308 108 57MB
English Pages 682 [683] Year 2023
Yuqing Sun · Tun Lu · Yinzhang Guo · Xiaoxia Song · Hongfei Fan · Dongning Liu · Liping Gao · Bowen Du (Eds.)
Communications in Computer and Information Science
1682
Computer Supported Cooperative Work and Social Computing 17th CCF Conference, ChineseCSCW 2022 Taiyuan, China, November 25–27, 2022 Revised Selected Papers, Part II
Communications in Computer and Information Science Editorial Board Members Joaquim Filipe , Polytechnic Institute of Setúbal, Setúbal, Portugal Ashish Ghosh , Indian Statistical Institute, Kolkata, India Raquel Oliveira Prates , Federal University of Minas Gerais (UFMG), Belo Horizonte, Brazil Lizhu Zhou, Tsinghua University, Beijing, China
1682
Rationale The CCIS series is devoted to the publication of proceedings of computer science conferences. Its aim is to efficiently disseminate original research results in informatics in printed and electronic form. While the focus is on publication of peer-reviewed full papers presenting mature work, inclusion of reviewed short papers reporting on work in progress is welcome, too. Besides globally relevant meetings with internationally representative program committees guaranteeing a strict peer-reviewing and paper selection process, conferences run by societies or of high regional or national relevance are also considered for publication. Topics The topical scope of CCIS spans the entire spectrum of informatics ranging from foundational topics in the theory of computing to information and communications science and technology and a broad variety of interdisciplinary application fields. Information for Volume Editors and Authors Publication in CCIS is free of charge. No royalties are paid, however, we offer registered conference participants temporary free access to the online version of the conference proceedings on SpringerLink (http://link.springer.com) by means of an http referrer from the conference website and/or a number of complimentary printed copies, as specified in the official acceptance email of the event. CCIS proceedings can be published in time for distribution at conferences or as postproceedings, and delivered in the form of printed books and/or electronically as USBs and/or e-content licenses for accessing proceedings at SpringerLink. Furthermore, CCIS proceedings are included in the CCIS electronic book series hosted in the SpringerLink digital library at http://link.springer.com/bookseries/7899. Conferences publishing in CCIS are allowed to use Online Conference Service (OCS) for managing the whole proceedings lifecycle (from submission and reviewing to preparing for publication) free of charge. Publication process The language of publication is exclusively English. Authors publishing in CCIS have to sign the Springer CCIS copyright transfer form, however, they are free to use their material published in CCIS for substantially changed, more elaborate subsequent publications elsewhere. For the preparation of the camera-ready papers/files, authors have to strictly adhere to the Springer CCIS Authors’ Instructions and are strongly encouraged to use the CCIS LaTeX style files or templates. Abstracting/Indexing CCIS is abstracted/indexed in DBLP, Google Scholar, EI-Compendex, Mathematical Reviews, SCImago, Scopus. CCIS volumes are also submitted for the inclusion in ISI Proceedings. How to start To start the evaluation of your proposal for inclusion in the CCIS series, please send an e-mail to [email protected].
Yuqing Sun · Tun Lu · Yinzhang Guo · Xiaoxia Song · Hongfei Fan · Dongning Liu · Liping Gao · Bowen Du Editors
Computer Supported Cooperative Work and Social Computing 17th CCF Conference, ChineseCSCW 2022 Taiyuan, China, November 25–27, 2022 Revised Selected Papers, Part II
Editors Yuqing Sun Shandong University Jinan, China Yinzhang Guo Taiyuan University of Science and Technology Taiyuan, China Hongfei Fan Tongji University Shanghai, China Liping Gao University of Shanghai for Science and Technology Shanghai, China
Tun Lu Fudan University Shanghai, China Xiaoxia Song Shanxi Datong University Datong, China Dongning Liu Guangdong University of Technology Guangzhou, China Bowen Du Tongji University Shanghai, China
ISSN 1865-0929 ISSN 1865-0937 (electronic) Communications in Computer and Information Science ISBN 978-981-99-2384-7 ISBN 978-981-99-2385-4 (eBook) https://doi.org/10.1007/978-981-99-2385-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
Welcome to the post-proceedings of ChineseCSCW 2022, the 17th CCF Conference on Computer-Supported Cooperative Work and Social Computing. ChineseCSCW 2022 was organized by the China Computer Federation (CCF), and co-hosted by the CCF Technical Committee on Cooperative Computing (CCF TCCC), the Taiyuan University of Science and Technology, and the Shanxi Datong University, in Taiyuan, Shanxi, China, during November 25–27, 2022. The conference was also supported by SCHOLAT and Guangdong Xuanyuan Network Technology Co., Ltd. The theme of the conference was Human-Centered Collaborative Intelligence, which reflects the emerging trend of the combination of artificial intelligence, human-system collaboration, and AI-empowered applications. ChineseCSCW (initially recognized as CCSCW) is a highly reputable conference series on computer-supported cooperative work (CSCW) and social computing in China, and has a long history. It aims at bridging Chinese and overseas CSCW researchers, practitioners, and educators, with a particular focus on innovative models, theories, techniques, algorithms, and methods, as well as domain-specific applications and systems, from both technical and social aspects in CSCW and social computing. The conference was initially held biennially since 1998, and has been held annually since 2014. This year, the conference received 211 submissions, and after a rigorous doubleblind peer review process, only 60 of them were eventually accepted as full papers to be orally presented, resulting in an acceptance rate of 28%. The program also included 30 short papers, which were presented as posters. In addition, the conference featured 6 keynote speeches, 5 high-level technical seminars, the ChineseCSCW Cup 2022 Collaborative Intelligence Big Data Challenge, the Forum for Outstanding Young Scholars, the Forum for Presentations of Top-Venue Papers, and an awards ceremony for senior TCCC members. We are grateful to the distinguished keynote speakers, Changjun Jiang (CAE Member) from Tongji University, Xingshe Zhou from Northwestern Polytechnical University, Ting Liu from Harbin Institute of Technology, Xing Xie from Microsoft Research Asia, Xin Lu from National University of Defense Technology, and Tong Zhang from South China University of Technology. We hope that you enjoyed ChineseCSCW 2022. November 2022
Yong Tang Peikang Bai Liying Yao
Organization
Steering Committee Yong Tang Weiqing Tang Ning Gu Shaozi Li Bin Hu Yuqing Sun Xiaoping Liu Zhiwen Yu Xiangwei Zheng Tun Lu
South China Normal University, China China Computer Federation, China Fudan University, China Xiamen University, China Lanzhou University, China Shandong University, China Hefei University of Technology, China Northwestern Polytechnical University, China Shandong Normal University, China Fudan University, China
General Chairs Yong Tang Peikang Bai Liying Yao
South China Normal University, China Taiyuan University of Science and Technology, China Shanxi Datong University, China
Program Committee Chairs Yuqing Sun Tun Lu Dongning Liu Yinzhang Guo Xiaoxia Song
Shandong University, China Fudan University, China Guangdong University of Technology, China Taiyuan University of Science and Technology, China Shanxi Datong University, China
Organization Committee Chairs Xiaoping Liu Zhiwen Yu
Hefei University of Technology, China Northwestern Polytechnical University, China
viii
Organization
Chaoli Sun Jifu Zhang
Taiyuan University of Science and Technology, China Taiyuan University of Science and Technology, China
Publicity Chairs Xiangwei Zheng Jianguo Li
Shandong Normal University, China South China Normal University, China
Publication Chairs Bin Hu Hailong Sun
Lanzhou University, China Beihang University, China
CSCW Cup Competition Chairs Chaobo He Yong Li
South China Normal University, China Shanxi Datong University, China
Paper Award Chairs Shaozi Li Yichuan Jiang
Xiamen University, China Southeast University, China
Paper Recommendation Chairs Honghao Gao Yiming Tang
Shanghai University, China Hefei University of Technology, China
Finance Chairs Huichao Yan Guoyou Zhang
Shanxi Datong University, China Taiyuan University of Science and Technology, China
Organization
Program Committee Tie Bao Zhan Bu Hongming Cai Xinye Cai Yongming Cai Yuanzheng Cai Zhicheng Cai Buqing Cao Donglin Cao Jian Cao Jingjing Cao Chao Chen Jianhui Chen Long Chen Longbiao Chen Liangyin Chen Qingkui Chen Ningjiang Chen Wang Chen Weineng Chen Yang Chen Zhen Chen Shiwei Cheng Xiaohui Cheng Yuan Cheng Lizhen Cui Weihui Dai Xianghua Ding Wanchun Dou Bowen Du Hongfei Fan Yili Fang Lunke Fei Liang Feng Shanshan Feng
Jilin University, China Nanjing University of Finance and Economics, China Shanghai Jiao Tong University, China Nanjing University of Aeronautics and Astronautics, China Guangdong Pharmaceutical University, China Minjiang University, China Nanjing University of Science and Technology, China Hunan University of Science and Technology, China Xiamen University, China Shanghai Jiao Tong University, China Wuhan University of Technology, China Chongqing University, China Beijing University of Technology, China Southeast University, China Xiamen University, China Sichuan University, China University of Shanghai for Science and Technology, China Guangxi University, China China North Vehicle Research Institute, China South China University of Technology, China Fudan University, China Yanshan University, China Zhejiang University of Technology, China Guilin University of Technology, China Wuhan University, China Shandong University, China Fudan University, China Fudan University, China Nanjing University, China Tongji University, China Tongji University, China Zhejiang Gongshang University, China Guangdong University of Technology, China Chongqing University, China Shandong Normal University, China
ix
x
Organization
Honghao Gao Jing Gao Ying Gao Yunjun Gao Liping Gao Ning Gu Bin Guo Kun Guo Wei Guo Yinzhang Guo Tao Han Fei Hao Chaobo He Fazhi He Haiwu He Bin Hu Daning Hu Wenting Hu Yanmei Hu Changqin Huang Tao Jia Bo Jiang Bin Jiang Jiuchuan Jiang Weijin Jiang Yichuan Jiang Lu Jia Miaotianzi Jin Lanju Kong Yi Lai Dongsheng Li Guoliang Li Hengjie Li Jianguo Li Jingjing Li
Shanghai University, China Guangdong Hengdian Information Technology Co., Ltd., China South China University of Technology, China Zhejiang University, China University of Shanghai for Science and Technology, China Fudan University, China Northwestern Polytechnical University, China Fuzhou University, China Shandong University, China Taiyuan University of Science and Technology, China Zhejiang Gongshang University, China Shanxi Normal University, China Zhongkai University of Agriculture and Engineering, China Wuhan University, China Chinese Academy of Sciences, China Lanzhou University, China Southern University of Science and Technology, China Jiangsu Open University, China Chengdu University of Technology, China South China Normal University, China Southwest University, China Zhejiang Gongshang University, China Hunan University, China Nanjing University of Finance and Economics, China Xiangtan University, China Southeast University, China China Agricultural University, China Shenzhen Artificial Intelligence and Data Science Institute (Longhua), China Shandong University, China Xi’an University of Posts and Telecommunications, China Microsoft Research, China Tsinghua University, China Lanzhou University of Arts and Science, China South China Normal University, China South China Normal University, China
Organization
Junli Li Li Li Pu Li Renfa Li Shaozi Li Taoshen Li Weimin Li Xiaoping Li Yong Li Lu Liang Hao Liao Bing Lin Dazhen Lin Cong Liu Dongning Liu Hong Liu Jing Liu Li Liu Shijun Liu Shufen Liu Xiaoping Liu Yuechang Liu Tun Lu Hong Lu Huijuan Lu Dianjie Lu Qiang Lu Haoyu Luo Zhiming Luo Peng Lv Pin Lv Xiao Lv Li Ni Hui Ma Keji Mao Chao Min Haiwei Pan Li Pan Yinghui Pan Lianyong Qi
xi
Jinzhong University, China Southwest University, China Zhengzhou University of Light Industry, China Hunan University, China Xiamen University, China Guangxi University, China Shanghai University, China Southeast University, China Tsinghua University, China Guangdong University of Technology, China Shenzhen University, China Fujian Normal University, China Xiamen University, China Shandong University of Technology, China Guangdong University of Technology, China Shandong Normal University, China Guangzhou Institute of Technology, Xidian University, China Chongqing University, China Shandong University, China Jilin University, China Hefei University of Technology, China Jiaying University, China Fudan University, China Shanghai Polytechnic University, China China Jiliang University, China Shandong Normal University, China Hefei University of Technology, China South China Normal University, China Xiamen University, China Central South University, China Guangxi University, China Naval University of Engineering, China Anhui University, China University of Electronic Science and Technology of China and Zhongshan Institute, China Zhejiang University of Technology, China Nanjing University, China Harbin Engineering University, China Shandong University, China Shenzhen University, China Qufu Normal University, China
xii
Organization
Jiaxing Shang Limin Shen Yuliang Shi Yanjun Shi Xiaoxia Song Kehua Su Songzhi Su Hailong Sun Ruizhi Sun Yuqing Sun Yuling Sun Wen’an Tan Lina Tan Yong Tang Shan Tang Weiqing Tang Yan Tang Yiming Tang Yizheng Tao Shaohua Teng Fengshi Tian Zhuo Tian Binhui Wang Dakuo Wang Hongbin Wang Hongjun Wang Hongbo Wang Lei Wang Lei Wang Tao Wang Tianbo Wang Tong Wang Wanyuan Wang Xiaogang Wang Yijie Wang Yingjie Wang
Chongqing University, China Yanshan University, China Shanda Dareway Company Limited, China Dalian University of Science and Technology, China Datong University, China Wuhan University, China Xiamen University, China Beihang University, China China Agricultural University, China Shandong University, China East China Normal University, China Nanjing University of Aeronautics and Astronautics, China Hunan University of Technology and Business, China South China Normal University, China Shanghai Polytechnic University, China China Computer Federation, China Hohai University, China Hefei University of Technology, China China Academy of Engineering Physics, China Guangdong University of Technology, China China People’s Police University, China Institute of Software, Chinese Academy of Sciences, China Nankai University, China IBM Research, USA Kunming University of Science and Technology, China Southwest Jiaotong University, China University of Science and Technology Beijing, China Alibaba Group, China Dalian University of Technology, China Minjiang University, China Beihang University, China Harbin Engineering University, China Southeast University, China Shanghai Dianji University, China National University of Defense Technology, China Yantai University, China
Organization
Zhenxing Wang Zhiwen Wang Zijia Wang Yiping Wen Ling Wu Quanwang Wu Wen Wu Zhengyang Wu Chunhe Xia Fangxiong Xiao Jing Xiao Zheng Xiao Xiaolan Xie Zhiqiang Xie Yu Xin Jianbo Xu Jiuyun Xu Meng Xu Heyang Xu Yonghui Xu Xiao Xue Yaling Xun Jiaqi Yan Xiaohu Yan Yan Yao Bo Yang Chao Yang Dingyu Yang Gang Yang Jing Yang Lin Yang Tianruo Yang Xiaochun Yang
xiii
Shanghai Polytechnic University, China Guangxi University of Science and Technology, China Guangzhou University, China Hunan University of Science and Technology, China Fuzhou University, China Chongqing University, China East China Normal University, China South China Normal University, China Beihang University, China Jinling Institute of Technology, China South China Normal University, China Hunan University, China Guilin University of Technology, China Harbin University of Science and Technology, China Harbin University of Science and Technology, China Hunan University of Science and Technology, China China University of Petroleum, China Shandong Technology and Business University, China Henan University of Technology, China Shandong University, China Tianjin University, China Taiyuan University of Science and Technology, China Nanjing University, China Shenzhen Polytechnic, China Qilu University of Technology, China University of Electronic Science and Technology of China, China Hunan University, China Shanghai Dianji University, China Northwestern Polytechnical University, China Harbin Engineering University, China Shanghai Computer Software Technology Development Center, China Hainan University, China Northeastern University, China
xiv
Organization
Xu Yu Shanping Yu Zhiwen Yu Zhiyong Yu Jianyong Yu Yang Yu Zhengtao Yu Chengzhe Yuan Junying Yuan An Zeng Dajun Zeng Zhihui Zhan Changyou Zhang Jia Zhang Jifu Zhang Jing Zhang Liang Zhang Libo Zhang Miaohui Zhang Peng Zhang Senyue Zhang Shaohua Zhang Wei Zhang Xin Zhang Zhiqiang Zhang Zili Zhang Hong Zhao Xiangwei Zheng Jinghui Zhong Ning Zhong Yifeng Zhou Huiling Zhu Nengjun Zhu Tingshao Zhu
Qingdao University of Science and Technology, China Beijing Institute of Technology, China Northwestern Polytechnical University, China Fuzhou University, China Hunan University of Science and Technology, China Zhongshan University, China Kunming University of Science and Technology, China Guangdong Engineering and Technology Research Center for Service Computing, China Nanfang College Guangzhou, China Guangdong Polytechnical University, China Institute of Automation, Chinese Academy of Sciences, China South China University of Technology, China Chinese Academy of Sciences, China Jinan University, China Taiyuan University of Science and Technology, China Nanjing University of Science and Technology, China Fudan University, China Southwest University, China Energy Research Institute of Jiangxi Academy of Sciences, China Fudan University, China Shenyang Aerospace University, China Shanghai Software Technology Development Center, China Guangdong University of Technology, China Jiangnan University, China Harbin Engineering University, China Southwest University, China Xidian University, China Shandong Normal University, China South China University of Technology, China Beijing University of Technology, China Southeast University, China Jinan University, China Shanghai University, China Chinese Academy of Science, China
Organization
Xia Zhu Xianjun Zhu Yanhua Zhu Jia Zhu Jianhua Zhu Jie Zhu Qiaohong Zu
Southeast University, China Jinling University of Science and Technology, China The First Affiliated Hospital of Guangdong Pharmaceutical University, China South China Normal University, China City University of Hong Kong, China Nanjing University of Posts and Telecommunications, China Wuhan University of Technology, China
xv
Contents – Part II
Crowd Intelligence and Crowd Cooperative Computing MatricEs: Matrices Embeddings for Link Prediction in Knowledge Graphs . . . . Huiling Zhu, Liming Gao, and Hankui Zhuo Learning User Embeddings Based on Long Short-Term User Group Modeling for Next-Item Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nengjun Zhu, Jieyun Huang, Jian Cao, and Shanshan Feng
3
18
Context-Aware Quaternion Embedding for Knowledge Graph Completion . . . . . Jingbin Wang, Xinyi Yang, Xifan Ke, Renfei Wu, and Kun Guo
33
Dependency-Based Task Assignment in Spatial Crowdsourcing . . . . . . . . . . . . . . Wenan Tan, Zhejun Liang, Jin Liu, and Kai Ding
48
ICKG: An I Ching Knowledge Graph Tool Revealing Ancient Wisdom . . . . . . . . Gaojie Wang, Liqiang Wang, Shijun Liu, Haoran Shi, and Li Pan
62
Collaborative Analysis on Code Structure and Semantics . . . . . . . . . . . . . . . . . . . . Xiangdong Ning, Huiqian Wu, Lin Wan, Bin Gong, and Yuqing Sun
75
Temporal Planning-Based Choreography from Music . . . . . . . . . . . . . . . . . . . . . . . Yuechang Liu, Dongbo Xie, Hankz Hankui Zhuo, Liqian Lai, and Zhimin Li
89
An Adaptive Parameter DBSCAN Clustering and Reputation-Aware QoS Prediction Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Yajing Li, Jianbo Xu, Guozheng Feng, and Wei Jian Effectiveness of Malicious Behavior and Its Impact on Crowdsourcing . . . . . . . . 118 Xinyi Ding, Zhenjie Zhang, Zhuangmiao Yuan, Tao Han, Huamao Gu, and Yili Fang Scene Adaptive Persistent Target Tracking and Attack Method Based on Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Zhaotie Hao, Bin Guo, Mengyuan Li, Lie Wu, and Zhiwen Yu Research on Cost Control of Mobile Crowdsourcing Supporting Low Budget in Large Scale Environmental Information Monitoring . . . . . . . . . . . . . . . 148 Lili Gao, Zhen Yao, and Liping Gao
xviii
Contents – Part II
Question Answering System Based on University Knowledge Graph . . . . . . . . . . 164 Jingsong Leng, Yanzhen Yang, Ronghua Lin, and Yong Tang Deep Reinforcement Learning-Based Scheduling Algorithm for Service Differentiation in Cloud Business Process Management System . . . . . . . . . . . . . . 175 Yunzhi Wu, Yang Yu, and Maolin Pan A Knowledge Tracing Model Based on Graph Attention Mechanism and Incorporating External Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Jianwei Cen, Zhengyang Wu, Li Huang, and Zhanxuan Chen Crowd-Powered Source Searching in Complex Environments . . . . . . . . . . . . . . . . 201 Yong Zhao, Zhengqiu Zhu, Bin Chen, and Sihang Qiu Cooperative Evolutionary Computation and Human-Like Intelligent Collaboration Task Offloading and Resource Allocation with Privacy Constraints in End-Edge-Cloud Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Xia Zhu, Wei Sun, and Xiaoping Li A Classifier-Based Two-Stage Training Model for Few-Shot Segmentation . . . . 235 Zhibo Gu, Zhiming Luo, and Shaozi Li EEG-Based Motor Imagery Classification with Deep Adversarial Learning . . . . 243 Dezheng Liu, Siwei Liu, Hanrui Wu, Jia Zhang, and Jinyi Long Comparison Analysis on Techniques of Preprocessing Imbalanced Data for Symbolic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 Cuixin Ma, Wei-Li Liu, Jinghui Zhong, and Liang Feng A Feature Reduction-Induced Subspace Multiple Kernel Fuzzy Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Yiming Tang, Bing Li, Zhifu Pan, Xiao Sun, and Renhao Chen A Deep Neural Network Based Resource Configuration Framework for Human-Machine Computing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 Zhuoli Ren, Zhiwen Yu, Hui Wang, Liang Wang, and Jiaqi Liu Research on User’s Mental Health Based on Comment Text . . . . . . . . . . . . . . . . . 298 Yubo Shen, Yangming Huang, Ru Jia, and Ru Li A Multi-objective Level-Based Learning Swarm Optimization Algorithm with Preference for Epidemic Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . 311 Guo Yang, Xuan-Li Shi, Feng-Feng Wei, and Wei-Neng Chen
Contents – Part II
xix
Aesthetics-Diven Online Summarization to First-Person Tourism Videos . . . . . . 326 Yiyang Shao, Bin Guo, Yuqi Zhang, Ke Ma, and Zhiwen Yu Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Feiyang Liu, Bin Guo, Hao Wang, and Yan Liu A Weighting Possibilistic Fuzzy C-Means Algorithm for Interval Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Yiming Tang, Lei Xi, Wenbin Wu, Xi Wu, Shujie Li, and Rui Chen An Evolutionary Multi-task Genetic Algorithm with Assisted-Task for Flexible Job Shop Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Xuhui Ning, Hong Zhao, Xiaotao Liu, and Jing Liu Depression Tendency Assessment Based on Cyber Psychosocial and Physical Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 Huanhong Huang, Deyue Kong, Fanmin Meng, Siyi Yang, Youzhe Liu, Weihui Dai, and Yan Kang Optimization of On-Ramp Confluence Sequence for Internet of Vehicles with Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 Zhiheng Yuan, Yuanfei Fang, Xinran Qu, and Yanjun Shi Chinese Event Extraction Based on Hierarchical Attention Mechanism . . . . . . . . 401 Qingmeng Hu and Hongbin Wang Instance-Aware Style-Swap for Disentangled Attribute-Level Image Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 Xinjiao Zhou, Bin Jiang, Chao Yang, Haotian Hu, and Minyu Sun Collaborative Multi-head Contextualized Sparse Representations for Real-Time Open-Domain Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . 423 Minyu Sun, Bin Jiang, Xinjiao Zhou, Bolin Zhang, and Chao Yang Automatic Personality Prediction Based on Users’ Chinese Handwriting Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 Yu Ji, Wen Wu, Yi Hu, Xiaofeng He, Changzhi Chen, and Liang He Domain-Specific Collaborative Applications A Faster, Lighter and Stronger Deep Learning-Based Approach for Place Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Rui Huang, Ze Huang, and Songzhi Su
xx
Contents – Part II
A Improved Prior Box Generation Method for Small Object Detection . . . . . . . . 464 Ximin Zhou, Zhiming Luo, and Shaozi Li ACAGNN: Source Code Representation Based on Fine-Grained Multi-view Program Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 Ji Li, Xiao Wang, and Chen Lyu A Framework for Math Word Problem Solving Based on Pre-training Models and Spatial Optimization Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 Weijiang Fan, Jing Xiao, and Yang Cao A Spillover-Based Model for Default Risk Assessment of Transaction Entities in Bulk Commodity Trade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 Yin Chen, Kai Di, Yichuan Jiang, and Jiuchuan Jiang The Sandpile Model of Japanese Empire Dynamics . . . . . . . . . . . . . . . . . . . . . . . . 514 Peng Lu, Zhuo Zhang, and Mengdi Li Active Authorization Control of Deep Models Using Channel Pruning . . . . . . . . 530 Linna Wang, Yunfei Song, Yujia Zhu, and Daoxun Xia A Knowledge Graph-Based Analysis Framework for Aircraft Configuration Change Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543 Yuxiao Wang, Xinyuan Zhang, Hongming Cai, Ben Wan, Mu Liu, and Lihong Jiang Node-IBD: A Dynamic Isolation Optimization Algorithm for Infection Prevention and Control Based on Influence Diffusion . . . . . . . . . . . . . . . . . . . . . . . 555 Songjian Zhou, Zheng Zhang, Ziqiang Wu, Hao Cheng, Shuo Wang, Sheng Bi, and Hao Liao A Hybrid Layout Method Based on GPU for the Logistics Facility Layout Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570 Fulin Jiang, Lin Li, Junjie Zhu, and Xiaoping Liu An Interpretable Loan Credit Evaluation Method Based on Rule Representation Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580 Zihao Chen, Xiaomeng Wang, Yuanjiang Huang, and Tao Jia A Survey of Computer Vision-Based Fall Detection and Technology Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 Manling Yang, Xiaohu Li, Jiawei Liu, Shu Wang, and Li Liu
Contents – Part II
xxi
3D Gaze Vis: Sharing Eye Tracking Data Visualization for Collaborative Work in VR Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610 Song Zhao, Shiwei Cheng, and Chenshuang Zhu A Learning State Monitoring Method Based on Face Feature and Posture . . . . . . 622 Xiaoyi Qiao, Xiangwei Zheng, Shuqin Li, and Mingzhe Zhang Meta-transfer Learning for Person Re-identification in Aerial Imagery . . . . . . . . 634 Lili Xu, Houfu Peng, Linna Wang, and Daoxun Xia Horizontal Federated Traffic Speed Prediction Base on Secure Node Attribute Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 Enjie Ye, Kun Guo, Wenzhong Guo, Dangrun Chen, Zihan Zhang, Fuan Li, and JiaChen Zheng Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
Contents – Part I
Social Media and Online Communities Multi-step Ahead PM2.5 Prediction Based on Hybrid Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yulin Wang, Junying Yuan, Yiwu Xu, and Yun Chen A Joint Framework for Knowledge Extraction from Flight Training Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuxuan Zhang, Jiaxing Shang, Linjiang Zheng, Quanwang Wu, Weiwei Cao, and Hong Sun
3
17
ScholarRec: A User Recommendation System for Academic Social Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu Weng, Wenguang Yu, Ronghua Lin, Yong Tang, and Chaobo He
28
Incremental Evolutionary Community Discovery Method Based on Neighbor Subgraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yan Zhao, Chang Guo, Weimin Li, Dingmei Wei, and Heng Zhu
42
Video Rumor Classification Based on Multi-modal Theme and Keyframe Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinpeng You, Yanghao Lin, Dazhen Lin, and Donglin Cao
58
Association Rule Guided Web API Complementary Function Recommendation for Mashup Creation: An Explainable Perspective . . . . . . . . . . Pengfei He, Wenchao Qi, Xiaowei Liu, Linlin Liu, Dianlong You, Limin Shen, and Zhen Chen Globally Consistent Vertical Federated Graph Autoencoder for Privacy-Preserving Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yutong Fang, Qingqing Huang, Enjie Ye, Wenzhong Guo, Kun Guo, and Xiaoqi Chen Research on User Personality Characteristics Mining Based on Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu Zheng, Jun Shen, Ru Jia, and Ru Li
73
84
95
A Unified Stream and Batch Graph Computing Model for Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Jinkun Dai, Ling Wu, and Kun Guo
xxiv
Contents – Part I
A Feature Fusion-Based Service Classification Approach for Collaborative Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Kun Hu, Aohui Zhou, Ye Wang, Bo Jiang, and Qiao Huang Requirements Classification and Identification Approach for E-Collaboration Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Shizhe Song, Bo Jiang, Siyuan Zhou, Ye Wang, and Qiao Huang Community Evolution Tracking Based on Core Node Extension and Edge Variation Discerning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Qifeng Zhuang, Zhiyong Yu, and Kun Guo University Knowledge Graph Construction Based on Academic Social Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Yanzhen Yang, Jingsong Leng, Ronghua Lin, Jianguo Li, and Feiyi Tang Country-Level Collaboration Patterns of Social Computing Scholars . . . . . . . . . . 173 Jingcan Chen, Yuting Shao, Qingyuan Gong, and Yang Chen An Intelligent Mobile System for Monitoring Relapse of Depression . . . . . . . . . . 182 Wenyi Yin, Chenghao Yu, Pianran Wu, Wenxuan Jiang, Youzhe Liu, Tianqi Ren, and Weihui Dai Fine-Grained Sentiment Analysis of Online-Offline Danmaku Based on CNN and Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Yan Tang and Hongyu Zhang Ramp Merging of Connected Vehicle With Virtual Platooning Control . . . . . . . . 207 Yijia Guo, Wenhao Wang, Wang Chen, Chaozhe Han, and Yanjun Shi Community Detection Based on Enhancing Graph Autoencoder with Node Structural Role . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Ling Wu, Jinlong Yang, and Kun Guo Representation of Chinese-Vietnamese Bilingual News Topics Based on Heterogeneous Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Zhilei He, Enchang Zhu, Zhengtao Yu, Shengxiang Gao, Yuxin Huang, and Linjie Xia Convolutional Self-attention Network for Sequential Recommendation . . . . . . . . 245 Yichong Hu, Liantao Lan, Ronghua Lin, Chengzhe Yuan, and Yong Tang Towards Using Local Process Mining to Analyse Learning Behavior Pattern . . . 257 Sipeng Ouyang, Yiping Wen, Jianxun Liu, and Lianyong Qi
Contents – Part I
xxv
Collaborative Mechanisms, Models, Approaches, Algorithms and Systems Memory-Effective Parallel Mining of Incremental Frequent Itemsets Based on Multi-scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Linqing Wang, Yaling Xun, Jifu Zhang, and Huimin Bi An AST-Based Collaborative Discussion Tool for the MOOC Environment . . . . 284 Xinyue Yu and Tun Lu DQN-Based Comprehensive Consumption Minimization on Calculation Offloading in Mobile Edge Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Kai Ding, Wenan Tan, Zhejun Liang, and Jin Liu Stochastic Task Offloading Problems for Edge Computing . . . . . . . . . . . . . . . . . . . 306 Kexin Ding, Zhi Zhong, and Jie Zhu Container-Driven Scheduling Strategy for Scientific Workflows in Multi-vCPU Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 Peng Xiang, Bing Lin, Hongjie Yu, and Dui Liu A Segmented Path Heuristic Recovery Algorithm for WSNs Based on Mobile Sink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Nie Wenmei, Song Xiaoxia, Li Yong, and Zhang Xulong TRindex: Distributed Double-Layer Road Network Trajectory Index . . . . . . . . . . 350 Weiqi Chen, Na Tang, Jingjing Li, and Yong Tang Sleep Scheduling for Enhancing the Lifetime of Three-Dimensional Heterogeneous Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Haoyang Zhou and Jingjing Li CoSBERT: A Cosine-Based Siamese BERT-Networks Using for Semantic Textual Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 Wenguang Yu, Yu Weng, Ronghua Lin, and Yong Tang Towards Heterogeneous Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 Yue Huang, Yonghui Xu, Lanju Kong, Qingzhong Li, and Lizhen Cui A Graph-Based Efficient Service Composition Method for Computer Aided Engineering (CAE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Zhuo Tian, Changyou Zhang, and Jiaojiao Xiao
xxvi
Contents – Part I
Privacy-Preserving Federated Learning Framework in Knowledge Concept Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 Yangjie Qin, Jia Zhu, and Jin Huang RCPM: A Rule-Based Configurable Process Mining Method . . . . . . . . . . . . . . . . 422 Yang Gu, Yingrui Feng, Heng Huang, Yu Tian, and Jian Cao Popularity Bias Analysis of Recommendation Algorithm Based on ABM Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 Cizhou Yu, Dongsheng Li, Tun Lu, and Yichuan Jiang Cloud-Edge Collaborative Task Scheduling Mechanism Based on Improved Parameter Adaptation Particle Swarm Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 Haoyang Zeng, Ningjiang Chen, Wanting Li, and Siyu Yu An Approach to Assessing the Health of Opensource Software Ecosystems . . . . 465 Ruoxuan Yang, Yongqiang Yang, Yijun Shen, and Hailong Sun Topic Discovery in Scientific Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 Yujian Huang, Qiang Liu, Jia Liu, and Yanmei Hu Multi-agent Adversarial Reinforcement Learning Algorithm Based on Reward Query Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492 Liwei Chen, Dingquan Jin, Tong Wang, and Yuan Chang UAV Target Roundup Strategy Based on Wolf Pack Hunting Behavior . . . . . . . . 502 Tong Wang, Jianchao Wang, Min Ouyang, and Yu Tai Prediction of New Energy Vehicles via ARIMA-BP Hybrid Model . . . . . . . . . . . 516 Beiteng Yang, Jianjun Liu, and Dongning Liu Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
Crowd Intelligence and Crowd Cooperative Computing
MatricEs: Matrices Embeddings for Link Prediction in Knowledge Graphs Huiling Zhu1(B) , Liming Gao2 , and Hankui Zhuo3 1
2
Jinan University, Guangzhou, China [email protected] Wechat Group, Tencent Inc., Guangzhou, China 3 Sun Yat-sen University, Guangzhou, China
Abstract. Knowledge graphs have been constructed to represent realworld knowledge. However, downstream tasks usually suffer from the incompleteness of the knowledge graphs. To predict the missing links, various models have been proposed by embedding the entities and relations into lower-dimensional spaces. Existing approaches usually ignore the fact that there are much fewer relations than entities and allow redundant parameters for relations. In this paper, we present MatricEs, a novel approach for link prediction, and propose its variations to reduce the dimension of relations space. In particular, MatricEs utilizes matrix embeddings and models the relation as a linear transformation from the head entity matrix to the tail entity matrix. MatricEs is universal as it subsumes many link prediction models. Recently, relation patterns draw much attention in building models with better ability of expression and interpretability. We formally define the relation patterns which are satisfied by MatricEs, including symmetry, antisymmetry, inversion, commutative compositions and non-commutative compositions, absorption and transitivity. Theoretical analysis shows that MatricEs is a valid, simple and universal model. Experiments show that MatricEs is effective as it outperforms most existing methods on link prediction datasets.
Keywords: Knowledge graph
1
· Link prediction · Relation pattern
Introduction
Large-scale knowledge graphs have been constructed in the past two decades, including YAGO [22], DBpedia [16], Freebase [2] and Google’s Knowledge Vault [7]. These knowledge graphs contain enormous facts about our real world and have been applied in several tasks such as natural language processing [11], information extraction [12], question-answering [4] and medical science [32]. A knowledge graph consists of facts in the real world which are stored in the symbolic form of triples as (head entity, relation, tail entity). For instance, the fact that The earth is a planet of the sun is denoted by (Earth, be_planet_of, Sun). Although the above knowledge graphs contain millions or billions of facts, c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 3–17, 2023. https://doi.org/10.1007/978-981-99-2385-4_1
4
H. Zhu et al.
there are still many missing facts. Link prediction, also known as knowledge graph completion, aims to find out the missing facts automatically. Symbolic representations provide direct approaches, but are difficult to be operated and computed as the complexity grows rapidly with the enlargement of the knowledge graph. Knowledge graph embedding, on the other hand, attempts to embed knowledge graphs into a lower-dimensional linear spaces so that the operations on those spaces can be utilized in tasks including knowledge graph completion [19]. In the past decade, many approaches have been developed and were effective in finding missing edges, including RESCAL [21], TransE [3], HolE [20], ComplEx [25], ANALOGY [18], SimplE [15], RotatE [23], DihEdral [28], QuatE [30], HAKE [31], ATTH [5] and BoxE [1]. These models represent the entities as vectors and relations as vectors or matrices, which force redundant parameters to the relation space: if the dimension of entity space is d, then the dimension of relation space is d or even d2 . Yet, knowledge graphs usually contain much less relations than entities. For instance, the FB15k dataset contains 14951 entities but only 1345 relations while the WN18 dataset contains 40, 943 entities but only 18 relations [3]. In this paper, we propose a novel embedding model MatricEs, which represents entities as m ˆ n matrices and relations as n ˆ n matrices. We further propose several variations in order to make sure that n ăă m, and to reduce the number of parameters in relation embeddings significantly. Recently, relation patterns have been investigated, including symmetry, antisymmetry, inversion and composition [15,23,28]. In this paper, we further define absorption and transitivity. Most of the existing models satisfy some of these patterns while MatricEs satisfies all. In fact, we prove the universality of MatricEs by showing that it subsumes many typical models. Details will be provided in Sect. 4.3. To summarize, our contributions include: (1) proposition of MatricEs, (2) proposition of variations of MatricEs to reduce the parameters, (3) formal definition of relation patterns, (4) proof of universality of MatricEs, (5) experiments which exihibit superiority of MatricEs in link prediction tasks. The rest of this paper is organized as follows: in Sect. 2, we will formally define the task of link prediction and relation patterns; related approaches will be reviewed in Sect. 3; in Sect. 4, we will present the novel model MatricEs, its ability to express relation patterns, its variations which need few parameters and compare MatricEs with some classical models; we will exhibit and analyze the experiment results in Sect. 5 and conclude in Sect. 6.
2 2.1
Definition Problem Definition
Let E and R denote the set of entities and the set of relations, respectively. A fact is denoted by a triple (h, r, t), where h, t P E are the head entity and the tail entity, respectively, and r P R is an irreflexive relation between them. Let T ` and T ´ denote the sets of true facts and false facts, respectively. We have
MatricEs: Matrices Embeddings for Link Prediction in Knowledge Graphs
5
that T ` Y T ´ “ E ˆ R ˆ Ez{(h, r, t)|h “ t}. A knowledge graph K is a subset of T ` . The task of link prediction is to extend K to a knowledge graph K such that K ⊂ K ⊂ T ` . 2.2
Relation Patterns
We define relation patterns as follows. – The inverse of a relation r, is a relation r which satisfies (h, r, t) P T ` ô (t, r , h) P T ` . For instance, part_of is the inverse of has_part, and _hypernym is the inverse of _hyponym. – A relation r is symmetric, if (h, r, t) P T ` ô (t, r, h) P T ` . Namely, r “ r. For instance, similar_to, spouse_of, adjacent_to and verb_group are symmetric relations. – A relation r is antisymmetric, if (h, r, t) P T ` ñ (t, r, h) P T ´ . Most relations are antisymmetric, as there are type constrains on entities. For instance, the head entity of place_of_birth should be a person while the tail entity must be a location. Antisymmetric relations whose head entity and tail entity are of the same type include prequel_of, location/contains and child_of. A relation is asymmetric if it is not symmetric. Antisymmetric relations are clearly asymmetric. Relations which are asymmetric but not antisymmetric include influenced_by and follows (in the graphs of social networks). In practice, models that express asymmetry perfectly are considered to model antisymmetry partially. – A relation r3 is a composition of the relations r1 and r2 , which is denoted by r3 “ r1 d r2 , if (h, r1 , m) P T ` ∧ (m, r2 , t) P T ` ñ (h, r3 , t) P T ` . For instance, uncle_of is a composition of brother_of and parent_of. Two relations r1 and r2 commute with each other, if r1 dr2 “ r2 dr1 . Clearly, material/quality commutes with material/color while fellow_worker_of does not commute with neighbour_of . – A relation r2 is left-absorbed by the relation r1 , if r1 d r2 “ r1 . Clearly, sibling_of is left-absorbed by parent_of. A relation r2 is right-absorbed by the relation r1 , if r2 d r1 “ r1 . Right-absorption is a dual notion of left-absorption. Formally, r1 d r2 “ r1 holds if and only if r2 d r1 “ r1 . Example: child_of is right-aborbed by sibling_of. – A relation r is transitive, if r d r “ r. For instance, sister_of, descendant_of, has_part and equivalent_to are transitive relations. 2.3
Notations
Vectors and matrices are denoted by boldface lowercase letters and boldface uppercase letters, respectively. diag is the diagonalization operator. The Frobe 2 nius norm of a matrix M “ (Mij ) is MF “ i j |Mij | . The inner product of vectors is denoted by · or xy, and elementwise (Hadamard) product of vectors or matrices is denoted by ◦. The sigmoid function is denoted by σ: x → 1/(1 ` e´x ).
6
H. Zhu et al.
3
Related Work
Table 1. Embeddings and score functions of link prediction models. 1/2 stands for either L1 norm or L2 norm; Re() stands for the real part of a complex number; xy stands for the inner product; SimplE jointly considered a fact (h, r, t) with its dual form (t, r´1 , h), each entity e has two embeddings: e as head and e˚ as tail; DK is the dihedral group which consists of K rotations and K reflections; H denotes the set of Hamilton’s quaternions; b denotes the Hamilton product between two quaternions. Model
Entity embedding Relation embedding
Score function
RESCAL TransE DisMult ComplEx TorusE SimplE RotatE DihEdral
h, t P Rd h, t P Rd h, t P Rd h, t P Cd [h], [t] P Rd /Zd h, h˚ , t, t˚ P Rd h, t P Cd h, t, P R2d
hT Mr t ´h ` r ´ t1/2 hT diag(r) t Re(xh, r, ty) ´[h] ` [r] ´ [t] 1 (xh, r, t˚ y ` xt, r´1 , h˚ y) 2 ´h ◦ r ´ t hT Rt
QuatE
Qh , Qt P Hd
Mr P Rdˆd r P Rd r P Rd r P Cd [r] P Rd /Zd r, r´1 P Rd r P Cd R “ diag(R1 , · · · , Rd ) R1 , · · · , Rd P DK Wr P Hd Mr P R
´||Mh Mr ´ Mt ||F
MatricEs Mh , Mt P R
mˆn
nˆn
(Qh b
Wr ) Wr 2
· Qt
Various models have been proposed for knowledge graph completion. These models can be roughly categorized into two groups [26]: transformation based models and semantic matching models. Table 1 is a summary of most typical and recent models. 3.1
Transformation Based Models
Transformation based models usually embed the entities as vectors, model the relations as transformations which act on the head entities and measure the element-to-element difference between the tail entity and the transformed head entity. Translational distance models are incipient transformation based models. TransE [3], embedded the entities and relations as vectors in Rn and computes the plausibility of a triple (h, r, t) by h ` r ´ t. TransE is simple and scalable, yet it can only express 1-to-1 relation. A family of translational distance models named after Trans have been proposed with better ability of expression, including TransH [27], TransM [9], TransR [17], TransD [13], TranSparse [14] and TransF [10]. As projections from higher dimensional Euclidean spaces were used, the cost paid is the increment of complexity of models. A review on these approaches can be found in the survey [26].
MatricEs: Matrices Embeddings for Link Prediction in Knowledge Graphs
7
Recently, generalizations were studied extensively both on the space and the transformation. TorusE [8] was based on a torus T n , which was essentially the quotient space Rn /Zn . RotatE [23] embedded the entities in the complex space Cn and modeled the relation as rotations from the head entity to the tail entity. These models can express inversion, antisymmetry and commutative composition, but can hardly express non-commutative composition, absorption nor transitivity, since the operations (addition and multiplication on R or C) used are commutative and invertible. 3.2
Semantic Matching Models
Semantic matching models utilize latent semantic attributes of the embeddings and can be further categorized into two subgroups: neural network based models and bilinear models. Different from transformation based models, score function of a semantic matching model may compute all possible combinations of elements of the head entity and the tail entity. Informally, for every pair of integers (i, j) , hi interacts with tj via some element of the relation embedding. A review on neural network based models can be found in the survey [26]. Such models contain more parameters and are unlikely to model relation patterns in general. The first bilinear model, RESCAL [21], represent each entity as a vector and each relation as a square matrix Mr . The score function is in the bilinear form hT Mr t. Later, many bilinear models have been proposed, either to reduce the complexity and or to gain better ability of expression, including DisMult [29], HolE [20], ComplEx [25], ANALOGY [18], SimplE [15], DihEdral [28] and QuatE [30]. Bilinear models can model symmetry and inversion, but may not model antisymmetry, composition, absorption or transitivity.
4
Our MatricEs Approach
In this section, we propose a novel approach MatricEs, which is by nature a transformation based model. 4.1
MatricEs
We represent the entities h, t as m ˆ n matrices Mh , Mt and the relation r as an n ˆ n matrix Mr . We expect that a true fact (h, r, t) satisfies: Mh Mr “ Mt . Informally, a single row of the entity matrix may stand for some attribute of the entity; the relation matrix, as a linear transformation from the head entity to the tail entity, connects the attributes between the entities. The fact that the relation matrix is shared by all rows of the entity matrix can not only reduce the number of parameters used for relations but also allow the rows of the entity matrix to propagate information during training. As the matrix multiplication is in general non-commutative and non-invertible, MatricEs is able to express non-commutative composition, absorption and transitivity.
8
H. Zhu et al.
Formally, we define the score function of MatricEs as: fr (h, t) “ ´||Mh Mr ´ Mt ||F
(1)
The dimension of relation space for MatricEs is d “ n2 , which is comparable with state-of-the-art models. To further reduce the dimension of relation space, we propose four variations of MatricEs, namely MatricEs-S, MatricEs-SS, MatricEs-D, and MatricEs-SD, as shown in Fig. 1.
Fig. 1. Variations of MatricEs
Specifically, the four variations are as shown below: – Square MatricEs (MatricEs-S). Let m “ n, indicating that Mh , Mr , Mt are square matrices and that the embeddings of entities and relations have the same dimension, which is similar to many existing approaches. Additional requirements on square matrices may include invertibility, determinant being ±1 and orthogonality. – Diagonal Blocks MatricEs (MatricEs-D). Let m “ n and l be a divisor of n. The embeddings are n ˆ n matrices partitioned into l ˆ l blocks, within which only the diagonal blocks may contain nonzero elements. l is a small number, practically in the set {2, 3, 4}. On one hand, parameters for the relation embedding are further reduced to nl. On the other hand, transformation described by low rank square matrices have clear geometric meaning. – Stack of Square MatricEs (MatricEs-SS). Let n|m, namely m be a multiple of n, indicating that the embeddings of entities are stacks of square matrices which share the same relation matrix. In practice, n can be a relatively small number, then the number of parameters for relations is reduced to n2 . – Stack of Diagonal Blocks MatricEs (MatricEs-SD). Let l|n|m, the embeddings of relations are diagonal block matrices while the embeddings of entities are stacks of such matrices. 4.2
Overview of Our Learning Framework
For each positive sample (h, r, t), negative samples were generated by corrupting the head entity or the tail entity. We adopted the self-adversarial negative sampling proposed by [23]:
MatricEs: Matrices Embeddings for Link Prediction in Knowledge Graphs
exp αfr (hj , tj ) p(hj , r, tj |{(hi , ri , ti )}) “ i exp αfr (hi , ti )
9
(2)
where (hi , ri , ti ) is a negative triple sampled, α is the temperature of sampling. We utilized the corresponding margin loss function: L “ ´ log σ(γ ` fr (h, t)) n ´ p(hi , r, ti ) log σ(´fr (hi , ti ) ´ γ)
(3)
i“1
where γ is the margin. The learning framework is described by Algorithm 1.
Algorithm 1 Require: Training set K “ {(h, r, t)}, set of entities E and set of relations R. Hyperparameters: margin γ, martrix dimension d, negative sample ratio ρ, sampling temperature α. initialize , γ`2.0 ) for e P E Me ← uniform (´ γ`2.0 d d γ`2.0 γ`2.0 Mr ← uniform (´ d , d ) for r P R Tpos Ð uniformly randomly sample (h, r, t) from K for (h, r, t) P Tpos do (h , r, t ) Ð generate ρ negative samples for (h, r, t) T “ Tpos Y {(h , r, t )} end for exp αfr (h ,t ) compute weight of each (h , r, t ) : p(hj , r, tj |{(hi , ri , ti )}) “ ρ exp αf j(hj ,t ) r i“1 i i update embeddings: ρ Mr “ Mr ´ ∇θr [´ log σ(γ ` fr (h, t)) ´ p(hi , r, ti ) log σ(´fr (hi , ti ) ´ γ)] i“1
Me “ Me ´ ∇θe [´ log σ(γ ` fr (h, t)) ´ end repeat
4.3
ρ i“1
p(hi , r, ti ) log σ(´fr (hi , ti ) ´ γ)]
Theoretical Analyses
Comparison with DihEdral and RotatE DihEdral [28] embedded entities as h, t P Rn for an even number n, and relation as R “ diag(R1 , · · · , R n2 ), with each Rk being a 2 by 2 matrix for rotation or reflection. The score function of DihEdral was hT Rt, which was equal to ´hT R ´ tT by assuming that h, t were unit vectors. Hence, the bilinear model DihEdral can be considered as a transformation based model, and in fact a special case of Diagonal Blocks MatricEs by letting Mh “ (hT , . . . , hT ), Mt “ (tT , . . . , tT ) and Mr “ R. DihEdral was unable to model absorption or transitivity, as relations were embedded as normal matrices, which were invertible.
10
H. Zhu et al. n
n
RotatE [23] embedded entities as h, t P C 2 , relations as r P C 2 , with each component of a relation r being a unit complex number cos θ ` i sin θ. The score function of RotatE was ´h ◦ r ´ t. By representing a rotation as the cos θ ´ sin θ matrix , it is obvious that RotatE is subsumed by DihEdral and sin θ cos θ hence by MatricEs. RotatE was unable to model non-commutative composition as multiplication of unit complex numbers was decided by addition of angles, which was commutative. Comparison with TransE. Both TransE and MatricEs are transformation based models. In order to connect the score functions, we need to compute the vector addition in the form of matrix multiplication as follows: Let S be the n ˆ n shifting matrix: ⎡ ⎤ 0 0 0 ... 1 ⎢ 1 0 0 ... 0 ⎥ ⎢ ⎥ ⎢ 0 1 0 ... 0 ⎥ ⎢ ⎥ ⎣. . . . . . . . . . . . . . .⎦ 0 0 ... 1 0 Let Mh “ diag(h) ` S, Mr “ S n´1 diag(r) ` I and Mt “ diag(t) ` S. Then h ` r ´ t2 “ (Mh Mr ´ Mt ) ◦ IF . Although the above connection seems to “do a little with a lot”, it reveals the universality of MatricEs. Many other translational distance models can be similarly subsumed by MatricEs. Modeling Relation Patterns. In this part, we study how MatricEs models relation patterns and compare it with other models. Theorem 1. MatricEs models all the relation patterns defined in Sect. 2. Proof. – Symmetry: If a relation r is symmetric, then it can be represented by a symmetric matrix Mr “ MTr . – Antisymmetry: An antisymmetric relation r can be best modeled by a matrix Mr which satisfies Mr “ ´MTr ; namely, Mr is an antisymmetric matrix with 0s on the diagonal. – Inversion: If Mr is the embedding for a relation r, then MTr represents the inverse relation r . – Composition: If Mr 1 and Mr 2 represent r1 and r2 , respectively, then Mr 1 Mr 2 represent r1 dr2 . Commutativity and non-commutativity of composition coincide with those of matrix multiplication. – Absorption and transitivity can be defined similarly. In particular, a transitive relation is described by an idempotent matrix: Mr Mr “ Mr . Comparison on Modeling Relation Patterns. Comparison on the ability of expressing relation patterns for link prediction models, is shown in Table 2. As we just proved, MatricEs is fully expressive. We remark that TransE can express asymmetry other than antisymmetry. For instance, if TransE imposed that h ` r “ t and ||t ` r ´ h|| being big, then ||r|| is big, which contradicts the setting of TransE. RotatE models antisymmetry with θ P {π/2, 3π/2}.
MatricEs: Matrices Embeddings for Link Prediction in Knowledge Graphs
11
Table 2. Comparison on expressiveness for link prediction models. “ô, , õ, è, Ù, ö, d” stands for “Symmetry, Antisymmetry, Inversion, Commutative Composition, Noncommutative Composition, Absorption, Transitivity”; stands for “able to express” and ˆ stands for “unable to express”, ‹ stands for “partial expressibility”, namely “able to express asymmetry but unable to express antisymmetry”. Model
ô õ è Ù ö d
RESCAL TransE DisMult ComplEx TorusE SimplE RotatE DihEdral QuatE
ˆ
‹ ˆ
ˆ ˆ ˆ ˆ ˆ
ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ
ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ
ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ
MatricEs
5
Experiments
In this section, we exhibit the experiment results of MatricEs-D and MatricEsSD. Comparing to the other two variations, they need less parameters for relation embedding and are more effective in predicting missing edges. 5.1
Datasets
We conducted experiments on four popular datasets: WN18, FB15K, WN18RR and FB15k-237, the statistics of which are shown in Table 3. WN18 [3] was sampled from WordNet1 , a English language based knowledge graph which consists of description of words and relation between pairs of words. FB15k [3] was sampled from Freebase [2], a large-scale knowledge graph about real life facts. WN18RR [6] and FB15k-237 [24] were further sampled from WN18 and FB15k, respectively, by removing one relation from each pair of inverse relations. Table 3. Statistics of Datasets. Dataset
number of number of triples in entities relations Training set Validation set Test set
WN18
41k
18
141k
5k
5k
FB15K
15k
1.3k
483k
50k
59k
WN18 RR
41k
11
87k
3k
3k
237
273k
18 k
20k
FB15K-237 15 k 1
https://wordnet.princeton.edu/.
12
H. Zhu et al.
5.2
Evaluation Metric
To measure the rank of correct entities in filling an incomplete triple, we use the standard evaluation metrics, including MR (Mean Rank), MRR (Mean Reciprocal Rank) and Hits@n with n “ 1, 3, 10, which are defined as follows. we first replace h by all entities h P E for each testing triplet (h, r, t) and compute fr (h , t) for each triple (h , r, t). Then we sort h according to score fr (h , t) descendingly to get the rank of the original entity h, denoted by K(h). MR is defined as: (h,r,t)PT K(h) MR “ |T| which means MR is an average of ranks of all the original entities in the testing triplets. Likewise, MRR can be calculated as follows: 1 MRR “
(h,r,t)PT K(h)
|T|
which indicates MRR is an average of inverse ranks of all the original entities in the testing triplets. Hit@n denotes the proportion of original entities in the top n entities, which can be calculated by: (h,r,t)PT χ(K(h) ď n) Hit@n “ |T| where χ(K(h) ď n) is 1 if K(h) ď n, and 0 otherwise. We compare MatricEs with several strong baselines. For translation based models, we select the classical model TransE and its successors TorusE and RotatE. For bilinear models, we select the typical model ComplEx and two recent models, DihEdral and QuatE. Neural network based models are not under consideration, as they generally use too many parameters and are hard to interpret. 5.3
Implementation Details
We used PyTorch2 to implement our model. Hyper-parameters were selected by grid search as follows: fixed margin γ P {6, 9, 12, 18, 24}, learning rate is initialized as 0.1 and tuned within {0.0001, 0.0003, 0.001, 0.003}, batch size P {32, 128, 256, 512, 1024}, negative sampling ratio P {32, 128} and the selfadversarial sampling temperature α P {0.5, 1.0}. The dimension of relation embedding d was in the set {200, 400, 1000} with m “ n P {100, 200, 500} and l “ 2 for MatricEs-D and it was in the set {200, 400} with n “ m 2 P {100, 200} and l “ 2 for MatricEs-SD. Elements of the matrices were initialized randomly between ´1 and 1. 5.4
Experimental Results
The results are shown in Table 4 and Table 5. MatricEs outperforms all baseline models on FB15k, WN18 and FB15k-237. On the WN18RR dataset, it achieves the best results on the MR and Hit@10 metric. 2
https://pytorch.org/.
MatricEs: Matrices Embeddings for Link Prediction in Knowledge Graphs
13
Table 4. Results marked by “♠” are taken from the paper [Sun et al., 2019] and the rest results are taken from original papers. The best results are in bold and the second best results are underlined. Model
FB15k
Hit@1 Hit@3 Hit@10 MR MRR
Hit@1 Hit@3 Hit@10
TransE ♠
–
0.463
0.297
0.578
0.749
–
0.495
0.113
0.888
ComplEx ♠
–
0.692
0.599
0.759
0.840
–
0.941
0.936
0.945
0.947
HolE
–
0.524
0.402
0.613
0.739
–
0.938
0.930
0.945
0.949
0.943
ANALOGY
–
0.725
0.646
0.785
–
–
0.942
0.939
0.944
–
TorusE
–
0.733
0.674
0.771
0.832
–
0.619
0.943
0.950
0.954
SimplE
–
0.727
0.660
0.773
0.838
–
0.942
0.939
0.944
0.947
RotatE ♠
40
0.797
0.746
0.830
0.884
309
0.949
0.944
0.952
0.959
DihEdral
–
0.733
0.641
0.803
0.877
–
0.946
0.942
0.948
0.952
QuatE
41
0.770
0.700
0.821
0.878
388
0.949
0.941
0.954
0.960
MatricEs-D
39
0.800 0.738
0.836
0.888
246
0.950 0.944
0.952
0.961
0.756
0.831
0.888
193 0.950 0.944
0.953
0.960
MatricEs-SD 43
5.5
WN18
MR MRR
0.664
Discussion
In this part, we analyze the properties of our approach. We compare our MatricEs-SD with two remarkable models, RotatE and QuatE, on the number of free parameters used in the four datasets for both entities and relations. The results are shown in Table 6. For WN18 and WN18RR, we can see that our MatricEs-SD outperforms the baselines when m “ 200, n “ 100 and l “ 2; for FB15K and FB15K-237 datasets, our MatricEsSD performs the best when m “ 400, n “ 200 and l “ 2. MatricEs-SD reduces 60% parameters of RotatE on all the four datasets and 50% parameters of QuatE on WN18 and FB15K. MatricEs-SD used as many parameters as QuatE on WN18RR with improved performance. When setting m “ 200, n “ 100 and l “ 2, indicating MatricEs-SD has the same number of parameters as QuatE, MatricEs-SD performs better than QuatE with MR“ 242, MRR“ 0.319, Hit@1, 3, 10 “ 0.224, 0.358, 0.510. Table 5. Link prediction results on the datasets FB15k-237 and WN18RR. Results marked by “♠” are taken from the paper [Sun et al., 2019] and the rest results are taken from original papers. The best results are in bold and the second best results are underlined. Model
FB15k-237
WN18RR
MR MRR
Hit@1 Hit@3 Hit@10 MR
MRR
Hit@1 Hit@3 Hit@10
TransE ♠
357
0.294
–
–
0.465
3384
0.226
–
–
0.501
ComplEx ♠
339
0.247
0.158
0.275
0.428
5261
0.44
0.41
0.46
0.51
RotatE ♠
177
0.338
0.241
0.375
0.533
3340
0.476
0.428
0.492
0.571
DihEdral
–
0.32
0.23
0.353
0.502
–
0.48
0.452
0.491
0.536
QuatE
176 0.311
0.221
0.564
MatricEs-D
209
MatricEs-SD 212
0.342
0.495
3472
0.481 0.436
0.500
0.346 0.247
0.385
0.544
2999
0.472
0.421
0.491
0.575
0.341
0.380
0.534
2904 0.462
0.409
0.478
0.572
0.244
14
H. Zhu et al.
Table 6. Comparison on number of free parameters. Number of free parameters for RotatE and QuatE are taken from [Zhang et al., 2019]. l “ 2 for MatricEs-SD. Number of free parameters for MatricEs-SD equals ml(#E) ` nl(#R). Dataset
RotatE QuatE
WN18
40.95M 49.15M 200 100 16.38M
Ó 60.00%
Ó 60.67%
FB15K
31.25M 26.08M 400 200 12.1M
Ó 61.28%
Ó 53.6%
WN18 RR
40.95M 16.38M 200 100 16.38M
Ó 60%
0.0%
Ó 60.0% Ó 79.54%
↑ 101.5% ↑ 3%
FB15K-237 29.32M 5.82M
m
n
MatricEs-SD compare RotatE compare QuatE
400 200 11.73M 200 100 6M
We would like to see to what extent the relation patterns are learned by our model. The results are shown in Fig. 2. (a) and (b) show that symmetry were perfectly learned. (c) and (d) are auxiliary figures illustrating the distribution of two relation embeddings. (e) and (f) show dissatisfactory behavior on antisymmetric relations. We argue that learning antisymmetry is hard, mainly due to the structure of data: only positive triple are stored in knowledge graphs yet antisymmetry asserts the existence of negative triples. Employing type constrains would ease this problem, as negative triples would be confirmed by the constrains and be useful in training link prediction models. As we mainly focused on the algorithm in this paper, we did not involve consideration of type constrains. (g) shows that despite the dissatisfaction on antisymmetry, the pair of relations in (e) and (f) have embeddings as transpose of each other, which perfectly models the inversion pattern. (h) shows that the composition pattern can be learned with satisfaction. Although MatricEs perfectly models absorption and transitivity, the lack of data with such relation patterns causes difficulty for empirical verification.
MatricEs: Matrices Embeddings for Link Prediction in Knowledge Graphs
15
Fig. 2. Visualization of relation patterns learned by MatricEs-SD. In the histograms, X-axes describe intervals of ranges of elements in corresponding matrices and Y-axes describe the distribution, namely percentage of elements that lie in the intervals. |M | denotes the matrix obtained from M by taking elementwise absolute value. (a) and (b) illustrate the properties of symmetry of two relations from WN18: _similar_to and _verb_group, by |MrT1 ´ Mr1 | and |MrT2 ´ Mr2 |, respectively, where r1 is _similar_to and r2 is _verb_group; (c) and (d) illustrate the distribution of two relation embeddings Mr3 and Mr4 , where r3 is _hypernym and r4 is _hyponym. (e) and (f) attempt to illustrate the property of antisymmetry of r3 and r4 by |MrT3 ` Mr3 | and |MrT4 ` Mr4 |, respectively, while (g) shows that they are in fact the inverse of each other, by |MrT3 ´ Mr4 |; (h) illustrate the property of composition of three relations from FB15k by |Mr5 Mr6 ´ Mr7 |, where r5 is the relation award_nominee/award_nominations./award/award_nomination/nominated_for, r6 is award_category/winners./award/award_honor/award_winner and r7 is award_category/nominees./award/award_nomination/nominated_for.
6
Conclusion
In this paper, we introduced a novel link prediction model, MatricEs, which defined the relation as a linear transformation from the head entity matrix to the tail entity matrix. We designed a couple of variation of MatricEs to reduce the usage of parameters. We proved the universality of MatricEs as it subsumed several typical transformation based models and presented how MatricEs modeled relation patterns. We implemented experiments on four link prediction datasets and verified the validity, effectiveness and simplicity of MatricEs. In the future, we should put efforts to overcome the problems with data, so as to better understand the relation patterns of antisymmetry, absorption and transitivity. Acknowledgements. This work is supported by the Joint Funds of the National Natural Science Foundation of China (Grant No. U1811263) and the National Natural Science Foundation of China for Young Scientists of China (Grant No. 11701592).
16
H. Zhu et al.
References 1. Abboud, R., Ceylan, İ.İ., Lukasiewicz, T., Salvatori, T.: Boxe: a box embedding model for knowledge base completion. In: NeurIPS 2020 (2020) 2. Bollacker, K.D., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1247– 1250 (2008) 3. Bordes, A., Usunier, N., García-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, vol. 26, pp. 2787–2795 (2013) 4. Bordes, A., Weston, J., Usunier, N.: Open question answering with weakly supervised embedding models. In: Proceedings of Machine Learning and Knowledge Discovery in Databases - European Conference, pp. 165–180 (2014) 5. Chami, I., Wolf, A., Juan, D., Sala, F., Ravi, S., Ré, C.: Low-dimensional hyperbolic knowledge graph embeddings. In: D. Jurafsky, J. Chai, N. Schluter, J.R. Tetreault (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020, pp. 6901–6914. Association for Computational Linguistics (2020) 6. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2d knowledge graph embeddings. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 1811–1818 (2018) 7. Dong, X., et al.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–610 (2014) 8. Ebisu, T., Ichise, R.: TorusE: knowledge graph embedding on a lie group. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 1819–1826 (2018) 9. Fan, M., Zhou, Q., Chang, E., Zheng, T.F.: Transition-based knowledge graph embedding with relational mapping properties. In: Proceedings of the 28th Pacific Asia Conference on Language, Information and Computation, pp. 328–337 (2014) 10. Feng, J., Huang, M., Wang, M., Zhou, M., Hao, Y., Zhu, X.: Knowledge graph embedding by flexible translation. In: Principles of Knowledge Representation and Reasoning: Proceedings of the Fifteenth International Conference, pp. 557–560 (2016) 11. Gabrilovich, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. J. Artif. Intell. Res. 34, 443–498 (2009) 12. Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L.S., Weld, D.S.: Knowledgebased weak supervision for information extraction of overlapping relations. In: Proceedings of The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 541–550 (2011) 13. Ji, G., He, S., Xu, L., Liu, K., Zhao, J.: Knowledge graph embedding via dynamic mapping matrix. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pp. 687–696 (2015) 14. Ji, G., Liu, K., He, S., Zhao, J.: Knowledge graph completion with adaptive sparse transfer matrix. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 985–991 (2016) 15. Kazemi, S.M., Poole, D.: SimplE embedding for link prediction in knowledge graphs. In: Advances in Neural Information Processing Systems, vol. 31, pp. 4289– 4300 (2018)
MatricEs: Matrices Embeddings for Link Prediction in Knowledge Graphs
17
16. Lehmann, J., et al.: Dbpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2), 167–195 (2015) 17. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2181–2187 (2015) 18. Liu, H., Wu, Y., Yang, Y.: Analogical inference for multi-relational embeddings. In: Proceedings of the 34th International Conference on Machine Learning, pp. 2168–2178 (2017) 19. Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1), 11–33 (2016) 20. Nickel, M., Rosasco, L., Poggio, T.A.: Holographic embeddings of knowledge graphs. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1955–1961 (2016) 21. Nickel, M., Tresp, V., Kriegel, H.: A three-way model for collective learning on multi-relational data. In: Proceedings of the 28th International Conference on Machine Learning, pp. 809–816 (2011) 22. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706 (2007) 23. Sun, Z., Deng, Z., Nie, J., Tang, J.: RotatE: knowledge graph embedding by relational rotation in complex space. In: 7th International Conference on Learning Representations (2019) 24. Toutanova, K., Chen, D.: Observed versus latent features for knowledge base and text inference. In: Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pp. 57–66 (2015) 25. Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex embeddings for simple link prediction. In: Proceedings of the 33nd International Conference on Machine Learning, pp. 2071–2080 (2016) 26. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: a survey of approaches and applications. IEEE Trans. Knowl. Data Eng. 29(12), 2724–2743 (2017) 27. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 1112–1119 (2014) 28. Xu, C., Li, R.: Relation embedding with dihedral group in knowledge graph. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 263–272 (2019) 29. Yang, B., Yih, W., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. In: 3rd International Conference on Learning Representations (2015) 30. Zhang, S., Tay, Y., Yao, L., Liu, Q.: Quaternion knowledge graph embeddings. In: Advances in Neural Information Processing Systems, vol. 32, pp. 2731–2741 (2019) 31. Zhang, Z., Cai, J., Zhang, Y., Wang, J.: Learning hierarchy-aware knowledge graph embeddings for link prediction. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, pp. 3065–3072. AAAI Press (2020) 32. Zheng, S., et al.: Pharmkg: a dedicated knowledge graph benchmark for bomedical data mining. Briefings in Bioinformatics (2020)
Learning User Embeddings Based on Long Short-Term User Group Modeling for Next-Item Recommendation Nengjun Zhu1 , Jieyun Huang1 , Jian Cao2(B) , and Shanshan Feng3 1
2
School of Computer Engineering and Science, Shanghai University, 99 Shangda Road, Shanghai 200444, China {zhu nj,huang0615}@shu.edu.cn Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China [email protected] 3 School of Information Science and Engineering, Shandong Normal University, No.1 University Road, Ji’nan 250358, China
Abstract. Session-based recommender systems are increasingly applied to next-item recommendations. However, existing approaches encode session information of each user independently and do not consider the interrelationship between users. This work is based on the intuition that the dynamic groups of like-minded users exist through the time, and users in the same group might share a similar preference. By considering the impact of latent user groups, we can learn a user’s preference in a better way. To this end, we propose a recommendation model based on learning user embeddings by modeling long and short-term dynamic latent user groups. It not only captures the latent group information of users, but also perceives the change of user grouping with time. Specifically, we utilize two network units to learn users’ long and short-term sessions, respectively. Meanwhile, we employ two additional units to detect which latent groups a user belongs to, followed by an aggregation of these latent group representations. Finally, user preference representations are shaped comprehensively by considering all these four aspects, based on an attention mechanism. Extensive experiments prove our model outperforms multiple state-of-the-art methods in terms of Recall, mAP, and AUC metrics.
Keywords: Session-based recommender Attention mechanism
1
· User group modeling ·
Introduction
Session-based recommender systems (SBRSs) have been a hot research topic since they can not only model users’ long-term preferences, but also highlight c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 18–32, 2023. https://doi.org/10.1007/978-981-99-2385-4_2
LSUG
19
Fig. 1. An example of two different approaches of modeling influences of grouping: (1) our dynamic grouping mechanism, and (2) conventional static grouping mechanism. The central user belongs to two latent groups represented by orange and green frames. In (2), groups influence the user equally and thus the visited items denoted by orange cuboid and green cylinder should be recommended with no difference. In (1), the two groups are treated differently since the central user has switched his group from the green one to the orange one. Up-to-date group might have a larger impact on him and thus items represented by orange cuboid are more in line with his taste. (Color figure online)
short-term demands [1–3]. A session in SBRSs specifies a scope of encapsulation of items, such as a set of products in a shopping cart, a set of viewed websites within a time window, and so forth. Different sessions reflect users’ diverse preferences and requirements because users’ interests keep changing in various periods [4]. Conventional approaches such as [5,6] treat all sessions equally and overlook the heterogeneity between different sessions, which degrades the performance of next-item recommendations. On the contrary, some SBRSs such as [7–9] distinguish the contribution of each session to depict users’ current interests. These systems have demonstrated a decent improvement of recommendation performances compared to conventional approaches. Most existing SBRSs assume a user’s current preference is associated differently with long and short-term sessions. For instance, to learn more complete representations of users, SHAN [9] adopts a hierarchical structure to fuse the pooling results of long and short-term sessions. The fusion weights are differentiated based on an attention network. Similarly, KA-MemNN [7,10] encodes each session from two perspectives: users’ intentions and preferences, followed by a more precisely weighted combination of long and short-term session representations. Besides, AttRec [8] and PLASTIC [3] exploit two different prototypes to learn user preferences based on long-term sessions (e.g., using matrix
20
N. Zhu et al.
factorization (MF) [11]) and short-term sessions (e.g., using recurrent neural network (RNN) [12]), respectively. In all these approaches, user representations are summarized based on their sessions independently, causing the learned models are built on a per user basis. There is no explicit information sharing between the models of users. However, in real-world applications, the groups of like-minded users exist in different contexts. The users in the same group usually share similar preferences and thus might behave similarly. Unfortunately, group information is often neglected in existing SBRSs. If the data of all related items and users is treated indiscriminately, it is a global model. Instead, more emphasis can be put on some of the related items or users to make the model more targeted, and it is a local model. For example, in [8,13], to capture users’ more specific preferences, user representations are learned from currently visited items. At the same time, many local non-sessionaware recommender systems (NSRSs) have been widely explored. For instance, based on truncated SVD (singular value decomposition), the work in [14] learns a global model for a shared aspect set as well as a set of user subset specific models. CMN [5] takes users’ neighbors as the values in memory network banks, and the values are further accumulated to model users’ preferences. Although by considering the stable local influences NSRSs can improve recommendation performance, they still fail to capture users’ dynamic preferences and evolving latent groups, and thus are not effective for next-item recommendations. Figure 1 further exhibits the difference between local NSRSs and local SBRSs. To this end, in this paper, we aim at proposing a local SBRS. However, due to users’ complicated behavior patterns, local SBRSs face several challenges: (1) Instead of being assigned a static group like conventional local approaches, in practice, a user might belong to multiple groups. For example, a user can be a cartoon fan and a tech fan at the same time. In such a case, when he purchases a comic book, the taste of cartoon fans would have a more significant impact on him compared to that of tech fans, and vice versa. (2) The interest of each latent user group can be updated over time. For example, a hot cake can disturb the widespread tendency inside the group. (3) Users can switch their latent groups due to multiple reasons, such as the evolution of preferences and requirements. How to capture the dynamics of each latent user group and evolving grouping is not trivial. To address the problems above, we design a next-item recommendation system (RS) based on modeling long and short-term latent user groups named LSUG. Specifically, we employ a hierarchical neural network to build an endend representation learning mechanism. We first split the sessions into long and short-term sessions, then we embed each item in sessions into a dense representation. Based on item embeddings, we abstract critical information to form long and short-term session representations by a pooling layer. These representations reflect users’ preferences and requirements. Through analyzing the relations between them and latent user group embeddings, we can assign the target user to multiple user groups with different probabilities. Groups with
LSUG
21
higher probability have a greater impact on user preferences. Considering that users’ interests may be updated over time, as well as their long and short-term preferences, the probabilities of users in groups are dynamic. Then, we can summarize the influences from different user groups by a weighted combination of group embeddings. Finally, the long and short-term session representations and the user group influences are further aggregated to more comprehensive user representations based on an attention model. These representations replace user latent vectors in a pairwise model, i.e., BPR [15], to estimate the probability of an item to be the next visited one. The experimental results show the superiority of our model over multiple state-of-the-art approaches in terms of Recall, AUC, and mAP metrics.
2
Related Work
Traditional approaches, e.g., collaborative filtering (CF), model the relations between users and items in a static way [16]. They neglect the sequential dependencies inside the user-item interactions. To tackle such a problem, Markov chain-based (MC-based) methods such as [17] are developed. The work [18] incorporates hidden Markov models into matrix factorization to deal with temporal dynamics in recommender systems. However, MC-based models only consider the first-order dependency, i.e., it predicts the transition between a pair of items instead of that between an item and a contextual item set. Thus they don’t fit the next item recommendation well. Neural networks (NNs) are widely explored and applied in recent years thanks to their abilities to handle highly complex users’ behaviors. Different from MCbased methods, RNN-based technologies like HRNN [19] can model higher-order sequential dependencies while avoiding exponential growth of parameters existing in higher-order MCs [17]. However, RNN models suppose that items in a session follow a rigid order, which doesn’t match the real-world session-based settings as a user might buy or look through these items randomly in a short time. Attention Mechanism assigns different weights to each part of the input and extracts more critical information. It makes more accurate judgments. MCRec [20] uses a deep neural network model with the co-attention mechanism to learn interaction-specific representations for users, items and meta-path context for top-N recommendation in HIN(heterogeneous information network). CoCoRec [21] leverages category information to capture the context-aware action dependence and uses a self-attention network to capture item-to-item transition patterns within each category-specific subsequence. The effectiveness of graph neural networks (GNNs) has been reported [22,23] in recommendation domains. MixGCF [24] utilizes the underlying GNN-based recommenders to synthesize negative samples. GHCF [25] uses GCN (Graph convolutional network) to explore the high-hop user-item interactions. MB-GMN [26] is an integrative neural architecture with a meta-knowledge learner and a meta graph neural network to capture the personalized multi-behavior characteristics. These models are usually trained in a pairwise manner, i.e., one item has priority over the
22
N. Zhu et al.
other. But in reality, items in the same session can not always have such partial order relations. Thus, these approaches might lead to false dependencies. Besides, RNN- and MC-based approaches are apt to forget long-term information and are biased to recently visited items according to their structures. Recently, many SBRSs rely on NNs to model users’ long and short-term interests. They pursue this task from two perspectives: (1) using an attention mechanism to learn explicit session-specific weights [7,9,27]; (2) exploiting different prototypes to model long and short-term sessions respectively [2,3,8]. Both of these two techniques have been proven very successful in SBRSs. Our work follows the first pipeline. However, these SBRSs treat each user independently and learn from users’ personal behaviors to make recommendations, causing no explicit information sharing between similar users. Therefore, user and item representations are learned from a global view. In reality, there might be strong local associations inside users and items, which has been validated in many local NSRSs such as CMN [5] and r(s)GLSVD [14]. Unfortunately, NSRSs do not take temporal orders of user behaviors into account, which limits their performances. To this end, we propose a local SBRSs to combine the advantages of SBRSs and local fusion settings.
3 3.1
LSUG Model Problem Formulation
In a recommender system (RS), we have a user u and an item v in a user set U = {u1 , u2 , · · · , u|U | } and an item set V = {v1 , v2 , · · · , v|V| }, respectively. Let s = {v1 , v2 , · · · , v|s| } ⊂ V be an item set clicked by a target user within a session, i.e. Δt. Throughout the history of user behaviors, we have a session sequence denoted by Stu = {su1 , su2 , · · · , sut } for each user, where t indicates the index of sessions following timestamps. Formally, given a user u and his session sequence Stu , we aim to build a model to predict the next items that have high probabilities belonging to current session st , by taking the consideration of user u’s long and short-term sessions, as well as his long and short-term latent groups’ influences. 3.2
Overview
Our model (the framework based on Long and Short-term latent User Group modeling, LSUG) shown in Fig. 2 is a hierarchical end-end framework. It splits user behaviors to long and short-term ones. For each part, we aggregate item embeddings to form a user’s preference representation. Then, based on the long and short-term representations, we calculate the probability distribution of groups that users might belong to, followed by an aggregation of group features to capture the impact from users’ neighbors as well as the differences in preference between subsets of like-minded users. Finally, we construct a hybrid user representation using users’ long and short-term session representations and group influences. Next, we give an introduction to each part of the model.
LSUG
23
Fig. 2. The framework of LSUG.
3.3
General Embedding Construction
We use two matrices U ∈ RN ×K and V ∈ RM ×K with fully-connected NNs to transform one-hot encoding of users and items to dense vectors, in which N = |U| (resp. M = |V|) denotes the number of users (resp. items) and K is the latent dimensionality. Let u ∈ RK and v ∈ RK represent an embedding vector of user u and item v, respectively. They capture static features since they do not change with time. Inspired by key-value memory networks (KV-MemNN) [28], to determine the influence of groups to each user, we assume L user groups with L latent anchor representations denoted by Gv ∈ RL×K . Gv describes the preferences of latent user groups. At the same time, latent groups have an additional representation matrix denoted by Gk ∈ RL×K deciding the relations with users. Gk and Gv are similar to the key and value elements in KV-MemNN, respectively. In this way, we can assign a user to multiple groups and accumulate influences of user grouping for him. 3.4
Context-Aware Input Embedding
We split the sessions of a user into two parts and utilize his current session sut = {v1 , v2 , · · · , v|sut | |(u, t)} to construct his context-aware input embeddings, which captures his short-term demands, and su1:t−1 = {v1 , v2 , · · · , v|su1:t−1 | |(u, t)} as the long-term preference. Each session has a related embedding matrix which is computed according to the representations of the items in it. Here, we feed these matrices to an aggregation function to learn semantic input embeddings xs , xl ∈ RK as follows: xus = pooling(Etu ) u ) xul = pooling(E1:t−1
(1)
We explore three different pooling methods to aggregate item features, i.e., mean, max, and attention pooling functions.
24
N. Zhu et al.
– mean pooling: It averages the values at each dimension of features. Each item has equal importance contributing to the final result. – max pooling: It takes the max value for each dimension and captures item set features in an extreme way. – attention pooling: It is a weighted average pooling, and we calculate the weights according to the relations between the general representation of the target user u and the item v as follows: exp(u v) exp(u vi ) vi ∈su t wu,v v xut =
wu,v =
(2)
v∈su t
The input embedding xut , where t ∈ {s, l}, reflects user u’s status at different timestamps, such as his purchase requirements and related groups. Thus, it is reasonable to justify the relations between the groups, i.e., anchor points, and user u according to this input embedding xut . 3.5
Latent User Group Influence Modeling
We first calculate the similarity between the input embeddings, i.e., xus , xul , and key embeddings of latent groups Gk , to assess a user’s current probability distribution bus , bul to decide which latent groups the user belongs to as bus = sof tmax(Gk xus ) bul = sof tmax(Gk xul )
(3)
where bus , bul ∈ RL and we utilize sof tmax(·) function to convert the vector Gk xu to a pseudo probability distribution vector. Then, we aggregate group features, i.e., Gv , according to the distributions to construct aggregated group feature vectors as: glu = Gv bul gsu = Gv bus
(4)
where glu ∈ RK and gsu ∈ RK represent the long and short-term latent group influences to the user u, respectively. 3.6
Hybrid User Representation Modeling
This part yields a hybrid user representation from four aspects: two personal preference representations, i.e., users’ long and short-term context-aware input session embeddings, and two group influence representations, i.e., the impacts from users’ current groups and historical groups. To combine these four components in a dynamic way, we investigate two approaches to fuse them. For simplicity, we denote {xus , xul , gsu , glu } as F .
LSUG
25
– MLP hybrid: We use a Multi-Layer Perception (MLP) to map each feature vector to a scalar, followed by a softmax layer converting the scalar to a weight for each component. exp(M LP (f )) f ∈F exp(M LP (f )) hu = wf f
wf =
(5)
f ∈F
– attention hybrid: We calculate the weights according to the relations between the components and the embedding of the target user u. exp(u v) exp(u vi ) vi ∈su t xut = wu,v v
wu,v =
(6)
v∈su t
3.7
Model Learning
The total training procedure is shown in Algorithm 1. After the final user representation is learned, we compute the inner product of user representations and item embedding as their similarities or users’ preferences to items: ˆ u,v = hu v R
(7)
We utilize a ranking and pairwise loss function proposed in [15] to train the model. For positive sampling, we randomly pick an item from user u’s current session. And for negative sampling, we choose an item that the user u never bought or visited before. We denote the positive item and negative item as v + and v − , respectively. Then, we calculate the final loss as follows: ˆ u,v+ − R ˆ u,v− ) arg min −lnσ(R (8) Θ
(u,Stu ,v + ,v − )∈D
where D is the train set containing all samples, and σ(x) = function.
4 4.1
1 1+e−x
is a sigmoid
Experiments Experimental Setup
Datasets. We use Tmall dataset [29], which contains purchase behaviors of users on Tmall online shop, and Gowalla dataset [30], which collects check-in behaviors of users. Following the settings in [9], we keep the last seven months of
26
N. Zhu et al.
Algorithm 1: Traning process input : embedding dimension K, number of group L, sessions data S, initial learning rate η output: trained model with parameters Θ do initialization; shuffle the sessions data S while not convergence do for batch in S do randomly select t for each session sequence and split sessions to long and short-term; do positive sampling in last session; do negative sampling in unvisited items; compute loss according to Eq. (8); do backpropagation and update parameters Θ; end end
data and items that have been observed by no less than 20 users. We aggregate items purchased in one day by the same user into a session and remove the sessions that only contain a single item. We randomly pick 20% of total users for test and randomly select an item in their last session as the target item to be predicted. Then, the statistics of datasets are shown in Table 1. Table 1. Statistics of datasets Dataset
Tmall
Gowalla
#user
20202
15076
#item
24774
12419
avg. session length
2.72
2.95
#train session
70895
128374
#test session
4040
3015
user-item matrix density 0.039% 0.15%
Baselines. We compare our model with the following baselines, including SBRSs, NSRSs, local NSRSs. 1) BPR [15] is a classic NSRS that learns how to rank from users’ feedback data by pairwise optimization. 2) GRU4Rec-bpr and 3) GRU4Rec-ce [12] are outstanding algorithms that uses GRU (gated recurrent unit) to model the sequential data. The former uses BPR as the ranking loss function while the latter takes cross-entropy as the loss function. 4) CMN [5] is a kind of local NSRSs that takes users’ neighbors as the values in the memory bank. 5) SHAN [9] is a state-of-the-art SBRS, which also utilizes a hierarchical neural network.
LSUG
27
Metrics. We use Recall, AUC, and mAP to evaluate models. Recall measures how much the prediction covers the ground truth. AUC evaluates how highly positive examples have been ranked over negative examples. mAP evaluates the location of the real visited items in the predicted list. Parameters Settings. We set K to 150 and L to 512 for both datasets. The initial learning rate is set to 0.03 with a 0.8 decay rate in every eight steps. We train the model until the convergence is reached. In the final models, we choose {pooling layer: mean pooling} and {hybrid layer: MLP hybrid} for Tmall dataset, while choose {pooling layer: attention pooling} and {hybrid layer: attention hybrid} for Gowalla dataset since they perform best in our experiments.
Fig. 3. Performance comparison
4.2
Comparison of Performance
Fig. 3 shows the performance of LSUG and other baselines on both Tmall and Gowalla datasets under all metrics. From the figure, we can observe that: 1) Our LSUG outperforms all the baselines, including a latent factor CF model, i.e., BPR, a local NSRS, i.e., CMN, two sequential models, i.e., GRU-bpr and GRU-ce, and a hierarchical SBRS, i.e., SHAN, with a large margin, especially on Tmall dataset. For example, LSUG improves 16.9% compared with SHAN (15.9% v.s. 13.6%) at Recall@20 and 4.16% at Recall@100 (22.5% v.s. 21.6%), although SHAN outperforms other baselines. Since both LSUG and SHAN split the sessions into long and short-term ones as well as both of them are a hierarchical model, the gain of performance might come from group influence components (GICs), which indicates the GICs are beneficial to model user preferences and help to make better recommendations.
28
N. Zhu et al.
2) Both GRU4Rec-bpr and GRU4Rec-ce perform well, the reason is that they might successfully capture sequential patterns, i.e., the dependency relation between items. Moreover, GRU-ce is better than GRU-bpr under Recall@N and AUC metrics. The reason might be that the softmax layer computes the probability of positive items over all negative ones, while the bpr loss only uses the sampled item pairs. 3) Although CMN and BPR are both NSRSs, CMN outperforms BPR. It may be because CMN collects user neighbors’ preferences for the current user. It further proves local information helps model user preferences.
Table 2. Influence of pooling functions and hybrid methods Dataset
Tmall
Gowalla
pooling-hybrid recall@20 mAP
4.3
recall@20 AUC
SHAN
0.136
0.037
0.424
0.956
max-attn
0.146
0.045
0.412
0.957
max-MLP
0.136
0.039
0.399
0.944
mean-attn
0.153
0.047 0.429
0.961
mean-MLP
0.159
0.042
0.424
0.957
attn-attn
0.141
0.041
0.433
0.962
attn-MLP
0.155
0.042
0.400
0.954
Influence of Components
Influence of Latent User Groups. To further investigate the effectiveness of user groups, we remove the group features from the model and only combine xus and xul . It leads to the results shown in Fig. 5, in which LSUG-d denotes LSUG deleting group features. We could see that, without group features, the performance of the model becomes worse, which proves that the group features are important. Influence of Pooling and Hybrid Methods. To show the influence of aggregation methods, we exhibit the performance of all combinations of {pooling layer: mean pooling, max pooling, attention pooling} × {hybrid layer: attention hybrid, MLP hybrid}. As shown in Table 2, for combining item features, i.e., the pooling layer in our model, mean pooling is better than max pooling under all experimental settings. e.g., mean-MLP v.s. max-MLP. The reason might be that mean pooling takes all item features into consideration and passes the information to the downstream network, while max pooling only picks the most extreme features. Attention pooling sometimes obtains worse results than mean pooling. A possible explanation is that sometimes the target item is not similar to a
LSUG
29
user’s general preference representation, i.e., u, and thus the model pays false attention. For hybrid methods, i.e., the hybrid layer in our model, attention and MLP get comparable results, and we couldn’t conclude that which one is better. However, most combinations achieve better results than SHAN steadily on both datasets, which show the power of user group modeling. On Tmall dataset, attn-MLP outperforms attn-attn with a large margin, while on Gowalla dataset, the observation is the opposite. We note that Gowalla dataset records users’ check-in data, and a user could visit one place repeatedly, while a user purchases already bought items with much lower frequency on Tmall dataset. Under such a circumstance, user embeddings on Gowalla dataset might be more similar to the frequently visited items. These items occur a lot in the test set as well. As a result, on Gowalla dataset, attention mechanism considers general user preferences, leading to better performance. On the contrary, MLP only considers current features and thus gets worse results on Gowalla dataset but yields a better performance on Tmall dataset. We randomly sample several users from both datasets and visualize their weights of long and short-term personal preferences (i.e., LP & SP), and long and short-term group influences (i.e., LG & SG), as shown in Fig. 4. We can observe that the weights are customized for different users.
Fig. 4. Weights visualization
4.4
Influence of Hyper-Parameters
We study the influence of latent group size L in our model. The value of L changes from 100 to 1000, and we only record Recall@20 values due to the limited space. As shown in Fig. 6, as group size grows, the value of metric increases gradually at first on both datasets. This indicates that small size is not enough to cover all potential latent user groups. However, the too larger size also decreases the performance because it might cause overfitting problems. Above all, a proper group size, e.g., L = 500 on Tmall dataset and L = 600 on Gowalla dataset, should be tuned to gain remarkable results.
30
N. Zhu et al.
Fig. 5. The performance of LSUG and LSUG-d
5
Fig. 6. The influence of group size L
Conclusion
In this paper, we proposed a next-item recommendation model based on learning long and short-term user groups. Specifically, we split user behaviors into long and short-term sessions. For all sessions, we abstract their representations according to their items. After that, the designed GICs detect users’ latent long and short-term groups, and incorporate the influences from different latent groups to form the final user representations. The conducted experiments on two real-world datasets demonstrate that our model outperforms several stateof-the-art models in terms of multiple metrics. Acknowledgments. This work is supported by Shanghai Youth Science and Technology Talents Sailing Program (No. 22YF1413700). Thanks to Runtong Li and Xingjing Lu for their valuable advice and help.
References 1. Yu, L., Zhang, C., Liang, S., Zhang, X.: In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 5709–5716 (2019) 2. Guo, L., Yin, H., Wang, Q., Chen, T., Zhou, A., Hung, N.Q.V.: In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1569–1577. ACM (2019 ) 3. Zhao, W., Wang, B., Ye, J., Gao, Y., Yang, M., Chen, X.: In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 3676–3682 (2018) 4. Wang, S., Hu, L., Wang, Y., Sheng, Q.Z., Orgun, M., Cao, L.: In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1–7. AAAI Press (2019) 5. Ebesu, T., Shen, B., Fang, Y.: In: Proceedings of the International SIGIR Conference on Research & Development in Information Retrieval, pp. 515–524. ACM (2018) 6. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., Chua, T.S.: In: Proceedings of the International Conference on World Wide Web (WWW) (International World Wide Web Conferences Steering Committee, pp. 173–182 (2017)
LSUG
31
7. Zhu, N., Cao, J., Liu, Y., Yang, Y., Ying, H., Xiong, H.: In: Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), pp. 807–815. ACM (2020) 8. Zhang, S., Tay, Y., Yao, L., Sun, A., An, J.: In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 9 (2019) 9. Ying, H., et al.: In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) (2018) 10. Zhu, N., Cao, J., Lu, X., Xiong, H.: ACM Trans. Inf. Syst. (TOIS) 40(2), 1 (2021) 11. Koren, Y., Bell, R., Volinsky, C.: Computer (8), 30 (2009) 12. Hidasi, B., Karatzoglou, A., Baltrunas, L., Tikk, D.: In: 4th International Conference on Learning Representations, ICLR (2016) 13. Wu, S., Tang, Y., Zhu, Y., Wang, L., Xie, X., Tan, T.: In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 346–353 (2019) 14. Christakopoulou, E., Karypis, G.: In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1235–1243. ACM (2018) 15. Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: CoRR abs/1205.2618 (2012) 16. Li, W., et al.: IEEE Access 7, 45451 (2019) 17. He, R., McAuley, J.: In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 191–200. IEEE (2016) 18. Zhang, R., Mao, Y.: IEEE Access 7, 13189 (2019) 19. Quadrana, M., Karatzoglou, A., Hidasi, B., Cremonesi, P.: In: Proceedings of the Eleventh ACM Conference on Recommender Systems (Association for Computing Machinery, New York, NY, USA, RecSys 2017, pp. 130–137 (2017) 20. Hu, B., Shi, C., Zhao, W.: In: The 24th ACM SIGKDD International Conference (2018) 21. Cai, R., Wu, J., San, A., Wang, C., Wang, H.: In: SIGIR 2021: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021) 22. Ge, S., Wu, C., Wu, F., Qi, T., Huang, Y.: In: Proceedings of The Web Conference 2020, New York, NY, USA, pp. 2863–2869. Association for Computing Machinery (2020) 23. Zheng, J., Ma, Q., Gu, H., Zheng, Z.: In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, New York, NY, USA, KDD 2021, pp. 2338–2348. Association for Computing Machinery (2021) 24. Huang, T., et al.: In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD 2021, New York, NY, USA, pp. 665–674. Association for Computing Machinery (2021) 25. Chen, C., et al.: In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3958–3966 (2021) 26. Xia, L., Xu, Y., Huang, C., Dai, P., Bo, L.: In: SIGIR 2021: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021) 27. Yuan, J., Song, Z., Sun, M., Wang, X., Zhao, W.X.: In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4635–4643 (2021) 28. Miller, A.H., Fisch, A., Dodge, J., Karimi, A., Bordes, A., Weston, J.: In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1400–1409 (2016) 29. Hu, L., Cao, L., Wang, S., Xu, G., Cao, J., Gu, Z.: In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1858–1864 (2017)
32
N. Zhu et al.
30. Cho, E., Myers, S.A., Leskovec, J.: In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1082–1090. ACM (2011)
Context-Aware Quaternion Embedding for Knowledge Graph Completion Jingbin Wang, Xinyi Yang, Xifan Ke, Renfei Wu, and Kun Guo(B) College of Computer and Data Science, Fuzhou University, Fujian 350108, China [email protected]
Abstract. In this paper, we study the learning representation of entities and relationships in the link prediction task of knowledge graph. The knowledge graph is a collection of factual triples, but most of them are incomplete. At present, some models use complex rotation to model triples, and obtain more effective results. However, such models generally use specific structures to learn the representation of entities or relationships, and do not make full use of the context information of entities and relations. In addition, in order to achieve high performance, models often need larger embedding dimensions and more epoches, which will cause large time and space cost. To systematically tackle these problems, we develop a novel knowledge graph embedding method, named CAQuatE. We propose two concepts to select valuable context information, then design a context information encoder to enhance the original embedding, and finally use quaternion multiplication to model triples. The experiment and results on two common benchmark datasets show that CAQuatE can significantly outperform the existing state-of-theart model in the knowledge graph completion task by obtaining lower dimensional representation vectors with fewer epoches and no additional parameters.
Keywords: knowledge graph prediction · quaternion
1
· knowledge graph completion · link
Introduction
Knowledge graph is the core of many semantic applications, such as question answering, searching and natural language processing. A knowledge graph is usually represented as a multi-relation directed graph, consisting of entities as nodes and relations as edges. The facts of the real world stored in the knowledge graph are modeled as triples (head entity, relation, tail entity), expressed as (h, r, t), such as (USA, capital, Washington D.C.). However, we know that knowledge in the real world is infinitely increasing, and the knowledge graph cannot contain all the knowledge in the real world. Therefore, the knowledge graph of the real world is usually incomplete. Therefore, more and more attention has been paid to the missing links between entities. c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 33–47, 2023. https://doi.org/10.1007/978-981-99-2385-4_3
34
J. Wang et al.
This task is called knowledge graph completion (KGC). So far, people have proposed a series of knowledge graph completion models. These models learn the low-dimensional representation of entities and relationships in the knowledge graph (called embedding), and optimizes the scoring function to make the valid triples get higher scores and the invalid triples get lower scores, and then uses these embedding to predict new facts. At present, effective KGC models can be roughly divided into two general categories. One uses complex multiplication to model the relationship as a rotation operation, which usually includes RotatE, QuatE and DualE. As stated in the representative work RotatE, rotation can simulate all three basic relations patterns, namely, symmetric/anti-symmetric, reverse and composite. Another category of model attempts to apply graph neural network (GNN) to KGC. These GNN-based KGC models can effectively aggregate information from multi-hop neighbors to enrich entity/relation representation. Therefore, we consider a question: is there a way to combine the advantages of these two types of models? In this paper, we propose a new method called Context-Aware Quaternion Embedding for Knowledge Graph Completion (CAQuatE), to address the shortcomings of existing models. First, CAQuatE samples context information by calculating the similarity between two entities and the correlation between entity and relation, then constructs context star diagram to model entity/relation context. Second, the original embedding goes through a lightweight context information encoder to output the enhanced embedding. Finally, model defines each relation as a rotation from the source entity to the target entity in the quaternion space. LightCAKE considers both entity context and relation context, and using the asymmetry of quaternion multiplication, these makes it fully aware of context information and can model various relations. To summarize, our contributions are as follows: (1) Two concepts of entity similarity and entity-relation correlation are proposed to sample valuable context in knowledge graph. (2) Proposes an encoder that aggregates entity and relation context information. On the basis of using quaternion rotation to model the structure information of triplet. It has the advantages of modeling complex relation based on semantic matching model and using context information based on GNN model. (3) The link prediction experiments are carried out on two public datasets. The experimental results show that the model in this paper obtains state-of-the-art performances with lower dimensions, less epoches and without additional parameters.
2
Related Work
Recall that in the current most advanced KGC method, it can be roughly divided into two general categories according to the way KGC method deals with triples. In this section, we will review the methods of each category in further detail. One of them, which we call semantic matching based models here, uses the multiplication score function to calculate the possibility of a given triple, so it is also called multiplication model. This idea was initially initiated by RESCAL [1],
Context-Aware Quaternion Embedding for Knowledge Graph Completion
35
which calculates the three-dimensional factorization of the third-order adjacency tensor representing the input knowledge graph to calculate the score of the triplet. DistMult can regard it as an effective extension of RESCAL, limiting each relation to a diagonal matrix to reduce the complexity of RESCAL, but it performs poorly in antisymmetric relations and well in symmetric relations. ComplEx [2] extended the embedded space to the complex space, and extended DistMult by learning the representation in the complex vector space. It can infer the symmetric and anti-symmetric relations through the embedded Hermite inner product, which involves the conjugate transposition of one of the two input vectors. RotatE [3] then proposes to formulate relations by rotating in a complex space with only one rotating surface, the relation is expressed as the rotation from the head entity to the tail entity. As a remarkable feature, RotatE is the first model to unify the symmetry/anti-symmetry, inversion and composition modes of KGE. This shows that the rotation operation in complex space has strong potential to enhance the general knowledge representation. QuatE [4] expanded the complex space to a quaternion space with two rotating surfaces, and applied quaternion multiplication and inner product to calculate the score of the triple. uses multi-vector representation and geometric product to model entities and relations Recently, DualE [5] introduces dual quaternions into knowledge graph embeddings, modeled the relations as a combination of a series of translation and rotation operations. The latest research, DensE, decomposes each relation into rotation operator and scaling operator based on SO3 group. Although this category of model can simulate all three basic relations patterns, namely symmetric/antisymmetric, inverse and composite relation, they ignore the structured information and neighbor information in the context of KGs. The other is the model based on graphical neural network (GNN), which uses graphical neural network to capture the structural feature of kg. These models first aggregate the context into entity/relation embedding through GNN, then transfer the context-aware embedding to the context independent scoring function. A2N [6] further distinguishes the weights of adjacent nodes using a method similar to the graph attention network. KBAT extends the graph attention mechanism to capture the entity and relation feature in the multi hop neighborhood of given entities. SACN uses the variant of graph convolution network as encoder and the variant of ConvE as decoder. RGHAT is equipped with relations attention and entity attention to calculate the weights of adjacent relations and entities respectively. The convolution operation is applied to each adjacent entity and its weight is equal. CompGCN uses entity relation combination operations such as TransE [7] to update entity embedding, and proposes a variety of neighborhood aggregation and synthesis operations for the structural pattern of multi relations graph. SE-GNN discuss the impact factors for extrapolation and from relation, entity and triple level respectively, then proposed three kinds of semantic evidence, which were fully combined through the multi-layer aggregation mechanism of neural network to obtain a more extrapolative knowledge representation. The disadvantage of this kind of model is that its geometric meaning is unclear and it can not model multivariate relations well.
36
J. Wang et al.
3
Preliminaries
3.1
Definition
A knowledge graph is denoted as a set of triplets G = {(h, r, t)|(h, r, t) ∈ E × R × E} consisting of N entities and M relations, where E and R represents the set of entities and relations, h, t ∈ E represents head entity and tail entity, r ∈ R represents the relation between two entities. Given a triplets (h, r, t), the corresponding embedding are Qh , Qr , Qt , where Qh , Qt ∈ HN ×d , Qr ∈ HM ×d , d is the embedding dimension, H is the set of quaternion. 3.2
Quaternion Background
Quaternion [Hamilton, 1844] is a simple hypercomplex number system. The complex number is composed of real number and imaginary number unit i, where i2 = −1.Similarly, a quaternion Q is composed of single real component and three imaginary components, i, j, k, satisfying the following rules: i2 = j2 = k2 = ijk = −1. They have the following relationship: Q = a+bi+cj+dk , where a, b, c, d are real numbers. The multiplication between two quaternions is non-commutative. The quaternion operations used in this paper are as follows: Quaternion Norm: |Q| =
a2 + b2 + c2 + d2
(1)
Quaternion Addition: Addition between Q1 = a1 + b1 i + c1 j + d1 k and Q2 = a2 + b2 i + c2 j + d2 k is defined as Q1 + Q2 = (a1 + a2 ) + (b1 + b2 )i + (c1 + c2 )j + (d1 + d2 )k
(2)
Inner Product: The inner product between two quaternions Q1 = a1 + b1 i + c1 j + d1 k and Q2 = a2 + b2 i + c2 j + d2 k is defined as: Q1 · Q2 = a1 , a2 + b1 , b2 + c1 , c2 + d1 , d2
(3)
Hamilton Quaternion Product: The Hamilton quaternion product is defined as: Q1 ⊗ Q2 = (a1 a2 − b1 b2 − c1 c2 − d1 d2 ) + (a1 b2 + b1 a2 + c1 d2 − d1 c2 )i +(a1 c2 − b1 d2 + c1 a2 + d1 b2 )j + (a1 d2 + b1 c2 − c1 b2 + d1 a2 )k
(4)
Context-Aware Quaternion Embedding for Knowledge Graph Completion
4
37
Methodology
In this part, we will introduce the CAQuatE model proposed in this paper. A model named CAQuatE is proposed in this paper, the framework of the model is shown in Fig. 1. It is mainly divided into three parts: (1) Context sampling module: Sampling entity context and relation context. (2) Context information encoder module: Use context information to enhance the embedding of entities and relations. (3) Quaternion rotation module: Model triples by rotation of quaternions. Firstly, in order to select the more valuable part from the many contexts of entity/relation, we propose the concepts of entity similarity and entity-relation correlation, and rank the context information according to these two concepts to select the more valuable part; Secondly, in order to make full use of the context information of entities and relations, in the context information encoder module, the enhanced quaternion representation of entities and relations is obtained respectively; Finally, the triple is modeled as the head entity through relation’s rotation to the tail entity, then get the final score of triples.
Fig. 1. the Architecture of CAQuatE
Figure 1 is the model architecture of CAQuatE. Firstly, the context sampling module filters its key contexts according to entity similarity and entity-relation correlation. After that, knowledge graph into the context information encoder module, all context information for each entity/relations is processed (diamond φent and φrel in the figure), and after the aggregation, according to certain weights to generate each entity/relation of quaternion embedding, the blue dotted line represents the weight α, the blue dotted line represents the weight β, enhance the original representation with the aggregated representation. Finally, the quaternion rotation module normalized the quaternion representation of relation, and defines the triple as a relation rotation from the head entity to the tail entity in the quaternion space. Taking the triplet (a, r1, b) in the figure as
38
J. Wang et al.
an example. Firstly, the whole knowledge graph is input into the context sampling module to obtain the sampled knowledge graph. Then input the sampled knowledge graph into the encoder of context information, we will get the context aggregation result a’ and b’, and then add with the original embedding a and b in a certain proportion to get a* and b*; Similarly, we can obtain the enhanced embedding r1* after the aggregation context of relation r1. Notice that each embedding here is a quaternion. Next, we right-multiply the normalized r1* by a* and inner product it with b* to get the score for that triplet. 4.1
Context Sampling Module
Before the formal training model, we need to preprocess the data. The module first obtains the context of all entities in the training set (that is, the corresponding relation and tail entity) and the context of all relation in the training set (that is, the corresponding two entities). The definitions are as follows: Definition 1. Entity Context: For each entity ei in the knowledge graph G, the entity context of ei is defined as Cei = {(rj , ek )|(ei , rj , ek ) ∈ G ∨ (ek , rj , ei ) ∈ G} , that is, all the (relation, tail entity) pairs in G whose head is ei . Definition 2. Relation Context: For each relation ri in the knowledge graph G, the relation context of ri is defined Cri = {(ej , ek )|(ej , ri , ek ) ∈ G} , that is, all the (head entity, tail) pairs in G whose relation is ri . We limit the number of contexts of an entity to α. To select from multiple contexts that are helpful for model training, we propose the concept of entity similarity. We define the similarity of entity ei and ej is: Sim(ei , ej ) = |{(rj , ek )|(rj , ek ) ∈ Cei ∩ (rj , ek ) ∈ Cej }|
(5)
Rank all contexts according to the similarity between the tail entity and the original entity, and select the top α Entity contexts with high similarity. Similarly, we limit the number of contexts in a relationship to β. To select from multiple contexts that are helpful for model training, we propose the concept of entity-relation correlation. We define the similarity of entity ei and rj is: (6) Corr(ei , rj ) = |{(ei , ek )|(ei , ek ) ∈ Crj } ∩ {(rj , ek )|(rj , ek ) ∈ Cei }| Rank all contexts according to their head/tail entities and relations, and select the top β relation contexts with high correlation. The module filters the context of entity e according to entity similarity. Filter the context of relation r based on entity-relation correlation. After that, the results of the two parts are input into the context information encoder. 4.2
Context Information Encoder Module
We aggregate context information into the representation of entities as well as relations to enhance their embedding.
Context-Aware Quaternion Embedding for Knowledge Graph Completion
39
The entity context information aggregation functions are formulated as: Qei = αi,j,k · φ(Qrj , Qek ) (7) (rj ,ek )∈Cei
where ei is the current entity, rj and ek are the entity context of ei , Qei , Qrj , Qek are the quaternion embedding of ei , rj and ek , and Qei is the aggregated information, φ(a, b) are functions for aggregation context information. Three types are given in this paper: φ1 (a, b) = a − b φ2 (a, b) = a · b (8) φ3 (a, b) = a ⊗ b Notice that both a and b here are Quaternions, −, ·, ⊗ represent quaternion addition (subtraction), quaternion inner product and quaternion multiplication respectively (see 3.2 for specific formula). α( i, j, k) is the weight represents the importance of each entity context to ei . About α( i, j, k), we have four calculation methods: α1 = 1 exp (Qei + Qrj − Qek ) α2 = (rj ,ek )∈Ce exp (Qei + Qrj − Qek ) i
exp (Qei · Qrj · Qek ) α3 = (rj ,ek )∈Ce exp (Qei · Qrj · Qek )
(9)
i
α4 =
exp (Qei ⊗ Q rj · Qek ) (rj ,ek )∈Cei
Q rj
exp (Qei ⊗ Qrj · Qek )
is the normalized quaternion representation of the embedding Qrj of where relation rj . The relation context information aggregation functions are formulated as: Qri = βi,j,k · φ(Qej , Qek ) (10) (ej ,ek )∈Cri
where ri is the current relation, ej ,ek are the relation context of ri , Qri ,Qej ,Qek are the quaternion embedding of ri , ej and ek , and Qri is the aggregated information, φ(a, b) are functions for aggregation context information. The specific formula is the same as above. βi,j,k is the weight represents the importance of each relation context to ri . About βi,j,k , we have four calculation methods: β1 = 1 exp (Qej + Qri − Qek ) (ej ,ek )∈Cr exp (Qej + Qri − Qek )
β2 =
i
exp (Qej · Qri · Qek ) β3 = (ej ,ek )∈Cr exp (Qej · Qri · Qek ) i
β4 =
exp (Qej ⊗ Q ri · Qek ) exp (Q ej ⊗ Qri · Qek ) (ej ,ek )∈Cr i
(11)
40
J. Wang et al.
where Q ri is the normalized quaternion representation of the embedding Qri of relation ri . The above can be regarded as an aggregation layer, which only captures the neighbor information in the 1-hop context. In order to obtain the multi hop neighbor information, we have introduced a multi-layer version of the aggregation context information, Qei ,Qei ,Qri ,Qri is used as the input of the next layer, and an iterative aggregation mechanism is designed. The formula is as follows: = Qe(l) + (1 − γe ∗ l)Qe(l) (12) Qe(l+1) i i i Qr(l+1) = Qr(l) + (1 − γr ∗ l)Qr(l) i i i
(13) (l+1)
0 l L, L means the maximum number of aggregation layers, Qei is the (l+1) is the embedding embedding of ei after aggregating (l + 1)-hop neighbors, Qri of ri after aggregating (l + 1)-hop neighbors, γe and γr are the decay factor. 4.3
Quaternion Rotation Module
The quaternion multiplication of a quaternion Q1 and a quaternion Q2 can be seen as bringing about two transformations at the same time: scaling and rotation. And put Q2 is limited to unit quaternions, which does not result in scaling, but only rotation. We represent the embedding of entities and relations in quaternion space. Given a triple (h, r, t), the quaternion embeddings of h, r, t, Qh , Qr , Qt ∈ H, are respectively expressed as: Qh = Qh,r + Qh,i i + Qh,j j + Qh,k k Qr = Qr,r + Qr,i i + Qr,j j + Qr,k k Qt = Qt,r + Qt,i i + Qt,j j + Qt,k k
(14)
Here Qh,r , Qh,i , Qh,j , Qh,k , Qr,r , Qr,i , Qr,j , Qr,k , Qt,r , Qt,i , Qt,j , Qt,k ∈ R. In the CAQuatE model proposed in this paper, each time the quaternion score of a triple is calculated, the embedding of entities and relations is enhanced with context information. Specifically, quaternion embedding of entities and relations is initialized first. In each training epoch, we first take the quaternion embedding of head entities and tail entities and relations, aggregate context information for them once, and then enhance the original embedding with the aggregate information. After the context information is aggregated, the quaternion Qr of the relation is normalized to the unit quaternion Q r , that is, the scaling effect is eliminated by dividing Qr by its norm: Q r =
Qr,r + Qr,i i + Qr,j j + Qr,k k Qr = = Q r,r + Qr,i i + Qr,j j + Qr,k k (15) |Qr | 2 2 2 2 Qr,r + Qr,i + Qr,j + Qr,k
Context-Aware Quaternion Embedding for Knowledge Graph Completion
41
Next, we rotate head entities Qh by quaternion multiplication with Q r : Qh = Qh ⊗ Q r = (Qh,r ◦ Qr,r − Qh,i ◦ Qr,i − Qh,j ◦ Qr,j − Qh,k ◦ Qr,k ) +(Qh,r ◦ Q r,i + Qh,i ◦ Qr,r + Qh,j ◦ Qr,k − Qh,k ◦ Qr,j )i +(Qh,r ◦ Q r,j − Qh,i ◦ Qr,k + Qh,j ◦ Qr,r + Qh,k ◦ Qr,i )j +(Qh,r ◦ Q r,k + Qh,i ◦ Qr,j − Qh,j ◦ Qr,i + Qh,k ◦ Qr,r )k
= Qh,r + Qh,i i + Qh,j j + Qh,k k
(16)
◦ represents the element multiplication. 4.4
Scoring Function and Loss
Similar to QuatE, we use the inner product of the quaternion of the head entity after rotation of the relation and the quaternion of the tail entity as a scoring function: φ(h, r, t) = Qh · Qt = Qh,r , Qt,r + Qh,i , Qt,i + Qh,j , Qt,j + Qh,k , Qt,k (17) We regard the task as a classification problem and learn the model parameters by minimizing the following regularization logic loss: log (1 + exp (−Yhrt φ(h, r, t))) + λ1 Q 22 + λ2 W 22 (18) L(Q, W ) = r(h,t)∈Ω∪Ω −
Ω − is sampled from the unknown set Ω .
5
Experiments and Results
In this section, we conduct a number of experiments on two widely used knowledge graph datasets. We begin by describing the dataset, baseline model, implementation details, and evaluation protocol. We then demonstrate the validity of our proposed model and each module by comparing it with several baselines. In addition, further experiments including ablation studies and case studies were conducted. 5.1
Datasets
WN18RR [9] is a database about vocabulary, its inverse relations are deleted, therefore, the main relation patterns are symmetric/antisymmetric and composition. FB15K-237 [10] is a subset of a large-scale knowledge graph, it contains general facts, and its inverse relations are deleted, too. Statistics about the dataset are shown in the Table 1:
42
J. Wang et al. Table 1. Statistics about the dataset. Dataset
#entity #relation #training #validation #test
FB15k-237 14,541 WN18RR
5.2
40,943
237
272,115
17,535
20,466
11
86,835
3,034
3,134
Baselines
We compared CAQuatE with a lot of baselines. For semantic matching based models, we reported DistMult, ComplEx [2], RotatE [3], QuatE [4], GeomE, AutoSF [11], DualE [5] and DensE. For GNN based models, we reported RGCN, VR-GCN, SACN, CompGCN, SE-GNN. 5.3
Evaluation Protocol
In this paper, we use Mean Reciprocal Ranking (MRR) and Hits@N as evaluation indicators. MRR is the reverse ranking of the correct entity. Hits@N is the proportion of the first N correct entities, where N = 1, 3, 10. 5.4
Experimental Settings
We implement our CAQuatE based on Pytorch. We test our model on a single GPU, the GPU model is NVIDIA GeForce GTX 3090. We use grid search to obtain the hyper-parameters. The embedding dimension is chosen from 100, 150, 200, 250, 300. Regularization rate λ1 and λ2 is chosen from 0, 0.01, 0.05, 0.1, 0.2. The learning rate is chosen from 0.02 to 0.1. The number of negatives per training sample is chosen from 1, 5, 10, 20. Hyperparameter for optimal performance of the model are shown in the Table 2. Table 2. Hyperparameter for optimal performance. Dataset
5.5
#d #λ1 #λ2 #α #β #γe #γr
FB15k-237 250 0.1
0.1
4
10
WN18RR
0.1
3
8
100 0.1
0.1
0.1
0.05 0.2
Link Prediction
The empirical results on WN18RR and FB15k-237 datasets are reported in Table 3. Hits@N is shown as a percentage. The best score is in bold, while the second one is in underline. After analysis, the following conclusions can be drawn: CAQuatE surpasses other baselines in general, which effectively proves the superiority of this model in completion task. More specifically, on WN18RR, CAQuatE outperformed all baselines on all metrics, particularly with 1.6%,
Context-Aware Quaternion Embedding for Knowledge Graph Completion
43
2.5%, 1.5%, and 0.5% improvements over QuatE on MRR, Hits@1, Hits@3, and Hits@10, respectively. On FB15k-237, CAQuatE achieved only second-best results on indicator Hits@1, while the rest of the indicators were better than all baselines, especially with 6.3%, 10%, 6% and 2.7% improvements on MRR, Hits@1, Hits@3 and Hits@10, respectively, compared with QuatE. Table 3. Link prediction results on WN18RR and FB15k-237 Model
WN18RR FB15k-237 MRR Hits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10
DistMult(2015) ComplEx(2016) RotatE(2019) QuatE(2019) GeomE(2020) AutoSF(2020) DualE(2021) DensE(2022)
0.43 0.44 0.476 0.488 0.485 0.490 0.492 0.492
39 41 42.8 43.8 44.4 45.1 44.4 –
44 46 49.2 50.8 50.1 – 51.3 –
49 51 57.1 58.2 57.3 56.7 58.4 58.6
0.241 0.247 0.338 0.348 0.366 0.360 0.365 0.351
15.5 15.8 24.1 24.8 27.2 26.7 26.8 –
26.3 27.5 37.5 38.2 40.1 – 40 –
41.9 42.8 53.3 55 55.7 55.2 55.9 54.4
R-GCN(2018) VR-GCN(2019) SACN(2019) CompGCN(2020) SE-GNN(2021)
– – 0.47 0.479 0.484
– – 43 44.3 44.8
– – 48 49.4 50.9
– – 54 54.6 57.3
0.248 0.248 0.35 0.355 0.368
15.1 15.9 26 26.4 28.3
26 27.2 39 39 40.2
41.7 43.2 54 53.5 56.2
CAQuatE
0.497 45.1
51.7
58.6
0.371 27.3
40.5
56.5
5.6
Number of Free Parameters Comparison
Table 4 shows the comparison of the number of parameters between CAQuatE and several multiplication model baselines and SE-GNN. GeomE is best achieved in the case of combining GeomE2D and GeomE3D, where the embedding of each entity/relation requires 12 vector representations. DualE uses dual quaternions, each embedding of an entity/relation requires 8 vector representations. DensE requires 300 and 800 dimensions on WN18RR and FB15K-237, respectively. The model in this paper removes unnecessary full connection layers, dropout layers, and BatchNorm1d layers from QuatE, so the number of parameters is slightly lower than that of QuatE. For comparison with the GNN model, we also list the number of SE-GNN parameters in the table. Table 4. the comparison of the number of parameters Model
RotatE QuatE
WN18RR
40.95M 16.38M 49.11M 32.76M 49.14M 18.17M
FB15k-237 29.32M
GeomE DualE
DensE
SE-GNN CAQuatE
5.82M 17.43M 11.64M 46.56M 56.02M
16.37M 5.81M
44
J. Wang et al.
5.7
Ablation Study
In order to evaluate the effect of each module, four ablation studies were conducted, namely, replacing entity similarity selection with random selection; entity-relation correlation selection was changed to random selection; Remove the entity context information encoding module, using the original embedded; Remove the relational context information encoding module and use the original embedding. The results are shown in Table 5, which demonstrates the effectiveness of each module. Table 5. Analysis on the effect of each module
Model
WN18RR FB15k-237 MRR Hits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10
w/o w/o w/o w/o
0.493 0.495 0.490 0.492
entity similarity entity-relation correlation entity context relation context
CAQuatE
5.8
44.5 44.9 44.1 44.4
51.3 51.6 51.3 51.2
57.8 58.5 57.7 58.3
0.361 0.363 0.352 0.354
26.3 26.5 25.5 25.9
40.0 39.8 38.6 38.8
56.1 56.0 55.0 54.8
0.497 45.1
51.7
58.6
0.371 27.3
40.5
56.5
Convergence Rate Comparison
We have drawn the curves of changes in model effects of QuatE, DualE and CAQuatE on WN18RR dataset with the training epoches. As shown in Fig. 2, our CAQuatE model not only performs better than the baseline, but also has a faster convergence rate. And it kept rising after 10000 epochs, which may be because the model made full use of context information and was better able to learn the structure of triples. 5.9
Impact of Embedding Dimension
To understand the impact of dimensions on model performance, we used the WN18RR dataset to experiment CAQuatE with QuatE, DualE and SE-GNN models in different dimension Settings. As shown in Fig. 3, our model achieves best results at 100 dimensions and consistently outperforms all baselines, indicating that CAQuatE can still achieve high accuracy across a wide range of dimensions.
Context-Aware Quaternion Embedding for Knowledge Graph Completion
Fig. 2. the curves of changes in model effects of QuatE, DualE and CAQuatE on WN18RR with the training epoches
45
Fig. 3. Curve of experimental results with embedding dimension on WN18RR
Figure 4 and Fig. 5 shows the link prediction results of MRR, Hits@1 and Hits@10 for CAQuatE models with different embedding dimensions d={10, 50, 100, 200, 250} on the WN18RR dataset and d={50,100,150,200,250,300,350}on the FB15k237 dataset. It can be seen that the performance of CAQuatE is the best when the embedding dimension is 100 and 250 respectively.
Fig. 4. Variation of experimental results with embedding dimension on WN18RR
5.10
Fig. 5. Variation of experimental results with embedding dimension on FB15k237
Multi-relation Analysis
We analyze the experimental results of multi-relation on FB15k-237 and WN18RR, as shown in Table 6. There are 11 relations types in the test set of WN18RR, with a total of 3134 triples, among which 1.34% are 1-to-1, 15.16% are 1-to-n, 47.45% are n-to-1, and 36.06% are n-to-n. There are 224 relations types in the test set of FB15k-237, with a total of 20,466 triples, among which 1-to-1 accounts for 0.94%, 1-to-n for 6.32%, n-to-1 for 20.45%, and n-to-n for 72.30%.
46
J. Wang et al. Table 6. Result on 1-to-1, 1-to-n, n-to-1, n-to-n relations
WN18RR FB15k-237 Relation Type MRR Hits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10 1-to-1
0.460 40.5
49.9
55.8
0.515 39.4
47.5
55.3
1-to-n
0.262 18.4
27.6
42.3
0.332 24.5
33.5
47.5
n-to-1
0.228 16.9
25.2
24.7
0.556 47.9
56.6
65.2
n-to-n
0.969 94.0
96.2
96.5
0.340 22.1
36.9
55.3
6
Conclusion
In order to combine the advantages of semantic matching based model and GNN based model, we proposed a novel knowledge graph embedding model CAquatE. We propose entity similarity and entity-relation correlation to filter valuable context information, design a context information encoder to learn the entity/relation embedding after context enhancement, and define each triplet as the rotation of head entity to tail entity in hypercomplex vector space. Experiments on two benchmark datasets show that our model can combine the advantages of the two categories of models and is superior to other most advanced methods. Acknowledgements. This work was supported by the Natural Science Foundation of Fujian, China (No. 2021J01619), the National Natural Science Foundation of China (No. 61672159).
References 1. Nickel, M., Tresp, V., Kriegel, H.P.: A three-way model for collective learning on multi-relational data. In: ICML, vol. 11, pp. 809–816 (2011) ´ Bouchard, G.: Complex embed2. Trouillon, T., Welbl, J., Riedel, S., Gaussier, E., dings for simple link prediction. In: International Conference on Machine Learning, pp. 2071–2080 (2016) 3. Sun, Z., Deng, Z.H., Nie, J.Y., Tang, J.: Rotate: knowledge graph embedding by relational rotation in complex space. In: International Conference on Learning Representations (2019) 4. Zhang, S., Tay, Y., Yao, L., Liu, Q.: Quaternion knowledge graph embeddings. In: NIPS, pp. 2731–2741(2019) 5. Cao, Z., et al.: Dual quaternion knowledge graph embeddings. In: National Conference on Artificial Intelligence (2021) 6. Bansal, T., Juan, D.C., Ravi, S., McCallum, A.: A2n: attending to neighbors for knowledge graph inference. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4387–4392 (2019) 7. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, vol. 26, pp. 2787–2795 (2013)
Context-Aware Quaternion Embedding for Knowledge Graph Completion
47
8. Toutanova, K., Chen, D., Pantel, P., Poon, H., Choudhury, P., Gamon, M.: Representing text for joint embedding of text and knowledge bases. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1499–1509 (2015) 9. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2D knowledge graph embeddings. In AAAI, pp. 1811–1818 (2017) 10. Toutanova, K., Chen, D.: Observed versus latent features for knowledge base and text inference. In: CVSC, pp. 57–66 (2015) 11. Zhang, Y., Yao, Q., Dai, W., Chen, L.: AutoSF: searching scoring functions for knowledge graph embedding. In: ICDE, pp. 433–444 (2020)
Dependency-Based Task Assignment in Spatial Crowdsourcing Wenan Tan1,2(B) , Zhejun Liang1 , Jin Liu1 , and Kai Ding1 1
Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China [email protected] 2 Shanghai Polytechnic University, Shanghai 201209, China
Abstract. Task assignment is one of the central problems in spatial crowdsourcing research. A good assignment approach will match the best performer to the task. Complex tasks account for an increasing proportion of task assignment demands, most of the previous researches on complex task assignment have ignored the dependency relationships between tasks, resulting in many invalid matches and wasting worker resources. A complex task can be assigned only after its dependent task is assigned, such as house decoration. Secondly, task quality is also an important factor to be considered in the task assignment process, the high-quality completion of tasks will benefit all three parties in the crowdsourcing system. Therefore, this paper proposes a dependency-based greedy approach, under the constraints of distance, time, budget, and skills, this approach first assigns a set of available workers to tasks without dependency and maximizes the total quality of assigned tasks. Finally, extensive experiments are conducted on the dataset, and the experimental results proved the effectiveness of the proposed approach in this paper.
Keywords: spatial crowdsourcing dependency · task quality
1
· task assignment · task
Introduction
The emergence of mobile Internet has not only brought convenience and benefits to people, but also breed a new mode of solving problems through collective intelligence, namely, crowdsourcing. Unlike outsourcing, crowdsourcing does not assign tasks to specific sets of workers, but uses the free internet mass to solve problems [3]. In recent years, with the rapid development of the “Internet +” and the widespread popularity of mobile devices, traditional crowdsourcing based online has also develop a new service model, namely, spatial crowdsourcing. Different from traditional crowdsourcing, spatial crowdsourcing requires workers to arrive at the specific spatial and temporal location to participate and complete tasks. A general crowdsourcing system consists of three parts, the task requester, the crowdsourcing platform, and the worker, and the task completion requires good collaboration between the three parts. c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 48–61, 2023. https://doi.org/10.1007/978-981-99-2385-4_4
Dependency-Based Task Assignment in Spatial Crowdsourcing
49
Task assignment is one of the hot topics in spatial crowdsourcing research, which aims at assigning a worker or a set of workers who are most suitable to perform a task under the relevant constraints. Reasonable task assignment can not only take advantage of workers strengths but also reduce task costs and ensure the quality of task completion. The tasks published on the crowdsourcing platform can be divided into two categories: one is simple tasks with an independent and single structure, such as answering questions online; the other is complex tasks requiring multiple skills and complex structure, such as house decoration, which often require the collaboration of multiple workers to complete. Different from simple task assignment, the assignment of complex tasks not only considers multi-skill requirements and task completion quality but also consider the dependency between tasks. Take decorating a house, for example, there are three complex tasks, the first is the construction of water and electricity renovation, the second is the construction of ceiling and floor, and the third is the construction of furniture installation. The three tasks are interdependent if task one does not match suitable workers set to perform, even if tasks two and three match workers, these are also invalid matches. Therefore, it is necessary to assign them based on the dependency between tasks. Secondly, when assigning tasks, we should not only consider whether workers can provide the required skills for the task but also consider the quality of skills, the higher the skill quality, the higher the task quality and the more satisfied the task requester. The remainder of this paper is organized as follows: related work on task assignment in spatial crowdsourcing is presented in the second part, the formal definition of the dependency-based task assignment problem is presented in the third part, in the fourth part, the task assignment approach is proposed, which iteratively selects the task with the highest task quality for assignment, in the fifth part, extensive experiments are conducted on the dataset to demonstrate the effectiveness of the proposed approach in this paper, and the sixth part is the conclusion.
2
Related Work
Task assignment is one of the central research problems in space crowdsourcing. As actual demands have changed, more and more task requesters are starting to submit complex tasks, and how to assign complex tasks reasonably has become one of the research hotspots in recent years. For complex tasks, which are often impossible for a single worker to complete, it is necessary to assign a set of appropriate workers to complete them collaboratively [1,5,10–13], Kittur et al. [4] propose a complex task decomposition approach based on the idea of MapReduce, which decomposes a complex task into several independent subtasks, and then the crowdsourcing platform assigns the subtasks to the workers and integrate the submitted results into the results of complex tasks and delivered the result to the task requesters. Wangze Ni et al. [7] consider the dependencies between complex tasks and propose the greedy approach and game approach for task assignment to maximize the number of task assignments. Zhao Liu et al. [6] regard complex tasks as a combination of multiple simple subtasks with dependencies, and also proposed the greedy approach and game
50
W. Tan et al.
approach to assigning tasks to each simple subtask to maximize platform profits. These researches do not consider the skill quality of workers when making task assignment. When the skill quality of workers is greatly different, it may lead to unqualified tasks and affect the reputation of the crowdsourcing platform. For complex task assignments, Cheng P et al. [2] proposed the greedy approach, g-divide- and-conquer approach and cost-model-based adaptive approach under constraints such as time, distance, and budget, so that the skills provided by the assigned workers match the skills required for the task and maximize the profit of the platform while satisfying the skill requirements. Liang Q et al. [8] proposed a heuristic task assignment approach FCC-SA based on feedback from task publishers and mutual evaluation among workers to accurately evaluate the matching degree between tasks and workers, evenly assign skilled workers to cooperative tasks, and maximize the number of tasks assigned. Under the constraints of task budget and skills, Rahman et al. [9] optimized the communication cost of the worker team by considering affinity and upper critical mass and improved the collaboration efficiency of the worker team when performing tasks. All of these researches consider complex tasks as a whole when assigning tasks, ignoring the dependency between tasks, which will lead to many invalid assignments. To solve the problems presented above, in this paper, we propose a dependency-based greedy approach to match the appropriate set of workers for complex tasks and maximize the total quality of the assigned tasks while minimizing the task cost under the constraints related to workers and tasks.
3
Problem Definition
In this section, we present the formal definition of dependency-based task assignment problem in spatial crowdsourcing. Definition 1 (Skills set). Let K = {k1 , · · · , kh , · · · , kd } is a skills set that contains d different skills. Each worker has at least one of these skills, and each complex task requires multiple skills to complete. Definition 2 (Multi-skilled workers set). Let Wp = {w1 , · · · , wi , · · · , wm } is a multi-skilled workers set that contains m available workers at the timestamp p. Each worker wi ∈ Wp is defined as a four-tuples, i.e. wi = (li , ri , vi , Qi ), where li = (lix , liy ) denotes the location of worker wi , ri denotes the maximum moving distance of worker wi , vi denotes the average moving speed of worker wi , Qi = {qi1 , · · · , qih , · · · , qid } denotes the quality of the skills set of worker wi , where qih denotes the quality of the hth skill in the skills set possessed by worker wi , when qih is 0, it means that worker wi does not possess this skill. Definition 3 (Complex tasks set). Let Tp = {t1 , · · · , tj , · · · , tn } is a complex tasks set that contains n tasks to be assigned at the timestamp p. Each task tj ∈ Tp is defined as a five-tuples, i.e. tj = (lj , ej , bj , Qj , Dj ), where lj = (ljx , ljy ) denotes the location of task tj , ej denotes the recruitment deadline of task tj ,
Dependency-Based Task Assignment in Spatial Crowdsourcing
51
bj denotes the budget of task tj , Qj = {qj1 , · · · , qjh , · · · , qjd } denotes the set of minimum skills quality set of task tj , where qjh denotes the minimum quality of the hth skill in the skills set required for task tj , when qjh is 0, it means that task tj does not need this skill. Dj denotes the dependency set of task tj , / Dj , task tj can only participate in task assignment when Dj ⊆ Tj and tj ∈ Dj = ∅, otherwise, tasks in the dependency set Dj are assigned in preference. Definition 4 (Valid matching pairs). When assigning tasks at timestamp p, a valid matching pair wi , tj needs to satisfy both dependency, distance, time, budget, and skill constraints. 1) Dependency constraint: All tasks in the dependency set Dj of task tj have been successfully assigned, i.e. Dj = ∅, task tj can participate in task assignment. 2) Distance constraint: The distance between worker wi and task tj is not greater than the maximum moving distance of worker wi , i.e. dist(wi , tj ) ≤ ri , Euclidean distance is used as a distance measure function in this paper. (1) dist(wi , tj ) = (ljx − lix )2 + (ljy − liy )2 where lix and liy represent the horizontal and vertical coordinates of worker wi , ljx and ljy represent the horizontal and vertical coordinates of task tj . 3) Time constraint: Before the task recruitment deadline, worker wi needs to arrive at the task location, i.e. dist(wi , tj ) · vi−1 ≥ ej − p. 4) Budget constraint: The travel cost of worker wi does not exceed the remaining budget of task tj . (2) cij = C · dist(wi , tj ) ≤ bj cij (3) bj = bj − wi ∈Wj
where C represents the cost of moving per unit distance, which can be expressed by fuel consumption per kilometer for automobiles and electricity consumption per kilometer for electric vehicles, Wj represents the set of workers that have been assigned to task tj . 5) Skill constraint: The assigned worker wi can provide the skills needed for task tj , i.e. QTi ∗ Qj = 0. Before defining the dependency-based task assignment problem, we first discuss how to calculate the quality of individual tasks. For the assigned task tj , this paper uses the sum of the effective skills quality of the workers set Wj to represent its quality. The effective skills quality of worker wi is not the sum of the quality of all skills but the quality of the skills provided by worker wi for task tj . The quality calculation formula for task tj is as follows: Q(tj ) = qih (4) wi ∈Wj h∈Re
where Re represents the set of skill subscripts provided by worker wi for task tj , when h ∈ Re , qih · qjh=0 .
52
W. Tan et al.
Next, we discuss how to calculate the cost of task tj . In this paper, we use the total travel cost of the workers set Wj to represent the cost of task tj . The cost calculation formula for task tj is as follows: C(tj ) = cij (5) wi ∈Wj
Definition 5 (Dependency-based task assignment problem). At timestamp p, there is a set of available workers Wp and a set of complex tasks Tp with interdependencies to be assigned on the crowdsourcing platform, the dependency-based task assignment problem is to obtain an assignment instance Ip such that: 1) For task tj , the total travel cost of the workers set Wj is not greater than the budget of task, 2) For the minimum quality of each skill required by task tj , the sum of the corresponding skill quality provided by the workers set Wj can be satisfied, 3) Maximize the total quality Q(Ip ) of the assigned tasks while keeping the task cost as small as possible. Q(tj ) = qih (6) Q(Ip ) = tj ∈CTp
tj ∈CTp wi ∈Wj h∈Re
where CTp represents the set of tasks successfully assigned at the timestamp p, for tasks that are not successfully assigned at the current timestamp, they can continue to participate in the assignment at the next timestamp as long as the recruitment deadline for the task has not been reached.
4
Task Assignment Approach
In this section, a dependency-based greedy approach for task assignment will be proposed. First, we define the matching pair availability. 4.1
Matching Pair Availability
At timestamp p, when assigning worker wi to task tj , two aspects need to be considered. One is the relationship between the skill quality of worker wi and the remaining skill quality required by task tj , the other is the relationship between the travel cost of worker wi and the budget of task tj . Therefore, the availability of a matching pair wi , tj is defined as follows: bj · h∈Re qih (7) A(wi , tj ) = cij · |Re | + where Re represents the set of skill subscripts with the non-zero remaining quality provided by worker wi for task tj . is the minimum travel cost of worker wi . When cij is 0, it is 0.05, and at other times, it is 0.
Dependency-Based Task Assignment in Spatial Crowdsourcing
4.2
53
Pruning Strategy
The greedy approach proposed in this paper will produce many useless calculations and waste system resources if all the matching pairs listed are calculated, Therefore, this subsection proposes two pruning strategies: worker pruning strategy and task pruning strategy, which exclude unqualified workers and tasks before calculation to improve the efficiency of task assignment approach. Worker Pruning Strategy. wa , tj and wb , tj are two valid matching pairs for task tj , If the skills of worker wa include the skills of worker wb and the quality of each skill of worker wa is greater than or equal to the quality of each skill of worker wb , and the travel cost of worker wa is less than or equal to the travel cost of worker wb , i.e. ∀kh ∈K qah ≥ qbh and caj ≤ cbj , then it can be said that worker wa dominates worker wb , i.e. wa ≺ wb , when worker wb is dominated, he will not be assigned to task tj , the matching pair wb , tj can be safely deleted. Task Pruning Strategy. Wj is the set of all valid workers that can be assigned to task tj , Wj is the set of workers that have been assigned to task tj , when a new worker wi ∈ (Wj − Wj ) is assigned to task tj , if the matching pair wi , tj has the maximum availability but the travel cost of worker wi is greater than the remaining budget of task tj , i.e. cij ≥ bj , that is, the budget of task tj is insufficient, task tj can be safely deleted. These deleted tasks can participate in the assignment at the next timestamp, or wait until the task publisher raises the budget. 4.3
Task Matching Pairs Selection
According to the availability of matching pairs in Sect. 4.1, the task matching pairs selection algorithm is proposed in this subsection to try to match a set of workers with the highest availability for the task. Algorithm 1. Task Matching Pairs Selection Input: A task tj and A set of workers Wp Output: A set of candidate matching pairs Ij 1: Initialize: Ij ← ∅ 2: if fj = 0 and Dj = ∅ then 3: Itmp ← ∅ 4: for each worker wi ∈ Wp do 5: if worker wi satisfies the relevant constraint then 6: if worker wi cannot be removed by the worker-pruning strategy then 7: Itmp ← Itmp ∪ {wi , tj } 8: end if 9: end if 10: end for 11: tag ← T rue
54
W. Tan et al.
12: while tag do 13: if Itmp = ∅ then 14: tag ← F alse 15: else 16: Obtain a pair wi , tj ∈ Itmp with the maximum availability 17: if task tj can be removed by the task-pruning strategy then 18: tag ← F alse 19: else 20: Update task tj remaining skill quality and budget 21: Ij ← Ij ∪ {wi , tj } 22: Itmp ← Itmp − {wi , tj } 23: if task tj meets the assignment completion conditions then 24: return Ij 25: end if 26: end if 27: end if 28: end while 29: if tag = F alse then 30: fj ← 1 31: Ij ← ∅ 32: end if 33: end if 34: return Ij
First, an assignment set Ij is initialized to store matching pairs related to task tj (line 1). If task tj cannot be assigned or does not satisfy the dependency constraint, then return the empty set directly indicating that task tj does not participate in this round of assignment, otherwise try to assign a set of workers with maximum availability to task tj (lines 2 to 33), In line 3, a temporary set Itmp is initialized to store the newly obtained matching pairs for task tj , for each available worker, check in line 5 whether it satisfies the distance, time, budget, and skill constraints, if satisfied, the worker pruning strategy is used to determine whether the worker is retained, if worker wi is not dominated by other workers in the set Itmp (line 6), then match pair wi , tj can be added to the set Itmp and delete matching pair wa , tj from the set Itmp for worker wa , which is dominated by worker wi (line 7), the tag in line 11 indicates whether a set of workers with maximum availability can be assigned to task tj , the specific assignment process is described in lines 12 to 28, first determine if there are any available matching pairs, if task tj does not meet the assignment completion condition and there are no available matching pairs, set tag to F alse (line 14), otherwise, select the matching pair with the highest availability from the set Itmp (line 16), then use the task pruning strategy to determine whether the task is retained, if the travel cost of worker wi is greater than the remaining budget of task tj , it indicates that the task budget is insufficient, and tag is also set to F alse (line 18), otherwise, update task tj skill remaining required quality and remaining budget (line 20), and add the matching pair to the set Ij (line 21) and remove it from the set Itmp (line 22). If the sum of each skill quality in the workers set Wj is greater than or equal to the minimum skill quality of task
Dependency-Based Task Assignment in Spatial Crowdsourcing
55
tj , that is, task tj meets the condition of assignment completion (line 23), and the set Ij is returned (line 24). If task tj does not match the suitable workers set, that is, tag is equals to F alse, then set fj to 1, which indicates that task tj does not participate in the subsequent assignment at the current timestamp, and setting the assignment set Ij to empty, returning an empty set (lines 29 to 31). 4.4
The Greedy Approach
In this paper, we propose a dependency-based greedy approach for task assignment, when performing task assignment, the approach will iteratively select the task with the highest task quality to assign and maximize the total quality of the assigned tasks. Algorithm 2. Dependency-based Task Assignment Approach Input: A timestamp p Output: A assignment instance Ip 1: Initialize: Ip ← ∅ 2: Collect a set of available workers Wp 3: Collect a set of tasks Tp that need to be assigned 4: while Ip is not stable do 5: CTtmp ← ∅ 6: for each work tj ∈ Tp do 7: Ij ← Use Algorithm 1 8: if Ij = ∅ then 9: CTtmp ← CTtmp ∪ {tj } 10: end if 11: end for 12: Obtain a task tj ∈ CTtmp with the maximum quality 13: fj ← 2 14: Ip ← Ip ∪ Ij 15: Wp ← Wp − {wi |wi , tj ∈ Ij } 16: for ta ∈ Tp do 17: if tj ∈ Da then 18: Da = Da − {tj } 19: end if 20: end for 21: end while 22: return Ip
At timestamp p, an assignment instance set Ip is initialized to store the last successfully assigned matching pairs (line 1). Lines 1 and 2 get the currently available workers set Wp and tasks set Tp to be assigned, available workers include workers not assigned at the last timestamp, workers who have completed previously assigned tasks, and workers newly appearing on the platform, tasks to be assigned include tasks not assigned at the last timestamp and tasks newly published on the platform by the task requester. After that, the task assignment is performed until the set Ip is stable (lines 3 to 21). Lines 6 to 10 add all assignable
56
W. Tan et al.
tasks to the set CTtmp using the task matching pairs selection algorithm, these assignable tasks refer to tasks whose assignment set returned by Algorithm 1 is not empty, then select the task tj with the highest quality (line 12) as the result of this round of assignment while setting fj to 2, which indicates that the task is successfully assigned (line 13), and adds the matching pairs in the assignment set Ij of task tj to the assignment instance set Ip (line 14). Finally, for all tasks that depend on task tj , remove task tj from its dependency set (lines 16 to 20). Finally, the assignment instance set is returned (line 22) and the worker is notified to perform the corresponding task.
5
Experimental Study
In this section, extensive experiments are conducted on the dataset to prove the effectiveness of the task assignment approach proposed in this paper. 5.1
Experimental Setting
Experimental Environment. The experiments in this paper are run on the following computer environment: AMD Ryzen 7, [email protected] GHz, 16 GB memory, and Windows 11 operating system. The algorithm is implemented in Python. Data Sets. Table 1 summarizes the range of experimental parameter values, where the default values of parameters are shown in bold. In each experiment, change only one of the parameters and keep the default values for the other parameters. Table 1. Experimental Parameter Setting Parameters
Values
the number of workers
200, 400, 600, 800
the number of tasks
100, 200, 300, 400
the range of worker skill size
[1, 3], [1, 4], [1, 5], [1, 6]
the range of task dependency size [0, 3], [0, 5], [0, 7], [0, 9]
Specifically, the locations of workers and tasks are generated randomly in the two-dimensional data space [0, 1]2 , the number of workers changes from 200 to 800, and the number of tasks changes from 150 to 300. For each worker wi , the average moving speed vi is randomly generated on [0.01, 0.1], the maximum moving distance ri is randomly generated on [0.2, 0.3], and the size of the skill set is randomly generated on [sk−, sk+], [sk−, sk+] varies from [1, 3] to [1, 6], and the quality of each skill is randomly generated on [0.3, 0.5]. For each task tj , The recruitment deadline ej is randomly generated on [10, 30], the budget
Dependency-Based Task Assignment in Spatial Crowdsourcing
57
bj is randomly generated on [5, 10], the size of the skill set is randomly generated on [4, 8], and the minimum quality of each skill is randomly generated on [0.4, 0.8], and the size of dependency set Dj is randomly generated on [de−, de+], [de−, de+] varies from [0, 3] to [0, 9], randomly select the task ta generated before task tj , add ta and the tasks in its dependency set Da to the dependency set Dj until the size of the dependency set Dj reaches the generated value. In addition, let the cost of moving per unit distance C be 10 and the total number of skills d be 10. Comparison Experiment. To evaluate the effectiveness and correctness of the task assignment approach proposed in this paper, we change the number of workers, the number of tasks, the size of workers skill set, and the size of task dependency set respectively, compared with the Random approach and the MSSC Greedy approach, which proposed in the reference [2], and analyze the differences in terms of both total task quality and running time. During the experiment, the approach is repeated several times, with the average as the final result. To illustrate the impact of dependencies on the final task assignment results, two approaches without considering dependencies are selected for comparison in this paper. Specifically, the task assignment approach proposed in this paper takes into account the inter-task dependencies and skill quality, and selects the task with the highest task quality to assign in each round, the MSSC Greedy approach does not consider the inter-task dependencies and skill quality and selects the matching pair with the highest profit growth to assign in each round, the random approach takes nothing into account and randomly selects a worker for each round and then randomly assigns to a task. For the random approach, the highest result of 100 runs is taken as the final result. 5.2
Result Analysis
In this paper, we analyze the effect of the following parameters on the experimental results: the number of workers, the number of tasks, the size of worker skill set, and the size of task dependency set. Because this paper uses a synthetic dataset, the experimental results are the average values obtained after extensive experiments, and their reasonableness and correctness can be guaranteed. Number of Workers and Tasks. Figure 1 shows the experimental results of changing the number of workers from 200 to 800 and Fig. 2 shows the experimental results of changing the number of tasks from 150 to 300. As observed from Fig. 1(a) and Fig. 2(a), the total task quality of the three approaches increased as the number of workers and tasks increased, the reason is that as the number of workers and tasks increases, there are more valid matching pairs for each task, and more tasks can be successfully assigned. In addition, the Dependency-based Greedy approach obtained the highest total task quality, followed by the MSSC Greedy approach, and the Random approach obtained the lowest total task
58
W. Tan et al.
Fig. 1. Effect of the number of workers
Fig. 2. Effect of the number of tasks
quality. As observed from Fig. 1(b) and Fig. 2(b), the running time of the three approaches increased as the number of workers and tasks increased, the reason is that the increase in the number of workers and tasks generates more matching pairs, which increases the computational complexity. The Random approach is the simplest, taking nothing into account in the assignment process, and has the shortest running time. The Dependency-based Greedy approach avoids many useless calculations by taking into account inter-task dependencies, the running time is second. The MSSC Greedy approach only considers platform profits, ignores inter-task dependencies, generates many invalid matching pairs, and has the longest running time. Worker Skill Set Size. Figure 3 shows the experimental results of changing the size of the worker skill set from [1, 3] to [1, 6]. As shown in Fig. 3(a), the total task quality of the three approaches increased with the worker skill set increases, the reason is that the increase in the worker skill set makes it easier to meet the skill requirements for each task and more tasks can be successfully assigned. the total task quality obtained by the Dependency-based Greedy approach is the highest, followed by the MSSC Greedy approach, and the Random approach is the lowest. As observed from Fig. 3(b), the running time of the three approaches increased with the worker skill set increases, the reason is that the increase in worker skill set allows more tasks to be allocated. The Random approach takes
Dependency-Based Task Assignment in Spatial Crowdsourcing
59
Fig. 3. Effect of the range of worker skill size
the least running time, followed by the Dependency-based Greedy approach, and the MSSC Greedy approach takes the most time.
Fig. 4. Effect of the range of task dependency size
Task Dependency Set Size. Figure 4 shows the experimental results of changing the size of the task dependency set from [0, 3] to [0, 9]. As observed from Fig. 4(a), with the increase of task dependency set, the total task quality of the three approaches decreases, the reason is that the increase of task dependency set makes the task dependency constraint more difficult to satisfy, and most tasks cannot participate in the assignment. The total task quality obtained by the Dependency-based Greedy approach is the highest, followed by the MSSC Greedy approach and the Random approach. As shown in Fig. 4(b), with the increase of the task dependency set, the running time of the Dependency-based Greedy approach decreases gradually, and the running time of the MSSC Greedy and Random approaches is almost unchanged. The reason is that the increase in task dependency set makes most tasks unable to participate in the assignment, which reduces the computational complexity, but the MSSC Greedy and Random approaches do not consider inter-task dependencies when assigning, so the running time of these two approaches is not reduced. The Random approach takes the least running time, the Dependency-based Greedy approach is second, and the MSSC Greedy approach takes the most time.
60
W. Tan et al.
According to the above experimental results, it can be seen that the approach proposed in this paper can ensure the completion of complex tasks with high quality while the running time is within an acceptable range.
6
Conclusion
This paper studies multi-skill task assignment based on dependency in spatial crowdsourcing. In addition to inter-task dependence constraint, the worker assignment is also constrained by distance, time, cost, and skill. Under these constraints, this paper proposes a dependency-based greedy approach that will assign a set of workers with the highest availability to each task, and then select the task with the highest quality from the successfully assigned tasks to assign, maximizing the total quality of the assigned tasks while keeping the task cost as small as possible. Finally, the effectiveness and correctness of the task assignment approach proposed in this paper are demonstrated by extensive experiments on the dataset. Whether each complex task can be divided into multiple interdependent subtasks for the assignment will continue to be studied in future work.
References 1. Cheng, P., Chen, L., Ye, J.: Cooperation-aware task assignment in spatial crowdsourcing. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1442–1453 (2019) 2. Cheng, P., Lian, X., Chen, L., et al.: Task assignment on multi-skill oriented spatial crowdsourcing. IEEE Trans. Knowl. Data Eng. 28(8), 2201–2215 (2016) 3. Howe, J.: The rise of crowdsourcing. Wired Mag. 14(6), 1–4 (2006) 4. Kittur, A., Smus, B., Khamkar, S., et al.: CrowdForge: crowdsourcing complex work. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 43–52 (2011) 5. Liang, Z., Tan, W., Liu, J., et al.: Multi-skill collaboration-based task assignment in spatial crowdsourcing. In: International Conference on Computer Application and Information Security (ICCAIS 2021), pp. 42–48 (2022) 6. Liu, Z., Li, K., Zhou, X., et al.: Multi-stage complex task assignment in spatial crowdsourcing. Inf. Sci. 586, 119–139 (2022) 7. Ni, W., Cheng, P., Chen, L., et al.: Task allocation in dependency-aware spatial crowdsourcing. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 985–996 (2020) 8. Qiao, L., Tang, F., Liu, J.: Feedback based high-quality task assignment in collaborative crowdsourcing. In: 2018 IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA), pp. 1139–1146 (2018) 9. Rahman, H., Roy, S., Thirumuruganathan, S., et al.: Optimized group formation for solving collaborative tasks. VLDB J. 28(1), 1–23 (2019). https://doi.org/10. 1007/s00778-018-0516-7 10. Rahman, H., Thirumuruganathan, S., et al.: Worker skill estimation in team-based tasks. Proc. VLDB Endow. 8(11), 1142–1153 (2015) 11. Song, T., Xu, K., Li, J., et al.: Multi-skill aware task assignment in real-time spatial crowdsourcing. GeoInformatica 24(1), 153–173 (2020). https://doi.org/10. 1007/s10707-019-00351-4
Dependency-Based Task Assignment in Spatial Crowdsourcing
61
12. Tan, W., Zhao, L., Li, B., et al.: Multiple cooperative task allocation in grouporiented social mobile crowdsensing. IEEE Trans. Serv. Comput. 15(6), 3387–3401 (2021) 13. Zhao, L., Tan, W., Xu, L., et al.: Crowd-based cooperative task allocation via multicriteria optimization and decision-making. IEEE Syst. J. 14(3), 3904–3915 (2020)
ICKG: An I Ching Knowledge Graph Tool Revealing Ancient Wisdom Gaojie Wang1 , Liqiang Wang2 , Shijun Liu1(B) , Haoran Shi1 , and Li Pan1 1
2
School of Software, Shandong University, Jinan, China [email protected] School of Journalism and Communication, Shandong University, Jinan, China
Abstract. The I Ching was originally a book of divination in ancient China. After thousands of years of deriving and redefining, it has been transformed into a profound book of dialectical philosophy with a series of philosophical thoughts and intellectual wisdom. However, due to its rich and special knowledge, it is difficult for traditional methods to mine deeper knowledge. This article first gives a brief introduction to the I Ching. Then, according to the knowledge structure of the I Ching, the classes of entities such as hexagrams, Yin and Yang lines, and words in the I Ching are defined, and the relationships between entities in the same class are also defined, and the relationships between entities different classes are constructed to establish the I Ching Knowledge Graph. Through the I Ching Knowledge graph, knowledge reasoning and link mining can be realized. In addition, an interactive tool (ICKG) is developed to support knowledge graph-based visualization and interactive knowledge mining. It provides users with search and visualization functions for each hexagram and each entity in the I Ching. Moreover, it supports the visualized linking including the connections among entities in the knowledge graph and the relationships among the image of the 64 hexagrams. This provides not only an interactive platform for scholars better understand the I Ching, but also a powerful toolkit to mine the deep knowledge in the I Ching. Keywords: I Ching · Hexagram Entity · Relationship
1
· Yin and Yang · Knowledge graph ·
Introduction
“The I Ching” or “The Book of Changes”, originally a divination book formed in the Western Zhou period (1000-750BC) in ancient China, contains exceedingly rich content. It starts by explaining the rules of divination, and proposes a set Supported by the National Natural Science Foundation of China under Grant 61872222; the Major Projects of the National Social Science Foundation of China under Grant 19ZDA026. c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 62–74, 2023. https://doi.org/10.1007/978-981-99-2385-4_5
ICKG: An I Ching Knowledge Graph Tool Revealing Ancient Wisdom
63
of laws about the change of things according to the change of the odd and even numbers of the yin and yang signs. Based on these laws, it further explains the changes of affairs of humans and heaven, which forms a thought with the theory of the change of Yin and Yang as the core system. The conventional interpretation merely by explanation and association is too limited to explore deeper insights concerning the philosophy, culture, and history of the I Ching [1]. With the development of digital humanities and artificial intelligence, the research status on I Ching has changed [2,3]. In order to help scholars studying I Ching to analyze and express the semantic relations contained in the I Ching, and to realize knowledge reasoning and online visualization of I Ching. Therefore, a novel digital humanities research platform on I Ching is necessary to support academic works through more intuitive and effective approaches. The first Chinese knowledge graph of the I Ching was thus created and a corresponding online application was developed.
Fig. 1. I Ching knowledge structure system
Concerning the particularity of I Ching Knowledge, the construction of I Ching knowledge graph has to consider the characteristics of both I Ching text and knowledge structure. Figure 1 gives a brief introduction to I Ching knowledge structure. As the Tai Chi diagram in Fig. 1 shows, Tai Chi produces Yin and Yang. A solid line represents the Yang (also can be represented by the number 1), and a broken line represents the Yin (also can be represented by the number 0). When the Yin and Yang lines are combined in different ways, four bigrams are formed namely, Taiyang, Shaoyang, Taiyin, and Shaoyin (clockwise from 12 o’clock). Different constructions of three Yin and Yang lines result in eight
64
G. Wang et al.
trigrams namely, Qian, Xun, Kan, Zhen, Kun, Gen, Li, and Dui as shown in the Tai Chi diagram (clockwise from 12 o’clock). The eight trigrams symbolize eight basic elements for their attributes and natures of the universe. Any two of eight trigrams placed together form a hexagram called a compound trigram, i.e., a hexagram. There are a total of sixty-four hexagrams (the outermost circle of Tai Chi in Fig. 1), which symbolize various situations in nature and human society. There are expositions for each hexagram and each line, which are named the judgment of hexagram and statement of lines respectively. As the Qian hexagram in Fig. 1 shows, it is the first hexagram of the 64 hexagrams. There is a short hexagram judgment for Qian. All lines in Qian hexagram are solid, and each line has a corresponding short line statement. As shown in the Qian trigram part of Fig. 1, all the hexagrams in the first row share the same lower trigram with three Yang lines, i.e., Qian trigram. In response to this, the hexagrams in the second row share the same upper trigram as Qian. As all these hexagrams share the same Qian trigram, they are related with each other to some extent in the judgments and line statements. As shown in Fig. 1, there are three kinds of “hexagram changing” transformations, which are intricate transformation, synthesis transformation, and mutual transformation [4]. There are many forms of changes in a hexagram, such as changes in any one of the six lines of a hexagram, changes in any two lines, changes in any three lines, etc. “Yin”, “Yang”, and “eight trigrams” are the basis of the I Ching, and 64 hexagrams are generated through any combination of the eight trigrams, which implicates the operating law of the I Ching symbol system. They contain the original but rich contents, including various rules and changes in life and the universe, as well as profound intellectual wisdom. Between the hexagram image and the textual judgments, some implicated linkages still need to be explored by more advanced methods. The main contributions of this paper include: 1) As far as we know, we build the first I Ching knowledge graph based on the I Ching knowledge system; 2) We designed an online application by I Ching knowledge graph, and developed a toolkit (ICKG) to support visualized profiling and linking mining in the 64 hexagrams.
2
Related Work
The knowledge graph is a research hotspot at present, which is widely used in many fields. Knowledge graph [5] plays a key role in the development of question-answering, advanced analytics, recommendations, and semantic search application [6]. Various applications [7] based on knowledge graphs have been used in industry, including internet financial, recommendation systems, social network analysis, and other areas. Public knowledge graphs such as WordNet [8], YAGO [9], DBpedia [10], and Wikidata [11] flourish quickly. As a consequence, data management solutions for knowledge graphs have also developed a lot, such as Neo4j [12], GraphDB [13], Titan [14], etc.
ICKG: An I Ching Knowledge Graph Tool Revealing Ancient Wisdom
65
There are also many studies on knowledge graphs in traditional culture, and some scholars have applied the construction of knowledge graphs to intangible cultural heritage [15,16]. In order to support managing this knowledge, researchers have long been committed to the construction of domain ontology, a lot of ontologies and metadata standards exist in the cultural heritage domain: CIDOC CRM, DC, VRA core, Europeana Data Model (EDM), etc. Weng et al. [17] proposed a framework for the automatic construction of medical knowledge graphs based on semantic analysis, including medical ontology constructors, knowledge elements and graph model generation, and presented the implementation and application of a personalized treatment recommendation framework based on knowledge graphs. Yu et al. [18] constructed a large-scale knowledge graph for Chinese medicine that integrates terms, documents, databases and other knowledge resources in the field of Chinese medicine. This knowledge graph can facilitate various knowledge services, such as knowledge visualization, knowledge retrieval and knowledge recommendation. Shi et al. [19] explore a new model to organize and integrate semantic health knowledge into concept graphs, automatically retrieve knowledge from the knowledge graph, reason rationally about the knowledge graph, and propose a contextual inference pruning algorithm to achieve efficient chain inference. With the wide application of information technology, humanities research has shown obvious characteristics of “interdisciplinary and cross-cutting”. The development of digital humanities research in China has accelerated, not only academic publications and related research in the field of digital humanities have emerged, but also related majors have been established [20]. In terms of digital humanities platform construction, the Chinese ancient book digital humanities research platform supports Chinese historical research, providing digital archives, digital reading, basic search, including automatic text annotation system to explain texts and characters’ social network relationship map tools to explore characters’ social network relationship [21]. [22] using the “China Biographical Database Project”, with poetry in Tang and Song Dynasties as the main body, combined with related poetry literature, a knowledge graph system of Tang poetry and Song Lyric is established to provide functions such as retrieval, display and visualization. As far as the author knows, this article is the first time that the knowledge graph is applied to the ancient Chinese oracular text I Ching, in order to explore the philosophical and wisdom mysteries in the I Ching. In addition, the I Ching knowledge management platform ICKG, is initially established in this paper to better explore the profound wisdom contained in this ancient oracular text with a digital humanities approach.
3 3.1
I Ching Knowledge Graph Key Entity Classes in the I Ching
In order to organise and integrate the content in the I Ching and build a knowledge graph, we first define the classes of entities in the I Ching and the specific
66
G. Wang et al.
instances of each entity class. Considering that the I Ching contains rich concepts, such as Yin and Yang, eight trigrams, sixty-four trigrams and other images (symbol), and there are also statements that deeply describe each image, such as the judgment of each hexagram and the statement of each line, each concept corresponds to rich content and profound philosophy. Briefly summarizing, the concepts in the I Ching are classified as: • Image entities: Yin line, Yang line, eight trigrams, sixty-four hexagrams, three hundred and eighty-six lines, upper trigram, lower trigram. • Hexagram judgment entities: supreme fortune, profitable, steadfast, auspicious, no harm, profits, no regret, little light, lost, worry, harmful. • Line statement entities: natural objects, animals, figures, architecture, direction, geographical location information, military scenario. • Words entities Divination term, physical image, person, space, time, event. There is a unique hierarchy between concepts in the I Ching, and the blue boxes in Fig. 2 represent concepts in the I Ching. For example, any three lines can be combined to give a trigram, and any two trigrams can be combined to give a hexagram. As shown in Fig. 2, below each concept there is information about the corresponding attribute, in the case of the hexagram, there are attributes such as trigram name, trigram order, explanation, statement and commentary.
Fig. 2. I Ching entity classes and attributes (Color figure online)
Since judgment and statement are records for divination, and the content describes various aspects of ancient social life, in order to obtain the each concrete entity with a semantic unit in the judgment and statement, we need to segment the judgment and statement. Since the I Ching is a classical Chinese text nearly 3,000 years ago, it is not suitable to use modern automatic word
ICKG: An I Ching Knowledge Graph Tool Revealing Ancient Wisdom
67
segmentation methods to determine the entities in it. Therefore, the word segmentation of judgment and statement is obtained by manual segmentation by postgraduates in the field of I Ching research, so as to determine each entity with a specific meaning. As shown in Fig. 3, the statement of the second line of the Qian hexagram are divided in different ways to obtain nine-word entities with different meanings.
Fig. 3. The statement of the second place line in Qian hexagram
After nearly 4 months of manual segmentation of judgment and statements by postgraduates majoring in I Ching research, more than 7,000 words with semantic units were finally obtained, of which 2,124 words were not repeated. As the hexagrams contain mainly conclusive statements such as auspicious, evil, regretful, and miserly; the lines contain not only divination terms but also a large number of historical events, as well as reflecting the opposing phenomena of nature, society, and the universe as a whole. Therefore, we refer to the ancient Chinese classic annotations on the judgment of each hexagram and statement of each line, and manually label and classify each word. As shown in Fig. 2, the words are divided into six categories, namely divination terms, natural objects, character images, space, time, event, and finally classify the words in each category, and finally get the smallest entity with semantic units. 3.2
The Relationship Among I Ching Entities
The relationship between entities is another basic element of the I Ching knowledge graph, which directly determines the richness of the entity knowledge graph and the functional scope of the application system based on the knowledge graph. We use SPO (subject, predicate, object) triples to represent the relationships between entities in the text. The entity-relationship construction task of I Ching can determine the relationships between two levels according to the knowledge structure. The first level is the relationship between the same entity classes, for example, there are intricate, synthesis, and mutual relationships between hexagram images (as shown in Fig. 1 Hexagram changing). The second level is the relationship between different entity classes. For example, there is an inclusive relationship between hexagrams and lines, and there is an existential relationship between lines and words. We have defined a total of 12 relationships so far, and Fig. 4 is a schematic diagram of entity relationships in the I Ching.
68
G. Wang et al.
Fig. 4. Illustration of the concept and entity graph knowledge representation of I Ching
3.3
Knowledge Graph Storage and Mining of Linkages
We use the graph database Neo4j1 to store and manage the I Ching knowledge graph. Neo4j is a mature and robust graph database with a high-performance graph engine. It uses Cypher language for expressive and efficient query, update and management. The Cypher is a declarative graph database query language that is expressive and can efficiently query and update graph data. The query language design is very user-friendly, and suitable for both developers and professional operators. It also provides visualization of graphical data. Figure 5 describes the number of various entities in the I Ching knowledge graph and the definition process. The numbers in the figure represent the number of instances in each entity class. The research I Ching has a long history and rich content. Figure 5 shows only the construction of the I-Ching knowledge graph. To reflect the professionalism of the I-Ching knowledge graph and taking into account the needs of I-Ching experts and scholars, the later work needs to integrate the traditional I Ching classics from the pre-Qin to the Republic of China for more than 2,000 years. Interpretation of history and famous scholars of past dynasties adds in the I Ching Knowledge Graph. So far, users can query and visualize entities and relationships through the Cypher language of the Neo4j database. Cypher is a database term and operations on graph databases can be realized without writing complex query codes. 1
https://neo4j.com.
ICKG: An I Ching Knowledge Graph Tool Revealing Ancient Wisdom
69
Fig. 5. Scale of knowledge graph and generation process
As shown in Fig. 6, which is a visualization of entities associated with Kun hexagrams in Neo4j, entities of the same color represent entities of the same class. Figure 7 is a visual display of the existence of “auspiciousness” in judgment of hexagram and statement of line. In addition, we can also perform high-level knowledge mining through the constructed knowledge graph, including knowledge link mining and path mining. For example, the user wants to know the following specific questions as follows: (1) What are the lines with the same divination term and the same line position? (2) What are the hexagrams that involve the entity of “sacrifice” at the national level and include the entity of “war”? (3) What is the shortest path between the “qian hexagram” entity and the “fruit” entity? All of the above questions can be answered through the knowledge graph we constructed.
4
ICKG Tool Development
To make it more convenient for users to search, manage and analyze the knowledge in I Ching Knowledge. We have developed an easy-to-learn knowledge graph toolkit named, ICKG web application. We use the Neo4j graph database as the back-end database for data storage and query, Django2 as the back-end framework, and VUE3 as the front-end framework to visualize the entities and links. 2 3
https://www.djangoproject.com. https://vuejs.org.
70
G. Wang et al.
Fig. 6. Diagram of the relationship between Kun hexagram
Fig. 7. Hexagram and Line containing the entity “auspicious”
We use the MVC (Model-view-controller) pattern design, which allows us to separate the user interface logic from the business requirements. The Model is used to encapsulate the data and data processing methods associated with the business logic of the application. The View enables the purposeful display of data. The Controller (Controller) acts as an organiser between the different levels. In this way, the association, listing and presentation of the entities is achieved and the richness of the hexagram relationships is expressed. The main system interface is shown in the Fig. 8 The ICKG is developed by taking advantage of the knowledge graph with strong intuitive and interactive features. The web application interface is shown in Fig. 9, 10, 11 and Fig. 12. Through this platform, as shown in Fig. 9, users can quickly find all the knowledge in the 64 hexagrams, including the judgments of each hexagram and the statements of each line, as well as the associated entities.
ICKG: An I Ching Knowledge Graph Tool Revealing Ancient Wisdom
71
Fig. 8. I Ching Knowledge Graph Construction and Application System
This reflects the characteristic that our system can accommodate the global knowledge of I Ching. The hexagrams and lines of the I Ching are a system based on “Yin”, “Yang” and “eight hexagrams”. Through the web application, users can view the “Yin” and “Yang” of each line of hexagrams. It’s also possible to search for any three lines that appear consecutively among the six lines and the trigram represented by the three lines. As shown in Fig. 10, it is a display of three consecutive lines selected arbitrarily from the six lines of the Meng hexagram, users can also view the other hexagrams in 64 hexagrams that are the same as the Meng hexagram with three consecutive lines. The hexagrams and lines are all-encompassing for the rich information from the explanations and interpretations. Through the Webapp we developed, users can view all the entities contained in the hexagrams, as well as which hexagrams contain a certain entity. As shown in Fig. 11, in the Webapp we inquire about all the entities of Qian hexagram and the linking among the entities in Qian hexagram and other hexagrams. It results in 12 hexagrams that are related to the “originating” entity, and all 14 entities of line statements related to the “originating” entity. Figure 12 shows the graph visualization of the changing relationship between hexagrams. The Webapp can also view the detailed relationship information between the hexagrams, which is embodied in the relationship between the lines of the 64 hexagrams. The knowledge of I Ching and its relationship displayed by the web application mainly depends on the organization of I Ching data in the back-end graph database. Therefore, the quality of the knowledge extracted from the I Ching texts is very important. It requires human supervision and subsequent targeted processing, especially in the early stage of the construction of the ICKG web application. In future work, we will continue to increase the functionality of the system. For example, We hope to understand the global picture of the I Ching from the perspective of statistical analysis. And the quantitative statistical analysis can be easily performed through the ICKG web application. We counted the number
72
G. Wang et al.
of identical entities contained in any two hexagrams, and we think that if the number of entities in the two hexagrams is larger, the meanings and interpretations may be closer.
Fig. 9. Search for hexagrams or entities
Fig. 10. The image connection of the hexagrams
Fig. 11. Words entity linkages
Fig. 12. Visualization of hexagonal relations
5
Conclusion
We have constructed a knowledge graph of I Ching, the earliest extant philosophical monograph in ancient China, which can realize the link discovery and knowledge mining between the 64 hexagrams. We have also developed an I Ching Web application system ICKG, through which the function of searching for hexagrams and related entities in I Ching can be realized. The system also makes a preliminary visualization of the knowledge graph. The function and application of the I Ching knowledge graph platform can be further improved. For example, since the birth of I Ching, famous scholars of all dynasties have had various interpretations. In the later work, interpretations of famous masters from different dynasties such as the Han Dynasty, Tang Dynasty, and Song Dynasty can be added to the system. After nearly 3,000 years of inheritance and development, I Ching has formed a huge knowledge system. How to build ICKG into a platform for experts and scholars according to the suggestions of experts and scholars in the professional field of I Ching is the direction of later work.
ICKG: An I Ching Knowledge Graph Tool Revealing Ancient Wisdom
73
References 1. Li, Z., Feng, X., Duan, W., et al.: I Ching philosophy inspired optimization. In: 13th IEEE International Conference on Control & Automation, ICCA 2017, pp. 353–360 (2017) 2. Maryl, M., Costis, J., Edmond, J., et al.: A case study protocol for meta-research into digital practices in the humanities. Digit. Humanit. Q. 14(3), 1–25 (2020) 3. Su, F., Zhang, Y., Immel, Z.: Digital humanities research: interdisciplinary collaborations, themes and implications to library and information science. J. Documentation 77(1), 143–161 (2021) 4. Chen, P., Zhang, T., Chen, L., Tam, C.: I-Ching divination evolutionary algorithm and its convergence analysis. IEEE Trans. Cybern. 47(1), 2–13 (2017) 5. Wu, T., et al.: KG-Buddhism: the Chinese knowledge graph on Buddhism. In: Wang, Z., Turhan, A.-Y., Wang, K., Zhang, X. (eds.) JIST 2017. LNCS, vol. 10675, pp. 259–267. Springer, Cham (2017). https://doi.org/10.1007/978-3-31970682-5_17 6. Hur, A., Janjua, N., Ahmed, M.: A survey on state-of-the-art techniques for knowledge graphs construction and challenges ahead. arXiv preprint arXiv:2110.08012 (2021) 7. Chen, X., Jia, S., Xiang, Y.: A review: knowledge reasoning over knowledge graph. Expert Syst. Appl. 141, 112948 (2020). https://doi.org/10.1016/j.eswa. 2019.112948 8. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995) 9. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706 (2007) 10. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10. 1007/978-3-540-76298-0_52 11. Vrandečcić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014) 12. Pokorný, J., Valenta, M., Troup, M.: Graph pattern index for Neo4j graph databases. In: Quix, C., Bernardino, J. (eds.) DATA 2018. CCIS, vol. 862, pp. 69–90. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26636-3_4 13. Güting, R.H.: GraphDB: modeling and querying graphs in databases. In: VLDB, vol. 94, pp. 12–15. Citeseer (1994) 14. Antonio, B., Cristóbal, B., José, G., et al.: TITAN: a knowledge-based platform for big data workflow management. Knowl. Based Syst. 232, 107489 (2021) 15. Eero, H.: Publishing and Using Cultural Heritage Linked Data on the Semantic Web. Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers (2012) 16. Wijesundara, C., Sugimoto, S.: Metadata model for organizing digital archives of tangible and intangible cultural heritage, and linking cultural heritage information in digital space. J. Libres-Libr. Inf. Sci. Res. Electron. J. 28(2), 58–80 (2018) 17. Weng, H., et al.: A framework for automated knowledge graph construction towards traditional Chinese medicine. In: Siuly, S., et al. (eds.) HIS 2017. LNCS, vol. 10594, pp. 170–181. Springer, Cham (2017). https://doi.org/10.1007/978-3-31969182-4_18
74
G. Wang et al.
18. Yu, T., Li, J., Yu, Q., et al.: Knowledge graph for TCM health preservation: design, construction, and applications. Artif. Intell. Med. 77, 48–52 (2017) 19. Shi, L., Li, S., Yang, X., et al.: Semantic health knowledge graph: semantic integration of heterogeneous medical knowledge and services. BioMed Res. Int. (2017). Hindawi Research International. https://doi.org/10.1155/2017/2858423 20. Baierer, K., et al.: DM2E: a linked data source of digitised manuscripts for the digital humanities. Semant. Web 8(5), 733–745 (2017) 21. Ho, S.-Y., Chen, C.-M., Chang, C.: A Chinese ancient book digital humanities research platform to support digital humanities research. In: 2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI), pp. 1–6 (2019) 22. Wu, C., Li, C., Jiang, W.: The realization of cross-media knowledge graph of tang and song poetry. In: International Conference on Machine Learning and Cybernetics (ICMLC), pp. 14–20 (2020)
Collaborative Analysis on Code Structure and Semantics Xiangdong Ning, Huiqian Wu, Lin Wan, Bin Gong, and Yuqing Sun(B) School of Software, Shandong University, Jinan, China {wanlin,gb,sun_yuqing}@sdu.edu.cn
Abstract. In this paper, we propose the collaborative method that analyzes both code structure and function semantics for code comparison. First, we create the function call graph of code and use it to obtain the structure semantics with the graph auto-encoder. Then the function semantics are obtained with the names and definition of the used library functions and built-in classes in code. Finally, we integrate the structure and function semantics to collaboratively analyze the similarity of codes. We adopt several real code datasets to validate our method and the experimental results show that it outperforms other baselines. The ablation experiments show that the function call structure contributes the most to the performance. We also visualize the semantics of function structures to illustrate that the proposed method can extract the correlations and differences between codes. Keywords: Code Structure · Function Semantics · Self-Encoder
1 Introduction Open-source platforms provide the environment for researchers to share and exchange code. For a problem, researchers often search the relevant solutions from these platforms for code reusability [1]. There are many differences on the code structure and functions about the same problem. To help researchers select high-quality code effectively, code comparison is an essential problem. Currently, there are four main categories in related works. One type is the text-based code analysis method [2–4] that treats code as a plain text and compares the similarity of texts. This method is suitable for the source codes with different programming languages, but the accuracy is usually low. Another is the Call Structure-based code analysis method [5–8], which encodes the Abstract Syntax Tree (AST) of the code as a vector, based on which to compute the code similarity. This method preserves the logical structure information of the source code, but the cost of constructing AST is usually high. Another is the static feature-based code comparison method [9]. It extracts the static features such as lines of code and number of parameters from source code to form feature vectors and performs code comparison based on the feature vectors, which is sensitive to the number of features selected. Besides, the binary-based code analysis method [10] generally disassembles binary code to obtain a sequence of instructions for © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 75–88, 2023. https://doi.org/10.1007/978-981-99-2385-4_6
76
X. Ning et al.
each function then performs code comparison based on vectorized instruction features. However, the interpretability of the vectorization process of this method is usually poor. Modular structure is a basic form of code design. Functions called frequently are often wrapped in the library functions which are encapsulated in the built-in classes, such as the NumPy library, the math library and the random library in the Python language. We often use functions to implement a modular structure where they are called by each other to achieve specific functionality. Therefore, call structures, function names and built-in class names provide important technical features of the code together. Existing work is mainly based on comparison analysis of code text and structure syntax, but it rarely considers code text and structure sematic information collaboratively. To solve the above problem, we propose the collaborative analysis method combining code structure and function semantics. First, we create the function call graph of the code and use a self-encoder based on graph convolutional neural networks to encode the structure semantics. The functions semantics is obtained based on the call information of library functions and built-in class names. Finally, we integrate code structure and function semantics to collaboratively analyze the similarity of codes. Several real code datasets were adopted to validate our method, showing that it outperforms other baselines. We verified that the function call structure part of our method contributes the most to the performance through ablation analysis. We also visualize the semantics of function structures to illustrate that our method can extract the correlations and differences between codes. In this paper, we first analyze the strengths and weaknesses of existing methods in related work. Then, the detailed description of our method is presented. Finally, performance comparation of existing methods is performed on three real code datasets, and we discuss the performance contribution of the various components of our method and the relevance and difference of the semantics of the code structure.
2 Related Work The most representative method for code comparison is the Abstract Syntax Tree (AST) based method. This method uses a parser to generate the AST of the source code and calculate the code similarity through tree matching [13]. RtvNN [5] is one of AST-based methods which transforms function names, class names and variable names in the code into specific tokens and uses recurrent neural networks to obtain the embedding of code tokens. Then the AST is encoded into an implicit vector using a recursive self-encoder. Finally, the embedding of tokens and the implicit vector of AST are combined to compare code similarity. TBCCD [6] detects code syntax similarity using tree convolution which obtains code vector using tree convolutional neural networks and uses it for code similarity analysis. The code comparison based on program call graph obtains the program dependency graph of the code, which contains the control and data dependencies of the program, then comparing code similarity using graph matching algorithm [14]. DeepSim [7] transforms the code control flow and data flow into semantic matrices and transform them into vectors using feedforward neural networks, transforming the problem of detecting code similarity into a binary classification problem. These types of methods preserve the
Collaborative Analysis on Code Structure and Semantics
77
logical structure information of the source code, but the cost of constructing ASTs and program call graph is usually high. Another two kinds of representative methods are the text-based analysis method and the static feature-based analysis method. Text-based methods generally consider the code as a plain text and use the similarity of it to calculate the similarity between codes [11]. Token-based methods convert the code into a sequence of tokens, for example, by replacing all class names with class representations and then performing code similarity calculations using algorithms such as dynamic programming [12]. SCDetector [2] is a combination of token-based and graph-based methods for code comparison. The control flow graph of the code is obtained by tools, and social network centrality analysis is applied to mine the centrality of the basic blocks in the control flow graph, which is called semantic tokens. Finally, the semantic tokens are encoded into implicit vectors using recurrent neural networks to calculate the code similarity. This kind of method is suitable for detecting code in almost all programming languages. However, the accuracy is usually low. The feature-based code comparison method obtains features such as the number of lines of code, the number of parameters and other static features by parsing the code. Then it performs code comparison based on the static features [15]. Several static features are defined in the TaaM [9] method: the number of lines of code, the number of parameters, the number of function calls, the number of local variables, the number of conditional statements, the number of loop statements, and the number of return statements. This method is sensitive to the number of features selected, too many features will lead to overfitting while too few features will lead to information loss. Another comparison method is based on the binary form of the code, which disassembles the binary code to obtain the instruction sequence of each function, then vectorizes the instruction features and finally calculating the code similarity using feature vectors [17, 18]. The BinDeep [10] method first extracts instruction sequences from binary functions and then vectorizes the instruction features using an instruction embedding model. BinMatch [16] is a semantic hybrid-based method to compare the binary functions and calculate the similarity based on signatures to measure the similarity of functions. This type of methods can perform similarity calculation across architectures, compilers, and versions of binary code, but the vectorization process is barely interpretable.
3 Collaborative Analysis on Code Structure and Semantics 3.1 Problem Definition The code comparison problem is designed to compute the similarity of two given codes, which should be consistent to the following properties: 1. Similarity of structure: Modular structure is the basic form of code design achieved by using functions. By calling each other among functions, we can obtain specific functionality. Therefore, the function call structures are similar in codes that achieve similar functionality. 2. Similarity of function: Functions called frequently are usually packaged into library functions, which constitute a library of basic functions. Therefore, the names of
78
X. Ning et al.
functions in the codes that achieve similar functionality are also similar. For example, the NumPy library, the math library, and the random library in the Python language. 3. Similarity of built-in classes: Library functions are generally encapsulated in built-in classes, so the imported built-in classes in the codes that achieve similar functionality are similar. For the convenience of subsequent discussion, the definition of the code function call graph is given below: the node indicates the function name, the attribute of node indicates the number of function calls, and an edge exists between two nodes where a function call relationship exists. 3.2 Framework We propose the collaborative analysis method incorporating code structure and function semantics. Our method consists of three parts: structure similarity, function similarity and built-in class similarity, as shown in Fig. 1.
Fig. 1. The Framework of Collaborative Analysis on Code Structure and Semantics
First, the structured graph of function calls corresponding to the given code is created. The encoding of the graph is achieved by a self-encoder [3] method based on graph convolutional neural networks. Then, the semantic comparison of functions and built-in classes is performed by using call information based on library functions and built-in classes names. Finally, the code similarity is analyzed collaboratively by integrating structure similarity and function semantics similarity. 3.3 Function Call Graph Construction We create the function call structure graph of the given code first, as shown in Fig. 2. We use a tool to generate the DOT file of the function call structure, which is used to describe the nodes in the graph and the relationships between the nodes. For example, in a Python application, the function call graph can be obtained using the PyCallGraph tool.
Collaborative Analysis on Code Structure and Semantics
79
We preprocess the DOT file to obtain the graph representation g, which is represented as shown in Eq. (1): g = (V , AV , E, AE )
(1)
where V denotes the set of function names, AV denotes the set of function attributes, E denotes the set of existing function call edges, AE denotes the set of edge attributes. And we can obtain the adjacency matrix A based on the graph representation g.
Fig. 2. Constructing function call structure graph
3.4 The Self-encoder Based Structure Similarity Our collaborative analysis method incorporating code structure and function semantics consists of three parts: structure similarity, function similarity and built-in class similarity. The structure similarity part uses the construction method in Sect. 3.3 to obtain the graph representation of function call structure as the technical features of the given code. Then we use a self-encoder [3] method based on graph convolutional neural networks to encode the function call structures. Structure vector hs contains information of the function properties and the presence of function call edges. The encoder is calculated as shown in Eq. (2) to Eq. (4): hs = GCN (AV , A)
(2)
˜ ˜ V W0 )W1 GCN (AV , A) = ARelu( AA
(3)
1 1 A˜ = D− 2 AD− 2
(4)
where AV is the collection of function properties, A is the adjacency matrix of the function call graph, hs is the function call structure vector, D is the degree matrix, Relu is the activation function, W0 and W1 are the parameters to be learned.
80
X. Ning et al.
The graph self-encoder uses the sigmoid function as the decoder to reconstruct the original graph as shown in Eq. (5): Aˆ = σ hs hTs
(5)
where Aˆ is the adjacency matrix of the reconstructed graph and σ is the sigmoid function. To make the reconstructed adjacency matrix as similar as possible to the original adjacency matrix, we use the cross entropy of two matrices as the loss function as shown in Eq. (6): loss = − N1 [a˙ij log aˆ ij + (1 − a˙ij )log(1 − aˆ ij )] (6) where a˙ij denotes the elements in the adjacency matrix A of the original graph, aˆ ij denotes
the elements in the adjacency matrix Aˆ of the reconstructed graph, N denotes the number of function call edge sets. 3.5 Function Semantics Similarity In the function semantics similarity part, the function call information is calculated based on library function names. Frequently called functions are usually packaged into library functions, which constitute a library of basic functional functions. Therefore, the library functions called by codes with similar functionality are also similar. According to the TFIDF module of the function names, we can get function call information vector hf , as shown in Eq. (7): (7) hf = f (fun1 ), f (fun2 ) . . . where hf denotes the function vector and f (funi ) denotes the TFIDF value of the i-th function called in the codes. In the built-in class similarity part, the similarity is obtained based on the call information vector of built-in classes names. Library functions are generally encapsulated in built-in classes, and the built-in classes imported by the codes of similar functionality are also similar. Therefore, the built-in class call information vector hc can be obtained from the TFIDF module of the built-in class names as shown in Eq. (8): (8) hc = f (cls1 ), f (cls2 ) . . . where hc denotes the built-in class vector and f (clsi ) denotes the TFIDF value of the i-th built-in class imported in the code. We combine the structure semantic vectors, function vectors and built-in class vectors for collaborative analysis. The code vector h is calculated as shown in Eq. (9): h = hs ⊕ hf ⊕ hc where h denotes the code vector, ⊕ is the vector concatenation character.
(9)
Collaborative Analysis on Code Structure and Semantics
81
3.6 Code Comparison Combining Code Structure and Function Semantics In the code comparison part, we use the method in Sect. 3.3 to obtain the graph representations of the function call structures of the given codes first. Then the vectors of given codes are obtained by applying the methods in Sects. 3.4 and 3.5. Finally, the cosine similarity value sima of the code vectors is used as the similarity score of given codes as shown in Eq. (10): sima = cos(hsea , hdes )
(10)
where hsea is the vector of retrieved reusable code, hdes is the vector of newly designed code, cos is the cosine similarity function, sima is the similarity score of the given codes.
4 Experiments and Analysis of Results 4.1 Dataset and Experimental Settings We adopt three real code datasets to validate our method performance from three perspectives: comparing our collaborative method with other call structure-based methods or text-based methods; discussing the impact of structure similarity, function similarity and built-in class similarity components on our method’s performance through model parsing analysis; visualizing to discuss the relevance and difference of structure semantic coding of code. The statistical analysis of the dataset is shown in Table 1. The dataset Google Code Jam (GCJ) is the dataset used by Wang et al. [19], where Google Code Jam1 is an international programming competition held by Google. The GCJ dataset contains 1669 code files for 12 different problems. The functionally similar codes that solve same problems are considered to be similar, while the codes that solve different problems are considered not similar. The dataset Big Clone Bench (BCB), proposed by Svajlenko et al. [20], is a widely used code plagiarism benchmark dataset containing 6 million plagiarism pairs and 260,000 non-plagiarism pairs. In code plagiarism detection tasks, researchers classify code plagiarism into four types where type IV refers to the implementation of the same functionality, while the code is different. More than 98% of the code plagiarism pairs in the BCB dataset belong to type IV [19]. Since our main work is to compare the structure semantics of the code, the BCB dataset is suitable for us. In this paper, we follow the settings in Wang et al.’s [19] paper, discarding code fragments without any markup and finally keeping 9134 code files. The dataset POJ Clone (POJ), proposed by Mou et al. [21], was initially used for code classification tasks and later used by code comparison methods such as TBCCD [6]. OpenJudge2 is an online program evaluation system of Peking University which consists of 7500 codes for 104 different problems submitted by students on OpenJudge, mainly including sorting, Boolean expression computation, Joseph’s problem, etc. OpenJudge verifies the correctness of each code and codes solving the same problem are considered similar. 1 https://code.google.com/codejam/past-contests. 2 http://noi.openjudge.cn/.
82
X. Ning et al. Table 1. The Dataset Statistics
Dataset
Number of code files
Total lines of code
Average number of lines
Average number of functions
GCJ
1669
98117
58.79
23.15
BCB
9134
302370
33.10
31.14
POJ
7500
265737
35.43
15.70
The function call graphs of 1500 codes are randomly selected from three datasets to pre-train the graph self-encoder. Batch Size is set to 128 and epoch is set to 100. The graph convolutional neural network includes 2 convolutional layers and the outputs of the 2 convolutional layers are 32- and 16-dimensional vectors. We set the dropout to 0.2 and the learning rate to 0.001. Node attribute is set to a 1433-dimensional random vector, and the structure semantic vectors, code vectors and built-in class vectors are transformed to 4096-dimensional, 256-dimensional and 256-dimensional vectors, respectively. A code pair is considered similar if the similarity score calculated is greater than or equal to a threshold. The thresholds are set to 0.77, 0.72, and 0.79 on the GCJ, BCB, and POJ datasets, respectively. 4.2 Evaluation Metrics We apply evaluation metrics such as precision and recall to evaluate the performance of our collaborative analysis method, which are suitable for datasets with labels. The GCJ and POJ datasets consist of codes that solve different problems, so codes that solve the same problem are considered functionally similar. Plagiarized pairs in the BCB dataset are considered similar. Therefore, precision and recall can be applied to the selected three datasets. A true positive TP indicates that the code pairs are actually similar and our prediction result is also similar, a false positive FP indicates that the code pairs are actually not similar but our prediction result is similar, and a false negative FN indicates that the code pairs are actually similar but our prediction result is not similar. We use precision, recall and F1 score to evaluate the performance of our method, as shown in Eq. (11), Eq. (12) and Eq. (13): P=
TP TP+FP
(11)
R=
TP TP+FN
(12)
F1 = 2 ∗
P∗R P+R
(13)
4.3 Comparation Method We considered five comparative methods: the AST methods, the code control flow and data flow method, the program dependency graph method and the labeling method. EUHOLMES [22] is a representative method proposed by Mehrotra et al. in 2021, which
Collaborative Analysis on Code Structure and Semantics
83
uses an attention-based graph neural network to encode the program dependency graph of codes as feature vectors, and the codes are compared by the similarity of the feature vectors. RtvNN [5]: This method first transforms function names, class names, variable names, etc. in the code into specific tokens and uses a recurrent neural network to obtain the embedding vector of code tokens. Then, the AST is encoded into an implicit vector using a recursive self-encoder. Finally, the embedding vectors of the tokens and the implicit vectors of the AST are combined to determine whether the codes are similar or not. TBCCD [6]: This method uses tree convolution to detect semantic similarity of codes. First, the code fragment is parsed into an AST, which is transformed into a complete binary tree. Then, it applies a tree convolutional neural network to get the code vector. Finally, the cosine similarity of the vectors is used for code similarity analysis. DeepSim [7]: This method encodes code control flow graphs and data flow graphs into semantic matrices. Each row in the matrices is a sparse binary vector. It transforms the semantic matrices into vectors to represent the implicit features of the codes using feedforward neural networks, which in turn transforms the problem of detecting code similarity into a binary classification problem. EU-HOLMES [23]: This method uses an attention-based graph neural network to encode the program dependency graph of codes as feature vectors, and the codes are compared by the similarity of the feature vectors. SCDetector [2]: This method treats the control flow graph as a social network and applies social network centrality analysis to mine the centrality of the basic blocks in the graph. Semantic tokens are encoded as implicit vectors using recurrent neural networks, based on which to compare the code similarity. StrSemSim: This is our proposed method that integrates the code structure and function semantics collaboratively. The followings are some variants. StrSemSim-class: This is our method without the built-in class similarity part. StrSemSim-fun: This is our method without the function similarity part. StrSemSim-str: This is our method without the function call structure similarity part. 4.4 Performance Analysis To verify our method performance, the effectiveness of code analysis was compared with the above models on three datasets using precision, recall and F1 score as shown in Table 2. To be consistent with the results in existing work, the data in Table 2 are taken from related papers. While only TBCCD [6] used the POJ dataset, the performance of the model on the POJ dataset is not listed. The recall of our method outperforms existing methods on both GCJ and BCB datasets, indicating that it predicts most of the codes whose functionality are actually similar to be similar. The highest precision of EU-HOLMES and DeepSim methods on both GCJ and BCB datasets indicates that the codes predicted to be similar by our method sometimes are actually not similar. On the GCJ dataset, the F1 score of our method improved by 0.56 and 0.07 compared to the RtvNN and TBCCD methods, respectively. The RtvNN method has the lowest accuracy rate because it predicts almost all code pairs as similar. RtvNN generates the implicit
84
X. Ning et al. Table 2. Performance comparison of code comparison models
Model
GCJ
BCB
P
R
F1
P
R
F1
RtvNN
0.20
0.90
0.33
0.95
0.01
0.01
DeepSim
0.71
0.82
0.76
0.97
0.98
0.98
TBCCD
0.79
0.85
0.82
0.94
0.95
0.95
EU-HOLMES
0.84
0.92
0.88
0.72
0.97
0.83
SCDetector
0.81
0.87
0.84
0.98
0.97
0.98
StrSemSim
0.83
0.95
0.89
0.93
0.99
0.96
vector representation of the codes based on the tokens and AST and uses the Euclidean distance to determine whether codes are similar or not, so it is strongly influenced by the implicit vector [7]. The F1 score of our method improved by 0.13 compared to the DeepSim method. The precision of DeepSim is 0.71 as it does not encode the information of function calls and cannot distinguish these functions and their corresponding statements. On the GCJ dataset, the F1 score of our method improved by 0.05 compared to the marker-based SCDetector method. Because the marker-based method cannot identify the syntactic and semantic information of the code. On the BCB dataset, the F1 score of our method improved by 0.13 compared to the EU-HOLMES method, which illustrates that using the structure information of function calls brings better gains for models than the structured syntactic information. 4.5 Ablation Experiment We split the model and validate the benefits brought by different parts of our method using F1 scores on GCJ and BCB datasets, and the results are shown in Fig. 3. StrSemSim has the highest F1 score, and the difference in F1 score between StrSemSim and other models indicates the gains brought to the model by different parts. The results show that the structure similarity part brings the most benefit to the method, and the function similarity part brings the second most benefit to the method. Since the BCB dataset contains only code fragments, and there are no imported built-in classes in the code fragments. The built-in class similarity part does not bring benefits to the performance on the BCB dataset. The F1 scores of StrSemSim-class and StrSemSim-fun on the GCJ dataset become larger and then smaller as the threshold increases, because the recall decreases while the precision increases as the threshold increases. The F1 score is maximum when the threshold is 0.77. The F1 score of StrSemSim-str keeps getting smaller as the threshold increases, as the recall is decreasing sharply as the threshold increases, indicating that the function call structure is similar between codes with similar functionality.
Collaborative Analysis on Code Structure and Semantics
85
Fig. 3. Results of ablation experiments
4.6 Visualized Analysis of Function Call Structure We visualize to verify the difference of function call structure information and the difference of the code text semantics of the codes under different problems. Different colors indicate different problems, and dots indicate the semantic features of the code text or the semantic features of the function call structure, as shown in Fig. 4. In Fig. 4(a) and 4(c), the codes in the GCJ and POJ datasets are considered as plain text, and the semantic vectors are obtained using BERT. In Fig. 4(b) and 4(d), the function call structures of the codes in the GCJ and POJ datasets are input to the graph self-encoder for graph semantic encoding. The text semantic vectors and the function call structure semantic vectors are projected onto a two-dimensional plane using the t-SNE [23] to observe the differences in text semantics and the differences in function call structure semantics under different problems visually. As shown in Fig. 4(a) and 4(c), there is a large overlap of code text semantics encoded by BERT, indicating that text semantics of different codes are similar, as the data types, modifiers, library function names and built-in class names are basically fixed. Therefore, there is little variability in texts of different codes. As shown in Fig. 4(b), there is almost no overlap in the function call structure semantics of the extracted codes on the GCJ dataset, indicating that there are large differences in the function call structures of the codes solving different problems and similar function call structures of the codes solving the same problems. As shown in Fig. 4(d), the function call structure semantics extracted by our method on the POJ dataset has a small overlap, as the codes in the POJ dataset contain an average of 15.7 functions while the codes in the GCJ dataset contain an average of 23.15 functions. The smaller the number of functions, the simpler the function call structure is, so the less information is available, indicating that our method can be affected by the number of functions. As shown in Fig. 4(d), the green color indicates that the problem is to sort the age using the quick sort algorithm, and the green dots indicate function call structure semantics of the age sort code. A, B and C indicate the codes are “9-70.c”, “9-75.c”, and “9-36.C”, respectively. The purple color indicates that the problem is factorial summation, and
86
X. Ning et al.
Fig. 4. The t-SNE distribution of code semantics (Color figure online)
the purple dots indicate the function call structure semantics of the factorial summation code. D indicates that the code is “1-39.c”. C code encapsulates the person number and age as a structure then performs fast sorting based on the structure. D is the code solving the factorial summation, and the factorial summation is generally recursive, so the function call structure semantics of A and D is different.
5 Conclusion In this paper, we propose the collaborative analysis method combining code structure and function semantics. First, we create the function call structure graph corresponding to the code, and the structure semantics is encoded using a graph self-encoder method. Then, the call information based on library functions and built-in class names are used for the function semantics. Finally, the structure and function semantics are combined to analyze code similarity collaboratively. The experimental results show that our method outperforms other comparison methods based on technical features. We are planning to integrate semantics and syntax information for code comparison analysis in the future work.
Collaborative Analysis on Code Structure and Semantics
87
Acknowledgments. This work was supported by the Major Project of NSF Shandong Province under Grant No. ZR2018ZB0420 and the Key Research and Development Program of Shandong Province under Grant No. 2019JZZY010107.
References 1. Liao, Z., Zhao, Y., Liu, S., et al.: The measurement of the software ecosystem’s productivity with github. Comput. Syst. Sci. Eng. 36(1), 239–258 (2021) 2. Wu, Y., Zou, D., Dou, S., et al.: SCDetector: software functional clone detection based on semantic tokens analysis. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp. 821–833 (2020) 3. Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025–1035 (2017) 4. Sajnani, H., Saini, V., Svajlenko, J., et al.: Sourcerercc: scaling code clone detection to big-code. In: Proceedings of the 38th International Conference on Software Engineering, pp. 1157–1168 (2016) 5. White, M., Tufano, M., Vendome, C., et al.: Deep learning code fragments for code clone detection. In: 2016 IEEE/ACM 31th International Conference on Automated Software Engineering, pp. 87–98 (2016) 6. Yu, H., Lam, W., Chen, L., et al.: Neural detection of semantic code clones via tree-based convolution. In: 2019 IEEE/ACM 27th International Conference on Program Comprehension, pp. 70–80 (2019) 7. Zhao, G., Huang, J.: Deepsim: deep learning code functional similarity. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 141–151 (2018) 8. Roy, C.K., Cordy, J.R.: NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: 2008 16th IEEE International Conference on Program Comprehension, pp. 172–181 (2008) 9. Kodhai, E., Kanmani, S., Kamatchi, A., et al.: Detection of type-1 and type-2 code clones using textual analysis and metrics. In: 2010 International Conference on Recent Trends in Information, Telecommunication and Computing, pp. 241–249 (2010) 10. Jia, X., Ma, R., Liu, S., et al.: BinDeep: a deep learning approach to binary code similarity detection. Expert Syst. Appl. 168, 114348 (2021) 11. Rattan, D., Bhatia, R.K., Singh, M.: Software clone detection: a systematic review. Inf. Softw. Technol. 55(7), 1165–1199 (2013) 12. Rattan, D., Kaur, J.: Systematic mapping study of metrics based clone detection techniques. In: Proceedings of the International Conference on Advances in Information Communication Technology and Computing, pp. 1–7 (2016) 13. Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s Sch. Comput. TR 541(115), 64–68 (2007) 14. Sheneamer, A., Kalita, J.: Code clone detection using coarse and fine-grained hybrid approaches. In: 2015 IEEE 7th International Conference on Intelligent Computing and Information Systems, pp. 472–480 (2015) 15. Sudhamani, M., Rangarajan, L.: Code clone detection based on order and content of control statements. In: 2016 2nd International Conference on Contemporary Computing and Informatics, pp. 59–64 (2016) 16. Hu, Y., Wang, H., Zhang, Y., et al.: A semantics-based hybrid approach on binary code similarity comparison. IEEE Trans. Softw. Eng. 47(6), 1241–1258 (2019)
88
X. Ning et al.
17. Zhang, F., Li, G., Liu, C., et al.: Flowchart-based cross-language source code similarity detection. Sci. Program. 2020, 1–15 (2020) 18. Haq, I.U., Juan, C.: A survey of binary code similarity. ACM Comput. Surv. 54(3), 1–38 (2021) 19. Wang, W., Li, G., Ma, B., et al.: Detecting code clones with graph neural network and flowaugmented abstract syntax tree. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering, pp. 261–271 (2020) 20. Svajlenko, J., Islam, J.F., Keivanloo, I., et al.: Towards a big data curated benchmark of interproject code clones. In: 2014 IEEE International Conference on Software Maintenance and Evolution, pp. 476–480 (2014) 21. Mou, L., Li, G., Zhang, L., et al.: Convolutional neural networks over tree structures for programming language processing. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1287–1293 (2016) 22. Mehrotra, N., Agarwal, N., Gupta, P., et al.: Modeling functional similarity in source code with graph-based Siamese networks. IEEE Trans. Softw. Eng. 48, 3771–3789 (2021) 23. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(2605), 2579–2605 (2008)
Temporal Planning-Based Choreography from Music Yuechang Liu1 , Dongbo Xie1 , Hankz Hankui Zhuo2(B) , Liqian Lai1 , and Zhimin Li1 1
2
Jiaying University, Meizhou, China {ycliu,lqlai}@jyu.edu.cn Sun Yat-sen University, Guangzhou, China [email protected]
Abstract. Dancing robots have attracted immense attention from numerous sources. Despite the success of previous systems in automated choreography to make robots dance in response to external stimuli, they are often either limited to a pre-defined set of movements or lack a nuanced understanding of the relationship between dancing motions. In this paper, we propose a temporal planning-based choreography approach, which considers choreography with music as a temporal planning problem, builds planning models automatically, and solves the problem with the learned models. In our experiment, we exhibit that our approach is both effective and efficient with respect to dance diversity, user scores, and running time.
Keywords: AI Planning
1
· Mixed Planning · Choreography
Introduction
Dancing robots have attracted considerable attentions from various areas. For example, SONY presented a humanoid robot, called QRIO [11]. Multiple QRIO units can dance in a highly coordinated fashion by imitating human movements. Nakaoka et al. explored a motion capturing system to teach a robot, called HRP2, to perform a Japanese folk dance [20]. Despite the success of previous systems, they either use a predefined set of movements (in conjunction with music) or have little variance in response to external stimuli. To alleviate the reliance on predefined movements and improve the variation in dances, probabilistic graphical model-based approaches, such as Markov chains [5], were proposed, which enable legged robots to dance in synchronization with pieces of music in a diverse fashion. Such robots can create a choreographed dance sequence by picking motions from a dance motion library based on previously picked dance motions and the current music tempo. However, building effective probabilistic graphical models to consider various (potentially innovative) dances is often difficult. Even though there have been approaches proposed to learn dancing models automatically, such as [27], which c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 89–102, 2023. https://doi.org/10.1007/978-981-99-2385-4_7
90
Y. Liu et al.
consider the beat-motion synchronization and limited posture relations, ignoring humans’ knowledge in choreography. Such learning-based approaches are restricted by the music-dance data sets which are not easy to collect. The dances created by those approaches are often far from human understanding, i.e., they are unexplainable to humans. By observation, we found certain motions that have hard constraints. For example, two motions that have to be executed in a specific order (“standing up” before “jumping”); two motions that have to be executed in parallel (“raising” the right hand and “moving” the right leg simultaneously); two motions cannot be executed sequentially (after executing an action of moving up a leg, the robot cannot execute another action of moving up another leg; before executing an action of stepping forward, the robot has to execute an action of standing up when it sits on the ground). These constraints can be naturally encoded into the representations of action models in the form of planning domain description language (PDDL) [18]. Intuitively, the action models in Fig. 1 can be used to describe constraints among dance motions.
Fig. 1. Action Constraint in PDDL.
Fig. 2. Plan2Dance Architecture.
Furthermore, some choreography knowledge such as, “action A usually comes after B, but not always,” can be represented via preferences in the form of PDDL2.3 model [10], as shown below: (pref erence (pf (> (comes-next-to A B) 1))), where pf indicates a symbol to control the metric. In this paper, we propose a novel approach, called Plan2Dance, to perform choreography automatically based on music. We first build a set of temporal action models based on the relationships between basic dancing motions, by considering the temporal requirement of dancing motions as well as choreography constraints and preferences based on various kinds of music. We then analyze the input music to obtain a series of features such as beats, tempos, and music segments. The result will be converted into planning problems, which will be solved by an off-the-shelf planner using the temporal action models built in the first step.
Temporal Planning-Based Choreography from Music
91
This work involves two separate challenges. First, building PDDL action models based on basic dancing motions involves understanding complex domain knowledge related to dancing. The number of basic motions should also be balanced; we need more motions to create more varied dances, although this lowers the planning efficiency. Second, a piece of music must be converted into a planning problem (including an initial state and the goal) with preferences and constraints being taken into account.
2
Related Works
Dance choreography by humanoid robots that respond to external stimuli has been widely studied. Some researchers call the reaction of robots against external sensible stimuli ‘inter-modality mapping’ [15]. Such mapping can be in the form of vision-to-dance, music/sound-to-dance, or even trajectory-to-dance (i.e., when a human dancer leads the robots) [2]. Dancing robots have been thoroughly studied by industrial and academic communities. Examples include QRIO from SONY [11], HRP-2, a robot that dances to traditional Japanese folk music [20], and “Keepon,” a robot that can generate motions from the movements of objects [22] and soon. The dance choreographies for these robots are based either on genetic algorithm or neural networks [4,27]. With the advancement of machine learning research, many methods have been proposed for music-driven choreography [8,16,21]. GrooveNet [1], Chor-rnn [7], Learning to dance [9], Gui et al. [14] generated an individual choreograph by extracted features from motions and used a recurrent network to generate new dance motion. ChoreoMaster [6], Li et al. [17], Tang et al. [26] trained a LSTMautoencoder(Long Short Term Memory) to generate dance motion directly from music features. Zhuang et al. [29] adopted WaveNet [23] as the motion generation model, which can generate dance motion of different dance types from the same model. Sun et al. [25] used a generative adversarial network (GAN) based crossmodal association model aims at creating the desired dance sequence based on the input music. ChoreoNet [28] is different from normal network methods, which uses CAU(choreographic dance unit) as the basic unit to choreograph rather than imagination mapping. Their framework firstly devises a CAU prediction model to learn the mapping relationship between music and CAU sequences, and finally converts the CAU sequence into continuous dance motions. This kind of approach can produce only limited dance movements, as all motions are retrieved from a pre-constructed database or graph model [19], which is useful for generating dance movement for a specific piece of music, but is difficult to apply to different musical pieces. However, all of these approaches don’t need experts of dance to participate in deep and instead of using simple relation between images to represent all the correlation between motions in the choreography, which means they lost some crucial information for a real dancer to think and show in a dance. In this case, we propose a approach, which is knowledgebased named Plan2Dance, to build motion model in a impacted correlation and with clear details depended on the real dancer.
92
Y. Liu et al.
The Plan2Dance approach proposed in this paper is knowledge-based, which intensively uses a temporal planning method upon PDDL2.1 and PDDL3 models. Compared to PDDL [18], PDDL2.1 [10] provides rich temporal-feature representation and PDDL3 further supports the description of constraints and preferences in planning. Using these features, it is possible for Plan2Dance to represent domain knowledge about dance in choreography, and furthermore, make itself easy to understand.
3
Problem Definition
The aim of our approach is to build action sequences that can be converted into dancing motion scripts executable for robots, provided musics as input. Formally, we define our problem as a tuple, M L, AA, P, M C, δ, where: – M L indicates a motion library which is composed by a set of basic dance motions; – AA indicates an audio analyzer of audio/music files; – P indicates a planner that processes PDDL3 models; – M C indicates a motion compiler that transforms a dancing plan into a motion scripts for the robot. – δ indicates the time delay between music-motion synchronization.
4
Our Plan2Dance Approach
The framework of Plan2Dance is as shown in Fig. 2. The high-level idea of Plan2Dance is to first build a domain model in PDDL [18], automatically convert the input music into planning problems, and feed the domain model and planning problems to an off-the-shelf planner (such as OPTIC-CLP [3]) to generate solution plans. Finally, we convert the plans to sequences of low-level motion scripts that the robot can dance with. Before the system runs, an action base that stores all predefined basic actions is constructed. Accordingly, a planning domain is required to be ready. When a piece of music or sound file is provided as input, it is first fed into the audio analyzer AA, where the file is transformed, segmented, and clustered. After the segmentation, the temporal information attached to the candidate actions is collected, which is then represented in the planning problem file. The generated problem file along with the domain file is then fed into an off-the-shelf planner, such as OPTIC-CLP [3], from which a solution plan is generated and transformed into the dancing action script for the robot to execute. The procedure is as illustrated in Fig. 3.
Temporal Planning-Based Choreography from Music
93
Fig. 3. System Procedure.
4.1
Step1: Dancing Domain Modeling
A Motion Library (ML) describes a set of motions: M L = {m1 , m2 , . . . , mT }, each motion is represented by a set of frames: mi = {f rame1 , f rame2 , . . . , f ramemi }. Each frame defines the position at a certain instant: f ramei = where ti is the time point and Gi is a vector of gear positions [g1 , g2 , . . . , gK ], which defines the pose of robots. A dance is composed of temporal transformations of a series of motions. In this paper, we use the humanoid robot “ROBOTIS MINI” from ROBOTIS [24] to evaluate our approach. There are 16 gears for each ROBOTIS MINI robot (thus K = 16). The durative actions in the form of PDDL2.1 are generated from motions in M L. The detailed procedure is as presented below: – For each motion, a distinct constant si (of the type state) is defined to represent the robot status resulting from an action execution. – The zeroary predicate is_body_f ree() is generated to record the bodily state of the robot. For most actions, is_body_f ree() is a required precondition. At the at-start effect, is_body_f ree() is false and then it turns to true at the at − end effect. – Each motion is defined to have a standard duration (sd), which is defined as: sd = max{ti | ∈ mj , ∃mj ∈ M L}. Each action is allowed to act either slower or faster than the standard setting. To reflect this, a function action-rate(?rate) is defined in the domain. The duration of the action is defined by: : duration (= ?duration (∗ sd (action-rate ?rate))).
94
Y. Liu et al.
– To record the total execution time from the beginning of the dancing, the function dance-time() is defined. It is updated with each action as follows: (increase (dance-time) (∗ sd (action-rate ?rate))). – In dance choreography, it is sometimes acceptable to have a null action (doing nothing) during the music. For such flexibility, a special action—Dummy Action (DA)—and a function that records the total time of DA execution are defined in the domain. Meanwhile, : duration (= ?duration δ) and (increase (dumb-total-time) δ) in the at-end effects are asserted. Dummy Action. Dummy action is a special action defined in the domain, in which no action is actually performed. Dummy action is needed when a certain section of the dance should be left “blank”. It is also used to adjust the dance rhythm in a planning problem via plan preferences. Motion Stability. Motion stability is a basic requirement for general dancing robots. It is, however, far from an easy task because (1) it is hard to define a reliable physical model with limited sensors that define the stability of the robots; (2) it heavily depends on the hardware, which makes it difficult to define general fall prevention knowledge for different robots. By observation, we found that instability is usually caused by the following factors: 1. Sudden increase in amplitude during motion switching, e.g., when the robot suddenly stands up and waves its hand from a squatting position. 2. Collision during motion composition, e.g., when the robot is made to step forward while its arms are flapping up and down (its arms will collide with its chest). 3. Over-transformation in space during motion switching, e.g., when the robot suddenly stands up and waves its hand from a squatting position. For cases 1 and 2, we define the transition actions between the action-pair that cause falling. For case 3, we classify the actions into categories along the X, Y, and Z-axis, each with down, middle, and up subcategories. When the actions of the “down” category switch to those of the “up” category, a transition action is defined to avoid direct switching between them.
Temporal Planning-Based Choreography from Music
4.2
95
Step 2: Music Analysis
Once a music file is fed into Plan2Dance, the analyzer module is invoked. The following tasks are performed during the analysis: – Music segmentation: The music is divided into segments. Such segments are used for music-dance synchronization and/or for parallel processing. – Amplitude detection: The amplitude of the music affects the motion amplitude and velocity. For example, stronger amplitudes usually correspond to stronger and faster action responses. – Beat and tempo detection: Beat is the local maxima of the amplitude of the music segment while tempo is the number of beats within a minute. Similar to the amplitude, tempo defines the number of motions associated with a music segment. – Repeatability: Motion repetition is very common in dance. In Plan2Dance, whether a motion or dance segment is reselected to act or not depends on the similarity between the current music segment and the one when it is first selected. Unlike automatic choreography learning approaches like [5], which focuses on the numerical computation of motion selection and motion features, Plan2Dance depends on action-reasoning related to the metric optimization. This results in differences in the feature definitions. In Plan2Dance, the music analysis task is accomplished by using the pyAudioAnalysis library [13]. The aim of music analysis is to generate a set of planning problems, which will be used to compute dance plans at the next step. Music Segmentation. Provided a music/audio file, music segmentation in AA generates a set of music segments seg = {seg1 , seg2 , . . . , segl }. For each segi , let T (segi ) represent the instant that the segment ends. For Plan2Dance, music segmentation is used to get the time points of the pieces of music that are critical for music-dance synchronization, i.e., to get the music features (short and mid-term features in [13]) that are essential for action selection while planning and speeding up the planning procedure (parallel computation in a divide-andconquer manner). Two segmentation approaches are investigated in this study: – Fixed-sized segmentation: Given a fixed length, the music is divided into segments of this length. For each of these segments, short-term (having 34 features for every 50 ms of segments) and mid-term (having 37 features for 1–10 s of segments) features are extracted and classified under a supervised model like SVM. Finally, when the adjacent segments are classified to be the same class, these segments are merged into one.
96
Y. Liu et al.
– Silence removal-based segmentation: Silences in the music indicate a natural separation of segments. Once the silences are detected and removed in the music, the remaining portions are the segments to be considered. Audio Similarity and Motion Repetition. The repetition of dance segments is based on similarity detection. When two music segments are regarded as similar (over a certain threshold), the corresponding dance segments are designed to be the same. Segment similarity is based on the self-similarity matrix computation in pyAudioAnalysis library [13]. They shows a self-similarity matrix for the song Space by Wingtip and Youngr. The highlighted diagonal segments with warmer color, i.e. segments (100.0 s–120.0 s) and (120.0 s–140.0 s) show high similarity. 4.3
Step 3: Planning Problem Generation
Figure 5 shows the generated planning problem “robot-1”. Using the temporal features, the connection between some actions can be easily defined and temporal aspects of dances are under control. The preferences and plan metrics are defined in the planning problem. The total dance duration must be specified in the goals (here, we allow the total time to float in a range from 11.0 to 12.5). Notice again that the temporal values in the problem are calculated in the former music analysis phase. Motion Velocity. For the preliminary version of Plan2Dance, we provide three options (medium, slow, and fast) for users to express their preference of the motion velocity. The medium mode is the standard predefined velocity, while slow and fast modes are, respectively, 0.85 and 1.15 times the medium mode, which can all be configured by the users. They are defined to be objects of type “rate” and can be set in the preferences in the goal section in the planning problem. Constraints and User Preferences. Constraints and preferences supported in PDDL3 allow people to express their inclinations on the result plan trajectory. Specifically, a dance choreographer can represent how dance can be better. Then the planner will compute a plan based on these criteria. For the current version of Plan2Dance, we define four preferences I-IV as follows (Fig. 4).
Temporal Planning-Based Choreography from Music
Fig. 4. Durative action: handsWaveDown and Dummy Action.
97
Fig. 5. Problem definition.
Preference I: Best Rate (BR). Best rate preference allows users to express their preferences regarding the dance rhythm (slow, medium, or fast). For example, the preference for fast-rhythm dance is defined as: (f orall (?s − state)(pref erence p0 (best-rate ?s f ast))), and (: metric minimize (is-violated p0)). Preference II: Synchronization Control (SC). In some scenarios, dancing motions need not synchronize strictly with certain beats of music. The synchronization can be controlled through beat tracking as follows: (f orall (?b − beat ?s − state) (pref erence p1 (beat-satisfy ?b ?s))). Preference III: Beat Ignoring (BI). As mentioned above, not every beat needs to be associated with a dance move; some beats can be ignored. To accomplish this dummy actions can then be used in the dance plan. This preference is defined as: pref erence p2 (>= (/ (dumb-time) (dance-total-time)) (dumb-prop)) Preference IV: Coherent Action (CA). To express knowledge like “A is usually followed by B, but not necessarily”, the preference can be defined as: pref erence p2 (> (coherent-satisfy s18 s19) 1)), and (: metric minimize (is-violated p2)), which specifies the proportion of total dummy actions time over total dancing time should not be higher than “dump-prop”. Furthermore, the total length of music and dance should be ensured by adding the following condition to the goal: (> (dance-time)n − e) (< (dance-time)n + e) where n denotes the length of dance the user wants and e is the tolerance.
98
Y. Liu et al.
Algorithm 1. MARKOV-DANCE Algorithm Require: B[1...m]: the number of beats; BP M [1...m]: BPM extracted for all the beats, m:total number of detected beats; R[1...n]: the repeatability of basic motions, n: total number of basic motions; BP M _max[1...n]: the maximum appropriate BPM of basic motions; BP M _min[1...n]: the minimum appropriate BPM of basic motions; Ensure: M OT ION [1...m]: selected motions for all beats; 1: for i = 1 to m do 2: Set idx = 1; 3: Set w = 0; 4: for j = 1 to n do 5: if BP M [i] ≥ BP M _min[j] and BP M [i] ≤ BP M _max[j] then 6: if i = 1 or M OT ION [i − 1] = j then 7: Set r = 1; 8: else 9: Set r = R[j]; 10: end if 11: if w < r ∗ BP M [j] then 12: Set idx = j; 13: Set w = r ∗ BP M [j]; 14: end if 15: end if 16: end for 17: Set M OT ION [i] = idx; 18: end for 19: return M OT ION ;
5
Experiment
Plan2Dance is implemented in Python, running on Ubuntu Linux 18.0 with a AMD CPU 2.0G and DDR3 6G RAM. A humanoid robot, ROBOTIS MINI [24] from ROBOTIS company is adopted. The core planner used is OPTICCLP [3]. Before running the experiments, we built a music library with 30 songs among which there are 20 modern, five Tibetan, and five ballet songs. A motion library with 88 basic motions for modern songs, 20 for Tibetan, and another 20 for ballet is constructed. For the evaluation of the generated dances, we develop a website that allows public grading on dances generated for 10 songs (access at www.dongbox.top/Plan2Dance). For comparison, we implement the Markov-based approach proposed in [5] (referred to as “MARKOV-DANCE” in Algorithm 1). We compare Plan2Dance with MARKOV-DANCE in terms of quality (based on user scores), motion diversity, and running time.
Temporal Planning-Based Choreography from Music
Fig. 6. Dance Diversity.
5.1
Fig. 7. Dance Scores.
99
Fig. 8. Running Time.
Motion Diversity
Motion diversity reflects the number of actions chosen from all candidate actions when the choreography approach runs on a set of musics. Formally, the motion diversity between D1 and D2 upon the motion library L is defined as: Diversity(D1, D2) = 1 − |D1 D2|/|L|. The diversity between a dance Dx and a dance set DD = {D1 , D2 , . . . , Dl } is defined as: l Diversity(Dx, DD) = Diversity(Dx, Di)/l. i=1
Figure 6 shows the results with respect to diversity, which shows our Plan2Dance generates more diverse dances than the Markov-based approach, especially for modern musics. This indicates that our Plan2Dance is able to capture small differences to generate diverse plans, while the Markov-based approach tends to ignore those small differences and generate less diverse dances. 5.2
User Study
For all the generated dances used for the experiments, we posted the dance videos on an open-access website and requested users to grade all the dance performances. The data in Fig. 7 shows an average of scores (with 10 points in total) over 200 users who generated 13000 score records. The data shows little variation between the planning with different preferences. However, the dances generated by the MARKOV-DANCE algorithm evidently scored lower than others. The lower scores can be interpreted to indicate a lack of diversity because the dances by MARKOV-DANCE have large numbers of similar motions, which users find repetitive and lacking in innovation.
100
5.3
Y. Liu et al.
Running Time
Figure 8 gives the running times (in seconds) of our approach w.r.t. the four preferences (BR, BI, SC, and CA) and MARKOV-DANCE algorithm. From the figure, we can see that even though our Plan2Dance attempts to compute logicbased explicable dances (or plans), its running time is close to unexplainable MARKOV-DANCE.
6
Conclusion
In this paper, we present Plan2Dance, a system for choreographing dance given a music file input. Plan2Dance is based on temporal planning in PDDL2.1 [10] and constraints and preferences in PDDL3 [12]. A set of action models and planning problems are generated based on analysis of the input music. Then the planner is invoked to generate a dance plan, which will be further transformed into motion scripts to drive the robot dance. Experiments were conducted to evaluate the CPU time cost of the planning process, dance motion diversity, and dance scores from users via an evaluation website, compared to the Markovbased approach. The experiment results show that while the dance generation by Plan2Dance is relatively slower than the Markov algorithm, the dance sequences have higher motion diversity and obtain better user scores. Thus, although the Markov-based algorithm is good in real-time dance generation, it is not good at choreography. For artistic and innovative choreography, domain knowledge from dance choreographers must be encoded into algorithms for dancing robots, which Plan2Dance successfully achieves in the planning paradigm.
References 1. Alemi, O., Françoise, J., Pasquier, P.: GrooveNet: real-time music-driven dance movement generation using artificial neural networks. Networks 8(17), 26 (2017) 2. Aucouturier, J.J.: Cheek to chip: dancing robots and AI’s future. IEEE Intell. Syst. 23(2), 74–84 (2008) 3. Benton, J., Coles, A., Coles, A.: Temporal planning with preferences and timedependent continuous costs. In: ICAPS, January 2012 4. Berman, A., James, V.: Learning as performance: autoencoding and generating dance movements in real time. In: Liapis, A., Romero Cardalda, J.J., Ekárt, A. (eds.) EvoMUSART 2018. LNCS, vol. 10783, pp. 256–266. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77583-8_17 5. Bi, T., Fankhauser, P., Bellicoso, D., Hutter, M.: Real-time dance generation to music for a legged robot. In: IROS, pp. 1038–1044 (2018) 6. Chen, K., et al.: ChoreoMaster: Choreography-oriented music-driven dance synthesis. ACM Trans. Graph. 40(4), 1–13 (2021) 7. Crnkovic-Friis, L., Crnkovic-Friis, L.: Generative choreography using deep learning. arXiv preprint arXiv:1605.06921 (2016) 8. Fan, R., Xu, S., Geng, W.: Example-based automatic music-driven conventional dance motion synthesis. IEEE Trans. Vis. Comput. Graph. 18(3), 501–515 (2011)
Temporal Planning-Based Choreography from Music
101
9. Ferreira, J.P., et al.: Learning to dance: a graph convolutional adversarial network to generate realistic dance motions from audio. Comput. Graph. 94, 11–21 (2021) 10. Fox, M., Long, D.: PDDL2.1: an extension to PDDL for expressing temporal planning domains. J. Artif. Intell. Res. 20, 61–124 (2003) 11. Geppert, L.: Qrio, the robot that could. IEEE Spectr. 41(5), 34–37 (2004) 12. Gerevini, A., Long, D.: Preferences and soft constraints in PDDL3. In: Workshop on Preferences and Soft Constraints in Planning, ICAPS 2006 (2006) 13. Giannakopoulos, T.: pyAudioAnalysis: an open-source python library for audio signal analysis. PLoS ONE 10(12), 1–17 (2015) 14. Gui, L.-Y., Wang, Y.-X., Liang, X., Moura, J.M.F.: Adversarial geometry-aware human motion prediction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 823–842. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_48 15. Hattori, Y., Kozima, H., Komatani, K., Ogata, T., Okuno, H.G.: Robot gesture generation from environmental sounds using inter-modality mapping. In: International Workshop on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems, vol. 123, pp. 139–140 (2006) 16. Lee, J., Kim, S., Lee, K.: Listen to dance: music-driven choreography generation using autoregressive encoder-decoder network. arXiv preprint arXiv:1811.00818 (2018) 17. Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Learn to dance with AIST++: music conditioned 3D dance generation. arXiv preprint arXiv:2101.08779 (2021) 18. McDermott, D., Committee, T.A.P.C.: PDDL - the planning domain definition language. Technical report, Yale University (1998). Available at: www.cs.yale.edu/ homes/dvm 19. Min, J., Chai, J.: Motion graphs++ a compact generative model for semantic motion analysis and synthesis. ACM Trans. Graph. (TOG) 31(6), 1–12 (2012) 20. Nakaoka, S., Nakazawa, A., Kanehiro, F., Kaneko, K., Morisawa, M., Ikeuchi, K.: Task model of lower body motion for a biped humanoid robot to imitate human dances. In: 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3157–3162, August 2005 21. Ofli, F., Erzin, E., Yemez, Y., Tekalp, A.M.: Learn2Dance: learning statistical music-to-dance mappings for choreography synthesis. IEEE Trans. Multimedia 14(3), 747–759 (2011) 22. Ogata, T., Hattori, Y., Kozima, H., Komatani, K., Okuno, H.: Generation of robot motions from environmental sounds using inter-modality mapping by RNNPB, January 2006 23. van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016) 24. ROBOTIS: Robotis mini robot (2019). http://www.robotis.us/robotis-mini-intl/. Accessed 8 Nov 2019 25. Sun, G., Wong, Y., Cheng, Z., Kankanhalli, M.S., Geng, W., Li, X.: DeepDance: music-to-dance motion choreography with adversarial learning. IEEE Trans. Multimedia 23, 497–509 (2020) 26. Tang, T., Jia, J., Mao, H.: Dance with melody: an LSTM-autoencoder approach to music-oriented dance synthesis. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1598–1606 (2018) 27. Wu, R., et al.: Towards deep learning based robot automatic choreography system. In: Yu, H., Liu, J., Liu, L., Ju, Z., Liu, Y., Zhou, D. (eds.) ICIRA 2019. LNCS (LNAI), vol. 11743, pp. 629–640. Springer, Cham (2019). https://doi.org/10.1007/ 978-3-030-27538-9_54
102
Y. Liu et al.
28. Ye, Z., et al.: ChoreoNet: towards music to dance synthesis with choreographic action unit. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 744–752 (2020) 29. Zhuang, W., Wang, C., Xia, S., Chai, J., Wang, Y.: Music2Dance: music-driven dance generation using WaveNet, 1(2) 7. arXiv preprint arXiv:2002.03761 (2020)
An Adaptive Parameter DBSCAN Clustering and Reputation-Aware QoS Prediction Method Yajing Li, Jianbo Xu(B) , Guozheng Feng, and Wei Jian Hunan University of Science and Technology, Xiangtan 411201, China [email protected]
Abstract. QoS prediction for Web services is a hot research topic in service computing, driving the development of service recommendation techniques. Collaborative filtering algorithms have been widely used in QoS prediction. As a group intelligence algorithm, it relies on a large amount of historical QoS data, so the data credibility should be ensured first. Previous work, attempts to identify outliers through clustering algorithms, however, historical QoS data is too sparse and contains a large number of services with varying data distributions, It is difficult to determine the number of clusters that apply to all services, so classification errors often occur. To solve the above problems, in the method proposed in this paper, the RSVD model considering user and service deviation is firstly used to fill the matrix to alleviate the impact of data sparsity, and the Adagrad algorithm is used to optimize the accuracy. Then the outliers are detected in two stages. In the first stage, the DBSCAN clustering algorithm with adaptive parameters is applied, according to the data distribution of each service, suitable parameters are matched for clustering, the number of clusters and a single outlier were mined by themselves, and similar users are obtained. The second stage removes common sense outliers. Count the number of outliers provided by users in both phases to get untrustworthy users. Finally, QoS values are predicted based on the trusted data. The experimental evaluation results show that compared with other baseline methods, the method in this paper has a great improvement in prediction accuracy. Keywords: QoS prediction · Matrix decomposition · Reputation-aware · DBSCAN clustering
1 Introduction Web services are self-contained, self-explanatory, modular application systems that can be published, located, and invoked from anywhere on the Web. The increase rapidly in the number of Web services has been beneficial in building a diverse and multi-functional service-oriented system, but along with this is the rising expectation of users when choosing services. How to choose a service with better performance among many services with similar functions? Quality of Service (QoS), which reflects Service availability, becomes an important factor in Service selection, including response time, throughput, and so on. Affected by many factors, the QoS values obtained by different users calling the same Web service are quite different, so it is very important to obtain personalized QoS values © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 103–117, 2023. https://doi.org/10.1007/978-981-99-2385-4_8
104
Y. Li et al.
for different users. But for a single user, it is impractical to call all Web services to obtain personalized QoS values, this undoubtedly brought a great challenge to the service evaluation, so it is crucial to use historical QoS data to predict the approximate personalized QoS values that users are missing before making service recommendations. The collaborative filtering (CF) algorithm has been widely used in Web service QoS prediction research. The algorithm utilizes group intelligence and needs to rely on a large amount of historical QoS data, so the key to the algorithm using CF is to ensure that the historical QoS values are credible. if use data mixed with outliers can cause values that would otherwise represent data attributes to lose their meaning, thus significantly reducing prediction accuracy. The abnormal data comes from the spurious QoS values provided by untrustworthy users and abnormal values obtained by ordinary users due to the contextual environment. Most of the previous studies have identified abnormal values by constructing probability distribution models and using the K-Means clustering algorithm, but the former is not good at detecting outliers in high-dimensional data. While the K-Means clustering algorithm is very sensitive to abnormal values, outliers have a great impact on the clustering center, and the algorithm is difficult to set the appropriate number of clusters K for all services, so classification errors often occur. To solve the above problems, we propose an adaptive parameter DBSCAN clustering and reputation-aware QoS prediction method (UDBSCAN_RSVD+). Firstly, to mitigate the sparsity of the historical QoS matrix, the random singular value decomposition technique (RSVD+) considering user and service observation bias is applied to prepopulate the sparse matrix, and the gradient descent algorithm (Adagrad) with adaptive gradients is used to improve the accuracy of the populated values. Then two stages are used to identify untrustworthy users. The first stage uses the DBSCAN clustering algorithm with adaptive parameters for each service, which can adjust the appropriate parameters according to the data distribution of different services, because its core point has the ability to merge and intersect neighborhoods, so any number of clusters of different shapes can be mined to find outliers and similar users. In the second stage, common sense outliers of the QoS matrix are removed. The number of times users provide outliers in both stages is recorded to get the set of untrustworthy users. Finally, our credible clustering information predicts the QoS values. The main contributions of this paper are summarized as follows: • RSVD+ technology is used to pre- population the matrix, which alleviates the bad influence of sparse matrix on QoS prediction accuracy. • We propose the DBSCAN clustering algorithm with adaptive parameters to identify abnormal data, which can adjust the appropriate parameters according to the data distribution of different services, mine the number of clusters by itself, find similar users and untrustworthy users, and purify the data. • We conducted a comprehensive experimental evaluation and demonstrated that our method outperforms other baseline methods with higher prediction accuracy.
2 Related Work Matrix factorization (MF) is a widely used technique for QoS prediction [1, 8], and most MF-based prediction methods are dedicated to minimizing the loss function to make
An Adaptive Parameter DBSCAN Clustering and Reputation-Aware
105
the predicted values approximate fit the true QoS values [2], while ignoring the impact of outliers on the prediction accuracy. Because the loss function is L2 norm, it is very sensitive to outliers [3]. In the process of minimizing the loss function, outliers will cause serious approximate deviation between the real QoS value and the predicted value, so that satisfactory performance cannot be obtained. Therefore, matrix factorization can only be used as a method of sparse matrix pre-population, and subsequent abnormal values checking is needed to improve the performance. Dimension reduction technology is widely used to fill sparse matrices, It is currently the most widely used method are PCA dimensionality reduction techniques and SVD singular value decomposition, Nilashi [4] et al. applied dimensionality reduction techniques to CF algorithms and proposed a multi-criteria based prediction method for recommender system, and Vaswani [5] et al. applied PCA techniques to sparse matrix decomposition. However, the Basic SVD method requires the original co-occurrence matrix to be dense, but the historical QoS matrix is very sparse, so the filling effect is not ideal. Clustering algorithms are often applied to identify outliers [6], and usually find the cluster that contains the least number of users and treat all users in it as outliers. However, it is difficult to set the number of clusters applicable to all services. Zheng [7, 9] et al. proposed a Web service prediction method based on two-stage K-Means clustering to reduce the impact of unreliable data on prediction accuracy. This method does not consider that it is difficult to find similar users when the QoS matrix is extremely sparse, the historical QoS matrix is a high-dimensional matrix, and there are a large number of services with different data distribution. Using K-Means clustering algorithm, all services are clustered with the same K values, and the effect is not ideal.
3 Preparatory Knowledge 3.1 RSVD+ Model RSVD is one of the latent factor models. By matrix decomposition of user-service QoS matrix [10], two matrices P and Q with no missing values are obtained, which represent users and services respectively. Each user and service is labeled with Implicit vector (potential information mined by the model). Then, multiply the service and user matrices and refill the sparse matrix. Because for any matrix has its full rank decomposition. Therefore, RSVD initializes the P and Q matrices, uses their product to represent the QoS matrix, and uses rˆu,i to represent the QoS prediction value. It can be obtained from Formula (1): rˆu,i = puT qi
(1)
The real QoS value observed by the user is denoted as ru,i . Then, all the known real values in the QoS matrix are traversed, and the error sum of squares is calculated with the predicted value. The loss function as shown in Eq. (2): SSE = (ru,i − rˆu,i )2 (2) u,i
106
Y. Li et al.
In order to prevent over-fitting, the RSVD model adds regularization parameters on the basis of the original loss function as the penalty term, and the loss function obtained is shown in Formula (3): SSE = (ru,i − puT qi )2 + λ(|pu |2 + |qi |2 ) (3) u,i
where, λ is the penalty coefficient. This paper improves the RSVD model and puts forward the RSVD+ model. This model considering the different user and service contextual environment, some users invoke all services get an overall faster response time, while some users get an overall slower response time, this is also true for services. This forms an observation error. To eliminate the observation errors of users and services, on the basis of loss function formula (3), user bias and service bias should be added, as shown in formula (4): SSE = (ru,i − puT qi − bu − bi )2 + λ(pu 2 + qi 2 + b2u + b2i ) (4) u,i
where, bu is user bias, bi is service bias. Thus, the matrix decomposition problem is transformed into an optimization problem, and the Adagrad gradient descent algorithm is used to make the loss function obtain the minimum value, so as to obtain the trained P and Q matrices, bu , bi , and the predicted value of the fitting true value is obtained through Formula (5). rˆu,i = puT qi + bu + bi
(5)
3.2 Adagrad Algorithm Gradient descent algorithm is the main method to solve the minimization loss function, but its performance is affected by the value of the learning rate, if the learning rate is set too large, it will oscillate around the minimum value and fail to converge; and set too small, it will not get good results even after many iterations. The expectation is that, initially, the excitation converges, approach the optimal solution quickly, As the number of iterations increases, it becomes a penalty convergence that does not across the minimum value. The Adagrad algorithm can achieves expectations. The principle of the algorithm is to initialize a gradient accumulation variable r. At each iteration, the gradient squared for each parameter is accumulated to r. The global learning rate is divided by the arithmetic square root of r as a dynamic update of the learning rate. As the number of iterations increases, becomes larger and larger, and the learning rate slowly decreases. 3.3 Adaptive Parameter DBSCAN Algorithm DBSCAN is a density-based clustering algorithm. Compared with other clustering algorithms, DBSCAN has great advantages in processing data sets containing outliers. Because core points have the ability to merge and intersect neighborhoods, it is need not
An Adaptive Parameter DBSCAN Clustering and Reputation-Aware
107
to set the number of clusters in advance and any number of clusters of different shapes can be mined. After the algorithm determines the Eps parameter (the radius of the neighborhood around a point) and the Min_sample parameter (the number of points contained in the neighborhood), the data are clustered and the outliers are found based on the nature of density reachable and density connected. For example, if point B exists in the Eps neighborhood of core point A, then A and B are defined as directly density reachable, and if point C is not in the domain of A but in the neighborhood of point B, then A and C are defined as density reachable neighborhood, then A and C are defined as density reachable, and A and C can be classified in the same cluster based on density connected. If a point D is not a core point and does not belong to the neighborhood of any point Eps, then D is defined as noisy point and is anomalous data. The traditional DBSCAN algorithm sets Eps parameter and Min_sample parameter as global variables, but this experiment contains a large number of Web services, and the data distribution of each service is very different, so DBSCAN clustering algorithm with adaptive parameters is adopted. According to the data distribution of each service, different Eps are set up. If the value of Min_sample is too small, it is not conducive to the expansion of sparse clustering; if the value is too large, the two neighboring clusters with high density may be merged into one cluster. Therefore, Min_sample was set to 4, and Eps was obtained comprehensively through the following two methods: • Draw the K-distance curve, and the abscissa of the inflection point of the curve is Eps parameter value. • According to the calculation formula (6) to obtain Eps. (xMax − xMin) × Min_samples × (0.5n + 1) (6) 1 √ (m π n ) n where, xMax is the maximum sample, xMin is the minimum sample, m is the number of samples, and n is the data dimension. The clustering results were obtained by evaluating the two methods with contour coefficients, selecting the optimal Eps value under the current service, and the obtained clusters are shown in Fig. 1.
Fig. 1. DBSCAN clustering graph with adaptive parameters
108
Y. Li et al.
Figure 1, the left is a service response time clustering figure, algorithm according to the data distribution service, will be divided into five data clustering based on density, the abnormal value accounted for 12% of the total, most of the abnormal data for the maximum and minimum, a small number of outliers are values that are far from other data points in the data set, even if the value tends to be close to the average, it can be detected according to this algorithm. The figure on the right shows the throughput clustering graph of a service, and the algorithm divides the data of this service into 9 clusters, where the outliers are 5%. As can be seen from the above figure, most outliers can be detected by the algorithm, while a small number of outliers appear in groups, which will be regarded as normal values by the density-based clustering algorithm. Therefore, common sense outlier detection is required in the next step.
4 An Adaptive Parameter DBSCAN Clustering and Reputation-Aware QoS Prediction Method The main process of the method in this paper (UDBSCAN_RSVD+) is shown in Fig. 2, including pre-populated, two-stage outlier detection and QoS value prediction. • Pre-populated: RSVD+ was used to prepopulate the sparse QoS matrix, and a QoS matrix without missing values was obtained. • Outlier detection: In the first stage, the DBSCAN clustering algorithm with adaptive parameters is used to cluster each service in turn on the pre-filled QoS matrix, and the similar users and outliers under this service are obtained. The second stage: finding common sense outliers, that is, values beyond the scope of limited knowledge and common sense. If the response time and throughput are 0 and negative, it goes against common sense and should be regarded as outliers. Delete the outliers found in the two stages, and record the number of outliers provided by each user with the untrusted user matrix. • Prediction of QoS value: After deleting the abnormal user, find the user similar to the active user and predict the missing QoS value. The details of each process will be introduced below. 4.1 Sparse Matrix Padding The historical QoS matrix is often extremely sparse, so it is not only difficult to find the most similar users (or services) by collaborative filtering. More likely, when a user provides credible QoS data, but the corresponding service is removed as an outlier because the data is sparse and the corresponding service has no other value distribution near the value, to mitigate the adverse effect of the sparse matrix on QoS prediction accuracy, RSVD+ is applied to pre-populate the sparse matrix before detecting the anomalous data.
An Adaptive Parameter DBSCAN Clustering and Reputation-Aware
109
Fig. 2. A flowchart of an adaptive parameter DBSCAN clustering and reputation-aware QoS prediction method
4.2 Two-Phase Anomaly Data Detection Taking response time data as an example, viewing the distribution of the original data at a specific density, as shown in Fig. 3, the QoS values of many services are negative, and these outliers will not disappear because of matrix pre-populated and subsequent use of collaborative filtering algorithm, which will undoubtedly seriously affect the prediction accuracy, so it is necessary to detect outliers.
Fig. 3. Historical QoS Data Boxplot
The sources of outliers can be divided into two situations: fake observations provided by untrustworthy users and outliers obtained by ordinary users due to the influence of contextual environment. The composition of outliers includes not only the maximum or minimum value, but also the value far away from other data points in the data set. Even if this value approaches the average value of the corresponding service, it should
110
Y. Li et al.
be regarded as an outlier. Therefore, the following two stages are used to detect outliers and find similar users and untrustworthy users. 1) The first phase [11]: find outlier points. Initialize the untrustworthy user matrix A and similar user matrix B to record clustering information, matrix A is a matrix of m × 1 and matrix B is a matrix of m × m, set each item of the two matrices to 0, and then the DBSCAN clustering algorithm with adaptive parameters is used to cluster each service in turn. If the QoS value provided by user i is considered as abnormal data in a service, the abnormal data is deleted and ai increased by 1, In a service, users clustered in the same cluster are regarded as similar users under the service, and are recorded in the B matrix in pairs, use Csk denotes the kth cluster of service s, when user i and user j both belong to Csk , then bi,j and bj,i both increase by 1. In the process of traversing the service, the untrusted user matrix A and the similar user matrix B are constantly updated, and the data regarded as abnormal by the service is constantly deleted in the pre- padding QoS matrix. After traversing, the similar user matrix is obtained. 2) The second phase: search for common sense outliers. Considering that abnormal data that violate common sense in the service may appear in droves, density-based clustering algorithms alone cannot detect such data anomalies. As shown in the figure on the right in Fig. 1, some throughput is negative, which obviously goes against common sense. However, because some outliers appear in droves, they are not detected and are classified into clusters together with normal data. Applies Kolmogorov-Smirnov Test (K-S test) to test data distribution of each service, found that do not obey the normal distribution, the extremes of the two segments cannot be identified by the quarterback method. So traverse the QoS matrix that has completed the first stage, delete the data with QoS value ≤ 0 in each service, and find the user i who provides the QoS value, and in the untrustworthy user matrix A, ai add 1. After traversing, the untrusted user matrix A and the QoS matrix to be predicted are obtained. 4.3 Predicted Missing Values In the prediction stage, we first rank the untrusted user matrix A obtained from outlier detection from large to small, and set A threshold for the proportion of untrusted users. According to the ranking and threshold, the untrusted users and the data provided by them are deleted from the QoS matrix to be predicted and the similar user matrix B to improve the prediction accuracy. When predicting the missing QoS value on the service s for the active user u, the similarity user matrix is sorted by the value of the user u column from the largest to the smallest, and the ranking of the most similar users of the user is obtained. Because the Top-N algorithm ignores the case that the number of similar neighbors is less than N, that is, negative filtering [8], so we find the most similar user of the active user u. If this user calls the service s, continue to search for the cluster Csk to which similar users belong on the service s, and then calculate the average value of QoS provided by all trusted users in the cluster Csk (excluding the data provided by untrusted users) to obtain the predicted value, and the formula is shown in (7). If the most similar user does not
An Adaptive Parameter DBSCAN Clustering and Reputation-Aware
111
call the service s, check whether the second-ranked similar user calls the service s, and so on until a similar user who calls the service s is found. cuCsk rcu,s (7) rˆu,i = N where rˆu,i denotes the QoS value provided by users who have invoked service s in Csk and N denotes the number of users in that cluster.
5 Experimental Evaluation In this section, we conduct a comprehensive experimental evaluation of the proposed Web service QoS prediction method. The experiments are conducted empirically using the WS-Dream public real dataset provided by Zheng’s team. The experiment mainly consists of three parts: 1) comparing the UDBSCAN_RSVD+ method proposed in this paper with several other baseline methods to verify the effectiveness; 2) verifying the effect of different parameters on the experimental accuracy; 3) A comprehensive ablation study was conducted for the proposed method. 5.1 Datasets The WS-DREAM dataset used in the experiment is the real dataset with the largest number of participating users, the largest number of invoking Web services and the largest scale, which contains 1974675 call records generated by 339 users invoking 5825 services. These records are converted into two QoS matrices, the response time matrix and the throughput matrix. At the same time, the data set also includes the location information of users and services (IP address, AS autonomous system, latitude, longitude, etc.), but the location information of services is not complete. Only when the IP address is known and the AS autonomous system, latitude and longitude information is missing, the API interface is called to obtain the home address of IP address. Finally, 4993 services with complete location information to be obtained. Table 1 is the detailed experimental data set information. Table 1. Experimental data set Statistics
Values
Number of Service Users
339
Number of Web Services
4993
Number of Web Services Invocations
1692627
Range of Response Time
0–20 s
Range of Throughput
0–1000 kbps
112
Y. Li et al.
5.2 Evaluation Metrics The following two indicators are used to comprehensively measure the closeness between the predicted value and the real value. The Mean Absolute Error (MAE) is usually used to reflect the actual Error of the predicted value. The smaller THE MAE value is, the higher the accuracy of the prediction will be. The mathematical expression of MAE is shown in (8): u,s ru,s − rˆu,s MAE = (8) N where ru,s denotes the true QoS value obtained by user u invoking service s, rˆu,s denotes the predicted QoS value, and N denotes the number of predicted values. 5.3 Baseline Methods To evaluate the performance of the UDBSCAN_RSVD+ method, the following baseline methods are used for comparative evaluation in this paper, as described below: • UPCC+: a reliability-aware user collaborative filtering approach, On the basis of user collaborative filtering, the DBSCAN clustering algorithm with adaptive parameters is used to find and delete the abnormal values of each service, and the number of abnormal values provided by each user is counted. After deleting abnormal users in proportion, the matrix of similar users is updated, and the deleted abnormal values are re-predicted. • IPCC+: A reliability-aware item collaborative filtering method, On the basis of item collaborative filtering, outliers and untrustworthy users are deleted. Update the similar service matrix, and re-predict the QoS value previously deleted due to abnormal data. • UDBSCAN_USVD: The user mean singular value decomposition method based on reliability perception proposed in this paper, Considering that the singular value decomposition (SVD) technology requires the original co-occurrence matrix to be dense, but the QoS matrix is sparse, the user average value is used to fill in the missing values first, and then SVD is used to pre-populate the matrix. Finally, the proposed method is used to detect outliers and predict QoS values. • UDBSCAN_ISVD: The service mean singular value decomposition method based on reliability perception proposed in this paper, First, the service average value is used to fill in the missing values, and then SVD is used to prepopulate the matrix. Finally, the proposed method is used to detect outliers and predict QoS values. • UDBSCAN_LFM: The hidden semantic model method based on reliability perception initializes two parameter matrices U and V, obtains the predicted value through the product of two hidden vectors, compares it with the real value to calculate the loss, optimizes it by gradient descent to minimize the loss, iteratively updates the values of two hidden vectors, and pre-fills the matrix. Finally, the method in this paper is used to detect outliers, and then the QoS value is predicted.
An Adaptive Parameter DBSCAN Clustering and Reputation-Aware
113
5.4 Performance Comparison In real life, user-service QoS matrix is often extremely sparse, and many Web services are only called by a small number of users. In order to simulate this situation, firstly, some QoS values are randomly deleted from the matrix to obtain a matrix of specific density, and then 10% users, namely 34 users, are randomly selected as active users. The remaining QoS values are regarded as real QoS values. In order to avoid the influence on the experimental results caused by the difference of the remaining real QoS value (such as the number of service calls) of active users, the prediction accuracy of the test method of 10-fold cross verification was used. In the following experiment, if abnormal users appear in the active users, the active users should be removed to test the prediction accuracy. In this section, the proportion of untrustworthy users varies from 1% to 5% by a step of 1%. The matrix density is set as 20%. The following points should be noted: First, UPCC+ and IPCC+ have obtained similar users through Pearson correlation coefficient before outlier detection. Top-K is set to 2, and the average value of two similar users is taken to fill in the missing values. The Other four methods use DBSCAN clustering algorithm with adaptive parameters to get similar users and abnormal users, and fill in the missing value by the average QoS value of the cluster to which the most similar users who have called the service belong. Second, in the matrix prepopulated stage, the iteration times set by LFM and RSVD+ are both set to 300, and in the matrix prepopulated stage, the iteration times set by LFM and RSVD+ are both set to 300, and the initial learning rate is the same under the same matrix density. Third, if there is an abnormal user among the active users, the active user should be removed and the prediction accuracy should be tested again. The performance comparison of different methods is shown in Table 2. The experimental results showed that: • Taking the response time matrix as an example and MAE as the evaluation index, when the proportion of untrusted users is 1%, 2%, 3% and 5%, UDBSCAN_RSVD+ has a higher prediction accuracy than other methods. It is proved that the proposed method has better performance than the baseline method. • When the proportion of untrusted users increases from 1% to 5%, the prediction accuracy of all methods is greatly improved although there are fluctuations. For example, the response time shown in the left figure in Fig. 4, the prediction accuracy of UDBSCAN_USVD method is the most significantly improved, with an increase of 24.7%. While IPCC+ method increases the prediction accuracy when the proportion of untrustworthy users increases from 1% to 4% is significantly improved, but it drops sharply from 4% to 5%, because reliable users are excluded according to the set untrusted user threshold, so different trusted user thresholds should be adjusted for different methods to improve the prediction accuracy. It is proved that the reliability of data has great influence on the prediction accuracy. • Compared with UDBSCAN_LFM, the method in this paper takes account of users, service deviation and penalty items in the pre-filling process, and using Adagrad algorithm in gradient descent, which greatly improves the prediction accuracy, which proves that these two steps are very necessary.
114
Y. Li et al. Table 2. Performance comparison with baseline methods
Response Time Method
Percentage of Untrusted Users
UPCC+
1%
2%
3%
4%
5%
0.5994
0.5810
0.5737
0.5741
0.5637
IPCC+
0.6088
0.6037
0.6060
0.6006
0.6587
UDBSCAN_USVD
0.9457
0.8812
0.8103
0.7413
0.7116
UDBSCAN_ISVD
0.5945
0.5884
0.5224
0.5224
0.5224
UDBSCAN_LFM
0.5225
0.5225
0.5012
0.4365
0.4364
UDBSCAN_RSVD+
0.4823
0.4732
0.4644
0.4657
0.4164
Throughput Method
Percentage of Untrusted Users
UPCC+
1%
2%
3%
4%
5%
26.33
26.33
26.36
26.36
26.10
IPCC+
30.59
30.86
30.24
29.88
30.00
UDBSCAN_USVD
48.88
49.03
49.03
49.10
49.10
UDBSCAN_ISVD
28.95
28.65
28.65
28.65
28.65
UDBSCAN_LFM
25.34
25.04
24.70
24.70
24.70
UDBSCAN_RSVD+
25.15
24.86
24.63
24.63
24.63
• Since the response time ranges from 0–20 s and the throughput ranges from 0– 1000 kbps, so the throughput error will also be large. After deleting abnormal users, the prediction accuracy will be improved slightly less than that of the response time. However, as shown in the right figure in Fig. 4, taking MAE as the evaluation index, under the proportion of all untrusted users, UDBSCAN_RSVD+ outperforms other baseline methods and maintains high prediction accuracy.
Fig. 4. The impact of untrusted users on prediction accuracy
An Adaptive Parameter DBSCAN Clustering and Reputation-Aware
115
5.5 Impact of Matrix Density This subsection investigates the effect of matrix density on the prediction results. In this part of the experiment, we set the percentage of untrustworthy users to 5% and Top-k to 2. The matrix density varies from 5% to 20% with a step size of 5%. The results are shown in Fig. 5:
Fig. 5. The improvements of the method at different matrix densities
The experimental results showed that: • Taking MAE as the evaluation index, when the matrix density is 5%, the prediction accuracy of all methods is not high, and the corresponding MAE value is high. As the matrix density increases to 20%, the performance of all methods is significantly improved. Under all matrix densities, UDBSCAN_RSVD+ method has the highest prediction accuracy, whether it is response time matrix or throughput matrix. • As shown in Fig. 5, when the matrix density is unchanged, the proposed method is compared with the baseline method with the optimal performance under this density to calculate the predicted improvement rate. It is found that when the matrix density is 5%, the predicted improvement rate reaches the maximum, the response time improvement rate reaches 7.0%, and the throughput reaches 3.8%, indicating that when the matrix is extremely sparse, The UDBSCAN_RSVD+ method has good performance. 5.6 Ablation Study In this section, the ablation study is used to verify the effectiveness of matrix pre-filling and abnormal value detection. On the basis of the UDBSCAN_RSVD+ model proposed in this paper, the ablation experiment compares the experimental results of the original method with that of removing DBSCAN, RSVD+ and both removed (namely UPCC method). Among them, the ratio of untrusted users set by the proposed the method in this paper and (w/o) RSVD+ method is 5%, with MAE as the evaluation index, as shown in Table 3:
116
Y. Li et al. Table 3. Ablation experiment results
Matrix
Methods
Matrix Density 5%
RT
TP
10%
15%
20%
UPCC
1.0326
0.9721
0.9500
0.9128
(w/o) DBSCAN
0.6780
0.5894
0.4702
0.4558
(w/o) RSVD+
0.6116
0.4915
0.4642
0.4558
UDBSCAN_RSVD+
0.5756
0.4623
0.4427
0.4164
UPCC
40.78
45.21
53.18
75.25
(w/o) DBSCAN
40.44
35.45
32.57
27.31
w/o) RSVD+
36.07
34.14
29.82
26.99
UDBSCAN_RSVD+
31.74
30.10
28.06
24.63
The experimental results showed that: • Under all matrix densities, no matter response time matrix or throughput matrix, the prediction accuracy of the original method is the highest. • The method of removing DBSCAN and RSVD+ respectively is worse than the original method, but it is better than the experimental method of removing both, which proves that the proposed method has the best performance, and the two components of RSVD+ pre-filling and DBSCAN outlier detection are effective.
6 Conclusion In this paper, we propose an adaptive parameter DBSCAN clustering and reputationaware QoS prediction method, which uses RSVD technology considering user and service bias to prepopulate the original sparse matrix and alleviate the adverse effects caused by data sparsity. Then outlier detection is performed in two stages to remove untrusted users and abnormal data. In order to evaluate the effectiveness of the proposed method, WS-DREAM dataset is used for experimental verification. Experimental results show that the proposed method can achieve better performance than the baseline method. In this experiment, this paper calculates the effect of contextual environment on the prediction accuracy by setting user and service deviations, which may not be accurate enough, and in the next step, the effect of contextual environment on the prediction accuracy can be approximated by introducing user service location information to improve the prediction accuracy. Acknowledgments. This work was supported by the National Natural Science Foundation of China (Grant 61872138, 61572188).
An Adaptive Parameter DBSCAN Clustering and Reputation-Aware
117
References 1. Zhang, Y., Wang, K., He, Q., et al.: Covering-based web service quality prediction via neighborhood-aware matrix factorization. IEEE Trans. Serv. Comput. 14, 1333–1344 (2019) 2. Wu, D., He, Q., Luo, X., et al.: A posterior-neighborhood-regularized latent factor model for highly accurate web service QoS prediction. IEEE Trans. Serv. Comput. 15, 793–805 (2019) 3. Xu, S., Zhang, C., Zhang, J.: Adaptive quantile low-rank matrix factorization. Pattern Recogn. 103, 107310 (2020) 4. Nilashi, M., Bagherifard, K., Rahmani, M., et al.: A recommender system for tourism industry using cluster ensemble and prediction machine learning techniques. Comput. Ind. Eng. 109, 357–368 (2017) 5. Vaswani, N., Narayanamurthy, P.: Static and dynamic robust PCA and matrix completion: a review. Proc. IEEE 106(8), 1359–1379 (2018) 6. Wu, D., Luo, X., Shang, M., He, Y., Wang, G., Wu, X.: A data-aware latent factor model for web service QoSW prediction. In: Yang, Q., Zhou, Z.-H., Gong, Z., Zhang, M.-L., Huang, S.-J. (eds.) PAKDD 2019. LNCS (LNAI), vol. 11439, pp. 384–399. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-16148-4_30 7. Wu, C., Qiu, W., Zheng, Z., et al.: QoS prediction of web services based on two-phase kmeans clustering. In: 2015 IEEE International Conference on Web Services, pp. 161–168. IEEE (2015) 8. Ghafouri, S.H., Hashemi, S.M., Hung, P.C.: A survey on web service QoS prediction methods. IEEE Trans. Serv. Comput. 15, 2439–2454 (2020) 9. Zheng, Z., Xiaoli, L., Tang, M., et al.: Web service QoS prediction via collaborative filtering: a survey. IEEE Trans. Serv. Comput. 15, 2455–2472 (2020) 10. Xu, J., Xiao, L., Li, Y., et al.: NFMF: neural fusion matrix factorisation for QoS prediction in service selection. Connect. Sci. 33(3), 753–768 (2021) 11. Chen, Z., Shen, L., Li, F.: Your neighbors are misunderstood: on modeling accurate similarity driven by data range to collaborative web service QoS prediction. Futur. Gener. Comput. Syst. 95, 404–419 (2019)
Effectiveness of Malicious Behavior and Its Impact on Crowdsourcing Xinyi Ding, Zhenjie Zhang, Zhuangmiao Yuan, Tao Han, Huamao Gu, and Yili Fang(B) Zhejiang Gongshang University, Hangzhou 310018, Zhejiang, China [email protected]
Abstract. Crowdsourcing has achieved great success in fields like data annotation, social survey, objects labeling, etc. However, enticed by potential high rewards, we have seen more and more malicious behavior like plagiarism, random submission, offline collusion, etc. The existence of such malicious behavior not only increases the cost of handling tasks for requesters but also leads to low quality of collected data. Current research works only investigate some specific types of malicious behavior and focus more on their impact on aggregation results. Also, they don’t evaluate the effectiveness of these malicious behavior in different scenarios. In this study, we formally propose malicious behavior effectiveness analysis model that could be applied in different scenarios. Through comprehensive experiments on four typical malicious behavior, we demonstrate that with the increasing of the number of malicious workers, all these malicious behavior lead to the decrease in accuracy of aggregation algorithms, among which random submission causes the biggest declination. Our study could provide guidance for designing secure crowdsourcing platforms, as well as ensuring high quality data.
Keywords: Crowdsourcing
1
· Malicious behavior · Data aggregation
Introduction
Crowdsourcing has been successfully applied in fields like image labeling [1], emotion analysis [2], video description [3], text translation [4], etc. Crowdsourcing mainly involves two stages: Task handling and Data aggregation. For task handling, requesters publish tasks redundantly to different anonymous workers from Internet according to some reward strategy and platform policy. Registered workers complete these tasks and submit their responses back to the platform. Due to the unreliability of these workers, one task will be assigned to different workers. Thus, in the data aggregation stage, aggregation algorithms will be ran to distill high quality data for requesters. These two stages need to be carefully designed because they both have huge impact on the quality of data [5]. In task handling, due to the diversity of knowledge background, skills, motivations of Internet users participated in crowdsourcing tasks, the quality c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 118–132, 2023. https://doi.org/10.1007/978-981-99-2385-4_9
Effectiveness of Malicious Behavior and its Impact on Crowdsourcing
119
of collected data is not guaranteed. What’s worse, some participated workers try to gain more rewards with less effort, leading to various malicious behavior. For instance, in order to complete as many tasks as possible in a short time, some malicious workers may provide ill-fitting responses to the platform [6,7]. Although most workers of crowdsourcing platforms are anonymous, they could use third part social network or other communication channels to form implicit network for information exchange [8]. These malicious workers share their answers using self organized networks, trying to gain more rewards with less effort. The existence of such malicious behavior not only increases the cost of requesters, but also decreases the diversity of collected data, leading to serious damage to the quality of aggregated data. Current malicious behavior can be mainly classified as the following four categories: Random Submission. In order to complete as many tasks as possible in a short time, some workers may provide ill-fitting responses. Malicious workers could utilize some probabilistic models to generate random answers and submit them to the platform. They could register multiple robot accounts to maximize their gains. Using robot accounts and probability models could create serious damage to the quality of collected data [6,7]. Group Plagiarism. Group Plagiarism refers to the scenario in which after one worker (usually the leader) completed all the tasks, he/she shares the answers with others through third party social network or implicit communication channels. Others just copy shared answers and submit them to the platform. Group Plagiarism leads to the same answers from different accounts. When there are enough workers involved in Group Plagiarism, it could easily overturn the aggregation results [9]. Division and Combination. In this scenario, workers will first divide all the tasks within the group. Each member of the group completes the assigned tasks and shares the responses with others. Thus, in the worst case, all workers will submit the same responses and will greatly impact the diversity of collected data [10]. Aggregation Submission. In order to improve the quality of submissions, thus increase the acceptance rate by crowdsourcing platforms, some workers may communicate and exchange answers using third part social networks. They will come up with high quality responses using methods like majority voting before submission. Such kind of behavior may increase the quality of collected data, but at the same time leads to loss of diversity information, thus are still considered harmful to the platform [11]. The above mentioned malicious behavior all try to gain more rewards by taking a series of actions to change the answers submitted to the platform. For random submission, it will directly decrease the accuracy of aggregation results, while for the other three strategies they will impact the diversity of submitted answers, thus also considered harmful for the aggregation algorithms. A deep understanding of how these different malicious behavior will impact the crowdsourcing results under various conditions is critical for future mitigation and filtering of noisy data caused by these behavior.
120
X. Ding et al.
In this study, we first analyze the effectiveness of different malicious behavior, formulating the malicious behavior problem and our effectiveness analysis model. Based on our proposed model, we analyze the effectiveness of the above mentioned four typical malicious behavior. The results show that when we take indifferential pricing policy, random submission, group plagiarism, division and combination are effective malicious behavior, while aggregation submission are ineffective malicious behavior. For aggregation submission, when the extra reward from improved accuracy is larger than the labor cost for aggregation, it could be considered as effective malicious behavior. Through comprehensive experiments, we investigate the impact of these malicious behavior on different aggregation algorithms under various scenarios. We select three common aggregation algorithms DS [12], HDS [13,14] and majority voting (MV) in this study. We find that random submission could cause the most serious damage for these aggregation algorithms. Specifically, on synth4 and dog these two multiclass classification datasets, the aggregation accuracy could drop below 0.3. For both homogeneous and inhomogeneous datasets, with the increase of the proportion of malicious workers, random submission, division and combination, aggregation submission all lead to decrease in accuracy. We also noticed the impact of group plagiarism on inhomogeneous dataset is small, this could be explained by the fact that there are a lot of workers who only accomplished a few tasks in inhomogeneous datasets. We summarize the contributions of this paper as follows: – We formally formulate the definition of malicious behavior and propose effectiveness analysis model which could be used to evaluate the effectiveness of different types of malicious behavior. To our best knowledge, this is the first discussion of malicious behavior from the effectiveness point of view. – Through comprehensive experiments, we find that with the increase of proportion of malicious workers, four typical malicious behavior all lead to decrease in the accuracy of aggregation algorithms. – We investigate the impact of these malicious behavior under different scenarios and find there exists significant difference on the effectiveness of these malicious behavior. Besides, we find the impact of group plagiarism on inhomogeneous datasets is not as big as usually thought.
2
Related Work
The malicious behavior of workers could have huge impact on the aggregation results for labeling tasks, resulting in the decrease of collected data quality and increase in the cost of requesters. Thus, it is often required for the crowdsourcing platform to take strategies to filter out these malicious workers. Crowdsorucing workers always want to gain more rewards while giving less effort. Some of them may take malicious behavior, which could cause serious damage to the aggregation results [15]. There have been proposed several research works that focus on the analysis of malicious behavior in crowdsourcing [2,16,17]. These works include the discussion of Unqualified Workers:
Effectiveness of Malicious Behavior and its Impact on Crowdsourcing
121
workers usually have to follow the guidance and task descriptions provided by the crowdsouring platform before they can submit answers. However, for some workers, they may not able to understand the task description properly. For instance, in some cases, the first three prerequisite tasks must be completed before they can answer the forth. These workers may not malicious, but their submissions are invalid and could cause issues for requesters [7]. Fraudsters: they are those who try to maximize their gains in the shortest time by submitting improper answers. For instance, they may submit the same answer to different tasks [7]. Rule Breakers: these workers do not follow the guidance or task description when handling tasks, thus resulted in invalid data. For example, some tasks ask workers to select three options, but there are workers that only select one option. Thus, the collected data will not meet the needs of requesters [7]. Random Submission: in this case, malicious workers or robot accounts will use some probabilistic models to generate random answers and submit them to the platform. One example is the sybil attack [6], in which a bunch of robot accounts will be registered and similar answers will be submitted through these robot account. Existing aggregation algorithms usually will fail in such circumstances. Group Plagiarism: a group of workers will assign all the tasks to one worker (usually the leader), the rest just copy this one worker’s answers. In the worst case, no worker will actually go through all the troubles to answer these tasks. What they copy is no better than randomly generated answers [9]. Division and Combination: all tasks will be split by the group members and each member will complete his/her own parts. Then, they share their answers within the group. In this scenario, if the assigned tasks are the ones this worker is actually good at, the final aggregation results will maintain a good accuracy. But in practice, these tasks are usually randomly distributed and the collected answers contain lots of duplicates, which in turn will jeopardize the aggregation results [10]. Aggregation Submission: sometimes, the platform will not accept one worker’s answers if the accuracy is below 0.7 and the payment is proportional to the data accuracy. In such cases, in order to get more rewards, workers will use third party social network or other communication channels to aggregate their answers first [11]. One simple aggregation strategy is the Majority Voting (MV), in which the one with the most votes will be selected as the answer. Notice the difference from the MV algorithm used for truth inference. Some other research works focus on the analysis of the impact of these malicious behavior on aggregation algorithms. Farnaz et al. [18] proposed a method to classify the data poisoning attacks in truth inference and summarized the impact of these attacks and malicious behavior on truth inference algorithms. They also used several metrics to measure the sensitivity of different truth inference algorithms to various attacks. But their work only covers a few aggregation algorithms and does not consider them under different scenarios. Miao et al. [19] investigated the impact of malicious behavior on the DS model. They discussed the details of the influence on aggregation results. However, they did not consider it under different scenarios and also did not investigate the impact on other aggregation algorithms.
122
X. Ding et al.
Existing works focus more on how to modify current aggregation algorithms, such that the impact from malicious behavior could be minimized. Some other works focus on the attack to the crowdsourcing process. In this study, we analyze four typical malicious behavior in crowdsourcing. We conduct comprehensive experiments to illustrate the characteristics of these malicious behavior and their impact on aggregation algorithms under different conditions.
3
Problem Definition
Crowdsourcing platforms usually pay workers based on the number of tasks completed and the quality of answers. To gain more rewards with less effort, some workers may provide ill-fitting responses or participate in collusion. To improve the acceptance rate by platforms, some will take strategies like aggregation to preprocess their responses before submission. To help analyze such kind of behavior, we formally define the malicious behavior as follows: Definition 1 (Malicious Behavior). Let T be the tasks set, W be the workers set and L be the responses set from workers. A = {< lij , wi , tj > |lij ∈ L, wi ∈ W, tj ∈ T } is the ideal answers set from workers with their own effort, in which lij is the ideal answer provided by worker wi by his/her own effort for task tj , while lij is the actual response from worker wi for task tj . We call the function defined on A f (< lij , wi , tj >) =< lij , wi , tj > Malicious Behavior if and only if there is a sample that makes lij = lij . We call function f Non Malicious or Zero-Malicious if and only if for any sample, lij = lij . We call the subset Tm = {ti |f (< lij , wi , tj >) =< lij , wi , tj >} of T malicious task set, the subset Wm = {wi |f (< lij , wi , tj >) =< lij , wi , tj >} of W malicious workers set and the subset Am = {< lij , wi , tj > |f (< lij , wi , tj >) =< lij , wi , tj >} of A malicious origin dataset, f (Am ) be the malicious dataset. Let P (·) be the reward function and E(·) be the cost function, then if f satisfies: P (A) P (f (A)) > , E(f (A)) E(A)
(1)
we call f Effective Malicious Behavior, and when f = arg max f ∈F
P (f (A)) , E(f (A))
(2)
we call f The Optimal Malicious Behavior. We can see from the above definition that malicious behavior are closely related to the reward function and the cost function. If one crowdsourcing platform does not provide different pricing strategies for tasks, then the reward is proportional to the number of completed tasks, that is P (A) ∝ |A|. If the tasks are priced based on their completed quality, then the reward is proportional to the data quality, that is P (A) ∝ V (A), here V (A) is the quality function of
Effectiveness of Malicious Behavior and its Impact on Crowdsourcing
123
data. A common metric to measure the data quality is to use accuracy. For the cost function, the more effort malicious workers put into, the larger the function value. For example, we could set the cost to be 1 if one worker completed this task using his/her own effort and 0 if copied from others, then the cost function is E(A) = |A |, where A = A/Am is the dataset with effort. With this definition, for malicious users, their goal is to find the optimal malicious behavior defined by f . We categorize existing malicious behavior to random submission, group plagiarism, division and combination, aggregation submission four types and will describe the details and conduct theoretical analysis in the next section.
4
Effectiveness of Four Typical Malicious Behavior
Using the definition of malicious behavior, we analyze random submission, group plagiarism, division and combination, aggregation submission four typical malicious behavior. All these malicious behavior will impact the aggregation results in some degree. Readers should aware that for these malicious behavior, workers do not put their own effort, or at least do not try their best to complete these tasks, thus it is hard to infer the true abilities of these workers. We analyze these four typical malicious behavior in detail next. Random Submission : In this kind of malicious behavior, workers aim to accomplish as many tasks as possible, thus they will usually provide random answers to tasks. For one task tj ∈ T , one worker can randomly pick one answer ∈ L for task tj . Assume workers Wm ⊆ W use random answers for Tm ⊆ T lij and resulted in a malicious dataset f (Am ), the reward function is P (f (Am )). On the contrary, if these workers choose to complete tasks with efforts, the reward function is P (Am ). It is obvious that provide random answers has lower labor cost, that is E(f (Am )) < E(Am ). For other tasks that completed with efforts, we have P (f (A/Am )) = P (A/Am ), and the labor cost is E(f (A/Am )) = E(A/Am ). A = f (Am ) ∪ (A/Am ) is the final dataset. When the platform does not provide different pricing strategies for different tasks, we have P (f (A)) = P (A) and P (f (A)) P (f (Am )) + P (A/Am ) P (Am ) + P (A/Am ) P (A) = > = . E(f (A)) E(f (Am )) + E(A/Am ) E(Am ) + E(A/Am ) E(A)
(3)
Thus, for random submission, if we take indifferential pricing policy, it is effective malicious behavior. But, if the platform’s pricing policy is based on response accuracy. The value of P (f (A)) depends on workers’ ability and the actual accuracy for these random submissions. Group Plagiarism : In group plagiarism, there is one worker (usually the leader) that accomplishes all the tasks and the rest just copy this worker’s answers. For task tj ∈ T , let wl be the worker that everyone else copies ∈ L as the answer for task tj . For task set Tm ⊆ T , from. They just use llj assume workers Wm ⊆ W they all copy the answers of wl and we get the
124
X. Ding et al.
malicious origin dataset Am , the malicious dataset f (Am ), the reward function is P (f (Am )). If all these workers accomplished these tasks all by their own, then the reward function is P (Am ). Obviously, the cost of copy is less than with efforts, that is E(f (Am )) < E(Am ). And for those completed with efforts, we have P (f (A/Am )) = P (A/Am ), and the cost is E(f (A/Am )) = E(A/Am ). A = f (Am ) ∪ (A/Am ) is the final collected dataset. When the pricing strategy is the same across all tasks, we have P (f (A)) = P (A), and P (f (A)) P (f (Am )) + P (A/Am ) P (Am ) + P (A/Am ) P (A) = > = . E(f (A)) E(f (Am )) + E(A/Am ) E(Am ) + E(A/Am ) E(A)
(4)
Thus, for group plagiarism, it is effective malicious behavior when the pricing strategy is the same across all tasks. But, when the pricing is based on the data quality, P (f (A )) depends on workers’ ability and the actual accuracy for these submissions. Division and Combination : For division and combination, each malicious worker wli will accomplish the assigned tasks with effort, then share the answers with others. Thus, division and combination could be seen as conducting multiple group plagiarism. Let the answers other workers copy from wli be f (Aim ) and there are totally M workers that complete their tasks by themselves. Then we i have f (Am ) = ∪M i f (Am ) and the data with effort is A/Am . For each group i plagiarism in Am , we have E(f (Aim )) < E(Aim ), and the reward function is P (f (Aim )), if this part is accomplished by effort, then it’s reward is P (Aim ). For other tasks we have P (f (A/Am )) = P (A/Am ) and the cost is E(f (A/Am )) = E(A/Am ). A = f (Am ) ∪ (A/Am ) is the final dataset collected. If the pricing strategy is the same across all tasks, then P (f (A)) = P (A), and we have P (f (A)) P (f (Am )) + P (A/Am ) = E(f (A)) E(f (Am )) + E(A/Am ) M i i P (f (Am )) + P (A/Am ) = M i i E(f (Am )) + E(A/Am ) M i P (A) i P (Am ) + P (A/Am ) > M = . i E(A) i E(Am ) + E(A/Am )
(5)
Thus, for indifferential pricing policy, division and combination is effective malicious behavior. We can think of group plagiarism is a special type of division and combination. If the pricing strategy is based on data accuracy, P (f (A )) depends on the workers’ ability and the actual data accuracy. Aggregation Submission : Sometimes, crowdsourcing platforms will reject submissions from low accuracy workers to filter out random responses, while in other cases, the reward is proportional to the data quality. Thus, to gain more rewards, some workers may take strategies to improve their responses before submission. In aggregation submission, each worker will actually try to accomplish all the tasks with effort and we assume the resulted dataset is A, then they may
Effectiveness of Malicious Behavior and its Impact on Crowdsourcing
125
Table 1. Statistics of Datasets Dataset name # classes # items # workers # worker labels bluebird
2
108
39
4212
RTE
2
800
164
8000
dog
4
807
109
8071
synth4
4
4000
5
20000
share their answers using third part social network or offline with other workers. For each task, they will discuss and pick the best answer (using majority voting for example). Notice in this case, all the answers submitted by these workers participated in discussion are the same. Assume the aggregated result is f (Am ). For other normal submissions we have A/Am . For indifferential pricing policy we know the reward is the same, that is P (f (Am )) = P (Am ). But for the extra cost due to discussion, we have E(f (Am )) > E(Am ). For other workers that does not participate in aggregation submission, we have P (f (A/Am )) = P (A/Am ), and the cost is E(f (A/Am )) = E(A/Am ). A = f (Am ) ∪ (A/Am ) is the final dataset we get. For indifferential pricing policy we have P (f (A)) = P (A), and P (f (A)) P (f (Am )) + P (A/Am ) P (Am ) + P (A/Am ) P (A) = < = . E(f (A)) E(f (Am )) + E(A/Am ) E(Am ) + E(A/Am ) E(A)
(6)
Thus, aggregation submission is not effective malicious behavior for indifferential pricing policy. If the pricing policy is based on accuracy and we assume the labor cost is E(f (Am )) = E(Am ), because of the aggregation preprocessing before submission, the data quality might be improved, thus resulted in more rewards. We have P (f (Am )) > P (Am ), and P (f (A)) P (f (Am )) + P (A/Am ) P (Am ) + P (A/Am ) P (A) = > = . E(f (A)) E(f (Am )) + E(A/Am ) E(Am ) + E(A/Am ) E(A)
(7)
Thus, in this case, aggregation submission is effective malicious behavior. However, in practice, E(f (Am )) = E(Am ), then whether it is effective malicious behavior depends on the actual cost function E(·). Note: From the above analysis, we know when the pricing policy is indifferential, random submission, group plagiarism, division and combination are valid malicious behavior, but aggregation submission is not effective malicious behavior. When the pricing policy is based on accuracy, whether it is effective malicious behavior depends on the actual reward function and cost function.
5
Experiments
To investigate the impact of different malicious behavior on existing aggregation algorithms under various scenarios, we conduct comprehensive experiments on
126
X. Ding et al.
Fig. 1. The distribution of the number of tasks performed by workers in each dataset.
four public datasets. The statistics of these datasets are shown in Table 1. In this section, we will first describe the details of the used datasets and our experiment setup. Then, through experiments, we show how these typical four malicious behavior could impact existing aggregation algorithms. We also provide possible explanations for our observations. 5.1
Dataset
We conduct experiments on four widely used public datasets, including two binary classification task datasets and two multiclass classification task datasets, as shown in Table 1. Two binary classification tasks include the bluebird [20] that identifies whether there is indigo bunting or blue grosbeak in an image and the RTE dataset [21] for text entailment. The main goal of text entailment is to determine whether one text fragment can be inferred from the meaning of another text fragment. For multiclass classification tasks, we use the dog dataset from ImageNet [1] and the synth4 dataset which consists of 4000 tasks and 5 workers. Table 1 shows the statistics of these four datasets. bluebird and synth4 are homogeneous, which means the number of tasks completed by each worker is the same. The dog and RTE datasets are inhomogeneous. Figure 1 shows the distribution of the number of tasks completed by workers. For datasets bluebird and synth4, the number of tasks by each worker are 108 and 4000. For RTE and dog datasets, most of the workers performed fewer than 100 tasks.
Effectiveness of Malicious Behavior and its Impact on Crowdsourcing
127
Fig. 2. The effect of worker redundancy on accuracy for different datasets.
5.2
Experimental Setup
The original four datasets used in this study does not include malicious workers (or at least there are no reported malicious workers). We simulate malicious behavior of workers by changing their responses. For each dataset, we do not change the distribution of tasks for workers, instead we turn some work ers W ∈ [M ] to malicious ones by changing their responses, thus the ratio of
| malicious workers is |W|W . We randomly select workers and turn them into |+|W | malicious workers. During this sampling process, we repeat 50 times and report the average results. For worker redundancy experiments in the following sections, we also repeat 50 times and report the average results. We set a reasonable range when changing workers’ responses to make sure they are valid answers to the platform. We investigate four typical malicious behavior, namely random submission, group plagiarism, division and combination, aggregation submission as discussed in Sect. 4 and use a group of workers with no malicious behavior for comparison. We mainly consider three commonly used aggregation algorithms in this study, Majority Voting (MV), DS [12] and HDS [13,14]. DS is an EM based probabilistic model that focuses on estimating ground truth from multiple noisy labels. HDS is very similar to DS, but it focuses on learning a classifier and the estimation of ground truth and workers’ performance are byproducts.
128
X. Ding et al.
Fig. 3. The influence of the proportion of malicious workers on the MV aggregation results with different worker redundancy(r), in the bluebird dataset.
5.3
Experimental Results
Redundancy. In crowdsourcing, due to the unreliability of workers, one task will usually be assigned to several workers. Then their responses will be aggregated to get the final answer. Redundancy is the average number of workers for one task. Figure 2 shows the impact of different redundancy to the aggregation results. As we can see from this figure, for MV, HDS and DS these three aggregation algorithms, with the increase of redundancy level, the accuracy also increases. Because DS and HDS are very similar algorithms, we can see the impact of redundancy are also similar. For the bluebird dataset, when the redundancy level is larger than 15, the MV algorithm starts to converge. Figure 3 shows the influence of the proportion of malicious workers on the MV aggregation algorithm under different redundancy level (we increase the redundancy by 10 each time until to its maximum value, which is 39 for the bluebird dataset). As we can see, once converged, further increase the redundancy level will not impact our analysis of the proportion of malicious workers on the MV aggregation algorithm. We have consistent observations on other datasets and algorithms, thus for the following experiments, we use the redundancy level when the aggregation algorithm starts to converge and do not consider different redundancy levels.
Effectiveness of Malicious Behavior and its Impact on Crowdsourcing
129
Fig. 4. The influence of the proportion of malicious workers on the accuracy, in the bluebird dataset. (r = 15)
Fig. 5. The influence of the proportion of malicious workers on the accuracy, in the synth4 dataset. (r = 5)
Proportion of Malicious Workers on Homogeneous Datasets. Fig. 4 and Fig. 5 show the impact of four typical malicious behavior on accuracy for homogeneous dataset bluebird and synth4. From these figures, we can see with the increase of the number of malicious workers, the accuracy decreases accordingly. Group plagiarism, division and combination behave similarly, they first reduce the accuracy on aggregation results until convergence. Random submission causes the largest declination (go down below 0.3) and its impact on multiclass classification tasks is larger than binary classification tasks. What surprises us is that, for aggregation submission based on MV, the accuracy first goes down and then goes up (this is especially true for the synth4 dataset). This is because when there are only a few malicious workers, the aggregated results might not be good, but with the increase of malicious workers, the accuracy of aggregated results is getting close the situation when there are no malicious behavior. All these malicious behavior have similar impact on DS and HDS algorithms, that’s because DS and HDS are close related aggregation algorithms, readers could refer to Fig. 2.
130
X. Ding et al.
Fig. 6. The influence of the proportion of malicious workers on the accuracy, in the dog dataset. (r = 109)
Fig. 7. The influence of the proportion of malicious workers on the accuracy, in the RTE dataset. (r = 164)
For the bluebird dataset, aggregation submission does not have much impact on the MV aggregation algorithm. When the proportion of malicious workers is larger than 0.5, the impact of group plagiarism, division and combination start to decrease. For the DS and HDS algorithms, when the proportion is larger than 0.3, the impact of group plagiarism, division and combination, aggregation submission start to decrease. On the contrary, when the malicious workers proportion is less than 0.5, random submission has the minimal impact, but with the increase of proportion, it’s influence becomes obvious and such influence does not seem to decelerate. We have similar observations on the synth4 dataset. Please note here, the vibration of aggregation submission based MV is caused by the parity in majority voting. Proportion of Malicious Workers on Inhomogeneous Datasets. Fig. 6 and Fig. 7 show the impact of these four typical malicious behavior on inhomogeneous datasets (dog, RTE ). We find that the overall behavior are similar to those on homogeneous datasets. With the increase of the number of malicious workers, the accuracy start to decrease. But we can also see the influence of group plagiarism is small. This is because for inhomogeneous dataset, there exists a large number of workers who only accomplished a few tasks. This corresponds to the real situation that some workers may only want to share part of their answers.
Effectiveness of Malicious Behavior and its Impact on Crowdsourcing
6
131
Conclusion
This study investigated four different types of malicious behavior in crowdsourcing and proposed effectiveness analysis models for such malicious behavior. We analyzed the impact of these malicious behavior on three aggregation algorithms, namely MV, DS and HDS. For the DS and HDS aggregation algorithms, when the proportion of malicious workers is less than 0.5, the influence of random submission is small, while group plagiarism, division and combination cause the biggest declination. But overall, random submission can lead to the biggest damage. For aggregation submission based on MV, we observed the accuracy will first goes down and then does up. For binary classification tasks, group plagiarism, division and combination could cause the biggest damage, while for multiclass classification tasks, random submission has more influence. Acknowledgments. This research has been supported by the National Nature Foundation of China under grant 61976187 and the Natural Science Foundation of Zhejiang Province under grant (LZ22F020008,LY20F030002,LQ22F020002).
References 1. Deng, J., Dong, W., Socher, R., et al.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp. 248–255 (2009) 2. Liu, X., Lu, M., Ooi, B.C., et al.: CDAS: a crowdsourcing data analytics system. PVLDB 5(10), 1040–1051 (2012) 3. Fang, Y., Sun, H., Zhang, R., et al.: A model for aggregating contributions of synergistic crowdsourcing workflows. In: Twenty-Eighth AAAI Conference on Artificial Intelligence (2014) 4. Zaidan, O., Callison-Burch, C.: Crowdsourcing translation: professional quality from non-professionals. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1220–1229 (2011) 5. Li, G., Wang, J., Zheng, Y., et al.: Crowdsourced data management: a survey. IEEE Trans. Knowl. Data Eng. 28(9), 2296–2319 (2016) 6. Yuan, D., Li, G., Li, Q., et al.: Sybil defense in crowdsourcing platforms. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. pp. 1529–1538 (2017) 7. Gadiraju, U., Kawase, R., Dietze, S., et al.: Understanding malicious behavior in crowdsourcing platforms: the case of online surveys. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. pp. 1631–1640 (2015) 8. Yin, M., Gray, M.L., Suri, S., et al.: The communication network within the crowd. In: Proceedings of the 25th International Conference on World Wide Web. pp. 1293–1303 (2016) 9. Chen, P.P., Sun, H.L., Fang, Y.L., et al.: Collusion-proof result inference in crowdsourcing. J. Comput. Sci. Technol. 33(2), 351–365 (2018) 10. Yu, L., Nickerson, J.V.: Cooks or cobblers? Crowd creativity through combination. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. pp. 1393–1402 (2011)
132
X. Ding et al.
11. Tang, W., Yin, M., Ho, C.J.: Leveraging peer communication to enhance crowdsourcing. In: The World Wide Web Conference. pp. 1794–1805 (2019) 12. Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the EM algorithm. J. Royal Stat. Soc. Series C (Appl. Stat.) 28(1), 20–28 (1979) 13. Raykar, V.C., Yu, S., Zhao, L.H., et al.: Learning from crowds. J. Mach. Learn. Res. 11(4), 1297–1322 (2010) 14. Zhou, D., Basu, S., Mao, Y., et al.: Learning from the wisdom of crowds by minimax entropy. In: Advances in Neural Information Processing Systems, p. 25 (2012) 15. Liu, C., Wang, S., Ma, L., et al.: Mechanism design games for thwarting malicious behavior in crowdsourcing applications. In: IEEE INFOCOM 2017-IEEE Conference on Computer Communications. IEEE, pp. 1–9 (2017) 16. Wang, G., Wang, T., Zheng, H., et al.: Man vs. machine: practical adversarial detection of malicious crowdsourcing workers. In: 23rd USENIX Security Symposium (USENIX Security 14). pp. 239–254 (2014) 17. Kaghazgaran P, Caverlee J, Alfifi M.: Behavioral analysis of review fraud: Linking malicious crowdsourcing to amazon and beyond. In: Proceedings of the International AAAI Conference on Web and Social Media. vol, 11, no. 1, pp. 560–563 (2017) 18. Tahmasebian, Farnaz, Xiong, Li., Sotoodeh, Mani, Sunderam, Vaidy: Crowdsourcing Under Data Poisoning Attacks: A Comparative Study. In: Singhal, Anoop, Vaidya, Jaideep (eds.) DBSec 2020. LNCS, vol. 12122, pp. 310–332. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49669-2 18 19. Miao, C., Li, Q., Su, L., et al.: Attack under disguise: an intelligent data poisoning attack mechanism in crowdsourcing. In: Proceedings of the 2018 World Wide Web Conference. pp. 13–22 (2018) 20. Snow, R., O’connor, B., Jurafsky, D., et al.: Cheap and fast-but is it good? evaluating non-expert annotations for natural language tasks. In: Proceedings of the 2008 conference on empirical methods in natural language processing. pp. 254–263 (2008) 21. Welinder, P., Branson, S., Perona, P., et al.: The multidimensional wisdom of crowds. In: Advances in Neural Information Processing Systems, p. 23 (2010)
Scene Adaptive Persistent Target Tracking and Attack Method Based on Deep Reinforcement Learning Zhaotie Hao, Bin Guo(B) , Mengyuan Li, Lie Wu, and Zhiwen Yu School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China {haozhaotie,mengyuanli,leiwu}@mail.nwpu.edu.cn, {guob, zhiwenyu}@nwpu.edu.cn
Abstract. As an intelligent device integrating a series of advanced technologies, mobile robots have been widely used in the field of defense and military affairs because of their high degree of autonomy and flexibility. They can independently track and attack dynamic targets. However, traditional tracking attack algorithms are sensitive to the changes of the external environment, and does not have mobility and expansibility, while deep reinforcement learning can adapt to different environments because of its good learning and exploration ability. In order to pursuit target accurately and robust, this paper proposes a solution based on deep reinforcement learning algorithm. In view of the low accuracy and low robustness of traditional dynamic target pursuit, this paper models the dynamic target tracking and attack problem of mobile robots as a Partially Observable Markov Decision Process (POMDP), and proposes a general-purpose end-to-end deep reinforcement learning framework based on dual agents to track and attack targets accurately in different scenarios. Aiming at the problem that it is difficult for mobile robots to accurately track targets and evade obstacles, this paper uses partial zero-sum game to improve the reward function to provide implicit guidance for attackers to pursue targets, and uses asynchronous advantage actor critic (A3C) algorithm to train models in parallel. Experiments in this paper show that the model can be transferred to different scenarios and has good generalization performance. Compared with the baseline method, the attacker’s time to successfully destroy the target is reduced by 44.7% at most in the maze scene and 40.5% at most in the block scene, which verifies the effectiveness of the proposed method. In addition, this paper analyzes the effectiveness of each structure of the model through ablation experiments, which illustrates the effectiveness and necessity of each module and provides a theoretical basis for subsequent research. Keywords: Deep Reinforcement Learning · Partial zero-sum game · Target pursuit · Dual agent
1 Introduction Mobile robot is a multi-functional mobile platform integrating perception, decisionmaking, attitude control and other functions[1]. Because of its good flexibility and © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 133–147, 2023. https://doi.org/10.1007/978-981-99-2385-4_10
134
Z. Hao et al.
mobility, it has been widely used in various fields of human life. Especially in the modern information warfare system, mobile robots can track and attack in real time and accurately, playing an increasingly important role in the future battlefield. Among them, the automatic detection and tracking technology of battlefield targets based on vision has become the basic means for mobile robots to realize battlefield situation awareness and accurate attack. The traditional pursuit method separates three tasks of visual recognition, visual tracking and attack decision-making, which need to be executed separately, and requires a large number of cascade adjustments. In the rapidly changing battlefield, a little delay may lead to the failure of the entire tactical action. In addition, during the pursuit process, with the continuous change of the surrounding environment, the target will become more difficult to track due to sudden situations such as background change, light intensity change, obstacle occlusion, and self rotation. These challenges put forward higher requirements for the robustness of mobile robots. Deep reinforcement learning (DRL) combines the advantages of deep learning and reinforcement learning [2], selects the optimal strategy with the help of the strong representation ability of deep learning, and uses neural networks to realize the nonlinear mapping from the perception end to the action end, so as to maximize the expected return. Since deep reinforcement learning has high exploratory learning ability [3], it has become a hot topic to introduce the decision strategy of deep reinforcement learning into the target tracking attack task to adapt to the complex and changeable environment and enhance robustness. Aiming at the low time efficiency challenge of traditional target tracking and attack methods, this paper use deep reinforcement learning algorithm to solve the problem of a single mobile robot chasing the target end-to-end, use game mechanism to improve the reward function, solve the challenge of low robustness in the process of pursuit, and simulate the scene of mobile robots chasing enemy targets in the real environment, which has reference significance for improving its strike speed and accuracy. The main work and contributions of this paper include the following aspects: (1) The mobile robot track and attack problem is modeled as a Partially Observable Markov Decision Process (POMDP), and a general deep reinforcement learning solution based on dual agents is proposed to accurately track and attack the target in different scenarios. (2) Aiming at the problem that mobile robots are difficult to attacks and avoid obstacles accurately, this paper uses partial zero-sum game to improve the reward mechanism and provide implicit guidance for agents to track dynamic targets. (3) In the 2D simulation environment, A3C algorithm is used to train the model, and comparative transfer analysis is carried out, and then the effectiveness of each module of the model is verified. The results show that the attacker can track the target accurately and avoid obstacles.
Scene Adaptive Persistent Target Tracking and Attack Method
135
2 Related Work 2.1 Visual Tracking Algorithm Vision based target tracking refers to modeling the appearance by using the time-series information before and after, so as to achieve the prediction of the future state of the target [5]. It collects images through the camera and recognizes and locates the target. According to the number of objects observed by cameras, vision based tracking algorithms are divided into single object tracking (SOT) and multiple object tracking (MOT). This paper only discusses single object tracking, and summarizes its main methods: generative tracking algorithm, discriminant tracking algorithm and deep reinforcement learning based algorithm. The generative tracking algorithm completes the target positioning and continuous tracking by searching the area in the image that best matches the target [5], such as particle filter [6], optical flow [7]. Since it only focuses on the target information, and does not consider the background information, illumination change, target deformation and so on, the tracking accuracy is not high. Discriminant tracking algorithm uses machine learning methods to establish a discriminant model for the extracted image information and predict the position of the target in the picture, such as kernel correlation filter (KCF) algorithm [8], GradNet algorithm [9]. Although the discriminant method has greatly improved the tracking performance, when deploying a real mobile robot, it still needs to consider the camera control task additionally, and cascade parameter adjustment is required. Deep reinforcement learning combines the feature extraction ability of deep learning, and can realize non-linear mapping from the perception end to the action value end, directly mapping the input of the original image to the output of the agent’s action. In recent years, attempts to combine the target tracking problem with reinforcement learning methods using deep networks have also gradually increased, among which ptracker [10] and East [11] are more representative. This paper uses deep reinforcement learning to realize the target tracking and attack process end-to-end. 2.2 Deep Reinforcement Learning Over the past few decades, with the increasing performance of modern computing devices, deep reinforcement learning has also developed by leaps and bounds. It has been successfully applied not only in virtual environments (e.g. video games [12]), but also in real environments. For instance, the quadrupedal robot Anymal [13], developed by the Federal Institute of technology in Zurich, with the support of DRL, can independently explore a variety of complex terrain. Under the same conditions, it takes far less time to cross the Alps than humans, only 76 min. Adversarial reinforcement learning is a branch of reinforcement learning. It uses adversarial networks to improve the robustness of agents. In order to change or mislead the agent strategy, Huang et al. [14] added stochastic hostile noise in the state input. [15] let the virtual agent Alice compete with another agent Bob, in which Alice constantly creates greater challenges for Bob to obtain more robust strategies; In conclusion, in the methods proposed by predecessors, the opponent will not make intelligent choices for
136
Z. Hao et al.
the strategy of the protagonist agent, and can only challenge the protagonist agent by adding noise to the observation. This paper designs a dual agent game strategy for the follow attack task, in which the attacker tries to follow the attack target while the target tries to escape from the attacker.
3 Follow Attack Model Based on Game Mechanism In this section, we describe a pursuit model based on double agent game from the perspective of game theory. The model uses partially zero-sum game to improve reinforcement learning reward function, implicitly guides the agent to pursue the target. It enhances its pursuit performance, thus providing a reference for the algorithm to deploy real robots. 3.1 Problem Description We first define the target tracking attack problem with mathematical symbols. On the battlefield, when we find an enemy target, following up it closely is the basis for accurate attack. Since mobile robots cannot observe all the state values of the environment, we define the pursuit process as a partially observable Markov decision process [4]. Specifically, define the attacker as agent1, the target as agent2, and use a tuple represents the whole attack process, and the meanings of each symbol are shown in Table 1: Table 1. Table captions should be placed above the tables. Symbol
Meaning
S
Space state
O 1 , O2
Observation State
A 1 , A2
Action state
r1 , r2
Reward function
P
State transition matrix
t
Time step, t ∈ {1, 2, . . .}
The pursuit process can be described as follows: at time t, the attacker uses the observation information o1,t to select the best action a1,t ∈ A1 according to the strategy π1 , the target will also make the corresponding strategy π2 according to the attacker’s strategy state will be updated accorda2,t ∈ A2 , take actions simultaneous, and the environment ing to the state transition matrix P ·st , a1,t , a2,t . At the same time, the attacker and the target will receive state feedback rewards r1,t , r2,t from the environment respectively, where each symbol meets: o1,t = o1,t (st , st−1 , ot−1 ), ot , ot−1 ∈ O, st , st−1 ∈ S
(1)
r1,t = r1,t st , a1,t , r2,t = r2,t st , a2,t
(2)
Scene Adaptive Persistent Target Tracking and Attack Method
137
The attacker’s strategy is a distribution of actions a1,t under observation o1,t . Reinforcement learning is used to learn the strategy θ1 , so it can be regarded as a neural network optimization process affected by learning parameters, which is written as: (3) π1 a1,t o1,t ; θ1 Likewise, the strategy of the target also can be written as: π2 a2,t o2,t ; θ2
(4)
The purpose of the attacker and the target is to maximize their accumulated rewards T T Eπ1 ,π2 t=1 r1,t , Eπ1 ,π2 t=1 r2,t , where T represents the total length of the time step of a round. The attack process can be seen as a game between the attacker and the target. Both sides choose corresponding strategies to deal with the opponent’s strategy. In order to destroy the target as soon as possible and get rid of the attacker’s tracking as soon as possible, the ability of both sides to make the best strategy will be improved in the continuous game. In the process of pursuit, there are two challenges, to ensure accuracy and a certain degree of robustness (Fig. 1).
(a) Accurately tracking targets
(b) Avoiding obstacles
Fig. 1. Dynamic target tracking challenges
The challenge of accurately tracking targets: in a real battlefield, if a mobile robot attacks a wrong target instead of a real tactical target, it is likely to lead to the failure of a tactical action, thus delaying the whole battle. Obstacle avoidance challenge: in the real environment, the surrounding environment of mobile robots is full of many challenges, and the biggest challenge is the interference of surrounding obstacles. If the robot does not have the ability to evade obstacles, it is likely to hit them, resulting in the failure of tracking attacks. In view of the above two challenges, this paper uses partial zero-sum game to improve the reward function, judges the target and obstacle information in the attack range of the agent, and uses convolutional neural network to identify the attack target for the accuracy of the attack process. The next section will introduce it in detail.
138
Z. Hao et al.
3.2 Model Framework The setting of the reward function in the target pursuit scenario requires that when the attacker hits the target, more reward should be given, and when it misses the target, severe punishment should be given. Generally speaking, the attack damage is proportional to the attack distance. The closer the distance, the greater the damage. The farther the distance, the smaller the damage. Therefore, the setting of the reward function is supposed to relate to the distance between the target and the attacker. The traditional target pursuit research only considers the behavior decision of the attacker, without considering the adversarial between the attacker and the target. However, the target is also an agent, which can make coping strategies to resist the attacker. Based on this, we designed a dual agent pursuit game framework based on deep reinforcement learning, as shown in Fig. 2.
Fig. 2. Mobile Robot target tracking and attack system framework
Both the attacker and the target have corresponding action space, state space and reward function. In the 2D environment, the state space is designed as a two-dimensional matrix, and the matrix elements represent the characteristics of attackers, targets, obstacles and passable areas. In the 3D environment, it can be designed as a picture containing pixel characteristics observed by the field of view of the agent. The action space of an agent is divided into discrete and continuous types. The discrete space is represented by a one-dimensional vector, and each time step can move a unit distance in different directions. The continuous space is represented by speed information. The agent’s strategy outputs the average value and standard deviation of each dimension of action, and uses Gaussian distribution to represent its action distribution. The reward function of the attacker and the target should be designed to avoid collision with obstacles and move for as long as possible. When the attacker is close to the target, more reward will be given, and when the attacker is far away, less reward or punishment will be given. In the case of intact attack equipment and sufficient ammunition, when the attacker tracks the target continuously for a period of time, it can be considered that the target has been destroyed. In this process, the reward function is used to implicitly guide the training model, so as to learn the optimal strategy network.
Scene Adaptive Persistent Target Tracking and Attack Method
139
In order to make the attacker tracks the target quickly and accurately as well as having the robustness to different scenarios, this paper uses partially zero-sum game to improve the reward strategy [16]. Zero-sum game is a concept of game theory. It means that under the strict competition between the two sides of the game, the gains of one side must mean the losses of the other side, and the sum of the gains and losses of the two sides of the game is always “zero”. The game between the attacker and the target is also a zero-sum game. Either the attacker successfully destroys the target or the target successfully escapes. The success of the attacker means that the income of the target is damaged, and the failure means that the target successfully escapes.
Fig. 3. Schematic diagram of target pursuit
As shown in Fig. 3, ρ is the relative distance between the target and the attacker. The attacker hopes that the actions he makes can ensure that the target is always in the position. In fact, the actual desired position (ρ∗ , θ∗ ), which is the nearest safe attack position of the target (ρ, θ ) and the desired position ρ2∗ , θ2∗ may not coincide due to the untimely tracking and other reasons. In the framework of zero-sum game, the sum of the rewards of the attacker and the target is required to be zero, that is r1,t + r2,t = 0. But the fact of the matter is that, due to the range of attack, when the target escapes from its attack range, it will lead to its inability to make wrong decisions. At this time, the experience gained by the attacker through sampling is inefficient and practical, so the whole process is not entirely a zero-sum game. Therefore, in order to get better sampling experience, we use partial zero-sum game to improve the reward function and give the punishment to the attacker outside the attack field of view to ensure that the attacker will not be far from the target. The reward function of the attacker is written as: r1 = A − ζ
|ρ − ρ∗ | + ω · min ρ − ρ∗ , 0 ρmax
(5)
where r1 is a reward of the attacker, μ > 0. When the target is within the attack range of the attacker, that is ρ < ρmax , and the two sides are in a zero-sum game. When the target escapes beyond the attack range of the attacker, that is ρ > ρmax , the target will be punished. The farther away from the attacker, the greater the punishment. The best strategy for the target is to escape from the attacker’s field of view, but in order to
140
Z. Hao et al.
avoid excessive punishment, the target needs to move near the maximum range of the attacker’s field of view, as shown in Fig. 3. The reward range of the target is [−A, A]. Specific parameter settings for attackers and targets will be introduced in Chapter 4. 3.3 Model Structure The observation encoder encodes the original image into a feature vector as the input of the sequence encoder. Unlike other image coding networks, it does not have a pooling layer, and its structure is relatively simple. After two convolutions and a full connection operation, it is output to the sequence encoder. The specific model structure and inputoutput relationship are shown in Fig. 3. Since it is in a 2D environment, we can’t get what the attacker really sees. We only input the maximum attack range (a ρmax ∗ ρmax matrix) around the attacker in the form of an image. The 256 dimensional feature of the encoder output is observed (Fig. 4).
Fig. 4. Observation encoder
The sequence encoder can fuse the historical observation features to obtain a representation containing the time-series features, which is used as the input of the Actor-Critic network. For the pursuit problem, in addition to identifying and locating the target, the corresponding time-series characteristics are also crucial, which can help the attacker judge the direction of the target’s movement. It’s generally composed of recurrent neural networks. In this paper, a long short term memory neural network (LSTM) is used. It contains 256 input features from the observation encoder, and the hidden layer features and output layer dimensions are 128 dimensions. The sequence encoder is necessary for the design of the model, since if only the observation information is input, and the historical information before and after is ignored, the movement trend of the target will not be known. If the target suddenly turns, it’s easy to lose the target. The specific verification of the effectiveness and necessity of the sequence encoder will be introduced in Table 4. Both the attacker and the target network are trained using the A3C algorithm. The models are generally similar, but there are some differences. Both the attacker and the target input their own observed surrounding environment state. The attacker outputs four discrete strategic actions (east, west, south and north). To make the pursuit process more
Scene Adaptive Persistent Target Tracking and Attack Method
141
challenging, the target can output eight discrete strategic actions (southeast, northeast, southwest and northwest directions are added), Select a policy to perform the action. The actor network and the critical network share the features of the sequence encoder as inputs, and output the corresponding action strategy and approximate value function respectively. The approximate value function represents the expected cumulative reward value, which is mainly used to calculate the strategy gradient of the actor during training to update the network. In this paper, both actor and critical networks are composed of fully connected layers. The action space is discrete, and the policy function will generate an action distribution containing different directions. In this paper, the Adam optimizer proposed by Kingma and Lei Ba [17] in 2014 is used for gradient calculation. Because of its good learning rate adaptability and efficient computing performance, it has become one of the most mainstream optimization algorithms in the field of deep learning. The learning rate of Adam optimizer is set to 0.001.
4 Experimental Verification In this chapter, the effectiveness of the model is verified through 2D simulation experiments, and the generalization ability of the model is verified based on the migration of the trained model to different scenarios. Finally, the impact of each module on the overall performance of the model is analyzed through ablation experiments. 4.1 Experimental Setup This paper improves the 2D scene of Zhong et al. [18] and adds dynamic obstacles in the environment, making the simulation more challenging and more in line with the real world. As shown in Fig. 5, the simulation environment is a simple 100 * 100 matrix, where 0 represents the free area, 1 represents the obstacle, 2 represents our attacker, 3 represents the dynamic obstacle, and 4 represents the enemy target, which are represented by white, black, red, green, and blue. Our goal is to track and attack the enemy. The map is divided into two modes according to the distribution of obstacles: Maze mode and Block mode. In Maze mode, obstacles form a maze like environment, while Block mode randomly generates disordered obstacles like blocks in the environment. In this paper, we first train under each map, and then transfer it to other scenarios for testing to verify the migration ability of the model. As shown in the partial figure of Fig. 5, an attacker can attack the surrounding 15 * 15 environment. The purpose of the attacker is to make the target at its nearest security attack location. In each round of pursuit game, the attacker starts from any free space, and the target randomly appears in the range of 3 * 3 centered on the attacker. Attackers and targets can move in different directions. The experiment set three groups of baselines according to the way the target travels in the environment: random mode (Ram), regular path follower (RPF), and navigation mode (Navigator). As shown in Fig. 5, the target in Nav mode starts from any randomly generated position in the environment to reach a target point set in advance, and it can walk through most of the environments in the map. In Ram mode, the target walks
142
Z. Hao et al.
randomly without any rules. It is more likely to move repeatedly in an area of the map. The RPF mode target moves in a regular rectangle throughout the map, and the trajectory is relatively fixed. For attackers, it is easy to pursue Ram and RPF targets, while it is difficult to pursue Nav targets.
(a) Block-Ram
(b) Block-RPF
(c) Block-Nav
(d) Block Partial Observations
(e) Maze-Ram
(f) Maze-RPF
(g) Maze-Nav
(h) Maze Partial Observations
Fig. 5. 2D simulation environment (Color figure online)
We use A3C model for training based on the work of David Griffis et al. [19]. In the process of training the model, the parameter settings of the attacker and the target are shown in Table 2: Table 2. Experimental parameter setting Parameter
Parameter value
Parameter
Parameter value
A
1
γ
0.95
ζ
2
ω
1
μ
1
δ1
0.001
ρmax
7
n
10
λ1
0.01
λ2
0.25
Where A is the maximum reward that an agent can obtain, ζ, μ, ω is the normalization parameter of the agent reward function, ρmax is the maximum attack range of the agent, δ1 is the learning rate of Adam optimizer, γ is a reward discount factor, λ1 , λ2 represents the regularization factor of the attacker and the target respectively. The target uses a larger regularization factor to encourage exploration. N is the frequency of parameter update. The global maximum iteration is 300K times without pursuit failure. In this paper, the pursuit performance is evaluated by accumulated reward and episode time.
Scene Adaptive Persistent Target Tracking and Attack Method
143
Accumulated reward: the sum of the real-time rewards obtained in each round. One round refers to the pursuit process with the maximum number of iterations or ending early because of tracking failure. It can reflect the accuracy and robustness of the attacker in tracking the attack target. Episode time: the shortest follow-up time to destroy the target in a round. Since the 2D scene cannot simulate the attack, it is assumed that the attacker can destroy the target by successfully following the target for 100 steps continuously. Episode time can roughly measure the performance of the pursuit. In this paper, when the attacker loses the target in 10 consecutive steps or reaches the maximum number of iterations, the pursuit process ends. 4.2 Comparative Transfer Experiment This paper trains attackers to pursuit different targets under different maps, compares the performance of each strategy, and then transfers the attacker model trained by the game mechanism to pursuit different targets to verify the model transfer performance. Considering the randomness of the initial position of each experimental environment and the trajectory of the agent, this paper conducted 10 experiments to calculate the average value of the accumulated reward and the episode time. Different targets are used for training in the two map modes. The model training results are shown in Table 3. Table 3. Model training results Environment
Accumulated reward
Episode time
Map Mode
Target Mode
Maze
Ram
273 ± 11
1.72 ± 0.25 s
Nav
254 ± 12
1.86 ± 0.43 s
RPF
260 ± 85
1.77 ± 0.78 s
Block
Game
281 ± 11
1.38 ± 0.28 s
Ram
296 ± 13
1.58 ± 0.34 s
Nav
332 ± 17
1.66 ± 0.28 s
RPF
302 ± 50
1.60 ± 0.43 s
Game
349 ± 24
1.31 ± 0.29 s
It can be found that the accumulated rewards in the game mode are more than other target modes, and the episode time is less than other target modes, which proves that the tracking accuracy and robustness of partial zero-sum game are better than other target modes. In addition, compared with block map, the accumulated rewards and episode time of targets in various modes in maze map are significantly less, which also confirms that maze map is more challenging as mentioned above. Since the attacker will closely follow the target in the later stage of training, which is a zero-sum game, and the accumulated reward of the target is a negative value for the attacker.
144
Z. Hao et al. Table 4. Model generalization performance
Environment
Accumulated reward
Episode time
Map Mode
Target Mode
Maze
Ram
375 ± 3
0.95 ± 0.01 s
Nav
302 ± 12
1.13 ± 0.25 s
Block
RPF
308 ± 14
1.00 ± 0.17 s
Ram
384 ± 3
0.94 ± 0.07 s
Nav
278 ± 12
1.10 ± 0.23 s
RPF
282 ± 9
0.97 ± 0.18 s
For a model, if it can only achieve good results in the training scenario, it is not enough to illustrate its goodness. We also need to transfer the model to other scenarios to verify its generalization ability. We transfer the attacker model based on partial zerosum game training to different map modes, and use a variety of targets to verify the attacker’s tracking performance. The experimental results are shown in Table 4. From Table 4, we find that the model has good generalization ability when it is verified in other scenarios. When the target walks randomly in maze map, the shortest destruction time is 44.7% less than the baseline method, and the effect is the most obvious. In block map, the shortest hit time is also reduced by 40.5% compared with the baseline method. We make a more intuitive comparison between the results of the transfer and the results of the baseline. As shown in Fig. 6, the abscissa is different maps and target modes, and the ordinate is the episode time. It can be seen that the game method has been significantly improved compared with baseline in any map.
Fig. 6. Comparison between game mode and baseline
4.3 Ablation Experiment In section 3 of Chapter 3, for the design of attackers and target networks, the observation encoder only uses a shallow CNN network, and the sequence encoder uses an LSTM
Scene Adaptive Persistent Target Tracking and Attack Method
145
network. Whether this design is effective or not lacks verification and needs to be verified experimentally. Table 5. Validation model of changing parameters Environment
Accumulated reward
Episode time
More CNN layers
260 ± 13
1.35 ± 0.25 s
GRU network
265 ± 11
1.46 ± 0.27 s
Map Mode
Changes
Maze
Block
Remove LSTM
225 ± 7
1.88 ± 0.55 s
More CNN layers
336 ± 6
1.35 ± 0.20 s
GRU network
324 ± 21
1.40 ± 0.34 s
Remove LSTM
248 ± 13
1.75 ± 0.52 s
In order to verify the importance of the LSTM module, this paper carries out ablation study by removing the LSTM, and other settings remain unchanged. The results are shown in Table 5. Compared with Table 3, it is found that after removing the LSTM network, the pursuit performance decreases significantly (the duration of destruction under maze and block maps increases by 36.2% and 33.5% respectively), which indicates that the LSTM module contributes greatly to the success of attacks. We also use another cyclic neural network GRU [20] (Gate Recurrent Unit) to replace LSTM network for training. The results of accumulated rewards and episode time do not change much, indicating that different recurrent neural network modules have achieved considerable performance. In addition, this paper also tries a deeper CNN network, which adds three additional convolution layers. The results are shown in Table 5. Comparing the deep network with the shallow network, it is found that the episode time does not decrease but slightly increases, indicating that increasing the number of network layers can not significantly improve the pursuit performance.
5 Conclusion In this paper, aiming at the research challenges of low accuracy and poor robustness of traditional dynamic target pursuit problem, the dynamic target pursuit problem of mobile robots is modeled as a Partially Observable Markov Decision Model (POMDP), and a general end-to-end deep reinforcement learning solution is proposed. Aiming at the problem that mobile robots are difficult to track accurately and avoid obstacles, a deep reinforcement learning solution based on partial zero-sum game is proposed to provide implicit guidance for attackers to pursue targets. After parallel training with the A3C algorithm, the attacker can accurately track the target, have the ability to avoid obstacles, and adapt to different scenarios. In addition, this paper also analyzes the effectiveness of each module of the model, which lays a theoretical foundation for further algorithm optimization.
146
Z. Hao et al.
At present, deep reinforcement learning algorithms based on the game mechanism only considers 1v1 scenarios, that is, one agent pursuit another agent. However, in the real battlefield, there will be many situations such as one to many and many to many. In the future, it is necessary to improve the game cooperation strategy according to different scenarios to adapt to a variety of environments. Acknowledgements. This project was supported by the National Outstanding Young Scientists Foundation of China (62025205), the National Key Research and Development Program of China (2019QY0600), and the National Natural Science Foundation of China (61960206008, 61725205).
References 1. Alexopoulos, C., Griffin, P.M.: Path planning for a mobile robot. IEEE Trans. Syst. Man Cybern. 22(2), 318–322 (1992) 2. Li, Y.: Deep reinforcement learning: an overview. arXiv preprint arXiv:1701.07274 (2017) 3. Arulkumaran, K., Deisenroth, M.P., Brundage, M., et al.: Deep reinforcement learning: a brief survey. IEEE Signal Process. Mag. 34(6), 26–38 (2017) 4. Littman, M.L.: Markov games as a framework for multi-agent reinforcement learning. In: Machine Learning Proceedings 1994, pp. 157–163. Morgan Kaufmann (1994) 5. Li, X., Zha, Y.F., et al.: A survey of target tracking based on deep learning. J. Image Graph. 24(12), 2057–2080 (2019) 6. Innmann, M., Zollhöfer, M., Nießner, M., Theobalt, C., Stamminger, M.: VolumeDeform: real-time volumetric non-rigid reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 362–379. Springer, Cham (2016). https://doi.org/ 10.1007/978-3-319-46484-8_22 7. Pérez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based probabilistic tracking. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 661–675. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-47969-4_44 8. Danelljan, M., Häger, G., Khan, F., et al.: Accurate scale estimation for robust visual tracking. In: British Machine Vision Conference, Nottingham, September 1–5, 2014. Bmva Press (2014) 9. Chen, B., Wang, D., Li, P., Wang, S., Lu, H.: Real-time ‘actor-critic’ tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 328–345. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_20 10. Huang, C., Lucey, S., Ramanan, D.: Learning policies for adaptive tracking with deep feature cascades. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 105– 114 (2017) 11. Yilmaz, A., Javed, O., Shah, M.: Object tracking: a survey. ACM Comput. Surv. (CSUR) 38(4), 13-es (2006) 12. Sutton, R.S., Barto, A.G.: Introduction to reinforcement learning (1998) 13. Miki, T., Lee, J., Hwangbo, J., et al.: Learning robust perceptive locomotion for quadrupedal robots in the wild. Sci. Robot. 7(62), eabk2822 (2022) 14. Huang, S., Papernot, N., Goodfellow, I., et al.: Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284 (2017) 15. Sukhbaatar, S., Lin, Z., Kostrikov, I., et al.: Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407 (2017) 16. Zhong, F., Sun, P., Luo, W., et al.: AD-VAT: An asymmetric dueling mechanism for learning visual active tracking. In: International Conference on Learning Representations (2018)
Scene Adaptive Persistent Target Tracking and Attack Method
147
17. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412. 6980 (2014) 18. Zhong, F., Qiu, W., Yan, T., Yuille, A., Wang, Y.: Gym-unrealcv: realistic virtual worlds for visual reinforcement learning. Web Page (2017). https://github.com/unrealcv/gym-unrealcv 19. Griffis, D.: A3c lstm atari with pytorch plus a3g design. Web Page. https://github.com/dgriff 777/rl_a3c_pytorch 20. Cho, K., Van Merriënboer, B., Bahdanau, D., et al.: On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)
Research on Cost Control of Mobile Crowdsourcing Supporting Low Budget in Large Scale Environmental Information Monitoring Lili Gao1(B) , Zhen Yao2 , and Liping Gao2 1 Weifang University, No. 5147 Dongfeng East Street, Weifang, China
[email protected] 2 University of Shanghai for Science and Technology, No. 516 Jungong Road, Shanghai, China
Abstract. Due to the large scale of information, high cost of deployment and vulnerability of instrument, it is difficult to solve the problem of information monitoring in urban environment. The emergence of mobile crowdsourcing solves the problems of high deployment cost and fragile instrument, which makes it possible to solve this monitoring problem. However, the massive employment cost because of the huge information scale makes it difficult for mobile crowdsourcing technology to be applied in practice. This paper proposes a low-budget model based on compressed sensing and naive Bayes classifier, which takes into account the impact of human activities on environmental information, improves the recovery algorithm of compressed sensing algorithm, and reduces the error of data recovery. At the same time, this model also considers the fact that some participants are not qualified to complete the task and improves the naive Bayes classifier to identify more reliable participants to reduce the reemployment rate. The model in this paper can recover all the data with a small number of sampling points, thus greatly reducing the task cost. At the same time, it can identify qualified participants who can complete the crowdsourcing task with high accuracy, thus further reducing the task cost. Experiments show that the low budget model proposed in this paper has a good performance on cost control. Keywords: Crowdsourcing · compressed sensing · naive Bayes classifier · cost control
1 Introduction In recent years, with the rapid development of intelligent devices and the emergence of more sophisticated sensor devices, mobile crowdsourcing has become a new research hotspot. Mobile crowdsourcing refers to the mode in which a crowdsourcing platform sends tasks to participants who sample the target information through the smart devices they carry and send the obtained data to the crowd-sourcing platform for payment. This mode has become a favorable tool to solve the problem of large-scale environmental information monitor. Before the emergence of mobile crowdsourcing, urban scale © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 148–163, 2023. https://doi.org/10.1007/978-981-99-2385-4_11
Research on Cost Control of Mobile Crowdsourcing
149
environmental information monitoring is faced with three difficult problems. First, large scale environmental information monitoring requires the deployment of a corresponding number of sensors, which is difficult to accept due to huge cost. Second, some areas are not suitable for installing sensors. Finally, sensors are sophisticated equipment that can be easily damaged and are costly to repair and replace. Scholars have introduced mobile crowdsourcing technology to urban scale environmental information monitoring, solved the above problems and developed a series of applications [1–3]. These applications realize noise pollution monitoring and environmental information collection through mobile phone location information and uploaded data. However, since the environment needs to employ a large number of participants, the large budget remains a challenge for academics. In the field of crowdsourcing, scholars have carried out studies from various perspectives on cost control, including power loss [4], data transmission cost [5], and number of sampling points [6]. Among them, compressed sensing technology is widely concerned because of its good performance in cost control. Compressed sensing was proposed by Candes et al. [7, 8] in 2006. It breaks the traditional Shannon Nyquist theorem and is capable of restoring all data with a small number of sampling points, which greatly saves the cost of sampling and transmission. Its core idea is to make use of the sparsity of data and project the high-dimensional data to the low-dimensional space through the corresponding sparse basis to obtain a small part of sampling points containing all data information and reduce the cost of data transmission. After receiving these sampling points, the original data can be recovered through the corresponding sparse base and recovery algorithm. In crowdsourcing, scholars infer the data by solving the minimum rank problem based on the idea of compressed sensing, which greatly reduces the cost of sampling. The key to the application of compressed sensing technology is whether the data to be processed is sparse in a low-dimensional space. Existing studies have shown [9–14] that environmental information such as temperature, humidity and air quality are suitable for compressed sensing. Therefore, compressed sensing can be used as the main way of cost control in urban scale environmental information monitoring. It is important to note that not all participants are qualified to complete the task. If the task data submitted by a participant does not meet the requirements, the crowdsourcing platform has to hire a new participant to perform the task. In order not to discourage the enthusiasm of participants and thus affect the user scale of the crowdsourcing platform, the crowdsourcing platform needs to give certain rewards even if the task data submitted is of poor quality. After all, a large user base is a basic guarantee that a crowdsourcing platform can finish the tasks. The urban scale environmental information monitoring problem has a higher requirement for the qualified rate of participants to complete the task due to its large-scale nature. Even a small increase in the task pass rate will reduce much cost. This paper selects qualified participants to complete the task and improves the pass rate of the task based on naive Bayes classifier, which is used as an auxiliary way of cost control in urban scale environmental information monitoring. For the problem of high cost in urban scale environmental information monitoring, this paper proposes a low budget model to reduce the cost by reducing the number of sampling points and the invalid number of employees. The main contributions of this paper are as follows:
150
L. Gao et al.
• In this paper, compressed sensing is combined with mobile crowdsourcing, and the required data is recovered through a small number of sampling points, which greatly reduces the sampling cost. Meanwhile, the recovery algorithm in compressed sensing is improved to reduce the mean error of data recovery. • Based on naive Bayes classifier, this article constructs a participant employment module to identify participants who are qualified to complete crowdsourcing tasks with high accuracy, thus reducing the number of invalid hires and further reducing the cost of crowdsourcing. • This paper conducts experiments based on simulated temperature data, proving that compressed sensing technology can effectively reduce costs in environmental information monitoring and the improved recovery algorithm can improve the overall quality of data recovery. At the same time, this paper conducts experiments based on simulated participant data and proves that the participant employment module based on naive Bayes classifier can effectively reduce the employment cost. Based on two experiments, the low budget model presented in this paper has a good performance on cost control. The organizational of this article is as follows: Part 2 introduces the related work; Part 3 introduces the definition, assumptions and problem elaboration. Part 4 describes the low budget model in detail. Part 5 is the experimental results and analysis. Part 6 is the summary of the whole paper and the future research direction.
2 Related Works In the field of mobile crowdsourcing, there are three mainstream research directions for cost control [15]. The first approach is to control the energy consumption of the participant’s equipment. Reference [16] uses the piggyback method to transmit data, which is transferred only when the participants use other applications, thus saving the overall energy consumption. On the basis of data transmission by piggyback method, Ref. [17] uses the improved simulated annealing algorithm for prediction and selects participants with high application using frequency to assign tasks, so as to save energy and ensure the timeliness of data. However, for environmental information monitoring, this method can save too little cost, and there is a risk that the data cannot be uploaded to the crowdsourcing platform in time, so it is not applicable for this environment. The second approach is to control the cost of data transfer. Reference [18] proposed an Eco Sense mechanism, which divided participants into two categories, one of which was unlimited data plan and the other was pay-per-use. Participants who pay per usage transmit their collected data to participants with unlimited data plan via Bluetooth, and then the data is transmitted to crowd-sourcing platform, so as to achieve the purpose of saving transmission fee. Similarly, Ref. [19] allows participants to hold data until they find a WiFi node, and then upload the data to the crowdsourcing platform through WiFi. This method is suitable for the scene with large data volume and low real-time requirement but not suitable for environmental information monitoring because the environmental information is usually updated once an hour or even less, which has certain requirements for real-time performance. The third method is to control the number of sampling
Research on Cost Control of Mobile Crowdsourcing
151
points. In 2015, L. Xu et al. [20] combined compressed sensing and crowdsourcing for the first time and recovered the required data through a small number of sampling points, which attracted a lot of attention for its good performance in cost control. Because compressed sensing requires only a small number of sampling points, the crowdsourcing mode using compressed sensing technology is also called sparse crowdsourcing. Sparse crowdsourcing can greatly reduce costs and ensure the timeliness of data, so this paper takes it as the main way of cost control. After L. Xu et al., scholars have conducted researches on sparse crowdsourcing from various angles [9–14, 21–24] and achieved good results. However, in order to maximize the extraction of characteristics in environmental information, scholars generally use strict constraint matrix in the recovery algorithm, which makes the recovery algorithm more sensitive to interference information. In the real environment, human activities will affect the environmental information and cause a lot of interference information, which will greatly reduce the performance of the recovery algorithm. To solve this problem, this paper will use a relatively loose constraint matrix to reduce the sensitivity of the recovery algorithm to the interference information under the premise of effectively extracting the environmental characteristics. At the same time, the existing research only uses compressed sensing technology for cost control, which cannot minimize the cost of crowdsourcing. For example, Refs. [9, 10, 14] assume that each participant can return the correct sampled data after receiving the task. Therefore, when the total number of sampling points is 100 and the sampling rate is 50%, assuming that all data can be correctly recovered, the number of employees can be reduced to 50, thus reducing the sampling cost greatly. However, in real life, participants would not be able to complete the task properly for a variety of reasons. If participants had a 50% pass rate on completing the task, even with a 50% sample rate, the number of hires would still be 100. To solve this problem, the naive Bayes classifier is adopted in this paper to select the qualified participants who can complete the task to reduce the number of invalid employees, which is taken as an auxiliary method of cost control. Considering the advantages and disadvantages of existing studies, this paper proposes a low-budget model for urban scale environmental information monitoring, which combines compressed sensing and naive Bayes classifier to achieve maximum cost control.
3 Problem Formulation In the third part, the definitions and assumptions used in this article are introduced firstly, and then the problems to be solved in detailed in the following. 3.1 Definition Definition 1: Full sampling matrix. For a mobile crowdsourcing task with m sampling points and n cycles, we use Fm×n to represent its full sampling matrix. F i, j denotes the ground truth data of cell i in cycle j. Definition 2: Binary matrix. We use Bm×n to denote the binary matrix, whose elements are 0 or 1. If B i, j equals 1, we sample the cell i in cycle j. Otherwise, the data of cell i in cycle j is obtained by recovery algorithm.
152
L. Gao et al.
Definition 3: Sampling matrix. We use sampling matrix Sm×n to store the data obtained from the sampling, which is calculated as follows: S = F ◦ B.
(1)
Where ◦ denotes the element-wise product of the two matrixes. Definition 4: (e, p)-quality. For a mobile crowdsourcing task lasting n cycles, when it meets the following equation, it meets the quality requirement of (e, p)-quality. |{k|ek e, 1 k n }| n · p.
(2)
Where e represents the maximum allowable error of data recovery in each cycle, p represents a probability and means that the error of data recovery in at least n* p cycles is less than e. Since it is impossible to foreknow the data of the real environment in practical application, it is impossible to guarantee that the data recovery error of all cycles is always less than e. In this paper, p is set as a larger value (such as 0.9 or 0.95) and Bayesian inference based on the retention method is used to guarantee the recovery quality in practical application. 3.2 Assumptions To simplify the model, the following assumptions are made in this article. Assumption 1: There are enough participants to take part in the crowdsourcing task in the target sampling area. Therefore, in each sampling cycle, there are enough candidate participants in any target cell that can accept the crowdsourcing task to sample the cell. Assumption 1 is achievable in real life. For example, WAZE, a crowdsourced traffic monitoring app, has more than 50 million users so far, more than enough potential participants to monitor environmental information on a city scale. Assumption 1 simplifies the model by eliminating the need to consider the movement trajectories of participants. Assumption 2: During the sampling cycle, we have enough time to keep hiring participants until one of them can provide qualified sampling data. Assumption 2 is also achievable in real life. For city-scale environmental information monitoring, the update frequency of information is generally one hour, while the sampling time of the sensor is generally within a few minutes. For example, the sampling time of humidity sensor SHTW2 is only 8 s. Therefore, we have enough time to collect qualified data. The existence of assumption 2 enables us to select participants iteratively, avoiding the oversampling of target cells caused by hiring multiple participants at the same time and resulting in cost waste. Since both assumption 1 and assumption 2 can be realized in real life, the model designed in this paper can be applied in real environment.
Research on Cost Control of Mobile Crowdsourcing
153
3.3 Problem Formulation Based on the above definitions and assumptions, we define the problems to be solved in this paper as follows. Given a crowdsourcing task with m sampling cells and n sampling cycles, in which r denotes the sampling rate, F denotes the full sampling matrix, F denotes the recovered matrix and q denotes the pass rate of participants to complete tasks, we need to keep the total number of hires to a minimum while keeping the recovery data as close to the real number as possible. The problem is expressed as follows:
r×m×n . q s.t. Fˆ = F.
min rp =
(3)
Where rp denotes the number of employees and F = F denotes that the inferred data should be as close to the real data as possible. The key to solve this problem is the lowest sampling rate and the highest task pass rate, for which we propose a low budget model to maximize cost control.
4 The Low Budget Model The workflow of the low-budget model proposed in this article is shown in the figure below.
Fig. 1. The workflow for the low-budget model.
As shown in Fig. 1, after a crowdsourcing task is received by the crowdsourcing platform, a random cell position is generated through the sampling matrix generation module. The participant module screens the candidate participants in the region according to the cell location, and hires more reliable participants to conduct sampling. After sampling, participants upload sampling data to the crowdsourcing platform. At last, the data recovery module performs data recovery according to the sampled data to obtain the data of unsampled cells. The above process is kept circulating until the quality of data
154
L. Gao et al.
recovery meets the quality requirement (e, p)-quality. In practice, in order to avoid task timeout, a sufficient number of points can be sampled once according to historical experience. Then we can sample the points one by one until the task quality requirements are met. Next, we will introduce the sampling matrix generation module, participant employment module and data recovery module in detail. 4.1 The Sampling Matrix Generation Module Random sampling matrix is widely used by scholars in the field of sparse crowdsourcing. In this paper, the process of generating random sampling matrix is as follows. For the binary matrix Bm×n to be generated, all elements in the matrix are initialized to 0, indicating that no sampling is required. For cycle j, we randomly generate a number i which meets 0 ≤ i < m. . If it equals 0, we turn it to 1, meaning that the cell needs to be sampled. A new number is then randomly generated and the process iterated until the quality of data recovery meets (e, p)-quality. 4.2 The Participant Employment Module In real life, participants will be unable to complete tasks properly due to various reasons, resulting in the crowdsourcing platform having to hire new participants for sampling, which wastes a large amount of employment costs. Especially in the environmental information monitoring problem of city scale, the number of sampling points is still huge even if compressed sensing technology is adopted due to the large scale. If the task qualification rate of participants is not high enough, it will cause high employment cost. However, if the payment is simply based on whether the task completion quality is up to standard, it will discourage participants from participating in crowdsourcing tasks and adversely affect the user scale of crowdsourcing platform, which is the basic guarantee for crowdsourcing platform to complete crowd-sourcing tasks. To solve the above problems, we use the naive Bayes classifier to screen the participants and employ the participants who can complete the task in a qualified manner with high probability. The expression of naive Bayes classifier [25] is as follows: h(x) = argmax P(c) c∈y
d i=1
P(xi |c).
(4)
Where, x denotes the classification sample, c denotes the category, y denotes the number of categories, d denotes the number of attributes, xi denotes the value of xi on the attribute i, P(c) denotes the prior probability of class, and P(xi |c ) denotes the conditional probability. The calculation formula of P(c) is as follows: P(c) =
|Dc | . |D|
(5)
Where, D is the training set, and Dc is the set composed of class c samples in D. The calculation formula of P(xi |c ) is as follows: Dc,x i P(xi |c) = . (6) |Dc |
Research on Cost Control of Mobile Crowdsourcing
155
Where Dc,xi denotes the set of samples whose value on attribute i is xi in Dc . Equation (4) is the most commonly used naive Bayes classifier formula. On this basis, this article improves it to some extent so as to further improve its classification accuracy. For each P(xi |c), we maintain a weight variable wxi ,c whose initial value is 1. Given a coefficient k which meets 0 < k < 1, when the sample belongs to category cr but is misclassified as category cl , we make wxi ,cr = wxi ,cr ÷ k and wxi ,cl = wxi ,cl × k. Therefore, Eq. (4) is modified as follows: h(x) = argmaxP(c) c∈y
d i=1
P(xi |c)wxi ,c .
(7)
In the participant employment module, participants were divided into two categories according to their ability to complete tasks by naive Bayes classifier. Each time the participant submits the task data, the crowdsourcing platform checks whether it meets the requirements and updates the weight variable wxi ,c accordingly. The more the number of attributes and values extracted from the training set, the better the classifier is, but the more computational resources it takes up. At the same time, the choice of attributes also affects the effect of the classifier. For example, the name attribute has no effect on the classification of participants. In view of the above reasons, the classifier attributes used in this paper are defined as 6 tuples < pe, pt, re, sa, hq, lt > and the value range of each attribute is defined as 3 tuples < lp, mp, hp >, with lp, mp and hp respectively representing low performance, medium performance and high performance, indicating the effect of this attribute on participants’ qualified completion of tasks. In the attribute tuple, pe denotes mobile phone energy level, pt denotes mobile phone model, re denotes task reward, sa denotes monthly salary, hq denotes the number of tasks historically participated in and lt denotes the pass rate of the last 3 tasks. Details of properties and attribute values are shown in Table 1. Since task rewards vary with crowdsourcing tasks, we assume that task rewards range from 10 to 100. Table 1. Attribute value division. pe(%)
pt(lanched years)
re(¥)
sa(¥)
hp
80–100
mp
50–80
lp
0–50
hq(quantity)
lt(qualified quantity)
1
80–100
1–3
50–80
above 10000
above 10
3
5000–10000
3–10
1–2
3–10
10–50
below 5000
0–2
0
Given a participant, we assume that the probability that he is qualified to complete the task is P+ and the probability that he is not qualified to complete the task is P− . We calculate P+ and P− through Eq. (6). If P+ > P− , we employ the participant to sample. Otherwise, another random participant was selected and classified. It should be noted that if a certain attribute value does not appear with a certain class in the training set, the direct use of Eq. (6) for probability estimation will lead to an error. For example, if the value of the first attribute of the sample is lp and none of
156
L. Gao et al.
the values of the first attribute of the participants in the training set who are qualified to complete the task is lp, then the calculated probability P+ must be 0, which means that the information carried by other attributes is erased. Therefore, we need to modify Eq. (5) and Eq. (6) by using Laplace correction: |Dc | + 1 . |D| + N Dc,x + 1 i P (xi |c) = . |Dc | + Ni
P (c) =
(8)
(9)
Where N represents the number of possible categories in training set D, and Ni represents the possible values of the attribute i. In real life, a problem we need to face is the lack of partial data in the data set. Participants may refuse to disclose their monthly salary and phone model for privacy reason, or they may lack historical data due to their first participation in a crowdsourcing task. A variety of reasons will lead to data loss, thus affecting the accuracy of classification. Therefore, missing value processing is an integral part. For participant samples lacking part of attribute values, this article uses naive Bayes classifier to infer missing attribute values based on complete attribute values to complete the training set. The key factor of this completion method lies in the order of attribute completion. Since the predicted attribute will participate in the prediction of the next missing attribute, if the predicted attribute is wrong, the next predicted attribute accuracy will decline. Therefore, it is necessary to give priority to the predicted attribute with more information to reduce the classification error. In this paper, information gain is used to sort the attributes that need to be predicted. The greater the information gain of the attribute, the more information the attribute carries. The expression of information gain22 is as follows:
V ˜ ˜v . Gain(D, a) = ρ ∗ Ent D − (10) r˜v Ent D v=1
Where D denotes the training set, a denotes the attribute, ρ denotes the proportion of ˜ denotes a subset of D that has no missing values on samples without missing values, D attribute a, V denotes the number of values for attribute a, r˜v denotes the proportion of the v ˜v ˜ sample without missing value whose value is a on attribute a, D denotes a subset of D v ˜ that values a on attribute a, Ent D denotes the information entropy, whose expression is as follows:
|y| ˜ =− p˜ k log2 p˜ k . Ent D k=1
(11)
Where y denotes category and p˜ k denotes the proportion of category k in the samples without missing values.
Research on Cost Control of Mobile Crowdsourcing
157
4.3 Data Recovery Module
Given a sampling matrix S, we need to restore the matrix F to be as close to the full sampling matrix F as possible. Based on the low-rank property of F, we can obtain F by solving the minimum rank problem: minrank F .
s.t.F ◦ B = S.
(12)
Since the problem is non-convex, it is difficult to solve directly, so we first perform singular value decomposition on F :
T
F = U V = LRT .
1/2
(13)
1/2
Where L = U , R = V . According to the compressed sensing theory [26, 27], when the finite equidistance property is satisfied, the minimum rank problem can be transformed into the minimum nuclear norm problem of the low-rank matrix, so the transformation of problem (12) is as follows: min L2F + R2F . s.t. LRT ◦ B = S. (14) In practical problems, F is usually close to low rank but not completely low rank, and the sampled data usually contain noise, so it is difficult to find L and R that completely satisfy Eq. (14). For this reason, we use Lagrangian operator to relax the constraint conditions: 2 (15) min LRT ◦ B − S + λ L2F + R2F . F
Where λ denotes the tradeoff between rank minimization and accuracy. On the basis of Eq. (15), Ref. [28] adds spatial and temporal constraints, and the equation is improved as follows: 2 min LRT ◦ B − S + λ L2F + R2F +λt LRT Tc T + λs Sc LRT . F
(16) Where Tc denotes the temporal constraint, Sc denotes the spatial constraint. On the basis of Eq. (16), Ref. [29] adds value constraint and the equation is improved as follows: 2 min LRT ◦ B − S + λ L2F + R2F + λt LRT Tc T + λs Sc LRT + λv Vc LRT . F
(17)
Where Vc denotes the value constraint. Reference [28] proves that the sampling data of adjacent cells are similar and the sampling data of adjacent cycles of the same cell are similar, so the spatial-temporal constraints are used to capture these features. Reference [29] further proves that the sampling
158
L. Gao et al.
data of two cells with similar surroundings are also similar, so the value constraint is used to capture this feature. But they do not consider the impact of human activities on environmental information. Suppose that there are two adjacent cells A and B and A is an outdoor barbecue area. When the outdoor barbecue in Area A starts business, the temperature in the next sampling cycle of area A will rise suddenly, breaking the original spatial correlation between A and B. This kind of sampling points are called abnormal points in this paper. Similarly, the temporal and value correlation between sampling points can be disrupted by human activity. Abnormal points will bring more disturbing information during data recovery. Spatial-temporal constraints and value constraints are sensitive to interference information due to their strict constraints. When restoring data in the region containing abnormal points, such strict constraints will bring negative performance improvement. Therefore, variance constraint is adopted in this paper. Compared with spatial- temporal constraint and value constraint, variance constraint is looser, which can not only capture spatial-temporal correlation and value correlation, but also reduce sensitivity to interference information. The formula of the recovery algorithm used in this paper is as follows: λ 2 2 v T 2 2 T (18) min LR ◦ B − S + λ LF + RF + LR − M F . F n Where n denotes the total number of cells in the target area, M denotes the average matrix whose element value M [i, j] equals the average value of all sampled data. In this paper, the gradient descent method is used to alternately fix L and R for iterative optimization. Finally, the recovery matrix F is obtained through Eq. (13).
5 Experiment Results 5.1 Data Sets In this paper, the simulated temperature data set is used to experiment the data recovery module. The simulated data set contains 40 cycles and each cycle contains 100 cells, in which the proportion of abnormal points is 1%. If the cell is an abnormal point, its temperature will be randomly increased by 0.5 to 1.5 degrees on the original basis. For privacy reasons, this paper uses simulated data to experiment the actual effect of the participant employment module. We make attribute values < lp, mp, hp > correspond to scores < 1, 2, 3 > and attribute < pe, pt, re, sa, hq, lt > correspond to weights < 0.1, 0.1, 0.2, 0.2, 0.2, 0.2 >. Then we randomly generate each attribute value of participant A, so that A’s score must be between 1 and 3. If A scores more than 1.5, let A take 80% probability as the positive example, which means A is qualified to complete the task, otherwise let A take 80% probability as the negative example. According to the above rules, 500 samples were randomly generated as experimental data, among which 300 samples were training sets and 200 samples were test sets.
Research on Cost Control of Mobile Crowdsourcing
159
5.2 Experiment Results Experiment on Recovery Algorithm
Fig. 2. Average error of data recovery.
To avoid contingency, all test data are averaged from the results of 10 runs. We first conducted experiments on the actual effect of compressed sensing. Figure 2 shows the average error of the results obtained by using different recovery algorithms under different sampling points, in which GR denotes the general recovery algorithm, VBR denotes the recovery algorithm based on variance constraint proposed in this paper, STBR denotes the recovery algorithm based on spatial-temporal constraint, and STVBR denotes the recovery algorithm based on spatial-temporal constraint and value constraint. Figure 2 shows that the VBR method performs best at any given time. The performance of STBR and STVBR methods is worse than that of GR method, indicating that the error caused by spatial-temporal constraint under the influence of interference information has exceeded the performance advantage brought by it. The global performance of STBR and STVBR methods is similar, indicating that the value constraint also loses its performance advantage under the influence of interference information, but does not bring negative performance improvement like the spatial-temporal constraint. The effect of all four recovery methods increases with the number of sampling points, because the increase of sampling points brings more valid information. However, when the number of sampling points exceeds 70, the effect of the four recovery methods decreases slightly. This is because when enough sampling points are collected, the probability of containing
160
L. Gao et al.
abnormal points in the sampled data increases greatly, thus increasing the amount of interference information during data recovery. Experiments on Participants Recruitment
Fig. 3. The number of employees corresponding to different sampling points.
Fig. 4. Classification accuracy under different coefficients k.
Next, we conducted an experiment on the effect of the participant’s employment module. Figure 3 shows the number of employments required using the employment method based on random recruitment (RR) and naive Bayes classifier (NBC) at different sampling points. Figure 3 shows that the participants employed by the RR and NBC methods had a nearly constant rate of task pass on the macro level. The NBC approach is able to employ participants with a higher probability of completing the task, so as the number of sampling points increases, the NBC approach can save more costs. Figure 4 shows the classification accuracy of the classifier under different values of coefficient k. When k is 1, the weight variable wxi ,c is always 1, then the classifier expression used in this paper is the same as that of the common naive Bayes classifier. When K is 0.9994, the classifier achieves the best performance, and the classification accuracy is improved by about 2.5% compared with the commonly used naive Bayes classifier. Finally, we randomly discard some values of the training set to obtain training sets with different
Research on Cost Control of Mobile Crowdsourcing
161
Fig. 5. The classification accuracy of training sets with different initial integrity.
integrity, and then predict and complete the missing values according to the improved naive Bayes classifier. We classified the whole test set through the completed training set and the original training set, and the results are shown in Fig. 5. When the training set integrity decreased from 100% to 70%, the classification accuracy also decreased, indicating that the data integrity had an impact on the classification accuracy. When the integrity of the training set is 60%, its performance exceeds that of the training set with the integrity of 80%, indicating that the correct completion order and method can offset the impact of data integrity on classification accuracy to a certain extent. The training sets with different integrity after completion have similar accuracy to the original training set, which indicates that the completion method used in this paper is feasible.
6 Conclusion and Future Work The low budget model proposed in this paper minimizes the cost of mobile crowdsourcing tasks from the two aspects of sampling points and employment participants. In the work of reducing the number of sampling points, this paper proposes a recovery algorithm based on variance constraint. Compared with the general recovery algorithm and the algorithm based on spatial-temporal constraint and value constraint in Refs. 28,29, this paper’s method has a stronger ability in processing interference information. In the work of reducing the number of employed participants, this paper proposes the employment method based on naive Bayesian classification. Compared with the general random employment method, this method can reduce a large number of employed participants. At the same time, for the problem of data missing in the actual scene, this paper uses naive Bayes classification method to complete the original data set according to the information entropy. In the future work, the main research directions are as follows. First, we need to improve the existing gradient descent algorithm to reduce the average error of data recovery. Secondly, we should design a reward mechanism that links the information integrity with the task reward, so as to encourage users to complete their information and thus improve the accuracy of the classifier. Finally, the prerequisite of naive Bayes classifier is to assume that each attribute independently affects the classification results.
162
L. Gao et al.
In real life, however, attributes affect each other. Therefore, we need to improve the classifier used in this paper.
References 1. Maisonneuve, N., Stevens, M., Ochab, B.: Participatory noise pollution monitoring using mobile phones. Inform. Polity 15(1/2), 51–71 (2010) 2. Rana, R.K., Chou, C.T., Kanhere, S., et al.: Ear-Phone: an end-to-end participatory urban noise mapping system. In: 9th ACM/IEEE International Conference on Information Processing in Sensor Networks, IPSN, pp. 105–116. (2010) 3. Min, M., Sasank, R., Katie, S., et al.: PEIR, the personal environmental impact report, as a platform for participatory sensing systems research. In: 7th International Conference on Mobile Systems, Applications, and Services, pp. 55–68 (2009) 4. Lane, N.D., Chon, Y., Zhou, L., et al.: Piggyback CrowdSensing (PCS): energy efficient crowdsourcing of mobile sensor data by exploiting smartphone app opportunities. In: ACM Conference on Embedded Networked Sensor Systems, pp. 1–14 (2013) 5. Xiao, M., Wu, J., Huang, L.: Online task assignment for crowdsensing in predictable mobile social networks. IEEE. Trans. Mob. Comput. 16(8), 2306–2320 (2017) 6. He, S., Kang, G.S.: Steering crowdsourced signal map construction via bayesian compressive sensing. In: IEEE Conference on Computer Communications, IEEE INFOCOM, pp. 1016– 1024 (2018) 7. Candes, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory. 52(2), 489–509 (2006) 8. Candes, E.J., Romberg, J.K., Tao, T.: Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. 59(8), 1207–1223 (2006) 9. Wang, L., Zhang, D., Pathak, A., et al.: CCS-TA: quality-guaranteed online task allocation in compressive crowdsensing. In: ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp, pp. 683–694 (2015) 10. Wang, L., Zhang, D., Yang, D., et al.: SPACE-TA: cost-effective task allocation exploiting intradata and interdata correlations in sparse crowdsensing, ACM Trans. Intell. Syst. Technol. 9(2), 20 (2018) 11. Chen, Y., Guo, D., Xu, M.: ProSC plus: profit-driven online participant selection in compressive mobile crowdsensing. In: 2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS, pp. 1–6 (2018) 12. Zhou, T., Cai, Z., Xiao, B., et al.: Location privacy-preserving data recovery for mobile crowdsensing. Proc. ACM Interact. Mobile Wearable Ubiquit. Technol. 2(3), 151 (2018) 13. Liu, T., Zhu, Y., Yang, Y., et al.: Incentive design for air pollution monitoring based on compressive crowdsensing. In: 59th Annual IEEE Global Communications Conference, IEEE BLOBECOM, pp. 1–6 (2016) 14. Chen, J., Chen, Z., Zheng, H., et al.: A compressive and adaptive sampling approach in crowdsensing networks. In: 2017 9th International Conference on Wireless Communications and Signal Processing, WCSP, pp. 1–6 (2017) 15. Guo, B., Liu, Y., Wang, L.: Task allocation in spatial crowdsourcing: current state and future directions. IEEE Internet Things J. 5(3), 1749–1764 (2018) 16. Ko, H., Pack, S., Leung, V.C.M.: Coverage-guaranteed and energy-efficient participant selection strategy in mobile crowdsensing. IEEE Internet Things J. 6(2), 3202–3211 (2019) 17. Bradai, S., Khemakhem, S., Jamaiel, M.: Real-time and energy aware opportunistic mobile crowdsensing framework based on people’s connectivity habits. Comput. Netw. 142, 179–193 (2018)
Research on Cost Control of Mobile Crowdsourcing
163
18. Wang, L., Zhang, D., Xiong, H., et al.: ecoSense: minimize participants’ total 3G data cost in mobile crowdsensing using opportunistic relays. IEEE Trans. Syst. Man Cybern. -Syst. 47(6), 965–978 (2017) 19. Peng, Z., Gui, X., An, J., et al.: Multi-task oriented data diffusion and transmission paradigm in crowdsensing based on city public traffic. Comput. Netw. 156, 41–51 (2019) 20. Xu, L., Hao, X., Lane, N.D., et al.: More with less: lowering user burden in mobile crowdsourcing through compressive sensing. ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp, pp. 659–670 (2015) 21. Hao, X., Xu, L., Lane, N.D., et al.: Density-aware compressive crowdsensing. In: 16th ACM/IEEE International Conference on Information Processing in Sensor Networks, IPSN, pp. 29–39 (2017) 22. Liu, W.B., Yang, Y.J., Wang, E., et al.: User recruitment for enhancing data inference accuracy in sparse mobile crowdsensing. IEEE Internet Things 7(3), 1802–1804 (2020) 23. Wang, L.Y., Zhang, D.Q., Yang, D.Q., et al.: Sparse mobile crowdsensing with differential and distortion location privacy. IEEE Trans. Inf. Forensics Secur. 15, 2735–2749 (2020) 24. Gao, L., Yao, Z., Li, Gao, Chen, Q.: Research on cost control of mobile crowdsoucing based on compressive sensing in environmental information monitoring. J. Chin. Mini-Micro Comput. Syst. 43(02), 443–448 (2022) 25. Zhou, Z.: Machine learning. Tsinghua University Press, Beijing (2016) 26. Donoho, D.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006) 27. Candes, E.J., Tao, T.: Near-optimal signal recovery from random projections: universal encoding strategies. IEEE Trans. Inf. Theory 52(12), 5406–5425 (2006) 28. Liu, W., Wang, L., Wang, E., et al.: Reinforcement learning-based cell selection in sparse mobile crowdsensing. Comput. Netw. 161, 102–114 (2019) 29. Liu, W., Yang, Y., Wang, E., et al.: Multi-dimensional urban sensing in sparse mobile crowdsensing. IEEE Access 7, 82066–82079 (2019)
Question Answering System Based on University Knowledge Graph Jingsong Leng1,2 , Yanzhen Yang1,2 , Ronghua Lin1,2 , and Yong Tang1,2(B) 1
South China Normal University, Guangzhou 510631, China {lengjingsong,yangyz,rhlin,ytang}@m.scnu.edu.cn 2 Pazhou Lab, Guangzhou 510330, China
Abstract. The Question Answering (Q&A) system can recognize natural language questions and give an answer, and people can get the information they want accurately and quickly from the Q&A system. The SCHOLAT is an academic social network. It has a University Portrait Project, which includes information on more than 2,700 colleges and universities in China. In this paper, we build a Q&A system for the university information field on this basis. We construct a university Knowledge Graph using the university database as a knowledge base for the Q&A system. We also collect more than two thousand universityrelated questions as the training data for the Q&A model. The Q&A system first performs entity recognition of the input question to obtain the abstract formula of the question sentence and a list of entity vocabulary. Secondly, a feature vocabulary extraction operation is performed and a word vector is constructed. Thirdly, the word vector is input to a Bayes classifier to obtain the type of question (i.e., intent). Then, structured query statements are selected according to the question type and the entity vocabulary list is used as query parameters to obtain relevant knowledge data from the Knowledge Graph. Finally, this knowledge data is used to construct answer sentences and output them.
Keywords: Question Answering System SCHOLAT
1
· Knowledge Graph ·
Introduction
A Question Answering (Q&A) system is an intelligent system that understands questions asked by people. It has a certain level of knowledge and can answer a question. Its input and output are sentences described in natural language. Q&A systems can be divided into open-domain and restricted-domain Q&A systems according to the domain they are designed for [8]. Open-domain Q&A systems usually use various information resources on the Web to find answers using retrieval and statistical methods, and the algorithms and models used are not particularly complex [13]. Compared with an open-domain Q&A system, a restricteddomain Q&A system can make use of many information resources in the domain c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 164–174, 2023. https://doi.org/10.1007/978-981-99-2385-4_12
Question Answering System Based on University Knowledge Graph
165
and optimize the algorithm specifically for this restricted domain, so that the Q&A system can give more professional and accurate answers. By applying some theoretical or empirical techniques in the restricted domain to the Q&A system, the Q&A system can have a deeper understanding of complex issues [2]. The Knowledge Graph is a structured semantic knowledge base. It is used to describe concepts and their interrelationships in the physical world in symbolic form. The concept of Knowledge Graph was introduced by Google [22] in 2012. In Q&A systems, natural language can be easily expressed and stored through semantic networks. The SCHOLAT is an academic social network. It has more than 2,700 colleges and universities in its database. And it already has a university Knowledge Graph. This can be used as a source of knowledge data for our Q&A system.
2
Related Work
The concept of intelligent Q&A has been around since 1950 in Alan Turing’s article Computing Machinery and Intelligence [24]. This article began with The Imitation Game, and discussed the question “Can machines think?”. The Imitation Game is a way to determine whether a machine is intelligent by asking it questions and checking the answers it gives. The Knowledge Graph makes knowledge search easy and efficient and has led to the development of Knowledge Graph-based Q&A systems. Liu et al. [17] applied the pre-trained BERT (Bidirectional Encoder Representation from Transformers) model [6] to a knowledge Q&A system, avoiding the time spent on manual template creation and improving its generalization ability. Banerjee [1] constructed an implicitly supervised neural network model, and learned from the Knowledge Graph, to make a model for a Q&A system. Lan et al. [15] proposed a novel iterative sequence matching model for multi-hop KBQA (Knowledge-based Question Answering), which was very helpful in the process of ranking candidate paths queried from the Knowledge Graph. Do et al. [7] used the Knowledge Graph and BERT model to enhance the Q&A system. They divided the triples in the Knowledge Graph into three classes and constructed two classifiers. While most Q&A systems build an accurate Q&A model before deployment, Li et al. [16] build a Q&A system that continuously improves the accuracy of the model through user interaction feedback. There are many examples of open-domain Q&A systems, which are mainly integrated into intelligent voice assistants [14], and for questions that cannot be answered will go to search engines to retrieve information. For example, IBM’s Watson Assistant [4], Google Now [23], Apple’s Siri, Amazon’s Alexa, Microsoft’s XiaoIce social chatbots [20,25] and so on. Many researchers have contributed to Q&A systems for different fields. Andreas Both et al. constructed a Q&A system for the COVID-19 domain using Component-Based Microbenchmarking [3]. Zhixue Jiang et al. constructed a Knowledge Graph-based Q&A system for the medical domain [11]. Xueling Dai et al. constructed a military domain-based Q&A system [5], which uses semantic Web technologies to deeply analyze military documents.
166
3 3.1
J. Leng et al.
Preparation University Knowledge Graph
The database used for the Knowledge Graph is Neo4j, a graph database developed using the Java language. It provides a specialized structured query language called CQL (Cypher Query Language). The number of entities in the whole Knowledge Graph is 191,089, and the number of relationships between entities is 1,638,275 pairs. Among them, there are 2,756 entities of university type, 3,391 entities of course type, 168,671 entities of scholar type, and the remaining 16,271 entities are entities of various majors, academic fields, and research directions, etc. The main attributes of the entity node of university type are the name of the university in English and Chinese, alias and abbreviation, location, year of establishment, institution code, and profile. 3.2
Natural Language Question Dataset
The questions dataset used to train the classification model is partially shown in Fig. 1, where each piece of data is pre-labeled with the type of questions and used for supervised model training. In addition, these data can be used for testing entity recognition algorithms for university name vocabulary.
Fig. 1. Natural language questions dataset
4
Question Understanding and Knowledge Search
Consider the question “北京大学的院校代码是什么? (What is the institution code of Peking University?)” , we can see that this question intends to ask for the institution code of a school, and the entity word “北京大学 (Peking University)” does not affect the intent of this question. In other words, replacing “北京大学 (Peking University)” with other entity words of the same type will have the same meaning, for example, “清华大学的院校代码是什么? (What is the institution code of Tsinghua University?)”, “华南师范大学的院校代码是什么? (What is the institution code of South China Normal University?)”. Then, we can use the placeholder “[school]” instead of “北京大学 (Peking University)”,
Question Answering System Based on University Knowledge Graph
167
so that we get an abstract sentence of this type of question “[school]的院校代码 是什么? (What is the institution code of [school])” . By doing the same operation for each question in the original question set, we can obtain a dataset consisting of abstract sentences. Consider the following three interrogative sentences: 1. “清华大学的院校代码是什么? (What is the institution code of Tsinghua University?)” 2. “北京大学的院校代码是什么? (What is the institution code of Peking University?)” 3. “北京大学是双一流高校吗? (Is Peking University a Double First-Class university?)” The first two sentences are different in terms of entity vocabulary, but both belong to the type of university name, and the remainder of the question is the same. Therefore, the intent of these two sentences is the same, and the answer information can be retrieved through the same knowledge retrieval path, only that the input parameters for retrieval are different. In the latter two sentences, although the entity term is “北京大学 (Peking University)”, the rest of the question is completely different. This leads to the fact that the intent of the two questions is completely different, as they are asking about different aspects of the same entity. In other words, the latter two sentences belong to different types of interrogatives. After the above analysis, we can conclude that the type of question is determined by the syntax of the remaining part of the sentence after the removal of the real words. It should be noted that the type of the question is determined by the additional information on the type of the entity vocabulary after removing the entity vocabulary. For example, “北京大学的简称是什么? (What is the short name of Peking University?)” and “计算机科学与技术的简称是什么? (What is the short form of Computer Science and Technology?)” are different types of questions. Then the next steps are as in Fig. 2. (1) Named entity recognition (NER) is applied to the input question sentence. This step outputs the abstract formula of the question sentence and a list of entity vocabulary, while manually labeling the type of the abstract formula. (2) Extract the feature terms from the abstract sentence and construct the word vector. (3) Input the word vector and type label of each sentence into the Bayes classification model, and train the classification model. (4) Query information from the Knowledge Graph and make an answer based on that information. 4.1
Named Entity Recognition
With the Knowledge Graph as the data source for the Q&A system, the next step is to design the Q&A system. The first step is to identify and extract the entity vocabulary from the user input so that we can know what the question is about. Named Entity Recognition detects references to real-world entities from text and classifies them into predefined types [10]. An entity can be considered as an instance of a real-world concept. For example, “school” is a concept, and
168
J. Leng et al.
Fig. 2. Steps of the Question Answering system
“South China Normal University” is an instance of that concept, i.e., an entity. In addition, the concept of “school” is also referred to as the type of entity “South China Normal University”. NER is the extraction of entity vocabulary from a natural language sentence and the analysis of entity types. This paper uses the open-source Chinese natural language processing toolkit HanLP [9], which can use custom corpus data to train its word separation and annotation models of entity words. The original pre-trained HanLP word separation model can recognize some entity words well, and the model can recognize entity words of university type better after enhanced training with university corpus, as shown in Fig. 3.
Fig. 3. Effectiveness of HanLP word separation and entity recognition
In addition, for some longer or more complex words of entities such as school names or some professional names, ordinary word separation algorithms may not be able to recognize them well. To solve this problem, we decided to use a Knowledge Graph to assist HanLP to recognize these special noun words. We extract the entities of universities, university types, university titles, majors, research directions, etc. from the Knowledge Graph, use these entity node “name” fields to make a glossary, and label the type to which each term belongs. This approach is called Dictionary-based NER [18]. 4.2
Feature Term Extraction and Word Vector Construction
As shown in Fig. 4, this subsection describes the conversion of the raw question data into word vector data that can be fed into a Bayes classification model. The data type used for training the Bayes classification model presented in the next section is a set of word vectors. In order to use the original question data for training the Bayes classification model, we need to extract the feature
Question Answering System Based on University Knowledge Graph
169
Fig. 4. Question sentences preprocessing
words from the original question sentences and then use these feature words to construct a sparse matrix. Each row of this sparse matrix corresponds to an original question, and each column of the sparse matrix corresponds to a feature term. If a feature word appears in a sentence, the corresponding position of the sparse matrix is assigned to 1 and the rest of the positions are assigned to 0. From the above discussion, we know that the specific value of the entity vocabulary does not affect the type of question. So we will first do entity identification for the original question, and then replace these entity words with their types, as shown in the first line of Fig. 4. In addition, some intonation words, auxiliaries, and punctuation are not very useful for question intent identification, so these deactivated words are also removed. Then the feature words that make up the matrix include the type of entity vocabulary and some predicates or pronouns that are not related to the entity. As an example, the 5 abstract sentences in the rightmost box in the second row of Fig. 4 will be divided into 8 feature terms. Two of these feature terms are types of entities, namely “[school]” and “[major]”. Then they can form a sparse matrix with 5 rows and 8 columns. The abstract sentence “[school] 的院校代码 是什么? (What is the institution code of [school])?” contains the first four feature terms. Therefore, the first four columns of the first row of the matrix are 1, and the last four columns of the first row are 0. The remaining four sentences follow the same operation to assign values to the matrix. 4.3
Question Intent Recognition
In the above discussion, we already know that the entity itself does not affect the intent of the question, only the type of entity can affect the intent of the question. Therefore, the following sentence examples are represented by abstract sentences with the concrete entity removed. Considering the intent of the question “询问 一个学校的地址在哪 (Asking where a school is located)” we find that there are
170
J. Leng et al.
many natural language sentences that can be used to express this, such as the following: 1. “[school] 的地址是? (What is the address of [school]?)” 2. “[school] 的校区位于哪里? (Where is [school]’s campus located?)” 3. “[school] 在什么地方? (What place is [school] in?)” These sentences all have the same question intent, and we classify them into the same type of interrogative sentences. Then, when a new question is inputted, we abstract it to get the abstract sentence and then classify it to know the intention of the question. Since then, we transform the question of identifying the intention of question sentences into the question of classifying abstract sentences, and we will use the Bayes classifier to classify the question sentences. Bayes classifier is a class of algorithms that uses Bayes’ theorem in probability and statistics to do classification tasks. Bayes classifier has several advantages, such as high classification accuracy, fast speed, simple algorithm, and easy implementation. The process of the Bayes classifier is to calculate the probability that the input vector of interrogative feature words may belong to each type in turn, and then rank the probabilities of each type and take the type with the highest probability as the output of the model. In this paper, a Naive Bayes classifier [19] is used, which assumes that each feature word of the input feature word vector is independent of each other. For the training of Bayes classification models, we introduce two distributed storage and computation engines, Hadoop [21] and Spark [12]. The raw question data for classification training and some intermediate processing data are stored on Hadoop, while the computational tasks of the model are executed on the Spark computation engine. HDFS has a good performance of incremental appending data, which is very suitable for our system. To save IO overhead we replace MapReduce with the memory-based Spark computational engine. The training of the Naive Bayes model can also be done using the SparkML machine learning toolkit of the Spark platform. 4.4
Assembling Structured Query Statements
The Knowledge Graph used in the Q&A system is specifically carried out using the Neo4j graph database, which outputs two values after the preprocessing and intent recognition of the questions, where is the type type to which the question belongs and the other is a list of entity terms args. The question type type determines which CQL query statement is selected, while the list of entity terms args is passed to the CQL statement as a query parameter. For example, the question “Is Peking University a Double FirstClass university?” is determined by the classifier to be of type 1, and the entity terms “Peking University” and “Double First-Class” are extracted, i.e. type = 1, args = [“P eking U niversity , “Double F irst − Class ]. Then a query statement Q1(args) will be executed. We get the knowledge information needed to answer the question and generate natural language answer sentences based on this knowledge as the output to complete the process.
Question Answering System Based on University Knowledge Graph
171
Fig. 5. Question Answering system architecture
5
Implementation of Question Answering System
The flowchart of the Q&A system is shown in Fig. 5, and the whole system consists of two layers of the Q&A model. The first layer is a FAQ1 -based Q&A model, which is mainly used to answer some daily chat questions. These questionanswer pairs directly record the corresponding answers for each question, for example, “< Hi, nice to meet you! – Thank you. nice to meet you too! >”. For the user input question, (1) the system first calculates the cosine similarity between the question and each question in the question-answer database; (2) it checks whether the similarity score is greater than the set threshold (e.g., 60 Fig. 5), if it is, the record is kept, otherwise, it is discarded; (3) it sorts all the retained records by similarity, and takes the record with the highest similarity score; (4) and finally it outputs the answer of this record. 1
FAQ: Frequently Asked Questions. This system searches an answer from a preprepared question and answer database.
172
J. Leng et al.
The second layer of the Q&A model is based on the Knowledge Graph of the university. This model is centered on the Knowledge Graph and uses a Bayes classifier to identify the intent of the input question. Then it queries the relevant knowledge data from the Knowledge Graph. Finally, it uses the knowledge data to construct the answer statements. The major difference from the first layer model is that the answer statements are generated on-the-fly instead of being pre-recorded. The second layer of the model can truly understand the intent of the user’s question and retrieve the knowledge to answer the question from the knowledge base. This model also converts the retrieved structured knowledge into ordinary natural language sentences. In order to allow the model to continuously improve its capabilities, we set up regular updates for the Q&A model. Each time a user enters a question into the system, the system records the user’s question, and when the number of questions reaches a threshold, the system stores the questions in Hadoop. Then we will annotate the type of the new questions manually. When the number of newly written questions grows to a specified amount, or when a specified time interval has elapsed since the last model update, the system initiates a task to update the model. However, the task of model updating does not affect the Q&A system, because the classification model is trained on another machine and then reloaded into the Q&A system. The process of loading a new model is very fast, with an almost monthly impact on the system.
6
Conclusion
In this paper, we build a Q&A system for the university domain based on the university Knowledge Graph. The system uses Knowledge Graph, NER, Chinese word separation, Bayes classification model, Neo4j, and other techniques. Using these techniques, the system first preprocesses the user input question to obtain a list of entities, then inputs abstract sentences to the Bayes classification model to determine the intent of the question, then obtains knowledge about the question from the Knowledge Graph, and finally generates a natural language answer sentence. This Q&A system was tested with more than two thousand questions (Fig. 1), and its accuracy rate of question intent recognition was 92.4%. Some test data and results are shown in Table 1. Table 1. Examples of some test results of this Q&A system Original question
Abstract formula
Type Answer
清华大学是 985 吗?
school 是 985 吗?
1
南昌大学是重点大学吗?
school 是重点大学吗? 2
南昌大学是 211工程大学。
五邑大学的代码是什么?
school 的代码是什么? 5
11349
华南师范大学的简称
school 的简称
清华大学是985工程大学。
7
华南师大、 SCNU
云南农业职业技术学院在哪? school 在哪?
8
云南省
重庆工程学院在哪个省?
8
重庆市
school 在哪个省?
Question Answering System Based on University Knowledge Graph
173
The training of the current Q&A model requires a lot of labor to annotate the original question sentences. In the future, we intend to use an unsupervised machine learning model to cluster these raw question sentences and then check them manually. Nevertheless, this Q&A system has the following advantages: (1) In the NER stage, it can identify more accurately some entity vocabulary in the question, such as university name, “985”, “211”, and other proper nouns. (2) The numerous nodes and relationships of SCHOLAT’s university knowledge graph can be used to retrieve the information needed for the question more efficiently. Acknowledgment. This work was supported in part by the National Natural Science Foundation of China under Grant U1811263.
References 1. Banerjee, P.: Implicitly Supervised Neural Question Answering. Ph.D. thesis, Arizona State University (2022) 2. Biswas, P., Sharan, A., Malik, N.: A framework for restricted domain question answering system. In: 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT), pp. 613–620 (2014). https://doi.org/ 10.1109/ICICICT.2014.6781351 3. Both, A., et al.: Quality assurance of a german covid-19 question answering systems using component-based microbenchmarking. In: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM 22, pp. 1561– 1564. Association for Computing Machinery, New York, NY, USA (2022). https:// doi.org/10.1145/3488560.3502196 4. Budzik, J., Hammond, K.: Watson: anticipating and contextualizing information needs. In: 62nd Annual Meeting of the American Society for Information Science. Citeseer (1999) 5. Dai, X., Ge, J., Zhong, H., Chen, D., Peng, J.: Qam: question answering system based on knowledge graph in the military. In: 2020 IEEE 19th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 100–104 (2020). https://doi.org/10.1109/ICCICC50026.2020.9450261 6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805 7. Do, P., Phan, T.H.V.: Developing a BERT based triple classification model using knowledge graph embedding for question answering system. Appl. Intell. 52(1), 636–651 (2021). https://doi.org/10.1007/s10489-021-02460-w 8. Dong-sheng, W., Wei-min, W., Shi, W., Jian-hui, F., Feng, Z.: Research on domainspecific question answering system oriented natural language understanding: a survey. Comput. Sci. 44(8), 1–8 (2017) 9. He, H., Choi, J.D.: The stem cell hypothesis: dilemma behind multi-task learning with transformer encoders. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5555–5577. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://aclanthology.org/2021.emnlp-main.451
174
J. Leng et al.
10. Jiang, H., Zhang, D., Cao, T., Yin, B., Zhao, T.: Named entity recognition with small strongly labeled and large weakly labeled data. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1775–1789. Association for Computational Linguistics, August 2021. https://doi.org/10.18653/v1/2021.acl-long.140 11. Jiang, Z., Chi, C., Zhan, Y.: Research on medical question answering system based on knowledge graph. IEEE Access 9, 21094–21101 (2021). https://doi.org/10.1109/ ACCESS.2021.3055371 12. Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning spark: lightning-fast big data analysis. O’Reilly Media, Inc. (2015) 13. Kozareva, Z., Hovy, E.: Tailoring the automated construction of large-scale taxonomies using the web. Lang. Res. Eval. 47(3), 859–890 (2013) 14. K¨epuska, V., Bohouta, G.: Next-generation of virtual personal assistants (microsoft cortana, apple siri, amazon alexa and google home). In: 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), pp. 99–103 (2018). https://doi.org/10.1109/CCWC.2018.8301638 15. Lan, Y., Wang, S., Jiang, J.: Multi-hop knowledge base question answering with an iterative sequence matching model. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 359–368 (2019). https://doi.org/10.1109/ICDM.2019. 00046 16. Li, Z., Sharma, P., Lu, X.H., Cheung, J., Reddy, S.: Using interactive feedback to improve the accuracy and explainability of question answering systems postdeployment. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 926–937. Association for Computational Linguistics, Dublin, Ireland, May 2022. https://doi.org/10.18653/v1/2022.findings-acl.75 17. Liu, A., Huang, Z., Lu, H., Wang, X., Yuan, C.: BB-KBQA: BERT-based knowledge base question answering. In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 81–92. Springer, Cham (2019). https:// doi.org/10.1007/978-3-030-32381-3 7 18. Lou, Y., Qian, T., Li, F., Ji, D.: A graph attention model for dictionary-guided named entity recognition. IEEE Access 8, 71584–71592 (2020). https://doi.org/10. 1109/ACCESS.2020.2987399 19. Murphy, K.P., et al.: Naive bayes classifiers. Univ. Br. Columbia 18(60), 1–8 (2006) 20. Shum, H.Y., He, X.d., Li, D.: From eliza to xiaoice: challenges and opportunities with social chatbots. Front. Inf. Technol. Electron. Eng. 19(1), 10–26 (2018) 21. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10 (2010). https://doi.org/10.1109/MSST.2010.5496972 22. Singhal, A., et al.: Introducing the knowledge graph: things, not strings. Official Google Blog 5, 16 (2012) 23. Thakur, S.: Personalization for google now: user understanding and application to information recommendation and exploration. In: Proceedings of the 10th ACM Conference on Recommender Systems, p. 3. RecSys 2016, Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2959100. 2959192 24. Turing, A.M.: Computing machinery and intelligence. Mind 59(236), 433 (1950) 25. Zhou, L., Gao, J., Li, D., Shum, H.Y.: The design and implementation of XiaoIce, an empathetic social chatbot. Comput. Linguist. 46(1), 53–93 (2020). https://doi. org/10.1162/coli a 00368
Deep Reinforcement Learning-Based Scheduling Algorithm for Service Differentiation in Cloud Business Process Management System Yunzhi Wu1 , Yang Yu2(B) , and Maolin Pan2 1
2
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China School of Software Engineering, Sun Yat-sen University, Guangzhou, China [email protected]
Abstract. Business process management system in cloud is increasingly used in enterprise organizations for its advantages of flexibility, affordability, and usability. The service differentiation problem for cloud business process management system has not been thoroughly investigated, nevertheless. Service differentiation requires corresponding service quality for different service level agreements. To address this problem, this paper designs a scheduling system in cloud business process management system by combining characteristics of the workflow engine, models the request scheduling process as the Markov decision process, and proposes a request scheduling algorithm for service differentiation in cloud business process management system based on deep reinforcement learning. The algorithm effectively reduces the over-provisioning of service quality while maintaining a low service level agreement violation rate, thus enhancing the service differentiation effect and conserving system resources. Experiments illustrate that the algorithm achieves a better service differentiation effect with a 74.4% drop rate in service quality over-provisioning compared to the heuristic algorithm while maintaining the service level agreement violation rate within 5%. Keywords: Service Differentiation · Request Scheduling · Business Process Management System · Deep Reinforcement Learning · Cloud Computing
1
Introduction
With the development of cloud computing, Business Process Management System (BPMS) has shifted from traditional on-premises software to cloud BPMS, which is deployed in the cloud and provides services on demand. Cloud BPMS eliminates the upfront investment of money and resources, making BPMS easier, faster, and cheaper [1]. Cloud BPMS provides Business Process as a Service (BPaaS), where tenants use automated processes in the cloud as a service on c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 175–186, 2023. https://doi.org/10.1007/978-981-99-2385-4_13
176
Y. Wu et al.
the pay-as-you-go model [2]. Compared to the on-premise software deployment approach of BPMS, the service approach of cloud BPMS is more affordable for small and medium-sized enterprises and organizations. Cloud BPMS improves the applicability of BPMS to all types of organizations. Service Level Agreement (SLA) is an important element of cloud, which stipulates the quality-of-service (QoS) and the corresponding price. Service response time is chosen as the research object, which is an essential metric in SLA. Tenants choose different levels SLA based on their real-time requirements. The essence of cloud services is the on-demand service, i.e., service differentiation: corresponding QoS is provided for different levels of SLA. Cloud service providers offer QoS lower than the agreed in SLA, which is SLA violation, and vice versa, which is over-provisioning of QoS. QoS over-provisioning leads to tenants who purchase low-level SLA also enjoying high QoS and taking up more resources. If tenants who purchase different levels of SLA receive the same QoS, it will lower the desire of tenants with high QoS needs to purchase high-level SLA, thus reducing the revenue of cloud service providers. Current research efforts for cloud BPMS focus on architecture designing and resource scheduling. Little attention to the SLA service differentiation is involved. [3] design and implement a multi-tenant business process system architecture that allows multiple tenants to securely share a single workflow engine instance with no additional overhead. [4] propose a request scheduling algorithm for cloud workflow, and load balance the engine instances according to the workflow engine characteristics, effectively saving computational resources. [5] address the problem of automatic scaling of cloud workflow services, model it as a semi-Markov process, and propose an algorithm for automatic scaling of workflow instances based on reinforcement learning (RL). In this paper, SLA service differentiation focuses on request response time, so for the scheduling system, requests have corresponding deadlines. On the application of RL to request scheduling with deadline, [6] use Earliest Deadline First (EDF) combined with RL to allocate bandwidth for requests with deadlines. [7] use a multi-agent deep reinforcement learning (DRL) model as a scheduler for real-time systems to improve the success rate of tasks. To address the service differentiation problem, this paper develops a scheduling system in cloud BPMS and proposes a scheduling algorithm based on DRL. The algorithm models the request scheduling problem as a Markov decision process (MDP) problem and combines characteristics of the workflow engine. The algorithm reduces the QoS over-provisioning rate while maintaining the SLA violation rate within 5%, thus improving the effect of service differentiation and making the cloud BPMS more consistent with characteristics of cloud services and realizing on-demand services.
2 2.1
Problem Description and System Design Cloud BPMS Description
Cloud computing systems have a three-tier structure: tenants, service providers, and resource providers. The service provider rents or purchases resources from
DRL-Based Scheduling Algorithm for Service Differentiation in Cloud BPMS
177
the resource provider to supply tenants with services. The resource provider makes physical or virtual resources available to the service provider for usage. The cloud BPMS is based on the three-tier structure of the cloud computing system and the workflow engine cluster is deployed on resources rented from resource providers. The workflow engine is one of core components of the BPMS, which creates process instances based on process definitions and controls the execution of process instances [8]. The architecture of the cloud BPMS is shown in Fig. 1. When a tenant sends a request to start a process instance, the scheduling system receives the request and submits it to the workflow engine cluster for execution in the proper order according to the SLA signed between the tenant and the service provider. The workflow engine creates a process instance and assigns tasks in the process instance to task executors for execution. Task executors can be humans, applications, robots, etc. The BPMS has no control over how long task executors take to execute a task. After task executors completing the task, the request to complete a task is submitted to the scheduling system. The scheduling system forwards the request to the workflow engine cluster and continues the process instance.
Fig. 1. The architecture of the cloud BPMS
The chosen BPM framework is Activiti Cloud, a cloud-native BPM framework that provides a scalable and transparent solution for BPMS in cloud, and its services are stateless [9]. The stateless engine’s response time for executing business process requests is only related to its resources, not to the complexity of the process definition, tenants, or third-party services. After testing, within a certain load threshold, the Activiti workflow engine cluster’s request response time increases slowly with the increase of load. When the load is too high, the request response time increases rapidly and even causes the cluster to crash due to overload. So an overload threshold needs to be set to protect the engine cluster. With a set overload threshold of T H and fixed resources such as CPU and memory, there is a maximum request response time for the Activiti engine cluster, setting to t0 . t0 is the time that can be guaranteed by the service provider as the minimum request response time agreed in the sale SLA. If tenants are not sensitive to the response time, they can sign SLA with the service provider that requires a response time of several times t0 .
178
2.2
Y. Wu et al.
Problem Description
The request reqj sent to the scheduling system has a SLA response time SLAj . If the response time Rj of reqj exceeds SLAj , reqj causes SLA violation. If Rj is less than SLAj , reqj causes QoS over-provisioning. The over-provisioning time Mj of reqj is defined as shown in Eq. (1). Then the request scheduling problem in cloud BPMS can be described as: 1. Known conditions: (i) T H, the overload threshold; (ii) t0 , the maximum request response time for the Activiti engine cluster. 2. Goal: The goal of the scheduling system is to minimize the QoS overprovisioning rate Rateover_prov while maintaining the SLA violation rate Ratesla_vio within a certain range. Ratesla_vio can be expressed as Eq. (2). J is the total number of requests, and Jsla_vio is the number of SLA violation requests. Rateover_prov is the ratio of the total amount of over-provisioning time to the total amount of SLA response time, as Eq. (3) shows. The effect of SLA service differentiation is reflected by Ratesla_vio and Rateover_prov . When Ratesla_vio is limited to a certain range, it is mainly reflected by Rateover_prov . SLAj − Rj Mj = 0
Rj < SLAj Rj ≥ SLAj
Ratesla_vio = Jsla_vio /J Rateover_prov =
J j=0
2.3
Mj /
J
SLAj
j=0
Scheduling System Design
Fig. 2. Scheduling process at t, t + 1, t + 2
(1) (2) (3)
DRL-Based Scheduling Algorithm for Service Differentiation in Cloud BPMS
179
The scheduling system maintains a priority queue Q = [req1 , req2 , ..., reqn ] for caching the unexecuted request. The priority of reqj is determined by its deadline Dj , where Dj = Aj + SLAj − t0 . Aj is the moment when reqj arrives at the scheduling system. If the scheduling system submits reqj to the workflow engine cluster before Dj , reqj will not cause an SLA violation, otherwise, it is an expired request and will cause an SLA violation. Set an expired request queue Qexp to store expired requests for subsequent processing. The time interval between the request deadline and the current time t of the system is finite and is determined by the maximum response time agreed in the SLA provided by the cloud service provider, which is noted as N . Figure 2 shows the scheduling process at t, t + 1, t + 2. At each time step t, the scheduling system selects some requests from Q to submit to the workflow engine cluster according to the scheduling algorithm, moves the expired requests out of Q, and adds them to Qexp . The scheduling system submits the request with a deadline that is larger than the current time of the system will cause QoS over-provisioning, but can reduce the load pressure for the following time. If subsequent load pressure is high, it can reduce Ratesla_vio . Therefore, it is difficult for rule-based heuristic algorithms to maintain the service differentiation effect in a dynamic load environment. The DRL-based scheduling algorithm is used as a solution.
3 3.1
Algorithm Design Problem Modeling
In RL, at each discrete time step t, the agent observes the state St of the environment and selects an action At based on its policy πθ (s) parameterized by θ. Then the environment provides the reward Rt and the next state St+1 . The interaction is formalized as MDP. The goal of the agent is to maximize the expectation of discounted return by optimizing πθ (s). The state-value function Vπ (s) = Eπ [G t |St = s] is used to measure the expectation of discounted return, ∞ where Gt = k=0 γ k Rt+k+1 is the discounted return, γ ∈ [0, 1] represents the discount factor. The process of scheduling requests in Q is modeled as an MDP problem, where the state needs to describe the current state of Q, the action needs to depict which requests are submitted from Q, and the reward needs to encourage the agent to learn a scheduling policy that maintains a low Ratesla_vio and minimizes Rateover_prov to maintain the service differentiation effect. Therefore, the state, action, and reward definitions are as follows: 1. State: St is the state of Q, which can be expressed as Eq. (4). For any i ∈ [0, N − 1], sti = |Qt+i |. Qt+i is a request set, where ∀reqj ∈ Qt+i , Dj ∈ [t+i, t+i+1). (4) St = st0 , st1 , ..., st(N −1)
180
Y. Wu et al.
2. Action: At time t, the agent chooses requests from Q to submit. The formal expression of scheduling action At is as follow: (5) At = at0 , at1 , ..., at(N −1) where ati , i ∈ [0, N − 1] is the number of requests submitted from the set N −1 Qt+i , limiting i=0 ati to no more than the overload threshold T H. 3. Reward : The reward evaluates the value of the action selected by the agent in the corresponding state, which needs to think about Ratesla_vio and Rateover_prov , and penalize the part of At that exceeds T H. Q.get_exp_req(t) is defined as extracting the set of expired requests from Q whose deadline are earlier than t. The reward function Rt is as follow: Rt =
⎧ N −1 N −1 ⎪ ⎪ (1+βi)ati +B|Q.get_exp_req(t)|+C( ati −T H) ⎨A i=0 N −1
⎪ ⎪ ⎩A
i=0
i=0
(1+βi)ati +B|Q.get_exp_req(t)|
N −1 i=0 N −1 i=0
ati > T H
(6) ati ≤ T H
where A is a positive constant representing the reward factor for actions that do not cause SLA violations. β ∈ (−1, 0) represents the discount coefficient that reduces the reward for the part of action At where ati , i! = 0. Because ati , i! = 0 means submitting requests in Qt+i , i! = 0, causing QoS overprovisioning. B is a negative constant representing the penalty factor for violations. C is a negative constant representing the penalty factor for the N −1 total submitting number of requests i=0 ati over T H. 3.2
DRL-Based Request Scheduler
Traditional RL methods are mainly table-driven, which are difficult to deal with the problem of large state space and action space. To solve this problem, DRL uses deep neural networks to express the value function and policy function in RL.
Fig. 3. The structure of actor and critic network
DRL-Based Scheduling Algorithm for Service Differentiation in Cloud BPMS
181
Proximal Policy Optimization (PPO) [10] is chosen, which belongs to the actorcritic framework. The actor network represents the policy πθ (s) and the critic network represents the state-value function V (s). The structure of actor and critic network is shown in Fig. 3. The critic network consists of three fully connected (FC) layers, and smoothL1 loss is set as the objective function. The actor network is composed of three FC layers, and the objective function can be expressed as:
ˆ t min rt (θ) Aˆt , clip (rt (θ) , 1 − , 1 + ) Aˆt (7) (θ) = E LCLIP t Aˆt = −V (st ) + rt + γrt+1 + ... + γ T −t+1 rT −1 + γ T −t V (sT )
(8)
The estimator of the advantage function Aˆt is a relative evaluation of the action, which is shown in Eq. (8). rt (θ) = πθ (at |st )/πθold (at |st ) denotes the probability ratio. clip function limits rt (θ) ∈ [1 − , 1 + ]. The clipping threshold often takes 0.1 ∼ 0.3. Algorithm 1. PPO, Actor-Critic Style 1: Initialize policy parameter θ and clipping threshold 2: for iteration=1, 2,... do 3: Run policy πθ old in environment for T timesteps ˆT ˆ1 , ..., A 4: Compute advantage estimates A 5: Optimize LCLIP with minibatch Stochastic Gradient Descent 6: θold ← θ 7: end for
The PPO algorithm is shown in the Algorithm 1. In each iteration, We use πθold to explore the environment for collecting samples and calculate the advantage estimates. Then the policy parameter θ is updated after exploring T timesteps, and the new policy is assigned to the old one. In the cloud BPMS, the scheduler has a trained actor network. Qexp .put(req_set) is defined as adding the request set req_set to the expired request queue Qexp . Qexp .get(num) is defined to extract num requests from Qexp . The scheduling process of the DRL-based request scheduler is shown in Algorithm 2. The current environment observation is obtained from Q and the policy outputs the action. Requests are submitted to the workflow engine cluster for execution according to the adjusted action based on the overload threshold T H. If the submission is less than the overload threshold, requests are extracted from Qexp and sent for execution.
182
Y. Wu et al.
Algorithm 2. DRL-based Request Scheduler Require: overload threshold T H, priority cache queue Q, expired request queue Qexp Ensure: 1: Load the trained policy πθ 2: while not done do 3: Qexp .put(Q.get_exp_req(t)) 4: Observe the state St 5: At = πθ (St ) N −1 H then 6: if i=0 ati > T −1 7: Adjust At to N i=0 ati ≤ T H 8: end if 9: Execute N −1the action At 10: if i=0 ati < T H then −1 11: Submit Qexp .get(T H − N i=0 ati ) to the workflow engine cluster 12: end if 13: t←t+1 14: end while
4 4.1
Experiment Experimental Setup
The environment of the experiment is a Kubernetes cluster deployed on a server, which has 32 GB memory and 24 cores. The experiment deploys the official version of Activiti Cloud, which has great microservice features. The Activiti engine container has a resource allocation of 1 CPU and 2 GB of memory.
Fig. 4. NASA website access traffic
DRL-Based Scheduling Algorithm for Service Differentiation in Cloud BPMS
183
To simulate the environment in which tenants send requests in a realistic cloud BPMS, this paper uses Gatling, an open-source tool for simulating concurrent request loads. This paper uses part of the access traffic of the 1995 NASA website, from which the traffic data from July 1 to July 3 is intercepted, as shown in Fig. 4. To verify the advantages and shortcomings of the proposed algorithm, experiments compare service differentiation effects of EDF, First Come First Serve (FCFS), and the proposed algorithm applied to the request scheduling problem in Cloud BPMS. The EDF can choose to submit requests in Qt or the top T H requests in Q. To distinguish, the former is referred to as Conservative EDF and the latter as EDF. The parameters used in this paper are shown in Table 1. Table 1. The parameters used in this paper Parameter
Value
The overload threshold
T H=45
The maximum request response time ofthe workflow engine cluster t0 =1 s
4.2
Clip factor
=0.25
Discount factor
γ=0.8
Batch size
256
Memory size
4096
Experiment Result
Based on the NASA website access dataset, the increase and decrease of requests per second by 15 are considered as the high load and low load scenarios, and the original dataset is considered as the medium load scenario. Table 2 shows the evaluation results of the different algorithms under different load scenarios. In the case of medium load, although FCFS is simple in principle and easy to implement, it does not consider the SLA response time of each request, resulting in high SLA violation rate Ratesla_vio and QoS over-provisioning rate Rateover_prov , which is difficult for cloud service providers to accept. Conservative EDF focuses on reducing Rateover_prov . The SLA differentiation effect is good when the load is low, but Ratesla_vio rises when the load is high. Therefore, Conservative EDF has high Ratesla_vio in a fluctuating load environment, resulting in high SLA violation penalties for cloud service providers.
184
Y. Wu et al.
Ratesla_vio of EDF and the proposed algorithm is below 5%, where Rateover_prov of EDF is 52.3% and Rateover_prov of the proposed algorithm is 13.4%. The proposed algorithm has a 74.4% decrease rate of Rateover_prov compared to EDF. Because EDF’s best effort causes high Rateover_prov when the load is low, while the proposed algorithm can better adapt to the fluctuating load environment. Table 2. Evaluation results of different algorithms Scenario
Metric
Algorithm FCFS Conservative EDF EDF
Low Load
Ratesla_vio 0.3% 2.8% Rateover_prov 94.8% 6.1%
Proposed
0% 0.3% 48.7% 18.7%
Middle Load Ratesla_vio 51.4% 10.3% Rateover_prov 38.4% 5.3%
3.7% 5% 52.3% 13.4%
Ratesla_vio 80.2% 23.5% Rateover_prov 17.2% 4%
21.3% 19.9% 22.3% 9.6%
High Load
If the cloud provider wishes to strictly control Rateover_prov , it can use Conservative EDF for request scheduling. The cloud provider can increase the workflow engine cluster resources in the case of high Ratesla_vio to guarantee performance and reduce SLA violation penalties. If the cloud service provider wants to ensure the SLA service differentiation effect without increasing the resource cost, and maintain low Ratesla_vio with as little Rateover_prov as possible, the proposed algorithm can be used. In the low load scenario, Ratesla_vio of each algorithm is low due to the low load pressure. The proposed algorithm and conservative EDF maintain the Rateover_prov at a low level. In the case of low load pressure and underutilized resources, cloud service providers can use conservative EDF as the request scheduling algorithm. In the high load scenario, the load pressure exceeds the capacity of the engine cluster, resulting in high Ratesla_vio for each algorithm. In this case, maintaining low Ratesla_vio requires horizontal or vertical scaling of resources to accommodate the load pressure. Cloud service providers can scale engine instances with RL based auto-scaling algorithms [5]. In the case of unscaled engine instances, the proposed algorithm has the lowest Ratesla_vio among the algorithms. In terms of Rateover_prov , Conservative EDF has the lowest Rateover_prov of 4%, and Rateover_prov of the proposed algorithm increases by 5.6% compared to Conservative EDF.
DRL-Based Scheduling Algorithm for Service Differentiation in Cloud BPMS
185
Fig. 5. Service differentiation effect of different algorithms
Figure 5 shows the effect of service differentiation with different algorithms under medium load, where the SLA response times of tenant A and tenant B are different, with tenant A requiring a lower response time for process requests and tenant B requiring less real-time performance. The upper limit of response time for serving a request is set to be 20t0 . When the response time of a request exceeds the upper limit, the request is considered as SLA failure. The response time of SLA successful requests is plotted in the Fig. 5. FCFS has a serious degree of SLA violation. Ratesla_vio of Conservative EDF is high, so even though its Rateover_prov is the lowest among the four algorithms, only the response time of tenant B’s request is close to the SLA response time. The Ratesla_vio of EDF and the proposed algorithm is less than 5%. In the evaluation result of the proposed algorithm, the response time of each tenant’s request is closer to the SLA response time, and the response time of tenant A’s request is significantly lower than that of tenant B’s request, indicating the best SLA service differentiation effect.
5
Conclusion and Future Work
To improve the effect of SLA service differentiation in cloud BPMS and make cloud BPMS more in line with the characteristics of cloud services, this paper designs a scheduling system in cloud BPMS, models the request scheduling problem in it as an MDP problem, and proposes a request scheduling algorithm based
186
Y. Wu et al.
on DRL. The experiment results show that the proposed algorithm can maintain Ratesla_vio below 5%, and the decrease rate of Rateover_prov is 74.4% compared with the heuristic algorithm, which effectively improves the service differentiation effect. Although the proposed algorithm can solve the service differentiation problem in cloud BPMS to a certain extent, there is some room for improvement in the work of this paper. On the one hand, the proposed request scheduling mechanism caches the arrived requests, which can be combined with the traffic prediction method to improve the scheduling effect. On the other hand, if cloud BPMS is combined with Robotic Process Automation (RPA), i.e., the process task executor is a robot, the execution time of the task is relatively fixed. The arrival time of the process task can be predicted in combination with the known business process definition to improve the service differentiation effect. Acknowledgements. This work is Supported by the NSFC-Guangdong Joint Fund Project under Grant Nos.U1911205,U20A6003; the Research Foundation of Science and Technology Plan Project in Guangdong Province under Grant No.2020A0505100030.
References 1. Baeyens, T.: BPM in the Cloud. In: Daniel, F., Wang, J., Weber, B. (eds.) BPM 2013. LNCS, vol. 8094, pp. 10–16. Springer, Heidelberg (2013). https://doi.org/10. 1007/978-3-642-40176-3_3 2. Carrillo, A., Sobrevilla, M.: Bpm in the cloud: a systematic literature review. arXiv preprint arXiv:1709.08108 (2017) 3. Pathirage, M., Perera, S., Kumara, I., Weerawarana, S.: A multi-tenant architecture for business process executions. In: 2011 IEEE International Conference on Web Services. pp. 121–128 (2011). https://doi.org/10.1109/ICWS.2011.99 4. Lin, G.D., Huang, Q.K., Yu, Y., Pan, M.I.: Stateless cloud workflow scheduling algorithm on activiti engine. Comput. Integr. Manuf. Syst. 26(6), 9 (2020) 5. Lu, J., Yu, Y., Pan, M.: Reinforcement Learning-Based Auto-scaling Algorithm for Elastic Cloud Workflow Service. In: Shen, H., Sang, Y., Zhang, Y., Xiao, N., Arabnia, H.R., Fox, G., Gupta, A., Malek, M. (eds.) PDCAT 2021. LNCS, vol. 13148, pp. 303–310. Springer, Cham (2022). https://doi.org/10.1007/978-3-03096772-7_28 6. Ghosal, D., Shukla, S., Sim, A., Thakur, A.V., Wu, K.: A reinforcement learning based network scheduler for deadline-driven data transfers. In: 2019 IEEE Global Communications Conference (GLOBECOM). pp. 1–6 (2019). https://doi.org/10. 1109/GLOBECOM38437.2019.9013255 7. Bo, Z., Qiao, Y., Leng, C., Wang, H., Guo, C., Zhang, S.: Developing real-time scheduling policy by deep reinforcement learning. In: 2021 IEEE 27th Real-Time and Embedded Technology and Applications Symposium (RTAS). pp. 131–142. IEEE (2021) 8. Weske, M.: Business Process Management Architectures. In: Business Process Management, pp. 333–371. Springer, Heidelberg (2012). https://doi.org/10.1007/9783-642-28616-2_7 9. Activiti homepage, https://www.activiti.org/ Accessed 7 Jun 2021 10. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
A Knowledge Tracing Model Based on Graph Attention Mechanism and Incorporating External Features Jianwei Cen1 , Zhengyang Wu1,2(B) , Li Huang2 , and Zhanxuan Chen2 1
School of Artificial Intelligence, South China Normal University, Foshan 528225, China [email protected] 2 School of Computer Science, South China Normal University, Guangzhou 510631, China {wuzhengyang,lhuang,czxuan}@m.scnu.edu.cn
Abstract. In recent years, research has focused heavily on Knowledge Tracing (KT), a crucial technique for learner state modeling in intelligent education. There are several KT models based on graph convolutional networks (GCN-KTs), but none of them can distinguish the importance of exercises or knowledge concepts. Existing GCN-KTs treat all neighboring nodes “equally” when performing graph convolution operations for exercise and concept embeddings, resulting in insufficient robustness of the generated node representations. Based on GCN-KTs, we offer a Knowledge Tracing model that is based on the Graph Attention Mechanism (GAFKT) and has an encoder-decoder structure. The encoder applies the self-attention layer to the topology and node features of the exercises and concepts, respectively, and the decoder uses the inner product to reconstruct the graph structure. In GAFKT, the semantic model of student knowledge and exercises is further enriched by mapping the external feature embeddings in the original data to the embeddings of the exercises and concepts in the same space. This work was experimented with several the most advanced models on two open source datasets for comparison, and the results demonstrated the effectiveness of GAFKT. Keywords: Knowledge Tracing · Graph Auto-Encoder Attention Mechanism · Online Learning
1
· Graph
Introduction
At present, online learning systems continues to evolve as the internet and computer technology advance. Through online learning applications, students can access and download a variety of learning resources prepared by educators, such as handouts, slides, videos, etc., for learning [1]. In the context of the global pandemic of COVID-19, many countries have had to stop offline teaching, close schools to reduce social contact, and students are forced to study at home [2]. c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 187–200, 2023. https://doi.org/10.1007/978-981-99-2385-4_14
188
J. Cen et al.
Online education tutoring platforms, such as Massive Online Course Development (MOOC) [3], Adaptive Practice System [26], and Intelligent Tutoring System (ITS) [4] are playing an indispensable role in personalized learning for students, and have also received great attention from educators and the public. Simultaneously, with the rapid growth of learning platforms and the increasing number of students studying online, online platforms are also faced with many challenges, such as how to provide each student with the service of “teaching by aptitude”, and how to provide students with an online platform based on how they learn. We can accurately analyze students’ mastery of the state of knowledge through the large number of learning records they have left on the online platform, and provide them with a learning plan that suits their needs. Corbett et al. [5] introduced the Knowledge Tracing (KT) task into the ITS field for the first time. The purpose of KT is to accurately forecast how students will perform in future learning processes by modeling the changing knowledge status of learners during knowledge acquisition. Knowledge tracing has become an important research direction for researchers in intelligent education today [6,7]. Nowadays, for the KT task, research on the KT model emerges one after another, and there are two main categories [8]: (1) traditional KT models; (2) KT models based on deep learning. These two types of representative model methods are Bayesian Knowledge Tracing (BKT) and Deep Knowledge Tracing (DKT). BKT [5] utilizes a Hidden Markov Model to capture students’ knowledge latent state, which are subsequently modeled as a binary group indicating whether or not students comprehend a knowledge concept. DKT [9] is the first KT model to incorporate deep learning, where the model takes a one-hot vector representation of students’ knowledge concepts responses as input and utilizes Recurrent Neural Network (RNN) to forecast the likelihood that students will correctly answer the problem at each time step. The proposal of DKT has attracted the research interest of many researchers and scholars in related fields. They applied the development of artificial intelligence technology to knowledge tracing tasks. After DKT, they successively proposed a series of novel models, such as DKVMN [10], GKT [11], JKT [12], etc. The Dynamic Key-Value Memory Network (DKVMN) [10] is an extension of DKT that maintains two matrices, key and value, which save and update the student’s knowledge proficiency, respectively. GKT [11] is a graph neural network-based knowledge tracing (GKT), transforms the structure of knowledge points into graphs and implements various knowledge graph structures. To a certain extent, these KT approaches based on concepts have had positive outcomes, but they frequently ignore the unique characteristics of the exercises and fail to further capture the deep hidden relation between the exercises and concepts. Moreover, external features related to the exercises (e.g., the time the student first answered, the type of problems, the difficulty of the exercises, etc.) were also ignored. JKT [12] converted the multidimensional connections of “practice to practice” and “concept to concept” into a graph struct and integrated them with the “practice to concept” connection. However, when the graph convolution network conducts the convolution function, it considers all adjacent nodes “equally,” making it difficult to identify the
A Knowledge Tracing Model Based on Graph Attention Mechanism
189
relevance of different nodes. Furthermore, JKT ignores the exterior aspects of the exercises, which is likely to result in a lesser accuracy of the model’s final prediction. Even if two exercises include the identical information, the probability of correct responses can vary substantially. This paper suggests GAFKT based on GCN-KT, which enhances knowledge tracing by fusing graph attention mechanism with external feature embeddings, to address the aforementioned problems. Overall, the three points listed below can be used to describe this paper’s contributions: • This paper propose the GAFKT model, which introduces a graph autoencoder consisting of an encoder and a decoder. The encoder adopts a graph attention network to generate efficient latent representations for nodes, and the decoder reconstructs the graph structure using inner products. GAFKT can also capture the semantic information of the higher-level student knowledge latent state in the KT process. Compared with GCN, the node representation generated by graph attention mechanism as the encoder is more robust. • This paper extracted several external features related to the students’ answering state from the original dataset, mapped them in the same space with the exercise embedding and concept embedding, and finally predicted the students’ learning state at the next time step through LSTM. Experiments demonstrate that models that are trained with external features typically outperform models that are not. • Through experiments on open source datasets, this paper compares the effectiveness of GAFKT with two classical knowledge tracing models, and the results indicate that GAFKT is productive in both evaluation metrics.
2 2.1
Related Work Knowledge Tracing
With the rapid growth of artificial intelligence technology, various models have been proposed to address the issue of knowledge tracing. For instance, BKT [10] models knowledge concepts without taking into account the differences between exercises, and in fact, the degree of difficulty varies from exercise to exercise. Additionally, BKT’s hidden state only has two options: “mastered” and “not mastered”, which is lacking in a richer and more nuanced expression. Piech et al. [9] proposed the DKT model in 2015, where they first introduced deep learning techniques in the KT task, and DKT uses recurrent neural networks to track the student’s changing learning state. DKT, however, ignores the connection between the exercise and the concept by decomposing the responded exercise into each of the knowledge concepts that make it up and modeling the students’ response using only the position information of the concepts. JKT [12] is also a GNN-based knowledge tracking model. However, JKT focuses on the higherorder relations of “exercise-concepts” and applies graph convolution process to draw out rich implicit information from their relational networks. The difference
190
J. Cen et al.
between our approach and theirs is that the graph structure and node features of the original data are input to the attention layer of the encoder to further mine the deeply hidden information of the exercises and knowledge concepts, which provides a certain degree of model interpretability thanks to the validity of the attention mechanism. 2.2
Graph Auto-encoder
Recently, Graph Auto-Encoder (GAE) has been favored by researchers in various fields because of its concise Encoder-Decoder structure and powerful encoding capability for graph embedding in graph neural networks. For example, in the recommendation domain, Berg et al. [14] proposed the GCMC framework using graph auto-encoder to treat the matrix complementation problem in recommender systems as a link prediction task on graphs; Mahdi et al. [15] used both encoder and decoder of GAE with multilayer perceptrons for learning graph embeddings, followed by other networks for scoring prediction recommendations. In the field of bioinformatics, a multi-channel graph attention auto-encoder model is proposed in the literature [16] to predict lncRNA disease association (LDA); Sun et al. [17] applied a graph convolutional auto-encoder to learn the embedding of pharmaceuticals and targeted nodes in a low-dimensional representation and effectively integrated multiple types of connections in the network. In this study, we add an attention layer to GAE in order to capture the higher-order information of the exercises and concepts. 2.3
Graph Attention Network
To address the drawbacks of previous graph convolutional network-based models, Graph Attention Networks (GAT) [13] were proposed. Earlier graph neural networks, such as [18], only mapped graphs or node information of graphs into vector representations or transformed them into tree structures, which were then processed using recurrent neural networks, and to make them available for sequential tasks, Li et al. [19] proposed GGS-NN, which was also improved accordingly for the GNN training problem. Later, researchers began to investigate the introduction of convolutional operations into graphs, and in the literature [20], graph convolutional networks (GCNs) containing multiple convolutional layers are proposed to efficiently extract features from the graph and generate embedding representations of the nodes. In the KT domain, Song et al. [12] first extended GCN to the knowledge tracing task and fused two influence sub-graphs. However, graph convolutional networks assign the same weights to all neighboring nodes when performing convolutional operations, which makes distinguishing the importance of distinct nodes harder and leads to a less robust representation of the generated nodes. Such a problem does not exist in GAT, where each node can be allocated different weighting factors to adjoining nodes according to their attributes, thus strengthening the representation of the nodes themselves.
A Knowledge Tracing Model Based on Graph Attention Mechanism
191
In this work, we try to use GAT to handle knowledge tracing questions and finally construct predictive models fusing graph embedding representations and external feature representations to obtain the probability of students answering the questions.
3
Problem Definition
Knowledge tracing automatically tracks changes in knowledge state based on a student’s historical learning trajectory and predicts learners’ knowledge mastery in the future learning process. The knowledge tracing problem may typically be defined in the following way: given a learner’s previous learning interaction sequence Xt = (x1 , x2 , ..., xt ) on a specific learning task, forecast the learner’s performance on the next interaction Xt+1 . Where x = {qt , at }, qt denotes the corresponding exercise, and a = {0, 1} is a binary tuple where 0 means wrong answer, 1 means correct answer. As a result, the probability of obtaining the right response is denoted by p = (at = 1|qt , X). Generally speaking, an exercise is composed of one or more concepts, and the concepts of different exercises may be the same. It can be seen that there is necessarily a relationship between the exercises or between the concepts. The primary symbols used in this paper are described in Table 1. Table 1. Notations.
Symbol
Description
Ge
exercise influence sub-graph
Gc
concept influence sub-graph
Ve , E e
exercise set, exercise interaction set
Vc , E c
concept set, concept interaction set
weij (i, j ∈ N ; i, j ≤ m) weighted influence of co-occurrence exercise edge < Vei , Vej > wcij (i, j ∈ N ; i, j ≤ m) weighted influence of co-occurrence concept edge < Vci , Vcj > Ae , Ac
adjacency matrix representation of exercise, concept
Xe , X c
attribute information matrix
Ze , Z c Aˆe , Aˆc
node represent matrix
Fout ∈ RN ×F
reconstructed sub-graph adjacency matrix representation
external feature matrix
To better accomplish the knowledge tracing task, some studies have modeled exercises and knowledge concepts before predicting students’ knowledge acquisition level [11,12]. We followed the design of [12] in this work. Definition 1 (Exercise Influence Sub-graph, E2E). E2E are formed by a given collection of exercises Ve , exercise interaction set Ee , and Ge = (Ve , Ee , weij ), with the set of exercise problems Ve as vertices and the interaction set Ee as edge. The weighted influence of the co-occurrence exercise edge
192
J. Cen et al.
< Vei , Vej > is weij (i, j ∈ N, i, j ≤ m), where “co-occurrence exercise edge” means that the student answered the same question more than twice in a row, N denotes the nodes of the exercise set and m represents the interactions between exercises i and j. The calculation is as follows: g c (Vei , Vej ) o i m m g (Ve , Ve )
weij =
(1)
where weij represents the co-answer rate of Vei and Vej in all problems involving co-occurrence < Vei , Vej >, indicating the average probability of answering the same question correctly. g c and g o represent the number of co-answers and cooccurrences, respectively. If and only if it is greater than 0, we consider the edge < Vei , Vej > to exist, and define weij as the weight on the edge. Definition 2 (Concept Influence Sub-graph, C2C). C2C are formed by a given collection of concepts Vc , concept interaction set Ec , and Gc = (Vc , Ec , wcij ), with the set of exercise problems Vc as vertices and the interaction set Ec as edge, where N denotes the number of nodes and n denotes the number of interactions between concepts i and j. The following formula is used to compute the weighted influence wcij (i, j ∈ N, i, j ≤ n) of co-occurring concept edges < Vci , Vcj >. g m (V i , V j ) (2) wcij = o c i c m n g (Vc , Vc ) in (2), wcij represents the co-answer rate of Vci and Vcj in all knowledge concepts < Vci , Vcj > involving co-occurrence. g m and g o represent the number of coanswers and co-occurrences, respectively, and there is one such edge between each pair of co-occurring concept nodes, and wcij is the weight on edge.
4
Model
Here, we will introduce Knowledge Tracing based on Graph Attention Mechanism and Incorporating External Features (GAFKT). Figure 1 depicts the overall architecture. The model is divided into three layers: the sub-graph building layer, the subgraph embedding layer, and the prediction layer. First, we finished the sub-graph construction, the sub-graph adjacency matrix, and the attribute matrix representation according to Definitions 1 and 2, respectively. The sub-graph embedding layer received the adjacency matrix A and the attribute matrix X represented by the sub-graph G and then passed through the encoder of a single attention layer generates a unified representation matrix Z for all nodes in the two sub-graphs, respectively. Next, the inner product ZZ T is performed on the representation matrix through the inner product decoder to obtain the topological structure A of the reconstructed graph. Finally, the obtained embedding representations of exercises and concepts are mapped in same space, and the auxiliary information feature vectors are fused into our LSTM prediction model. For the convenience of description, the above process ignores the subscripts of G, Z, X, and A, the same below.
A Knowledge Tracing Model Based on Graph Attention Mechanism
193
Fig. 1. Overview framework of the model.
4.1
Sub-graph Building Layer
This layer is responsible for reconstructing the input data. In order to maintain as much structural information as feasible, two influence subgraphs, “E2E” and “C2C,” are generated based on the original input data and defined in Sect. 3 [12], respectively. 4.2
Sub-graph Embedding Layer
Inspired by Graph Auto-Encoder (GAE), we use graph attention networks as encoders for GAE because the vast majority of existing graph auto-encoders use graph convolutional network (GCN) [15] to learn the topology and attribute information of graphs. One drawback of GCN is that it gives the same weight to all surrounding nodes during the convolution process. Therefore, it is difficult to differentiate the importance of each node, because the neighboring nodes are “equally” important. In contrast, in this paper, the use of GAT utilizes an attention mechanism to assign the relevance of neighboring nodes. Each node in the graph network can be allocated different weighting factors to adjacent nodes following their properties to strengthen the model’s learning capabilities. The sub-graph embedding layer consists of an encoder, which uses a graph attention network (GAT) to generate valid latent representations for the nodes, and a decoder, which uses an inner product to reconstruct the graph structure. This paper uses a single graph attention layer to process attention based on a set of input node feature vectors: (a) For the “E2E”, (b) For the “C2C”,
he = {he1 , he2 , ..., heN }, hei ∈ RF
(3)
hc = {hc1 , hc2 , ..., hcN }, hci ∈ RF
(4)
in Eqs. (3) and (4), N is the number of nodes, and F is the feature dimension of the nodes.
194
J. Cen et al.
For vertex i, calculate its neighbor j and its attention coefficient one by one, the calculation method is the same for “E2E” and “C2C”, so ignore the superscript e and c of node representation h to simplify the formula: eij = σ(a([W hi ||W hj ])), j ∈ Ni
(5)
In above (5), the vertex representation is increased by a linear mapping of a shared parameter W ∈ RF ×F [·||·] means that the transformed features of nodes i, j are spliced, and finally, the spliced high-dimensional representation is mapped to a real number by the attention parameter a(·) : RF × RF → R. Among these, Ni denotes the neighborhood of node i, eij represents the attention coefficient, indicating the relevance of node i, j; σ represents a nonlinear activation function. Then, using the soft-max function, normalize equation (5) to obtain the normalized attention coefficient eij : exp(eij ) k∈Ni exp(eik )
αij = sof tmax(eij ) =
(6)
Next, a nonlinear activation function is applied to aggregate the node weight parameters and attention coefficients, which are finally used as the final output features of the node. (7) hi = σ(αij W hj ), hi ∈ RF After the above steps, the adjacency matrices Ae,c and the attribute matrices Xe,c of the sub-graphs Ge,c are input into the encoder, and all potential embedding representation matrices Ze,c = {h1 , h2 , ..., hN } of the exercises and concepts are generated through the attention layer, respectively. During the whole process, the dimension of the node representation may change, i.e. F → F . Finally, the original sub-graph is recreated using the inner product as the GAE decoder. Employing the embedding representation matrices Ze,c , the decoder’s aim is to recreate a fresh topological information matrix Aˆe,c : T Aˆe,c = σ(Ze,c Ze,c )
(8)
In order to make the structure of Aˆ as similar as possible to the original graph A, the following loss function will be minimized: (Ae − Aˆe )2 (9) Le = ||Ae − Aˆe || = Lc = ||Ac − Aˆc || = (Ac − Aˆc )2 (10) With this graph attention auto-encoder, we dig deep into the hidden relationships between the exercises and nodes in the concept sub-graph.
A Knowledge Tracing Model Based on Graph Attention Mechanism
4.3
195
Predicted Layer
Before making predictions, we extracted all relevant external features useful for the exercises from the original dataset (e.g., student’s first response time, type of problem, the difficulty of the problem, etc.), combined with the exercise embeddings and concept embeddings in the same space, as input for prediction. The fusion formula is shown in (11). To avoid dimensional inconsistencies when connecting node features, we utilize a liner transformation Wf with a trainable weight matrix; fout ∈ F , where fout represents the external correlation feature matrix, as follows: f = Concat(Wf [he ⊕ hc ], Wf fout )
(11)
Finally, the LSTM network receives the fused vector f . Because the KT task may be considered as a binary classification problem—predicting whether students will answer correctly or incorrectly—the loss function is standard crossentropy: (12) L= yt log yˆt + (1 − yt )log(1 − yˆt ) where, y denotes true value, and yt denotes predicted value.
5
Experiments Table 2. Statistics of the datasets. Statistics
5.1
ASSIST2009 Matmat
#students
3873
18847
#problems
15911
1150
#concepts
123
433
#records
432702
390367
problems per skill 283
335
skills per problem 1.29
2.66
problem types
7
5
Dataset
The two openly available datasets that were used in the studies were ASSIST2009 and Matmat, and they are each detailed in more detail below. Following preprocessing, Table 2 displays the datasets’ full statistical data. • ASSIST2009 [21]. This open source dataset was collected by the ASSISTments online tutoring website1 . This dataset is the most frequently used 1
https://sites.google.com/site/assistmentsdata/home/assistment-2009-2010-data/ skill-builder-data-2009-2010.
196
J. Cen et al.
benchmark dataset for knowledge tracing tasks. According to the existing research methods [12,22,23], we removed records with “noskill” knowledge concepts, scaffolding problems [25], where students went through the teacher’s prompts to complete questions that could not be completed independently, and user records with answer sequences less than 3 in length. • Matmat [26]. This is an open source dataset about students learning mathematics, collected by Adaptive Learning Group from Adaptive Practice System2 . Similarly, we removed records with less than 3 sequences of student responses in this dataset. Note that for each dataset, we extract a number of external features related to the students’ answer status, such as first answer time, problem difficulty, and problem types, which form a feature matrix mapped in the same space as the exercise embedding and knowledge embedding. 5.2
Evaluating Indicator
Knowledge tracing is a prediction problem, which can be regarded as a binary classification task, predicting that students’ correct/incorrect (1/0) exercise records represent positive or negative classes. Therefore, referring to most of the previous research methods, we choose the following two approaches as the evaluation indicators to verify the predictive ability of the model. • The area under the curve (AUC) [24], the greater the value of this indicator, the more likely the anticipated positive class is ahead of the predicted negative class. • Accuracy (ACC), the greater the value of this indicator, the better the accuracy of the model predicting the positive class. 5.3
Baseline Methods
Four baseline models—BKT, DKT, DKVMN, and JKT—were chosen for comparison in order to evaluate the validity of GAFKT. • BKT [5]: a kT model based on a Hidden Markov, which takes the students’ knowledge mastery status as a binary variable that is updated by a Bayesian algorithm. • DKT [9]: the first deep learning KT approach, employs Recurrent Neural Networks (RNN) to forecast the probability that a student will correctly respond to a problem at each time step. • DKVMN [10]: a memory perception KT model that maintains two matrices, key and value, which save and update the student’s knowledge proficiency, respectively. • JKT [12]: a GCN-KT based model that converts the “exercise-concept” relationship of the original datasets into “exercise-exercise” and “conceptconcept” multidimensional relationships in order to capture more complex semantic information. 2
https://www.fi.muni.cz/adaptivelearning/?a=data.
A Knowledge Tracing Model Based on Graph Attention Mechanism
5.4
197
Experimental Setting
In this paper, Adam is used as the optimizer, and the following settings are set: learning rate lr = 1e−5, batch-size = 128, set dropout = 0.6, which can effectively avoid overfitting, number of attention heads of graph attention network set head = 1. The parameters of all baseline models are uesd with best settings in their papers. The training set comprises 80% of the dataset, whereas the test set comprises the remaining 20%. 5.5
Overall Performance
Fig. 2. Performance of the baseline models compared to GAFKT.
On both datasets, Table 3 displays the accuracy and AUC values of our model with each baseline model. To compare the experimental results more visually, we plotted the data in Table 3 as a bar chart, as shown in Fig. 2. We observed that GAFKT significantly outperformed all baselines in all datasets. Specifically, GAFKT raised its AUC metrics on the datasets ASSIST2009 and Matmat by 1.5% and 4%, respectively, and its ACC metrics by 8.9% and 3.8%, respectively, when compared to approaches that used GCN and solely modeled the graph structure. In the baseline models, it can be shown that the BKT model performs worse than the deep learning-based model since BKT models knowledge states as binary variables and ignores the association between exercises and concepts, while the graph-based JKT model outperforms all sequence-based models overall, which indicates that structuring KT problems into graph representations better captures the higher-order relationships between exercises and concepts. The above results demonstrate that each of the GAFKT improvements is effective, and the joint optimization of node embeddings by graph attention auto-encoder and fusion of externally relevant features helped to enhance the overall performance of the model.
198
J. Cen et al. Table 3. The AUC and ACC on two datasets. Method
ASSIST2009 AUC ACC
Matmat AUC ACC
BKT [5] DKT [10] DKVMN [11] JKT [12]
0.708 0.740 0.743 0.798
0.630 0.742 0.750 0.822
0.688 0.715 0.726 0.753
0.627 0.775 0.724 0.818
GAFKT+GCN 0.801 0.819 0.834 0.864 GAFKT (no features) 0.806 0.825 0.832 0.849 0.813 0.842 0.862 0.856 GAFKT
5.6
Ablation Experiment
On each of the two datasets, we performed ablation experiments in this subsection to further investigate the effect of the model’s key components on its performance. The ablation experimental setup is shown below, and its prediction performance is shown in Fig. 3. • GAFKT+GCN: Incorporate external correlation features into the GCNbased KT model and explore the effect of external correlation features on the model to demonstrate the effectiveness of our combined features. • GAFKT (no features): Only the graph attention auto-encoder is constructed without the optimization of the combined features, which eliminates the influence of external features to demonstrate that we optimize the embedding representation of the sub-graph more adequately. • GAFKT: The complete model is retained.
Fig. 3. Comparison of AUC and ACC for ablation experiments.
It can be demonstrated that the knowledge tracing model’s prediction performance can be greatly enhanced by the integration of graphical attention mechanism and external features. In addition, GAFKT is superior to GAFKT+GCN,
A Knowledge Tracing Model Based on Graph Attention Mechanism
199
which is due to the fact that GAT assigns the corresponding weights to the nodes according to the magnitude of their contributions to their neighboring nodes to improve the encoder’s ability to process the graph data. The AUC value of the knowledge tracing model incorporating GCN and external features is 0.2% higher than that of GAFKT without external features, which is attributed to the fact that external features can further enrich the semantic representation of knowledge, thus the model’s prediction performance is improved to some amount. Experiments demonstrate that these two key components help to improve the model performance.
6
Conclusion
In this paper, we propose the GAFKT, which for the first time uses graph attention auto-encoder for modeling exercises and concepts in a KT task. Thanks to the graph attention layer, it enables the learned graph node representation to adaptively adjust the weights of the nodes in graph in accordance with the degree of importance. Furthermore, the model further enhances the semantic model of students and exercises by mapping the external feature embeddings in the original data to the embeddings of the exercises and concepts in the same space. Experimental findings on two open datasets show that our suggested GAFKT has better prediction performance than the baseline models, raising the prediction accuracy of the deep learning-based knowledge tracing model. In the future, we will experiment on more datasets to validate the performance of GAFKT and conduct further cold-start experiments on knowledge tracing. Acknowledgements. This research was supported by the National Natural Science Foundation of China (NSFC) under the Grant No. U1811263. We would like to thank the anonymous reviewers for their constructive advice.
References 1. Ossiannilsson, E.: Sustainability: special issue “the futures of education in the global context: sustainable distance education”. Sustainability (2020) 2. Pardamean, B., Suparyanto, T., Cenggoro, T.W., Sudigyo, D., Anugrahana, A.: AI-based learning style prediction in online learning for primary education. IEEE Access 10, 35725–35735 (2022) 3. Vardi, M.Y.: Will MOOCs destroy academia? Commun. ACM 55(11), 5–5 (2012) 4. Psotka, J., Massey, L.D., Mutter, S.A.: Intelligent Tutoring Systems: Lessons Learned. Psychology Press (1988) 5. Corbett, A.T., Anderson, J.R.: Knowledge tracing: modeling the acquisition of procedural knowledge, user model. User Model. User Adap. Interact. 4(4), 253– 278 (1994). https://doi.org/10.1007/BF01099821 6. Dowling, C.E., Hockemeyer, C.: Automata for the assessment of knowledge. IEEE Trans. Knowl. Data Eng. 13(3), 451–461 (2001)
200
J. Cen et al.
7. Pardos, Z.A., Bergner, Y., Seaton, D.T., Pritchard, D.E.: Adapting Bayesian knowledge tracing to a massive open online course in edX. Educ. Data Min. 13, 137–144 (2013) 8. Abdelrahman, G., Wang, Q., Nunes, B.P.: Knowledge tracing: a survey. arXiv preprint arXiv:2201.06953 (2022) 9. Piech, C., et al.: Deep knowledge tracing. In: Advances in Neural Information Processing Systems, vol. 28 (2015) 10. Zhang, J., Shi, X., King, I., Yeung, D.-Y.: Dynamic key-value memory networks for knowledge tracing. In: Proceedings of the 26th International Conference on World Wide Web, pp. 765–774 (2017) 11. Nakagawa, H., Iwasawa, Y., Matsuo, Y.: Graph-based knowledge tracing: modeling student proficiency using graph neural network. In: 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 156–163. IEEE (2019) 12. Song, X., Li, J., Tang, Y., Zhao, T., Chen, Y., Guan, Z.: JKT: a joint graph convolutional network based deep knowledge tracing. Inf. Sci. 580, 510–523 (2021) 13. Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017) 14. Berg, R.V.D., Kipf, T.N., Welling, M.: Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263 (2017) 15. Kherad, M., Bidgoly, A.J.: Recommendation system using a deep learning and graph analysis approach. arXiv preprint arXiv:2004.08100 (2020) 16. Sheng, N., et al.: Multi-channel graph attention autoencoders for disease-related lncRNAs prediction. Brief. Bioinform. 23(2), bbab604 (2022) 17. Sun, C., Xuan, P., Zhang, T., Ye, Y.: Graph convolutional autoencoder and generative adversarial network-based method for predicting drug-target interactions. IEEE/ACM Trans. Comput. Biol. Bioinform. 19(1), 455–464 (2020) 18. [18] Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE Trans. Neural Netw. 20(1), 61–80 (2008) 19. Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015) 20. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 21. Feng, M., Heffernan, N., Koedinger, K.: Addressing the assessment challenge with an online system that tutors as it assesses. User Model. User-Adap. Interact. 19(3), 243–266 (2009). https://doi.org/10.1007/s11257-009-9063-7 22. Ghosh, A., Heffernan, N., Lan, A.S.: Context-aware attentive knowledge tracing. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2330–2339 (2020) 23. Liu, Y., Yang, Y., Chen, X., Shen, J., Zhang, H., Yu, Y.: Improving knowledge tracing via pre-training question embeddings. arXiv preprint arXiv:2012.05031 (2020) 24. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982) 25. Wood, D., Bruner, J.S., Ross, G.: The role of tutoring in problem solving. J. Child Psychol. Psychiatry 17(2), 89–100 (1976) 26. Papouˇsek, J., Pel´ anek, R., Stanislav, V.: Adaptive geography practice data set. J. Learn. Anal. 3(2), 317–321 (2016)
Crowd-Powered Source Searching in Complex Environments Yong Zhao1 , Zhengqiu Zhu1 , Bin Chen1,2 , and Sihang Qiu1,2(B) 1
National University of Defense Technology, Changsha, China {zhaoyong15,zhuzhengqiu12,chenbin06,qiusihang11}@nudt.edu.cn 2 Hunan Institute of Advanced Technology, Changsha, China Abstract. Source searching algorithms are widely used in different domains and for various applications, for instance, to find gas or signal sources. As source searching algorithms advance, search problems need to be addressed in increasingly complex environments. Such environments could be high-dimensional and highly dynamic. Therefore, novel search algorithms have been designed, combining heuristic methods and intelligent optimization, to tackle search problems in large and complex search space. However, these intelligent search algorithms usually cannot guarantee completeness and optimality, and therefore commonly suffer from the problems such as local optimum. Recent studies have used crowdpowered systems to address the complex problems that machines cannot solve on their own. While leveraging human rationales in a computer system has been shown to be effective in making a system more reliable, whether using the power of the crowd can improve source searching algorithms remains unanswered. To this end, we propose a crowd-powered sourcing search approach, using human rationales as external supports to improve existing search algorithms, and meanwhile to minimize the human effort using machine predictions. Furthermore, we designed a prototype system, and carried out an experiment with 10 participants (4 experts and 6 non-experts). Quantitative and qualitative analysis showed that the sourcing search algorithm enhanced by crowd could achieve both high effectiveness and efficiency. Our work provides valuable insights in human-computer collaborative system design. Keywords: Source searching computing
1
· Crowd-powered system · Crowd
Introduction
Source searching problems always exist in nature and in our daily lives, such as animals finding a odor source to acquire foods in the wild and people searching the emission source of air pollution. Traditional search algorithms, such as tree search algorithms and graph search algorithms, work well for limited search space given enough time. However, as technology advances and computing power explodes, people started to expect search algorithms to solve search problems in more complex scenes with a more strict time limit. c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 201–215, 2023. https://doi.org/10.1007/978-981-99-2385-4_15
202
Y. Zhao et al.
Complex search space could be high-dimensional and dynamic, and the computation might be required to be completed in real-time in many applications. Since traditional traverser algorithms can no longer meet those requirements, novel search algorithms have been proposed to tackle source searching problems in large and complex search space [23,31]. These novel search algorithms integrate the human cognition and animal behaviors into the search rules, and gather information during the search process to dynamically adjust search parameters. While the search efficiency has been dramatically improved, the completeness and optimality can no longer be guaranteed. Therefore, it is possible that these search algorithms return a local optimum or even no solution, instead of the global optimal solution. Researchers and practitioners have noticed this issue [14]. Recent work has focused on crowd-powered systems and human-AI collaborations [1]. These approaches enable humans to take part in the automatic process that is supposed to be completely controlled by machines to solve complex problems. A crowd-powered system provides us with a new perspective that humans could get involved in the automatic problem-solving process, to enhance the effectiveness and efficiency of algorithms [6,7,15,20]. Since human-machine collaboration has been proved to be feasible in a variety of domains, we see an opportunity that combines human rationales with search algorithms, to overcome the difficulties that current source searching approaches usually encounter. However, whether the power of the crowd is really effective and efficient in improving existing source searching algorithms remain unknown. To address this knowledge gap, in this work, we are particularly interested in answering the research questions: how can crowd-powered approaches improve the effectiveness and efficiency of existing source searching algorithms? To answer the research question, we designed a human-machine collaborative framework that could improve the existing source searching algorithm in a way that crowdsources the problems that occurred during source searching. We implemented a prototype system where a virtual robot autonomously searching a source in complex environments in a simulation setup, to enable a user study. The crowd-powered system is responsible for detecting fatal problems, explaining the algorithm, giving suggestions, and generating tasks for humans to complete. When it comes to humans’ turns, humans could either take full control of the robot or aid the robot to address problems. Particularly, to better facilitate effective problem solving, the system predicts the location of the source using Bayesian methods and sequential Monte Carlo methods to further assist humans in making decisions and taking actions. We recruited 10 participants in this study, including 4 domain experts in the field of source searching, and 6 non-experts having no experience in source searching, to evaluate the proposed human-machine collaborative approach in randomly generated complex searching environments. The experiment shows that the crowd-powered system could improve the performance of the state-of-the-art source searching algorithms, in both effectiveness (success rate 100%, 22% higher than none-human methods) and efficiency
Crowd-Powered Source Searching in Complex Environments
203
(using significantly fewer iterations/steps to find the source). Furthermore, we analyzed system usability scores and cognitive workload scores reported by the participant. Results show that a specific way of interaction could achieve better usability and cognitive workload for a group of participants in the prototype system. Our work provides useful suggestions, valuable insights, and important implications in leveraging human-machine collaboration for improving source searching algorithms.
2
Related Work
We discuss related literature from two perspectives: crowd-powered systems and source searching. 2.1
Crowd-Powered Systems
The goal of designing a crowd-powered system is to leverage human rationales in a way combing with computer systems to collaboratively solve complex problems. A typical crowd-powered system is the ESP game [38], which is an image labeling system developed by Google. This system successfully gamified the image labeling process, and produced a large amount of data while the users were enjoying the game. CrowdDB is another example leveraging human input to process queries that database systems cannot answer [15]. Bozzon et al. proposed Crowdsearcher, a system that is able to answer search queries using the intelligence of crowds [6]. Furthermore, previous work combined human intelligence with machine learning methods to address the problem such as conversational agent learning intents and text classification [3,39]. Recent studies recruited online users from crowdsourcing platforms, and applied smart task scheduling and output prediction methods to produce city maps [30,33]. While human-in-the-loop systems have been shown to be effective in many domains, algorithm-in-the-loop systems also started to play a critical role in human decision-making. Previous work introduced this concept, and provided principles for human-AI decisionmaking and risk analysis [18,19]. In the domain of robotics and engineering, human-machine collaboration has been used for a long time, to address practical problems that can hardly be considered in theoretical models [17,25]. For instance, human-machine collaboration was effectively applied to address radiation source search and localization [5], spill finding and perimeter formation [9], and urban search and rescue response [10]. 2.2
Source Searching
In general, source searching is a kind of problem that aims to determine the location of the source (of gas or signal) in the shortest possible time, as it is of vital importance for both nature and mankind [12,24]. For example, the search for preys [21], submarines [11] survivors [36], and pollution sources [41]. As a classical kind of source searching algorithm, the bio-inspired algorithm typically
204
Y. Zhao et al.
leverages the gradient ascent strategy to approach the source, based on a reasonable assumption that the signal emitted by the source has a greater intensity near the source [22,29]. However, in the presence of environmental disturbances (e.g., turbulence), the intensity gradient of the emitted signal may be disrupted, undermining the feasibility of the bio-inspired searching algorithm [23]. An alternative kind of source searching algorithm has been developed based on Bayesian theory [12]. Previous works [23,31] proposed the cognitive searching algorithm that models the source searching process as a Markov Decision Process. To further enhance the performance (i.e., success rate and efficiency) of a searching algorithm, multi-robot collaboration mechanisms [13,27,35] were designed and adopted. However, when source searching happens in complex environments, the search process always encounters fatal problems, resulting in wrong outcomes. In this work, we designed a prototype system for source searching in complex environments, and carried out a user study with this system to answer the research questions.
3
Methods
Addressing the problem of source searching in complex environments, the search algorithm needs to navigate a robot to sense the signals emitted by the source, and simultaneously move in the environment. The search process ends when the source is found. There may be many obstacles in the complex environment, which can hinder the movement of the robot and bring challenges to the search task, especially when the entire environment is unknown. Previous work has proposed the Infotaxis search algorithm, featuring a reward function to determine which direction to go for source searching [37]. However, prior studies have shown Infotaxis and its improved versions did not perform well in complex environments and still faced problems occasionally [23,28,31,40]. To this end, we designed a prototype system based on the crowd-powered method to address source searching problems. 3.1
Method Overview
In this section, we show the design of a crowd-powered source searching method, and explain how human rationales can be used during the search process to improve the effectiveness and efficiency of search algorithms. The overview of the method is shown in Fig. 1. In this work, we use the wisdom of the human together with machine intelligence to play important roles in existing source searching algorithms, to address problems that the machine cannot solve on its own. As shown in Fig. 1, the human-machine collaboration contains three main parts—problem detection and task generation by machine, task explanation and solution suggestion by machine, and problem solving and task completion by human. In the following sections, we explain the three main parts in detail.
Crowd-Powered Source Searching in Complex Environments
205
Fig. 1. The crowd-powered method that integrates human-machine collaboration into the search process.
3.2
Human-Machine Collaborative Tasks
The prototype system was designed following the framework shown in Fig. 1. The source searching algorithm used in this system is Infotaxis, one of the most popular novel search strategies particularly effective for source searching problems [31,37]. Problem Detection and Task Generation. Current source searching algorithms (including Infotaxis) usually suffer from local optimum problems, and therefore result in no information gained and infinite loop problems eventually. Therefore, we proposed a simple rule-based mechanism to detect the no information gained and infinite loop problems automatically: if a robot 1) passes by the same spot 5 times within a specific time window, and 2) acquires none information, the system detects a problem and pauses the search process. A task is then generated and crowdsourced, to leverage human intelligence to enable an effective problem solving. The crowdsourcing task features a user interface where crowd workers can view the problem explanation and execute the task. A screenshot of the crowdsourcing task is shown in Fig. 2. Task Explanation and Solution Suggestion. When a problem is detected, on the task interface, we use graphical elements to explain the task as well as the problem. In the prototype system, the goal of explanation is to let users clearly “see” the problem. We showed the direction that the robot wanted to go, and the direction robot had to go (because of obstacles) to help people understand why a problem could happen. We did not further explain the reasons since a problem could be the consequence of many different factors. Future work could focus on deep human understanding on problems. In addition, to enable an effective human-machine collaboration, the task could give a solution suggestion, which helps crowd workers better execute the task. The solution suggestion
206
Y. Zhao et al.
Fig. 2. A screenshot of the crowdsourcing task generated by the crowd-powered source searching prototype system. (Color figure online)
features a source estimation method that uses Bayesian inference and sequential Monte Carlo methods to show the distribution of the posterior probability of the source location (see green particles in Fig. 2) [2,34]. The machine also suggests an area where the source most likely will be called “belief source area” using DBSCAN [32] to provide more information to help humans understand and address the problem. Problem Solving and Task Completion. When the human (crowd worker) starts to execute the task, the prototype system provides two control modes. A full control mode allows the user to take over the search process and control each single step of the robot; An aided control mode allows the user to define a temporary goal (a targeted location), so the robot will pause the current search activities and move to the targeted location set by the user. We did not implement other problem-solving means in the prototype system since they require more expertise and incorrect operations may lead to a high failure rate. Future work could consider implementing more controls modes such as setting forbidden areas and search parameter tuning. 3.3
Task Interface
As shown in Fig. 2, the task interface uses graphical elements to explain the source search task and the problem by displaying the searcher (robot), search environment, search route, current search state, and a potential search goal (estimated source). When a problem is found, the system automatically generates a crowdsourcing task and assign it to a human crowd worker, and then the worker canclick on the interface with the button [EXECUTE] to control the robot or plan a path for the robot. The user can click the button [CONTINUE] to resume the automatic search. This search process keeps going until the source is found. The prototype system was developed using Python 3.7 and tkinter packages.
Crowd-Powered Source Searching in Complex Environments
4
207
User Study
We design a user study to answer our research question. In this section, we introduce experimental conditions, environments, measures, and the procedure of the user study. 4.1
Experimental Conditions
As we have introduced in the previous section, we provided two interaction/control modes – Full Control (FC) and Aided Control (AC) respectively. A full control interaction mode represents the problem-solving method that requires humans to take over the search process, while an aided control interaction mode represents the problem-solving method that sets a temporary goal (let the robot exits the current search state and then navigate it to a manual defined location). Furthermore, we used two baseline conditions in our experiment. The baseline 1 condition directly uses the state-of-the-art source search algorithm (Infotaxis), while the baseline 2 condition also uses our proposed automatic problem detection method and then navigates the robot to a random location in order to jump out of the problem. Please note that the baseline 2 condition is also an improvement based on the state-of-the-art source search algorithm. 4.2
Experimental Environments
The source searching activities are performed by a virtual robot in a 2D 20 m × 20 m squared area in a simulation setup. The 2D search area is divided into 20 × 20 cells in a grid. Each cell has a probability Po determining whether this cell contains an obstacle. Po is set to be 0.75, to give a relatively high difficulty (more obstacles) of tasks, since simple environments (with few obstacles) do not need human assistance that much. In this study, we did not consider the specific types or shapes of obstacles. If there is an obstacle in a cell, the cell is considered to be completely obstructed and cannot be arrived at or passed by the robot. The prototypes system was deployed on one machine, and all the participants were invited to execute tasks using the same machine to ensure a fair comparison. Participants were invited to a quiet lab, to make sure the experiment would not be interrupted by others. 4.3
Measures
In this study, we measure the effectiveness and efficiency of source search process and outcomes. The effectiveness is measured by the success rate. As the source searching process can forever go on if the source is not found, we define that a successful source searching means the robot finds the actual source within 400 steps (a step means an iteration of updating search states). If the robot cannot find the source (either with or without human involvement) after 400 steps, the
208
Y. Zhao et al.
source search task is considered to be failed. The efficiency is measured by the number of steps the robot takes to successfully find a source. A fail source search is not taken into account in calculating the efficiency. Furthermore, we measure human execution time per task to see how engaged the participants are during task execution. Furthermore, we use two standard questionnaires to understand the perceived usability and cognitive workload while using the crowd-powered source searching system. The perceived usability is measured by System Usability Scale (SUS) [8]. Using the ratings of SUS items, we can derive scores of the SUS in two aspects – usability and learnability [4,26]. Furthermore, we measure cognitive work load using NASA-TLX. 4.4
Procedure
We first asked participants to complete a demographic survey. This survey requires participants to provide their basic background information about their age, gender, education level, and domain knowledge about search algorithms. After the demographic survey, we also briefly explained to the participants our the experimental scenarios (i.e., to find a gas source) and how to use the prototype system. After demographic surveys, participants were asked to complete source searching tasks. Each participant should complete 20 tasks using 2 control modes, i.e., full control and aided control. To avoid learning biases, the order of the control modes during task execution was pre-scheduled – half of the participants first executed 10 full-control tasks and then aided-control tasks (2 experts + 3 non-experts), and the other half first executed aided-control tasks and then full-control ones. After finishing each control mode (10 tasks), participants were asked to rate their feelings on system usability and cognitive workload using standard questionnaires.
5
Results
We evaluated the effect of using the crowd-powered method in source searching algorithms by measuring the effectiveness (success rate), the efficiency (the number of steps taken to find the source), the human execution time, the self-report SUS scores, and the self-report TLX scores. 5.1
Participants
We asked four experts (academic researchers or engineers), who have been working on the topics related to source searching for at least 1 year, to participate in our study. Furthermore, we recruited 6 non-expert volunteers from our institute who had no experience in source searching. People involved in the prototype system development were not invited to the experiment to avoid potential biases. The experiment was approved by the ethics committee of our institute.
Crowd-Powered Source Searching in Complex Environments
5.2
209
Source Searching Result
We evaluated source searching from three perspectives, namely the effectiveness (the success rate), the efficiency (the number of steps used to find the source), and the human execution time per task. Results are shown in Table 1. Clearly, the crowd-powered method is proved to be effective, as the success rates can achieve 100% in most cases (except only one case) being approximately 22% higher than baseline 1, and 12% higher than baseline 2. It shows that leveraging human inputs could make the algorithm performance nearly perfect. Furthermore, we observe the improvement of efficiency when the full control mode is used, in comparison with both aided control and the baselines. In general, both experts and non-experts showed good performance while collaborating with the machine to solve the problems of the search algorithm. Table 1. Results of the source searching experiment. Groups
Expertise
Full control
Expert 100 Non-expert 98
138.85 ± 79.00 144.73 ± 87.62
29.59 ± 25.47 34.40 ± 30.16
Aided control Expert 100 Non-expert 100
175.10 ± 67.67 165.67 ± 80.60
33.58 ± 27.87 29.01 ± 29.51
Baseline 1
–
78.5
154.04 ± 91.32
–
Baseline 2
–
88
179.64 ± 96.45
–
Effectiveness (% success rate)
Efficiency Human execution time (# steps per task) (seconds per task)
To deeply understand the difference among experimental conditions, we performed statistical analysis for efficiency (the number of steps) and human execution time (per task). Since the numbers of search steps follow normal distributions according to the normality tests, we applied two-way ANOVA to see the effects of two factors considered in this study – expertise (expert vs non-expert) and control mode (full control vs aided control), as well as their interaction effect. Results of statistical tests are shown in Table 2. We found that the efficiency of source searching shows a significant difference in terms of the control mode (p = 0.026), meaning the full control mode could achieve a better efficiency regardless of expertise. Table 2. Results of two-way ANOVA for the efficiency (# of steps) of source searching. Factors
Efficiency (# of steps) dF F -value p-value
Expertise (Expert vs Non-expert)
1
Control mode (Full control vs Aided control) 1
0
0.9764
5.01
0.0263*
Expertise × Control mode 1 0.68 0.4089 Note: an asterisk (*) represents significant difference (p < 0.05).
210
Y. Zhao et al.
Since distributions of the human execution time do not come from a normal distribution according to normality tests (for all the data groups p < 0.003), we applied pairwise Mann-Whitney U tests to carry out the significance tests (α value was adjusted by Bonferroni correction). We did not find a significant difference in terms of human execution time (p > 0.07 for all the pairs), meaning neither expertise nor control mode could significantly affect execution time. The source searching result conveys three main messages: 1. The crowd-powered method is effective and efficient for improving source searching; 2. Through our design, non-experts could achieve similar performances as experts could do; 3. Taking over the machine during problem-solving could further improve the efficiency of source searching. 5.3
Usability
We asked all the participants to fill up the system usability scale (SUS) after completing each control mode (i.e., full control and aided control). Therefore, each participant provided 2 SUS responses. Since we only recruited 10 participants (20 SUS responses in total), we did not use statistical tests to perform the analysis. Scores of SUS are reported in Table 3. According to previous studies [4,8,26], SUS can measure the usability and learnability of a system. While we found that the full control mode (meaning the search process is taken over by humans) showed better search efficiency, the participants in general reported that they perceived better usability and learnability from the aided control mode, rather than the full control mode. Interestingly, for all the nonexperts, their SUS scores of aided control were not lower than the scores of full control, meaning they all preferred the aided control mode. However, in terms of experts, we observed more diverse opinions, and the difference of experts’ overall average SUS scores between aided control and full control was less obvious (Aided Control 86.25 vs Full Control 80.63), in comparison with non-experts (Aided Control 80.42 vs Full Control 65.00). The SUS result conveys one main message: The participants, especially the non-experts, generally perceived better usability when they were aiding the machine, in comparison with taking over the machine. 5.4
Cognitive Workload
To understand cognitive workload during human-machine collaboration, we used NASA-TLX scale after the participants completed each control mode (i.e., full control and aided control). Similarly, each participant also only provided 2 TLX responses, resulting in 20 TLX responses in total. To this end, we did not perform any statistical analysis. Results of NASA-TLX scores on six dimensions, namely physical demand, mental demand, temporal demand, performance, effort, and frustration, as well as the overall TLX score are reported in Table 4. We observed
Crowd-Powered Source Searching in Complex Environments
211
Table 3. Results of the system usability. FC means the full control mode (humans take over the machine), while AC means the aided control mode (humans aid the machine). Expertise
Participant ID Usability
Expert
1 2 3 4 Average
Learnability
AC (90.63) > FC FC (93.75) > AC AC (84.38) = FC AC (81.25) > FC AC (86.72) > FC
Non-Expert 5 6 7 8 9 10 Average
AC AC AC AC AC AC AC
(78.13) (87.50) (87.50) (71.88) (87.50) (78.13) (81.77)
> > > > > > >
FC FC FC FC FC FC FC
Overall
(68.75) (90.63) (84.38) (71.88) (79.69)
AC (100.0) = FC AC (62.50) > FC FC (87.50) > AC AC (100.0) = FC AC (84.38) = FC
(59.38) (68.75) (84.38) (65.63) (71.88) (43.75) (65.63)
AC AC AC AC AC AC AC
(87.50) (87.50) (100.0) (100.0) (37.50) (37.50) (75.00)
> > = > = = >
FC FC FC FC FC FC FC
(100.0) (50.00) (75.00) (100.0) (84.38)
AC (92.50) > FC (75.00) AC (85.00) = FC (85.00) FC (85.00) > AC (82.50) AC (85.00) > FC (77.50) AC (86.25) > FC (80.63)
(50.00) (62.50) (100.0) (87.50) (37.50) (37.50) (62.50)
AC (80.00) > FC (57.50) AC (87.50) > FC (67.50) AC (90.00) > FC (87.50) AC (77.50) > FC (70.00) AC (77.50) > FC (65.00) AC (70.00) > FC (42.50) AC (80.42) > FC (65.00)
that the non-experts in general perceived less cognitive workload in the aided control mode, compared to the full control mode, across all the TLX dimensions. However, in terms of the experts, we found that all the experts thought their performances while using the full control mode were not worse (3 out of 4 reported higher) than the aided control mode. This is an opposite finding compared to non-experts. In terms of other dimensions, again, we observed more diverse opinions from the experts, resulting in tiny differences between the aided control mode and the full control mode. Table 4. Results of the cognitive workload. FC means the full control mode (humans take over the machine), while AC means the aided control mode (humans aid the machine). Expertise
ID
Physical demand
Mental demand
Temporal demand
Performance
Effort
Frustration
Overall TLX
Expert
1 2 3 4 Average
AC (0) < FC (5) FC (65) < AC (80) AC (10) < FC (20) AC (60) = FC (60) AC (37.5) = FC (37.5)
AC (0) < FC (5) FC (50) < AC (70) AC (5) < FC (30) AC (60) < FC (80) AC (33.8) < FC (41.3)
AC (5) < FC (20) FC (0) < AC (5) AC (0) = FC (0) AC (0) = FC (0) AC (2.5) 1
(7)
p−1
FTTγ k represents the finish time of the transmission of task tγ k . The finish time p−1 p−1 FTTi of the transmission of task ti is FTTi = Tie,tram + STTi , ∀i ∈ N
(8)
e = 1 indicates that task t is executed on the edge server The decision variable yi,j i ej (j ∈ M, M = {1, . . . , M }), which needs to satisfy the constraints of formula (9).
M
ye j=1 i,j
= xie , ∀i ∈ N
(9)
Tasks are scheduled on the edge server after the data transmission. Assume that each edge server can only execute one task simultaneously. When the edge server ej has hj j j j tasks to process and the processing order is = {ψ1 , ψ2 , . . . , ψhj }, the start execution time SET e j of task tψ j in edge server ej is ψp ,j
p
SET e j = ψp ,j
FTTψ j , p = 1 p
max{FTTψ j , EATje }, p > 1 p
(10)
Task Offloading and Resource Allocation with Privacy Constraints
EATje = T e j
225
(11)
ψp−1 ,j
EATje represents the earliest available time of ej when task tψ j is scheduled, which p
is equal to the completion time of task tψ j
p−1
e on edge server ej . The completion time Ti,j
of task ti on edge server ej can be expressed as e,exe e e Ti,j = SETi,j + Ti,j , ∀i ∈ N , j ∈ M.
(12)
e,exe represents the task execution time of task ti on edge server ej . Ti,j e,exe Ti,j =
wi , ∀i ∈ N , j ∈ M fj e
(13)
Cloud Server Computing. In addition to the three stages of edge server computing, the task processed by the cloud server includes uploading tasks from the base station to the remote cloud through the core network, processing tasks by the cloud server, and returning the processing result to the base station through the core network. The execution time Tic,exe of task ti on the cloud server is shown in formula (14). Tic,exe =
wi , ∀i ∈ N fc
(14)
f c represents the processing power of the cloud server. The completion time of task ti on the cloud server can be expressed as Tic = FTTi + Tic,exe + 2 × Lc , ∀i ∈ N
(15)
The mathematical model of the task offloading and resource allocation problem with privacy awareness aiming at maximizing the number of successfully executed tasks is expressed as follows. maxNsuccess =
N i=1
Ci
(16)
s. t. Ci = Ti =
1, Ti ≤ δi 0, Ti > δi p
p∈{l,e,c}
(17) p
Ti × xi
(18)
226
X. Zhu et al.
4 Task Offloading and Resource Allocation Algorithm with Privacy Awareness A heuristic privacy-aware task offloading and resource allocation algorithm (PTORA) is proposed to alternately solve the task offloading and resource allocation problem and optimize collaboratively. The PTORA algorithm includes five steps: 1) task offloading sequence generation; 2) offloading decision adjustment; 3) communication resource allocation; 4) computing resource allocation; 5) scheduling result adjustment. The PTORA algorithm is described as follows.
4.1 Task Offloading Sequence Generation The PTORA algorithm successively offloads low-privacy and non-privacy tasks to find suboptimal offloading decisions. Three offloading sequence generation (OSG) rules are proposed to generate non-privacy task offloading sequences and low-privacy task offloading sequences. • OSG1: Maximum ratio of the local execution time to the task transmission time first. The ratio of the local execution time to the transmission time can be calculated by formula (20).
tran
Ti
=
di Be
(19)
Task Offloading and Resource Allocation with Privacy Constraints
ratioi =
Til,exe
tran
227
(20)
Ti
• OSG2: Latest deadline first. Prioritize offloaded tasks with higher latency tolerance, and allocate local computing resources to latency-sensitive tasks. • OSG3: Minimum and latest start time first. If the task is processed locally before the threshold, it can be completed within the deadline. Otherwise, it times out. If the latest start time LSTi is less than 0, the local device cannot complete the task within the deadline. The latest start time LSTi can be calculated by formula (21).
LSTi = δi −
wi fλli
(21)
4.2 Offloading Decision Adjustment After each round of scheduling, the algorithm selects Kω tasks to join the offloaded task set Soffloaded according to the current load of the edge server with privacy protection capability. If flag == true, too many low-privacy tasks are offloaded and lead to the privacy edge server being overloaded. At this time, only the first Kω tasks in the nonprivacy task offloading sequence Q0 are selected to join Soffloaded . If flag == flase, taking the OSG3 rule and comprehensively select the top Kω tasks with the smallest data amount in Q0 and Q1 to join the offloaded task set Soffloaded . 4.3 Communication Resource Allocation Allocate the system bandwidth for each device according to the ratio of the data size required by each device for offloading tasks to the total data size that needs to be uploaded off by the whole system. The relationship between the offloaded task set Uk of the device k and the bandwidth Bk is shown in formula (22). off di i∈Uk , ∀k ∈ K (22) Bk = Be × off dq p∈K q∈U p
In addition, a task upload sequence needs to be generated for each device, and three task upload sequence (TUS) generation rules are used: minimum computation first (TUS1), minimum data first (TUS2), and earliest deadline first (TUS3).
228
X. Zhu et al.
4.4 Computing Resource Allocation Local Resource Allocation. When ordering tasks executed locally on the same device, only the workload and the deadline of tasks need to be considered. Three local resource allocation (LRA) strategies are designed: minimum calculation amount first (LRA1), earliest deadline first (LRA2), minimum and latest start time first (LRA3). Edge Server Resource Allocation. When scheduling offloaded tasks, the server selection policy affects the scheduling results. Three edge server selection policies (SSP) are designed as follows. • SSP1: Earliest available time first. Assign tasks to the server with the earliest available time, which reduce the edge server idle time. • SSP2: Earliest finish time first. Assign tasks to the server with the earliest finish time so that the completion time of the currently scheduled task is minimized. • SSP3: Minimum waste first. WCj represents the waste of edge server ej , which is determined by the idle time and the processing speed of the edge server. The idle time of ej is the difference between the next start execution time max{EATje , FTTi } of ej and the earliest available time EATje of ej . WCj is calculated by formula (23). WCj = (max EATje , FTTi − EATje ) × fje
(23)
The process of the edge server resource allocation algorithm is as follows. Since it is hard to predict whether the task can be successfully executed before the end of the task scheduling and to determine the time when subsequent tasks on the same device start uploading, the minimum priority queue priorityQueue is used to store the next task to be uploaded on each device. First, initialize priorityQueue as an empty set, and off failed task set T failed as an empty set. Then traverse each device, sort the task set Uk off
to generate the task upload sequence Lk according to the TUS rule, and initialize the earliest transmittable time ETTk of the device to be 0. Next take out the front task tk of the queue, calculate its finish transmission time FTTk , and add tk to priorityQueu. When scheduling, take out the front task ttop of priorityQueue, and call Algorithm 2 or Algorithm 3 according to the task privacy type. If ttop can be completed within the deadline, the earliest transmittable time of the device where ttop is located is updated to the finish transmission time of ttop . Otherwise, do not update and add ttop to T failed . Finally, add the next task that needs to be uploaded in the device where ttop is located to priorityQueue. Repeat the process until priorityQueue is empty and the offloading task scheduling ends.
Task Offloading and Resource Allocation with Privacy Constraints
229
4.5 Scheduling Result Adjustment Due to the characteristics of the task itself, there may be excessively offloaded on the same device, and the local devices are not fully utilized. To maximize the number of tasks completed within deadlines, a scheduling result adjustment strategy based on the rescheduling of failed tasks is designed. The process is shown in Algorithm 4.
230
X. Zhu et al.
5 Experiment and Performance Analysis The experiment is based on the EdgeCloudSim in Java. The developed software is IntelliJ IDEA 2018.3.2 × 64 on Intel (R) core (TM) i5-10210U CPU @ 1.60 GHz with 12 GB of RAM. The simulation scenario is set as follows. There are 8 edge servers deployed in the base station, 2 of which have privacy protection capabilities, and 6 are ordinary edge servers. The processing power of the edge server, the processing power of the local device, the number of tasks generated by each device, and the amount of task calculation obey uniform distribution. The processing power of the cloud server is 5 GHz, the propagation delay between the base station and the cloud is 1s, and the total system bandwidth is 50Mbps. Experimental parameters are shown in Table 1. 5.1 Parameter Calibration To evaluate the effect of the number of devices covered by the base station on the algorithm performance, the number of devices is set based on the paper [23]. A ratio is used to represent the proportion of tasks with high, low, and non-privacy levels. The value range of ratio is {(5%, 20%, 75%), (10%, 30%, 60%), (15%, 40%, 45%), (20%, 50%, 30%)}. The experimental analysis is carried out under four types of data sizes of di ∈{[0.2, 0.5], [0.5, 1.5], [1.5, 3], [3, 5]} (MB) and four types of device K ∈{10, 20, 30, 40, 50}. A total of 5 × 4 × 4 × 10 = 800 random instances, and 10 instances are generated in each case. For the component calibration with Kω ∈ {5, 10, 15, 20, 25}, 3 OSG rules (OSG1, OSG2, OSG3), 3 TUS rules (TUS1, TUS2, TUS3), 3 LRA strategies (LRA1, LRA2, LRA3), and 3 SSPs (SSP1, SSP2, SSP3), i.e., 5 × 3 × 3 × 3 × 3 = 405 experiments per instance, the parameters are calibrated 800 × 405 = 324000 times. Each instance is repeated 10 times in the experiment. The relative percentage deviation (RPD) is used to evaluate and analyze the algorithm performance. For the task set T , assume that the solution obtained by the current algorithm is πω , and the corresponding number of tasks completed within the deadline is Nsuccess (πω ). For the current task set, the optimal solution obtained under
Task Offloading and Resource Allocation with Privacy Constraints
231
Table 1. Experimental Settings Parameter
Value
Number of edge servers M
8
Number of edge servers with privacy protection capabilities
2
|E P | Number of ordinary edge servers |E N |
6
Number of devices K
{10, 20, 30, 40, 50}
Number of tasks generated by each device
[1, 40]
Workload wi
[0.4, 2.4] (× 109) CPU cycles
Device CPU processing power fkl
[0.2, 0.5] GHz
Edge server CPU processing power fje
G[1, 2] GHz
Cloud server CPU processing power f c
5 GHz
Propagation delay between base station and cloud data center 1 s Lc System bandwidth Be
50 Mbps
different components or algorithms is πω∗ , and the corresponding number of tasks completed within the deadline is Nsuccess (πω∗ ). RPD is calculated by RPD(%) = Nsuccess (πω∗ )−Nsuccess (πω ) × 100%. The ANOVA technique analyzes experimental results. Nsuccess (πω∗ ) Three main hypotheses (normality, homoscedasticity and independence of the residuals) are checked from the residuals of the experiments. Apart from a slight non-normality in the residuals, all the hypotheses are easily accepted. As shown in Fig. 2(a), the RPD when Kω =10 is not significantly different from that when Kω =5 while the algorithm performance decreases when Kω is greater than 10. The reason is that Kω is too large, which causes the remote resources from idle to overload and reduces the algorithm performance. So take Kω =10. As shown in Fig. 2(b), the performance of OSG1 is significantly better than that of OSG2 and OSG3. From Fig. 2(c), the algorithm performance is significantly better than others when TUS3 is used for the task upload sequence. Because TUS3 uploads tasks with earlier deadlines first, try to avoid tasks with earlier deadlines from missing the deadline. In Fig. 2(d), the RPD of LRA2 and LRA3 are smaller than those of LRA1. According to the calibration results, the LRA3 strategy is selected in this paper. As shown in Fig. 2(e), SSP1, SSP2, and SSP3 have no significant difference in the impact on the scheduling results. Among them, SSP3 is slightly better than the other two server selection policies. Therefore, edge server selection strategy is selected by SSP3.
232
X. Zhu et al.
Fig. 2. The mean plot of parameter and component settings with 95% Tukey HSD intervals
5.2 Algorithm Comparison To verify the performance of the proposed PTORA algorithm, the following representative strategies are compared. • PASHE [24] algorithm: Prioritize tasks according to their privacy, scheduling highprivacy tasks first, then low-privacy tasks, and finally non-privacy tasks. • RANDOM algorithm: To prevent user privacy leaks, the RANDOM algorithm restricts high-privacy tasks to be only executed locally. In contrast, low-privacy and nonprivacy tasks are randomly selected to be processed locally or offloaded. • NOCLOUD algorithm: Remote cloud resources are not considered, that is, tasks can only be executed locally or offloaded to edge servers for processing. Figure 3 shows the interactions between data sizes and compared algorithms with the changed number of devices. From Fig. 3(a), it can be seen that with the increase in the number of devices, the RPD of the PTARA is lower than that of the other three comparative algorithms. The RPD of the RANDOM shows a slight downward trend, while that of the NOCLOUD and the PASHE show an upward trend. When the number of devices is large, the number of tasks is increased. The availability of cloud server resources seriously affect the number of tasks completed within the deadline. Therefore, the performance gap between the NOCLOUD and the PTORA tends to be significant with the increased number of devices. Figure 3(a–d) show that, as the data size increases, the transmission time of offloaded tasks increases, and the limited system bandwidth becomes a performance bottleneck of computation offloading. Furthermore, the performance gap between the RANDOM and the NOCLOUD and the PTORA decreases, while the performance gap between the PASHE and the PTORA tends to increase.
Task Offloading and Resource Allocation with Privacy Constraints
233
Fig. 3. Interactions between data sizes and compared algorithms with 95% Tukey HSD
6 Conclusion This paper investigates the problem of task offloading and resource allocation with privacy protection in a multi-task and end-edge-cloud environment. A heuristic privacyaware task offloading and resource allocation algorithm is proposed for end-edge-cloud computing according to the type of task privacy. The offload decision and resource allocation are optimized alternately and iteratively. Experiment results show that the performance of the proposed algorithm is better than that of others. Acknowledgment. This work was supported by the Key-Area Research and Development Program of Guangdong Province (No.2021B0101200003), the National Natural Science Foundation of China (Nos. 61872077 and 61832004).
References 1. Cohen, J.: Embedded speech recognition applications in mobile phones: status, trends, and challenges. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5352–5355. IEEE, Las Vegas, NV, USA (2008) 2. Kumar, K., Lu, Y.H.: Cloud computing for mobile users: can offloading computation save energy? Computer 43(4), 51–56 (2010) 3. Soyata, T., et al.: Cloud-vision: real-time face recognition using a mobile-cloudlet-cloud acceleration architecture. In: IEEE Symposium on Computers and Communications (ISCC), pp. 59–66. IEEE, Cappadocia, Turkey (2012) 4. Guo, F., et al.: An efficient computation offloading management scheme in the densely deployed small cell networks with mobile edge computing. IEEE/ACM Trans. Network. 26(6), 2651–2664 (2018) 5. Wang, C., et al.: Integration of networking, caching, and computing in wireless systems: a survey, some research issues, and challenges. IEEE Commun. Surv. Tutor. 20(1), 7–38 (2017) 6. Khan, W.Z., et al: Edge computing: a survey. Future Gener. Comput. Syst. 97(AUG.), 219–235 (2019) 7. Pace, P., et al: An edge-based architecture to support efficient applications for healthcare industry 4.0. IEEE Trans. Indust. Inform. 15(1), 481–489 (2018) 8. Mora, H., et al.: Multilayer architecture model for mobile cloud computing paradigm. Complexity 2019, 1–13 (2019) 9. Zhou, J., et al.: Research advances on privacy preserving in edge computing. J. Comput. Res. Develop. 57(10), 2027–2051 (2020)
234
X. Zhu et al.
10. Sonmez, C., Ozgovde, A., Ersoy, C.: Edgecloudsim: an environment for performance evaluation of edge computing systems. Trans. Emerg. Telecommun. Technol. 29(11), e3493 (2018) 11. Mao, Y.Y., Zhang, J., Letaief, K.B.: Joint task offloading scheduling and transmit power allocation for mobile-edge computing systems. In: IEEE Wireless Communications and Networking Conference (WCNC), pp. 1–6. IEEE, San Francisco, CA, USA (2017) 12. Zhang, G.W., et al.: FEMTO: fair and energy-minimized task offloading for fog-enabled IoT networks. IEEE Internet Things J. 6(3), 4388–4400 (2018) 13. Lyu, X., Tian, H.: Adaptive receding horizon offloading strategy under dynamic environment. IEEE Commun. Lett. 20(5), 878–881 (2016) 14. Chen, M., Hao, Y.X.: Task offloading for mobile edge computing in software defined ultradense network. IEEE J. Sel. Areas Commun. 36(3), 587–597 (2018) 15. Zhang, Q., et al.: Dynamic task offloading and resource allocation for mobile-edge computing in dense cloud RAN. IEEE Internet Things J. 7(4), 3282–3299 (2021) 16. Yan, J., et al.: Optimal task offloading and resource allocation in mobile-edge computing with inter-user task dependency. IEEE Trans. Wireless Commun. 19(1), 235–250 (2019) 17. Chen, X., et al.: Efficient multi-user computation offloading for mobile-edge cloud computing. IEEE/ACM Trans. Networking 24(5), 2795–2808 (2015) 18. Chen, S.G., et al.: Efficient privacy preserving data collection and computation offloading for fog-assisted IoT. IEEE Trans. Sustain. Comput. 5(4), 526–540 (2020) 19. Hwang, R.H., Hsueh, Y.L., Chung, H.W.: A novel time-obfuscated algorithm for trajectory privacy protection. IEEE Trans. Serv. Comput. 7(2), 126–139 (2013) 20. Razaq, M.M., et al.: Privacy-aware collaborative task offloading in fog computing. IEEE Trans. Comput. Soc. Syst. 9(1), 88–96 (2022) 21. Wang, T., et al.: A three-layer privacy preserving cloud storage scheme based on computational intelligence in fog computing. IEEE Trans. Emerg. Top. Comput. Intell. 2(1), 3–12 (2018) 22. Lyu, X., et al.: Multiuser joint task offloading and resource optimization in proximate clouds. IEEE Trans. Veh. Technol. 66(4), 3435–3447 (2016) 23. Peng, K., Huang, H., Wan, S., Leung, V.C.M.: End-edge-cloud collaborative computation offloading for multiple mobile users in heterogeneous edge-server environment. Wireless Netw. 1–12 (2020). https://doi.org/10.1007/s11276-020-02385-1 24. Fizza, K., et al: PASHE: Privacy aware scheduling in a heterogeneous fog environment. In: IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud), pp. 333–340. Barcelona, Spain (2018)
A Classifier-Based Two-Stage Training Model for Few-Shot Segmentation Zhibo Gu, Zhiming Luo(B) , and Shaozi Li Department of Artificial Intelligence, Xiamen University, Xiamen, China [email protected]
Abstract. Over the past few years, deep learning-based semantic segmentation methods reached state-of-the-art performance. The segmentation task is time-consuming and requires a lot of pixel-level annotated data, which restricts the segmentation application. Benefiting from the general segmentation task, few-shot semantic segmentation also developed significantly. In this study, we propose a real-time training method based on feature transformation and a multi-stage classifier. The generalization ability of the model is enhanced through the strategy of real-time training. Aiming at the inconsistency of the feature domain of the support set and query set, we propose a feature transformation module, which uses the memory mechanism to map the query set features to the feature domain of the support set. Then, the query set features can better adapt to the classifier. The multi-stage classifier is used to retain the hierarchical information of different scales, and the attention mechanism is introduced to further explore information in different sizes and channels to prevent the abuse of advanced features effectively. We conducted experiments on the COCO-20i dataset, and our model can obtain good performance, i.e., 32.7% and 41.7% mIoU scores for 1-shot and 5-shot settings, respectively. Keywords: Few-shot Learning
1
· Semantic Segmentation · Real-time
Introduction
The considerable improvement in deep learning-based computer vision methods in recent years mainly relies on the availability of large-scale annotated datasets and growing computing ability. However, building large-scale datasets [2,7,13] is time-consuming and expensive, particularly in object detection and semantic segmentation tasks. And relying on a large-scale training dataset with an exhaustive pixel-wise label for every category, existing solutions have insufficient scalability to unseen categories. The human visual system can recognize a new category by referring to a few examples. Inspired by human ability, few-shot learning attacked more and more attention and aimed at recognizing unseen categories in images by building a model on a few samples. Few-shot semantic segmentation is developing segmentation ability in a computer system with only a few samples. c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 235–242, 2023. https://doi.org/10.1007/978-981-99-2385-4_17
236
Z. Gu et al.
Deep learning-based semantic segmentation takes advantage of image classification. And few-shot semantic segmentation methods are built based on the general semantic segmentation algorithms. Generally, a deep learning-based few-shot segmentation model is pre-trained on basic categories. Then model generalization is tested on few-shot tasks with unseen categories. Each few-shot task contains an unlabeled test sample referred to as a query set and a few labeled samples referred to as a support set. Many works on few-shot semantic segmentation follow the meta-learning [14,22,23], i.e., learning to learn paradigm. They focus on the episodic-training scheme and design of specialized architecture for the basic training. But the episodic training assumed a similar structure between episodictraining tasks and new few-shot tasks, and the unseen categories are from the same domain. These presume may limit the applicability of the existing few-shot segmentation methods in realistic scenarios. To improve the model generalization, we mainly focus on improving the testing phase, in which we initialize and train a small classifier for the specific few-shot task without meta-learning. And this leads to outstanding performance on few-shot semantic segmentation. Aiming at the problem of the generalization limitation of meta-learning, we propose a real-time training method based on feature transformation and multistage classifier modules in this paper. The generalization ability of the model is enhanced through the strategy of real-time training. To deal with the inconsistency of the feature domain of the support set and query set, we propose a feature transformation module, which uses the memory mechanism to map the query set features to the feature domain of the support set. Therefore, the query set features can better adapt to the classifier. The multi-stage classifier is used to retain the hierarchical information of different scales, and the attention mechanism is introduced to explore further the relationship between various sizes and channel feature information to prevent the abuse of advanced features effectively. The training of the model is divided into two stages. Firstly, the backbone network is trained on the regular segmentation task to enhance its generalization ability of feature extraction. In the second stage, the backbone network parameters are frozen, and a multi-level classifier is trained for segmentation prediction according to the specific segmentation task. The training of small classifiers can effectively prevent the model from overfitting. Experiments on the COCO20i dataset verify the effectiveness of our proposed methods, which achieve the 32.7% and 41.7% mIoU scores for 1-shot and 5-shot settings, respectively.
2
Related Work
Semantic segmentation is one of the hottest research field in computer vision. Jonathan et al. [10] firstly proposed the Fully Convolutional Networks (FCN) for the problem of semantic segmentation. After that, the FCN-based semantic segmentation frameworks are based on the encoder-decoder framework. The following few-shot semantic segmentation also uses a similar model as the backbone. Compared with the conventional semantic segmentation task, the few-shot semantic segmentation only use limited samples to achieve a better segmentation.
Two-Stage Training for Few-Shot Segmentation
237
Few-Shot Learning. Training a deep model requires the support of a largescale dataset which is expensive to obtain. However, human beings are good at learning to recognize a new category with a few samples. Inspired by the rapid learning ability of human beings, few-shot learning has sprung up, willing to build an effective machine learning model and quickly learn new category knowledge through a minimal number of samples. There are lot of popular works in fewshot learning [4,6,8,11,12,17,18,20]. These works can mainly be classified into two types: gradient-based methods and metric learning-based methods. Few-Shot Classification. The gradient-based method adopts the stochastic gradient descent method (SGD) to learn the commonalities between different tasks. Metric learning-based methods use a deep model as a feature embedding function and implement the few-shot task by comparing spacial distance among training samples and testing samples. Prototype learning [1,9,19] is one of the most popular metric learning methods. In few-shot semantic segmentation, prototype learning has been the main type of method in the past few years. And there are also a lot of works based on metric learning in few-shot classification [3,15,16,21].
3
Problem Definition
Few-shot semantic segmentation aims to recognize a category and segment out every pixel that belongs to the specific category by learning from a few labeled samples. The training set Dtrain and the testing set Dtest are from none-overlapped categories. For every training episode, there are two sets in both training and testing phases, support set S which serves as the training set in a few-shot task, and a query set Q are the testing set without a label in the few-shot task. Few-shot tasks are formulated into N-way K-shot segmentation tasks. Specifically, K-shot means that the support set S includes K(image, mask) pairs for each semantic class, and there are N classes in the query set. The query set Qi contains Nquery (image, mask) pairs from the category with the support set. Generally, the model first extracts category information from the support set and then applies the preserved knowledge to perform semantic segmentation on the query set.
4
Method
In this section, we first describe our model architecture and then introduce two main modules of our network. As shown in Fig. 1, our model framework introduces a multi-stage classifier to implement classification on query images and a feature transformation module for domain mapping. Method Overview. Our model contains three parts: encoder, multi-stage classifier, and feature transformation module. The backbone network is the ResNet50 and splices different feature maps to obtain multi-size feature maps. In the reasoning stage of the model, we input the support set samples into the encoder to obtain the features Fs of the support set samples.
238
Z. Gu et al.
Fig. 1. Illustration of the architecture of our model. Query set and Support set are embedded into deep features by the share-weight encoder. Then a multi-stage classifier is trained on support set. Feature trans module map query feature to support feature domain for prediction.
Fs = fθ (x)
(1)
where fθ is encoder function and x denotes support features. Then, we initialize the multi-stage classifier, take the features extracted from the encoder as the classifier input, and iterative the training by gradient descent until the classifier is converged. y = classifier(Fs )
(2)
We obtain the query set features Fq by the encoder. The support set feature and query set feature are used as the input of the feature transformation module. The feature transformation module maps the query set feature vector to the feature domain of the support set features to generate the enhanced feature. Finally, the classifier is used to segment the enhanced features to obtain the final prediction. yq = classifier(ftrans (Fq , Fs ))
(3)
Feature Transformation Module. The feature transformation module takes the query set feature Fq and the support set feature Fs as input, using an untrained memory mechanism to map the query feature to the support feature domain. We define Fs as K, Fs as V, and Fq as Q. For each term in Q, the cosine distance is calculated as their similarity. Then, according to the similarity matrix, Q multiplies the corresponding value to obtain Ftrans (Fig. 2). Ftrans = softmax(
Fq × K T √ )•V dk
(4)
Two-Stage Training for Few-Shot Segmentation
239
Fig. 2. The Feature Trans Module with K and V denotes support feature. Q is query feature.
Due to the difference in samples, the directly calculated Ftrans is consistent with the feature domain of the support set, and the continuity of the feature vector is disrupted. In order to achieve smooth feature migration, the original feature Fq of the query and mapping feature Ftrans are weighted fused to generate Fˆq . Fˆq = (1 − α)Fq + αFtrans (5) Multi-stage Classifier. We first generate the prediction map from the deep feature map. The prediction map is used to guide the further prediction of the next feature layer. Specifically, we concatenate prediction P i in layer I to F i+1 , feature of layer i + 1, to get fused feature Fˆ i+1 Fˆ i+1 = Concat(F i+1 , P i )
(6)
Before the classifier, we applied the channel attention module to learn the significance of different features. SE(x) = x ∗ σ(FC(GAP(x))),
(7)
where FC denotes the fully connected layer, GAP means the global average pooling, and σ is the sigmoid function.
5
Experiments
We evaluate our proposed model on dataset COCO-20i. The COCO-20i is provided by Microsoft for image tasks. There are 32,800 images in 91 categories which are divided into four splits. We train our model on 3 splits and evaluate it on the rest one. And we use mIoU as the performance measure. We adopt ResNet50 as our backbone network, which is pertained on ImageNet for classification. In the first stage, we do a segmentation train encoder on three splits. Then test our model on the rest one. For each few-shot task,
240
Z. Gu et al.
Table 1. Statistic comparisons of our model with the other state-of-the-art methods on COCO-20i. Method
Backbone
1-shot 5-shot split0 split1 split2 split3 mean split0 split1 split2 split3 mean
ResNet50 [5] PANet 31.5 RPMMs [20] 29.5 PPNet 34.5 CWT 32.2 Ours 30.2
22.6 36.8 25.4 36.0 35.4
21.5 29.0 24.3 31.6 32.8
16.2 27.0 18.6 31.6 32.4
23.0 30.6 25.7 32.9 32.7
45.9 33.8 48.3 40.1 37.2
29.2 42.0 30.9 43.8 45.8
30.6 33.0 35.7 39.0 39.9
29.6 33.3 30.2 42.4 43.8
33.8 35.5 36.2 41.3 41.7
a multi-classifier is trained on the query set. We implement experiments on 1shot and 5-shot tasks. Table 1 shows that our model achieves state-of-the-art performance. Table 2. Ablation study on COCO-20i for Feature Transformation and Multi-stage Classifier Transformation Classifier 1-shot 5-shot √ √ √
√
28.4
35.9
29.1
37.2
31.2
39.9
32.7
41.7
The ablation study aims to investigate the main contribution of different modules in our model. We conduct experiments on dataset COCO-20i for Feature Transformation and Multi-stage classifier. As shown in Table 2, there are significantly promoting on both 1-shot and 5-shot tasks of the Feature Transformation module. And Multi-stage classifier can further contribute to the final prediction.
6
Conclusion
We proposed a few-shot segmentation model based on Feature Transformation and Multi-stage Classifier. We use a two-stage training strategy. Firstly, the encoder is trained for semantic segmentation. On the premise of a large number of samples, after sufficient training, the model can obtain strong generalization ability. For specific few-shot task, a Multi-stage Classifier is trained on the support set and implement prediction on the query set. Feature Transformation can map query features to support feature domain, and the multi-stage classifier can take full advantage of different stage features. Results show our model achieves state-of-the-art in few-shot segmentation and outperforms previous work.
Two-Stage Training for Few-Shot Segmentation
241
References 1. Dong, N., Xing, E.P.: Few-shot semantic segmentation with prototype learning. In: BMVC (2018) 2. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010). https://doi.org/10.1007/s11263-009-0275-4 3. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML, pp. 1126–1135 (2017) 4. Gairola, S., Hemani, M., Chopra, A., Krishnamurthy, B.: SimPropNet: improved similarity propagation for few-shot image segmentation. In: IJCAI, pp. 573–579 (2021) 5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016) 6. Li, X., Wei, T., Chen, Y.P., Tai, Y.W., Tang, C.K.: FSS-1000: a 1000-class dataset for few-shot segmentation. In: CVPR, pp. 2869–2878 (2020) 7. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48 8. Liu, W., Zhang, C., Lin, G., Liu, F.: CRNet: cross-reference networks for few-shot segmentation. In: CVPR, pp. 4165–4173 (2020) 9. Liu, Y., Zhang, X., Zhang, S., He, X.: Part-aware prototype network for few-shot semantic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 142–158. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-58545-7_9 10. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) 11. Nguyen, K., Todorovic, S.: Feature weighting and boosting for few-shot segmentation. In: ICCV, pp. 622–631 (2019) 12. Rakelly, K., Shelhamer, E., Darrell, T., Efros, A., Levine, S.: Conditional networks for few-shot semantic segmentation. In: ICLR (2018) 13. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y 14. Shaban, A., Bansal, S., Liu, Z., Essa, I., Boots, B.: One-shot learning for semantic segmentation. In: BMVC (2017) 15. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017) 16. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: relation network for few-shot learning. In: CVPR, pp. 1199–1208 (2018) 17. Tian, Z., Zhao, H., Shu, M., Yang, Z., Li, R., Jia, J.: Prior guided feature enrichment network for few-shot segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 44(2), 1050–1065 (2020) 18. Wang, H., Zhang, X., Hu, Y., Yang, Y., Cao, X., Zhen, X.: Few-shot semantic segmentation with democratic attention networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 730–746. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_43 19. Wang, K., Liew, J.H., Zou, Y., Zhou, D., Feng, J.: PANet: few-shot image semantic segmentation with prototype alignment. In: ICCV, pp. 9197–9206 (2019)
242
Z. Gu et al.
20. Yang, B., Liu, C., Li, B., Jiao, J., Ye, Q.: Prototype mixture models for few-shot semantic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 763–778. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-58598-3_45 21. Ye, H.J., Hu, H., Zhan, D.C., Sha, F.: Few-shot learning via embedding adaptation with set-to-set functions. In: CVPR, pp. 8808–8817 (2020) 22. Zhang, C., Lin, G., Liu, F., Yao, R., Shen, C.: CANet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In: CVPR, pp. 5217–5226 (2019) 23. Zhang, X., Wei, Y., Yang, Y., Huang, T.S.: SG-One: similarity guidance network for one-shot semantic segmentation. IEEE Trans. Cybern. 50(9), 3855–3865 (2020)
EEG-Based Motor Imagery Classification with Deep Adversarial Learning Dezheng Liu, Siwei Liu, Hanrui Wu, Jia Zhang(B) , and Jinyi Long College of Information Science and Technology, Jinan University, Guangzhou 510000, China [email protected]
Abstract. Brain-computer interface (BCI) provides a direct communication pathway from the human brain to computers. In consideration of electroencephalogram (EEG) data are greatly affected by the individual differences, it is hard to acquire general models applicable across subjects. For this reason, we propose a domain adversarial neural network (DANN) that can efficiently extract and classify domain-invariant features across domains. DANN consists of three parts: feature extraction, label prediction, and domain discrimination. During training, labeled data of the source domain and unlabeled data of the target domain are used as inputs. The part of feature extraction learns discriminative features of EEG signals through deep networks. The features of the source domain data are then fed into the label prediction part for learning to distinguish between different task classes, while the part of domain discrimination reduces inter-domain differences by learning domain-invariant features across domains. By optimizing both the part of label prediction and the domain discrimination, the model learns task classification features that are domain-invariant. DANN approach is validated on several BCI competition datasets, indicating its advantages in cross-subject motor imagery classification. Keywords: Transfer learning · Brain-computer Interface · Domain Adversarial Neural Network
1 Introduction Brain-computer interface (BCI) provides a direct communication pathway from the human brain to computers [1]. EEG (Electroencephalogram) is a widely used BCI signal as it is inexpensive, noninvasive, and easy-to-use. In general, there are three types of BCIs: Event-related potentials, steady state visual evoked potential, and motor imagery [2]. In this article, the focus is motor imagery (MI). MI task is a voluntarily induced one, subjects imagine the motion of their body, and this activates multiple parts of the brain. This activation is similar to the activation of the cerebral motor cortex caused by actual exercise, which has significant value in clinical or other application fields [3]. Although EEG-based BCI have been widely studied, there still exist some problems in analyzing EEG signals. The features of EEG signals are greatly influenced by individual differences, and obtaining a general model applicable across subjects is difficult [4, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 243–255, 2023. https://doi.org/10.1007/978-981-99-2385-4_18
244
D. Liu et al.
5]. The traditional approach is training a model for each subject [4, 6], but the training of recognition models of EEG signals requires a lot of labeled EEG signals, which is hard to collect. In the face of the above problems, transfer learning methods has become a primary focus. Transfer learning applies knowledge learned in one domain to a different domain, and many transfer learning applications have been proposed in BCI. For instance, some researchers implement transfers by reducing the variation in feature distribution across domains [7–9]. Besides, thanks to the excellent feature learning capability of neural networks, it has received widespread attention in EEG classification [10–14]. However, the inter-individual variation is still an obstacle limiting further improvement in accuracy, thus we design a domain adversarial neural network (DANN) that can efficiently extract and classify domain-invariant features between the different domains. In detail, DANN consists of three parts: feature extraction, label prediction, and domain discrimination. During training, labeled data of the source domain and unlabeled data of the target domain are used as inputs. The features extracted from the source domain data are sent into the part of label prediction for learning to distinguish between different task classes, while the features extracted from all domain data are sent to the part of domain discrimination to learn to minimize the inter-domain differences. By optimizing the model, the network learns task classification features that are domain-invariant. This paper is organized as follows. Section 2 reviews existing methods for EEG classification. Section 3 details DANN method. In Sect. 4, we illustrate the experimental result. Finally, Sect. 5 gives the conclusion.
2 Related Work To solve the above problems, some researchers improve performance with traditional analysis methods. For instance, in feature extraction, besides to the traditional timefrequency features, additional types of features are also tried, such as connectivity features and higher-order statistics [15, 16]. Moreover, multiple studies have incorporated various features to represent more informative EEG [17, 18]. In feature classification, methods such as deep learning and Riemannian geometry are used to enhance the capabilities of classifiers from various perspectives [19, 20]. However, for most traditional methods and those that improve on them, due to the individual differences, direct application of a pre-trained model from one subject to a new subject is difficult. Therefore, many researchers use transfer learning approaches to solve the above problems. Transfer learning-based methods for EEG signal classification are grouped into shallow methods and deep methods. In the first approach, many methods have been improved based on the Common Spatial Pattern (CSP). For example, Kang et al. [7] developed two weighted sum methods to linearly combine the covariance matrices of different subjects, thus obtaining a composite CSP for inter-subject knowledge transfer. Blankertz et al. [8] improved the coefficients of CSP to enforce the invariance property, thereby extracting CSP features that are less affected by noise. In addition, many methods generate domain-invariant features by reducing the differences across domains. For instance, Lan et al. [9] used the domain adaptation methods such as transfer component analysis to reduce inter-domain differences in EEG signals across subjects.
EEG-Based Motor Imagery Classification with Deep
245
In terms of deep learning methods, Daoud et al. [10] proposed a neural network to learn important spatial representations from various scalp positions. Then, the recurrent neural networks are utilized to predict the incidence of epilepsy. Yildirim et al. [12] designed a new 1-D convolutional neural network model, that was efficient, fast, uncomplicated, and simple to use. Wei et al. [13] proposed a spatial component-wise convolutional network (SCCNet), and the initial convolutional layer of this network is used for spatial filtering, which can enhance and denoise the EEG signals. Ming et al. [14] designed a deep network based on generative adversarial networks, which mitigated different variances to enhance the generalization ability of models. Different from the above approaches, the proposed adversarial inference method considering not only conditional probability distributions across tasks, but also marginal probability distributions across subjects. We design a domain adversarial network that adversarially learns the differences between tasks while narrowing the inter-domain differences. Finally, the deep representations extracted from the network are domaininvariant, thus achieving transfers for the classification of motor imagery.
3 The Proposed Method In this section, we first describe the notations as well as the definitions that are used later in this work. We then outline the architecture of our approach and detail the proposed method. 3.1 The Framework Suppose {Xs , Ys } is the labeled source domain set and Xt is the unlabeled target domain s s t , Ys = {yi }ni=1 and Xt = {xi }ni=1 , where xi ∈ RC×T , yi ∈ set, In detail, Xs = {xi }ni=1 {0, 1, ..., nc }, and ns and nt are the sample size of the source and target domain sets, respectively, C and T denote the number of channels and time points. DANN is shown in Fig. 1. Firstly, in consideration of that EEGNet [21] works well in EEG data, EEGNet model is trained by using the source domain data and further leverage it for predicting the result of the target data. However, the extracted features of the source and target domains vary owing to inter-domain discrepancies. For that, we design a domain discriminator to constrain the learned features across domains to be closer to each other. Then, we compose the EEGNet and a domain discriminator into DANN, and it consists of the following three components: Feature Extractor:Gf It learns discriminative feature representations of EEG signals through deep networks. For instance, Gf ·; θf is the multi-dimensional feature vector of EEG data extracted from Gf , where θf is the parameter set of Gf . Specially, different from ordinary CNNs, the feature extractor is induced by a depthwise convolution to learn the spatial domain features of EEG data, and a separable convolution is used to learn the time-spatial domain features. Label Predictor: Gy It calculates the prediction result of EEG data with the extracted features from Gf . To decode the extracted features of EEG signals, we design a simple
246
D. Liu et al.
and efficient classifier Gy ·; θy , where θy is the parameter of Gy . Since the data Xt from the target domain are unlabeled, thus the source data are employed for the training of the feature extractor and the label predictor. First, the features of the EEG signal are extracted by Gf , written as Gf Xs ; θf . These features are then classified as label Y by the label predictor Gy : Y = Gy Gf Xs ; θf ; θy (1)
To make DANN predict the labels as correctly as possible, we minimize the discrepancy of the predicted label Y and the true label Ys by the following formula: (2) Ey θf , θy = Ly Y , Ys
In this case Ly is cross-entropy function, as its effectiveness and simplicity. Domain Discriminato: Gd We define Gd (·; θd ) as the domain discriminator, where θd is the parameters ofGd . By tricking the discriminator during training, the differences between domains are reduced. To achieve this, the domain labels from the source and target domains are set to 0 and 1, respectively. The features from both domains are fed into Gd to train the domain discriminator. To express the optimization function clearly, letX = {Xs , Xt }. Therefore, Gf X ; θf denotes the feature vectors of X and Gd Gf X ; θf ; θd is the predicted domain labels D ofX . Then, cross-entropy serves as the loss function to optimize the feature extractor and the domain discriminator: (3) Ed θf , θd = Ld D, D
where D denotes the truth domain labels. 3.2 Training Detail and Prediction To enable the proposed method to learn cross-domain knowledge, we try to maximize Ld while minimize Ly as much as possible. One way to satisfy both criteria is to minimize the loss: (4) E(θf , θy , θd ) = Ey θf , θy − λEd θf , θd where λ is a hyperparameter trading off the two losses, and the total loss E(θf , θy , θd ) can be obtained for training. Moreover, the parameter update rule of the proposed network is designed as follows:
(θ f , θ y ) = argminE(θf , θy , θ d ) θf ,θy
θ d = argmax E(θ f , θ y , θd ) θd
(5) (6)
where the parameters θ f ,θ y and θ d denote the saddle points of the function (4). As we can observe from the Eqs. (5) and (6), the label prediction loss is minimized while the domain classification loss is maximized during the training process. Thus, the deep representation calculated from the trained model is domain-invariant. In addition, the detailed structure of DANN is shown in Table 1.
EEG-Based Motor Imagery Classification with Deep
247
Fig. 1. The proposed domain adversarial neural network.
3.3 Optimization with Backpropagation During training, the following methods are used to update the parameters of the network: θf ← θf − η(
∂Liy ∂θf
θy ← θf − η θd ← θd − η
−λ ∂Liy ∂θy ∂Liy ∂θd
∂Lid ) ∂θf
(7)
(8)
(9)
where η is the learning rate, and i denotes the sample index. During the feedforward, the input EEG data are transformed into deep representations with the parameter θf and then generated Ly and Ld by the fully connected layers with the parameters θy and θd , respectively. The backpropagation follows Eqs. (7)–(9). Based on this, we can obtain the optimal solution to facilitate a well-trained model.
248
D. Liu et al. Table 1. The details of DANN. Layer
Feature extractor
Filters ×(Kernel Size)
Activation
8 × (1, 32)
Linear
2 × (64, 1)
Linear
Input EEG Conv2D BatchNorm Depthwise Conv2D BatchNorm Mean pooling
ELU (1, 4)
Dropout Separable Conv2D
16 × (1, 16)
BatchNorm Mean Pooling
Linear ELU
(1, 8)
Dropout flatten Label predictor
Layer
Kernel Size
Fully connected
25
Relu
Fully connected
Number of classes
softmax
Fully connected
100
Relu
Number of domains
softmax
Dropout Domain discriminator
Dropout Fully connected
4 Experiment and Results 4.1 Data Description In our experiments, we evaluate on two publicly available motion image BCI competition datasets [22, 23]. Their statistics are summarized in Table 2. For data set Iva (M1), 5 subjects execute right hand or right foot motor imagery tasks within 3.5 s. For data set IIIb (M2), recordings were obtained from three healthy subjects (O3, S4, X11), O3 executes a 5-s motor imagery task (left or right) in a virtual reality environment. Whereas S4 and X11 execute a “basket paradigm” [24] motor imagery task (left or right) facing the screen in a real environment. For M1, each subject executes 140 right-hand MI and 140 right-foot MI, the number of channels of the signal is 118, and the frequency is 100 Hz. For M2, the number of trials for 3 subjects is 640, 1080, and 1080, with half the number of “left” and “right” classes, and the signals are collected from 3 channels at 125 Hz. Due to a bug, O3’s 1– 160 and 161–320 are equal, and the official only records the label 321–640. Therefore, the number of trials for O3 becomes 319.
EEG-Based Motor Imagery Classification with Deep
249
In this paper, in the light of that the task of M1 is executed within 3.5 s after the cue, the task of M2 is executed within 5 s after the cue. Therefore, we extract the M1 data within [0.5, 3.5] seconds after the cue, and the M2 data within [1, 5] seconds after the cue. Additionally, we use an 8–30 Hz band-pass filter to eliminate the noise of EEG signals and obtain cleaner signals. Table 2. The statistics of two datasets. Dataset
Number of Subjects
Number of Channels
Number of Time Samples
Trails per Subject
Class-Imbalance
M1
5
118
350
280
No
M2
3
2
625
640,1080,1080
No
4.2 Experimental Settings The experiments are conducted under two approaches, the first one is single source to single target (S → S), while the second one is multiple sources to single target (M → S). In S → S, choosing a subject to be the target and the others are held-out in turn as the source. In M → S, selecting a subject to be the target and concatenating the data from other subjects as the source. For instance, M1 dataset has 5 subjects (1,…, 5), thus, there has 5x4 = 20 S → S tasks, e.g., 1 → 2 (Subject 1 to be the source, and subject 2 to be the target), 1 → 3, 1 → 4,…, 7 → 6, and five M → S tasks, e.g., {1, 2, 3, 4} → 5, {1, 2, 3, 5} → 4,…, {2, 3, 4, 5} → 1. The balanced classification accuracy (BCA) is served as the performance measure: BCA =
1 l tPc c=1 nc l
(10)
where tPc denotes the count of true positives in class c, while nc is the count of samples in class c. l denotes the count of classes. 4.3 Comparative Studies DANN is compared with several state-of-the-art BCI classification algorithms: ➀ CSP-LDA (linear discriminant analysis) [25]: CSP maximizes the difference between the variance values of two different classes of data to obtain the feature vectors with a high discrimination. LDA is served as the classifier. ➁ CA (centroid alignment) [26]: CA achieves transfers across domains by reducing the marginal probability distribution across domains. ➂ CA-CORAL (correlation alignment) [27]: CORAL reduces domain discrepancies by adjusting the second-order statistics across domains.
250
D. Liu et al.
➃ CA-JGSA (joint geometrical and statistical alignment) [28]: JGSA is an unsupervised domain adaptive approach that statistically and geometrically reduces the offset between domains. The above ➁➂➃ methods use SVM (support vector machine) [29] as the classifier. The parameters for baseline algorithms are initialized following the respective literature recommendations. For the configuration of DANN model, we use Adam optimization approach [30] and the learning rate is η = 1e − 3, β1 = 0.9, β2 = 0.99, and ∈= 10e − 8. Moreover, we seek λ between 0.2 and 1 at pace 0.2, and discover that λ = 1 is applicable for M1 and M2 datasets, thus, we set λ = 1. The batch size is set as 16. Moreover, before training, 70% of the data is randomly chosen from the target domain set, then, we integrate the data selected from the target domain together the source data for training. The rest of the data from the target domain is employed as the test set to test the model’s performance. Moreover, to verify the effectiveness of DANN model, we repeat the above steps 10 times and compare the result using the average of the 10 runs. 4.4 Experimental Results Analysis Tables 3–4 show the means and the standard deviations of the BCAs on the two datasets with S → S and M → S transfers, respectively. For clarity, the maximum value is in bold. In both S → S and M → S, DANN approach obtains the maximum accuracy on both datasets in comparison to the baseline approaches. The non-transfer learning method CSP-LDA performs the worst. Some of the transfer learning methods have opposite performance on the two datasets, such as CA-CORAL, which performs better on M1 and worse on M2. In contrast, our proposed DANN model obtains the highest accuracy on the two datasets with different evaluation manners, i.e., S → S and M → S, illustrating the superiority of our method in reducing inter-domain differences and extracting discriminable class features. In addition, we perform the paired t-test for BCAs to verify whether the accuracy improvement of DANN is statistically significant. Due to the few number of subjects in datasets M1 and M2, we combine the BCAs of M1 and M2 for the paired t-test to obtain the statistical result. Prior to test, we perform the Lilliefors test [31] to verify that the null hypothesis the data are from a normal distribution could not be rejected. Next, we perform a false discovery rate correction [32], which is a linear increasing procedure, on the paired p-values for each task at a fixed level of significance (α = 0.05). The false discovery rate adjusted p-values (q-values) are shown in Table 5. The proposed DANN performs significantly better than all baselines in S → S transfer. In the case of M → S tasks, the performance improvement becomes less pronounced, and it is reasonable since usually in machine learning the variation among algorithms decreases with increasing training data size.
EEG-Based Motor Imagery Classification with Deep
251
Table 3. Mean (%) and standard deviation (%) of the BCAs in S → S transfers. M1
M2
Avg
CSP-LDA
55.80(8.50)
49.97 (0.81)
52.89
CA
70.79(13.12)
57.01 (10.76)
63.90
CA-CORAL
74.30(12.25)
56.63(9.42)
65.47
CA-JGSA
61.55(17.04)
50.41(1.21)
55.98
DANN
76.43 (9.38)
66.22(8.78)
71.33
Table 4. Mean (%) and standard deviation (%) of the BCAs in M → S transfers. M1
M2
Avg
CSP-LDA
63.64(16.12)
60.09(13.39)
61.87
CA
75.50(11.20)
63.28 (18.43)
69.39
CA-CORAL
75.21(7.95)
63.16(18.65)
69.19
CA-JGSA
65.43(20.76)
51.65 (2.86)
58.54
DANN
80.24(8.85)
69.10(7.23)
74.67
Table 5. False discovery rate adjusted p-value in paired t-tests (α = 0.05).
S→S
M→S
DANN vs
M1 + M2
CSP-LDA
.000
CA
.006
CA-CORAL
.027
CA-JGSA
.001
CSP-LDA
.014
CA
.163
CA-CORAL
.109
CA-JGSA
.055
4.5 Algorithmic Properties Figure 2 shows the variation of the training loss values per iteration for the DANN model when training on M1 and M2 datasets, and we can observe that the training loss value drops sharply, and then becomes stable over 50 iterations. We can observe that DANN can converge to obtain the optimal solutions efficiently.
252
D. Liu et al.
Fig. 2. Convergence analysis of the DANN model on the M1 and M2 datasets.
4.6 Visualization
Fig. 3. T-SNE visualization of the feature distributions when transferring Subject 3’s data (source) to Subject 1 (target) in M1.
T-SNE [33] is a method for visualizing data distribution. To visualize the effectiveness of DANN method in domain adaptation, the t-SNE method is used to downscale high-dimensional EEG features to 2 and observe whether the DANN method can combine data distributions from different domains. Figure 3 shows the result of transferring the data from subject 3 to subject 1 in M1. We can observe that all the transfer algorithms, except for the non-transfer algorithm CSP-LDA, make the feature distributions across domains closer to each other.
EEG-Based Motor Imagery Classification with Deep
253
DANN outperforms in reducing domain discrepancies in comparison with other baseline methods, demonstrating that DANN performs better in knowledge transfer.
5 Conclusion Transfer learning is a popular and effective method for EEG signals analysis, which can be used to cope with variations among EEG signals with different distributions. In this paper, we designed DANN, which can effectively extract and classify domain-invariant representations across domains. Specifically, the adversarial network consists of a feature extractor, a label predictor, and a domain discriminator. The feature extractor learns discriminative feature representations of EEG signals through deep networks. The label predictor is used to distinguish different tasks, in the meanwhile, the domain discriminator is used to reduce the inter-domain differences of the features. By optimizing the three components, the network learns task-specific features that are domain-invariant. We have validated the proposed method on several datasets and demonstrate that DANN outperforms several state-of-the-art transfer algorithms for cross-subject knowledge transfer of motion imagery data. In the future, it is interesting to further improve the domain adversarial network to make it a stable EEG classification mechanism. Acknowledgements. This work is supported by the National Natural Science Foundation of China (No. 62106084 & No. 61773179), the Outstanding Youth Project of Guangdong Natural Science Foundation of China (No. 2021B1515020076), the National Natural Science Foundation of Guangdong, China (No. 2022A1515010468 & No. 2019A1515012175), the Fundamental Research Funds for the Central Universities, Jinnan University (No. 21621026 & No. 21622326), Guangdong Provincial Key Laboratory of Traditional Chinese Medicine Informatization (No. 2021B1212040007).
References 1. McFarland, D.J., Wolpaw, J.R.: Brain-computer interfaces for communication and control. Commun. ACM 54(5), 60–66 (2011) 2. Giacopelli, G., Migliore, M., Tegolo, D.: Graph-theoretical derivation of brain structural connectivity. Appl. Math. Comput. 377, 125150 (2020) 3. Blankertz, B., Tomioka, R., Lemm, S., Kawanabe, M., Muller, K.R.: Optimizing spatial filters for robust EEG single-trial analysis. IEEE Signal Process. Mag. 25(1), 41–56 (2007) 4. Jayaram, V., Alamgir, M., Altun, Y., Scholkopf, B., Grosse-Wentrup, M.: Transfer learning in brain-computer interfaces. IEEE Comput. Intell. Mag. 11(1), 20–31 (2016) 5. Zheng, W., Lu, B.L.: Personalizing EEG-based affective models with transfer learning. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2732–2738 (2016) 6. Lin, Y.P., et al.: EEG-based emotion recognition in music listening. IEEE Trans. Biomed. Eng. 57(7), 1798–1806 (2010) 7. Kang, H., Nam, Y., Choi, S.: Composite common spatial pattern for subject-to-subject transfer. IEEE Signal Process. Lett. 16(8), 683–686 (2009) 8. Blankertz B., Kawanabe M., Tomioka R., Hohlefeld F., Müller K. R., Nikulin V.: Invariant common spatial patterns: Alleviating nonstationarities in brain-computer interfacing. Adv. Neural Iinform. Process. Syst. 20 (2007)
254
D. Liu et al.
9. Lan, Z., Sourina, O., Wang, L., Scherer, R., Müller-Putz, G.R.: Domain adaptation techniques for EEG-based emotion recognition: a comparative study on two public datasets. IEEE Trans. Cogn. Develop. Syst. 11(1), 85–94 (2018) 10. Daoud, H., Bayoumi, M.A.: Efficient epileptic seizure prediction based on deep learning. IEEE Trans. Biomed. Circuits Syst. 13(5), 804–813 (2019) 11. Xu, G., et al.: A deep transfer convolutional neural network framework for EEG signal classification. IEEE Access 7, 112767–112776 (2019) 12. Yıldırım, Ö., Pławiak, P., Tan, R.S., Acharya, U.R.: Arrhythmia detection using deep convolutional neural network with long duration ECG signals. Comput. Biol. Med. 102, 411–420 (2018) 13. Wei, C.S., Koike-Akino, T., Wang, Y.: Spatial component-wise convolutional network (SCCNet) for motor-imagery EEG classification. In: Proceedings of the International IEEE/EMBS Conference on Neural Engineering (NER), pp. 328–331 (2019) 14. Ming, Y., et al.: Subject adaptation network for EEG data analysis. Appl. Soft Comput. 84, 105689 (2019) 15. Brodu, N., Lotte, F., Lécuyer, A.: Exploring two novel features for EEG-based brain–computer interfaces: multifractal cumulants and predictive complexity. Neurocomputing 79, 87–94 (2012) 16. Zhang, H., Chavarriaga, R., Millán, J.R.: Discriminant brain connectivity patterns of performance monitoring at average and single-trial levels. Neuroimage 120, 64–74 (2015) 17. Frey, J., Appriou, A., Lotte, F., et al.: Classifying EEG signals during stereoscopic visualization to estimate visual comfort. Comput. Intell. Neurosci. (2016) 18. Roy, R.N., Charbonnier, S., Campagne, A., Bonnet, S.: Efficient mental workload estimation using task-independent EEG features. J. Neural Eng. 13(2), 026019 (2016) 19. Abu-Rmileh, A., Zakkay, E., Shmuelof, L., Shriki, O.: Co-adaptive training improves efficacy of a multi-day EEG-based motor imagery BCI training. Front. Hum. Neurosci. 362 (2019) 20. Dose, H., Møller, J.S., Iversen, H.K., Puthusserypady, S.: An end-to-end deep learning approach to MI-EEG signal classification for BCIs. Expert Syst. Appl. 114, 532–542 (2018) 21. Lawhern, V.J., Solon, A.J., Waytowich, N.R., Gordon, S.M., Hung, C.P., Lance, B.J.: EEGNet: a compact convolutional neural network for EEG-based brain–computer interface. J. Neural Eng. 15(5), 056013 (2018) 22. Dornhege, G., Blankertz, B., Curio, G., Muller, K.R.: Boosting bit rates in noninvasive EEG single-trial classifications by feature combination and multiclass paradigms. IEEE Trans. Biomed. Eng. 51(6), 993–1002 (2004) 23. Schlögl, A.: Dataset IIIb: Non-stationary 2-class BCI data. BCI Competition III (2005) 24. Vidaurre, C., Schlögl, A., Cabeza, R., Pfurtscheller, G.: A fully on-line adaptive brain computer interface. Biomed. Tech. 49(2), 760–761 (2004) 25. Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning. Springer, New York (2006) 26. Zhang, W., Wu, D.: Manifold embedded knowledge transfer for brain-computer interfaces. IEEE Trans. Neural Syst. Rehabil. Eng. 28(5), 1117–1127 (2020) 27. Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: Proceedings of the AAAI Conference on Artificial Intelligence, p. 30(1) (2016) 28. Zhang, J., Li, W., Ogunbona, P.: Joint geometrical and statistical alignment for visual domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, pp. 1859–1867 (2017) 29. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011) 30. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv Preprint arXiv: 1412.6980 (2014)
EEG-Based Motor Imagery Classification with Deep
255
31. Lilliefors, H.W.: On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J. Am. Stat. Assoc. 62(318), 399–402 (1967) 32. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc.: Ser. B (Methodol.) 57(1), 289–300 (1995) 33. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Comparison Analysis on Techniques of Preprocessing Imbalanced Data for Symbolic Regression Cuixin Ma1 , Wei-Li Liu2(B) , Jinghui Zhong1(B) , and Liang Feng3 1
2
South China University of Technology, Guangzhou, China [email protected] Guangdong Polytechnic Normal University, Guangzhou, China [email protected] 3 Chongqing University, Chongqing, China
Abstract. Symbolic regression is an important research field which aims to find a symbolic function to fit a given data set. Since real-world data sets usually have imbalanced characteristic, preprocessing imbalanced data sets is an important technique to improve the performance of symbolic regression techniques. In this paper, we compare and analyze the performance of six preprocessing imbalanced data techniques for symbolic regression. Specifically, three of them are data weighting techniques, including proximity weighting, remoteness weighting, and nonlinearity weighting. The fourth is data compression technique, which removes redundant data according to the weight to improve the fitting accuracy and efficiency, while the last two techniques are data sampling techniques, namely, the synthetic minority over-sampling and the Kmedoids clustering undersampling. Keywords: Symbolic regression · Imbalanced data Data compression · Data sampling
1
· Data weighting ·
Introduction
Symbolic regression aims to search for a symbolic mathematical expression that can best fit the given data set. It has a range of real-world applications, such as biomechanical model discovery [1], modeling and evaluating chemical synthesis processes in industry [2], and time series prediction [3]. Genetic programming (GP) is the mainstream method for symbolic regression. However, existing GPbased symbolic regression techniques focus on fitting balanced data sets, and ignore the imbalanced characteristic of the data, which may reduce the generality of the symbolic model found by the GP. This work is supported by the National Natural Science Foundation of China (Grant No. 62076098), the Guangdong Natural Science Foundation Research Team (Grant No. 2018B030312003), the GuangDong Basic and Applied Basic Research Foundation (Grant No. 2021A1515110072), and the research startup funds of Guangdong Polytechnic Normal University (Grant No. 2021SDKYA130). c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 256–270, 2023. https://doi.org/10.1007/978-981-99-2385-4_19
Comparison Analysis on Techniques of Preprocessing
257
In the literature, a number of preprocessing imbalanced data methods have been proposed for data classification and regression, such as random sampling technique [4], the Gaussian noise method [5], the Synthetic Minority Oversampling Technique (SMOTE) [6], and the Model-Based Synthetic Sampling (MBS) [7]. Existing imbalanced data processing methods can be generally categorized into two types: the sampling methods and the weighting methods. The sampling methods make the original data set balanced by adding or deleting data, whilst the weighting methods achieve this goal by adjusting weight of the samples. Different preprocessing methods have different characteristics and few efforts have been made to systematically compare and analyze the characteristics of the preprocessing techniques. To fill this gap, this paper aims to compare six commonly used data preprocessing techniques of imbalanced data sets for symbolic regression. Among the six techniques, three are data weighting techniques (i.e., proximity weighting, remoteness weighting, and nonlinearity weighting), the fourth is data compression method, and the last two are data sampling methods. The rest of this paper is organized as follows. Section 2 introduces the preprocessing imbalanced data methods for comparison. Section 3 describes the experiment design and the comparison results. Finally, Sect. 4 provides the summary of this study.
2
Preprocessing Algorithms for Imbalanced Data Sets
2.1
Data Weighting
Setting weight for a data point requires to select its neighbors from the training samples first. Note that calculating the distance of a neighbor should only consider the input variables. Supposing that a dataset M = {M1 , M2 , ..., MN } has N data points, all of which have (D + 1) dimensions, with D dimensions from the input space and 1 dimension from the output space. Here, each data point can be denoted by Mi = (xi1 , xi2 , ..., xiD , y i ), i = {1, 2, ..., N }, whilst its projection in the input space is denoted by Pi = (xi1 , xi2 , ..., xiD ), i = {1, 2, ..., N }. The set of all input points represented by P and the output vector by Y = (y 1 , y 2 , ..., y N )T . The distance between two point Pi and Pj is calculated by: distx (Pi , Pj ) =
D 1 i ||xd − xjd || D
(1)
d=1
where the norm can be either ||.|| D1 or ||.||2 . In this paper, the L2 norm is chosen to measure the distance. Based on the distance values, we can select k nearest neighbors for each Pi . After selecting k neighbors, three data weighting techniques can be used to assign weight of the data points. 2.1.1
Proximity Weighting
The proximity weighting [8] directly takes the linear distance between a data point and its neighbors as the weight. First, k neighbors are selected for each
258
C. Ma et al.
data point of the data set. Then the Euclidean distance from the k nearest neighbors to the data point is calculated as the proximity weight of the point. Here, weight of the i-th point in the data set M can be calculated by π(i, M, P, k) =
k 1 ||Mi − nj (Mi , M, P)|| k j=1
(2)
where nj (Mi , M, P) is the j-th nearest neighbor in the input space of Mi , which also means that the projected set of nj (Mi , M, P) in the input space is nj (Pi , P). In this way, this weighting technique can intuitively reflect the local importance of each data point. 2.1.2
Remoteness Weighting
The remoteness weight [9] performs normalization on the proximity weight. For each data point, k neighbors are selected and further operated on the basis of proximity weight. The data points are sorted according to the proximity weight and each data point is marked with a sorted ordinal number then calculate for the ordinal number so that the final sum of weight of all data points is equal to 1. The remoteness weight is defined as follows ρ(i, M, P, k) = I[i; π(i, M, P, k)]
(3)
where I[i; π(i, M, P, k)] is sorted according to the proximity weight π(i, M, P, k) and labeled with the sort ordinal number i. Therefore, the remoteness weight is defined according to the distance of the point Mi from the nearest k neighbors in the input space, and it is used as a weight based on the normalized value of its proximity weight. As a result, only half of the data will have larger weights than the other half, regardless of the degree of data imbalance. 2.1.3
Nonlinearity Weighting
The first two weighting techniques can successfully detect data points in the response curve that are highly variable with respect to their neighbors, capturing “steep” regions on the response surface, but they may ignore regions with low variable but highly nonlinear changes (large changes in input space and small changes in output space). However, from a modeling point of view, regions with highly nonlinear changes on the response surface are more important than steep but linear regions. Therefore, a weighting technique that focuses on the degree of nonlinearity [9] is introduced. The specific calculation formula for the nonlinearity weight is defined as Eq. (4). ν(i, M, P, k) = distXY (Mi , Πi )
(4)
where Πi is the least-squares fitted hyperplane through k neighbors, and distXY is a selected distance measure of the input-output space. The nonlinearity weight highlights the nonlinear characteristics of the data points, i.e., the more helpful it is to outline the image profile of the function.
Comparison Analysis on Techniques of Preprocessing
259
In the D-dimensional input space, 1 or D + 1 neighbors are recommended for the proximity and remoteness weighting techniques, whilst at least D + 1 neighboring points are recommended for the nonlinearity weight technique. Though nonlinearity weight is possibly not able to correctly define the least squares plane that fits the k nearest neighbors in the input space, the situation may be improved by increasing the number of neighbors. As a result, this paper uses 1 neighboring point for proximity weight and remoteness weight, and D + 1 neighboring points for nonlinearity weight. 2.2
Data Compressing
After data points sorted according to weight, a data compressing technique can be used to cut off inferior points with small weight. However, the one-time weighting only reflect the local importance of the data points to their k neighbors, instead of the global importance to the whole data set. To address such issue, The Simple Multidimensional Iterative Technique for Subsampling [9](SMITS) data compressing technique is recommended. The SMITS method will iteratively delete points with the lowest weight, and then the data set will be re-evaluated to get weights of all points. Specifically, the data compression repeats the following steps of “delete-update-sort-delete”, until the size of the remaining data set reaches a predefined value. step step step step
1 2 3 4
2.3
Calculating the initial weight value for each point of the entire data set. Finding and deleting points with the lowest weight. Updating weight of the neighboring points of the deleted points. Continue to perform steps 2 and 3 until a predefined number of points are left in the dataset. Data Sampling
For data classification, techniques to preprocess imbalanced datasets include different types of data resampling techniques, like the random resampling techniques. Such techniques randomly select data points to replicate (oversampling) or data points to remove (undersampling), which may respectively lead to overfitting or loss of important information costs. This subsection hence introduces two improved sampling techniques, including the undersampling to reduce points and the oversampling to increase points. 2.3.1
SMOTE Oversampling
SMOTE [6] is a method of over-sampling minorities by “synthesizing new data samples” to achieve data balance. For each data point of the minority, k nearest neighbors are firstly chosen, and then new samples are randomly generated along the lines between the point and its neighbors. With these new samples added, the decision area of the minority is effectively enlarged. This method can also be
260
C. Ma et al.
applied in the field of data regression. First, each data point whose gap with any of its neighbors exceeds a predefined threshold, is found. Then, new points are generated at the middle of the lines between the point and its neighbors, until no gaps between points are larger than the threshold. Although the fitted function curve of the original data set does not necessarily match the linearly generated data points, it can be regarded as approximately acceptable within a relatively local area. Besides, generating new points at the middle of the connecting lines can facilitate an even distribution of the final data set, to reduce the imbalance effect on the symbolic regression process. 2.3.2
K-medoids Clustering Undersampling
The K-medoids [4] clustering undersampling algorithm is based on an unsupervised clustering algorithm, where the centers of clusters are actual data points. The clustering centers are initially randomly chosen, and they are repeatedly optimized untill obtaining the best result (with the lowest cost). The final training set is consisted of all data of the minority training set and the clustering centers of the majority training set. In this experiment, this paper firstly initializes five clustering centroids and divides the dataset into five clusters, by calculating the distance of each point to the cluster centers. Then the cluster centers are reselected in the clusters so that the distance from the centroid of that cluster is minimized. After getting the new cluster centers, the whole data set is classified again until both the cluster centers and the clusters stop changing. After identifying the minimum class in terms of size, clustering in the majority class again with the expected number of clusters equal to the number of the minority class. Later, the centroids of the majority class and the minority class both consist the new data set, so as to achieve a balance among different data classes in the dataset.
3 3.1
Experimental Studies Data Set
In this study, nine two-dimensional functions and two three-dimensional functions with different characteristics are selected for testing, as listed in Table 1. For each test problem, the training set is imbalanced dataset generated using the target function, whilst the test set is a balanced dataset. Note that there are no overlapping data points between the training and test sets for each test function in our experiments.
Comparison Analysis on Techniques of Preprocessing
261
Table 1. Experimental function design function name target function F1
f (x) = x4 + x3 + x2 + x
F2
f (x) = x5 − 2x3 + x
F3
f (x) = sin(x2 ) cos x
F4
f (x) = sin(x2 + x) + sin x
F5 F6
f (x) = ln(x + 1) + ln(x2 + 1) f (x) = ln( (x2 + 4))
F7
f (x) = 2x
F8
f (x) = e(sin x+1)
2
Salustowicz1d f (x) = x3 e−x cos x sin x(sin2 x cos x − 1) Salustowicz2d f (x1 , x2 ) = x31 e−x1 cos x1 sin x1 (sin2 x1 cos x1 − 1)(x2 − 5) Kotanchek2d
f (x1 , x2 ) =
2
e−(x1 −1) 1.2+(x2 −2.5)2
F1 is a non-monotonic non-periodic polynomial function, with one variable sign of the slope in the function distribution. The training set takes 100 points of x ∈ [−7, 7] with a Gaussian distribution and 1000 dense points of x ∈ [4, 5], and the test set takes 100 uniform points of x ∈ [−7, 7]. The data are compressed to 100 points. The k-medoids clustering undersampled data set size is 395 points. The oversampling dataset is approximately twice as large as the original one. F2 is a monotonically increasing polynomial function, the slope of the function image has no variable sign. The training set takes 1000 points of x ∈ [−5, 5] with a Gaussian distribution, the test set takes 50 uniform points of x ∈ [−5, 5]. The data compression takes 50 points. The k-medoids clustering undersampled data set size is 45 points. The data set after oversampling is approximately 14 times larger than the original one. F3 is a non-monotonic non-periodic mixed trigonometric function with frequent variation of the slope in the function distribution. The training set takes 5000 points of x ∈ [−10, 10] with a Gaussian distribution, and the test set takes 200 uniform points of x ∈ [−4, 4]. The data compression takes 200 points. The k-medoids clustering undersampled data set size is 3110 points. The oversampling dataset has a similar size to the original dataset. F4 is a non-monotonic non-periodic mixed trigonometric function with frequent variation in the slope in the function distribution. The training set takes 5000 points of x ∈ [0, 15] with a Gaussian distribution and the test set takes 250 uniform points of x ∈ [5, 10]. The data compression takes 200 points. The k-medoids clustering undersamples the dataset size to 2515 points. The oversampling dataset is 30% more than the original one. F5 is a monotonically increasing mixed logarithmic function with no variable sign for the slope in the function distribution. The training set takes 1000 points of x ∈ [0, 12] with a Gaussian distribution and the test set takes 150 uniform points of x ∈ [0, 20]. The data compression takes 100 points. The k-medoids clustering undersampled data set size is 430 points. The oversampling dataset is similar in size to the original dataset. F6 is a non-monotonic non-periodic mixed logarithmic function with the distribution slope changing sign once. The training set takes 1000 points of x ∈ [−10, 10] with a Gaussian distribution and 2000 dense points within [2,3]. The test set takes 150 uniform points of x ∈ [−5, 5]. The data compression takes 200 points. The k-medoids clustering undersampled data set size is 910 points. The oversampling dataset is approximately twice as large as the original one. F7 is a monotonically increasing function, exponential function, and the slope of the function image has no variable sign. The training set takes 3000 points of x ∈ [−7, 15] with a Gaussian distribution. The test set takes 200 uniform points with [0,10]. The data compression takes 200 points. The k-medoids clustering undersampled data set size is 180 points. The oversampling dataset is approximately 2 times larger than the original one. F8 is a non-monotonic periodic, mixed function with frequent function image slope variation. The training set takes 2000 points of x ∈ [−3, 5] with a Gaussian distribution. The test set takes 200 uniform points of x ∈ [−6, 3]. The data compression takes 200 points. The k-medoids clustering undersamples the dataset size to 875 points. The oversampling dataset is approximately 4 times larger than the original one. Salustowicz1d is a standard test function, non-periodic non-monotonic hybrid function with irregular function image variance. The training set takes 2000 points of x ∈ [3.5, 6] with a Gaussian distribution and 250 uniform points within [0,3.5] and [6,9] each. The test set takes 1001 uniform points within [−0.5,10.5]. The data compression takes 200 and 500 points. The k-medoids clustering undersampled data set size is 575 points. The SMOTE oversampling takes the gap value between data points not greater than 0.01, and the resampled data set is approximately three times larger. Salustowicz2d is a standard test function, non-periodic non-monotonic hybrid function, with irregular sign variation in the function distribution. The test set takes 4000 points of x1 ∈ [4, 20] and x2 ∈ [4, 20] with Gaussian distributions. The test set takes 1000 uniform points of x1 ∈ [0, 15] and x2 ∈ [−2, 13]. The data compression takes 400 points. The k-medoids clustering owes the size of the dataset after sampling is 2040 points. The dataset after oversampling is approximately twice as large as the original one. Kotanchek2d is a standard test function, non-periodic non-monotonic hybrid function with irregular variance in the function distribution. The test set takes 4000 points of x1 ∈ [−6, 10] and x2 ∈ [−4, 10] with Gaussian distributions. The test set takes 1000 uniform points of x1 ∈ [−1, 4] and x2 ∈ [−2, 2]. The data compression takes 400 points. The size of the dataset after K-medoids clustering is 1955 points. The resampled dataset is increased by 75% compared to the original dataset.
262
3.2
C. Ma et al.
Symbolic Regression Parameter Setting
The SL-GEP [10] is adopted as the GP solver to fit the data. The parameters of the SL-GEP are set as follows: (1) The function set is {+, −, ∗, /, sin, cos, exp, ln(|x|)}. (2) The length of the main function is 21, where the length of head is 10 and that of tail is 11. (3) The number of ADFs is 2 and the head length of ADFs is 3 and the tail length of ADFs is 4. (4) The maximum number of generations is 20,000. In SL-GEP, the weight fitting error is used to evaluate a solution: N 2 i=1 wi (yi − y) (5) RM SEw = N where wi is the weight of the ith sample data. To avoid coincidence in the experimental results, different random number seed pairs were used to run each experiment for 30 times. We recorded the error results for every 100 generations, and took the mean curve of the errors for comparison. 3.3
Compared Algorithms
Three kinds of methods are designed for data processing in the experiments. 1 Direct weighting (weighted): directly apply the proximity weighting, remoteness weighting, or nonlinearity weighting to the function fitting process. 2 Weighted compression (weighted+SMITS): perform the above three weighting method then apply the SMITS compression algorithm to the data set. 3 Weighted compression and then weighted (weighted+SMITS+weighted): perform one weighting technique and compress the data and then perform the selected weighting again on the subset. Note that the data sampling techniques (either oversampling or undersampling) are performed only for once, and then the resampled dataset is used directly for regression without data weighting or data compressing for experiments. 3.4
Results and Analysis
For each problem, eleven data preprocessing schemes have been designed to process the dataset except for the original. Thus, there are twelve regression results for each problem. 3.4.1
General Test Functions
Problems of this category are from F1 to F8 in Table 1. In the data pre-processing part, as can be seen from Fig. 1, the proximity weight and remoteness weight compression results are similar. Most of the data points in the sparse part are remained, and more data points in the dense part are deleted. On the other hand,
Comparison Analysis on Techniques of Preprocessing
263
the non-linearity weight mainly remains the data points at the inflection point part of the image. K-medoids clustering undersampling is achieved by extracting class centroids for most classes and balancing with the minority classes. SMOTE oversampling approximately fills the gaps in the original data curve by generating new data points, and connects the outlier points with the dense points. According to the results in Fig. 2, we can see that the error curves of functions F1, F2, F3 and F4 drop faster using the nonlinearity weight; The k-medoids clustering undersampling performed well in F2, but did not show such excellent performance in F1. The SMOTE oversampling method can have average performance among the methods. For the logarithmic, exponential class of the test functions, i.e., F5, F6, F7, there are still certain error values after 20,000 generations. The error curve of F5 and F6 show that the K-medoids undersampling method perform the best in F5 and F7, but not in F6. For the mixed function F8, it has obvious periodicity and peak-valley characteristics. The best experimental results are the use of proximity weight compression then re-weighting, and K-medoids clustering under sampling, both of which achieve a regression error of zero. The regression results of the SMOTE algorithm are similar to those of nonlinearity weighting methods. However, the data compression consumes less computational time than the SMOTE oversampling, due to smaller data set. Generally, the above results indicate that if the function image has a certain regularity(such as periodicity, symmetry and parity), the non-linear weighting method seems to have better performance. If the function monotonically decreasing slowly or monotonically increasing quickly, the remoteness weighting method seems to have better performance. Due to the reduction of data points after compression, the regression time can be reduced by 20% to 60%, and most of the functions after compression can have better or similar performance that the one without compression. The performances of the SMOTE oversampling and the K-medoids clustering undersampling methods are unstable. It still needs further study to find out their preferred problem types. 3.4.2
Two-Dimensional Benchmark Test Functions
Salustowicz1d is highly complex as shown in Fig. 3 and there is difficulty to fit the function completely. The comparison results are shown in Fig. 4. It can be observed that the dataset after clustering undersampling of K-medoids is more uniform, and the extraction of data points by nonlinearity weight plus compression is concentrated on a few well-defined areas of the peaks and valleys of the transitions. According to the results in Table 2 and the error curves in Fig. 4, directly use of weight on the original data set, especially the proximity weight and remoteness weight, had an improving effect. The other methods especially compress method does not well. Therefore, it seems that for complex function images, data should not be over compressed. The SMOTE oversampling also showed a positive effect, while the K -medoids clustering undersampling achieves optimal results with minimal error values.
264
C. Ma et al.
Fig. 1. Function F1 training set and test set
Comparison Analysis on Techniques of Preprocessing
Fig. 2. Regression error curve of general test functions
265
266
C. Ma et al. Table 2. Minimum error value for all test functions (20000 generations) f unction RM SEw origin
F1 F2 0
0
F3
F4
F5
F6
F7
0
0
0.299
0.05591
14.016 22.0786
proximity
0
0
0
0
0.2872 0.04911
remoteness
0
0
0
0
0.2973
0.06240
13.240
non-linearity
0
0
0
0
0.4005 0.050812
20.325
proximity+SMITS
0
0
0
0
0.2849
0.04999
15.390
remoteness+SMITS
0
0
0
0
0.2626
0.05520
13.335
non-linearity+SMITS
0
0
0
0
0.2875
0.05497
15.639
proximity+SMITS+proximity
0
0
0
0
0.2744
0.04885
16.849
remoteness+SMITS+remoteness
0
0
0
0
0.3123
0.04862
16.160
non-linearity+SMITS+non-linearity
0
0
0
0
0.3710
0.05229
15.612
K-medoids
0
0
0.2929
0
0.2630
0.5900
13.2237
SMOTE
0
0
0
0
0.3098
0.0514
19.0273
sal1d(200) sa12d
kot2d
f unction RM SEw
F8
origin
0.04393
0.08886
0.4276 0.01166
proximity
0.07050
0.07993
0.3978 0.01129
remoteness
0.09115
0.08658
0.4038 0.01143
non-linearity
0.01373
0.17206
0.5665 0.01227
proximity+SMITS
0.05828
0.13335
0.3859
0.1164
remoteness+SMITS
0.03856
0.13593
0.4193
0.1239 0.1295
non-linearity+SMITS
0.00613
0.12620
0.4407
proximity+SMITS+proximity
0
0.0994
0.4425 0.1092
remoteness+SMITS+remoteness
0.04340
0.13764
0.4136
0.1142
non-linearity+SMITS+non-linearity 0.00818
0.1495
3.4.3
0.12083
0.5711
K-medoids
0
0.0994
0.4425 0.1092
SMOTE
0.0434
0.1376
0.4136
0.1142
Three-Dimensional Benchmark Test Functions
Salustowicz2d is the hybrid test function on the three-dimensional space, and the training set is also a dense middle edge sparse dataset generated according to the Gaussian distribution. Kotanchek2d’s training set is also an imbalanced data set. The data preprocessing results and convergent curves of the two problems are shown in Fig. 5, Fig. 6. The data preprocessing effects of various methods are similar with the processing results of two-dimensional functions. For the complex Salustowicz2d function, according to the results in Table 2 and Fig. 6(a), it is not suitable to use nonlinearity weight. Meanwhile, the compressed subset obtained by using proximity weight for compression has the smallest error value. For the Kotanchek2d
Comparison Analysis on Techniques of Preprocessing
267
Fig. 3. Salustowicz1d training set and test set
Fig. 4. Salustowicz1d regression results
function, the minimum error values reached by various data preprocessing methods are similar but the compression with proximity weight and then weighting is the method that reach the minimum value. For the three-dimensional function, the experimental samples are still insufficient, but from the experimental results of the two existing standard test functions, the proximity weight performs better than the other weight techniques, and the compression method is able to improve the performance.
268
C. Ma et al.
Fig. 5. Salustowicz2d training set and test set
Fig. 6. Three-dimensional standard test function regression results
Comparison Analysis on Techniques of Preprocessing
4
269
Conclusions
This paper conducted experimental comparison and analysis on preprocessing techniques of imbalanced data for symbolic regression. Through experiments, all three data weighting techniques can reduce the effects of the dense part but to increase the effects of the sparse part on the symbolic regression of a data set. Besides, the SMITS compressing technique also helps to balance the data set by pruning some dense data points. Moreover, the k-medoids clustering undersampling method can extract data points in a balanced manner and the SMOTE oversampling is more effective in filling the gaps in the function distribution to link the sparse and dense data points together. For the two-dimensional test functions, the data set curve change has certain characteristics, such as periodicity, which makes nonlinearity weight more suitable. On the contrary, the (nearly) monotonic functions prefer the remoteness weighting most, and then the proximity weighting. Generally, the data compressing helps to shorten the regression time. As for two sampling techniques, the K-medoids clustering undersampling method has been shown better effective by one-third of the test functions and relatively unsensitive to the shape of the function images, whilst the SMOTE oversampling method has not shown nice effective in the symbol regression. For the three-dimensional test functions, the experimental results of the proximity weight are better than other weight techniques, which fit in the functions with complex distribution but few slope changes. In the large and imbalanced case, compressed data set can get better minimum error values than the original unprocessed data. In conclusion, both data-weighted compression and undersampling methods are more suitable for preprocessing imbalanced data sets for the symbolic regression, but still need more experiments to prove it in the high dimension. The compression rate in data compression and the number of cluster in K-medoids should also concern about, so as to get a more balanced data set to get the better regression result.
References 1. Hughes, J.A., Brown, J.A., Khan, A.M., Khattak, A.M., Daley, M.: Analysis of symbolic models of biometrie data and their use for action and user identification. In: 2018 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), St. Louis, MO, pp. 1–8 (2018). https://doi.org/ 10.1109/CIBCB.2018.8404969 2. Gonz´ alez-Campos, G., Torres-Trevi˜ no, L.M., Lu´evano-Hip´ olito, E., Martinez-de la Cruz, A.: Modeling synthesis processes of photocatalysts using symbolic regression. In: 2014 13th Mexican International Conference on Artificial Intelligence, Tuxtla Gutierrez, pp. 174–179 (2014). https://doi.org/10.1109/MICAI.2014.33 3. Hughes, J.A., Houghten, S., Brown, J.A.: Descriptive symbolic models of gaits from Parkinson’s disease patients. In: 2019 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Siena, Italy, pp. 1–8 (2019). https://doi.org/10.1109/CIBCB.2019.8791459
270
C. Ma et al.
4. Dubey, R., Zhou, J., Wang, Y.: Analysis of sampling techniques for imbalanced data: an n=648 ADNI study. Neuroimage 87, 220–241 (2014). https://doi.org/10. 1016/j.neuroimage.2013.10.005 5. Ali, A.: Classification with class imbalance problem: a review. Int. J. Advance Soft Comput. Appl. 7(3) (2015). ISSN 2074-8523 6. Chawla, N.V.: Smote: synthetic minority over-sampling technique. JAIR 16, 321– 357 (2002) 7. Liu, C.: Model-based synthetic sampling for imbalanced data. IEEE Trans. Knowl. Data Eng. 32(8), 1543–1556 (2020) 8. Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, pp. 420–434. Springer, Heidelberg (2000). https://doi.org/10.1007/ 3-540-44503-X 27 9. Vladislavleva, E., Smits, G., Den Hertog, D.: On the importance of data balancing for symbolic regression. IEEE Trans. Evol. Comput. 14(2), 252–277 (2010) 10. Zhong, J., Ong, Y.S., Cai, W.: Self-learning gene expression programming. IEEE Trans. Evol. Comput. 20(1), 65–80 (2015)
A Feature Reduction-Induced Subspace Multiple Kernel Fuzzy Clustering Algorithm Yiming Tang1,2(B) , Bing Li1 , Zhifu Pan1 , Xiao Sun1 , and Renhao Chen1 1 Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine,
School of Computer and Information, Hefei University of Technology, Hefei 230601, Anhui, China [email protected] 2 Engineering Research Center of Safety Critical Industry Measure and Control Technology, Ministry of Education, Hefei University of Technology, Hefei 230601, Anhui, China
Abstract. High-dimensional data poses a great challenge to clustering, and subspace clustering algorithms have unique advantages when working with highdimensional data. However, there are still some difficulties in adapting the nonlinear feature dimension, the selection and collaborative settings of kernel functions. Therefore, in this study, we come up with a feature reduction-induced subspace multiple kernel fuzzy clustering. Firstly, in order to solve the information loss caused by the nonlinear characteristics of the data, the multiple kernel clustering method is introduced, and some nonlinear features of the data are collaboratively mapped to the high-dimensional linear space, so as to better understand the characteristic information of the data. Secondly, due to the complexity of subspace and multiple kernel learning, we introduce the idea of feature reduction, and reduce the data dimension according to the importance of information, which can reduce the complexity of the algorithm on the one hand, and improve the role of important feature attributes in clustering on the other hand. Finally, the proposed RS-MKFC algorithm and the related 6 algorithms are compared on 6 ordinary datasets and 4 high-dimensional datasets, and it is found that the proposed algorithms are superior over the other 6 algorithms. At the same time, we discover the ability of RS-MKFC algorithm to screen important features as well as the function of feature reduction, and good results are achieved. Keywords: Fuzzy clustering · multiple kernel collaboration · subspace clustering · clustering analysis
1 Introduction Clustering is an unsupervised machine learning method [1–4] that can help us better grasp the laws in the data, and its main purpose is to find a reasonable data cluster in a given set of data samples without prior information, so that the similarity between the data samples in the cluster is high and the similarity between the clusters is low. The initial clusters were almost always hard clusters, typical such as the K-means algorithm [5, 6], or the Density Peak Clustering (DPC) algorithm, which is also a highly © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 271–285, 2023. https://doi.org/10.1007/978-981-99-2385-4_20
272
Y. Tang et al.
recognized hard clustering algorithm [7]. Hard clustering is characterized by either-or, that is, a data object can only belong strictly to one cluster. However, in real life, the classification of most objects is not strictly in this way, but is ambiguous in terms of categories. This situation is more suitable for soft division, so fuzzy clustering comes into being [8]. The FCM algorithm is one of the most classic and most used fuzzy clustering algorithms [9]. But it has the problem of being affected by the sensitivity and noise points of the initialization cluster center. Krishnapuram and Keller proposed the Possibility Clustering Method (PCM) [10]. It relaxes the restriction that all data in the FCM algorithm belong to a certain class and the restriction of 1, thereby reducing the influence of noise points on the clustering process. However, the PCM algorithm has problems such as cluster center consistency and dependence on FCM initialization parameters. Later, Pal et al. [11] proposed the Possibility Fuzzy C-means Clustering algorithm (PFCM). It takes into account the characteristics of both fuzzy membership and possibility membership. Another important development idea is the use of kernel functions. Kernel functions map data from low-dimensional spaces to high-dimensional spaces, and cluster in highdimensional spaces, the most classic of which is the KFCM algorithm [12]. However, in a kernel-based fuzzy clustering algorithm, an important step is to choose the appropriate kernel function or combination of kernel functions, which relies on artificial prior knowledge. Therefore, Huang et al. [13] proposed a Multiple Kernel Fuzzy Clustering algorithm (MKFC), which could adaptively adjust the effect of each kernel function by introducing a kernel function weight value, and assigned a higher weight value to the better effect of the kernel function, thereby improving the clustering effect. In general, all characteristic dimensions of the data in a clustering algorithm are considered to have the same importance. However, the actual data in various application areas tend to be increasingly dimensional [14], which leads to a gradual convergence of distance-based similarities between any two data points, and redundant and unrelated attributes have a great impact on clustering results. To this end, a subspace-based clustering method is proposed, and by assigning weights to different characteristic attributes, the subspace clustering algorithm divides the data sample into clusters and searches for the subspace where each cluster class is located [15, 16]. Deng et al. [17] used within-cluster information and between-cluster information, and a new fuzzy clustering method, Enhanced Soft Subspace Clustering (ESSC), was proposed. Yang and Nataliani proposed a feature-reduction FCM algorithm (FRFCM) [18] to improve the effect of key features in clustering. The subspace clustering algorithm is more prominent when working with highdimensional data, and the quality of clustering results is higher. Compared with hard subspace clustering, soft subspace clustering reflects whether attributes and clusters are related on the one hand, and on the other hand, it also clarifies the degree of correlation and difference between them, and has better adaptability and flexibility [19]. However, there are still some problems when working with high-dimensional data using subspaces. 1) The problem of adaptability of nonlinear feature dimensions. Some data are not fully applicable to linear feature spaces and may be represented better in nonlinear feature spaces.
A Feature Reduction-Induced Subspace Multiple Kernel Fuzzy
273
2) The problem of kernel function selection and collaborative settings. Using the kernel function to map the data features can solve the problem of nonlinear features, but it is difficult to select the appropriate kernel function and set the appropriate parameters. To solve the above problems, in this paper we propose a feature reduction-induced subspace multiple kernel fuzzy clustering (RS-MKFC) algorithm, which combines multiple kernel learning and feature reduction to improve the adaptability of the algorithm. Firstly, the high-dimensional data clustering problem is solved using subspaces, dividing each data point into their respective subspace. Then, to solve the information loss caused by the nonlinear features of the data, we introduce a multiple kernel learning method to map some nonlinear features of the data to a high-dimensional linear space to better display the feature information of the data. Secondly, due to the complexity of subspace and multiple kernel learning, we use the idea of feature reduction to reduce the data dimensions according to the importance of information, and improve the role of important feature attributes in clustering. Finally, the proposed algorithm is compared with the relevant algorithms in this regard, which proves that the proposed algorithm has certain advantages.
2 Feature Reduction-Induced Subspace Multiple Kernel Fuzzy Clustering 2.1 The Idea of the Proposed Algorithm A new objective function is proposed as follows: JRS−MKFC =
c n
m uki
k=1 i=1
L
δl wkl ((xil ) − vkl )T ((xil ) − vkl )
l=1
p L c c h2kt log h2kt − h2kt + η +γ wkl log δl wkl . k=1 t=1
(1)
k=1 l=1
The following constraints need to be met: c
uki = 1, uki ∈ [0, 1], m > 1,
(2)
wkl = 1, 0 wkl 1, k = 1, 2, ..., c,
(3)
k=1 L l=1 p
h2kt = 1, k = 1, 2, ..., c.
(4)
t=1
Here the parameter m is the fuzzy coefficient, and the parameter wkl represents the weight of the kth feature of class i. Parameters hkt represent the weight values of the t-kernel function of the data in the subspace of class k. Parameters η are positive rule parameters,
274
Y. Tang et al.
where the parameters τ take the integer 1 or 2, but in some special data sets, we need to make adjustments to τ . δl measures the data distribution density of the l th feature, which is used to control the feature weights and help the objective function to reduce the features. In addition, log x = loge x. Here is a function of multiple kernel coordination, a collaboratively weighted combination of multiple specific kernel functions, which is p (xil )= t=1 hkt Kt (xil ) (where Kt is the specific t-th kernel function). Then, we can get p (xil )T (xil ) = t=1 h2kt Kt (xil )T Kt (xil ). The RS-MKFC algorithm proposed here uses three parts to complete the construction of the objective function (1). The first part adds feature weights to the objective function of the multiple kernel version of the FCM algorithm (i.e. MKFC). The second part is the kernel function weight penalty term, which can help calculate the kernel function weight value. The third part is the maximum entropy regularization term, which can help the target item to perform adaptive iteration on the feature weight, and assign higher weights to important attributes, thereby improving the clustering effect. The setting of the δl parameter is a core of the algorithm. In probability theory and statistics, standard deviation and variance are used to measure the dispersion of data. According to [7], the smaller the dispersion, the closer the data set is to the cluster center; The greater the dispersion, the farther the dataset is from the center of the cluster. We need to keep the data features with small dispersion and discard the data features with large dispersion. The index VMR=σ 2 /μ[21] and MVR = μ/σ 2 [22] are able to solve this problem, but it may not work as well for cases where the mean is 0 and some extreme data sets. Here, we introduce the interquartile range value of the data, and propose a new parameter IMVR to measure the feature dispersion, which is used to better control the iteration of data feature weights. The parameter IMVR is set as follows: IQR var(x) . (5) (δl )IMVR = mean(x) l Here, the value of IQR is the data value at 3/4 position minus the data value at 1/4 position after the feature data is arranged in ascending order. The interquartile range can well avoid the judgment of data dispersion due to the presence of abnormally large or abnormally small numerical values in the data, and also takes into account the global distribution characteristics of the data. 2.2 The Generation of Iteration Formulas According to (1), the Lagrange multiplier method is used to obtain a new objective function: p c L 2 J =JRS−MKFC + λ1 1 − uki + λ2 1 − hkt + λ3 1 − wkl . (6) k=1
t=1
l=1
Moreover, the necessary conditions for the minimum value of Eq. (1) are as follows (k = 1, ..., c, i = 1, ..., n, l = 1, ..., L): ∂J ∂J ∂J ∂J = 0, = 0, =0, 2 =0 ∂uki ∂vkl ∂wkl ∂hkt
(7)
A Feature Reduction-Induced Subspace Multiple Kernel Fuzzy
275
Firstly, from (7), the implicit expression of vkl can be obtained by ∂J m = −2 uki δl wkl ((xil ) − vkl ). ∂vkl i=1 n um (xil ) n ki m . vkl = i=1 i=1 uki n
(8)
(9)
After introducing a mapping combined with multiple kernels in the algorithm, and substituting (9) at the same time, it can be obtained: 2 dkil = ((xil ) − vkl )T ((xil ) − vkl ) T T = (xil )T (xil ) − 2vkl (xil ) + vkl vkl =
p
h2kt αkil .
(10)
t=1
Here we let (k = 1, ..., c, i = 1, ..., n, l = 1, ..., L)
(11)
Then, the distance between data xi and cluster center vk can be expressed as: dki =
L l=1
2 wkl dkil .
(12)
At the same time, (1) can be expressed as: JRS−MKFCM = + =
n τc
p n c
n c k=1 i=1 L c
m uki
L
2 δl wkl dkil
+γ
p c
(h2kt log h2kt − h2kt )
k=1 t=1
l=1
wkl log δl wkl
k=1 l=1 m h2kt uki ςkit + γ
k=1 i=1 t=1
p c
n wkl log δl wkl . τc c
(h2kt log h2kt − h2kt ) +
k=1 t=1
L
k=1 l=1
(13) Thereinto, we let. βkt =
n i=1
m uki ςkit , ςkit =
L l=1
wkl αkil .
(14)
276
Y. Tang et al.
Secondly, from (7) we can get: βkt + γ log h2kt − λ2 = 0,
(15)
λ2 − βkt = exp γ
−1 exp λ2 γ = exp λ2 γ −1 exp −βkt γ −1 .
= exp βkt γ −1
h2kt
(16)
Due to (4), we can obtain: exp λ2 γ −1 = p
1 .
−1 t=1 exp −βkt γ
(17)
Combining (16) and (17), we can obtain the following iterative formula for h2kt : h2kt
exp −βkt γ −1
. = p −1 t =1 exp −βkt γ
(18)
Furthermore, from (7) we can get: n
m 2 uki δl dkil
i=1
Let ϑkl =
n
1 − λ3 = 0. + η log δl wkl + δl
(19)
m 2 i=1 uki dkil ,
and it can be seen from the above equation: −δl ϑkl − η 1 λ3 exp . wkl = exp δl η η
Due to (4), it is obtained by (20): λ3 exp = L η l=1
1 δl
1 exp
−δl ϑkl −η η
.
We substitute (20) into (21) to get the iterative formula for wkl as follows:: −δl ϑkl 1 exp δl η wkl = . L −δl ϑkl 1 exp δl η
(20)
(21)
(22)
l=1
At the same time, we consider feature reduction and feature restrictions, so the feature weight update matrix after the reduction is: wkl wkl = L(new) l=1
wkl
.
(23)
A Feature Reduction-Induced Subspace Multiple Kernel Fuzzy
Finally, from
∂J ∂uki
277
= 0 we can get: λ1
uki = ( m
L l=1
1
) m−1 .
(24)
2 δl wkl dkil
Due to (2), it is obtained: uki =
L
l=1
c k=1
2 δl wkl dkil L l=1
1 − m−1
2 δl wkl dkil
. 1 − m−1
(25)
2.3 Algorithm Framework So far, the overall idea of the RS-MKFC algorithm has been introduced. The specific execution process is shown in Algorithm 1. When the feature weight of a certain dimension √ (t) of the data satisfies Wj < 1/ cnL, the feature is reduced.
3 Experiments To verify the clustering ability of the proposed RS-MKFC algorithm, the proposed algorithm and other algorithms are verified. In this experiment, the test environment we used was Intel(R) Core(TM)5-8400 CPU @ 300 GHz 3.00 GHz along with RAM 16.00GB and Windows 10OS, the programming software is MATLAB 2018. First, we illustrate the applicability of feature reduction through the Iris dataset in UCI, and then compare it with the relevant fuzzy algorithms FCM, PFCM, MKKM, MKFC [13], EWFCM [20], FRFCM [18] and other six algorithms.
278
Y. Tang et al.
The evaluation indicators include hard clustering effectiveness index and soft clustering effectiveness index. The hard effectiveness indicators include the Accuracy Classification rate (ACC), the Normalized Mutual Information (NMI) [23]. The effectiveness index of soft clustering adopts the Extension index of ARI (EARI) [24]. The description of the specific datasets is shown in Table 1. We take the real UCI dataset Iris as an example. The Iris flower dataset mainly has 150 articles and four features, namely calyx length, calyx width, petal length, and petal width. First, as shown in Fig. 1, this is a scatterplot of all possible combinations of any two features in the Iris dataset, with the same color represented as the same class. In the figure, we can see the two-dimensional graph composed of calyx length and calyx width features, and the data is mixed between different categories, which cannot be well distinguished. A scatterplot of calyx length, calyx width and petal length, and petal width can help us distinguish between different categories a little bit. Among them, the twodimensional scatterplot composed of petal length and petal width is the best, with clear intervals between them, and the data of the same category are more clearly clustered into a cluster.
A Feature Reduction-Induced Subspace Multiple Kernel Fuzzy
279
Table 1. Experimental datasets. Database name
Total number of samples
The number of features
Reference the number of clusters
Iris
150
4
3
Zoo
101
17
7
Seeds
210
7
3
Spect heart
267
22
2
Breast Cancer Wisconsin
569
30
2
Ecoli
337
7
8
y-axis x-axis
Calyx length
Calyx width
Petal length
Calyx length
Calyx width
Petal length
Petal width
-
-
-
Petal width
-
Fig. 1. Scatter plots of any two features for Iris.
From this, we can understand that the first and second features are relatively minor features, so we want these features to have smaller weights in order to weaken them during clustering. As shown in Table 2, we run the RS-MKFC algorithm on the Iris dataset to observe the weight attributes of each feature at different iterations. Initially,
280
Y. Tang et al.
we assign equal weights to each feature, in the first iteration, the calyx width attribute is reduced to jane, after the second iteration, only the petal length and petal width have weight, and the later iteration weights no longer change. The remaining features of RSMKFC on Iris are in line with the two characteristics we expect to retain (i.e. petal length and petal width), and the effect is ideal. Table 2. Different feature weights of Iris. Iteration
The weight of the feature attribute Calyx length
Calyx width
Petal length
Petal width
initial
0.2500
0.2500
0.2500
0.2500
1
0.0392
–
0.8589
0.1379
2
–
–
0.8569
0.1431
3
–
–
0.8569
0.1431
4
–
–
0.8569
0.1431
In terms of the dense parameters for measuring features, in Table 3, we compare the proposed parameters IVMR with VMR [21] and MVR [22]. VMR cannot produce smaller values for the third and fourth features, and MVR works slightly better, but the difference between the values of different features is too large. The proposed IVMR parameter considers the global characteristics of the data features, avoids the influence of the mean value of 0 or extreme value, and obtains a more balanced value to measure the compactness of the features, which has a certain rationality. Table 3. Tight parameters for different features of Iris. Calyx length
Calyx width
Petal length
Petal width
Average value
5.8433
3.0540
3.7587
1.1987
Variance value
0.6857
0.1880
3.1132
0.5824
VMR
0.1173
0.0616
0.8283
0.4859
MVR
8.5218
16.2443
1.2073
2.0581
IMVR
11.0783
8.1222
4.3464
3.0871
Next, the performance of the proposed RS-MKFC algorithm will be verified with 6 UCI datasets [25] and 4 high-dimensional datasets, and compared with the six algorithms of FCM, PFCM, MKKM, MKFC, EWFCM, and FRFCM. The first six datasets are used to compare the performance of the proposed algorithm under normal datasets. The latter four high-dimensional datasets are compared with the same feature reduction algorithm FRFCM. Tables 4, 5, 6, 7, 8,9 shows the details. As shown in Tables 4, 5, 6, 7, 8, 9 above, for the classic Iris dataset, it can be seen that the indicators of the RS-MKFC algorithm perform best and achieve a relatively
A Feature Reduction-Induced Subspace Multiple Kernel Fuzzy
281
Table 4. Clustering results on Iris. FCM
PFCM
MKKM
MKFC
EWFCM
FRFCM
RS-MKFC
ACC
0.807
0.840
0.873
0.893
0.893
0.913
0.940
NMI
0.665
0.698
0.742
0.730
0.743
0.771
0.795
EARI
0.696
0.734
–
0.793
0.812
0.824
0.848
Table 5. Clustering results on Zoo. FCM
PFCM
MKKM
MKFC
EWFCM
FRFCM
RS-MKFC
0.584
0.644
0.7327
0.753
0.663
0.683
0.773
NMI
0.626
0.636
0.7049
0.720
0.677
0.690
0.734
EARI
0.642
0.648
–
0.747
0.712
0.706
0.758
ACC
Table 6. Clustering results on Seeds. FCM
PFCM
MKKM
MKFC
EWFCM
FRFCM
RS-MKFC
ACC
0.776
0.791
0.833
0.838
0.852
0.886
0.895
NMI
0.542
0.572
0.600
0.609
0.633
0.672
0.696
EARI
0.755
0.776
–
0.790
0.804
0.827
0.832
Table 7. Clustering results on Spect. FCM
PFCM
MKKM
MKFC
EWFCM
FRFCM
RS-MKFC
ACC
0.566
0.674
0.727
0.730
0.794
0.795
0.839
NMI
0.113
0.138
0.139
0.148
0.173
0.186
0.222
EARI
0.264
0.339
–
0.685
0.708
0.717
0.724
Table 8. Clustering results on Cancer. FCM
PFCM
MKKM
MKFC
EWFCM
FRFCM
RS-MKFC
ACC
0.785
0.787
0.859
0.872
0.894
0.916
0.928
NMI
0.290
0.307
0.485
0.492
0.532
0.573
0.640
EARI
0.789
0.811
–
0.826
0.857
0.858
0.861
282
Y. Tang et al. Table 9. Clustering results on Ecoli. FCM
PFCM
MKKM
MKFC
EWFCM
FRFCM
RS-MKFC
ACC
0.446
0.548
0.580
0.595
0.643
0.673
0.717
NMI
0.386
0.414
0.416
0.431
0.432
0.487
0.518
EARI
0.499
0.533
–
0.545
0.561
0.588
0.606
optimal clustering effect. On datasets with higher feature dimensions, such as Breast Cancer Wisconsin and SPECT heart Data algorithm, the accuracy rates are 0.928 and 0.839, which is a great improvement over other algorithms. The proposed RS-MKFC algorithm can achieve good results even on the datasets Zoo and Ecoli, which have many clustering categories and feature dimensions. This is due to the fact that the algorithm maps nonlinear features to high-dimensional spaces through a multiple kernel approach while filtering out important features. Compared with the single subspace and multiple kernel fuzzy methods, the proposed algorithm can combine the advantages of the two methods well. For the two feature reduction algorithms FRFCM and RS-MKFC, Table 10 shows the results for four high-dimensional datasets. Both algorithms can filter features to improve the role of important feature attributes in clustering. Due to the influence of multiple kernel learning, the RS-MKFC algorithm has a better effect on the characteristic reduction of the data than the FRFCM algorithm, and the ACC index is more ideal. Because the RS-MKFC algorithm improves the refinement of data features, it is easier to filter out unimportant features. At the same time, looking at Fig. 2, we can understand that filtering out unimportant features can greatly reduce the running time of the overall algorithm. Table 10. Clustering results on high-dimensional datasets. Amount of data
Number of features
Number of features retained
ACC
FRFCM
RS-MKFC
FRFCM
RS-MKFC
arrhythmia
452
278
115
79
0.7345
0.7832
libras
360
90
12
15
0.4139
0.4336
semeion
1592
265
14
8
0.7850
0.8945
musk
476
166
10
6
0.6608
0.7681
We also compare the clustering accuracy of the proposed RS-MKFC algorithm with the traditional FCM algorithm on a high-dimensional dataset, as shown in Fig. 3. The RS-MKFC algorithm we propose works much better than FCM. This is due to the fact
A Feature Reduction-Induced Subspace Multiple Kernel Fuzzy
283
that traditional algorithms assign equal weights to each feature in clustering, while highdimensional data is sparse, and many attribute values are 0, which greatly reduces the effect of the algorithm based on Euclidean distance.
Fig. 2. The running time of each iteration on the musk Data set.
Fig. 3. Comparison of classification rates on high dimensional datasets.
The above experimental results show that the proposed RS-MKFC algorithm performs better than FRFCM, EWFCM, MKFC, MKKM, PFCM, FCM and other algorithms. At the same time, in the processing of high-dimensional datasets, the RS-MKFC algorithm can filter out important features, reduce the complexity of the algorithm, and improve the efficiency and accuracy of clustering.
4 Summary and Outlook In this paper, we come up with a feature reduction-induced subspace multiple kernel fuzzy clustering. Firstly, the RS-MKFC algorithm introduces the idea of subspace, which can assign different weights according to the importance of information in different dimensions of the data. Secondly, a feature reduction method is adopted to avoid the problem of huge computational complexity when dealing with high-dimensional data sets. Thirdly, we propose a parameter IMVR to measure the degree of discretization of data features, which can help the algorithm to reduce the data features and improve
284
Y. Tang et al.
the degree of role of key features in the algorithm. Finally, we compare the proposed RS-MKFC algorithm with the related 6 algorithms on the UCI dataset. The proposed algorithms are better than FCM, PFCM, MKKM, MKFC, EWFCM, and FRFCM algorithms. In addition, we analyzed the ability of RS-MKFC algorithm to screen important features and the function of feature reduction, both of which have achieved good results. In the future, logical reasoning [26, 27] and granular computing [28, 29] have important research value in the field of artificial intelligence, and we will consider combining them with multiple kernel fuzzy clustering algorithms to form a new model of reasoning clustering, which is expected to bring new vitality and exploration direction to the field. Acknowledgment. This work has been supported by the National Natural Science Foundation of China (Nos. 62176083, 62176084, 61877016, and 61976078), the Key Research and Development Program of Anhui Province (No. 202004d07020004), the Natural Science Foundation of Anhui Province (No. 2108085MF203), and the Fundamental Research Funds for Central Universities of China (No. PA2021GDSK0092).
References 1. Tang, Y.M., Ren, F.J., Pedrycz, W.: Fuzzy c-means clustering through SSIM and patch for image segmentation. Appl. Soft Comput. 87, Art. no. 105928, 1–16 (2020) 2. Tang, Y.M., Hu, X.H., Pedrycz, W., Song, X.C.: Possibilistic fuzzy clustering with highdensity viewpoint. Neurocomputing 329, 407–423 (2019) 3. Tang, Y.M., Li, L., Liu, X.P.: State-of-the-art development of complex systems and their simulation methods. Compl. Syst. Model. Simul. 1(4), 271–290 (2021) 4. Tang, Y.M., Pan, Z.F., Pedrycz, W., Ren, F.J., Song, X.C.: Viewpoint-based kernel fuzzy clustering with weight information granules. IEEE Trans. Emerg. Top. Comput. Intell. (2022). https://doi.org/10.1109/TETCI.2022.3201620 5. Bai, L., Liang, J., Cao, F.: A multiple k-means clustering ensemble algorithm to find nonlinearly separable clusters. Inform. Fus. 61, 36–47 (2020) 6. Huang, J.Z., Ng, M.K., Rong, H.: Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 657–668 (2005) 7. Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014) 8. Yang, M.S.: A survey of fuzzy clustering. Math. Comput. Model. 18(11), 1–16 (1993) 9. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact wellseparated clusters. J. Cybern. 3(3), 32–57 (1973) 10. Krishnapuram, R., Keller, J.M.: A possibilistic approach to clustering. IEEE Trans. Fuzzy Syst. 1(2), 98–110 (1993) 11. Pal, N.R., Pal, K., Keller, J.M., et al.: A possibilistic fuzzy c-means clustering algorithm. IEEE Trans. Fuzzy Syst. 13(4), 517–530 (2005) 12. Zhang, D.Q., Chen, S.C.: Clustering incomplete Data using kernel-based fuzzy c-means algorithm. Neural Process. Let. 18(3), 155–162 (2003) 13. Huang, H.C., Chuang, Y.Y., Chen, C.S.: Multiple kernel fuzzy clustering. IEEE Trans. Fuzzy Syst. 20(1), 120–134 (2012) 14. Gan, G., Wu, J., Yang, Z.: A fuzzy subspace algorithm for clustering high dimensional data. In: Li, X., Zaïane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 271–278. Springer, Heidelberg (2006). https://doi.org/10.1007/11811305_30
A Feature Reduction-Induced Subspace Multiple Kernel Fuzzy
285
15. Li, J., Liu, H., Tao, Z., et al.: Learnable Subspace Clustering. IEEE Trans. Neural Netw. Learn. Syst. 33(3), 1119–1133 (2022) 16. Lu, C., Feng, J., Lin, Z., et al.: Subspace clustering by block diagonal representation. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 487–501 (2018) 17. Deng, Z., Choi, K.S., Chung, F.L., et al.: Enhanced soft subspace clustering integrating withincluster and between-cluster information. Pattern Recogn. 43(3), 767–781 (2010) 18. Yang, M.S., Nataliani, Y.: A feature-reduction fuzzy clustering algorithm based on featureweighted entropy. IEEE Trans. Fuzzy Syst. 26(2), 817–835 (2017) 19. Xu, P., Deng, Z., Cui, C., et al.: Concise fuzzy system modeling integrating soft subspace clustering and sparse learning. IEEE Trans. Fuzzy Syst. 27(11), 2176–2189 (2019) 20. Zhou, J., Chen, L., Chen, C.L.P., et al.: Fuzzy clustering with the entropy of attribute weights. Neurocomputing 198, 125–134 (2016) 21. Cox, D.R., Lewis, P.A.W.: The Statistical Analysis of Series of Events. METHUEN & CO LTD, London (1966) 22. Bai, Z., Wang, K., Wong, W.K.: The mean-variance ratio test - A complement to the coefficient of variation test and the Sharpe ratio test. Statist. Probab. Lett. 81(8), 1078–1085 (2011) 23. Strehl, A., Ghosh, J.: Cluster ensembles - A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3(3), 583–617 (2002) 24. Campello, R.J.G.B.: A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment. Pattern Recogn. Lett. 28(7), 833–841 (2007) 25. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. School of Information and Computer Science, University of California, Irvine, CA, USA (2007). http://archive.ics.usi. edu/ml/Datasets.html 26. Tang, Y.M., Ren, F.J.: Fuzzy systems based on universal triple I method and their response functions. Int. J. Inf. Technol. Decis. Mak. 16(2), 443–471 (2017) 27. Tang, Y.M., Zhang, L., Bao, G.Q., et al.: Symmetric implicational algorithm derived from intuitionistic fuzzy entropy. Iranian J. Fuzzy Syst. 19(4), 27–44 (2022) 28. Tang, Y.M., Pedrycz, W.: Oscillation bound estimation of perturbations under Bandler-Kohout subproduct. IEEE Trans. Cybern. 52(7), 6269–6282 (2022) 29. Tang, Y.M., Pedrycz, W., Ren, F.J.: Granular symmetric implicational method. IEEE Trans. Emerg. Top. Comput. Intell. 6(3), 710–723 (2022)
A Deep Neural Network Based Resource Configuration Framework for Human-Machine Computing System Zhuoli Ren, Zhiwen Yu(B) , Hui Wang, Liang Wang, and Jiaqi Liu Northwestern Polytechnical University, Xi’an 710072, China [email protected]
Abstract. The collaborative computing between humans and machines has reached a new level in recent decades being a result of the increasing convergence of technology and progress in different scientific fields. By combining the strengths of humans and machines, Human-Machine Computing (HMC) integrates the complex cognitive reasoning capabilities of humans and the high-performance computing capabilities of computer clusters to tackle complex tasks that are difficult to accomplish by machines alone. However, combining humans with intelligent machines to obtain more efficient scheduling and management is a nontrivial task. Considering the heterogeneity of human-machine resources, we proposed a deep neural network-based resource configuration framework for the HMC system. In particular, we firstly describe the architecture of the HMC system. On this basis, we present the modeling details of the human-machine computing resource. Secondly, we analyze the optimization problem to be solved by the framework and propose a deep neural network-based scheduler for solving the resource configuration problem. Finally, the performance of our proposed framework is evaluated through simulation experiments, and the results show that our solution can make remarkable effectiveness, which can serve as guidelines for future research on HMC systems. Keywords: Human-machine computing · Human-machine collaboration · Heterogeneous Resource · Deep Neural Network
1 Introduction With the increasing popularity and momentum of artificial intelligence, today’s intelligent machines are able to learn from data and work like humans. Collaborative humanmachine computing refers to the cooperation between humans and intelligent machines in an organic organizational form guided by existing methods to accomplish a defined target task and obtain maximum benefits. For example, considering the tasks of medical diagnosis [2, 3] and criminal justice [13], machine processing results must be verified and modified by a human before output, which requires human expertise and feedback to be included in the computational system. Therefore, Human-Machine Computing (HMC) aims to investigate a new computing paradigm that uses human-machine interaction © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 286–297, 2023. https://doi.org/10.1007/978-981-99-2385-4_21
A Deep Neural Network Based Resource Configuration
287
to effectively combine human and machine intelligence to handle pervasive computing tasks. Different from machine-supported computing system (e.g., cloud computing, fog computing), we introduce the Human Processing Unit (HPU), a concept analogous to the CPU/XPU, as the human computing resource in HMC system. Unlike machine computing resources, HPU needs to be configured with the intermittency of human work in mind and formally represented as corresponding parameters. Considering the humanmachine resources, the key problem to be solved by the HMC system is how to unify the configuration and management of heterogeneous human-machine resources to meet the needs of diverse human-machine computing tasks, and how to constrain and model the task scheduling and allocate the corresponding computing resources in the face of diverse computing goals, so as to obtain a reasonable solution under the consideration of multiple conditions in a way that is different from existing machine-supported systems. The main contributions of our work are threefold: • We present a novel framework for human-machine resource modeling and dynamic task management. • We proposed a scheduler which used gradient based back-propagation strategy for scalable and fast scheduling, and shows that it is superior to other advanced schedulers. • Evaluation results show that our solution can make remarkable effectiveness.
2 Related Work In recent studies, the HMC system has received considerable attention from many scholars [1, 12]. For example, Wang et al. [4] used HMC to mine the valuable information samples tackling the real challenges in object detection, while the author in [5] designed a HMC system for unstructured data analytics. However, at the system level, most of the existing research related to task scheduling and resource configuration is mainly focused on problems for a single platform or scenario, on the machine side: e.g., cloud computing [14], edge computing [15], robotics [16], virtual containers [17], etc.; on the human side: e.g., crowd [18], assembly line workshops [19], expert clouds [20], etc. For specific scenarios, different problems have their own scheduling models. For instance, Madni et al. [8] proposed a heuristic multiobjective resource scheduling algorithm based on cuckoo search for cloud computing resources, with completion time, cost and resource utilization as the objective functions for optimal search. Similarly, Cheng et al. [9] presented a novel deep reinforcement learning-based scheduling method to minimize energy cost for large-scale CSPs with very large number of servers that receive enormous numbers of user requests per day. However, existing schedulers are unable to predict the resource requirements of different workloads, resulting in poor utilization of computing resources [10]. Consider this, the authors in [11] proposed a novel approach for task scheduling optimization based on dynamic dispatch queues and hybrid meta-heuristic algorithms, which achieve good performance in minimizing the waiting time as well as maximizing resources utilization. For the scheduling and configuration of human resources, Han et al. [6] proposed a genetic evolutionary algorithm for heuristic decoding to solve the production scheduling problem based on sequential scheduling with multi-objective optimization considering
288
Z. Ren et al.
completion time as well as total delay. While in real-time crowdsourcing scenario, the authors in [7] developed a crowdsourcing system that solves a task allocation problem by efficiently identifying the most appropriate group of workers for each incoming task, thereby meeting the real-time needs of the application and returning high quality results within budget constraints.
3 System Framework and Problem Description 3.1 Human-Machine Computing Framework
Fig. 1. Architecture of HMC system.
We consider a heterogeneous human-machine computing environment where the system handles different computing tasks requiring human and machine resources, Fig. 1 shows the architecture of our HMC system. As shown in the hierarchical system framework diagram, the system structure is divided into four layers functionally and logically. The lowest layer is the infrastructure layer, which includes hybrid human and machine computing resources. As the middleware connecting resources and tasks, the provisioning layer is responsible for resource registration, management and configuration. The scheduling layer completes the task scheduling and processing according to the task queue, task objective and scheduling constraints, and the top layer involves human-machine computing tasks, including human-machine task decomposition and task management. Our work mainly focuses on the scheduling layer, on the basis of task
A Deep Neural Network Based Resource Configuration
289
and resource modeling, considering the task structure, time, cost and other factors during computation, to provide computing resources for different types of human-machine tasks. For a specific task, it can be decomposed into a set of subtasks performed by stages and distinguishing processing units. Due to the diversity of task characteristics, different types of sub-tasks have different requirements and constraints on human-machine computing resources. 3.2 Data Model Computing Resource. Through registration and virtualization, the human computing resource in the provisioning layer, which we refer to as HPU, can have diverse computing capabilities. We assume that there are a fixed number of human computing unit in the system and we denote them as H = {h0 , h1 , ..., h|H −1| }. Specifically, each HPU unit is described by a 3-tuple < AlloRatei , Cap < s1 , s2 , ... >i , Fatiguei >, where AlloRatei represents the allocation rate in the virtualization state of hi , we introduce the attribute of allocation rate to measure the busy degree of HPU during calculation. Cap < s1 , s2 , ... >i is a collection that denotes hi ’s quantification of ability for different types of task, Fatiguei represents the fatigue state. Similarly, we denote the machine computing unit set in the system as M = {m0 , m1 , ..., m|M −1| }, and each machine unit is described by a 4-tuple < IPS, RAM , Disk, Bandwith >, which describes its capabilities in high performance computing, storage and network communication. For hybrid human-machine computing resources, we use uH (hti ) or uM (mti ) to denote the collection of time-series utilization metric. The collection of maximum capacities of the unit hi or mi is denoted as cH (hti ) or cM (mti ). Dynamic Task Model. The task scheduling of HMC system is organized by time sequence. The tth interval is represented by It and starts at s(It ), so s(It ) = 0 and s(It ) = s(It−1 ) + t > 0. At the end of interval It−1 , Nt new tasks are submitted by different users. A task is considered to be active for scheduling only if at least one sub-task of that task is being executed in the system. If no sub-task of a task can be executed at current interval It , it will be added to a waiting queue Wt . We denote the active task in t } and consists of |A | tasks, a t represents an active the interval It as At = {a0t , a1t , ..., a|A t i t| task in At at It . The scheduling layer in HMC system allocates the new tasks to suitable computing resources, and the completed task is denoted as Lt at the end of It .
290
Z. Ren et al.
3.3 Problem Description Usually human-machine tasks are diverse and are issued by users with different needs, so different tasks may require different amounts and attributes of human-machine resources during the scheduling process, this can be described by Eq. (1): ∀ait ∈ At , uH (ait ) > 0, uM (ait ) > 0
(1)
The scheduler in our system is denoted as S, only if the target computing unit can satisfy the current task requirements, the scheduling is considered to be executable. We use S˜ to denote one executable scheduling process, which should follow ∀ait , hti ∈ S˜ t , uH (ait ) + u(hti ) ≤ cH (hi )
(2)
∀ait , mti ∈ S˜ t , uM (ait ) + u(mti ) ≤ cM (mi )
(3)
The notation ait ∈ S˜ t represents that the task ait is scheduled in interval It .The dynamic update process of tasks in the system can be described as: ˜ t−1 + At−1 \Lt → At N˜ t + W
(4)
˜ t−1 + Nt \N˜ t → Wt Wt−1 \W
(5)
We use Ft to denote the QoS parameters of running tasks in the system, such as response time, the matching degree for task, consumption, etc. Further, we consider the problem of the scheduler is minimizing an objective score for the allocated task in It , which is presented as O(Ft ). In order to find the optimal scheduling decision, the scheduler need to minimize the O(Ft ) throughout the execution. Thus, the problem can be described as: minimize S
T
O(Ft ), subject to ∀t, Eqs.(2) − (6).
(6)
t=0
4 The DNN Based Scheduler 4.1 Objective Function For the scheduler, we need to focus on how to provide the most appropriate resources for a particular task under constraints. To optimize QoS parameters, we consider an objective function focusing on response time, matching degree and consumption. Equation (7) defines the objective function O(Ft ) for interval It . O(Ft ) = α · ARTt + β · AMDt + γ · ACt
(7)
Average Response Time (ART) is defined for an interval It as the average response time for all completed task (Lt ) in the system, which is normalized by the maximum response time until the current interval, as shown by Eq. (8): t ljt ∈Lt ResponseTime(lj ) (8) ARTt = |Lt | maxs≤t maxljs ∈Ls ResponseTime(ljs )
A Deep Neural Network Based Resource Configuration
291
Average Matching Degree (AMD) is defined to select the most appropriate resource for the current task under the same time sequence. The idea of clustering is used in matching, which requires some attributes and constraints in modeling. In the process of cluster matching, feature attribute extraction and dimensionality reduction are firstly carried out to make their features consistent. After clustering, different resource categories and task categories are obtained, and the matching degree between them is calculated. For the objective function O(Ft ), the average matching degree is formulated by Eq. (9): t ait ∈At MatchDegreeSt (ai ) (9) AMD = |At | maxs≤t maxat ∈At MatchDegree(ait ) i
Average Consumption is defined for any interval as the cost and energy of the system which includes both HPU working cost and machine computing consumption, the average consumption is defined by Eq. (10): ACt = AHCt + AMCt
(10)
where AHCt and AMCt is the average HPU cost and average machine consumption for interval It , as shown below: hi ∈H It · costhi (11) AHCt = |At | hi ∈H costhmax × (ti+1 − ti ) i s(It+1 ) mi ∈M t=s(It ) Powermi (t)dt (12) AMCt = max × (t |At | mi ∈M Powerm i+1 − ti ) i
4.2 Model Input and Training Input. In the proposed HMC system, we assume that there is a finite number of active tasks that can be handled in scheduling layer, with an upper bound of N . For the scheduler, at any interval, |At ≤ N |. Moreover, because the tasks in the system have requirements for both human and machine resources, we consider the utilization metrics for the task in HMC system of allocation rate, fatigue, IPS, ram, disk and bandwidth which form a feature vector F. Then we use ϕ(At−1 ), an N × F matrix to express the active task utilization metrics {u(ait−1 )|∀i, ait−1 ∈ At−1 }, where the first |At−1 | rows are occupied by the feature vectors of active tasks in the order of arrival interval and the rest of rows are 0. For human resource, we consider allocation rate and fatigue utilization of each HPU, each of size F1 . As for machine unit, we form feature vectors with IPS, ram, disk and bandwidth utilization and capacities of size F2 , where F = F1 + F2 . As mentioned above, at every interval It for |H | HPU, we form a |H | × F1 matrix ϕ(Ht−1 ) and for |M | machine units, a |M | × F2 matrix ϕ(Mt−1 ) is constructed, both of which use computing unit utilization metrics of interval It−1 . Training. As mentioned in Sect. 1, in order to use the gradient-based optimization strategy, we consider training a neural model f to approximate the objective function O(Ft ) using the input parameters [ϕ(At−1 ), ϕ(Ht−1 ), ϕ(Mt−1 ), ϕ(S)] that have been
292
Z. Ren et al.
introduced above. We consider a continuous function f (x; ω) where x is a continuous or discrete bounded variable and ω is a vector denoting the neural network parameters, as a neural approximator of O(Ft ). More specific, x is the set of utilization metrics for tasks and computing units with scheduling decisions, the parameter ω are trained using the dataset = {O(Ft ), [ϕ(At−1 ), ϕ(Ht−1 ), ϕ(Mt−1 ), ϕ(S)]} and we use a loss function L to minimize the dataset. For the model f , the Mean Square Error (MSE) is considered to used as the loss function L which quantifies the dissimilarity between output of f and the ground truth, as shown by Eq. (13): L(f (x; ω), y) =
1 T (y − f (x; ω))2 , (x, y) ∈ t=0 T
(13)
where y is the corresponding value of the objective function O(Ft ) for x. A key capability of neural networks is to randomly initialize a parameterized function and change a large number of parameters according to the loss function.
4.3 Scheduler After completing the training of the model f , we execute the scheduling processing which is summarized in Algorithm 1. The inputs to the scheduler are the tasks to be assigned and the resource utilization metrics of the human-machine computing resources, as well as the capacity characteristics of the HPU units. For each scheduling decision, we consider selecting a subset of all tasks and computing units which of size K (line 4–5 in Algorithm 1). After this, we configure the appropriate computing units for each high utilization task. In addition, to accommodate the dynamic nature of the arrival task, at each scheduling interval, we train the scheduling model using back propagation. For this purpose, we obtain the latest QoS score from the scheduling layer of the system and fine-tune the weights of the scheduler model using the MSE loss (line 15–16 in Algorithm 1). For the system, continuous optimization and training of the model allows it to quickly adapt to dynamic task queues and make appropriate changes to scheduling decisions.
A Deep Neural Network Based Resource Configuration
293
5 Experiments and Analysis 5.1 Experimental Setup In our experimental, each HPU is randomly assigned certain properties from the functional capabilities of the tasks defined by the system. More specifically, we set thre number of task requests ranging from [100, 800]. The number of HPU and machine computing units in our simulated resource pool is 20 and 10 respectively. For the scheduling interval I , we keep the size as 300 s and average run our experiment for 5 times to generate QoS results. In order to evaluate the performance of our deep neural network-based scheduler, the scheduling methods compared include evolution models like Genetic Algorithms (GA), a light-wight heuristic methods and a Reinforcement-Learning (RL) based model. We evaluate the metrics of average response time, total consumption and scheduling time through simulation experiments. All simulations ran on a PC with an Intel Core i7-10875H [email protected] Ghz, 32G RAM, Nvidia RTX 2080 and Windows 10 OS.
294
Z. Ren et al.
5.2 Result For all the model in simulation experiments, the Random scheduling is taken as the benchmark. We first evaluate the metric of average response time and scheduling time, Fig. 2 shows that the variation of average response time with the different number of tasks. Compared with other scheduling model, our average response time is the fastest except for random scheduling. More specifically, when the number of input task requests reaches 800, due to resource constraints and scheduling constraints, the effect of other models is not obvious. However, the response time of our model is approximate 60% of that of the alternative scheduling method. In terms of the scheduling time, although random scheduling can get a decision in a very short time, this method does not take into account other factors such as resources, which leads to its poor performance with respect to average response time. Except for random scheduling, our model has the shortest scheduling time due to it is quick to adapt in highly volatile environments. Figure 4 shows the total consumption of both HPU and machine computing units for each scheduling model. As the number of input tasks gradually increases, the gap in total consumption between schedulers becomes larger which due to the resource utilization gradually approaches saturation and the scheduling performance of the model only appears outstanding when the number of computational resources is further increased (Fig. 3).
Fig. 2. Average Response Time of our scheduler and baselines with increasing number of tasks
A Deep Neural Network Based Resource Configuration
295
Fig. 3. Scheduling Time of our scheduler and baselines with increasing number of tasks
6 Conclusion and Future Work In this paper, we proposed a deep neural network-based task scheduling and resource configuration framework for HMC system. More specifically, we introduce the HMC architecture and the modeling details of human-machine computing resources. And on this basis, we presented the dynamic task model and the framework of our DNN based scheduler. Finally, the experimental results showed that our framework can make remarkable performance, which is obviously better than the existing scheduling algorithm, and can serve as guidelines for future research on HMC systems. In the future work, we consider further enriching and refining the man-machine modelling attributes to suit diverse task requirements. For the scheduler, we try to add some networks with memory features (e.g., RNN, LSTM) to the model which can better predict and adapt to dynamic task queues.
296
Z. Ren et al.
Fig. 4. Consumption of our scheduler and baselines with increasing number of tasks
References 1. Amershi, S., Weld, D., Vorvoreanu, M., et al.: Guidelines for human-AI interaction. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–13 (2019) 2. Beede, E., Baylor, E., Hersch, F., et al.: A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2020) 3. Lee, M.H., Siewiorek, D.P., Smailagic, A., et al.: A human-AI collaborative approach for clinical decision making on rehabilitation assessment. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–14 (2021) 4. Wang, K., Yan, X., Zhang, D., et al.: Towards human-machine cooperation: self-supervised sample mining for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1605–1613 (2018) 5. Sinha, K., Manjunath, G., Gupta, B., et al.: Designing a human-machine hybrid computing system for unstructured data analytics. arXiv preprint arXiv:1606.04929 (2016) 6. Han, W., Deng, Q., Gong, G., et al.: Multi-objective evolutionary algorithms with heuristic decoding for hybrid flow shop scheduling problem with worker constraint. Expert Syst. Appl. 168, 114282 (2021) 7. Boutsis, I., Kalogeraki, V.: On task assignment for real-time reliable crowdsourcing. In: 2014 IEEE 34th International Conference on Distributed Computing Systems, pp. 1–10. IEEE (2014)
A Deep Neural Network Based Resource Configuration
297
8. Madni, S.H.H., Latiff, M.S.A., Ali, J., et al.: Multi-objective-oriented cuckoo search optimization-based resource scheduling algorithm for clouds. Arab. J. Sci. Eng. 44(4), 3585–3602 (2019). https://doi.org/10.1007/s13369-018-3602-7 9. Cheng, M., Li, J., Nazarian, S.: DRL-cloud: deep reinforcement learning-based resource provisioning and task scheduling for cloud service providers. In: 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 129–134. IEEE (2018) 10. Chen, Z., Quan, W., Wen, M., et al.: Deep learning research and development platform: characterizing and scheduling with QoS guarantees on GPU clusters. IEEE Trans. Parallel Distrib. Syst. 31(1), 34–50 (2019) 11. Ben Alla, H., Ben Alla, S., Touhafi, A., et al.: A novel task scheduling approach based on dynamic queues and hybrid meta-heuristic algorithms for cloud computing environment. Clust. Comput. 21(4), 1797–1820 (2018). https://doi.org/10.1007/s10586-018-2811-x 12. Chen, L., Ning, H., Nugent, C.D., et al.: Hybrid human-artificial intelligence. Computer 53(8), 14–17 (2020) 13. Yu, Z., Li, Q., Yang, F., et al.: Human–machine computing. CCF Trans. Pervasive Comput. Interact. 3(1), 1–12 (2021) 14. Al-Rahayfeh, A., Atiewi, S., Abuhussein, A., et al.: Novel approach to task scheduling and load balancing using the dominant sequence clustering and mean shift clustering algorithms. Future Internet 11(5), 109 (2019) 15. Luo, Q., Hu, S., Li, C., et al.: Resource scheduling in edge computing: a survey. IEEE Commun. Surv. Tutor. 23(4), 2131–2165 (2021) 16. Wang, Z., Sheu, J.B., Teo, C.P., et al.: Robot scheduling for mobile-rack warehouses: humanrobot coordinated order picking systems. Prod. Oper. Manag. 31(1), 98–116 (2022) 17. Liu, B., Li, P., Lin, W., et al.: A new container scheduling algorithm based on multi-objective optimization. Soft. Comput. 22(23), 7741–7752 (2018). https://doi.org/10.1007/s00500-0183403-7 18. Yin, X., Chen, Y., Xu, C., et al.: Matchmaker: stable task assignment with bounded constraints for crowdsourcing platforms. IEEE Internet Things J. 8(3), 1599–1610 (2020) 19. Zhao, Z.Y., Zhou, M.C., Liu, S.X.: Iterated greedy algorithms for flow-shop scheduling problems: a tutorial. IEEE Trans. Autom. Sci. Eng. 19(3), 1941–1959 (2021) 20. Ashouraie, M., Navimipour, N.J.: Priority-based task scheduling on heterogeneous resources in the Expert Cloud. Kybernetes 44(10), 1455–1471 (2015)
Research on User’s Mental Health Based on Comment Text Yubo Shen, Yangming Huang, Ru Jia(B) , and Ru Li School of Computer Science (School of Software), Inner Mongolia University, Hohhot, Inner Mongolia 010021, China [email protected]
Abstract. With the rapid development of society and the rapid pace of life, people’s psychological pressure is aggravated, which leads to a surge in the number of patients with mental diseases, and has an impact on the society. In order to better understand people’s mental health status, many experts and scholars have researched and designed a series of psychological assessment questionnaires, but most of these questionnaires are set out in the form of judgment and selection. Patients with mental illness can hide their real mental health status by choosing seemingly correct options. In this paper, based on the emotions of the dictionary user mental health system, select BosonNLP emotion as basic emotional lexicon dictionaries, combined with the praise and criticism dictionary + negative word dictionary + emoji dictionary + degree adverbs dictionary have the comment text sentiment analysis, the classification of positive and negative results as discriminant psychological abnormal emotional characteristics of the user, in order to analyze the mental health status of users. After evaluation and analysis on the microblog comment text dataset, it is found that the classification accuracy of the sentiment dictionary constructed based on this paper is better than that based on BosonNLP sentiment dictionary. Keywords: Mental health · Comment text · Emotional analysis
1 Introduction With the development of the Internet, more and more social networking platform into the line of sight of people, become people published point of view, writing in the present mainstream media platform, which contains a lot of useful information of text, can be used in the mental health status to the user, an objective analysis helps to understand the mental health level of people. In foreign studies, Riloff [1] and others were the first to study the construction of sentiment dictionaries in 1997. They used corpus data to construct sentiment dictionaries and made many studies on this basis. Kim [2] et al. believe that opinions can be expressed from four aspects: article theme, opinion holder, emotional expression item and positive or negative tendency, and the combination of these opinions is the positive or negative emotional tendency of opinion holder towards a certain topic. In 2017, Zhao Yanyan [3] et al. conducted a study from the perspective of dictionary scale and constructed a large-scale sentiment dictionary to solve the problem © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 298–310, 2023. https://doi.org/10.1007/978-981-99-2385-4_22
Research on User’s Mental Health Based on Comment Text
299
of insufficient sentiment words. By analyzing the comment text data of social network platforms, the monitoring of users’ mental health level can be improved, so as to achieve early detection and treatment, and effectively avoid the personal loss and social harm caused by mental diseases. Based on the sentiment analysis method of sentiment dictionary, this paper analyzes the sentiment of the comment text in microblog and predicts the user’s mental health status, and uses computer technology to automatically identify users with mental health problems. At present, the method of obtaining research data in the field of psychology is relatively simple and time-consuming, which is easy to make a large number of psychological research stagnated at the theoretical level and unable to use data for verification. At the same time, the number of samples generated by Weibo, Douban, Facebook and other social media with billions of users has increased exponentially, which effectively makes up for the shortage of small sample size in psychological research, thus increasing the applicability and universality of the research and improving the efficiency of psychological research. The analysis of users’ mental health through the text data of comments on social platforms not only breaks the limitation of time and space, but also relieves the pressure that psychologists need to participate in the whole process, which can be a good way to diagnose and treat mental diseases in the early stage.
2 Related Research 2.1 Text Emotion Analysis Text emotion analysis is to analyze, process and extract the text with subjective emotion, and analyze it using natural language processing and text mining technology [4]. It includes data statistical analysis, computer linguistics, natural language processing and so on. It is interdisciplinary research. According to the granularity of text, the emotion analysis of text can be divided into word level, sentence level and text level. Among these factors, word level emotion analysis is the basis of a sentence or text level. Extracting lexical elements with practical significance can not only reduce the number of features analyzed, but also lay a solid foundation for later high-level text analysis. Compared with the traditional text analysis methods, the emotional analysis of text faces more complex problems, such as the vocabulary combination of natural language and various syntactic structures, which makes it difficult for the computer to analyze the emotional semantics contained in it [5]. At present, there are two main ways to analyze the emotion of text: one is based on dictionary analysis, and the other is based on feature analysis [6]. The former is based on the characteristics of the characters, establishes certain character rules to a certain extent, and combines them with the sorted dictionary to judge and analyze them, and obtains their emotional tendencies; The latter is to judge and analyze the emotion of unknown text based on the relevant characteristics of the text, using the relevant theories of statistics, and combining the training mode of machine learning in the massive corpus [7]. 2.2 Dictionary Based Emotion Analysis Dictionary based emotion analysis is a semantic based corpus processing technology. Through analyzing the words with emotional components, and then quantifying them,
300
Y. Shen et al.
a complete emotion classification of text data is obtained. The process of emotion classification based on dictionary method is as follows (Fig. 1):
Fig. 1. Dictionary based emotion analysis process
From the above process, we can see that the core of text sentiment classification is to extract sentiment keywords, establish sentiment dictionary, and set up power to calculate sentiment tendency. Among them, the establishment of sentiment dictionary is the premise of text sentiment classification, and the comprehensiveness and accuracy of sentiment dictionary directly affect the classification of sentiment. Sentiment dictionary extension is an extension method based on the construction of sentiment dictionary, which is used to make up for the shortcomings of the dictionary and improve the classification accuracy. Sentiment orientation is calculated by extracting assessment factors from the text of comments and matching phrases. The importance of all evaluation factors at all levels is comprehensively considered, and the emotional words in the text are given weight, so as to obtain the level of emotional orientation, which is divided into positive and negative emotions [8]. 2.3 Introduction to Chinese Word Segmentation Chinese is the unit of characters. Every character in a sentence should be connected. In the classification task, word segmentation tool is used to separate words and then classify them. Word segmentation is to decompose continuous words according to certain rules, so that they can be reorganized into a set of words, so that all the emotional words contained in it can be extracted. It is a critical preprocessing process to obtain a series of word sequences that provide a basis for later emotion analysis. Commonly used Chinese word segmentation tools include SCWS, ICTCLAS compatible with multiple operating systems and Paoding applied on the Java platform [9]. Among the Chinese word segmentation systems, the most authoritative one is ICTCLAS [10], which was developed by the Institute of Computer Science of Chinese Academy of Sciences in five years. After 6 upgrades, the word segmentation system has reached version 3.0, with a word segmentation speed of 996 KB/s and a word segmentation accuracy of 98.45%.
Research on User’s Mental Health Based on Comment Text
301
3 Research Method 3.1 Chinese Oriented Microblog Emotion Dictionary There are various ways of expression in microblog. Some comments have no emotional words in the text, but they may also have a certain emotional tendency. A large number of Weibo data show that most users like to express their emotions through text and emoticons. Therefore, emoticons in microblog also play a crucial role in the analysis of sentiment polarity of text. The research focus of this paper is to establish an emoji dictionary and integrate the existing dictionary resources to form a dictionary with a certain scale in the field of Weibo. The microblog dictionaries constructed in this paper include five categories: basic sentiment dictionary, praise and criticism dictionary, emoji dictionary, degree adverb dictionary and negative word dictionary. 3.2 Construction of Basic Emotion Dictionary Compared with the traditional text, the microblog comment text has the characteristics of large data volume, short text content, colloquial expression and many online words [11]. Therefore, BosonNLP emotion dictionary is selected as the basic emotion dictionary, which is an emotion polarity dictionary automatically constructed from millions of emotion annotation data sources such as microblogs, news and forums [12]. Because it contains microblog data, the dictionary contains a large number of online words and informal abbreviations, It also covers a large number of non-standard comments. The final sorted basic dictionary contains 114767 emotional words, and gives different weights to emotional words according to emotional tendency and emotional intensity, ranging from [−7, 7]. Among them, negative numbers indicate negative words, positive numbers indicate positive words, and the value can reflect the degree of positive and negative. Some emotional words and their weights are shown in Table 1 below. (1) Praise and criticism dictionary Based on the selection and sorting of “Knowledge Network” and “Chinese Commendatory and derogatory Thesaurus”, this paper finds a commendatory and derogatory dictionary, which contains 10190 positive emotion words and 13711 negative emotion words. At the same time, the colloquial words and some positive or negative words in the microblog text are shown in Table 2. (2) Emoji dictionary One important factor contributing to the rapid development of microblog is its variety of forms of expression. Users can express their emotions by Posting text, pictures, videos and some emoticons. As a new word in the network, emoticons are used more and more in microblog. Emojis on microblog are mainly composed of basic emojis, which appear in the form of “[]” + text. Their application on microblog is becoming more and more popular. Various cartoon characters and ICONS are added to the emojis on microblog, thus enriching the expression of microblog [13]. Emojis in microblog usually have some kind of emotional polarity, which can more clearly convey the user’s emotion, so they play an important role in microblog sentiment analysis. That is to say, in sentiment analysis, emotional symbols must be
302
Y. Shen et al.
included in the “feature category”. Based on this point, based on the classification of emojis, this paper classifies their positive and negative polarity, and finally forms an emojis dictionary. The specific emoticons and their weights are shown in Table 3. (3) Degree adverb dictionary Adverbs of degree usually play a certain role in modifying verbs and adjectives, which can affect the emotional tendency of a sentence without changing its emotional polarity. Since most of the microblog comment texts express the user’s emotions through oral language, which is short but informative, such adverbs of degree that play a modifying role often appear in the texts, thus conveying the user’s emotions more directly [14]. For example: “This book is very good. It has lots of interesting stories.” “Very” is adverb of degree, used to modify emotional words, which can better show that the user has a good impression of the book. Adverbs of degree will have a certain impact on the user’s emotional performance, so when adverbs of degree appear, they should be given appropriate weight according to the degree of expression. According to 219 Chinese adverbs of degree collected from the HowNet sentiment Analysis lexicon, this paper divides them into six levels according to the strength of emotion, and assigns a weight of 0.5–3. Degree adverbs and weights are shown in Table 4. (4) Negative word dictionary When expressing emotions, negative words often have a great impact on the expression of emotions. Negative words appear around emotional words, generally indicating the change of emotional polarity, when negative words modify positive emotions, it means negative emotions; On the contrary, negative words modify negative emotions, which shows positive emotions. In this paper, 71 negative words are sorted out to form the negative dictionary, and their weight is set as −1.
Table 1. Basic emotion dictionary. Emotional words
Weights
speechless
−5.604577436
miserable
−3.798636281
woohoo
−3.539448046
approachable
1.624643516
overcome
4.392317423
…
…
3.3 Comment Text Preprocessing Technology Microblog users are free to express their opinions and opinions, but there are no restrictions on how they can express themselves, and in some cases, their grammatical forms may not conform to the general standards. The study shows that in a microblog, nouns,
Research on User’s Mental Health Based on Comment Text
303
Table 2. List of praise and criticism dictionary. Positive emotion words
Negative emotion words
excellent
pit father
pretty
have a guilty conscience
methodical
lead a gay one ‘s fling
cozy
narrow-minded
don’t be afraid of danger
useless
…
…
Table 3. Emoji dictionary. Category
Weight Content
positive comment text 2
[So proud] [Ha ha] [so happy] [clap][ok][good] [Yeah] [Thumbs up] [Awesome] [Love you] [Red envelope] [heart] [candle] [cake] [words Jane] [present] [pandas] [rabbit] [gloves] [eat] [thinking] [cap] [shake hands]…
negative comment text -2
[Anger][shut up] [contempt] [tears] [sickness] [vomit][grief] [Mad][insidious][scolding][yawning] [question][crazy] [worst] [pitiful] [surprised] [shy] [snickering] [too lazy to talk to you] [dizzy] [sleep] [nerd] [right hem] [left hem] [bad] [grievance] [snort]…
Table 4. Degree adverbs dictionary. Degree
Weight
Number
Degree adverb
most
3
30
over
extremely
2.5
69
extremely
very
2
42
very
relatively
1.5
37
compare
slightly
1
29
little by little, more or less
owe
0.5
12
a little, not much
adjectives and adverbs play a large role in text sentiment analysis, and adverbials play a small role at the grammatical level. In addition, users can forward other people’s messages at will, which leads to a lot of data duplication. Therefore, it is very necessary to preprocess the data on microblog. Word segmentation and part-of-speech tagging is a very important task in natural language processing. The expression of Chinese microblog is to use punctuation marks
304
Y. Shen et al.
to segment sentences, which is different from the space processing between English words and characters, so Chinese microblog needs word segmentation. Therefore, many scholars have discussed it and achieved certain results. The most popular Chinese word segmentation system in the world is ICTCLAS [6] developed by the Institute of Technology, Chinese Academy of Sciences. The system is based on the multi-level hidden horse model, and the segmentation speed of a single machine is 996 KB/s, and the segmentation accuracy is 98.45%. Chinese automatic word segmentation system SCWS is suitable for small and medium-sized search engines, keyword extraction and other occasions, the accuracy rate of 90%~95%; On the basis of the forward maximum matching, MMSEG4J, an open-source word segmentation widget based on Java, achieves a word recognition rate of 98.41%. Based on the Python Chinese word segmentation component-stuttering segmentation, the dynamic programming method is used to find the maximum possible path, so that the operation of unknown words can be completed. Since the experimental environment of this study is based on python, stuttering word segmentation technique was used for preprocessing on the microblog data. 3.4 Rules of Text Emotion Analysis Sentence Pattern Analysis Rules. By preprocessing the microblog, the text is divided into several groups of short sentences, S = {S1, S2, …, Sn}. Si represents the sentence i of the text. A complete sentence is called a compound sentence. It is characterized by different ending punctuation (period, question mark, exclamation point). On this basis, this paper sets the rules and analyzes their role in emotion. A period is usually used at the end of a sentence, without emotional color, which will not have a practical impact on the expression of emotions. The last sentence of the question mark can be a question or a rhetorical question. Rhetorical questions generally contain rhetorical words, such as “could it be” and “how”, which express completely different emotions and are opposite to those expressed in the original sentence. Interrogative sentences are used to express the user’s uncertainty, without any emotional intention. Finally, exclamation sentences have the function of strengthening mood, so they can enhance mood emotionally. On this basis, this paper uses the following rules to assign the emotional weight of different sentences, defines the sentence pattern influence factor Y, and the initial value is 1. Rule 1 if the sentence type is an exclamation sentence, i.e., “!” appears in the sentence, Y = 2. Rule 2 if the sentence type is a interrogative sentence, i.e. “?” appears in the sentence, At the same time, there is no rhetorical question in the sentence, y = 1. Rule 3 if the sentence type is a rhetorical question, i.e., “?” appears in the sentence, At the same time, there are rhetorical questions, y = 1.5. To sum up, the sentence pattern calculation formula (1) of Weibo comment text is: Ti = Si ∗ Y
(1)
where, Ti is the sentiment value of the sentence i after correction, Si is the sentiment value of the sentence i in the comment text, Y is the sentence type influence factor.
Research on User’s Mental Health Based on Comment Text
305
Word Analysis Rules. After the sentence is segmented, the sentiment words, degree adverbs and negative words are found respectively according to the sentiment dictionary constructed in this paper. Since different types of words have different influences on the judgment of sentiment polarity of sentences, the following calculation rules are formulated in this paper. Rule 1 (negative word + sentiment word) If the sentiment word is modified by a negative word, the calculation rule (2) is: C = (−1)n ∗ Csen
(2)
Rule 2 (degree words + sentiment words) If the sentiment words are modified by degree adverbs, the calculation rule (3) is: C = Cdeg ∗ Csen
(3)
Rule 3 (degree word + negative word + emotion word) If the emotion word is modified by degree word + negative word, the calculation rule (4) is: C = (−1)n ∗ Cdeg ∗ Csen ∗ 1.5
(4)
Rule 4 (negative word + degree word + sentiment word) If the sentiment word is modified by negative word + degree word, the calculation rule (5) is: C = (−1)n ∗ Cdeg ∗ Csen ∗ 0.5
(5)
where, C is the sentiment value of the sentiment word after correction, Cdeg is the degree adverb influence factor, Csen is the initial sentiment value of the sentiment word, and n represents the number of negative words. 3.5 Comprehensive Calculation of Sentiment Tendency Weight Based on the degree calculation rules proposed above, the sentiment value of the microblog comment text is calculated next. The sentence sentiment value calculation formula (6) is as follows: Si = C+ E (6) where, Si is the sentiment value of the sentence i in the comment text, C is the sentiment value of the sentiment word after correction, and E is the sentiment value of the emoji. Let the final sentiment value of the microblog comment text be W, then the final sentiment value W of the text is calculated by the formula (7): (7) W = Ti Comparing the calculated W value with 0, the sentiment polarity of the Weibo comment text can be obtained. When W > 0, it can be judged that the comment text is positive; when W = 0, it can be judged that the comment text is positive. The review text is neutral; when W < 0, it can be judged that the review text is negative.
306
Y. Shen et al.
4 Experiment Analysis 4.1 Experimental Data This article crawled some of the comment text data on Sina Weibo, and obtained a total of 23,968 comment texts. By preprocessing these comment texts, after removing hyperlinks, stop words, and @username and other meaningless texts, high-quality Weibo comment text data is obtained, which is organized and saved in a txt file. Through manual annotation, the comment text data is divided into two categories: positive and negative, and 20,000 comment texts are randomly selected from them to form the test data of this experiment. The final annotated test data statistics are shown in Table 5: Table 5. Weibo comment text test data statistics. Comment text category
Number
Text example
positive comment text
10000
Happy and fulfilling! [Applause] I’m trying to be someone like you ~ ~ [Haha] Baby, my sister really loves you! [Love you] [Kiss] Trying to be like you ~ ~ [Haha]
negative comment text
10000
How many people’s voices to say ~ [sad] The people who watch it are really sad [tears] Not with such attacks and ridicule! What’s the difference with the mob that smashed the car?! [Tears][tears][tears][tears][tears][tears]
4.2 Experimental Performance Evaluation Metrics Commonly used indicators in emotion recognition are: precision, recall, and F-Score. The precision rate measures the accuracy of the classification results, the recall rate measures the completeness of the classification model, and the F-Score is an evaluation index that integrates these two indicators, and is an indicator used to comprehensively reflect the whole. This paper mainly uses F-Score value is the main evaluation index. Pc (8) Pa where, Pc is the number of comment texts of this category that are judged correctly, Pa is the number of comment texts that are judged to be of this category. Precision : Precision =
Rc (9) Ra where, Rc is the number of comment texts of this category that are judged correctly, Ra is the number of comment texts that should be judged as this category. Recall : Recall =
F − Score : F − Score =
2 × precision × recall × 100% precision + recall
(10)
Research on User’s Mental Health Based on Comment Text
307
4.3 Experimental Results and Analysis Based on the sentiment dictionary constructed in Chapter 3, this paper conducts sentiment analysis on the crawled Weibo comment text in two steps. The experimental results obtained are shown in Table 6 below. Table 6. Experimental results. Experimental method
Comment text category
Precision
Recall
F-Score
BosonNLP Dictionary + Semantic Rules
positive
0.6409
0.9203
0.7555
negative
0.8587
0.4845
0.6194
Basic Dictionary + Expression + Semantic Rules
positive
0.643
0.9256
0.7588
negative
0.8672
0.4861
0.6229
In the first step, only the BosonNLP sentiment dictionary, which is the basic sentiment dictionary, and the semantic rules formulated in Chapter 3 are added for analysis. The F-Score of the positive text is 0.7555, and the negative text is 0.6194. The second step is to use the Weibo sentiment dictionary (BosonNLP sentiment dictionary + praise and criticism dictionary + negative dictionary + emoji dictionary + degree adverb dictionary) constructed in this paper and the semantic rules formulated in Chapter 3 for analysis. The F-Score is 0.7588 and 0.6229 for negative text. The following conclusions can be drawn from the experimental results: (1) The precision rate, recall rate and F1 value of the second group of positive comment texts were increased by 2.1‰, 5.3‰ and 3.3‰ respectively compared with the first group; the precision rate, recall rate and F1 value of the negative comment texts were higher than those of the first group. They were increased by 8.5‰, 1.6‰ and 3.5‰ respectively. From the comparison results, it can be seen that the Weibo sentiment dictionary constructed in this paper has achieved relatively good results in judging negative comment texts. At the same time, compared with only using the BosonNLP sentiment dictionary, the addition of praise and criticism dictionary, emoji dictionary, and negative dictionary The effect of sentiment analysis after the dictionary of adverbs of degree and degree has been improved. The reason is that after the dictionary is added, the coverage of emotional words is expanded, so that the extracted emotional features increase, so the emotional analysis of the text is more accurate. Therefore, the coverage of sentiment words affects the analysis results of sentiment lexicon to a certain extent. (2) The Weibo sentiment dictionary constructed in this paper uses the BosonNLP sentiment dictionary as the basic sentiment dictionary, and adds the Weibo emoji dictionary. Since the BosonNLP sentiment dictionary contains a large amount of Weibo corpus in the construction process, it can be more accurately identified. The network terminology is introduced, and the emoji dictionary is introduced according to the characteristics of Weibo, which is helpful for the emotional recognition of the text. From the experimental results, it can be seen that the sentiment dictionary
308
Y. Shen et al.
constructed in this paper performs well in the precision, recall and F1 value of sentiment polarity recognition in review texts. In summary, the Weibo sentiment dictionary constructed based on the BosonNLP sentiment dictionary in this paper has better performance in sentiment recognition of Weibo comment texts. Then randomly select 2,000 comment texts from the remaining 3,968 Weibo comment texts for sentiment analysis. Without manual annotation in advance, the sentiment dictionary constructed in this paper is directly used for analysis. The results are shown in Table 7. Among the Weibo comment texts, a total of 1418 are positive comment texts, and the remaining 582 are negative comment texts. Table 7. Test results. The total number of test comment texts
Number of positive comments
Number of negative comment texts
2000
1418
582
According to the test results in Table 7, the proportion of different categories of microblog comment text is calculated. Among the randomly selected 2000 comment text data, the positive comment text accounts for 70.9%, the negative comment text accounts for 29.1%, and the positive comment text accounts for the vast majority, almost two-thirds of the total. It can be seen that there is a large difference in the proportion of positive and negative emotions in microblog comment text, and on the whole, positive emotions dominate. The test results show that the vast majority of Weibo users are in good mental health, and only a few users may have mental health problems. In order to monitor people’s mental health in a timely and effective manner, this paper proposes a preliminary screening and analysis of users’ mental health through sentiment analysis of Weibo comment texts. Aiming at the sentiment analysis of Weibo comment texts, this paper adopts the method based on sentiment dictionary, and selects the BosonNLP sentiment dictionary which contains many Internet terms as the basic sentiment dictionary. The emoji dictionary makes a certain improvement in the performance of sentiment analysis of Weibo comment text. The experimental results also demonstrate the feasibility of analyzing users’ mental health through Weibo comment text.
5 Summary The progress of society brings people with increasing pressure, and the fast-paced life makes people unable to relieve the pressure in time, and prone to mental health diseases. With the wide attention of people to mental health problems, more and more psychological counseling work and psychological problem prevention and screening work have been paid attention. In order to monitor people’s mental health status timely and effectively, this paper proposes to screen and analyze the mental health status of users through
Research on User’s Mental Health Based on Comment Text
309
the sentiment analysis of microblog comment text. The key to accurate sentiment analysis of microblog comments is to establish a sentiment dictionary. The higher the number and accuracy of sentiment words in the sentiment dictionary, the more accurate the results will be. This paper proposes a new sentiment lexicon, which includes BosonNLP sentiment lexicon, sentiment lexicon, emoticons lexicon, adverbs of degree lexicon and negation lexicon. This method can effectively solve the situation that a large number of emotions appear in the comments, so as to improve the accuracy of sentiment analysis of microblog comments. The results show that the microblog sentiment lexicon constructed in this paper performs well in the text sentiment analysis, and can accurately judge the sentiment polarity of the text. In general, the sentiment lexicon constructed in this paper performs well in the sentiment analysis of microblog comment texts, but there is still room for improvement in the construction of sentiment lexicon in microblog domain and the correct identification of special sentence patterns. The development of the Internet is changing with each passing day, and network terms are growing explosively. Building a domain sentiment dictionary for microblog and improving the coverage of sentiment words in the dictionary can effectively improve the accuracy of text sentiment analysis. Since the sentiment lexicon itself has certain limitations in text sentiment analysis, in future work, deep learning methods can be considered to judge the sentiment polarity of the text, or the sentiment lexicon and deep learning can be properly integrated, so as to obtain higher accuracy. Acknowledgments. This paper is supported by the Inner Mongolia Natural Science Foundation Project (2020MS07018), the National Natural Science Foundation of China (61862046) and the Inner Mongolia Autonomous Region Scientific and Technological Achievement Transformation Project (CGZH2018124).
References 1. Riloff, E., Shepherd, J.: A corpus-based approach for building semantic lexicons. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pp.117–124. EMNLP (1997) 2. Kim, S.M., Hovy, E.: Automatic detection of opinion bearing words and sentences. Companion volume of the proceedings of IJCNLP-05, Jeju Island, Republic of Korea (2005) 3. Yanyan, Z., Bing, Q., Qiuhui, S., et al.: Construction of large-scale sentiment lexicon and its application in sentiment classification. J. Chin. Inf. Process. 31(2), 187–193 (2017) 4. Zhiming, L., Bo, Y., Chunping, O., et al.: SE-text rank sentiment summarization method based on topic. Technol. Intell. Eng. 3(3), 8 (2017) 5. Yongcheng, Z., Huaibing, W.: Application of python natural language processing method in text sentiment analysis. Comput. Knowl. Technol. 16(36), 87–88 (2020) 6. Hengliang, F., Weiqing, C.: A KNN text classification method based on association Analysis. Comput. Technol. Dev. 24(6), 4 (2014) 7. Jianfeng, X., Yuan, X., Yuanchen, X., et al.: Hybrid sentiment classification algorithm framework for Chinese text based on semantic understanding and machine learning. Comput. Sci. 42(06), 61–66 (2015) 8. Xiaodong, C.: Sentiment Analysis of Chinese Microblog Based on Sentiment Dictionary. Huazhong University of Science and Technology, Hubei (2012)
310
Y. Shen et al.
9. Jiahui, C.: Sentiment Analysis and Topic Orientation Determination of Chinese Microblog Based on Sentiment Dictionary. Southwest University, Chongqing (2019) 10. Qun, L., Huaping, Z., Hongkui, Y., et al.: Chinese lexical analysis based on cascading hidden horse model. J. Comput. Res. Dev. 08, 1421–1429 (2004) 11. Guowei, S., Wu, Y., Wei, W.: Burst topic detection for large-scale microblog message stream. J. Comput. Res. Dev. 52(02), 512–521 (2015) 12. Jiesheng, W., Kui, L.: Sentiment analysis of Chinese microblog based on multiple sentiment dictionaries and rule sets. Comput. Appl. Softw. 36(9), 7 (2019) 13. Jianying, L.: Sentiment Analysis of Chinese Microblog based on Sentiment Lexicon. National University of Defense Technology (2016) 14. Jiangyue, L.: Research on Psychological Early Warning Model Based on Fine-Grained Sentiment Lexicon. Tianjin University (2016)
A Multi-objective Level-Based Learning Swarm Optimization Algorithm with Preference for Epidemic Resource Allocation Guo Yang, Xuan-Li Shi, Feng-Feng Wei, and Wei-Neng Chen(B) School of Computer Science and Engineering, South China University of Technology, Guangzhou 51006, China [email protected]
Abstract. Epidemics like COVID-19 seriously threaten public health. How to control the spread of epidemics has long been an important topic that attracts a large amount of research effort. During the epidemic prevention, it is crucial to effectively reduce the number of infected people, and it is also important to make good use of epidemic prevention resources and reduce cost. To address this issue, we consider a multi-objective epidemic resource allocation problem in this paper. First, we define an optimization model for this problem with objectives to minimize the number of infected people and exposed people, and minimize the cost of resources. The model integrates the Susceptible-Exposed-Infected-Vigilant (SEIV) model to predict the spread of the epidemic and estimate the demand for resources. Second, to solve this intractable problem, we develop a multiobjective level-based learning swarm optimizer (MLLSO) by combining the concepts of level-based learning and nondominated sorting. Third, to further improve the search performance, a special initialization strategy and a preference-based mechanism are also introduced in the algorithm, leading to MLLSO-P. Finally, experimental results demonstrate the effectiveness of the proposed approaches. Keywords: resource allocation · infectious disease spread · evolutionary computation algorithm
1 Introduction From the past till the moment, infectious diseases have been a serious threat to public health. Recently, the COVID-19 has brought global attention to the topic of infectious disease prevention and control. At present, people have epidemic prevention resources such as masks and vaccines, and epidemic prevention measures such as isolation to control the spread of the epidemic. Different epidemic prevention resources have different effects on epidemic prevention, and also have different production costs and related social costs. Thus, how to allocate epidemic prevention resources reasonably plays a crucial role in the control of infectious diseases [1, 2]. To effectively allocate resources, some epidemic spreading models can be used to predict the infectious disease propagation before resource allocation [3, 4]. The most © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 311–325, 2023. https://doi.org/10.1007/978-981-99-2385-4_23
312
G. Yang et al.
famous model is the Susceptible-Infectious-Recovered (SIR) model [5], where each individual has three states—susceptible, infectious and recovered, and the spread of the epidemic can be predicted by the dynamic model constructed from the transition relationship among the three states. Since the SIR model was developed, many extensions of the SIR have been developed, e.g., the Susceptible-Exposed-Infectious-Vigilant (SEIV) [6, 7], Susceptible-Exposed-Infectious-Recovered (SEIR) [8], etc. The SIR model and its extensions have been widely used to predict infectious disease propagation. For example, He et al. [8] took some general control strategies as input and optimized the parameter of SEIR, where SEIR is used to predict the epidemic spread further predicting the resource demand. The infectious disease evolution and forecast are verified in Hubei province for COVID-19 in [8]. Considering the asymptomatic people, Leon et al. [9] particularly proposed a Susceptible-Exposed-Infected-Asymptomatic-Recovered-Dead model to model and predict the spread of COVID-19. Besides, some data-driven models are also developed for forecasting infectious diseases [10, 11]. Based on the above infectious disease models, some literature considered the resource allocation problem to control the spread of epidemics. We divide them into two categories, i.e., infection-optimal strategy [12–15] and cost-optimal strategy [2, 16], according to objective function. In the infection-optimal strategy, researchers generally aim at mini3mizing the number of infected people or the infected rate under cost constraints. Brandeau et al. [12] allocated limited resources among multiple noninteracting populations to control the spread of infectious disease. Addressing the combinatorial, discrete resource allocation problem, Zhao et al. [15] proposed an evolutionary divideand-conquer algorithm for minimizing the epidemic spread under the fixed cost. On the contrary, some researchers take the cost as the objective and regard infection rate as the constraint. A desired exponential decay rate for eradicating infectious disease is often given as the constraint. Nowzari et al. [16] reformulated the epidemic resource allocation problem as geometric programs and developed an optimization framework to control the epidemic. They found the optimal resource allocation schedule to minimize cost under a fixed epidemic decay rate. However, the existing studies focus less on optimizing the infection rate and the cost simultaneously. Although Nowzari et al. [16] considered both infection-optimal strategy and cost-optimal strategy, they treated the resource allocation in these two strategies as different problems, and solved them separately. In the real world, there is the requirement to control the infectious disease spread as quickly as possible while minimizing the cost. Therefore, it is promising to consider the epidemic resource allocation problem as a multi-objective problem, where the infectious disease spread and the cost are minimized simultaneously. As the epidemic resource allocation problem has been proven intractable [14, 15], in this paper we consider applying a multi-objective evolutionary algorithm (MOEA). MOEAs, such as nondominated sorting genetic algorithm II (NSGA-II) [17], multiobjective evolutionary algorithm based on decomposition (MOEA/D) [18], have been widely used for solving complex multi-objective problems [19–22]. MOEAs imitate the biological inheritance and evolution mechanism to find out a set of solutions to approximate the Pareto Front of the considered problem in a single run. MOEAs have shown good performance in some network-based multi-objective optimization problems. For example, combining ant colony optimization with the MOEA/D framework, Chen et al. [19]
A Multi-objective Level-Based Learning Swarm Optimization Algorithm
313
designed an adaptive dimension size selection algorithm to control the spread of rumors in a social network. Wang et al. [20] concentrated on the cooperation and robustness optimization problem in dynamic directed networks. They proposed a MOEA-NetCC to design the network with good cooperation maintaining ability and high robustness. Although MOEAs can find out a set of Pareto solutions, the final solution is usually determined by the decision makers [23]. Only the solutions in the region of interests (ROI) can attract the decision makers [24]. For example, in the epidemic resource allocation problem, neither too many infections nor too high costs is an acceptable outcome, which is out of the ROI. Thus, integrating preference into the MOEA for epidemic resource allocation is promising. To this end, in this paper we intend to develop a multi-objective level-based learning swarm optimizer algorithm with preference (MLLSO-P) for the multi-objective epidemic resource allocation problem. The main contributions are as follows. (1) We define a multi-objective epidemic resource allocation optimization model, where the number of infected people and exposed people, and the cost of the resource are the objectives. Moreover, the Susceptible-Exposed-Infected-Vigilant (SEIV) model is adopted to predict the epidemic spread, and the demand for resources are estimated based on this model. (2) To solve this multi-objective optimization problem, we develop a new MOEA by combining the concepts of level-based learning [27] and nondominated sorting [17], leading to the multi-objective level-based learning swarm optimizer (MLLSO). Since the resource allocation in this paper is discrete, a binary update rule based on the level-based learning swarm optimization (LLSO) is developed for particles. (3) To further improve the swarm diversity of MLLSO, particles are divided into two parts being initialized in different ways. Moreover, a preference comprehensive sorting strategy is developed for MLLSO, resulting as the MLLSO-P. The rest of this paper is organized as follows: “Background” introduces basic information of the existing SEIV model and LLSO. Next, the multi-objective resource allocation optimization model is defined in “Model Definition” to control the spread of epidemic and the cost. Subsequently, we explain MLLSO in detail in “Proposed Algorithm”, further defining MLLSO-P. Various experiments are conducted in “Experiment”.
2 Background 2.1 Existing SEIV Model To allocate epidemic resources more precisely, the SEIV model is adopted to predict the spread of the epidemic. As the variant of SEIR [8], SEIV has superiority in the generality of transiting to other epidemic models, such as SIR [5], SEIR [8], SIS [25] and great parameter approximation [15], so it has been extensively discussed in the study of epidemic spread [2, 6, 8, 14, 15, 26]. However, the archetypical SEIV model for virus spreading simulation lacks research on the society phenomenon, e.g. interaction and psychological behaviors among people. To make up for this deficiency, many novel modified
314
G. Yang et al.
SEIV model have been proposed and manifested excellent performance in real interaction networks [32–34], e.g. some scholars took into consideration the people awareness response design for the original SEIV model [32], while some scholars are interested in state-transition rate design. e.g. Adebimpe et al. considered the temporary immunity and saturated incidence rate [33], and Trpevski et al. thought about multiplicative infection rate [34], which is also be adopted in our paper. In general SEIV model, an individual may be in one of four states, i.e., Susceptible (S), Exposed (E), Infected (I), Vigilant (V), with probability pS i(t), pE i(t), pI i(t), pV i(t) respectively. piS + piE + piI + piV = 1 s.t. 0 ≤ piS , piE , piI , piV ≤ 1 Individuals and the communication among individuals are represented as nodes (n) and edges (e) in an unweighted and undirected GN = (n, e), where N is the total of nodes and A = (aij |1, 2, …, N) is the adjacency matrix of GN . We give a general state transition relationship of SEIV in Fig. 1, and summarize the parameters in Table 1.
Fig. 1. The state transition relationship of SEIV.
Table 1. The parameter definition of SEIV. Variable Definition θi
The immune probability of an individual transiting from S to V
γi
The immunity loss probability of an individual transiting from V to S
βE i
The probability of a susceptible infected by the exposed people, transiting from S to E
βI i
The probability of a susceptible infected by the infected people, transiting from S to E
δI i
The recovery probability of an infected individual transiting from I to V
δE i
The recovery probability of an exposed individual transiting from E to V
ξi
The incidence probability of an exposed individuals transiting from E to I
μi
The prevalence probability of a susceptible individual, transiting to the exposed state
A Multi-objective Level-Based Learning Swarm Optimization Algorithm
315
Especially, the prevalence probability is calculated as Eq. (1), which represents the total infected probability of node i infected by its all neighbors. ui = 1 −
N
(1 − βiE aij pjE − βiI aij pjI )
(1)
j=1
SEIV model simulates epidemic spread through a series of differential equations discribing the state change of individuals via time [14, 26]. According to the transition process presented in Fig. 1, the mapping equations can be formulated in (2). Based on these differential equations, we can predict the epidemic spread successfully and further predict the resource demand before allocating. ⎧ dpS V S S i ⎪ ⎪ ⎪ dtE = γi pi − θi pi − (1 − θi )μi pi ⎪ ⎨ dpi S E E dt = (1 − θi )μi pi − (ξi + (1 − ξi )δi )pi I dp ⎪ i = ξ pE − δ I pI ⎪ i i ⎪ i i dt ⎪ V ⎩ dp S E E I I V i dt = θi pi + (1 − ξi )δi pi + δi pi − γi pi
(2)
2.2 Level-Based Learning Swarm Optimizer In this paper, the idea of level-based learning is introduced in the proposed MLLSO algorithm. Level-based learning was first proposed and introduced in LLSO (PSO) [28] by Yang et al. [27] for single objective large-scale optimization. Similar to other particle swarm optimizers, such as particle swarm optimizer (PSO) [28], competitive swarm optimizer (CSO) [27], LLSO imitates the swarm foraging behavior in biology, searching the optimal solution. Moreover, a level-based strategy is used in LLSO, where particles are separated into different levels according to corresponding fitness. Particles in the higher level have the better solutions, and particles in the lower level have the worse solutions. By learning from two examples randomly selected from the higher levels, particles can update their positions using Eq. (3), d = r v d + r (x d d d d vi,j 1 i,j 2 rl1 ,k1 − xi,j ) + ϕr3 (xrl2 ,k2 − xi,j ) (3) d = xd + vd xi,j i,j i,j where X i,j = [x1 i,j, …, xd i,j, …, xD i,j] and V i,j = [v1 i,j, …, vd i,j, …, vD i,j] are the position vector and velocity vector, respectively. xd i,j is the d dimension of the jth particle from the ith level. The rl1 and rl2 are two randomly selected level that is higher than ith level. The k 1 and k 2 are two randomly selected from rl1 th level and rl2 th level, respectively. The r 1 , r 2 , r 3 are random parameters within [0,1]. The ϕ is the control parameter within [0,1]. We summarize the simplified process of LLSO in Algorithm 1.
316
G. Yang et al.
3 Model Definition In this section, we define the multi-objective epidemic resource allocation model, where the cost and the number of infected or exposed people are considered as objective functions. In the real world, there are various types of epidemic resources. Five commonly used resources [14] are considered in this paper to control the epidemic, i.e., the vaccinating resource R1 , the protective resource R2 , the detective resource R3 , the curative resource R4 , and the detective and curative resource R5 . The allocation of these resources is represented as a 5*N dimension Boolean matrix M = [mr,i ]. If the rth resource is allocated to the ith node, mr,i = 1, otherwise mr,i = 0. Generally, resource help individuals prevent epidemic by affecting the epidemic transition probability [34, 35]. Different resources have influence on different state transition probabilities. For example, the vaccinating resource R1 can enhance the immune probability of an individual transiting from S to V. The influence of resources on transition probabilities of SEIV can be formulated in the following. ⎧ R1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ R2 R3 ⎪ ⎪ ⎪ R4 ⎪ ⎪ ⎩ R5
: θi = θi + (θi − θi )m1,i : βiE = βiE (1 − m2,i ) + βiE m2,i , βiI = βiI (1 − m2,i ) + βiI m2,i : ξi = ξi + (ξi − ξi )m3,i : δiI = δiI + (δiI − δiI )m4,i : ξi = ξi + (ξi − ξi )m5,i , δiE = δiE + (δiE − δiE )m5,i
(4)
The line over/under the variable represents the maximum/minimum value of that variable. We can see from Eqs. (4) that the appropriate resource allocation can effectively control the epidemic spread by affecting the value of state transition probability. 3.1 The Number of Infected People and Exposed People Based on Eq. (4), the change of the transition probability can be calculated. Thus, the final number of people in different epidemic states can be computed by Eq. (2). We define
A Multi-objective Level-Based Learning Swarm Optimization Algorithm
317
the number of infected people and exposed people (L) as one objective to measure the epidemic spread. L = E(T ) + I (T )
(5)
where E(T ) and I(T ) is the total number of people in exposed and infected state, respectively at the final evolution time T. 3.2 The Cost of All Allocated Resource In the real world, it is impossible to allocate resources without considering the cost. Many studies take the cost as the constraint in epidemic resource allocation problem [14, 15]. We take the cost as the second objective in this paper. In our paper, we totally consider five different kinds of resource, they are the vaccinating resource R1 (e.g. vaccines), the protective resource R2 (e.g. masks), the detective resource R3 (e.g. Pathological diagnosis equipment), the curative resource R4 (e.g. medicament, surgical equipment), and the comprehensive resource R5 respectively. The costs of different resources are formulated as N
N
S S pi m1,i , CostR2 = c2 pi m2,i CostR1 = c1 i=1 i=1
N N (6) CostR3 = c3 piE m3,i , CostR4 = c4 piI m4,i i=1 i=1
N N CostR5 = c3 piE m5,i + c4 piE m5,i i=1
i=1
where c1 , c2 , c3 , c4 and c5 are the unit cost for R1 , R2 , R3 , R4 , and R5 , respectively. Thus, the total cost (C) can be calculated as Eq. (7). C = CostR1 + CostR2 + CostR3 + CostR4 + CostR5
(7)
Considering both the control of epidemic and the cost, we take L and C as the objective functions for epidemic resource allocation problem. F : min(f1 , f2 ) ⎧ M ⎨ f1 = L, f2 = C s.t. M = [mr,i ], r = 1, ...5, i = 1, ..., N ⎩ mr,i ∈ {0, 1}
(8)
4 Proposed Algorithm In order to solve the multi-objective epidemic resource allocation problem, we propose a MLLSO algorithm, which combines the level-based learning and nondominated sorting. The binary particle update rule is designed for MLLSO, so that MLLSO can solve discrete problems. Moreover, considering the high dimension of resource allocation problem,
318
G. Yang et al.
we develop a random-uniform initialization method to improve the swarm diversity. Furthermore, preference is considered to MLLSO, resulting MLLSO-P. It maintains solutions in ROI of decision makers. Instead of nondominated sorting, MLLSO-P uses a comprehensive sorting strategy with preference to sort particles. Besides, the position of each particle generally represents a feasible solution (x = M), where M is the allocation matrix. Thus, the dimension of velocity is also equal to that of M, D = 5 × N. 4.1 Algorithm Description of MLLSO MLLSO integrates level-based learning [27] and non-dominated sorting [17], making an original single-objective LLSO into a multi-objective version. Particles in different positions generally have different potential in exploiting and exploring. Grouping particles into different levels according to their fitness and treating them differently is the key idea of level-based learning. Particles in low levels learn from particles in high levels, not the global best particle or own history optimums. Through this mechanism, swarm can maintain good diversity. However, the original level-based learning is proposed for the single-objective problems, where the particles can be sorted according to corresponding fitness. One particle corresponds to one fitness. In the multi-objective problem, one particle has two different fitness. How to sort particles in level-based learning for the muti-objective problem is unsolved. Thus, we consider the nondominated sorting into level-based learning. The nondominated sorting is proposed in the NSGA-II algorithm [17] for multi-objective problems. According to the dominated relationship of particles, the nondominated sorting ranks particles into different fronts. The particles in the first front have the nondominated fitness. We summarize the over process of MLLSO-P in Algorithm 2. First of all, the archive set of the optimal solutions A is initialized as an empty set, in line 1. We assign the position (X) and velocity (V ) in the line 2 using random-uniform initialization. This novel initialization method allows a fraction of particles to be initialized randomly while others need to keep their pre-allocated uniform fitness values over iterations using repairing function, therefore the solving diversity of whole system is enhanced. Next, the process enters the loop to evolute the swarm until the terminal condition is satisfied. At the beginning of the loop, the fitness values of two objectives for swarm is calculated in line 4, and then true_rank of particles are obtained through a comprehensive sorting, the sorting includes two nondominated sorts, one utilizes fitness values and another one utilizes the front result and the distance to reference point for particles. Through such emerging design, MLLSO can tackle the issue of being trapped in intractable solutions and improve effectiveness at the same time. MLLSO uses the first front of true_rank containing the optimal solutions to update A in line 6. Only the nondominated solutions can be remained in A, after adding the first front of true_rank into A. Subsequently, MLLSO divides particles into NL levels successively according to true_rank and updates particles using Eq. (13) in lines 8–9. Finally, a repairing function is reused again to fix a fraction of particles under fixed cost.
A Multi-objective Level-Based Learning Swarm Optimization Algorithm
319
4.2 Binary Particle Update Rule Since the traditional level-based learning [27] is proposed for continuous problems, a binary level-based learning is used. Thus, MLLSO can solve the discrete multi-objective resource allocation problems. We use a sigmoid function to transfer the continuous values of particles into binary values. This idea refers to [14], where Zhao et al. proposed a binary particle swarm optimizer for networked epidemic control. Binary LLSO adopts the same update rule as the traditional LLSO for velocities, but updates positions differently, as shown in the following. ⎧ d d d d d d ⎪ ⎨ vi,j ←r1 vi,j + r2 (xrl1 ,k1 − xi,j ) + ϕr3 (xrl2 ,k2 − xi,j ) d ) (9) 1, if rand(0,1) = then ← c //update the thredshold //return this segment else drop out this segment end if
Third, in consideration of the requirements of online and real-time algorithm, this paper selects the HoG feature [20] of each frame and average them to get the feature of the shot. Last, inspired by Nagar et al. [17], we utilize Eq. (1) to represent the distinctiveness of one shot set. In order to speed up the process of maintenance, we simplify Eq. (1) to (2). Here, f (∗) is the function of feature extraction. Rfc =
1 |E|(|E| − 1)
|f (Si ) − f (Sj )|2
(1)
Si ∈E Sj ∈E,Sj =Si
Rdff =
1 |E| − 1
|f (Si ) − f (Si+1 )|2
(2)
Si ,Si+1 ∈E
The distinctiveness of the set E will be maintained to keep max by the greedy strategy. Given the new shot Snow , one existed shot in the set will be delete to maximize the distinctiveness of the set E.
330
3.3
Y. Shao et al.
Aesthetics Selection
In the second stage, this model evaluate aesthetics value from two aspects of symmetry and color.
Algorithm 2. Aesthetics Selection Require: Given candidate set E and total frame number is nf rame; total shot number is nshot rank = [(sh1 , sc1 ), ..., (shnf rame , scnf rame )] //calculate the score and sort V ← [] cntBeauty ← [0...0] //record chosen frame number for each shot for shi , sci in rank do if |V | >= nshot then cntBeauty[shi ]+ = 1 if cntBeauty[shi ] >= length(shi ) then //add shi into V end if else break end if end for //sorted frames in V by time return V
Given the current f rame, first get its symmetrical picture f ramef lip . Then use SIFT operator [21] to extract the key points of both, PS = {p1 , . . . , pk } and PF = {q1 , . . . , qk }, while the corresponding features are Fs = {f1 , . . . , fk } and Ff = {g1 , . . . , gk }. We score symmetry by comparing the corresponding points in f rame and f ramef lip , as showed in Eq. (3). 1 sem = |k| i∈Ps minj∈Pf sci,j (3) similarity(f ,f ) sci,j = log dist(pi ,qji) j dist(i, j) > 50.0 In Chap. 5, it is experimentally found that the calculation of similarity has little impact. Therefore, set it to a certain value in the implementation. In this step, we change the original frame from RGB gamut to HSV gamut where channel S represents saturation and V represents value. Therefore, the average of channel S and channel V can be used for color score, as showed in Eq. (4). [s(x, y) + v(x, y)] ∗ 0.5 (4) coli = β (x,y)
Here, s(x, y) represents video Saturation of point (x, y) in frame i, while v(x, y) is Value. β is normalization factor, so that the score is in [0, 1].
Aesthetics-Diven Online Summarization to First-Person Tourism Videos
331
After calculating the score of symmetry and color. The aesthetics score sci is calculated as Eq. (5). Here, α is one hyperparameter. Given the frame i belonging shot number shi , we can get a rank list Rank. Then the shot is selected based on this Rank until the number of the shots in final set V is enough, illustrated in Algorithm 2. sci = α · semi + (1 − α) · coli
4 4.1
(5)
Experiment Offline Evaluation Based on Dataset
For the first part of the experiment, this paper selects 8 videos from the 25 videos in SumMe [17] dataset that closest to the tourism scenarios. And two other baselines are selected for comparison of F-measure scores [17]. One is to randomly select 15% of video frames, and the other is the method based on video interest score proposed by Gygli et al. [17]. The best and reasonable settings are selected to our model. The final experimental results are shown in Table 1. It can be seen that the F-measure score of FoVlog model has exceeded the random method and the method proposed by Gygli et al. [17] in most chosen videos. Table 1. The comparison of our method and others. The data of Gygli’s method is from [16] and we choose best score among settings as the score of FoVlog
4.2
Video Name
Random Gygli FoVlog(best)
Base jumping
0.141
0.121
0.173
Bearpark climbing
0.156
0.118
0.209
Eiffel Tower
0.125
0.295 0.235
Notre Dame
0.137
0.235 0.134
playing ball
0.145
0.174
0.198
Scuba
0.143
0.184
0.218
St Maarten Landing 0.143
0.313
0.416
Statue of Liberty
0.129
0.192 0.131
AVERAGE
0.140
0.204
0.214
Experiments on Aesthetics Rules
We first build a simple picture dataset, select some photos, and divide them into three categories: (1) high aesthetics value photos taken by professional photographers; (2) ordinary taken photos; (3) randomly taken photos. Then, the symmetry and color score are calculated using the method in 3.2. The corresponding results are shown in Fig. 2(a).
332
Y. Shao et al.
Fig. 2. Aesthetics scores of the chosen photos. (a) is the original method, (b) is the result after simplification. It can be seen that the aesthetics score has distinguishability.
4.3
Experiments on Efficiency
This section will explain the time efficiency of the fovlog model from two aspects. On the one hand, it has more benefits of computing speed using Rdf f than Rf c . On the other hand, the experiment will verify that some simplification on aesthetics scoring can still guarantee the quality and speed up the operation. Table 2. Comparison of frame loss rate of different diversity calculation methods. It can be seen from the table that for this experimental video, using Rfc , when the number of shots is 10, the frame loss rate has exceeded 2%, and in this case, Rdff will not cause any frame drop. Number of Shot Rfc
Rdff
5
0.0007 0
10
0.0234 0
15
0.1221 0
In the experiment, the time of one shot is 1 s, the frame rate is 16 FPS, and the number of shots in final set id 5, 10, and 15 respectively. Rfc and Rdff are used to measure the distinctiveness of the shot set. The same video with a duration of 3:55 is used as the original video. The frame loss rate of the two calculation methods is recorded. The results are shown in Table 2. So in order to achieve real-time effect, Rdff is better than Rfc . The experiment in this paper find that the aesthetics scoring rules in the model are reasonable but complex, which will have a negative impact on the performance of the whole system. To simplify this, we reduce the quality of
Aesthetics-Diven Online Summarization to First-Person Tourism Videos
333
frames during the score calculating and discard the futures value when getting semi . Figure 2(b) shows the distinguishability after simplification. However, he time of scoring one photo reduces from 8.84 s to 0.12 s. 4.4
Prototype Implementation
Fig. 3. APP interface. (a) is user configuration before recording. (b) is the process of recording and video summarization. (c) is the display of the results
Table 3. Comparison of frame loss rate of different diversity calculation methods. The higher quality summary is, the longer user waiting time in the second stage, which will seriously affect the user experience. Quality User Waiting Time 270P
00:15
480P
00:56
720P
03:21
1080P
06:08
In this paper, the FoVlog model is implemented on a Harmony mobile phone with 8.0 GB main memory, as shown in Fig. 3. After testing its in real scenarios, we found that calculating the aesthetics score and sorting the video frames, causing
334
Y. Shao et al.
some user waiting time. So this paper calculates the user waiting time in the second stage in the actual situation (five 1 s shots in the final summary), as shown in Table 3. Considering some indoor scenarios, some computing works can be distributed to the edge server. This paper proposes a edge-terminal collaborative mode for indoor scenarios. The structure is shown in Fig. 4. After changing the mode, the waiting time of the user in the second stage is expected to shorten, achieving a more real-time effect.
Fig. 4. The diagram of edge-terminal collaborative structure. The tasks of the second stage are arranged on the edge server. During the recording process, the mobile terminal devices share the change of the set E with the edge server, which aims to calculate the second stage at the same time.
5
Summary and Expectation
This paper studies the problem of aesthetics-diven online summarization to first-person tourism videos while using mobile devices with limited resources and designs an aesthetics-driven online summarization model of the first-person tourism video. Besides, experiments have verified the effectiveness of this model on both aesthetics and processing time. Finally, the whole system is implemented with mobile phones, and the mode of edge-terminal cooperation in indoor scenarios is proposed. We find that when there are two alternating indoor and outdoor scenes in the same video, the aesthetics rules prefer to choose the light pictures. So later work should pay more attention to the information fusion of the two stages and the sophisticated refinement of aesthetics rules.
Aesthetics-Diven Online Summarization to First-Person Tourism Videos
335
Acknowledgements. This project was supported by the National Outstanding Young Scientists Foundation of China (62025205), the National Key Research and Development Program of China (2019QY0600), and the National Natural Science Foundation of China (61960206008, 61725205).
References 1. Li, X., Zhao, B.: Video distillation. Scientia Sinica Inform. 51(5) (2021) 2. Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2513–2520 (2014) 3. Del Molino, A.G., Tan, C., Lim, J.H., et al.: Summarization of egocentric videos: a comprehensive survey. IEEE Trans. Hum.-Mach. Syst. 47(1), 65–76 (2016) 4. Yang, Y.Z., Liu, L., Fu, X.D., et al.: Multi-pedestrian tracking optimized by social force model under first-person perspective. J. Image Graph. (2020) 5. Hua, Z.Y.: Research on first person video summarization for the lost of the elderly. Harbin Institute of Technology (2017) 6. Lin, Y.L., Morariu, V.I., Hsu, W.: Summarizing while recording: context-based highlight detection for egocentric videos. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 51–59 (2015) 7. Zhang, Y., Liang, X., Zhang, D., et al.: Unsupervised object-level video summarization with online motion auto-encoder. Pattern Recognit. Lett. 130, 376–385 (2020) 8. Nagar, P., Rathore, A., Jawahar, C.V., et al.: Generating personalized summaries of day long egocentric videos. IEEE Trans. Pattern Anal. Mach. Intell. (2021) 9. Guo, B., Zhang, Q.Y., Fang, Y.Y.: Computational aesthetics: visual aesthetics measurement and generation driven by computational science. Package Eng. 42(22): 62–77, 102 (2021) 10. Tong, H., Li, M., Zhang, H.-J., He, J., Zhang, C.: Classification of digital photos taken by photographers or home users. In: Aizawa, K., Nakamura, Y., Satoh, S. (eds.) PCM 2004. LNCS, vol. 3331, pp. 198–205. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30541-5 25 11. Datta, R., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a computational approach: U.S. Patent Application 12/116,578 (2008) 12. Nishiyama, M., Okabe, T., Sato, I., et al.: Aesthetic quality classification of photographs based on color harmony. In: CVPR, pp. 33–40. IEEE (2011) 13. Lu, X., Lin, Z., Jin, H., et al.: Rating image aesthetics using deep learning. IEEE Trans. Multimedia 17(11), 2021–2034 (2015) 14. Lu, X., Lin, Z., Shen, X., et al.: Deep multi-patch aggregation network for image style, aesthetics, and quality estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 990–998 (2015) 15. Sheng, K., Dong, W., Ma, C., et al.: Attention-based multi-patch aggregation for image aesthetic assessment. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 879–886 (2018) 16. Gygli, M., Grabner, H., Riemenschneider, H., et al.: The interestingness of images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1633–1640 (2013)
336
Y. Shao et al.
17. Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Cham (2014). https://doi.org/10. 1007/978-3-319-10584-0 33 18. Zhang, L., Jing, P., Su, Y., et al.: SnapVideo: personalized video generation for a sightseeing trip. IEEE Trans. Cybern. 47(11), 3866–3878 (2016) 19. Bettadapura, V., Castro, D., Essa, I.: Discovering picturesque highlights from egocentric vacation videos. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. IEEE (2016) 20. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893. IEEE (2005) 21. Lowe, D.G.: Object recognition from local scale-invariant features. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction Feiyang Liu1 , Bin Guo1(B) , Hao Wang1 , and Yan Liu2 1 School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China
{liufeiyang,wanghao456}@mail.nwpu.edu.cn, [email protected] 2 School of Computer Science, Peking University, Beijing 100871, China [email protected]
Abstract. Adequate perception and understanding of the user’s visual context is an important part of a robot’s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user’s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multimodal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively. Keywords: Human-machine interaction · Human-machine dialogue · Scene awareness · Spatial-temporal reasoning · Cross-attention mechanism
1 Introduction Human-computer interaction, as a fundamental technology for information exchange between humans and machines in this age, has received widespread attention from academia and industry. Human-machine dialogue is the core area of HCI technology, which aims to maximize the imitation of human conversations, enabling machines to © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 337–351, 2023. https://doi.org/10.1007/978-981-99-2385-4_25
338
F. Liu et al.
communicate with humans in a more natural way. As natural human communication can span multiple modalities such as visual, auditory, and verbal, the researcher hopes that the machines can improve the naturalness and efficiency of human-machine interactive dialogues by integrating and extracting effective information from multiple modalities so they can better serve users. Therefore, tasks related to audio visual scene-aware dialogues have received much attention in recent years. In 2019, Alamri et al. [1] first proposed the AVSD task while collecting short video dialogue data from CHARADES [2] to publish an audio-visual context-aware dialogue dataset, which covers scene-specific video and audio, conversation history and video summaries, etc., presenting a challenging dialogue system modeling task for multimodal representation learning and inference learning. The following difficulties exist in this task: (1) the video content contains both spatial and temporal variations, making it more difficult to obtain semantic information from it; (2) the user’s discourse may involve different spatial objects in different video clips in multiple rounds of dialogue, thus making the dialogue agent need to have fine-grained inference capabilities; (3) the input data comes from multiple modalities such as text, audio and visual, and there is information redundancy and complementary among multi-modal features The dialogue agent needs to reasonably process multi-modal information to obtain a comprehensive understanding of the dialogue context. This paper focus on the visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, extracting fine-grained semantic information from multi-modal features such as visual, auditory, and textual content containing complex spatial-temporal changes and rich semantic information. A comprehensive understanding of the dialogue context is obtained by fusing multi-modal information, which leads to a reasonable dialogue generation response. The main contributions of this paper include the following aspects: 1. A Transformer-based neural network framework guided by user questions is proposed to achieve fine-grained contextual semantic information extraction. A bidirectional spatial-temporal inference approach is used for visual semantic information extraction, which not only utilizes information at the spatial and temporal levels, but also learns the dynamic information diffusion between two feature spaces through spatial-to-temporal and time-to-spatial inference, which aims to address the changing semantics of user questions in conversational environments. 2. A multi-modal fusion approach based on the cross-attention framework is proposed, where each modal feature can interact directly with other modal features in pairs, allowing multi-modal features to be fully interacted and fused in a cross manner. The results are used as the contextual information to generate responses of system, and strategies such as label smoothing, multi-pointer generator, and beam search are applied to enhance the prediction capability of the model. 3. The effectiveness of this paper’s model was verified on a large-scale video conversation dataset AVSD, and compared with the baseline model, this paper’s model improved 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of generated conversation content, respectively.
Visual Scene-Aware Dialogue System
339
2 Related Work 2.1 Audio Visual Scene-Aware Dialog Research related to audio visual scene-aware dialogue systems has recently made some progress, Hori et al. [3] proposed a sequence-to-sequence approach with a questionguided LSTM based on visual and auditory features of videos. Sanabria et al. [4] extend previous work with more fine-grained visual features and transfer learning for video summarization tasks. Le et al. [5] propose question-aware attention mechanisms and a multi-modal for embedding different modal Li et al. [7] proposed a multi-modal dialogue generation model based on a pre-trained language model and introduced a multi-task learning approach. Geng et al. [8] proposed to represent videos as spatial-temporal scene graphs that capture key audio visual cues and semantic structure to generate visual memories for each scene graph frame. Most current work typically employs some pre-trained models for video feature extraction, such as 2D CNN models on video frames [9] and 3D CNN models on video clips [10]. However, these approaches focus on cues at the temporal level of visual content, ignoring finer-grained cues from the spatial level, simply integrating features at the spatial level by summation pooling of equal weights to obtain a global representation at the temporal level. As a result, they are not ideal for studying complex problems involving local spatial objects [11]. Multi-round dialogue amplifies this drawback as it allows users to explore different spatial objects in different video clips in multiple dialogue rounds. Le et al. [12] proposed an approach which not only emphasizes the equal importance of spatial cues and temporal cues, but also proposes to achieve dynamic information diffusion between two visual feature spaces through a bidirectional time-to-space and space-to-time reasoning strategy. Inspired by this, this paper adopts a Transformerbased neural network framework guided by user questions, and adopts a bidirectional spatial-temporal inference method for visual semantic information extraction and a dot product self-attention mechanism for the extraction of the remaining modalities in order to achieve fine-grained contextual semantic information extraction. 2.2 Multi-modal Fusion In order to combine feature inputs from different modalities, Alamri et al. [1] simply concatenate multi-modal features and made them interact only through the fully connected layer, which is too simple and can lead to more serious information loss. Le et al. [5] utilized Transformer, which makes the target responses pass through the attention sublayer of each modality in turn to achieve multi-modal information fusion. However, the authors did not provide a reasonable explanation for the sequential arrangement of multi-modal fusion, and the multi-modal features still do not interact sufficiently with each other. In this paper, we propose a multi-modal fusion method based on a crossattention framework, where each modal feature can interact with other modal features in a two-by-two manner, and this method aims to enable multi-modal features to be fully interacted and fused in a cross-attention manner.
340
F. Liu et al.
3 Fine-Grained Contextual Semantic Information Extraction This section designs a multi-modal Transformer network for fine-grained contextual semantic information extraction, where user questions are used to extract semantic information from visual, auditory, and textual content. This method first preprocesses multimodal features, encodes them into a continuous representation by different encoders, and then extracts semantic information based on user questions by multi-modal Transformer networks respectively (Fig. 1).
Fig. 1. The framework of fine-grained contextual semantic information extraction
3.1 Feature Preprocessing The input to the visual context-aware dialogue task consists of three main modal data, i.e., visual-like data, auditory-like data, and text-like data, which includes: a video V, a summary of the video or a video description C, the user’s current discourse Ht , ,and (t-1) rounds of conversation history (the current conversation is round t, each round contains one user sentence Hi and a system response Ai ). We represent the textual inputs such as conversation history Xhis , user query input Xque , video summary Xcap and systemgenerated responses Y as a sequence of words, and each word element is represented using the index of the corresponding word element in the vocabulary V. The vocabulary is generated by traversing all text inputs. We note that when the vocabulary is too large, the model may be difficult to train and converge even after multiple training rounds, so we introduce a hyper-parameter cutoff to include a word in the vocabulary only when it occurs more than cutoff. Text Encoder We use a text encoder to embed the text input X as a continuous expression ZRL×d , where d is the embedding dimension and L to denote the length of the sequence S. It
Visual Scene-Aware Dialogue System
341
consists of a token embedding layer and a layer normalization [13]. The embedding layer contains a trainable matrix ER|V |×d , each row of which is a d-dimensional vector representing a token in the vocabulary, where |V| denotes the size of the vocabulary. We use E(X) to denote the embedding function that takes out the corresponding vector for each token in the input sequence X: Zemb = E(X ) ∈ RLX ×d
(1)
In order to incorporate positional information to the sequence of tokens, we adopts the approach in [6], where the position of each token is encoded using the sine and cosine functions. The positional encoding is combined with the word embedding through element-wise addition, followed by layer normalization. The output includes conversation history Zhis , user discourse Zque , video summaries Zcap , and system-generated responses Zres , and all text sequences share the embedding matrix E. During training, the target output sequence is the result of shifting the input sequence one bit to the left, which is done in an auto-regressive manner for sentence prediction. Video Encoder For the visual features of the video, we use a pre-trained 3D-CNN model as a feature extractor to extract the spatial-temporal visual features. The dimension of the output features depends on the sampling step as well as the video clip length, and we denote the pre pre output of the feature extractor Zvis RF×p×dvis , where F is the number of sampled video pre clips, P is the spatial dimension of a 3D-CNN layer and dvis is the feature dimension. The features are then passed through a linear layer with ReLU and a layer normalization pre pre in order to reduce the feature dimension from dvis to d (d < < dvis ) and finally obtain the visual features Zvis RF×p×d . For the audio features, we utilize a pre-trained VGGish model as a feature extractor and obtain the audio features Zaud RF×d by a similar step. During the training process, we fix the pre-trained feature extractor model and use the extracted features directly in our dialogue system. 3.2 Multi-modal Contextual Semantic Information Extraction After obtaining the multi-modal features, the semantic information in them needs to be extracted. In this paper, we use a multi-modal semantic information extraction network based on attention mechanism to extract relevant semantic information from each modal feature using user queries. Spatial Visual Semantic Information Extraction In this module, user queries are used at each spatial position separately to extract relevant semantic information along the time step. The user query features are first stacked at stack RP×Lque ×d . For each spatial position, the network P spatial positions, denoted as Zque learns the relevance of the user query to each of the F time steps by the following attention mechanism: (1)
(1)
T Zt2s = Zvis Wt2s ∈ RP×F×datt
(2)
342
F. Liu et al. (2)
(2)
stack Zt2s = Zque Wt2s ∈ RP×Lque ×datt
(3)
(1) (2) (1)T St2s = Softmax Zt2s Zt2s ∈ RP×Lque ×F
(4)
(1)
(2)
(1)
where datt is the attention layer dimension, Wt2s , Wt2s Rd ×datt . The attention score St2s is used to compute a weighted sum at each spatial position of Zvis for different time steps. The result of the weighted sum is mapped to the original dimension d by a linear layer stack is added to it by a skip connection to obtain the temporal attended with ReLU, Zque t . Subsequently, user queries are visual semantic information, which we denote as Zt2s again used to extract semantic information along the spatial dimension. We use a similar attention network to model the interaction between each token in the user query and each temporal attended spatial position: (3) (3) t = Zt2s Wt2s ∈ RP×F×datt Zt2s (3)
(3)
(4)
(3)
(5)
t Zt2s = Zt2s Wt2s ∈ RP×F×datt
(6)
(2) (3) (4)T St2s = Softmax Zt2s Zt2s ∈ RLque ×P
(7)
(2)
where Wt2s , Wt2s Rd ×datt . Attention score St2s was used to compute a weighted sum t . The temporal-to-spatial attended features were obtained of all spatial positions of Zt2s and summed with Zque by a skip connection, the result was noted as Zt2s . Temporal Visual Semantic Information Extraction In this module, we use a similar process to extract semantic information by first obtaining spatial attended visual features and then spatial-to-temporal attended visual features. The main difference between it and the previous module is that we stack the user queries to stack RF×Lque ×d , the subsequent calculation steps F time steps at the beginning to get Zque are similar to Eqs. (2) to (7). We denote the final output as Zs2t . Audio Semantic Information Extraction In this module, we use a network model based on an attention mechanism to extract the semantic information related to the user question in the audio features. First, the user question as Q and the audio features as K and V are subjected to a multi-headed dot product self-attention operation with residual connectivity before and after the selfattention operation, followed immediately by a layer normalization layer. The output then enters a feed-forward network model, which also has residual connection and layer normalization. The output of this module is denoted as Zq2a . Caption Semantic Information Extraction Since video captions contain rich information about the video content, in this module we focus on the correlation between user questions and video captions, and through a network similar to the one used to extract audio semantic information, we obtain the video caption semantic features Zq2c .
Visual Scene-Aware Dialogue System
343
Fig. 2. The framework of fine-grained contextual semantic information extraction
4 Multi-modal Fusion Reasoning and Response Generation After obtaining the multi-modal fine-grained semantic information, since the information between the modalities is not directly interacted, this paper proposes a cross-attention framework to enable the multi-modal features to be fully interacted and fused in an intersection manner. The results of multi-modal feature fusion, together with the conversation history, user queries and the initial target sequence, enter the decoder to generate the target sequence in an auto-regressive manner. To generate more fluent and natural responses, three strategies, Label Smoothing [14], Multi-Pointer Generator [15], and beam search, are used to enhance the word prediction capability of the model (Fig. 2). 4.1 Multi-modal Fusion Reasoning Cross-Attention Framework Taking video description features as an example, the details of the framework are as follows: first, three different sets of simultaneous multi-head attention operations are performed, each with Q being video caption feature Zq2c , and the three sets with K, V being audio feature Zq2a , temporal feature Zt2s and spatial feature Zs2t , respectively. A residual connection of Zq2c is made before and after each set of self-attention, followed by a layer normalization; Then, each of the three groups is passed through a feedforward network, which is also connected by a residual before and after, followed by a layer normalization as well. Subsequently, the results of the three sets of operations are concatenated together in the last dimension and then mapped back to the original dimension through a linear layer. The output of the final linear layer is summed with the initial Zq2c and then passed through a layer normalization. With a similar network, we perform the same cross-attention operation as above for the features of the other three modalities. We concatenate the results of the obtained
344
F. Liu et al.
cross-attention for the four modalities with the user question features to obtain: Zcrs = [Zque ; Zt2s ; Zs2t ; Zq2a ; Zq2c ] ∈ RLque ×5d
(8)
where; denotes the concatenation operation. Then, an importance score is obtained using the following formula: Scrs = Softmax(Zcrs Wcrs ) ∈ RLque ×4
(9)
where Wcrs R5d ×4 . This importance score is used to obtain a weighted sum of the four modal features, and we denote the final multi-modal fusion vector obtained as Zvid . Unlike [3] and others who treat each modality equally, this approach automatically makes the model find the most appropriate fusion by training it to potentially avoid unnecessary noise, for example, by not considering auditory information when the problem involves only visual problems. Response Reasoning Decoder The purpose of this network is to decode the system’s responses in an auto-regressive manner. At the beginning, a special symbol (start of sequence) is fed into the network model, and subsequently the output lexical elements of the model are spliced to this special symbol and fed into the network model again to produce the next lexical element. This process is repeated until the maximum sequence length limit is reached or another special symbol (end of sequence) is predicted. This network contains four attention sub-layers, which serve to integrate textual cues and multi-modal fusion cues into the output lexical meta-expression. The first layer is a self-attention layer, which is used to generate a clearer attention distribution for the current sequence. The second and third layers are used to capture the conversation history and the contextual information of the user’s current question so that the response fits the current conversational context. The fourth layer is used to capture the semantic information in the multi-modal fusion vector to support the response generation. At the i-th inference, we write the output of the network as Zdec Ri×d . 4.2 Response Generation Network Label Smoothing Facing the problem of easy overfitting that may be brought by one-hot, a study has proposed the Label Smoothing [14] method, which is formulated as follows:
yi = (1 − ε)yi + ε
1 K −1
(10)
where y is the original label, y’ is the smoothed label, ε is a hyperparameter with a default of 0.1, and K is the number of categories. This approach alleviates the problem of overly arbitrary models to some extent, also has some noise immunity, compensates for the problem of insufficient supervised signals (relatively low information entropy)
Visual Scene-Aware Dialogue System
345
in simple classification, increases the amount of information, and potentially enhances the generalization ability of the model. Multi-Pointer Generator The Pointer Generator is proposed in [15], which retains the generative capability of the traditional seq2seq model and has the replication capability of the Pointer Network. First, as in the seq2seq model, the context vector C is passed through a linear classifier to obtain the vocabulary distribution Pvocab , followed by an attention mechanism to obtain the attention scores of C on each token of the source sequence, which is mapped back to the vocabulary to obtain the pointer distributionPptr . Finally, with a weight Pgen , we compute the final output distribution Pout : Pout = pgen × Pvocab + (1 − pgen ) × Pptr
(11)
To enhance the generative ability of the model, a similar approach is used in this paper to emphasize the importance of tokens in the source text, i.e., user queries as well as video captions. After obtaining the output of the multi-modal inference network, a linear layer is first used to obtain the vocabulary distribution: Pvocab = Softmax(Zdec Wvocab ) ∈ Ri×|V |
(12)
where Wvocab Rd ×|V | , since the semantic information between the input sequence and the generated responses is similar, we share the weights between Wvocab and E. Subsequently, Zdec is used to perform dot product attention calculations with user questions and video descriptions, respectively, to obtain attention scores Sd 2q and Sd 2c , and attention results Zd 2q and Zd 2c . Sd 2q and Sd 2c are used to map the scores of the corresponding word elements in the two sequences to their corresponding positions in the vocabulary to obtain the user question distribution Pque and the video caption distribution Pcap . Zd 2q and Zd 2c are used to participate in the following operations to obtain the weights of the three distributions: Zgen = [Zres ; Zdec ; Zd 2q ; Zd 2c ] ∈ Ri×4d
(13)
Sdis = Softmax(Zgen Wgen ) ∈ Ri×3
(14)
where Wgen R4d ×3 , finally, we use the importance scores Sdis to calculate the weighted sum of the three distributions to obtain the final output distribution Pout . Beam Search The beam search is a compromise between greedy search and exhaustive search, and it has a hyperparameter named beam size k. At time step 1, choose the k tokens that we select with the highest conditional probability. These k tokens will be the first token of each of the k candidate output sequences. In each subsequent time step, based on the k candidate output sequences from the previous time step, we will continue to pick the k candidate output sequences with the highest probability from the k*|Y| (|Y| is the vocabulary size) possible choices. Finally, we obtain the sequence with the highest
346
F. Liu et al.
product of conditional probabilities among k*|Y|*T (T is the maximum sequence length) candidate sequences as the output sequence by selecting: L 1 1 log P(y , ..., y |c) = log P(yt |y1 , ..., yL , c) 1 L Lα Lα
(15)
t=1
where c is the context vector containing semantic information, L is the length of the final candidate sequence, and α is usually set to 0.75. By choosing k flexibly, the beam search allows a trade-off between correctness and computational cost.
5 Experiments 5.1 Datasets This paper uses a large-scale video dialogue dataset AVSD from DSTC7 [16].
Fig. 3. A sample in the training set of AVSD dataset Table 1. Summary of AVSD dataset Training # of Dialogs # of Turns # of Words
7,659 153,180 1,450,754
Validation 1,787
Test 1,710
35,740
13,490
339,006
110,252
Figure 3 shows a sample in the training set of AVSD dataset, C represents the video Caption, Qi and Ai represent the i-th question and i-th response, and S represents the video summary. Table 1 is a summary of this dataset.
Visual Scene-Aware Dialogue System
347
As in [3], we apply the VGGish [17] model as the audio feature extractor of the video to obtain audio features with embedding dimension of 128 and sequence length varying with audio length, which is pre-trained on a large-scale YouTube dataset. A 3D-CNN ResNext-101 [19] model pre-trained on Kinetics [18] is used as a feature extractor for the visual features of the video as well. The video clips were sampled by sampling steps of 16 frames and window size of 16 frames, and finally the video features with embedding dimension of 2048, 16 spatial positions, and sequence length varying with the video length were obtained. In addition, the video caption of each sample in the dataset is merged with the video summary as the default video caption. 5.2 Baselines • Baseline [3] Proposes a sequence-to-sequence approach for a question-guided LSTM based on visual and auditory features of video. • CMU Sinbad’s (AVSD Winner) [4] Migration learning was performed using more fine-grained visual features than the former and working from a video summary. • MTN [5] A transformer-based approach is used to sequentially fuse multimodal features through a transformer decoder framework. • STSGR [8] A hierarchical graph representation learning and transformer inference framework for visual context-aware dialogues is proposed. • Vx2TEXT [27] A Vx2Text framework is proposed to generate text from a multimodal input of "video + X" (X stands for text, voice or audio, etc.). • BiST [12] A textual cue-based video high-resolution query framework is proposed, which captures the complex visual nuances of videos through a bidirectional inference framework with spatial-temporal dimensions. • VGD-GPT2 [28] Video-based dialogue tasks using pre-trained language models. 5.3 Implementations Unlike training, which starts with a complete response sentence, during testing it always starts with a special symbol and generates words one by one in an autoregressive manner. Therefore, the following method is used in the training phase to simulate the process in the testing phase: for each reference sentence in the training set, it is cropped with a probability p (e.g., 0.5) at a random number i (2 < i < L), and the sequence to the left of i is retained as the new reference sentence. The number of neurons in the hidden layer of the feedforward network is 512, the number of multi-head attention heads is 8. The maximum sequence length is 12; cutoff = 4, indicating the size of the number of cut words; beam size = 5, indicating the bundle width of bundle search is 5; for label smoothing, ε is set to 0.1; the dropout [20] of all sublayers is 0.2 to prevent overfitting; Adam [21] is used as the optimizer of the model and makes betas = (0.9, 0.98), eps = 1e-9, and the learning rate strategy in [6] is adopted, warm-up steps are set to 12000 steps, and the number of training rounds is 50 epochs. Initialization of all parameters of the model using Xavier initialization [22] before the start of training.
348
F. Liu et al.
5.4 Experiment Results In this paper, system-generated responses are evaluated using several objective measures commonly used in the field of natural language processing to compare the similarity between automatically generated sentences and reference responses, which are BLEU [23], METEOR [24], ROUGE-L [25], and CIDEr [26]. Table 2 presents the evaluation results of the model proposed by us and the above models on the test set of AVSD, and it can be seen that our model achieves the best results in three evaluation metrics, such as BLEU2, METEOR, and ROUGE-L, where the BLEU metric indicates high fluency in generating responses, METEOR indicates high accuracy, and ROUGE-L indicates high responses with high recall. Our model performs much better than Baseline, Sinbad’s, and MTN on all metrics, with 39.5%, 32.1%, 19.7%, and 61.3% improvement on BLEU4, METEOR, ROUGE-L, and CIDEr, respectively, over the baseline model. Table 2. Results of experimenets Model
BLEU1
BLEU2
BLEU3
BLEU4
METEOR
ROUGE-L
CIDEr
Baseline
0.626
0.485
0.383
0.309
0.215
0.487
0.746
Sinbad’s
0.718
0.584
0.478
0.394
0.267
0.563
1.094
MTN
0.731
0.597
0.494
0.410
0.274
0.569
1.129
STSGR
-
-
-
0.133
0.165
0.362
1.272
Vx2TEXT
0.361
0.260
0.197
0.154
0.178
0.393
1.605
BiST
0.755
0.619
0.510
0.429
0.284
0.581
1.192
VGD-GPT2
0.749
0.620
0.520
0.436
0.282
0.582
1.194
Ours
0.753
0.620
0.515
0.431
0.284
0.583
1.203
It can also be noted that compared to BiST, which also achieved the best score on the BLEU1, our model scored second to BiST on BLEU1 with a difference of 0.002 points, while the model scored higher or equal to BiST on the remaining six metrics. Compared to VGD-GPT2, which also achieved the best score on three metrics, the model scored higher or equal to VGD-GPT2 on five metrics. By examining the CIDEr metric, we find that it focuses on the word correspondence between the candidate sentence and the reference sentence, without caring whether the candidate whether the sentence is fluent or not, thus we can conclude that STSGR and Vx2TEXT should be extreme in considering the word recall, adding words to their generated responses as much as possible, while ignoring whether the generated responses are fluent or not. Therefore, if these two extreme models are taken out of consideration, the model in this paper also has the best performance on the CIDEr metric. 5.5 Ablation Study In order to verify the validity of the constituent modules of the model proposed in this paper, the following ablation experiments were conducted:
Visual Scene-Aware Dialogue System
349
(1) Use/non-use of cross-attention framework. (2) With linear/single-pointer/multi-pointer generator: with linear generator means using only one linear classifier for word prediction; with single-pointer generator means using only user queries for source text; with multi-pointer generator means using user queries and video captions for source text.
Table 3. Ablation study based on the use of the cross-attention framework or not cross-attention framework
BLEU4
METEOR
ROUGE-L
CIDEr
N
0.415
0.278
0.575
1.152
Y
0.421
0.282
0.582
1.168
Table 4. Ablation study based on different generators Generator
BLEU4
METEOR
ROUGE-L
CIDEr
Linear
0.400
0.273
0.566
1.117
Single
0.411
0.274
0.568
1.138
Multiple
0.416
0.277
0.571
1.145
From Table 3, we note that the inclusion of the crossed self-attention framework increases the scores of the model in every metric, a result that demonstrates the important contribution of the crossed attention framework proposed in this paper in terms of multimodal interaction, which allows each modal feature to learn relevant semantic information in other modalities in a crossed manner. From Table 4, we note that the scores of each metric are lower than the remaining two groups when only the linear classifier is used; and the model using the multipointer generator outperforms the model using the single-pointer generator, which proves the effectiveness of the multi-pointer generator, which retains the ability of the linear classifier to generate new words and also has the ability to copy words from the source text, thus improving the quality of the generated responses.
6 Conclusion In this paper, we focus on visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, and propose a Transformer-based neural network framework guided by user questions to achieve fine-grained video semantic information extraction. The cross-attention framework proposed in this paper allows multimodal features to be fully interacted and integrated in an intersectional manner. The experimental results show that our model outperforms Baseline models on the AVSD dataset, and achieves the highest scores in BLEU2, METEOR, and ROUGE-L.
350
F. Liu et al.
In the future, we will further explore methods to use multi-modal information more efficiently (e.g. hierarchical transformer), and apply transfer learning to this work. We may be able to get a more capable visual scene-aware dialogue system that enables truly intelligent human-computer interaction.
References 1. Alamri, H., et al.: Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7558–7567 (2019) 2. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_31 3. Hori, C., et al.: End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2352–2356. IEEE (2019) 4. Sanabria, R., Palaskar, S., Metze, F.: CMU Sinbad’s submission for the DSTC7 AVSD challenge. In: DSTC7 at AAAI 2019 Workshop 2019 (2019) 5. Le, H., Sahoo, D., Chen, N.F., Hoi, S.C.: Multimodal transformer networks for end-to-end video-grounded dialogue systems (2019) 6. Vaswani, A., et al.: Attention is all you need, vol. 30 (2017) 7. Li, Z., Li, Z., Zhang, J., Feng, Y., Zhou, J.: Bridging text and video: a universal multimodal transformer for audio-visual scene-aware dialog. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2476–2483 (2021) 8. Geng, S., et al.: Dynamic graph representation learning for video dialog via multi-modal shuffled transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence 2021, vol. 2, pp. 1415–1423 9. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016) 10. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) 11. Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: TGIF-QA: toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2758–2766 (2017) 12. Le, H., Sahoo, D., Chen, N.F., Hoi, S.C.: BiST: bi-directional spatio-temporal reasoning for video-grounded dialogues (2020) 13. Ba, J.L., Kiros, J.R., Hinton, G.E.J.: Layer normalization (2016) 14. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016) 15. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks, vol. 28 (2015) 16. Yoshino, K., et al.: Dialog system technology challenge, vol. 7 (2019) 17. Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131– 135. IEEE (2017) 18. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Visual Scene-Aware Dialogue System
351
19. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017) 20. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 21. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014) 22. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010) 23. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002) 24. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005) 25. Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004) 26. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015) 27. Lin, X., Bertasius, G., Wang, J., Chang, S.-F., Parikh, D., Torresani, L.: Vx2text: end-toend learning of video-based text generation from multimodal inputs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7005–7015 (2021) 28. Le, H., Hoi, S.C.: Video-grounded dialogues with pretrained generation language models (2020)
A Weighting Possibilistic Fuzzy C-Means Algorithm for Interval Granularity Yiming Tang1,2(B) , Lei Xi1 , Wenbin Wu1 , Xi Wu1 , Shujie Li1 , and Rui Chen1 1 Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine,
School of Computer and Information, Hefei University of Technology, Hefei 230601, Anhui, China [email protected] 2 Engineering Research Center of Safety Critical Industry Measure and Control Technology, Ministry of Education, Hefei University of Technology, Hefei 230601, Anhui, China
Abstract. Granular clustering is an emerging branch in the field of clustering. However, the existing granular clustering algorithms are still immature in terms of weight setting of granular data and noise resistance. In this study, a weighting possibilistic fuzzy c-means algorithm for interval granularity (WPFCM-IG) is proposed. To begin with, a new weight setting method for interval granular data is given. The principle of justifiable granularity is used as the evaluation criterion of granular data, and a weight is assigned to each granular data from two perspectives of coverage and specificity to measure the quality of the granular data. In addition, the idea of possibilistic clustering is introduced, which is helpful to improve the noise resistance. And, with the proposed weights of interval granular data, the influence of data with smaller weights on the clustering results can be reduced during the clustering process. Based upon these ideas, the WPFCM-IG algorithm is put forward, and its core idea, formula derivation and implementation process are described. Finally, the performance of the proposed algorithm is verified by comparison experiments on the artificial and UCI datasets. The experimental results show that the WPFCM-IG algorithm is better than other advanced algorithms in this field in terms of reconstruction error. Next, the WPFCM-IG algorithm is smoother than other algorithms on the collaborative relationship curve between the fuzzy coefficient and the reconstruction error, so WPFCM-IG can better optimize the fuzzy coefficient. Keywords: Granular computing · Fuzzy clustering · Granular data · Coverage · Specificity
1 Introduction Cluster analysis has achieved significant application results in the fields of biology, ecommerce, pattern recognition, and image processing [1–4]. Clustering includes two branches: hard clustering and fuzzy clustering. Hard clustering simply places a piece of data strictly in a category. However, the limitations of this classification are obvious. In complex environments, data cannot be generally assigned to a specific category, which © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 352–366, 2023. https://doi.org/10.1007/978-981-99-2385-4_26
A Weighting Possibilistic Fuzzy C-Means Algorithm
353
is prone to misclassification in extreme cases and cannot be well adapted to complex and changeable environments [5]. In order to solve this problem, Zadeh proposed the fuzzy set theory [6]. Scholars began to introduce the fuzzy set theory into the processing of clustering problems, and got the idea of fuzzy clustering. Although there are many fuzzy clustering algorithms, the fuzzy clustering algorithm based on objective function is still used the most, among which the most typical is fuzzy c-means (FCM) algorithm [7, 8]. However, FCM algorithm is sensitive to the initialization of clustering centers, and noise points also have a certain impact on its clustering performance. Subsequently, Krishnapuram and Keller proposed the possibilistic c-means (PCM) algorithm [9]. Compared with FCM algorithm, PCM can deal with the noise better, but PCM have the problem that the cluster center is easy to coincide. On this basis, the possibilistic fuzzy c-means (PFCM) algorithm was proposed by Pal et al. [10]. In addition, to better deal with high-dimensional data, the concept of kernel function was introduced to map high-dimensional data to the kernel space, and on the basis of FCM algorithm, a kernel-based FCM algorithm named KFCM [11] was proposed. Then it is found that the characteristics of each dimension of data have different effects on clustering. So the processing method of weighting different dimensions of data is proposed. WFCM [7] can better fit the information carried by the data through feature weighting, so as to optimize the clustering effect. However, feature weights are introduced into Euclidean distance to cluster, but there are some attributes in high-dimensional data that have nothing to do with clustering, and Euclidean distance cannot be used to distinguish these attributes well. Therefore, Zhou et al. presented a weighted fuzzy C-means algorithm EWFCM with maximum entropy regularization [12]. The algorithm realized feature selection by providing weights for data features. Weighted possibilistic c-means clustering algorithm(WPCM) proposed by Schneider [13] assigned low weight to noise points, which could effectively reduce the impact of noise data on clustering results. Bahrampour et al. proposed weighted and constrained possibilistic c-means clustering (WCPCM) [14], in which local attribute weighting was introduced, so that different weights could be assigned to objects with corresponding features and the clustering process was more targeted. Bose and Mali [15] put forward a type-reduced possibilistic fuzzy clustering algorithm. Granular computing has attracted much attention from scholars in recent years. Information granularity is a more general and abstract entity than ordinary numbers, and it is the core structure of granular computing. The concept of information granularity [16] refers to the entity constructed based on similarity, proximity, spatial relationship, etc. Information granularity and its processing form the field of granular computing. Generally, information granularity is represented as interval, fuzzy set, rough set and so on. Recently, clustering granular data has attracted more and more attention, and granular clustering has become a new and interesting research hotspot. As an emerging field, a small number of scholars have begun to put forward granular clustering algorithms [17–19]. For example, Zarandi and Razaee [20] proposed a granular clustering algorithm based on Wasserstein distance (WDFCM). Effati et al. [21] proposed the A-CFCM, a fuzzy clustering algorithm based on A-cuts distance. In
354
Y. Tang et al.
addition, Shen et al. [22] presented a weighted granular data fuzzy c-means algorithm (WFCM-G). Although the above proposed granular clustering algorithms work well, there are still two problems as follows: 1) Weight setting for granular data. Weighting is an important method to improve the performance of granular clustering. After the granular data is constructed, the appropriate criteria to measure the quality of granular data should be selected, which should consider the setting of weights from the perspectives of coverage and specificity. At present, granularity clustering algorithms either do not consider these two aspects of weight, or just give a simple weight processing mechanism. 2) Anti-noise problem of granular clustering algorithms. The existing granular clustering algorithms are still rough in dealing with noise resistance. For example, A-CFCM and WFCM-G are essentially direct transformations of FCM algorithm, which are also sensitive to noise points. How to improve the anti-noise performance of the algorithms is one of the key issues to be considered. Aiming at the above problems, this paper proposes a weighting possibilistic fuzzy c-means for interval granularity (WPFCM-IG) clustering algorithm.
2 Weights of Interval Granular Data Suppose there is a data set of data: Y = {y1 , y2 , · · · , yN }, yk ∈ Rn . These original data will be represented as granular data through the principle of justifiable granularity (PJG), namely G = g1 , g2 , ..., gM . gs represents the s-th granular data. For example, [a, b] is an interval data, in which a and b respectively represent the upper and lower bounds of interval data. Two characteristics related to the weight value of granular data are coverage and specificity. The coverage represents the number of data points contained in the information granularity, while the specificity is related to the length of the boundary of each feature range of the information granularity. We can calculate the weight as the product of coverage and specificity. Therefore, the quality of information granularity gs is expressed as: Hs = Ps ×
n
SDsl /n.
(1)
l=1
Ps represents the coverage of the s-th granular data, and SDsl denotes the interval specificity of the l-th feature of the s-th granular data. The specificity of the s-th granular data is represented by adding the specificity of each feature of the granular data and taking the mean value. n is the characteristic number. The formulas of coverage and specificity are as follows: Ps =
1 card {yk |yk ∈ gs , k = 1, ...N , s = 1, ...M }, N
(2)
A Weighting Possibilistic Fuzzy C-Means Algorithm
SDsl = 2 −
bsl − asl . φl
355
(3)
Here card is the number of elements in a set. Furthermore, we adopt the method of normalized coefficient expressed by ws =
Hs . max(Hs )
(4)
Thereinto max(Hs ) represents the largest weight of the granular data.
3 A Weighting Possibilistic Fuzzy C-Means Algorithm for Interval Granularity 3.1 The Idea of the Proposed Algorithm The objective function of WPFCM-IG algorithm consists of two parts. The first part introduces the typicality matrix on the basis of FCM objective function, and integrates the weight of granularity data into it. The second part is to control the typicality matrix. Therefore, the objective function of the proposed WPFCM-IG algorithm is expressed as follows: JGWPFCM =
M c i=1 s=1
M c η p auism + btis ws gs − vi 2 + 2 θi (1 − tis )p . m c i=1
(5)
s=1
Thereinto G = gs s=1,...,M is the constructed M granular data. The WPFCM-IG algorithm strives to obtain clustering prototypes in dense areas and relatively dispersed. c is the number of clusters. uis represents the similarity between granular data gs and cluster center vi , with a restriction condition ci=1 uis = 1. The typicality tis of (5) relaxes the restriction that the membership sum of each granular data is 1, so as to better reflect the expectation of each granular data gs on the typicality of the i-th cluster. T = [tis ]c×M denotes the typicality matrix. In addition, wk is the weight of the s-th granularity data. m, p are fuzzy coefficient and typical coefficient respectively (where m > 1, p > 1), a, b, θi are parameters (i = 1, ...c). Parameter η is used to measure the covariance matrix of the compactness of the data set. The reason why this parameter is introduced is that the clustering result will be more effective by using the compactness and separation characteristics of the data set. The calculation method is as follows: n n 2 gs s=1 gs − g ,g = s=1 . (6) η= M M The constraints of the objective function are as follows: c i=1
uis = 1, 0 ≤ uis ≤ 1.
(7)
356
Y. Tang et al.
3.2 Derivation of Iterative Formulas Using the Lagrange multiplier method, the Lagrange function corresponding to (5) can be constructed as follows: J =
M M c c m η p auis + btis ws gs − vi 2 + 2 θi (1 − tis )p m c i=1 s=1
+λ(1 −
c
i=1
s=1
(8)
uis ).
i=1
The necessary conditions for minimizing (8) are ∂J ∂J ∂J = 0, = 0, = 0, i = 1, ...c, s = 1, ..., M . ∂uis ∂vi ∂tis
(9)
By the Lagrange multiplier method, we can obtain 2
uis =
gs − vi − m−1 c
2 .
gs − vj − m−1
(10)
j=1 M p auism + btis ws gs
vi =
s=1 M
s=1
tis = 1+
m p auis + btis ws 1 bcm2 ws gs −vi 2 ηθi
.
1 p−1
(11)
.
(12)
Thus, the overall idea and principle of the WPFCM-IG algorithm have been completely introduced. 3.3 Algorithm Framework After the granular data is constructed, the WPFCM-IG algorithm is used to cluster the granular data. The execution process of the algorithm is shown in Table 1.
A Weighting Possibilistic Fuzzy C-Means Algorithm
357
Table 1. The execution process of the WPFCM-IG algorithm.
4 Experiment and Analyses In order to verify the clustering performance of the proposed WPFCM-IG algorithm on granular data, we carried out relevant experiments. The clustering effects of the proposed granularity weighted WPFCM-IG clustering algorithm are compared with the fuzzy data clustering algorithm WDFCM [20], the fuzzy clustering algorithm A-CFCM [21], and the weighted FCM clustering algorithm WFCM-G [22] for granular data. The data sets used in the experiment include 2 artificial data sets and 4 UCI data sets [23]. The artificial datasets DATA1 and DATA2 are both points with Gaussian distributions which we generated using the generation tool. The tested UCI datasets are Aggregation, Ecoli, Authentication, and Vowel_context. These UCI datasets are frequently used. Table 2 provides the basic information of 2 artificial data sets and 4 UCI data sets: total number of samples and number of features. In the experiment, the number of clusters c and the fuzzy coefficient m are optimized by constant adjustment. 4.1 Evaluation Index Reconstruction criteria (RC) are used as a reliable method to quantify the performance of clustering algorithms in the presence of numerical data [24, 25]. In terms of its ability to represent the original data, the smaller the difference between the original data and the reconstructed data, the better the performance of the clustering algorithm. It includes the following steps: Firstly, the WPFCM-IG algorithm is used to cluster M granular data gs into C groups. The membership matrix and clustering prototype iterated from the constructed granular data are U and V = {v1 , v2 , ..., vc }, respectively.
358
Y. Tang et al. Table 2. Statistics of experimental data sets. Name
Instances
Attributes
DATA1
150
2
DATA2
300
2
Aggregation
788
2
Ecoli
336
7
Authentication
1372
4
Vowel_context
990
10
Secondly, The degranulating process is determined by minimizing the following distance: min
c
2 uism gˆ s − vi .
(13)
i=1
Here, m is the fuzzy coefficient, and uis represents the membership degree of the s-th granularity data gs , the i-th clustering prototype vi , and both U and V are derived from the granulation process. We finally obtain the reconstructed granularity dataset based on c prototypes vi , denoted as G, and we obtain: c uism vi gˆ s = i=1 (14) c m . i=1 uis Thirdly, we sum up the difference between each pair of original granularity data gs and reconstructed data gˆ s , and use this difference to represent the reconstruction error ˆ Specifically, between original granularity data G and reconstructed granularity data G. ˆ is expressed as: the reconstruction error between G and G E=
M
gs − gˆ s 2
(15)
s=1
4.2 Artificial Data Set We first discuss the case of artificial datasets, and Table 2 gives the basic information of the datasets. DATA1 has 150 data, whose number of features is 2, and DATA2 has 300 data. Figure 1 shows the data distribution of DATA2. We first need to select some high-density points from the 300 data in DATA2 to build the granular data. After the granular data is formed, the weight of each granular data is calculated, and then the four granularity clustering algorithms are used to cluster them, and the reconstruction errors generated by the four granularity clustering algorithms are calculated.
A Weighting Possibilistic Fuzzy C-Means Algorithm
359
The reconstruction criterion RC can be used to optimize the fuzzy coefficient m, that is, to help determine the best value of m under different number of clusters. Table 3 and Table 4 respectively represent the reconstruction errors generated by the proposed WPFCM-IG algorithm and the other three granularity clustering algorithms in the two data sets DATA1 and DATA2, where c represents the number of clusters and the values are 3, 6 and 9 respectively.
Fig. 1. Data distribution of DATA2.
In the data set DATA1 in Table 3, if the cluster number c is 3, then the reconstruction error of WPFCM-IG and WFCM-G reaches the minimum value when the fuzzy coefficient m = 1.21. WDFCM and A-CFCM both achieve the minimum error when m = 1.31. When c = 6, the reconstruction error of WPFCM-IG and WFCM-G reaches the minimum value for the case that the fuzzy coefficient m = 1.11, and the error of the other two algorithms is the minimum value when m = 1.21. When c = 9, the four clustering algorithms all have the smallest reconstruction error when m is 1.01. It can be seen from Table 3 and Table 4 that the granularity weighting algorithm WPFCM-IG has the smallest reconstruction error among the four granularity clustering algorithms no matter the cluster number is 3, 6 or 9. Table 3. Comparative analyses of interval data on DATA1.
c
A-CFCM
WDFCM
WFCM-G
WPFCM-IG
m
m
m
m
Error
Error
Error
Error
3 1.31
27.62 ± 3.03 1.31
27.85 ± 3.35 1.21
26.83 ± 2.53 1.21
26.12 ± 2.39
6 1.21
22.72 ± 2.78 1.21
22.96 ± 2.85 1.11
21.75 ± 2.24 1.11
21.36 ± 2.32
9 1.01
18.87 ± 1.73 1.01
19.26 ± 1.63 1.01
17.65 ± 1.61 1.01
17.33 ± 1.52
360
Y. Tang et al. Table 4. Comparative analyses of interval data on DATA2.
c
A-CFCM
WDFCM
WFCM-G
WPFCM-IG
m
m
m
m
Error
Error
Error
Error
3 1.41
46.26 ± 4.33 1.41
46.34 ± 4.28 1.31
43.42 ± 4.16 1.31
42.37 ± 3.57
6 1.31
37.15 ± 3.86 1.31
36.95 ± 4.02 1.21
35.66 ± 3.93 1.21
35.58 ± 3.54
9 1.21
30.85 ± 3.21 1.21
31.32 ± 3.09 1.11
29.27 ± 2.56 1.11
28.63 ± 2.48
4.3 UCI Data Sets In the process of clustering and optimization of the four granularity clustering algorithms, the value of the fuzzy coefficient m increases gradually with a step size of 0.1. Table 5, Table 6, Table 7 and Table 8 respectively show the comparison of the error values obtained by using four granularity data clustering algorithms for the four UCI datasets, where the cluster number c is also set as 3, 6 and 9. We see that the WPFCM-IG algorithm performs better than the other three clustering algorithms in all cases. Table 5. Comparative analyses of interval data on the Aggregation dataset. A-CFCM c m
Error
WDFCM
WFCM-G
WPFCM-IG
m
m
m
Error
Error
Error
3 1.31 1978.43 ± 80.32 1.41 1987.05 ± 86.76 m
1837.91 ± 76.32 1.31
1834.45 ± 65.66
6 1.21 1621.64 ± 76.77 1.31 1605.53 ± 79.73 1.31
1540.03 ± 70.36 1.21
1534.43 ± 63.25
9 1.01 1415.25 ± 72.31 1.11 1451.36 ± 65.08 1.21
1285.96 ± 53.65 1.11
1280.55 ± 35.69
Table 6. Comparative analyses of interval data on the Ecoli dataset. A-CFCM
WDFCM
WFCM-G
WPFCM-IG
c
m
Error
m
Error
m
Error
m
Error
3
1.11
2.06 ± 0.11
1.21
2.11 ± 0.12
1.11
2.01 ± 0.08
1.21
1.99 ± 0.05
6
1.11
1.72 ± 0.08
1.11
1.75 ± 0.09
1.21
1.66 ± 0.03
1.11
1.65 ± 0.03
9
1.01
1.42 ± 0.07
1.01
1.43 ± 0.06
1.11
1.38 ± 0.04
1.01
1.36 ± 0.03
Figures 2, 3, 4 and 5 show the cooperative relationship between fuzzy coefficient and reconstruction error of four clustering algorithms for interval data on UCI dataset Authentication, which reflects the optimization process for m. From these figures, we can see the relationship between the fuzzy coefficient m and the reconstruction error of the four algorithms under the same number of clusters. The A-CFCM algorithm shows more instability in clustering fuzzy data with outliers, while the curves obtained by WPFCM-IG algorithm are smoother than the other three algorithms. In addition, when
A Weighting Possibilistic Fuzzy C-Means Algorithm
361
Table 7. Comparative analyses of interval data on the Authentication dataset. A-CFCM c m
Error
WDFCM
WFCM-G
WPFCM-IG
m
m
m
Error
Error
Error
3 1.41 2274.97 ± 90.67 1.41 2287.86 ± 97.65 1.01
2173.56 ± 89.54 1.31
2162.54 ± 90.23
6 1.31 1929.03 ± 82.43 1.31 1941.09 ± 83.67 1.31
1845.25 ± 76.33 1.21
1834.76 ± 75.78
9 1.21 1648.87 ± 79.45 1.21 1676.56 ± 77.93 1.21
1544.27 ± 64.53 1.11
1528.05 ± 62.76
Table 8. Comparative analyses of interval data on the Vowel_context dataset. A-CFCM c m
Error
WDFCM
WFCM-G
WPFCM-IG
m
m
m
Error
Error
Error
3 1.31 252.26 ± 2.58 1.31 254.34 ± 2.75 1.11
248.09 ± 2.36 1.31
245.56 ± 2.54
6 1.21 215.25 ± 2.16 1.21 214.73 ± 1.42 1.31
213.37 ± 1.95 1.21
211.20 ± 1.96
9 1.11 172.85 ± 1.41 1.21 174.32 ± 1.69 1.21
170.27 ± 1.36 1.11
168.39 ± 1.15
we focus on a specific algorithm, its performance characteristics depend on the data used.
Fig. 2. The relationship between fuzzy coefficient and reconstruction error for A-CFCM on Authentication dataset.
In addition, Fig. 6, Fig. 7, Fig. 8 and Fig. 9 respectively show the bar graph of the corresponding relationship between different clustering numbers c and the best fuzzification coefficient m of UCI dataset Authentication in different clustering algorithms. For the WPFCM-IG algorithm, when the number of clusters c is 3, the reconstruction error is minimized when the value of m is 1.41. When the number of clusters c is 6, the best value of m is 1.21. When the number of clusters c is 9, the reconstruction error
362
Y. Tang et al.
Fig. 3. The relationship between fuzzy coefficient and reconstruction error for WDFCM on Authentication dataset.
Fig. 4. The relationship between fuzzy coefficient and reconstruction error for WFCM-G on Authentication dataset.
reaches the minimum value when m is 1.11. The other three clustering algorithms are the same. Obviously, the best value of m is generally stable between [1.11, 1.41]. Through experimental comparison and analyses of four UCI datasets, it is not difficult to find that the reconstruction error of the proposed WPFCM-IG algorithm is the smallest among the four clustering algorithms in most cases. It can be seen that the clustering results obtained by the WPFCM-IG algorithm have higher reliability and better performance.
5 Summary and Outlook In this paper, a weighting possibilistic fuzzy c-means algorithm for interval granularity (WPFCM-IG) is proposed, and its main contributions are as follows:
A Weighting Possibilistic Fuzzy C-Means Algorithm
363
Fig. 5. The relationship between fuzzy coefficient and reconstruction error for WPFCM-IG on Authentication dataset.
Fig. 6. Corresponding relations between c and m in the A-CFCM algorithm.
Fig. 7. Corresponding relations between c and m in the WDFCM algorithm.
364
Y. Tang et al.
Fig. 8. Corresponding relations between c and m in the WFCM-G algorithm.
Fig. 9. Corresponding relations between c and m in the WPFCM-IG algorithm.
Firstly, a new weight setting method of interval granular data is given. With the principle of justifiable granularity as the evaluation standard of granular data, the quality of granular data is measured by assigning a weight to each granular data from the perspectives of coverage and specificity. Secondly, we propose the WPFCM-IG algorithm. In order to better reflect the information contained in granular data, we introduce the idea of possibilistic clustering, which is helpful to improve the anti-noise ability. In addition, with the help of the granular data weight proposed in the clustering process, the influence of the data with low weight on the clustering results is reduced, and the clustering correctness of the algorithm is improved to a certain extent. Finally, we perform comparative experiments on artificial datasets and UCI datasets to verify the performance of the proposed algorithm. The experimental results show that WPFCM-IG algorithm is better than the WDFCM algorithm, the A-CFCM algorithm and the WFCM-G algorithm in reconstruction error. Moreover, the WPFCM-IG algorithm is smoother than other algorithms in the cooperative relation curve between fuzzy coefficient and reconstruction error, so that the fuzzy coefficient can be better optimized.
A Weighting Possibilistic Fuzzy C-Means Algorithm
365
In the future work planning, we will try to apply the proposed WPFCM-IG algorithm to the field of image segmentation. In addition, logical reasoning [26–29] is one of the important theoretical foundations in the field of artificial intelligence. We will consider the combination of logical reasoning and granular clustering, so as to give a new measure to improve granular clustering. Acknowledgment. This work has been supported by the National Natural Science Foundation of China (Nos. 62176083, 62176084, 61877016, and 61976078), the Key Research and Development Program of Anhui Province (No. 202004d07020004), the Natural Science Foundation of Anhui Province (No. 2108085MF203), and the Fundamental Research Funds for Central Universities of China (No. PA2021GDSK0092).
References 1. Li, X.L., Zhang, H., Wang, R., Nie, F.P.: Multiview clustering: a scalable and parameter-free bipartite graph fusion method. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 330–344 (2022) 2. Tang, Y.M., Pan, Z.F., Pedrycz, W., Ren, F.J., Song, X.C.: Viewpoint-based kernel fuzzy clustering with weight information granules. IEEE Trans. Emerg. Top. Comput. Intell. (2022). https://doi.org/10.1109/TETCI.2022.3201620 3. Tang, Y.M., Ren, F.J., Pedrycz, W.: Fuzzy c-means clustering through SSIM and patch for image segmentation. Appl. Soft Comput. 87, 105928: 1–16 (2020) 4. Tang, Y.M., Hu, X.H., Pedrycz, W., Song, X.C.: Possibilistic fuzzy clustering with highdensity viewpoint. Neurocomputing 329, 407–423 (2019) 5. Tang, Y.M., Li, L., Liu, X.P.: State-of-the-art development of complex systems and their simulation methods. Complex Syst. Model. Simul. 1(4), 271–290 (2021) 6. Zadeh, L.A.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965) 7. Wang, X., Wang, Y., Wang, L.: Improving fuzzy c-means clustering based on feature-weight learning. Pattern Recogn. Lett. 25(10), 1123–1132 (2004) 8. Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (2008) 9. Krishnapuram, R., Keller, J.M.: A possibilistic approach to clustering. IEEE Trans. Fuzzy Syst. 1(2), 98–110 (1993) 10. Pal, N.R., Pal, K., Keller, J.M., et al.: A possibilistic fuzzy c-means clustering algorithm. IEEE Trans. Fuzzy Syst. 13(4), 517–530 (2005) 11. Graves, D., Pedrycz, W.: Kernel-based fuzzy clustering and fuzzy clustering: a comparative experimental study. Fuzzy Sets Syst. 161(4), 522–543 (2010) 12. Zhou, J., Chen, L., Chen, C.L.: Fuzzy clustering with the entropy of attribute weights. Neurocomputing 19(8), 125–134 (2016) 13. Schneider, A.: Weighted possibilistic c-means clustering algorithms. In: Proceedings of the Ninth IEEE International Conference on Fuzzy Systems, FUZZ, pp. 176–180. IEEE (2000) 14. Bahrampour, S., Moshiri, B., Salahshoor, K.: Weighted and constrained possibilistic c-means clustering for online fault detection and isolation. Appl. Intell. 35(2), 269–284 (2011) 15. Bose, A., Mali, K.: Type-reduced vague possibilistic fuzzy clustering for medical images. Pattern Recogn. 112, 107784 (2021) 16. Zadeh, L.A.: Toward a generalized theory of uncertainty (GTU)—an outline. Inf. Sci. 172(1), 1–40 (2005) 17. Pedrycz, W., Succi, G., Sillitti, A., Iljazi, J.: Data description: a gen-eral framework of information granules. Knowl.-Based Syst. 80, 98–108 (2015)
366
Y. Tang et al.
18. Zhu, X.B., Pedrycz, W., Li, Z.W.: Granular data description: designing ellipsoidal information granules. IEEE Trans. Cybern. 47(12), 4475–4484 (2017) 19. Ouyang, T.H., Pedrycz, W., Reyes-Galaviz, O.F., Pizzi, N.J.: Granular description of data structures: a two-phase design. IEEE Trans. Cybern. 51(4), 1902–1912 (2021) 20. Zarandi, M.H.F., Razaee, Z.S.: A fuzzy clustering model for fuzzy data with outliers. Int. J. Fuzzy Syst. 1(2), 29–42 (2011) 21. Effati, S., Yazdi, H.S., Sharahi, A.J.: Fuzzy clustering algorithm for fuzzy data based on α–cuts. J. Intell. Fuzzy Syst. 24(3), 511–519 (2013) 22. Shen, Y.H., Pedrycz, W., Wang, X.M.: Clustering homogeneous granular data: formation and evaluation. IEEE Trans. Cybern. 49(4), 1391–1401 (2019) 23. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. School of Information and Computer Science, University of California, Irvine, CA, USA (2007). http://archive.ics.usi. edu/ml/Datasets.html 24. Pedrycz, W., Valente de Oliveira, J.: A development of fuzzy encoding and decoding through fuzzy clustering. IEEE Trans. Instrum. Measur. 57(4), 829–837 (2008). https://doi.org/10. 1109/TIM.2007.913809 25. Zhu, X.B., Pedrycz, W., Li, Z.W.: Granular encoders and decoders: a study in processing information granules. IEEE Trans. Fuzzy Syst. 25(5), 1115–1126 (2017) 26. Tang, Y.M., Ren, F.J.: Fuzzy systems based on universal triple I method and their response functions. Int. J. Inf. Technol. Decis. Mak. 16(2), 443–471 (2017) 27. Tang, Y.M., Zhang, L., Bao, G.Q., et al.: Symmetric implicational algorithm derived from intuitionistic fuzzy entropy. Iranian J. Fuzzy Syst. 19(4), 27–44 (2022) 28. Tang, Y.M., Pedrycz, W., Ren, F.J.: Granular symmetric implicational method. IEEE Trans. Emerg. Top. Comput. Intell. 6(3), 710–723 (2022) 29. Tang, Y.M., Pedrycz, W.: Oscillation bound estimation of perturbations under Bandler-Kohout subproduct. IEEE Trans. Cybern. 52(7), 6269–6282 (2022)
An Evolutionary Multi-task Genetic Algorithm with Assisted-Task for Flexible Job Shop Scheduling Xuhui Ning, Hong Zhao(B) , Xiaotao Liu, and Jing Liu Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, China [email protected], {hongzhao,xtliu}@xidian.edu.cn, [email protected]
Abstract. Flexible job-shop scheduling problem (FJSP) has aroused much attention from academia. It is known that evolutionary multitasking optimization (EMTO) is famous for solving multiple tasks simultaneously by leveraging the knowledge among tasks. To explore the universality of EMTO, an assisted-task based evolutionary multi-task genetic algorithm (MTGAA) is firstly proposed to deal with FJSP. In MTGAA, each FJSP task is equipped with a constitutive assisted task that generates a high-quality initial population according priory rules, so that the target-task is improved by using the knowledge from assisted-task. For the purpose to improve the ability of searching optimal of MTGAA, an adaptive crossover strategy is designed by using two popular crossover operators at the same time in this paper. Besides, the effectiveness of proposed two components are verified by comparing MTGAA to four variants of MTGAA. The expert mental results of MTGAA are compared with two latest algorithms on standard benchmark data instances and the experimental results show that MTGAA is competitive in dealing with FJSP. Keywords: Flexible job shop scheduling · Evolutionary multi-tasking · Knowledge transfer
1 Introduction and Related Work Job-shop scheduling problem (JSP) is one of the classic combinational optimization problems, which consists of multiple jobs, all the operations of the job can only be processed on one unique machine, the processing time and order of each operation are already known. The objective of JSP is to find a suitable operation sequencing to minimize a certain metric under conditions that meets constraints. Flexible job-shop scheduling problem (FJSP) is a specific branch of JSP [1]. In FJSP, each machine can process multiple kinds of operations and each operation is available on several different machines. JSP has been proven an NP-hard problem [2, 3], while FJSP contains extra machine assignment which makes it more complicated than JSP. Solving FJSP need scheduling a suitable schedule to minimize a certain criterion like completion time or flowtime. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 367–378, 2023. https://doi.org/10.1007/978-981-99-2385-4_27
368
X. Ning et al.
Owing to the complexity of FJSP, traditional mathematical optimization methods are time-consuming and inefficient when dealing with FJSP. To develop more effective methods, many swam intelligence algorithms are employed to deal with FJSP, Zhang et al. [4] combined particle swarm optimization (PSO) algorithm and taboo search (TS) algorithm to deal with FJSP. Xing et al. [5] proposed a knowledge-based ant colony optimization (ACO) algorithm to deal with FJSP. Li et al. [6] proposed a hybrid Paretobased discrete artificial bee colony algorithm to handle FJSP. Caldeira et al. [7] designed a hybrid artificial bee colony algorithm for solving FJSP. Evolutionary algorithms (EAs) employ the natural evolutionary mechanism to solve complex and difficult optimization problems like multimodal optimization problems [8, 9] and portfolio optimization problems [10]. Owing high robustness and outstanding performance makes genetic algorithms (GAs) popular for dealing with FJSP [11, 12]. For the purpose to improve the ability of GA processing FJSPs, many respectable works are mainly on encoding and decoding strategies, initial population methods, and local search operators. Pezzella et al. [1] encoded the FJSP by improving the chromosome encoding and using the operation minimum processing time standard to initialize the population. Zhang et al. [13] encoded machine assignment and operation sequence separately, machine assignment chromosome represents the selected machines for each operation, operation sequence chromosome is encoded by job indexes. Li et al. [14] hybridized GA and tabu search to deal with FJSP. Gao et al. [15] improved GA by using Variable Neighborhood Descent as local search operator. Moreover, Chen et al. [16] proposed a key-parameters self-learning GA based on reinforcement learning to solve FJSP. Evolutionary multi-tasking optimization (EMTO) [17] aims to optimize multiple tasks at the same time by using the relevance among these tasks [18, 19]. Zhang et al. [20] proposed a GP (Genetic Programming)-based surrogate-assisted evolutionary multitasking algorithm to deal with FJSP, the knowledge transfer occurred among the surrogates that are built for different FJSP tasks. Yuan et al. [21] encoded four different permutations-based combinatorial optimization problems uniformly, knowledge is shared in the unified search space so that each task can be promoted. However, the use of unified representation may cause much redundant information transferred during knowledge transfer due to heterogeneity of different FJSP tasks. An assisted-task based evolutionary multi-task genetic algorithm (MTGAA) is proposed to solve this problem, our contribution can be summarized as follows: • A feasible idea is to construct a constitutive assisted task for each FJSP task, the assisted task adopts initial population strategy, which can provide high-quality solutions for the target task. By this means, redundant information caused by heterogeneity of tasks is avoided and make the transferred knowledge more effective. • An adaptive crossover strategy is designed to improve the quality of solutions, two current popular machine assignment chromosome crossover operators multi-point crossover (MPX) [22] and uniform crossover (UX) [15] are employed in the crossover strategy, the selection probability of the two crossover operators is adjusted according to the improvement of the offspring after crossover.
An Evolutionary Multi-task Genetic Algorithm with Assisted-Task
369
To verify the effectiveness of proposed assisted-task component and adaptive crossover strategy component, we selected 10 instances from Braindimarte’s data set [23] for a series of experiments and four algorithm variants are implemented in this paper. Moreover, to verify the performance of MTGAA, MTGAA is compared to some latest algorithms and the result shows that our algorithm enjoys a certain degree of competitiveness.
2 Methodology 2.1 Problem Formulation For a n × m FJSP, which contains n jobs and m machines, each job J i (i ∈ [1,n]) consists of a series of operations O = {Oi,1 , Oi,2 , Oi,3 , …, Oi,h }, where Oi,j is the j-th operation of J i (h is the number of operations), i,j is the set of available machines of Oi,j . pi,j,k is the processing time of opeartion Oi,j on machine M k (M k ∈ i,j ). In this paper, some constraints are summarized as follows: Each job J i and machine M k are available at time 0. Each machine can only process one operation at a time. The order of operations for each job is determined in advance. Ongoing operations cannot be interrupted and machine disruptions are ignored. The next operation on the machine can be performed immediately after the previous operation is completed. 6) i,j is non-empty. 1) 2) 3) 4) 5)
FJSP could be divided into two sub-problems: machine assignment of operations and sequence of operations, FJSP involves how to sort these operations and select the corresponding machine for each operation, then a given performance indicator is optimized. A typical performance indicator for FJSP is makespan. In this paper, the objective of FJSP is find a schedule with the smallest makespan. Makespan is the maximum completion time of all jobs which can be expressed as: ⎛ ⎞ h n (Si,j,k + pi,j,k )⎠ (1) Cmax = Max⎝ i=1 j=1
where si,j,k denotes to the start time of operation Oi, j on machine M k , C max is the maximum completion time of J, the objective of FJSP in this paper is to minimize C max . 2.2 Basis of MTGAA The proposed MTGAA incorporates the core of evolutionary multitasking: the knowledge transfer mechanism. Chromosome coding and decoding, population initialization, selection, crossover and mutation of MTGAA are introduced in the following subsections.
370
X. Ning et al.
Chromosome Representation. Chromosome representation has a great influence on the search efficiency of GA. Therefore, targeted chromosome representation in the framework of GA could be designed for reducing the costs of decoding. The chromosome in this paper is bilayer encoded: machine assignment (MA) and operation sequence (OS). There are 3 jobs (J 1 , J 2 , J 3 ) and 6 machines (M 1 , M 2 , M 3 , M 4 , M 5 , M 6 ), each job has 3 operations (J 1 (O11 , O12 , O13 ); J 2 (O21 , O22 , O23 ); J 3 (O31 , O32 , O33 )) in Table 1.. As shown in Fig. 1, in OS chromosome part, the first occurrence of the job index refers to the first operation of the job, the second occurrence of the job index refers to the second operation of the job and so on. In MA chromosome part, in the order from smallest to largest job index, select the machine from available machines collections (see Table 1.) of the operation.
Fig. 1. The structure of chromosome bilayer encoding,
Chromosome Decoding. MA chromosome is read in order, each number of the gene position represents the index of the machine that is selected for one operation and the corresponding processing time of machine is recorded according to the time processing table of benchmark. OS chromosome is also read from left to right, each gene is decoded to the corresponding operation. Population Initialization. Population initialization has a large impact on the quality of the solution, populations of target task and assisted task are initialized in two different ways. Random initialization strategy is employed in the population initialization of target task, operation sequence is randomly generated and the corresponding machine for each operation is also randomly selected. Two priority rules are employed in the population initialization of assisted task, LMPT and GMPT [24], it is worth noting that the two priority rules are only employed in machine assignment while operation sequence is randomly generated. Selection Operation. The purpose of selection is to select better individuals to enter the next generation based on fitness. Tournament selection is employed for target tasks, and tournament selection allows elite individuals to have a greater probability of retention
An Evolutionary Multi-task Genetic Algorithm with Assisted-Task
371
and avoid premature convergence. Elitist selection employed for assisted task, and elitist selection speeds the convergence of population of assisted task so that provides highquality knowledge faster. Table 1. Machines and corresponding processing time for each operation, ‘--’ indicates the machine is not available Job
Operation
M1
M2
M3
M4
M5
M6
J1
O11
--
4
--
3
--
--
O12
5
--
2
5
--
--
O13
2
--
--
--
4
3
O21
--
4
5
--
--
--
O22
--
--
--
6
7
3
O23
--
3
--
8
--
2
O31
7
2
--
4
--
--
O32
1
5
--
--
3
O33
--
--
5
6
--
J2
J3
4
Crossover Operator. The evolution of population is greatly driven by crossover, offspring with better fitness can be generated by exchanging parts of the chromosomes of the selected two individuals. Due to the different characteristics of machine assignment and operation sequence, the employed operators of crossover and mutation of corresponding chromosome are different. Precedence preserving order-based crossover (POX) [25] is employed for the operation sequence chromosome, the offspring generated by POX is always feasible and avoid repair mechanism, thus improves the efficiency of decoding. Two popular crossover operators MPX and UX are employed for machine assignment chromosome, crossover operator is randomly selected by an adaptive selection mechanism of crossover operators (see Algorithm 1). As shown in Algorithm 1, crossover_vec is a weight vector, cross_idx is number of crossover operator that is generated by crossover_vec through roulette wheel section.
372
X. Ning et al.
POX (For Operation Sequence Chromosome). As shown in Fig. 2, the job set J is divided into two non-empty sub-sets J 1 = {1,3} and J 2 = {2}; Copy the jobs of J 1 that contained in Parent1 to Children1 and copy the jobs of J 1 that contained in Parent 2 to Children 2, their positions are preserved; Copy the jobs of J 2 that contained in Parent2 to Children1 and copy the jobs of J 2 that contained in Parent 1 to Children 2, their order are preserved. Copy the jobs of J 1 that contained in Parent 1 to Children 1 and copy the jobs of J 1 that contained in Parent2 to Children2, their orders are preserved.
Parent 1 2 1 3 3 2 1 2 3 1
Children 1 2 1 3 3 2 1 2 3 1
Parent 2 3 2 1 2 1 1 3 2 3
Children 2 1 2 3 2 3 1 3 2 1
Fig. 2. Illustration of POX (for operation sequence chromosome)
MPX (For Machine Assignment Chromosome). As shown in Fig. 3, randomly generate a set S consists of 0 and 1 that is equal in the length of chromosome. Exchange the genes of Parent 1and Parent 2 in positions where 1 appears in S, and leave the other genes unchanged. S
1 0 0 1 0 1 1 0 1
Parent 1 4 1 6 2 5 2 4 3 2 Parent 2 2 3 1 3 4 6 1 6 5
Children 1 2 1 6 3 5 6 1 3 5 Children 2 4 3 1 2 4 2 4 6 2
Fig. 3. Illustration of MPX (for machine assignment chromosome)
UX (For Machine Assignment Chromosome). As shown in Fig. 4, generate a random integer r ∈ (1, length of chromosome), select a sub-section of length r randomly, exchange the genes of Parent 1 and Parent 2 within the sub-section, and the rest remains unchanged (Fig. 4). Mutation Operator. To a certain extent, mutation increases the diversity of population and improve the local random search capability. The employed operators of mutation of machine assignment chromosome and operation sequence chromosome are also different. Mutation of Operation Sequence Chromosome. The process of mutation is presented in Fig. 5, two genes are randomly selected from operation sequence chromosome then exchange them.
An Evolutionary Multi-task Genetic Algorithm with Assisted-Task
373
r=5
Parent 1 4 1 6 2 5 2 4 3 2 Children 1 4 3 1 3 4 6 4 3 2 Parent 2 2 3 1 3 4 6 1 6 5
Children 2 2 1 6 2 5 2 1 6 5
Fig. 4. Illustration of UX (for machine assignment chromosome)
OS
2 1 3 3 2 1 2 3 1
MA
4 1 6 2 5 2 4 3 2
OS
2 2 3 3 1 1 2 3 1
MA 4 1 6 2 5 2 4 3 2 Fig. 5. Illustration of mutation for operation sequence chromosome
Mutation of Machine Assignment. As shown in Fig. 6, select one operation randomly then reselect a machine in the available machine sets of this operation, then change the gene of machine assignment chromosome corresponding to this operation. The offspring produced after mutation is always feasible and no requirement for chromosome repair mechanism. Job
Operation
J1
O11
M1 --
M2 4
M3 --
M4 3
OS
2 1 3 3 2 1 2 3 1
MA
4 1 6 2 5 2 4 3 2
M5 --
M6 --
O11
OS
2 1 3 3 2 1 2 3 1
MA
2 1 6 2 5 2 4 3 2
Fig. 6. Illustration of mutation for machine select chromosome
374
X. Ning et al.
2.3 The Framework of MTGAA The procedure of proposed algorithm is described in Algorithm 2. Initialize the population P1 and P2 for target task and assisted task, respectively, initialize P1 with Random initialization strategy and initialize P2 with LMPT and GMPT priority rules (90% individuals of P2 are produced by GMPT and 10% individuals of P2 are produced by LMPT ). For each iteration, replace the worst 5 individuals in P1 with the best five individuals in P2 . Transfer too many good individuals of P2 to P1 might cause premature convergence of P1 , so only 5 individuals are selected for knowledge transfer. Then perform crossover (according Algorithm 1) and mutation for P1 and P2 .
3 Experiment Result In this section, the effectiveness of the proposed MTGAA is evaluated on a standard test set, 10 instances from Braindimarte’s data set are selected for a series of experiments. Experimental parameters in this paper are follows: Population size: 100; Maximum number of evaluations: 20000; Possibility of crossover: 0.8; Possibility of mutation: 0.5; Effectiveness of Multi-task with Assisted-task in MTGAA. In multi-task environment, the target task is promoted through the high-quality knowledge provided by the assisted-task. For the purpose of validating the advantage of knowledge transfer of MTGAA, this paper implements IniGA (population initialization strategy only), and
An Evolutionary Multi-task Genetic Algorithm with Assisted-Task
375
traditional genetic algorithm (GA) and compared with MTGAA. As shown in Table 2., MTGAA is superior to GA and IniGA so the effectiveness of multi-task with assisted-task in MTGAA has been verified. Effectiveness of Adaptive Crossover Strategy of MTGAA. For the purpose of validating the advantage of adaptive crossover strategy of MTGAA, this paper implements UXGA (MTGAA with UX only) and MPXGA (MTGAA with MPX only) compared with MTGAA. As shown in Table 2, the results of MTGAA are better than the results of MPXGA and UXGA and the significance of adaptive crossover strategy of MTGAA has been verified. Figure 7 provides the performance of MK10 that solved by five algorithms. It can be observed that the MTGAA has the strongest optimal solution search capability, followed by UXGA and MPXGA, GA has the poorest performance, the premature convergence of IniGA is caused by the initial population with low diversity. Table 2. Comparison results in makespan on Brandimarte’s data (LB, UB) represents the best lower and upper for each instance up to now Problem
Size(n*m)
LB, UB
GA
IniGA
UXGA
MPXGA
MTGAA
Mk01
10*6
36, 42
44
42
42
42
41
Mk02
10*6
24, 32
34
30
33
30
29
Mk03
15*8
204, 211
207
204
204
204
204
Mk04
15*8
48, 41
74
67
66
67
65
Mk05
15*4
168, 186
182
183
185
179
178
Mk06
10*15
33, 86
82
83
77
68
68
Mk07
20*5
133, 157
159
156
157
151
148
Mk08
20*10
523
524
523
523
523
523
Mk09
20*10
299, 369
334
346
327
322
311
Mk10
20*15
165, 296
294
258
259
247
240
Superiority of MTGAA. To verify the performance of MTGAA in dealing with FJSP, the best results of MTGAA during 5 runs are selected to compare to other latest algorithms, including EDPSO [26] and GWO [27]. As shown in Table 3., MTGAA can get the 5 best results out of 10 results, while MATSPSO can get 3 best results and GWO can get 4 best results. Therefore, MTGAA enjoys a certain degree of competitiveness when compared with these algorithms.
376
X. Ning et al. 380 MTGAA UXGA MPXGA GA IniGA
360
Makespan
340 320 300 280 260 240 0
20
40
60
80
100
Generation
Fig. 7. The performance of MTGAA, UXGA (MTGAA with UX only), MPXGA (MTGAA with MPX only), GA and IniGA (population initialization strategy only) on MK10. Table 3. Comparison results of MTGAA and comparison algorithms on makespan Problem
Size(n*m)
LB, UB
MATSPSO
GWO
MTGAA
MkJ1
10*6
36, 42
39
40
41
MkJ2
10*6
24, 32
27
29
29
MkJ3
15*8
204, 211
207
204
204
MkJ4
15*8
48, 41
65
64
65
MkJ5
15*4
168, 186
174
175
178
MkJ6
10*15
33, 86
72
69
68
MkJ7
20*5
133, 157
154
147
148
MkJ8
20*10
523
523
523
523
MkJ9
20*10
299, 369
340
322
311
MkJ10
20*15
165, 296
299
249
240
4 Conclusion To solve FJSP effectively, the genetic algorithm is equipped with an assisted-task based on the knowledge transfer mechanism of evolutionary multi-task. To take advantage of knowledge transfer of EMTO, we build an assisted-task to provide high-quality knowledge for each FSJP task. Moreover, for the purpose to improve the ability of searching optimal, an adaptive crossover strategy is designed to use two crossover operators at the same time. To verify the effectiveness of MTGAA, numerous comparative experiments
An Evolutionary Multi-task Genetic Algorithm with Assisted-Task
377
are conducted on the standard benchmark data set and the results show that MTGAA is competitive in solving FJSPs. It is worth noting that MTGAA is the first attempt to use EMTO specifically to deal with FJSP, there is great potential for exploration in the construction of assisted-task and the form of knowledge. In the future work, we will be interested in solving more complex combinational optimization problems with methods of EMTO style, such as promoting the complex optimization problem that with expensive cost by using the knowledge from the simple optimization problem that with cheap cost [28, 29]. Acknowledgement. This work was supported by the Guangdong Basic and Applied Basic Research Foundation (2021A151511073, 2022A1515011297).
References 1. Pezzella, F., Morganti, G., Ciaschetti, G.: A genetic algorithm for the flexible job-shop scheduling problem. Comput. Oper. Res. 35, 3202–3212 (2008) 2. Gao, K., Yang, F., Zhou, M., Pan, Q., Suganthan, P.N.: Flexible job-shop rescheduling for new job insertion by using discrete Jaya algorithm. IEEE Trans. Cybern. 49, 1944–1955 (2018) 3. Garey, M.R., Johnson, D.S., Sethi, R.: The complexity of flowshop and jobshop scheduling. Math. Oper. Res. 1, 117–129 (1976) 4. Zhang, G., Shao, X., Li, P., Gao, L.: An effective hybrid particle swarm optimization algorithm for multi-objective flexible job-shop scheduling problem. Comput. Ind. Eng. 56, 1309–1318 (2009) 5. Xing, L.N., Chen, Y.W., Wang, P., Zhao, Q.S., Xiong, J.: A knowledge-based ant colony optimization for flexible job shop scheduling problems. Appl. Soft Comput. 10, 888–896 (2010) 6. Li, J.Q., Pan, Q.K., Gao, K.Z.: Pareto-based discrete artificial bee colony algorithm for multiobjective flexible job shop scheduling problems. Int. J. Adv. Manuf. Technol. 55, 1159–1169 (2011) 7. Caldeira, R.H., Gnanavelbabu, A., JosephSolomon, J.: Solving the flexible job shop scheduling problem using a hybrid artificial bee colony algorithm. In: Vijayan, S., NachiappanSubramanian, K. (eds.) Trends in Manufacturing and Engineering Management. LNME, pp. 833–843. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-4745-4_72 8. Zhao, H., et al.: Local binary pattern-based adaptive differential evolution for multimodal optimization problems. IEEE Trans. Cybern. 50, 3343–3357 (2019) 9. Zhao, H., Li, J., Liu, J.: Localized distance and time-based differential evolution for multimodal optimization problems. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 510–513 (2022) 10. Zhao, H., Chen, Z.-G., Zhan, Z.-H., Kwong, S., Zhang, J.: Multiple populations coevolutionary particle swarm optimization for multi-objective cardinality constrained portfolio optimization problem. Neurocomputing 430, 58–70 (2021) 11. Gao, K., Cao, Z., Zhang, L., Chen, Z., Han, Y., Pan, Q.: A review on swarm intelligence and evolutionary algorithms for solving flexible job shop scheduling problems. IEEE/CAA J. Autom. Sin. 6, 904–916 (2019) 12. Shao, G., Shangguan, Y., Tao, J., Zheng, J., Liu, T., Wen, Y.: An improved genetic algorithm for structural optimization of Au–Ag bimetallic nanoparticles. Appl. Soft Comput. 73, 39–49 (2018)
378
X. Ning et al.
13. Zhang, G., Gao, L., Shi, Y.: An effective genetic algorithm for the flexible job-shop scheduling problem. Expert Syst. Appl. 38, 3563–3573 (2011) 14. Li, X.Y., Gao, L.: An effective hybrid genetic algorithm and tabu search for flexible job shop scheduling problem. Int. J. Prod. Econ. 174, 93–110 (2016) 15. Gao, J., Sun, L., Gen, M.: A hybrid genetic and variable neighborhood descent algorithm for flexible job shop scheduling problems. Comput. Oper. Res. 35, 2892–2907 (2008) 16. Chen, R., Yang, B., Li, S., Wang, S.: A self-learning genetic algorithm based on reinforcement learning for flexible job-shop scheduling problem. Comput. Ind. Eng. 149, 106778 (2020) 17. Gupta, A., Ong, Y.-S., Feng, L.: Multifactorial evolution: toward evolutionary multitasking. IEEE Trans. Evol. Comput. 20, 343–357 (2015) 18. Wei, T., Wang, S., Zhong, J., Liu, D., Zhang, J.: A review on evolutionary multi-task optimization: trends and challenges. IEEE Trans. Evol. Comput. 26, 941–960 (2021). https://doi. org/10.1109/TEVC.2021.3139437 19. Osaba, E., Del Ser, J., Martinez, A.D., Hussain, A.: Evolutionary multitask optimization: a methodological overview, challenges, and future research directions. Cogn. Comput. 14, 927–954 (2022) 20. Zhang, F., Mei, Y., Nguyen, S., Zhang, M., Tan, K.C.: Surrogate-assisted evolutionary multitask genetic programming for dynamic flexible job shop scheduling. IEEE Trans. Evol. Comput. 25, 651–665 (2021) 21. Yuan, Y., Ong, Y.S., Gupta, A., Tan, P.S., Xu, H.: Evolutionary multitasking in permutationbased combinatorial optimization problems: realization with TSP, QAP, LOP, and JSP. In: 2016 IEEE Region 10 Conference (TENCON), pp. 3157–3164. IEEE (2016) 22. Davis, L.: Applying adaptive algorithms to epistatic domains. In: IJCAI, pp. 162–164 (1985) 23. Brandimarte, P.: Routing and scheduling in a flexible job shop by tabu search. Ann. Oper. Res. 41, 157–183 (1993) 24. Bagheri, A., Zandieh, M., Mahdavi, I., Yazdani, M.: An artificial immune algorithm for the flexible job-shop scheduling problem. Future Gener. Comput. Syst. 26, 533–541 (2010) 25. Lee, K. M., Yamakawa, T., Lee, K.-M.: A genetic algorithm for general machine scheduling problems. In: 1998 Second International Conference. Knowledge-Based Intelligent Electronic Systems. Proceedings KES 1998 (Cat. No. 98EX111), pp. 60–66. IEEE (1998) 26. Meng, T., Pan, Q.-K., Sang, H.-Y.: A hybrid artificial bee colony algorithm for a flexible job shop scheduling problem with overlapping in operations. Int. J. Prod. Res. 56, 5278–5292 (2018) 27. Nouiri, M., Bekrar, A., Jemai, A., Niar, S., Ammari, A.C.: An effective and distributed particle swarm optimization algorithm for flexible job-shop scheduling problem. J. Intell. Manuf. 29(3), 603–615 (2015). https://doi.org/10.1007/s10845-015-1039-3 28. Liao, P., Sun, C., Zhang, G., Jin, Y.: Multi-surrogate multi-tasking optimization of expensive problems. Knowl. Based Syst. 205, 106262 (2020) 29. Wang, C., Wu, K., Liu, J.: Evolutionary multitasking AUC optimization. arXiv preprint arXiv: 2201.01145 (2022)
Depression Tendency Assessment Based on Cyber Psychosocial and Physical Computation Huanhong Huang1 , Deyue Kong1 , Fanmin Meng1 , Siyi Yang1 , Youzhe Liu2 , Weihui Dai1(B) , and Yan Kang3(B) 1 School of Management, Fudan University, Shanghai 200433, China {18307100138,17307110411,18307100021,21210690199, whdai}@fudan.edu.cn 2 School of Data Science, Fudan University, Shanghai 200433, China [email protected] 3 School of Humanities and Management Sciences, Southwest Medical University, Luzhou 646000, China [email protected]
Abstract. In modern social environment, depression tendency has become a common psychological state which can cause a variety of adverse effects on people’s physical and mental health as well as life and work. At present, the assessment of depression tendency mainly depends on professional psychological scales, and it is difficult to timely find the above tendency in social population. Based on CPP (Cyber Psychosocial and Physical) Computation methodology, this study analyzed the behavioral traits of depression tendency in physical space and cyber space. It was found that the above traits had some consistency for those persons with high depressive tendencies, but existed large diversity among people with low depressive tendency. Therefore, a clustering analysis and the LS-SVR (Least Squares Support Vector Regression) estimator weighted by distance coefficients of different clusters were proposed for depression tendency assessment according to QIDS-SR16 Scale. Results showed that this method can improve assessment accuracy and recognition rate. Keywords: Depression Tendency · Assessment · CPP Computation · LS-SVR · Clustering Analysis
1 Introduction As a common mental disorder syndrome, depression has been well known to the public with the main manifestations such as slowed thinking, depressed mood, decreased initiative and diminished interest [1]. It can be reflected as somatic symptoms (e.g. sedentary lifestyle [2], excessive body mass index (BMI) [3, 4], and loss of appetite) as well as cognitive impairment [5], attentional bias [6, 7], and emotional inertia [8] through the traditional observation in physical space. With the widespread popularity of ubiquitous © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 379–386, 2023. https://doi.org/10.1007/978-981-99-2385-4_28
380
H. Huang et al.
network and social media in modern society, behavioral traits of depression in cyber space have also aroused the attention of researchers increasingly. For example, the frequency of certain words, the use of emojis, the amount and timing of information posted, and the speech in voice messages have all been found to be associated with this illness [9, 10]. Unlike depression, depressive tendencies are mild and negative emotional states. In the severe form, they may also show such symptoms as depressed mood, slow reaction, and reduced volitional activity, but those symptoms are short-lived, unstable, or can be self-regulated [1]. Depressive tendencies can be strongly influenced by environment and are more likely to occur or worsen in stressful environments. They may decrease or disappear when stress disappears, and can easily lead to depression if become more serious and last for a long time. In recent years, social monitoring and intervention of depressive tendencies has drawn great attention by global medical communities, psychosocial researchers and human resource managerial department [11]. Various professional psychological scales are currently used to assess depressive tendencies. The main ones include Hamilton Depression Rating Scale (HAMD), Beck Depression Inventory (BDI), Quick Inventory of Depressive Symptomatology (QIDS), and Patient Health Questionnaire-9 (PHQ-9). However, most of them require the involvement of professionals, which makes them hard to popularize among the public. Besides, those scales fail to consider the influence of external environment (e.g. weather, seasons, etc.) and therefore may cause biases. In addition, depressive tendencies are affected by various factors and can change dynamically, so the time window for assessment is so limited that it is impossible to fully unveil them. Although various intelligent assessment technologies have been developed based on speech and motion pattern analysis [12, 13], further in-depth research is still necessary by taking into consideration of individual environmental factors because they are unstable and hard to identify. The rapid development of Internet of Things, mobile Internet and social media has provided a new way to dynamically monitor and analyze depressive tendencies from various behavioral data in physical space and cyberspace [11–13]. In previous studies, we conducted a large quantity of analyses on the associated characteristics of people’s psychology and behaviors in physical space, psychosocial space, and cyberspace and proposed the CPP (Cyber Psychosocial and Physical) Computation methodology [14, 15], which has been applied in epidemic evolution prediction, social perception computation, and social security monitoring [15–17]. This paper aims to explore the new method to analyze and assess social depressive tendencies based on those methodology and framework.
2 Survey and Statistical Analysis 2.1 Questionnaire Design According to our previous studies based on CPP Computation methodology, the associated characteristics of people’s psychology and behaviors in physical space, psychosocial space, and cyberspace exist individual differences, so this study designed a survey questionnaire for collecting data and conducting a correlative analysis based on individuals. It included four parts: basic information, depression tendency scale, behavioral traits in
Depression Tendency Assessment
381
physical space, and behavioral traits in cyberspace. The questions covered demographic factors such as age, marital status, aggregation and others including smoking and drinking habits, and the behavior on social media, discussed in relevant literatures, to better identify the individuals with depressive tendencies. Among all the depressive tendency scales, only Quick Inventory of Depressive Symptomatology (QIDS) is suitable for self-test, so this study used QIDS-SR16 Scale as the psychological scale on our questionnaire. It covers 9 areas: sleep quality, negative mood, appetite and weight, attention perception, self-criticism, suicidal tendencies, interests, energy/fatigue, and psychomotor. The scale classifies depression in 5 levels: 0–5 no depression, 6–10 mild depression, 11–15 moderate depression, 16–20 severe depression, and 21–27 very severe depression. 2.2 Survey and Data Collection Data were collected through an online survey from 206 participants distributed in China’s 28 provinces, municipalities and South Korea, with 122 males and 84 females aged from 16 to 56. Figure 1 shows the number distribution of depressive tendency scores of participants assessed by QIDS-SR16 Scale.
Fig. 1. Number distribution of depressive tendency scores.
According to the QIDS-SR16 scale, the percentages of no depression, mild depression, moderate depression, severe depression, and very severe depression are 27.7%, 27.7%, 24.3%, 15.0%, and 5.3% respectively, as shown in Fig. 2. This study collected the participants’ information of gender, age, occupation, marital status, living alone or with others, and asked them fill the QIDS-SR16 Scale. Based on the analysis of the relevant literatures, we investigated the weather conditions and the participants’ behavioral traits in physical space: outdoor activity frequency, exercise time, keeping pets, smoking, and drinking; and their behavioral traits in cyber space: posting frequency, and the emotional words on WeChat Moments.
382
H. Huang et al.
Fig. 2. Percentages of different depression tendency levels.
2.3 Statistical Analysis Through regression modeling analysis, six variables are found to have significant influence on depressive tendencies. Their regression coefficients are shown in Table 1. Table 1. Regression coefficients of variables with significant influence. Variable
Coefficient Standard deviation t-Value
Constant (b)
12.18
0.81
Weather-rainy (X1)
p-Value
15.124 0.000
4.58
2.59
1.770 0.078
−2.43
0.81
−3.011 0.003
8.17
5.71
1.431 0.154
Frequency of going out-changeless (X4) −2.35
0.82
−2.864 0.005
Post moment-frequent (X5)
1.05
2.240 0.026
Unmarried (X2) Divorced (X3)
2.34
Therefore, the regressive model can be constructed as follow based on the coefficients given in Table 1: P = 4.58X1 − 2.43X2 + 8.17X3 − 2.35X4 + 2.34X5 + 12.18
(1)
In the model, P is the total depressive tendency score, and the definitions of the remaining variables are given in Table 1. The model reveals the relationship between depressive tendency levels and significant influence variables including the individual’s personal life, weather conditions, and behavioral characteristics in physical space and cyberspace. However, it only shows the linear relationships between the variables. In fact, there may be complex non-linear relationships between depressive tendencies and various variables, which should be analyzed and assessed by more accurate methods constructed through machine learning.
Depression Tendency Assessment
383
3 Clustering Analysis and Machine Learning Method 3.1 Clustering Analysis Through the examination of the sample data, the study finds that behavioral traits have some consistency for those people with high depressive tendencies, but exist large diversity among people with low depressive tendency. In order to observe the distribution characteristics of the samples, the study uses K-means method to do a clustering analysis, as shown in Fig. 3.
Fig. 3. Distribution of sample clusters.
As can be seen in Fig. 3, there are three clusters in the samples. Although many samples are concentrated in the second cluster, there is a fairly large distance between this cluster and the other two. If all samples are used for learning together by the machine, the samples in the first and third clusters are likely to be neglected as noisy data. In order to avoid such a problem, this study first divides the samples into three subsets A, B, and C, performs machine learning separately, and then uses weighted Euclidean distance between the samples and their cluster center to calculate the final results. 3.2 Machine Learning Method After testing and comparing the four machine learning algorithms: K-nearest neighbors, logistic regression, support vector machines, and random forests, the study finds that SVM (Support Vector Machine) algorithm can obtain the best result. The commonly used SVM model can only identify classification of depression tendencies. In order to get a quantitative score of QIDS-SR16 Scale, this research uses the LS-SVR (Least Square Support Vector Regression) model as an evaluator. In the learning method proposed by this study, three estimators Ga ,Gb and Gc are firstly trained well by the sample subsets A, B and C respectively using LS-SVR model. Afterwards, the new sample to be assessed is input into the above estimators at the same time, and calculate its final assessment result by the following weighted formula: P = a1 E1 +a2 E2 +a3 E3
(2)
384
H. Huang et al.
where, E1 , E2 and E3 are the results of estimator Ga ,Gb and Gc respectively, a1 , a2 and a3 are the sample number weighted Euclidean distance coefficients between the sample x to be assessed and the cluster center C i of subsets A, B and C, which can be calculated by the following formula: ai = si · d (x, Ci )/
3 j=1
d x, Cj , i = 1, 2, 3
(3)
where, the distance coefficients d (x, Ci )/ 3j=1 d (x, Cj ) reflect the belonging degrees of the sample x to subsets A, B and C, and si are their sample number weighted coefficients, which represent the sample number percentage of each subset in the total subsets. Using them as weighting factors can obtain a better estimation accuracy when samples are unevenly distributed and diversified. The study randomly extracts two-thirds of the samples at each level for training, and uses the remaining samples for testing by the proposed method. Table 2 shows the accuracy of estimated scores at different levels of depressive tendencies. Table 2. Accuracy of estimated scores at different levels of depressive tendencies. Depressive tendency level
E1
E2
E3
P
No depression(0–5)
71.6%
82.5%
62.1%
78.6%
Mild depression(6–10)
59.1%
75.6%
51.2%
73.9%
Moderate depression(11–15)
69.3%
63.0%
53.8%
68.6%
Severe depression(16–20)
58.8%
59.6%
68.4%
67.2%
Very severe depression(21–27)
63.4%
52.7%
74.6%
72.7%
Average accuracy
64.4%
66.7%
62.0%
72.2%
As can be seen in Table 2, the result accuracies of three estimators Ga ,Gb and Gc are significantly different. For example, as for no depression (0–5 points) and mild depression (6–10 points), E2 has the highest estimation accuracies of 82.5% and 75.6% respectively; in the case of moderate depression (11–15 points), E1 has the highest estimation accuracy of 69.3%; in the case of severe depression (16–20 points) and very severe depression (21–27 points), E3 had the highest estimation accuracy of 68.4% and 74.6% respectively. Strengthened with Euclidean distance coefficients, P achieves the mean accuracy of 72.2%, which was higher than those of E1 , E2 , and E3 . The accuracies above are calculated by using depressive tendency scores as estimated targets. Although not yet ideal, they reflect the achievable accuracy based on the available information. If only a 5-level classification of depressive tendencies is identified, the recognition results of 126 test samples can be obtained from the trained SVM model based on the subsets of samples, which is shown in Table 3.
Depression Tendency Assessment
385
Table 3. Recognition results of depressive tendency levels.
As can be seen in Table 3, the average recognition rate of the 5-level classification is 86.3%. The recognition rates of very severe depression (21–27 points) and major depression (16–20 points) are 100% and 87.5%, respectively. Since the samples of these two levels were small, further validation is required by more samples. The recognition rate of mild depression (6–10) was 78.0%, the lowest of all, probably because this level is difficult to be differentiated from the level of no depression (0–5).
4 Conclusion and Discussion Based on CPP (Cyber Psychosocial and Physical) Computation methodology, this paper analyzed the behavioral traits of depression tendency in physical space and cyber space, and studied its assessment method. It was found that those behavioral traits have more consistencies at a low level of depressive tendency, but exist large diversity at high levels. This leads to the difficulty for conducting an effective machine learning assessment algorithm. In this study, the method with clustering analysis and weighted LS-SVR estimator was proposed for improving the assessment accuracy and recognition rate, and achieved better results. However, the proposed method needs to be further verified by more samples, and the following work may be considered in the future studies: producing the optimal sample subsets through a self-adapting clustering process; improving the machine learning algorithm by prior knowledge of psychological and behavioral mechanisms related to depressive tendency; and understanding its continuous changes in behavioral traits while from a low level to higher. Acknowledgements. This study was supported by Project of Ministry of Education of China (No. 19JZD010, No. 18YJA630019), Luzhou City Social Psychological Service and Crisis Intervention Research Project (No. LZXL-202218), and Undergraduate Program of Fudan University (No. 202009, No. 202010). Huanhong Huang, Deyue Kong, Fanmin Meng, and Siyi Yang are the joint first authors who made equal contributions. Weihui Dai and Yan Kang are joint corresponding authors. Many thanks to Professor Shuang Huang from Shanghai International Studies University for completing the translation work.
386
H. Huang et al.
References 1. Chamberlain, S.R., Blackwell, A.D., Fineberg, N.A., Robbins, T.W., Sahakian, B.J.: The Neuropsychology of obsessive compulsive disorder: the importance of failures in cognitive and behavioural inhibition as candidate endophenotypic markers. Neurosci. Biobehav. Rev. 29, 399–419 (2005) 2. Hallgren, M., Nguyen, T.T.D., Owen, N., Vancampfort, D., Ekblom-Bak, E.: Associations of sedentary behavior in leisure and occupational contexts with symptoms of depression and anxiety. Prevent. Med. Int. J. Devot. Pract. Theory 133, 106021 (2020) 3. Mulugeta, A., Zhou, A., Vimaleswaran, K.S., Dickson, C., Hyppönen, E.: Depression increases the genetic susceptibility to high body mass index: evidence from UK biobank. Depress. Anxiety 36, 1154–1162 (2019) 4. Estrella-Castillo, D.F.: Lizzette Gómez-de-Regil: Comparison of body mass index range criteria and their association with cognition, functioning and depression: a cross-sectional study in Mexican older adults. BMC Geriatr. 19, 339 (2019) 5. Dai, Z.Z.: Study on the Characteristics of Neurocognitive Function and Social Cognitive Function on Depressive Patients. Nanjing Medical University, Nanjing (2016) 6. Han, B.X., Jia, L.P., Zhu, G.H., Wang, M.M., Lu, G.H.: Attention bias to emotional faces in depression patients at different states. Chin. J. Health Psychol. 28(6), 819–824 (2020) 7. Li, X., Li, H.Z.: Progress of research on attention bias of depressive patients. World Latest Med. Inf. 19(99), 91–93 (2019) 8. Ma, H.X., Li, H.Q., Liu, J.F., Zhai, Y.F.: Emotional inertia: Influencing factors and its relationship with depression. Chin. J. Clin. Psychol. 28(1), 136–139, 144 (2020) 9. Wang, H.F., Liu, L.: Social decision-making of depressed Individuals: present situation and prospect. Chin. J. Health Psychol. 26(5), 795–800 (2018) 10. Zhang, Q.: Behavioral and Electrophysiological Study of Impaired Interpersonal Function of Depression. Anhui Medical University, Hefei (2016) 11. Li, P.Y.: A Detection Model for Identification of Depressed College Students on Weibo Social Network. Harbin Institute of Technology, Harbin (2014) 12. Han, S.Y., Chen, T.Y., Gao, K., Wang, J.Z., Dai, W.H.: Speech intelligent monitoring for early warning of depression recurrence. In: Proceedings of 21st International Conference on IT Applications and Management (ITAM-21), pp. 123–135, Huelva (2019) 13. Masud, M.T., Mamun, M.A., Thapa, K., Lee, D.H., Griffiths, M.D., Yang, S.-H.: Unobtrusive monitoring of behavior and movement patterns to detect clinical depression severity level via smartphone. J. Biomed. Inform. 103, 103371 (2020) 14. Dai, W.H.: Cyber Psychological and Physical (CPP) Computation Based on Social Neuromechanism. Fudan University, Shanghai (2015) 15. Dai, W.H., Wang, J.Z., Cang, X., Feng, G.G.: Smart learning for CyberPsychosocial and physical computation. In: Proceedings of The 2016 International Conference on Social Collaboration and Shared Values in Business (ICSCSVB-1), pp. 101–110, Gwangjiu, Korea (2016) 16. Qian, X.S., Hu, A.A., Dai, W.H., Ling, H.: The functions and the significance of online games in COVID-19 prevention and control: an empirical analysis of online interactive data during the epidemic period. Sci. Technol. Rev. 39(14), 129–143 (2021) 17. Dai, W.H., Duch, W., Abdullah, A.H., Xu, D.R., Chen, Y.-S.: Recent advances in learning theory. Comput. Intell. Neurosci. 2015, 395948 (2015)
Optimization of On-Ramp Confluence Sequence for Internet of Vehicles with Graph Model Zhiheng Yuan, Yuanfei Fang, Xinran Qu, and Yanjun Shi(B) Dalian University of Technology, Dalian 116024, China [email protected]
Abstract. In the confluence area of highway on-ramp, there are potential conflicts between vehicles on the on-ramp and vehicles in the main lane, which not only affect vehicle safety and traffic efficiency, but also increase energy consumption and environmental pollution. According to the technology of Internet of Vehicle (IoV) and Edge Computing (EC), an optimization method of on-ramp confluence sequence was proposed. The optimization efficiency and driving safety were improved by analyzing the confluence sequence through graph model. Finally, MATLAB and SUMO are used to prove the reliability of the optimization. Keywords: Internet of Vehicles · Ramp Confluence · Vehicles Platooning
1 Introduction With the rapid development of China’s national economy and society, traffic problems have increasingly attracted the attention of the broad masses of people and relevant departments. By the end of 2021, the number of vehicles in use in the country was 300 million, an increase of 20.64 million compared with the end of the previous year, an increase of 7.35% [1]. In order to improve the traffic capacity and efficiency, highways, as one of the highefficiency transportation infrastructure, have developed rapidly. By the end of 2020, the length of highways in China will reach 161000 km [2]. However, due to the large increase in the number of vehicles and traffic flow, highways are also facing traffic congestion, which leads to an increase in traffic time and vehicle energy consumption, and then environmental pollution. Freeway on-ramp confluence section is one of the key factors affecting the traffic efficiency of the entire freeway system [3]. Ramp confluence section is the intersection of the outermost lane of the highway and the entrance ramp lane of the highway. As the vehicles on the ramp enter the main lane, there is a potential conflict with the vehicles on the main lane, which may cause the vehicle speed slowing down or even parking queuing. As an important component of highway system, improving the traffic efficiency of ramp confluence is very important to ensure the smooth flow of highway. The main cause of traffic congestion lies in the insufficient coordination between vehicles and roads and incorrect competitive vehicle confluence. The development © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 387–400, 2023. https://doi.org/10.1007/978-981-99-2385-4_29
388
Z. Yuan et al.
of vehicle network technology and edge computing technology provides technical conditions to solve the above problems. The vehicle can realize the information interaction with IoV, send its own vehicle information and receive the coordination information of the road. In the ramp confluence section, the IoV can not only coordinate vehicles to avoid traffic congestion, but also provide the possibility to develop more advanced control systems to reduce vehicle energy consumption and improve road capacity. According to the paper, first of all, the roads and vehicles in the ramp confluence scene are analyzed and modeled. Secondly, the ramp confluence sequence method is proposed based on the IoV. Finally, the feasibility and validity of the ramp confluence sequence method are verified by the simulation analysis of the ramp confluence scene, which provides a theoretical basis for solving the traffic congestion on the entrance ramp of Highway under the Internet of vehicles. The rest of the paper is divided into the following parts: Sect. 2 introduces the current situation of ramp confluence optimization research; In Sect. 3, the optimization model of ramp confluence sequence is proposed. Section 4 realizes ramp confluence sequence optimization based on graph model. In Sect. 5, a simulation platform for onramp convergence of IoV is built by using MATLAB and SUMO, and the efficiency and reliability of optimization of convergence sequence based on graph model is verified.
2 Related Work With the help of vehicle network communication technology, ramp convergence can achieve efficient vehicle convergence through sequence optimization, improve on-ramp capacity and reduce energy consumption. 2.1 Optimal Strategy Method The method based on optimal strategy aims to find the globally optimal confluence sequence. The disadvantage is that the computational complexity is too high. Ye et al. [4] proposed a double-layer optimal edge calculation model for ramp confluence to maximize vehicle capacity. The ramp confluence sequence optimization is modeled as a Mixed Integer Linear Programming (MILP) problem, and the hybrid Integer Linear Programming (MILP) based optimal merging sequence optimization method is verified to improve traffic capacity. Jing et al. [5] proposed an optimization framework and algorithm based on multiplayer cooperative game method to coordinate vehicles to merge ramp to minimize the global cost. Although the optimization strategy method can obtain the optimal merging sequence of vehicles, the computational complexity of the solution generally increases rapidly with the increase of the number of vehicles, and it is difficult to achieve results in real-time applications, especially in the traffic scene of ramp confluence.
Optimization of On-Ramp Confluence Sequence
389
2.2 Rule-Based Method Compared with the optimal strategy method, the rule-based ramp confluence method aims to obtain suboptimal merging sequence in limited computing time by using heuristic rules. Cao et al. [6] proposed a path generation method based on Model Predictive Control (MPC) to optimize confluence sequences of vehicles. The main idea of this method is to make the related vehicles on the main lane speed up or slow down a little, so that the vehicles on the ramp can merge easily, and ensure that no collision occurs in the process of merging. Kong et al. [7] applied an improved Cellular Automata approach to the merging problem of vehicles in ramp scenarios. The main lane and ramp in ramp confluence scene were divided into discrete grids, and each grid was empty or occupied by vehicles. The merging process of ramp vehicles was simulated by improving the changing rules of grids. The main advantage of the rule-based approach lies in its high computational efficiency, which can meet the real-time requirements of ramp confluence, but it lacks a strict theory to ensure the performance of the algorithm.
3 Optimization Model of Confluence Sequences According to the control process of the centralized hierarchical control system, each vehicle has a unique identifier after entering the optimized area. Vehicle i represents the vehicle i arriving at the convergence point. The ramp convergence strategy aims to reduce vehicle energy consumption and improve traffic efficiency, the objective function can be described as follows: assigned c ti c assigned assigned 2 ti (1) ai (t)dt + w2 − ti−1 F1 = w1 i=1 t 0 i
i=2
assgined
where, w1 and w2 represent weight coefficients, ti0 and ti represent the start time of vehicle i optimization and the time when vehicle i reaches the confluence point, respectively, ai2 (t) represents acceleration of vehicle i in time t. Considering the control input and speed within a given allowable range, the arrival assgined time ti of vehicle I has a lower bound, and the constraints are described in formula: assgined
ti
≥ timin
(2)
where, i ≥ 2, timin represents the lower bound of the time when vehicle i reaches the confluence point. When vehicle I accelerates uniformly at the maximum acceleration amax , the time it takes to accelerate to the maximum speed vmax and then uniformly pass the confluence point is the lower bound of arrival time. There are two situations: (1) Vehicle i has passed the confluence point before it accelerates to vmax ; (2) Vehicle i has accelerated to vmax and passed the confluence point at the speed vmax . The speed curves of vehicle I in two cases are shown in Fig. 1(a) and (b) respectively.
390
Z. Yuan et al.
According to (a) in Fig. 1, Formula (3) can be obtained: vi0 timin − ti0 + 21 amax timin − ti02 = Pi0
(3)
Fig. 1. Vehicle speed trajectory in extreme cases
According to Fig. 1, Formula (4) can be obtained. ⎧ vi0 + amax t = vmax ⎨ (v max )2 − (vi0 )2 =
2amax pin ⎩ vmax timin − ti0 − t = |pi0 | − pin
(4)
where, vi0 represents the initial speed of vehicle i during optimization, pi0 represents the initial position of vehicle i during optimization, ti0 represents the start time of vehicle optimization, and the specific expression timin can be further obtained. assgined Accordingly, ti has an upper bound and the constraint is described as: assgined
ti
≤ timax
(5)
where, timax represents the upper bound of the time when vehicle i reaches the confluence point. Specifically, when vehicle I uniformly decelerates at the maximum deceleration amin until it decelerates to the minimum speed vmin and then uniformly passes through the confluence point, it is the upper bound of the arrival time. There are also two situations similar to the acceleration stage. The speed curves of vehicle i in these two cases are shown in Fig. 1(c) and (d) respectively.
Optimization of On-Ramp Confluence Sequence
391
In order to avoid vehicle collision, the time for vehicles to pass through the confluence point should meet the time headway of a certain safety interval. The formula is described as follows: assigned
ti
assigned
− ti−1
≥ Thd
(6)
where, time Thd represents the time of the safety interval. Based on the above description, the whole optimization problem can be described as follows: obj. min F1 (7) subject to (2)(5)(6)
4 Optimization of Confluence Sequences with Graph Model In the ramp confluence scenario under the IoV, the performance of ramp convergence strategy mainly depends on the passage sequence of vehicles passing through the convergence point [8]. In the on-ramp convergence scenario in this paper, it is assumed that overtaking is not allowed on a single-lane road. In the detection area, vehicles are grouped according to their motion states in the detection area, as shown in Fig. 2. For each group of vehicles, m represents the number of vehicles on the main road, n represents the number of vehicles on the ramp, and the vehicle identification set on the main lane is represented by M = {1, ..., m}, and the vehicle identification set on the ramp is represented by N = {1, ..., n}.
Fig. 2. Vehicles clustered in detecting zone
In order to ensure that no collision occurs between vehicles in the merging area, the time when the latter vehicle arrives at the merging point should be greater than or equal to the time when the former vehicle arrives at the merging point plus the pre-defined security interval headway. In a group of vehicles, the time when vehicle i arrives at the merging point is shown in Formula (8): assigned
ti
assigned
= ti−1
+ Thd
(8)
where, i ≥ 2, time Thd represents the time headway of the safety interval. As long as the merging sequence of vehicles is obtained, the time of each vehicle arriving at the merging
392
Z. Yuan et al.
point can be obtained by iteration according to Formula (5). It is worth noting that for the head vehicle of each group, the time at which the vehicle arrives at the confluence point is not determined by this formula. The confluence time of the head vehicle of each group opt is optimized separately as ti , and the confluence time of thetail vehicle of the previous assigned
opt
assigned
, The confluence time of each group is max ti , ti−1 + Thd . group is ti−1 Since the merging time of vehicles is determined by Formula (8), in this case, the total energy consumption of vehicles only needs to be considered instead of Formula (1), and the objective function is described as follows: c assigned 1 ti 2 ai (t)dt F2 = 2 ti0
(9)
i=1
This objective function is optimized to reduce the total energy consumption of the vehicle, that is, under the condition of meeting the confluence time, the vehicle will drive to the confluence point according to the trajectory with the lowest energy consumption. In order to optimize the confluence sequence of a group of vehicles, a graph model is established to reduce the total energy consumption of vehicles. A directed graph G = (V , E) was constructed, as shown in Fig. 3, in which V and E ⊆ V × V represented a vertex set and a directed edge set, respectively. The vertices of the graph model are represented by a binary group (j, k). j represents the total number of vehicles on the main lane that have been assigned the right of way, and k represents the total number of vehicles on the on-ramp that have been assigned the right of way. To edge set E contains two subsets: E1 = {(j, k) → (j + 1, k)|0 ≤ j ≤ m, 0 ≤ k ≤ n, j ∈ Z, k ∈ Z}, E2 = {(j, k) → (j, k + 1)|0 ≤ j ≤ m, 0 ≤ k ≤ n, j ∈ Z, k ∈ Z}, E = E1 ∪ E2 , E1 ∩ E2 = ∅.
Fig. 3. Graph optimizing the merging sequences
Directed side (j, k) → (j + 1, k), from vertex j, k to vertex j + 1, k represents the total number of vehicles with right-of-way on the main lane plus one; similarly, directed side (j, k) → (j, k + 1), from vertex j, k to vertex j, k + 1 represents the total number of vehicles with right-of-way on the on-ramp plus one.
Optimization of On-Ramp Confluence Sequence
393
For m vehicles on the main lane and n vehicles on the ramp, these vehicles eventually need to merge through the merging point for ramp confluence. Therefore, the optimization problem of merging sequence of vehicles is that the right of way of merging point for m + n times is allocated to these vehicles in a certain order. In a directed graph G, confluence sequence optimization problem is converted to the directed graph shortest path problem, vertex (0, 0) to (m, n) vehicle confluence sequence, the representative of the shortest path in the vertex (0, 0) to (m, n) in each step, use the variable rw record step w, apparently in the directed graph G, every step of choice has two options, rw ∈ {0, 1}. When rw = 0 and j < m; the w step is chosen to go right along the directed edge (j, k) → (j + 1, k). When rw = 1 and k < n; the w step is chosen to go right along the directed edge (j, k) → (j, k + 1) (Fig. 4).
Fig. 4. Schematic diagram of the correspondence between routes and confluence sequences
Corresponding to the optimization problem of merging sequence, there are two possibilities of right-of-way allocation in merging point from the 1st to the m + n, and in on-ramp merging scene, the variable rw can also be used to record the decision-making of right-of-way allocation at merging point w: When rw = 0 and j < m; a vehicle in the main lane gains the right of way at the junction. When rw = 1 and k < n; a vehicle on the ramp gains the right of way at the junction. By variable rw , the path selection of each step in the shortest path problem of directed graph G corresponds to the right-of-way allocation decision of each conjuncture point in the conjuncture sequence optimization problem. Graph G via the confluence sequence optimization problem and corresponding relation of the shortest path problem, the vehicle may be the confluence of sequence and directed graph G from vertex (0, 0) to (m, n) possible paths one-to-one correspondence, each have two cars on the main driveway and ramp into a group, namely, m = 2 and n = 2, path: (0, 0) → (0, 1) → (0, 2) → (1, 2) → (2, 2) corresponds to the confluence sequence A → C → B → D of vehicles. The correspondence between all possible paths and possible confluence sequences is shown in Table 1. The number of all possible
394
Z. Yuan et al. Table 1. Table of correspondence between routes and merging sequences
A path in a directed graph
The merging sequence of vehicles
(0ˈ0) → (0ˈ1) → (0ˈ2) → (1ˈ2) → (2ˈ2)
→
→
→
(0ˈ0) → (0ˈ1) → (1ˈ1) → (1ˈ2) → (2ˈ2)
→
→
→
(0ˈ0) → (0ˈ1) → (1ˈ1) → (1ˈ2) → (2ˈ2)
→
→D→
(0ˈ0) → (1ˈ0) → (1ˈ1) → (1ˈ2) → (2ˈ2)
B→A→
(0ˈ0) → (1ˈ0) → (1ˈ1) → (2ˈ1) → (2ˈ2)
B→A→D→C
(0ˈ0) → (1ˈ0) → (2ˈ0) → (2ˈ1) → (2ˈ2)
B→D→A→C
→
paths or possible merging sequences, can be calculated by Formula (10). Num =
(m + n)! m!n!
(10)
Based on graph G = (V , E) found from the vertex (0, 0) to (m, n) of the shortest path, including the edge weights to represent forecast energy consumption of the vehicle, so the vertex (0, 0) to (m, n) said the shortest path of the vehicle’s total energy consumption of the smallest confluence sequence, to solve the problem of vehicle confluence sequence optimization. First to detect vehicles in regional grouping is one of the goals for the use of confluence sequence optimization based on graph model vehicle to create suitable conditions, consider if no vehicles to use based on graph model optimization grouping vehicle confluence sequence, when dealing with the low traffic scene, as shown in Fig. 5, vehicle A trigger condition in this round of optimization. There are only vehicles A and B in the detection area. Vehicle A is about to leave the detection area at A speed of vmax , while vehicle B has just entered the detection area at A speed of vmin .
Fig. 5. On-ramp confluence scene at low traffic flow
Optimization of On-Ramp Confluence Sequence
395
The graph model was used to optimize the confluence sequence of vehicle A and B, and the passage sequence was A → B. The edge computing server sets the unique numbers of vehicle A and B as i and i+1 respectively according to the merging sequence. assigned For vehicle i as the head vehicle, the confluence time is assigned ti . According to assigned assigned Formula (2), the confluence time of vehicle (i + 1) is ti+1 = ti + Thd . The confluence time range of vehicle i + 1 under vehicle capability constraints: assigned min max (11) ti+1 ∈ / ti+1 , ti+1 Which did not reach the confluence point within the prescribed time, as shown in Fig. 6 (a). Considering the grouping of vehicles in the detection area, vehicles A and B are divided into two groups, and the merging sequence optimization of the use diagram model of the two groups of vehicles is carried out respectively. Since the two groups of vehicles respectively contain only one car, and each car is the head car of the group, the head car of each group will not be constrained by the merging time of Formula (11). Therefore, the confluence time of vehicle B can be optimized separately, as shown in Fig. 6(b).
Fig. 6. Diagram of vehicle position trajectory: (a) The vehicle i+1 can’t reach the point; (b) The vehicle i+1 can reach the point
The flow part of the ramp confluence sequence method is shown in Fig. 7.
396
Z. Yuan et al.
Fig. 7. Flow part of the ramp confluence sequence method
5 Simulation Analysis of Ramp Confluence In this paper, MATLAB is used to realize the ramp confluence sequence optimization module, and the TraCI4Matlab interface is used to interact with SUMO. The pseudo-code of the interaction between MATLAB and SUMO is shown in Table 2. The main functions of ramp confluence sequence optimization and formation planning simulation include: (1) Judging whether there are vehicle trigger optimization conditions (whether it’s low traffic flow or not); (2) If the optimization condition is triggered, set the number for each vehicle group; (3) Use graph model to optimize vehicle confluence sequence; (4) Formation planning based on motion trajectory (which is according to condition on vehicle and road). In addition, other auxiliary function modules are also implemented, such as obtaining basic road information, detecting vehicles to be optimized in each round of optimization, vehicle energy consumption statistics, vehicle motion analysis modules and so on. According to the optimized merging sequence, each vehicle in the two groups runs along the motion path satisfying the minimum acceleration square, as shown in Fig. 7. Figure 7 (a) shows the position curves of two groups of vehicles. It can be seen that in this round of optimization, no collision occurred during the merging process of vehicles, and the two groups of vehicles formed two vehicle formations respectively after passing the merging point. The vehicles are divided into two groups, so that the lead car of
Optimization of On-Ramp Confluence Sequence
397
Table 2. Pseudocode for interactive implementation between MATLAB and SUMO Pseudo-code for interactive implementation
traci.start(sumoCmd);
% Start SUMO
simTime=300;
% Set the simulation time
step=0.1;
% The initial simulation step size
while step as an example, model identify the relationship between them as “start-position/person”, in which “start-position” is the event type, “person” is the argument role of “陈德良” in the event “start-position”.
Event type 6WDUW3RVLWLRQ Trigger word
VWDUWSRVWLWLRQSHUVRQ
>
@
Person Position
6WDUWSRVLWLRQSRVLWLRQ
Fig. 2. Trigger and argument relationship classification
The advantage of our methods is merging four event extraction tasks into one classification task by a relatively simple way. The model structure is simple and easy to train, and at the same time, it solves the problems of argument overlap and multi-roles argument.
2 Model Our model treats event extraction as a relation classification task. This method classifies the relationship between trigger word and argument, and the relationship type is “event type + argument role”. The method utilizes the information between candidate trigger words and candidate arguments to achieve event extraction, and also tackle multiple events sentence, argument overlap and multi-roles argument. The idea of our method is shown in Fig. 2. First, identifying candidate trigger words and candidate arguments, and then classifying the relationship between candidate trigger words and candidate arguments. The incorrectly identified candidate trigger words and candidate arguments will be further processed in the following relation classification. If one of trigger word or argument is wrong, and the relationship will be “none”. Main structure of the model as shown in Fig. 3 includes two parts: bidirectional GRU and hierarchical attention. The hierarchical attention in our model is different to word-level attention and sentence-level attention in HAN [17]. We study in sentencelevel event extraction, and trigger words such as “受了伤” and “杀青”, where each character has different importance for event type detection. Therefore, we modified the hierarchical attention mechanism in HAN to word-level attention and character-level attention. 2.1 Candidate Trigger Word and Candidate Argument Detection To identify candidate trigger words and arguments, it is necessary to identify as many trigger words and arguments in the sentence as possible, wrongly identified trigger words
404
Q. Hu and H. Wang Softmax Full-connect Layer
... Word attention a11
a11
a11
a11
uc h1
h2
h3
hL
h1
h2
h3
hL
a11
a12
a13
Bi-GRU
Character attention
a1T
uw h21
h22
h23
h2T
h21
h22
h23
h2T
x21
x22
x23
x2T
Bi-GRU
Embedding Layer
Fig. 3. Model structure of this paper
and arguments while be handled during relation classification. We use the BERT and CRF model to treat candidate triggers and event arguments detection as a word-level sequence labeling task. Assuming that the sentence contains n words, input these n words into the BERT model to learn its features. The features of these words are then fed into the CRF to identify then most likely labeled sequences for the words. When identifying trigger words, we just need to identify whether they are arguments. Therefore, we only use three labels (B, I and O) for the CRF. “B” means that the current word is at the begin character of argument, “I” means that it is in the middle of argument, and “O” means that it is not in argument. The same method is used for candidate trigger word detection. 2.2 Model Input In the hierarchical attention mechanism, we suppose that there are T characters and L words in the input sentence of the model. First input T words, use Embedding Layer to convert the words into embedding vectors. The model obtains the model input vector X by concatenating word embedding, candidate argument and trigger word position embedding. In order to make the model fully learn the forward and backward semantic information in the sentence, compared with the bidirectional LSTM model, we choose the bidirectional GRU as base model which is easier to train.
Chinese Event Extraction Based on Hierarchical
405
2.3 Hierarchical Attention Considering that not all character in the trigger word have the same importance to the word, such as the trigger word “受了伤”, it is obvious that the importance of “伤” is much higher than “了”, so in this article we refer to Hierarchical attention mechanism uses character-level attention mechanism. The attention score comes from character hidden and word embedding, as shown in Eq. (2). The representing of word is get by weighted sum the hidden states of these words, as shown in Eq. (3). uit = tanh(Ww hit + bw ) exp(uitT uw ) ait = T t exp(uit uw ) ait hit wi =
(1) (2) (3)
t
where Ww is the weight matrix of the fully connected layer, uw which is the word vector containing the character, which hit represents the feature vector obtained by the character t after bidirectional GRU. The word vector is obtained by summing the word vector weighted by the attention score wi . After that, Bi-GRU of the same structure use for learning the hidden state of the words. Word attention scores compute from context vector uc and word represent by softmax, as shown in Eq. (6). Finally, the word hidden vector is weighted and summed to obtain the sentence hidden vector s. The s as the present of candidate trigger words, arguments, and sentence information feed into fully connected layer to obtain the relation label. hi = Bi − GRU (wi ), t ∈ [1, L]
(4)
ui = tanh(Ws hi + bs )
(5)
exp(uiT uc ) ai = T i exp(ui uc ) ai hi s=
(6) (7)
i
2.4 Model Output We regard event extraction as a relationship prediction between candidate trigger words and candidate arguments. The next step is to map the feature vector s into the classification space using a fully connected layer. We construct the relationship between candidate trigger words and candidate arguments as event types add argument roles. There are 33 types of events, 35 types of argument roles, and 223 types of relationships are finally constructed. In order to deal with wrong arguments, it is also necessary to add a “none” relationship label, indicating that the candidate trigger word or the candidate argument is
406
Q. Hu and H. Wang
wrong. Therefore, we regard event extraction as a 224-way classification task. In order prompt model learning more event knowledge, when the trigger words are the same, we want the event types same too, so we continue to do event type classification on the feature vector s, which is a 34-way classification task (33 event types add none). The relationship classification output is shown in Eq. (9), and the event type classification is shown in Eq. (11). o = tanh(Wo s + bo )
(8)
y = softmax(o)
(9)
t = tanh(Wp s + bp )
(10)
p = softmax(t)
(11)
Output y is the score that classify s into various relations, where Wo , Wp is matrix, bo , bp is the bias term, o ∈ R224 , p ∈ R34 are the relational probability of classification and the probability of event type. 2.5 Loss Function We use the cross entropy loss function as model’s loss function. The calculation process is shown in Eq. (12), where y is one-hot relationship label between the candidate trigger word and the candidate argument, and z is the event type label, where N = 224, M = 34. Loss = −
N j=1
yj • log oj −
M
zj • log tj
(12)
j=1
3 Experiment We evaluate on the public datasets ACE2005 and CEC (Chinese Emergency Corpus). We divide ACE 2005 into 549 texts as the training dataset, 20 as the validation dataset, and 64 as the test dataset. The CEC dataset contains 333 texts, and we randomly divide it into training set, validation set and test set according to the ratio of 7:2:1. In order to verify our method, we selected the following models for comparison: (1) DMCNN was proposed by Chen et al. [5] in 2015, for multiple events sentence. On the basis of convolutional neural network, dynamic multi-pooling is designed to extract more important features. (2) Rich-C was proposed by Chen et al. [6]. Based on Lin et al., they mix trigger word context features, dependency features, semantic features, and nearest entity features for event detection.
Chinese Event Extraction Based on Hierarchical
407
(3) C-BiLSTM was proposed by Zeng et al. [18] in 2016, combining CNN and LSTM for feature extraction. This method treats event extraction as a word-level sequence label task. (4) NPNs was proposed by Lin et al. [15] in 2018, in order to solve the problem of trigger word segmentation errors. The model learns the structural and semantic information of words and characters, and the word and character features are mixed for event detection. This method is a classification method based on character features. (5) JMCEE was proposed by Xu et al. [19] in 2020. This model regards the task of argument extraction as a word-level binary classification task for the case of multiple events in a sentence. Each word uses binary classification to determine whether it is the start word or the end word of the argument. 3.1 Experimental Results and Analysis Experimental results are shown in Table 1 and Table 2. From the experimental results, we can see that compared with DMCNN and C-BiLSTM, the F1 score of our model is improved by 2.3% and 2.1%, respectively. In Table 2, our method also outperforms JMCEE (BERT-pipeline) on the argument classification task, but it is not satisfactory compared with other models. Although the effect is not so good, our model is relatively simple, and the model can deal the problems of argument overlap and multi-roles argument that are difficult to other models. The model did not achieve good results. The main reason is probably that we use BERT + CRF for trigger word and argument detection. Only 62.7% of the trigger words and 59.3% of the arguments in the test set were identified. One of the reasons may be that the argument is not a simple named entity, but also includes time and value expressions. Another reason may be that the number of features labeled in the ACE dataset is too small to train a sufficiently powerful BERT + CRF model. Table 1. Experiment on event detection Trigger detection DMCNN
Trigger classification
P
R
F1
P
R
F1
66.6
63.6
65.1
61.6
58.8
60.2
Rich-C
62.2
71.9
66.9
58.9
68.1
63.2
C- BiLSTM
65.6
66.7
66.1
60.0
60.9
60.4
NPNs
75.9
61.2
67.8
73.8
59.6
65.9
JMCEE(BERT-Pipeline)
82.5
78.0
80.2
72.6
68.2
70.3
JMCEE(BERT-Joint)
84.3
80.4
82.3
76.4
71.7
74.0
Ours
63.6
71.5
67.3
59
66.4
62.5
In order to exclude the influence of the low accuracy of BERT and CRF in identifying arguments, verify whether our method is useful for event classification and argument
408
Q. Hu and H. Wang Table 2. Experiment on argument extraction Argument detection
Argument classification
P
R
F1
P
R
F1
Rich-C
43.6
57.3
49.5
39.2
51.6
44.6
C- BiLSTM
53.0
52.2
52.6
47.3
46.6
46.9
JMCEE(BERT-Pipeline)
59.5
40.4
48.1
51.9
37.5
43.6
JMCEE(BERT-Joint)
66.3
45.2
53.7
53.7
46.7
50.0
Ours
62.3
38.3
47.4
58.8
36.2
44.8
role classification. We continued to do the following experiments, constructing pairs of trigger words and arguments as positive samples, and adding some negative samples with a relationship “none” to simulate the situation where the trigger words or arguments were identified incorrectly. The experimental results are shown in Table 3. Table 3. Trigger classification and argument classification Trigger classification
Argument classification
P
R
F1
P
R
F1
DMCNN
61.6
58.8
60.2
39.2
51.6
44.6
Rich-C
58.9
68.1
63.2
47.3
46.6
46.9
C-BiLSTM
60.0
60.9
60.4
51.9
37.5
43.6
NPNs
73.8
59.6
65.9
53.7
46.7
50.0
MCEE(BERT-Pipeline)
72.6
68.2
70.3
39.2
51.6
44.6
JMCEE(BERT-Joint)
76.4
71.7
74.0
47.3
46.6
46.9
ours
80.4
79.5
80.0
60.7
58.0
59.3
From the experimental results in Table 3, it can be seen that with the improvement of the accuracy of trigger word detection, our model has great advantages in classification tasks such as event type classification and argument role classification. In order to further verify the performance, we further select the following models for experiments on the CEC dataset. Yang et al. [20] used Bi-LSTM and CRF for the financial field to regard event extraction as a sequence labeling task. Ma et al. [21] proposed BiGRU (Bi-direction Gate Recurrent Unit), which uses an end-to-end approach to simultaneously perform trigger word detection and event type classification to avoid error propagation. This method aims at the shortcomings of DCFEE, Zheng et al. [22] proposed Doc2EDAG (Document to Event Directed Acyclic Graph), which can effectively convert text into a directed acyclic graph of entities for end-to-end event extraction. Transfer [23], this method was proposed by Huang et al. for new event detection. The LEAM [24] model utilizes the information
Chinese Event Extraction Based on Hierarchical
409
of labels and treats text classification as a joint label-word embedding problem. Yin et al. [25] introduced a residual network to change the network structure to alleviate the problem of gradient disappearance, and proposed Conv-RDBiGRU. The experimental results are shown in Table 4. Table 4. CEC dataset event detection experiment Model
P
R
F1
DCFEE
68.7
70.9
69.4
BiGRU
71.1
69.0
70.0
Doc2DEAG
73.5
70.3
71.9
Transfer
74.1
70.5
72.2
Conv-RDBiGRU
78.8
69.3
73.8
LEAM
71.1
79.7
75.2
ours
75.9
76.1
76.0
From Table 4, we can see that compared with other models, our model has achieved better results. Which indicates that incorporating both trigger word and argument information into the model is useful. On the other hand, the knowledge learned by our model is more abundant, and it finishes event extraction in one step, which indicates joint model outperform pipeline model. Compared with models such as Doc2EDAG, Transfer, and LEAM, our model is more simple and easier to train.
4 Summarize In this paper, we investigate event extraction researches based on deep learning, and summarize them into three categories: classification methods based on feature learning, methods based on question answering and methods based on Seq2seq. The classification method based on feature learning is difficult to deal with argument overlap and multiroles argument, the method based on question answering is prone to error propagation, and the method based on Seq2seq is difficult to train. Since the current methods have their own intractable problems, we propose a method for event extraction based on the idea of relation classification. This method inputs candidate trigger words and candidate argument pairs, and the model classifies their relationships to extract event. After experiments on ACE2005 and CEC data, our method performs well on both event type classification and argument classification tasks. Acknowledgments. This work is supported by the National Natural Science Foundation of China under Grants no. 61966020. Moreover, we sincerely thank all reviewers for their valuable comments.
410
Q. Hu and H. Wang
References 1. Ahn, D. The stages of event extraction. In: Proceedings of the Workshop on Annotating and Reasoning About Time and Events, pp. 1–8 (2006) 2. Zheng, S., Cao, W., Xu, W., et al.: Doc2edag: an end-to-end document-level framework for Chinese financial event extraction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 337–346 (2019) 3. Chen, Z., Ji, H.; Language specific issue and feature exploration in Chinese event extraction. In: Proceedings of Human Language Technologies: the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 209–212 (2009) 4. Wang, X., Wang, Z., Han, X., et al.: Hmeae: hierarchical modular argument extraction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (emnlp-ijcnlp), pp. 5781–5787 (2019) 5. Chen, Y., Xu, L., Liu, K., et al.: Event extraction via dynamic multi-pooling convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 167–176 (2015) 6. Chen, C., Ng, V.: Joint modeling for Chinese event extraction with rich linguistic features. In: Proceedings of Coling 2012, pp. 529–544 (2012) 7. Zhang, J., Qin, Y., Zhang, Y., et al.: Extracting entities and events as a single task using a transition-based neural model. In: IJCAI, pp. 5422–5428 (2019) 8. Chen, Y., Chen, T., Ebner, S., et al.; Reading the manual: event extraction as definition comprehension. In: Proceedings of the Fourth Workshop on Structured Prediction for Nlp, p. 783 (2020) 9. Du, X., Cardie, C.: Event extraction by answering (almost) natural questions. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (emnlp), pp. 671–683 (2020) 10. Liu, J., Chen, Y., Liu, K., et al.: Event extraction as machine reading comprehension. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (emnlp), pp. 1641–1651 (2020) 11. Li, F., Peng. W., Chen, Y., et al.: Event extraction as multi-turn question answering. In: Findings of the Association for Computational Linguistics, pp. 829–838 (2020) 12. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010 (2017) 13. Devlin, J., Chang, M., Lee, K., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019) 14. Li, S., Ji, H., Han, J.: Document-level argument extraction by conditional generation. In: Proceedings of the 2020 Conference on Empirical. Methods in Natural Language Processing (emnlp): Association for Computational Linguistics, pp. 671–683 (2021) 15. Lin, H., Lu, Y., Han, X., et al.: Nugget proposal networks for Chinese event detection. arXiv preprint arXiv:1805.00249 (2018) 16. Lin, J., Jian, J., Chen, Q.: Eliciting Knowledge From Language Models for Event Extraction. Arxiv Preprint arxiv:2109.05190 (2021) 17. Paolini, G., Athiwaratkun, B., Krone, J., et al.: Structured prediction as translation between augmented natural languages. In: International Conference on Learning Representations, pp. 1–26 (2021)
Chinese Event Extraction Based on Hierarchical
411
18. Zeng, Y., Yang, H., Feng, Y., Wang, Z., Zhao, D.: A convolution BiLSTM neural network model for Chinese event extraction. In: Lin, C.Y., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds) Natural Language Understanding and Intelligent Applications. ICCPOL NLPCC 2016 2016. Lecture Notes in Computer Science, vol. 10102, pp. 275–287. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50496-4_23 19. Xu, N., Xie, H., Zhao, D.: A novel joint framework for multiple Chinese events extraction. In: Sun, M., Li, S., Zhang, Y., Liu, Y., He, S., Rao, G. (eds.) CCL 2020. LNCS (LNAI), vol. 12522, pp. 174–183. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63031-7_13 20. Yang, H., Chen, Y., Liu, K., Xiao, Y., Zhao, J.: DCFEE: a document-level Chinese financial event extraction system based on automatically labeled training data. In: Proceedings of ACL 2018, System Demonstrations, vol. 4, pp. 50–55, Melbourne, Australia (2018) 21. Ma, C., Chen, X., Wang, W.: Chinese event detection based on recurrent neural network. Netinfo Secur. 5, 75–81 (2018) 22. Zheng, S., Cao, W., Xu, W., Bian, J.: Doc2EDAG: an end-to-end document-level framework for Chinese financial event extraction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing EMNLP /IJCNLP, vol. 1, pp. 337–346, Hong Kong, China (2019) 23. Huang, L., Ji, H., Cho, K., Dagan, I., Riedel, S., Voss, C.: Zero -shot transfer learning for event extraction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 2160–2170, Melbourne, Australia (2018) 24. Wang, G., Li, C., Wang, W., et al.: Joint embedding of words and labels for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 2321–2331, Melbourne, Australia (2018) 25. Yin, H., Cao, J., Cao, L., et al.: Chinese emergency event recognition using conv- RDBiGRU model. Comput. Intell. Neurosci. 2020 (2020)
Instance-Aware Style-Swap for Disentangled Attribute-Level Image Editing Xinjiao Zhou1 , Bin Jiang1,2(B) , Chao Yang1 , Haotian Hu1 , and Minyu Sun1 1
2
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, Hunan, China {zhouxinjiao,jiangbin,yangchaoedu,huhaotian,sunminyu}@hnu.edu.cn Key Laboratory for Embedded and Network Computing of Hunan Province, Hunan University, Changsha 410082, Hunan, China
Abstract. Recent studies have shown that attribute-level image editing could be achieved by modifying the latent code of style-based Generative Adversarial Network (StyleGAN). However, many existing methods suffer from semantic entanglement, which leads to undesirable changes when editing a specific attribute. To solve this problem, we focus on the S space of StyleGAN2 and propose a simple and effective Optimizationbased Instance-Aware Style-Swap method, which could perform a natural transformation effect on arbitrary source images and reference images by searching instance-aware swap coefficients with the guidance of prior information. By further analyzing the learned swapping coefficients, we can identify the style channels connected with a specific target attribute. We verify the effectiveness of our proposed methods in the field of face attribute editing. Extensive experiments have demonstrated that our method can achieve controllable and fine-grained image editing along various attributes. Qualitative and quantitative results show the advantages of our method compared with other semantic image editing methods.
Keywords: Image editing Swapping coefficients
1
· Optimization-based · Instance-Aware ·
Introduction
In recent years, generative adversarial network (GAN) [1–5] plays a leading role in the field of image generation with its superior performance in generating realistic high-solution images. The traditional GAN [1] randomly samples the noises from a Gaussian distribution and then maps them into a specific domain image through the deep neural network. However, due to the unconditional settings, the sampling latent code is hard to interpret. This presents a challenge for semantic image editing, which aims to transform a source image into a target image with the desired attribute while preserving other attributes unchanged. c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 412–422, 2023. https://doi.org/10.1007/978-981-99-2385-4_31
Instance-Aware Style-Swap for Disentangled Attribute-Level Image Editing
413
Fig. 1. We propose an Optimization-based Instance-Aware Style-Swap method for editing a source image, which could achieve natural attribute-level transformation from a reference image to a target image with the desired attribute, e.g., (a) narrow eyes, (b) mustache and (c) lipsticks.
A feasible solution to alleviate the above challenges is to use the conditional generation of GANs [6–9], which are not only committed to generating realistic images but pay more attention to maintaining the information of input conditions. However, the training of conditional GAN requires a large number of manually labeled datasets and long-time training consumption. Even so, the quality of the generated images of conditional GAN could be blurry and no better than unconditional GANs. Another solution is to disentangle the latent space in GANs. Recent works [10–14] reveal that the latent space of GANs contains a wide range of semantic directions. Unfortunately, unsupervised method can’t precisely achieve to edit the user-desired target attribute, and the supervised one face a severe problem of feature entanglement. To address such problems, many studies [15–17] have focused on the S space of StyleGAN2 [5]. Compared with the traditional generator, StyleGAN [4,5] uses eight MLPs to convert the randomly sampled noise into the W latent space, then transformed into S space to control each convolution layer. StyleSpace [15] reveal that each style channel in S space is only responsible for a specific attribute, which means we can edit specific attributes by modifying the corresponding style channel. So the key point are how to find the style channel that controls the corresponding attribute and how to adjust the style value. StyleSpace [15] finds the precisely controllable direction in the S space guided by semantic masks and analyzes style channels that control a specific attribute. However, it can only get the spatial area controlled by the channels and can not know how to adjust the activation value. EIS [16] gets coarse image segmentation by using K-sphericalmeans [18] to cluster the activated feature map of a given layer, and then swaps the specific style channels between source image and reference image to edit the image. Since K-spherical-means cannot get the precise region of channel control, the visual effect of the editing result may not be excellent, and finegrained attribute editing cannot be achieved. Furthermore, RIS [17] indicates
414
X. Zhou et al.
Fig. 2. Our proposed method. We invert Isource and Iref er to the ss , sr , where ss , sr ∈ S. Then the two style vectors are fed into Style-Swap module to get a new style vectors se . We employ a generator to reconstruct image Iedit from se . Particularly, we uses a pretrained attribute classifier to predict the attribute score of image. And finally we iteratively perform the swap process to minimize the objective loss function.
that style channels corresponding to a particular feature are different for every image and computes a attribute-channel catalog per image. However, it still needs to manually label each cluster, which is time-consuming. Our preliminary institution is to seek for dynamical attribute catalog of style channels and use prior information for supervision. In this work, we aim to transform a source image into a target image with the desired attribute while preserving other attributes unchanged. To achieve this goal, motivated by EIS [16] and RIS [17], we propose a simple and effective Optimization-based Style-Swap method. By swapping the specific style channels between sources image and reference image with the guidance of prior information, we could perform a natural transformation effect on arbitrary input images. Instead of using a fixed attribute catalog of style channels, we view the swapping coefficients as learnable parameters for iteratively searching instanceaware swap coefficients, where instance refers to both the source image and reference image. For each pair of source image and reference image, we learn the swap coefficients optimally by back-propagating the gradient of the pre-trained attribute classifier to obtain the target results. We have done experiments in the field of face attribute editing to verify the effectiveness of our idea as shown in Figure 1. Experiments demonstrate that our method can produce high-quality disentangled face editing results in a short inference time about 30 s. Furthermore, by analyzing the learned swap coefficients, we can identify style channels that control a specific target attribute, which come to further demonstrate the disentanglement of S space. And experiments show that directly adjusting the single channel enables more fine-grained editing.
Instance-Aware Style-Swap for Disentangled Attribute-Level Image Editing
2 2.1
415
Methods Problem Statement
We consider achieving semantic image editing via S space of StyleGAN2. We first define some general terms that are necessary for a formal description of semantic image editing problems. We denote the image to be transformed by Isource and the image which has target attribute by Iref erence . Here, we employ a fixed inversion model to map the image into a latent vector s ∈ S and the generator G to reconstruct image from the latent vector. We use a fixed attribute classifier to predict attributes of Isource and Iref erence , As = (as1 , as2 , ..asi , ...asK ) for Isource , Ar = (ar1 , ar2 , ...ari , ...arK ) for Iref erence , where K denotes the total number of attributes. Then we swap the values of the specified attribute to obtain the target attribute, At = (at1 , at2 , ..ati , ...atK ). Our goal is to transform Isource into a new image Iedit with the attribute of At . 2.2
Proposed Approach
Overview. Our model is an optimizing-based method, as shown in Figure 2. For image Isource and Iref erence , we first employ a fixed inversion model to map the two images to style vectors ss , sr . The two style vectors are fed in a StyleSwap model to get a new style vector se . The function of this model is to swap the channels that correspond to the target attribute accurately. Then we employ the generated model to reconstruct image Iedit from se . Finally, we use the fixed classifier to predict attributes Ae of Iedit . If Ae infinitely approaches At , Iedit is the final editing result. Otherwise, we continue to repeat the above steps until the maximum number of setting iterations. Our goal is to search for optimal instance-aware swap coefficients to make the transformed image with desired attributes. Next, we will introduce the key module Style-Swap in our method. Instance-Aware Style-Swap. As mentioned above, the function of this model is to accurately swap the channels that correspond to the target attribute. Rather than using a fixed attribute catalog of style channels, which could be obtained like [16,17], we directly view the swapping coefficients as learnable parameters. Therefore, the image editing problem is further transformed into searching for the optimal swapping coefficients Λ = [λ1 , λ2 , ...λn ], where n is the dimension of S space. The coefficient vector and the input style vectors are then passed to perform the swap process, where the swapped style vector could be computed as: (1) se = (1 − Λ) ss + Λ sr where denotes a per-element multiplication. This operation blends the channels of ss and sr according to Λ and outputs a swapped code se . Ideally, each λi is a binary number, which means that λi ∈ {0, 1}. When λi equals 0, it indicates that the i-th style channel holds the value of ss , and when λi equals 1, it represents that the i-th channel is swapped for the value of sr . However, since
416
X. Zhou et al.
discrete parameters are not applicable to the back propagation neural networks, here the range of λi is set to [0,1], which represents the swapping degree of i-th channels in the S space. Prior Information Supervision. In our optimization procedure for searching instance-aware swapping coefficients, semantic-aware supervision is required to provided guidance for transforming Isource into a new image Iedit with the target attributes. In our framework, we use a pre-trained attribute classifier as the prior knowledge. Specifically, we propose attribute loss to constrain the attributes of the generated images Iedit . Formally, we construct the following binary crossfor the i-th attribute: entropy loss Lcls i Lcls i (Λ) = −ati · log(fi (G(se )) − (1 − ati )) · log(1 − fi (G(se )))
(2)
where fi (G(se )) predicts the score of the i-th attribute for the image Iedit . The total attribute loss Latt is the weighted sum of the cross-entropy loss of all attributes, and we take different weights for attribute j that need to be edited and other irrelevant attributes, which computed as: Latt (Λ) = α · Lcls j (Λ) + β ·
K
Lcls i (Λ)
(3)
i=1,i=j
where α and β denote the hyperparameters that control the loss weights of editing attributes and other unrelated attributes. Richardson et al. in [19] shows that identity loss is crucial in achieving reconstruction of realistic facial images. When editing human face, we also take an identity loss Lid loss into our optimization process to ensure the identity preservation. Lid loss is computed using the identity information between the Isource and Iedit : (4) Lid loss (Λ) = 1− < R(G(se ), R(G(ss ) > where R denoting the pretrained ArcFace network [20] and < ·, · > computes the cosine similarity. Our method swaps two style vectors by swapping coefficients. It is obvious that if too many channels are swapped, the resulting image will be closer to Iref erence than to Isource in content. Since we only need specific attribute channels to complete the swap, we use l2-norm loss to limit the number of channels to swap, which could be written as Ll2
norm (Λ)
= Λ2
(5)
In summary, the overall loss objective function could be written as: Ltotal (Λ) = λattr · Latt (Λ) + λid · Lid loss (Λ) + λ2 · Ll2
norm (Λ)
(6)
where λattr , λid and λ2 are different weights to balance three losses. The optimal swapping coefficient Λ∗ could be solved by arg minΛ Ltotal (Λ).
Instance-Aware Style-Swap for Disentangled Attribute-Level Image Editing
417
Fig. 3. Visual results of our method to manipulate various attributes with different reference image.
Other Modules. Same as previous studies, we employ the pre-trained StyleGAN2 [5] generator to generate faces since it can produce realistic and natural images. We choose PSP [21] architecture as our inversion model, which is a hierarchical structure that can map the input image to latent code in W+ space [22], and we then get the corresponding style vectors by the affine layer of StyleGAN2. For the attribute classifier, we refer to [23], and extend the structure of resnet50 [24]. The output of the last fully connected layer is set to the output of 40 attributes, whose value is between 0 and 1. The above modules are pre-trained before the experiments and keep fixed during the optimization process.
3 3.1
Experiment Experimental Settings
We conduct experiments in the field of face attribute editing to verify the effectiveness of our method. Both StyleGAN2 generator and PSP model are pretrained on the Flickr-Faces-HQ Datasets (FFHQ) [4]. The attribute classifier is trained on CelebA dataset [2] with 40 attributes [25]. In experiments, we choose Adam [26] optimizer to seek for optimal swap coefficient, and the initial learning rate is 0.01 with linear decayed. To balance the importance of each loss, we set λattr = 1, λid = 0.8 and λ2 = 0.1 in Eq. 6 and α = 5, β = 2 in Eq. 3. For each pair of source image and reference image, we set the maximum iterative steps to 200 and a minimum threshold 1e−4 for Ltotal to control the early stopping of the iterations. We conduct all experiments on a single NVIDIA GeForce RTX 1080Ti 11 GB in Pytorch, and each optimization process could be completed in less than 30 s.
418
X. Zhou et al.
Fig. 4. Visualization of continuous editing by linearly increasing coefficients η.
3.2
Instance-Aware Attribute-Level Image Editing
In this section, we conduct our method for different pairs of images and investigate the quality of instance-aware attribute-level image editing results. Figure 1, 3 clearly show that our method can achieve disentangled attribute editing with natural transformation effect. For example, for the smile attribute editing, our method focuses on both the changes of cheekbones and real decrescendo. Especially, instead of using spatial supervision such as binary masks, our approach utilizes the guidance provided by the attribute classifier so that a more fine-grained attributes editing can be achieved. For instance, on the hair region, we can edit both the hairstyle and hair color individually due to fine-grained supervision. Furthermore, by opting optimization process for dynamically searching instance-aware swap coefficients, we can get different editing effects for the same attribute by selecting different reference images. As shown in Figure 3, we get different glasses styles from different reference images for the glasses attribute. And on smiling editing, we also achieve different degree of smiles. [16] shows that linear interpolation in GAN can lead to continuous changes. In our method, we can also achieve continuous editing by modifying Eq. 1. Specifically, for the fixed learned swapping coefficient Λ, we multiply an additional coefficient η to control the degree of editing as follows: se = ss + η · Λ (sr − ss )
(7)
where η ∈ R. Figure 4 shows the results of our continuous editing. We can see that our method can make continuous edits to the attributes of the source image with natural and smooth transformation effects.
Instance-Aware Style-Swap for Disentangled Attribute-Level Image Editing
3.3
419
Comparison with SOTA Methods
To demonstrate the advantages of our experiments, in this section, we compare our model with several other SOTA methods [7,10,17], including the label-based model interfaceGAN [10], the GAN-based model AttGAN [7], and the referencebased model RIS [17]. We reimplement interfaceGAN by using SVM to train hyperplane for each attribute in the W+ space of StyleGAN2. AttGAN is trained on thirteen attributes with strong visual impact including “Bald”, “Bangs”, “Black Hair”, “Blond Hair”, “Brown Hair”, “BushycEyebrows”, “Eyeglasses”, “Gender”, “Mouth Open”, “Mustache”, “No Beard”, “Pale Skin” and “Age”.
Fig. 5. Quanlitative comparison of face attribute editing results between our method with label-based method InterfaceGAN [10] and GAN-based method AttGAN [7].
Figure 5 and Figure 6 show our qualitative results. It can be seen that the label-based model InterfaceGAN [10] is able to achieve the specified image editing, but it has serious problems with semantic entanglement. As shown in Figure 5, it suffers from undesirable changes when editing a specific attribute. For instance, when we edit brown hair to black hair, the skin tone of the face also becomes darkened. While AttGAN [7] largely solves the entangled problem, it suffers from unrealistic effects. In Figure 5, on the bangs editing, the resulting image produces a blurred look and fails to edit bald attribute. In addition, AttGAN [7] cannot edit the attributes that were not trained during the training process, which lacks flexibility. In contrast, our approach is more simple and effective. In general, we achieve better results with natural transformation effect and rendered a more disentangled visual impact for the various attribute edits, except for hair color, smiling and bald. Figure 6 shows our results with the reference-based model RIS [17]. The results show that the editing of RIS [17] is spatial-level rather than attributelevel. For example, when we edit the mouth, RIS [17] directly swaps the whole mouth part including the jaw, making the result look very unsuitable for the
420
X. Zhou et al.
Fig. 6. Quanlitative comparison between our method and reference-based method RIS [17] on editing face attributes, e.g., smile, hair color and eyeglasses.
original image. Instead, our experimental results are more matching and realistic since we search dynamically instance-aware swap coefficients with the guidance of prior information. In addition, RIS [17] cannot perform fine-grained editing. For example, for hair color and hairstyle, RIS [17] can only rigidly swap all style vectors controlling the hair region and cannot edit hair color or hairstyle individually. Our method is based on the attribute classifier, so we can edit almost all the attributes of the attribute classifier. Table 1. Quantitative comparison results with other SOTA methods measured by FID (lower is better) and ID (higher is better). Methods
bangs Fid↓ ID↑
black hair Fid↓ ID↑
smile Fid↓
ID↑
InterfaceGAN [10] 27.22
0.641
36.18
0.549
27.49
0.770
AttGAN [7]
32.31
0.663
36.44
0.609
26.87
0.696
RIS [17]
25.39
0.671
36.75
0.775
21.88
0.767
Ours
24.12 0.827 36.07 0.881 21.38 0.893
To better evaluate our method, we also did quantitative experiments. Particularly, we sample 1000 face images to achieve semantic image editing by the above methods along bangs, black hair and smile. Then we adopt Fr´ echet Inception Distance (FID) [27] and face verification score (ID) [21] as metrics to measure the manipulated results. FID [27] is commonly used to measure the distance between edited images and source images. ID [21] use the pretrained ArcFace network [20] to measure how face identity is preserved in edited images. Table 1 shows that our method outperforms other standard approaches almost in all cases, and it indicates that our method not only generates high-quality images but also ensures accurate face attribute editing.
Instance-Aware Style-Swap for Disentangled Attribute-Level Image Editing
4
421
Conclusion
In this work, we focus on the S space of StyleGAN2 and propose a simple and effective Optimization-based Instance-Aware Style-Swap method to achieve disentangled semantic image editing. By analyzing the leaned swapping coefficients, we demonstrate the highly attribute-level disentanglement of S space and identify the style channels connected with a specific target attribute. The experiments show that we can perform a natural and disentangled transformation within arbitrary source image and reference image in inference time about 30 s, which is significantly better than previous methods. However, we obtain the style vector of the real image by inverting it into W+ space and then getting it through the affine layer since there is no direct way to get it. We are planning future work to investigate the inversion method in S space.
References 1. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014) 2. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017) 3. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018) 4. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019) 5. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119 (2020) 6. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014) 7. He, Z., Zuo, W., Kan, M., Shan, S., Chen, X.: Attgan: facial attribute editing by only changing what you want. IEEE Trans. Image Process. 28(11), 5464–5478 (2019) 8. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018) 9. Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: Stargan v2: diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8188–8197 (2020) 10. Shen, Y., Gu, J., Tang, X., Zhou, B.: Interpreting the latent space of GANs for semantic face editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9243–9252 (2020) 11. Shoshan, A., Bhonker, N., Kviatkovsky, I., Medioni, G.: GAN-control: explicitly controllable GANs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14083–14093 (2021) 12. Voynov, A., Babenko, A.: Unsupervised discovery of interpretable directions in the GAN latent space. In: International Conference on Machine Learning, pp. 9786– 9796. PMLR (2020)
422
X. Zhou et al.
13. Tzelepis, C., Tzimiropoulos, G., Patras, I.: Warpedganspace: finding non-linear RBF paths in GAN latent space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6393–6402 (2021) 14. H¨ ark¨ onen, E., Hertzmann, A., Lehtinen, J., Paris, S.: Ganspace: discovering interpretable GAN controls. Adv. Neural. Inf. Process. Syst. 33, 9841–9850 (2020) 15. Wu, Z., Lischinski, D., Shechtman, E.: Stylespace analysis: disentangled controls for stylegan image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12863–12872 (2021) 16. Collins, E., Bala, R., Price, B., Susstrunk, S.: Editing in style: uncovering the local semantics of GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5771–5780 (2020) 17. Chong, M.J., Chu, W.S., Kumar, A., Forsyth, D.: Retrieve in style: unsupervised facial feature transfer and retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3887–3896 (2021) 18. Hornik, K., Feinerer, I., Kober, M., Buchta, C.: Spherical k-means clustering. J. Stat. Softw. 50, 1–22 (2012) 19. Shen, Y., Zhou, B.: Closed-form factorization of latent semantics in GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1532–1540 (2021) 20. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019) 21. Richardson, E., et al.: Encoding in style: a stylegan encoder for image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2287–2296 (2021) 22. Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: how to embed images into the stylegan latent space? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4432–4441 (2019) 23. Zhuang, P., Koyejo, O., Schwing, A.G.: Enjoy your editing: controllable GANs for image editing via latent space navigation. arXiv preprint arXiv:2102.01187 (2021) 24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 25. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738 (2015) 26. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 27. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Collaborative Multi-head Contextualized Sparse Representations for Real-Time Open-Domain Question Answering Minyu Sun1 , Bin Jiang1,2(B) , Xinjiao Zhou1 , Bolin Zhang1 , and Chao Yang1 1
2
College of Computer Science and Electronic Engineering, Hunan University,, Changsha, 410082, Hunan, China {sunminyu,jiangbin,zhouxinjiao,onlyou,yangchaoedu}@hnu.edu.cn Key Laboratory for Embedded and Network Computing of Hunan Province, Hunan University, Changsha 410082, Hunan, China
Abstract. An efficient method of representing and retrieving information is an essential component of open domain QA. There are question and answer models that allow for real-time responses with speed benefit and scalability. Nonetheless, due to the limitations of existing phrase models, their accuracy is low. In this paper, we improve the contextualized sparse representation to strengthen the connection between contextual information. We achieve better answer retrieval by enhancing the embedding quality of the model for phrase representation. Specifically, based on original contextualized sparse representations, we transform the single self-attention into collaborative multi-head attention so that attention heads can connect and pay attention to crucial information in different context locations. Compared with learning sparse vectors in n-gram vocabulary space by rectified self-attention, collaborative multi-head attention performs better on the SQuAD dataset. Due to the increased efficiency of critical information representation, the model improves to varying degrees on both of the two evaluation metrics. Keywords: open-domain question answering sparse vectors · attention mechanisms
1
· real-time answering ·
Introduction
Open-domain question answering (QA) typically refers to generic factoids derived from existing knowledge sources (such as Wikipedia) in response to a substantive enquiry. Pipeline [1] is a widely used open-domain question answering technique. It usually retrieves documents related to a question from a knowledge source through an efficient retrieval technique and then finds the answer in the document through a QA model [2–5]. The inherent complexity of the neural reading comprehension model makes the open-domain QA with real-time responses face the problem of excessive time consumption, so it’s unsuitable for low-latency open-domain QA applications [6,7]. While classical information c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 423–434, 2023. https://doi.org/10.1007/978-981-99-2385-4_32
424
M. Sun et al.
retrieval algorithms such as tf-idf [8] can guarantee the retrieval speed. However, because of the limits of existing phrase representation models, the accuracy of QA model-generated replies is frequently low. [9]. Therefore, looking into ways to increase response accuracy while keeping real-time QA speed is advantageous. By encoding and indexing every conceivable text span in a dense vector offline, Seo et al. [10] greatly improves the performance of end-to-end phrase retrieval of related texts. In addition, to alleviate the poor performance on entitycentric questions, Seo et al. [11] connect sparse and dense vectors based on word frequency to capture lexical information. Since every word’s importance in a document encoded by sparse vectors is equally considered regardless of its context, some of the importance weight is lost. To compensate for these losses and obtain richer lexical information in sparse vectors, Jinhyuk Lee et al. [12] propose SPARC to expand its base to the n-gram vocabulary space using rectified self-attention weights. However, we hope the lexical information in the sparse vectors can be further exploited, which can enhance QA accuracy. In this paper, we propose an improved sparse attention mechanism and demonstrate its use in an open-domain QA setting with phrase retrieval. Our improvement of the original attention mechanism can be divided into two stages. First, we transform the self-attention in SPARC into multi-head attention, which extends the ability of the model to focus on the sparse vector in different locations without vector expansion. This approach effectively improves the accuracy of answers and does not have much effect on the speed of answer retrieval. Second, considering that not all heads are equally informative, we use a collaborative method between different attention heads. The collaboration of heads enables better detection and quantification of head redundancy. By this way, we can better assign weights and reparameterization using fewer parameters. With the help of collaborative heads, the accuracy of model responses improves significantly. In summary, our contributions are as follows: ∗ To capture different features of contexts, we replace the unidimensional sparse attention with multi-head attention. ∗ To further exploit the information captured by multi-head attention, we use a concatenation-based method to share the weight of different heads. ∗ By comparing the experimental results with other methods and ablation study, we demonstrate that increasing the number of attention heads and collaborative multi-head attention can improve the quality of the answer.
2
Related Work
We provide a quick overview of the phrase query model and the attention mechanism in this section. 2.1
Phrase Query Model Based on Sparse and Dense Vectors
The definition of open-domain QA based on unstructured text can be summarized as follows. Firstly, we are given a set of documents x1 ...xK and a question
Collaborative Multi-head Contextualized Sparse Representations
425
q (where K is the number of documents). We need to look for a method which can obtain the response by extracting the text span that can answer question q. The speed of an open-domain QA system is low due to the limitations of traditional pipeline methods in phrase representation [13,14]. This issue is well addressed by the indexable query-agnostic phrase representation approach that [11] presented. This phrase representation in which dense vectors can effectively exploit new developments in contextualized text coding to encode local syntactic and semantic [15], and sparse vectors are vectors are better at encoding precise lexical information. In addition, the model encode the document and questions independently, which means that encoded documents do not need to be encoded again when they are used to answer a new question. The effective method meets the time requirements for real-time applications. To make the response to question q better, the real-time response model for open-domain QA that proposed by [11,12] directly encodes every term in the document, regardless of the inquiry. It then performs a similarity search on the encoded phrases. The answer: a ˜ is represented as follows: a ˜ = argmaxxki:j Hx (xki:j ) · Hq (q),
(1)
where xki:j is a phrase consisting of the k − th document from the i − th word to the j − th word. Hx is an encoder which can encodes the contextualized query phrase, and Hq can encodes the single question. To compute the similarity, we use · to represent fast inner product operation. However, Seo et al. [11] argues that the unique properties of tf-idf, that are neither learnable and constant all across the same document, have a limited capability for representation. So they proposed Contextualized Sparse Representation (SPARC) [12] to implement a better, learning-capable sparse representation model and further improve the QA accuracy. The sparse representation is obtained as a concatenation of the sparse embeddings of its start and end words and the sparse embedding of phrases is rectified by self-attention. 2.2
Multi-head Attention and Collaborative Attention
The multi-head attention is proposed by [16], which adopts the attention mechanism with Query-Key-Value model [17]. Transformer employs multi-head attention as opposed to merely using a single attention function. For each projection query, the keys, values, and outputs are computed once, and then the model concatenates all outputs and reasonably represents their projection dimensions. Multi-head attention can focus on crucial information at different locations. It is pointed out that attention heads may focus on the same information, which makes redundancy [18]. A practical way to compress the multi-head attention layer is to prune the less informative heads [19,20], thus significantly lowering the number of parameters. However, the pruning method still requires pre-training all heads’ original model. The collaborative attention proposed by Cordonnier et al. [21] solved this problem. The model learns the key-query projections of all heads at once after noticing that each head learns redundant
426
M. Sun et al.
projections. Then each head’s weight will be renewed with the help of other heads. The main difference from normal multi-head attention is that it does not replicate each head’s key and query matrix. This approach makes attention heads able to use more or fewer dimensions. Moreover, the parameter can be represented more efficiently because of the effective hence storing. Therefore, increasing the interaction between each head in the multi-head attention can effectively improve the performance of the QA system, and the adaptive head size improves the flexibility of the model.
3
Overview
In this section, we detail the self-attention sparse coding module of the backbone model and describe our collaborative multi-head sparse coding module. For ease of description, the symbols and functions in this chapter are generic. 3.1
Contextual Sparse Representations Backbone
Our multi-head collaborative self-attentive sparse encoding is proposed based on [11]. In the backbone model, the sparse part of each phrase can be described , where the start and end mean the sparse embedding of as si:j = sstart , send i j start word and end word. By doing this, we are able to compute them effectively without having to explicitly enumerate every conceivable sentence. Here, the superscript will not be used. We will only describe how to obtain the start vector from sparse encoder. Because we get the sparse vectors in the same manner(by using independent parameters). We set the contextualized encoding of each document H as [h1 , ..., hN ] ∈ RN ×d . The sparse encoding S = [s1 , ..., sN ] ∈ RN ×F is shown as follow: QK T √ S = Attention(Q, K, F ) = ReLU (2) F ∈ RN ×F , d In above equation, Q, K ∈ RN ×d are query, key matrices. They are transformed by different linear transformation on H. We use WQ , WK ∈ RN ×d to compute it. And F ∈ RN ×F is an one-hot n-gram feature of the input document x. 3.2
Multi-head Self-attention Sparse Representations
To improve the backbone, we split the attention mechanism for sparse vectors into a multi-headed form. To begin, the attention mechanism is concatenated to obtain multi-head attention defined for H heads as: M ultiHeadAttn(Q, K, F ) = Concat(head1 , ..., headH )WO
(3)
headi = Attention(Qi , Ki , F ),
(4)
where the N -dimensional original queries Q, keys K and values F are projected into d dimensions. For each projection query, the keys and values as well as
Collaborative Multi-head Contextualized Sparse Representations
427
the outputs are calculated according to Eq. (2). The matrix WO ∈ RD×N of additional parameters projects the concatenation of the H head outputs to the output space RN (D means product of H and d). In addition, the distinct parameter matrices Qi , Ki are learned from each head i ∈ [H] and F is shared by each head. All of the outputs are subsequently combined by the model, which then projects them back to a d-dimensional representation. 3.3
Overall Structure
In order to enhance the collaboration between attention heads, we learn the key/query of all heads at once. Then each head’s weight will be rectified with the effect of other heads. It is important to note that the parameter settings in this part and Sect. 3.2 are identical. The collaborative head attention in a sparse vector is defined as follows: CollabAttn(Q, K, F ) = Concat(head1 , ..., headH )WO
(5)
headi = Attention(Qi diag(mi ), Ki , F ),
(6)
As we can see from Eq. (6), instead of copying keys K and query Q to all heads, mixing vectors mi are learned together. With the help of the mixing vectors, the heads can use more or fewer dimensions to express themselves, giving each head a more robust representation. Since the heads share projections, they only need to be stored and read once, making the representation of parameters more efficient. The part(b) of Fig. 1 displays our attention mechanism computed of S input vectors. And the mixing matrix M is shown as : M := concati∈[H] [mi ] ∈ RH×D .
(7)
In the mixing vector M , the mi for the i − th head is a vector with ones aligned with the d dimensions allocated to the i − th head among the D total dimensions. The collaboration between heads is optional, which means that the size d of each head is adaptive so that the heads can handle larger or smaller subspaces when needed. 3.4
Overall Structure
The process of dense sparse encoding of the input documents H is then described. As shown in Eq. (8), the vector xki:j formed by the combination of dense and sparse vectors obtained by dense and sparse encoding. (8) xki:j = eki:j , ski:j . where xki:j is a phrase embedding which can be described as the connection of i-th word to the j-th word from the k-th document. And eki:j is the dense part and ski:j is the sparse part for final text span (i, j) in the k-th document. The
428
M. Sun et al.
Fig. 1. (a) is the original self-attention score calculation module without using the multi-head mechanism. We can see that the attention score is obtained by feature matrix WQ , WK . In (b), we use a concatenated multi-head attention with H = 3 independent heads, and the blocks of different colors represent the feature matrix of different heads. The mixing matrix M ’s block framework ensures that each head dot product to focus on information in different location.
sparse vector s are specifically obtained by taking the attention score shown in part (b) of Fig. 1 and then dot F , one-hot feature of the input document x. The encoding of the dense vector e and the encoding of the question q refer to the method in [11], and the answer is generated according to Eq. (1). In addition, we compute question q’s sparse encodings similarly to the encoding of document’s, with the exception that we use a special token to represent the entire question rather than start and end words. For phrase encoding, we use the same BERT and linear transformation weights. 3.5
Training
We train our encoders using training samples from an extractive question answering dataset because training phrase encoders on the complete Wikipedia is computationally prohibitive (SQuAD). In addition, we use an improved negative sampling strategy [12], which increases the resistance of both dense and sparse models to noisy texts.
Collaborative Multi-head Contextualized Sparse Representations
429
The question q and the golden document x will be given in the dataset in the form of QA pairs. The loss function must take into account both dense and sparse encoding for golden document x. The dense logit of each phrase xi:j start end can be described as xsparse = si:j · s[CLS] = sstart · s[CLS] + send · s[CLS] . For i i i:j descriptive convenience, we omit the superscripts start and end of s, and both of the start&end term can be computed in the same way. QK T QKT √ √ F )T F · (ReLU (9) si · s[CLS] = ReLU d i d
where Q , K ∈ RM ×d ,F ∈ RM ×F denote the question side query, key, and n-gram feature matrices. The kernel function of Eq. (9) can effectively reduce the dimensionality of the output F . The final loss to minimize is calculated by summing the negative log probability over the dense and sparse logits: sparse ) + log exp(li,j + li,j ) (10) L = −(li∗ ,j ∗ + lisparse ∗ ,j ∗ i,j
4
Experiments
In this section, we conduct extensive experiments and demonstrate the effectiveness of our proposed method. We first present the experimental setup, including the dataset, experimental environment, and parameter settings. Then we compare our model with baseline and other models; we also investigate the importance of collaborative multi-head attention by ablation study. Finally, we discuss the effect of the different numbers of heads on accuracy and analyze the output of our QA model. 4.1
Experimental Setup
Dataset. SQuAD [22] is a reading comprehension dataset made up of questions asked by crowdsourced staff in a set of Wikipedia articles. Since the phrase encoder cannot be trained using the entire Wikipedia data, we use the training example in SQuAD-open (an open-domain version of SQuAD) to train our encoder. The structure of the dataset contains the paragraph, question, answer, and the starting position of the answer in the text. In this paper, we used 97,339 training examples to train our encoder and 12,151 samples to test the accuracy of the model. Evaluation Metrics. The experiment uses two evaluation metrics EM and F1. ∗ EM: The proportion of model-generated responses to standard responses that are identical to all responses is known as EM, or exact match. A higher EM metric represents better model performance. Because open domain QA question responses are frequently one word or phrase long, the EM metric is an excellent way to assess the response quality of a QA system with mostly short responses.
430
M. Sun et al.
∗ F1: Firstly, we define three machine learning concepts: accuracy, precision, and recall. The proportion of correctly classified samples to the total number of samples is referred to as accuracy in a classification problem. Precision in a classification problem refers to the number of correct samples as a percentage of the number of positive samples determined by the classifier. The proportion of correctly classified samples to true samples is referred to as recall. The F1 is a statistical measure of the precision of the dichotomous model for imbalanced data that is calculated as the summed average of precision and recall. Equation 8 depicts the calculation formula. F 1 − score =
P recision−1 + Recall−1 2
−1 (11)
According to the Eq. (11), the F1 varies between 0 and 1, and it only becomes large when both Precision and Recall are large enough. This demonstrates that the F1 can be used to determine how good the results of a quiz system are. It’s worth noting that the F1 value is usually multiplied by 100 for comparison purposes, so it floats between 0 and 100. Model Details. We conduct all experiments on a single NVIDIA GeForce RTX 1080Ti (memory of 11G) with Pytorch toolbox. Due to the limitations of the experimental environment, we set the batch size to 1. The maximum total input sequence length after word piece tokenization is set to 256, and the stride to take between chunks when splitting up a long document is set to 64. We set the max query length to 32 and use the bert-small [15] as our encoder and fine-tuned it; the word list size of bert model is 30,522. Comparision. For the comparison, we have selected several advanced opendomain QA models. Including: ∗ DrQA [22]: A large-scale open domain question and answer system proposed by Facebook in 2017. DrQA first uses a document retriever to seek the pertinent Wikipedia articles (focusing on the TOP 5), then uses a document reader to comprehend the semantics of the documents, resulting in the correct statement blocks being returned as responses. ∗ DENSPI: This is an indexable query-independent phrase representation model for real-time open-domain QA proposed by [11]. ∗ SPARC [12]: A model using n-gram rectified self-attention mechanism, which was based on DENSPI. SPARC encods phrase with rich lexical information in open-domain question answering. 4.2
Results
Quantitative Comparison. We use EM and F1 metrics for quantitative evaluation. We set the number of attention heads to 16 in this part of the experiment. Table 1 shows the results of the experiments on the SQuAD dataset. Our model improves the EM score by more than 1 % and the F1 score by more than 0.7%
Collaborative Multi-head Contextualized Sparse Representations
431
Table 1. Results on SQuAD dataset. model
EM
DrQA
62.17
F1 72.11
Dense-Sparse Phrase Index (DENSPI) 67.28
75.6
DENSPI+SPARC
69.64
78.16
Ours
70.72 78.95
compared with baseline. And our model achieves top-1 place when compared to DENSPI and DrQA. Ablation Study. Table 2 compares the performance of our model with its various variants on the SQuAD dataset to investigate the impact of the multi-head and collaborative mechanism on model performance. We remove the collaboration module first and use only the multi-head mechanism to improve the original sparse attention representation. Meanwhile, considering the collaboration mechanism must be performed in a multi-head attention environment, we remove both the collaboration and multi-headed modules. As shown in Table 2, both removal operations cause the drop in EM and F1. And removing both modules at the same time forces them to drop even more. This experiment validates the collaboration module and the multi-head, indicating that both methods are effective. Table 2. Ablantions of our model. model
EM
F1
Ours
70.72
78.95
-collaboration
70.21(-0.51) 78.61(-0.34)
-(collaboration + multi-head) 69.53(-1.19) 78.32(-0.63)
Number of Heads. To explore the reasonableness of the head count parameter setting, we also compare the effect of using multi-head attention and collaborative multi-head attention alone on the model effect improvement, as shown in Fig. 2. The horizontal axis of both plots indicates the number of heads. The vertical axis of plot (a) indicates the difference between the EM score of our model and the baseline. And the vertical axis of plot (b) indicates the difference between the F1 score and the baseline. Both plots show that just increasing the number of heads does not improve the model performance. In the case of using only multi-head, the number of heads set to 4 is more reasonable to obtain the improvement of model performance. For the collaborative multi-head model proposed in this paper, a range of 8-16 heads is suitable, and a high number of heads will lead to performance degradation.
432
M. Sun et al.
Fig. 2. Figures (a) and (b), which represent the models employing collaborative multihead and multi-head alone with various head count settings, respectively, show the improvement in EM and F1 scores compared to baseline.
Qualitative Analysis. The Table 3 shows the output of our model and the baseline model DENSPI+SPARC model. We can see that our model can remove the interference and output the correct answer if the original model is ambiguous about the contextual information. The above results show that our model is more capable of retrieving the correct answer in context, which demonstrates the effectiveness of the multi-head collaborative attention mechanism in sparse representations. Table 3. Prediction samples from DENSPI+SPARC and our model. Contex: The league eventually narrowed the bids to three sites: New Orleans’ Mercedes -Benz Superdome, Miami’s Sun Life Stadium, and the San Francisco Bay Area’s Levi’s Stadium Question: What was the given name of Miami’s stadium at the time of Super Bowl 50? DENSPI+SPARC: Levi’s Stadium OURS: Sun Life Stadium
4.3
Conclusion
In this paper, we use a collaborative multi-head attention mechanism to improve the contextualized sparse representation. The multi-head part concentrates on information at various points in the sparse vector. The collaboration mechanism increases the effectiveness of multi-head part. With the help of both parts, the real-time open-domain QA accuracy is improved. On the open-domain QA dataset, experiments reveal that our novel model performs better than the SPARC-enhanced DENSPI model. The ablation study also shows that while utilizing merely multi-head attention can improve the model to some level, its overall performance is not as excellent as that of the collaborative multi-head attention mechanism.
Collaborative Multi-head Contextualized Sparse Representations
433
References 1. Semnani, S.J., Pandey, M.: Revisiting the open-domain question answering pipeline. arXiv preprint arXiv:2009.00914 (2020) 2. Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to answer opendomain questions. In: ACL (1) (2017) 3. Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li, M., Lin, J.: End-to-end open-domain question answering with bertserini. In: NAACL-HLT (Demonstrations) (2019) 4. Wang, S., et al.: R 3: Reinforced ranker-reader for open-domain question answering. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018) 5. Das, R., Dhuliawala, S., Zaheer, M., McCallum, A.: Multi-step retriever-reader interaction for scalable open-domain question answering. In: International Conference on Learning Representations (2018) 6. Jiang, B., Yang, J., Yang, C., Zhou, W., Pang, L., Zhou, X.: Knowledge augmented dialogue generation with divergent facts selection. Knowl. Based Syst. 210, 106, 479 (2020) 7. Seo, M.J., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. In: 5th International Conference on Learning Representations, ICLR (2017) 8. Fautsch, C., Savoy, J.: Adapting the TF IDF vector-space model to domain specific information retrieval. In: Proceedings of the 2010 ACM Symposium on Applied Computing (SAC), Sierre, Switzerland, March 22–26, 2010, pp. 1708–1712 (2010) 9. Kadlec, R., Schmid, M., Bajgar, O., Kleindienst, J.: Text understanding with the attention sum reader network. In: ACL (1) (2016) 10. Seo, M., Kwiatkowski, T., Parikh, A., Farhadi, A., Hajishirzi, H.: Phrase-indexed question answering: a new challenge for scalable document comprehension. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 559–564 (2018) 11. Seo, M., Lee, J., Kwiatkowski, T., Parikh, A., Farhadi, A., Hajishirzi, H.: Real-time open-domain question answering with dense-sparse phrase index. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4430–4441 (2019) 12. Lee, J., Seo, M., Hajishirzi, H., Kang, J.: Contextualized sparse representations for real-time open-domain question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 912–919 (2020) 13. Lin, Y., Ji, H., Liu, Z., Sun, M.: Denoising distantly supervised open-domain question answering. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1736–1745 (2018) 14. Wang, Z., Ng, P., Ma, X., Nallapati, R., Xiang, B.: Multi-passage BERT: a globally normalized BERT model for open-domain question answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 5878–5882 (2019) 15. Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019) 16. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
434
M. Sun et al.
17. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR (2015) 18. Voita, E., Serdyukov, P., Sennrich, R., Titov, I.: Context-aware neural machine translation learns anaphora resolution. In: 56th Annual Meeting of the Association for Computational Linguistics, pp. 1264–1274. Association for Computational Linguistics (2018) 19. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5797–5808 (2019) 20. Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? In: Advances In Neural Information Processing Systems, vol. 32 (2019) 21. Cordonnier, J.B., Loukas, A., Jaggi, M.: Multi-head attention: Collaborate instead of concatenate. arXiv preprint arXiv:2006.16362 (2020) 22. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100, 000+ questions for machine comprehension of text. In: EMNLP (2016)
Automatic Personality Prediction Based on Users’ Chinese Handwriting Change Yu Ji1 , Wen Wu2(B) , Yi Hu3 , Xiaofeng He1 , Changzhi Chen4 , and Liang He1 1
School of Computer Science and Technology, East China Normal University, Shanghai, China 2 Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, School of Computer Science and Technology, School of Psychology and Cognitive Science, East China Normal University, Shanghai, China [email protected] 3 School of Psychology and Cognitive Science, East China Normal University, Shanghai, China 4 Born to Learn Education Technology, Sichuan, China
Abstract. In recent years, personality has been considered as a valuable personal factor being applied to many fields. Although lately some studies have endeavored to implicitly obtain user’s personality from her/his handwriting, they failed to achieve satisfactory prediction performance. Most of the related methods focus on constructing handwriting features, while the handwriting change information is ignored. In fact, user’s handwriting change could reflect her/his physical and mental state more finely, which is helpful for recognizing the user’s personality. Furthermore, the related studies may not fully use Chinese character features to analyze the change of Chinese handwriting. In this paper, we propose an effective Chinese Handwriting Change based Personality Prediction (CHCPP) model to identify users’ personalities. To be specific, we construct the handwritten character sequence based on the writing order. We then extract the Chinese character features and the visual signals of each handwritten character in the sequence to analyze the handwriting change. Meanwhile, we also construct the statistical Chinese character features based on the whole handwritten character set to assist in modeling the change of Chinese handwriting. Lastly, we utilize the handwriting change information and the statistical Chinese character features to acquire the prediction results. The experimental results show that our CHCPP model outperforms the related methods on a real-world dataset.
Keywords: Personality prediction character feature · Deep learning
1
· Handwriting change · Chinese
Introduction
As an affect-processing system, personality describes the relatively stable pattern of humans in terms of their behaviors, thoughts, and emotions [23]. Hence, c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 435–449, 2023. https://doi.org/10.1007/978-981-99-2385-4_33
436
Y. Ji et al.
personality has been considered as a valuable personal factor being used in many fields. Take the field of education as an example, researchers have shown that students’ personalities could affect their learning motivation [19] and their preference for teaching methods [6]. Therefore, teachers could improve both students’ learning performance [21] and learning satisfaction [27] by tailoring the learning process to meet students’ potential learning needs with their personality profiles. Traditional questionnaire-based methods (e.g., 240-item NEO-PI-R questionnaire [9]) to personality identification are time-consuming and laborious [36]. Hence, some researchers focus on implicitly recognizing users’ personalities from their handwriting, as an individual’s handwriting is a result of an organized system and could reflect the individual’s personality [1,17]. Most of the existing methods utilize machine learning algorithms to build personality classifiers with handwriting features [24,25]. However, the related researchers ignore the handwriting change information. This kind of information actually could reflect users’ physical and mental state in a more fine-grained manner [30], which is useful for personality prediction. Furthermore, few of them make full use of Chinese character features when analyzing users’ Chinese handwriting changes. In fact, unlike other languages, Chinese characters have unique square shapes [18], which could help researchers observe the change of Chinese handwriting more intuitively. To deal with these issues, we propose a Chinese Handwriting Change based Personality Prediction (CHCPP) model to identify users’ personalities. By mining and fusing the Chinese character features and the visual signals from the handwriting images to analyze users’ handwriting changes, CHCPP model could better identify users’ personalities. Concretely, we construct the image sequence of handwritten characters according to the writing order, and extract the Chinese character features and the visual signals of each handwritten character in the sequence. Meanwhile, we also design the statistical Chinese character features based on the whole handwritten character set. We then combine the Chinese character features and the visual signals to analyze the handwriting change. We also use the statistical Chinese character features to focus on important characters in the handwritten character sequence. Finally, we utilize the handwriting change information with the statistical Chinese character features to identify the users’ personalities. Specifically, the main contributions of our work are as follows: (1) We have proposed an effective Chinese Handwriting Change based Personality Prediction (CHCPP) model to better accomplish the handwriting-based personality prediction task. Compared with other methods, our CHCPP model could classify users’ personalities more precisely by analyzing the users’ handwriting changes. (2) We have extracted various Chinese character features, and combined them with the visual signals to analyze the handwriting change that occurred in the handwritten character sequence. By fusing these kinds of information with psychological theory, our CHCPP model could extract practical handwriting change information.
Automatic Personality Prediction
437
(3) We have conducted experiments on a real-world dataset to verify the effectiveness of our CHCPP model. The experimental results show that our CHCPP model achieves the best performance than other related methods.
2
Related Work
The Big-Five Factor (BFF) model is one of the most authoritative personality models, which describes personality based on five traits: Openness to experience (O), Conscientiousness (C), Extraversion (E), Agreeableness (A), and Neuroticism (N ) [10]. In this study, we construct the personality prediction model based on the BFF model. The traditional method to assess users’ personalities is via questionnaire (e.g., 50-item IPIP questionnaire [14]). However, it is time-consuming and laborintensive [36], which impedes the large-scale application of personality in upperlevel tasks (e.g., online learning system [3] and gamified system [26]). To utilize the users’ personalities on a large scale, researchers attempt to implicitly acquire the users’ personalities from various user-generated content, such as digital footprints [33], eye-tracking data [5], and handwriting [7]. Specifically, according to previous psychological findings, an individual’s handwriting is a projection of her/his innate and has scientific bases that make it possible to identify an individual’s personality [1]. Hence, handwriting has been widely studied by researchers for personality prediction [12]. Most of the related studies utilized machine learning algorithms to train personality classifiers with handwriting features. To be specific, Mekhaznia et al. extracted the textural feature from handwritten Spanish for personality prediction [24]. Mostafa et al. identified users’ personalities by constructing the feature of writing style based on Arabic Handwriting [25]. With the rapid development of deep learning, some deep neural networks based methods are proposed to solve handwriting-based personality prediction task. For example, Gavrilescu et al. constructed a non-invasive three-layer architecture in literature based on neural networks to recognize users’ personalities [13]. On the other hand, Convolutional Neural Networks (CNN) were also introduced into the field of handwriting analysis because they are well suited to extract all kinds of visual signals from handwriting images. For example, a CNN with five convolutional layers is used for extracting image features from the handwriting image [31]. However, the above-mentioned methods failed to achieve satisfactory performance on the handwriting-based personality prediction task, as they ignored the change that occurred in the users’ handwriting over time. In fact, the handwriting change information could reflect the users’ physical and mental state in detail (e.g., frequent change of handwriting is a symptom of multiple personality disorder [28]), which may help the model make a more accurate classification of the users’ personalities. In addition, the existing methods may not make full use of Chinese character features, which leads to insufficient understanding of users’ Chinese handwriting changes. Hence, we are interested in designing a Chinese handwriting change based personality prediction model in this work. We also
438
Y. Ji et al.
Fig. 1. The architecture of CHCPP model.
Fig. 2. Examples of the handwritten characters segmented from the handwriting image.
would like to see whether the Chinese character features could be useful to analyze users’ Chinese handwriting change and improve our CHCPP model for the handwriting-based personality prediction task.
3
Methodology
Before introducing the details of our CHCPP model, we give the formalization first. For one of the five personality traits, a target user u ∈ U has a handwriting image Iu and her/his corresponding personality class Pu ∈ {low, high}. The whole handwritten character set Ru with n handwritten character image {r1u , r2u , ..., rnu } is segmented from Iu . The handwritten character sequence Su with k handwritten character images {s1u , s2u , ..., sku } is sampled from Ru . In addition, Gu ={gu1 , gu2 , ..., guk } and Cu ={c1u , c2u , ..., cku } denote the visual signals and Chinese character features extracted from Su respectively. Besides, Du represents the statistical Chinese character features extracted from Ru . Qu denotes the handwriting change information. The architecture of our CHCPP model is shown in Fig. 1. We will introduce the details of CHCPP model in a module-bymodule way. 3.1
Handwritten Character Sequence Construction Module
We construct the image sequence of handwritten characters for subsequent analysis of handwriting change. Concretely, we first segment all of the handwritten
Automatic Personality Prediction
439
Fig. 3. The architecture of visual signal extraction module.
characters in the handwriting image. The image size of each handwritten character is 55 × 55 pixels. Figure 2 shows some examples of handwritten characters segmented from the handwriting image. To speed up training, we adopt the systematic random sampling method to select k=100 handwritten characters, which are constructed into the handwritten character sequence according to the writing order. 3.2
Visual Signal Extraction Module
The visual signal extraction module is constructed to extract the key visual signals from the handwritten character. The extracted visual signals are used to analyze users’ handwriting changes. As CNN is good at extracting all kinds of visual signals by different convolution kernels, we construct the visual signal extraction module based on LeNet-5 [11], which is a successful application of CNN in the field of handwritten character recognition. The detailed architecture of the visual signal extraction module is demonstrated in Fig. 3. Concretely, we use the handwritten character siu (i = 1, 2, ..., k) as the input of the visual signal extraction module to acquire the corresponding visual signal gui (i = 1, 2, ..., k). 3.3
Character Feature Extraction Module
Unlike other languages, Chinese characters have unique square shapes [18], and studies have shown that different square shapes (e.g., width-height ratios) can reflect different personalities of the users [8]. Hence, we extract Chinese character features of each handwritten character for subsequent analysis of handwriting change. Concretely, we utilize OpenCV1 to extract 9 Chinese character features, which are shown in Table 1. Besides, we construct the statistical Chinese character features Du , which includes the mean value, max value, min value, and median value of each Chinese character feature of all handwritten characters in Ru .
1
https://opencv.org/opencv-4-5-3/.
440
Y. Ji et al. Table 1. The descriptions and examples of the Chinese character features.
Chinese Character Feature Description Example Width The distance from the left-most point to the right-most point of the character Height The distance from the upper-most point to the lower-most point of the character Width-Height Ratio Width/height, where / denotes division (short as whr.) operation Area Width*height, where * stands for multiplication operation Slant The tilt angle of character relative to the lower border Left Blank The distance from the left-most point of the character to the left border Right Blank The distance from the right-most point of the character to the right border Upper Blank The distance from the upper-most point of the character to the upper border Lower Blank The distance from the lower-most point of the character to the lower border
3.4
Handwriting Change Information Extraction Module
Fig. 4 shows the architecture of the handwriting change information extraction module. It is known that Bi-directional Long Short-Term Memory (Bi-LSTM) [34] is a typical variant of recurrent neural networks, which has been widely adopted for dealing with sequence problems by introducing memory cell and gate mechanism. Hence, we use Bi-LSTM with attention mechanism [2] to extract the handwriting change information from the handwritten character sequence. Concretely, we combine the visual signals and the Chinese character features of each handwritten character as the character’s handwriting information. We then send the sequence of handwriting information (i.e., {[gu1 ; c1u ], [gu2 ; c2u ], ..., [guk ; cku ]}) to Bi-LSTM to generate the handwriting change information Qu . Formally, Qu is a weighted summation of hidden states as: Qu =
k
αi hi
(1)
i=1
where hi is the hidden state of siu in Su . αi is the attention weight of hi , which measures the importance of siu in Su . The attention weight αi for each hidden state is defined as:
Automatic Personality Prediction
441
Fig. 4. The architecture of handwriting change information extraction module.
exp(e(hi , Du )) αi = k j=1 exp(e(hj , Du ))
(2)
e(hi , Du ) = vT tanh(Wh hi + Wd Du + b)
(3)
where e is a score function that measures the importance of handwriting characters for composing the handwriting change information. v is weight vector and vT represents its transpose. Wh and Wd are weight matrices. Considering that the statistical Chinese character features Du could reflect user u’s personality to a certain extent, we use it to help CHCPP model focus on the specific handwritten Chinese characters that have greater contribution to the user’s personality. 3.5
Prediction Module
We utilize a feed-forward network to project the handwriting change information Qu with the statistical Chinese character features Du into the target space of two classes and obtain the prediction result Fu (see Eq. (4)). In addition, the loss function for optimization is defined as cross-entropy error. Fu = Wp2 Relu(Wp1 [Qu , Du ] + bp1 ) + bp2
(4)
where Relu(x) = max(0, x) is activation function.
4
Experiment
4.1
Dataset
We conducted 5-fold experiments to validate the effectiveness of our CHCPP model on a real dataset. The dataset is provided by Born to Learn Education Technology2 , which is an education technology company in China. From May to July 2021, we invited 609 senior high school students to join our study. The 609 students come from the same school, same grade, but different classes. On the one hand, we collected the students’ handwritten Chinese essays, which are stored in 2690 × 1715 scan images. An example of the handwriting image is illustrated in 2
http://www.sxw.cn.
442
Y. Ji et al.
Fig. 5. An example of the handwriting image.
Fig. 6. The class distribution under different personality traits.
Fig. 5, where we blurred the handwriting image to protect the student’s privacy. To speed up training, we converted the color images into grayscale. On the other hand, we obtained each student’s personality via Ten-Item Personality Inventory (TIPI) questionnaire [15], which reaches adequate levels in terms of convergent and discriminant validity measures. To be specific, the five personality traits are evaluated by 10 items: 2 items for each trait. Each item requests students to rate themselves on 7-point Likert scales that from strongly disagree (i.e., 1) to strongly agree (i.e., 7). After filtering out invalid answers to TIPI questionnaire, we ultimately collected the questionnaire answers of 419 students. To be specific, we firstly excluded invalid students who choose the default option (i.e., 4) for all of the 10 items. We then excluded invalid students whose answers to the TIPI questionnaire are contradictory. For example, a student rated both 1 or 7 on two opposite questions such as “I think I am extraverted, enthusiastic” and “I think I am reserved, quiet”. According to the questionnaire answers, we acquired each student’s score (between 1.0 to 7.0) about each personality trait. Specifically, for better observing the handwriting of students with extreme personality traits, we removed the personality scores in the range [m − 0.5, m + 0.5], where m stands for the median score for a given trait [32]. Hence, the classes for each trait are low and high corresponding to the low scores (i.e., scorem + 0.5) respectively. The class distribution under different personality traits is shown in Fig. 6. 4.2
Evaluation Metric
We use Area U nder receiver operating characteristics Curve (AUC, the higher the better) [16] to evaluate the performances of personality prediction models, as AUC could provide a more discrimination evaluation for the quality of sample rankings than accuracy can provide [20].
Automatic Personality Prediction
4.3
443
Baselines
We compared our CHCPP model with several related methods, as well as four variations of our model (i.e., CHCPP-A, CHCPP-D, CHCPP-DC, and CHCPPDG). These methods can be divided into two categories: methods without/with handwriting change information. • Methods without handwriting change information – SVM+C [29] uses the handwriting features to train a SVM classifier. Specifically, due to different languages, we use the statistical Chinese character features to replace the handwriting features in the original paper. – Writerpatch [31] expands the dataset by dividing the handwriting image into patches for personality prediction. Concretely, this expansion is made by cropping the original handwriting image in patches of 200 × 200 pixels. And then, this method utilizes all of the patches to train a CNN-based classifier. The final result of one handwriting image is acquired by voting the personality classes of all its patches. – BEiT [4] stands for bidirectional encoder representation from image transformers. To be specific, we fine-tune BEiT (initialized by beit-basepatch16-224-pt22k-ft22k) to extract high-order image features of the original handwriting image. We then send the extracted image features to a linear layer for personality prediction. • Methods with handwriting change information – CHCPP-A uses average pooling to replace the attention mechanism in the handwriting change information extraction module. – CHCPP-D only uses the handwriting change information in the personality prediction module. – CHCPP-DC only extracts the handwriting change information from the visual signals and removes the statistical Chinese character features in the personality prediction module. – CHCPP-DG removes the statistical Chinese character features in the personality prediction module and only adopts the Chinese character features to extract the handwriting change information. 4.4
Experimental Results
The experimental results are shown in Table 2. Specifically, we obtain the overall AUC performance of each model by averaging the AUC performances of the model on all of the five personality traits. In addition to AUC, we also present the Improvement Percentage (IP) of each model relative to the baseline SVM+C in Table 2. The improvement percentage of AUC is calculated as: IP =
V aluetestmodel − V aluebaseline ∗ 100% V aluebaseline
(5)
444
Y. Ji et al.
Table 2. Personality prediction results. The best performances are in bold and the Improvement Percentage (IP) of each model is based on SVM+C. Model
O AUC (IP)
C AUC (IP)
E AUC (IP)
A AUC (IP)
N AUC (IP)
Overall AUC (IP)
0.500 0.500 (+0.0%) 0.510 (+2.0%)
0.473 0.519 (+9.7%) 0.518 (+9.5%)
0.557 0.546 (-2.0%) 0.519 (-6.8%)
0.510 0.516 (+1.2%) 0.511 (+0.2%)
0.611 (+22.2%) 0.564 (+12.8%) 0.510 (2.0%) 0.500 (+0.00%) 0.508 (+1.6%)
0.571 (+20.7%) 0.553 (+16.9%) 0.541 (+14.4%) 0.511 (+8.0%) 0.532 (+12.5%)
0.627 (+12.6%) 0.608 (+9.2%) 0.619 (+11.1%) 0.538 (-3.4%) 0.599 (+7.5%)
0.598 (+17.3%) 0.573 (+12.4%) 0.545 (+6.9%) 0.515 (+1.0%) 0.541 (+6.1%)
Models without handwriting change information SVM+C WriterPatch BEiT
0.500 0.500 (+0.0%) 0.500 (+0.0%)
0.518 0.517 (-0.2%) 0.509 (-1.7%)
Models with handwriting change information Ours: CHCPP CHCPP-A CHCPP-D CHCPP-DC CHCPP-DG
0.568 (+13.6%) 0.541 (+8.2%) 0.500 (+0.0%) 0.500 (+0.0%) 0.500 (+0.0%)
0.613 (+18.3%) 0.601 (+16.0%) 0.556 (+7.3%) 0.528 (+1.9%) 0.566 (+9.3%)
where V aluebaseline and V aluetestmodel denote the AUC of the baseline SVM+C and the test model such as our CHCPP model. The Effect of Handwriting Change Information. It can be seen from Table 2 that the models with handwriting change information perform better than the models without handwriting change information on most personality traits. For example, relative to SVM+C, CHCPP-DG increases its AUC on Agreeableness from 0.473 to 0.532. Similarly, CHCPP-DC also performs better on Conscientiousness when compared with BEiT (i.e., 0.528 vs. 0.509). On the contrary, the overall performance of the models without handwriting change information is far from satisfactory. Among them, Writerpatch obtains better performance (i.e., overall AUC=0.516) for adopting a dataset expansion strategy. Even so, our CHCPP model still achieves better AUC than Writerpatch on all of the five personality traits regardless of their class distribution (i.e., Openness to experience is of class imbalance, while the other four personality traits are of relative class balance). As for the possible reason, our CHCPP model mines various information from the handwriting images (i.e., the Chinese character features and the visual signals) for analyzing users’ handwriting changes, which enables our CHCPP model to better understand users’ physical/mental state, and then makes accurate classification of users’ personalities. The Effect of the Statistical Chinese Character Features. As shown in Table 2, CHCPP-A and CHCPP-D have low performance than CHCPP on all of the five personality traits. For example, relative to CHCPP, CHCPP-A and CHCPP-D decrease their AUC on Neuroticism from 0.627 to 0.608 and 0.619 respectively. This proves the effectiveness of the statistical Chinese character features for analyzing users’ handwriting changes and identifying users’ personalities. To display the effect of the statistical Chinese character features intuitively, we plot the correlation analysis between the personality traits and the statistical Chinese character features in Fig. 7. As shown in Fig. 7, some
Automatic Personality Prediction
445
Fig. 7. Correlation analysis between the personality traits and the statistical Chinese character features. The number in the cell represents the corresponding correlation coefficient, while * in superscript indicates p PJ ), society has reached a critical state of crisis, and some major problems have been encountered. Therefore, society is no longer healthy and has got big crises or severe problems. Under this condition, the country will collapse. For instance, the peasant revolt at the end of the Ming Dynasty accelerated the downfall of the Ming Dynasty [13].
3 3.1
Outcomes of Optimal Solutions Solving Optimal Parameters
We simulated the life cycle process based on the settings. To achieve fitness (validity), simulated outcomes should match the life cycle of Japanese history. Therefore, the key is to find the optimal solution P ar∗ (·), which should best match the real 13 empires in Table 1. According to the chronology, we combine them into a time-serial data freal (·). The total span is 2267 years. We deem 13 empires as the set of {Y1 , Y2 . . . Y13 }, and Yi represents the life cycle duration. To match the history of Japanese empires, we define real history function, freal (·) = {Y1 , Y2 . . . Y12 , Y13 }, where Yi refers to the life-cycle process and duration of specific empires. We also deem the simulation function fsim (·) = {Yˆ1 , Yˆ2 . . . Yˆ12 , Yˆ13 }, where (Yˆi ) captures life-cycle duration of simulated empires. In Eq. (3), we compare real empire duration in history freal (·), with simulated ones fsim (·). Under each combination of parameters, we calculate the fitness of our model. The difference term, Δ = fsim (·) − freal (·), is applied to measure the difference between real and simulated empires. When minimal, the optimal solution can achieve the highest matching degree. P ar∗ (·) = fsim (·) − freal (·) = &Argmin
13 (Yi − Yˆl )2 1
s· t·
Yˆl ≈ 2267
years(= ticks)
n−1
(3)
520
P. Lu et al.
The non-linear structure can be predictable and reverse deductive [31]. The sandpile model captures the nature of SOC, therefore, can be applied in reverse deduction of empire evolution [25]. By parameter traverse, we can obtain the optimal solution P ar ∗ (·), which indicates that the global threshold PJ is 0.622. For the Japanese society, this global threshold (PJ = 0.622) indicates the overall stability threshold of the sandpile system and the empire society. This global threshold PJ = 0.622) determines the rises and falls of empires, and can best simulate real empires in history. This section provides the outcomes of the optimal solution P ar ∗ (·), whose robustness and validity can be verified or supported. For robustness, we run each simulation 1000 times to obtain robust outcomes. We use three criteria to verify the fitness: (a) the number of empires. As we have 13 empires in real history, our simulations should accurately generate 13 empires under the optimal solution; (b) the total duration of all empires. In the history of Japan, the summation of 13 empires is 2267 years in Table 1. The total duration of 13 simulated empires should be 2267 ticks (years) as well, and (c) the paired span matching of empires. Besides macroscopic matching, we also need microscopic matching. Therefore, we compare 13 pairs of empire duration. If gaps between paired empires are tiny, our model can achieve the highest fitness. 3.2
Matched Size of Simulated Empires
Figure 2 shows the robust outcomes under P ar ∗ (·) for 1000 simulations. The number of simulated empires ranges from 6 to 21. The process of simulated empires is similar to the life cycles of Japanese empires, which is random and complex. However, the normal distribution pattern is clear, which can be seen in the Q-Q normal plot of Fig. 2, because most data points are concentrated on the straight (y = x) line, which supports the fitness and robustness of our model. For the number of simulated empires, the mean value is 13 (12.8613), which is the same number as the Japanese history. Besides, the standard deviation (SD) is 3.058, which indicates that the optimal solution is stable and robust. For simulated empires, we have 130 cases with 13 empires, whichaccounted for 13.05%. It indicates in Fig. 2 that the number distribution of simulated empires is symmetric, and we have different percentages under other cases, such as N = 6 (1.10%), N = 7 (1.70%), N = 8 (5.40%), N = 9 (6.80%), N = 10 (8.90%), N = 11 (11.00%), N = 12 (11.10%), N = 13 (13.00%), N = 14 (9.60%), N = 15 (11.00%), N = 16 (8.50%), N = 17 (4.50%), N = 18 (3.70%), N = 19 (1.90%), N = 20 (1.10%) and N = 21 (0.70%) for different amounts of simulated empires. To a large extent, the sandpile model simulations can restore the history of Japanese empires. 3.3
Matched Duration of Simulated Empires
There are 13 empires from the Yayoi to the Heisei period in Japan [21], and we check the matching of empires one by one. As Fig. 2 shows, we have 130 cases with 13 simulated empires. Figure 3A displays the real duration in historical order, the duration sequence is as follows: Yayoi (600 years), Kofun (242 years), Asuka (118 years), Nara (84 years), Heian (391 years), Kamakura (151 years), Muromachi
The Sandpile Model of Japanese Empire Dynamics
521
Fig. 2. Number of simulated empires under optimal solution (N = 1000). It shows the size of simulated empires under 1000 simulations.
(237 years), Azuchi-Momoyama (30 years), Edo (264 years), Meji (44 years), Taisho (13 years), Showa (63 years) and Heisei (30 years). Figure 3B shows the best simulation under P ar ∗(·), and we rank simulated empires to match history. The serial of bars refers to the simulated empire duration. Comparing Figs. 3A and 3B, the one-by-one span matching is obvious. The long history of 2267 years (Table 1), is full of chaos and complexity. Hence, small gaps between simulated and real empires should be deemed acceptable, as long as they are within the 10% of specific real duration. The gaps of 13 paired empires form the value set of {11, −3, 25, −7, −2, −20, 12, 20, −23, −20, 8, −1, 4}, and 13% of gaps form the value set {1.83%, 1.24%, 21.19%, 8.33%, 0.51%, 13.25%, 5.06%, 66.67%, 8.71%, 45.45%, 61.54%, 1.59%, 13.33%}. In Fig. 3B, 13 gaps are normally distributed and the trend is close to zero. Figure 3C presents 130 simulations (with 13 empires) out of 1000 simulations. And, similarly, 13 average duration of multiple simulations form the set of {600.3, 245.4, 116.4, 81.4, 395.8, 160.9, 207.1, 21.6, 304.4, 39.7, 11.2, 52, 23.3}, and we have the set of gaps {−0.3, −3.4, 1.6, 2.6, −4.8, −9.9, 29.9, 8.4, −40.4, 4.3, 1.8, 11, 6.7}. Thus, most are within 10% of real duration, and overall is close to 0. Besides, if divided by total duration, gap percentages for the best simulation form {0.49%, 0.13%, 1.11%, 0.31%, 0.09%, 0.89%, 0.53%, 0.89%, 1.02%, 0.89%, 0.35%, 0.04%, 0.18%}, and multiple simulations from {0.01%, 0.15%, 0.07%, 0.12%, 0.21%, 0.44%, 1.32%, 0.37%, 1.79%, 0.19%, 0.08%, 0.49%, 0.30%}, which is less further. We further check the continuous distribution matching of simulated empires. For the real history in Figure A, the probability density curve peaks when the
522
P. Lu et al.
duration is around 100 years, then declines as the duration grows further. Besides, it is obvious that the distribution of real duration in history has long right tails in Figure D. For the best simulation, the density function of simulated empire duration in Figure E is infinitely close to Figure D, which indicates that the best simulation has the same density function and its matching degree is perfect. The kernel density curve also has longer right tails, which is consistent with previous studies [1]. For 130 simulations with 13 empires in Figure F, the aggregate curve is also close to that in Figure D, and the perfect matching can be achieved. Figure F shows 130 kernel density curves. The aggregate kernel density curve is calculated based on the average outcomes of these 130 curves. Considering duration similarity between real history and simulations, one-by-one matching, and overall trend matching, it includes that perfect matching to real history can be confirmed.
Fig. 3. One-by-one matching of 13 empires. For A to C, x-axis refers to empires (Yayoi to Heisei), and y-axis refers to the duration. For D to F, X is duration, and Y is PDF.
4
Counterfactual Simulations
Besides perfect matching to real history, we further investigate counterfactuals. Although not happening in this world, they may be seen in other paralleled worlds. Paralleled simulation is suitable to explore all possibilities. Besides, the division of the Japanese Empire has always been controversial. The classification differs from different perspectives. For example, the Early Kofun can only be studied through the monumental tombs, the stone chambers, and the various jade objects and bronze mirrors of the time [6]. The area of domination and the
The Sandpile Model of Japanese Empire Dynamics
523
period of its emergence and downfall are also controversial. The Northern and Southern Dynasties existed between the Kamakura and Muromachi periods. We have broad and narrow definitions for the Muromachi Period [27]. Therefore, the size can be other than 13. Thus, to further check the robustness and generality of our model, we relax the size to 13±2. The matching can be also well-supported. 4.1
The Matching for 11 and 12 Empires
We first explore 11 and 12 empires. Under optimal solution P ar∗ (·), we have 110 cases (11 empires) and 111 cases (12 empires). For 11 or 12 simulated empires, 2 or 1 tiny empire will be removed, to make 11 or 12 pairs. First, we check the fitness for the best simulation. The duration of 11 real empires can be visualized in Fig. 4A, according to the historical order in Table 2. (a) For 110 simulations with 11 empires. In Japanese history, it is reasonable to classify 2267 years into 11 empires. So, we drop the two smallest empires, Taisho (13 years) and Heisei (30 years). Finally, the 11 pairs of simulated and real empires can be obtained. The best simulation with 11 empires can be exhibited in Subfig. 4B. The overall trend of 11 simulated empires fits real history well. The longest empire lasts 600 years in real history, 636 years in one best simulation. The difference of 36 years is within 10% of 600 years. The Azuchi-Momoyama period, is the shortest one with 30 years in history, 38 years for the best simulation, and a gap of 8 years. Subfigure 4B (One Sim Error) visualizes 11 gaps or
Fig. 4. Matching of fewer than 13 empires. A and D exhibit 11 and 12 empires in real history. B and C exhibit the best simulation and 110 simulations in 11 empires, respectively. E and F, respectively, show the best simulation and means of 111 ones.
524
P. Lu et al.
errors in years, and the linear regression is used to fit their trend. Subfigure 4C shows that most data points and the trend are close to the horizontal y = 0 line. The smoothed regression line is also close to this y = 0 line, indicating that the simulations’ results have relatively small gaps and our model fits well. Then, we check the outcomes of multiple simulations (N = 110). Subfigure 4A illustrates 11 duration of real empires in history, and similarly, Subfig. 4C shows the distribution for mean values of 110 simulations. The error bars shed light on the standard error (SD) for each empire, because we have 110 observations. As is seen in Subfig. 4C (Multi Sim Error), the SD is in a smaller range. The mean of multiple simulations better fits the distribution of real historical empires. For instance, the longest and shortest ones in history are the Yayoi period with 600 years and the Azuchi-Momoyama period with 30 years. In multiple simulations, they lasted for 617.9 and 20.9 years, with tiny gaps of 17.9 and 9.1 years. In contrast, outcomes of multiple simulations are remarkably consistent with real history. Subfigure 4C (Multi Sim Error) reveals the gap values, and the linear regressive trend can be estimated and visualized. The smoothed regression line fluctuates around (covers) the horizontal y = 0 line, which indicates that the gap of the simulations is tiny, within an acceptable range. (b) For 111 simulations with 12 empires. Similarly, we drop the smallest one (the Taisho with 13 years) from Table 1, and keep the Heisei with 30 years. Then, we obtain a new list of empires in real history. Out of 1000 simulations, we have 111 simulations with 12 empires. First, we compared 12 real and 12 simulated duration, for the best simulation. The 4D visualizes the sequence of 12 empires in real history. Likewise, 4E reveals the sequence of 12 empires under the best simulation. Generally speaking, the best simulation well fits real history. The longest (Yayoi period) and shortest one (Azuchi-Momoyama & Heisei) are with 600 and 30 years in history for the best simulation, the Yayoi is 621 years, the Azuchi-Momoyama is 9.3 years, and the Heisei is 27.3 years. Three gaps are 21, 20.7, and 2.7, which are relatively smaller and acceptable. The best fitness is achieved at the Kofun with 242 years, Asuka with 118 years, Nara with 84 years, and Showa with 63 years. In contrast, simulated empires refer to 244 years, 110 years, 79 years, and 55 years, and the gaps are tiny (2, 8, 5 & 8 years). Subfigure 4E illustrates the gaps between real history and the best simulation. All data points are centered on the y = 0 line, and smoothed regression line fluctuates around the y = 0. It indicates the best simulation matches real history very well. Second, we also investigate the matching degree of multiple simulations, to verify the robustness and generality. Subfigure 4D shows 12 real duration in history, and Subfig. 4F shows the robustness of multiple simulations. The Yayoi has 568.2 years, Azuchi-Momoyama has 12.3 years, and Heisei has 21.6 years. So, three gaps (31.8, 17.7 & 8.4 years) are tiny. Similarly, we have real empires such as Kofun (242 years), Asuka (118 years), Heian (391 years), and Kamakura (151 years) in history, and multiple simulations provide 232.5, 96.2, 396.6, and 146.2 years. Four gaps are merely 9.5, 21.8, 5.6, and 4.8 years. The error bars are SD values. The smoothed regressive line in Fig. 4F is also close to the horizontal y = 0 line, which holds fine fitness.
The Sandpile Model of Japanese Empire Dynamics
4.2
525
The Matching for 14 and 15 Empires
Out of 1000 simulations, we have 96 simulations with 14 empires, and 110 simulations with 15 empires. From the perspectives of paralleled universes [35], it is likely to have 14 and 15 empires. In the current universe where we live, we have 13 empires. Hence, we drop 1 and 2 simulated ones to obtain 13 pairs of empires. The method is the same, dropping tiny empires. So that we can compare real and simulated empires, to check the matching degree. It includes:
Fig. 5. Matching of less 13 empires. A and D exhibit 11 and 12 empires in real history. B and C exhibit the best simulation and 110 simulations in 11 empires, respectively. E and F, respectively show the best simulation and means of 111 ones.
(a) For 96 simulations with 14 empires. We drop the smallest empire and obtain a sequence of 13 empires. First, we check best simulation. Figure 5A shows the real history of 13 empires. The Asuka has 118 years, Nara 84 years, Kamakura 151 years, Meji 44 years, and Showa 63 years. For the best simulation, they have 112, 98, 141, 54, and 70 years in Fig. 5B. The gaps are 6, 14, 10, 10, 7 years, which are relatively tiny. Figure 5B (One Sim. Error) visualizes all gaps, and the line regression is applied to fit the errors. Smoothed regressive curve is close to the y = 0 line, which indicates small gaps; Second, we check multiple (N = 96) simulations. Figure 5C presents the mean of multiple simulations with 14 empires. In Fig. 5C, Asuka, Nara, Kamakura, Meji, and Showa periods have 115.4, 87.7, 146.5, 40, and 59.6 years. For real history, they are 118, 84, 151, 44, and 63. The gaps can be −2.46, 3.7, −4.5, −4, −3.4, which are relatively
526
P. Lu et al.
tiny. The error bars illustrate SD values, which seems to be small and indicates robustness of our model. Figure 5C plots the 13 gaps, and the smoothed trend is close to the y = 0 line. Therefore, 14 simulated empires can back-calculate or restore real history. (b) For 111 simulations with 15 empires. Discarding two tiny dynasties, we also obtain 13 simulated empires, with 13 counterparts in real history. First, we check best simulation. The matching degree is satisfactory. For instance, in Fig. 5D we have the 3 longest empires, such as Yayoi (600 years), Heisei (391 years), and Edo (264 years). For the best simulation in Fig. 5E, simulated Yayoi, Heisei, and Edo periods have 591, 385, and 277 years. The gaps (9, 6, and 13 years) are relatively tiny. The smoothed regression curve is close to the y = 0 line, which indicates that best simulation with 15 empires can reflect real history. Then, we further check multiple simulations. Figure 5D shows the real empires, and 5F shows multiple simulations. For instances, the Yayoi, Heisei, and Edo have 600.6, 388.4, and 291 years, with tiny gaps (0.6, 2.6, and 27 years). Figure 5F indicates that most data points coincide with real empires. The error bar (SD) is narrow, which indicates small deviations and robustness of multiple simulations. Subfigure F (Multi Sim. Error) illustrates the gaps between real history and multiple simulations. Also, the smoothed regression line is close to the y = 0 line, which indicates fine fitness of our model.
5
Conclusions and Discussions
The life cycle pattern of empires exists worldwide [10]. Our model has indicated related rules and regulations, which governs this life cycle pattern. Microscopically, this is determined by self-organized behaviors of agents, which can be captured by the sandpile model. Evolutionary features of sandpile and empires are similar. Faraway from equilibrium states, they all collapse automatically. (a) Periodicity and Fluctuations. Natural and social phenomena follow similar periodic laws, because the internal self-organized structure is similar. The sandpile system collapses because of adding more sand particles, which takes on periodic oscillations. And this is also applicable for empires. For Japanese empires, there are 13 empire cycles. Emperor Meiji in 1912 had transformed Japan from an isolated feudal society into a rapidly expanding industrial empire, and some successful wars helped Japan to expand its territory. Especially in 1941, the territory had reached its peak. However, the defeat in World War II ended the Japanese Empire [2]. Empire always follows the periodic cycle of rises, the process of chaotic-orderly-chaotic. This periodicity and turbulences can be better interpreted and modeled by “sandpile collapse”. (b) The Self-organized criticality. The sandpile model shows typical self-organized critical phenomena. When local patches (agents) have reached the “critical” state, they take on system-level rules and regulations, and newly added sand particles will cause systemic disturbances. Although local disturbances are not obvious, they shape the whole system. The sandpile structure makes itself more fragile to any new sand. Finally, global collapses or turbulence of the system will occur. Empire
The Sandpile Model of Japanese Empire Dynamics
527
also has features of self-organized criticality. When the Japanese Empire has reached the critical state, the changes in economy, politics, power, climate, and world order have great impacts. Eventually, they led to declines of Japanese Empire. It is obvious that the empire has evolved from one regime to another, not gradually, but in the form of catastrophic avalanches. Considering the sandpile model of Chinese empires [30], both Japanese and Chineses empires are under self-organized criticality. (c) Back-calculations of real history. As a typical model of Self-Organized Criticality [26], the sandpile model has accurately back-calculated Japanese history [7]. Optimal solution supports validity, robustness, and generality of our model. First, the size of simulated empires perfectly matches 13. For 1000 optimal solution simulations, we obtain the size distribution of empires (N = 1000). The mean is 13, the normal distribution is followed. Second, 13 duration of real empires in history can be well fitted, under both the best simulation and multiple simulations. For multiple simulations, 13 simulated duration can better back-calculate real history. Kernel density curves also match real history. Last, we loosen the conditions and infer counterfactuals, such as 11–12 and 14–15 empires, and best matching can be achieved as well. The span gaps between real and simulated empires are within reasonable ranges, and the trend is close to the y = 0 line. Hence, we can perfectly back-calculate the process of real history, which reinforces similarity (coupling) between human society and natural systems [29]. (d) Limitations and Future Directions. Although generally accepted standards of empire divisions are adopted, there are still some problems in the life cycle researches of Japanese Empires. Different standards produce different classifications, which shapes our real target function freal (·). Besides, like natural systems, society and history are also full of complexity. For 2000+ years, empires all have life cycles, which indicates the strong pattern whose internal mechanism should be solved. Sometimes, two empires may coexist (overlap) during the same time or period, which easily leads to the ambiguity of longevity. In future researches, we will continue to search for the historical data of other empires. Based on self-organized criticality, more Agent-Based Models should be applied to unveil empire dynamics. Moreover, we should compare eastern (Japan, China, etc.) and western empires in the broader perspectives of global history.
References 1. Arbesman, S.: The life-spans of empires. Hist. Methods J. Quant. Interdiscipl. Hist. 44(3), 127–129 (2011) 2. Ardrey, R.: Territorial Imperative; A Personal Inquiry into the Animal Origins of Property and Nations. Atheneum, New York (1966) 3. Bak, P.: How Nature Works: The Science of Self-organized Criticality. Springer, New York (2013). https://doi.org/10.1007/978-1-4757-5426-1 4. Bak, P., Tang, C., Wiesenfeld, K.: Self-organized criticality. Phys. Rev. A 38(1), 364 (1988) 5. Barcel´ o, J.A., Del Castillo, F.: Simulating Prehistoric and Ancient Worlds. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31481-5
528
P. Lu et al.
6. Barnes, G.L.: A hypothesis for early Kofun Rulership. Jpn. Rev. 3–29 (2014) 7. Beasley, W.G.: The Japanese Experience: A Short History of Japan. University of California Press, Berkeley (2000) 8. Bhowal, A.: Damage spreading in the ‘sandpile’ model of SOC. Phys. A 247(1–4), 327–330 (1997) 9. Chang, K.C., Lamberg-Karlovsky, C.: Ancient China and its anthropological significance. In: Archaeological Thought in America, pp. 155–166. Cambridge University Press, London (1989) 10. Chen, Q.: Climate shocks, dynastic cycles and nomadic conquests: evidence from historical china. Oxf. Econ. Pap. 67(2), 185–204 (2015) 11. Cooley, M.: The giant remains: Mesoamerican natural history, medicine, and cycles of empire. Isis 112(1), 45–67 (2021) 12. Davies, J.C.: The territorial imperative. A personal inquiry into the animal origins of property and nations. Am. Polit. Sci. Rev. 61(1), 162–163 (1967) 13. De Tocqueville, A.: The Old Regime and the French Revolution. Anchor (2010) 14. Diamond, J.M., Ordunio, D.: Guns, Germs, and Steel, 1st edn. Books on Tape, New York (1999) 15. Fran¸cois, P.: Empire the rise and demise of the British world order and the lessons for global power. Bull. D’information: ABHC= Mededelingenblad: BVNG 26(2–3), 18–20 (2005) 16. Hardt, M., Negri, A.: Empire, 1st edn. Harvard University Press, Cambridge (2020) 17. Ito, K.: Punctuated-equilibrium model of biological evolution is also a selforganized-criticality model of earthquakes. Phys. Rev. E 52(3), 3232 (1995) 18. Ivashkevich, E., Priezzhev, V.B.: Introduction to the sandpile model. Phys. A 254(1–2), 97–116 (1998) 19. Jackson, J.C., Rand, D., Lewis, K., Norton, M.I., Gray, K.: Agent-based modeling: a guide for social psychologists. Soc. Psychol. Person. Sci. 8(4), 387–395 (2017) 20. James, D.H.: The Rise and Fall of the Japanese Empire. Routledge, Milton Park (2010) 21. Jansen, M.B., Hall, J.W.: The Cambridge History of Japan. No. 1, Cambridge University Press, Cambridge (1989) 22. Kelly, K.: Out of Control: The New Biology of Machines, Social Systems, and the Economic World. Hachette UK, London (2009) 23. Kidd, B.: Social Evolution. GP Putnam’s Sons, New York (1898) 24. Lieven, D.C.: Empire: The Russian empire and Its Rivals. Yale University Press, New Haven (2002) 25. Lu, P., Yang, H., Li, M., Zhang, Z.: The sandpile model and empire dynamics. Chaos Sol. Fract. 143, 110615 (2021) 26. Majumdar, S.N., Dhar, D.: Height correlations in the abelian sandpile model. J. Phys. A: Math. Gen. 24(7), L357 (1991) 27. Meyer, M.W.: Japan: A Concise history. Rowman & Littlefield Publishers (2012) 28. Olson, M.: The Rise and Decline of Nations. Yale University Press, New Haven (2008) 29. Palla, G., Der´enyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043), 814–818 (2005) 30. Paoletti, G.: Deterministic Abelian Sandpile Models and Patterns. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-01204-9 31. Pincus, S.: 1688. Yale University Press (2009) 32. Sklar, E.: Netlogo, a multi-agent simulation environment. Artif. Life 13(3), 303–311 (2007)
The Sandpile Model of Japanese Empire Dynamics
529
33. Smith, E.R., Conrey, F.R.: Agent-based modeling: a new approach for theory building in social psychology. Pers. Soc. Psychol. Rev. 11(1), 87–104 (2007) 34. Tainter, J.: The Collapse of Complex Societies. Cambridge University Press, Cambridge (1988) 35. Tegmark, M.: Parallel universes. Sci. Am. 288(5), 40–51 (2003)
Active Authorization Control of Deep Models Using Channel Pruning Linna Wang1 , Yunfei Song1 , Yujia Zhu1 , and Daoxun Xia1,2(B) 1 2
School of Big Data and Computer Science, Guizhou Normal University, Guiyang 550025, China Engineering Laboratory for Applied Technology of Big Data in Education, Guizhou Normal University, Guiyang 550025, China [email protected]
Abstract. In recent work, researchers have proposed deep neural network(DNN) model active authorization control protection strategies. Because active authorization strategies can prevent attackers from stealing models in advance, they have become a focus of DNN model copyright protection research. At present, most active authorization methods significantly reduce the accuracy by encrypting the parameters of a DNN model or modifying the structure of a DNN model, to prevent malicious infringers from using the model. The active authorization method of modifying the structure of a DNN model impacts the original task accuracy of the DNN model. Moreover, in active authorization methods that encrypt DNN model parameters, authorized users need to perform many calculations to decrypt the DNN model parameters. Therefore, this paper uses a channel pruning algorithm to control the authorization of the DNN model. In this work, the pruning rate or threshold is used as the secret key of the DNN model, the secret key is used to prune and fine-tune the original DNN model before the DNN model is distributed to authorized users, and the fine-tuned DNN model can restore performance similar to that of the original DNN model. Due to the advantages of the pruning mechanism, the DNN model retains the performance of the original task during the active authorization process, and reducing the number of calculations. We perform our work with the CIFAR-10 and CIFAR-100 datasets, and the experimental results show that we only need to prune a small number of channels in the DNN model to determine whether a user is authorized. Keywords: Deep neural network control · Copyright protection
1
· Channel pruning · Authorization
Introduction
In recent years, deep learning (DL) has achieved great success in various fields, including computer vision, speech recognition, natural language processing, Supported by National Natural Science Foundation of China (grant no. 62166008). c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 530–542, 2023. https://doi.org/10.1007/978-981-99-2385-4_40
Active Authorization Control of Deep Models Using Channel Pruning
531
and other critical artificial intelligence fields. Deep convolutional neural network (DCNN) frameworks, such as LeNet, AlexNet, VGGNet, GoogLeNet, and ResNet, have made it easier to train deep neural network (DNN) models. However, DNN model training is still a difficult task because it requires large datasets and a considerable amount of computing resources. Therefore, the deep learning model has been deployed more widely and become more valuable. It is necessary to protect the intellectual property rights of the deep neural networks (DNNs). Researchers have recently shown interest in using watermarks to verify the copyright of DNN models. DNN watermarking is a technology that attempts to embed a watermark in the redundant information of a DNN model, extract the watermark from the stolen model, and then prove the copyright of the stolen model. However, the main disadvantage of the DNN watermarking method is this approach can be applied only after a DNN model is stolen, and thus, cannot prevent piracy in advance. As an alternative to the DNN watermarking method, we propose a DNN channel pruning algorithm to protect DNN models from illegal use. To the best of our knowledge, the present paper is the first work that shows how to prune and fine-tune DNN models to ensure that DNN models are not infringed upon by unauthorized users. The contributions of this paper include the following four aspects. – We develop a new copyright protection method for DNN models based on a channel pruning algorithm that reliably protects DNN models. – This is the first study to control the authorization of DNN models using channel pruning, and this study provides an important opportunity to advance our understanding of the main DNN model authorization control methods by making the functions of the DNN model unavailable. – Because important channels account for only a small portion of the total number of channels, we need to prune only the important channels in the DNN model to protect the DNN model from illegal use. – Due to the advantages of the pruning mechanism, the DNN models retain the performance of the original task during the active authorization process, reducing the number of calculations.
2 2.1
Related Work DNN Model Protection Strategies
DNN model protection strategies can be divided into two categories: passive verification and active authorization control. With passive verification, a DNN model owner can verify the copyright of a model if a DNN model is stolen, usually with a watermark that is embedded into the DNN model before the model is distributed [19]. Passive verification strategies can be divided into white-box watermarking, black-box watermarking, and no-box watermarking. White-box watermarking embeds watermarks into the parameters of the DNN model [9,13,16,17]. Black-box watermarking embeds watermarks into the application programming interface (API) of the DNN model
532
L. Wang et al.
and extracts the watermarks according to the relationship between the input and output of the DNN model [1,22]. No-box watermarking embeds watermarks into the output results of a DNN model [23]. With active authorization control, the users can activate the reasoning function of the DNN model only if the users can prove that they have the legal right to use the model [19]. Otherwise, the DNN model is paralyzed, and the DNN model cannot be used. Fan et al. [2,3] used a specified digital entity as the authorization certificate in the DNN model. This digital entity, which is known as a digital passport, needs to be embedded into the DNN model during training. Users can use the DNN model only when they input the correct digital passport, and a forged digital passport significantly degrades the performance of the DNN model. However, this method changes the internal structure of the DNN model, resulting in performance degradation. Tian et al. [15] encrypted important parameters in the DNN model with the selective encryption algorithm. The DNN model decrypts different numbers of parameters for different access users and provides hierarchical access services to users; however, because DNN models usually have a large number of parameters, the computational cost of parameter encryption is not acceptable, and it increases the decryption time. Xue et al. [21] proposed a novel solution. They first used a loss function to select and encrypt some parameters in the DNN model and then used the antagonistic disturbance correction parameters of the DNN model, the position of the encrypted parameters, and the value of the antagonistic disturbance to jointly generate a key. Authorized users can decrypt the DNN model with this key, allowing these users to utilize the reasoning function of the DNN model. This method effectively prevents malicious infringers from using the reasoning function of the DNN model. Xue et al. [20] implemented user fingerprint management while implementing active authorization control. This method distributes the adversarial samples of the DNN model as fingerprints to authorized users and adds a control layer as the last layer in the DNN model. The control layer can restrict the access or use of unauthorized users to the DNN model. When an authorized user inputs a fingerprint to the DNN model, the control layer of the DNN model is automatically deleted. The DNN model can also be restored to normal use. 2.2
Deep Neural Network Pruning
DNNs are usually over parameterized, and DNN models have significant redundancy, leading to unnecessary computations and inappropriate memory usage. Various approaches for eliminating redundancy have been proposed. To address these limitations, Song et al. [4] proposed a method that reduces the storage and computation requirements of DNN models by one order of magnitude without affecting model accuracy by learning only important connections. They used a three-step method to prune redundant connections. First, they trained the network to determine which connections are important. Next, they deleted connections that were deemed to be unimportant. Finally, they retrained the network to fine-tune the weights of the remaining connections. Song’s pruning method is a type of unstructured pruning approach.
Active Authorization Control of Deep Models Using Channel Pruning
533
Because unstructured pruning approaches have specific hardware requirements, structured pruning approaches have been developed. Channel pruning is a common structured pruning approach. A common channel pruning method is scale factor mapping. First, a scale factor is mapped for each channel in the CNN model; then, a regularizer is used to make the scale factors sparse. The L1 regularizer is widely used in sparse DNN models. Liu et al. [10] proposed a neural network learning scheme, that applies the L1 regularizer directly to the scaling factor in the batch normalization layer for thinning and uses the L1 regularizer to set unimportant scale factors to zero, automatically identifying and pruning unimportant channels during the training process. As a result, a compact network with fewer parameters can be obtained, and fewer computations are required. Because the L1 regularizer pushes all scale factors to 0, the L1 regularizer lacks the distinction between pruning channels and preserving channels. A more reasonable pruning method is to suppress only the unimportant channels (scale factor is 0) while retaining important channels (scale factor is large). To achieve this goal, Zhuang et al. [25] proposed a new scale factor regularizer method: the polarization regularizer. The polarization regularizer increases the distance between the deleted and reserved channels; thus, the pruned and reserved channels are easier to separate. The experimental results showed that the structural pruning effect using the polarization regularizer method is significantly better than that using the L1 regularizer method. Channel pruning is one of the most effective DNN pruning methods; hence, in this work, we introduce an active authorization control method that uses an L1 regularizer and a polarization regularizer to prune channels in DNN models that are mapped to larger-scale factors.
Fig. 1. Illustration of the total active authorization framework for DNN models. The darker the colour of the channel is, the more important that channel is, and the lighter the colour of the channel is, the less important that channel is.
534
3
L. Wang et al.
Proposed Method
As shown in Fig. 1, we propose an active authorization control framework for DNN models with three parts: the original DNN models, the pruned DNN models, and the fine-tuned DNN models. Next, we describe the channel pruning process, fine-tuning process, and authorization control process for DNN models. 3.1
Overview
In the pruning and fine-tuning stage, we prune the DNN model with pruning rates or thresholds provided by users and fine-tune the pruned DNN model to improve the accuracy of the model (Sect. 3.2). In Sect. 3.3, we introduce the authorization control process of our method. Specifically, we discuss how the use of different pruning rates and thresholds are set; then, we present how the use of different pruning rates and thresholds results in various pruning and fine-tuning accuracies. 3.2
Pruning and Fine-Tuning
In previous work, the purpose of pruning channels in DNN models was to remove unimportant channels in the models, ensure that pruning did not impact DNN model performance, and reduce the number of calculations as much as possible. Traditionally, DNN model pruning has been used to develop DNN models with smaller sizes, improved memory savings, faster reasoning speeds, and minimized accuracy losses. However, the DNN model pruning methods discussed in this paper select and prune important channels in DNN models, reducing the performance of the DNN model and causing the main functions of the model to be unavailable. To establish whether our method is available for most previous DNN model pruning methods, we evaluate two different DNN model channel pruning methods [10,25]. In terms of the rule for selecting channels, the model owner first initializes a set of scaling factors γ1 , γ2 , ..., γn via the regularizer Rs (γ), γi represents the scale factor of channel i, then training the scale factors with the channel sparsity regularizer, which includes both an L1 regularizer [10] and a polarization regularizer Rn [25]. The L1 and polarization regularizers are applied to the scaling factors in the batch normalization (BN) layers as follows: con = min θ
N 1 L(f (xi ; θ), yi ) + R(θ) + λRs (γ), N i=1
(1)
thus, our method is easy to implement and does not change existing CNN architectures. In the above equation, xi and yi denote the training input and training target, respectively, θ denotes the trainable weights, L(·) is the loss function, R(·) is usually the L2 regularizer on the parameters of the DNN model, Rs (·) indicates sparsity regularization on the scale factors of the channels, which can be an L1 regularizer or polarization regularizer in this paper, and λ balances
Active Authorization Control of Deep Models Using Channel Pruning
535
the three terms. After the channel pruning process, the accuracy of the DNN model decreases. In general, the accuracy can be restored by fine-tuning the DNN model after the pruning process. Because the structure of a DNN model is simpler after pruning, the model often has better generalization performance when the task goal is less complex; that is, the DNN model may be overfitted for the dataset before pruning, while the DNN model accuracy may be improved after pruning and fine-tuning. 3.3
Authorization Control
Setting the Prune Rates and Thresholds. For the L1 regularizer, the DNN model owner distributes a pruning rate to each authorized user as a secret key. The pruning rates of the authorized users are in the following range: M inpr < pru ratio < M axpr .
(2)
Before a user can employ a DNN model, the user must show a unique pruning rate pru ratio, and then prune pru ratio% of the total number of channels in the DNN model. If pru ratio is greater than M axpr , which may occur when too many channels are pruned, the accuracy after fine-tuning may be low. If pru ratio is less than M inpr , which may occur when too few channels are pruned, the accuracy after pruning may be very high, and the accuracy of the fine-tuned model distributed to users may be also very high, resulting in the illegal use of the DNN model. Therefore, the pruning rates of authorized users need to range between M inpr and M axpr . For the polarization regularizer, the DNN model owner distributes a pruning threshold to each authorized user as a secret key. The pruning thresholds of the authorized users are in the following range: M inpt < pru threshold < M axpt .
(3)
Before a user can employ a DNN model, the user must show a unique pruning threshold pru threshold, and then prune channels whose scale factors are greater than pru threshold in the DNN model. If pru threshold is smaller than M inpt , which may occur when too many channels are pruned, the accuracy after finetuning may be very low. If pru threshold is greater than M axpt , which may occur when too few channels are pruned, and the accuracy after pruning may be very high, and the accuracy of the fine-tuned model distributed to users may be also very high, resulting in the illegal use of the DNN model. Therefore, the pruning thresholds of authorized users need to range between M inpt and M axpt . Pruning Accuracy and Fine-Tuning Accuracy. Consider a scenario in which an unauthorized user provides a smaller pruning rate or a greater threshold pruning model so that improved fine-tuning accuracy is obtained during the fine-
536
L. Wang et al.
Table 1. Determination of authorized and unauthorized users. Baseline represents the accuracy of the DNN model of the model without pruning, P rune acc is the accuracy of the DNN model after pruning, F inetune acc is the accuracy of the DNN model after fine-tuning, represents an authorized user, and represents an unauthorized user. Baseline − P rune acc >= τp Baseline − P rune acc < τp |F inetune acc − Baseline| τf
Table 2. The accuracy threshold settings in Table 1 for different datasets and DNNs. τp represents the pruning accuracy threshold, and τf represents the fine-tuning accuracy threshold. Datasets CIFAR-10
DNNs
τp (%) τf (%)
VGG-19
50.00 3.50
CIFAR-100 PreResNet-56 20.00 3.50 ResNet-56 20.00 2.00
tuning process. To avoid impersonating authorized users, as shown in Table 1, the accuracy after pruning is limited to the range: Baseline − P rune acc >= τp ,
(4)
the difference between the accuracy of the original DNN model and that of the DNN model after pruning is greater than the threshold τp , which increases the difficulty of an unauthorized user guessing the pruning rates or thresholds. Thus, the coincidence rate that the pruning rate and pruning threshold are distributed to an authorized user within a specific range is very low, increasing the security of the active authorization method for DNN models. If unauthorized users keep trying to crack the secret key violently, they need to know the DNN active authorization method and pruning method. The cost of using a violent search to find this information is prohibitive. Even if it is assumed that the DNN model is cracked violently by unauthorized users, it will take a lot of computation resources. They need to use different secret keys to prune and finetune the model, so that the accuracy of the pruned model and the accuracy of the fine-tuned model are within a certain range as shown in Table 2, the computation resources spent are enough for unauthorized users to train the model by themselves. As shown in Table 1, if the accuracy after fine-tuning is limited to the range: |F inetune acc − Baseline| . Logic Link Recognize. Obviously, CI and EI are dispersed in the influence probability network, and focusing on their relationship links is the key of probability calculation. The flow network complex graph G1 can be constructed from the IPN for any causal pair by DFS, and the logical link set R can be extracted. Markov-Based Influence Probability Calculation. Since the influence between configuration items are transitive, there are many influencing factors between non-adjacent nodes and the relationship is complex. This paper assumes that the influence chain of configuration items conform to Markov property [14]. Markov process refers to a random process in the state space that has been transformed from one state to another. This process requires the property of “no memory”, that is, the probability distribution of the next state can only be determined by the current state, and the events before it are irrelevant in the time series. Therefore, the formula can be constructed as follows: P (v | kn ) . . . P (k2 | k1 ) P (k1 | u) (8) Pl (v | u) = r=(u,k1 ,k2 ,...,kn ,v)∈R
where r = (u, k1 , k2 ..., kn , v) ∈ R represents any connected path with u and v as two ends in the flow network complex graph G1 , and n represents the length of any connected path. F ig. 2 shows the calculation process of a single causal influence probability.
548
Y. Wang et al.
Fig. 2. Example of single factor causal pair computing
2.5
Multiple Factor Causal Pair Computing
Since a change instruction often affects a group of initial configuration items rather than one, when multiple initial configuration items are disturbed simultaneously, the analysis of other potentially affected configuration items is causally complex. To solve this problem, we first established the SCM [15] for multiple factor causal pairs to analyze whether there are confounding factors in any causal pair. If there are, the influence of confounding factors will be removed by intervention method to obtain non-confounding influence probability, and then the multiple factor causal pairs will be added up as a whole. SCM-Based Multiple Factor Causal Confounder Recognition. On the basis of the IPN G0 , the initial configuration item CI ∈ C0 is marked. For all potentially affected configuration item node EI ∈ V , all reachable directed edges e = (CI, EI) are connected, and the set of edges is defined as E3 . Thus, the graph G3 = (V0 , E3 ) representing SCM is generated. Check whether a backdoor path exists. The path that starts with the arrow pointing to X between nodes X and Y is the back door path. For example, X ← Z → Y , where Z is the confounder between X and Y . Intervention-Based Influence Probability Calculation. Backdoor adjustment is a causality calculation method based on do operator brought by backdoor path elimination intervention, which uses known data distributions to estimate causal effects between variables to achieve interventions. Given a directed acyclic graph G and a pair of ordered variables X and Y in G, if none of the nodes in a set of variables Z are descendants of X and Z blocks all backdoor paths between X and Y , then the variable Z satisfies the backdoor criterion for (X, Y ). Thus, the causal relationship between X and Y can be written as: P (Y = y | X = x, Z = z)P (Z = z) (9) P (Y = y | do(X = x)) = z
A Knowledge Graph-Based Analysis Framework
549
Therefore, it can be extended to the problem in this paper. For the set of initial configuration items CI = {CI1 , CI2 ..., CIk } ⊆ V 0 and any EI ∈ V0 , the influence probability of any potential multiple factor causality on < CI, EI > can be calculated as follows: Pd (EI | do (CIk )) (10) Pt (EI | CI) = k
where, k for the change of command to identify the initial motion of configuration item number,Pt (EI | CI) is the probability of influence of EI in contribution value. In particular, if there is no a directed edge e ∈ E2 and end point for CIk , Pd (EI | do(CI) = Pl (EI | CIk ). 2.6
Output Process
The output process module mainly realizes the updating of configuration item probabilistic network and visualization of inference evidence. In order to achieve the model hyperparameter acquisition and build the architecture change impact prediction framework based on continuous learning, the following model update strategies are proposed in this paper. With the historical < C, E > causal pair as the updated data of model initialization, the influence probability of < C, Eo ut > is calculated by the above method, where Eo ut represents any configuration item node that may be affected. In order to filter nodes with low probability of influence, the threshold h is set. When Pt < h, the corresponding affected configuration item is ignored. Besides, the loss function L(α, β) is designed as follows. In the iterative training process, the Adam optimizer is used to continuously reduce the loss value and carry out reverse propagation, which is used to update the hyperparameters in the model. L(α, β) = ⎧ ⎫
⎨ ⎬ Eembedding − argsoftmax softmax (F (kn , kn+1 )) ⎩ ⎭ Ep H n
(11) 2
where H equals to r = (k0, , k1 , . . . , kn ) ∈ R, k0 = C, kn = Ep , F (kn , kn+1 = α(kn kn+1 )F0 (kn , kn+1 ) + β(kn kn+1 ), Ee mbedding represents the representation vector of the real affected configurational term caused by CI. Ep represents the node of affected configuration term caused by CI, r = (k0 , k1 , ..., kn ) ∈ R represents all connected paths between CI and Ep obtained by algorithm 2, and n represents the length of any path among them. To guarantee the interpretability of apriorism, output process module will calculate the susceptible configuration item and verify the result with the original knowledge graph. In addition, the weights of adjacent edges are used to represent the causal contributions between adjacent configuration items.
550
3
Y. Wang et al.
Case Study
In this paper, the input change instruction taken as the demand case, and the process and details of the method are described in detail through the application of the probability calculation framework of causal pair effect. The case will be presented in three stages: data model preparation, multi-level calculation, and conclusion analysis. Data Model Preparation. The input of the configuration change influence analysis framework proposed in this paper is divided into two parts: change instruction and aircraft configuration knowledge graph. The carrier of change instruction is natural language, which can identify the initial configuration item through entity recognition and intention recognition. As shown in Fig. 3, “304F8337-005-008” and “211P4537-002-003” are the initial configuration items required for this change. Figure 6 is the knowledge graph of aircraft configuration. It contains configuration items and non-configuration items, which has no additional requirements for specific ontology.
Fig. 3. Case of aircraft configuration knowledge graph
Multi-level Calculation. We mark the initial configuration item in Fig. 3 as A, B, and the remaining potential configuration items marked as C − F , affected by DFS algorithm, this case can be constructed from A single causal factors to set{< A, B >, < A, C >, < A, D >, < A, E >, < B, C >, < B, D >, < B > E}. Perform the above operations on all single factor causal pairs, and assign the weight of the obtained Pl to the corresponding edge E3 , so that the single factor causal pair influence probability network G3 = (V0 , E3 ) is obtained, as shown in Fig. 4(b).
A Knowledge Graph-Based Analysis Framework
551
Fig. 4. Multi-level influence probability network calculation steps
Since there may be multiple initial configuration items triggered by the change instruction, the influence probability of multiple initial configuration items on a potentially affected configuration item is causally complex. For example, the effects of A and B on the other potentially affected configurational terms in this example should be considered as a whole. Therefore, the influence probability graph of change orders can be obtained through the following two steps: (1) Regard G3 as SCM and check whether there is a backdoor path. If so, intervene the initial configuration item, as shown in Fig. 4(c). (2) According to the calculation formula in Sect. 2.5, the influence probability of multiple causal pairs pointing to the same configuration item is added up after intervention to obtain the influence probability graph of change instruction, as shown in Fig. 4(d). Conclusion Analysis. Finally, we can calculate the potential configuration items influence probability P (EI | CI) as shown in Table 1. Assuming k = 2, that is, the influence range of top(2) is evaluated, A and B will be included in the reference of the initial configuration item, and C and D will be included in the reference of the configuration item subject to the secondary motion. At the same time, each selected conclusion can support its reliability by influencing the subgraph of probability network at each stage of the process.
552
Y. Wang et al. Table 1. The results of susceptible configuration items EI
C
D
E
P (EI | CI) 0.73 0.62 0.37
Figure 5 reflects the reason why D is included in the scope of influence. Figure 5(a) is a subgraph of G0 , showing the reachable paths of A, B and D on G0 . Figure 5(b) is A subgraph of G2 , illustrating the probability transfer contribution from A and B to D.
Fig. 5. Case of change propagation evidence graph
4
Discussion
The method in this paper is discussed by comparing it with other two related methods, and the results can be seen in Table 2 Literature [16] is consistent with the objective of this paper. In order to realize the impact assessment of aircraft configuration change, axial search based on product change matrix card is used to assist qualitative reasoning of configuration change. This method is only applicable to a single initial configuration item. Literature [17] carries out impact assessment for diesel engine engineering change, relying on the change impact matrix evaluated by experts, calculates importance nodes by constructing multiple networks, and uses susceptibility - infection - susceptibility model for template matching, so as to achieve quantitative assessment of the spread impact of complex product engineering change. Integrity. Compared with other works, the proposed method is based on the domain knowledge atlas constructed by combining historical cases and artificial experience (ontology), which has sufficient sampling and retains potential data association, and the inference results are relatively complete. Accuracy. The method in this paper is oriented to the actual situation, and information modeling is carried out for multiple disturbances, and the neural network is adopted. The conclusion obtained by the model after sufficient training has strong accuracy.
A Knowledge Graph-Based Analysis Framework
553
Table 2. Comparison with other related methods Literature [16] Literature [17] Our Approach Goal
Aircraft configuration change analysis
diesel engine change analysis
Aircraft configuration change analysis
Data
Configuration component document
Expert evaluation matrix
Configuration knowledge graph
Initial CI
Single
Single
Multiple
Calculation Mode Qualitative calculation
Quantitative calculation
Quantitative calculation
Reasoning Model Score matrix-based qualitative logical reasoning
Multiple network-based node importance assessment
multi-level network-based Influence probability calculation
Integrity
Middle
Strong
Weak
Accuracy
Strong
Weak
Middle
Scalability
Middle
Weak
Strong
Interpretability
Strong
Ordinary
Middle
Scalability. The proposed method has low requirements on configuration knowledge graph, only including configuration and non-configuration terms, and is oriented to change instruction input with natural language as carrier, so it has high scalability. Interpretability. Through multi-level logical disaggregation, the causal contribution value is explicitly expressed in the network diagram, which has strong explanatory ability.
5
Conclusion
For more realistic change orders, we propose a knowledge graph-based analysis framework for aircraft configuration change propagation. Firstly, influence probability network based on graph representation is constructed. Based on IPN, we propose a two-level causal pair computing algorithm to solve the simultaneous disturbance of multi-factors. The results of case study demonstrate that this framework is integrity, accuracy, scalability and interpretability. In the future, we will focus on expanding prior knowledge, consider configurational mechanical constraints, and combine with a broader open knowledge domain to study causal reasoning technology based on data knowledge.
554
Y. Wang et al.
References 1. Zhenhua, S.: Research on configuration management for aircraft design. Intell. Manuf. 04, 50–53 (2020) 2. Chen, X., Jia, S., Xiang, Y.: A review: Knowledge reasoning over knowledge graph. Expert Syst. Appl. 141, 112948 (2020) 3. Zhongwei, G., Rong, M., Haicheng, Y., et al.: Engineering Change Based on Product Development Network Hub Node [J]. Comput. Integr. Manuf. Syst. 18(1), 40–46 (2012) 4. Zhang, N., Yang, Y., Wang, J., et al.: Identifying core parts in complex mechanical product for change management and sustainable design. Sustainability 10(12), 4480 (2018) 5. Xi, Y., Yimin, D., Peng, Y.: Design change propagation process and characteristics analysis of variable function machinery based on FBS. J. Eng. Des. 23(1), 8–13 (2016) 6. Yupeng, L., Xiaochun, W., Xiaolin, L.: Impact assessment of complex product design changes based on BBV network model[J]. Comput. Integr. Manuf. Syst. 7, 1429–1438 (2017) 7. Hamraz, B., Caldwell, N.H.M., Ridgman, T.W., et al.: FBS Linkage ontology and tech-nique to support engineering change management[J]. Res. Eng. de-sign 26(1), 3–35 (2015) 8. Lu, G., Zhang, L., Jin, M., Li, P., Huang, X.: Entity alignment via knowledge embedding and type matching constraints for knowledge graph inference. J. Amb. Intell. Hum. Comput. 13, 5199–5209 (2021) 9. Chao L, Wang T, Chu W. PIE: a parameter and inference efficient solution for large scale knowledge graph embedding reasoning. arXiv preprint arXiv:2204.13957, 2022 10. Cheng, K., Yang, Z., Zhang, M., et al.: UniKER: a unified framework for combining embedding and definite horn rule reasoning for knowledge graph inference. In: Proceedings of the. Conference on Empirical Methods in Natural Language Processing 2021, pp. 9753–9771 (2021) 11. Chen Y. Convolutional Neural Network for Sentence Classification. University of Waterloo (2015) 12. Chen, T., Xu, R., He, Y., et al.: Improving sentiment analysis via sentence type classi-fication using BiLSTM-CRF and CNN[J]. Expert Syst. Appl. 72, 221–230 (2017) 13. Grover, A., Leskovec, J.: node2vec: Scalable feature learning for net-works. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864 (2015) 14. Mor, B., Garhwal, S., Kumar, A.: A systematic review of hidden markov models and their applications[J]. Arch. Comput. Methods Eng. 28(3), 1429–1448 (2021) 15. Pearl, J., Mackenzie, D.: The Book of Why: The New Science of Cause and Effect. Basic Books (2018) 16. Zepeng, S.: Research on change impact assessment process based on CM2. Mech. Eng. 4, 110–112 (2020) 17. Congdong, L., Zhiwei, Z., Cejun, C., et al.: Impact Assessment of Engineering Change Propagation for Complex Products Based on Multiple Networks. J. Comput. Appl. 40(4), 1215 (2020)
Node-IBD: A Dynamic Isolation Optimization Algorithm for Infection Prevention and Control Based on Influence Diffusion Songjian Zhou, Zheng Zhang, Ziqiang Wu, Hao Cheng, Shuo Wang, Sheng Bi, and Hao Liao(B) Shenzhen University, Shenzhen, China [email protected]
Abstract. In the infection prevention and control of epidemics, isolation has always been an important means for mankind to curb the spread of the epidemic. Isolation targets not only confirmed patients, their close contacts and sub-close contacts, and other groups at risk. It is clearly impractical to isolate all the groups with risk. In this paper, we propose an isolation optimization algorithm Node-IBD maximizing influence blocking, aiming to isolate a certain percentage of these close contacts or sub-close contacts to maximize the prevention effect and curb the spread of the epidemic. The possibility of spread of the epidemic can be minimized even when potentially infected persons in the risk population cannot be identified. This paper proves the feasibility and effectiveness of the isolation algorithm through the experiments of static contact network and dynamic contact network. It is expected to provide a useful strategy for future epidemic prevention. Keywords: epidemic model · adaptive control isolation optimization algorithm
1
· influence diffusion ·
Introduction
From ancient to the present, infectious diseases have accompanied the development of human society. Human history can be said to be a history of infectious diseases [1]. Throughout human history, infectious diseases have always been with us, and may come to us anytime. Therefore, developing a scientific prevention and control system based on infectious disease research helps to formulate prevention and control strategies in a timely manner after the outbreak. In the COVID-19 epidemic, there have been many studies on the effectiveness of epidemic prevention and control measures, including the control of social distance [2], city closure [3], and vaccination [4], all of which are analyzed from the perspective of complex network transmission. Therefore, the spread and prevention and control of infectious diseases is a current research focus of complex networks c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 555–569, 2023. https://doi.org/10.1007/978-981-99-2385-4_42
556
S. Zhou et al.
[5], which has significant value for human response to epidemics and is related to the development of human society and even the continuation of human races. Research on the spread of infectious diseases can be done by modelling the dynamics in infectious diseases [6] and discovering the patterns of transmission of infectious diseases through the phenomenon of transmission in networks. This research was initially based on a SIR model of population consideration, where the SIR model divides the population into different states and represents the transmission process as a turn of states. The small-world network [7] published in Nature and the scale-free network [8] in Science have greatly advanced the understanding of real-world networks in the field of complex networks. Studying the transmission of infectious diseases in these two networks is closer to the phenomenon of epidemic transmission in human societies, which has led to the derivation of infectious disease research based on transmission networks. The COVID-19 epidemic that swept the world in the last three years further promoted the research on infectious diseases in complex networks, and scholars proposed some infectious disease models on the COVID-19 virus [9,10], while based on the real data of this epidemic, such as cases [11], mobile population, means of prevention and control [3], etc., to study the factors influencing the spread of infectious diseases. In recent years, China has developed very efficient isolation measures in epidemic prevention and control, which is one of the reasons why ’zero-COVID’ policy in China has been able to succeed. Although, isolation strategies based on the isolation of close contacts and sub-close contacts of confirmed patients are simple and efficient, a total blanket isolation can have a huge economic cost. Traditional approaches tend to be based on the SIR model and do not take into account the maximization of impact blocking in the contact network. Therefore, we hope to achieve the least possible spread of the outbreak by means of influence blocking. In this paper, we design an isolation algorithm Node-IBD (Node Influence Block Degree) for network propagation force minimization based on the isolation strategy for close contacts and sub-close contacts of confirmed patients adopted in the last three years of the COVID-19 epidemic. In a given contact network, a limited number of nodes (i.e., individuals) are selected for isolation, so that the overall network propagation force is minimized and the influence blocking is maximized to achieve the purpose of minimizing the spread of the epidemic. The design of the algorithm is based on the representation of the network propagation force in the matrix, and the optimal set of nodes is selected and deleted to achieve the minimization of network propagation force through matrix operations. We conduct experiments on static artificial networks, static real networks, and dynamic artificial networks to verify the effectiveness of our proposed isolation algorithm. The algorithm has great significance and scientific value in the prevention and control efforts to deal with the COVID-19 epidemic.
2
Related Work
In this paper, in order to simulate and solve the real-world propagation inhibition problem, we start from a model of human mobility behaviour and treat the epidemic propagation problem as a class of influence propagation problems.
Dynamic Isolation Optimization Algorithm
2.1
557
Influence Blocking Maximization
In the research of network transmission, an important direction is the communication of influence. How to control and make good use of influence is of great research value in all fields. Among them, the blocking of influence has great potential in the isolation work of epidemic prevention and control. The optimization problem of maximizing influence blocking [12] was originally proposed by He X, and was mainly applied to competitive networks for rumour containment. Subsequently, many other solutions have been proposed in the research related to influence blocking maximization. Zhu W [13] proposed that location information has an impact on influence blocking and designed a method for seed set selection based on location information; Nguyen and Yan [14] combined a greedy algorithm with the association structure in the network and designed the GVS algorithm. This algorithm can select the minimum number of nodes with high influence to maximize the information propagation of competing entities; Wu P [15] proposed the extension problem of maximizing influence blocking and designed two models, MCICM and COICM, to describe the extension of maximizing influence blocking based on this problem; On the other hand, Budak and Agrawal [16] proposed a method to limit the propagation of misinformation to the maximum extent from the perspective of heuristic algorithms; and Taninmis, Aras and Altinel [17] also proposed a heuristic algorithm based on greed and mathematics to discuss and design influence blocking maximization from the perspective of competing parties fighting against propagation according to sequential decision making. 2.2
Human Mobile Behavior Model
Studying the principle behind human movement behaviour has long been a popular topic in the field of complex networks. And in infectious disease prevention efforts, understanding human spatial movement behaviour is crucial. It helps us understand human contact behaviour and provides important information for studying disease transmission in time and space. With the rapid development of communication technology in the 21st century, large-scale human activity data captured by GPS [18,19] has brought a breakthrough for scholars to recognize and understand human movement behaviour. From early models of random wandering, such as Brownian motion [20] and L´evy-flight models [21]; to models that incorporate human behavioural mechanisms, such as gravitational models [22] and Individual Mobility Models (IMMs) [23]; to models that consider the constraints of geographic space on human behaviour, such as human mobility behaviour models based on hierarchical transportation networks [24], and human mobility behaviour models based on the geographic location and shape of cities [25], etc. Among them, the most mainstream research revolves around human behaviour mechanisms, which can usually be summarized as the following two points: 1) social interaction mechanisms. One person’s mobile behaviour will affect others, and people will tend to follow the general flow and move closer to the crowded places, reflecting human aggregation [26]; 2) long-term memory
558
S. Zhou et al.
mechanism. A person’s mobility behaviour will be influenced by his historical mobility behaviour, and people will tend to return to the places they have been, such as the place of work as well as the place of residence, reflecting human habituation [27]. The Collaborative Mobility Models (CMM) were proposed by Xu [28], who fused the gravitational model and the IMM to simulate human mobility behaviour over time, while considering a model of social interaction mechanisms and considering long-term memory mechanisms. The CMM does not rely on external pre-input contextual information and can also develop social interaction mechanisms by using population density as CMM likewise classifies human mobility behaviour into return and exploration. The probability Pret that an individual decides to return and the probability Pexp that he decides to explore can be expressed, respectively, as: Pret = 1 − δN −γ
(1)
Pexp = δN −γ
(2)
where δ controls the initial probability of exploration, N is the number of places ever visited, and γ is the degree to which the number of places visited affects the probability of exploration. As a person visits more places, the probability that he will explore new places again decreases and the probability of returning to old places increases. This is consistent with the rationale that new places are limited in exploration, and when a person has visited enough places, then he can explore less new places, and the probability of returning to the old places naturally increases.
3
Model
Isolation of confirmed patients and their associated close contacts, sub-close contacts and other risk groups is a very important part of the outbreak prevention and control work. For confirmed patients, isolation and treatment are inevitable; for close contacts and sub-close contacts, isolation and observation are mainly for prevention. There is no way to detect an infected person in the first place, so it is entirely possible that the virus may have spread before the diagnosis is confirmed. The first targets of transmission are the close contacts of the confirmed patients, who are most likely to be infected; assuming that some of them are infected, their close contacts, i.e., the sub-close contacts of the confirmed patients, are the next high-risk group. In order to avoid further spread of the virus, when we find a confirmed patient, we usually investigate their mobile trajectory, identify their close contacts, or even their indirect sub-close contacts, and isolate them until their diagnosis is available. China has done this very well, and this is an important reason why China is currently able to effectively control the COVID-19 outbreak.
Dynamic Isolation Optimization Algorithm
559
The number of close contacts of confirmed patients is not small, especially for COVID-19, which requires consideration of a longer infectious cycle, multiple days of close contacts need to be considered. At the same time, certain industries can be particularly high in the number of contacts, such as delivery workers, expressmans, airport officials, etc. If we consider the last close contacts, the number is even more alarming. Isolation of all close contacts and sub-close contacts would be a huge drain and many times not feasible. Therefore, we can give a more rigorous definition of the above problem. In the contact network, the initial propagation source and close contacts are the objects that must be isolated, while in the secondary close contacts, only p% objects can be selected for isolation. The selected isolation object can minimize the final propagation force. How to design an isolation algorithm to isolate only a certain proportion of the risk population and minimize the spread of the virus, and that such an algorithm would be a novel and efficient option for our epidemic control. 3.1
Epidemic Spread Model
Given a static contact network G = (V, E), where V represents the set of individuals and E represents the set of contact relationships between individuals. In set V , there is an individual who is diagnosed as infected (Red), and we isolate all of his first-order neighbors, that is, close contacts (Blue); For the secondorder neighbors, i.e. the second closest contacts (Purple), select p% to isolate, as shown in Fig. 1.
Fig. 1. Expression and treatment of close contacts and sub-close contacts
Our goal is to minimize the spread of the epidemic through isolation at a certain cost. Denote the given cost by p, which is the proportion of subclose contacts that can be selected; denote the propagation force of the network after isolation by F (S), and S is the set of objects selected for isolation. We
560
S. Zhou et al.
achieve the goal of controlling the development of the epidemic by minimizing the transmission force F (S) of the network. There may be a question here. Close contacts have been isolated. How can the confirmed patients spread to other individuals in the network? In fact, our subsequent source of infection was not a confirmed patient. In the real-world epidemic, we know that the confirmed patients will be more tightly controlled than the close contacts and sub-close contacts. We isolate close contacts and sub-close contacts. We are not worried that the confirmed patients will infect them, but that there will be new infected people among them and infect other individuals in the network. Therefore, we hypothesized that there were new infected persons randomly appearing among the sub-close contacts and observed their transmission status. 3.2
Method
Fig. 2. Influence blocking on network structure
In the isolation problem, the virus is the original transmission information, but there is no corresponding transmission. Even the vaccine cannot play such a role. Therefore, the method of competitive transmission model cannot be used. In fact, we achieve the blocking of influence by changing the network structure. As shown in Fig. 2, we isolated nodes 2 and 8 so that the original propagation will not affect them, and the subsequent propagation will not affect nodes 5 and 7. Obviously, for our proposed isolation problem, it is not possible to start from a competitive propagation model. We refer to the related research on the influence of network structure on propagation [29,30], for a propagation network, the main eigenvalue λ (i.e. the eigenvalue with the largest modulus) of its corresponding adjacency matrix is the decisive factor of the propagation threshold of the network. The so-called transmission threshold corresponds to the epidemic transmission and represents the critical value of epidemic outbreak. The smaller
Dynamic Isolation Optimization Algorithm
561
the main characteristic value λ, the larger the propagation threshold and the slower the propagation rate. Based on the above relationship, we can transform the minimization of network propagation force in the isolation problem into the minimization of the principal eigenvalue λ in the adjacency matrix, and obtain the mathematical definition of the isolation problem, which is expressed as follows. Problem Definition: Given a contact network G = (V, E), there is a node v0 ∈ V that are the initial propagation sources in the network, whose first-order neighbors form the set of nodes V1 and whose second-order neighbors form the set of nodes V2 . The set consisting of all nodes to be isolated is denoted by S, which must contain the initial propagation source v0 and the set of nodes of first-order neighbors V1 , i.e., S ⊇ {v0 , V1 }. Given a budget p, which denotes the proportion of second-order neighbors that can be picked, the number of nodes that can be picked among the second-order neighbors is p|V2 |. For the initial contact network G, the principal eigenvalues of its corresponding adjacency matrix are denoted by λ; after removing the node set S (including the nodes and all associated connected edges), the principal eigenvalues of the ˜ ˜ are denoted by λ. adjacency matrix corresponding to the new contact network G ˜ The influence blocking degree of this node set can then be denoted as Δλ = λ− λ. The optimization goal of this problem is to find a set S such that the principal ˜ is minimized after deletion, in other words, to maximize the total eigenvalue λ influence blocking degree Δλ. For a node v ∈ S, its influence blocking degree is denoted by F (v). It can be expressed as: F (v) (3) Δλ = argmax|S|≤1+|V1 |+p|V2 | v∈S
However, how to express the influence blocking degree of nodes and solve the possible overlap between nodes are the prerequisites for selection. Therefore, the problem can be considered from two aspects: optimal selection on connected edges and optimal selection on nodes. Optimal Selection on Connected Edges. To represent the influence blocking degree of a node, we can first consider the connected edges. If F (e) is used to represent the influence blocking degree of the edge e, then the influence blocking degree of the node can be expressed as: F(ei,j ) (4) F (vi ) = ei,j ∈E
For node vi , the sum of the influence blocking degrees of all edges associated with ei,j is the influence blocking degree of the node. Now, the question is how to express the blocking degree of the influence of the border. Suppose we use the change of the principal eigenvalue before and after deleting the connecting edge e to express the influence blocking degree of the connecting edge: F (e) = λ(G) − λ(G − e)
(5)
562
S. Zhou et al.
We find that the change in the principal eigenvalue of a set of connected edges deleted is not equal to the sum of the change in the principal eigenvalue of each connected edge deleted. This indicates that the influence blocking degree of a set of contiguous edges cannot be expressed by the sum of the changes in the principal eigenvalues of each contiguous edge before and after its deletion. Equation 5 does not have simple cumulativity. A representation of the influence blocking degree of connected edges is proposed in the literature [31], which indirectly reflects the influence of connected edges on the principal eigenvalues through the product of eigenvectors in the adjacency matrix. The result approximates the actual influence on the principal eigenvalue, and most importantly, it achieves the decoupling between different connected edges. The connected edges selected in this way can be represented by a simple cumulative representation of the influence of a set of connected edges on the principal eigenvalues; at the same time its representation corresponding to each edge can also be used as the influence blocking degree of that connected edge for the optimal selection from the largest to the smallest. Where the degree of influence blocking of the connected edges can be expressed as (6) F (ei,j ) = u(i)v(j) On the one hand, it corresponds to the influence on the main eigenvalue after each edge is deleted, so it is convenient to select from large to small; On the other hand, the decoupling between different edges is also realized, which can be obtained by simple accumulation operation when calculating the sum of influence blocking degrees. Optimal Selection on Nodes. The expression of influence blocking degree on the edge is obtained above. According to the formula 4, we can express the influence blocking degree of a node as: u(i)v(j) (7) F (vi ) = ei,j ∈E
However, the influence blocking degree between different nodes is overlapped. In a pair of adjacent nodes vi and vj , they have a connected edge ei,j . When calculating their influence blocking degree, F (ei,j ) will be repeatedly accumulated. Therefore, when selecting the seed node set, the influence blocking degree of the node set cannot be obtained by simple summation. It can be proved that the influence blocking degree function has submodularity. Therefore, in the optimization solution of this problem, we can rely on the maximum solution of the monotone submodular problem. Kempe [32] proved in the document that the approximate optimal solution with approximate ratio of 1 − 1/e can be obtained based on greedy algorithm. Based on this, we designed the algorithm 1 to solve the problem of optimal selection of our nodes.
Dynamic Isolation Optimization Algorithm
563
Algorithm 1: Greedy(F,P): selects the node set with the greatest influence blocking degree
1 2 3 4 5 6
Input: Node set influence blocking degree function F, budget p Output: Node set S with the largest influence blocking degree S = {v0 , V1 } for 1 to p|V2 | do u = argmaxv∈V2 \S (F (S ∪ {v}) S = S ∪ {u} end return S
For the initial infection source v0 and the set V1 consisting of all its firstorder neighbors, all of them are selected into the set of nodes we picked for isolation. And for the set of second-order neighbors V2 , we selected p|V2 | of them for isolation according to the greedy algorithm. According to the principle of maximum influence blocking, the set of nodes we selected, after deletion, will enable the propagation of the network to the lowest point under that cost. We will verify the effect of the algorithm by experiment below.
4 4.1
Experiments Experiment Setting
In order to verify the proposed isolation optimization algorithm, we used static artificial network, static real network, and dynamic artificial network as experimental data sets to conduct sufficient experiments. Static Artificial Network. Static artificial network we mainly use small world network (WS) and scale-free network (BA). In the simulation experiment, for the parameters of the two artificial networks, the node size N is uniformly set to 1000; the node degree is uniformly set to 24. The basic properties of each artificial network are shown in Table 1. Both types of networks are classical artificial networks among network communication studies and can be used here to represent contact networks. To show the structural features of the networks more clearly, we demonstrate these two types of artificial networks with 50 nodes, as shown in Fig. 3. Table 1. Properties of the network Dataset
nodes edges
WS
1000
BA
average degree
12000 24.000
1000
11856 23.712
Ego-Facebook 4039
88234 43.691
Musae-Twitch 9498
153138 32.246
564
S. Zhou et al.
Fig. 3. Structure diagram of artificial networks
Static Real Network. For the static real networks we choose two social networks: Ego-Facebook, which is an egocentric network built from the list of users’ friends in Facebook and then stitched together; Musae-Twitch, which is a network of users’ relationships in the gaming video platform Twitch. Table 1 shows the basic information of the above two real social networks. Dynamically Artificial Network. In order to better simulate the real epidemic spread, we also generated a set of dynamic contact networks based on the human mobile behavior model proposed in [28]. This model is proposed to discover the development and evolution of cities. CMM takes into account two major factors, social interaction and historical mobility behavior, and can effectively reproduce the reality of urban population density distribution, showing the characteristics of urban structure with dense population in the center and distinct fractal patterns at the edges. Based on the scientific nature of the model, we use it to construct the required dynamic contact network. CMM simulates human movement in a n × n grid. At the beginning, m individuals were all placed in the central grid (N/2, N/2), and then moved based on the model and gradually dispersed to each grid. After several rounds of iteration, it is finally in a relatively stable state. We selected 21 rounds of stable iterations to build a dynamic contact network. Individuals in the same square are considered as close contacts, and a complete graph is constructed for them, as shown in Fig. 4. We do not need to consider how the groups of squares are connected, because in the next iteration, each individual may go to different squares to form a connection with other individuals. In our experiment, we choose to measure from two aspects: network attribute and simulated propagation process to judge the effectiveness of the algorithm in the isolation work. 1. Variation of principal eigenvalue of isolated network Δλ, The change of communication power can be measured from the attribute of the network.
Dynamic Isolation Optimization Algorithm
565
Fig. 4. From population heatmap to contact network
2. Change curve of I/N ratio of infected persons, the effectiveness of isolation can be reflected from the simulation results of epidemic transmission. The following three main selection methods were used for the comparison: random selection (Rand). Individuals with proportion p are randomly selected from close contacts or sub-close contacts for isolation; selection based on node degree (Degree). Among close contacts or sub-close contacts, individuals with proportion p are selected from the largest to the smallest by node degree for isolation; node-based median selection (Betweenness). Among close contacts or sub-close contacts, individuals with proportion p are selected for segregation by node’s betweenness from large to small. In addition, the isolation optimization algorithm proposed by us is based on the influence blocking degree of the node and considers the maximum node set of the sum of the influence blocking degrees. Therefore, we call it Node-IBD. 4.2
Experiment Results
In two kinds of static artificial networks, we observed the development of the epidemic situation in the next 14 days by not selecting objects for isolation (No Action) and selecting 10% sub-close contacts for isolation in four ways. Since the selection of potential infected persons is random, the effect of isolation may not be fully explained. Therefore, our results were obtained by taking the mean value of 100 experiments. The change of the proportion of infected persons is shown in Fig. 5. The results of static network generation show that the isolation optimization algorithm Node-IBD proposed by us is superior to other methods in both small world network (WS) and scale-free network (BA). It is worth noting that in the BA, the effect of partial isolation on the sub-close contacts is outstanding, and it is the most significant in the containment of the epidemic compared to the non
566
S. Zhou et al.
Fig. 5. The change of the proportion of infected persons in the static artificial networks
Fig. 6. The change of the proportion of infected persons in the static real networks
isolation. This is because in the BA network, the node degree follows the power law distribution, and there are nodes with significant influence. Isolation of these nodes plays a very important role in controlling the spread of the epidemic. In the static real network, our isolation algorithm still exhibits optimal isolation. Although the improvement is less pronounced in the Musae-Twitch network, the main reason is that all three isolation methods, Degree, Betweenness, and Node-IBD, are able to reduce the percentage of infected individuals to less than 2/10000, which is already in a state where the epidemic is almost receding. This is the reason why we only observe the epidemic development in the next 7 days in the Musae-Twitch network. The change in the proportion of infected is shown in Fig. 6. In the dynamically artificial network, we observed the development of the epidemic situation in the next 14 days by not selecting objects for isolation and selecting 10% of close contacts for isolation in four different ways. The change of the proportion of infected persons is shown in Fig. 7.
Dynamic Isolation Optimization Algorithm
567
Fig. 7. The change of the proportion of infected persons in the dynamically artificial network
The experimental results show that our proposed isolation algorithm NodeIBD is optimal. Although it is only slightly better than the node-degreebased isolation algorithm Degree and the node-median-based isolation algorithm Betweenness from the above graph, this proportional difference (0.15%) actually shows a difference of 1.5 infections in a network of 1000 people at day 14. The implications of this are significant for the subsequent epidemic that is still on the rise. Admittedly, our method itself does not take into account who the subsequent individuals are exposed to in the dynamic network, but based on the habitual nature of human mobile behavior, a large proportion of those exposed in the first 7 days have a higher probability of being exposed again in the next 14 days. Therefore, our method is scientific and reasonable to construct contact networks for node selection based on the first 7 days.
5
Conclusion
Aiming at the isolation of risk population, one of the keys to epidemic prevention and control during the pandemic, we propose an isolation optimization algorithm Node-IBD which consider the maximization of influence blocking. With convenient transportation, there are a large number of close contacts and sub-close contacts of confirmed patients, and it is not practical to isolate them all. This algorithm can minimize the spread of the epidemic by isolating a certain proportion of close contacts or sub-close contacts. This method mainly starts with the adjacency matrix of the contact network. By reducing the main eigenvalue of the matrix, the network propagation force is minimized, and the influence blocking is maximized. We have verified the static contact network and dynamic contact network, and the experimental results have confirmed its feasibility and effectiveness. This study is expected to be a reference and guide for the isolation work in epidemic prevention and control. In the future, the research of the transmission model needs to focus more on the integration with the actual situation at the community level. Through
568
S. Zhou et al.
continuous revision of the adaptive epidemic prevention and control model, it can be better applied in the urban epidemic prevention and control work. As for the isolation of risk population, the isolation optimization algorithm proposed in this paper has much space for improvement in the real contact network, especially in the dynamic contact network. It can provide more meaningful help for the selection of isolation targets from the perspective of predicting the future contact population. Acknowledgment. This work was supported by the NSFC under Grant no. 62276171, the Natural Science Foundation of Guangdong Province of China under Grant Nos. 2019A1515011173 and 2019A1515011064, the Shenzhen Fundamental ResearchGeneral Project under Grant No. JCYJ20190808162601658, CCF- Baidu Open Fund, NSF-SZU and Tencent-SZU fund.
References 1. Leventhal, G.E., Hill, A.L., Nowak, M.A., et al.: Evolution and emergence of infectious diseases in theoretical and real-world networks. Nat. Commun. 6, 6101 (2015) 2. Du, Z., Xu, X., Wang, L., et al.: Effects of proactive social distancing on COVID-19 outbreaks in 58 cities, China. Emerg. Infect. Dis. 26(9), 2267 (2020) 3. Tian, H., Liu, Y., Li, Y., et al.: An investigation of transmission control measures during the first 50 days of the COVID-19 epidemic in China. Science 368(6491), 638–642 (2020) 4. Han, S., Cai, J., Yang, J., et al.: Time-varying optimization of COVID-19 vaccine prioritization in the context of limited vaccination capacity. Nat. commun. 12(1), 1–10 (2021) 5. Riley, S.: Large-scale spatial-transmission models of infectious disease. Science 316(5829), 1298–1301 (2007) 6. Kermack, W.O., McKendrick, A.G.: A contribution to the mathematical theory of epidemics. In: Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, vol. 115, pp. 700–721 (1927) 7. Watts, D.J., Strogatz, S.H.: Collective dynamics of “small-world” networks. Nature 393(6684), 440–442 (1998) 8. Barabasi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) 9. Zhang, W., Zhao, W.G.W., Wu, D., et al.: Predicting COVID-19 trends in Canada: a tale of four models. Cogn. Comput. Syst. 2(3), 112–118 (2020) 10. Achterberg, M.A., Prasse, B., Ma, L., et al.: Comparing the accuracy of several network-based COVID-19 prediction algorithms. Int. J. Forecast. (2020) 11. Ghamizi, S., Rwemalika, R., Cordy, M., et al.: Data-driven simulation and optimization for covid-19 exit strategies. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3434–3442 (2020) 12. He, X., Song, G., Chen, W., et al.: Influence blocking maximization in social networks under the competitive linear threshold model. In: Proceedings of the 12th SIAM International Conference on Data Mining, pp. 463–474 (2012) 13. Zhu, W., Yang, W., Xuan, S., et al.: Location-aware targeted influence blocking maximization in social networks. In: 2019 28th International Conference on Computer Communication and Networks (ICCCN), pp. 1–9. IEEE (2019)
Dynamic Isolation Optimization Algorithm
569
14. Nguyen, N.P., Yan, G., Thai, M.T., et al.: Containment of misinformation spread in online social networks. In: Proceedings of the 4th Annual ACM Web Science Conference, pp. 213–222 (2012) 15. Wu, P., Pan, L.: Scalable influence blocking maximization in social networks under competitive independent cascade models. Comput. Netw. 123, 38–50 (2017) 16. Budak, C., Agrawal, D., El Abbadi, A.: Limiting the spread of misinformation in social networks. In: Proceedings of the 20th international conference on World Wide Web, pp. 665–674 (2011) ˙ 17. Tanınmı¸s, K., Aras, N., Altınel, IK., et al.: Minimizing the misinformation spread in social networks. Iise Trans. 52(8), 850–863 (2020) 18. Schrank, D., Eisele, B., Lomax, T.: TTI’s 2012 urban mobility report. Texas A&M Transportation Institute. The Texas A&M University System, vol. 4 (2012) 19. Gonzalez, M.C., Hidalgo, C.A., Barabasi, A.L.: Understanding individual human mobility patterns. Nature 453(7196), 779–782 (2008) 20. Einstein, A.: Investigations on the Theory of the Brownian Movement. Courier Corporation, North Chelmsford (1956) 21. Shlesinger, M.F., Klafter, J., Wong, Y.M.: Random walks with infinite spatial and temporal moments. J. Stat. Phys. 27(3), 499–512 (1982) 22. Zipf, G.K.: The P 1 P 2/D hypothesis: on the intercity movement of persons. Am. Sociol. Rev. 11(6), 677–686 (1946) 23. Song, C., Koren, T., Wang, P., et al.: Modelling the scaling properties of human mobility. Nat. Phys. 6(10), 818–823 (2010) 24. Han, X.P., Hao, Q., Wang, B.H., et al.: Origin of the scaling law in human mobility: hierarchy of traffic systems. Phys. Rev. E 83(3), 036117 (2011) 25. Kang, C., Ma, X., Tong, D., et al.: Intra-urban human mobility patterns: An urban morphology perspective. Phys. A Stat. Mech. Appl. 391(4), 1702–1717 (2012) 26. Grabowicz, P.A., Ramasco, J.J., Gon¸calves, B., et al.: Entangling mobility and interactions in social media. PLoS ONE 9(3), e92196 (2014) 27. Lu, X., Wetter, E., Bharti, N., et al.: Approaching the limit of predictability in human mobility. Sci. Rep. 3(1), 1–9 (2013) 28. Xu, F., Li, Y., Jin, D., et al.: Emergence of urban growth patterns from human mobility behavior. Nat. Comput. Sci. 1(12), 791–800 (2021) 29. Prakash, B.A., Chakrabarti, D., Valler, N.C., et al.: Threshold conditions for arbitrary cascade models on arbitrary networks. Knowl. Inf. Syst. 33(3), 549–575 (2012) 30. Milanese, A., Sun, J., Nishikawa, T.: Approximating spectral impact of structural perturbations in large networks. Phys. Rev. E 81(4), 046112 (2010) 31. Tong, H., Prakash, B.A., Eliassi-Rad, T., et al.: Gelling, and melting, large graphs by edge manipulation. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 245–254 (2012) ´ Maximizing the spread of influence through 32. Kempe, D., Kleinberg, J., Tardos, E.: a social network. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 137–146 (2003)
A Hybrid Layout Method Based on GPU for the Logistics Facility Layout Problem Fulin Jiang1 , Lin Li1,2(B) , Junjie Zhu1 , and Xiaoping Liu1,2 1
2
School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China lilin [email protected] Engineering Research Center of Safety Critical Industrial Measurement and Control Technology, Ministry of Education, Hefei 230009, China Abstract. The existing facility layout problem (FLP) only considers the layout of processing facilities. However, in the current scenario of industrial logistics, there are not only working facilities, but also a lot of transportation facilities. In order to solve the logistics facility layout problem (LFLP), this paper proposes a two-step algorithmic framework based on GPU acceleration to obtain some feasible solutions. The first step is layout of working facilities by meta-heuristic algorithm, and the second step is layout of transportation facilities by routing algorithm. Then, a GPU parallel solution is used to accelerate the optimization of material handling cost (MHC) and transportation facilities cost (TFC) during the whole process. Finally, the framework is tested on three layout instances by two different meta-heuristic algorithms. Compared to no routing step or no GPU, our method is more accurate and effective in solving the LFLP. Keywords: Logistics facility layout problem acceleration
1
· Hybrid method · GPU
Introduction
The FLP is an important research topic in industrial production. Given the material data and logistics relationships between a set of facilities, this problem focuses on determining the optimal layout solution on the shop floor, so as to reduce production costs or other objectives. LFLP is a special kind of FLP facing the industrial logistics scene. There are two types of facilities in logistics scenario, including working facilities and transportation facilities. The former refers to processing, sorting and storage facilities, and the latter is used to connect different working facilities for transporting materials. The existing FLP methods can not solve the LFLP for two reasons. Firstly, it does not consider the layout of transportation facilities, so when calculating MHC, the ideal Manhattan distance is generally used to represent the distance between the working facilities. However, there are conflicts when placing transportation facilities along the Manhattan path (Fig. 1). Secondly, more transportation facilities will bring more cost and transportation time, so the TFC must be an optimization objective except MHC. c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 570–579, 2023. https://doi.org/10.1007/978-981-99-2385-4_43
A Hybrid Layout Method Based on GPU for the LFLP
571
The FLP is complicated to solve as an NP-hard problem. The main method obtains feasible solutions through meta-heuristic algorithms. Palomo-Romero et al. [10] proposed a parallel genetic algorithm based on the island model to solve the FLP. Pourhassan et al. [11] proposed an innovative dynamic FLP and used the NSGA2 algorithm to solve. Jolai et al. [5] considered pickup/drop-off locations of the facility using the MOPSO algorithm for layout. Derakhshan and Wong [4] proposed the modified particle swarm optimization to solve the static and dynamic facility layout problems. Liu et al. [7] proposed a MOPSO algorithm combined with an overlapping processing based on the gradient method, which is effective and robust in solving the multi-objective FLP. In addition, some researchers have used hybrid layout methods to enhance the effectiveness of facility layout schemes. Kulturel [6] introduced the cyclic facility layout problem and used the hybrid simulated annealing algorithm to optimize the layout of facilities of different scales. Mohamadi et al. [8] proposed a two-stage algorithm, using the weights of the attributes to calculate the adjacency level of the facilities. Besbes et al. [2] proposed a three-dimensional facility layout model that uses the A* algorithm to obtain the shortest path between facilities. Most of the current facility layout algorithms are based on CPU implementation rather than GPU-accelerated. In other fields, some researchers have redesigned the GPU-accelerated versions of meta-heuristics. For the genetic algorithm, Mohammadi et al. [9] used a GPU-based parallel genetic algorithm to solve the assignment problem, improving the strategy for selecting new populations. For the particle swarm algorithm, Dali et al. [3] implemented parallel GPU-PSO and GPU-distributed PSO, in order to improve the efficiency of solving maximum constraint satisfaction problems. In this paper, a multi-objective optimization mathematical model considering the total MHC and the total TFC is firstly established. The layout method of hybrid facilities based on CUDA acceleration can be divided into two steps. The first step is the multi-objective meta-heuristic layout algorithm, which is used to realize the working facilities layout. Multi-Objective Particle Swarm Optimization (MOPSO) [5] and Non-dominant Sorting Genetic Algorithm 2 (NSGA2) [11] are selected as the meta-heuristic algorithm examples in this paper. The second step is the path search algorithm, which is used to realize the layout of the transportation line.
2
Problem Description and Mathematical Model
Combined with the related background knowledge of industrial logistics and actual conditions, the assumptions of the LFLP include four aspects. The shape of all working facilities can be regarded as a rectangle, with a fixed length and width, and the location of the logistics entrance and exit is known. The working facilities can only be placed horizontally or vertically. Transportation facilities can be extended horizontally or vertically and cannot be extended diagonally. The transport distance between different working facilities is the actual total length of the material transportation facilities.
572
F. Jiang et al.
The LFLP is applied in a two-dimensional coordinate system (Fig. 1), which indicate the coordinate range of the shop floor, the location of the working facility and the extension direction of the transportation facility. In this case, the point O is the origin of the coordinate system. The location of the working facility is represented by the facility center coordinate. The Ein and Eout represent the entrance and exit of the shop floor and the transportation line is represented by the set of directed line segments between the start and end points of the path. A series of parameters are used to represent the objective functions and constraints of the LFLP, which are described as follows. The subscripts i and j represent the indexes of the working facilities, which has the maximum value of the total number of working facilities as N . The subscripts k represent the indexes of the materials transport lines, which have the maximum value of the total number of materials as M . The ai represent the working facility of subscript i and the mk represent the material of subscript k. L W li wi Ck Cs Ct Wc LCM Lk Sk Ls St Lmn (Xi , Yi )
shop floor length shop floor width length of working facility ai width of working facility ai unit distance cost of transporting material mk cost per unit length of linear conveyor cost of a turning conveyor width of the conveyor minimum length of linear conveyor total length of the transportation line of material mk total amount of material mk total length of linear conveyors total number of turning conveyors length of transportation line between turning point m and n coordinates of working facility ai M
Lk × Sk × Ck
(1)
fT F C = Ls × Cs + St × Ct
(2)
Xi − 0.5li ≥ 0, ∀i ∈ {1, 2, · · · , N }
(3)
Yi − 0.5wi ≥ 0, ∀i ∈ {1, 2, · · · , N }
(4)
Xi + 0.5li ≤ L, ∀i ∈ {1, 2, · · · , N }
(5)
Yi + 0.5wi ≤ W, ∀i ∈ {1, 2, · · · , N }
(6)
I (ai ) ∩ I (aj ) = ∅, i = j, ∀i, j ∈ {1, 2, · · · , N }
(7)
Lmn ≥ LCM + Wc
(8)
fM HC =
k=1
A Hybrid Layout Method Based on GPU for the LFLP
573
Fig. 1. The LFLP coordinate system.
Two objective functions, including minimizing MHC and TFC, are established in this paper. The calculation of MHC is shown in Eq. (1), which represents the total cost of transporting the material on the given line. The calculation of TFC is shown in Eq. (2), which indicates the total cost of the transportation line layout, consisting of the total cost of the linear conveyor and the turning conveyor. In the process of layout, working facilities and transportation facilities must meet a series of constraints. Constraints (3)–(6) indicate that working facilities require to be located in the shop floor. Constraint (7) indicate that any two working facilities can not overlap. Constraint (8) indicates that the transportation line has a minimum length constraint.
3
Hybrid Layout Method Framework
This proposed algorithmic framework is divided into five main modules (Fig. 2). The blue represents the module for CPU processing and the red represents the module for GPU processing. GPU modules can operate the solutions in each population in parallel, which can significantly accelerate the whole iteration process of the algorithm by CUDA.
Fig. 2. Hybrid layout method main modules.
574
3.1
F. Jiang et al.
Initialization Strategy
The algorithm encodes the coordinates and orientation information of working facilities, such as Plt =(X1 , Y1 , O1 , · · · , Xi , Yi , Oi , · · · , XN , YN , ON ). For NSGA2, Plt represents the chromosome code of individual l in generation t. For MOPSO, Plt represents the position vector code of particle l in generation t. When initializing the population, this paper designs a strategy to ensure the non-overlapping and diversity of the initial solution. The shop floor is devided into two-dimensional grids with equal length and width according to the distance. This program randomly generates code information of the working facility and checks whether overlap with other working facilities. If overlaps exist, this program continues to test the new location and orientation. This process is repeated several times to obtain some initial solutions. 3.2
Working Facility Layout
After the initial solution is generated, the different meta-heuristic algorithms have different update strategies. For example, the NSGA2 algorithm generates new individuals through crossover and mutation of parental chromosomes. In the MOPSO algorithm, the particle updates itself by tracking two extreme values, which are the individual best extreme value P best and the global best extreme value Gbest . If the encoding solution obtained by the algorithm violates the working facility constraints (3)–(7), this non-overlapping strategy needs to be used to adjust the coordinates. The GPU parallelly accesses the coded information of all working facilities in each population. If overlap occurs, the facility is adjusted to the lower left side of the shop area. After overlap processing (Fig. 3), all facilities are inside the shop by shifting the center coordinates of the envelope rectangle.
Fig. 3. The coordinate adjustment strategy.
3.3
Transportation Facility Layout
After updating all the locations of the working facilities every time, the layout of transportation facilities is laid out. In this paper, a path search algorithm
A Hybrid Layout Method Based on GPU for the LFLP
575
based on multi-objective evaluation (MOEA) is adopted to generate transportation lines between different working facilities. The overall process conforms to constraint (8) and the pseudo-code is shown in the Algorithm 1.
Fig. 4. Path point matrix construction.
Generally, the path point matrix is constructed by dividing a two-dimensional grid at equal intervals. However, in the framework of the hybrid facility layout method, the coordinate values of this solution are random float numbers, and the location of the transportation facility is related to the coordinates of the center of the working facility. Therefore, it is unreasonable to divide the feasible point matrix by determining the interval. In this paper, the key points are chosen to construct the path point matrix (Fig. 4).
Algorithm 1: MOEA
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Input: starting coordinates (xs , ys ), ending coordinate (xe , ye ), grid point set M , open list ol, close list cl Output: path point set P ol ← null, cl ← null; remove the node in M inside the facility; add (xs , ys ) to ol; get the node i with the lowest cost in ol; add (xi , yi ) to P ; while (xi , yi ) = (xe , ye ) do get the node m with the lowest cost in ol; add (xm , ym ) to P ; delect (xm , ym ) from ol; add (xm , ym ) to cl; get the neighbor nodes i of m; cost(i) ← G(i) + H(i) + E(i); add (xi , yi ) to ol; end
576
F. Jiang et al.
In the realistic logistics scenario, the cost of a turning conveyor is generally higher than that of a linear conveyor, so the turning point needs to be taken into account when searching for path nodes. Suppose the coordinates of the starting node is (xs , ys ), the coordinates of the ending node is (xe , ye ), the coordinates of node i is (xi , yi ), and the set of turning nodes is T . The calculation of the line cost in the multi-objective evaluation search algorithm consists of three main metrics, where G(i) represents the actual cost from node i to the starting node, H(i) represents the linear cost from node i to the ending node, E(i) represents the cost of turning node, and if node i belongs to T , its value is Ct . 3.4
Filtering of Solutions
After all the working facilities and transportation facilities are laid out, the fitness function values of the new solutions are calculated. Then, solution filtering is performed to update the relevant parameters and generate the next generation solution set. Different meta-heuristic algorithms have different filtering strategies. For example, the NSGA2 algorithm selects new populations based on pareto rank and crowding levels. The MOPSO algorithm obtains the new solution set by updating the information of P best , Gbest and archive list.
4
Layout Results and Analysis
This paper selects three instances [1] of shop floor layout for experimental analysis. The algorithms for comparison with this paper are MOPSO [5] and NSGA2 [11], which only consider the layout of working facilities. Therefore, the length of the transportation line is calculated according to the ideal Manhattan distance. Different from this, our method jointly iterates the layout of working facilities and transportation facilities, which is called MOPSO-Route and NSGA2-Route. 4.1
Objective Results and Comparison
In most instances, the proposed algorithm is superior to the comparison algorithm in the Table 1. The comparison algorithm does not consider the layout of transportation facilities in the iteration, and directly optimizes the feasible solutions of work facilities with Manhattan distance. The layout of transportation facilities is regarded as post-processing, which selects a non-conflict solution. The proposed algorithm considers the constraint of transportation facilities in the whole algorithm process, so the location of working facilities is adjusted according to this constraint. The more the number of working facilities, the greater the impact of transportation facilities, so the hybrid algorithm can save the more cost of logistics scenarios. Several groups of experiments are respectively run with CPU and GPU in Fig. 5. The experimental analysis shows that the number of individuals increases, the parallel advantage of GPU is reflected, and the acceleration ratio keeps growing. The maximum acceleration ratio reaches 5.152. Thus, the GPU parallel strategy proposed in this paper is reasonable.
A Hybrid Layout Method Based on GPU for the LFLP
577
Table 1. Comparison of cost results on each instance.
4.2
Instance Algorithm
MHC Best
TFC Worst Average Best Worst Average
1
MOPSO MOPSO-Route NSGA2 NSGA2-Route
34730 34301 38576 35782
49841 50528 51530 51209
42639 41460 43584 42067
692 501 652 530
864 615 837 646
735 556 694 581
2
MOPSO MOPSO-Route NSGA2 NSGA2-Route
921 829 977 850
1253 1121 1308 1074
1044 995 1097 981
1461 1433 1575 1465
1812 1675 1902 1653
1685 1531 1728 1526
3
MOPSO MOPSO-Route NSGA2 NSGA2-Route
86978 85130 87442 84834
97614 96769 96332 94541
91286 90222 92056 90389
4557 4395 4778 4439
5763 5459 5816 5551
5384 5015 5401 5127
Subjective Results and Comparison
Figure 6 shows the best layout results of all algorithms in different instances. The color of the working facility is random, and the label is consistent. In instance 1, the results of comparison algorithms show that the transport line has many turning conveyors (see the red circles), which means that a lot of turning conveyors are required. In instance 2, the results of the comparison algorithms show that some transportation facilities are too close (see the red circles), which will affect installation and maintenance. In contrast, the layout results of the proposed algorithm seem sparse and tidy. In instance 3, it can be found that more facilities lead to denser space in the same size scenario. However, the proposed algorithm still keeps a certain distance as far as possible, and it is difficult for comparison algorithms to clearly express the routing structure.
Fig. 5. Running time of each instance.
578
F. Jiang et al.
Fig. 6. Layout results of each algorithm.
5
Conclusion
In the logistics scenario, there are both working facilities and transportation facilities. These facilities need to be laid out respectively due to their different characteristics. In this paper, a hybrid layout algorithm framework combining multi-objective meta-heuristic optimization and path search algorithm is proposed to address the LFLP. On the other hand, considering the parallelism of meta heuristic algorithm, GPU based parallel strategy is introduced into the algorithm framework to improve the efficiency of layout. The current layout only considers the relationship between facilities, not the relationship between people and facilities. In the future, connectivity constraints of residual space can be added.
References 1. Asl, A.D., Wong, K.Y.: Solving unequal area static facility layout problems by using a modified genetic algorithm. In: 2015 IEEE 10th Conference on Industrial Electronics and Applications (ICIEA), pp. 302–305. IEEE (2015). https://doi.org/ 10.1109/ICIEA.2015.7334129 2. Besbes, M., Zolghadri, M., Costa Affonso, R., Masmoudi, F., Haddar, M.: 3D facility layout problem. J. Intell. Manuf. 32(4), 1065–1090 (2020). https://doi. org/10.1007/s10845-020-01603-z
A Hybrid Layout Method Based on GPU for the LFLP
579
3. Dali, N., Bouamama, S.: GPU-PSO: parallel particle swarm optimization approaches on graphical processing unit for constraint reasoning: case of maxCSPs. Procedia Comput. Sci. 60, 1070–1080 (2015) 4. Derakhshan Asl, A., Wong, K.Y.: Solving unequal-area static and dynamic facility layout problems using modified particle swarm optimization. J. Intell. Manuf. 28(6), 1317–1336 (2017) 5. Jolai, F., Tavakkoli-Moghaddam, R., Taghipour, M.: A multi-objective particle swarm optimisation algorithm for unequal sized dynamic facility layout problem with pickup/drop-off locations. Int. J. Prod. Res. 50(15), 4279–4293 (2012) 6. Kulturel-Konak, S., Konak, A.: A large-scale hybrid simulated annealing algorithm for cyclic facility layout problems. Eng. Optim. 47(7), 963–978 (2015) 7. Liu, J., Zhang, H., He, K., Jiang, S.: Multi-objective particle swarm optimization algorithm based on objective space division for the unequal-area facility layout problem. Expert Syst. Appl. 102, 179–192 (2018) 8. Mohamadi, A., Ebrahimnejad, S., Soltani, R., Khalilzadeh, M.: A new two-stage approach for a bi-objective facility layout problem considering input/output points under fuzzy environment. IEEE Access 7, 134083–134103 (2019) 9. Mohammadi, J., Mirzaie, K., Derhami, V.: Parallel genetic algorithm based on GPU for solving quadratic assignment problem. In: 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI), pp. 569–572. IEEE (2016). https://doi.org/10.1109/KBEI.2015.7436107 10. Palomo-Romero, J.M., Salas-Morera, L., Garc´ıa-Hern´ andez, L.: An island model genetic algorithm for unequal area facility layout problems. Expert Syst. Appl. 68, 151–162 (2017) 11. Pourhassan, M.R., Raissi, S.: An integrated simulation-based optimization technique for multi-objective dynamic facility layout problem. J. Ind. Inf. Integr. 8, 49–58 (2017)
An Interpretable Loan Credit Evaluation Method Based on Rule Representation Learner Zihao Chen1 , Xiaomeng Wang1(B) , Yuanjiang Huang2 , and Tao Jia1 1
2
Southwest University, Chongqing 400700, China [email protected], {wxm1706,tjia}@swu.edu.cn BaiHang Intelligent Data Technology Institute, Chongqing, China [email protected]
Abstract. The interpretability of model has become one of the obstacles to its wide application in the high-stake fields. The usual way to obtain interpretability is to build a black-box first and then explain it using the post-hoc methods. However, the explanations provided by the posthoc method are not always reliable. Instead, we design an intrinsically interpretable model based on RRL (Rule Representation Learner) for the Lending Club dataset. Specifically, features can be divided into three categories according to their characteristics of themselves and build three sub-networks respectively, each of which is similar to a neural network with a single hidden layer but can be equivalently converted into a set of rules. During the training, we learned tricks from previous research to effectively train binary weights. Finally, our model is compared with the tree-based model. The results show that our model is much better than the interpretable decision tree in performance and close to other blackbox, which is of practical significance to both financial institutions and borrowers. More importantly, our model is used to test the correctness of the explanations generated by the post-hoc method, the results show that the post-hoc method is not always reliable. Keywords: Personal credit evaluation · Interpretable machine learning · Binary neural network · Loan application · Knowledge extraction
1
Introduction
Credit is a core concept in the financial field, and credit scoring and rating are widely studied problems with a long history [1]. When borrowers apply for loans, banks or institutions are expected to make decisions not only agilely but also precisely. In the massive data environment created by financial technology (FinTech), technologies such as machine learning and data mining have become important technical means for credit evaluation due to their powerful data analysis capabilities.Some research and achievements related have been witnessed [2–6]. c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 580–594, 2023. https://doi.org/10.1007/978-981-99-2385-4_44
An Interpretable Loan Credit Evaluation Method
581
However, since the period of statistical learning, the machine learning community has been focusing on improving performance and pursuing the improvement of various performance metrics. This phenomenon is particularly obvious since entering the era of deep learning [7]. [8] points out that with the increasing strength of deep neural networks and their popularity in our lives, there is also growing concerned about their black-box nature. If domain experts can’t explain the decision-making process of the model, how can users trust it? Blackbox refers to a model that does not understand the internal mechanism or explain the decision-making process. Such a model can not be directly used in domains demanding high transparency, such as financial, medicine, criminal judicature, and other high-stake decisions. It can be said that the lack of interpretability has become one of the obstacles to a wide application in these fields. Therefore, interpretable machine learning has attracted more and more attention in the past years. The usual way to obtain interpretability is to build a blackbox model first and then explain its behavior using the post-hoc methods. However, many researchers are skeptical about the reliability of this direction. Considering interpretability before modeling is encouraged [9]. Some works of interpretable machine learning tend to propose a general model or algorithm [10–12], but it is unknown whether that model meets the requirements in the actual application scenarios. In application scenarios with high requirements for interpretation, it is unrealistic to seek a one-for-all model or interpretation technology, because different application scenarios need different interpretations. For example, in image classification applications [13], a partial area of the image is enough for interpretation of the classification results, not specific to each feature (pixel). For healthcare and criminal justice, the traditional scorecard is more acceptable for practitioners [14], but for credit evaluation, while the rule-based interpretation may be more user-friendly. RRL (Rule Representation Learner) [15] provides a rule representation and learning framework, which has performance advantages over traditional decision tree methods. So we design a credit evaluation model based on RRL. The main contributions of the paper are as follows: • Focusing on the credit evaluation, we designed an interpretable loan credit evaluation model based on RRL. Our model can naturally extract accurate global and local explanations without using the post-hoc methods. It has practical significance for both financial institutions and borrowers. • Our model is compared and validated with other tree-based models on the Lending Club dataset. The results show that our performance is far better than that of interpretable CART, and is close to the black box, like Random Forest, XGBoost, and LightGBM. • An experiment is designed to verify the correctness of the post-hoc interpretation methods based on our model, results show why it is not suitable to use the post-hoc methods in high-stake decision-making scenarios. The rest of this paper is organized as follows. Section 2 introduces the related work and development of interpretable machine learning and credit risk evaluation. Section 3 presents the structure and training process of our model. Section 4 shows the experimental results compared with the tree-based model, and also
582
Z. Chen et al.
illustrates how our model provides local and global explanations. An experiment to verify the correctness of the post-hoc interpretation method is also designed. Section 5 concludes the paper and covers future work.
2 2.1
Related Work Interpretable Machine Learning
Interpretable machine learning has attracted much attention in recent years [16,17]. As suggested by [16], it can be simply divided into post-hoc methods and ante-hoc (or intrinsically interpretable design, white-box). The post-hoc methods do not check the internal structure or parameters of the original model. They attempt to provide explanations of behavior from the trained model (the black box). Therefore, they are available to all kinds of models, and it has been widely studied in the past few years. Such representative research includes LIME [10], Anchor [11], SHAP [12,18], etc. For example, LIME can get a locally linear surrogate model as an explanation for an individual sample, Anchor provides sufficient explanations for the decision-making of the model. It has to be said that the emergence of post-hoc methods improves users’ understanding in most cases. However, there is no such thing as a free lunch. Some researchers raised concerns [9,19,20]. If explanations provided by the post-hoc methods can faithfully reflect the black-box model, is it necessary for the black box to exist? If not, why should we believe the explanations it provides. The post-hoc methods apply to fields with low safety requirements, Such as it will not bring too much loss even if the interpretation is misleading. But for other high-stakes fields, post-hoc has not seemed a sensible choice. According to the ante-hoc, related studies can be divided into two categories. Optimization of traditional statistical learning models and mathematical programming, including improving the training efficiency of the interpretable algorithm and regularization constraints on the complexity. For examples: [21,22] focused on the research of decision tree and made efforts to acquire a sparse but also accurate decision tree according to a custom objective function. [23] solved a mixed integer nonlinear program problem in an acceptable time and get an integer scorecard on the premise of a performance guarantee. Another way is to design a model that meets interpretability requirements in specific application scenarios: [24] decomposes a regression problem, in which LSTM is responsible for ensuring performance and linear regression is responsible for explaining. [25] built a two-layer additive risk model in Explainable Machine Learning Challenge organized by FICO, and it is comparable to black-box models in performance. [26] presented neural additive models which combine some of the expressivity of DNNs with the inherent interpretability of generalized additive models. [15] tried to extract conjunction or disjunction rules from neural networks for classification tasks. Generally speaking, intrinsically interpretable models is a broad topic, and it is also a mainstream trend of interpretable machine learning. Modelers are required to design exclusive interpretable models for specific application scenarios. Our model also belongs to this category.
An Interpretable Loan Credit Evaluation Method
2.2
583
Credit Risk Evaluation
Credit risk is one of the three risks defined in the Basel Accord [1]. Personal credit risk evaluation is an essential aspect of financial risk management. With the rapid development of digital finance, government, banks, financial institutions, and FinTech companies have accumulated a large amount of data, providing a solid data foundation for credit risk modeling. In the loan application scenario, a functional model can predict the solvency and willingness of users according to features collected, provide decision support for transactions, facilitate applicants and help financial institutions avoid risks, which is conducive to promoting sound development of the financial market. Personal credit risk evaluation is a typical classification task suitable for modeling by machine learning. In the past decade, a large number of machine learning methods have been applied to credit scoring or evaluation [27–29]. Among a huge amount of algorithms, logistic regression is still widely used today because of its solid statistical foundation and strong interpretability. Deep learning, and ensemble learning are usually superior in performance, but they are difficult to be widely used in credit risk scenarios for many reasons [30]. Interpretability is one of the major obstacles. Therefore, applying interpretable machine learning techniques to the credit evaluation during loan application is natural.
3
Model
If the loan application of a customer is rejected, the financial institution needs to clarify the reason. For example, it is rejected for “The total assets are less than 10000 and the monthly income is less than 5000”. This requires that the causal relationship between the inputs and outputs in the credit evaluation model is clear. Although the decision tree model is easy to do, it has shortcomings in heuristic training and performance. Instead, RRL (Rule-based Representation Learner) [15] is applied to construct a interpretable credit evaluation model in the paper. The key to model construction includes logical rule determination, feature selection and binarization, and model training. 3.1
Rule Representation
For user i, the rejection of his loan application is determined by a combination of factors. Then, we simply formalize the interpretation as the Eq. (1). Ei = ri1 ∧ ri2 ... ∧ rin
(1)
where Ei indicates that this is an interpretation set for rejected customer i. rin indicates the nth meta rule for customer i. Such as “The total assets are less than 10000”. A neural network (Fig. 1) and a conjunction function [31] are used to represent the conjunction rule, the conjunction activation function used is as follows:
584
Z. Chen et al.
Conj =
n
1 − wi (1 − xi ),
xi , wi ∈ {0, 1}
(2)
i=1
where wi represents the weight, and xi represents the input of the previous layer. It should be noted that both the weight and the input are binary values. Therefore, such a network can be equivalently transformed into a set of conjunction rules. So far, the rules we need can be represented by a neural network. Take Fig. 1 as an example, if and only if node 1 and node 2 output one, node 3 will output one.
Fig. 1. An example: represents conjunction rules by a neural network.
3.2
Feature Selection and Binarization
According to suggestions by [32], credit features of the applicant are divided into three categories: loan information, history information, and soft information. Loan information includes features directly related to the loan application. History features contain statistical information about the past behavior of the applicant. Soft information refers to features that are not directly related to lending but may also be helpful for classification. According to the model design, the input features are to be binarized. Therefore, the original feature values need to be processed by one-hot encoding. Onehot encoding is intuitive for categorical features but not for continuous, binning is therefore required for continuous features. Binning is a common engineering strategy, which introduces non-linear and enhances the robustness of data. Here we use the decision tree algorithm to discretize the continuous feature. In detail, for each continuous feature, the CART is used to train separately, and take the threshold value of the split node as the basis of binning. This is the commonly used binning strategy in credit evaluation modeling. After binning for continuous features, One-hot encoding can be carried out normally.
An Interpretable Loan Credit Evaluation Method
3.3
585
Model Outline and Training
The overall structure consists of three sub-networks and then are aggregated by a fully connected layer to make a final decision. Each sub-network contains only one hidden layer (see Fig. 2). After one-hot, the input layer (the black nodes) receive the new features after encoding. The hidden layer (the blue nodes) is essentially a fully connected layer, but to simulate the conjunction behaviour between rules, we use the conjunction activation function introduced previously. Each sub-network can be regarded as a sub-classifier, which can output a default probability between 0 and 1 through the sigmoid activation function. So far, the overall network structure has been presented, but there are still problems to be solved in the practical training process.
Fig. 2. Overall modeling structure.
The most severe obstacle is the vanishing gradient. As mentioned earlier, the weight of the hidden layer should only take 0 or 1. Here, the weight matrix can be understood as an adjacency matrix. Unlike the continuous weight of a regular neural network, this makes training almost impossible. Nevertheless, [15] solved the problem skillfully. They presented a training method for the discrete weight neural network and an improved conjunction activation function to solve the vanishing gradient and the training of large-scale data sets respectively. More specifically, We draw lessons from the idea of the Gradient Grafting proposed by [15]. In the training process, discrete and continuous weight models will be maintained simultaneously. The so-called continuous model means that the weights are floating-point numbers between 0 and 1. The discrete model discretizes the hidden layer’s weight matrix of the continuous model with a threshold of 0.5. It should be noted that only the hidden layer’s weight matrix
586
Z. Chen et al.
of the discrete model is binary, and the weights of other layers are the same as those of the continuous model. The weights of the continuous model are clipped between 0 and 1 manually after each update. The regular gradient descent can be seen from Eq. 3. However, the update of gradient grafting can be formulated as the Eq. 4, where Yd is the output of the discrete model and Yc is the output of the continuous model. The updating direction of the weights will be more focused on optimizing the loss of the discrete model. In this way, we effectively train the discrete model actually needed with the help of the continuous one. ∂L(Yc ) ∂Wt
(3)
∂L(Yd ) ∂Yc · ∂Yd ∂Wt
(4)
Wt+1 = Wt − η Wt+1 = Wt − η
However, when we need to train on a large-scale dataset, we can find that the vanishing gradient is still unavoidable by analyzing the formula of its partial derivative w.r.t. weights. As the Eq. 5, both Xk − 1 and 1 − Wi (1 − Xi ) are values between 0 and 1. Multiplication of multiple numbers between 0 and 1 has to cause the entire value to approach 0. Therefore, the improved conjunction activation function proposed by [15] can be used here as the Eq. 6. Here, logarithms are −1 is aiming to keep the used to convert multiplications to additions, and −1+x behaviours of the conjunction activation function. So far, we have achieved efficient training of the discrete neural network by the Gradient Grafting and the improved conjunction activation function. To be precise, we can train each sub-network individually. However, the network will not converge so easily when the three sub-networks are trained together. However, since it can be trained individually, one can first train the three subnetworks separately and then let the trained weights as the initial values of joint training later. ∂Conj = (Xk − 1) · ∂Wk Conj+ =
4 4.1
n
1 − Wi (1 − Xi )
(5)
i=1∧i=k
−1 −1 + log(Conj)
(6)
Experiments Dataset
Most of the loan data are out of date and small-scale, not in line with the practical significance. So we take the data from the Lending Club as the experimental data, which is the largest P2P platform in the US. We collected the loan records of the Lending Club from 2007 to the fourth quarter of 2018. Among 1048575 records, we randomly selected 100000 of them for the experiment to save training time.
An Interpretable Loan Credit Evaluation Method
587
The features used and their categories are shown in the Table 1. The target variable is loan status, including three values: Current, Fully paid and Charged off. What we actually need here is the record of Fully paid and Charged off. Fully paid refers to the loan has been repaid in full, encoded as 0 in label column, also refers to negative sample in the paper, and Charged off encoded as 1, refers to positive sample, they are applicants of loans that have not been repaid within a period of time. So the task can be simplified to a typical binary classification. For features, we refer to the preprocessing methods in [33]. However, we do not deal with the data imbalance because we want to avoid the influence of data preprocessing on the performance results as much as possible. Table 1. Features selected and categories belong. Category Loan Information
Feature Name Installment Loan Purpose Loan Application Type Interest Rate Last Payment Amount Loan Amount Revolving Balance
History Information Delinquency In 2 Years Inquiries In 6 Months Mortgage Accounts Grade Open Accounts Revolving Utilization Rate Total Accounts Fico Avg Soft Information
Address State Employment Length Home Ownership Verification Status Annual Income
4.2
Classification Performance
First of all, it should be noted that performance is not the final goal of our model, but since performance comparison is an issue that must be mentioned, a brief comparison is carried out here. CART. The decision tree can also be transformed into a conjunctive rule set similar to our model, so it is reasonable to focus on the performance difference between both. We use the implementation of the decision tree in [34] with the default parameters setting.
588
Z. Chen et al.
Representative Black-Box Models. They mainly include some tree-based models, which are usually considered inexplicable due to their complex internal structure, such as random forest, XGBoost, LightGBM, etc. We also use the implementation of these algorithms in [34]. For random forest, we set n estimators = 100, criterion = “gini”. For XGBoost, we set learning rate = 0.01, n estimators = 160, objective = “binary:logistic”. As for LightGBM, we keep its default setting. These models are complex enough to be called black box models, because we don’t know their decision path at all. Our Model. As mentioned before, our model consists of the three sub-networks, each of which has only one hidden layer to more likely preserve the interpretability and understandability of a human. However, the number of nodes in the hidden layer is a hyperparameter to be selected. We tried from 16, 32, 64, 128 to 256, and the learning rate from 0.1, 0.01, 0.001, 0.0001, 0.00001, etc. Considering that the data set is unbalanced, we will examine the three indicators of Accuracy, F1-Score, and AUC. The results are shown in Table 2. It can be seen that on this data, the performance of different models is not much different except for the CART. Due to the need to compromise the interpretability as much as possible in the structure of the model, our model has not surpassed LightGBM and XGBoost in performance. However, it is pretty close. Such loss can be exchanged for structural transparency and interpretability, which is worthwhile for high-stake application scenarios. Our result also reveals that it is not that more complex models lead to better performance, especially for structured tabular data. There is a myth experience that the complexity of the model is proportionate to the performance. However, such a view should be taken with a grain of salt. Our model are pretty close to tree-based black-box models in performance in practice on this dataset and obviously better than the decision tree with similar structure. Table 2. Classification performance comparison by 5-fold cross validation, the results have been averaged. Model Interpretable Models Our Model CART Black-box Models
4.3
Accuracy F1-Score AUC 0.865 0.822
0.655 0.570
0.923 0.731
LightGBM 0.871 XGBoost 0.865 Random Forest 0.864
0.665 0.656 0.646
0.930 0.926 0.924
Global and Local Explanations
The model can provide both global and local explanations. Global explanation means that the internal structure of the model is clear and can be equivalently
An Interpretable Loan Credit Evaluation Method
589
convert into rule sets. It help users predict what kind of behaviour combination will lead to a default. Local interpretation focuses on a specific sample. For example, when an user is rejected, the reasons or factors that lead to the default prediction can be inferred in reverse based on the rule set. The two explanations make the business logic clearer, applicants more responsible and help eliminate discrimination. Global Explanation. The model can be converted into three rule sets according to the feature classification. As shown in Table 3, the weight of each rule can be obtained after training. The loan information has the highest impact on the prediction, as we expected. For a specific rule, if the weight exceeds 0.5, it represents a negative factor. According to this, a global rule view that reflects positive or negative factors can be drawn. With this global view, experience and knowledge are accumulated to understand and improve the model. Some rules and common sense confirm each other. For example, There may be rules that are in line with our common sense cognition. For example, the rule (Grade = A, 0.45) indicates that credit grade A is a positive factor in avoiding rejection, which matches our prior knowledge. For another example, the rule ((8.0 ≤ InterestRate < 12.0) ∧ (LastP aymentAmount < 7.0), 0.52) tells us that the smaller the last payment, the negative impact on the prediction, which may be seldom realized in our previous experience, can be extracted as knowledge for domain experts. Some other rules in the model defy common sense. For example, (V erif ication = F alse, 0.49) indicates that applicant information verification is a negative factor for loan applications. This is clearly against common sense. Then we have reason to verify the dataset used for training and improve the model. In other words, the transparent structure allows us to debug our model efficiently. The simpler the model structure, the faster the iterative upgrading of model performance, rather than the opposite. Local Explanation. When someone is rejected by the model, we can provide the applicant with the reasons for the refusal. We only need to check the network nodes activated by the input. From a rule-set perspective, that is, that rule sets that the applicant satisfies. Then show him those combinations of behaviors that have a weight greater than 0.5. In this way, users can be more trusted in the system within the financial institution, and at the same time, they can also guide users’ future behavior to get the loan. 4.4
Correctness Test of Post-hoc Methods
The flaws of post-hoc methods were brought to the fore as early as 2019 [9]. The explanations provided by the post-hoc method may not be faithful to the actual decision-making behaviour of the original model. However, it is difficult to convince without evidence to support it.
590
Z. Chen et al. Table 3. Rule examples extracted from our model.
Subnet Weight Subnet
Weight Rules
6.9160
0.52
Loan Information
0.43
8.0 = 8.0 AND LoanAmount > 9.2
... −0.7029
History Information 0.45 0.50
Grade = A MortgageAccounts = 0 AND Grade = D or less
... 1.0364
Soft Information
0.53 0.49 ...
AnnualIncome < 10.9 Verification = False
In this section, our model is used to test the correctness of the explanation generated by post-hoc methods. The basis is that each sub-network in our model can be equivalently converted into a rule set, and its representation form is basically similar to the explanations provided by some post-hoc methods. Anchor [11] is selected as a representative post-hoc method for testing. The method can provide sufficient conditions for model decision results and classify any other samples that meet these conditions as the same class as the explained sample. Specifically speaking, after building our model, treat it as a black box and then use Anchor to explain positive prediction (Charged off ) of the model. After that, we can check whether the explanation provided by Anchor is contrary to our actual decision-making paths. For example, the rules provided by Anchor contain features that are not included in our model. Unsurprisingly, we did find such an example. A sample is classified as charged off by our model, then we can check at the nodes that are activated (the rules satisfied). The decision path of this sample is transparent and formulated as conjunction rules. The Anchor was then used to explain this sample as well. It provides a rule set, precision, and coverage, which means that samples that meet the rule set have a precision percent probability of being considered as Charged Off samples by our model. However, we compare rules provided by the Anchor to our actual decision paths, loan amount ≥ 9.5 is not in an actual rule set at all, which means that the Anchor provides a wrong interpretation of our model’s decision. This case is relatively solid evidence that the post-hoc methods sometimes provide explanations that are not faithful to
An Interpretable Loan Credit Evaluation Method
591
the behavior of the original model. It is as if we go from one place to another, the original model arrives via path A, and the explanation provided by the post-hoc methods arrives via path B, both arrive at the same destination, but this is not what we desire. An inaccurate interpretation would give a wrong guide to the users’ behavior in the future. For example, the explanation provided by the Anchor contains loan amount ≥ 9.5, which means that the loan amount is a significant reason for rejection, which may drive the user to reduce the amount applied for the following loan application. However, this behavior does not actually improve the approval rate of future applications because this rule is not included in the decision path of the original model. Such deceptive explanations may lead users to be more suspicious than trusting decision systems, contrary to the original intention of interpretable machine learning. Obviously, it is not suitable to use the post-hoc methods in high-stake decision-making scenarios. 4.5
Experimental Summary
From the experimental stage, we first illustrated the results of the classification performance of our model. Results show that the performance of our model is significantly better than that of the decision tree with the most similar structure, which shows that the gradient descent to build a decision tree-like rule set can effectively avoid the local optimum caused by heuristic training. Nevertheless, our performance is not as good as some black-box models, but it is pretty close. The resulting transparency and interpretability are crucial to widespread application in high-stake application scenarios. We pay more attention to the practicality and interpretability of the model, which is more meaningful than improving the performance at the thousandths. The result shows that we got an acceptable trade-off. Then, we presented the advantages of our model, provide global and local explanations, and analyzed in detail how it breaks the trust barrier between the applicants and financial institutions. More importantly, we demonstrated the unreliability of post-hoc methods-a practical case of the post-hoc method that are not faithful to the behavior of the original model. Reflected from the side, the correct direction of interpretable machine learning should be considered interpretability before building models rather than post-hoc, especially in application with high risks.
5
Conclusion
In FinTech, model interpretability is more of a concern to users than performance. Post-hoc interpretation method cannot accurately explain the internal mechanism of black-box models. RRL represents and learns conjunction logical rules through neural networks, taking into account both interpretability and performance. Based on this advantage, we design decision rules for loans and create interpretable credit evaluation models based on RRL. Experiments demonstrate
592
Z. Chen et al.
that the model performs as well as state-of-the-art black-box models while maintaining interpretability, and it has been verified in credit evaluation application scenarios. In addition, using the proposed model as the verification tool, the simulation analysis gives the confirmation that the credibility of the post-hoc methods is insufficient. There are still many aspects for our work to improve in the future. In the structure of proposed neural network, only the conjunction function is considered. However, the disjunction function is often in loan credit evaluation. The disjunction activation function can be further studied in future research. In otherwise, we can continue to improve the performance, which may be achieved by improving the gradient grafting training trick, the binning of continuous features can be added to the training process of the model. In addition, by communicating with industry practitioners, we can design models that are more in line with the interpretability requirements of credit evaluation or other fields. At a macro level, our efforts should also focus on formal definitions of interpretable machine learning since there has not been a rigorous definition of this topic so far. Acknowledgements. This research is supported by the National Natural Science Foundation of China (Grant No. 62006198).
References 1. Thomas, L., Crook, J., Edelman, D.: Credit scoring and its applications. SIAM (2017) 2. De Prado, M.L.: Advances in Financial Machine Learning. John Wiley & Sons, Hoboken (2018) 3. Goodell, J.W., Kumar, S., Lim, W.M., Pattnaik, D.: Artificial intelligence and machine learning in finance: identifying foundations, themes, and research clusters from bibliometric analysis. J. Behav. Exp. Finance 32, 100577 (2021) 4. Nti, I.K., Adekoya, A.F., Weyori, B.A.: A systematic review of fundamental and technical analysis of stock market predictions. Artif. Intell. Rev. 53(4), 3007–3057 (2020) 5. Ozbayoglu, A.M., Gudelek, M.U., Sezer, O.B.: Deep learning for financial applications: a survey. Appl. Soft Comput. 93, 106384 (2020) 6. Zheng, X.L., Zhu, M.Y., Li, Q.B., Chen, C.C., Tan, Y.C.: FinBrain: when finance meets AI 2.0. Front. Inf. Technol. Electron. Eng. 20(7), 914–924 (2019) 7. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 8. Zhang, Y., Tiˇ no, P., Leonardis, A., Tang, K.: A survey on neural network interpretability. IEEE Trans. Emerg. Top. Comput. Intell. 5, 726–742 (2021) 9. Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019) 10. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
An Interpretable Loan Credit Evaluation Method
593
11. Ribeiro, M.T., Singh, S., Guestrin, C.: Anchors: high-precision model-agnostic explanations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) 12. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 13. Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J K.: This looks like that: deep learning for interpretable image recognition. In: Wallach,, H., Larochelle, H., Beygelzimer, A., d’ Alch´e-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates Inc. (2019) 14. Rudin, C., Ustun, B.: Optimized scoring systems: toward trust in machine learning for healthcare and criminal justice. Interfaces 48(5), 449–466 (2018) 15. Wang, Z., Zhang, W., Liu, N., Wang, J.: Scalable rule-based representation learning for interpretable classification. In: Advances in Neural Information Processing Systems, vol. 34 (2021) 16. Molnar, C., Casalicchio, G., Bischl, B.: Interpretable machine learning – a brief history, state-of-the-art and challenges. In: Koprinska, I., et al. (eds.) ECML PKDD 2020. CCIS, vol. 1323, pp. 417–431. Springer, Cham (2020). https://doi.org/10. 1007/978-3-030-65965-3 28 17. Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L., Zhong, C.: Interpretable machine learning: fundamental principles and 10 grand challenges (2021) 18. Lundberg, S.M., Erion, G.G., Lee, S.I.: Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888 (2018) 19. Laugel, T., Lesot, M.J., Marsala, C., Renard, X., Detyniecki, M.: The dangers of post-hoc interpretability: unjustified counterfactual explanations. arXiv preprint arXiv:1907.09294 (2019) 20. Slack, D., Hilgard, S., Jia, E., Singh, S., Lakkaraju, H.: Fooling lime and shap: adversarial attacks on post hoc explanation methods. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 180–186 (2020) 21. Hu, X., Rudin, C., Seltzer, M.: Optimal sparse decision trees. In: Advances in Neural Information Processing Systems, vol. 32 (2019) 22. Lin, J., Zhong, C., Hu, D., Rudin, C., Seltzer, M.: Generalized and scalable optimal sparse decision trees. In: International Conference on Machine Learning, pp. 6150– 6160. PMLR (2020) 23. Ustun, B., Rudin, C.: Learning optimized risk scores. J. Mach. Learn. Res. 20(150), 1–75 (2019) 24. Kim, T., Sharda, S., Zhou, X., Pendyala, R.M.: A stepwise interpretable machine learning framework using linear regression (LR) and long short-term memory (LSTM): city-wide demand-side prediction of yellow taxi and for-hire vehicle (FHV) service. Transp. Res. Part C: Emerg. Technol. 120, 102786 (2020) 25. Chen, C., Lin, K., Rudin, C., Shaposhnik, Y., Wang, S., Wang, T.: A holistic approach to interpretability in financial lending: models, visualizations, and summaryexplanations. Decis. Support Syst. 152, 113647 (2022) 26. Agarwal, R., et al.: Neural additive models: interpretable machine learning with neural nets. In: Advances in Neural Information Processing Systems, vol. 34 (2021) 27. Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., Vanthienen, J.: Benchmarking state-of-the-art classification algorithms for credit scoring. J. Oper. Res. Soc. 54(6), 627–635 (2003) 28. Lessmann, S., Baesens, B., Seow, H.V., Thomas, L.C.: Benchmarking state-of-theart classification algorithms for credit scoring: an update of research. Eur. J. Oper. Res. 247(1), 124–136 (2015)
594
Z. Chen et al.
29. Moscato, V., Picariello, A., Sperl´ı, G.: A benchmark of machine learning approaches for credit score prediction. Expert Syst. Appl. 165, 113986 (2021) ´ 30. Gunnarsson, B.R., Vanden Broucke, S., Baesens, B., Oskarsd´ ottir, M., Lemahieu, W.: Deep learning for credit scoring: do or don’t? Eur. J. Oper. Res. 295(1), 292– 305 (2021) 31. Payani, A., Fekri, F.: Learning algorithms via neural logic networks. arXiv preprint arXiv:1904.01554 (2019) 32. Ruyu, B., Mo, H., Haifeng, L.: A comparison of credit rating classification models based on spark-evidence from lending-club. Procedia Comput. Sci. 162, 811–818 (2019) 33. Lee, J.W., Lee, W.K., Sohn, S.Y.: Graph convolutional network-based credit default prediction utilizing three types of virtual distances among borrowers. Expert Syst. Appl. 168, 114411 (2021) 34. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
A Survey of Computer Vision-Based Fall Detection and Technology Perspectives Manling Yang1 , Xiaohu Li2 , Jiawei Liu2 , Shu Wang3 , and Li Liu2(B) 1 Department of Microelectronics and Communication Engineering, Chongqing University,
Chongqing, China [email protected] 2 School of Big Data and Software Engineering, Chongqing University, Chongqing, China {xhlee,202124131055,dcsliuli}@cqu.edu.cn 3 School of Materials and Energy, Southwest University, Chongqing, China [email protected]
Abstract. With the increase in the number of elderly people living alone, the use of computer vision technology for real-time fall detection is of great importance. In this paper, we review fall detection based on computer vision from four perspectives: background significance, current research status, relevant influencing factors, and future research outlook. We summarized our approach by classifying the three types of input image data in fall detection systems: RGB (Red, Green, Blue), Depth, and IR (Infrared Radiation), outlining research in both target tracking and bone detection for basic image processing tasks, as well as methods for processing video data. We analyzed the possible effects of multiple factors on fall detection regarding camera selection, the individual object recognized, and the recognition environment, and collected the solutions. Based on the current problems and trends in vision-based fall detection, we present an outlook on future research and propose four new ideas including functional extensions using the easy fusion feature of Mask R-CNN (Mask Region with Convolutional Neural Network), the use of YOLO (You Only Look Once) family to improve the speed of target detection, using variants of LSTM (Long Short-Term Memory) such as GRU (Gate Recurrent Unit) to achieve more efficient detection, and using Transformer methods that have been migrated from natural language processing to computer vision for detection. Keywords: Computer Vision · Deep Learning · Fall Detection · Neural Network · Video Surveillance System
1 Introduction Falls are one of the main common risks faced by elderly and disabled people. A study conducted by the World Health Organization (WHO) in 2007 estimated that in people over 70 years of age, the probability of a fall event is as high as 42%, and 50% of them die unnaturally as a result of a fall [1]. Therefore, it is important to detect falls through technical means and provide timely assistance. Current fall detection methods can be © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 595–609, 2023. https://doi.org/10.1007/978-981-99-2385-4_45
596
M. Yang et al.
divided into three main categories. The first category is based on wearable devices, including the monitored individual carrying sensors, and this type of system uses a wide range of technologies. The second category is based on systems with environmental arming, where sensors are placed around the monitored person, including infrared, RF sensors [2], etc. The third category is based on computer vision techniques, using cameras to capture video images for recognition, and classical common methods include Support Vector Machine (SVM) [3], Random Forest (RF) [4], Artificial Neural Networks, ANN) [5], etc. In the case of using wearable devices, elderly people need to wear the detection devices continuously, however, they may often forget to wear the devices or not wear them properly. Although the impact on the user’s activity is small for environment-based sensors, the monitorable area still has limitations (Fig. 1).
Fig. 1. General system architecture for image-based fall detection.
In summary, vision-based devices can replace the above sensors, provide a viable solution, and will be cheaper to implement. Cameras can also be installed in all rooms, with lower maintenance costs and more efficient replacement [6] (Table 1). Table 1. Comparison of three classifications of current fall detection methods. Compare angles
Wearable technology
Environmental placement Based on Vision
Judgment signals
Signals from sensors
Sensor signals are placed in the environment
Image information
Hardware
Accelerometers, pressure sensors, inclinometers, microphones, etc
Pressure, acoustic, infrared, and RF sensors, etc
A variety of cameras, such as ordinary cameras, depth cameras, infrared cameras, etc
Advantage
Variety of detection Sensors don’t affect the information and user being monitored analysis methods
Less influenced by people, and visual information is more intuitive
Disadvantages
The monitored subject may refuse or forget to wear them
Camera price is high, there is a risk of information leakage
The equipment is influenced by the environment and sensor layout
A Survey of Computer Vision-Based Fall Detection and Technology Perspectives
597
2 A Review of Vision-Based Fall Detection-Related Research and Techniques After researching related technologies, we classify the three commonly used input image data inputs in the field of computer vision into types RGB, Depth, and IR. The common datasets and fall detection models for each of these three types will be introduced in 2.1 with examples. For the basic image processing part, the more common and effective template matching algorithms and feature fusion techniques from the perspective of image processing techniques such as target tracking, motion prediction, and human skeleton detection are selected for illustration in 2.2. In 2.3, we focus on video data processing, with emphasis on image data processing with the addition of timeline, i.e., time-dependent sequential data, including two methods of deep bidirectional LSTM and video motion detection. 2.1 Example of Methods Related to Classification by Input Image Data Type We first collected a summary of common datasets in current fall detection research, which is summarized in Table 2. Table 2. Summary table of common falls datasets. Dataset Name
Data Type
Features
Multi-camera fall dataset [7]
RGB
Multi-camera system acquisition contains fall simulations and normal daily activities in realistic situations
Le2i [8]
RGB
There is a realistic dataset of 191 videos containing fake falls and video frames without people
UR fall detection [9]
RGB
Contains a sequence of 30 falls and 40 activities of daily living, using a Kinect camera to record fall events
UP fall detection dataset [10]
RGB
17 healthy young adults performing 11 activities with data from wearable sensors, environmental sensors, and visual devices
SDU falls [11]
Depth
Kinect depth camera acquisition, consisting of six types of movements from ten subjects
TST Fall Detection [12]
Depth
Includes depth images and skeletal joint data collected using Microsoft Kinect v2
CMU Graphics Lab [13]
RGB and Depth
A total of 2605 sequences, divided into 6 categories and 23 subcategories (continued)
598
M. Yang et al. Table 2. (continued)
Dataset Name
Data Type
Features
IASLAB-RGBD fall dataset [14]
RGB and Depth
Contains 15 different people, acquired in two different laboratory environments
Fall Detection Dataset [15]
RGB and Depth
Raw RGB and depth images were recorded using a single uncalibrated Kinect sensor, consisting of 8 different views in 5 different rooms with 5 different participants
CMD fall dataset [16]
RGB and Depth
Acquired by seven overlapping Kinect sensors and two wearable accelerometers, including 20 activities from 50 subjects and multimodal multi-view data
Deep Learning Fall Detection Based on RGB Images with Parameter Optimization. G. Anitha and S. Baghavathi Priya proposed a new image-based fall detection system that involves different operational stages including pre-processing of images, feature extraction, classification, and parameter optimization of the detection system [17] (Fig. 2).
Fig. 2. The working process of the VEFED-DL model.
To improve the image quality and eliminate noise, the system will process the extracted frames at three levels including resizing, image enhancement, and min-maxbased normalization [17]. IoT Fall Detection System Based on Deep Image HOG-SVM. The highly developed IoT technology and machine learning have enabled multimedia devices to be used in various environments where special people need to be protected. The Ritsumeikan University proposed a fall detection IoT system for the elderly based on HOG-SVM (Histogram of Oriented Gradients-Support Vector Machine) [18]. Non-invasive Multi-person Fall Detection Based on IR Images. Most of the studies on thermal vision-based fall detection methods, mainly focus on single-person occupancy scenarios, so they are not fully applicable to real life (Fig. 3).
A Survey of Computer Vision-Based Fall Detection and Technology Perspectives
599
Fig. 3. Multi-Person Fall Detection (MoT-LoGNN) Method Flow.
The Key Laboratory of Computational Intelligence and Cyberspace Information of the South China University of Technology proposed a non-invasive thermal visionbased fall detection method for multiple people, which consists of four components: TLoGNN, fine-tuning mechanism, multi-occupancy decomposer (MOD), and sensitivitybased sample selector (SSS) [19]. 2.2 Basic Image Processing in Fall Detection Target Tracking Based on Template Matching Algorithm. Motion target detection is the basis for advanced tasks such as motion target tracking and behavior recognition [20]. Intelligent video surveillance can automatically detect, identify and track targets in video scenes without human intervention [21] with the powerful computing power of computers combined with other technologies. The template matching algorithm uses the optimized absolute sum of differences (OSAD) to detect and recognize objects with high tracking accuracy, stable performance, and independence of illumination conditions [22]. Separating the target from the background, which is easy to target detection, and also the size of the tracking window can be adjusted according to the distance between the target and the camera [23]. Assuming that the average distance of the previous frame has been obtained and that the target will not go beyond that range in the next frame, all pixels in the tracking window whose depth is within that range can be selected and used for new distance calculation [23] (Fig. 4).
Fig. 4. Average distance measurement schematic
Feature Fusion-Based and Human Skeleton Detection Action Recognition. Most current action recognition methods can be divided into the following three categories:
600
M. Yang et al.
depth sequence-based, human skeleton detection-based, and feature fusion based. The depth sequence-based methods use various deep learning models to process and judge the RGB-type image information, and their main advantage is the richness of appearance information. The human skeletal sequence-based approach uses the changes in human joint points between video frames to describe the action, including the relative position and appearance changes of the joint points [24]. (Fig. 5).
Fig. 5. Workflow of Skeletal Recognition.
The main problem with skeleton-based detection is that when occlusion occurs in the scene, the estimation of joint points can be lost, which affects the action recognition results. The fusion of skeletal features and depth information features can effectively overcome skeletal feature errors due to occlusion and perspective changes [25]. The main issue to consider when using feature fusion techniques is how multiple types of data can be fused to make them more effective in their ways. However, the problem exists that multimodal data fusion requires handling more data volume, higher feature dimensionality, and more complex action recognition computations [26]. 2.3 Typical Video Data Processing Methods Deep Bidirectional LSTM Based Video Sequence Action Recognition. Videos are sequential data in which the motion of the visual content is represented in a sequence of frames, and the sequence of frames helps to understand the context of the action. Long-term sequences are subject to forgetting the earlier inputs of the sequence, and such problems are known as gradient disappearance problems. LSTM can be used to solve such problems [27]. Sejong University proposed a novel approach to action recognition by processing video data using convolutional neural networks (CNN) and deep bi-directional LSTM (Densely-connected Bi-directional Long Short-Term Memory, DBLSTM) networks [28]. In a bidirectional LSTM, the output at time t depends not only on the previous frame in the sequence but also on the upcoming later frames. In contrast, a bidirectional RNN (Recurrent Neural Network) is a stacking of two RNNs together, with one RNN proceeding forward and the other RNN proceeding backward. DNN-Based Video Motion Monitoring. DNNs are suitable for dealing with problems related to time series, and videos are time-dependent, so video motion detection requires
A Survey of Computer Vision-Based Fall Detection and Technology Perspectives
601
using the current frame, previous frame, and next frame of a given video. The Department of Computer Science at Auckland University of Technology proposed a deep learningbased model to implement video motion detection by combining CNN and RNN to build a DNN (Deep Neural Network) to accomplish video motion detection [29]. Integrating CNN and RNN can significantly reduce the size of video data and training time. However, the system only implements dynamic video detection, which cannot accomplish realtime object tracking and dynamic event recognition, and requires a large number of real videos for training and testing to produce more accurate results.
3 Other Influencing Factors Analysis 3.1 Camera Choice RGB Cameras and RGB-D Cameras. Traditional RGB color cameras can only capture image data within the camera’s field of view, record the size of the R, G, and B values of pixel points, and image as a 2-dimensional color image. The disadvantage is that the acquired image data information is extremely limited. For RGB-D depth camera or called a 3D camera, can obtain RGB images and depth images at the same time. Depth Camera Comparison. Depth cameras can be divided into three types according to their principles: active projection structured light depth cameras, passive binocular depth cameras, and depth cameras based on Time of Flight (TOF) measurements (Table 3).
Table 3. Comparison of the main performance and parameters of three different principles of depth cameras. Principle
Active projection structured light
Passive binocular
Based on the reflection time length
Measurement Accuracy
Decreases with increasing distance
Short distance range 0.01mm to 1cm
Stable at 1cm
Dark Environment
Applicable
Not applicable
Applicable
Main Advantages
Low power consumption, low cost, suitable for low light Suitable for use in low light conditions
Low hardware cost, can be used indoors and outdoors, strong and low light effect reduced
Longer measuring distance and maintain accuracy, can directly output 3D data of the measured object
Main Disadvantages
Poor accuracy at long distances, strong light easily interferes with the projected light
In strong/low light, the single environment has a big impact and can cause matching failure and complicated algorithm
Stable but not high accuracy, high time measurement requirements, basically cannot be used under outdoor bright light conditions
Applicable Scenarios
Smartphone front photography, face recognition, AR/VR, etc
Driverless, gesture recognition, depth detection, etc
Dynamic scenes unmanned, smartphone rear photography, etc
602
M. Yang et al. Table 4. Infrared imaging principle and infrared lens selection
Classification
Passive infrared imaging Detect the infrared radiation of the target, since only the distribution of the object’s temperature Active infrared imaging
Considerations Infrared Sensing
Infrared lights produce infrared radiation to irradiate the object Infrared light will cause a color CCD off-color impact, so ordinary color cameras will use a filter to filter out. Color infrared cameras require the use of a dual-peak single filter
Lens selection
Improper matching of the sensor and lens will appear dark corners or lens angle waste. The lens angle also needs to match the angle of the selected infrared light emission
Power supply selection
General power consumption needs to be greater than 20% of the overall system requirements. If below this value, the power supply will be fully loaded state
Infrared Camera Types. The biggest advantage of an infrared lens is to be able to apply it to the night vision scene. Infrared light selection is the choice of an infrared camera is a very important issue, you need to consider the camera, lens, power supply and other aspects of the comprehensive then choose (Table 4).
3.2 Individual Influence of the Identified Object Basic User Parameters. In practical applications, the system may not be able to process the prepared training data as expected due to the differences in users and usage scenarios. In the system proposed by Ritsumeikan University Institute of Science and Technology for solving this problem [30], the method used is that the basic parameters of the user are calculated by the edge node and sent to the cloud, and then the cloud server calculates the best detection model and sends it back to the edge node. Due to the small amount of data in the overall test condition, as much data as possible needs to be collected for the model to be trained in the follow-up. At the same time, a portion of experimenters with closer body sizes should be selected to reduce the influence of the model by individual differences such as height and weight.
3.3 Environmental effects during recognition Light Condition Changes. Improving the accuracy of fall detection in complex environments such as changing room light conditions is an important issue in RGB imagebased fall detection. The problem of the temporal evolution of visual data is solved by using dynamic images [31]. Sagar Chhetri and Abeer Alsadoon et al. from the School of Computer and Mathematics at Charles Sturt University proposed a mechanism that
A Survey of Computer Vision-Based Fall Detection and Technology Perspectives
603
can improve the performance of image preprocessing by capturing each dynamic action in a video into a single image using a dynamic optical flow technique [32] (Table 5).
Table 5. Comparison of the prior art with the solution proposed by this system [32] Comparative Aspects
Current Solutions
Proposed Solution
Method Name
TVL-1 optical flow algorithm
Enhanced dynamic optical flow algorithm
Accuracy
Improved accuracy of fall detection sensitivity under stable lighting conditions
Improves the accuracy of fall detection sensitivity under dynamic lighting conditions
Processing Time
Higher processing time and required processing power in the pre-processing stage
Reduces processing time in the image pre-processing stage using enhanced dynamic optical flow
It solves the process of processing video into image sequences, an operation that requires high processing power, and can reduce the processing power required for preprocessing. Encoding the temporal data of the optical flow video by rank pooling not only reduces the processing time but also improves the performance of the fall detection classifier under various lighting conditions. Impact of Camera Height and Layout on Recognition Accuracy. Cameras placed in a lower position may have problems with occlusion and single monitoring view. Due to room differences, users will not set the cameras or sensors at the same height. Ritsumeikan University proposed the enhanced tracking algorithm and the noise-reducing Alex-Net (ETDA-Net) algorithm to improve the related performance [33]. In the later research, an algorithm of camera adaptive height calculation can be introduced to accurately measure the camera height and input to the model, to improve the model’s adaptability to image input at different position heights. To adapt to single-camera equipment conditions, a late fusion technique is proposed that can improve the accuracy of existing fall detection systems [34] (Fig. 6).
Fig. 6. Experimental design diagram
604
M. Yang et al.
4 Research Outlook 4.1 Image Data Type Selection The image data type selection is considered from two perspectives: detection effect and practical application. For better detection effect, RGB type is a very classical data type in image processing, while Depth type can solve the problems of privacy and influence by ambient light that exist in RGB type. Meanwhile, the biggest difficulty of the IR type is that there are fewer public data sets. When conditions allow, we can choose to use RGB-D cameras to acquire both RGB and Depth image data inputs, RGB images for capturing color and appearance information, and Depth images for scenes with changing illumination conditions [35]. For a wider range of practical applications, if we choose to add a portable type of fall detection function module to the existing intelligent surveillance system, most of the image data of the current surveillance systems used in real life are RGB images with rich public data sets. 4.2 Target Detection Module In the field of target detection, it can be divided into two mainstream types, one is the One-Stage algorithm of the Yolo series, SSD, etc., which directly predicts the class and location of different targets using only one CNN network. The other type is the R-CNN series of algorithms based on candidate regions (Region Proposal), which are Two-Stage and need to use Selective Search or CNN network (RPN) to generate candidate regions first, and then perform classification and regression. The advantage of the first type of method is faster detection, but the accuracy is relatively low, and the second type of method has relatively high accuracy but slow detection. Mask R-CNN Combined with Human Bone Detection. CNN is a traditional target detection algorithm that has undergone continuous development and has undergone three progressive development stages, R-CNN, Fast R-CNN, and Faster R-CNN, with each structure contrasted as in Fig. 8.
Fig. 7. Schematic diagram of R-CNN, Fast R-CNN, and Faster R-CNN frameworks
A Survey of Computer Vision-Based Fall Detection and Technology Perspectives
605
From the three layers at the beginning to the final unification into one layer, the parameters and operations were reduced and the detection speed was accelerated (Fig. 7). The Mask R-CNN, which emerged subsequently, is a continuation of the development of the Faster R-CNN. In the paper that published Mask R-CNN, the authors then combined Mask R-CNN with KeyPoints Detection partly [36]. A similar idea was proposed by Sara MOBSITE et al. in 2018, where they used Mask R-CNN to output an image of a human silhouette. This can reduce unnecessary processing in the subsequent steps [37]. Linear interpolation is used in the OpenPose-based skeleton detection and LSTM/GRU model fall detection framework proposed by Chuan-Bi Lin et al. to compensate for the loss of joint points [38] to address the compensation of missing values. The problem that may arise when using image features captured by a normal RGB camera to acquire the skeleton is when the body is overlapped, occluded, and the body contours are not clear, all of which can cause losses and errors in generating the skeleton. YOLO Series Selection. The Advantage of the YOLO Series Algorithm Over TwoStage is Its Ability to Automatically Extract Features and Complete the Detection of Target Frames and End-To-End Prediction of Categories in One Go. YOLO can achieve the best current performance while the inference speed is highly competitive. Most of the fall detection systems currently using the YOLO family of algorithms use the YOLO V5 algorithm. It has better convertibility and is easier to improve the detection accuracy, especially the YOLO V5S model with a smaller depth and width product factor, which helps to reduce the deployment cost [39]. Few attempts have been made to use the latest YOLO X. According to the data given in the paper, the detection speed can reach the millisecond level at the earliest [40], so it can achieve the requirement of target detection tracking for fall detection.
4.3 Falling Action Judgment Improvements to the Problems of Current LSTM-Based Methods. In the case of detection based on the need to process video stream data, LSTM can handle the task of time series better than CNN as well as RNN. Also, LSTM solves the long-term dependency problem of RNN and alleviates the problem of gradient disappearance and gradient explosion caused by RNN backpropagation during training. But this is more time-consuming to train as the model structure of LSTM itself is relatively complex for the requirements of a fall detection system that needs fast detection. If additional LSTM units need to be added to the network, the performance of the model can be improved, while the computational complexity increases accordingly. When increasing from using 24 to 512 units, the LSTM computation becomes approximately 10 times slower [41]. Advantages of GRU (LSTM variant) Implementation. Classification of fall events requires both temporal and spatial features to be considered. Recurrent neural networks can extract temporal features by remembering the necessary information from the past. However, the problem of gradient disappearance and gradient explosion may occur. Using Gated Recurrent Unit (GRU) network with its update and reset gates can solve
606
M. Yang et al.
this problem by being able to decide which information needs to be passed as output vectors. GRU is a variant of LSTM with a simple architecture, so GRU is faster than LSTM in terms of unit operations and detection speed. Since LSTM is complex, many variants have been generated, and GRU is the most commonly used one among all LSTM variants. Other variants can be tried for comparison in the course of subsequent research.
4.4 Application of Transformer to Vision Domain Although LSTM solves the problem of the limited memory length of RNN, it still cannot be parallelized. The data at the moment t0 must be computed before the data at the moment t0 + 1 can be computed. Google proposed Transformer to replace the previous temporal memory network [42], and the memory time can be infinitely long if it is not affected by the hardware conditions. The parallelized implementation can greatly accelerate the training speed. The first proposed for the vision domain is the Vision Transformer (ViT), which applies the standard Transformer directly to images with as little modification as possible. When the model is pre-trained at a large enough scale and transferred to a classification task with fewer data points, the accuracy can be able to obtain significant improvement [43]. Swin Transformer [44] has made significant progress on a variety of image data processing tasks in the CV domain, mainly in three categories: image classification, target detection, and semantic segmentation.
Fig. 8. Schematic diagram of Swin Transformer downsampling operation
The use of different downsampling rates in Backbone here helps to build target detection, instance segmentation, and other tasks on this basis. Secondly, unlike in ViT where Multi-Head Self-Attention (MSA) is performed directly on the whole feature map, Swin Transformer uses the concept of Windows Multi-Head Self-Attention (W-MSA) to divide the feature map into multiple disjoint regions, which in the paper It is called Window. Multi-Head Self-Attention is performed only within each window. This can reduce the computational effort when the shallow feature map is large (Fig. 9). In early 2022 Microsoft Research Asia proposed Swin Transformer V2 [45] in response to three more major problems of the Swin Transformer for training and application to large visual models: instability of training, resolution gap between pre-training and fine-tuning and the large demand for labeled data. The corresponding solutions and improvements are proposed for each of the three problems mentioned above. It can be seen that the application of Transformer in the field of vision is developing rapidly and has great potential.
A Survey of Computer Vision-Based Fall Detection and Technology Perspectives
607
Fig. 9. Shifted Windows Multi-Head Self-Attention(SW-MSA) schematic
4.5 Future Research Many insights about future research on optimization can be obtained from the updates of many previous methods. Firstly, inspired by the continuous updates of R-CNN, an important optimization idea is to improve the network’s comprehensiveness and reduce the required parameters as much as possible. Secondly, based on the continuous updates of the YOLO series, it can be inspired that for the calculation of the loss values in the training part of the model, the equivalent conversion or simplification of the mathematical formula from the loss calculation is calculated. Many current types of research in the field of vision not only improve and evolve the classical methods but also gradually produce completely new structural models different from the traditional methods. For example, the traditional LSTM method mentioned earlier has a large number of variants as well as the emergence of a completely new Transformer. Another example is the direct prediction for human pose judgment by the classical regression-based approach. Then the heat map-based approach gradually emerged, where the detection is performed by predicting the fraction of each key point appearing at each position. More and more methods have appeared in the field, but the goal has always been to improve detection accuracy and speed. At the same time, more researchers are focusing on improving the generality of the methods for more usage scenarios, and the three points mentioned above can be used as a guide for future research purposes. Acknowledgment. This work was supported by grants from the National Natural Science Foundation of China (grant no. 61977012), the China Scholarship Council (grant no. 201906995003), the Central Universities in China (grant no. 2021CDJYGRH011), and the Key Research Programme of Chongqing Science \& Technology Commission (grant no. Cstc2019jscx-fxydX0054).
References 1. WHO global report on falls prevention in older age (2007) 2. Gutierrez, J., Rodriguez, V., Martin, S.: Comprehensive review of vision-based fall detection systems. Sens.-Basel 21(3) (2021) 3. Chen, Z.J., Wang, Y.: Infrared-ultrasonic sensor fusion for support vector machine-based fall detection. J. Intell. Mater. Syst. Struct. 29(9), 2027–2039 (2018) 4. Msaad, S., Cormier, G., Carrault, G.: Detecting falls and estimation of daily habits with depth images using machine learning algorithms. In: 42nd Annual International Conference of the IEEE-Engineering-in-Medicine-and-Biology-Society (EMBC), Montreal, Canada, pp. 2163– 2166. IEEE (2020)
608
M. Yang et al.
5. Yodpijit, N., Sittiwanchai, T., Jongprasithporn, M.: The development of artificial neural networks (ANN) for falls detection. In: 2017 3rd International Conference on Control, Automation and Robotics (ICCAR), pp. 547–50. IEEE (2017) 6. Ramanujam, E., Padmavathi, S.: A vision-based posture monitoring system for the elderly using intelligent fall detection technique. In: Mahmood, Z. (eds.) Guide to Ambient Intelligence in the IoT Environment. Computer Communications and Networks. Springer, Cham, pp. 249–69 (2019). https://doi.org/10.1007/978-3-030-04173-1_11 7. Auvinet ER, C., Meunier, J., St-Arnaud, A., Rousseau, J.: Multiple cameras fall data set. Technical Report Number 1350. University of Montreal: Montreal, QC, Canada (2011) 8. Charfi, I.: Optimized spatio-temporal descriptors for real-time fall detection: comparison of support vector machine and Adaboost-based classification. J. Electr. Imag. 22(4) (2013) 9. Kepski, M., Kwolek, B.: Embedded system for fall detection using body-worn accelerometer and depth sensor. In: International Workshop Intelligent Data, pp. 755–759 (2015) 10. Martinez-Villasenor, L., Ponce, H., Brieva, J., Moya-Albor, E., Nunez-Martinez, J., PenafortAsturiano, C.: UP-fall detection dataset: a multimodal approach. Sensors (Basel) 19(9) (2019) 11. Ma, X.: Depth-based human fall detection via shape features and improved extreme learning machine. IEEE J. Biomed. Health Inf. 18(6) (2014) 12. Cippitelli, E., Gambi, E., Gasparrini, S., Spinsante, S.: TST Fall detection dataset v2. IEEE Dataport (2016) 13. CMU Graphics Lab—Motion Capture Library (2021). http://mocap.cs.cmu.edu/ 14. Munaro, M.: A feature-based approach to people re-identification using skeleton keypoints (2014) 15. Adhikari, K.: Activity recognition for indoor fall detection using convolutional neural network (2017) 16. Tran, T.H.: A multi-modal multi-view dataset for human fall analysis and preliminary investigation on modality (2018) 17. Anitha, G., Priya, S.B.: Vision based real time monitoring system for elderly fall event detection using deep learning. Comput. Syst. Sci. Eng. 42(1), 87–103 (2022) 18. Kong, X.B., Meng, Z.L., Nojiri, N., Iwahori, Y., Meng, L., Tomiyama, H.: A HOG-SVM based fall detection IoT system for elderly persons using deep sensor. Procedia Comput. Sci. 147, 276–282 (2019) 19. Zhong, C.N., Ng, W.W.Y., Zhang, S., Nugent, C.D., Shewell, C., Medina-Quero, J.: Multioccupancy fall detection using non-invasive thermal vision sensor. IEEE Sens. J. 21(4), 5377– 5388 (2021) 20. Feng, Z., Zhu, X., Xu, L., Liu, Y.: Research on human target detection and tracking based on artificial intelligence vision. In: 2021 IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), pp. 1051–1054 (2021) 21. Velasquez, J., Piech, K., Lehnhoff, S., Fischer, L., Garske, S.: Incremental development of a co-simulation setup for testing a generation unit controller for reactive power provision. Comput. Sci. Res. Dev. 32(1–2), 3–12 (2016). https://doi.org/10.1007/s00450-016-0319-2 22. Satish, B., Jayakrishnan, P.: Hardware implementation of template matching algorithm and its performance evaluation. In: 2017 International Conference on Microelectronic Devices, Circuits and Systems (ICMDCS) (2017) 23. He, S.S., Liang, A., Lin, L., Song, T.: A continuously adaptive template matching algorithm for human tracking. In: 2017 First IEEE International Conference on Robotic Computing (IRC), pp. 303–309 (2017) 24. Ramirez, H., Velastin, S.A., Meza, I., Fabregas, E., Makris, D., Farias, G.: Fall detection and activity recognition using human skeleton features. IEEE Access 9, 33532–33542 (2021) 25. Chaaraoui, A.A., Padilla-Lopez, J.R., Florez-Revuelta, F.: Fusion of skeletal and silhouettebased features for human action recognition with RGB-D devices. In: 2013 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 91–97 (2013)
A Survey of Computer Vision-Based Fall Detection and Technology Perspectives
609
26. Zhang, H.B., Zhang, Y.X., Zhong, B., Lei, Q., Yang, L., Du, J.X., et al.: A comprehensive survey of vision-based human action recognition methods. Sensors (Basel) 19(5) (2019) 27. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 28. Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2018) 29. Luo, H., Liao, J., Yan, X., Liu, L.: Oversampling by a constraint-based causal network in medical imbalanced data classification. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2021) 30. Chen, Y., Kong, X., Meng, L., Tomiyama, H.: An edge computing based fall detection system for elderly persons. Procedia Comput. Sci. 174, 9–14 (2020) 31. Fan, Y.X., Levine, M.D., Wen, G.J., Qiu, S.H.: A deep neural network for real-time detection of falling humans in naturally occurring scenes. Neurocomputing 260, 43–58 (2017) 32. Chhetri, S., Alsadoon, A., Al-Dala’in, T., Prasad, P.W.C., Rashid, T.A., Maag, A.: Deep learning for vision-based fall detection system: enhanced optical dynamic flow. Comput. Intell. 37(1), 578–595 (2020) 33. Kong, X., Chen, L., Wang, Z., Chen, Y., Meng, L., Tomiyama, H.: Robust self-adaptation fall-detection system based on camera height. Sensors (Basel) 19(17) (2019) 34. Baldewijns, G., Debard, G., Mertes, G., Croonenborghs, T., Vanrumste, B.: Improving the accuracy of existing camera based fall detection algorithms through late fusion. In: P Annual International IEEE EMBS, pp. 2667–2671 (2017) 35. Khraief, C., Benzarti, F., Amiri, H.: Elderly fall detection based on multi-stream deep convolutional networks. Multimed. Tools Appl. 79(27–28), 19537–19560 (2020). https://doi.org/ 10.1007/s11042-020-08812-x 36. He, K., Gkioxari, G., Doll´ar, P., Girshick, R.: Mask R-CNN. In: Facebook AI Research (FAIR) (2018) 37. Mobsite, S., Alaoui, N., Boulmalf, M.: A framework for elders fall detection using deep learning. IEEE (2020) 38. Lin, C.-B., Dong, Z., Kuan, W.-K., Huang, Y.-F.: A framework for fall detection based on OpenPose skeleton and LSTM/GRU models. Appl. Sci. 11(1) (2020) 39. Yin, Y., Lei, L., Liang, M., Li, X., He, Y., Qin, L.: Research on fall detection algorithm for the elderly living alone based on YOLO. In: 2021 IEEE International Conference on Emergency Science and Information Technology (ICESIT), pp. 403–408 (2021) 40. Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: YOLOX: exceeding YOLO Series in 2021 (2021) 41. Lu, N., Wu, Y., Feng, L., Song, J.: Deep learning for fall detection: three-dimensional CNN combined with LSTM on video kinematic data. IEEE J. Biomed. Health Inform. 23(1), 314– 323 (2019) 42. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al.: Attention is all you need. Comput. Lang. (2017) 43. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al.: An image is worth 16x16 Words: transformers for image recognition at scale. Google Res. Brain Team (2021) 44. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Microsoft Research Asia (2021) 45. Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., et al.: Swin transformer V2: scaling up capacity and resolution. In: Microsoft Research Asia (2022)
3D Gaze Vis: Sharing Eye Tracking Data Visualization for Collaborative Work in VR Environment Song Zhao, Shiwei Cheng(B) , and Chenshuang Zhu Zhejiang University of Technology, Hangzhou 310023, China [email protected]
Abstract. Conducting collaborative tasks, e.g., multi-user game, in virtual reality (VR) could enable us to explore more immersive and effective experience. However, for current VR systems, users cannot communicate properly with each other via their gaze points, and this would interfere with users’ mutual understanding of the intention. In this study, we aimed to find the optimal eye tracking data visualization, which minimized the cognitive interference and improved the understanding of the visual attention and intention between users. We designed three different eye tracking data visualizations: gaze cursor, gaze spotlight and gaze trajectory in VR scene for a course of human heart, and found that gaze cursor from doctors could help students learn complex 3D heart models more effectively. To further explore, two students as a pair were asked to finish a quiz in VR environment, with sharing gaze cursors with each other, and obtained more efficiency and scores. It indicated that sharing eye tracking data visualization could improve the quality and efficiency of collaborative work in the VR environment. Keywords: Gaze fixation · computer supported collaborative learning · information visualization · medical visualization
1 Introduction Virtual reality (VR) technology provides users with extraordinary immersive entertainment. Software and hardware developers have also made a lot of efforts in terms of experience, for example, adding auditory, haptic and visual approaches to improve the fun in games, and realizing two-player or even multi-player online VR modes to improve communication between players. However, a major challenge still remains: how to let users collaborate with each other naturally and conveniently as they do in their daily life. In daily life, people collaborate with each other in many ways, among which eye contact has been one of the most natural and effective ways. In this way, people can easily understand which region and object others are currently focusing on. However, only few existing VR collaboration studies use eye tracking as a collaboration technique. One of the major problems in VR is that users cannot communicate through eye contact as they do in real life. Users in VR scene cannot acquire any information about each other’s gaze point. When they are discussing a phenomenon they are looking at, neither © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 610–621, 2023. https://doi.org/10.1007/978-981-99-2385-4_46
3D Gaze Vis: Sharing Eye Tracking Data Visualization
611
of them will know whether the other one is getting the wrong information, let alone give any correction or reminder. By visualizing the eye tracking data in VR, the other user’s gaze information can be obtained, which facilitates the efficiency of collaboration between two users and makes the interaction process as natural as in a real scene. We proposed a technique with sharing real-time eye tracking data visualization between users in a collaborative VR environment. We built three different eye tracking data visualization modes as well as the no-eye-tracking modes, and compared the difference in the effectiveness of user collaboration with and without shared eye tracking data visualization. The contribution of this study is that we found gaze cursor that serves the best performance for the collaborative users improving efficiency and quality of collaborative work.
2 Related Work Non-verbal cues, such as gaze, play an important role in our daily communication, it is not only as a way of expressing intention, but also as a way of communicating information to the other. Oleg et al. [1] conducted a study on sharing visual attention between two players in a collaborative game so that one player’s focusing area was visible to the other player. They investigated the difference between using head direction and eye gaze to estimate the point of attention, and the results showed that the duration of sharing eye gaze was shorter than sharing head direction, and the subjective ratings of teamwork were better in the high immersion condition. Wang et al. [2] investigated the use of gaze in a collaborative assembly task in which a user assembled an object with the assistance of a robot assistant. They found that being aware of a companion’s gaze significantly improved collaboration efficiency. When gaze communication was available, task completion time was much shorter than when it was unavailable. Newn et al. [3] tracked the user’s gaze in strategic online games, and eye-based deception added difficulty and challenge to the game. D’Angelo et al. [4] designed novel gaze visualizations for remote pair programming, and the programmers took more time to view the same code lines concurrently. They also designed gaze visualizations in remote collaboration to show collaborators where they were viewing in a shared visual space. Various eye tracking devices have been used in single-player VR studies to accomplish different tasks. Kevin et al. [5] proposed a simulation of eye gaze in VR to improve the immersion of interaction between users and virtual non-player character (NPC). They developed an eye tracking interaction narrative system centered on the user’s interaction with a gaze-aware avatar that responds to the player’s gaze, simulating real human-tohuman communication in a VR environment, and made preliminary measurements based on the user’s responses. This study demonstrated that users had better experience during VR interactions with eye tracking. Boyd et al. [6] explored the effects of eye contact in immersive VR on children with autism. They developed an interaction system based on the eye tracking communication between the children and the avatars. Visual attention prediction was crucial for predicting performance in motion. Heilmann et al. [7] investigated the difference between stimulus presentation and motor response in eye tracking studies, and examined the possibility of presenting this relationship in VR. Llanes-Jurado et al. [8] proposed a calibrated algorithm that can be
612
S. Zhao et al.
applied to further experiments on eye tracking integrated into head-mounted displays and presented guidelines for calibrating the fixation point recognition algorithm.
3 Eye Tracking Date Visualization for Collaboration 3.1 Eye Tracking Method We implemented eye tracking using a method based on pupil center corneal reflection (PCCR) [9]. An infrared camera was used to capture the user’s eye image, after which the pupil center and the Purkinje were localized. The PCCR vector was calculated from the position of the pupil center and the location of the coordinates of the center of the Purkinje in the eye image [10]. The obtained PCCR feature vector was sent into the ray-tracing module in VR scene, hence the radial of the feature vector is denoted as X, which represents the direction of the user’s gaze. By re-establishing the local geometry and collision detection, we calculated and obtained the coordinates of the intersection point of the collision point P and radial X. 3.2 Eye Tracking Data Visualization In virtual reality environment, if user’s visual attention behavior can be observed intuitively with eye tracking, it will be convenient for visual perception and cognition analysis in complex 3D scenes. Based on previous studies [11, 12], we designed three kinds of eye tracking data visualization modes in virtual reality scenes: gaze cursor, gaze trajectory and gaze spotlight, as shown in Fig. 1.
Fig. 1. Visualization of gaze points: gaze cursor, gaze spotlight and gaze trajectory (from left to right).
Gaze cursor: is based on a blue sphere with a specific radius (e,g., radius is 5 in the world coordinate system). This visualization model is highly directional and has a clear, concise, and focused field of view [13]. Gaze trajectory: is to display the eye saccade, so that the original independent gaze points have a form of display with the chronological order. Gaze spotlight: is with a range (radius is 40 in the world coordinate system) for local highlighting. This visual representation covers a larger area than gaze cursor, drawing attention to the information around the gaze points.
3D Gaze Vis: Sharing Eye Tracking Data Visualization
613
3.3 Prototype System for Collaboration In our research, we needed to build collaborative scenes and share eye tracking data between collaborators. The prototype system was developed using the Unity3D engine, which could access to external devices conveniently. The system recorded and processed 3D scene data and the user’s eye tracking data. Eye tracking modules, including infrared (IR) LEDs, IR lens, and high definition (HD) cameras, were added in the head mounted display (HMD) devices to track the user’s eye tracking data [14] (as shown in Fig. 2), and the high-precision eye tracking method we used in this study ensured that the accuracy was 0.5° (error was within the degree of visual angle). We used a framework based on the server-client model to synchronize simulations with the network. The users shared the same viewing angle in VR scene, which helped them to increase the sense of presence, eliminate motion sickness, and facilitate the collaboration between them.
Fig. 2. VR HMD with eye tracking module in our study.
4 Experiment 4.1 VR Scene We designed 3D heart models in VR based on healthy heart and various diseased hearts, to simulate a controlled experiment of lesson of heart knowledge with different eye tracking data visualizations (gaze cursor, gaze trajectory, gaze spotlight). The user was allowed to learn about heart structure and disease in different eye tracking data visualization modes and no-eye tracking data visualization mode. After finishing the experiment, we compared which eye tracking data visualization mode was optimal in VR. The user was required to identify the heart model in VR and finish quiz about heart structure and disease. The answer time, answer scores and eye tracking data were all recorded during their experiment. 4.2 Experiment 1: Optimal Eye Tracking Data Visualization Modes In this experiment, we invited a doctor (30 years old, female) from a local hospital. A VR HMD with an eye tracking module was used.
614
S. Zhao et al.
First, after calibrating the eye tracker, the doctor used VR HMD for a few minutes to familiarize with it, and then she gave a lecture about heart structure and related diseases in the VR environment. The lecture was recorded repeated and added with/without doctor’s eye tracking data visualization. Participants: We recruited 40 participants (26 males and 14 females, aged between 19 and 25) from the local participants pool and all participants were normal or corrected to normal vision, and they had no knowledge about heart structure and disease. 9 participants were familiar with VR and eye tracking. Before the experiment beginning, each participant signed an informed consent form and filled out a short background questionnaire. Groups: Based on the different independent variables of eye tracking data visualizations, we divided the participants into 4 groups as follows (10 participants of each group). In addition, we added the speech of the lecture in different eye tracking data visualization modes as well as non-eye tracing data visualization mode. The doctor was required to teach the same content and keep similar visual attention behavior as much as possible for each lecture: Group 1: gaze trajectory + speech; Group 2: gaze spotlight + speech; Group 3: gaze cursor + speech; Group 4: no eye tracking data visualization + speech. Procedure: Before the experiment began, participants were allowed to spend a few minutes familiarizing and adapting to the VR HMD. Firstly, the participant was asked to learn from the doctor’s teaching videos about the heart structure, mitral stenosis, aortic septal defect. Then the participant was required to wear the VR HMD and finish the quiz, which required the participant to use the handle to point out specific parts of the heart model (i.e., the coronary artery, aorta, pulmonary artery, superior and inferior vena cava, left ventricle, right atrium, aortic valve, and mitral valve). Secondly, the participant was asked to learn about the heart diseases (e.g., symptoms caused by myocardial necrosis, mitral stenosis, aortic septal defect, atrial septal defect and ventricular septal defect) from the doctor’s teaching videos with her eye tracking data visualizations. Then the participant was also required to wear the VR device to answer the quiz. The quiz would present the heart model with different heart diseases: atrial septal defect, mitral stenosis, ventricular septal defect and normal heart. Participants should select the correct name of the diseases accordingly. We recorded participants’ score of answers, completion time and eye tracking data during the process of quiz answering in these two steps. After completing the two quizzes, a questionnaire was used to collect participants’ subjective feedback on VR eye tracking data visualization learning. 4.3 Experiment 2: Collaboration with Eye Tracking Data Visualizations After we obtained the optimal eye tracking data visualization in Experiment 1, we conducted a collaborative work experiment on VR eye tracking data visualization.
3D Gaze Vis: Sharing Eye Tracking Data Visualization
615
Participants: We recruited 20 participants (12 males and 8 females, aged between 19 and 22) from the local participants pool and all participants were normal or corrected to normal vision and had no knowledge about heart structures and disease. No participants were familiar with VR and eye tracking. Each participant signed an informed consent form and filled out a short background questionnaire. Groups: We randomly divided the participants into 2 groups: Experimental group: with eye tracking data visualizations + speech (5 paired, and 10 participants totally); Control group: without eye tracking data visualizations + speech (5 paired, and 10 participants totally); Procedure: Before the experiment began, each participant was allowed to spend a few minutes familiarizing and adapting to the VR HMD. Then each paired (two participants) were asked to wear VR HMDs, respectively, as shown in Fig. 3.
Fig. 3. Paired participants conducted collaborative work in VR.
In the VR environment, a diseased heart model was presented, and the participant needed to cooperate with the partner to recognize the diseases. Participants in the experimental group could observe each other’s eye tracking data visualization, while the participants in the control group could not see eye tracking data visualization and only work together through free talking. During the experiment, we recorded participants’ eye tracking data in both experimental and control groups, and further analyzed the quality of collaboration based on participants’ answers, communication records and eye tracking data.
5 Results 5.1 Optimal Eye Tracking Data Visualization In order to explore the best eye tracking data visualization which accurately conveyed the partners’ visual attention information as well as avoided causing excessive visual interference, we analyzed the results of Experiment 1.
616
S. Zhao et al.
It can be seen from the Table 1 that the correct rate of the heart structure and disease quiz of the gaze cursor was 76% and 82.5%, respectively, which was obviously superior to the other three visualization modes, while the average answer time of the heart structure of the gaze cursor (60.88s) was in the middle value of all the modes. We also found that for each experimental condition, the correct rate of the heart disease quiz was equal or higher than that of the heart structure quiz. Table 1. Quiz results in different modes of eye tracking data visualization. Visualizations
Average correct rate of heart structure quiz
Average answer time of heart structure quiz
Average correct rate of heart disease quiz
Average answer time of heart disease quiz
Gaze trajectory
40.00%
56.62 s
40.00%
124.284 s
Gaze spotlight
44.00%
66.142 s
70.00%
122.477 s
Gaze cursor
76.00%
60.88 s
82.50%
118.57 s
No visualization
44.00%
59.25 s
60.00%
99.529 s
Fig. 4. Correctness rate for each question in each visualization in heart structure quiz.
Figure 4 summarized the correctness of each question in the heart structure quiz in the different modes of eye tracking data visualizations. We found that the correctness rate of aorta, left ventricle and right atrium were higher than the rest. This was because the aorta, left ventricle, and right atrium were more obvious and easy to be recognized in the heart structure distribution. In addition, we found that in the rest parts, the gaze cursor showed a better performance than other eye tracking data visualizations. Although in aorta and right atrium, the gaze cursor mode did not outperform other modes, this was due to the errors in the calculation of the gaze point coordinates by the eye-tracking module, and the gaze cursor was small, which was more likely to cause the error. Figure 5 summarizes of the correctness rate of each group of heart disease quiz in different modes of eye tracking data visualizations. We found that the correctness rates
3D Gaze Vis: Sharing Eye Tracking Data Visualization
617
Fig. 5. Correctness rate for each question in each visualization in heart disease quiz.
under the gaze trajectory mode were poor, even worse than the non-eye tracking data visualization. The reason was the gaze trajectory brought obvious visual interference to the participant, and distracted the participant’s visual attention, so the participant could not obtain other’s gaze data accurately. For example, in the case of ventricular septal defect and atrial septal defect, the area in the heart model were small but obvious. On the other hand, the gaze cursor and the gaze spotlight brought higher correctness rates, because these eye tracking data visualizations could accurately help participants find the diseased area. We analyzed the correctness rates of each visualization modes by one-way ANOVA. In the heart structure quiz, we found a significant difference among the four eye tracking data visualization modes (p < 0.05). Similarly, in the heart disease quiz, we also observed a significant difference (p < 0.001) across the four modes, both results have proved that the quiz result has significant difference according to the different eye tracking data visualization modes under VR. However, we did not find significant difference at the answer time of heart structure quiz (p = 0.745), or heart disease quiz (p = 0.428), indicating the different eye tracking data visualization modes under VR did not affect participants to finish the quiz. Therefore, we further conducted a post-hoc LSD test to compare the correctness rates of the gaze cursor mode and the rest three modes in heart structure quiz and heart disease quiz. As shown in Table 2, it can be seen that in the heart structure quiz, there was a significant difference between the gaze cursor and gaze trajectory (p < 0.001), between gaze cursor and gaze spotlight (p < 0.005), and between gaze cursor and no visualization mode (p < 0.005). In the heart disease quiz, a significant difference could be observed between gaze cursor and gaze trajectory (p < 0.001), between gaze cursor and no visualization mode (p < 0.05), however, no significant difference could be observed between gaze cursor and gaze spotlight (p = 0.184). These results suggested that the visualization modes of the gaze cursor is the optimal eye tracking data visualization mode under VR.
618
S. Zhao et al.
Table 2. Significant differences between gaze cursor and other eye tracking data visualization modes. Gaze trajectory
Gaze spotlight
No visualization
Heart structure quiz
p < .001
p < .005
p < .005
Heart disease quiz
p < .001
p = .184
p < .05
5.2 Analysis of Collaborative Work For the collaborative work, participants could observe the gaze cursor from the partners, and communicated more efficiently. We analyzed the eye tracking data of the participants during the collaboration, and compared their performance as well. Figure 6 showed the paired participants’ eye tracking data in the gaze cursor mode and no visualization mode in time series. Different color and length of the bars indicated which part and how long did the participant spend to view in the heart model. From Fig. 6, 4 out of 5 groups using the eye tracking data visualizations successfully answered the quiz, while 2 out of 5 groups of participants who did not use the eye tracking data visualization successfully answered the quiz. In addition, it was obvious that the collaborative work using the gaze cursor spent less time than no visualization mode. This indicated that the shared eye tracking data visualization in VR could insist participants to finish the collaborative work more quickly.
Fig. 6. 20 participants’ eye tracking data between gaze cursor mode and no visualization mode Red box indicates the participants successfully answered the quiz.
Comparing the eye tracking data in the diseased area, we found that P1(highlighted with red box in the top of Fig. 6) indicated the eye tracking data visualization helped
3D Gaze Vis: Sharing Eye Tracking Data Visualization
619
the participants a1 and a2 successfully answered the quiz, and the eye tracking data distribution showed a highly coincidence, indicating that with the assistance of the gaze cursor mode, the paired participants could find and follow their partner’s gaze. In no visualization mode, P6 (highlighted with the purple box in the bottom of Fig. 6) indicated participants B1 and B2 (or D1 and D2) cluttered their gaze fixations, and they rarely discussed and communicated during the collaboration, thus they spent much time and did not answer the quiz correctly. In the process of communication, when one of the participants using the gaze cursor wanted to discuss a certain area of the heart, it was easy and accurate to let the partner find out the exact area through the eye tracking. However, in no visualization mode, participants tent to spend longer time to let their partner understand and find the exact area, which was inefficient and time-wasting. Furthermore, In the collaborative scenario, participants were found to positively observed their partner’s eye tracking data visualization. The participants followed the orientation of partner’s eye tracking data visualization and established an interaction based on eye tracking when working together. Interestingly, not all participants realized they had already found the diseased area at the first glance, only when they followed their partner’s eye tracking data visualization had they realized that. For example, P2 (shown in the yellow box in the Fig. 6), in the case of using the eye tracking data visualization, according to the speech records, we found that participant C2 who first saw the diseased area did not realize that she found the exact area, but after the partner C1 followed C2’s eye tracking, C1 successfully recognized the focus of the C2’s gaze was on the diseased area, and reminded C2 immediately. In this process, they could collect information from each other’ eye tracking data visualization and cooperate to complete the task. In addition, it was suggested that paired participants could help mutually by looking at each other’s eye tracking data visualization. For example, P3 (highlighted with the green box in the top of Fig. 6) indicated participants d1 and d2 noticed the diseased area for the first time, but both of them were uncertain, then they resorted to each other’s gaze, and discussed whether this was a diseased area. In this way, the eye tracking data visualization helped enhance the recognition of the diseased area for both participants. In the no-eye tracking data visualization mode, for example, P5 (highlighted with the black box in the bottom of Fig. 6) indicated that without eye tracking data visualization, for participants B1 and B2 (or C1 and C2), one observed the diseased area, while the other didn’t. In this case, participants were prone to be confused. For example, one participant said in the post experiment interview: “I did not know whether what I saw was the diseased area; I’m not sure if I need to communicate with my partner, so I spent much time and did not complete the task at last.” Eye tracking data visualization also required the participants seriously to think and discuss carefully. P4 (highlighted with the blue box in the last line of Fig. 6) indicated both participant e1 and e2 had seen the diseased area, but they were all distracted by each other’s eye tracking data visualization. They did not discuss what they observed in time, only blindly followed each other’s gaze. It wasted much time and led to the failure of the task.
620
S. Zhao et al.
6 Conclusion The complex VR environment and model structure bring lots of challenges for users to accomplish collaborative tasks, and they cannot communicate and conduct collaborative work with them like in the real world. This study designed eye tracking data visualization and utilized the visualization as visual attention indicators for paired users during their collaboration. We fund that gaze cursor was the best visualization modes, and applied it to facilitate the collaborative work in the heart lecture scene, and the experimental results showed that it could improve the quality and efficiency of the collaboration in VR environment. Acknowledgement. The authors would like to thank all the volunteers who participated in the experiments. This work was supported in part by the National Natural Science Foundation of China under Grant 62172368, 61772468, and the Natural Science Foundation of Zhejiang Province under Grant LR22F020003.
References 1. Špakov, O., et al. Eye gaze and head gaze in collaborative games. In: Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications, vol. 85, pp. 1–9 (2019) 2. Wang, H., Shi, B.E.: Gaze awareness improves collaboration efficiency in a collaborative assembly task. In: Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications, vol. 85, pp. 1–9 (2019) 3. Newn, J., Allison, F., Velloso, E., et al.: Looks can be deceiving: Using gaze visualisation to predict and mislead opponents in strategic gameplay. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, vol. 261, pp. 1–12 (2018) 4. D’Angelo, S., Gergle, D.: An eye for design: gaze visualizations for remote collaborative work. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, vol. 349, pp. 1–12 (2018) 5. Kevin, S., Pai, Y.S., Kunze, K.: Virtual gaze: exploring use of gaze as rich interaction method with virtual agent in interactive virtual reality content. In: Proceedings of the 24th ACM Symposium on Virtual Reality Software and Technology, vol. 130, pp. 1–2 (2018) 6. Boyd, L.A.E., Gupta, S., Vikmani, S.B., et al.: vrSocial: toward immersive therapeutic VR systems for children with autism. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, vol. 204, pp. 1–12 (2018) 7. Heilmann, F., Witte, K.: Perception and action under different stimulus presentations: a review of eye-tracking studies with an extended view on possibilities of virtual reality. Appl. Sci. 11(12), 5546 (2021) 8. Llanes-Jurado, J., et al.: Development and calibration of an eye-tracking fixation identification algorithm for immersive virtual reality. Sensors 20(17), 4956 (2020) 9. Cheng, S., Ping, Q., Wang, J., Chen, Y.: EasyGaze: hybrid eye tracking approach for handheld mobile devices. Virt. Real. Intell. Hardw. 4(2), 173–188 (2022) 10. Cheng, S.W., Sun, Z.Q.: An approach to eye tracking for mobile device based interaction. J. Comput.-Aided Des. Comput. Graph. 26(8), 1354–1361 (2014). (in Chinese) 11. Mitkus, M., Olsson, P., Toomey, M.B., et al.: Specialized photoreceptor composition in the raptor fovea. J. Comparat. Neurol. 525(9), 2152–2163 (2017) 12. Neider, M.B., Chen, X., Dickinson, C.A., Brennan, S.E., Zelinsky, G.J.: Coordinating spatial referencing using shared gaze. Psychon. Bull. amp. 17, 718–724 (2010)
3D Gaze Vis: Sharing Eye Tracking Data Visualization
621
13. Shuang, Z.C.: Research on collaborative interaction based on eye movement in virtual reality environment. Zhejiang University of Technology, Hangzhou (2020) 14. Cheng, S.W., Zhu, C.S.: Prediction and assistance of navigation demand based on eye tracking in virtual reality environment. Comput. Sci. 48(8), 315–321 (2021)
A Learning State Monitoring Method Based on Face Feature and Posture Xiaoyi Qiao, Xiangwei Zheng(B) , Shuqin Li, and Mingzhe Zhang School of Information Science and Engineering, Shandong Normal University, Jinan 250014, China [email protected]
Abstract. With online learning and blended teaching being more and more adopted on the Internet today, how to monitor the learners’ state and improve the interaction between learners and instructors is paid much attention by researchers. In this paper, a learning state monitoring method based on face feature and posture (LSMFP) is proposed so as to improve learners’ efficiency. Based on the learners’ online learning video, the blink frequency, yawn depth, and emotional distribution are computed through the face feature. The human posture assessment technique is combined to determine and calculate the learning concentration and based on the monitoring information, the learning states are inferred. To verify the feasibility of the proposed method, a learning state monitoring and analysis system is designed and developed. The application results show that the proposed method can improve the interaction effect between learners and instructors, which in turn can improve the efficiency of learners. Keywords: Face Feature and Analysis
1
· Human Posture Assessment · Monitoring
Introduction
China’s education informatization continues to develop, and new computer technology should be actively used to solve practical problems in education [1]. It is necessary to combine practical problems, innovate teaching management platforms, improve the quality of classroom monitoring services, realize machine assisted scientific and automatic management, and provide directions for thinking and practical references for the realization of personalized teaching and diversified development. Software intelligence is developing rapidly and artificial intelligence is showing its usefulness in a variety of fields. In the field of education, new technologies will be used to make learning more effective and accessible. Therefore, there is research value in monitoring the learning process by means of technology [2]. There are more cases for monitoring and analyzing data for online learning, but there is still research to be done on how to monitor Supported by organization x. c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 622–633, 2023. https://doi.org/10.1007/978-981-99-2385-4_47
A Learning State Monitoring Method Based on Face Feature and Posture
623
learning data in face-to-face classes. The monitoring system in the face-to-face classroom should allow the classroom participants and the classroom equipment to play their part through the monitoring function so that learners can achieve self-control and instructors can make timely adjustments to the content based on feedback. The main subject of classroom monitoring is the learner, and the role of the monitoring system for the learner is firstly to achieve self-control through the objective presence of the monitoring system, and secondly to monitor the main performance of the learner in teaching and learning activities through the monitoring system. The process of continuous strengthening of self-control improves self-control and better performance in teaching and learning activities. The main changes in the learner’s learning process are in the face, and the learning state of the learner can be inferred from the face features. The main face features include blinking, yawning, and emotion. The posture of the body during learning also reflects the state of learning and includes head posture, upper body posture, and gestures. Of these, head posture is more closely related to the state of learning. The combination of face feature and human posture allows for a more scientific analysis of learning status and attentiveness. The main contributions of this paper can be summarized as follows: (1) A learning state monitoring method based on face feature and posture (LSMFP) is proposed through the computation of face feature and human pose estimation. (2) Combined with the observation of learners’ states, five learning states are pro-posed, namely concentrated, confused, distracted [3], fatigued [4], and drowsy. In the process of system monitoring, the above five states are judged by face feature and head movements. And then combined with the changes in emotion and human posture, the learning concentration level is determined. (3) An LSMFP system was designed and developed with a B/S (browser/server) architecture for the instructor role and a C/S (client/server) architecture for the learner role, which enables a complete monitoring process and presentation of monitoring data.
2 2.1
Related Work Computer Aided Education Systems
Teaching quality is a key concern in school teaching and learning, and the use of many technologies to monitor the teaching process is an objective measure to improve learners’ efficiency and learning quality. For example, Yang et al. [5] proposed an online education quality monitoring system based on brain-computer interface technology (BCI), which collects data through head-mounted equipment, analyses the data to obtain the concentration level of learning, and through the monitoring system, completes the data interaction with the instructor, who regulates the classroom learning in a timely manner. Zhao et al. [6] constructed a monitoring platform for teaching quality in universities based on the analysis of the current situation, providing new ideas for the development of monitoring systems.
624
2.2
X. Qiao et al.
Computer Vision Technology
Facial features reflect the state of the face through the position of the eyes, mouth and nose which have distinct variations, constructing facial data which is analysed to obtain the required judgment. Face feature recognition has gradually evolved from face geometric feature recognition to support vector machines, and currently, face feature combined with deep learning have been studied in many areas such as emo-tion classification. Zhang et al. [7] designed a smart campus identity authentication and did privacy protection based on face feature. Light and pose can affect face recognition and applications, and face feature point detection combined with head pose improves detection. Xu et al. [8] studied face feature point detection based on deep learning and analyzed the case of combining head pose recognition. Human pose includes single-person pose assessment, multi-person pose assessment, human pose tracking, and 3D human pose assessment. Applications of human pose include action behavior recognition, human interaction, or providing assistance to other technologies. 2.3
Summary
There have been many attempts at face feature recognition and teaching aid systems. However, the application of face feature recognition is relatively single and there is no integration with pose detection; the teaching aid system still has some problems in practical application, failing to detect and classify learning states in a more detailed way.
3
The Proposed Method
The combination of technology and education to aid educational progress is a current research hotspot. Many attempts have been made to achieve in-depth research and analysis of the teaching and learning process, in areas such as intelligent assisted education and smart classrooms. Teaching is a dynamic and changing scenario, and the teaching process includes the teaching of the instructor, the learning of the learner, and the interaction between the two, all three of which influence each other. In order to promote a virtuous circle between the three, a testing method is designed to monitor learners’ learning states and provide timely feedback to the instructor, who can then adjust the teaching style. The system is also suitable for learners’ online self-study, and the learner side allows for learning process recording and real-time learning reminders. 3.1
Overview
A learning state monitoring method based on face feature and posture (LSMFP) implements face detection, human posture, and emotion recognition functions by processing captured video streams. Face detection implements blinking and yawning detection by face features. Face recognition focuses on determining
A Learning State Monitoring Method Based on Face Feature and Posture
625
whether a face is detected, followed by blink detection and yawn detection. Human pose mainly includes head pose detection, which combines the upper body movement of a person to obtain three head poses: normal, head down and head turned left and right. Emotions are classified according to an emotion recognition model to obtain real-time emotion states, and different values are assigned to different emotions and learning postures. The learner’s learning concentration is obtained by weighting the emotion and posture scores. Learning states are detected based on emotional distribution, body posture, blinking, and yawning. Learning states include concentrated, confused, distracted, fatigued, and drowsy. The system monitors attention and learning status, alerts to low attendance, and records learning status. Statistical analysis of the learning process will be carried out at the end of the study (see Fig. 1).
Fig. 1. LSMFP Flow Chart.
3.2
Face Feature
The extraction of face features is done on the basis of face recognition. Face detection includes face image acquisition and face detection, and the application scenario in this paper uses a camera to acquire face images, and after image pre-processing, greyscale conversion, normalization, and denoising, the images are processed well to better obtain face feature information (see Fig. 2).
Fig. 2. Face Feature Processing Diagram.
626
X. Qiao et al.
Face emotions are a component of a learner’s learning state and can objectively reflect the learner’s attitude toward learning. At present, emotion recognition can be achieved by extracting face feature through algorithmic computation to classify and recognize emotions. Emotion recognition is based on the seven emotions proposed by Ekman [9] to identify the emotions of learners in class in real-time. In this paper, a trained emotion recognition model is used [10], and a BN (Batch Normalization) layer is used in the network to solve the gradient disappearance or gradient explosion problem. After the image is input to the model, it is convolved once more through the Conv2D convolution layer, using the Relu activation function, and after reactivation, it is selectively convolved through the Conv2D convolution layer in order to solve the degradation problem in the deep network. After reactivation, in order to solve the degradation problem in the deeper network, the image is selectively convolved through the Conv2D convolution layer or through two SeparableConv2D convolution layers, for a total of four times, to weaken the strong connection between each layer, and finally, the data is simplified by GlobalAveragePooling2D, and the prediction results are output using the Softmax function. The 68 key points of the face are detected by means of the model library of the Dlib, mainly the feature points of the eyes, mouth, and nose. The eye aspect ratio (EAR) is obtained from the eye feature points, and the blink frequency and closure duration are calculated from the recorded data. The difference between the two is mainly the degree of opening and duration of yawning. The mouth aspect ratio (MAR) is used to detect yawning behavior, and the depth of yawning is determined by the duration of yawning. 3.3
Human Posture
Head posture reflects the direction of the learners’ vision when learning and the orientation of the face helps to determine the learners’ level of concentration. Head posture can be judged by Eulerian rotation angles, including Yaw and Pitch and Roll information, and the use of 3D vectors to describe head posture allows accurate judgments to be made about head down, head up, head left, and right, and normal posture. Human pose detection is the abstraction of the body’s position in a video into a model connected by nodes. OpenPose is an open-source library developed with Caffe based on convolutional neural networks and supervised learning. OpenPose recognizes all 18 key points of the human body (see Fig. 3). When judging the state of a certain action, it is not necessary to focus on the position changes of all key points, for example, the head action judgment needs to combine the position changes of the five key points of the head, and the nose and neck key points can represent the overall displacement of the key points of the head. Therefore, this paper discards the key point data of the legs and focuses on the position change information of the key points of the upper body.
A Learning State Monitoring Method Based on Face Feature and Posture
627
Fig. 3. 18 Key Points of Human Posture.
3.4
Classification
Combining the analysis of learners’ classroom behaviors and emotions, this paper proposes five learning states: concentrated, confused, distracted, fatigued, and drowsy. The data detected by blinking, yawning, emotions, and head posture in the face feature are synthesized and analyzed to obtain different learning states. The concentrated state is judged as normal head posture or concentration degree higher than 0.7, and the confused state is judged as a positive emotion concentration degree Less than 0.7. The length of time with the head down is H and the length of time with the head turned to the left or right or with the eyes unblinking is Ht, when the distracted is shown in Eq. 1. H > 3s||Ht > 4s
(1)
Fatigued is defined as yawning or blinking sleepily. The drowsy blink is determined by the duration of eye closure and EAR analysis. The EAR threshold range for closed eyes is as in Eq. 2. 0 < EAR < 0.3
(2)
Blinks in which EAR is within this range for 3 consecutive seconds are considered to be drowsy blinks. Drowsy is defined as eyes closed for more than 3 s. 3.5
Concentration Calculation
Learning concentration is calculated from emotions and the head-turning angle of the human body. Emotions are divided into positive and negative emotions, which are then assigned separate values in different categories. The head-turning angle is calculated using the human posture recognition model. A fuzzy matrix is constructed, with a 50% influence ratio set for emotion and head-turning angle respectively. The identified emotions are assigned a rating value and the concentration is calculated using a fuzzy integrated evaluation method in combination with the head-turning angle.
628
4 4.1
X. Qiao et al.
System Design and Development System Design
A learning state monitoring method based on face feature and posture (LSMFP) system for use in the classroom or for online learning. The web-based of system operation process is as follows: the learning data of multiple learners are collected in real-time through the application and transferred to the cloud via the application server. After logical processing and data processing by the monitoring algorithm, the results are fed back to the web server and displayed to the instructor via a web page in the classroom teaching cycle scenario. In the online learning scenario, the results are fed back to the application server and the application can monitor the learning process and prompt when the learning is not serious. The monitoring algorithm mainly consists of data monitoring and data processing and analysis. The data monitoring mainly records changes in face feature, the person’s posture, and specific behavior in real-time emotions, and the analysis of the real-time data yields the learning state and attention. LSMFP system enables data acquisition, transmission, and analysis, as well as data recording and visualization of the results (see Fig. 4).
Fig. 4. System Architecture Diagram.
LSMFP system is divided into an instructor’s side and a learner’s side. The learner’s side records and displays the learner’s states while learning by processing the captured video streams with face recognition, human posture, and emotion recognition functions. The instructor side views learner information as well as the visualization of results.
A Learning State Monitoring Method Based on Face Feature and Posture
4.2
629
System Implementation
Learner Role. The learner side is based on a B/S architecture and is easy to install on the learners’ computer (see Fig. 5). The main functions of the page include a video capture area, a learning states display area, a concentration display area, a monitoring information recording area, an information prompt area, and a function control area. The learners access the system using their name and password and the learners are considered to have signed in. The video from the video capture area is processed and judged by face feature and human posture to obtain the results of drowsiness, head posture, yawn, and emotion recognition and recorded into the database. The message area provides information on whether the camera can be turned on successfully, whether the system is working properly, and the stage of operation. The control area controls the switching on and off of the camera as well as the operation of the system and the analysis of the learner’s monitoring information.
Fig. 5. Learner Side Interface.
Instructor Role. The instructor side is developed on a C/S architecture to facilitate instructors to be the ability to log into the system from anywhere (see Fig. 6). The main functions of the instructor side include monitoring learner information in real-time and processing learner’s record information. The instructor logs into the system with a username and password and enters the
630
X. Qiao et al.
real-time information display screen, which shows the information of learners who have logged into the system, and the real-time concentration monitoring on the right side. The prompt message area shows drowsy and tired individual learers. At the end of the lesson, the learner’s learning process can be viewed on the left-hand side.
Fig. 6. Instructor Side Interface.
At the end of the course, learners monitoring data analysis can be viewed through the system, including overall data analysis for the whole group and individual data analysis, specifically concentration charts, learning states histograms, emotional classification pie charts, and classroom information, which includes the total number of learners, the number of signins, the number of sleepy learners and overall classroom performance (see Fig. 7).
Fig. 7. Information Analysis Interface.
A Learning State Monitoring Method Based on Face Feature and Posture
4.3
631
Case Study
To verify the monitoring effectiveness of the system, several testers were sought to conduct experiments. The first step was to verify the viability of the system. Feed-back from testers using the system established that the system’s system operation, information transmission, and information analysis functions were functioning properly. To verify the accuracy of the test results, the data of the testers simulating different learning states were recorded, including the concentration curves of the testers in the states of serious study, study fatigue, and drowsiness. The curves show the differences in concentration in different states, indicating that the monitored data can reflect the learning states (see Fig. 8).
Fig. 8. Concentration Curve.
In order to verify the accuracy of the learning state results detection, we recorded information on the real-time monitoring factors of the testers and the learning states displayed in the system to test whether the results obtained matched the states judged by the algorithm. Different conditions for blinking, yawning, emotion and head posture reflect different learning states, and Table 1 records one combination of condi-tions for the four factors associated with judging the five categories of learning states.
632
X. Qiao et al. Table 1. Learning status real-time analysis table. No Dozing Yawning Emotion Head Posture Learning State
5
1
N
N
Happy
2
N
N
Surprise Normal
Normal
Concentrated
3
N
Y
Natural
Turning head Distracted
4
N
Y
Natural
Head down
Fatigued
5
Y
N
Natural
Head down
Drowsy
Confused
Conclusion
In this paper, a learning state monitoring method based on face feature and posture (LSMFP) is proposed and the LSMFP system is implemented. The system implements monitoring and analysis functions for teaching monitoring, which can monitor learner behavior, provide feedback for instructors to adjust classroom strategies in a timely manner, and generate post-lesson analysis charts for instructors and learners to self-reflect. However, technical limitations prevent the analysis of more complex teaching situations. In the future, the focus should be on how to better analyze classroom teaching and learning to ensure scientific and accurate results. Acknowledgements. This work is supported by Shandong Provincial Project of Graduate Education Quality Improvement (No. SDYJG21104, No. SDYJG19171, No. SDYY18058), the OMO Course Group “Advanced Computer Networks” of Shandong Normal University, the Teaching Team Project of Shandong Normal University, Teaching Research Project of Shandong Normal University (2018Z29), Provincial Research Project of Education and Teaching (No. 2020JXY012), the Natural Science Foundation of Shandong Province (No. ZR2020LZH008, ZR2021 MF118, ZR2019MF071).
References 1. Bao, L., Deng, Z., Zhong, Z.: Analysis of hot spots and trends of research on informatization of basic education in China. Digit. Teach. Prim. Secondary Sch. 03, 30–34 (2022) 2. Gu, R., Zhao, L.: Teaching quality monitoring in vocational colleges in the intelligent era: logic and dilemma breakthrough. Chin. Vocat. Tech. Educ. 35, 11–18 (2021) 3. Wang, J.: A study on the analysis of students’ online learning status based on Yolov4. Shi Hezi Sci. Technol. 05, 55–57 (2021) 4. Zhang, C.: A study on classroom problem behaviors of middle school students. Theory Pract. Educ. 35(28), 56–60 (2015) 5. Yang, N., Shi, Q., Yu, J., et al.: BCI-based online education quality monitoring system. China Mod. Educ. Equipment 19, 7–9 (2020) 6. Zhao, H., Yu, G.: On the construction of information platform for teaching quality monitoring in applied universities. Appl.-Oriented High. Educ. Res. 4(01), 55–59 (2019)
A Learning State Monitoring Method Based on Face Feature and Posture
633
7. Zhang, X., Liu, C., Yang, X., et al.: Privacy protection research based on face feature authentication in smart campus. Netw. Secur. Technol. Appl. 04, 95–96 (2022) 8. Xu, Y., Zhao, J., Zhang, Z., et al.: Automatic facial feature points location based on deep learning: a review. J. Image Graph. 26(11), 2630–2644 (2021) 9. Ekman, P., Friesen, W.: Constants across cultures in the face and emotion. J. Pers. Soc. Psychol. 17(2), 124–129 (1971) 10. Li, M.: https://github.com/liminze/Real-time-Facial-Expression-Recognitionand-Fast-Face-Detection. Accessed 22 Apr 2022
Meta-transfer Learning for Person Re-identification in Aerial Imagery Lili Xu1 , Houfu Peng1 , Linna Wang1 , and Daoxun Xia1,2(B) 1
2
School of Big Data and Computer Science, Guizhou Normal University, Guiyang 550025, China [email protected] Engineering Laboratory for Applied Technology of Big Data in Education, Guizhou Normal University, Guiyang 550025, China Abstract. Person re-identification (Re-ID) aims to retrieve a person of interest across multiple nonoverlapping cameras. In the last few years, the construction of aerial person Re-ID datasets has been appealing because visual surveillance through unmanned aerial vehicle (UAV) platforms has become very valuable in real-world scenarios. However, great differences are exhibited between the pedestrian images captured by ground cameras and those captured by UAVs. Person Re-ID methods based on ground person images have difficulty when performing Re-ID based on aerial person images. In this paper, we first propose a novel meta-transfer learning method for person Re-ID in aerial imagery; this approach trains a generalisable Re-ID model to learn discriminative feature representations for aerial person images. Specifically, a meta-learning strategy is introduced to study a feature extractor, and a transfer learning strategy is introduced to utilise and further improve the acquired meta-knowledge. To overcome the convergence speed and recognition accuracy reductions caused by the presence of difficult categories in the given dataset, we propose a learning strategy based on curriculum sampling that is harmonised with our meta-transfer learning framework. In addition, a new metric formulation of sample similarity is introduced based on Mahalanobis distance to improve the optimisation of the model. Extensive comparative evaluation experiments are conducted on the large-scale aerial Re-ID dataset, and the results obtained show that our method achieves a Rank-1 accuracy of 63.63% and a mean average precision (mAP) of 38.02%, demonstrating its potential for completing person ReID in aerial images. Ablation studies also validate that each component contributes to improving the performance of the model. Keywords: Person re-identification learning
1
· Aerial imagery · Meta-transfer
Introduction
Affected by the development of modern unmanned aerial vehicle (UAV) technology and its unique practical importance, person Re-ID in aerial imagery [1] Supported by National Natural Science Foundation of China (grant no. 62166008). c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 634–644, 2023. https://doi.org/10.1007/978-981-99-2385-4_48
Meta-transfer Learning for Person Re-identification in Aerial Imagery
635
has become a research subject worthy of great effort in vision community. The research on person Re-ID in aerial imagery is not only helpful to criminal investigation, search for missing persons and prevention of terror attack, but also especially useful for person search and rescue under natural disasters such as earthquake, fire and so on. When the ground camera is difficult to work affected by the disasters, the aerial person Re-ID technology based on UAV is effective for searching and identifying pedestrians, helping to complete rescue tasks and protect public safety. Aerial person Re-ID, however, is an extremely challenging task that is affected by many factors as shown in Fig. 1. In addition to the existing problems faced by traditional person Re-ID approaches, works on aerial person Re-ID are also required to handle issues, such as weak pedestrian appearance features and large resolution variations in aerial imagery. Research on aerial person imagery, which have just become popular, mainly focus on the basic work of constructing datasets, such as NWPU VHR [2], UAV123 [3], VisDrone [4], AVI [5], PRAI1581 [1], and the tasks of object detection [6], tracking [7] and segmentation [8]. Regarding the person Re-ID task that receives little attention, Zhang et al. [1] presented an end-to-end learning method of subspace pool using convolution feature mapping, which effectively improved the person Re-ID accuracy in aerial imagery. They also pointed out that due to the influence of the various perspectives and poses of aerial images, the effective part-based methods in traditional person Re-ID tasks does not perform so well in aerial person datasets. Different from the part-based methods, meta-learning [9,10] aims to learn a model that is general to various tasks, which is a possible solution based on global features to address the problem of having large differences between pedestrian samples when conducting person Re-ID for air-ground integration. Among meta-learning approaches, model-agnostic meta-learning (MAML) [9] as a famous example designs a meta-learner to learn a good initialisation method. However, this type of algorithms tend to build a great quantity of meta tasks as support, which leads to expensive computing cost and overfitting risk during network training. In the paper, we propose a modified framework based on MAML, which utilises a powerful transfer strategy called fine-tuning to reduce the chance of overfitting and avoid the problem of catastrophic forgetting. In this research, meta-learning and transfer learning are first proposed to address the issue of aerial person Re-ID. Our method combines the advantages of transfer learning and meta-learning and trains a generalisable feature extractor to adapt to the resolution variations, viewpoint differences and other changes exhibited by the labelled samples. “Meta” refers to a type of meta-knowledge that is independent of specific tasks and can be studied by constructing a large number of meta-tasks, or in other words, a basic network that extracts feature vectors to represent each pedestrian image. “Transfer” refers to a feature representation network that is transformed and represented by specific layers of neurons. By finetuning these neurons, the network can learn new useful knowledge while reducing the coverage of the learned meta-knowledge. It should be noted that, we regard the fine-tuning operation of network learned from training samples to adapt to
636
L. Xu et al.
Fig. 1. Several examples of the ground person dataset PRW-v16 and aerial person dataset NWPU VHR-10, PRAI-1581. The left part of the figure compares the difference between the images of ground dataset PRW-v16 and aerial dataset NWPU VHR10, and the right part displays five challenging problems of the samples in the aerial dataset PRAI-1581. (a) Images in diverse occlusion situations. (b) Images with various perspectives and poses in same identity captured by UAVs. (c) Images with weak appearance feature. (d) Images in low resolution. (e) Images with smaller objects under UAVs flying at higher height.
new unseen samples as “transfer” due to the huge inconsistent data distribution in aerial person imagery. Nevertheless, we observe that a number of samples in the dataset are particularly difficult to identify, which significantly reduces the model recognition accuracy. To solve this problem, curriculum sampling method is used to retrain these samples with newly formed meta-tasks. In addition, we improve the distance metric based on Gaussian embedding to adress the problem of sub-optimal model updating direction during the two-stage training process of meta-learning. The contributions of this paper are summarised as follows: – We propose a framework based on meta-transfer learning; this is the first application of meta-learning and transfer learning to aerial person Re-ID. The meta-knowledge learned from the training samples is used to train a feature extractor that can quickly adapt to new samples. – We equip our framework with a curriculum sampling module, which samples the identities with lowest validation accuracies in every meta-task online and forces the model to study deep features with better generalisability. – We also present a module combined with Gaussian embedding to improve the model performance, thereby preventing model optimisation towards the optimum for seen samples and sub-optimum for unseen samples.
2
Methodology
In this paper, we mainly conduct our person Re-ID research in aerial imagery based on one specific aerial person dataset with supervision. We take the training
Meta-transfer Learning for Person Re-identification in Aerial Imagery
637
set, validation set, and testing set as three independent data sources, which are denoted DS , DV and DT , respectively. A suitable model is trained for DT with the meta-knowledge learned from DS , and DV is used to verify the convergence of the model online. 2.1
Overview
As shown in Fig. 2, we propose a meta-transfer learning framework that includes three phases: a pretraining phase, a meta-training phase, a meta-test phase. In the pretraining stage, we train a model on DS data and then save the weights at the lower layers of the network as the feature extractor Θ (Sect. 2.2). For the main procedure of the proposed framework, a series of meta-tasks on the source dataset TS are designed with the meta-training stage TST R and meta-test stage TST E (Sect. 2.3). Specifically, during the meta-training stage, we first copy the model saved during the pretraining phase and optimise it with the loss derived from the training data of the source domain L(Θ, TST R ). Then, in the meta-test phase, we utilise the updated feature extractor to compute the loss yielded by the T T E ). This loss is calculated to update the test data of the source domain L(Θ, S original model parameters, and these updates are cumulative. After conducting T T E )} is meta-training on all episodes, a model optimised by the losses {Li (Θ, S learned, where i denotes the meta-task sequence number. At the end of each epoch after meta-learning, we verify the recognition accuracy of the current network on DV to find the convergence model. Finally, we test the performance of the trained model on the data of DT . At this time, we assume that the learned model is sufficiently discriminative with respect to unseen samples. Thus, TTT E corresponds to the meta-test flow of the meta-transfer learning framework, which aims at predicting the feature representations of unlabelled data in DT . 2.2
Pretraining
To make meta-transfer learning easier and faster in terms of finding the target network, we first use the available DS data to obtain an initialisation by training a deep neural network. For this network, the low-level part is regarded as the feature extractor Θ, and the high-level part is regarded as the classifier θ. During training, we optimise the entire network by minimising the loss as follows: 1 l(f[Θ;θ] (x), y) (1) LDS ([Θ; θ]) = |DS | (x,y)∈DS
which is the empirical loss between the ground truth y and the prediction for the data x. However, in the end, only the parameters of the feature extractor Θ are saved, while the classifier θ is discarded. This is because, on the one hand, the tasks implemented by the classifier during this stage contain different classification objectives from those of the subsequent meta-tasks. On the other hand, the specificity of the neurons in the higher layers of a deep convolutional neural network makes them hardly beneficial for new tasks.
638
L. Xu et al.
Fig. 2. Aerial Person Re-ID system based on our proposed meta transfer learning method. The blue dotted box is the overall framework of model training, and the red dotted box is the pedestrian retrieval process. For each meta-batch during training, the orange lines represent the meta-training stage and the green lines represent the meta-test process stage. At the end of each meta-task, we select the most difficult recognising identity according to the verification accuracy of the meta-test, collect all samples under this identity and resample the meta-batch for later hard meta-task training, which is called curriculum sampling and is represented by gray lines. (Color figure online)
2.3
Meta-transfer Learning
Our meta-transfer learning model is based on the MAML [9] framework, which first samples several meta-tasks from the distribution of DS and then divides each task into a training branch and a test branch. The optimisation process is shown in Fig. 3. One substantial difference from the MAML framework is that the update of each task works on the high-level network with strong specificity instead of the whole network. Meta-training Phase. In this stage, a series of meta-tasks are designed as follows. For each task, 16-class, 4-shot episodes are randomly sampled from DS as the training split and another 16-class, 4-shot episode is sampled as the test split. First, we copy the feature extractor saved in the previous pretraining stage to predict the features of the training split and initialise a new classifier to predict their identities. Then, the loss (e.g., cross-entropy loss) can be calculated as
Meta-transfer Learning for Person Re-identification in Aerial Imagery
follows:
639
n
Li ([Θ; θ]) =
1 log(ρ(yi |xi )) n i=1
(2)
where (xi , yi ) ∈ DST R and n refers to the number of person images sampled in each meta-training batch. We apply this loss to optimise the feature extractor Θ and the classifier θ. Among them, the parameters of the feature extractor are saved, and the parameters of the classifier are discarded. For convenience, we record the updated respectively. and θ, feature extractor and classifier as Θ
Fig. 3. The detailed optimisation process of the proposed method. The grey lines in the picture represent the operations of the meta-training phase and the orange lines correspond to the meta-test phase. In addition, the solid lines indicate the copy to model, while the dotted lines refer to the optimisation. (Color figure online)
learned in the previous Meta-test Phase. In this stage, the feature extractor Θ step is used to predict the features of the test split. Then, the loss induced on the test split can be calculated as follows: = LTST E (Θ)
1 |DST E |
l(fΘ (x), y)
(3)
TE (x,y)∈DS
refers to the total loss of all meta-tasks, e.g., the triplet loss: where LTST E (Θ) Ltri (i, j, k) = max(ρ + dij − dik , 0)
(4)
where i, j, and k represent the marks of an anchored sample xi , a positive sample xj with the same person category and a negative sample xk with a different category in a triplet; dij and dik represent the similarity distances between xi and xj /xi and xk ; and ρ represents the margin parameter. in each meta-task is copied from the Θ updated It should be noted that Θ in the last meta-task. Once a meta-task is completed, the parameters of each are discarded, while each updated Θ is saved for the next task. The updated Θ model obtained from this stage is theoretically a feature extractor that can learn new sample features with a few examples.
640
2.4
L. Xu et al.
Curriculum Sampling
Curriculum learning was presented as a method to organise samples in a meaningful pattern during the training process. In this work, the proposed learning strategy based on curriculum learning can be integrated into the meta-transfer learning framework without any auxiliary model. Specifically, during the metatest phase, the high-level neurons of the model are optimised once by the loss of LTi E and the recognition accuracy of LTi E because m classes are calculated. Then in the current meta-task, we choose the identity corresponding to the lowest recognition accuracy Accm as the hardest identity. After obtaining all categories that the model fails to recognize (indexed by m) from k tasks during each epoch, we resample the tasks from the hard data. Specifically, we assume that the harder samples are consistent with the task distribution and we can improve the model performance by strengthening its learning process with the harder samples. 2.5
Gaussian Embedding
In this paper, we propose a meta-transfer framework based on MAML [9]. The labelled data are divided into two branches. In each meta-task, the losses of different branches are calculated based on different networks. Euclidean distances are typically used as the similarity measurements between person images in meta-learning networks: d(fφ (x), μk ) = (fφ (x) − μk )T (fφ (x) − μk )
(5)
By default, the features of each category follow the same Gaussian distribution. When the training samples and test samples of each meta-task have the same identity, the optimisation process based on this assumption is reasonable. However, when the identities of the training samples and test samples are inconsistent, the test samples are embedded in the feature space in random positions, and the same optimisation results may be suboptimal for the test samples [10]. To account for this fact, the Euclidean distance metric is replaced by the Mahalanobis distance as follows: −1 (fφ (x) − μk ) (6) d(fφ (x), μk ) = (fφ (x) − μk )T k
−1 The relevant distance measurement parameters k are learned by backpropagation when training the model to improve the problem regarding the optimisation direction of the model.
3
Experiments
We evaluate our meta-transfer learning method on a large-scale aerial pedestrian dataset and conduct supplementary experiments to verify the effectiveness of each component. Below, we introduce the dataset and the utilized implementation settings (Sect. 3.1), followed by an analysis of the results (Sect. 3.2) and ablation studies (Sect. 3.3).
Meta-transfer Learning for Person Re-identification in Aerial Imagery
3.1
641
Datasets and Settings
Datasets. The experiments are conducted on a newly published person Re-ID benchmarks obtained with multiple UAV cameras: PRAI-1581 [1]. On PRAI1581, we conduct experiments according to the default partitioning of the original dataset, where the identities in the training set, validation set and testing set do not coincide at all. The training set data are split into two branches for metatraining and meta-testing. The validation set data are applied to calculate the loss of the updated model after each epoch, indicating the best meta-network parameters with the most generalisability in theory. Several hard examples from PRAI-1581 are shown in Fig. 1. PRAI-1581 [1] is an open source dataset for aerial person Re-ID, which contains 39461 images of 1581 person identities in total. The dataset is split into 19523 images whose identities are used for training, and 19938 images among which 799 identities are used for testing. For the test set, 15258 images and 4680 images are divided into gallery and query set, respectively. Implementation Details. Our framework for the proposed meta-transfer learning method takes the widely used ResNet-50 and DenseNet-161 as its backbones. During data processing, the images are resized to 256 × 128 and augmented based on random cropping and flipping. The training batch size is set to 128 during pretraining. During meta-transfer learning, we sample 16-class, 4-shot episodes to obtain samples for training and another 16 × 4 samples for testing. A stochastic gradient descent (SGD) optimiser and another Adam optimiser is adopted in the pretraining stage and meta-training stage. We initialise the learning rate of the inner loop of the meta model to 0.0005 and that of the outer loop to 0.00005. Then, in the later 100 epochs, the learning rates are decayed by 0.9 every 50 steps. In total, the meta-transfer learning stage takes 100 epochs. 3.2
Comparison with State-of-the-Art Methods
We compare our approach with state-of-the-art works on aerial person Re-ID and report the comparison results in Table 1. An aerial dataset called PRAI1581 is used to determine the model performance. The gallery and query images used during the evaluation were taken by UAVs. Unless otherwise stated, all the experiments take ResNet-50 as the backbone to train and test the model according to the standard dataset partition. By observing Table 1, we can draw the following three conclusions. First, our method achieves comparable performance on PRAI-1581, with an mAP of 38.02% and a Rank-1 accuracy of 63.63%, it surpasses TL + SP [1] by 13.84% in Rank-1 accuracy. Second, it can be seen that our method outperforms other models by large margins in Rank-1 accuracy, but it produce slight mAP improvements. This may be because of the randomness of task division, and a number of meta-tasks are mostly composed of hard identities, resulting in difficulty when performing model recognition. The retrieval result fragment of one identity in PRAI-1581 is shown in Fig. 4. The left image represents a query sample, and
642
L. Xu et al.
the right part shows the top 10 similar images from top to bottom. The possible reasons for the failure cases include low resolution, viewpoint differences, illumination, occlusion, shadows and inconsistent bounding box content. In particular, we notice that the last few samples of the third case in Fig. 4 failed to match with the query, this is probably because it is difficult for meta learning to focus on the target person when the image contains multiple pedestrians. Third, compared with the method that uses ResNet-50 as the backbone, higher Rank-1 and mAP values are achieved when using DenseNet-161 instead. Table 1. Comparison with state-of-the-art related methods on the large-scale aerial person Re-ID benchmark PRAI-1581. The performance is evaluated quantitatively by the mAP and the Rank-1. The results of the compared methods are cited from [1]. Method
Rank-1 mAP Method
Rank-1 mAP
ID
42.62
51.09
31.47 DSR
39.14
TL
47.47
36.49 IDE
43.90
32.90
PCB
47.47
37.15 DCGAN
38.93
28.82
STL
47.49
37.13 Deep Embedding
21.36
14.73
SVDNet
46.10
36.70 MGN
49.64
40.86
AlignedReID 48.54
37.64 TL + SP (ResNet-50)
49.79
39.58
PCB+RPP
48.07
38.45 TL + SP (DenseNet-161) 54.76
43.05
MBC
30.05
22.83 Ours (Baseline)a
52.74
33.24
63.63
38.02
64.97
38.15
2Stream
47.79
a
37.02 Ours (ResNet-50)
Part-align 43.14 32.86 Ours (DenseNet-161)a a Results obtained by our method.
Fig. 4. Several successful examples of proposed approach on PRAI-1581. Images with red and green boxes respectively indicate the false and correct matches to the query. (Color figure online)
Meta-transfer Learning for Person Re-identification in Aerial Imagery
3.3
643
Ablation Studies
Some ablation studies are performed as follows to validate the effectiveness of each component of our method. Effectiveness of Meta-transfer Learning. As shown in Table 2, we conduct ablation studies to verify the effectiveness of the proposed meta-transfer learning strategy. Notably, the Re-ID model based on the proposed framework achieves improved performance on the aerial person dataset. When using meta-learning, the mAP can be improved from 33.24% to 38.02% with a 4.78% gain, and the Rank-1 accuracy is increased from 52.74% to 63.63% with a 10.89% gain on PRAI-1581. This shows that the meta-transfer learning framework enables the Re-ID model to learn strong representations for unseen samples. Effectiveness of Curriculum Sampling. As shown in Table 2, ablation studies are also conduct to verify the effectiveness of proposed curriculum sampling strategy for meta-learning. It is obvious that the model performance is considerably better when using our curriculum sampling strategy, especially the mAP is improved from 36.65% to 38.02% with a 1.37% gain. The reason why we obtain more obvious mAP gains than Rank-1 accuracy gains may be that our learning strategy reduces the impacts of meta-tasks with excessive numbers of difficult samples on model training. Effectiveness of Gaussian Embedding. Table 2 demonstrates the comparison between the results of model training obtained with and without the proposed Gaussian embedding strategy. By applying this strategy to the metatransfer learning framework, the mAP increases by 1.38% on PRAI-1581. This shows the ability to improve the model optimisation procedure with the designed metric. Table 2. Results of ablation studies. Experiments are conducted on the large-scale aerial person Re-ID benchmark PRAI-1581. “Meta” denotes training with the metatransfer learning strategy. “CS” means “Curriculum Sampling”, refers to training with the proposed curriculum sampling strategy. “GE” denotes “Gaussian embedding”, which refers to training with the proposed improved strategy. Meta CS GE mAP Rank-1 Rank-5 Rank-10
4
38.02 63.63
87.76
93.42
33.24 52.74
78.05
85.91
36.65 61.28
86.81
92.97
36.64 60.06
87.20
92.08
Conclusions
In this work, meta-learning and transfer learning are applied to person Re-ID in aerial imagery for the first time. The proposed meta-transfer learning framework
644
L. Xu et al.
learns knowledge that helps to extract the features of aerial person images when training the model and finally achieves comparable performance on aerial person datasets. We also introduce a learning scheme based on curriculum sampling to select the hardest identity in each meta-task for retraining in the later steps to eliminate the impact of distinguishing hard samples on the resulting model performance. In addition, a metric is designed to improve the optimisation direction of the model, making the meta-transfer model more suitable for unseen samples. Experimental results demonstrate the effectiveness of our method for person Re-ID in aerial imagery. Acknowledgements. This work is supported by the National Natural Science Foundation of China (no. 62166008).
References 1. Zhang, S., et al.: Person re-identification in aerial imagery. IEEE Trans. Multimedia 23(1), 281–291 (2021) 2. Cheng, G., Zhou, P., Han, J.: Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 54(12), 7405–7415 (2016) 3. Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 445–461. Springer, Cham (2016). https://doi.org/10.1007/978-3-31946448-0 27 4. Zhu, P., Wen, L., Xiao, B., Ling, H., Hu, Q.: Vision meets drones: a challenge. In: European Conference on Computer Vision (ECCV), pp. 437–468. Springer, Munich (2018) 5. Singh, A., Patil, D., Omkar, S.N.: Eye in the sky: real-time drone surveillance system (DSS) for violent individuals identification using scatternet hybrid deep learning network. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018, pp. 1629–1637. IEEE (2018) 6. Han, J., Ding, J., Xue, N., Xia, G.: ReDet: a rotation-equivariant detector for aerial object detection. In: 2021 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2786–2795. IEEE, Virtual, 19–25 June 2021 7. Cao, Z., Fu, C., Ye, J., Li, B., Li, Y.: HiFT: hierarchical feature transformer for aerial tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15457–15466. IEEE, Virtual, 11–17 October 2021 8. Lee, K., Lee, H., Hwang, J.: Self-mutating network for domain adaptive segmentation in aerial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7068–7077. IEEE, Virtual, 11–17 October 2021 9. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, NSW, Australia 6–11 August 2017, pp. 1126– 1135. PMLR (2017) 10. Liu, B., Kang, H., Li, H., Hua, G., Vasconcelos, N.: Few-shot open-set recognition using meta-learning. In: 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020, pp. 8795–8804. IEEE (2020)
Horizontal Federated Traffic Speed Prediction Base on Secure Node Attribute Aggregation Enjie Ye1,2,3 , Kun Guo1,2,3(B) , Wenzhong Guo1,2,3 , Dangrun Chen1,2,3 , Zihan Zhang2 , Fuan Li2 , and JiaChen Zheng1,2,3
2
1 Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou 350108, China [email protected] College of Computer and Data Science/College of Software, Fuzhou University, Fuzhou 350108, China 3 Key Laboratory of Spatial Data Mining and Information Sharing, Ministry of Education, Fuzhou 350108, China
Abstract. Federated graph learning has been widely used in distributed graph machine learning tasks. The data distribution of existing graphbased federated Spatio-temporal prediction methods is mainly segmented by graph topology. However, in the real-world Spatio-temporal traffic speed prediction task, a location will have data from different devices belonging to different companies. A node may have multi-party information in the real-world distributed traffic speed prediction scenario. The difference in multi-party information leads to the information not being fully utilized. Moreover, the direct transmission of node embedding in the federated learning process may also risk privacy leaks. Using homomorphic encryption and other encryption methods will bring a high computational overhead. Therefore we propose a new distributed privacy-preserving traffic speed prediction algorithm, which uses secure node attribute aggregation strategy(SNAAS) to apply to the multi-party collaborative traffic speed prediction scenario when the graph topology structure is public. At the same time, secret sharing technology is used in SNAAS to protect the attribute matrix and reduce the overhead of secret computing. Keywords: Spatio-temporal traffic speed prediction · federated learning · secret sharing · secure node attribute aggregation strategy graph neural network
1
·
Introduction
With the in-depth study of smart cities, various companies have a large amount of traffic data. Traffic speed prediction is one of the key issues of the smart city. If we only know the current traffic status, traffic congestion will be unavoidable. The traffic management department can only deal with it when congestion occurs. At the same time, drivers cannot predict the traffic status. Therefore, c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 645–659, 2023. https://doi.org/10.1007/978-981-99-2385-4_49
646
E. Ye et al.
drivers’ arrangement of driving routes is lagging. Much work has focused on the task of predicting traffic speed. In the task of traffic speed prediction, in addition to mining the change law of time series data, according to the characteristics of road interconnection, we can also use spatial data to mine more information. Considering the spatial features can improve the accuracy of traffic speed prediction [8], but it brings higher requirements for the quality of the dataset, However, the real-world traffic data will have the problems of insufficient and missing data, and each company has data that can only reflect its business scenarios. To solve the problem of insufficient and missing data, learning by combining multi-party data is an effective solution. However, the traffic are often sensitive data, which involves the problem of data privacy protection. Recently, some work has applied federated learning [20] to traffic speed prediction task to solve the privacy problem in multi-party traffic speed prediction. Some works [14,22,23] also consider the topological information of the graph to improve the prediction accuracy. [22,23] divide the complete graph to each participant. The scenario of [14] believes each node is a data island [20] when the graph’s topology is public, which means that all participants share the topology of the graph. These works have achieved good prediction results in their respective scenarios. In the real-world scenario, information may belong to other devices in the same place. Such as the driving speed of a taxi and the sensor’s speed records in the same place, a node on the graph may have multiple features belonging to different participants when the graph’s topology is public. However the current method only considers the case that each node has only one participant’s data. When a node carries the data of multiple participants, the current methods can not comprehensively consider the data of different participants for temporal feature extraction because of the differences in the data of different participants. The difference in data on a node will badly impact the extraction of temporal features. At the same time, in the intermediate data transmission, not encrypting data will potentially risk data privacy. Directly using the encryption algorithm will bring a lot of computing overhead because of the enormous matrix dimension. In this paper, We propose the horizontal federated traffic speed prediction algorithm(HFTSPA) for a new scenario and consider the safety of the transmitted embedding vector. Each participant has the equipment and corresponding data in a node in the scenario. We share the parameters of the local GRU in HFTSPA. Moreover, we designed the secret node attribute aggregation strategy(SNAAS) to generate the node’s embedding vector on the participant side and aggregate it on the coordinator side. SNAAS can comprehensively consider all participants’ data differences by encoding data on the participant side and aggregating it on the coordinator side. After aggregating the embedding vector, we mine the graph information by GN [2]. After that, the updated node feature vector contains the information of each participant for prediction. Our contributions mainly include: 1. Our algorithm has greatly improved application to the scenario of the same place having data belonging to different participants on the distributed privacy-preserving traffic speed prediction task.
HFTSP Base on SNAA
647
2. SNAAS uses GRU to extract the temporal feature, combined with secret sharing technology, aggregates the embedding vectors of nodes on the coordinator side, unifies the differences of multi-party information, and reduces the computational overhead of direct encryption. 3. Experiments on real-world networks show that HFTSPA improves the accuracy by considering graph topology data and adopting the federated strategy, and it also proves the effectiveness of HFTSPA for new scenarios
2
Related Work
This section will introduce the related work from two aspects: the traditional traffic speed prediction method based on GNN and the federated learning method. 2.1
Traffic Speed Prediction Based on GNN
GNN is still one of the most advanced algorithms for traffic prediction problems because it can capture spatial correlation [8]. In order to solve the problem that the traffic network is a non-Euclidean topology structure, [5] proposed a flow prediction framework based on multiple residual recursive graph neural networks. [6] developed a multi-temporal module with a large receptive field that can simultaneously capture short-term neighboring and long-term periodic dependencies, as well as a global correlation spatial module that can simultaneously capture local and non-local correlations in traffic networks. [18] designed an attentional mechanism that makes better use of short - and long-term dependencies in time series. [9] extracted the dynamic characteristics of node attributes, generated a dynamic graph, and proposed a model based on GNN and RNN to improve the prediction performance. [16] designed a Simplified Spatio-temporal traffic forecasting GNN, aggregating different domain representations to effectively puma spatial correlation and a weighted aggregation mechanism to spread temporal correlation. These methods show the influence of graph structure information on the speed prediction results on a single dataset. However, in the real-world traffic speed prediction scene, a single participant’s dataset often has problems such as missing or errors. Combining multi-party datasets for distributed training is a solution to the problems. The above methods are not suitable for distributed training scenarios. 2.2
Privacy-Preserving Traffic Speed Prediction
Privacy is becoming increasingly important in intelligent transportation systems, and several solutions have been developed to secure user privacy. We will introduce the related work from two aspects: the traditional privacy-preserving traffic speed prediction method and the federated traffic speed prediction method. In the traditional privacy-preserving traffic speed prediction method, [24] utilized bit arrays for encoding the users’ data and maximum-likelihood estimation to
648
E. Ye et al.
predict the outcome. [19] used pseudonyms to conceal vehicles’ information and optimized the way to update pseudonyms. [10] proposed a privacy-preserving incentive announcement network based on Blockchain and expected to encourage users to share announcements. With the rapid advancement of federated learning, many academics are considering its use in traffic speed prediction to ensure privacy. For traffic flow prediction, In the federated traffic speed prediction method, [11] merged federated learning and a Practical GRU neural network and integrated the optimal global model to improve the importance of traffic flow prediction further. [1] proposed the semi-supervised federated learning using unlabeled data in ITS. In order to provide dependable, low-delay vehiclemounted communication, [17] proposed a transmission power and resource allocation architecture based on federated learning. [7] proposed a privacy-protected mobility prediction framework for accurate mobility prediction. [21] presented a federated learning framework-based mobile-aware active edge caching strategy, in which local models were aggregated in Roadside Units to update the global model. [15] constructed a consortium blockchain for the decentralized FL-based TFP system. Miners verify model updates for distributed vehicles to prevent unreliable model updates. These works consider privacy protection in traffic speed prediction tasks, and some tasks also consider distributed training and graph topology. In the multi-party traffic speed prediction scenario, the current work has two kinds of partition settings for the graph’s structure. One is to divide the graph into subgraphs [22,23] and each participant is responsible for a subgraph. The second is the scenario in which the graph’s topology is public, and each participant is responsible for a node [14]. The graph’s topology is often public in real-world traffic speed prediction scenarios. The current work of privacy-preserving traffic speed prediction applied to distributed scenes will bring a lot of computing and communication costs. These algorithms do not consider the topology information between roads. Some works use federated learning to reduce the training process’s computing and communication costs. Some take into account the graph’s topology information. We summarize the multi-party real-world traffic speed prediction scenario. When the graph’s topology is public, each node has data from different participants. However, current works aim at a scenario where a node has data from one participant. Moreover, the current federated traffic speed prediction method does not consider the risk of privacy leakage caused by the upload of embedding vectors.
3
Preliminaries
In this section, we define the problems to be solved and the network privacy in traffic speed prediction in our work. After that, we will introduce the federated averaging algorithm commonly used in federated learning.
HFTSP Base on SNAA
3.1
649
Problem Definition
Suppose np participants {p1 , ..., pnp } ∈ P cooperate to predict traffic speed. We denote the traffic network as Gpi (V, A, Xpi ), where V = {v1 , v2 , .., vn } denotes the set of roads’ ID, A ∈ Rn×n is an adjacency matrix. Ai,j = 1 indicates that there is an connection between road i and road j , Xpi ∈ Rn×m denotes an node feature matrix. and Xi ∈ Rn×m represents the i-th road’s records during the past m time steps. And in our scenario, the V and A of each participant are the same. Node label Ypi ∈ Rn×m represents the i-th road’s records in the future time steps. When the graph topology is public, each participant pi ∈ P has the temporal series data Xpi of each node on the graph. The goal of privacy protection on traffic speed prediction is to predict the label Ypi based on time series data {Xpi , pi ∈ P } and topological structure A without reveal the network privacy defined in Subsect. 3.2. 3.2
Network Privacy in Traffic Speed Prediction
We define our horizontal network privacy in the presence of semi-honest adversaries [4]. Each participant follows the procedure of the algorithm and never conspires with others to obtain privacy about any participant. The information of network should be protected: The attributes of nodes. The attribution of a node are learned from the data owned by the node. If the attribution are leaked, it will bring the risk of data leakage. 3.3
The FederatedAveraging Algorithm
FedAvg [12] is a federated learning method that allows users to benefit from shared models without the need to store these data centrally. Each client’s local training dataset will never be uploaded to the server. The client will upload an update to the global model calculated locally by the coordinator. Equation (1) shows the update process of the coordination side. Where k represents the set of participants, n is the size of dataset, and w is the model’s weight. wt+1 ←
K nk k=1
4
n
k wt+1
(1)
The Proposed Algorithm
In this section, we will describe the algorithm’s main stages and analyze the possibility of privacy leakage in the algorithm.
650
E. Ye et al.
Fig. 1. Framework of HFTSPA
4.1
Horizontal Federated Traffic Speed Prediction Algorithm
Three stages of temporal information extraction, spatial information extraction and traffic speed prediction, as shown in Fig. 1. The steps of the algorithm are described as follows: Stage 1: Temporal information extraction This stage consists of three steps. Each participant mines the temporal features of the original temporal data using GRU, encrypts and transmits the features. Step 1: The participant pi inputs data Xpi into GRU1 to obtain the initial temporal features matrix Hpi of each node. Step 2: Participants Randomly generate mask matrix Mpi ,pj in pairs and exchange with other. Mpi ,pj is a randomly generated matrix. Use Mpi ,pj to perform addition or subtraction operations on the matrix that needs to be encrypted. This operation is called blinding [3]. Blinding the initial temporal features matrix Hpi to < Hpi > by Eq. (2). < Hpi >= Hpi +
pu ∈P :pi pv
Step 3: Send matrix < Hpi > to the coordination end.
pi (modR)
(2)
HFTSP Base on SNAA
651
Stage 2: Spatial information extraction This stage is running at the coordinator, which includes three steps. First, aggregate the time sequence features matrix of each participant, and use the aggregated features and topological features of graphs to mine spatial information using GNN. Step 1: the coordinator accepts the node temporal feature matrix Hpi of each participant, aggregates the temporal features corresponding to each node, generates a new temporal feature matrix Hc by Eq. (3) and takes it as the global node’s feature of the node. pi ∈P < Hpi > Hc = (3) |P | Step 2: input the adjacency matrix Ag with global temporal features Hc into the GN module to update the features of nodes. The updated features matrix HG carry the spatial features of the graph. Step 3: Send the updated feature HG to each participant. Stage 3: Traffic speed prediction This section includes two steps, splicing the time series matrix Hpi and updated features matrix HG to a matrix, and inputting the matrix into GRU for speed prediction. Step 1: Each participant splices the updated global HG features with the local time series Hpi features. Step 2: input the spliced feature matrix into GRU2 to get the prediction results. In the process of model training, model averaging aggregation [12] is performed every k rounds of stage1-3. The model can be trained with the data of other participants without leaving the local server. The pseudo-code of the algorithm is shown in Algorithm 1. 4.2
Privacy Analysis
We analyze the possibility of privacy leakage in three stages of HFTPA to demonstrate its security. There are three steps in stage 1. In step 1, each participant computes the initial temporal features matrix Hpi locally without exposing their privacy. In step 2–3, each participant uses the blinding strategy to encrypt Hpi to < Hpi > and sends them to the coordinator. The coordinator or others cannot obtain the participants’ Hpi via the intermediate < Hpi >. As a result, it is unable to deduce each participant’s attribute matrix Hpi from the encrypted intermediate results without the participants’ paired random masks. In stage 2. In step 1, the coordinator can only obtain the final result Hc . In step 2 coordinator computes the updated features matrix HG locally without exposing their privacy. According to the above analysis, this stage 2 is safe. In stage 3, Each participant performs local splicing and speed prediction locally without exposing their privacy.
652
E. Ye et al.
Algorithm 1: HFTSPA Input: Adjacent matrix A, temporal series data Xi Output: Speed Label Y for training round r = 1,2,..,R do // STAGE 1: Time series data mining // Participant : for each participant pi ∈ P do Hpi ← GRU1 (Xpi ) Participants Randomly generate mask matrix Mpi ,pj in pairs and exchange with other. < Hpi >← Blinding Hpi using Mpi ,pj by Equation (4) Send < Hpi > to Coordinator;
1
2 3 4 5 6
// STAGE 2: Spatial data mining // Coordinator:
7 8
9 10 11 12
Hc ← pi ∈P|P | i // Calculate Hc acorrding to equation (3) HG ← GN s(Hc ) Send HG to each Participants ; // STAGE 3: Traffic speed prediction // Participant: for each participant pi ∈ P do yˆpi ← GRU2 ([Hpi ; HG ]) pi = (ˆ ypi , ypi ) Send pi ,θpi ,2 ,θpi ,2 to each Participants ; // Participants :
13 14 15
5
p
∈P i ← pi|P | // Calculate θ¯1 and θ¯2 acorrding to equation(1) i θ θ¯1 ← pi ∈P N N pi ,1 N θ¯2 ← pi ∈P Ni θpi ,2
Experiments
This section will introduce the dataset used in the experiment, the baseline algorithms, and the evaluation metrics used. After that, we did three experiments to prove our algorithm’s effectiveness, respectively. 5.1
Datasets
We briefly describe the processing of topology dataset and the two temporal series datasets used in the experiments. We construct Gg (A, V ) base on urban road of Shanghai public transport open dataset1 . The dataset’s fields include road ID, road name, starting road point, ending road point, grade, etc. We get the intersection by selecting the road names of the starting point and the ending point to form a new road name. 1
https://data.sh.gov.cn
HFTSP Base on SNAA
653
And then, we take the road ID as the node ID set V of the graph G(A, V ). If a road vr has a starting road vi and a end road vj , we consider that road vi and road vj are connected, that is, Aij = 1. We use two traffic speed datasets to evaluate the performance of our algorithm and the baselines. (1) QiangSheng Taxi2 : This dataset contains the taxi traffic speed in the ShangHai over 20 days from August 1st, 2016 to Aug 20th, 2016. The dataset contains taxi ID, longitude and latitude, time, speed and other fields. First, the samples are filtered according to the time field. The experiment adopts the form of 5min interval for filtering, that is, one data sample is taken every five minutes. If there is no speed at the current time, we use the average value of data in the adjacent time window (5min). Then align the data with Gg (A, V ) according to the longitude and latitude in the dataset. (2) Mobai3 : This dataset contains the bike traffic speed in the ShangHai over 20 days from August 1st, 2016 to Aug 20th, 2016. The dataset contains fields such as bicycle ID, longitude and latitude, driving path, etc. We calculate the average speed of bicycles according to the path field and time field, the start time and end time, and the distance calculated from the path. Take the average speed as the speed of the bicycle at the point on the whole driving path. After that, the time and section alignment processing is performed just as the QiangSheng Taxi dataset is processed. We sampled the traffic data in the two data sets every 5 min and spliced every 12 sampling points into a sequence with length 12. Second, the sequences with a corresponding time length of 6 in the two databases are spliced into sequences with length 12. Third, link all the corresponding sequences. Finally, the whole sequence is cut into several sequences with length 24 as training samples. The forecasting task is to predict the traffic speed in the last 6 sampling points of each sequence according to the first 6 sampling points in a sequence with length 12. We show the statistics of datasets in the Table 1. We use a ratio of 7:2:1 to divide the training, test and valuation dataset. Table 1. Statistics of datasets QiangSheng Taxi and Mobai. Dataset
5.2
# Train Seq # Val Seq # Test Seq
QiangSheng Taxi 672
96
192
Mobai
96
192
672
Baseline Algorithms
We use CNFGNN,Standalone HFTSPA(SHFTSPA),HFTSPA(w-en) HFTSPA(w-GN) as the baseline algorithms. 2 3
https://sodachallenges.com/datasets/taxi-gps https://sodachallenges.com/datasets/mobike-shanghai
and
654
E. Ye et al.
(1) CNFGNN [14]: The settings of the CNFGNN model are the same as [14]. Each node has a GRU model, and the hidden layer of the GRU has 64 dimensions. The two-layer Graph Network (GN) is used as the spatial information extraction model at the coordination end. In order to compare with our algorithm, we process the dataset as follows: First, we sampled the traffic data in the two data sets every 5 min and spliced every 6 sampling points into a sequence with length 6. Second, the sequences with a corresponding time length of 6 in the two databases are spliced into sequences with length 12. Third, link all the corresponding sequences. Finally, the whole sequence is cut into several sequences with length 24 as training samples. (2) Standalone HFTSPA(SHFTSPA): there is a single participant with complete graph topology and traffic speed information. It uses the complete information of a dataset to get the node embedding and predicts traffic speed. We use it as a benchmark for the consistency experiment. (3) HFTSPA(w-en):HFTSPA without secret sharing technology, simple direct addition embedding. (4) HFTSPA(w-GN): use a GRU-based encoder-decoder model as the model on the participant. Take SNAAS to aggregate node embedding vector. The algrithom trained with the Federated Averaging algorithm [13]. 5.3
Experimental Settings
We set two participants to conduct the simulation experiment. Participant 1 uses the Qiangsheng taxi dataset for simulation, and participant 2 uses the Mobai dataset. We evaluated the performance of HFTSPA and all baseline algorithms in traffic prediction tasks. For all algorithms, each participant has a GRU-based encoder-decoder model. All algorithms use the Adam optimizer, and the learning rate is 1e–3. Our settings for participants are as follows: Participant 1 uses Qiang Sheng taxi dataset Participant 2 uses Mobai dataset. In the accuracy experiment of CNFGNN model, a dataset composed of two datasets is used. Because CNFGNN model is run in one network, we aggregate the two datasets into a network. Therefore, when comparing, we average the RMSE obtained by the two participants of other algorithms as the RMSE for comparison. 5.4
Evaluation Metrics
We use Routed Mean Squared Error (RMSE) to evaluate the performance of each algorithm. n 1 2 (yi − yˆi ) (4) RM SE = n i=1 where yi is the predict value, yˆi is the label and n is the size of the yi .
HFTSP Base on SNAA
5.5
655
Consistency Experiment
We verify the correctness of HFTSPA by comparing the accuracy of predicted traffic speed of HFTSPA with the HFTSPA(w-en) on the traffic networks with the same initial parameters set. As shown in the Fig. 2, the RMASE values of HFTSPA are the same as the RMSE values of HFTSPA(w-en). Therefore, the experimental results indicate that HFTSPA are as accurate as HFTSPA(w-en) in RMSE. Without the loss value of accuracy, the secret sharing technology ensures that the intermediate result of the embedding obtained by the coordinator aggregation is consistent with the plaintext calculation result. Based on the experimental results, we can conclude that HFTSPA does not lead to accuracy loss in privacy-preserving traffic speed prediction.
Fig. 2. Results of the consistency experiment
5.6
Ablation Experiment
First, we use HFTSPA and HFTSPA(w-GN) to predict the traffic speed on the taxi and Mobai datasets to illustrate the impact of the graph’s topology on traffic speed prediction tasks. Figure 3(a) depicts the RMSE values for HFTSPA(w-GN) and HFTSPA. The prediction accuracy of HFTSPA is higher than HFTSPA(wGN). It can be seen that whether to consider the graph structure has a great impact on the speed prediction results. Sencond, we compare the prediction accuracy of shftspa and hftspa to illustrate the improvement of prediction accuracy by multi-party cooperative training. As shown in fig 3(b), HFTSPA performed better in both participants than SHFTSPA alone.
656
E. Ye et al.
Fig. 3. Results of the ablation experiment
5.7
Accuracy Experiment
In the accuracy experiment, we compared the accuracy of HFTSPA and CNFGNN on the two datasets to illustrate the advantages of HFTSPA in realworld traffic speed prediction scenarios. We use HFTSPA and CNFGNN to predict the traffic speed on the taxi and Mobai datasets. Figure 4 depicts the RMSE values for CNFGNN and HFTSPA. We think the accuracy of HFTSPA is higher than that of CNFGNN because CNFGNN believes that the data is scattered on each node. If a node has two datasets, learning node temporal features is easily affected by data from different participants. While the participants of HFTSPA only use one participant’s data when learning the time series vector, learn the time series features, and then aggregate them in the coordinator. The matrix aggregation method reasonably handles the situation that a node has data from two participants.
HFTSP Base on SNAA
657
Fig. 4. Results of the accuracy experiment
6
Conclusions
In this paper, we propose a horizontal federated traffic speed prediction algorithm for traffic speed prediction on a dataset with a more practical distribution scenarios. The coordinator and participants jointly construct a model based on GRU and GN to improve traffic prediction accuracy. The secret sharing technology is used to strictly protect the network privacy of each participant without the loss of accuracy under a semi-honest model. Comprehensive experiments on Qiangsheng taxi and Mobai datasets demonstrate the correctness and effectiveness of the algorithm. In the future, we will consider more information such as weather, festivals, and other information in the prediction task, rather than just considering the temporal and spatial information, to achieve a more comprehensive traffic speed prediction algorithm. Acknowledgements. This work was supported by the National Natural Science Foundation of China under Grant No. 62002063 and No. U21A20472, the National Key Research and Development Plan of China under Grant No.2021YFB3600503, the Fujian Collaborative Innovation Center for Big Data Applications in Governments, the Fujian Industry-Academy Cooperation Project under Grant No. 2017H6008 and No. 2018H6010, the Natural Science Foundation of Fujian Province under Grant No.2020J05112, the Fujian Provincial Department of Education under Grant No. JAT190026, the Major Science and Technology Project of Fujian Province under Grant No.2021HZ022007 and Haixi Government Big Data Application Cooperative Innovation Center.
References 1. Albaseer, A., Ciftler, B.S., Abdallah, M., Al-Fuqaha, A.: Exploiting unlabeled data in smart cities using federated edge learning. In: 2020 International Wireless Communications and Mobile Computing (IWCMC), pp. 1666–1671. IEEE (2020) 2. Battaglia, P.W., et al.: Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018)
658
E. Ye et al.
3. Bonawitz, K., et al.: Practical secure aggregation for privacy-preserving machine learning. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175–1191 (2017) 4. Brickell, J., Shmatikov, V.: Privacy-preserving graph algorithms in the semihonest model. In: Roy, B. (ed.) ASIACRYPT 2005. LNCS, vol. 3788, pp. 236–252. Springer, Heidelberg (2005). https://doi.org/10.1007/11593447 13 5. Chen, C., et al.: Gated residual recurrent graph neural networks for traffic prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 485–492 (2019) 6. Fang, S., Zhang, Q., Meng, G., Xiang, S., Pan, C.: Gstnet: global spatial-temporal network for traffic flow prediction. In: IJCAI (2019) 7. Feng, J., Rong, C., Sun, F., Guo, D., Li, Y.: Pmf: a privacy-preserving human mobility prediction framework via federated learning. Proc. ACM Interact. Mobile Wearable Ubiquitous Tech. 4(1), 1–21 (2020) 8. Jiang, W., Luo, J.: Graph neural network for traffic forecasting: a survey. Expert Syst. Appl. 207, 117921 (2022) 9. Li, F., et al.: Dynamic graph convolutional recurrent network for traffic prediction: benchmark and solution. ACM Trans. Knowl. Disc. Data (TKDD) 16(1), 1–22 (2021) 10. Li, L., et al.: Creditcoin: a privacy-preserving blockchain-based incentive announcement network for communications of smart vehicles. IEEE Trans. Intell. Transp. Syst. 19(7), 2204–2220 (2018) 11. Liu, Y., James, J., Kang, J., Niyato, D., Zhang, S.: Privacy-preserving traffic flow prediction: a federated learning approach. IEEE Internet Things J. 7(8), 7751–7763 (2020) 12. McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282. PMLR (2017) 13. McMahan, H.B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: International Conference on Artificial Intelligence and Statistics (2016) 14. Meng, C., Rambhatla, S., Liu, Y.: Cross-node federated graph neural network for spatio-temporal data modeling. In: Knowledge Discovery and Data Mining, pp. 1202–1211 (2021) 15. Qi, Y., Hossain, M.S., Nie, J., Li, X.: Privacy-preserving blockchain-based federated learning for traffic flow prediction. Future Gener. Comput. Syst. 117, 328–337 (2021) 16. Roy, A., Roy, K.K., Ahsan Ali, A., Amin, M.A., Rahman, A.K.M.M.: SST-GNN: simplified spatio-temporal traffic forecasting model using graph neural network. In: Karlapalem, K., et al. (eds.) PAKDD 2021. LNCS (LNAI), vol. 12714, pp. 90–102. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-75768-7 8 17. Samarakoon, S., Bennis, M., Saad, W., Debbah, M.: Distributed federated learning for ultra-reliable low-latency vehicular communications. IEEE Trans. Commun. 68(2), 1146–1159 (2019) 18. Shi, X., Qi, H., Shen, Y., Wu, G., Yin, B.: A spatial-temporal attention approach for traffic prediction. IEEE Trans. Intell. Transp. Syst. 22(8), 4909–4918 (2020) 19. Sucasas, V., Mantas, G., Saghezchi, F.B., Radwan, A., Rodriguez, J.: An autonomous privacy-preserving authentication scheme for intelligent transportation systems. Comput. Secur. 60, 193–205 (2016) 20. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 10(2), 1–19 (2019)
HFTSP Base on SNAA
659
21. Yu, Z., Hu, J., Min, G., Zhao, Z., Miao, W., Hossain, M.S.: Mobility-aware proactive edge caching for connected vehicles using federated learning. IEEE Trans. Intell. Transp. Syst. 22(8), 5341–5351 (2020) 22. Yuan, X., et al.: Fedstn: graph representation driven federated learning for edge computing enabled urban traffic flow prediction. IEEE Trans. Intell. Transp. Syst. (2022) 23. Zhang, C., Zhang, S., James, J., Yu, S.: Fastgnn: a topological information protected federated learning approach for traffic speed forecasting. IEEE Trans. Ind. Inf. 17(12), 8464–8474 (2021) 24. Zhou, Y., Mo, Z., Xiao, Q., Chen, S., Yin, Y.: Privacy-preserving transportation traffic measurement in intelligent cyber-physical road systems. IEEE Trans. Veh. Technol. 65(5), 3749–3759 (2015)
Author Index
B Bi, Huimin I-269 Bi, Sheng II-555 C Cai, Hongming II-543 Cao, Donglin I-58 Cao, Jian I-422, II-18 Cao, Weiwei I-17 Cao, Yang II-488 Cen, Jianwei II-187 Chang, Yuan I-492 Chen, Bin II-201 Chen, Changzhi II-435 Chen, Dangrun II-645 Chen, Jingcan I-173 Chen, Liwei I-492 Chen, Ningjiang I-449 Chen, Renhao II-271 Chen, Rui II-352 Chen, Wang I-207 Chen, Wei-Neng II-311 Chen, Weiqi I-350 Chen, Xiaoqi I-84 Chen, Yang I-173 Chen, Yin II-499 Chen, Yun I-3 Chen, Zhanxuan II-187 Chen, Zhen I-73 Chen, Zihao II-580 Cheng, Hao II-555 Cheng, Shiwei II-610 Cui, Lizhen I-390 D Dai, Jinkun I-110 Dai, Weihui I-182, II-379 Di, Kai II-499 Ding, Kai I-295, II-48
Ding, Kexin I-306 Ding, Xinyi II-118 F Fan, Weijiang II-488 Fang, Yili II-118 Fang, Yuanfei II-387 Fang, Yutong I-84 Feng, Guozheng II-103 Feng, Liang II-256 Feng, Shanshan II-18 Feng, Yingrui I-422 G Gao, Lili II-148 Gao, Liming II-3 Gao, Liping II-148 Gao, Shengxiang I-232 Gong, Bin II-75 Gong, Qingyuan I-173 Gu, Huamao II-118 Gu, Yang I-422 Gu, Zhibo II-235 Guo, Bin II-133, II-326, II-337 Guo, Chang I-42 Guo, Kun I-84, I-110, I-147, I-217, II-33, II-645 Guo, Wenzhong I-84, II-645 Guo, Yijia I-207 H Han, Chaozhe I-207 Han, Tao II-118 Hao, Zhaotie II-133 He, Chaobo I-28 He, Liang II-435 He, Pengfei I-73 He, Xiaofeng II-435 He, Zhilei I-232
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Sun et al. (Eds.): ChineseCSCW 2022, CCIS 1682, pp. 661–665, 2023. https://doi.org/10.1007/978-981-99-2385-4
662
Hu, Haotian II-412 Hu, Kun I-125 Hu, Qingmeng II-401 Hu, Yanmei I-481 Hu, Yi II-435 Hu, Yichong I-245 Huang, Heng I-422 Huang, Huanhong II-379 Huang, Jieyun II-18 Huang, Jin I-411 Huang, Li II-187 Huang, Qiao I-125, I-135 Huang, Qingqing I-84 Huang, Rui II-453 Huang, Yangming II-298 Huang, Yuanjiang II-580 Huang, Yue I-390 Huang, Yujian I-481 Huang, Yuxin I-232 Huang, Ze II-453 J Ji, Yu II-435 Jia, Ru I-95, II-298 Jia, Tao II-580 Jian, Wei II-103 Jiang, Bin II-412, II-423 Jiang, Bo I-125, I-135 Jiang, Fulin II-570 Jiang, Jiuchuan II-499 Jiang, Lihong II-543 Jiang, Wenxuan I-182 Jiang, Yichuan I-437, II-499 Jin, Dingquan I-492 K Kang, Yan II-379 Ke, Xifan II-33 Kong, Deyue II-379 Kong, Lanju I-390 L Lai, Liqian II-89 Lan, Liantao I-245 Leng, Jingsong I-162, II-164 Li, Bing II-271 Li, Dongsheng I-437 Li, Fuan II-645 Li, Ji II-476
Author Index
Li, Jianguo I-162 Li, Jingjing I-350, I-365 Li, Lin II-570 Li, Mengdi II-514 Li, Mengyuan II-133 Li, Qingzhong I-390 Li, Ru I-95, II-298 Li, Shaozi II-235, II-464 Li, Shujie II-352 Li, Shuqin II-622 Li, Wanting I-449 Li, Weimin I-42 Li, Xiaohu II-595 Li, Xiaoping II-219 Li, Yajing II-103 Li, Zhimin II-89 Liang, Zhejun I-295, II-48 Liao, Hao II-555 Lin, Bing I-320 Lin, Dazhen I-58 Lin, Ronghua I-28, I-162, I-245, I-376, II-164 Lin, Yanghao I-58 Liu, Dezheng II-243 Liu, Dongning I-516 Liu, Dui I-320 Liu, Feiyang II-337 Liu, Jia I-481 Liu, Jianjun I-516 Liu, Jianxun I-257 Liu, Jiaqi II-286 Liu, Jiawei II-595 Liu, Jin I-295, II-48 Liu, Jing II-367 Liu, Li II-595 Liu, Linlin I-73 Liu, Mu II-543 Liu, Qiang I-481 Liu, Shijun II-62 Liu, Siwei II-243 Liu, Wei-Li II-256 Liu, Xiaoping II-570 Liu, Xiaotao II-367 Liu, Xiaowei I-73 Liu, Yan II-337 Liu, Youzhe I-182, II-379 Liu, Yuechang II-89 Long, Jinyi II-243 Lu, Peng II-514
Author Index
Lu, Tun I-284, I-437 Luo, Zhiming II-235, II-464 Lyu, Chen II-476 M Ma, Cuixin II-256 Ma, Ke II-326 Meng, Fanmin II-379 N Ning, Xiangdong II-75 Ning, Xuhui II-367 O Ouyang, Min I-502 Ouyang, Sipeng I-257 P Pan, Li II-62 Pan, Maolin II-175 Pan, Zhifu II-271 Peng, Houfu II-634 Q Qi, Lianyong I-257 Qi, Wenchao I-73 Qiao, Xiaoyi II-622 Qin, Yangjie I-411 Qiu, Sihang II-201 Qu, Xinran II-387 R Ren, Tianqi I-182 Ren, Zhuoli II-286 S Shang, Jiaxing I-17 Shao, Yiyang II-326 Shao, Yuting I-173 Shen, Jun I-95 Shen, Limin I-73 Shen, Yijun I-465 Shen, Yubo II-298 Shi, Haoran II-62 Shi, Xuan-Li II-311 Shi, Yanjun I-207, II-387 Song, Shizhe I-135 Song, Yunfei II-530
663
Su, Songzhi II-453 Sun, Hailong I-465 Sun, Hong I-17 Sun, Minyu II-412, II-423 Sun, Wei II-219 Sun, Xiao II-271 Sun, Yuqing II-75 T Tai, Yu I-502 Tan, Wenan I-295, II-48 Tang, Feiyi I-162 Tang, Na I-350 Tang, Yan I-194 Tang, Yiming II-271, II-352 Tang, Yong I-28, I-245, I-350, I-376, II-164 Tian, Yu I-422 Tian, Zhuo I-405 W Wan, Ben II-543 Wan, Lin II-75 Wang, Gaojie II-62 Wang, Hao II-337 Wang, Hongbin II-401 Wang, Hui II-286 Wang, Jianchao I-502 Wang, Jingbin II-33 Wang, Liang II-286 Wang, Linna II-530, II-634 Wang, Linqing I-269 Wang, Liqiang II-62 Wang, Shu II-595 Wang, Shuo II-555 Wang, Tong I-492, I-502 Wang, Wenhao I-207 Wang, Xiao II-476 Wang, Xiaomeng II-580 Wang, Ye I-125, I-135 Wang, Yulin I-3 Wang, Yuxiao II-543 Wei, Dingmei I-42 Wei, Feng-Feng II-311 Wen, Yiping I-257 Weng, Yu I-28, I-376 Wenmei, Nie I-335 Wu, Hanrui II-243 Wu, Huiqian II-75
664
Author Index
Wu, Lie II-133 Wu, Ling I-110, I-217 Wu, Pianran I-182 Wu, Quanwang I-17 Wu, Renfei II-33 Wu, Wen II-435 Wu, Wenbin II-352 Wu, Xi II-352 Wu, Yunzhi II-175 Wu, Zhengyang II-187 Wu, Ziqiang II-555
Yu, Siyu I-449 Yu, Wenguang I-28, I-376 Yu, Xinyue I-284 Yu, Yang II-175 Yu, Zhengtao I-232 Yu, Zhiwen II-133, II-286, II-326 Yu, Zhiyong I-147 Yuan, Chengzhe I-245 Yuan, Junying I-3 Yuan, Zhiheng II-387 Yuan, Zhuangmiao II-118
X Xi, Lei II-352 Xia, Daoxun II-530, II-634 Xia, Linjie I-232 Xiang, Peng I-320 Xiao, Jiaojiao I-405 Xiao, Jing II-488 Xiaoxia, Song I-335 Xie, Dongbo II-89 Xu, Jianbo II-103 Xu, Lili II-634 Xu, Yiwu I-3 Xu, Yonghui I-390 Xulong, Zhang I-335 Xun, Yaling I-269
Z Zeng, Haoyang I-449 Zhang, Bolin II-423 Zhang, Changyou I-405 Zhang, Hongyu I-194 Zhang, Jia II-243 Zhang, Jifu I-269 Zhang, Mingzhe II-622 Zhang, Xinyuan II-543 Zhang, Yuqi II-326 Zhang, Yuxuan I-17 Zhang, Zheng II-555 Zhang, Zhenjie II-118 Zhang, Zhuo II-514 Zhang, Zihan II-645 Zhao, Hong II-367 Zhao, Song II-610 Zhao, Yan I-42 Zhao, Yong II-201 Zheng, JiaChen II-645 Zheng, Linjiang I-17 Zheng, Xiangwei II-622 Zheng, Yu I-95 Zhong, Jinghui II-256 Zhong, Zhi I-306 Zhou, Aohui I-125 Zhou, Haoyang I-365 Zhou, Siyuan I-135 Zhou, Songjian II-555 Zhou, Ximin II-464 Zhou, Xinjiao II-412, II-423 Zhu, Chenshuang II-610 Zhu, Enchang I-232 Zhu, Heng I-42 Zhu, Huiling II-3 Zhu, Jia I-411 Zhu, Jie I-306
Y Yang, Beiteng I-516 Yang, Chao II-412, II-423 Yang, Guo II-311 Yang, Jinlong I-217 Yang, Manling II-595 Yang, Ruoxuan I-465 Yang, Siyi II-379 Yang, Xinyi II-33 Yang, Yanzhen I-162, II-164 Yang, Yongqiang I-465 Yao, Zhen II-148 Ye, Enjie I-84, II-645 Yin, Wenyi I-182 Yong, Li I-335 You, Dianlong I-73 You, Jinpeng I-58 Yu, Chenghao I-182 Yu, Cizhou I-437 Yu, Hongjie I-320
Author Index
Zhu, Junjie II-570 Zhu, Nengjun II-18 Zhu, Xia II-219 Zhu, Yujia II-530
665
Zhu, Zhengqiu II-201 Zhuang, Qifeng I-147 Zhuo, Hankui II-3 Zhuo, Hankz Hankui II-89