380 89 58MB
English Pages XXIX, 857 [873] Year 2020
LNAI 12430
Xiaodan Zhu Min Zhang Yu Hong Ruifang He (Eds.)
Natural Language Processing and Chinese Computing 9th CCF International Conference, NLPCC 2020 Zhengzhou, China, October 14–18, 2020 Proceedings, Part I
123
Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science
Series Editors Randy Goebel University of Alberta, Edmonton, Canada Yuzuru Tanaka Hokkaido University, Sapporo, Japan Wolfgang Wahlster DFKI and Saarland University, Saarbrücken, Germany
Founding Editor Jörg Siekmann DFKI and Saarland University, Saarbrücken, Germany
12430
More information about this series at http://www.springer.com/series/1244
Xiaodan Zhu Min Zhang Yu Hong Ruifang He (Eds.) •
•
•
Natural Language Processing and Chinese Computing 9th CCF International Conference, NLPCC 2020 Zhengzhou, China, October 14–18, 2020 Proceedings, Part I
123
Editors Xiaodan Zhu ECE & Ingenuity Labs Research Institute Queen’s University Kingston, ON, Canada Yu Hong School of Computer Science and Technology Soochow University Suzhou, China
Min Zhang Department of Computer Science and Technology Tsinghua University Beijing, China Ruifang He College of Intelligence and Computing Tianjin University Tianjin, China
ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Artificial Intelligence ISBN 978-3-030-60449-3 ISBN 978-3-030-60450-9 (eBook) https://doi.org/10.1007/978-3-030-60450-9 LNCS Sublibrary: SL7 – Artificial Intelligence © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Welcome to 9th CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC 2020). Following the success of previous conferences held in Beijing (2012), Chongqing (2013), Shenzhen (2014), Nanchang (2015), Kunming (2016), Dalian (2017), Hohhot (2018), and Dunhuang (2019), this year’s NLPCC was held at Zhengzhou, which is located in the central part of China. As a leading international conference on natural language processing (NLP) and Chinese computing (CC), organized by the CCF-NLP (Technical Committee of Natural Language Processing, China Computer Federation, formerly known as Technical Committee of Chinese Information, China Computer Federation), NLPCC 2020 serves as an important forum for researchers and practitioners from academia, industry, and government to share their ideas, research results and experiences, and promote their research and technical innovations in the various fields. The fields of NLP and CC have boomed in recent years, and the growing number of submissions to NLPCC is testament to this trend. After unfortunately needing to reject 25 submissions that did not meet the submission guidelines, we received a total of 404 valid submissions to the entire conference, inclusive of the main conference, student workshop, evaluation workshop, and the special explainable AI (XAI) workshop. Of the 377 valid submissions to the main conference, 315 were written in English and 62 were written in Chinese. Following NLPCC’s tradition, we welcomed submissions in nine topical areas for the main conference: Conversational Bot/QA; Fundamentals of NLP; Knowledge Base, Graphs and Semantic Web; Machine Learning for NLP; Machine Translation and Multilinguality; NLP Applications; Social Media and Network; Text Mining; Trending Topics (Explainability, Ethics, Privacy, Multimodal NLP, etc.) Acceptance decisions were made by multiple virtual scientific Program Committee (PC) meetings due to the COVID-19 pandemic, attended by the general, PC, and area chairs. After our deliberations for the main conference, 83 submissions were accepted as oral papers (with 70 papers in English and 13 papers in Chinese) and 30 as poster papers. 9 papers were nominated by the area chairs for the Best Paper Award in both the English and Chinese tracks. An independent Best Paper Award Committee was formed to select the best paper from the shortlist. The proceedings included only the accepted English papers; the Chinese papers appear in the journal ACTA Scientiarum Naturalium Universitatis Pekinensis. In addition to the main proceedings, 2 papers were accepted for the student workshop, 8 papers were accepted for the evaluation workshop, and 4 papers were accepted to the special Explainable AI (XAI) workshop. We were honored to have four internationally renowned keynote speakers – Claire Cardie (Cornell University, USA, and ACL Fellow), Ido Dagan (Bar-Ilan University, Israel, and ACL Fellow), Edward Grefenstette (Facebook AI Research, UK), and Danqi Chen (Princeton University, USA) – share their expert opinions on recent developments in NLP via their wonderful lectures.
vi
Preface
The organization of NLPCC 2020 is due to the help of a great many people: • We are grateful for guidance and advice provided by general co-chairs Mark Steedman and Xuanjing Huang, and Organization Committee co-chairs Hongying Zan, Xiaojun Wan, and Zhumin Chen. We especially thank Xiaojun Wan, as the central committee member who as acted as a central adviser to both of us as PC chairs, in making sure all of the decisions were made on schedule. • We would like to thank the student workshop co-chairs Jin-Ge Yao and Xin Zhao, evaluation co-chairs Shoushan Li and Yunbo Cao, XAI workshop co-chairs Feiyu Xu, Dongyan Zhao, Jun Zhu, and Yangzhou Du, as well as techical workshop co-chairs Xiaodong He and Feiyu Xu. • We are indebted to the 18 area chairs and the 251 primary reviewers, for both the English and Chinese tracks. This year, in the special COVID-19 period, they operated under severe load, and completed their high-quality reviews. We could not have met the various deadlines during the review process without their hard work. • We thank tutorial co-chairs Xipeng Qiu and Rui Xia for assembling a comprehensive tutorial program covering a wide range of cutting-edge topics in NLP. • We thank sponsorship co-chairs Dongyan Zhao and Derek Wong for securing sponsorship for the conference. • Yu Hong and Ruifang He for ensuring every little detail in the publication process was properly taken care of. Those who have done this form of service work know how excruciating it can be. On behalf of us and all of the authors, we thank them for their work, as they truly deserve a big applause. • Above all, we thank everybody who chose to submit their work to NLPCC 2020. Without your support, we could not have put together a strong conference program. Stay safe and healthy, and we hope you enjoyed NLPCC 2020. August 2020
Xiaodan Zhu Min Zhang
Organization
NLPCC 2020 is organized by China Computer Federation, and hosted by Zhengzhou University and the National State Key Lab of Digital Publishing Technology.
Organization Committee General Chairs Mark Steedman Xuanjing Huang
The University of Edinburgh, UK Fudan University, China
Program Committee Chairs Xiaodan Zhu Min Zhang
Queen’s University, Canada Tsinghua University, China
Student Workshop Chairs Jin-Ge Yao Xin Zhao
Microsoft Research Asia, China Renmin University of China, China
Evaluation Chairs Shoushan Li Yunbo Cao
Soochow University, China Tencent, China
Technical Workshop Chairs Xiaodong He Feiyu Xu
JD.com, China SAP, Germany
Tutorial Chairs Xipeng Qiu Rui Xia
Fudan University, China Nanjing University of Science and Technology, China
Publication Chairs Yu Hong Ruifang He
Soochow University, China Tianjin University, China
Journal Coordinator Yunfang Wu
Peking University, China
Conference Handbook Chair Yuxiang Jia
Zhengzhou University, China
viii
Organization
Sponsorship Chairs Dongyan Zhao Derek Wong
Peking University, China University of Macau, Macau
Publicity Co-chairs Wei Lu Haofen Wang
Singapore University of Technology and Design, Singapore Tongji University, China
Organization Committee Chairs Hongying Zan Xiaojun Wan Zhumin Chen
Zhengzhou University, China Peking University, China Shandong University, China
Area Chairs Conversational Bot/QA Yu Su Quan Liu Fundamentals of NLP Lili Mou Jiajun Zhang
The Ohio State University, USA iFlytek, China University of Alberta, Canada Institute of Automation, Chinese Academy of Sciences, China
Knowledge Graph and Semantic Web Xiang Ren University of Southern California, USA Min Liu Harbin Institute of Technology, China Machine Learning for NLP Mo Yu IBM T.J Watson Research Center, USA Jiwei Li Shannon.AI, China Machine Translation and Multilinguality Jiatao Gu Facebook AI, USA Jinsong Su Xiamen University, China NLP Applications Wei Gao Xiangnan He Text Mining Wei Lu Qi Zhang Social Network Xiangliang Zhang Huaping Zhang
Singapore Management University, Singapore University of Science and Technology of China, China Singapore University of Technology and Design, Singapore Fudan University, China King Abdullah University of Science and Technology, Saudi Arabia Beijing Institute of Technology, China
Organization
Trending Topics Caiming Xiong Zhiyuan Liu
ix
Salesforce, USA Tsinghua University, China
Treasurer Yajing Zhang Xueying Zhang
Soochow University, China Peking University, China
Webmaster Hui Liu
Peking University, China
Program Committee Wasi Ahmad Xiang Ao Lei Bi Fei Cai Pengshan Cai Hengyi Cai Deng Cai Yi Cai Yixin Cao Yixuan Cao Ziqiang Cao Hailong Cao Kai Cao Ching-Yun Chang Hongshen Chen Muhao Chen Yidong Chen Chengyao Chen Jian Chen Yubo Chen Lei Chen Wenliang Chen
University of California, Los Angeles, USA Institute of Computing Technology, Chinese Academy of Sciences, China Beijing Normal University, Zhuhai, China National University of Defense Technology, China University of Massachusetts Amherst, USA Institute of Computing Technology, Chinese Academy of Sciences, China The Chinese University of Hong Kong, Hong Kong, China South China University of Technology, China National University of Singapore, Singapore Institute of Computing Technology, Chinese Academy of Sciences, China Microsoft STCA, China Harbin Institute of Technology, China New York University, USA Amazon.com, UK JD.com, China University of Southern California and University of Pennsylvania, USA Xiamen University, China Wisers AI Lab, Canada Beijing Normal University, Zhuhai, China Institute of Automation, Chinese Academy of Sciences, China Beijing Normal University, Zhuhai, China Soochow University, China
x
Organization
Kehai Chen Boxing Chen Qingcai Chen Bo Chen Gong Cheng Chenhui Chu Yiming Cui Mao Cunli Xinyu Dai Xiang Deng Xiao Ding Li Dong Zi-Yi Dou Qianlong Du Junwen Duan Nan Duan Miao Fan Yufei Feng Yang Feng Shi Feng Guohong Fu Wei Gao Yeyun Gong Yu Gu Jiatao Gu Zhijiang Guo Han Guo Xu Han Qinghong Han Tianyong Hao Jie Hao Lei Hou Linmei Hu Wei Hu Lifu Huang Xuanjing Huang Jing Huang Minlie Huang Guimin Huang Chenyang Huang
National Institute of Information and Communications Technology, Japan Alibaba, China Harbin Institute of Technology, China iscas.ac.cn, China Nanjing University, China Kyoto University, Japan Harbin Institute of Technology, China Kunming University of Science and Technology, China National Key Laboratory for Novel Software Technology, Nanjing University, China The Ohio State University, USA Harbin Institute of Technology, China Microsoft Research Asia, China Carnegie Mellon University, USA National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, China Harbin Institute of Technology, China Microsoft Research Asia, China Baidu Research, China Queen’s University, Canada Institute of Computing Technology, Chinese Academy of Sciences, China Northeastern University, China Soochow University, China Singapore Management University, Singapore Microsoft Research Asia, China The Ohio State University, USA Facebook AI Research, USA Singapore University of Technology and Design, Singapore University of North Carolina at Chapel Hill, USA Tsinghua University, China Peking University, China South China Normal University, China Florida State University, USA Tsinghua University, China Beijing University of Posts and Telecommunications, China Nanjing University, China University of Illinois at Urbana-Champaign, USA Fudan University, China JD.com, USA Tsinghua University, China Guilin University of Electronic Technology, China University of Alberta, Canada
Organization
Jiangping Huang Yuxiang Jia Ping Jian Wenbin Jiang Tianwen Jiang Shengyi Jiang Zhanming Jie Peng Jin Wan Jing Chunyu Kit Fang Kong Xiang Kong Lun-Wei Ku Kenneth Kwok Oi Yee Kwong Yanyan Lan Man Lan Hady Lauw Wenqiang Lei Yves Lepage Maoxi Li Chenliang Li Jian Li Peifeng Li Hao Li Ru Li Fei Li Binyang Li Junhui Li Bin Li Zhixu Li Zuchao Li Xiujun Li Xiang Li Lishuang Li Yachao Li Jiaqi Li Hao Li Yuan-Fang Li Albert Liang Lizi Liao
xi
Chongqing University of Posts and Telecommunications, China Zhengzhou University, China Beijing Institute of Technology, China Baidu Research, China Harbin Institute of Technology, China Guangdong University of Foreign Studies, China ByteDance, Singapore Leshan Normal University, China Associate Professor, China City University of Hong Kong, Hong Kong, China Soochow University, China Language Technologies Institute, Carnegie Mellon University, USA Academia Sinica, Taiwan, China Principal Scientist, Singapore The Chinese University of Hong Kong, Hong Kong, China Institute of Computing Technology, Chinese Academy of Sciences, China East China Normal University, China Singapore Management University, Singapore National University of Singapore, Singapore Waseda University, Japan Jiangxi Normal University, China Wuhan University, China The Chinese University of Hong Kong, Hong Kong, China Soochow University, China ByteDance, Singapore Shanxi University, China University of Massachusetts Lowell, USA University of International Relations, China Soochow University, China Nanjing Normal University, China Soochow University, China Shanghai Jiao Tong University, China Microsoft Research Redmond, USA Xiaomi AI Lab, China Dalian University of Technology, China Soochow University, China Harbin Institute of Technology, China Rensselaer Polytechnic Institute, USA Monash University, Australia Google, USA National University of Singapore, Singapore
xii
Organization
Shujie Liu Lemao Liu Qi Liu Yang Liu Zitao Liu Zhengzhong Liu Xianggen Liu Ming Liu Yongbin Liu Yijia Liu Qun Liu Honglei Liu Yang Liu An Liu Linqing Liu Jiasen Lu Zhunchen Luo Chen Lyu Jing Ma Yue Ma Chih-Yao Ma Xianling Mao Zhao Meng Xiangyang Mou Preslav Nakov Guoshun Nan Tong Niu Vardaan Pahuja Shichao Pei Baolin Peng Wei Peng Chengbin Peng Longhua Qian Tao Qian Yanxia Qin Likun Qiu Jing Qiu Weiguang Qu Nazneen Fatema Rajani Jinfeng Rao Zhaochun Ren Pengjie Ren Yafeng Ren Feiliang Ren
Microsoft Research Asia, China Tencent AI Lab, China University of Science and Technology of China, China Wilfrid Laurier University, Canada TAL Education Group, China Carnegie Mellon University and Petuum Inc., USA Tsinghua University and DeeplyCurious.ai, China Harbin Institute of Technology, China University of South China, China Alibaba DAMO Academy, China Huawei Noah’s Ark Lab, China Facebook Conversational AI, USA Tsinghua University, China Soochow University, China University of Waterloo, Canada Allen Institute For AI, USA PLA Academy of Military Science, China Guangdong University of Foreign Studies, China Hong Kong Baptist University, Hong Kong, China LRI, Université Paris Sud, France Georgia Tech, USA Beijing Institute of Technology, China ETH Zurich, Switzerland Rensselaer Polytechnic Institute, USA Qatar Computing Research Institute, HBKU, Qatar Singapore University of Technology and Design, Singapore Salesforce Research, USA Université de Montreal, Canada KAUST, Saudi Arabia Microsoft Research, USA Artificial Intelligence Application Research Center, Huawei Technologies, China Ningbo University, China Soochow University, China Hubei University of Science and Technology, China Donghua University, China Minjiang University, China Hebei University of Science and Technology, China Nanjing Normal University, China Salesforce Research, USA Facebook Conversational AI, USA Shandong University, China University of Amsterdam, The Netherlands Guangdong University of Foreign Studies, China Northeastern University, China
Organization
Lei Sha Haoyue Shi Xiaodong Shi Kaisong Song Yiping Song Ruihua Song Chengjie Sun Jingyuan Sun Lichao Sun Xiaobing Sun Xu Tan Yiqi Tang Zhiyang Teng Zhiliang Tian Jin Ting Ming Tu Zhaopeng Tu Masao Utiyama Xiaojun Wan Huaiyu Wan Mingxuan Wang Bo Wang Tianlu Wang Shaonan Wang Bailin Wang Di Wang Zhen Wang Xuancong Wang Rui Wang Zhichun Wang Zhigang Wang Longyue Wang Dingquan Wang Xun Wang Zekun Wang Chuan-Ju Wang Zhongyu Wei Zhuoyu Wei Gang Wu Changxing Wu Yu Wu Chien-Sheng Wu
xiii
University of Oxford, UK Toyota Technological Institute at Chicago, USA Xiamen University, China Alibaba Group, China Peking University, China Microsoft Xiaoice, China Harbin Institute of Technology, China Institute of Automation, Chinese Academy of Sciences, China University of Illinois at Chicago, USA Singapore University of Technology and Design, Singapore Microsoft Research Asia, China The Ohio State University, USA Westlake University, China Hong Kong University of Science and Technology, Hong Kong, China Hainan University, China ByteDance, USA Tencent, China NICT, Japan Peking University, China Beijing Jiaotong University, China ByteDance, China Tianjin University, China University of Virginia, USA National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, China The University of Edinburgh, UK Woobo, USA The Ohio State University, USA MOH Office for Healthcare Transformation, Singapore NICT, Japan Beijing Normal University, China Tsinghua University, China Tencent, China Google, USA University of Massachusetts Amherst, USA Harbin Institute of Technology, China Academia Sinica, Taiwan Fudan University, China Microsoft Research Asia, China Northeastern University, China East China Jiaotong University, China Microsoft Research Asia, China Salesforce, USA
xiv
Organization
Yunqing Xia Yingce Xia Tong Xiao Yanghua Xiao Ruobing Xie Xin Xin Wenhan Xiong Hao Xiong Deyi Xiong Jingjing Xu Ruifeng Xu Jinan Xu Liang Yang Yating Yang Chenghao Yang Ziyu Yao Pengcheng Yin Yongjing Yin Dong Yu Wei Yu Heng Yu Tao Yu Lu Yu Jiali Zeng Feifei Zhai Wei-Nan Zhang Yue Zhang Fuzheng Zhang Peng Zhang Chengzhi Zhang Xiaowang Zhang Dongxu Zhang Ning Zhang Meishan Zhang Wen Zhang Guanhua Zhang Dakun Zhang Biao Zhang Boliang Zhang Dongdong Zhang Wayne Xin Zhao Jieyu Zhao Tiejun Zhao Jie Zhao
Microsoft Research Asia, China Microsoft Research Asia, China Northeastern University, China Fudan University, China Tencent, China Beijing Institute of Technology, China University of California, Santa Barbara, USA Alibaba, China Tianjin University, China Peking University, China Harbin Institute of Technology, China Beijing Jiaotong University, China Dalian University of Technology, China The Xinjing Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, China Columbia University, USA The Ohio State University, USA Carnegie Mellon University, USA Xiamen University, China Beijing Language and Culture University, China Carnegie Mellon University, USA Alibaba, China Yale University, USA King Abdullah University of Science and Technology, Saudi Arabia Tencent, China Fanyu, China Harbin Institute of Technology, China Westlake University, China Meituan-Dianping Group, China Tianjin University, China Nanjing University of Science and Technology, China Tianjin University, China University of Massachusetts Amherst, USA Tsinghua University, China Tianjin University, China Tencent, China Harbin Institute of Technology and Tencent, China SYSTRAN, France The University of Edinburgh, UK DiDi Labs, USA Microsoft Research Asia, China RUC, China University of California, Los Angeles, USA Harbin Institute of Technology, China The Ohio State University, USA
Organization
Xiaoqing Zheng Zihao Zheng Junsheng Zhou Guangyou Zhou Hao Zhou Ganbin Zhou Guodong Zhou Luowei Zhou Muhua Zhu Haichao Zhu Yanyan Zou Jinsong Su Congying Xia Cheng Yang Qiang Yang Mo Yu Jianguo Zhang Huaping Zhang Yunbo Cao Junyi Li Min Yang Xuefeng Yang Sreya Dey Yangzhou Du Shipra Jain Yao Meng Wenli Ouyang
Organizers Organized by China Computer Federation
xv
Fudan University, China Harbin Institution of Technology, China Nanjing Normal University, China Central China Normal University, China ByteDance, China Tencent, China Soochow University, China Microsoft, USA Tencent, China Harbin Institute of Technology, China Singapore University of Technology and Design, Singapore Xiamen University, China University of Illinois at Chicago, USA Beijing University of Posts and Telecommunications, China KAUST, Saudi Arabia IBM Research, USA University of Illinois at Chicago, USA Beijing Institute of Technology, China Tencent, China China Academy of Electronics and Information Technology, China Chinese Academy of Sciences, China ZhuiYi Technology, China SAP, India Lenovo, China Uttar Pradesh Technical University, India Lenovo, China Lenovo, China
xvi
Organization
Hosted by Zhengzhou University
State Key Lab of Digital Publishing Technology
In cooperation with: Lecture Notes in Computer Science
Springer
ACTA Scientiarum Naturalium Universitatis Pekinensis
Sponsoring Institutions Primary Sponsors Zoneyet
Organization
xvii
Diamond Sponsors JD Cloud & AI
AISpeech
Alibaba
Platinum Sponsors Microsoft
Baidu
Huawei
Lenovo
China Mobile
PingAn
Golden Sponsors Niutrans
Tencent AI Lab
xviii
Organization
Xiaomi
Gridsum
Silver Sponsors Leyan
Speech Ocean
Contents – Part I
Oral - Conversational Bot/QA FAQ-Based Question Answering via Knowledge Anchors . . . . . . . . . . . . . . Ruobing Xie, Yanan Lu, Fen Lin, and Leyu Lin
3
Deep Hierarchical Attention Flow for Visual Commonsense Reasoning . . . . . Yuansheng Song and Ping Jian
16
Dynamic Reasoning Network for Multi-hop Question Answering . . . . . . . . . Xiaohui Li, Yuezhong Liu, Shenggen Ju, and Zhengwen Xie
29
Memory Attention Neural Network for Multi-domain Dialogue State Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zihan Xu, Zhi Chen, Lu Chen, Su Zhu, and Kai Yu
41
Learning to Answer Word-Meaning-Explanation Questions for Chinese Gaokao Reading Comprehension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongye Tan, Pengpeng Qiang, and Ru Li
53
Enhancing Multi-turn Dialogue Modeling with Intent Information for E-Commerce Customer Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ruixue Liu, Meng Chen, Hang Liu, Lei Shen, Yang Song, and Xiaodong He Robust Spoken Language Understanding with RL-Based Value Error Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chen Liu, Su Zhu, Lu Chen, and Kai Yu A Large-Scale Chinese Short-Text Conversation Dataset . . . . . . . . . . . . . . . Yida Wang, Pei Ke, Yinhe Zheng, Kaili Huang, Yong Jiang, Xiaoyan Zhu, and Minlie Huang DVDGCN: Modeling Both Context-Static and Speaker-Dynamic Graph for Emotion Recognition in Multi-speaker Conversations . . . . . . . . . . . . . . . Shuofeng Zhao and Pengyuan Liu
65
78 91
104
Fundamentals of NLP Nominal Compound Chain Extraction: A New Task for Semantic-Enriched Lexical Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bobo Li, Hao Fei, Yafeng Ren, and Donghong Ji
119
xx
Contents – Part I
A Hybrid Model for Community-Oriented Lexical Simplification . . . . . . . . . Jiayin Song, Yingshan Shen, John Lee, and Tianyong Hao
132
Multimodal Aspect Extraction with Region-Aware Alignment Network . . . . . Hanqian Wu, Siliang Cheng, Jingjing Wang, Shoushan Li, and Lian Chi
145
NER in Threat Intelligence Domain with TSFL . . . . . . . . . . . . . . . . . . . . . Xuren Wang, Zihan Xiong, Xiangyu Du, Jun Jiang, Zhengwei Jiang, and Mengbo Xiong
157
Enhancing the Numeracy of Word Embeddings: A Linear Algebraic Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuanhang Ren and Ye Du
170
Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Houquan Zhou, Yu Zhang, Zhenghua Li, and Min Zhang
179
A Span-Based Distantly Supervised NER with Self-learning. . . . . . . . . . . . . Hongli Mao, Hanlin Tang, Wen Zhang, Heyan Huang, and Xian-Ling Mao
192
Knowledge Base, Graphs and Semantic Web A Passage-Level Text Similarity Calculation . . . . . . . . . . . . . . . . . . . . . . . Ming Liu, Zihao Zheng, Bing Qin, and Yitong Liu Using Active Learning to Improve Distantly Supervised Entity Typing in Multi-source Knowledge Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bo Xu, Xiangsan Zhao, and Qingxuan Kong TransBidiFilter: Knowledge Embedding Based on a Bidirectional Filter . . . . . Xiaobo Guo, Neng Gao, Jun Yuan, Lin Zhao, Lei Wang, and Sibo Cai Applying Model Fusion to Augment Data for Entity Recognition in Legal Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hu Zhang, Haihui Gao, Jingjing Zhou, and Ru Li Combining Knowledge Graph Embedding and Network Embedding for Detecting Similar Mobile Applications . . . . . . . . . . . . . . . . . . . . . . . . . Weizhuo Li, Buye Zhang, Liang Xu, Meng Wang, Anyuan Luo, and Yan Niu CMeIE: Construction and Evaluation of Chinese Medical Information Extraction Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tongfeng Guan, Hongying Zan, Xiabing Zhou, Hongfei Xu, and Kunli Zhang
207
219 232
244
256
270
Contents – Part I
xxi
Document-Level Event Subject Pair Recognition. . . . . . . . . . . . . . . . . . . . . Zhenyu Hu, Ming Liu, Yin Wu, Jiexin Xu, Bing Qin, and JinLong Li
283
Knowledge Enhanced Opinion Generation from an Attitude . . . . . . . . . . . . . Zhe Ye, Ruihua Song, Hao Fu, Pingping Lin, Jian-Yun Nie, and Fang Li
294
MTNE: A Multitext Aware Network Embedding for Predicting Drug-Drug Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fuyu Hu, Chunping Ouyang, Yongbin Liu, and Yi Bu
306
Machine Learning for NLP Learning to Generate Representations for Novel Words: Mimic the OOV Situation in Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoyu Xing, Minlong Peng, Qi Zhang, Qin Liu, and Xuanjing Huang
321
Reinforcement Learning for Named Entity Recognition from Noisy Data . . . . Jing Wan, Haoming Li, Lei Hou, and Juaizi Li
333
Flexible Parameter Sharing Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chengkai Piao, Jinmao Wei, Yapeng Zhu, and Hengpeng Xu
346
An Investigation on Different Underlying Quantization Schemes for Pre-trained Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zihan Zhao, Yuncong Liu, Lu Chen, Qi Liu, Rao Ma, and Kai Yu A Survey of Sentiment Analysis Based on Machine Learning. . . . . . . . . . . . Pingping Lin and Xudong Luo
359 372
Machine Translation and Multilinguality Incorporating Named Entity Information into Neural Machine Translation . . . Leiying Zhou, Wenjie Lu, Jie Zhou, Kui Meng, and Gongshen Liu
391
Non-autoregressive Neural Machine Translation with Distortion Model . . . . . Long Zhou, Jiajun Zhang, Yang Zhao, and Chengqing Zong
403
Incorporating Phrase-Level Agreement into Neural Machine Translation . . . . Mingming Yang, Xing Wang, Min Zhang, and Tiejun Zhao
416
Improving Unsupervised Neural Machine Translation with Dependency Relationships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jia Xu, Na Ye, and GuiPing Zhang
429
xxii
Contents – Part I
NLP Applications Incorporating Knowledge and Content Information to Boost News Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhen Wang, Weizhi Ma, Min Zhang, Weipeng Chen, Jingfang Xu, Yiqun Liu, and Shaoping Ma Multi-domain Transfer Learning for Text Classification . . . . . . . . . . . . . . . . Xuefeng Su, Ru Li, and Xiaoli Li
443
457
A Cross-Layer Connection Based Approach for Cross-Lingual Open Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lin Li, Miao Kong, Dong Li, and Dong Zhou
470
Learning to Consider Relevance and Redundancy Dynamically for Abstractive Multi-document Summarization. . . . . . . . . . . . . . . . . . . . . . Yiding Liu, Xiaoning Fan, Jie Zhou, Chenglong He, and Gongshen Liu
482
A Submodular Optimization-Based VAE-Transformer Framework for Paraphrase Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoning Fan, Danyang Liu, Xuejian Wang, Yiding Liu, Gongshen Liu, and Bo Su MixLab: An Informative Semi-supervised Method for Multi-label Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ye Qiu, Xiaolong Gong, Zhiyi Ma, and Xi Chen
494
506
A Noise Adaptive Model for Distantly Supervised Relation Extraction . . . . . Xu Huang, Bowen Zhang, Yunming Ye, Xiaojun Chen, and Xutao Li
519
CLTS: A New Chinese Long Text Summarization Dataset . . . . . . . . . . . . . . Xiaojun Liu, Chuang Zhang, Xiaojun Chen, Yanan Cao, and Jinpeng Li
531
Lightweight Multiple Perspective Fusion with Information Enriching for BERT-Based Answer Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu Gu, Meng Yang, and Peiqin Lin Stance Detection with Stance-Wise Convolution Network . . . . . . . . . . . . . . Dechuan Yang, Qiyu Wu, Wei Chen, Tengjiao Wang, Zhen Qiu, Di Liu, and Yingbao Cui
543 555
Emotion-Cause Joint Detection: A Unified Network with Dual Interaction for Emotion Cause Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guimin Hu, Guangming Lu, and Yi Zhao
568
Incorporating Temporal Cues and AC-GCN to Improve Temporal Relation Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xinyu Zhou, Peifeng Li, Qiaoming Zhu, and Fang Kong
580
Contents – Part I
Event Detection with Document Structure and Graph Modelling . . . . . . . . . . Peipei Zhu, Zhongqing Wang, Hongling Wang, Shoushan Li, and Guodong Zhou AFPun-GAN: Ambiguity-Fluency Generative Adversarial Network for Pun Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yufeng Diao, Liang Yang, Xiaochao Fan, Yonghe Chu, Di Wu, Shaowu Zhang, and Hongfei Lin Author Name Disambiguation Based on Rule and Graph Model . . . . . . . . . . Lizhi Zhang and Zhijie Ban Opinion Transmission Network for Jointly Improving Aspect-Oriented Opinion Words Extraction and Sentiment Classification . . . . . . . . . . . . . . . . Chengcan Ying, Zhen Wu, Xinyu Dai, Shujian Huang, and Jiajun Chen Label-Wise Document Pre-training for Multi-label Text Classification . . . . . . Han Liu, Caixia Yuan, and Xiaojie Wang Hierarchical Sequence Labeling Model for Aspect Sentiment Triplet Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Chen, Shaowei Chen, and Jie Liu Knowledge-Aware Method for Confusing Charge Prediction . . . . . . . . . . . . Xiya Cheng, Sheng Bi, Guilin Qi, and Yongzhen Wang
xxiii
593
604
617
629 641
654 667
Social Media and Network Aggressive Language Detection with Joint Text Normalization via Adversarial Multi-task Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shengqiong Wu, Hao Fei, and Donghong Ji
683
A Cross-Modal Classification Dataset on Social Network . . . . . . . . . . . . . . . Yong Hu, Heyan Huang, Anfan Chen, and Xian-Ling Mao
697
Sentiment Analysis on Chinese Weibo Regarding COVID-19 . . . . . . . . . . . . Xiaoting Lyu, Zhe Chen, Di Wu, and Wei Wang
710
Text Mining Pairwise Causality Structure: Towards Nested Causality Mining on Financial Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dian Chen, Yixuan Cao, and Ping Luo
725
Word Graph Network: Understanding Obscure Sentences on Social Media for Violation Comment Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dan Ma, Haidong Liu, and Dawei Song
738
xxiv
Contents – Part I
Data Augmentation with Reinforcement Learning for Document-Level Event Coreference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jie Fang and Peifeng Li
751
An End-to-End Multi-task Learning Network with Scope Controller for Emotion-Cause Pair Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rui Fan, Yufan Wang, and Tingting He
764
Clue Extraction for Fine-Grained Emotion Analysis. . . . . . . . . . . . . . . . . . . Hongliang Bi and Pengyuan Liu
777
Multi-domain Sentiment Classification on Self-constructed Indonesian Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nankai Lin, Boyu Chen, Sihui Fu, Xiaotian Lin, and Shengyi Jiang
789
Extracting the Collaboration of Entity and Attribute: Gated Interactive Networks for Aspect Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . Rongdi Yin, Hang Su, Bin Liang, Jiachen Du, and Ruifeng Xu
802
Sentence Constituent-Aware Aspect-Category Sentiment Analysis with Graph Attention Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuncong Li, Cunxiang Yin, and Sheng-hua Zhong
815
SciNER: A Novel Scientific Named Entity Recognizing Framework . . . . . . . Tan Yan, Heyan Huang, and Xian-Ling Mao
828
Learning Multilingual Topics with Neural Variational Inference . . . . . . . . . . Xiaobao Wu, Chunping Li, Yan Zhu, and Yishu Miao
840
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
853
Contents – Part II
Trending Topics (Explainability, Ethics, Privacy, Multimodal NLP) DCA: Diversified Co-attention Towards Informative Live Video Commenting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhihan Zhang, Zhiyi Yin, Shuhuai Ren, Xinhang Li, and Shicheng Li
3
The Sentencing-Element-Aware Model for Explainable Term-of-Penalty Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongye Tan, Bowen Zhang, Hu Zhang, and Ru Li
16
Referring Expression Generation via Visual Dialogue . . . . . . . . . . . . . . . . . Lingxuan Li, Yihong Zhao, Zhaorui Zhang, Tianrui Niu, Fangxiang Feng, and Xiaojie Wang Hierarchical Multimodal Transformer with Localness and Speaker Aware Attention for Emotion Recognition in Conversations . . . . . . . . . . . . . . . . . . Xiao Jin, Jianfei Yu, Zixiang Ding, Rui Xia, Xiangsheng Zhou, and Yaofeng Tu
28
41
Poster Generating Emotional Social Chatbot Responses with a Consistent Speaking Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Zhang, Yan Yang, Chengcai Chen, Liang He, and Zhou Yu
57
An Interactive Two-Pass Decoding Network for Joint Intent Detection and Slot Filling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huailiang Peng, Mengjun Shen, Lei Jiang, Qiong Dai, and Jianlong Tan
69
RuKBC-QA: A Framework for Question Answering over Incomplete KBs Enhanced with Rules Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qilin Sun and Weizhuo Li
82
Syntax-Guided Sequence to Sequence Modeling for Discourse Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Longyin Zhang, Fang Kong, and Guodong Zhou
95
Macro Discourse Relation Recognition via Discourse Argument Pair Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhenhua Sun, Feng Jiang, Peifeng Li, and Qiaoming Zhu
108
xxvi
Contents – Part II
Dependency Parsing with Noisy Multi-annotation Data . . . . . . . . . . . . . . . . Yu Zhao, Mingyue Zhou, Zhenghua Li, and Min Zhang
120
Joint Bilinear End-to-End Dependency Parsing with Prior Knowledge . . . . . . Yunchu Gao, Ke Zhang, and Zhoujun Li
132
Multi-layer Joint Learning of Chinese Nested Named Entity Recognition Based on Self-attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haoru Li, Haoliang Xu, Longhua Qian, and Guodong Zhou
144
Adversarial BiLSTM-CRF Architectures for Extra-Propositional Scope Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rongtao Huang, Jing Ye, Bowei Zou, Yu Hong, and Guodong Zhou
156
Analyzing Relational Semantics of Clauses in Chinese Discourse Based on Feature Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenhe Feng, Xi Huang, and Han Ren
169
Efficient Lifelong Relation Extraction with Dynamic Regularization . . . . . . . Hangjie Shen, Shenggen Ju, Jieping Sun, Run Chen, and Yuezhong Liu Collective Entity Disambiguation Based on Deep Semantic Neighbors and Heterogeneous Entity Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zihan He, Jiang Zhong, Chen Wang, and Cong Hu
181
193
Boosting Cross-lingual Entity Alignment with Textual Embedding . . . . . . . . Wei Xu, Chen Chen, Chenghao Jia, Yongliang Shen, Xinyin Ma, and Weiming Lu
206
Label Embedding Enhanced Multi-label Sequence Generation Model . . . . . . Yaqiang Wang, Feifei Yan, Xiaofeng Wang, Wang Tang, and Hongping Shu
219
Ensemble Distilling Pretrained Language Models for Machine Translation Quality Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hui Huang, Hui Di, Jin’an Xu, Kazushige Ouchi, and Yufeng Chen
231
Weaken Grammatical Error Influence in Chinese Grammatical Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinggui Liang and Si Li
244
Encoding Sentences with a Syntax-Aware Self-attention Neural Network for Emotion Distribution Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chang Wang and Bang Wang
256
Contents – Part II
Hierarchical Multi-view Attention for Neural Review-Based Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongtao Liu, Wenjun Wang, Huitong Chen, Wang Zhang, Qiyao Peng, Lin Pan, and Pengfei Jiao Negative Feedback Aware Hybrid Sequential Neural Recommendation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bin Hao, Min Zhang, Weizhi Ma, Shaoyun Shi, Xinxing Yu, Houzhi Shan, Yiqun Liu, and Shaoping Ma MSReNet: Multi-step Reformulation for Open-Domain Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weiguang Han, Min Peng, Qianqian Xie, Xiuzhen Zhang, and Hua Wang ProphetNet-Ads: A Looking Ahead Strategy for Generative Retrieval Models in Sponsored Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weizhen Qi, Yeyun Gong, Yu Yan, Jian Jiao, Bo Shao, Ruofei Zhang, Houqiang Li, Nan Duan, and Ming Zhou
xxvii
267
279
292
305
LARQ: Learning to Ask and Rewrite Questions for Community Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huiyang Zhou, Haoyan Liu, Zhao Yan, Yunbo Cao, and Zhoujun Li
318
Abstractive Summarization via Discourse Relation and Graph Convolutional Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenjie Wei, Hongling Wang, and Zhongqing Wang
331
Chinese Question Classification Based on ERNIE and Feature Fusion . . . . . . Gaojun Liu, Qiuxia Yuan, Jianyong Duan, Jie Kou, and Hao Wang
343
An Abstractive Summarization Method Based on Global Gated Dual Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lu Peng, Qun Liu, Lebin Lv, Weibin Deng, and Chongyu Wang
355
Rumor Detection on Hierarchical Attention Network with User and Sentiment Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sujun Dong, Zhong Qian, Peifeng Li, Xiaoxu Zhu, and Qiaoming Zhu
366
Measuring the Semantic Stability of Word Embedding. . . . . . . . . . . . . . . . . Zhenhao Huang and Chenxu Wang
378
Task-to-Task Transfer Learning with Parameter-Efficient Adapter . . . . . . . . . Haiou Zhang, Hanjun Zhao, Chunhua Liu, and Dong Yu
391
Key-Elements Graph Constructed with Evidence Sentence Extraction for Gaokao Chinese. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoyue Wang, Yu Ji, and Ru Li
403
xxviii
Contents – Part II
Knowledge Inference Model of OCR Conversion Error Rules Based on Chinese Character Construction Attributes Knowledge Graph. . . . . . . . . . Xiaowen Zhang, Hairong Wang, and Wenjie Gu
415
Explainable AI Workshop Interpretable Machine Learning Based on Integration of NLP and Psychology in Peer-to-Peer Lending Risk Evaluation . . . . . . . . . . . . . . . Lei Li, Tianyuan Zhao, Yang Xie, and Yanjie Feng Algorithm Bias Detection and Mitigation in Lenovo Face Recognition Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sheng Shi, Shanshan Wei, Zhongchao Shi, Yangzhou Du, Wei Fan, Jianping Fan, Yolanda Conyers, and Feiyu Xu
429
442
Path-Based Visual Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohsen Pourvali, Yucheng Jin, Chen Sheng, Yao Meng, Lei Wang, Masha Gorkovenko, and Changjian Hu
454
Feature Store for Enhanced Explainability in Support Ticket Classification. . . Vishal Mour, Sreya Dey, Shipra Jain, and Rahul Lodhe
467
Student Workshop Incorporating Lexicon for Named Entity Recognition of Traditional Chinese Medicine Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bingyan Song, Zhenshan Bao, YueZhang Wang, Wenbo Zhang, and Chao Sun Anaphora Resolution in Chinese for Analysis of Medical Q&A Platforms . . . Alena Tsvetkova
481
490
Evaluation Workshop Weighted Pre-trained Language Models for Multi-Aspect-Based Multi-Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fengqing Zhou, Jinhui Zhang, Tao Peng, Liang Yang, and Hongfei Lin
501
Iterative Strategy for Named Entity Recognition with Imperfect Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huimin Xu, Yunian Chen, Jian Sun, Xuezhi Cao, and Rui Xie
512
The Solution of Huawei Cloud & Noah’s Ark Lab to the NLPCC-2020 Challenge: Light Pre-Training Chinese Language Model for NLP Task . . . . . Yuyang Zhang, Jintao Yu, Kai Wang, Yichun Yin, Cheng Chen, and Qun Liu
524
Contents – Part II
DuEE: A Large-Scale Dataset for Chinese Event Extraction in Real-World Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xinyu Li, Fayuan Li, Lu Pan, Yuguang Chen, Weihua Peng, Quan Wang, Yajuan Lyu, and Yong Zhu Transformer-Based Multi-aspect Modeling for Multi-aspect Multi-sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhen Wu, Chengcan Ying, Xinyu Dai, Shujian Huang, and Jiajun Chen
xxix
534
546
Overview of the NLPCC 2020 Shared Task: AutoIE . . . . . . . . . . . . . . . . . . Xuefeng Yang, Benhong Wu, Zhanming Jie, and Yunfeng Liu
558
Light Pre-Trained Chinese Language Model for NLP Tasks . . . . . . . . . . . . . Junyi Li, Hai Hu, Xuanwei Zhang, Minglei Li, Lu Li, and Liang Xu
567
Overview of the NLPCC 2020 Shared Task: Multi-Aspect-Based Multi-Sentiment Analysis (MAMS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Chen, Ruifeng Xu, and Min Yang
579
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
587
Oral - Conversational Bot/QA
FAQ-Based Question Answering via Knowledge Anchors Ruobing Xie(B) , Yanan Lu, Fen Lin, and Leyu Lin WeChat Search Application Department, Tencent, Beijing, China [email protected]
Abstract. Question answering (QA) aims to understand questions and find appropriate answers. In real-world QA systems, Frequently Asked Question (FAQ) based QA is usually a practical and effective solution, especially for some complicated questions (e.g., How and Why). Recent years have witnessed the great successes of knowledge graphs (KGs) in KBQA systems, while there are still few works focusing on making full use of KGs in FAQ-based QA. In this paper, we propose a novel Knowledge Anchor based Question Answering (KAQA) framework for FAQbased QA to better understand questions and retrieve more appropriate answers. More specifically, KAQA mainly consists of three modules: knowledge graph construction, query anchoring and query-document matching. We consider entities and triples of KGs in texts as knowledge anchors to precisely capture the core semantics, which brings in higher precision and better interpretability. The multi-channel matching strategy also enables most sentence matching models to be flexibly plugged in our KAQA framework to fit different real-world computation limitations. In experiments, we evaluate our models on both offline and online query-document matching tasks on a real-world FAQ-based QA system in WeChat Search, with detailed analysis, ablation tests and case studies. The significant improvements confirm the effectiveness and robustness of the KAQA framework in real-world FAQ-based QA.
1
Introduction
Question answering (QA) aims to find appropriate answers for user’s questions. According to the type of answers, there are mainly two kinds of QA systems. For simple questions like “Who writes Hamlet? ”, users tend to directly know the answers via several entities or a short sentence. KBQA is designed for these questions [4]. While for complicated questions like “How to cook a risotto?”, users usually seek for detailed step-by-step instructions. In this case, FAQ-based QA system is a more effective and practical solution. It attempts to understand user questions and retrieve related documents as answers, which is more like a sentence matching task between questions and answers [11]. QA systems always pursue higher precision and better interpretability, for users of QA systems are much more critical to the results compared to users in IR or dialog tasks. Recent years have witnessed the great thrive in knowledge c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 3–15, 2020. https://doi.org/10.1007/978-3-030-60450-9_1
4
R. Xie et al. noises noises
Query
Title1
Title2
Title3
noises Of
OP
Delete WeChat contactor recover
Knowledge anchors (triples) (contactor, has_operation, recover) (contactor, component_of, WeChat)
matching
matching
How to find back your WeChat friend
(friend, has_operation, find_back) (friend, component_of, WeChat)
How to recover deleted WeChat Msg
(Msg, has_operation, recover) (Msg, component_of, WeChat)
Recover deleted contactor in Tiktok
(contactor, has_operation, recover) (contactor, component_of, Tiktok)
Fig. 1. An example of knowledge anchors (triples) in FAQ-based QA.
graphs (KGs). A typical knowledge graph usually consists of entities, relations and triples, which can provide structural information for QA. KGs has been widely used in KBQA for simple questions [4]. However, there are still few works focusing on introducing KGs to FAQ-based QA for complicated questions. The main challenge of FAQ-based QA is that its queries and answers are more difficult to understand, since complicated questions often involve with professional terms, domain-related operations and conditions. A small semantic shift in the question may lead to a completely different answer. Moreover, the informal representations (e.g., VX), abbreviations (e.g., Msg) and domain-specific restrictions may further confuse the understanding and matching. Simply relying on conventional sentence matching models may not work well in this situation, dependency parsers and term weights trained on general corpus may also bring in errors. To address these problems, we introduce KGs to FAQ-based QA systems. Differing from KBQA, we bring in KGs not to directly answer questions, but to better understand and match queries and titles. A query/title in FAQ-based QA usually contains essential factors like entities and triples (i.e., entity pairs in texts with their relations) that derive from KGs. We consider such factors of KGs in query/title as knowledge anchors, which can anchor the core semantics in query and title for NLU and sentence matching. Knowledge anchors can bring in higher precision and better interpretability, which also makes the FAQ system more robust and controllable. Figure 1 gives an example of knowledge anchors in real-world queries and titles. The knowledge anchors bring in prior knowledge and highlight the core semantics as well as restrictions for matching. In this paper, we propose a novel Knowledge Anchor based Question Answering (KAQA) framework for FAQ-based QA. Precisely, KAQA mainly consists of three modules: (1) knowledge graph construction, which stores prior domain-specific knowledge. (2) Query anchoring, which extracts core semantics in queries and documents with three triple disambiguation modules. And (3) multi-channel query-document matching, which calculates the semantic similarities between queries and documents with token and knowledge anchor sequences. The advantages of KAQA mainly locate in two points: (1) KAQA is a simple
FAQ-Based Question Answering via Knowledge Anchors
5
and effective framework, which cooperates well with almost all sentence matching algorithms, and can be easily applied to other industrial domains. (2) Knowledge anchors in KAQA make it possible to understand queries and titles accurately with fine-grained domain-specific knowledge. The structural interpretable KG also make the FAQ-based QA system more robust and controllable. In experiments, we build a new dataset from a real-world Chinese FAQ-based QA system, and conduct both online and offline evaluations. The results show that knowledge anchors and KAQA are essential in NLU and sentence matching. We further conduct some analyses on query anchoring and knowledge anchors with detailed cases to better interpret KAQA’s pros and cons as well as its effective mechanisms. The main contributions are concluded as follows: – We propose a novel KAQA framework for real-world FAQ-based QA. The multi-channel matching strategy also enables models to cooperate well with both simple and sophisticated matching models for different real-world scenarios. To the best of our knowledge, KAQA is the first to explicitly utilize knowledge anchors for NLU and matching in FAQ-based QA. – We conduct sufficient online and offline experiments to evaluate KAQA with detailed analyses and cases. The significant improvements confirm the effectiveness and robustness of KAQA. Currently, KAQA has been deployed on a well-known FAQ system in WeChat Search affecting millions of users.
2
Related Work
Question Answering. FAQ-based QA is practical and widely used for complicated questions. [2] gives a classical n-gram based text categorization method for FAQ-based QA. Since the performance of FAQ-based QA is strongly influenced by query-document matching, lots of efforts are focused on improving similarity calculations [22]. Knowledge graphs have been widely used in QA. Semantic parser [12], information extraction [19] and templates [21] are powerful tools to combine with KGs. Recently, Pre-train models and Transformer are also used for QA and reasoning [3]. [20] focuses on multi-hop knowledge reasoning in KBQA, and [7] explores knowledge embeddings for simple QA. However, models in FAQbased QA usually ignore or merely use entities as features for lexical weighting or matching [1]. To the best of our knowledge, KAQA is the first to use knowledge anchors for NLU and matching in FAQ-based QA. Sentence Matching. Measuring semantic similarities between questions and answers is essential in FAQ-based QA. Conventional methods usually rely on lexical similarity techniques [15]. Inspired by Siamese network, DSSM [6] and Arc-I [5] extract high-order features and then calculate similarities in semantic spaces. Arc-II [5] and MatchPyramid [14] extract features from lexical interaction matrix. IWAN [16] explores the orthogonal decomposition strategy for matching. Pair2vec [8] further considers compositional word-pair embeddings. [10] also considers recurrent and co-attentive information. Our multi-channel model enables most of sentence matching models to be plugged in KAQA flexibly.
6
R. Xie et al.
3
Methodology
We first give an introduction of the notations used in this paper. For a knowledge graph {E, R, T }, E and R represents the entity and relation set. We utilize (eh , r, et ) ∈ T to represent a triple in KG, in which eh , et ∈ E are the head and tail entity, while r ∈ R is the relation. We consider the query q and document d as inputs, and simply use titles to represent the documents for online efficiency. In KAQA, both queries and titles are labelled with knowledge anchors in query anchoring module. The knowledge anchor set in a query Aq = {E q , T q } contains two sequences, namely the entity sequence E q and the triple sequence T q , where entities and triples are arranged by their positions. The knowledge anchor set Ad = {E d , T d } in document is the same as that in query. 3.1
Overall Architecture
KAQA mainly consists of three modules, namely knowledge graph construction, query anchoring and query-document matching. Figure 2 shows the overall architecture of KAQA. Knowledge graph construction is the fundamental step to learn and store prior knowledge. Next, the query anchoring module scans queries and titles to extract knowledge anchors. Multiple disambiguation models are used to prove the reliability of extracted knowledge anchors. Finally, the query-document matching module measures the semantic similarity between queries and titles via their token, entity and triple sequences.
KG construction
Query anchoring
Pattern-based bootstrapping
Knowledge reasoning KG
Neural relation extraction
Rule-based model
Multi-channel matching q, Aq={Eq, Tq}
Architecture-I token, entity and triple sequences
MatchPyramid
d, Ad={Ed, Td}
CKRL+human annotation
Offline KGC
Neural triple disambiguation
IWAN
Online query/title matching
Fig. 2. The overall architecture of the KAQA framework.
3.2
Knowledge Graph Construction
In KAQA, KGs are mainly utilized for better NLU and matching, not for directly answering questions. Therefore, instead of directly using existing open-domain KGs, we build a domain-specific customized KG, which focuses more on triples that represent core semantics in specific target domains. In most domains, the core semantics of a sentence are captured by triples like (contactor, has operation,
FAQ-Based Question Answering via Knowledge Anchors
7
recover ) as in Fig. 1, which often imply certain actions on objects. Specifically, we focus on the domain of software customer service. We mainly focus on four types of relations to anchor core semantics, namely has operation, component of, synonym and hypernym/hyponym. has operation is responsible for the main operation, component of reveals important relatedness, while synonym and hypernym/hyponym are used for alignment and generalization. Table 1. An example of a query with its entities and triple candidates. The bold triples indicate the correct triples which should be selected by the query anchoring module. Query
How to recover WeChat friend if she has deleted me?
Entity
delete; WeChat; friend ; recover ;
Triple candidates
(WeChat, has operation, delete) (WeChat, has operation, recover ) (friend, has operation, delete) (friend, has operation, recover ) (friend, component of, WeChat)
In KG construction, we first set dozens of seed entities in the target domain, and then use some patterns-based models with conventional NER models like CRF to get the final entity set. Extracting useful entities from existing knowledge bases is also a good supplement in practice. Based on these entities, we combine several models to get triple candidates. (1) We first use pattern-based bootstrapping methods with count-based features, lexical features (e.g., term weight and POS tag) and semantic parser results to generate high-frequent triples. (2) Next, we implement some neural relation extraction models (e.g., CNN/PCNN with attention [13]) for relation extraction. (3) We jointly consider all models with a linear transformation to rank all triple candidates. (4) Finally, we further conduct CKRL [18] assisted by human annotation to make sure the accuracy of KG is above 95%. In real-world scenarios, KG customization is labor-intensive but indispensable for high precision and interpretability in QA systems. 3.3
Query Anchoring
Query anchoring attempts to extract core semantics via knowledge anchors. Simply relying on string matching or semantic parser is straightforward, while it will bring in ambiguity and noises. Moreover, semantic parsers usually perform unsatisfactory on irregular queries. Hence, we conduct both entity and triple disambiguation models to address this issue. For entity disambiguation, we first conduct a string matching to retrieve all possible entity candidates. For efficiency, we directly implement a forward maximum matching algorithm [17] for entity disambiguation, whose accuracy is acceptable in our software scenario. For triples, we first extract all possible connections between any two entities as triple candidates if the entity pair appears in KG. As in Table 1, there are four
8
R. Xie et al.
triple candidates of has operation that reflect different core semantics. The triple disambiguation model needs to find the true purpose of the query. We conduct an ensemble triple disambiguation model with three models. (1) The rule-based model (RB) considers simple syntactic rules, patterns, lexical features (e.g., token weights, POS tags, entity types), and triple-level features (e.g., relation types, entity pair distances). This model highlights valuable empirical observations and is simple and effective, where lots of general rules could be easily transferred to other domains. (2) The knowledge reasoning model (KR) enables some heuristic multi-hop knowledge reasoning patterns over KGs. For example, since friend is a component of WeChat, the target object of recover in Fig. 1 is more likely to be friend rather than WeChat. (3) As for the neural triple disambiguation model (NTD), we build a supervised model based on FastText [9], which takes a sentence with its target triple as the input and outputs a confidence score. The inputs are: (a) Target triple that indicates which triple candidate we focus on. (b) Position features, which show the distances from the current token to two entities in the target triple. There are two position features for each token. (c) Conflict entity features: if (eA , r, eB ) makes a triple candidate, while eB is already in the target triple (eC , r, eB ), then (eA , eC ) is the conflict entity pair. (d) Conflict triple features: if a triple (except the target triple itself) shares any entities with that in target triple, then this triple is viewed as a conflict triple. All features are aggregated and fed into FastText. In practice, the knowledge reasoning model first works as a high-confident filter to remove obvious illogical results. The final triple confidence score is the weighted addition of the rule-based and neural model scores, with the weights empirically set as 0.3 and 0.7. 3.4
Multi-channel Query-Document Matching
The query-document matching module takes queries and document titles with their knowledge anchors as inputs, and outputs the query-document similarity features. The input of query contains three channels, including the token sequence W q , the entity sequence E q and the triple sequence T q , and the same for document titles. The final similarity vector s is formalized as follows: s = softmax(MLP(f (q,d) )),
(q,d)
f (q,d) = concat(fw(q,d) , fe(q,d) , ft
),
(1)
where MLP(·) is a 2-layer perception and f (q,d) is the aggregated query-document (q,d) (q,d) (q,d) similarity features. fw , fe , ft indicate the hidden states of query-title pairs for token, entity and triple channels respectively. The multi-channel matching strategy jointly considers the matching degrees from different aspects with token, entity and triple. To show the flexibility and robustness of KAQA in vari(q,d) (q,d) (q,d) based on three representative sentence ous situations, we learn fw , fe , ft matching models including ARC-I, MatchPyramid and IWAN. It is not difficult for KAQA to use other sentence matching models. Architecture-I (ARC-I). ARC-I is a classical sentence matching model following the siamese architecture [5]. It first uses neural networks like CNN to
FAQ-Based Question Answering via Knowledge Anchors
9
get the sentence representations of both query and title separately, and then (q,d) calculates their similarities. Here, fw is concatenated by the final query and (q,d) (q,d) title representations with token sequences, and the same as fe and ft . fw(q,d) = Concat(CNN(Wq ), CNN(Qd )).
(2)
MatchPyramid. Differing from ARC-I, MatchPyramid calculates the sentence similarity directly from the token-level interaction matrix [14]. We use the cosine similarity to build the 2D interaction matrix. The similarity features are the hidden state after the final 2D pooling and convolution layers. fw(q,d) = CNN(M),
Mij = Cosine sim(wiq , wjd ).
(3)
Inter-weighted Alignment Network (IWAN). IWAN is an effective sentence matching model using orthogonal decomposition strategy [16]. It calculates query-document similarity based on their orthogonal and parallel components in the sentence representations. For a query, IWAN first utilizes a Bi-LSTM layer to get the hidden state qh (and correspondingly dh for document). Next, an query-document attention mechanism is used to generate the alignment representation of query qa from all hidden embeddings in document. The parallel and orthogonal components are formalized as follows: qpi =
qhi · qai a q , qai · qai i
qoi = qhi − qpi ,
(4)
in which qpi indicates the parallel component that implies the similar semantic parts of document, while qoi indicates the orthogonal component that implies the conflicts between query and document. At last, both orthogonal and parallel components of query and document are concatenated to form the final query(q,d) = MLP(Concat(qp , qo , dp , do )). document similarity features as fw 3.5
Implementation Details
The query-document matching module is considered as a classification task. We utilize a softmax layer which outputs three labels: similar, related and unrelated. We use cross-entropy as our loss function, which is formalized as follows: n 3 1 1{yi = j} log pi (lj |s)]. J(θ) = − [ n i=1 j=1
(5)
n represents the number of training pair instances. 1{yi = j} equals 1 only if the i-th predicted label meets the annotated result, and otherwise equals 0. In this paper, we conduct KAQA concentrating on the field of software customer service. In query anchoring module, the synonym and hypernym/hyponym relations are directly utilized for entity and triple normalization and generalization, while component of is mainly utilized for knowledge reasoning in triple
10
R. Xie et al.
disambiguation. In query-document matching, we only consider the instances with has operation as the triple part in knowledge anchors empirically, for they exactly represent the core semantics of operation. It is not difficult to consider more relation types in our multi-channel matching framework.
4 4.1
Experiments Dataset and Knowledge Graph
In this paper, we construct a new dataset FAQ-SCS for evaluation, which is extracted from a real-world FAQ-based QA system in WeChat Search, since there are few large open-source FAQ datasets. In total, FAQ-SCS contains 29, 134 query-title pairs extracted from a real-world software customer service FAQbased QA system. All query-title pairs are manually annotated with similar, related and unrelated labels. Overall, FAQ-SCS has 12, 623 similar, 7, 270 related and 9, 241 unrelated labels. For evaluation, we randomly split all instances into train, valid and test set with the proportion of 8:1:1. We also build a knowledge graph KG-SCS in the software customer service domain for KAQA. KG-SCS contains 4, 530 entities and 4 relations. After entity normalization via alignments with synonym relations, there are totally 1, 644 entities and 10, 055 triples. After query anchoring, there are 1, 652 entities and 2, 877 triples appeared in FAQ-SCS, 83.1% queries and 86.7% titles have at least one triple. 4.2
Experimental Settings
In KAQA, we implement three representative models including the siamese architecture model ARC-I [5], the lexical interaction model MatchPyramid [14], and the orthogonal decomposition model IWAN [16] for sentence matching in our multi-channel matching module, with their original models considered as baselines. We do not compare with KBQA models for they are different tasks. All models share the same dimension of hidden states as 128. In training, the batch size is set to be 512 while learning rate is set to be 0.001. For ARC-I and MatchPyramid, the dimension of input embeddings is 128. The number of filters is 256 and the window size is 2 in CNN encoder. For IWAN, the dimension of input embedding is 256. All parameters are optimized on valid set with grid search. For fair comparisons, all models follow the same experimental settings. 4.3
Online and Offline Query-Document Matching
Offline Evaluation. We consider the evaluation as a classification task with three labels as unrelated, related or similar. We report the average accuracies across 3 runs for all models. From Table 2 we can observe that: (1) The KAQA models significantly outperform all their corresponding original models on FAQ-SCS, among which KAQA (IWAN) achieves the best
FAQ-Based Question Answering via Knowledge Anchors
11
Table 2. Offline evaluation on query-document matching. Model
Accuracy
MatchPyramid [14] ARC-I [5] IWAN [16]
0.714 0.753 0.778
KAQA (MatchPyramid) 0.747 0.773 KAQA (ARC-I) 0.797 KAQA (IWAN)
accuracy. It indicates that knowledge anchors and KAQA can capture core semantics precisely. We also find that pre-train models are beneficial for this task. Moreover, KAQA performs better when there are multiple triple candidates, which implies that KAQA can distinguish useful information from noises and handle informality and ambiguity in natural language. (2) All KAQA models with different types of sentence matching models have improvements compared to their original models. Specifically, we evaluate our KAQA framework with siamese architecture model (ARC-I), lexical interaction model (MatchPyramid) and orthogonal decomposition model (IWAN). The consistent improvements reconfirm the robustness of KAQA with different types of matching models. In real-world scenarios, KAQA can flexibly select simple or sophisticated matching models to balance both effectiveness and efficiency. Online Evaluation. To further confirm the power of the KAQA framework in real-world scenario, we further conduct an online A/B test on WeChat Search. We implement the KAQA framework with its corresponding baseline model in online evaluation. We conduct the online A/B test for 7 days, with approximately 14 million requests influenced by our online models. The experimental results show that KAQA achieves 1.2% improvements on Click-through-rate (CTR) compared to the baseline model with the significance level α = 0.01. With the help of knowledge anchors, KAQA could have better performances in interpretability, cold start and immediate manual intervention. It has also been successfully used in other domains like medical and digital fields. 4.4
Analysis on Query Anchoring
In this subsection, we evaluate the effectiveness of different triple disambiguation models. We construct a new triple disambiguation dataset for query anchoring evaluation. Specifically, we randomly sample queries from a real-world software customer service system. To make this task more challenging, we only select the complicated queries which have at least two triple candidates with has operation relation before triple disambiguation. At last, we sample 9, 740 queries with 20, 267 triples. After manually annotation, there are 10, 437 correct triples that
12
R. Xie et al.
represent the core semantics, while the rest 9, 830 triples are incorrect. We randomly select 1, 877 queries for evaluation. There are mainly three triple disambiguation components. We use RB to indicate the basic rule-based model, KR to indicate the knowledge reasoning model, and NTD to indicate the neural triple disambiguation model. We conduct three combinations to show the contributions of different models, using Accuracy and AUC as our evaluation metrics. In Table 3, we can find that: Table 3. Results of triple disambiguation. Model
Accuracy AUC
KAQA (RB) 0.588 0.619 KAQA (RB+KR) KAQA (RB+KR+NTD) 0.876
0.646 0.679 0.917
(1) The ensemble model RB+KR+NTD that combines all three disambiguation components achieves the best performances on both accuracy and AUC. User queries in FAQ-based QA usually struggle with abbreviations, informal representations and domain-specific conditions. The results reconfirm that our triple disambiguation model is capable of capturing user intention precisely, even with the complicated queries containing multiple triple candidates. We will give detailed analysis on such complicated queries in case study. (2) The neural triple disambiguation component brings in huge improvements compared to rule-based and knowledge reasoning models. It indicates that the supervised information and the generalization ability introduced by neural models are essential in triple disambiguation. Moreover, RB+KR model significantly outperforms RB model, which verifies that knowledge-based filters work well. 4.5
Ablation Tests on Knowledge Anchors
In this subsection, we attempt to verify that all components of KAQA are effective in our task. We set two different settings, the first removes triples in knowledge anchors, while the second removes entities. We report the accuracies of these two settings on KAQA (ARC-I) in Table 4. We find that both settings have consistent improvements over the original models, which also implies that the entities and triples are useful for matching. Moreover, triples seem to play a more essential role in knowledge anchors.
FAQ-Based Question Answering via Knowledge Anchors
13
Table 4. Results of different knowledge anchors. Model
Accuracy
ARC-I 0.753 KAQA (ARC-I) (entity) 0.762 KAQA (ARC-I) (triple) 0.766 0.773 KAQA (ARC-I) (all) Table 5. Examples of query-title matching with triples and labels. Label 0/1/2 indicates unrelated/related/similar. We only show the triples that indicate core semantics.
Query
Title
ARC-I KAQA Label
How to delete WeChat’s Can WeChat recover chat 2 chat logs. logs that have been deleted? (chat log, OP, delete) (chat log, OP, recover )
0
0
How can I not add pictures (when sending messages) in Moments? (picture, OP, (not) add )
In Moments, can I only 0 share textual messages without attaching figures? (figure, OP, (not) attach)
2
2
How to log in WeChat with Can I log in WeChat with 2 new account? two accounts simultane((new) account, OP, log in) ously? ((two) account, OP, log in)
2
1
What should I do to set up How to change the adminis- 0 administrators in the group? trator in my chatting group? (administrator, OP, set up) (administrator, OP, change)
0
2
4.6
Case Study
In Table 5, we give some representative examples to show the pros and cons of using knowledge anchors. In the first case, KAQA successfully finds the correct knowledge anchor (chat log, OP, recover) in title via the triple disambiguation model, avoiding confusions caused by the candidate operation delete. While the original ARC-I model makes a mistake by only judging from tokens. In the second case, there is a semantic ellipsis (send messages) in user query that confuses ARC-I, which usually occurs in QA systems. However, KAQA successfully captures the core semantics (picture, OP, (not) add ) to get the right prediction. The synonym relation also helps the alignment between “figure” and “picture”. However, KAQA also has limitations. In the third case, knowledge anchors merely concentrate on the core semantic operation log in WeChat account, paying less attention to the differences between “new” and “two”. Therefore, KAQA gives a wrong prediction of similar. A more complete KG is needed. In the last case, KAQA does extract the correct knowledge anchors. However, although set up and change have different meanings, set up/change administrator should
14
R. Xie et al.
indicate the same operation in such scenario. Consider the synonym and hypernym/hyponym relationships between triples will partially solve this issue.
5
Conclusion and Future Work
In this paper, we propose a novel KAQA framework for real-world FAQ systems. We consider entities and triples in texts as knowledge anchors to precisely capture core semantics for NLU and matching. KAQA is effective for real-world FAQ systems that pursue high precision, better interpretability with faster and more controllable human intervention, which could be rapidly adapted to other domains. Experimental results confirm the effectiveness and robustness of KAQA. We will explore the following research directions in future: (1) We will consider more sophisticated and general methods to fuse knowledge anchors into the multi-channel matching module. (2) We will explore the relatedness between entity and triple to better modeling knowledge anchor similarities.
References 1. Bedu`e, P., Graef, R., Klier, M., Zolitschka, J.F.: A novel hybrid knowledge retrieval approach for online customer service platforms. In: Proceedings of ECIS (2018) 2. Cavnar, W.B., Trenkle, J.M., et al.: N-gram-based text categorization. In: Proceedings of SDAIR (1994) 3. Clark, P., Tafjord, O., Richardson, K.: Transformers as soft reasoners over language. In: Proceedings of IJCAI (2020) 4. Cui, W., Xiao, Y., Wang, W.: KBQA: an online template based question answering system over freebase. In: Proceedings of IJCAI (2016) 5. Hu, B., Lu, Z., Li, H., Chen, Q.: Convolutional neural network architectures for matching natural language sentences. In: Proceedings of NIPS (2014) 6. Huang, P.S., He, X., Gao, J., Deng, L., Acero, A., Heck, L.: Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of CIKM (2013) 7. Huang, X., Zhang, J., Li, D., Li, P.: Knowledge graph embedding based question answering. In: Proceedings of WSDM (2019) 8. Joshi, M., Choi, E., Levy, O., Weld, D.S., Zettlemoyer, L.: pair2vec: compositional word-pair embeddings for cross-sentence inference. In: Proceedings of NAACL (2019) 9. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016) 10. Kim, S., Kang, I., Kwak, N.: Semantic sentence matching with densely-connected recurrent and co-attentive information. In: Proceedings of AAAI (2019) 11. Kothari, G., Negi, S., Faruquie, T.A., Chakaravarthy, V.T., Subramaniam, L.V.: SMS based interface for FAQ retrieval. In: Proceedings of ACL (2009) 12. Kwiatkowski, T., Choi, E., Artzi, Y., Zettlemoyer, L.: Scaling semantic parsers with on-the-fly ontology matching. In: Proceedings of EMNLP (2013) 13. Lin, Y., Shen, S., Liu, Z., Luan, H., Sun, M.: Neural relation extraction with selective attention over instances. In: Proceedings of ACL (2016)
FAQ-Based Question Answering via Knowledge Anchors
15
14. Pang, L., Lan, Y., Guo, J., Xu, J., Wan, S., Cheng, X.: Text matching as image recognition. In: AAAI (2016) 15. Robertson, S., Zaragoza, H., et al.: The Probabilistic Relevance Framework: BM25 R in Information Retrieval (2009) and Beyond. Foundations and Trends 16. Shen, G., Yang, Y., Deng, Z.H.: Inter-weighted alignment network for sentence pair modeling. In: Proceedings of EMNLP (2017) 17. Wang, R., Luan, J., Pan, X., Lu, X.: An improved forward maximum matching algorithm for Chinese word segmentation. Comput. Appl. Softw. (2011) 18. Xie, R., Liu, Z., Lin, F., Lin, L.: Does william shakespeare really write hamlet? Knowledge representation learning with confidence. In: Proceedings of AAAI (2018) 19. Yao, X., Van Durme, B.: Information extraction over structured data: question answering with freebase. In: Proceedings of ACL (2014) 20. Zhang, Y., Dai, H., Kozareva, Z., Smola, A.J., Song, L.: Variational reasoning for question answering with knowledge graph. In: Proceedings of AAAI (2018) 21. Zheng, W., Yu, J.X., Zou, L., Cheng, H.: Question answering over knowledge graphs: question understanding via template decomposition. Proc. VLDB 11(11), 1373–1386 (2018) 22. Zhou, G., Zhou, Y., He, T., Wu, W.: Learning semantic representation with neural networks for community question answering retrieval. Knowl.-Based Syst. 93, 75– 83 (2016)
Deep Hierarchical Attention Flow for Visual Commonsense Reasoning Yuansheng Song1 and Ping Jian1,2(B) 1
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China {yssong,pjian}@bit.edu.cn 2 Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications, Bejing, China
Abstract. Visual Commonsense Reasoning (VCR) requires a thoroughly understanding general information connecting language and vision, as well as the background world knowledge. In this paper, we introduce a novel yet powerful deep hierarchical attention flow framework, which takes full advantage of text information in the query and candidate responses to perform reasoning over the image. Moreover, inspired by the success of machine reading comprehension, we also model the correlation among candidate responses to obtain better response representations. Extensive quantitative and qualitative experiments are conducted to evaluate the proposed model. Empirical results on the benchmark VCR1.0 show that the proposed model outperforms existing strong baselines, which demonstrates the effectiveness of our method.
Keywords: Hierarchical attention flow reasoning · Visual question answering
1
· Visual commonsense
Introduction
In this paper we focus on multiple-choice Visual Commonsense Reasoning (VCR) task [22]. Given an image, the machine is required not only to select an answer to the question among several candidates, but also to provide a rationale justifying why that answer is correct. Usually the machine should both understand the image-text simultaneously and refer to the background world knowledge to get the right answer and rationale. As shown in Fig. 1, the question Q, answer Ai and rationale Ri (i = 0, 1, 2, 3) are all in the mixture form of natural language and explicit references to image regions. Each token in them is either a word in a vocabulary, or a tag referring to an object in the image. The VCR task can be divided into two sub tasks, visual question answering (Q → A) and answer justification (QA → R). The holistic setting (Q → AR) demands to first choose the right answer and then get the correct rationale based on the answer. Here we unify the two sub tasks together. In Q → A, the question is the query and answer choices are the c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 16–28, 2020. https://doi.org/10.1007/978-3-030-60450-9_2
Deep Hierarchical Attention Flow for Visual Commonsense Reasoning
17
Fig. 1. One example from VCR dataset. Given an image I and related question Q, the system is required to provide the correct answer Ai and then pick a rationale Ri to justify the answer.
responses. And in QA → R, the concatenated question and correct answer is the query while rationale choices are the responses. VCR is similar to Visual Question Answering (VQA) [5]. They both need to answer the question correctly according to the given image. Although great progress has been made in the VQA research field, typical state-of-the-art systems [1,3,8,11,14,15,21] still struggle with difficult inferences because of lack of common knowledge. Some approaches [10,20]have been proposed to generate image caption as common knowledge to enhance the VQA performance. However, they are likely to generate irrelevant information to the question or image. Different from above VQA methods, we observe that candidate responses can provide additional knowledge to help reason image content better. For example, in Fig. 1, referring only to the question and image information we could hardly conjecture what exactly [person1] is thinking. But with the help of response A1 and the comparison among these responses, we can reason that the scientific expeditions in this image are struggling to work in the harsh environment. Furthermore, inspired by advanced models of Machine Reading Comprehension (MRC) [18,19,23], we also make candidate responses to attend to each other to obtain better response representations. In this paper we propose a novel approach to adequately leverage query and response text information to understand and reason the image better. Firstly, BiLSTM is applied to encode the query and candidate responses to fuse the visual feature and text feature in the token level. Secondly, the bilinear attention mechanism [13] is organized to model the relevance information among query, candidate responses and image. Here we obtain the query-aware and image-aware response representations. Thirdly, we employ Transformer [17] to model the dense interaction of response representations. So far we have fully utilize response information as kind of world knowledge to aid to reason over
18
Y. Song and P. Jian
the image. Finally, the correlation among candidate responses is computed for better response representation referred to advanced MRC methods. Our main contributions are summarized as follows. – We propose a deep hierarchical attention flow framework for multiple-choice visual commonsense reasoning task. The model takes full advantage of text information including query and candidate responses to reason and understand the image content. – As we know, it is the first to introduce correlation comparison module in the VCR task referring to the advanced MRC models so that each candidate response can compare with each other to aggregate useful information. – Empirical evaluation shows the proposed model outperforms previous strong baselines. On the VCR1.0 benchmark, we improve significantly the Q → AR overall accuracy 2.5% and 2.0% on dev and test dataset respectively.
2 2.1
Related Work Visual Question Answering
VQA is one of the most attractive research areas in recent years. Lots of efficient deep learning methods have been proposed for it. Previous typical methods [1,3,8,11] usually apply pre-trained CNN like ResNet [7] to extract image features and use RNN to encode the query information. Then, attention mechanism is utilized to fuse image features and query information. Finally, the answer label is predicted based on these fusion features. However, being lack of context information, the answer label cannot help to reason and understand image content in these VQA methods. 2.2
Visual Commonsense Reasoning
VCR is more difficult than VQA. In order to answer the question correctly, the system is expected to conduct commonsense reasoning about the ground world knowledge except for the scene reading of the image. Besides, the system also should provide the rationale to explain why it chooses the answer. It can be seen that typical VQA models are incompetent for VCR the task. There are few related works for VCR. In order to move towards cognitionlevel understanding, [22] introduce Recognition to Cognition Networks (R2C), which contains three modules including grounding, contextualization and reasoning. The grounding module is used to learn joint image-language representation for each token of query and candidate response sequence. After grounding, attention mechanism is applied to contextualize the text information and image context. Finally, reasoning module reasons over shared representations. The responses contain sufficient query and image information because of above three modules, which makes R2C outperform state-of-the-art VQA systems on this VCR task. However, R2C ignores the importance of responses and their correlation, which have abundant semantic information. In this paper, we mainly
Deep Hierarchical Attention Flow for Visual Commonsense Reasoning
19
Fig. 2. Overall structure of the proposed model for multiple choice VCR task.
focus on contextual information and additional knowledge existing in query and responses text. It will lead to a better understanding for the text semantics as well as image content.
3 3.1
Approach Overview
As illustrated in Fig. 2, the proposed model is mainly composed of four modules including encoding, language & vision fusion, response modeling and correlation comparison. We will introduce the image objects feature and language text feature representations in Sect. 3.2. In Sect. 3.3, following [22], we compute the object-aware query and candidate response representation and then apply attention mechanism to fuse both query and object information into response representations. Based on this representation, in Sect. 3.4, Transformer blocks are employed to learn generalizable contextualized visual and linguistic response representations. Next in Sect. 3.5, each candidate response representation is compared with each other to collect supportive information and then computed one score in the correlation comparison module. Finally, the implementation and training details are provided in Sect. 3.6. 3.2
Feature Representation
Image Features. [22] apply Mask-RCNN [6] to detect objects by bounding boxes. Following that, we adopt Resnet-50 backbone to extract image object features, which is pretrained on ImageNet. We keep the first three block parameters
20
Y. Song and P. Jian
of ResNet frozen and fine-tune the final blocks (after the RoiAlign is applied). Finally we obtain a 2,048-dimensional visual feature vector for each object. Besides, the label of each object’s category is also viewed as a grounded evidence. The sequence object features o = [(v1 ; c1 ), (v2 ; c2 ), ..., (vK ; cK )] ∈ R|o|×(2048+dc ) , where v is the visual feature, and c is the label feature. |o| is the number of objects in one image, and dc is the dimension of label semantic embedding. Text Features. Considering BERT’s success in natural language processing field [4,9], we use publicly available pre-trained BERT-Base model to get the contextualized text embeddings. We extract text features from the second-tolast layer of frozen BERT-Base representation, which is proven to work well. It should be noted that the query and the responses are both in the mixture form of natural language and object tags in VCR. One example might look like this: [CLS] What is person1 thinking ? [SEP] person1 has never seen bird6 up close before. [SEP] In the above example, person tags are replaced with gender neural names (person1 → Jackie) and object detections are replaced by their class name (bird6 → bird) for minimizing domain shift between BERT’s pretrained data. 3.3
Language and Vision Fusion
In this section, we will describe the details of language & vision module. Since the query and responses all consist of objects tags and natural language words, this module firstly learns a joint image-language representation for each token. |q| The given query {[eqt ; oqt ]}t=1 consists of |q| words and object tags, where et is the tth word BERT representation embedding, and ot is the object visual feature embedding extracted by ResNet-50. It is worthy of noting that if the word doesn’t refer to a object, its object tag will be designated as the entire i i |r i | image as illustrated in Fig. 2. The i th response is represented as {[ert ; ort ]}t=1 . Then BiLSTM is employed to encode the query and response. uqt = BiLSTM(uqt−1 , [eqt ; oqt ]) i
i
i
i
urt = BiLSTM(urt−1 , [ert ; ort ])
(1) (2)
After encoding stage, we adopt attention mechanism to attend the response with visual object and the query. To get the query-aware response representation, we apply bilinear function to compute similarity matrix between the query uq i and the response ur . Then the attention weight matrix is obtained via a rowwise softmax function. At last, the query-aware response u ˜ q is computed using the following equation. i S = ur Wuq (3) u ˜ q = sof tmax(S)uq
(4)
Deep Hierarchical Attention Flow for Visual Commonsense Reasoning
21
we perform another bilinear function between the response and object visual features to get the object-aware response representation ˜ o. Finally, we concatei ˜ q , and ˜ o, and then feed them to a nonlinear layer. nate the ur , u i
i
˜q ; ˜ o; ] u ˜ r = [ur ; u i
˜ r + b) + ur Ri = ReLU(W u 3.4
(5) i
(6)
Response Modeling
Due to the effectiveness and increasing ubiquity, Transformer has been used in a wide variety of NLP tasks. Each Transformer layer contains identical blocks that transform input as the following way. g ˜l = MultiHeadAttention(hl−1 )
(7)
gl = LayerNorm(˜ gl + hl−1 )
(8)
˜ l = FFN(gl ) h
(9)
˜ l + gl ) gl = LayerNorm(h
(10)
where MultiHeadAttention is a multi-headed self attention mechanism, LayerNorm represents a layer normalization [2]. FFN is a two-layer feed-forward network. Here we employ multiple Transformer layers to encode the response representation that has attend to the query information and object visual features. Then, the attention pooling vector of the ri is obtained as following. ˜ i = MultiLayerTF(Ri ) R
(11)
˜i a = wT R
(12)
α = sof tmax(a)
(13)
˜ i| |R i
r =
˜ ij αj R
(14)
j=1
where MultiLayerTF represents multiple Transformer layers, and ri is the i th response pooling vector. 3.5
Correlation Comparison
Given the representation of each response ri , we compare all the responses with each other by attention mechanism. Therefore, these responses can exchange information mutually and verify each other. In order to avoid self-comparison of the response, we set the diagonal attention weight to zero. ˜ ij = ri W(rj )T S
(15)
22
Y. Song and P. Jian
˜ ij ) 1(i = j)exp(S βij = ˜ j 1(i = j)exp(Sij ) βij rj ˜ ri =
(16) (17)
j
ri ] zi = [ri ; ri ∗ ˜
(18)
i
The final response z is passed through a multilayer perceptron. We train our model by minimizing the cross entropy between the prediction and the gold label. 3.6
Implementation and Training Details
The parameters of our model are as follows. Object features are projected from 2,176 hidden size to 512 dimensional vectors. We also use BERT-Base to initialize the word embedding matrix and get 768 dimensional vectors. The BiLSTM is single layer with 256 dimensional hidden states, and the input dropout is set 0.3. In the response modeling module, we use 3 Transformer layers to model the query-aware and object-aware response representation. The multi-head attention size is 512, and the head number is set to 8. So the latent dimension for each head is 64. The first 3 blocks of ResNet-50 are frozen and its rest parameters are fine-tuned during training process. We also filter the objects that are not referred in the query and response set. We train our model using Adam optimizer [12] with a batch size 64 and initial learning rate of 0.0002. The weight decay is set to 0.0001. The validation set of VCR1.0 is utilized to find the best hyper-parameters, yielding the highest overall accuracy score. We clip the gradients to have a total L2 norm of at most 1.0, and lower the learning rate by a factor of 2 when noticing a plateau (validation accuracy not increasing for two epochs). The model is trained for 20 epochs and approximately costs 20 h over 2 NVIDIA Titan X GPUs.
4 4.1
Experiments Datasets and Evaluation Metrics
Visual commonsense reasoning corpus VCR1.0 [22] contains 290k multiple choice QA problems exacted from 110k movie scenes. Given one image and the corresponding question, the system should select an answer from four candidate answers and then pick a rationale from four candidate rationals. Only one candidate answer or rationale is correct. The problem formulation is defined as a tuple [I, O, Q, A, R]. I represents the image. O is a sequence objects detected by bounding boxes [16]. Q is the question, A is the answers (A0 , A1 , A2 , A3 ) and R is the rationals (R0 , R1 , R2 , R3 ). The question Q, answer Ai and rationale Ri (i = 0, 1, 2, 3) are all in the mixture form of natural language and explicit references to image regions.
Deep Hierarchical Attention Flow for Visual Commonsense Reasoning
23
Table 1. Experimental results on VCR. Accuracy in percentage (%) is reported. Model
Q→A QA → R Q → AR Val Test Val Test Val Test
Chance BERT [4]
25.0 25.0 25.0 25.0 6.2 6.2 53.8 53.9 64.1 64.5 34.8 35.0
RevisitedVQA [8] 39.4 40.5 BottomUpTopDown [1] 42.8 44.1 45.5 46.2 MLB [11] 44.4 45.5 MUTAN [3] R2C [22] Ours
34.0 25.1 36.1 32.0
33.7 25.1 36.8 32.2
13.5 10.7 17.0 14.6
13.8 11.0 17.2 14.6
63.8 65.1 67.2 67.3 43.1 44.0 66.3 66.9 68.6 68.7 45.6 46.0
VCR task consists of visual question answering(Q → A) and answer justification (QA → R). The ultimate goal is (Q → AR). So there are three experimental metrics in the VCR task, including Q → A, QA → R accuracy and Q → AR overall accuracy. To evaluate the capability of the proposed model, we report the performance of our model using the official VCR metric [22]. 4.2
Results, Ablations, and Analysis
In this section, we first present the empirical results on the VCR task, which shows the proposed model’s superiority against the previous strong baselines. Then ablation studies are conducted to determine the contributions of each component. Finally, extensive qualitative evaluation is carried out to explain the behaviors of our model. Results on VCR. As demonstrated in Table 1, the proposed model outperforms previous state-of-the-art model R2C [22] stably and significantly in terms of all three evaluation metrics. Especially, it improves the Q → AR overall accuracy 2.5% and 2.0% on dev and test dataset respectively. As shown in Table 1, RevisitedVQA [8], BottomUpTopDown [1], MLB [11] and MUTAN [3] are all strong VQA baselines, but they are struggling on the VCR task. We note that they do not use the candidate response to help understand the image content. R2C [22] performs much better than VQA models, but is still unable to fully utilize the query and response information, which presents the significance of text knowledge to aid to reason over the image. We note that text-only model BERT [4] also obtains good performance on this task but still much worse than our model, which indicates the necessity of understanding the image context and text information simultaneously for getting the right response.
24
Y. Song and P. Jian
Fig. 3. Some qualitative examples from our model. The final probabilities show how confident our model to choose the corresponding response.
Ablation Studies. We run several ablation experiments to investigate which component is efficient. The ablation results are summarized in Table 2. They will be discussed in detail and here we mainly focus on the Q → AR accuracy on the VCR dev set. Table 2. Ablations for our model on the validation set. We compare some important components with R2C [22]. Model
Q→A QA → R Q → AR R2C Ours R2C Ours R2C Ours
Complete model
63.8 66.3 67.2 68.6 43.1 45.6
No No No No
– – 48.3 53.1
correlation comparison response modeling query vision representation
65.4 65.8 62.0 53.7
– – 43.5 63.2
68.3 68.1 65.1 63.3
– – 21.5 33.8
44.6 45.0 40.6 34.1
To evaluate the effectiveness of response correlation comparison module, we directly employ the output logit for each response after self attention pooling. The Q → AR overall accuracy on dev set drops 1.0 point, which confirms our hypothesis that the comparison among responses can help aggregate useful information for a better response representation. The results also validate the necessity of fully utilizing candidate responses for VCR task. Then we investigate the importance of response modeling which collects evidence dynamically from itself. The final result is slightly influenced by dropping 0.6 point, which indicates it is beneficial but not so critical for final performance. In the language & vision module, we firstly remove the query representation. Our model accuracy drops 5.0 points while R2C [22] suffers heavily with a loss
Deep Hierarchical Attention Flow for Visual Commonsense Reasoning
Why
is
[chair2]
empty
?
25
[image] [chair2] [person0]
The person who drove [person0] here has gone inside the building . Everyone is standing up . [chair2] is empty because [person0] stood up . The power went out .
Fig. 4. One example from Q → A task. The first super-column is the question: Why is [chair2] empty? The second super-column represents the objects attended by our model. Each row represents a answer choice. Each block shows a heatmap of attention between the question or objects and each answer choice. The darker color means greater attention weights. (Color figure online)
of 21.6 points. The result shows that our model performs much better without query. After removing the query representation, the model has to rely only on the response information to reason the image. We argue that the reason why
26
Y. Song and P. Jian
our model performs well is considering the responses as additional knowledge besides modeling the correlation among responses. So it is meaningful to make the most of the response text to understand the image semantic information. After removing vision representation, the performance of our model and R2C both starts a precipitous decline in the Q → A process. We argue that the model need to refer to more vision information because the evidence provided by the query is scarce to obtain the correct response in Q → A. The QA → R accuracy is slightly influenced because of much semantic information in the query. Qualitative Evaluation. Figure 3 presents some qualitative examples. In the first row, our model has high confidence to infer the reason why [chair2] is empty. Not only that, it chooses the real rationale correctly even all rationales are relevant with the question more or less, which demonstrates it really understands the vision and language context instead of guessing smartly. The second example also shows the strong capability of the model to complete the VCR task. To gain better insight of the behaviors of the proposed model, we visualize the attention weights, including response attending over the query and response attending over the image objects. The visualized attention map helps us to identify which token in the query is more important and which object is more useful in the visual commonsense reasoning process. Figure 4 shows the attention map of the first example(Q → A) in Fig. 3. For the query “Why is [chair2] empty?”, the attention generated by our model is focused on the key words “[chair2]” and “empty” in the query. The right response also refers to the object “[person0]” more while the other wrong responses are not.
5
Conclusion
In this paper, we propose a novel deep hierarchical attention flow model for visual commonsense reasoning. In particular, the query and candidate response information are fully exploited to aid to reason the image content. Furthermore, we make the responses attend to each other to gather important evidence, which enhances candidate response representation by modeling the response correlation. Experimental results on VCR data set show our model outperforms previous strong baselines. We believe that more benefits can be further explored from the interaction computing among answers, rationales and image in VCR. We will continue our study from this perspective, for example, modeling the causal relationship between these components. In addition, cross-modal pretraining will also be considered to further improve the VCR performance. Acknowledgements. The authors would like to thank the organizers of NLPCC2020 and the reviewers for their helpful suggestions. This research work is supported by the National Key Research and Development Program of China under Grant No. 2017YFB1002103.
Deep Hierarchical Attention Flow for Visual Commonsense Reasoning
27
References 1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018 2. Ba, L.J., Kiros, J.R., Hinton, G.E.: Layer normalization. CoRR (2016) 3. Benyounes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. arXiv: Computer Vision and Pattern Recognition (2017) 4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019) 5. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 6. He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.B.: Mask R-CNN. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017, pp. 2980–2988 (2017). https://doi.org/10.1109/ICCV.2017.322 7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778 (2016). https://doi.org/ 10.1109/CVPR.2016.90 8. Jabri, A., Joulin, A., van der Maaten, L.: Revisiting visual question answering baselines. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 727–739. Springer, Cham (2016). https://doi.org/10.1007/978-3-31946484-8 44 9. Jawahar, G., Sagot, B., Seddah, D.: What does BERT learn about the structure of language? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3651–3657. Association for Computational Linguistics, Florence, July 2019 10. Kim, H., Bansal, M.: Improving visual question answering by referring to generated paragraph captions. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3606–3612. Association for Computational Linguistics, Florence, July 2019 11. Kim, J., On, K.W., Lim, W., Kim, J., Ha, J., Zhang, B.: Hadamard product for low-rank bilinear pooling. In: 5th International Conference on Learning Representations (2017) 12. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015 (2015). Conference Track Proceedings 13. Luong, M., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. CoRR (2015) 14. Nguyen, D.K., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018 15. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. Litoral Revista De La Poes´ıa Y El Pensamiento, pp. 2953–2961 (2015)
28
Y. Song and P. Jian
16. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017) 17. Vaswani, A., et al.: Attention is all you need. CoRR (2017) 18. Wang, W., Yang, N., Wei, F., Chang, B., Zhou, M.: Gated self-matching networks for reading comprehension and question answering. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 189–198. Association for Computational Linguistics, Vancouver, July 2017 19. Wang, Y., et al.: Multi-passage machine reading comprehension with cross-passage answer verification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1918–1927. Association for Computational Linguistics, Melbourne, July 2018. https://doi.org/10. 18653/v1/P18-1178 20. Wu, J., Hu, Z., Mooney, R.: Generating question relevant captions to aid visual question answering. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3585–3594. Association for Computational Linguistics, Florence, July 2019 21. Wu, J., Mooney, R.J.: Self-critical reasoning for robust visual question answering. arXiv: Computer Vision and Pattern Recognition (2019) 22. Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019 23. Zhu, H., Wei, F., Qin, B., Liu, T.: Hierarchical attention flow for multiple-choice reading comprehension. In: AAAI Conference on Artificial Intelligence (2018)
Dynamic Reasoning Network for Multi-hop Question Answering Xiaohui Li1 , Yuezhong Liu2 , Shenggen Ju1(B) , and Zhengwen Xie1 1 College of Computer Science, Sichuan University, Chengdu 610065, China
[email protected] 2 Enterprise Service, Commonwealth Bank of Australia, Sydney, NSW 2000, Australia
Abstract. Multi-hop reasoning question answering is a sub-task of machine reading comprehension (MRC) which aims to find the answer of a given question across multiple passages. Most existing models usually obtain the answer by visiting the question only once so that models may not obtain adequate text information. In this paper, we propose a Dynamic Reasoning Network (DRN), a novel approach to obtain correct answers by multi-hop reasoning among multiple passages. We establish a query reshaping mechanism which visits a query repeatedly to mimic people’s reading habit. The model dynamically reasons over an entity graph with graph attention (GAT) and the query reshaping mechanism to promote its ability of comprehension and reasoning. The experimental results on the HotpotQA and TriviaQA datasets show that our DRN model achieves significant improvements as compared to prior state-of-the-art models. Keywords: Machine reading comprehension · Multi-hop reasoning · Query reshaping mechanism
1 Introduction Machine reading comprehension (MRC) is a task concerning obtaining the correct answer of a given question by reasoning among a set of texts. Most MRC models focus on answering a question within a single passage [1–4], which are useless facing multiple passages since a single passage is not enough to find the correct answer. To equip models with the ability to do reasoning among a set of passages, some multi-hop reasoning QA models are proposed. These models are usually trained to get the answer by reasoning across multiple passages. There are three main research directions in terms of multi-hop reasoning QA Models. The first direction is based on memory network [5, 6] whereas lacking a clear reason process. Another research direction [7–9] directly reasons across all given passages containing unrelated ones by giving explicit reasoning chains or getting additional information. However, the amount of data to be processed is large since not each passage is related to the answer. The last direction focuses on constructing an entity graph to realize multi-hop reasoning across multiple passages, and remarkable performance has been achieved [10–12]. © Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 29–40, 2020. https://doi.org/10.1007/978-3-030-60450-9_3
30
X. Li et al.
Despite of the above achieved success, there are still several limitations on the current models. Firstly, most existing models reason directly across all given passages without a selection process while some of them are irrelevant to the correct answer. Thus disturbing information is introduced so that the number of passages to be processes is enhanced. Passage selection before reasoning is indispensable. Secondly, each text regardless of either the question or the passage is always visited few times after encoding them into representation vectors. However, models may not absorb enough information just by visiting the text few times, which results in not making full use of text information. To overcome these limitations, we propose a novel model named dynamic reasoning network (DRN) for the multi-hop reasoning QA task. To begin with, we build a passage selector to get rid of answer-irrelevant passages as much as possible so that the amount of information to be processed is reduced; next, based on chosen passages, we recognize named entities and construct an entity graph for the following reasoning phase; then, with the intuition that people cannot focus on too many contents at the same time, we propose a dynamic reasoning network which consists dynamic graph attention and query reshaping mechanism to reason over the entity graph, which facilitates fully information understanding; finally, a prediction model is utilized for the answer prediction. Our model is evaluated on the HotpotQA and TriviaQA datasets. Experimental results show the effectiveness of our model compared to other baseline models. In summary, contributions of this paper are as follows: 1. We propose a Dynamic Reasoning Network (DRN), an effective method for the multi-hop reasoning in the MRC task. 2. We propose a query reshaping mechanism in the reasoning step which facilitates fully information understanding. 3. We provide an experimental study on public datasets (HotpotQA and TriviaQA) to illustrate the effectiveness of our proposed model.
2 Related Work Machine reading comprehension (MRC) aims to make the machine read and understand a text composed of natural language and get the answer of a given question. In the past few years, many MRC approaches [13–17] achieved remarkable improvement. There are three research directions with respect to multi-hop reasoning. The first direction is based on the memory network [5, 6] which uses the memory unit to combine the question with the information obtained in each round and predicts the answer after several inferences through continuous iterative inference. These models are based on an end-to-end training model and dynamically determine the number of rounds of inference. However, these models cannot give a clear reasoning process. The second direction reasons directly over all the given passages find the answer by either constructing reason chains [7, 8] or getting additional information of questions and contexts [9]. These methods need to process a large amount of passages for not reducing the irrelevant ones. The third direction based on entity graph builds a graph based on questions and documents and obtains the answer through multi-hop reasoning over the entity graph, giving a clear reasoning process. [10] extracted entities from documents and constructed entity
Dynamic Reasoning Network for Multi-hop Question Answering
31
graphs through named entity recognition and coreference resolution modules. Then they performed multi-hop reasoning on the entity graphs. [11] constructed graph networks based on the entities co-occurring in query and candidates and used graph convolutional neural networks to achieve multi-hop inference among documents. However, these models constructed entity graphs by directly making use of all documents, which results the amount of data to be processes is large. [12] proposed a dynamically fused graph network (DFGN) based on graph neural network to dynamically merge and update query and context information to finish reasoning. However, the paper did not consider the links between the documents when constructing the entity graph. Our model is based on the third approach. Unlike the above model, we first build a passage selector to reduce the information to be processed. At the same time, we consider the connection between documents when constructing the entity graph. Also, we propose the dynamic reasoning network with a query reshaping mechanism to mimic the people’s reading habit to improve the performance of the model.
3 Model We introduce our Dynamic Reasoning Network (DRN) in details in this section. We first make a general introduction of the model’s framework. Then we introduce the passage selector, the encoder, the means of entity graph construction and the dynamic reasoning network respectively. Finally, we illustrate the prediction layer. 3.1 Task Definition and Overview Our problem is formulated as follows: given a query and a set of passages where some of these are disturbing items, the goal is to generate a proper response among these passages. We propose a dynamic reasoning network to solve the multi-hop reasoning problem. The overview of our DRN model is presented in Fig. 1.
Fig. 1. Overview of the model
32
X. Li et al.
3.2 Passage Selector The first step in our approach is selecting gold passages among a set of given passages to get rid of interference ones. We use the same method to build the passage selector as [12]. A pre-trained BERT model [17] with a sentence classification layer was applied to predict the similarity between the passage and the question. The selector network takes a query and a passage by concatenating “[CLS]” + question + “[SEP]” + document + “[SEP]” as input and outputs a matching score between 0 and 1. We define a training label for each passage as the following rule: the label is 2 if the passage contains the answer span; the label is 1 if the passage contains at least one supporting sentence, and 0 otherwise. Passages whose matching score is greater than a threshold η (η = 0.1 as given in [12]) are chosen as gold passages for the downstream tasks. 3.3 Encoding Query and Context The query Q and its corresponding gold passage P are fed through a pre-trained BERT model in “[CLS]” + Q + “[SEP]” + P + “[SEP]” format to obtain representations Q = [q1 , q2 , . . . , qm ] ∈ Rm×h and P = [p1 , p2 , . . . , pn ] ∈ Rn×h , where m and n are length of query and passage, and h is the size of hidden states. Then we use co-attention to improve the interactions between the query Q and the passage P. We first compute the attention matrix A ∈ Rm×n , capturing the similarity between the passage and query. Then, we calculate a passage-aware query representation Qp ∈ Rm×h . Similarly, we get a query-aware passage representation Pq ∈ Rn×h . Further, we compute another passage representation P ∈ Rn×h , capturing the interactions between passage-aware query and passage, which focuses on the connections between passage-relevant words of the query and the passage. Now we get the final passage representation Pfinal = [P, P ] ∈ Rn×2h , where ‘,’ represents concatenating these two vectors: A = QPT
(1)
Qp = softmax(A)P
(2)
Pq = softmax AT Q
(3)
P = softmax(AT )Qp
(4)
3.4 Entity Graph Construction We construct an entity graph to facilitate multi-hop reasoning among gold passages. We recognize named entities and noun phrase in the query and the passages with the Stanford corenlp toolkit [18]. The entity graph is constructed from three levels as the following rules: question-based level. Add an edge between two nodes if both of their sentence representations contain the named entities or noun phrase from the same query. The goal is to grasp the interaction between the query and the passages; context-based
Dynamic Reasoning Network for Multi-hop Question Answering
33
level. Add an edge between two nodes from the same passage. By this way, we get connections within a passage; passage-based level. Add an edge between two nodes if their sentence representations consist at least one same named entity or noun phrase. We construct relations among different passages. 3.5 Dynamic Reasoning Network The dynamic reasoning block is designed to appropriately reason over the entity graph. We propose a dynamic reasoning network (Fig. 2) to mimic people’s analyzing information process by reading messages repeatedly. We construct a query reshaping mechanism to read important part of the query repeatedly. A graph neural network (GNN) is employed to passing information over the entity graph and dynamic graph attention (GAT) is used to ensure the information disseminates among the current query-relevant entity nodes.
Fig. 2. The architecture of DRN
Query Reshaping Mechanism. When people answer a question according to reading materials, they usually read materials multiple times as well as the question, especially a complex question. Meanwhile, the important part of the text will be taken into more consideration in the next reading phase. Therefore, we proposed a query reshaping mechanism (Fig. 3) to dynamically compose the original query and the important part of it to mimic human’s reading habits. Given the original query representation Q = [q1 , q2 , . . . , qm ] ∈ Rm×h , we first select the most important part from Q with the consideration of the previous query information and the message from the entity graph. Since the message from the entity graph will be passing to query at query update phase, we focus on the previous query information Qt . After getting the attention matrix, we make another calculation to make the weights of the most important part larger and the weights of the part which counts for little smaller. The choosing mechanism can be formulated with attention mechanism as follows: μ = WT1 tanh(W2 Qt + W3 qt−1 ⊗ em )
(5)
34
X. Li et al.
Fig. 3. Framework of query reshaping
exp(μj ) α = m j=1 exp(μj )
qt =
m i=1
α3 m i
3 k=1 αk
(6) qi
(7)
where W1 , W2 , W3 are trainable parameters. em ∈ Rm is a row vector of 1. The outer product W3 qt−1 ⊗ em means repeating W3 qt−1 m times. Then we use GRU to encode the chosen part by combining with its context and get a representation qt each time. After re-visiting the query N times, we obtain a reshaped query representation Qt which will be sent to the next reason step:
qt = GRU(qt−1 , qt )
Qt = [q1 , q2 , . . . , qN ]
(8) (9)
Dynamic Graph Attention. After obtaining the entity graph, we utilize a graph neural network to propagate information among entity nodes. An entity filter is constructed to choose words more related to the current query so that information can be only passed by query-aware nodes:
qt−1 = MeanPooling(Qt−1 ) γti =
qt−1 Vt eit−1 √ d2
(10)
(11)
mt = σ(γ1 , γ2 , . . . , γn ) + 1
(12)
i ei = mit et−1
(13)
where Vt is a linear projection matrix, and σ is the sigmoid function.
Dynamic Reasoning Network for Multi-hop Question Answering
35
We use GAT [19] to disseminate information dynamically across the graph. We compute attention between two entity nodes as follows: i hit = Ut et−1 + bt i, j
(14) j
βt = LeakyReLU(WTt [hit , ht ])
(15)
i,j
i,j αt
exp(βt ) = j,k k exp(βt )
(16)
where Ut , Wt are learnable parameters. α represents the proportion of information that will be passed to the neighbor nodes of each entity. To update entity nodes, each node sums over all the information received from its neighbor nodes: j,i j i (17) et = ReLU αt h t j∈Ni
where Ni is the set of neighbor nodes of entity ei . In order to propagate graph information to the query, which is crucial for reasoning correctly, we utilize bi-attention network [1] to update the query representation: Qt = Bi-attention(Qt−1 , Et )
(18)
3.6 Answer Prediction We follow the same structure of prediction layers as [16]. The outputs contain four types of predictions: supporting sentences, the start token of the answer, the end token of the answer, the answer type. The prediction framework is shown as Fig. 4. The passages representation is sent to prediction layer in sentence format. For supporting sentences prediction, we use a binary classifier to predict the probability that the current sentence is a supporting sentence. The query types are defined as “span”, “yes”, and “no”. GRUs are used to output these four types of possibilities Pi : Psup = GRU(Ct )
(19)
Pstart = GRU [Ct , Psup ]
(20)
Pend = GRU([Ct , Psup , Pstart ])
(21)
Ptype = GRU([Ct , Psup , Pstart , Pend ])
(22)
The loss function is jointly optimized in a multi-task learning setting: L = η1 BCE Psup , Psup + CE Pstart , Pstart + CE Pend , Pend + η2 CE Ptype , Ptype (23) where η1 , η2 are weights used to control the effects of different loss functions. BCE represents binary cross entropy loss function and CE represents cross entropy loss function.
36
X. Li et al.
Fig. 4. Framework of answer prediction
4 Experiments 4.1 Datasets and Evaluation Metrics We evaluate our DRN model on both HotpotQA [16] in the distractor setting and unfiltered TriviaQA [20]. Models are evaluated based on Exact Match (EM) and F1 score. For HotpotQA, joint EM and F1 score are used as the overall performance measurements, which encourages the model to be accurate on both of them. HotpotQA is a recent benchmark dataset for multi-hop reasoning across multiple passages. Each question is designed to obtain answer only by multi-hop reasoning between predefined passages and some disturbing passages are also given. A fine-grained supporting fact for each question-answer pair is collected to promote the explainability of models. TriviaQA is a popular benchmark dataset that is built based on information retrieval (IR). Different form HotpotQA, each answer needs to be found by multi-hop reasoning across sentences within a single passage. It has relatively complex questions so that we employ it to prove the effectiveness of the query reshape mechanism. 4.2 Implementation Details We use the uncased version of BERT to encode question-answer pairs and to tokenize all texts with its tokenizer. In passage selection stage, we use the same low threshold (η = 0.1) as [12] to keep a balance between a high recall and a reasonable precision on supporting facts. Stanford CoreNLP Toolkits [18] is employed to recognize named entities and noun phrases in questions and passages. For optimization, we use Adam Optimizer [21] with an initial learning rate of 1e−4.
Dynamic Reasoning Network for Multi-hop Question Answering
37
4.3 Overall Results Table 1 and Table 2 respectively present the results of our model and compared models on HotpotQA and TriviaQA datasets. We select 4 models for comparison. GRN is a model that has no published paper on the HotpotQA dataset leaderboard. As a result, we do not select it as a baseline model when testing on the TriviaQA dataset. From the table we can figure out that DRN achieves better results as compared to other baselines. As described in Sect. 3, DRN utilizes attention mechanism repeatedly to model people’s reading habit. Therefore, the text information can be understood as much as possible. Besides, GAT is employed to pass information dynamically across the entity graph. The following results indicate the efficacy of our approach. Table 1. Performance on HotpotQA dataset Model
Answer EM
F1
Sup Fact
Joint
EM
EM
F1
F1
Baseline [16] 44.45 58.31 20.36 64.52 11.12 40.23 GRN
52.92 66.71 52.37 84.11 31.77 58.47
QFE [22]
53.71 66.28 57.80 84.37 34.89 60.17
DFGN [12]
55.32 69.27 52.10 81.63 33.61 59.76
DRN (ours)
58.49 72.42 55.89 83.56 36.04 63.13
Table 2. Performance on TriviaQA dataset Model
EM
F1
Baseline [16] 44.94 46.85 QFE [22]
55.14 56.26
DFGN [12]
56.50 59.13
DRN (ours)
59.73 62.21
4.4 Ablation Studies To evaluate the performance of different components of our model, we perform ablation study on different parts of model in the development set of HotpotQA. The results are shown in Fig. 5 where ‘w/o’ stands for without. It can be observed that if we remove one kind of the three types of edge, the results drop to some extent. Among them, the edges of context-based level contributes the most,which illustrates the importance of information interaction with a single passage. Furthermore, if the entity filter which facilitates message passing only across the query-aware entity nodes is not used, the performance decreased by 1.06% for EM and 1.09% for F1 score, proving the effectiveness of the
38
X. Li et al.
entity filter. For the query reshaping model, if we do not utilize it, the performance degrades by 1.28% for EM and 1.12% for F1 score. It verifies our claim that visiting a text repeatedly and each time taking the most important part of it into consideration will improve the understanding of the text. About query update in reasoning step, we can see that the performance drops by 1.02% for EM and 0.61% for F1 score. Therefore, passing information from entity graph to query helps reasoning.
Fig. 5. Ablation studies on HotpotQA
4.5 Results Analysis Figure 6 shows the results with different re-visiting times during the query reshaping phase. When reshaping the query 5 times, we get the best performance. As we claimed before, re-visiting a query multiple times facilitates understanding the full information. However, when we read the query too many times, there is no distinction in terms of importance among different parts of a query.
Fig. 6. Performance of different re-visiting times during query reshaping phase
Dynamic Reasoning Network for Multi-hop Question Answering
39
5 Conclusion Most existing MRC models usually obtain the answer within a single passage. However, many complex questions need to be answered by reasoning across multiple passages because the information from a single passage is not enough. Therefore, some models of multi-hop reasoning QA had been establishing since 2018. In this paper, we propose the Dynamic Reasoning Network (DRN) for the MRC task based on multi-hop reasoning. We first introduce a passage selector to reduce the disturbing information to decrease the amount of information to be processed. We also build a dynamic reasoning network with dynamic graph attention (GAT) and query reshaping mechanism to make full use of text information. We evaluate DRN on the HotpotQA and TriviaQA datasets. The experimental results prove the significant effectiveness of our model with respect to both EM and F1 score. In the future, we will focus on promoting the performance of passage selector to choose gold passages more precisely, which is imperative for the remaining tasks. Acknowledgments. The work was partially supported by the Sichuan Science and Technology Program under Grant Nos. 2018GZDZX0039 and 2019YFG0521.
References 1. Seo, M., et al.: Bidirectional attention flow for machine comprehension. arXiv preprint arXiv: 1611.01603 (2016) 2. Liu, X., et al.: Stochastic answer networks for machine reading comprehension. arXiv preprint arXiv:1712.03556 (2017) 3. Wang, W., et al.: Gated self-matching networks for reading comprehension and question answering. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2017) 4. Yu, A.W., et al.: QANet: combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541 (2018) 5. Munkhdalai, T., Yu, H.: Reasoning with memory augmented neural networks for language comprehension. arXiv preprint arXiv:1610.06454 (2016) 6. Shen, Y., et al.: ReasoNet: learning to stop reading in machine comprehension. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017) 7. Chen, J., Lin, S., Durrett, G.: Multi-hop question answering via reasoning chains. arXiv preprint arXiv:1910.02610 (2019) 8. Kundu, S, et al.: Exploiting explicit paths for multi-hop reading comprehension. arXiv preprint arXiv:1811.01127 (2018) 9. Dhingra, B., et al.: Neural models for reasoning over multiple mentions using coreference. arXiv preprint arXiv:1804.05922 (2018) 10. Song, L., et al.: Exploring graph-structured passage representation for multi-hop reading comprehension with graph neural networks. arXiv preprint arXiv:1809.02040 (2018) 11. De Cao, N., Aziz, W., Titov, I.: Question answering by reasoning across documents with graph convolutional networks. arXiv preprint arXiv:1808.09920 (2018) 12. Xiao, Y., et al.: Dynamically fused graph network for multi-hop reasoning. arXiv preprint arXiv:1905.06933 (2019)
40
X. Li et al.
13. Wang, S., Jiang, J.: Machine comprehension using match-LSTM and answer pointer. arXiv preprint arXiv:1608.07905 (2016) 14. Wang, W., et al.: R-NET: machine reading comprehension with self-matching networks. Natural Language Computer Group, Microsoft Reserach. Asia, Beijing, China, Technical Report 5 (2017) 15. Clark, C., Gardner, M.: Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723 (2017) 16. Yang, Z., et al.: HotpotQA: a dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600 (2018) 17. Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 18. Manning, C.D., et al.: The stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (2014) 19. Veliˇckovi´c, P., et al.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017) 20. Joshi, M, et al.: TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 (2017) 21. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412. 6980 (2014) 22. Nishida, K, et al.: Answering while summarizing: multi-task learning for multi-hop QA with evidence extraction. arXiv preprint arXiv:1905.08511 (2019)
Memory Attention Neural Network for Multi-domain Dialogue State Tracking Zihan Xu1,2 , Zhi Chen1,2 , Lu Chen1,2(B) , Su Zhu1,2 , and Kai Yu1,2(B) 1
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China {zihan.xu,zhenchi713,chenlusz,paul2204,kai.yu}@sjtu.edu.cn 2 SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Abstract. In a task-oriented dialogue system, the dialogue state tracker aims to generate a structured summary (domain-slot-value triples) over the whole dialogue utterance. However, existing approaches generally fail to make good use of pre-defined ontologies. In this paper, we propose a novel Memory Attention State Tracker that considers ontologies as prior knowledge and utilizes Memory Network to store such information. Our model is composed of an utterance encoder, an attention-based query generator, a slot gate classifier, and ontology Memory Networks for every domain-slot pair. To make a fair comparison with previous approaches, we also conduct experiments with RNN instead of pre-trained BERT as the encoder. Empirical results show that our model achieves a compatible joint accuracy on MultiWoz 2.0 dataset and MultiWoz 2.1 dataset.
1
Introduction
Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, which has attracted more and more attention [2]. DST is expected to accurately summarize the intentions of a user and extract a compact representation of dialogue states. The dialogue state is a set of domain-slot-value triples, which records all the user’s conditions (e.g., train-leaveat-13:45 means that the user wants to order the train’s ticket leaving at 13:45.). Traditional approaches for DST rely on hand-crafted rules and rich expert knowledge [18]. Recently, data-driven dialogue state trackers achieve excellent performance improvement. Current deep-learning based models can be taken into two categories: classification and generation. Classification approaches [20, 21] assume that all ontologies are predefined in the task, where dialogue state trackers only need to consider DST as a classification problem. For example, for slot bookday in domain train, which tells us when the train leaves, all possible values can be known in advance. Generation approaches use an open-vocabulary setup for value searching [19]. They assume that some values can be copied from dialogue context and all values can be extracted from an open-vocabulary. Z. Xu and Z. Chen—Co-first authors and contribute equally to this work. c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 41–52, 2020. https://doi.org/10.1007/978-3-030-60450-9_4
42
Z. Xu et al.
Fig. 1. An example of multi-domain dialogue state tracking
However, it is difficult for generation approaches to handle the problem when complex reasoning is required to get the right answer. A large scale multi-domain dialogue dataset (MultiWOZ 2.0) was recently introduced by [14], and MultiWoz 2.1 [7] was proposed by fixing many annotation errors in MultiWoz 2.0. These datasets are two of the largest multi-domain dialogue datasets available at present. There are mainly two challenges that make the task difficult. The first challenge is the cross-domain character. As shown in Fig. 1, a user can first ask the system about booking a train and then transfer to asking about hotel rooms, which requires flexibility in the ability of domaintransition. Moreover, slots like train-departure and train-destination are quite confusable since they share a lot of similarities. In total, there are 1879 possible values in MultiWoz 2.0 and 1934 in MultiWoz 2.1, which creates a large number of combinations. In this case, our model has to first determine the correct domain-slot pairs to track, then accurately point out the right values. Another challenge is that complex reasoning is required in multi-turn dialogues. As in Fig. 1, the dot arrows show how multi-turn reasoning may be needed throughout a dialogue. For example, state hotel-area-dontcare contained in this dialogue, we even need to make cross-domain reasoning to get the right prediction. In this work, we formulate the dialogue state tracking as a classification problem based on predefined ontologies. We employ the Memory Network to store the predefined ontologies and use an attention-based query generator to generate a slot-aware summary of the dialogue context. Our model resolves dialogue state tracking tasks into three separate sub-tasks: (1) dialogue summary generation, which generates a query based on attention scores between domain-slot pair and dialogue utterance, (2) slot value prediction, which calculates a probability distribution over possible values stored in memory layers and updates the
Memory Attention Neural Network for Multi-domain DST
43
query for each hop, and (3) slot state prediction, which predicts what the slot state is among {value, don’t care, none}. By sequentially updating the query, both dialogue information and ontology knowledge are integrated. This is quite intelligible because sometimes we need to traverse all the possible values in (2) to determine the answer in (3). In order to improve the reasoning ability of dialogue state tracker, we propose a novel architecture by incorporating predefined ontologies into memory networks. Memory Network [17] is a general paradigm used in machine reading tasks. In this work, we employ the Gated Memory Network [13], which adds Memory Gates between adjacent memory layers. We propose the Memory Attention Neural Network for multi-domain taskoriented dialogue tracking. Explainability and compatibility are the two main advantages of our proposed model. Contributions in this work are summarized as: – We utilize prior knowledge by setting up a state classifier Memory Network, which is the first work to model the predefined ontologies in Dialogue State Tracking. – We build a computational efficient multi-domain DST by proposing an architecture that is implemented purely with the Attention Mechanism, which also shows a performance boost. – We adopt recent progress in large pre-trained contextual word embeddings [6] into dialog state tracking, and get compatible performance. – We conduct ablation studies on memory hops and memory gates and provide a comprehensive error analysis of the results.
2
Related Work
Dialogue State Tracking. Traditional methods of dialogue state tracking are often based on hand-crafted rules [8], where the principal focus is on building a set of semantic parsing rules. These rule-based methods have been widely used in commercial systems. However, the performance of traditional models relies heavily on a huge amount of expert knowledge, which is labor-intensive and expensive. Besides, due to the natural limitation of hand-crafted rules, traditional methods are inclined to make mistakes in real practice. Statistical dialog state trackers calculate a probability distribution over hypotheses of the real dialogue state, based on the outputs of semantic parsing module. Compared with traditional methods, statistical models are more robust in many scenarios. Recent approaches use end-to-end architectures [3,9,19] to combine Natural Language Understanding and Dialogue State Tracker. These approaches can be further classified into fixed-vocabulary and open-vocabulary. Fixed-vocabulary approaches usually assume that all ontologies are known in advance and the model only has to pick the correct value from ontologies. Openvocabulary approaches often start with zero knowledge of possible values and generate the candidate values from an open-vocabulary.
44
Z. Xu et al.
Fig. 2. An overview of our proposed model
Memory Networks Based DST. Memory Network has been proven useful in machine reading tasks, where the model is capable to perform complex reasoning. Some previous works have used Memory Networks in a dialogue state tracking task. For example, in an early work [16], dialogue state tracking was formulated as a machine reading task, where retrieving dialogue state is completed by a reading comprehension task. In this work, we use Memory Network as the decoder for dialogue state tracker. We use a dialogue encoder to generate a query for the input of the Memory Network. We use the attention scores derived at the last hop as a probability distribution of dialogue states, and we use the query from the last hop for dialogue state gate classification.
3
Proposed Framework
In a multi-domain dialogue system, there are a set of domains D that users and the system can converse about. For each domain d ∈ D, there are nd slots. Each slot s corresponds to a specific aspect of the user intent (e.g. price) and can take a value v (e.g. cheap) from a candidate value set defined by a domain ontology O. The dialogue state S can be defined as a set of (d, s, v) triples, e.g. { (hotel, price-range, cheap), (restaurant, area, west)}. At t-th dialogue turn, the dialogue state is St , which is used as a constraint to frame a database query. Based on the results of database query and St , the system gives a response Rt+1 to the user. Then the user inputs a new sentence Ut+1 . A state tracker then updates the dialogue state from St to St+1 according to Rt+1 and Ut+1 . The whole dialogue process can be represented as a set of triples {(S0 , R1 , U1 ), (S1 , R2 , U2 ), · · · , (ST −1 , RT , UT )}. Traditionally, lots of DST models predict dialogue state according to the whole dialogue context up to date. They do not explicitly model the dialogue state update process. In contrast, our proposed model explicitly updates dialogue state from St to St+1 depending on Rt+1 and Ut+1 , St+1 = fdst (St , Rt+1 , Ut+1 ).
(1)
Memory Attention Neural Network for Multi-domain DST
45
Concretely, our proposed DST model takes the triple (St , Ut+1 , Rt+1 ) as input, and predict a distribution over all candidate values for each slot. As shown in Fig. 2, our proposed model consists of four components: input encoding module, context-aware slot embedding, multi-hop gated memory network and slot gate classifier. 3.1
Input Encoding Module
The input encoding module takes the previous dialogue state St−1 and the current dialogue utterances Rt and Ut as input, and output the representation for each token in the input. As mentioned above, the dialogue state St−1 is a set of domain-slot-value 1 2 ⊕ Zt−1 ⊕ ··· ⊕ triples. Therefore, we denote the previous state as St−1 = Zt−1 J i Zt−1 , where J is the number of triples in the dialogue state. Each triple Zt−1 i is a denoted as a sub-sequence, i.e. Zt−1 = d ⊕ s ⊕ v, e.g. {(hotel, pricerange, cheap), (restaurant, area, west)} is translated to a sequence “hotel price range cheap restaurant area west”. The state sequence is then contacted with the dialogue utterances as the input of the encoder. Here, we utilize the deep contextual pre-trained language model BERT [6] as the encoder: Ht = BERT ([CLS] ⊕ St−1 ⊕ [SEP] ⊕ Rt ⊕ Ut ⊕ [SEP]) ,
(2)
where [CLS] and [SEP] are special tokens for separation in BERT. Ht = {hkt }N k=1 , where hkt is the representation vector of k-th token in the joint input sequence and N is the number of tokens in the input sequence1 . 3.2
Context-Aware Slot Embedding
The input sequence described in Sect. 3.1 usually contains information of more than one slot. For j-th (domain, slot) pair, we need to harvest its related information. First, we use the pre-trained BERT to encode the domain and slot tokens, and take the corresponding representation vector of “[CLS]” as the initial slot embedding sj . Then, we obtain the context-aware slot representation cj according to the attention mechanism, cj =
N
αk hk ,
(3)
k=1
αk = softmax sTj Watt hk + batt ,
(4)
where Watt ∈ Rd×d and batt ∈ Rd are trainable parameters, d is the dimension of vectors, and sj is also updated during training. This context-aware slot embedding cj can be regarded as a summary of dialogue context in the view of the j-th (domain, slot) pair, and it can be used as the query vector to retrieve the related values in the ontology memory. 1
For brevity, the subscript t of hkt will be omitted in the following sections.
46
Z. Xu et al.
3.3
Multi-hop Gated Memory Network
For the j-th (domain, slot) pair2 , we utilize the multi-hop gated memory network (MH-GMN) to find the related value according to the context-aware slot embedding cj . MH-GMN consists of multi-layer supporting memories, each of which is in turn comprised of a set of input and output memory representations with memory cells. At the l-th layer, the input and output memory cells are obtained by transforming the candidate values {v1 , · · · , vK } in ontology using two trainable embeddings Al ∈ Rd×K and El ∈ Rd×K . For both embeddings, their i-th column vectors can be initialized with the contextual representation using BERT, i.e. the domain, slot and i-th candidate value vi as well as two tokens “[CLS]” and “[SEP]” are concatenated as the input of BERT: [CLS]
hi
= BERT([CLS] ⊕ domain ⊕ slot ⊕ vi ⊕ [SEP]),
(5)
[CLS]
is the contextual representation of “[CLS]” and it is used as the where hi initial embedding vector3 . MH-GMN takes cj as the initial query vector q0 , and updates it from one hop to the next. At the l-th hop, we use ql to compute dot product attention scores pli (1 ≤ i ≤ K) over each entry of the input memory cells, pli = softmax((ql )T · ali ),
(6)
where ali is the i-th column vector of the input embedding matrix Al , i.e. it is the input embedding vector of the i-th candidate value vi . Subsequently, we calculate the output memory ol by applying weighted sum of attention scores and the output memory cells: ol =
K
pli eli ,
(7)
i=1
where eli is the i-th column vector of the output embedding matrix El . We use end-to-end memory access regulation mechanism in the updating procedure. We define the forget gate as: gl = σ(Wgl ql + blg ),
(8)
where Wgl ∈ Rd×d and blg ∈ Rd are the hop-specific trainable parameters for the l-th hop and σ is the sigmoid function. The updating rule of the query vector is defined by: (9) ql+1 = ol (1 − gl ) + ql gl , where denotes the element-wise product of two vectors. Finally, we take the attention scores pL i (1 ≤ i ≤ K) of the last hop as the slot value distribution. 2 3
For brevity, the subscript indicating the (domain, slot) pair is omitted in this section and next section. When the size of embedding vector and the size of BERT embedding are different, a linear transformation layer will be used.
Memory Attention Neural Network for Multi-domain DST
3.4
47
Slot Gate Classifier
Following the previous work [19], here a slot gate is also utilized to predict the two specific slot values dontcare and none. We use a context-aware gate classifier to map the context query into three classes {value, dontcare, none}. For the j-th (domain, slot) pair, the gate classifier takes the final query vector qL as input and predict which class it belongs to: pc = softmax(Wc · qL ),
(10)
where Wc ∈ R3×d is a parameter for the three-way linear layer.
4 4.1
Experimental Setup Dataset
We conduct experiments on MultiWoz 2.0 [1] and MultiWoz 2.1 [7], which are among the largest publicly available multi-domain task-oriented dialogue datasets. Both datasets include over 10, 000 dialogues covering seven domains. MultiWoz 2.1 is a renewed version of MultiWoz 2.0 after correcting some false annotations and improper utterance. As is reported by [7], MultiWoz 2.1 corrected over 32% of state annotations across 40% of the dialogue turns and fixed 146 dialogue utterances by canonicalizing slot values in the utterances to the values in the dataset ontology. Following [19], we use only five domains (restaurant, train, hotel, taxi, attraction) out of all seven domains since other two domains (hospital, bus) only make up small portion of the training set and does not even appear in test set. Instead of compacting the whole dialogue utterances, we preprocess the dataset by concatenating the current dialogue state and the utterance of the current turn. In most previous works, the longest input sequence is 879 tokens, while now we only have to process at most 150 tokens, which makes the training process more efficient. 4.2
Training Details
We employed the pretrained BERT-base-uncased for the utterance encoder, memory layer initialization, and domain-slot embedding initialization. The embedding size of memory layers and the domain-slot embedding size is 768, which is predefined in BERT-base-uncased. We use Adam as our optimizer. For memory layer initialization and domain-slot embedding initialization, we use the BERT output corresponding to [CLS] token. For example, we use the output h[CLS] of “[CLS] hotel price range cheap [SEP]” as a representation for the triplet “hotel-pricerange-cheap”. We use different learning rates for utterance encoder and memory networks. We set the learning rate to 1e−5 for utterance encoder and 1e−4 for memory networks. We use a batch size of 16 and set the dropout rate to 0.2. The max sequence length is fixed to 256.
48
Z. Xu et al.
Table 1. Main results of our approaches and several baselines on the test set of MultiWoz 2.0 and MultiWoz 2.1, Models
Predefined Ontology BERT used MultiWoz 2.0 MultiWoz 2.1
HJST [7]
Y
N
38.40
35.55
FJST [7]
Y
N
40.20
38.00
TRADE [19]
N
N
48.60
45.60
Ours (GRU)
Y
N
49.85
51.79
SOM-DST [12]
N
Y
51.38
52.57
DS-DST [20]
Y
Y
–
51.21
DST-picklist [20] Y
Y
–
53.30
SST [4]
Y
Y
51.17
55.23
Trippy [11]
N
Y
–
55.29
Ours (BERT)
Y
Y
50.15
52.70
During training, we regard the model as a branched multi-layer architecture, where each domain-slot subtask shares the encoder parameters and has a corresponding unique memory network. Since we use adjacent architecture in Memory Network implementation, there are L + 1 different layers for a L-hop memory network. To prevent overfitting, we sequentially freeze all but one layer and only train that layer for one epoch. We use teacher forcing to train the memory network. In our model, dialogue context is encoded with BERT [6], memory embeddings and domain-slot embeddings are also initialized with BERT. To make a fair comparison with previous models, we also conducted experiments by replacing BERT with BiGRU [5], and we initialize the utterance word embedding, memory embeddings, and domain-slot embedding by concatenating Glove embeddings [15] and character embeddings [10], where the dimension is 400 for each vocabulary word.
5 5.1
Experimental Results Baseline Models
TRADE encodes the dialogue context and decodes the value for every slot using a copy mechanism in dialogue state sequence generation. It is also capable of transferring to unseen domains [19]. DST-picklist is proposed together with DS-DST, but this model assumed that the full ontology is available and only performed picklist-based DST [20]. SST predicts dialogue statesfrom dialogue utterances and schema graphs which containslot relations in edges, It uses a graph attention matching network to fuse information from utterances and graphs, and a recurrent graph attention network to control state updating [4].
Memory Attention Neural Network for Multi-domain DST
49
Trippy makes use of various copy mechanisms to fill slots with values, it combines the advantages of span-based slot filling methods with memory methods to avoid the use of value picklists altogether [11]. 5.2
Joint Goal Accuracy
We compare our approach with several baselines. For performance evaluation, we use joint goal accuracy, an evaluation metric that checks whether the predicted values of all slots exactly match those of the ground truth. The experimental results on the test sets of MultiWoz 2.1 and MultiWoz 2.0 are reported in Table 1. As we can see, our model achieves 50.15% joint accuracy on MultiWoz 2.0 and 52.70% on MultiWoz 2.1, which is compatible among models using BERT. To make a fair comparison with previous non-BERT models, we conducted experiments using GRU instead of BERT. The non-BERT version achieves 49.85% and 52.70% joint accuracy respectively on MultiWoz 2.0 and MultiWoz 2.1, which proves the effectiveness of our model among previous non-BERT models. Interestingly, on the contrary to most previous works, our model achieves higher performance on MultiWoz 2.1 than on MultiWoz 2.0. This phenomenon is consistent with the report by SOM-DST [12]. As they assumed, models explicitly using the dialogue state labels as input, like SOM-DST and our model, benefit more from the error correction on the state annotations done in MultiWOZ 2.1. 5.3
Slot-Specific Accuracy
Figure 3 shows the comparison between the slot-specific accuracy of TRADE and the non-BERT version of our model on the test set of MultiWoz 2.1. We can conclude that our model achieves much better performance than the TRADE model with regard to almost all sub-tasks. Specifically, our model outperforms TRADE by more than 2% accuracy under train-leaveat, restaurant-name,
Fig. 3. Slot-Specific Accuracy of TRADE and the RNN version of our model on the test set of MultiWoz 2.1. The numbers in brackets indicates how many possible values there are under the domain-slot pairs.
50
Z. Xu et al.
taxi-destination and taxi-departure, where many possible values are provided. We can conclude that our model shows robustness even when confronted with many distractors. 5.4
Ablation Study on Memory Gates
We conducted an ablation study on the use of Memory Gates, the result is shown in Fig. 4. As can be seen from the table, generally using Memory Gates benefits the performance. With the two-hops memory network, using Memory Gates achieves over 4% performance gain in joint accuracy. Interestingly, under the one-hop memory network, the use of Memory Gates poses only a slight impact over the performance. This is presumably because that for a one-hop memory network, only one Memory Gate is used, which gains only little benefit from the memory updating mechanism.
Fig. 4. Joint Accuracy variation with and without gates using different number of memory hops
Fig. 5. Attention scores taken from different hops related to . The turn utterance is ‘i would recommend express by holiday inn cambridge. from what day should i book ? ; starting saturday. i need 5 nights for 6 people by the way.’
5.5
Ablation Study on Number of Memory Hops
We conducted experiments with different numbers of memory hops, the result is shown in Fig. 4. As can be suggested, with or without memory gates, our model achieves the best performance using two memory hops. Interestingly, the
Memory Attention Neural Network for Multi-domain DST
51
performance of our model drops when the number of memory hops increase, which is contradictory to intuition. This is presumably because that dialogue information is blurred due to query updating. When 4 hops of memory layer are used, the performance of our model drops dramatically, indicating the equal importance of making complex reasoning and properly understanding dialogue information. Figure 5 presents an example for showing how memory networks can be powerful in making complex reasoning. This example is taken from the test set of MultiWoz 2.1, involving a customer trying to book a hotel room. At the first hop, memory network seems misled by the information that our customer wants to book for 6 people. While with the help of the query updating mechanism, the model can correct itself at the second hop.
6
Conclusion
In this work, we present a novel model for multi-domain dialogue state tracking, which combines Gated Memory Network and pre-trained BERT [6] to increase the ability of complex reasoning. We also conducted experiments over RNNbased implementation, instead of using BERT as an utterance encoder, to make a fair comparison with former non-BERT models. Experiments show that our approach achieves compatible performance compared with previous approaches and reaches state-of-the-art performance among non-BERT models. Acknowledgement. We thank the anonymous reviewers for their thoughtful comments. This work has been supported by the National Key Research and Development Program of China (Grant No. 2017YFB1002102) and Shanghai Jiao Tong University Scientific and Technological Innovation Funds (YG2020YQ01).
References 1. Budzianowski, P., et al.: Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5016–5026 (2018) 2. Chen, H., Liu, X., Yin, D., Tang, J.: A survey on dialogue systems: recent advances and new frontiers. ACM Sigkdd Explor. Newslett. 19(2), 25–35 (2017) 3. Chen, L., Chen, Z., Tan, B., Long, S., Gaˇsi´c, M., Yu, K.: Agentgraph: toward universal dialogue management with structured deep reinforcement learning. IEEE/ACM Trans. Audio Speech Lang. Process. 27(9), 1378–1391 (2019) 4. Chen, L., Lv, B., Wang, C., Zhu, S., Tan, B., Yu, K.: Schema-guided multi-domain dialogue state tracking with graph attention neural networks. In: AAAI, pp. 7521– 7528 (2020) 5. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014) 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL (2019) 7. Eric, M., et al.: Multiwoz 2.1: Multi-domain dialogue state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669 (2019)
52
Z. Xu et al.
8. Goddeau, D., Meng, H., Polifroni, J., Seneff, S., Busayapongchai, S.: A form-based dialogue manager for spoken language applications. In: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP 1996, vol. 2, pp. 701– 704. IEEE (1996) 9. Goel, R., Paul, S., Hakkani-T¨ ur, D.: Hyst: a hybrid approach for flexible and accurate dialogue state tracking. arXiv preprint arXiv:1907.00883 (2019) 10. Hashimoto, K., Xiong, C., Tsuruoka, Y., Socher, R.: A joint many-task model: Growing a neural network for multiple nlp tasks. arXiv preprint arXiv:1611.01587 (2016) 11. Heck, M., et al.: Trippy: a triple copy strategy for value independent neural dialog state tracking. arXiv preprint arXiv:2005.02877 (2020) 12. Kim, S., Yang, S., Kim, G., Lee, S.W.: Efficient dialogue state tracking by selectively overwriting memory. arXiv preprint arXiv:1911.03906 (2019) 13. Liu, B., Lane, I.: An end-to-end trainable neural network model with belief tracking for task-oriented dialog. In INTERSPEECH (2017) 14. Paul, S., Goel, R., Hakkani-T¨ ur, D.: Towards universal dialogue act tagging for task-oriented dialogues. arXiv preprint arXiv:1907.03020 (2019) 15. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543 (2014) 16. Perez, J., Liu, F.: Dialog state tracking, a machine reading approach using memory network. arXiv preprint arXiv:1606.04052 (2016) 17. Sukhbaatar, S., et al.: End-to-end memory networks. In: Advances in Neural Information Processing Systems, pp. 2440–2448 (2015) 18. Wen, T.H., et al.: A network-based end-to-end trainable task-oriented dialogue system. In: EACL (2016) 19. Wu, C.S., Madotto, A., Hosseini-Asl, E., Xiong, C., Socher, R., Fung, P.: Transferable multi-domain state generator for task-oriented dialogue systems. arXiv preprint arXiv:1905.08743 (2019) 20. Zhang, J.G., et al.: Find or classify? Dual strategy for slot-value predictions on multi-domain dialog state tracking. arXiv preprint arXiv:1910.03544 (2019) 21. Zhong, V., Xiong, C., Socher, R.: Global-locally self-attentive dialogue state tracker. arXiv preprint arXiv:1805.09655 (2018)
Learning to Answer Word-Meaning-Explanation Questions for Chinese Gaokao Reading Comprehension Hongye Tan(B) , Pengpeng Qiang, and Ru Li School of Computer and Information Technology, Shanxi University, Taiyuan, China [email protected]
Abstract. Word sense understanding (WSU) is fundamental for human beings’ reading and the word-meaning-explanation question is an important kind of questions in Chinese reading comprehension (RC) in the college entrance exams of China (called as ‘Gaokao’ for short), which requires students to explain the meaning for a target word. This paper proposes a method to answer the word-meaningexplanation questions, which combines the attractive VAE framework with the BERT and Transformer to learn rich, nonlinear representations for producing the high-quality explanation for a target word within a certain context. In order to generate multi-style explanations, we construct not only the Chinese dictionarystyle datasets, but also the essay-style dataset as a supplement to Chinese Gaokao application. We also build the Gaokao-style test set to evaluate our model. The experimental results show that our model can perform better than the baseline models. The code and the relevant dataset will be released on Github. Keywords: Word meaning explanation · Reading comprehension · Natural language generation
1 Introduction Machine Reading Comprehension (MRC) is a challenging problem and has received much attention. A line of MRC research attempts to answer questions of standard tests [1–8]. For example, RACE [1] is a typical example of such researches, which is a benchmark for the multi-choice MRC task and is collected from English exams for Chinese middle and high school students. And ARC [2] is another dataset, consisting of science questions ranging from 3rd grade to 9th. There are also some projects aiming to build systems to pass university entrance exams, such as the Todai Robot Project in Japan [3] and the 863 Program project in China [4, 8]. In China, the college entrance exams (called as ‘Gaokao’ for short) involve Chinese reading comprehension (RC), which provides some texts and the related questions for students to answer. And the main question types are multi-choice or free-description style questions, which involve understanding of word senses, rewriting of specific details, interpretation of complex sentences, comprehension of the main idea, inferences about © Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 53–64, 2020. https://doi.org/10.1007/978-3-030-60450-9_5
54
H. Tan et al.
the author’s intention and sentiment, and language appreciation. Different questions pose distinct challenges, so it is impossible for a general model to solve all the problems. Word sense understanding (WSU) is fundamental for human beings’ reading and is an important kind of questions in RC. The WSU questions in Chinese Gaokao RC, which usually have the following two forms: (1) word meaning explanation: given a target word (the word to be explained) and its context, explain the word’s meaning. (2) word meaning analysis: given the explanation of a target word and its context, decide whether the explanation is correct. This paper explores to answer the word-meaning-explanation questions. An example is shown in Table 1. Word-meaning-explanation questions usually have the following characters: (1) The target word can be an out-of-vocabulary word, or be with a new meaning in the text. For ’ (‘ruins’) has the new meaning of ‘the architectural example, in Table 1, the word ‘ remains with historical and cultural information, cultural relic value and aesthetic connotation’. And its sentiment polarity changes from the negative to the positive. (2) The style of the explanations can be different because a target word can be explained with its synonym or its definition such as the example in Table 1. (3) Usually, the meaning of a word can be inferred through the sentence-level contexts. Table 1. Word-meaning-explanation question examples (translated from Chinese into English)
In this paper, we propose a model to answer word-meaning-explanation questions for Chinese Gaokao RC. The model combines the attractive VAE framework with the BERT and Transformer to learn rich, nonlinear representations for producing the high-quality explanation for a target word within a certain context. In order to generate explanations with diversity, we construct multi-style datasets for model training. We utilize glosses and example sentences in Contemporary Chinese Dictionary (CCD) and Chinese 4-character idiom Dictionary to build the CCD-style dataset and Idiom-style dataset. In order to simulate the situation in a real Chinese Gaokao application, we collect 1000 essays and select the sentence with monosemy words to construct an Essay-style dataset as a supplement. Additionally, we build a test set of Chinese Gaokao style to evaluate our model. The experimental results show that our model performs better than the baselines. The contributions of this work are as follows: • We explore to answer the word-meaning-explanation problems for Chinese Gaokao RC, and propose a model based on the attractive VAE framework to solve them. This
Learning to Answer Word-Meaning-Explanation Questions
55
work has a certain universal significance and can be extended to explain new words in more general application scenarios. • We construct multi-style datasets based on the Chinese dictionaries for generating diverse and high-quality explanations. • We build the relevant test-set of Chinese Gaokao style to verify our proposed method. The results show that the method is more effective than the baselines.
2 Related Work In this section, we explain existing tasks that are related to our work. There is a line of MRC researches aiming to answer questions of standard tests [1–8]. The most representative task is RACE [1], a benchmark of MRC task, which is collected from English exams for Chinese middle and high school students. The questions in RACE are of multi-choice style, and existing methods utilize a general model to answer all the questions including WSU questions. Unlike them, our task is from Chinese Gaokao RC, which contains multi-choice questions and free-description questions, and each question poses distinct challenges. So it is necessary to build different models to solve different problems. For example, Tan et al. has explored to answer complicated free-description questions in Chinese Gaokao RC tasks [8]. And this paper aims to answer the wordmeaning-explanation questions, which is a generation problem beyond the multi-choice style. Our task is related to the tasks of paraphrase generation, definition modeling and word explanation. Paraphrase generation is to generate a new expression conveying the same meaning as the original one [9–11]. The synonymous words or phrases, are often used for paraphrasing a word. Our task is different. The reason is that besides the synonymous words, explaining a word can be with its definition (shown as the example in Table 1), which cannot directly substitute the original word in the context. Definition modeling is a task introduced by Noraset et al. [12], which aims to generate a definition sentence for a word. The task does not consider the word’s local contexts, so the related methods are not suitable for generating the explanation for a word within some certain contexts. Some other researchers proposed to generate the explanations of words from contexts. For example, Ni and Wang proposed a model based on sequence-to-sequence models with attention mechanisms to generate explanations for non-standard English phrase [13]. Ishiwatari et al. proposed the more general and practical model, taking advantage of local and global contexts to generate explanations for a given expressions from various domains [14]. Their task is similar to our task. But our method is different with theirs. We generate the explanations for Chinese words and utilize the attractive VAE framework to grasp the information of the contexts and get the better results.
3 Methods In this section, we first give the formulation of our task. Then, we describe the proposed model in detail.
56
H. Tan et al.
3.1 Task Formulation As shown in Table 1, the can be defined as: given a target word-meaning-explanation word (or phrase) W = wj , . . . , wk with its context C = {w1 , . . . , wI }(1 ≤ j ≤ k ≤ I ), output the target word’s explanation T = {t1 , . . . , tn }. Here, W is a word or a phrase (we call it the target word for simple) included in C. T is a phrase or a short sentence describing the meaning of W . 3.2 The Model Figure 1 shows the overview architecture of our proposed model, which adopts a combination of variational autoencoder (VAE) with the BERT and Transformer to generate the explanation of a target word as the answer of the word-meaning-explanation question.
Fig. 1. Architecture of our model
Here, we leverage the idea of the VAE-LSTM model proposed by (Gupta et al. 2018) [15], which is used for paraphrase generation for specific sentences. In order to capture the essence of the original sentence, their model presents a mechanism which conditions both the encoder and the decoder of variational autoencoder (VAE) on the intermediate representation of the input. Following them, our model utilizes a deep generative framework, which combines a sequence to sequence model and VAE to generate the explanation of a target word, given a certain context. But we use a BERT for the encoder to better capture the context information, and adopt Transformer with the attention mechanism for the decoder, to get the contextual words’ contribution to the explanations. 3.2.1 The Variational Autoencoder (VAE) The VAE is a deep generative latent variable model [16], which combines neural latent variables and amortized variational inference. Specifically, VAE learns a latent representation z for an input x in the encoder side, and reconstructs the input x from the latent variable z in the decoder side. Through the latent representation z, the VAE learns rich,
Learning to Answer Word-Meaning-Explanation Questions
57
nonlinear representations from massive text data, such that the decoder can produce the data similar to the inputs. This makes the VAE become an attractive generative model. In the VAE, a posterior distribution qϕ (z|x) is used to approximate the distrithe posterior qϕ (z|x) is assumed to be a Gaussian distribution bution pθ (x|z), and N μ(x), diag σ 2 (x) . Moreover, the VAE requires the posterior qϕ (z|x) is close to the prior p(z). The VAE infers the posterior distribution for each instance with a shared inference network and optimizes the evidence lower bound (ELBO) instead of the intractable marginal log-likelihood. The objective function is shown as Formula (1), where KL is the KL divergence. (1) l(θ, ϕ; x) = Eqϕ logpθ (x|z) − KL qϕ (z|x)p(z) 3.2.2 The VAE Encoder In the VAE input side, the encoder converts the target word W , its context C, and its explanation T into the corresponding vector representations. In our implementation, the encoder is a BERT, which realizes the dynamic vector representations for a word based on its contexts using the transformer architecture and self-attention mechanisms [17]. The vector representation of the input W are as follows: H w = Encode(W )
(2)
where Encode(·) is the output of the last layer of BERT. Similarly, we can get H C and H T for the inputs of the context C and the explanation T . Then we concatenate these vector representations after max-pooling, and use a linear transformation to produce the mean and variance parameters (i.e., ϕ) of the VAE encoder. 3.2.3 The VAE Decoder In the VAE output side, the decoder aims to generate the explanations conditions on the latent representation z, and the vector representations of the target word and its context, produced by a BERT with the attention mechanism. In order to further capture the influence of contextual information on the generation of explanation, we use the attention mechanism as follows: (3) V t = Attention H C , H W More specifically, the decoder is implemented by using the Transformer [18]. Inside the Transformer, n encoders and decoders (n = 6) are stacked. And the decoders are with the encoder-decoder attention mechanism, used for capturing the correlation between the input sentence and the output at the current time.
4 Experiments 4.1 Datasets We construct the multi-style datasets to train our model to generate high-quality and diverse explanations. In the datasets, each sample is a triple of (word, context, explanation).
58
H. Tan et al.
CCD-Style Dataset. We utilize the Contemporary Chinese Dictionary (the fifth edition, CCD) to construct the CCD-style dataset. In the CCD, each entry usually involves three elements: a word, its explanation (or definition) and the usage examples. If a word has multiple definitions/examples, we treat them as different entries. Thus, we extract 69,354 triples for the CCD-style dataset. Idiom-Style Dataset. We also construct the Idiom-style dataset based on the Chinese four-character idiom dictionary from the website1 , which provides the idioms, their phonetic alphabets (Pinyin), sources, explanations and usage examples. We keep the idioms, explanations and examples to construct the Idiom-style dataset, which includes 12,631 triples. Essay-Style Dataset. In order to simulate a situation in a real Chinese Gaokao application, we construct the Essay-style dataset. We collect 1000 essays and select the sentence with monosemy words. We extract the monosemy words one by one and take each of them as a target word, use the sentence containing it as the contexts, and take its dictionary definition as the explanation. Thus, we get 12,138 triples totally to construct the Essay-dataset. Gaokao-Style Test Set. Additionally, we collect WSU questions from the true Gaokao exams or the practice exams. We transform these questions into the triples (word, context, explanation) and take them as another test set (named as Gaokao-style test set). Specially, the sizes of the test set are 264. Table 2. Statistics of the datasets for our model Explanation Avg. Len.
Context Avg. Len.
Sizes
CCD-style dataset
9.3
8.41
69,354
Idiom-style dataset
22.55
26.40
12,631
Essay-style dataset
9.47
31.20
12,138
Gaokao-style test set
7.9
43.86
264
Table 3. Splitting of the datasets Train CCD-style dataset
1 www.zd9999.com/cy/.
Validation Test
55500 6927
6927
Idiom-style dataset 10105 1250
1276
Essay-style dataset
1427
9711 1000
Learning to Answer Word-Meaning-Explanation Questions
59
The above datasets have different styles. For example, compared with the other datasets, the example sentences in CCD-style dataset are shorter and are with more common words. Most of the target words in Idiom-style dataset are four-character idioms, with incisive meanings, and are mostly evolved from historical allusions or stories. In Essay-style dataset, the meanings of the target words are often implicit and profound. In Gaokao-style test set, the target words are often with new meanings. The detailed information of the datasets is shown in Table 2 and Table 3. 4.2 Settings and Baselines
Table 4. Hyperparameters of our model and baselines Dual-Encoder
I-Attention
LOG-CaD
UniLM
Ours
Layers of Enc-LSTMs
2
2
2
–
–
Dim. of Enc-LSTMs
600
600
600
–
–
Dim. of input word Emb.
300
300
300
768
768
Dim. of char. Emb.
160
–
160
–
–
Layers of Dec-LSTMs
2
–
2
–
–
Dropout rate
0.5
0.5
0.5
0.1
0.1
Learning rate
0.001
0.001
0.001
5e − 5
5e − 5
Settings. In this work, considering the cost of memory and time, we use BERTBASE for our model. We use the Adam [19] optimizer to train the models. For the hyperparameters of Adam optimizer, we set the learning rate 5e−5. the dropout [20] is 0.1, the batch size is 128 and the warmup steps are 100. We train the model for 30 epochs. Baselines. We take the following models as baselines, which are typical models for explaining word meanings in English. Dual-Encoder [13]: the model is based on a sequence-to-sequence framework with the attention mechanism. And the encoder is the dual encoder structure, realized by the word-level and character-level LSTMs. I-Attention [21]: the model adopts a feed-forward neural network to indirectly use the local context to disambiguate the phrase, which leads to the performance improvement. LOG-CaD [14]: it is the state-of-art model, which consists of a description decoder and two context encoders of local & global context encoders. It also uses the attention mechanism to help the decoder to focus on important words in local contexts.
60
H. Tan et al.
UniLM [22]: It is a pre-trained model that can handle natural language understanding and generation tasks at the same time. The pre-training of the UNILM model is based on three goals: unidirectional, bidirectional, and sequence-to-sequence prediction. This model uses a Transformer network with shared parameters and also uses specific selfattention masks to control the context information used in prediction. When we fine-tune it, we use the sequence-to-sequence model to adapt to our task. We got the models’ open resources from the website provided by the paper [14]2 [22]3 . And we utilize the word embeddings, which is trained through the Chinese Wikipedia. We train and evaluate the baselines on our datasets by using the same hyper-parameters as theirs. The hyperparameters of our model and baselines are shown as Table 4. 4.3 Results and Analysis Following the previous work, we utilize the BLEU [23] metric for automatically evaluating the explanation produced by the model. We utilize all the training set (65605 samples in total) to train our model and the baselines to get the Bleu scores on the test set of different style, which are shown in Table 5. From Table 5, we can see that our model consistently performs better than all baselines on four test sets. This shows that the model based on the VAE framework is effective because it can learn rich representations from massive text data. We also find that all models achieve the lowest BLEU score in the Gaokao-style test set. The main reason is that: compared with other datasets, the generation of explanation for Gaokao-style test set is more difficult because much more words in the dataset are with new meanings and need more contexts, even the whole text, to understand their meanings. And moreover, the author’s emotions, intentions and inner activities are sometimes necessary to be known to decide the meaning of the words. This is very different from the definitions in the dictionary, in which the meaning of words can be decided through short sentence-level contexts. We can also see that all the models don’t perform too well on the Idiom-style dataset, because Chinese four-character idioms express abstractive meaning and their explanation length are longer, making the generation of explanation more difficult. We also evaluate the performance manually. Follow Ishiwatari et al. [10], we ask the raters to score the explanation with 4 levels (0–3). And “0” means being completely wrong, “1” means correct topic with wrong information or incomplete, “2” means missing small details, “3” means correct. Two raters are asked to score the explanation. If the difference of their scores is not more than 1, the final score is their average. Else, the third rater is asked to re-score the explanation and the final score is the average of the two closer scores. Table 6 shows the manual evaluation results. And We find that that our model is better than all the baselines. From Table 5 and Table 6, we also find that LOG-CaD is not a pre-trained model but performs well, the reason is that the model utilizes two context encoders of local & global context encoders, which is helpful for understanding the words’ meanings. 2 https://github.com/shonosuke/ishiwatari-naacl2019. 3 https://github.com/microsoft/unilm.
Learning to Answer Word-Meaning-Explanation Questions
61
Table 5. Bleu scores on four datasets Model
CCD-style test set
Idiom-style test set
Essay-style test set
Gaokao-style test set
Dual-Encoder [13]
22.78
19.54
20.47
13.78
I-Attention [21] 21.35
19.73
21.06
12.67
LOG-CaD [14]
22.15
20.16
21.38
15.24
UniLM [22]
24.81
20.84
22.61
16.28
Ours
25.74
21.35
24.74
18.1
Table 6. Human annotated scores on the Gaokao-testset Model
Annotated score
Dual-Encoder [13] 1.52 I-Attention [21]
1.37
LOG-CaD [14]
1.83
UniLM [22]
1.87
Ours
2.01
Case Study. Table 7 shows two explanation examples a word in the Gaokao-style test ’ (a fellow townsman in English), an expression in Shaanxi set. In Example1, ‘ Province dialect, is an OOV and its meaning must be generated from context. All the baselines cannot capture the meaning of the word, but our model generates the correct explanation for the target word. ’ (Grass grows), a network slang and an OOV, describes the In Example2, ‘ increasing possessiveness about something. Although the explanation produced by our model is not correct, but it describes a way of behavior, that is close to the reference answer. From the examples, we can find that: compared with the baselines, the quality of explanations generated by our model are better. The main reason for the results is that our model utilizes the framework of VAE to learns richer and better representations from massive text data and capture the essence of the original sentence, which is very helpful for explaining the words.
62
H. Tan et al. Table 7. Explanation examples generated by our model and baselines
5 Conclusions This paper proposes a method to answer the word-meaning-explanation questions, which combines the attractive VAE framework with the BERT and Transformer to produce the high-quality explanation for a target word within a certain context. And multi-style datasets are constructed to train the model to generate the diverse explanations. The Gaokao-style test set is also built to evaluate the model. And the experimental results show that the proposed model performs better than the baseline models. In the future, we will consider to utilize much more contexts in the model to capture more information to decide the words’ meaning. We will also try to understand the author’s emotions and intentions to improve the explanation of the words with new meanings.
Learning to Answer Word-Meaning-Explanation Questions
63
Acknowledgments. We thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Key Research and Development Program of China (No. 2018YFB1005103) and the National Natural Science Foundation of China (No. 61673248, No. 61772324).
References 1. Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.: RACE: large-scale reading comprehension dataset from examinations. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794(2017) 2. Clark, P., et al.: Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv: Artificial Intelligence. Springer (2018) 3. Fujita, A., Kameda, A., Kawazoe, A., Miyao, Y.: Overview of Todai Robot Project and evaluation framework of its NLP-based problem solving. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, pp. 2590–2597 (2014) 4. Cheng, G., Zhu, W., Wang, Z., Chen, J., Qu, Y.: Taking up the Gaokao challenge: an information retrieval approach. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2479–2485 (2016) 5. Rodrigo, A., Peñas, A., Miyao, Y., et al.: Overview of CLEF QA entrance exams task 2015. In: Working Notes of CLEF 2015 (2015) 6. Sun, K., Yu, D., Chen, J., et al.: DREAM: a challenge dataset and models for dialogue-based reading comprehension. Trans. Assoc. Comput. Linguist. 7, 217–231 (2019) 7. Yu, W., Jiang, Z., Dong, Y., et al.: ReClor: a reading comprehension dataset requiring logical reasoning. In: Proceedings of 8th International Conference on Learning Representations (2020) 8. Tan, H., Zhao, H.: A pipeline approach to free-description question answering in Chinese Gaokao reading comprehension. Chin. J. Electron. 28(1), 113–119 (2019) 9. McKeown, K.R.: Paraphrasing questions using given and new information. Comput. Linguist. 9(1), 1–10 (1983) 10. Madnani, N., Dorr, B.J.: Generating phrasal and sentential paraphrases: a survey of data-driven methods. Comput. Linguist. 36(3), 341–387 (2010) 11. Wubben, S., Van Den Bosch, A., Krahmer, E.: Paraphrase generation as monolingual translation: data and evaluation. In: Proceedings of the 6th International Natural Language Generation Conference, pp. 203–207 (2010) 12. Noraset, T., Liang, C., Birnbaum, L., Downey, D.: Definition modeling: learning to define word embeddings in natural language. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, pp. 3259–3266 (2017) 13. Ni, K., Wang, W.Y.: Learning to explain non-standard English words and phrases. In: Proceedings of the 8th International Joint Conference on Natural Language Processing, pp. 413–417 (2018) 14. Ishiwatari, S., et al.: Learning to describe unknown phrases with local and global contexts. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 3467–3476 (2019) 15. Gupta, A., Agarwal, A., Singh, P., Rai, P.: A deep generative framework for paraphrase generation (2017) 16. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of the Second International Conference on Learning Representations (2014)
64
H. Tan et al.
17. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long and Short Papers), vol. 1, pp. 4171–4186 (2019) 18. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Proceedings of the Neural Information Processing Systems, pp. 5998–6008 (2017) 19. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR, abs/1412.6980 (2014) 20. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929– 1958 (2014) 21. Gadetsky, A., Yakubovskiy, I., Vetrov, D.: Conditional generators of words definitions. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Short Papers, pp. 266–271 (2018) 22. Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197 (2019) 23. Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Enhancing Multi-turn Dialogue Modeling with Intent Information for E-Commerce Customer Service Ruixue Liu1 , Meng Chen1(B) , Hang Liu1 , Lei Shen2 , Yang Song1 , and Xiaodong He1 1
JD AI, Beijing, China {liuruixue,chenmeng20,liuhang55,songyang23,xiaodong.he}@jd.com 2 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China [email protected]
Abstract. Nowadays, it is a heated topic for many industries to build intelligent conversational bots for customer service. A critical solution to these dialogue systems is to understand the diverse and changing intents of customers accurately. However, few studies have focused on the intent information due to the lack of large-scale dialogue corpus with intent labelled. In this paper, we propose to leverage intent information to enhance multi-turn dialogue modeling. First, we construct a largescale Chinese multi-turn E-commerce conversation corpus with intent labelled, namely E-IntentConv, which covers 289 fine-grained intents in after-sales domain. Specifically, we utilize the attention mechanism to extract Intent Description Words (IDW) for representing each intent explicitly. Then, based on E-IntentConv, we propose to integrate intent information for both retrieval-based model and generation-based model to verify its effectiveness for multi-turn dialogue modeling. Experimental results show that extra intent information is useful for improving both response selection and generation tasks.
Keywords: Multi-turn dialogue modeling corpus · Intent information
1
· Large-scale dialogue
Introduction
With the rapid development of artificial intelligence, many conversational bots have been built for the purpose of customer service, especially in E-commerce. Building a human-like dialogue agent has lots of benefits for the E-commerce customer service industry. It can not only improve the working efficiency for the professional customer service staffs, but also save amount of labor costs for the E-commerce company. R. Liu and M. Chen—Equal contribution. c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 65–77, 2020. https://doi.org/10.1007/978-3-030-60450-9_6
66
R. Liu et al.
Existing approaches to building end-to-end conversational systems mainly concentrate on retrieval-based models [19,25,26], generative-based models [5,6, 13] and hybrid models [14,21]. Impressive progress has been made on modeling the context [19,24], leveraging external knowledge [6], and promoting the language diversity of response [5,7]. However, previous works did not pay enough attention on the user intent in conversations. There are two major issues: 1) existing dialogue datasets are deficient for intent-augmented dialogue modeling. There is basically no large-scale multi-turn dialogue corpus with intent labelled. 2) Most existing neural conversation models do not explicitly model user intent in conversations. More research needs to be made to understand the user intent and to develop intent-augmented response selection and generation models, which is exactly the target of this paper. To tackle above two obstacles, we firstly construct a large-scale multi-turn dialogue corpus with intent labelled, namely E-IntentConv, which consists of more than 1 million conversations about after-sales topics between users and customer service staffs in E-commerce scenario. Nearly three hundreds fine-grained intents are summarized and provided based on real business of E-commerce customer service for understanding the user intents accurately. To represent the user intents explicitly, we also extract tens of words (Intent Description Words, denoted as IDW) to depict each intent with attention mechanism. Then, we propose a novel intent-aware response ranking method and an intent-augmented neural response generator to leverage the extra intent information for dialogue modeling. For the response ranking model, an ensemble modeling paradigm with three intent-aware features is well-designed. For the neural dialogue generator, an extra intent classifier is integrated into the decoder to enhance the intent expression. Experimental results validate that both response selection and generation tasks can be improved with our proposed models. To the best of our knowledge, it is the first work to build dialogue systems with intents of large-scale multi-turn conversations. To sum up, our contributions are two-folds: 1) we collect a very large-scale multi-turn E-commerce dialogue corpus with intent labelled, and we release it to the NLP community for free1 . 2) we propose two intent-aware dialogue models and design experiments to prove that both response selection and response generation tasks can be improved by incorporating intent information.
2
Related Work
There are two lines of research that are closely related to and motivate our work: multi-turn dialogue corpus, and end-to-end dialogue system. 2.1
Multi-turn Dialogue Corpus
The research on chatbots and dialogue systems has kept active for decades. The growth of this field has been consistently supported by the development of 1
We have the license to redistribute this corpus and third-party users can download it from our official website for research purpose: http://jddc.jd.com.
Enhancing Multi-turn Dialogue Modeling
67
new datasets and novel approaches. Especially for popular deep learning based approaches, large scale of training corpus in real scenario becomes decisive. More recently, some researchers construct dialogue datasets from social media networks (e.g., Twitter Dialogue Corpus [11] and Chinese Weibo dataset [18]), or online forums (e.g., Chinese Douban dataset [19] and Ubuntu Dialogue Corpus [9]). Despite of massive number of utterances included in these datasets, they are different from real scenario conversations. Posts and replies are informal, single-turn or short-term related. In terms of dialogue datasets from real scenario, ECD [25] corpus is collected from real E-commerce scenario, which keeps the nature of conversation flow and the bi-turn information for real conversation. However, the ECD corpus provides little annotated information for each query, such as intent information of customers. Compared to ECD [25], our E-IntentConv corpus provides extra intent information and description words extracted by an interpretative way to help understanding those intents. These intents contain beneficial information for dialogue systems to understand queries under complicated after-sales circumstances. 2.2
End-to-end Dialogue System
With the development of dialogue datasets, a number of data-driven dialogue systems are designed, divided into retrieval-based models [19,25,26], generativebased models [5–7,13,24] and hybrid models [14,21]. For retrieval-based models [19,25,26], various text matching techniques are proposed to catch the semantic relevance between context and response. But they ignore the constraint between query and response in the dimension of intent. For generative-based models, modeling dialogue context [24], leveraging external knowledge [6] and promoting language diversity for response [5,7] have become hot research topics, but all of them neglect the importance of understanding user’s intents. Different from previous work, we propose two intent-enhanced dialogue models and verify their effectiveness on E-IntentConv corpus.
3
E-IntentConv Construction
The construction of E-IntentConv includes data collection, intent annotation, and intent description words extraction, which illustrates how we collect the large-scale multi-turn dialogue corpus, how we annotate the intent for each query, and how we mine the intent description words. 3.1
Dataset Collection and Statistics
We collect the conversations between customers and customer service staffs from a leading E-commerce website2 in China. After crawling, we de-duplicate the raw data, desensitize and anonymize the private information based on very 2
http://www.jd.com.
68
R. Liu et al. Table 1. Overview of E-IntentConv. Total number of sessions
1,134,487
Total number of utterances
22,495,387
Total number of words
253,523,129
Average number of intents per session
4
Average number of turns per session
20
Average number of words per utterance 11
detailed rules. For example, we replace all numbers, order ids, names, addresses with special token , , , correspondingly. Then, we adopt Jieba3 toolkit to perform Chinese word segmentation. Table 1 summarizes the main information about the dataset. It’s observed that the amount of this data is large enough to support building the current mainstream data-driven dialogue models. Meanwhile, the average number of intents per session is 4, which indicates that the customers’ intents are constantly changing as the conversations go on. Thus, understanding these changing intents accurately is critical to solve the customers’ problems. Table 2. An example from E-IntentConv corpus. Best viewed in color
3
https://github.com/fxsjy/jieba.
Enhancing Multi-turn Dialogue Modeling
69
The E-IntentConv illustrates the complexity of conversations in E-commerce. It covers different kinds of dialogues including: a) task completion: e.g. changing the order address, b) knowledge-based question answering: e.g. asking the warranty and return policy or asking how to use the product, and c) feeling connection with the user: e.g. actively responding to the user’s complains and soothe his/her emotion. Therefore, it’s totally different from previous dialogue datasets. Table 2 shows a typical example in the corpus, q1 ,q2 in blue refer to task completion, q3 in red is knowledge-based question and answering while q4 ,q5 in purple require feeling connection. 3.2
Intent Annotation
As the number of queries in the dataset is huge, it’s infeasible to annotate the intents for all queries manually. Here we use a high-quality intent classifier to label the intent for each query automatically. The classifier contains totally 289 classes which are summarized based on the real business of E-commerce customer service. The classes are fine-grained and helpful to understand the user’s intents under the after-sales circumstances. To train the intent classifier, we sample 600,000 instances from the corpus and annotate them manually under the user intent taxonomy. Each instance consists of at most three consecutive utterances from users (eg. q1 ,q2 ,q3 in Table 2). The former two utterances are context and the last one is the query. Three professional customer service staffs are invited to annotate the training data. The inter-agreement score is 0.7234 and the final label is decided by voting strategy. At last, totally 578,127 training samples are annotated manually under the user intent taxonomy. Considering the challenging of short text classification and the language understanding in dialogue, we train the intent classifier with Hierarchical Attention Network (HAN) [22] so each utterance in a training sample is weighted differently. The classification accuracy on the test set reaches 93% and the Macro-F1 score is 84.22%, which indicates the predicted intents for the user queries are reliable. By this way, we label the intents for all user queries automatically. Figure 1 shows the distribution of top 15 intents in E-IntentConv.
Fig. 1. Distribution of top 15 intents in E-IntentConv. 4
The Fleiss’ Kappa score is calculated, and above 0.2 is acceptable.
70
R. Liu et al.
3.3
Intent Description Words
Intuitively, intent is a high-level abstraction, so how to make the system understand the intent becomes important. Here, we try to depict the abstract intent with tens of explicit words, which can be seen as descriptions or explanations for each intent. We call those words as Intent Description Words (IDW). IDWs should be better the feature words from the perspective of classifier, so they can represent the exact meaning of corresponding intent. Specifically, we utilize the attention mechanism of HAN model [22] to extract those feature words, which is interpretative as Fig. 2 shows. The words with the top K highest attention weights in each training instance are picked out as IDW candidates. After dealing with all training instances, we collect a set of IDWs for each intent. We rank those words by frequency, and filter the stop words, then top N words are chosen as final IDWs for each intent. IDWs are also provided along with our dataset5 .
Fig. 2. Visualization of attention weights in HAN. Words with higher attention weights are assigned with deeper colors. English translation is provided for understanding.
4
Methods
Based on the intent-labelled corpus above, in this section, we want to validate the effectiveness of intent information for dialogue modeling. For retrieval-based model, we propose a novel ensemble modeling paradigm and design three intentaware features, to facilitate the response ranking task. For generation-based model, we integrate a special intent classifier into the encoder-decoder framework to promote the expression of user intent. 4.1
Retrieval-Based Model
Existing retrieval-based models mainly focus on calculating the semantic similarity between context and response, and treat the response selection problem as a ranking problem. Both traditional learning to rank models [14,20] and neural based matching models (e.g., DAM [26], ESIM [2], and BERT [4]) have been 5
Empirically, we set K to 4 and N to 50 in this work.
Enhancing Multi-turn Dialogue Modeling
71
proposed. Here, we propose an ensemble modeling approach with GBDT [23] to combine the advantages of existing neural based matching models. By taking BM25 [3], DAM, ESIM and BERT as input features, we build a regression model (denoted as Ensemble) to predict final similarity. Meanwhile, we also design several new features to catch the intent consistency between query and response candidate as follows: IntentFeat 1: we represent the intent by averaging the word embeddings of all IDWs, and represent the response candidate by averaging all word embeddings in the response utterance, then the cosine similarity of two representations is calculated as the first intent-aware feature. IntentFeat 2: we calculate the ratio of how many words in the response candidate are ‘covered’ by the IDWs. A word is considered as ‘covered’ by the intent as long as the similarity between it and any word in IDWs is greater than the threshold t6 . Here, the similarity score is also calculated based on the word embeddings. IntentFeat 3: for each word in the response candidate, the largest similarity score between it and all the IDWs is chosen as the final similarity score. Then we average all similarity scores for all words to represent the similarity between the response candidate and the intent. By adding the extra intent features, we denoted the model as Ensemble-IDW. 4.2
Generation-Based Model
Popular generation-based models are based on the standard seq2seq model [16] with attention mechanism [1]. To better utilize these IDWs and make the generated response Y more informative and consistent with context C, we propose a model named S2S-IDW. First, we represent each intent z as the average word embeddings of corresponding IDWs. Then z is concatenated to each decoder input and used to update the decoder hidden state. Inspired by [15], we use a intent classifier to enhance the intent expression, and the classification loss is defined as: LCLS = −p(z)logq(z|Y )
(1)
1 T Ewe(t; C, z)) (2) t=1 T where f (·) is the softmax activation function, p(z) is a one-hot vector that represents the desired intent distribution for an instance, and Ewe(t; C, z) is the expected word embedding at time step t, which is calculated as: p(yt ) · Emb(yt ) (3) Ewe(t; C, z) = q(z|Y ) = f (W ·
yt ∈V
that is, for each decoding step t, we enumerate all possible words in the vocabulary V . Finally, the loss function can be written as: 6
Empirically, we set t to 0.6 in our experiments.
72
R. Liu et al.
L = LCE + λLCLS
(4)
where λ is a hyper-parameter that controls the trade-off between cross entropy loss LCE and classification loss LCLS .
5
Experiments
In this section, we perform extensive experiments on the E-IntentConv dataset. We firstly introduce the dataset preparation, then show the experimental setup and results for the response selection and generation tasks. 5.1
Dataset Preparation
We first divide the around 1 million conversation sessions into training, validation and testing set with the ratio of 8:1:1. Then we construct I-R pairs from each set into the {I, R} = {q1 , r1 , q2 , r2 , ..., qi , ri , Q, R} format, where I = {C, Q} stands for input, C is the dialogue context, Q is the last query, and R represents the response. i is set to 5 so the most recent five rounds of dialogue are kept as context. Finally, there are 2,852,620 I-R pairs for training, 176,600 and 177,412 I-R pairs for validation and testing respectively. 5.2
Response Selection
Following [19] and [25], we randomly sample negative responses for above training, validation, and testing sets. The ratio for positive and negative samples is 1:1 for training set, and 1:9 for validation/testing set. Totally 1 million I-R pairs (Train Set I) are sampled for training the neural matching models. 200k I-R pairs (Train Set II) are sampled for training the ensemble models. It’s worth noting that there is no overlap between Train Set I and II. Following [9], recall at position k in n candidates (denoted R10 @1, R10 @2, R10 @5) are taken as evaluation metrics. Our baselines are as follows: 1) BM25: The standard retrieval technique BM25 [3] is used to rank candidates. 2) DAM: The Deep Attention Matching Network proposed by [26], which matches a response with its multi-turn context using dependency information learned by Transformers. 3) ESIM: The Enhanced Sequential Inference Model proposed by [2], which matches local and composition information by designing a sequential LSTM inference model. 4) BERT: We fine-tune the BERT [4] with I-R pairs and use the predicted similarity score to rank all response candidates. From Table 3, it can see that, the proposed Ensemble model outperforms existing neural semantic matching models significantly, which proves that the different features are complementary to each other. By adding the three intentaware features, the performance of Ensemble-IDW is further improved, which indicates their helpfulness in the response selection task.
Enhancing Multi-turn Dialogue Modeling
73
Table 3. The comparison of retrieval-based models. ‡ means statistically significant difference over baseline with p < 0.05. Model
R10 @1
R10 @2
R10 @5
BM25
0.3618
0.4982
0.7570
DAM
0.7634
0.9067
0.9883
BERT
0.7968
0.9249
0.9926
ESIM
0.8114
0.9330
0.9941
Ensemble
0.8628
0.9598
0.9974
Ensemble-IDW 0.8657‡ 0.9610‡ 0.9975 Table 4. Case study for retrieval-based models. The user intent is return policy.
5.3
Response Generation
Totally 500k I-R pairs (Train Set III) are sampled to train various generation models. We choose BLEU [10] score, Rouge [8] score, and Distinct-1/2 [7] (denoted as Dist-1/2) to evaluate the quality and the diversity of the generated responses. Here are the baselines: 1) S2S-Attn: The classical seq2seq model with attention mechanism [13], which uses Bi-LSTM as encoder and LSTM as decoder. 2) TF: The Transformer model with multi-head self-attention mechanism [17]. For previous two models, utterances in the context are concatenated into one sentence. 3) HRED: The hierarchical recurrent encoder-decoder model proposed by [12] to consider the hierarchical structure in a dialogue session.
74
R. Liu et al. Table 5. The comparison of generation-based models. Model
BLEU Rouge-L Dist-1 Dist-2
HRED
7.43
16.18
0.37% 3.60%
TF
8.08
18.03
0.20% 1.21%
S2S-Attn 11.02
22.33
S2S-IDW 11.67 23.03
0.12% 0.85% 0.24% 1.83%
Table 5 shows the results of generation-based models. HRED has the best performance on diversity however it performs the worst in BLEU and Rouge-L metrics. We argue that the hierarchical modeling of context keeps more information meanwhile brings more noise. The S2S models have the highest similarity scores (BLEU and Rouge-L). With the support of IDWs, S2S-IDW improves all metrics compared with S2S-Attn (Sign Test, with p-value < 0.05). 5.4
Case Study
To further illustrate the effectiveness of intent information, we compare the responses of retrieval-based and generation-based models and in Table 4 and Table 6. We can see that, responses enhanced by intents are more informative, diverse and appropriate. Table 4 discusses phone return policy and postage charges. The Ensemble model selects the wrong answer due to the misleading context information of return goods within 7 days without reason. Meanwhile, Table 6. Case study for generation-based models. The user intent is refund period.
Enhancing Multi-turn Dialogue Modeling
75
the IDW words, postage, returns and postage charges help the Ensemble-IDW model to focus accurately on postage charges related answers instead. In Table 6, the user is enquiring information on refund period for his product. As we can see, both models can generate correct answer, however, with IDWs of refund, Wechat and JD IOU, the S2S-IDW model generates much more informative and diverse response referring to various payment methods.
6
Conclusion and Future Work
In this paper, we focus on enhancing the dialoge modeling with intent information for E-commerce customer service. To facilitate the research, we firstly construct the E-IntentConv dataset, which not only includes large-scale, multi-turn dialogues in real scenario but also contains rich and accurate intent information. We also propose two novel dialogue models and verify the effectiveness of intents in both response selection and generation tasks. This work is a first step towards intent-augmented multi-turn dialogue modeling. The work has much limitation and much room for further improvement. For example, the dialogue dataset here is only collected from one company and the intents are also too domain-specific. And the definition of all intents can be more clear and discovered by popular topic modeling approach automatically. In the future, we will improve above aspects, enrich the dataset with more annotations, and explore more effective approaches to utilize these information.
References 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 2. Chen, Q., Zhu, X., Ling, Z.H., Wei, S., Jiang, H., Inkpen, D.: Enhanced LSTM for natural language inference. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1657–1668 (2017) 3. Manning, C.D., P.R., Sch¨ utze, H., : Introduction to information retrieval. Inf. Retr. 13(2), 192–195 (2010) 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019) 5. Gao, X., et al.: Jointly optimizing diversity and relevance in neural response generation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1229–1238 (2019) 6. Ghazvininejad, M., et al.: A knowledge-grounded neural conversation model. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
76
R. Liu et al.
7. Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119 (2016) 8. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004) 9. Lowe, R., Pow, N., Serban, I., Pineau, J.: The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 285–294 (2015) 10. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002) 11. Ritter, A., Cherry, C., Dolan, W.B.: Unsupervised modeling of twitter conversations. In: Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, Los Angeles, CA, USA, 2–4 June 2010, pp. 172–180 (2010) 12. Serban, I.V., Sordoni, A., Bengio, Y., Courville, A., Pineau, J.: Building endto-end dialogue systems using generative hierarchical neural network models. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 3776– 3783 (2016) 13. Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1577–1586 (2015) 14. Song, Y., Li, C.T., Nie, J.Y., Zhang, M., Zhao, D., Yan, R.: An ensemble of retrieval-based and generation-based human-computer conversation systems. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 4382–4388 (2018) 15. Song, Z., Zheng, X., Liu, L., Xu, M., Huang, X.J.: Generating responses with a specific emotion in dialog. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3685–3695 (2019) 16. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, pp. 3104–3112 (2014) 17. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 18. Wang, H., Lu, Z., Li, H., Chen, E.: A dataset for research on short-text conversations. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 935–945 (2013) 19. Wu, Y., Wu, W., Xing, C., Zhou, M., Li, Z.: Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 496–505 (2017) 20. Yan, Z., et al.: Docchat: an information retrieval approach for chatbot engines using unstructured documents. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 516–525 (2016) 21. Yang, L., et al.: A hybrid retrieval-generation neural conversation model. arXiv: Information Retrieval (2019)
Enhancing Multi-turn Dialogue Modeling
77
22. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016) 23. Ye, J., Chow, J.H., Chen, J., Zheng, Z.: Stochastic gradient boosted distributed decision trees. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 2061–2064 (2009) 24. Zhang, H., Lan, Y., Pang, L., Guo, J., Cheng, X.: Recosa: detecting the relevant contexts with self-attention for multi-turn dialogue generation. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 3721– 3730 (2019) 25. Zhang, Z., Li, J., Zhu, P., Zhao, H., Liu, G.: Modeling multi-turn conversation with deep utterance aggregation. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3740–3752 (2018) 26. Zhou, X., et al.: Multi-turn response selection for chatbots with deep attention matching network. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 1118–1127 (2018)
Robust Spoken Language Understanding with RL-Based Value Error Recovery Chen Liu, Su Zhu, Lu Chen(B) , and Kai Yu(B) MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China {chris-chen,paul2204,chenlusz,kai.yu}@sjtu.edu.cn
Abstract. Spoken Language Understanding (SLU) aims to extract structured semantic representations (e.g., slot-value pairs) from speech recognized texts, which suffers from errors of Automatic Speech Recognition (ASR). To alleviate the problem caused by ASR-errors, previous works may apply input adaptations to the speech recognized texts, or correct ASR errors in predicted values by searching the most similar candidates in pronunciation. However, these two methods are applied separately and independently. In this work, we propose a new robust SLU framework to guide the SLU input adaptation with a rule-based value error recovery module. The framework consists of a slot tagging model and a rule-based value error recovery module. We pursue on an adapted slot tagging model which can extract potential slot-value pairs mentioned in ASR hypotheses and is suitable for the existing value error recovery module. After the value error recovery, we can achieve a supervision signal (reward) by comparing refined slot-value pairs with annotations. Since operations of the value error recovery are non-differentiable, we exploit policy gradient based Reinforcement Learning (RL) to optimize the SLU model. Extensive experiments on the public CATSLU dataset show the effectiveness of our proposed approach, which can improve the robustness of SLU and outperform the baselines by significant margins.
Keywords: Spoken Language Understanding
1
· Robustness · RL
Introduction
The Spoken Language Understanding (SLU) module is a key component of Spoken Dialogue System (SDS), parsing user’s utterances into structured semantic forms. For example, “I want to go to Suzhou not Shanghai ” can be parsed into “{inform(dest=Suzhou), deny(dest=Shanghai)}”. It can be usually formulated as a sequence labelling problem to extract values (e.g., Suzhou and Shanghai ) for certain semantic slots (attributes, e.g., inform-dest and deny-dest). C. Liu and S. Zhu—Contributed equally to this work. c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 78–90, 2020. https://doi.org/10.1007/978-3-030-60450-9_7
Robust SLU with RL-Based Value Error Recovery
79
Fig. 1. An overview of the robust SLU framework, which is composed of two main components: a slot tagging model and a rule-based value error recovery module. During evaluation, only ASR hypotheses are fed into the two modules to generate the final semantic form.
It is crucial for SLU to be robust to speech recognition errors, since ASRerrors would be propagated to the downstream SLU model. By ignoring ASRerrors, it promotes rapid development of natural language processing (NLP) algorithms for SLU [2,5,7,16] where SLU models are trained and evaluated on manual transcriptions and even natural language texts. Once ASR hypotheses are used as input for evaluation, it will lead to a sharp decrease in SLU performance [20]. ASR-errors may give rise to two issues: 1) inputs for training and evaluation are mismatched; 2) Sequence labelling models extract values directly from ASR hypotheses, which may contain wrong words. Previous works try to overcome these problems in two ways: 1) Adaptive training approaches are introduced to transfer the SLU model trained on manual transcriptions to ASR hypotheses [9, 18]. 2) Other works adopt rule-based post-processing techniques to refine the predicted values with the most similar candidates in pronunciation [4,11,13]. However, this value error recovery module is usually fixed and independent of the SLU model. To overcome the above problems, we propose a new robust SLU framework to guide the former SLU model trained with a rule-based value error recovery. As illustrated in Fig. 1, it consists of a slot tagging model and a value error recovery module. The slot tagging model is pre-trained on manual transcriptions, which considers SLU as a sequence labelling problem. To alleviate the input mismatched issue, it is adaptively trained on ASR hypotheses. The value error recovery module is exploited to correct potential ASR-errors in predicted values of the slot tagging model, which is built upon a pre-defined domain ontology. However, there are no word-aligned annotations for ASR hypotheses to finetune the slot tagging model. Thus, we indirectly guide the adaptive training of the slot tagging model on ASR hypotheses by utilizing supervisions of the value error recovery. Concretely, we can compute a reward by measuring predicted semantic forms after the value error recovery with annotations, and then optimize the slot tagging model by maximizing the expected reward. Since operations in the value error recovery are non-differentiable, we use a policy gradient [10] based reinforcement learning (RL) approach for optimizing.
80
C. Liu et al.
Table 1. An example of user utterance (manual transcription and ASR hypothesis) and its semantic annotations. x ˆ I want to go to Suzhou not Shanghai x I one goal to Suizhou not Shanghai y inform(dest=“Suzhou”); deny(dest=“Shanghai”) oˆ I[O] want[O] to[O] go[O] to[O] Suzhou[B-inform-dest] not[O] Shanghai[B-deny-dest]
We conduct an empirical study of our proposed method and a set of carefully selected state-of-the-art baselines on the 1st CATSLU challenge dataset [20], which is a large-scale Chinese SLU dataset collected from a real-world SDS application. Experiment results confirm that our proposed method can outperform the baselines significantly. In summary, this paper makes the following contributions: – To the best of our knowledge, this is the first work to train a slot tagging model guided by a rule-based value error recovery module. It tends to learn a robust slot tagging model for easier and more accurate value error recovery. – We propose to optimize the slot tagging model with indirect supervision and RL approach, which does not require word-aligned annotations on ASR hypotheses. Ablation study confirms that RL training can give improvements even without the value error recovery module.
2
Proposed Method
In this section, we provide details of our proposed robust SLU framework, which consists of a sequence labelling based slot tagging model and a rule-based value error recovery (VER) module. To guide the training of the slot tagging model on ASR hypotheses with the VER, we propose an RL-based training algorithm. ˆ = (ˆ x1 · · · x ˆ|ˆx| ) denote the ASR 1-best hypothesis Let x = (x1 · · · x|x| ) and x and manual transcription of one utterance respectively. Its semantic representation (i.e., act(slot=value) triplets) is annotated on x ˆ. Thus, it is easy to get the word-level tags on x ˆ, oˆ = (ˆ o1 · · · oˆ|ˆx| ), which is in Begin/In/Out (BIO) schema (e.g., O, B-inform-dest, B-deny-dest), as shown in Table 1. 2.1
Slot Tagging Model
For slot tagging, we adopt an encoder-decoder model with focus mechanism [19] to model the label dependency. A BLSTM encoder reads an input sequence x ˆ, and generates the hidden states at the t-th time-step via → ← − − → − −−→ ← − ←−− ht = [ht ; ht ]; ht = LSTMf (ht−1 , φ(ˆ xt )); ht = LSTMb (ht+1 , φ(ˆ xt ))
(1)
where φ(·) is a word embedding function, [·; ·] denotes vector concatenation, LSTMf and LSTMb represent the forward and backward LSTMs, respectively.
Robust SLU with RL-Based Value Error Recovery
81
Then, an LSTM decoder updates its hidden states at the t-th time-step ot−1 ); ht ]), where ψ(·) is a label embedding recursively by st = LSTM(st−1 , [ψ(ˆ ← − function, and s0 is initialized with h 1 . Finally, the slot tag oˆt is generated by P (ˆ ot |ˆ o λ (λ ∈ (0, 1) is a tunable threshold), ck can be the ancestor of ci and ci is thus added into the same chain. Otherwise, if none of candidate is assigned to be the ancestor of ci , we will build a new chain and set ci as the first element of newly created chain. Performing the above heuristic searching for all elements, we can finally obtain all possible nominal compound chains.
Nominal Compound Chain Extraction
125
4.5 Learning The targets of the learning in compound extractor and chain detector are to minimize the following losses, respectively: Ll = −
n k
pj (ˆ oi )log(pj (oi )),
j=0 i=0 2
1 Lr = − pˆij log(pij ). R i=1 j=0 R
(10)
where p(ˆ oi ) is the ground-truth probability, k is the size of tag set and n is the number of tokens. R is the count of compound-pairs in a document, pˆij and pij are the gold and predicted probability that ci and cj belong to one chain. In our joint training, we optimize the final loss, L = Ll + μLr , where μ is a coupling coefficient.
5 Experiments 5.1 Settings We use the pre-trained weights in BERT-base-chinese version5 to initialize the BERT encoder, which has 12 layers with 768 dimensions of hidden state. We use Adam as the optimizer with an initial learning rate of 1e-5 with warm-up rate of 0.1 and L2 weight decay of 0.01, training with early-stop strategy. The batch size is 1. The maximum length of sentence is 128. A dropout layer with 0.2 is used after the encoder and the fusion layer. The factors λ and μ are set as 0.5 and 0.4, according to our development experiments. The sememes of words are obtained by an open API, OpenHowNet6 . We also re-implement the joint model for co-reference resolution [14, 15] as our strong baseline. For the nominal compound detection, we adopt precision, recall and F1-score as metrics. We use MUC, B3 and CEAFφ4 to evaluate the chain detection7 . 5.2 Main Results Table 2 shows the results of pipeline and joint methods under different setting. First of all, we can find that our proposed joint model consistently outperforms the pipeline counterpart under all settings. In contrast to pipeline, the joint model achieves the improvements of 0.1% F1 score (70.2–70.1) on compound extraction, and 1.6% F1 score (59.3–57.7) on chain detection, respectively. In addition, the improvements in the second chine detection stage are more significant than that in the first compound extraction stage. The possible reason is that, the joint model can mitigate the error propagation from the first extraction step, and perform dynamic adjustment for chain detection in 5 6 7
https://github.com/google-research/bert. https://openhownet.thunlp.org/. The scores are evaluated by the standard scripts of CONLL12: http://conll.cemantix.org/2012.
126
B. Li et al.
Table 2. Results on the NCCE task. w/o BERT denotes that replace the BERT encoder with BiLSTM, and w/o HowNet denotes removing the HowNet resource by the GCN encoder. w/o gate indicates replacing the gate fusion mechanism (Eq. 6) with direct concatenation. Compound Extraction Chain Detection Precision Recall F1
MUC(F1) B3 (F1) CEAFφ4 (F1) Avg. (F1)
68.8
71.4
70.1
60.5
51.2
61.4
47.6
56.6
51.7
39.1
31.3
41.4
37.3
w/o HowNet 68.9
68.3
68.6
59.6
50.3
60.0
56.6
Pipeline: Ours w/o BERT
57.7
Joint: CoRef
-
-
-
48.7
40.7
50.7
46.7
CoRef+BERT -
-
-
59.5
50.6
59.7
56.6
70.3
70.0
70.2
61.6
53.7
63.7
59.7
45.6
60.2
51.9
43.1
33.8
42.0
39.6
w/o HowNet 67.8
69.3
68.5
60.1
51.0
60.2
57.1
w/o gate
71.4
69.4
60.9
52.3
63.1
58.8
Ours w/o BERT
67.4
training. Most prominently, when BERT is unavailable, we can notice that the performance drastically drops, with roughly 20% F1 score decrease, for both two sub-tasks in pipeline and joint schemes. This can be explained by that the pre-trained contextualized representation in BERT can greatly enrich the information capacity of documents, relieving the polysemy problem to some extent. Such observation is consistent with the recent findings of BERT applications [3, 10]. We also see that if the HowNet module is removed, both the pipeline and joint methods can witness notable performance drops. However, the influence from HowNet seems comparably weaker, compared with the BERT encoder. In addition, the usefulness of the HowNet resource is more significant for chain detection, compared with the one for the nominal compound extraction. For example, the gap is 2.6% F1(59.7–57.1) for chain detection while the drop is 1.7% F1(70.2–68.5) for compound extraction in the joint model. This is partially due to the fact that, the enhanced sememes information can promote the interactions between different nominal mentions chain, being much informative for the chain clustering, which is consistent with our initial intuition introduced in §4.2. Furthermore, we compare our joint model with a strong baseline, CoRef, a joint model for co-reference resolution8 . From the results we can learn that the CoRef model is much competitive, and with BERT, it reaches a close equivalent-level results to ours (without HowNet version for fair comparison), with 56.6% F1 score. Nevertheless, our model with the help of HowNet can outperform CoRef by 3.1% F1 score on NCCE. Also the gate mechanism (Eq. 6) can bring positive effects for the results.
8
Since the original CoRef model does not support pipeline scheme, failing to extracting the mentions standalone, and thus we only present the result of chain detection.
Nominal Compound Chain Extraction
127
Table 3. An example illustrating nominal compound chains (colored tokens) extracted by the models with/without HowNet resource, respectively.
Avg. F1(%)
full model 80
70.5 68.9
w/o HowNet
67.2 65.7
69.2 68.5
60 48.6
46.3
40 27.3
22.8
20 1–3
Fig. 4. F1 score against nominal mention length. The score is the average in a sliding window of size 7.
4–6
7–9
10–12
Chain Size
13+
Fig. 5. Performance of chain detection under varying chain size.
5.3 Discussion Influence of Compound Lengths. One key challenges of NCCE lies in extracting longer nominal compounds, which is more tricky compared with the shorter lexical words in LCE. Here we study the influence of compound lengths for nominal mention detection under differing settings, including joint/pipeline model with/without HowNet and with/without BERT, respectively. Figure 4 illustrates the results of different nominal compound lengths. First of all, the nominal compounds with lengths in 5–10 words increase the extraction difficulty the most, while the results will decrease when lengths are larger than 14. In addition, we can find that with HowNet or BERT, both the pipeline and joint model can consistently better solve longer compounds, especially those with the length ≥ 13. This is partially because the external sememes from HowNet can improve the understanding ability of the document context, facilitating the detection. In particular, the improvements for those compounds with length ≥ 14 by BERT are more significant. Chain Detection in Varying Chain Size. Chain size refers to the compound numbers within a chain, we further investigate the influences of HowNet for chain detection under different chain sizes. As shown in Fig. 5, first, too shorter or too longer chains are more difficult to recognize, while the chains in [4, 12] obtain better results. We also see that the detection without HowNet is better, especially for those longer chains (≥13).
128
B. Li et al.
Table 4. Performance of the sentence ordering task under different input resources. With/without ‘type’ indicates whether the syntactic role of the word in the sentence is considered when to build the graph. Input resource
Accuracy PMR
τ
Sentence
48.72
21.00 66.57
Sentence+CW
49.64
19.49 66.62
Sentence+LC w/o type with type
49.84 50.54
21.50 67.99 21.00 68.41
Sentence+NCC 51.87
26.50 68.68
The underlying reason can largely be that external sememe information from HowNet can provide more hints for the inference. Case Study. We explore how HowNet helps to facilitate the NCCE task. Specifically, we empirically show an example extracted by our joint model, in Table 3, based on the test. In particular, we clearly find that when the sememes from HowNet are employed, the extraction results become more complete. For example, the surface compound words (‘the body’) can be enriched by its sememes as listed in Table, which then will inspire the model to further yield correct extraction. Without such links from words to external HowNet, the inference grows harder. 5.4
Application
As we emphasized earlier, compared with the traditional lexical chain, the nominal compound chain can be more expressive on rendering the underlying topics, providing details about the semantics of documents, which consequently can better facilitate the downstream NLP tasks. To further quantify the usefulness of such characteristic, we here exploit the nominal compound chain extracted by our model for sentence ordering, a semantic-dependent task [25]. Based on the state-of-the-art graph model in Yin et al., (2019) [25], we first implement the task with raw sentence inputs, and besides we leverage the common words (CW) as external resource9 . We then additionally extend the inputs with the lexical chains (LC) and nominal compound chains (NCC), respectively. Technically, we utilize these external resources by building graphs, connecting the surface words with the corresponding nodes from the chains. We follow the same metrics as Yin et al., (2019), including accuracy, PMR, and τ . Table 4 shows the main results. First, the comparison between top two rows indicates that the integration with enhanced resources can benefit the sentence ordering task. We clearly find that the lexical chains help to give the improved task performance, than the common words. Specifically, with more fine-grained information, 9
For more technical details, please refer to the raw paper of Yin et al., (2019) [25].
Nominal Compound Chain Extraction
129
the helpfulness is more evidence, which can be inferred from the results from with lexicon type and without lexicon type. Most importantly, our proposed nominal compound chains can improve the result the most. Significantly, the PMR metrics are increased by 5(26.50–21.50) compared with sentence+LC, and 5.5(26.5–21.0) compared with the raw sentence input, respectively. This shows the usefulness of our introduced nominal compound chain.
6 Related Work Morris et al. (1991) pioneer the lexical cohesion (chain) task, a concept that arises from semantic relationships between words and provides a cue of text structures [20]. Based in the concept of cohesive harmony, Remus et al. (2013) propose three knowledgefree methods to extract lexical chain from document automatically [22]. Yunpeng et al. (2016) develop a method with semantic model for chain extraction, and prove that semantic features can be key signals to improve the task [26]. Mascarell (2017) uses word embedding to compute the semantic similarity, and improves the results by considering the contextual information [19]. Nevertheless, the lexical chain extraction involves in shallow lexicon knowledge, lacking the use of latent semantic information [11, 12], such as the topic information [7], which limits the usefulness for the downstream tasks. This motivates us to propose a novel task of Nominal Compound Chain Extraction. Lexical chain extraction as one of the information extraction tasks [9], shares much technical similarities with the co-reference resolution task [5], as both of them model the task as a chain prediction problem. While the lexical chain extraction task focuses more on the semantic coherence between mentions, the latter aims to identify mentions of same entity, event or pronoun in groups. Recently, an increasing number of neural co-reference resolution models have been proposed [14, 15], and greatly outperform previous machine learning models with hand-crafted features. For example, Lee et al. (2017,2018) first propose end-to-end neural model rely on pairwise scoring of entity mentions.
7 Conclusion In this work, we proposed a novel task, namely nominal compound chain extraction, as the extension of the lexical chain extraction. The nominal compound chain can provide richer semantic information for rendering the underlying topics of documents. To accomplish the extraction, we proposed a joint model, formulating the task as a two-step prediction problem, including Nominal Compound Extraction and Chain Detection. We made use of the BERT contextualized language model, and enriched the semantics of input documents by leveraging the HowNet resource. We manually annotated a dataset for the task, including 2,450 documents, 26,760 nominal compounds and 5,096 chains. The experimental results showed that our proposed joint model gave better performance than the pipeline baselines and other joint models, offering a benchmark method for nominal compound chain extraction. Further analysis indicated that both BERT and external HowNet resource can benefit the task, especially the BERT language model.
130
B. Li et al.
Acknowledgments. This work is supported by the National Natural Science Foundation of China (No. 61772378, No. 61702121), the National Key Research and Development Program of China (No. 2017YFC1200500), the Humanities-Society Scientific Research Program of Ministry of Education (No. 20YJA740062), the Research Foundation of Ministry of Education of China (No. 18JZD015), the Major Projects of the National Social Science Foundation of China (No. 11&ZD189), the Key Project of State Language Commission of China (No. ZDI135-112) and Guangdong Basic and Applied Basic Research Foundation of China (No. 2020A151501705).
References 1. Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. In: Advances in Automatic Text Summarization, pp. 111–121 (1999) 2. Carthy, J.: Lexical chains versus keywords for topic tracking. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 507–510. Springer, Heidelberg (2004). https://doi.org/10.1007/ 978-3-540-24630-5 63 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the NAACL, pp. 4171–4186 (2019) 4. Dong, Z., Dong, Q.: HowNet and the Computation of Meaning (2006) 5. Elango, P.: Coreference resolution: A survey. University of Wisconsin (2005) 6. Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Inf. Process. Manag. 43(6), 1705–1714 (2007) 7. Fei, H., Ji, D., Zhang, Y., Ren, Y.: Topic-enhanced capsule network for multi-label emotion classification. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1839–1848 (2020) 8. Fei, H., Ren, Y., Ji, D.: Boundaries and edges rethinking: an end-to-end neural model for overlapping entity relation extraction. Inf. Process. Manag. 57(6), 102311 (2020) 9. Fei, H., Ren, Y., Ji, D.: Negation and speculation scope detection using recursive neural conditional random fields. Neurocomputing 374, 22–29 (2020) 10. Fei, H., Ren, Y., Zhang, Y., Ji, D., Liang, X.: Enriching contextualized language model from knowledge graph for biomedical information extraction. Brief. Bioinform. (2020) 11. Fei, H., Zhang, M., Ji, D.: Cross-lingual semantic role labeling with high-quality translated training corpus. In: Proceedings of the ACL, pp. 7014–7026 (2020) 12. Fei, H., Zhang, Y., Ren, Y., Ji, D.: Latent emotion memory for multi-label emotion classification. In: Proceedings of the AAAI, pp. 7692–7699 (2020) 13. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 14. Lee, K., He, L., Lewis, M., Zettlemoyer, L.: End-to-end neural coreference resolution. arXiv preprint arXiv:1707.07045 (2017) 15. Lee, K., He, L., Zettlemoyer, L.: Higher-order coreference resolution with coarse-to-fine inference. In: Proceedings of the NAACL, pp. 687–692 (2018) 16. Li, Z., Ding, N., Liu, Z., Zheng, H., Shen, Y.: Chinese relation extraction with multi-grained information and external linguistic knowledge. In: Proceedings of the ACL, pp. 4377–4386 (2019) 17. Liu, X., Luo, Z., Huang, H.: Jointly multiple events extraction via attention-based graph information aggregation. arXiv preprint arXiv:1809.09078 (2018) 18. Mallick, C., Dutta, M., Das, A.K., Sarkar, A., Das, A.K.: Extractive summarization of a document using lexical chains. In: Nayak, J., Abraham, A., Krishna, B.M., Chandra Sekhar, G.T., Das, A.K. (eds.) Soft Computing in Data Analytics. AISC, vol. 758, pp. 825–836. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-0514-6 78
Nominal Compound Chain Extraction
131
19. Mascarell, L.: Lexical chains meet word embeddings in document-level statistical machine translation. In: Proceedings of the Workshop on Discourse in Machine Translation, pp. 99– 109 (2017) 20. Morris, J., Hirst, G.: Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Comput. Linguist. 17(1), 21–48 (1991) 21. Niu, Y., Xie, R., Liu, Z., Sun, M.: Improved word representation learning with sememes. In: Proceedings of the ACL, pp. 2049–2058 (2017) 22. Remus, S., Biemann, C.: Three knowledge-free methods for automatic lexical chain extraction. In: Proceedings of the NAACL, pp. 989–999 (2013) 23. Sun, R., Zhang, Y., Zhang, M., Ji, D.: Event-driven headline generation. In: Proceedings of the ACL, pp. 462–472 (2015) 24. Xu, S., Yang, S., Lau, F.C.: Keyword extraction and headline generation using novel word features. In: Proceedings of the AAAI (2010) 25. Yin, Y., Song, L., Su, J., Zeng, J., Zhou, C., Luo, J.: Graph-based neural sentence ordering. arXiv preprint arXiv:1912.07225 (2019) 26. Yunpeng, Q., Wenling, W.: Using semantic model to build lexical chains. Data Anal. Knowl. Discov. 32(9), 34–41 (2016)
A Hybrid Model for Community-Oriented Lexical Simplification Jiayin Song1 , Yingshan Shen1 , John Lee2 , and Tianyong Hao1(B) 1 School of Computer Science, South China Normal University, Guangzhou, China [email protected], [email protected], [email protected] 2 Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China [email protected]
Abstract. Generally, lexical simplification replaces complex words in a sentence with simplified and synonymous words. Most current methods improve lexical simplification by optimizing ranking algorithm and their performance are limited. This paper utilizes a hybrid model through merging candidate words generated by a Context2vec neural model and a Context-aware model based on a weighted average method. The model consists of four steps: candidate word generation, candidate word selection, candidate word ranking, and candidate word merging. Through the evaluation on standard datasets, our hybrid model outperforms a list of baseline methods including Context2vec method, Context-aware method, and the state-of-the-art semantic-context ranking method, indicating its effectiveness in community-oriented lexical simplification task. Keywords: Lexical simplification · Context2vec · Context-aware
1 Introduction Lexical simplification is a task that transforms complex words in text into simplified words with synonyms that are easier for readers including children and non-native speakers to understand [1, 2]. Generally, given a text to be simplified, a lexical simplification model first generates a list of candidate words, and then obtains a final list of candidate words through certain filtering methods, such as part-of-speech, semantic similarity, contextual relevance, etc. Most lexical simplification methods use a single model, which has the disadvantage that the trained model with a single dataset may not be effective to other datasets. Meanwhile, the model produces substitution words as simple as possible in order to reduce word complexity. However, lexical simplification for a specific community has a list of target words that treated as simple. For instance, The Hong Kong Education Bureau requires transforming complex English literature to be compatible with the defined simplified word list (EDB List), which includes approximately 4,000 words starting with the beginning of the letter A to Z that all students in Hong Kong are expected to know upon graduation from primary school [3]. Due to the limited scope of candidate words, the lexical simplification task is more challenging than the common lexical substitution © Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 132–144, 2020. https://doi.org/10.1007/978-3-030-60450-9_11
A Hybrid Model for Community-Oriented Lexical Simplification
133
task. For example, in the sentence “I assume he was not aware of the fact that he was in a church and being extremely obnoxious”, a model may generate a candidate word “offensive” as the best to the target word “obnoxious”. However, the generated candidate word “offensive” may be filtered out in candidate selection step since the word not exist in the EDB list. Therefore, how to ensure that simplified words are within the scope of special communities is a challenging research topic. This paper develops a strategy for merging models for lexical simplification through the combination of candidate words generated by Context-aware model [4] and Context2vec model [5] with weighted average. The whole method consists of four steps: 1) candidate generation, 2) candidate selection, 3) candidate ranking, and 4) Candidate merging. The evaluation results show the effectiveness of our hybrid model for community-oriented lexical simplification task. The main contribution of the paper lies on: 1) a weighted average method is proposed to integrate the Context2vec and Contextaware models, 2) a strategy is implemented to merge candidate word lists generated by different models, and 3) our hybrid model outperforms a list of baseline methods including state-of-the-art methods on publicly available datasets.
2 Related Work The lexical simplification task [6] commonly contains the steps of target words (usually complex words) identification and candidate words (simplified words) generation. The semantic meaning of generated candidate words is required to be similar to target words. Meanwhile, the candidate words need to be fit to the same sentence grammar [7, 8]. Typically, target word identification classifies words in a sentence into complex and non-complex [9, 10] and the complex words are treated as target words for lexical simplification. The identification of candidate words contains several sub-tasks, including generating candidate words, selecting candidate words and sorting candidate words. For example, Hintz et al. developed a system, which generated substitution candidates from WordNet and casted lexical substitution into a ranking task [11]. Substitution ranking is a task of sorting candidate words according to semantic similarity and word simplicity [12]. Generally, lexical simplification systems use simplicity, or word complexity as the basis for ranking. Hao et al. [4] used a concise strategy for combining both n-gram and word2vec for semantic and context ranking. The n-gram was applied to candidate words to retain the meaning of original sentences and to ensure correct grammar of generated sentence by looking at a fixed window around target word. The word2vec model was applied to learn two distinct representations, one as a target and the other as a context, both embedded in the same space [13], while candidate words were reranked by similarity. Kriz et al. [14] proposed the Context2vec model, which was based on a neural network structure. Context2vec embedded the context around target words into a low-dimensional space [15], and used a bidirectional neural network to obtain a complete representation of a context sentence. Peters et al. [16] proposed a new deep contextualized word representations called ELMo (Embeddings from Language Models), in which lower layers efficiently encoded syntactic information while higher layers captured semantics [17].
134
J. Song et al.
Recently, researches had begun to study a lexical simplification model for a specific community to predict whether a word can be understood by users in the community [7, 18, 19]. The traditional lexical simplification systems provided users with the same substitution words regardless of their English proficiency since the systems did not specify target user groups during training. In practice, under different language backgrounds, users may have different cognitive distinctions between complex and noncomplex words. Lee et al. [20] proposed a personalized substitution ranking method to identify candidate words that were consistent with users’ language level. The method used a combination strategy by taking into account the complexity of words and the faithfulness of word meaning. However, this method was more beneficial for users with higher language skills. Due to their large vocabulary, the system was more likely to recommend high-quality synonyms that were more difficult but still familiar with. Hao et al. [21] proposed a semantic-context combination ranking strategy for English lexical simplification to a restricted vocabulary. After that, Song et al. [4] further improved the method through optimization, e.g., multiple candidate words with the same similarity were generated when using WordNet to calculate similarity values. However, these methods still used a single model and performance could be further improved by enhancing candidate word generation.
3 The Method In the lexical simplification task, substitutions should retain both semantic meaning and context appropriateness. A hybrid method based on Context2vec and Context-aware models that uses two rank strategies is proposed for community-oriented lexical simplification in this paper. The Context2vec captures context information through a Bidirectional long short term memory (Bi-LSTM) network. The hybrid method generally consists of four steps. The first step is to use Context2Vec method and Context-aware method to generate candidate words separately. For a given target word w, both methods generate a list of candidate words, as c1 , c2 , …, cn . Candidates generated by the Context-aware are similar to the meaning of the target word, while the candidates generated by the Context2vec method consider sentence context relevance. The second step is substitution selection, which selects the best candidates for the target word. The third step is substitution ranking, which re-ranks the selected candidates in terms of their simplicity or other metrics. The final step is the combination of candidate words by both Context2vec and Context-aware methods. The overall framework of our proposed method is shown as Fig. 1. 3.1 Substitution Generation The Context2vec method applies bidirectional LSTM for variable-length sentential contexts around target words. Full context relevant information are represented by feeding a LSTM network with sentence words in two directions: from left to right and from right to left. The target words are also embedded at the same time. Two types of similarity metrics based on the similarity of context-to-context (c2c) and target-to-context (t2c) are applied for generating candidates. For a pair of sentences sen1 and sen2 , the context-tocontext similarity is measured by vector cosine values between the respective words of
A Hybrid Model for Community-Oriented Lexical Simplification
135
sen1 and sen2 embedding representations that is shown as Eq. (1). The target-to-context similarity value is calculated by measuring cosine values of embedding representations of the target word wt in sen1 and the context of sen2 , as shown in Eq. (2). Finally, the overall similarity of candidate words are as Simt2c *Simc2c .
Sentences Context-aware model
Context2vec model
Substitution candidate generation
Substitution candidate generation
No
No POS matched
POS matched
Yes Stemming/morphy matched
Substitution candidate filtering No
Substitution candidate filtering No
Yes Stemming/morphy matched
Yes Yes Substitution candidates in EDB List
No
Substitution candidate ranking
Yes Substitution candidates
Substitution candidates
Substitution candidates merging
Ranked best candidates
Fig. 1. Framework of our proposed method for community-oriented lexical simplification.
The candidate words generated by Context2vec are obtained according to the semantics and context. Its morphology conforms to current sentences. Yet the gold candidates given in 50 independent annotators which introduced in the experiment section uses the lemma of the words, thus a lemmatization is performed. Simc2c (sen1 , sen2 ) = sen2 .dot(sen1 )
(1)
Simt2c (wt , sen2 ) = sen2 .dot(wt )
(2)
With respect to the Context-aware model, candidate words are selected from a list of words that the community considers simple. Thus, we use all words in a vocabulary
136
J. Song et al.
list that related to the requirement of communities, except for target words, as the initial candidate words to calculate the semantic similarity to target words. The semantic similarity measure based on WordNet is applied for the semantic relevance calculation between candidate words and target word. For a pair of words wt and wc , the measure calculates the shortest path between them. The calculations using the path strategy is shown as Eq. (3), where len(wc ,wc ) is the distance of the shortest path between the two synsets. SimPath (wt , wc ) = 1/(1 + len(wt , wc ))
(3)
After calculating semantic similarity, both models obtain an answer-similarity set and the candidate words are sorted according to the similarity values. 3.2 Substitution Selection In order to obtain more suitable candidate words, we use different filtering strategies to select candidates for the Context2vec model and Context-aware model. First of all, part-of-speech (POS) tag matching is applied on candidate words of the two models at the same time. This matching reduces grammatical inappropriate cases of candidate words and reduces the volume of candidates. We also set a similarity threshold ξ to filter candidate words with low similarity in this process for Context-aware model. After that, some of the remaining generated words having the same stem with the target word are kept in candidate lists. Therefore, we select the remaining candidate words by lemmatization and stemming methods to extract their stem and lemmatization information. By comparing with the stem and lemma of the target word, words that are inconsistent with that of the target word are filtered out. The filtering process is shown as Eq. (4), where wt and wc are a target word and a candidate word, respectively, stem(wt ) denotes stemming of word wt by porter stemming tool and lemma(wt ) denotes lemmatization by the morphy method. 1 stem(wt ) = stem(wc ) or lemma(wt ) = lemma(wc ) (4) filter(wc ) = 0 stem(wt ) = stem(wc ) and lemma(wt ) = lemma(wc ) The two filtering stratifies are used in the Context-aware model while Context2vec has more filtering constraints. For community-oriented lexical simplification, candidate words must be taken from a vocabulary list that related to the requirement of communities. Thus, a further filtering strategy considering the scope limitation of simplified word list, e.g., EDB List, is applied. 3.3 Substitution Ranking Keeping similar semantic meaning and grammatically fitting to the original sentence are essential to lexical simplification. The order of candidate words has a dramatic impact on the models empirically. For a candidate word wc and its target word wt in the sentence sen, both models first use relevance values calculated by Word2vec to initially re-rank the candidate words. As shown in Eq. (5), the Word2Vec_similarity(wc, wt ) is the
A Hybrid Model for Community-Oriented Lexical Simplification
137
similarity of wc and wt utilizing a group of related models to produce word embedding from word2vec. Since the Context2vec model is based on a bidirectional LSTM neural network, contextual information has been embedded when training the model. Thus the order of word2vec is the final order of the candidate words. Differently, Context-aware model only considers the similarity of word meaning. R1(wc ) = Word 2Vec_similarity(wc , wt )
(5)
After that, we sort candidate words based on their contextual relevance for the Context-aware model. Firstly, candidate strings as n-grams are extracted, where n is set to 2. Then target words in the strings are replaced with candidate words to calculate the probability of the words fitting the strings. During every turn of calculations, the combinations of 2-gram strings are retrieved from a reference corpus to obtain their frequencies. For example, for the target word “trustworthy” in the sentence “Many people see Al Jazeera as a more trustworthy source of information than government and foreign channels”, the extracted strings are “more trustworthy” and “trustworthy source”. Consequently, a candidate word can be acquired to replace target word by calculating its corresponding context relevance value. Finally, the top 10 candidates of the ranking list are selected as the best substitutions. The candidate word rank is calculated by the Eq. (6), and (7), where Matchtw (wc, wt ) is the binary value denoting the matching between POS tags of wc and wt . Sem(wc, wt ) is the semantic similarity between the candidate word and the target word, while Semsyn is the similarity with SYNPOS after the filtering with ξ . Relvcon (wc, sen) is the context fitting of the word wc to a given sentence sen. The Relvcon (wc, sen) is defined as the maximum frequency value of a candidate word with surrounding context divided by its maximum frequency value. The square root is applied to normalize the value range since the relevance value is usually small due to the large value of maximum frequency. Additionally, we introduce a parameter β to balance the weights of similarity and context relevance in the calculation. Matchtw (wc , wt ) √ (1 − β)Semsyn (wc , wt ) + β Relvcon (wc , sen) 1 pos(wc ) = pos(wt ) Matchtw (wc , wt ) = 0 pos(wc ) = pos(wt )
R2(wc ) =
(6) (7)
3.4 Substitution Merging In order to solve the low ratio problem of candidate words generation by the Contextaware model and to improve the generation accuracy of candidate words of the Context2vec model, we use a model merging strategy as a weighted average method and a boosting method. For the candidate words wi (i = 1, 2, …, n) from Context-aware model and candidate words wj (j = 1, 2, …, m) from Context2vec, the merged candidate words are as wc (c = 1, 2, …, l)(l ≤ m + n). The score of a candidate words wc thus is calculated by the Eq. (8). ⎧ wi = wj ⎨ score(wi ) ∗ η (8) score(wc ) = score(wj ) ∗ (1 − η) wi = wj ⎩ score(wi ) ∗ η + score(wj ) ∗ (1 − η)/2 wi = wj
138
J. Song et al.
In the equations, the score(wc ) is the value of wc after merging and the parameter η is used to balance the weights of the two methods in the final score calculation. If a candidate word appears in one of the two methods only, the score of the candidate word only needs to be multiplied by the corresponding weight of this method. When the candidate words appear in both methods, we add the score of candidate words and calculate average after weighting.
4 Evaluation and Results 4.1 Dataset and Baselines We train an unsupervised Context2vec model that uses bidirectional LSTM neural network and a Context-aware model on a standard dataset [6] from SemEval 2007 as Training dataset. The dataset includes 295 sentences in total, extracted from the English Internet Corpus [22] and average length of per sentence is 28.5 words. When training Context-aware model, words in lexical samples are selected to ensure variety of senses. Simultaneously, each senses of a target word with the same part-of-speech tag has multiple instances. Finally, we evaluate the effectiveness of our method on Wikipedia dataset, which contains 249 sentences and is annotated by 50 independent annotators, and name it as Testing dataset A. Then we remove sentences whose annotation agreements of gold substitution candidates below 20%, i.e., at least 10 agreements of the gold substitutions of target word from the 50 independent annotators, and generate Testing dataset B containing 119 sentences to further test their stability. Since this paper proposes a hybrid method by combining Context-aware method and Context2vec method, we thus use the two individual methods as the baseline methods. In addition, the following two state-of-the art methods are also used as baselines. A semantic-context ranking method [21] combines both semantic and context information for candidates ranking in lexical simplification, where WordNet-based semantic similarity measures, Part-of-Speech matching, and n-grams are utilized. To our knowledge, the method is the best performed model on the same datasets at present. A boosting strategy as another model merging method is used as a baseline method. Boosting method usually combines multiple homogeneous learners and sequentially learns in a highly adaptive manner. Considering the input of Context2vec model are sentences while the output of Context-aware model are words, the boosting method utilizes the output of Context2vec model as the input of Context-aware model in experiment. 4.2 Evaluation Metrics We apply the commonly used metrics Accuracy @N [6], Best, Oot (out-of-ten), and CWGR (candidate word generation rate) as standard evaluation metrics. Accuracy @N Measure. The metric determines whether any of the generated top N (N = 1, 2, …10) candidate words are included in the gold set. If it is included, the current candidate word is marked as correct for a target word. The final accuracy is thus calculated as the number of correct matches divided by the total number of sentences. We calculate top 1 to 10 accuracy scores, which show how often a gold-standard word
A Hybrid Model for Community-Oriented Lexical Simplification
139
is selected as the best fitting or among the 10 highest-ranked candidates and change of gold set coverage in this process. Best Measure. The system returns all words that the system believes are fitting, thus the credit for each correct guess is divided by the number of guesses. The first guess in the list with the highest overall rating is taken as the best guess. Best measures evaluate the quality of the best guess. We calculate recall and precision as the average annotator response frequency of substitutes identified by a model over all items in datasets. The metrics is represented as Eq. (9). Σai :i∈A Precisionbest =
Σres∈ai freqres |ai | |Hi |
|A|
ai :i∈T
, Recallbest =
res∈ai freqres |ai | |Hi |
|T |
(9)
Oot Measures. Oot measure allows system to make up to 10 guesses, the credit for each correct guess is not divided by the number of guesses. With 10 guesses the system is more likely to find the responses of the gold annotations. The metrics is represented as Eq. (10). Precisionoot =
Σai :i∈A
Σres∈ai freqres |Hi |
|A|
, Recalloot =
Σai :i∈T
Σres∈ai freqres |Hi |
|T |
(10)
CWGR Measures. Candidate word generation rate measure counts the number of target words divided by the total number of candidate words.
4.3 Parameter Tuning The parameters used in Context2vec model and Context-aware model training are summarized in Table 1. For the Context2vec model, we use 100, 300 and 500 units for context representation in preliminary training. When dimension equals to 300, the result improves significantly and keep relative stable after that. For smoothing factor α, we set it from 0.25 to 1 with the interval as 0.25. The performance using different values of α are calculated, from which 0.75 is selected as the best value of α. Other parameters are used by default values. For the Context-aware model, we set ξ from 0 to 0.9 with the interval as 0.1 during training. The overall performance reaches the best when ξ equals to 0.3. After that, we utilize the best value of ξ to train β. We set β values from 0.05 to 0.95 with the interval as 0.05 and identify that the method achieves best performance when β equals to 0.5. Finally, 2-gram and 3-gram are applied and their performance are compared where 2-gram is selected for its higher performance. Using the optimized parameters, candidate words of the same target word are merged using the weighted average method. The performance of our method using different parameter η are shown in Fig. 2. From the result, the overall performance are the highest and are similar when η equals to 0.65 and 0.7. We thus further divide the interval 0.65– 0.7 into four cells with an interval of 0.01. As shown in Table 2, the overall highest performance on the training dataset is achieved when η equals to 0.67.
140
J. Song et al. Table 1. Parameter values setting on the training dataset for Context2vec model. Models
Parameters
Values
Context2vec model
Context word units (dimensions)
300
Learning mini-batch size
1
Context-aware model
Learning rate
0.001
Smoothing factor
0.75
Epoch
10
Similarity threshold ξ
0.3
Balance parameter β
0.5
n of n-gram
2
Fig. 2. Performance with different η values from 0 to 1 with the interval as 0.05 Table 2. Performance by setting various η on the training dataset. η
Accuracy
Best
@1
@2
0.65
0.349
0.66
0.361
0.67
Oot
@3
P
R
F1
P
R
F1
0.458
0.506
0.203
0.189
0.196
0.501
0.467
0.483
0.450
0.514
0.210
0.195
0.202
0.501
0.467
0.483
0.365
0.454
0.510
0.210
0.196
0.203
0.501
0.467
0.483
0.68
0.353
0.454
0.510
0.203
0.189
0.196
0.501
0.467
0.483
0.69
0.353
0.454
0.510
0.2
0.186
0.193
0.501
0.467
0.483
0.70
0.361
0.458
0.510
0.205
0.191
0.198
0.501
0.467
0.483
A Hybrid Model for Community-Oriented Lexical Simplification
141
4.4 The Results The performance of all methods are evaluated on the same testing datasets. The results are shown in Table 3 and Table 4 separately. Overall, the Context2vec method has the lowest overall performance among all baseline methods. When using Context-aware individually, the overall performance is slightly higher than that of Context2vec method. Compared with Boosting method and other baselines, our hybrid method by combining Context-aware and Context2vec methods achieves the highest performance. Table 3. Performance of all methods on the testing dataset A using various evaluation metrics. Methods
Accuracy on testing dataset A @1
@2
@3
@4
@5
@6
@7
@8
@9
@10
Semantic-context 0.201 0.289 0.329 0.357 0.378 0.398 0.410 0.410 0.414 0.426 ranking method Context2vec method
0.217 0.265 0.321 0.333 0.357 0.369 0.373 0.382 0.382 0.382
Context-aware method
0.261 0.325 0.365 0.394 0.410 0.418 0.438 0.438 0.446 0.450
Boosting strategy 0.269 0.345 0.361 0.369 0.373 0.378 0.378 0.378 0.378 0.378 Our hybrid method
0.365 0.454 0.510 0.538 0.554 0.570 0.590 0.590 0.590 0.594
Methods
Best P
Best R
Best F1
Oot P
Oot R
Oot F1
Semantic-context ranking method
0.152
0.109
0.127
0.423
0.304
0.354
Context2vec method
0.159
0.137
0.147
0.254
0.219
0.235
Context-aware method
0.179
0.136
0.155
0.439
0.335
0.380
Boosting strategy
0.161
0.139
0.149
0.254
0.219
0.235
Our hybrid method
0.210
0.196
0.203
0.501
0.467
0.483
On the Dataset A, the performance of the Context-aware method is improved by our hybrid method with Accuracy @1 from 0.261 to 0.365 (39.8%), Best F1 from 0.155 to 0.203 (31.0%) and Oot F1 from 0.380 to 0.483 (27.1%). On the Dataset B, our hybrid model improves the Context-aware method on Accuracy @1 from 0.269 to 0.361 (34.2%), Best F1 from 0.307 to 0.356 (16.0%) and Oot F1 from 0.506 to 0.613 (21.1%). The performance on Best P is slightly decreased on Dataset B since the numerator increases by 34% and the denominator that is the rate of candidate word generation increased by 40% in the calculation. On the other hand, considering the increasing generation rate of candidate words, Accuracy @1-10 are improved. Therefore, more simple vocabulary options can be provided for a target word. Overall, our hybrid method based on the weighted average strategy achieves the best performance on the both datasets. The CWGR is computed for evaluating the quality of candidate words compared with gold standard data. When CWGR is high, the generated candidate words contain
142
J. Song et al.
Table 4. Performance of all methods on the testing dataset B using various evaluation metrics. Methods
Accuracy on testing dataset B @1
@2
@3
@4
@5
@6
@7
@8
@9
@10
Semantic-context 0.218 0.311 0.319 0.361 0.403 0.420 0.420 0.420 0.420 0.420 ranking method Context2vec method
0.235 0.294 0.311 0.328 0.345 0.345 0.345 0.345 0.345 0.345
Context-aware method
0.269 0.328 0.345 0.378 0.378 0.387 0.387 0.387 0.395 0.403
Boosting strategy 0.210 0.311 0.311 0.319 0.336 0.345 0.345 0.345 0.345 0.345 Our hybrid method
0.361 0.429 0.454 0.496 0.513 0.538 0.538 0.546 0.555 0.563
Methods
Best P
Best R
Best F1
Oot P
Oot R
Oot F1
Semantic-context ranking method
0.335
0.2
0.250
0.686
0.409
0.512
Context2vec method
0.261
0.219
0.238
0.4
0.336
0.365
Context-aware method
0.382
0.257
0.307
0.629
0.423
0.506
Boosting strategy
0.237
0.199
0.216
0.4
0.336
0.365
Our hybrid method
0.367
0.345
0.356
0.633
0.595
0.613
more words in gold standard, thus it is more preferred. The experiment result of CWGR of all methods on the same datasets are shown in Table 5. The semantic-context ranking method has the lowest CWGR on both datasets (76.3% and 67.2%), while Context2Vec method has higher rate with 86% on testing dataset A and 84% on testing dataset B. Our hybrid method has the highest CWGR with 93.2% on dataset A and 94.1% on dataset B. The comparison demonstrates that our model is effective to improve the performance of the lexical simplification task for specific communities. Table 5. Performance of the methods in CWGR on the testing datasets. Methods
CWGR on testing dataset A CWGR on testing dataset B
Semantic-context ranking method 76.3%
67.2%
Context2vec method
86%
84%
Context-aware method
77%
67.2%
Boosting strategy
85.9%
84%
Our hybrid method
93.2%
94.1%
A Hybrid Model for Community-Oriented Lexical Simplification
143
5 Conclusions This paper proposes a hybrid method for lexical simplification based on a Context-aware model and a Context2vec model. It combines the candidate words generated by the two models according to a weighted average strategy to regenerate candidate words of target words. Evaluations on the standard Wikipedia datasets through comparison with a list of baseline methods shows that our proposed method achieves a best performance on the community-oriented lexical simplification task. Acknowledgements. This work was supported by National Natural Science Foundation of China (No. 61772146) and Natural Science Foundation of Guangdong Province (2018A030310051).
References 1. Kajiwara, T., Matsumoto, H., Yamamoto, K.: Selecting proper lexical paraphrase for children. In: The 25th Conference on Computational Linguistics and Speech Processing (ROCLING), pp. 59–73 (2013) 2. Zeng, Q., Kim, E., Crowell, J., Tse, T.: A text corpora-based estimation of the familiarity of health terminology. In: Oliveira, J.L., Maojo, V., Martín-Sánchez, F., Pereira, A.S. (eds.) ISBMDA 2005. LNCS, vol. 3745, pp. 184–192. Springer, Heidelberg (2005). https://doi.org/ 10.1007/11573067_19 3. Education Bureau: Enhancing English vocabulary learning and teaching at secondary level. http://www.edb.gov.hk/vocab_learning_sec. Accessed: 05 2020 4. Song, J., Hu, J., Hao, T.: A new context-aware method based on hybrid ranking for communityoriented lexical simplification. In: The 6th International Symposium on Semantic Computing and Personalization (SeCoP). Springer (2020, in press) 5. Melamud, O., Goldberger, J., Dagan, I.: context2vec: learning generic context embedding with bidirectional LSTM. In: The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 51–61 (2016) 6. McCarthy, D., Navigli, R.: Semeval-2007 task 10: English lexical substitution task. In: SemEval, pp. 48–53. ACL (2007) 7. Qiang, J., Li, Y., Zhu, Y., Yuan, Y., Wu, X.: Lexical simplification with pretrained encoders. In: AAAI, pp. 8649–8656 (2020) 8. Qiang, J., Li, Y., Zhu, Y., Yuan, Y., Wu, X.: A simple BERT-based approach for lexical simplification. arXiv preprint arXiv:1907.06226 (2019) 9. Paetzold, G., Specia, L.: Semeval 2016 task 11: complex word identification. In: SemEval, pp. 560–569 (2016) 10. Yimam, S.M., Stajner, S., Riedl, M., Biemann, C.: Multilingual and cross-lingual complex word identification. In: Recent Advances in Natural Language Processing, pp. 813–822 (2017) 11. Hintz, G., Biemann, C.: Language transfer learning for supervised lexical substitution. In: The 54th Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers, pp. 118–129 (2016) 12. Paetzold, G., Specia, L.: Lexenstein: a framework for lexical simplification. In: ACL-IJCNLP 2015 System Demonstrations, pp. 85–90 (2015) 13. Melamud, O., Levy, O., Dagan, I.: A simple word embedding model for lexical substitution. In: The Workshop on Vector Space Modeling for Natural Language Processing, pp. 1–7 (2015) 14. Kriz, R., Miltsakaki, E., Apidianaki, M., Callison-Burch, C.: Simplification using paraphrases and context-based lexical substitution. In: NAACL, vol. 1, pp. 207–217 (2018)
144
J. Song et al.
15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 16. Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802. 05365 (2018) 17. Peters, M.E., Neumann, M., Zettlemoyer, L., Yih, W.-T.: Dissecting contextual word embeddings: architecture and representation. In: The 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, pp. 1499–1509 (2018b) 18. Ehara, Y., Miyao, Y., Oiwa, H., Sato, I., Nakagawa, H.: Formalizing word sampling for vocabulary prediction as graph-based active learning. In: EMNLP, pp. 1374–1384 (2014) 19. Lee, J., Yeung, C.Y.: Personalizing lexical simplification. In: The 27th International Conference on Computational Linguistics (COLING), pp. 224–232 (2018) 20. Lee, J., Yeung, C.Y.: Personalized substitution ranking for lexical simplification. In: The 12th International Conference on Natural Language Generation, pp. 258–267 (2019) 21. Hao, T., Xie, W., Lee, J.: A semantic-context ranking approach for community-oriented english lexical simplification. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Yu. (eds.) NLPCC 2017. LNCS (LNAI), vol. 10619, pp. 784–796. Springer, Cham (2018). https://doi. org/10.1007/978-3-319-73618-1_68 22. Sharoff, S.: Open-source corpora: using the net to fish for linguistic data. Int. J. Corpus Linguist. 11(4), 435–462 (2006)
Multimodal Aspect Extraction with Region-Aware Alignment Network Hanqian Wu1,2 , Siliang Cheng1,2 , Jingjing Wang3(B) , Shoushan Li3 , and Lian Chi4 1
School of Computer Science and Engineering, Southeast University, Nanjing, China {hanqian,slcheng}@seu.edu.cn 2 Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, Nanjing, China 3 School of Computer Science and Technology, Soochow University, Suzhou, China {djingwang,lishoushan}@suda.edu.cn 4 Nanjing University of Information Science and Technology, Nanjing, China [email protected]
Abstract. Fueled by the rise of social media, documents on these platforms (e.g., Twitter, Weibo) are increasingly multimodal in nature, with images in addition to text. To well automatically analyze the opinion information inside multimodal data, it’s crucial to perform aspect term extraction (ATE) on them. However, until now, the researches focus on multimodal ATE are rare. In this study, we take a step further than previous studies by proposing a Region-aware Alignment Network (RAN) that aligns text with object regions that show in an image for the multimodal ATE task. Experiments on the Twitter dataset showcase the effectiveness of our proposed model. Further researches prove that our model has better performance when extracting emotion polarized aspect terms. Keywords: Aspect term extraction Image-text alignment
1
· Multimodal learning ·
Introduction
In recent years, social media, as a kind of platform which allows users to share and exchange ideas, are becoming more and more popular. As a popular social media may contain an enormously wide range of users and a mount of valuable information, mining data from such platforms has become a research focus and grab many researchers’ attention. One particular search area in social media is aspect-based sentiment analysis (ABSA), which aims to detect the aspect terms explicitly mentioned in sentences and predict the sentiment polarities over the aspect terms [3,15,16]. Aspect term extraction (ATE) is a fundamental task in ABSA, it focuses on extracting all aspect terms in a sentence. However, social media like Twitter have their uniqueness. First, there are informal languages, slang and typos. Second, users tend to post in c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 145–156, 2020. https://doi.org/10.1007/978-3-030-60450-9_12
146
H. Wu et al.
Fig. 1. An example of multimodal tweet. In this tweet, the aspect terms Mario and Luigi show in the corresponding image.
multimodal format. As short textual components of social media posts only provide limited contextual information, in most casesm images could render help. Figure 1 shows an example of a tweet and its corresponding image. We can notice that for the ATE task, the aspect terms we would like to extract are always eye-catching in the image. This attribute makes the task more challenging and promising. For decades, many researches have been done for ATE tasks. However, despite its importance, most of the researches on ATE tasks still focus on traditional news articles and longer text documents, and not as much in multimodal social media data. Multimodal ATE task is a new dimension of traditional text-based ATE, which goes beyond the analysis of plain texts, and includes other modalities like images. In recent years, several studies have been done to solve the task [23],
Multimodal Aspect Extraction with Region-Aware Alignment Network
147
while they somehow still suffer from some limitations. Based on previous studies, there are some other important problems needed to be addressed in ATE tasks. Firstly, previous methods mainly use Word2Vec pre-trained embedding as the input as the proposed model, which performed well on the task. However, Word2Vec embeds one token with only one embedding, which may cause confusion when the model meets polysemies. More importantly, it is important if we can capture the entities within images. Normally, images and text in the same tweet are highly correlated. According to prior works [23], in most cases, a word is related to a small region of the input image. The idea sheds a light on us to further match each word with certain regions of objects in the image because the entities we would like to tag in the sentence are likely to relate to certain objects in the image, not the whole picture. Aside from that, in consideration of efficiency, we could also reduce the complexity for the model to extract features of an image by obtaining the feature representations of objects beforehand. To address the above limitations, we decide to build our model on top of the recent BERT [4] architecture, which can help obtain contextualized word representations. For image features, we adopt Faster-RCNN [14] to extract object features in images. In this work, by analyzing the problems we mentioned above, our work offers the following contributions: – To alleviate the data sparseness and polysemy problem, we adopt BERT to embed our input text. – The task of multimodal ATE has not been carefully studied in previous methods. In this paper, we bring out a better solution to align text to its corresponding image and evaluate several models for the task. – Achieved state-of-the-art results of ATE task on the multimodal dataset [23].
2 2.1
Related Work Aspect Term Extraction
The ATE task aims at extract aspect and opinion terms explicitly occurred in a sentence. Early researches mainly focus on rule-based methods for ATE tasks. Hu et al. [7] leveraged frequent nouns or noun phrases to extract aspect terms and tried to identify opinion terms by exploiting the relationship and occurrences between aspect terms and opinion terms. Later works treat ATE task as sequence labeling and apply machine learning methods like Hidden Markov Model or Conditional Random Fields. But the performances of these methods rely on hard-coded rules or the feature’s quality. In recent years, many studies have outperformed traditional systems by introducing neural network architectures. A large amount of work used pre-trained word embeddings as the input of Recurrent Neural Network (RNN) for ATE [9]. To make the model simpler and faster, Xu et al. [18] proposed a Convolutional Neural Network (CNN) model with double embeddings mechanism. The model enhanced the performance but has a flaw of ignored the different importance
148
H. Wu et al.
between two embeddings. He et al. [6] introduced attention mechanism into ATE task by proposed an attention-based model and proved its efficiency. Recently, in consideration of the relationship between Aspect Category Classification (ACC) and ATE task, multi-task models [17,20] emerged to handle the two tasks simultaneously. 2.2
Multimodal Learning
With the prevalence of multiform user-generated data (text, image, video, etc.), multimodal sentiment analysis is drawing research attention. Finding alignment between text and other forms of resources is explored as multimodal learning. Text-image pair is the commonest case. Same with ATE, multimodal learning, too, applied feature-based models in the early years [2]. With the development of deep learning, neural network-based models gradually emerged. Yu et al. [22] use pre-trained text and image CNN to extract feature representation from both modal data and use the combination as the input of their proposed model. Nowadays, multimodal learning has been successfully applied to tasks like Image Captioning [8,19] and Visual Question Answering [11,21]. Different from those tasks which have 1-to-1 alignments between text and images, in the multimodal ATE each word can be related with different parts of one image. To fill the gap between multimodal sentiment analysis and aspect-based sentiment analysis, we propose a Region-aware Alignment Network (RAN) for ATE task in this paper. Our model takes image objects into consideration, build an attention network which captures both image and text feature before final decoding.
3
Methods
In this section, we describe our proposed model inspired from [23] based on multimodal datasets. On the basis, suppose the multimodal input includes a sentence T = {W1 , W2 , ..., WL } and a corresponding image I, the goal of our model is to predict the label of each word in scheme {B, I, O}, where B indicates the beginning, I indicates the inside and the end of an aspect term, O means nontarget words. Figure 2 shows the architecture of multimodal ATE framework. 3.1
Feature Extraction
Image Feature Extraction. Previous studies [5] use features from the last layer of VGGNet or ResNet as the global feature representation of the input image. However, each aspect term in the text refers to different regions of the image. To solve the problem, the work in [1,23] chose the features from the last pooling layer which have a dimension of 512 × 7 × 7, where 7 × 7 is the number of image regions. However, we notice that the aspect terms have strong alignment with the objects in the corresponding image, where other regions have little effect. Applying attention mechanism on all regions in the image may
Multimodal Aspect Extraction with Region-Aware Alignment Network
(a)
149
(b)
Fig. 2. (a) The architecture of our proposed model; (b) the detail of adaptive co-attention network.
not only introduce noise but make it more difficult for the model to extract useful features from the image. Based on the hypothesis, we chose the FasterRCNN model to extract object features. An image could be represented as o˜I = oi ∈ Rdo , i = 1, 2, ...N } , where N is the number of objects, to be exact, the o˜i |˜ top N features after the Non-Maximum Suppression (NMS) processing, and o˜i is the feature representation of object i. For the convenience of calculation, we transform each object feature vector by a single layer perceptron before sending it into the model. (1) oI = tanh (WI o˜I + bI ) Where WI , bI are parameters, vI is the input feature vector. Text Representation. For the word-level representation, we leverage the BERT contextual embeddings to encode original text. Especially, we use the output embedding from the last transformer layer. Consider word-level representation have out-of-vocabulary problem, we build a character-level representation which also helps in capturing morphological information. In detail, each word is projected into character vector set [c1 , c2 , · · · , cm ], where m is length of the word. k groups of filter with different kernel size [l1 , l2 , · · · , lk ] are applied for the convolution. After transformation for a word, we will get the sequence for each filter: (2) Fj = · · · ; tanh Cj · F[i:i+lj −1] + bj ; · · · , where i is the index of the convolutional window. Finally, after applying a maxpooling layer wj = max (Fj ), we obtain the final representation for word wj by concatenation. (3) w = [w1 ⊕ w2 ⊕ · · · wk ] Then we concatenate the word-level and character-level vector of each word and feed it into a bidirectional LSTM layer to generate a hidden state matrix.
150
H. Wu et al.
→ − At each time step we can obtain forward and backword hidden state (ht and ← − ht ). The final output is the concatenation of the forward and backward hidden state: → ← − − (4) ht = [ ht , ht ] 3.2
Adaptive Co-attention Network
The Adaptive Co-attention Network [23] is an architecture that can learn the shared information between text and images. In our assumption, each word, especially for aspect terms is connected to several objects of an input image. Thus the attention is applied to decide which object for a word to attend to by: zt = tanh (WoI vI ⊕ (Wht ht + bht )) αt = softmax (Wαt zt + bαt )
(5)
where WoI , Wht and Wαt are parameters, ht ∈ Rd is the input text representation by Eq. (4), d is the dimension of input feature, oI ∈ Rd×N is the input feature map obtained by Eq. (1), N is the number of object feature vectors. ⊕ denotes the concatenation operation of one word feature vector with each object feature vector of the image. αt is the attention probability vector. Thus the new imageword vector ht can be: αt,i oi (6) oˆt = i
where αt,i is the attention weight for word vector ht on object i. After previous step, similarly, we decide which words should word ht attend to by: zt = tanh (Wx x ⊕ (Wx,ˆot oˆt + bx,ˆot )) βt = softmax (Wβt zt + bβt )
(7)
where Wx , Wx,ˆot and Wβt are parameters, x ∈ Rd×n , d is the dimension of features and n is text length. oˆt ∈ Rd is the image-word vector obtained by previous step. d is the dimension of features. ⊕ denotes the concatenation of one image-word vector with each word vector, βt ∈ Rn is the attention probability vector. We obtain the word-word vector by: ˆt = βt,j hj (8) h j
where βt,j is the attention weight for image-word vector oˆt on word j. Then the multimodal vector is obtained through the fusion of image-word vector and word-word vector: hoˆt = tanh (Woˆt oˆt + boˆt ) ˆ t + bˆ hhˆ t = tanh Whˆ t h ht (9) gt = σ Wgt hvˆt ⊕ hhˆ t mt = gt hoˆt + (1 − gt ) hhˆ t
Multimodal Aspect Extraction with Region-Aware Alignment Network
151
where Woˆt , Whˆ t and Wgt are parameters, σ is the logistic sigmoid activation. Fused vector mt is obtained by the gate gt . As the text is the most important information in ATE task, images might be unnecessary and introduce noise sometimes. To filter out noises, a filtration gate is applied to combine text vector and multimodal vector: st = σ (Wst ,ht ht ⊕ (Wmt ,st mt + bmt ,st )) ut = st (tanh (Wmt mt + bmt ))
(10)
m ˆ t = Wm ˆ t (ht ⊕ ut ) where ht is the hidden state of bidirectional LSTM at time step t, ut is the multimodal feature after filtration. 3.3
Decoder
For the tag prediction, final output m ˆ = {m ˆ 1, m ˆ 2 ...m ˆ n } is passed to Conditional Random Field (CRF) layer. CRF is proved useful in tasks where output labels strong dependency (e.g. “B” cannot follow “I”). Given a text sequence x = [x1 , x2 , . . . , xn ] and the corresponding label sequence y = [y1 , y2 , . . . , yn ], possibility of the tag sequence is calculated as:
T p(y|X) =
Ωi (yi−1 , yi , X)
T i=1 Ωi yi−1 , yi , X
i=1 y ∈Y
(11)
where Ωi (·) is potential function. We use maximum conditional likelihood to learn best parameters that maximize the log-likelihood: L(p(y|X)) = log p(y|X) (12) i
While decoding, we predict the output sequence by: y ∗ = argmaxy ∈Y p(y|X)
4 4.1
(13)
Experiments Experimental Settings
Datasets. We conduct our experiments based on the Twitter TMSC datasets provided by [10,23], which are collected and selected through Twitter’s API. The dataset includes 6407 items in total, we randomly divide the dataset into training set (80%, 5125), development set (10%, 641) and test set (10%, 641). Each item contains a sentence that divided into word sequence, the label of each word and the image related to the sentence. In addition, our dataset contains 4060 positive aspect terms, 1358 negative aspect terms and 5884 neural aspect terms in total.
152
H. Wu et al.
Hyper-parameters. For images, we apply zero-padding for images have less than k extracted objects. For the BERT word embedding, vector dimension is fixed to 768. The embeddings of the out-of-vocabulary words are randomly initialized following the uniform distribution of [−0.25, 0.25]. For the character-level embedding, the dimension is fixed to 30, and is also initialized randomly from a uniform distribution of [−0.25, 0.25]. We use three groups of 32 kernels with sizes of 2, 3 and 4 for character-level CNN. The output dimension of bidirectional LSTM and character-level CNN are 200 and 50, respectively. The sentence length and word length is set to 35 and 30 respectively. We use Adam optimizer and the learning rate is set to 0.001. The batch size is set to 16 for training. Baselines. To comprehensively verify the effectiveness and advantages of our proposed model, we will compare our proposed model with previous state-ofthe-art models using the same corpus. The models are listed as follows: – CRF trains a model with basic word embeddings. – BiLSTM+CRF uses bidirectional LSTM to extract word features with both forward and backward information. It is the basic structure in many sequence labeling tasks. It is an end-to-end system which requires no feature engineering. – CNN+BiLSTM+CRF was proposed in [12]. Based on the previous model, it adopts CNN to extract character-level features. It is welcomed in many sequence labeling tasks and was reported to have achieved the best result on the CoNLL 2003 test set with F1-measure of 91.21%. – Multi-task was proposed in [13]. The model proposed a framework that can deal with aspect term extraction and opinion polarity classification task simultaneously. – ACN+VGG+BiLSTM+CRF was proposed in [23]. The model introduced an adaptive co-attention network that combines visual and textual information to recognize named entities. – BERT+CNN+BiLSTM+CRF is our model without using image information. The baseline is to prove the effectiveness of images in ATE task. – RAN(Word2Vec) is our model with Word2Vec vectors as the input text embedding. For the word-level representation, we use traditional word embeddings pre-trained on tweets which representing each word with 200 dimensions.
4.2
Experimental Results and Analysis
We performed experiments on each method above and our proposed model. The results are shown in Table 1. In consideration of fairness, we take the mean value of 10 times experiments. For our proposed model, we compared the performance by using different number of extracted object features (4, 9, 12, 16, 25, 49) as the input of the model. As results shown in Table 2, the F1-measure enhances when extract more
Multimodal Aspect Extraction with Region-Aware Alignment Network
153
Table 1. Experimental results. The best scores are in bold. Model
Metric Accuracy Precision Recall F1-measure
CRF BiLSTM+CRF CNN+BiLSTM+CRF Multi-task VGG+ACN BERT+CNN+BiLSTM+CRF
0.902 0.945 0.951 0.953 0.953 0.953
0.537 0.782 0.788 0.799 0.800 0.795
0.569 0.770 0.831 0.814 0.820 0.828
Our model (Word2Vec) Our model (BERT)
0.954 0.955
0.809 0.813
0.834 0.821 0.845 0.829
0.553 0.776 0.805 0.806 0.810 0.811
Table 2. Results of our proposed model with different number of object vectors. Object number
Accuracy Precision Recall F1-measure
4 objects
0.950
9 objects
0.955
0.793
0.820
0.807 0.817
0.810
0.825
16 objects (our choise) 0.955
0.813
0.845 0.829
25 objects
0.955
0.808
0.817
0.812
49 objects
0.951
0.780
0.808
0.794
objects in the beginning, reaches the top, and declines as extracting more objects. We can notice that the optimum object number is 16. This is interpretable because most sentences in our dataset have less than 6 aspect terms. As the features we use may contain repeated features, using features slightly more than the number of aspect terms can be compensation. Then as the object number increases, noises is introduced to the model, hence the performance becomes poorer. To compare our model with other baselines, we use the same pre-trained Twitter word embeddings which is provided by [23] for all the baselines requiring word embedding input on all of the datasets. Besides, we align the train/dev/test configurations for all methods. The experimental results suggest that our proposed framework consistently gives the best F1 score and outperforms other baselines in most cases. Especially, by using 16 extracted object features, our proposed RAN(Word2Vec) has an improvement of 1.1% in precision, 1.7% in recall and 1.4% in F1-measure than the network proposed in [23], which is the state-of-the-art model in multimodal named entity recognition task. It suggests that our method of using object features as the input has the better alignment to the text feature. At last, the performance improves a lot when image information is introduced. Our model is based on the BERT+CNN+BiLSTM+CRF
154
H. Wu et al.
architecture by adding image information in it. The result proves the contribution of image features. Further, we restructured our model by BERT instead of Word2Vec and achieve the best scores in all evaluation metrics, with 0.4% in precision, 1.3% in recall and 1.0% in F1-measure than RAN (Word2Vec), which shows that BERT representation can bring performance improvement for our ATE task and proves the superiority of BERT. To better unveil the potential of our proposed method, we carefully analyzed the model’s ability to extract aspect terms of different sentiment polarity. For our model does not predict sentiment polarity, we take recall alone to measure the performance of the model. Concretely, suppose there are k positive aspect terms in our corpus, m of them are extracted successfully by our model, the recall rate is m k accordingly. Experiments show that the average recall rate of positive aspect terms, negative aspect terms and neural aspect terms are 0.947, 0.851 and 0.782, respectively. We can see that the recall on positive aspect terms is 2.26% higher in comparison with recall on neural aspect terms. Improvement of negative aspect terms is a bit lower but still improved the result by 1.46%. The lower performance might be caused by the limitation of our dataset, which contains less negative aspect terms than others. The result indicates our proposed model that aligns text with object features has a better ability to extract emotions of an object in the image, hence benefits the process of extraction of aspect terms. As correlated tasks in ABSA, the finding shed a light on our future researches in aspect-based sentiment classification.
5
Conclusion
In this paper, we proposed a novel approach on the task of multimodal aspect term extraction. To better utilize hidden information in images, we use object and text features as the input and successfully enhance the performance of ATE through the RAN model on the Twitter dataset. The experiment results prove that our proposed model outperforms other baseline models. For the benefit of future work, we also analyze the performance of our model on different sentiment polarity aspect terms and demonstrates the model’s potential in other correlated ABSA tasks. Acknowledgements. We would like to appreciate Jiangsu Provincial Key Laboratory of Network and Information and Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education for offering the environment of our experiments. This work is supported in part by Industrial Prospective Project of Jiangsu Technology Department under Grant No. BE2017081 and the National Natural Science Foundation of China under Grant No. 61572129. This work is also supported by a Project funded by China Postdoctoral Science Foundation No. 2019-M661930.
Multimodal Aspect Extraction with Region-Aware Alignment Network
155
References 1. Arshad, O., Gallo, I., Nawaz, S., Calefati, A.: Aiding intra-text representations with visual context for multimodal named entity recognition. In: 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, pp. 337–342 (2019) 2. Borth, D., Ji, R., Chen, T., Breuel, T.M., Chang, S.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: ACM Multimedia Conference, MM 2013, pp. 223–232 (2013) 3. Chen, X., et al.: Aspect sentiment classification with document-level sentiment preference modeling. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3667–3677. Association for Computational Linguistics (2020) 4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019) 5. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, pp. 2296–2304 (2015) 6. He, R., Lee, W.S., Ng, H.T., Dahlmeier, D.: An unsupervised neural attention model for aspect extraction. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Volume 1: Long Papers, pp. 388–397 (2017) 7. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 2004, pp. 168–177 (2004) 8. Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, vol. 2015, pp. 3128–3137 (2015) 9. Liu, P., Joty, S.R., Meng, H.M.: Fine-grained opinion mining with recurrent neural networks and word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pp. 1433–1443 (2015) 10. Lu, D., Neves, L., Carvalho, V., Zhang, N., Ji, H.: Visual attention model for name tagging in multimodal social media. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Volume 1: Long Papers, pp. 1990–1999 (2018) 11. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, pp. 289– 297 (2016) 12. Ma, X., Hovy, E.H.: End-to-end sequence labeling via bi-directional LSTM-CNNSCRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 1: Long Papers (2016) 13. Nguyen, H., Shirai, K.: A joint model of term extraction and polarity classification for aspect-based sentiment analysis. In: 2018 10th International Conference on Knowledge and Systems Engineering (KSE), pp. 323–328 (2018)
156
H. Wu et al.
14. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, pp. 91–99 (2015) 15. Wang, J., et al.: Aspect sentiment classification with both word-level and clauselevel attention networks. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 4439–4445. International Joint Conferences on Artificial Intelligence Organization (2018) 16. Wang, J., et al.: Aspect sentiment classification towards question-answering with reinforced bidirectional attention network. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3548–3557. Association for Computational Linguistics (2019) 17. Wu, H., Wang, Z., Liu, M., Huang, J.: Multi-task learning based on questionanswering style reviews for aspect category classification and aspect term extraction. In: Seventh International Conference on Advanced Cloud and Big Data, CBD, vol. 2019, pp. 272–278 (2019) 18. Xu, H., Liu, B., Shu, L., Yu, P.S.: Double embeddings and CNN-based sequence labeling for aspect extraction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 592–598. Association for Computational Linguistics (2018) 19. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, pp. 2048–2057 (2015) 20. Yang, H., Zeng, B., Yang, J., Song, Y., Xu, R.: A multi-task learning model for Chinese-oriented aspect polarity classification and aspect term extraction. CoRR abs/1912.07976 (2019) 21. Yu, D., Fu, J., Mei, T., Rui, Y.: Multi-level attention networks for visual question answering. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 4187–4195 (2017) 22. Yu, Y., Lin, H., Meng, J., Zhao, Z.: Visual and textual sentiment analysis of a microblog using deep convolutional neural networks. Algorithms 9(2), 41 (2016) 23. Zhang, Q., Fu, J., Liu, X., Huang, X.: Adaptive co-attention network for named entity recognition in tweets. In: AAAI Conference on Artificial Intelligence (2018)
NER in Threat Intelligence Domain with TSFL Xuren Wang1,2(B) , Zihan Xiong1,2(B) , Xiangyu Du2 , Jun Jiang2 , Zhengwei Jiang2 , and Mengbo Xiong1,2 1 Information Engineering College, Capital Normal University, Beijing 100048, China
{wangxuren,2181002064}@cnu.edu.cn 2 Key Laboratory of Network Assessment Technology, Institute of Information Engineering,
Chinese Academy of Sciences, Beijing 100093, China
Abstract. In order to deal with more sophisticated Advanced Persistent Threat (APT) attacks, it is indispensable to convert cybersecurity threat intelligence via structured or semi-structured data specifications. In this paper, we convert the task of extracting indicators of compromises (IOC) information into a sequence labeling task of named entity recognition. We construct the dataset used for named entity identification in the threat intelligence domain and train word vectors in the threat intelligence domain. Meanwhile, we propose a new loss function TSFL, triplet loss function based on metric learning and sorted focal loss function, to solve the problem of unbalanced distribution of data labels. Experiments show that named entity recognition experiments show that F1 value have improved in both public domain datasets and threat intelligence. Keywords: Cybersecurity threat intelligence · Advanced persistent threat · Metric learning · Focal loss
1 Introduction Threat intelligence is evidence-based knowledge, including context, mechanisms, indicators, implications and actionable advice, about an existing or emerging menace or hazard to assets that can be used to inform decisions regarding the subject’s response to that menace or hazard. That is to say, it is a dataset collected against indicators of cyber security threats, attackers, malware, vulnerabilities, etc. Threat intelligence is often shared in unstructured text, as shown in Fig. 1.
Fig. 1. A sample for unstructured text
In order to deal with more sophisticated Advanced Persistent Threat (APT) attacks, which is an attack action funded by a state, government or intelligence agency or carried out by an attack group with a relevant background, it is necessary to exchange cybersecurity threat intelligence via structured or semi-structured data specifications. © Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 157–169, 2020. https://doi.org/10.1007/978-3-030-60450-9_13
158
X. Wang et al.
Recently, researchers would like to convert unstructured threat intelligence into semistructured or structured data format, which is the application of NLP’s sequence labeling technology to extract the attack indicators in threat intelligence. As also shown in Fig. 1, the attack group “TG-3390” has a vulnerability numbered “CVE-2011-3544”, which uses the “Java Runtime Environment” and the ultimate goal is to deliver a malware “HttpBrowser”. The sequence labeling task in NLP can well identify the attack indicators that we have defined, and thus transform it into semi-structured or structured data. However, the current sequence labeling tasks in the field of threat intelligence still have the following challenges: 1) The lack of open-source datasets in the field of threat intelligence named entity recognition research; 2) The particularity of unstructured APT report data. The data specificity of the APT report is reflected in two aspects. One is that there are large numbers of words in special fields, such as advanced persistent threat groups (APT groups), common vulnerabilities and exposures (CVE), malwares, and hash values, which are prone to serious OOV problems during training. The other is the particularity of the APT report structure. It is different from the NER data set in the general field. APT reports are usually extremely long. Also, its sentences are intensely long. Correspondingly, the frequency of entities in a sentence is very low. Consequently, there will be a serious imbalance of data label distribution. In addition, there will be some tables that are matter for threat intelligence in the article. These tables do not conform to all the syntactic structures of natural language. Therefore, the existing NER model will be more difficult to directly identify the semantic information expressed in the table list, resulting in poor recognition results. In this paper, we still convert the task of extracting indicators of compromises (IOC) information into a sequence labeling task of NLP - named entity recognition. The commonly used model in named entity recognition is BiLSTM-CRF [2, 3]. Although BiLSTM can solve the problem of long-term dependence in sequences, it is difficult to capture syntactic information in sentences. The appearance of transformer proves that it can capture the information of more channels in the sentence, and the various pretraining models that appear in [4, 5] are more important for the downstream tasks of NLP Important role. However, due to the particularity of the data in the threat intelligence field mentioned above, simply using the NER model cannot solve the IOC extraction task in the threat intelligence field. Therefore, we make the following contributions in combination with the characteristics of data in the threat intelligence field: • First, we use the Word2vec model to train word vector in the threat intelligence field, and then concatenate with the word vector of the BERT pre-training model as the final word vector feature to alleviate the OOV problem. • Then, we add attention mechanism after BiLSTM to redistribute the weights for the sequence. Since the content of the table format in the APT report usually contains a large number of entities, we pay more attention to words with entity tags. • Finally, we make improvements to the loss function of CRF layer. We first propose to apply metric learning to the NER domain. We introduce the Sorted Focal loss function and the triplet loss function based on metric learning to balance the uneven distribution of tags.
NER in Threat Intelligence Domain with TSFL
159
2 Related Works 2.1 NER Task Collobert et al. [1] proposed two network structures of window and sentence method to perform NER, which is one of the representative works of NN/CNN-CRF neural network model. In order to make the expression of the vector more accurate to the text, Lample et al. [2] and Ma and Hovy [3] introduced character-level vector representation, respectively using BiLSTM and CNN to capture the character-level information inside a word. Moreover, two NER classic Neural network models, BiLSTM-CRF and BiLSTM-CNNs-CRF, have been proposed. The results of the research in the past two years show that the pre-trained model using the migration strategy plays a crucial role in the NER task. The ELMO proposed by Peters et al. [4] uses multi-layer BiLSTM to represent words, which can not only characterize the grammatical and semantic features of vocabulary, but also change with the change of context. Devlin et al. [5] also proposed a new language representation model BERT, whose model architecture is a multi-layer bidirectional Transformer encoder, which pre-trains deep bidirectional representations by joint adjustment in the context of all layers. BERT Large also surpassed previous mainstream models. 2.2 NLP for Cyberthreat Detection Joshi A [6] and others have implemented an information recognition method for network text data. The CRF algorithm is used to identify network security related entities and relationships. Its data is derived from national vulnerability databases and security bulletins, news reports, and network security blog text sources. With the extracted network security concepts and vulnerability descriptions, CRF-based systems are used to identify text-related entities, concepts and relationships in the network. Sabottke et al. [7] proposed a Twitter-based vulnerability detector using SVM classifier, which can extract information related to the vulnerability from Twitter and supplement it through other sources, and predict whether it can be used in circumstances. Liao [8] and others have developed iACE, an automated tool capable of IOC extraction. The ChainSmith system designed by Zhu et al. [9] can automatically extract IOCs from technical articles and industry reports and classify them into different stages of attack activity (bait, utilization, installation, and command and control) Rules-based methods are used in the system to identify information such as hashes, IPs, and URLs. Nuno et al. [10] designed Bi-LSTM-CRF to extract IOC entities related to threat intelligence. Its data is derived from Twitter social network. Unlike other information extraction tasks in the field of cybersecurity, it is added before doing entity extraction. A binary classifier based on Convolutional Neural Networks (CNN) determined whether short text is relevant to threat intelligence. IOC information extraction was applied if relevant, or the data was discarded. Zhou et al. [11, 12] applied end-to-end sequence tags to the work of IOCs recognition tasks, introducing attention mechanism and regularization features of entity spelling, and realizing the recognition of IOCs from the long sentences of network security reports.
160
X. Wang et al.
2.3 Metric Learning In mathematics, metric is a function that defines the distance between elements in a set. A set of metrics is called a metric space. Eric Xing [13] first proposed the concept of metric learning. The basic principle is to self-learn the metric distance function for a specific task according to different tasks. Metric learning was originally applied to face recognition. It was later moved to the field of text classification. Especially for text processing of high-dimensional data, metric learning has a good classification effect. Deep metric learning usually consists of three parts: feature extraction network to map embedding, a sampling strategy to combine the samples in a mini-batch into many sub-sets, and finally the loss function calculates loss on each sub-set. Contrastive loss [14] first introduced deep neural networks to metric learning. Triplet loss [15] further considers the relative relationship between pairs within a class and pairs between classes.
3 Models Our model is shown in Fig. 2. There are 4 parts: representation layer, BiLSTM Layer with self-attention, full connectivity layer based on metric learning and CRF layer with new loss function. Representation layer will concatenate the word-level and character-level feature representations. And we concatenate the word vectors generated from APTWord2vec and BERT. BiLSTM Layer with self-attention will deeply obtain the features from representation layer. Full connectivity layer based on metric learning will map the feature vectors to metric space. CRF layer with new loss function will infer the labels of sequences and deal with the imbalance problem.
Fig. 2. Model for APT NER
3.1 Representation Layer The input of the representation layer is the sequence of words S = {x1 , x2 , . . . , xi , . . . , xT }, where S represents a sentence, T represents the length of a
NER in Threat Intelligence Domain with TSFL
161
sentence, and xi represents the i-th word in a sentence. Word-level feature representation fuse the Word2vec and BERT representation. After vector conversion, the word-level representation became to S = {w1 , w2 , . . . , wi , . . . , wT }. wi means the word-level representation of the i-th word. Character-level representation of the sequence corresponding to S = {c1 , c2 , . . . , ci , . . . , cT }, ci represents the character-level representation of the i-th word. We concatenate the word-level and character-level feature representation to obtain the final representation vector ei : ei = concat[wi , ci ]
(1)
3.2 BiLSTM Layer with Self-attention Representation layer’s sentence sequence representation will be given to BiLSTM layer for further feature extraction. Because the sentences of security reports are usually extremely long, and there are few threat intelligence related entities. Some key context words usually appear around these entities. For example, “malware” or “toolset” may appear around “MiniDuke” to indicate that it is a malware. For that, we introduced the multi-head self-attention mechanism to focus our attention on the context words. ← ei gets the expression of the i-th hidden layer through BiLSTM hi = hi , h i , and then uses the following formula to calculate the Multi-head Self-attention vector: QK T Attention(Q, K, V) = softmax √ V dk
(2)
MultiHead(Q, K, V) = Concatenate(head1 , . . . , headh )WO
(3)
Q headi = Attention QWi , KWiK , VWiV
(4)
√ Q Where dk is the scaling factor, the parameter matrix Wi ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv and W O ∈ Rhdv ×dmodel . In this work, we set h = 8 parallel attention layers as the original text. For each head, we use dk = dv = dmodel /h = 64, and in order to speed up the efficiency of its operation, we shared the weight parameters between the 8 heads as same strategy as Lan [16]. 3.3 Full Connectivity Layer Based on Metric Learning The idea of triplet loss [15] is to make the distance between negative pairs longer than the distance between positive pairs. During the training process, both positive and negative pairs are selected and positive and negative pairs have one same sample. Combined with the domain of named entity identification, we define words with the same entity labels as positive samples and other entity label words as a negative sample. The training will select a current word hi from a sentence within a batch size as the anchor. Then choose another two words hj and hk from the batch size. We consider hj be a positive sample with the same entity label as hi . In contrast, hk is a negative sample representing other
162
X. Wang et al.
entity words. And now we have a triplet sample. Complete the spatial feature mapping between each entity word at the full connection layer for the spatial distance measure. Define the following loss function to measure the distance between samples using the cosine angle.
LossTriplet = Dij − Dik + m (5) (i, j, k), hij = 1, hik = 0 The above formula shows that loss will be 0 when the distance between negative pairs is farther than the distance between positive pairs by m. At this point, it is considered that the current model has learned well, model will do not needed to updates. The end result is that we want words containing the same entity label to be closer to each other. And the distance between other entity words will be farther. As shown in Fig. 3, the blue circle represents the word labeled VULID. We choose the current anchor word “CVE-2016-1063” and the positive sample word “CVE-20161064”, which has the same entity label VULID as the anchor word. The red square is the negative sample word “APT28”, whose label is APT. Left figure shows the distances between three words in the original space. Right figure shows the distances between the three words after learning. We hope that words belong to the same entity labels become closer after learning, while words do not belong to the same class of entity labels become as far away as possible.
Fig. 3. The triplet loss minimizes the distance between an anchor and a positive, both of which have the same entity label, and maximizes the distance between the anchor and a negative of different entity labels. (Color figure online)
3.4 CRF Layer with Sorted Focal Loss Function In this part, we use CRF to infer the final sequence labels and assign most likely label yi for each word wi . The loss function in the CRF layer includes two types of scores: emission score and transition score. Emission score is the output matrix Pi,yi of the BiLSTM layer, which represents the probability that the output label corresponding to
NER in Threat Intelligence Domain with TSFL
163
the word i is yi . Transition score is the matrix Ayi ,yi+1 of the CRF layer, which represents the transition relationship between labels. CRF layer constrains the rules for predicting labels (BIOES rules). We define the score below. score(x, y) =
T i=1
Pi,yi +
T
Ayi ,yi+1
(6)
i=0
During training, we minimize the log loss function to obtain the correct label sequence prediction. (7) LossCRF = log escore(x,˜y) − score(x, y) y˜ ∈Yx
Where Yx represents all possible tag sequences in a sentence. Due to most of the articles reported by APT are always long texts, the data will generate a large number of unlabeled words, which is labeled O. Then there will be serious problem of label imbalance in the named entity recognition task in threat intelligence domain. Named entity recognition is actually a multi-category task, then the problem of label imbalance is transformed into a multi-class category imbalance problem. We add focal loss [18] function to balance label distribution in the CRF layer. The formula is shown below. LossFocal = −α(1 − P(y|x))γ log(P(y|x))
(8)
Where α is the balance factor, which is used to balance the unevenness of the number of positive and negative samples. γ is used to reduce the loss of O-label samples, so that the model focuses more on samples with entity labels. P(y|x) means the probability that the word x belongs to the label y. At the same time, we sort the value of loss functions. We believe that labeled samples are more difficult to be trained in NER tasks. We call the labeled samples as difficult samples, which always have high values of its loss. And the difficult samples are more helpful for NER training. The values are sorted, and the samples with high loss values are filtered and given back to the neural network for training. Accordingly, we achieve the purpose of training the samples with less label distribution. Eventually, the loss function of the CRF layer is changed to a sum of three losses, which we defined as TSFL function. α1 , α2 , α3 are hyperparameter used to balance the three loss functions. LossTSFL = α1 LossCRF + α2 LossFocal + α3 LossTriplet
(9)
4 Experiments 4.1 Dataset Since there is no word vector for the cybersecurity domain, we crawl 250 MB reports, blogs, malware descriptions and vulnerability descriptions from numerous security company websites. Also, we add Common Vulnerabilities and Exposures 131 MB vulnerability descriptions as source data for training APTWord2vec. The number of articles crawled with their corresponding companies is shown in the Table 1 below.
164
X. Wang et al. Table 1. Crawled source data from security company Security company
Span
Number
Kaspersky
2010–2020
215
FireEye
2018–2020
123
Cisco
2012–2020 1336
Microsoft security
2006–2020 2425
Threat micro
2007–2020 1394
OpSec
X–2020
804
McAfee
2010–2020
591
Palo Alto Network & Unit42 2010–2020
651
Naked Security
2006–2020 8548
Avira
2014–2020
Webroot
2009–2020 1274
807
We manually annotated threat intelligence reports from the threat intelligence reports summarized on the open source GitHub. Due to the limited data of manual labeling, we used Easy data augmentation (EDA) [17] technology to perform data augmentation processing on the labeled data. Synonym replacement (SR) technology was mainly used. Randomly select non-stop words from the sentence and replace them with randomly selected synonyms. Table 2 contains an overview of the sizes of the data files after data augmentation. The unannotated data contain 260,258 tokens. Table 2. Number of articles, sentences and tokens in each data file Threat Intelligence data Sentences Tokens Training set
8,012
172,494
Development set
1,719
42,858
Test set
1,807
44,906
Since there is no publicly available threat intelligence domain data set for named entity identification, we have to build the dataset by our own. The definition of 19 marked entities refers to the OpenIOC specification, as shown in Table 3. We used the BIOES label notation to represent the boundary of the entity, where B (Begin) represents the beginning entity word, I (Intermediate) represents the middle entity word, E (End) represents the ending entity word, S (Single) represents a single entity, and O (Other) means Non-entity words used to mark irrelevant entities. Assuming that X ∈ {B, I, S, E}, TAG is the label of the entity word, the final label of the word is X-TAG. If the word is non-entity word, we labeled it as O.
NER in Threat Intelligence Domain with TSFL
165
Table 3. Examples of the entities in cyber threat domain Entity
Lable
Example
APT group
APT
APT32, OceanLotus
Security team
SECTEAM
Cisco
Organization
COMP
Google
OS
OS
Windows
Email
EMAIL
[email protected]
Location
LOC
China
IP address
IP
109.248.148.42
Domain
DOM
globalowa.com
URL
URL
http://shwoo.gov.taipei/buyer_flowchart.asp
Protocol
PROT
HTTP
File name
FILE
at.exe
Tool
TOOL
PowerShell, EXE, Java script
MD5 value
MD5
11a9f798227be8a53b06d7e8943f8d68
SHA1 value
SHA1
906dc86cb466c1a22cf847dda27a434d04adf065
SHA2 value
SHA2
4741c2884d1ca3a40dadd3f3f61cb95a59b11f99 a0f980dbadc663b85eb77a2a
Malware
MAL
CobaltStrike, Trojan.SH.MALXMR.UWEJP
Encryption algorithm
ENCR
DES
Vulnerability
VULNAME
zero-day
CVE ID
VULID
CVE-2016-4117
We take Table 4 as an example of using BIOES to mark the entities. “Lazarus Group” means an APT group entity. “Lazarus” is the beginning of APT group entity, which was labeled as B-APT. “Group” is the ending of APT group entity, which was tagged as E-APT. “WhiskeyAlfa” represents a malware entity with a single word. Therefore “WhiskeyAlfa” was labeled as S-MAL. Finally, the remaining non-entity words were labeled as O. Table 4. Examples for BIOES labels Sentence
Lazarus
Group
Malware
WhiskeyAlfa
Contains
…
BIOES labels
B-APT
E-APT
O
S-MAL
O
…
166
X. Wang et al.
4.2 Training Representation Layer. First of all, for word embedding, we used GLoVE, Word2vec, APTWord2vec and BERT and the training model to do experiment We chosen 100 dimensions as the word vector size. Then for character embedding, we tested the effects of two different neural networks, LSTM and CNN, on the experimental data. The results show that using CNN under the baseline model can achieve better results as shown in Table 5. CNN used a layer of convolution, the kernel size was set to 3, and the dropout parameter was set to 0.4. AttBiLSTM-CRF Layer. For the neural network, we set the learning rate to 0.15. We used the stochastic gradient descent method for back propagation and 8-fold crossvalidation. The distribution of the training dataset, verification dataset and test dataset was 7: 1.5: 1.5. For the Focal Loss function, since the sequence labeling task is equivalent to a multi-classification problem, the value of the balance factor α was set to 0.5. The γ factor used the settings in the Lin article [18], γ = 2. For the final loss function, we set α1 , α2 , α3 to 1, 1, 0.1 respectively.
4.3 Result We first examined the effect of different word vector models on NER tasks in the field of threat intelligence as shown in Table 5. At the same time, we merged APTWord2vec with other word vectors. For this work we used BiLSTM-CRF model with self-attention (AttBiLSTM-CRF), which we called Baseline1. Experimental results show that APTWord2vec+BERT can obtain a higher F1 value. Table 5. F1 value with Baseline1. Baseline1 is AttBiLSTM-CRF model. Model
Dev dataset (F1) Testing data (F1)
GLoVE
70.25
68.32
Word2vec
69.56
68.43
APTWord2vec
75.77
74.69
BERT
83.33
82.65
APTWord2vec+BERT 85.56
84.20
As shown in Table 6, we also tested the impact of the character level representation with different neural networks. We fused Baseline1 with APTWord2vec+BERT (the best performance of word vector models). We called the new model as Baseline2. The results show that Baseline2 with CNN perform better.
NER in Threat Intelligence Domain with TSFL
167
Table 6. Impact of the character level representation with different neural network. We used Baseline2, which is Baselin1 with APTWord2vec+BERT. Model
Dev dataset (F1) Testing data (F1)
Baseline2 with CNN
85.61
84.32
Baseline2 with BiLSTM 84.62
83.77
Secondly, we verified the impact of the loss function on general dataset CoNLL2003. In this part, we used Baseline1. Experiments state that adding TSFL function can improve NER results in public domain datasets as shown in Table 7. Table 7. Verified the impacts of loss function. FL means focal loss function. SFL means sorted focal loss function. TSFL means fused triplet loss function based on metric learning and sorted focal loss function. Model
Dev dataset (F1) Testing data (F1)
Baseline1
90.13
91.89
Baseline1 with FL
91.23
92.57
Baseline1 with SFL
92.43
92.76
Baseline1 with TSFL 93.56
93.29
Finally, we evaluated the impact of the loss function on cyber threat domain dataset as shown in Table 8. We used Baseline2 with character level representation used CNN. And we called this new model as Baseline3. The experimental results further shows that TSFL function achieves significant improvements on NER task in threat intelligence domain. Table 8. F1 value with Baseline3. Baseline3 is Baseline2 with CNN. Model with different loss function
Dev dataset
Test dataset
Precision
Recall
F1
Precision
Recall
F1
86.76
84.49
85.61
85.91
82.79
84.32
Baseline3 with FL
87.36
86.03
86.03
86.54
82.96
84.71
Baseline3 with SFL
87.43
84.68
86.65
86.67
83.34
84.97
Baseline3 with TSFL
87.71
87.45
87.58
86.71
83.66
85.16
Baseline3
168
X. Wang et al.
5 Conclusion In this paper, we constructed the dataset used for named entity identification in the threat intelligence domain. At the same time, we trained the APTWord2vec for threat intelligence domain to prepare for subsequent NLP research in the field of threat intelligence. Meanwhile, we proposed a new loss function TSFL, triplet loss function based on metric learning and sorted focal loss function, to solve the problem of unbalanced distribution of data labels. We evaluate on both the public CoNLL2003 and the threat intelligence domain NER dataset built by ourselves. Experiments show that F1 value have improved in both public domain datasets and threat intelligence. Acknowledgments. We thank the corresponding authors Xuren Wang and Zihan Xiong for their help. This work is supported by the National Key Research and Development Program of China (Grant No. 2018YFC0824801, Grant No. 2016QY06X1204).
References 1. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011) 2. Lample, G., Ballesteros, M., Subramanian, S., et al.: Neural architectures for named entity recognition (2016) 3. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF (2016) 4. Peters, M.E., Neumann, M., Iyyer, M., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018) 5. Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 6. Joshi, A., Lal, R., Finin, T., et al.: Extracting cybersecurity related linked data from text. In: IEEE Seventh International Conference on Semantic Computing, pp. 252–259. IEEE (2013) 7. Sabottke, C., Suciu, O., Dumitras, T.: Vulnerability disclosure in the age of social media: exploiting Twitter for predicting real-world exploits. In: Proceedings of the 24th USENIX Security Symposium (USENIX Security 2015). USENIX Association (2015) 8. Liao, X., Yuan, K., Wang, X., Li, Z., Xing, L., Beyah, R.: Acing the IOC game: toward automatic discovery and analysis of open-source cyber threat intelligence. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS). Association for Computing Machinery (2016) 9. Zhu, Z., Dumitras, T.: ChainSmith: automatically learning the semantics of malicious campaigns by mining threat intelligence reports. In: IEEE European Symposium on Security and Privacy. IEEE (2018) 10. Dionísio, N., Alves, F., et al.: Cyberthreat detection from twitter using deep neural networks. In: IEEE International Joint Conference on Neural Networks. IEEE (2019) 11. Tan, S., Long, Z., Tan., L., Guo, H.: Automatic identification of indicators of compromise using neural-based sequence labelling (2018) 12. Zi, L., et al.: Collecting indicators of compromise from unstructured text of cybersecurity articles using neural-based sequence labelling. In: 2019 International Joint Conference on Neural Networks (IJCNN). IEEE (2019) 13. Xing, E.P., Ng, A.Y., Jordan, M.I., et al.: Distance metric learning with application to clustering with side-information. In: International Conference on Neural Information Processing Systems. MIT Press (2002)
NER in Threat Intelligence Domain with TSFL
169
14. Hadsell, R., Chopra, S., Lecun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1735–1742, New York, USA (2006) 15. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2015) 16. Lan, Z., et al.: ALBERT: A lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2019) 17. Wei, J.W., Kai, Z.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196 (2019) 18. Lin, T.Y., Goyal, P., Girshick, R., et al.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Enhancing the Numeracy of Word Embeddings: A Linear Algebraic Perspective Yuanhang Ren1 and Ye Du2(B) 1
University of Electronic Science and Technology of China, Chengdu, China [email protected] 2 Southwestern University of Finance and Economics, Chengdu, China [email protected]
Abstract. To reason over the embeddings of numbers, they should capture numeracy information. In this work, we consider the magnitude aspect of numeracy information. We could find a vector in a high dimensional space and a subspace of original space. After projecting the original embeddings of numbers onto that vector or subspace, the magnitude information could be significantly enhanced. Therefore, this paper proposes a new angle to study numeracy of word embeddings, which is interpretable and has nice mathematical formulations.
Keywords: Word embedding numeracy Interpretable models
1
· Subspace projection ·
Introduction
With the help of the deep learning and word embeddings, impressive progress has been made in different areas of NLP, such as parsing [4,7], machine translation [17,30] and natural language generation [10,28]. In this paper, we focus on reasoning over numbers with word embeddings which are fundamental building blocks for current NLP models. To reason numerically, word embeddings should contain numeracy information. According to [21], numeracy is properties that hold by numbers like magnitude [21], numeration [21], ordering [27], etc. If some of these properties are reflected by the embeddings of numbers, we say the numeracy information is kept to some degree. In our work, we focus on the magnitude. Magnitude is one essential property of numeracy [3,5,21,29]. Roughly speaking, magnitude concerns how well the embeddings of numbers encode their values information. [21] propose a test called SC to measure the magnitude of the embeddings of numbers in the original embedding space. They show that popular embeddings of numbers have the magnitude information to some extent, although it is not very strong. [27] introduce a decoding test to measure the magnitude. Basically, it measures how much of the magnitude information is c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 170–178, 2020. https://doi.org/10.1007/978-3-030-60450-9_14
Enhancing the Numeracy of Word Embeddings
171
contained that helps to map the embedding of a number to its value. Their results show that the values of numbers could be approximated pretty well via a neural network. They also investigate a linear regression model and demonstrate that a linear subspace can partially reflect the magnitude. Given a set of number embeddings WN , a few numeracy tests could be run on WN itself as [21] did. However, if there is a mapping f : WN → WˆN , the numeracy information might be better revealed. For instance, [27] utilize neural networks to map the original number embeddings to real numbers. We believe mappings can be employed to investigate whether original embeddings have the magnitude as long as the mappings only use the information of original embeddings on the testing set. The mapping could give us a new perspective to investigate numeracy. Hence, we ask the following question: What is an appropriate mapping(perspective) that can enhance the magnitude? We hope that original embeddings indeed have the numeracy information, and then an appropriate mapping could be helpful in enhancing it. Otherwise, any mapping would have little help. In this paper, we are interested in mappings that are interpretable and have nice mathematical formulations. Our contributions of this paper are as follows: 1. We extend the Strict Contrastive (SC) magnitude test [21] to a set of new tests called SC-k. It is shown that there exists a low dimensional subspace in the original space of number embeddings such that a larger degree of magnitude quantified by SC-k is uncovered via projecting the embeddings of numbers onto that subspace. 2. On the other hand, we can find a direction in a high dimensional space such that a larger degree of magnitude measured by the test of [27] is revealed by projecting the embeddings of numbers onto that direction. Interestingly, that high dimensional space is implied by a kernel used to separate the numbers and non-numeric words. In all, the magnitude information of word embeddings can be enhanced by our linear mappings.
2
A Subspace to Enhance the Magnitude of Numbers
[21] propose a magnitude test that is called the SC test as follows. Given a number w1 , one can find a number w+ closest to w according to its actual value, i.e., the value abs(w − w+ ) is the smallest among all numbers that are different from w. Similarly, a number w− which is the second closest to w can be acquired. Thus, abs(w − w+ ) < abs(w − w− ) it is natural to ask whether d(vw , vw+ ) < d(vw , vw− ) 1
In our work, we restrict our scope to Arabic numbers.
(1)
172
Y. Ren and Y. Du
holds, where vw , vw+ , vw− are the embeddings of w, w+ , w− respectively and d is a distance metric. If for any number w, inequality 1 is satisfied, the test is passed. Unfortunately, according to their results [21], even a set of random embeddings achieves a high accuracy. Hence, we propose a novel and more discriminative test to evaluate the magnitude. It is called the SC-k (Strict Contrastive up to the k-th closest). The idea is to choose a set S consisting of the second closest numbers up to the k-th closest numbers to w. Then, w− is picked as the number ). If the that has the smallest distance to w in S, i.e., w− = argminw ∈S d(vw , vw inequality of 1 holds for every w, the test is passed. 2.1
The Subspace Model
Given a set of numbers W and their corresponding embeddings V , we want to find a matrix Q ∈ Rl×s consisting of an orthonormal basis of a subspace, where l and s are the dimensions of the original space and the subspace. By projecting the vectors of two numbers x and y onto the subspace spanned by Q, the cosine distance [11] of two number embeddings of x and y can be stated as follows d(vx , vy ) = 1 − cos(QT vx , QT vy ) where cos represents the cosine similarity. To measure how many numbers have passed the SC-k test, the accuracy metric is introduced as follows. 1 Acc(V, Q) = I(Nw − Pw ) |W | w∈W
where I is an indicator function that 1 I(x) = 0
if x > 0 otherwise
Pw and Nw denote d(vw , vw+ ) and d(vw , vw− ). The subspace Q is found by solving the following problem. max
Acc(V, Q)
s.t.
QT Q = Es
Q
(2)
where Es is an s × s identity matrix. It is important to note that such a subspace is found on the training set and tested on the testing set. Hence, if the subspace indeed reflects some intrinsic properties of number embeddings, improvements would be shown not only on the training set but also on the testing set.
Enhancing the Numeracy of Word Embeddings
2.2
173
The Optimization Procedure
The problem (2) might be hard to solve. Thus, we first relax the problem (2) and optimize a relaxed problem to obtain an approximate solution to the original problem. Our relaxation is as follows, max
1 ˆ I(Nw − Pw ) |W |
s.t.
QT Q = Es
Q
w∈W
(3)
ˆ where Iˆ is a soft indicator function that I(x) = (1 + e−βx )−1 . The β is a hyperparameter. One algorithm to solve the constrained optimization problem like the problem (3) is the Projected Gradient Descent (PGD). Generally speaking, there are two steps in the PGD. One step is the parameter update step which is accomplished by the Gradient Descent. The other step is the projection step that restricts the solution in the constrained region. For the parameter update step, the Gradient Descent might not be a good choice since it is slow and the learning rate is the same for all parameters. Hence, we split the data into mini-batches and replace the Gradient Descent with Adam [15] in the PGD. For the projection step, we formalize it as follows, min A − QF Q (4) s.t. QT Q = Es where A ∈ Rl×s is the parameter matrix updated by Adam. The problem (4) is known as the nearest orthonomal matrix problem [12,13]. One closed form solution of the problem (4) is Es V T = Us V T Q=U 0 where U and V are the left and right singular vectors of A, Us is the first s columns of U . After iterating between these two steps, an approximate solution to the original problem is obtained by the end. 2.3
Experiments
The SkipGram [18], Glove [22] and FastText [2,19] embeddings are evaluated on the test. Each of which has two variants that are trained on the Wikipedia corpora and Gigaword corpora.2 Meanwhile, a random embedding is used as the baseline. The ks of the SC-k dataset are from {10,100,500} and the dataset is derived from numbers used in SC with a little modification. The numbers are selected such that the difference of adjacent numbers is 1 for all adjacent pairs 2
Embeddings are available at http://vectors.nlpl.eu/repository with ids 5, 11, 7, 13, 9, and 15 [8].
174
Y. Ren and Y. Du
after sorting numbers in SC-k in ascending order. Otherwise, the difficulty of passing the test would be different for different numbers. The SC-k dataset has 2055 numbers and is randomly split into training, validation, and testing sets with ratios of 65%, 15%, and 20%. The dimension s of the subspace, the learning rate η of the Adam, and β in the relaxation are three hyperparameters of the model. Their ranges are s ∈ [2, 3, ..., 256], η ∈ [10−5 , 1] and β ∈ [2, 3, ..., 20]. Given a particular hyperparameter, the model is fitted on the training set and evaluated on the validation set. The number of training epochs is 50 and the mini-batch size is 256. Hyperparameters that achieve the best scores on the validation set are chosen. The Bayesian optimization with Gaussian Processes [9] is used to search the hyperparameters. Given the best hyperparameters, the model is refitted on the training and validation sets and tested on the testing set. We repeat the random split five times and report the mean and standard deviation of the accuracy. Since the results on the testing set are reported and the split is random, the results of the original space also have the randomness. Table 1. Accuracy of various embeddings on the SC-k magnitude test Embeddings
k=10 Subspace
k=100 Original
Subspace
k=500 Original
Subspace
Original
SkipGram-wiki 38.78 ± 1.43 25.64 ± 1.08 26.47 ± 1.76 18.54 ± 1.17 25.11 ± 2.47 18.3 ± 1.07 SkipGram-giga 27.79 ± 2.32 16.16 ± 1.91 13.14 ± 1.38 7.35 ± 0.68 11.34 ± 1.65 7.98 ± 1.12 Glove-wiki
54.84 ± 2.99 43.7 ± 1.86 48.03 ± 1.27 37.62 ± 2.49 48.03 ± 1.84 35.96 ± 1.55
Glove-giga
32.85 ± 1.55 19.32 ± 1.18 20.15 ± 1.46 12.94 ± 1.48 20.83 ± 1.5 13.09 ± 2.25
FastText-wiki 43.16 ± 1.41 27.69 ± 1.07 35.52 ± 2.43 23.41 ± 1.37 32.02 ± 1.24 22.38 ± 1.48 FastText-giga 30.41 ± 1.64 17.86 ± 1.88 17.08 ± 1.24 11.58 ± 0.57 14.74 ± 3.15 11.44 ± 1.79 random
28.03 ± 1.35 8.08 ± 0.54 1.61 ± 0.5
1.17 ± 0.28 0.39 ± 0.19
0.1 ± 0.12
The results are listed in Table 1. First, let’s check whether the magnitude information is contained in embeddings. The accuracies of pre-trained embeddings are higher than the random baseline about 10 or more points in the original space and are almost always higher than the random in the subspace. This indicates the magnitude information is indeed included in the pre-trained embeddings. Next, we put our attention on the enhancement of the magnitude. The accuracies of pre-trained embeddings in subspaces exceed the original counterparts with various ks. Moreover, the accuracy improvement decreases as the k increases. Note that the random embedding also obtains significant improvements when k is 10, we believe the reason is that the SC-10 test is too simple. Although the subspace dimension is about 155 on average, the accuracies of word embeddings in the subspace are higher than their original space counterparts by about 7 or more points when k is 100. Meanwhile, the random baseline almost has a very small improvement when k is greater than or equal to 100. From the
Enhancing the Numeracy of Word Embeddings
175
results of SC-100 and SC-500, we conclude that the magnitude information is indeed enhanced by projecting word embeddings onto a proper low dimensional subspace.
3
A Direction to Reflect the Magnitude of Numbers
Since the exact values of numbers naturally lie on an axis, we are interested in the following question: Is there a vector w as an axis such that the magnitude of a word x is naturally revealed by projecting x onto w? Namely, given an embedding x of a number, its magnitude is obtained by y = wT x + b where y, w, and b denote the predicted magnitude, the direction, and the intercept. It is not surprising that such a direction could not be easily found in the original space of word embeddings. Thus, we look into the high dimensional space implied by kernels. [16] indicate that numbers and non-numeric words can be well separated. According to our experiments, an SVM equipped with a polynomial kernel 1 T vi vj )3 does the separation task pretty well. The high dimenK(vi , vj ) = ( 300 sional space implied by the kernel could be a reasonable perspective to look at number embeddings. The Ridge is utilized as the linear mapping model and the Ridge equipped with the pre-defined K is called Kernel Ridge(separability). To quantify how much magnitude is revealed, the decoding test proposed by [27] is adopted. The Ridge with kernels found on the validation set of this test is set as another baseline called Kernel Ridge. Like the previous section, the direction is found on the training set and tested on the testing set. The embeddings used for this test are the same as the previous section. We choose the same set of numbers across all pre-trained embeddings, which are within the range of [0, 10000] (since out of range numbers are rare). We then split 65% of numbers into a training set, 15% into a validation set, and 20% into a testing set. Hyperparameters for the Ridge is α, for the Kernel Ridge are α, kernel types, γ, r and d, for the Kernel Ridge (separability) is α. All αs are within the range [10−3 , 103 ], kernel types are polynomial, rbf, and sigmoid, γ ∈ [10−6 , 10], d ∈ [1, 2, ...8], r ∈ [−10, 10]. Once the best hyperparameters are found, we refit the model on the training and validation sets. Then, we test the model on the testing set. We report the mean and standard deviation of the metric - Root Mean Squared Error (RMSE) - across ten different random splits, using the same split for all embeddings and models. The results are listed in Table 2. First, the RMSEs of all pre-trained embeddings are far lower than the random baseline. This indicates that pre-trained embeddings indeed maintain the magnitude. Second, the RMSEs of Kernel Ridge are about 140 lower than those of Ridge while the improvement on the random is not obvious. This illustrates the magnitude can be enhanced by mapping embeddings with kernels. Finally, the RMSEs of
176
Y. Ren and Y. Du Table 2. RMSE of various embeddings on the magnitude test of [27]
Embeddings
Ridge
Kernel Ridge (separability) Kernel Ridge
SkipGram-wiki 443.54 ± 50.90 355.26 ± 92.26 354.73 ± 87.96 SkipGram-giga 419.33 ± 45.81 554.95 ± 63.87 429.35 ± 58.67 Glove-wiki 534.98 ± 58.93 431.21 ± 97.56 Glove-giga 447.36 ± 50.35 336.07 ± 77.83 FastText-wiki 373.10 ± 41.73 351.66 ± 90.52 FastText-giga 1081.03 ± 105.16 1232.64 ± 109.54 random
289.82 ± 65.35 297.74 ± 61.54 413.59 ± 54.69 374.81 ± 75.44 301.49 ± 76.87 269.36 ± 81.01 1040.32 ± 109.60
Kernel Ridge (separability) are about 85 lower than those of Ridge on average while the random embedding gets worse; meanwhile, the RMSEs of Kernel Ridge (separability) are not worse than those of the Kernel Ridge by too much. This demonstrates that the high dimensional space implied by K actually gives us a proper perspective to look at those numbers. By projecting embeddings onto a direction in that space, the magnitude is enhanced significantly. One should note the kernel selection of Kernel Ridge (separability) does not use the magnitude information at all, which makes this observation intriguing.
4
Related Works
Both our work and [27] use linear models. However, there are serval differences between our methods and theirs. First, we use kernels to show a clearer linear structure in the high dimensional space. Second, the number ranges of ours is larger than theirs to draw the conclusion. In addition to the linear models, they also use the neural network. It could be harder to analyze the structure by the mapping of the neural network. Thus, we focus on models that are more explainable in this work. Besides the test of the magnitude, some tests have been taken into account to test other aspects of numeracy. For example, the tests of ordering [27,31], and decoding [26] are used to check the properties of number embeddings or to design better models. Apart from these tests, a downstream task can be used as a unified test of numeracy like the DROP QA in [27]. Through these tests, one can know more about the numeracy of number embeddings. To facilitate the reasoning over the numbers, various numerical question answering datasets have been proposed like DROP [6], EQUATE [23], and Mathematics Questions [24]. Questions in these datasets require models to compare, sort, and add numbers. Apart from our work, there are lots of other works trying to understand and utilize the structures of word embeddings. [20] noticed strange geometric structures of word embeddings. [25] proposed a statistical method to uncover the underlying latent structures, which can help to interpret word embeddings. [1] utilized linear structures to recover vectors that approximately capture senses.
Enhancing the Numeracy of Word Embeddings
5
177
Conclusion and Future Work
In this paper, we investigate the linear mappings in word embeddings on magnitude. By projecting word embeddings onto specifically designed linear subspaces, magnitude is enhanced. In particular, these linear mappings are interpretable and have nice mathematical formulations. In the future, applying these mappings to downstream applications is needed to be explored. For instance, one can learn out-of-vocabulary (OOV) numeral embeddings [14] with the subspaces described in Sect. 2. Meanwhile, probing other interesting properties in embeddings could be another direction.
Reproducibility All code and data for this paper are available at Github: https://github.com/ ryh95/low dim numeracy. Acknowledgements. We would like to thank the anonymous reviewers for their valuable comments.
References 1. Arora, S., Li, Y., Liang, Y., Ma, T., Risteski, A.: Linear algebraic structure of word senses, with applications to polysemy. Trans. Assoc. Comput. Linguist. 6, 483–495 (2018) 2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017) 3. Cantlon, J.F., Brannon, E.M.: Shared system for ordering small and large numbers in monkeys and humans. Psychol. Sci. 17(5), 401–406 (2006) 4. Chen, D., Manning, C.: A fast and accurate dependency parser using neural networks. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 740–750 (2014) 5. Dehaene, S., Dehaene-Lambertz, G., Cohen, L.: Abstract representations of numbers in the animal and human brain. Trends Neurosci. 21(8), 355–361 (1998) 6. Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., Gardner, M.: Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161 (2019) 7. Dyer, C., Ballesteros, M., Ling, W., Matthews, A., Smith, N.A.: Transitionbased dependency parsing with stack long short-term memory. arXiv preprint arXiv:1505.08075 (2015) 8. Fares, M., Kutuzov, A., Oepen, S., Velldal, E.: Word vectors, reuse, and replicability: towards a community repository of large-text resources. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 271–276. Association for Computational Linguistics, Gothenburg, Sweden, May 2017 9. Frazier, P.I.: A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811 (2018) 10. Gatt, A., Krahmer, E.: Survey of the state of the art in natural language generation: core tasks, applications and evaluation. J. Artif. Intell. Res. 61, 65–170 (2018)
178
Y. Ren and Y. Du
11. Glavaˇs, G., Vuli´c, I.: Explicit retrofitting of distributional word vectors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 34–45 (2018) 12. Higham, N.J.: Matrix nearness problems and applications 13. Horn, B.K., Hilden, H.M., Negahdaripour, S.: Closed-form solution of absolute orientation using orthonormal matrices. JOSA A 5(7), 1127–1135 (1988) 14. Jiang, C., et al.: Learning numeral embeddings. arXiv preprint arXiv:2001.00003 (2019) 15. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 16. Kutuzov, A., Velldal, E., Øvrelid, L.: Redefining part-of-speech classes with distributional semantic models. arXiv preprint arXiv:1608.03803 (2016) 17. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015) 18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 19. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pretraining distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018) 20. Mimno, D., Thompson, L.: The strange geometry of skip-gram with negative sampling. In: Empirical Methods in Natural Language Processing (2017) 21. Naik, A., Ravichander, A., Rose, C., Hovy, E.: Exploring numeracy in word embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3374–3380 (2019) 22. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 23. Ravichander, A., Naik, A., Rose, C., Hovy, E.: Equate: A benchmark evaluation framework for quantitative reasoning in natural language inference. arXiv preprint arXiv:1901.03735 (2019) 24. Saxton, D., Grefenstette, E., Hill, F., Kohli, P.: Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557 (2019) 25. S ¸ enel, L.K., Utlu, I., Y¨ ucesoy, V., Koc, A., Cukur, T.: Semantic structure and interpretability of word embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1769–1779 (2018) 26. Trask, A., Hill, F., Reed, S.E., Rae, J., Dyer, C., Blunsom, P.: Neural arithmetic logic units. In: Advances in Neural Information Processing Systems, pp. 8035–8044 (2018) 27. Wallace, E., Wang, Y., Li, S., Singh, S., Gardner, M.: Do NLP models know numbers? Probing numeracy in embeddings. arXiv preprint arXiv:1909.07940 (2019) 28. Wen, T.H., Gasic, M., Mrksic, N., Su, P.H., Vandyke, D., Young, S.: Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745 (2015) 29. Whalen, J., Gallistel, C.R., Gelman, R.: Nonverbal counting in humans: the psychophysics of number representation. Psychol. Sci. 10(2), 130–137 (1999) 30. Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016) 31. Yang, Y., Birnbaum, L., Wang, J.P., Downey, D.: Extracting commonsense properties from embeddings with limited human guidance. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 644–649 (2018)
Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing? Houquan Zhou, Yu Zhang, Zhenghua Li(B) , and Min Zhang Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, Suzhou, China [email protected], [email protected], {zhli13,minzhang}@suda.edu.cn
Abstract. In the pre deep learning era, part-of-speech tags have been considered as indispensable ingredients for feature engineering in dependency parsing. But quite a few works focus on joint tagging and parsing models to avoid error propagation. In contrast, recent studies suggest that POS tagging becomes much less important or even useless for neural parsing, especially when using character-based word representations. Yet there are not enough investigations focusing on this issue, both empirically and linguistically. To answer this, we design and compare three typical multi-task learning framework, i.e., Share-Loose, Share-Tight, and Stack, for joint tagging and parsing based on the state-of-the-art biaffine parser. Considering that it is much cheaper to annotate POS tags than parse trees, we also investigate the utilization of large-scale heterogeneous POS tag data. We conduct experiments on both English and Chinese datasets, and the results clearly show that POS tagging (both homogeneous and heterogeneous) can still significantly improve parsing performance when using the Stack joint framework. We conduct detailed analysis and gain more insights from the linguistic aspect.
1
Introduction
Among different NLP tasks, syntactic parsing is the first to convert sequential utterances into full tree structures. Due to its simplicity and multi-lingual applicability, dependency parsing has attracted extensive research interest as a main-stream syntactic formalism [21,24], and been widely used in semantic parsing [7], information extraction [22], machine translation [25], etc. Given an input sentence S = w0 w1 . . . wn , dependency parsing constructs a tree T = {(h, d, l), 0 ≤ h ≤ n, 1 ≤ d ≤ n, l ∈ L}, as depicted in Fig. 1, where (h, d, l) is a dependency from the head wh to the dependent wd with the relation label l, and w0 is a pseudo root node. H. Zhou and Y. Zhang—Equal contributions to this work. This work was supported by National Natural Science Foundation of China (Grant No. 61525205, 61876116) and a Project Funded by the Priority Academic Program Development (PAPD) of Jiangsu Higher Education Institutions. c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 179–191, 2020. https://doi.org/10.1007/978-3-030-60450-9_15
180
H. Zhou et al.
Fig. 1. An example dependency tree with both homogeneous (CTB) and heterogeneous (PKU) POS tags.
In the pre deep learning (DL) era, part-of-speech (POS) tags are considered as indispensable ingredients for feature engineering in dependency parsing. POS tags function as word classes in the sense that the same POS tags usually play similar syntactic roles in language utterances. For example, ordinary verbs are tagged as VV and ordinary nouns as NN in Penn Chinese Treebank (CTB). Yet some tags are designed for serving other tasks such as information extraction. For instance, proper nouns and temporal nouns are distinguished from NN as NR and NT respectively. In this sense, we refer to tag pairs like {NN,VV} as syntax-sensitive and {NN,NR} as syntax-insensitive. Parsing performance drops dramatically when removing POS-related features, since POS tags play a key role in reducing the data sparseness problem of using pure word-based lexical features. Meanwhile, to alleviate error propagation in the first-tagging-then-parsing pipeline, researchers propose to jointly model POS tagging and dependency parsing under both graphbased [18] and transition-based [9] frameworks. In the past five years, dependency parsing has achieved tremendous progress thanks to the strong capability of deep neural networks in representing word and long-range contexts [1,2,4,6,12,28]. Yet all those works hold the assumption of POS tags being important and concatenate word and POS tag embeddings as input. Moreover, researchers show that using Character-level Long Short-term Memory (CharLSTM) based word representations is helpful for named entity recognition [15], dependency parsing [5], and constituency parsing [13]. The idea is first to perform Long Short-term Memory (LSTM) over word characters and then to add (or concatenate) together word embeddings and CharLSTM word representations as model inputs. In particular, Kitaev and Klein (2018) [13] show that with CharLSTM word representations, POS tags are useless for constituency parsing. We believe the reason may be two-fold. First, word embeddings, unlike lexical features, suffer from much less data sparseness, since syntactically similar words can be associated via similar dense vectors. Second, CharLSTM word representation can effectively capture morphological inflections by looking at lemma/prefix/suffix, which provide similar information as POS tags. However, there still lacks a full and systematic study on the usefulness of POS tags for dependency parsing.
Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing?
181
In this work, we try to answer the question whether POS tagging is necessary or even helpful for neural dependency parsing and make the following contributions1 . • We design three typical multi-task learning (MTL) frameworks (i.e., ShareLoose, Share-Tight, Stack ), for joint POS tagging and dependency parsing based on the state-of-the-art biaffine parser. • Considering that there exist large-scale heterogeneous POS-tag data for Chinese partly because it is much cheaper to annotate POS tags than parse trees, we also investigate the helpfulness of such heterogeneous data, besides homogeneous POS-tag data that are annotated together with parse trees. • We conduct experiments on both English and Chinese benchmark datasets, and the results show that POS tagging, both homogeneous and heterogeneous, can still significantly improve parsing accuracy when using the Stack joint framework. Detailed analysis sheds light on the reasons behind helpfulness of POS tagging.
2
Basic Tagging and Parsing Models
This section presents the basic POS tagging and dependency parsing models separately in a pipeline architecture. In order to make fair comparison with the joint models, we make the encoder-decoder architectures of the tagging and parsing models as similar as possible. The input representation contains both word embeddings and CharLSTM word representations. The encoder part adopts three BiLSTM layers. 2.1
The Encoder Part
The Input Layer. Given a sentence S = w0 w1 . . . wn , the input layer map each word wi into a dense vector xi . c xi = ew i ⊕ ei
(1)
c where ew i is the word embedding, ei is the CharLSTM word representation vector, and ⊕ means vector concatenation.2 CharLSTM word representations eci are obtained by applying BiLSTM to the word characters and concatenating the final hidden output vectors. Following Dozat and Manning (2017) [4], the word embeddings ew i is the sum of a fixed pretrained word embedding and a trainable word embedding initialized as zero. Infrequent words in the training data (less than 2 times) are treated as a special OOV-token to learn its embedding. Under the pipeline framework, the parsing model may use extra POS tags as input. p p c (2) xi = ew i ⊕ ei ⊕ ei ⊕ ei 1 2
We release our code at https://github.com/Jacob-Zhou/stack-parser. c We have also tried the sum of ew i and ei , leading to slightly inferior performance.
182
H. Zhou et al.
where epi and ep i are the embeddings of the homogeneous and heterogeneous POS tags, respectively. For dropouts, we follow Dozat and Manning (2017) [4] and drop the different components of the input vector xi independently. The BiLSTM Encoder. We employ the same N = 3 BiLSTM layers over the input layer to obtain context-aware word representations for both tagging and parsing. We follow the dropout strategy of Dozat and Manning (2017) [4] and share the same dropout masks at all time steps of the same unidirectional LSTM. The hidden outputs of the top-layer BiLSTM are used as the encoded word representations, denoted as hi . 2.2
The Tagging Decoder
For the POS tagging task, we use two MLP layers to compute the score vector for different tags and get the optimal tag via softmax. The first MLP layer uses leaky ReLU [19] activation, while the second MLP layer is linear without activation. During training, we take the local cross-entropy loss. 2.3
The Parsing Decoder
We adopt the state-of-the-art biaffine parser of Dozat and Manning (2017) [4]. We apply an MLP layer with leaky ReLU activation to obtain the representations of each word as a head (rhi ) and as a dependent (rdi ). rhi ; rdi = MLP (hi )
(3)
As discussed in Dozat and Manning (2017) [4], this MLP layer on the one hand reduces the dimensionality of hi , and more importantly on the other hand strips away syntax-unrelated information and thus avoids the risk of over-fitting. Then a biaffine layer is used to compute scores of all dependencies.
rd score (i ← j) = i 1
T W rhj
(4)
where score (i ← j) is the score of the dependency i ← j, and W is a weight matrix. During training, supposing the gold-standard head of wi is wj , we use the cross-entropy loss to maximize the probability of wj being the head against score(i←j) all words, i.e., e escore(i←k) . 0≤k≤n
For dependency labels, we use extra MLP and Biaffine layers to compute the scores and also adopt cross-entropy classification loss. We omit the details due to space limitation.
3
Joint Tagging and Parsing Models
The pipeline framework suffers from the error propagation problem, meaning that POS tagging mistakes badly influence parsing performance. In the pre-DL
Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing?
183
era, researchers propose joint tagging and parsing models under both graphbased and transition-based parsing architectures [9,18]. The key idea is to define the joint score of a tag sequence and a parse tree and to find the optimal joint result in the enlarged search space. In the neural network era, jointly modeling two tasks becomes much easier thanks to the commonly used encoder-decoder architecture and the MTL framework. In this work, we design and compare three typical MTL frameworks for joint POS tagging and dependency parsing, i.e., Share-Loose, Share-Tight, and Stack, as illustrated in Fig. 2. The Share-Loose and Share-Tight methods treat tagging and parsing as two parallel tasks, whereas the Stack method consider parsing as the main task and derive POS tag-related information as the inputs of the parsing component. For all joint models, the inputs only include the word embeddings and CharLSTM word representations, as shown in Eq. 1. POS POS
DEP
MLP
Biaffine
MLP
MLP
MLP
BiLSTM 3
BiLSTM 3
POS POS
DEP Biaffine
MLP
MLP
MLP
MLP
BiLSTM 3
POS POS
DEP
MLP
Biaffine
MLP
MLP
MLP
BiLSTM × 3
⊕ rp BiLSTM 2
BiLSTM 2
BiLSTM 2
BiLSTM 1
BiLSTM 1
BiLSTM 1
MLP
BiLSTM × 3
. . . xi . . . xj . . .
. . . xi . . . xj . . .
. . . xi . . . xj . . .
Share-Loose
Share-Tight
Stack
Fig. 2. The framework of three variants of the joint model.
Share-Loose. The tagging and parsing tasks use nearly separate networks, and only share the word and char embeddings. To incorporate heterogeneous POS tagging data, we add another scoring MLP at the top to compute scores of different heterogeneous POS tags. Under such architecture, the loosely connected tagging and parsing components can only influence each other in very limited manner. Share-Tight. This is the most commonly used MTL framework, in which the tagging and parsing components share not only the embeddings, but also the BiLSTM encoder. Different decoders are then used for different tasks. In this tightly joint model, the tagging and parsing components can interact with and mutually help each other to a large extent. The shared parameters are trained to capture the commonalities of the two tasks. Stack. The Stack takes BiLSTM hidden outputs of the tagger, denoted as rpi , as the extra input of the parser. In this way, the error propagation problem can be better handled.
184
H. Zhou et al.
xparse = xi ⊕ rpi i
(5)
The idea of the Stack joint method is mainly borrowed from Zhang and Weiss (2016) [26]. They propose the stack-propagation approach to avoid using explicit POS tags in dependency parsers. They employ the simple feed-forward networks for both tagging and parsing [2]. Without BiLSTM encoders, they use the hidden outputs of a single-layer MLP of the tagging component as extra inputs of the parsing component. Training Loss. During training, we directly add together all losses of different tasks, i.e., the parsing loss, the homogeneous POS tagging loss, and the heterogeneous POS tagging loss. L = LDEP + LPOS + LPOS
4
(6)
Experiments
In this section, we conduct experiments and detailed analysis to make full investigation on the usefulness of POS tagging for dependency parsing. 4.1
Experimental Settings
Data. We conduct experiments on the English Penn Treebank (PTB), the Chinese dataset at the CoNLL-2009 shared task (CoNLL09) [7], and the larger-scale Chinese Penn Treebank 7 (CTB7). For PTB, we adopt the same settings such as data split and Stanford dependencies of Chen and Manning (2014) [2]. We follow the official settings for CoNLL09. We use the Stanford Parser v3.0 to obtain Stanford dependencies for CTB7.3 For Chinese, besides the homogeneous POS tags, we also incorporate the largescale People Daily corpus of Peking University (PKU) as heterogeneous POS tagging data. Evaluation Metrics. We use POS tagging accuracy (TA), and unlabeled attachment score (UAS) and labeled attachment scores (LAS) for dependency parsing. For UAS and LAS computation, We follow Dozat and Manning (2017) [4] and ignore all punctuation marks for PTB. Hyper-parameters. We follow most hyper-parameter settings of Dozat and Manning (2017) [4] for all our models. For CharLSTM word representations, we set the dimension of the character embeddings to 50, and the dimension of CharLSTM outputs to 100. We train each model for at most 1,000 iterations, and stop training if the peak performance on the dev data does not increase in 100 (50 for models with BERT) consecutive iterations. 4.2
Results on the Dev Data
Results of the Pipeline Framework. Table 1 shows the influence of using homogeneous and heterogeneous POS tags in the pipeline framework. More 3
https://nlp.stanford.edu/software/stanford-dependencies.shtml.
Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing? Table 1. Parsing performance (LAS) on dev data under the pipeline framework. ep ew ec ew ew ew ew
⊕ ep ⊕ ec ⊕ ec ⊕ ep ⊕ ec ⊕ ep
PTB 87.79 93.42 93.34 93.92 93.97 93.88 p ⊕e -
CoNLL09 75.94 85.30 84.42 85.94 86.09 86.17 86.01
CTB7 75.72 84.43 83.73 85.12 85.23 85.32 85.23
185
Table 2. Parsing performance (LAS) comparison on dev data for the three joint methods. homo
Share-Loose Share-Tight Stack hetero Share-Loose Share-Tight Stack homo+hetero Share-Loose Share-Tight Stack
PTB 93.95 93.93 94.09 -
CoNLL09 86.28 86.17 86.26 86.05 86.25 86.16 86.30 86.62 86.69
CTB7 85.56 85.56 85.79 85.62 85.76 85.86 85.57 85.86 85.88
results are also presented to understand the contributions of each of the four components in the input layer. The homogeneous tagging accuracy is 97.58, 96.59, 96.72, 97.85, on the dev data of PTB, CoNLL09, and CTB7, and PKU, respectively. We perform 5-fold jack-knifing to obtain the automatic homogeneous POS tags on the training data to avoid closed testing, and use the PKUtagger to produce heterogeneous POS tags for sentences of CoNLL09 and CTB7. The results of using only one component clearly show that lexical information (i.e., ew and ec ) is most crucial for parsing, and only using POS tag embeddings leads to very large accuracy drop. When using two components at the same time, using CharLSTM word representations (ec ) is slightly yet consistently better than using POS tag embeddings (ep ), both substantially outperforming the model using only word embeddings (ew ) by more than 0.5 on all three datasets. Moreover, using three components leads to slight improvement on both CoNLL09 and CTB7, but hurts performance on PTB. Further using heterogeneous tag embeddings slightly degrades the performance. All those results indicate that under the pipeline framework, POS tags become unnecessary and can be well replaced by the CharLSTM word representations. We believe the reasons are two-fold. First, CharLSTM can effectively capture morphological inflections by looking at lemma/prefix/suffix, and thus plays a similar role as POS tags in terms of alleviating the data sparseness problem of words. Second, the error propagation issue makes predicted POS tags less reliable. Results of the Joint Methods. Table 2 presents the results of the three joint tagging and parsing methods without or with heterogeneous POS tagging. When using only homogeneous POS tagging, we find that the performance gaps between different joint methods are very small. The best joint methods outperform the basic model by 0.1, 0.3, and 0.6 respectively. A similar situation arises when using heterogeneous tagging only. When using both homogeneous and heterogeneous tagging, aka (w/hetero) setting, we can see that the overall performance is further improved by large margin. The best Stack method outperforms the basic model by 0.6 on CoNLL09
186
H. Zhou et al.
and 0.7 on CTB7, showing that heterogeneous labeled data can inject useful knowledge into the model. Overall, we can see that the Stack method is more stable and superior compared with the other three methods, and is adopted for the following experiments and analysis. 4.3
Final Results on the Test Data
Table 3 shows the results on the test data. For the scenario of not using BERT, the pipeline method using homogeneous POS tags is slightly yet consistently inferior to the basic model. The Stack method using only homogeneous POS tags significantly outperforms the basic method by 0.2 (p < 0.005), 0.4 (p < 0.0005), and 0.5 (p < 0.0005) in LAS on the three datasets respectively. Utilizing heterogeneous POS tags on Chinese furtherboosts parsing performance, leading to large overall improvements of 0.9 (p < 0.0001) on both datasets. Table 3. Final results on the test data. It is noteworthy that we produce our experiments with single run for each model on each dataset, since our preliminary experiments that we train Stack w/ hetero and Basic on CTB7 for four times show the variance of performances is small (σ 2 < 0.01). PTB TA w/o BERT Andor et al. (2016) [1]
w/ BERT
CoNLL09 UAS
LAS
TA
97.44 94.61 92.79 –
UAS
CTB7 LAS
TA
UAS
LAS
84.72 80.85 –
–
– –
Dozat and Manning (2017) [4] 97.3
95.74 94.08 –
88.90 85.38 –
–
Ji et al. (2019) [11]
97.3
95.97 94.31 –
–
–
–
–
Li et al. (2019) [17]
97.3
95.93 94.19 –
88.77 85.58 –
–
–
Basic (ew ⊕ ec )
97.50 95.97 94.34 96.42 89.12 86.00 96.48 88.58 85.40
–
Pipeline (ew ⊕ ec ⊕ ep )
97.50 95.88 94.27 96.42 89.12 85.98 96.48 88.42 85.28
Stack
97.91 96.13 94.53 96.55 89.46 86.44 96.62 88.86 85.88
Stack w/ hetero
–
–
Li et al. (2019) [17]
–
96.67 95.03 –
–
96.66 89.85 86.85 96.72 89.26 86.27
Basic (ew ⊕ ec )
97.42 96.85 95.14 97.29 92.21 89.42 97.22 91.66 88.75
92.24 89.29 –
–
–
Stack
97.57 96.85 95.25 97.36 92.44 89.68 97.32 91.67 88.84
Stack w/ hetero
–
–
–
97.39 92.46 89.76 97.40 91.81 89.04
When using BERT, parsing accuracy of the basic method increases by very large margin. Compared with the stronger baseline, the improvement introduced by POS tagging becomes smaller. Overall, using both homogeneous and heterogeneous POS tagging, the Stack method significantly outperforms the basic method by 0.3 (p < 0.005) on both CoNLL09 and CTB7. For POS tagging, the trend of performance change is similar. First, the joint method can also improve tagging accuracy, especially when with heterogeneous POS tagging. Using Bert can substantially improve TA on both Chinese datasets. However, it is surprising to see a slight decrease in TA when using BERT, which is possibly due to over-fitting considering the TA is already very high on English.
Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing?
187
We also list the results of recent previous works. We can see that our final joint models achieve competitive parsing accuracy on PTB and CoNLL09 w/ or w/o BERT. 4.4
Detailed Analysis
Performance changes (%)
In the following, we conduct detailed analysis on the CoNLL09 test data, in order to understand or gain more insights on the interactions and mutual influence between POS tagging and dependency parsing. For the joint method, we adopt the Stack model with both homogeneous and heterogeneous POS tagging without using BERT, to jointly produce automatic POS tags and parse trees. For the pipeline method, we use the two basic tagging and parsing models separately to produce automatic results.
2.5 2 1.5 1 0.5 0 −0.5 NN VV PU AD NR Tagging
P
M
JJ DEG DEC
Parsing
Fig. 3. Changes of tagging accuracy and parsing accuracy (LAS) on the CoNLL09 test set for words of different POS tags. Words/arcs are categorized according to their/their dependent’s gold-standard POS tags.
Correlation of Performance Changes Between Tagging and Parsing. Overall, the joint method outperforms the pipeline method by 0.2 in TA, and 0.7/0.9 in UAS/LAS, as shown in Table 3. To gain more insights, we categorize all words according to their gold-standard POS tags and compare the accuracy changes for each set. Figure 3 shows the most frequent tags. We can see that there is clear positive correlation of absolute performance changes between tagging and parsing. For instance, as the most frequent tags NN and VV, their tagging accuracy increases by 0.2 and 0.7, and parsing accuracy increases by 0.7 and 1.4, respectively. The most notable exception is NR with opposite changes in tagging and parsing accuracy (−0.6 vs. +1.3), which can be explained from two aspects. First, we find that most of NR mistakes are due to the {NR, NN} ambiguous pair, which is syntax-insensitive and thus has very small impact on parsing decisions. Second, the Stack model may be more robust to tagging errors.
188
H. Zhou et al. Table 4. The impact of specific POS tagging error patterns on parsing. UAS LAS NN → → → → VV → → → →
NN VV NR JJ VV NN VA AD
91.73 67.25 90.43 91.96 85.92 65.60 84.75 55.32
89.69 44.98 86.96 20.54 84.12 40.07 83.05 25.53
UAS LAS NR
→ → JJ → → DEG → → DEC → →
NR NN JJ NN DEG DEC DEC DEG
91.73 86.39 95.40 92.82 96.75 92.06 94.28 96.88
86.96 83.67 94.33 14.92 95.91 26.56 92.39 22.22
Overall, we conclude that tagging and parsing performance is highly correlated due to the close relationship between the two tasks. Influence of Tagging Errors on Parsing. In the Stack method, the hidden representations from the tagging encoder is fed into the parsing encoder as extra inputs. We would like to understand how tagging decisions influence parsing. Overall, UAS/LAS are 90.42/88.38 for words getting correct POS tags, whereas 73.40/42.59 for wrongly tagged words. We can observe dramatic drop of 17.0/45.8, indicating that POS tags has much larger influence on LAS than UAS. Looking deeper into this issue, Table 4 shows the parsing accuracy for words of different POS tagging patterns. A tagging pattern X → Y represents the set of words whose correct tag is X and are tagged as Y. We can see that higher parsing accuracy are usually achieved by correct tagging patterns X → X than wrong pattern X → NOT-X, except DEC → DEG in UAS.4 . The tagging ambiguites can be classified into three types. First, the syntaxsensitive ambiguous pairs such as {NN, VV} and {VV, AD} lead to large performance decrease in both UAS and LAS if wrongly tagged. Second, the syntax-insensitive ambiguous pairs such as {NN, NR} and {VV, VA} have very small influence on parsing accuracy. Finally, some ambiguous pairs only greatly influence LAS but have little effect on UAS, such as {NN, JJ} and {DEC, DEG}.
5
Related Works
Previous studies [16,20] show POS tags are indispensable ingredients for composing different features in the traditional pre-DL dependency parsers [14,27]. 4
DEG and DEC are two tags for the frequently used auxiliary word “ ” (d¯e, translated as ”, “of” or “that”) in Chinese. “ ” is tagged as DEG in phrase “ ”. while as DEC in “
Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing?
189
Meanwhile, Li et al. (2011) [18] show that the error propagation problem introduced by predicted POS tags degrades parsing accuracy by about 6 (UAS) on different Chinese datasets. Therefore, researchers propose to jointly model the POS tagging and dependency tasks in the graph-based [18] and transition-based [9] frameworks, leading to promising results. The key challenge is to define scoring functions on the joint results, and design effective search algorithms to determine optimal joint answers in the enlarged search space. Furthermore, joint models of word segmentation, POS tagging, and parsing are also proposed [10]. In the DL era, joint modeling of multiple related tasks becomes much easier under the MTL framework [3]. In fact, MTL has become an extensively used and powerful technique for many problems. The basic idea is sharing the encoder part while using separate decoders for different tasks. The major advantages of employing MTL are two-fold, i.e., 1) exploiting the correlation and mutual helpfulness among related tasks, and 2) making direct use of all (usually nonoverlapping) labeled data of different tasks. The Share-Light and Share-Tight methods are both typical MTL frameworks, and the main difference lies in the amount of shared parameters. Actually, there are still many other variants due to the flexibility of MTL. For example, Straka (2018) [23] stacks task-specific private BiLSTMs over shared BiLSTMs for joint tagging and parsing. Based on the current results, we expect that such variants may achieve very similar performance. The Stack method is similar to the stack-propagation method of Zhang and Weiss (2016) [26]. Their basic idea is to use the hidden outputs of the POS tagging components as extra inputs of the parsing components, forming a stacked structure. During training, parsing loss is directly propagated into the full tagging component whereas tagging loss only indirectly influences the parsing components via their shared parts. Their pioneer work employ a simple feed-forward network for both tagging and parsing [2], and only achieves an LAS of 91.41 on PTB. Another inspiring work related with the Stack method is Hashimoto et al. (2017) [8], who propose to jointly train many tasks of different complexity in a very deep and cascaded network architecture, where higher levels are used for more complex tasks.
6
Conclusions
Unlike the findings in traditional pre-DL dependency parsing, recent studies indicate that POS tagging becomes much less important and can be replaced by CharLSTM word representations in neural dependency parsers. However, there lacks a full and systematic investigation on this interesting issue, from both empirical and linguistic perspectives. In this paper, we try to investigate the role of POS tagging for neural dependency parsing in both pipeline and joint frameworks. We design and compare three typical joint methods based on the state-of-the-art biaffine parser. We try to accommodate both homogeneous and heterogeneous POS tagging, considering it is much cheaper to annotate POS tags than parse trees and there exist large-scale heterogeneous POS tag datasets for
190
H. Zhou et al.
Chinese. Based on the experiments and analysis on three English and Chinese benchmark datasets, we can draw the following conclusions. • For the pipeline method, both homogeneous and heterogeneous POS tags provide little help to the basic parser with both word embeddings and CharLSTM, due to error propagation and the overlapping role in reducing data sparseness. • The three joint methods investigated in this work perform better than the pipeline method. Among them, the Stack is more stable and superior compared with the other three, leading to significant improvement over the basic model on all datasets. • POS tagging is still helpful for dependency parsing under the joint framework even if the parser is enhanced with BERT, especially when with heterogeneous POS tagging. • Detailed analysis shows that POS tagging and dependency parsing are two closely correlated tasks. In particular, If the joint model fails to resolve syntaxsensitive POS tagging ambiguities, it usually makes wrong parsing decisions as well.
References 1. Andor, D., et al.: Globally normalized transition-based neural networks. In: Proceedings of ACL (2016) 2. Chen, D., Manning, C.: A fast and accurate dependency parser using neural networks. In: Proceedings of EMNLP (2014) 3. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of ICML (2008) 4. Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing. In: Proceedings of ICLR (2017) 5. Dozat, T., Qi, P., Manning, C.D.: Stanford’s graph-based neural dependency parser at the CoNLL 2017 shared task. In: Proceedings of CoNLL (2017) 6. Dyer, C., Ballesteros, M., Ling, W., Matthews, A., Smith, N.A.: Transition-based dependency parsing with stack long short-term memory. In: Proceedings of ACL (2015) 7. Hajiˇc, J., et al.: The CoNLL-2009 shared task: syntactic and semantic dependencies in multiple languages. In: Proceedings of CoNLL (2009) 8. Hashimoto, K., Xiong, C., Tsuruoka, Y., Socher, R.: A joint many-task model: growing a neural network for multiple NLP tasks. In: Proceedings of EMNLP (2017) 9. Hatori, J., Matsuzaki, T., Miyao, Y., Tsujii, J.: Incremental joint POS tagging and dependency parsing in Chinese. In: Proceedings of IJCNLP (2011) 10. Hatori, J., Matsuzaki, T., Miyao, Y., Tsujii, J.: Incremental joint approach to word segmentation, POS tagging, and dependency parsing in Chinese. In: Proceedings of ACL (2012) 11. Ji, T., Wu, Y., Lan, M.: Graph-based dependency parsing with graph neural networks. In: Proceedings of ACL (2019) 12. Kiperwasser, E., Goldberg, Y.: Simple and accurate dependency parsing using bidirectional LSTM feature representations. Trans. ACL (2016)
Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing?
191
13. Kitaev, N., Klein, D.: Constituency parsing with a self-attentive encoder. In: Proceedings of ACL (2018) 14. Koo, T., Collins, M.: Efficient third-order dependency parsers. In: ACL (2010) 15. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of NAACL (2016) 16. Lei, T., Xin, Y., Zhang, Y., Barzilay, R., Jaakkola, T.: Low-rank tensors for scoring dependency structures. In: Proceedings of ACL (2014) 17. Li, Y., Li, Z., Zhang, M., Wang, R., Li, S., Si, L.: Self-attentive biaffine dependency parsing. In: Proceedings of IJCAI (2019) 18. Li, Z., Zhang, M., Che, W., Liu, T., Chen, W., Li, H.: Joint models for Chinese POS tagging and dependency parsing. In: Proceedings of EMNLP (2011) 19. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, p. 3 (2013) 20. McDonald, R., Petrov, S., Hall, K.: Multi-source transfer of delexicalized dependency parsers. In: Proceedings of EMNLP (2011) 21. Nivre, J., et al.: Universal dependencies v1: a multilingual treebank collection. In: Proceedings of LREC (2016) 22. Roller, S., Kiela, D., Nickel, M.: Hearst patterns revisited: automatic hypernym detection from large text corpora. In: Proceedings of ACL (2018) 23. Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of CoNLL (2018) 24. Zeman, D., et al.: CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies. In: Proceedings of CoNLL (2018) 25. Zhang, M., Li, Z., Fu, G., Zhang, M.: Syntax-enhanced neural machine translation with syntax-aware word representations. In: Proceedings of NAACL (2019) 26. Zhang, Y., Weiss, D.: Stack-propagation: Improved representation learning for syntax. In: Proceedings of ACL (2016) 27. Zhang, Y., Nivre, J.: Transition-based dependency parsing with rich non-local features. In: Proceedings of ACL (2011) 28. Zhou, H., Zhang, Y., Huang, S., Chen, J.: A neural probabilistic structuredprediction model for transition-based dependency parsing. In: Proceedings of ACL (2015)
A Span-Based Distantly Supervised NER with Self-learning Hongli Mao1 , Hanlin Tang1 , Wen Zhang2 , Heyan Huang1 , and Xian-Ling Mao1(B) 1
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China [email protected], {hltang,hhy63,maoxl}@bit.edu.cn 2 Huazhong University of Science and Technology, Wuhan, China [email protected]
Abstract. The lack of labeled data is one of the major obstacles for named entity recognition (NER). Distant supervision is often used to alleviate this problem, which automatically generates annotated training datasets by dictionaries. However, as far as we know, existing distant supervision based methods do not consider the latent entities which are not in dictionaries. Intuitively, entities of the same type have the similar contextualized feature, we can use the feature to extract the latent entities within corpuses into corresponding dictionaries to improve the performance of distant supervision based methods. Thus, in this paper, we propose a novel span-based self-learning method, which employs spanlevel features to update corresponding dictionaries. Specifically, the proposed method directly takes all possible spans into account and scores them for each label, then picks latent entities from candidate spans into corresponding dictionaries based on both local and global features. Extensive experiments on two public datasets show that our proposed method performs better than the state-of-the-art baselines.
Keywords: Name entity recognition Span-level · Self-learning
1
· Distant supervision ·
Introduction
Named Entity Recognition (NER) is a task that takes an utterance as the input and outputs identified entities, such as person names, locations, and organizations. It is one of the fundamental components in many natural language processing tasks such as syntactic parsing [9], relation extraction [13] and co-reference resolution [3]. A major issue encountered in the development of NER task is the data sparsity issue. It is challenging to obtain a large amount of labeled data in new H. Mao and H. Tang—Equal contribution. c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 192–203, 2020. https://doi.org/10.1007/978-3-030-60450-9_16
A Span-Based Distantly Supervised NER with Self-learning
193
domains. Recently, distantly supervised methods [6] have been applied to automatically generate labeled data according to domain specific dictionaries. These methods first identify entity mentions by exact string matching with the dictionary, and then assign corresponding types to the entity mentions [7]. Although distant supervision is effective to label data automatically, it suffers from the noisy labeling problem due to limited coverage of the dictionary. Traditional token-level models with CRF architecture are sequential, which lead to cascading errors and are sensitive to these noises. To address the issue of noisy labeling, previous studies usually use external knowledge to alleviate the negative impact of noises. For example, Yang et al. [23] and Nooralahzadeh et al. [14] utilize golden labeled data to train a reinforcement learning agent for filtering out noisy sentences. Shang et al. [19] and Liu et al. [11] introduce high-quality phrases to expand the dictionary and reduce the false negative labels. However, nearly all existing methods ignore latent entities within corpuses. External knowledge is sometimes hard to obtain, we can just use the distribution of existing entities to mine latent entities. Intuitively, entities of the same type constantly appear in the same syntactic structure. For example, suppose raw text contains “Xiaomi mobile” and “NOKIA mobile”, brand entity “Xiaomi” and product entity “mobile” are included in the dictionary, hence “NOKIA” is more likely to be a brand entity. In order to detect latent entities with this property, we try to treat NER task as span classification and score spans for each label. Then we can add those high-confidence spans into the dictionary to improve the performance of the distantly supervised NER. In additional, different from sequence labeling framework, the span-based approach models each possible span independently and does not suffer from cascading errors, thus it is more robust to noises. Based on above considerations, we propose a novel span-based model with self-learning for distantly supervised NER task. Specifically, our span-based model first directly predicts the type distribution of spans based on span representations induced from neural networks. Then a self-learning algorithm is used to iteratively select latent entities from candidate spans into the dictionary based on our proposed confidence measure. The reliable confidence measure not only considers probability of the span at local view, but also takes its frequency into account at global view. Finally, in order to find a non-overlapping span partition of the sentence in the inference process, a greedy algorithm is applied to pick higher scoring labeled spans. Extensive experiments on two public datasets illustrate that our method can effectively learn span-level information and recognize latent entities for achieving new state-of-the-art results. Our contributions in this work include: • We propose a novel span-based self-learning method for the distantly supervised NER task, which mines latent entities by span-level features. • Extensive experiments on two benchmark datasets empirically prove the effectiveness of our method. • A multiple criterion is designed to fill high-confidence entities into corresponding dictionaries by considering both local and global features.
194
2 2.1
H. Mao et al.
Related Work Supervised NER
The task of supervised named entity recognition (NER) is generally considered as a sequence labeling problem. Traditional methods based on probability statistics such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF) require a large number of features and additional resources [16,17]. Neural network based methods using popular architectures like BiLSTM-CRF report state-of-the-art in this task without any feature engineering [10,12]. Recently, the contextualized word embedding [1,5] have been explored to further improve the state-of-the-art results in NER task. Unlike these existing methods, our method devotes to learning NER on the distant supervised setting only with the dictionary. 2.2
Distantly Supervised NER
In general, a large amount of labeled data is required to train neural models, which is quite expensive. Distant supervision [13] was proposed to address this issue using existing knowledge resources. It has previously been successfully applied to tasks like relation extraction [2,13] by assuming that all sentences that mention two entities of a fact triple describe the relation in the triple. Recently, distant supervision [6,11,14,19,23] has also attracted attention for NER. Distant-LSTM-CRF [6] uses syntactic rules and quality phases to automatically label tokens, then trains the BiLSTM-CRF classifier on labeled datasets. Yang et al. [23] and Nooralahzadeh et al. [14] propose a method combining the Partial-CRF approach and reinforcement learning for distant supervision. However, their methods need golden labeled data to guide the selector in reinforcement learning. Shang et al. [19] employs a new type of tagging scheme (i.e., Tie or Break) to determine entity boundaries before predicting entity types and use a set of high-quality phrases to reduce the false negative labels. Liu et al. [11] propose a span level model with a dynamic programming inference algorithm, however, their dynamic programming suffers from lower precision and they still need a set of high-quality phrases to extend dictionary. Different from above methods, based on the span-level model, our method applies a self-learning algorithm to detect latent entities, which does not need a set of high-quality phrases or golden labeled data and is more robust to noises in distant supervision. 2.3
Span-Based Models in Other NLP Tasks
Span-based models have been proved to be a fundamental part for the success of recent work in NLP tasks. Kitaev and Klein [8] incorporate the LSTM Minus into their parsing model and achieve the best results in constituency parsing task. As for Semantic Role Labeling (SRL), Ouchi et al. [15] treats SRL as span selection and seeks to select appropriate spans for each label. In coreference resolution, Wu et al. [22] presents a state-of-the-art coreference resolution model that casts anaphora identification as the task of query-based span prediction.
A Span-Based Distantly Supervised NER with Self-learning
195
Fig. 1. Example of the sentence generating candidate spans.
3
The Proposed Method
Our method consists of four components. Firstly, the spans generator is applied to obtain candidate spans along with their types. Then span-based neural model is trained on these annotated spans. Moreover, the high scoring labeled spans are iteratively picked into the dictionary via a self-learning algorithm. Finally, in the inference process, a greedy algorithm is adopted to find a non-overlapping span partition of the sentence. In the following, we will introduce four components of our method in detail. 3.1
Spans Generator
Given a sentence that consists of T words w1:T = w1 , · · · , wT and a dictionary |Y | D, we need to generate all possible spans Y = {(i, j, r)k }k=1 . Each span (i, j, r) consists of word indices i and j in the sentence (1 ≤ i ≤ j ≤ T ) and a span label r ∈ R, where R represents the list of pre-defined types and None type. Here we focus mainly on the Chinese NER. Different from language like English, Chinese takes characters as units. If we generate all possible spans based on Chinese characters, a large number of spans labeled None will be obtained. So unlike previous work [11], we first segment the sentence into phrases by using Jieba segmentation1 tool, and then generate all possible spans via combining adjacent phrases whose number is no more than a specified threshold. In additional, owing to some entities may not be recognized by the word segmentation tool, we add the dictionary D to Jieba’s user dictionary to allow candidate spans can cover more entities. The procedure for a sentence generating spans is depicted in Fig. 1, single spans are obtained by word segmentation, and combination spans here are generated through combining up to two adjacent single spans. Compared with generating spans based on Chinese characters, our method can filter out a large number of meaningless spans while retain entities. In order to get the label of spans, we annotate the unlabeled sentences by exact string matching of the entity mentions in the dictionary, where conflicted matches are resolved by maximizing the total number of matched tokens [7, 19]. Therefore, we sort the entity mentions in the dictionary by their length, and start to match the spans in the sentences greedily from the longest one. 1
https://github.com/fxsjy/jieba.
196
H. Mao et al.
Fig. 2. The overall architecture of our span-based model
Moreover, the annotations are not allowed to be nested or overlapping. For instance, both “Yangtze River” and “Yangtze River Bridge” entity mentions in raw text “Yangtze River Bridge” are included in the dictionary, but only the longer one will be recognized as the entity mention. 3.2
Spans-Based Neural Model
Figure 2 illustrates the overall architecture of our span-based model. The word encoder layer creates token embeddings for each raw token. From the token embeddings, we use bidirectional LSTM (BiLSTM) [10,18] to calculate contextualized features. The span representations are extracted from these features. Based on them, we can obtain the probability distribution via the Softmax function for each span. Span Representation. We use xt to represent the raw token embeddings of token t with (1 ≤ t ≤ T ). In order to achieve contextualized word representation xt , we apply the first six layers of pre-trained BERT [4] as the word encoder. Then we feed token embeddings xt into BiLSTM to obtain ht , → ← − − ht = [ht ; ht ]
(1)
→ − ← − where ht and ht are the hidden states of the last layer of the forward and → − ← − backward LSTM respectively. ht is the concatenation of ht and ht .
A Span-Based Distantly Supervised NER with Self-learning
197
Inspired by previous work [20,21], for each span (i, j), its span representation g is defined as: (2) g = [hi + hj ; hi − hj ] where the addition and subtraction features of the i-th and j-th hidden states are concatenated and used for representing span (i, j). The middle part of Fig. 2 illustrates an example of this process. For the span (4, 5), the 4th and 5th features (h4 and h5 ) are calculated from BiLSTM. Then these two vectors are added, and the 5th vector is subtracted from the 4th vector. The resulting vectors are concatenated to form the span representation. Loss Function. For the span i, to compute its probability distribution over all types pi , the span representation gi is fed into a softmax classifier: pi = sof tmax (MLP(gi ))
(3)
where MLP is a Multilayer Perceptron. The output size of MLP and the size of pi is equal to the number of NER classes. To train the parameters θ of our span-based model, we minimize the crossentropy loss function: J(θ) = − log pc (r|θ) (4) c∈C
where C indicates all the candidate spans in the training set, r is the label of span c. 3.3
Self-learning
Due to the limited coverage of the dictionary, the labels of training set are noisy. In order to expend the dictionary, we design a self-learning algorithm to iteratively mine latent entities by making full use of span-level features.
Algorithm 1. Detecting entities with self-learning algorithm Input: Raw sentences S and Dictionary D repeat 1. generate all possible spans C = {(i, j, r)} based on S and D 2. train span-based model M with C 3. select top k entities tagged by M while not appearing in the D based on confidence measure 4. add all top k entities into D until meet stopping criterion
The input of the algorithm is raw sentences and a dictionary containing entity mentions along with their types. Firstly, all possible spans are generated from sentences and the dictionary is used to assign labels to each span by the
198
H. Mao et al.
longest matching algorithm, while matching spans are not allowed to be nested or overlapping. Then, our span-based neural model is trained on these candidate spans to formulate their type distributions. Finally, top k latent entities valued by a confidence measure will be added into the dictionary. After this process is repeated E times, the stop condition will be reached. The Algorithm 1 presents details of the procedure for self-learning. Every iteration, top k ranked entities tagged by span-based model contribute to the dictionary. Therefore, a reliable confidence measure is crucial to the success of self-learning algorithm. If one bogus entity is selected, it will lead to the generation of wrong label spans and the selection of many other entities. Hence, we propose a confidence measure considering how confident an entity is labeled not only locally but also globally. The local confidence of an entity I is defined as the probability of entity type calculated by the span-based model: LocalConf (I) = max pi (r|θ) r∈R
(5)
where R represents the number of NER classes. The global confidence concerns frequency of the entity occurring in all candidate spans. The linguistic intuition here is that the entity labeled by model and with a high occurring frequency is usually a latent entity. The global confidence is computed as below: GlobalConf (I) = n (6) where n is the number of the entity I appearing in the candidate spans. We then propose a final measure to combine the two confidence measures, just taking the product of the two measures: ComConf (I) = LocalConf (I) × GlobalConf (I) 3.4
(7)
Inference
After iterative training, the increasing coverage dictionary will improve performance of our span-based neural model. In the inference process, we need to find a non-overlapping span partition of the sentence. Liu [11] proposes a dynamic programming inference algorithm to find a non-overlapping span partition such that the joint probability of each span being None type is minimized. However, this dynamic programming based approach ignores the probability comparison between labels, increases the recall performance with lower precision. Here, we utilize a greedy algorithm to deal with redundant, overlapping entity proposals tagged by model and output real entities. The idea of our algorithm is simple but effective: greedily selecting the entity proposal with the maximum probability, deleting conflict entity proposals, and repeating the previous process until all the proposals are processed.
A Span-Based Distantly Supervised NER with Self-learning
4
199
Experiments
4.1
Datasets
We perform experiments on two public datasets provided by Yang [23] to compare our method with other approaches. • EC is a Chinese dataset in the e-commerce domain. There are five entity types: “brand”, “product”, “model”, “material” and “specification”. This corpus contains 2,400 sentences tagged by annotators, 1,200 instances for training, 400 for dev and 800 for test. Yang et al. [23] provides the dictionary of 927 entries and 2,500 sentences as a raw text. • NEWS is another Chinese dataset from the news domain. It is only labeled with PERSON type. This corpus contains 3,000 human annotated sentences as training dataset, 3,328 as dev data, and 3,186 as testing data. Yang et al. [23] performs distant supervision to raw data and obtains 3,722 annotated sentences. 4.2
Baselines
To evaluate our approach, we compare the following baselines: • Dict-based: The collected entity dictionary is directly used to match the strings in the testing data. • LSTM-CRF[10]: It is a supervised model achieving the state-of-the-art performance in the NER task. • LSTM-CRF-PA+RL1[23]: It is a distantly supervised approach combining partial annotation learning and reinforcement learning. • LSTM-CRF-PA+RL2[14]: Another reinforcement model formulating a new reward function in RL differs with [23]. To make a fair comparison, we implement LSTM-CRF, LSTM-CRF-PA+SL1 and LSTM-CRF-PA+SL2 with the same word encoder as our work. 4.3
Metrics
Following the standard setting [10], we evaluate the methods using microaveraged F1 score and report the precision (Pre) and recall (Rec) in percentage. 4.4
Parameter Settings
In order to achieve contextualized word representation, we apply the first six layers of pre-trained BERT2 as the word encoder and fine tune the BERT during the training procedure. For model parameters, we empirically set the batch size Bs = 32, the learning rate λ = 2 − e5, the dimension sizes of LSTM hidden 2
https://github.com/ymcui/Chinese-BERT-wwm.
200
H. Mao et al.
features as 300, the top k value as 30 and number of iterations E = 5. Multilayer Perceptrons (MLP) has two hidden layers with 600 dimensions, each followed by ReLU activation. In the training process, we employed the dropout strategy to guard against overfitting by a drop value of 0.5 and take Adam with default parameters as the back-propagation algorithm. 4.5
Results on Human Annotated Data and Distantly Labeled Data
When training set contains the human annotated data (H), we first apply clean data to initial span-based model, then train on human annotated (H) and distantly labeled data (A) together. Table 1 shows the performance of different methods. Table 1. NER Performance Comparison. The proposed method is trained on human annotated (H) and distantly labeled data (A). Model Dict-based LSTM-CRF LSTM-CRF
Training set EC Pre \ H H+A
LSTM-CRF-PA+RL1 H + A LSTM-CRF-PA+RL2 H + A H+A Our method
Rec
75.60 31.05 61.16 62.09 60.72 53.45 63.42 66.75 67.55
F1
NEWS Pre Rec
F1
44.02 61.62 56.85
96.08 31.77 80.08 76.73 82.60 62.27
47.75 78.37 71.01
62.37 62.89 83.42 62.36 64.48 83.75 63.94 65.70 85.63
81.53 82.46 82.12 82.93 84.84 85.23
Among all methods, dictionary based method achieves the highest precision, which is not surprising as entities in the dictionary are always correct. However, it suffers from low recall due to the low coverage of the dictionary. Fine-grained CRF focuses more on word-level features, which is more vulnerable to noises. So, when trained on H and A, LSTM-CRF system obtains much lower performance on two datasets compared with LSTM-CRF trained on H. LSTM-CRF-PA+RL1 and LSTM-CRF-PA+RL2 utilize reinforcement learning method to filter out noisy sentences. However, they can not detect the latent entities and are also limited to fine-grained CRF. Our method achieves the best recall and F1 values among all methods, which demonstrates it is more robust to noises. In addition, our design surpasses LSTM-CRF-PA+RL2 by 2.3% F 1 score on NEWS dataset. This result can be explained by the fact that our model casts the NER task to a binary classification problem as there is only PERSON type entity in NEWS dataset. 4.6
Results on Distantly Labeled Data
In specific NER domain, the availability of a golden supervision data is challenging. We here examine the performance of the proposed model trained only with
A Span-Based Distantly Supervised NER with Self-learning
201
Table 2. NER Performance Comparison. The proposed method is only trained on distantly labeled data (A). Model LSTM-CRF LSTM-CRF-PA+RL2 Span-based Model Span-based Model+SL
EC Pre
Rec
56.67 48.86 56.38 49.55 60.15 49.78 59.39 54.68
F1 52.48 52.74 54.48 56.94
distantly labeled data (A) on EC dataset. Experiment results among different methods are reported on Table 2. Obviously, LSTM-CRF suffers from low recall value with the limited coverage of the dictionary. When trained without human labeled data, the improvement of LSTM-CRF-PA+RL2 in F 1 is not significant compared with LSTM-CRF. This reveals that guidance of the prior knowledge about which sentences are labeled correctly is key to the selector in reinforcement learning. It is worth mentioning that even without using self-learning algorithm, our span-based model still achieves competitive performance, which reveals that span-based model is more suitable for distantly supervised NER task compared with CRF methods. This is due to the fact that span-based model does not suffer from cascading errors and is more robust to noises. However, the recall of span-based model is still at a low level. Through detecting latent entities to join the dictionary via self-learning, Span-base Model+SL method boosts recall scores from 49.78 to 54.68 and achieves best the F 1 among all methods. 4.7
Case Study
In this section, we provide some samples of latent entities detected by our method in Table 3. discovered by our In sample 1, the latent entity method appears in the same sentence structure as the existing entities and , which demonstrates our model can really learn the semantic information of span. In sample 2, we can find some entities indeed appear more than once in specific domain corpus. This fact validates the effectiveness of our confidence measure combining local and global features. The sample 3 indicates the method of joining latent entities into the dictionary can not only address false negative case, but also reduce negative the impact of false positive instance. For example, initially is only tagged , but when our model adds the detected entity to the dictionary, we can get its correct label by using longest string matching algorithm.
202
H. Mao et al.
Table 3. Samples of latent entities. In sample 1 and sample 3, “PDT” and “BAD” mean product and brand entities tagged by the dictionary matching, and one string with underline should be a real entity. In sample 2, the number of times the latent entities occur in raw text is reported.
5
Conclusion and Future Work
In this paper, we introduce a span-based self-learning method for distantly supervised NER. The core of our method is using span-level features to iteratively mine latent entities into the dictionary. Experimental results show our model outperforms previous state-of-the-art methods. In the future, our work will extend the study to additional domains and languages. Acknowledgement. The work is supported by National Key R&D Plan (No. 2018YFB1005100), NSFC (61772076, 61751201 and 61602197, No. U19B2020), NSFB (No. Z181100 008918002). We also thank Yuming Shang, Jiaxin Wu, Maxime Hugueville and the anonymous reviewers for their helpful comments.
References 1. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: COLING 2018: 27th International Conference on Computational Linguistics, pp. 1638–1649 (2018) 2. Augenstein, I., Maynard, D., Ciravegna, F.: Relation extraction from the web using distant supervision. In: Janowicz, K., Schlobach, S., Lambrix, P., Hyv¨ onen, E. (eds.) EKAW 2014. LNCS (LNAI), vol. 8876, pp. 26–41. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-13704-9 3 3. Chang, K.W., Samdani, R., Roth, D.: A constrained latent variable model for coreference resolution. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 601–612 (2013) 4. Cui, Y., et al.: Pre-training with whole word masking for Chinese bert. arXiv preprint arXiv:1906.08101 (2019) 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT 2019: Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019) 6. Giannakopoulos, A., Musat, C., Hossmann, A., Baeriswyl, M.: Unsupervised aspect term extraction with B-LSTM & CRF using automatically labelled datasets. In: Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 180–188 (2017)
A Span-Based Distantly Supervised NER with Self-learning
203
7. He, W.: Autoentity: automated entity detection from massive text corpora (2017) 8. Kitaev, N., Klein, D.: Constituency parsing with a self-attentive encoder. arXiv preprint arXiv:1805.01052 (2018) 9. Koo, T., Collins, M.: Efficient third-order dependency parsers. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1–11 (2010) 10. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270 (2016) 11. Liu, S., Sun, Y., Li, B., Wang, W., Zhao, X.: Hamner: headword amplified multispan distantly supervised method for domain specific named entity recognition. In: AAAI 2020: The Thirty-Fourth AAAI Conference on Artificial Intelligence (2020) 12. Ma, X., Hovy, E.H.: End-to-end sequence labeling via bi-directional LSTM-CNNsCRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1064–1074 (2016) 13. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011 (2009) 14. Nooralahzadeh, F., Lønning, J.T., Øvrelid, L.: Reinforcement-based denoising of distantly supervised NER with partial annotation. In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pp. 225–233 (2019) 15. Ouchi, H., Shindo, H., Matsumoto, Y.: A span selection model for semantic role labeling. arXiv preprint arXiv:1810.02245 (2018) 16. Passos, A., Kumar, V., McCallum, A.: Lexicon infused phrase embeddings for named entity resolution. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 78–86 (2014) 17. Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pp. 147–155 (2009) 18. Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45(11), 2673–2681 (1997) 19. Shang, J., Liu, L., Gu, X., Ren, X., Ren, T., Han, J.: Learning named entity tagger using domain-specific dictionary. In: EMNLP 2018: 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2054–2064 (2018) 20. Stern, M., Andreas, J., Klein, D.: A minimal span-based neural constituency parser. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 818–827 (2017) 21. Wang, W., Chang, B.: Graph-based dependency parsing with bidirectional LSTM. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 2306–2315 (2016) 22. Wu, W., Wang, F., Yuan, A., Wu, F., Li, J.: Coreference resolution as query-based span prediction. arXiv preprint arXiv:1911.01746 (2019) 23. Yang, Y., Chen, W., Li, Z., He, Z., Zhang, M.: Distantly supervised NER with partial annotation learning and reinforcement learning. In: COLING 2018: 27th International Conference on Computational Linguistics, pp. 2159–2169 (2018)
Knowledge Base, Graphs and Semantic Web
A Passage-Level Text Similarity Calculation Ming Liu1,2 , Zihao Zheng1 , Bing Qin1,2(B) , and Yitong Liu3 1 Harbin Institute of Technology, Harbin, China
[email protected] 2 PENG CHENG Laboratory, Shenzhen, China 3 Tencent Technology (Beijing) Co., Ltd., Beijing, China
Abstract. Along with the explosion of web information, information flow service has attracted the attentions of users. In this kind of service, how to measure the similarity between texts and further filter the redundant information collected from multiple sources becomes the key solution to meet user’s desire. One text often mentions several events. The core event mostly decides the main content carried by the text. It should take the pivotal position. For this reason, this paper aims to construct a passage-level event connection graph to model the relations among the events mentioned by one text. The core event can be revealed and is further chosen to measure the similarity between two texts. As shown by experimental results, after measuring text similarity from a passage-level event representation perspective, our unsupervised measuring method acquires superior results than unsupervised methods by a large margin and even comparable results with some popular supervised neuron based methods. Keywords: Text similarity calculation · Passage-level event representation · Event connection graph · Vector tuning
1 Introduction Due to the fast advance of internet technology, lots of web applications appear to enhance the way users experience the web. Text is the most prevail format of data on the web. It lets the task of measuring text similarity become the primary issue needed to be solved in many web applications, such as news recommendation and Q&A system. Traditional text similarity measurements can be classified into two categories. One is supervised based, which turns text into vector representation and makes two similar (or called correlated in most of articles) texts close in a high-dimensional space. The other one is unsupervised based, which scores two texts via the coverage in terms of the words (or some other statistics) shared by two texts. In general, supervised ones own high performance, since they can accurately draw the boundary to separate similar texts from dissimilar texts with the help of training data. Accordingly, their high-quality results are over dependent on training data. When domain changes, the performances of supervised ones degrade sharply. Unsupervised ones do not suffer from this limitation, since they do not refer to any transcendental knowledge. Thus, they do not fear of domain transfer. There are numerous types of texts on the web. It is impossible to collect all kinds of texts as training © Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 207–218, 2020. https://doi.org/10.1007/978-3-030-60450-9_17
208
M. Liu et al.
data to let supervised methods go through at advance. Therefore, in this paper, we hope to design an unsupervised calculation, which can fit to any kind of text. As we know, one text mentions several events. Traditional event extraction tasks, like ACE and FrameNet, treat event occurring in sentence level. That indicates one sentence mentions one event and the trigger of the event is often the verb in the sentence. In contrast, from passage level, though there are many events mentioned by one text, one text only has a core event. The other events should serve the core event, and play the auxiliary roles, as explaining the core event or stating the details of the core event. In text similarity calculation task, we should consider the events from a passage level. The core event mostly decides the similarity between two texts. For example, the similarity between these two articles1 are high, since both articles take the event “the damage of Mangosteen typhoon” as the core event, though one details the degree of the damage and the other not. As text similarity calculation concerns whether two texts stress the same core event or not, we should find a way to model the relations among the events mentioned by one text and finally locate the core event. It is worth noticing that the core event mentioned in one text may cross several sentences. Therefore, traditional sentence-level event extraction methods are not appropriate for extracting the core event. For this reason, this paper constructs an event connection graph to cover the multiple relations among events mentioned by one text. The graph is composed by a set of polygons with the trigger as its center and the arguments as its surrounding nodes. PageRank is adopted to value the nodes to locate the core event, and text similarity is calculated in terms of the correlation between two core events respectively mentioned by two texts. Besides, to better model the event connection graph, one improvement is made to detect related triggers to fully cover the relations among events. As shown by experimental results, our similarity calculation obtains superior results than unsupervised methods by a large margin, and even comparable results with supervised neuron based methods. Typically, our calculation is an unsupervised type. It can be applied in any domain and the performance does not drop.
2 Related Work The popularity of internet causes novel information (mostly textual data) appear every day. Facing the tremendous amount of raw data, internet users need automatic tools to help them analyze and process data. Text is the most prevail format on the web, which makes a large proportion of web applications have to deal with textual data. Text similarity calculation is a fundamental task needed by almost all the text related applications, like clustering, dialogue, recommendation, Q&A. From a generalized point of view, the way to perform text similarity calculation can be classified into two categories, i.e. supervised based and unsupervised based. Regarding supervised ones, they treat two texts as one pair-point. A classification function or a score function is then trained to discriminate the points into similar and 1 Two articles are, respectively, https://www.reuters.com/article/us-asia-storm/super-typhoon-
slams-into-china-after-pummeling-philippines-idUSKCN1LW00F, and https://www.wunder ground.com/cat6/Typhoon-Mangkhut-Causes-Heavy-Damage-Hong-Kong-China-and-Macau.
A Passage-Level Text Similarity Calculation
209
dissimilar two kinds. Due to the iterative training process, supervised ones often acquire high quality. Before the appearance of neuron network, text is often encoded to a one-hot vector, whereas it generates a high-dimensional and sparse vector, which degrades the qualities of many classification functions. The proposal of neuron based word embedding brings a densely distributional vector to replace one-hot vector [1]. Besides, the neuron based models, such as CNN, GRU, LSTM, or the pre-trained models such as Transformer, GPT, BERT, XLNET, RoBERTa can produce more reasonable text representation on the basis of word embedding. To better model the interaction between texts, attention mechanism is utilized [2]. Neuron based models bring a large promotion of the performance. However, supervised methods have one obvious defect. They have to draw a hypothesis about the distribution of input data in terms of the transcendental knowledge derived from training data. There are countless types of texts on the web. It is impossible to collect all kinds of texts as training data to let supervised methods go through beforehand. In this paper, we hope to design a similarity calculation, which can fit to any kind of text. Therefore, we design an unsupervised text similarity calculation. Unsupervised similarity calculations free from training data. They also encode text as a vector, but apply some untrained score functions to measure vector similarity as text similarity. Euclidean distance, KL divergence, Entropy are typical exemplars. There also propose several ensemble calculations and some parameter is designed to tune the proportion of the scores acquired from different parts [3]. TF/IDF, TextRank, LDA are applied to extract features, and word embedding is used to encode features to vectors. Following the procedure of self-training, some recent works try to turn unsupervised similarity calculation to a supervised task. The output cases generated by similarity calculation are applied as training data in turn [4]. This kind of methods, however, suffers from cold-starting issue. Previous supervised and unsupervised methods are conducted ignoring a fact that most of web texts are used to record events. One text holds one core event. The other mentioned events either help explain the core event or provide supplement details. In fact, the core event mostly decides the similarity between two texts. Thus, the task of calculating text similarity can be done by comparing the core events mentioned in two texts. Event extraction and representation have been researched during a long time. Like in MUC (Message Understanding Conference) and ACE (Automatic Content Extraction), the two most famous event extraction tasks, the definition of event is sentence-level. The events contained by different sentences are treated independently. The algorithms designed for sentence-level event extraction are not appropriate to extract passage-level event (the core event in a text), since most of the recent state-of-the-art event extraction algorithms aim to learn a better representation for single sentence [5]. The relations among events are not considered. Though there are some algorithms considering cross-sentence event extraction, they just focus on several adjacent sentences in a slide window [6]. They are all not suitable to extract the core event from a passage level. For this reason, this paper considers constructing an event connection graph to comprehensively cover the relations among all the events mentioned in one text. The core event is located from the graph and utilized to calculate text similarity with high accuracy.
210
M. Liu et al.
3 Task Description Given two texts, noted as s1 and s2 , the aim of our paper is to calculate their similarity from a point whether s1 and s2 mention the same core event or not. To model the events and the relations among them, an event connection graph is constructed and noted as G(V , E), where V denotes node set and E denotes arc set. The node set V includes the triggers and the arguments extracted sentence by sentence via sentence-level event extraction algorithm following the way shown in [7]. The extracted triggers and arguments are formed as polygons. The arc set E includes the arcs connecting triggers and arguments. PageRank runs one the graph to locate the core event to calculate text similarity.
4 Construction of Event Connection Graph To comprehensively model the events and their relations mentioned in one text, we construct an event connection graph. This graph is composed by some polygons and each polygon indicates a sentence-level event. The center of the polygon is the trigger of the event, and the nodes surrounding the center is the arguments of the event, which are connected in the order in which they appear in the sentence. It is straightforward that trigger is the core element in one event, since trigger decides the type of one event and also decides which argument needs to be extracted. For this reason, we put trigger in the center of the polygon. Figure 1 is an example of one polygon formed from the sentence “The Queen holds a welcome banquet at Buckingham Palace”.
Fig. 1. The polygon constructed from the example sentence.
To reveal the relations between different events, we connect polygons to form an event connection graph. The polygons are connected via the overlapping arguments. Figure 2 shows an example graph formed from the following four sentences. “Trump and his wife visit Britain.” “Trump and the Queen are jointly interviewed by reporters at afternoon.” “The Queen holds a welcome banquet at Buckingham Palace.” “The reporter broadcasts on BBC One during banquet”.
A Passage-Level Text Similarity Calculation
211
Fig. 2. The event connection graph formed from the given sentences.
In this figure, the polygons formed from the given sentences are connected via the overlapping arguments, such as “Trump” and “Queen” etc. Via the overlapping arguments, only the shallow relations among events can be revealed. Since the trigger is the core element in the event, we design a vector tuning method in Sect. 6 to find related triggers to reveal deeper relations among events.
5 Weight Evaluation The constructed event connection graph covers the events and their relations mentioned in one text. As given in [8], “if the author emphases something (or a clue), everything in his article is relevant to this thing (or clue)”. Thus, it is rational that the core event in one article is surrounded by the other events. Besides, in each polygon, the trigger is surrounded by its arguments. Then, we can utilize centrality measurement to choose the core event. We choose PageRank as the measurement. PageRank is proposed by Google, and is used to rank web pages in searching engine. The principle behind PageRank is random walk. When one surfer randomly surfs on the graph, the node visited more frequently by the surfer is the central node (owning the largest PageRank value). It is calculated by, 1 PR(π ) = cPR(π )A + (1 − c) 1V T n
(1)
where, PR(π ) denotes the PageRank vector. Each node has one entry in it. c denotes jumping probability. A surfer uses probability c to jump to one adjacent node or uses (1 − c) to jump to one random node according to V T . V T denotes personalized vector, which includes the preference of one node when random jump occurs. In general cases, V T is just a vector, all the entries in which equate to 1. A is transition matrix formed from the event connection graph. The size of A is v ∗ v. v denotes node number. Each entry in A is the transition probability from one node in the row to another node in the column. Aij can be calculated by Eq. 2. 1/outi ; if (i, j) ∈ E Aij = (2) 0; if (i, j) ∈ / E
212
M. Liu et al.
where, outi denotes the out degree of node vi . Via PageRank, each node in the event connection graph has a value. This value can be used to locate the core event. There are two types of nodes, i.e. trigger and argument. If the node of the largest value is a trigger, we take the trigger and the arguments belonging to this trigger as the core event. It just treats the polygon which takes the trigger of the largest value as its center to be the core event. On the opposite, if the node of the largest value is an argument, we then take the nodes in all the polygons which share this argument as the overlapping node. Figure 3 just shows PageRank values of the nodes in Fig. 2. In this figure, the node “Queen” has the largest value. Since “Queen” is an argument, we choose the nodes in the polygons which take “Queen” as their overlapping node. The chosen polygons are marked in yellow color in Fig. 3.
Fig. 3. PageRank values of the nodes in the event connection graph.
Let Si denote the set which includes the chosen nodes in the event connection graph of the given text, texti . So is to Sj formed from textj . To calculate the similarity between texti and textj , we can form a similarity matrix, denoted as TSij . Each element in this matrix denotes the similarity between two chosen nodes in Si and Sj respectively. The node in the graph is either trigger or argument, and can be represented as vector via embedding (Glove [9] is applied). We can measure their vector similarity via Cosine similarity. Some trigger or argument may be phrase. We then average the vectors of the words in that phrase as its vector representation. The mean of all the elements in TSij is treated as the similarity between two texts, texti and textj . It is shown as follows. n sim texti , textj =
k=1
TSij (k) n
(3)
where n denotes the number of all the elements in TSij , and TSij (k) denotes one element in TSij .
6 Tuning Trigger Words The previous constructed event connection graph only takes the overlapping arguments shared by different polygons to reveal the relations among events. This kind of relation
A Passage-Level Text Similarity Calculation
213
is too vague and not sufficient, since the relation between events is mainly caused by trigger word. For this reason, we try to detect semantically similar triggers and link them to let event connection graph cover more relations among events. As shown in [10], about ninety percent of trigger words are nouns and verbs (or noun and verb phrases). The popular pre-trained word embedding can reveal word similarity counting on whether two words own similar contexts or not. However, in event, trigger and its arguments have some commonly used collocations, e.g. “win the game” or “beat the component”. That causes two trigger words which are semantically similar may have different contexts. For this reason, we cannot merely depend on pre-trained word embedding to reveal semantic similarity between triggers. To detect semantic similarity between trigger words, we take some synonym dictionaries, like VerbNet 2 and WordNet 3 , to fine-tune the vector representations of the trigger words to let semantically similar triggers own close vector representations. Two triggers whose Cosine similarity is beyond certain threshold (0.8) are connected through an arc in the event connection graph to involve more rational relations among events. Figure 4 shows the graph after connecting similar triggers, i.e. interview and broadcast. Regarding the threshold (0.8), it is obtained based on experimental results. As shown in Fig. 4, with the inserted arc (red color), the node of the largest PageRank value changes to “interview”. The core event in the graph is revealed more correctly.
Fig. 4. The novel event connection graph by linking semantically similar triggers. (Color figure online)
The synonymous pairs in VerbNet and WordNet are used as training data, denoted as Bc . We tune the vectors of the triggers according to Bc via the following formulas. O(Bc ) = Oc (Bc ) + R(Bc ). Oc (Bc ) =
(xl ,xr )∈Bc
[τ (att + xl tl − xl xr ) + τ (att + xr tr − xl xr )]
R(Bc ) = 2 https://wordnet.princeton.edu/. 3 https://verbs.colorado.edu/verbnet/.
xi ∈Bc
λxi (int) − xi2
(4) (5) (6)
214
M. Liu et al.
where, (xl , xr ) denotes a synonymous pair in Bc . tl is one word, randomly sampled from the synset which xl is not in. So is to tr . att denotes the predefined deviation, and is set to 0.6. τ denotes max margin loss, noted as maxτ (0, x). xi (int) denotes the pre-trained Glove vector. λ is a predefined regression parameter, and is set to 0.0006. The predefined parameters are set according to [11]. The tuning formula (Eq. 4) has two parts. The former one (noted as Oc (Bc )) refers to Eq. 5, which lets semantically similar triggers own close vector representations. The latter part (noted as R(Bc )) refers to Eq. 6, which keeps the tuned vectors not far away from their pre-trained results. We only tune the vectors of the triggers included by VerbNet and WordNet, and do not extend the range outside the dictionaries. The reason is that the pre-trained vector representation is acquired from a large-scale corpus. Thus, they are credible until we have enough evidence to support that the pre-trained vector representation is unable to detect the similarity between words accurately, like two words are included by the same synset in VerbNet or WordNet whereas they are dissimilar via pre-trained vector representation. If the trigger is a phrase, we simply take the mean of the vectors through all the words in that phrase as the representation.
7 Experiments and Analyses 7.1 Experimental Setting Our similarity calculation aims to calculate the similarity between two texts based on passage-level event representation. This calculation is unsupervised, and does not limit on any particular language and particular domain. To test its compatibility, we choose testing corpuses from English, Chinese, and Spanish. For English, there are some open tasks about text similarity measurement, such as query match and paraphrase in GLUE (General Language Understanding Evaluation) task [12]. We then choose these two tasks to test our similarity calculation. Five thousand text pairs are sampled from query match data set and one thousand text pairs are sampled from paraphrase data set. In these two sampled data sets, one half includes similar text pairs and the other half includes dissimilar text pairs. The corpus for these tasks only includes short sentences, and one sentence mostly indicates one event. This kind of corpus cannot fully demonstrate the ability of our calculation on handling long text, since only long text mentions several events and we need to choose the core event to measure text similarity. For this reason, we manually annotate a testing corpus including one thousand text pairs chosen from Daily news published in the latest one month (one half includes similar text pairs and the other half includes dissimilar text pairs). For Chinese, we choose two testing corpuses. One is published by Alibaba company for query match task (this corpus is processed as we did for English corpus) and the other is manually constructed including one thousand text pairs chosen from Tencent news also published in the latest one month. For Spanish, there is not suitable open corpus. We only manually annotate one corpus including one thousand text pairs chosen from kaggle contest. Among all the manually annotated corpus, we set one half as positive (similar pairs) and the other half as negative (dissimilar pairs).
A Passage-Level Text Similarity Calculation
215
The criterion used for evaluation is F1, whose formulas are shown as follows. P =
r(n) t(n)
(7)
R =
r(n) a(n)
(8)
F1 = 2 ∗
P+R P∗R
(9)
where, P denotes precision, which is measured by the correctly noted similar (or called positive) text pairs (noted as r(n)) compared with the totally similar text pairs (noted as t(n)). R denotes recall, which is measured by the correctly noted similar text pairs compared with the totally noted similar text pairs (noted as a(n)). F1 combines precision and recall together. There are some large-scale corpuses for open tasks about text similarity measurement, such as paraphrase and query match. We can compare our calculation with some supervised algorithms on these corpuses. We take three popular neuron based algorithms to deal with these two tasks. They are multilayer CNN (i.e. TextCNN, one convolutional layer, one max-pooling layer, and one softmax output layer), LSTM (taking LSTM to encode text and taking softmax as output layer), LSTM+bidirectional attention (taking LSTM to encode text and adding a bidirectional layer to model the interaction between two input texts). For the aforementioned three algorithms, we respectively encode input text via multilayer CNN, LSTM, and LSTM+bidirectional attention. Softmax layer is adopted to output a value to indicate the similarity between two texts. The pre-trained model, i.e. BERT (we use the base version), is also taken as baseline (following its finetuning task, we input two sentences into BERT with a segmentation tag [SEP] and add a softmax layer on [CLS] as output). Regarding unsupervised baselines, we represent input text as vector and apply Cosine similarity to calculate text similarity. The applied unsupervised representations are as follows. 1) Average: the mean of all the word vectors in the input text. 2) TextRank+average: take TextRank to choose keywords from input text and then take the mean of the chosen keyword vectors as representation. 3) TextRank+concatenation: take TextRank to choose keywords and concatenate the word vectors of all the chosen keywords to form a long vector. All the models are given pre-trained Glove vectors. 7.2 Experimental Results Testing on Threshold In our calculation, we need to set a threshold to decide whether connecting two triggers or not in the event connection graph. The following graph demonstrates and explains the rationality of the threshold setting in our paper. It shows F1 values when the threshold changes from 0.1 to 1.0.
216
M. Liu et al.
As shown in Fig. 5, it can be found that when the value of the threshold changes, the calculating results also change. The performance curves (measured by F1) reach the perk at the value of 0.8 (or close to it) in all of the three testing corpuses. The reason can be explained by an intuitive assumption that if the semantic similarity between two triggers exceeds some threshold, the two triggers just indicate two similar or related events. This assumption is roughly supported by [13]. This article states that, according to human judgement, for most of word pairs, if two words are semantically similar, the pre-trained embeddings of them are close. However, as indicated in Sect. 6, due to the reason that some words own some fixed collocations, this assumption is not always true. Thus, in Sect. 6, we tune the pre-trained embedding of the triggers via the training samples in some synonym dictionaries to make semantically similar triggers own close embedding vectors. Based on the tuning operation, to find a threshold to decide whether two triggers are similar or not becomes feasible. Based on the experimental results shown in Fig. 5, 0.8 is a reasonable choice. When the threshold is set too little, many dissimilar triggers are incorrectly connected, but when it becomes too large, some similar triggers are missed to be connected. This phenomenon causes that the performance curves climb at first and drop finally.
Fig. 5. F1 values when we change the threshold from 0.1 to 1.0.
Comparison of Different Algorithms In the following table, we compare our calculation with the baseline supervised and unsupervised algorithms. The supervised baseline algorithms include multilayer CNN (abbreviated as MCNN), LSTM, LSTM+bidirectional attention (abbreviated as LSTM+BIA), and BERT base. The unsupervised baseline algorithms include Average (abbreviated as AVE), TextRank+average (abbreviated as TR+AVE), and TextRank+concatenation (abbreviated as TR+CON). The details of the baseline algorithms are already told before. All the testing algorithms give a value to measure the similarity between two texts. Since each testing corpus can be separated into two halves, i.e. similar and dissimilar, we record the similarity value of each text pair in the testing corpuses via the given algorithms. The mean of all the values is treated as the threshold to decide whether two texts in one
A Passage-Level Text Similarity Calculation
217
text pair are similar or not. To make the obtained results more persuasive, we add significant test in the experiments. We separate each testing corpus into ten parts, and record calculating results in each part. Two-tail paired t-test is applied to determine whether the results obtained by different algorithms over the ten times’ calculations are significantly different or not. We set three significant levels as 0.01, 0.05, and 0.1 (labelled as ***, **, and *). The corpuses for paraphrase, query match and manually annotated are abbreviated as Para, Q&Q, and MA. Regarding the manually annotated corpus, since it does not have enough training data, we do not run supervised baseline algorithms on that corpus. Table 1. The comparison between our calculation and the baseline algorithms Methods
English Para
Supervised
Chinese Q&Q
MA
Q&Q
Spanish MA
MA
MCNN
0.83 *** 0.59 *** –
0.57 *** –
–
LSTM
0.81 **
0.61 **
–
0.63 **
–
–
LSTM+BIA 0.83 **
0.62 **
–
0.60 **
–
–
BERT base
0.87 *** 0.71 *** –
0.71 *** –
–
0.68 **
0.49 *
0.39 *
0.43 *
0.37 *
0.38 *
TR+AVE
0.69 **
0.51 **
0.42 **
0.48 **
0.39 **
0.40 **
TR+CON
0.71 **
0.47 **
0.43 **
0.47 **
0.41**
0.42 *
Ours
0.80 *** 0.56 *** 0.54 *** 0.53 *** 0.51 *** 0.53 ***
Unsupervised AVE
As shown in Table 1, we list the calculating results obtained in different languages and in different tasks. Through all the corpuses, supervised algorithms overwhelm unsupervised algorithms by a large margin. The reason is straightforward, since supervised algorithms can derive useful knowledge from training data. Unlike supervised algorithms, unsupervised algorithms cannot learn any knowledge to help model the interaction between texts. That certainly causes lower performance. Compared with the manually annotated corpus which includes long texts, it can be found that the obtained results via unsupervised algorithms on short texts are much better. This is because, unsupervised algorithms only choose some words or aggregate all the words in the text to generate text representation, however, long text has many words which are little relevant to the main content. This kind of situation drops the performance of unsupervised algorithms when dealing with long texts. Regarding our calculation, it obtains the comparable results with supervised ones and performs much better than unsupervised ones especially on the corpus including long texts. The reason is totally due to our event connection graph. Based on this graph, we can extract the nodes (or words) to represent the core event mentioned in the text. Thus, the irrelevant noisy words are ignored when calculating text similarity. Then, we can acquire accurate text similarity results on both long texts and short texts. The significant testing results also prove the reliability of the high performance of our calculation.
218
M. Liu et al.
8 Conclusion Text similarity calculation is a fundamental task to improve many high-level text related applications. However, traditional methods are conducted in terms of either making two similar texts close in a high-dimensional space (supervised ones) or measuring the number of frequently concurrent words shared by two texts (unsupervised ones). In general, text similarity is decided by whether two texts mention the same core event or not. This paper just proposes a novel text similarity calculation via constructing an event connection graph to disclose the core event mentioned in one text. Besides, to better model the relations among events, it tunes the vectors of the triggers to detect more related events and link them in the event connection graph. This approach can locate the core event more accurately. Acknowledgement. The research in this article is supported by the Science and Technology Innovation 2030 - “New Generation Artificial Intelligence” Major Project (2018AA0101901), the National Key Research and Development Project (2018YFB1005103), the Key Project of National Science Foundation of China (61632011), the National Science Foundation of China (61772156, 61976073) and the Foundation of Heilongjiang Province (F2018013).
References 1. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013) 2. Duan, C., Cui, L., Chen, X., Wei, F., Zhu, C., Zhao, T.: Attention-fused deep matching network for natural language inference. In: IJCAI, pp. 4033–4040 (2018) 3. Ou, S., Kim. H.: Unsupervised citation sentence identification based on similarity measurement. In: iConference, pp. 384–394 (2018) 4. Pavlinek, M., Podgorelec, V.: Text classification method based on self-training and lda topic models. Exp. Syst. Appl. 80(1), 83–93 (2017) 5. Wang, X., et al.: HMEAE: hierarchical modular event argument extraction. In: EMNLPIJCNLP, pp. 5781–5787 (2019) 6. Yang, H., Chen, Y., Liu, K., Xiao, Y., Zhao, J.: DCFEE: a document-level chinese financial event extraction system based on automatically labeled training data. In: ACL, pp. 50–55 (2018) 7. Qiu, L.K., Zhang, Y.: ZORE: a syntax-based system for Chinese open relation extraction. In: EMNLP, pp. 1870–1880 (2014) 8. Liao, S., Grishman, R.: Using document level cross-event inference to improve event extraction. In: ACL, pp. 789–797 (2010) 9. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014) 10. Li, P., Zhou, G., Zhu, Q., Hou, L.: Employing compositional semantics and discourse consistency in Chinese event extraction. In: EMNLP-CoNLL, pp. 1006–1016 (2012) 11. Amir, H., Béatrice, D.: Word embedding approach for synonym extraction of multi-word terms. In: LREC, pp. 297–303 (2018) 12. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: ICLR, pp. 1–20 (2019) 13. Lan, W., Xu, W.: neural network models for paraphrase identification, semantic textual similarity, natural language inference, and question answering. In: COLING, pp. 3890–3902 (2018)
Using Active Learning to Improve Distantly Supervised Entity Typing in Multi-source Knowledge Bases Bo Xu1(B) , Xiangsan Zhao1 , and Qingxuan Kong2 1
School of Computer Science and Technology, Donghua University, Shanghai, China [email protected], [email protected] 2 Glorious Sun School of Business and Management, Donghua University, Shanghai, China [email protected] Abstract. Entity typing in the knowledge base is an essential task for constructing a knowledge base. Previous models mainly rely on manually annotated data or distant supervision. However, human annotation is expensive and distantly supervised data suffers from label noise problem. In addition, it suffers from semantic heterogeneity problem in the multi-source knowledge base. To address these issues, we propose to use an active learning method to improve distantly supervised entity typing in the multi-source knowledge base, which aims to combine the benefits of human annotation for difficult instances with the coverage of a large distantly supervised data. However, existing active learning criteria do not consider the label noise and semantic heterogeneity problems, resulting in much of annotation effort wasted on useless instances. In this paper, we develop a novel active learning pipeline framework to tackle the most difficult instances. Specifically, we first propose a noise reduction method to re-annotate the most difficult instances in distantly supervised data. Then we propose a data augmentation method to annotate the most difficult instances in unlabeled data. We propose two novel selection criteria to find the most difficult instances in different phases, respectively. Moreover, we propose a hybrid annotation strategy to reduce human labeling effort. Experimental results show the effectiveness of our method.
1
Introduction
Entity typing (ET) in the knowledge base (KB) is an essential task for constructing a knowledge base and has seen a surge of interest in recent years [4,5,12,14]. Knowing the type labels of entities is essential for many downstream applications. Given an entity and all its factual triples in the knowledge base, this task aims to classify the entity into predefined type labels. Typically, each ETKB task requires its own annotated data for training the model, which is expensive and time-consuming. To address this problem, distant supervision has been proposed to automatically annotate a large number of This paper was supported by the National Natural Science Foundation of China under Grant 61906035 and Shanghai Sailing Program under Grant 19YF1402300. c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 219–231, 2020. https://doi.org/10.1007/978-3-030-60450-9_18
220
B. Xu et al.
unlabeled entities in the knowledge base [14]. The distant supervision technique assumes that equivalent entities in different knowledge bases have the same type labels. Therefore, the automatic annotation method is as follows: First, link unlabeled entities in the target knowledge base with equivalent entities in the source knowledge base, then use all type labels of the equivalent entities to annotate those unlabeled entities. However, despite its efficiency, distantly supervised ETKB often suffers from the well-known label noise problem and produces false positive and false negative noises. For example, on the upper side of Fig. 1, the expected type labels of entity Donald Trump in KB1 are {Agent, Person, Politician and Businessman}, while in KB2 are {Agent, Person, Politician and President}. If we regard KB2 as the source knowledge base and apply distant supervision technique to annotate entity Donald Trump in KB1, then type President is the false positive noise and Businessman is the false negative noise.
Fig. 1. Distantly supervised ETKB suffers from Label Noise and Semantic Heterogeneity problems in the multi-source knowledge base.
Besides the well-known label noise problem, distantly supervised ETKB suffers from the Semantic Heterogeneity problem in the multi-source knowledge base, which means that the instances in distantly supervised data and target knowledge bases do not share the same type distribution or the same feature space. The reason is that the multi-source knowledge base extracts knowledge from multiple data sources that use different semantic schemas and naming conventions to describe entities. For example, on the bottom side of Fig. 1, which shows the semantic representation of two entities. Entity Gone With the Wind comes from an online shopping website, while entity Coiling Dragon comes from an online novel website. Although both entities belong to the same type labels {Work, Written Work, Book and Novel}, their semantic representations are completely different. Even if one is labeled, it cannot be used to infer the type of the other.
Using Active Learning to Improve Distantly Supervised Entity Typing
221
To address these issues, we propose to use the active learning method to improve distantly supervised entity typing in the multi-source knowledge base (ETMKB), which aims to combine the benefits of human annotation for difficult instances (entities) with the coverage of large distantly supervised data. However, existing active learning criteria do not consider the label noise and semantic heterogeneity problems, resulting in much of annotation effort wasted on useless instances. In this paper, we develop a novel active learning pipeline framework to tackle the most difficult instances. To solve the label noise problem, we propose a novel selection criterion to find the noisiest instances from distantly supervised data and propose a hybrid annotation strategy to relabel them. To solve the semantic heterogeneity problem, we propose another novel selection criterion to label the most mismatched instances from unlabeled data to augment the training data. To the best of our knowledge, we are the first to use an active learning method to improve distantly supervised entity typing in the multi-source knowledge base. Our approach is able to improve the quality of training data by relabeling the noisiest entities in the distantly supervised data and annotating the most mismatched entities in the unlabeled data. We also propose a hybrid annotation strategy to reduce human labeling effort. Experimental results on the real-world dataset show the effectiveness of our method.
2 2.1
Overview System Framework
Fig. 2. Framework of our two-phase active learning method to improve distantly supervised entity typing in the multi-source knowledge base.
Our framework is shown in Fig. 2. To solve the label noise and semantic heterogeneity problems, we propose a two-phase active learning framework to annotate the most difficult instances.
222
B. Xu et al.
In the noise reduction phase, we propose an algorithm to iteratively reannotate the noisiest instances in distantly supervised (DS) data to solve the label noise problem. In the data augmentation phase, we propose another algorithm to iteratively label the most mismatched instances in unlabeled data to solve the semantic heterogeneity problem. Finally, we obtain an improved DS data by revising the noisiest entities in raw DS data and augmenting the most mismatched entities in unlabeled data. The improved DS data will be used to train the final entity typing model. 2.2
Entity Typing Model
We define a multi-source knowledge base as G = {E, T , F}, where E, T and F are sets of entities, types and facts, respectively. A fact is denoted as a triple (s, p, o) ∈ F, such as (Donald Trump, Occupation, Politician), and the set of all factual triples of entity e is denoted as Fe = {(s, p, o)|(s, p, o) ∈ F, s = e}. Given an entity e ∈ E and all its factual triples Fe in the multi-source knowledge base G, the ETMKB task aims to predict a type set ET (e) ⊂ T that entity e belongs to. Similar to CUTE [14], we extract two types of features from the factual triples, namely property (i.e., Born, Occupation) and property-object pair (i.e., Occupation-Politician, Category-Presidents of the United States) features. Specifically, each entity e is represented by a one-hot vector e ∈ Rd : e = (x1 , x2 , ...xi , ..., xd ),
(1)
where xi represents the existence of i-th feature fi in entity e. Note that the focus of this work is to propose an effective active learning method to improve the performance of existing ETMKB models, rather than to propose a better ETMKB model. Therefore, in all subsequent experiments, we just use a simple-yet-effective multilayer perceptron (MLP) neural network model with one hidden layer [15] for entity-level typing: (2) [P (t1 |e), . . . , P (t|T | |e)] = σ W (o) g(W (i) e) , where W (i) ∈ Rh×d is the weight matrix connecting the input one-hot vector to the hidden layer with size h, and g is the rectifier activation function. W (o) ∈ R|T |×h is the weight matrix connecting the hidden vector to the output layer with size |T |, σ is the sigmoid activation function, and P (t|e) is the probability that entity e belongs to type t. Since ETMKB is a multi-class multi-label classification problem, we use binary cross-entropy as the loss function.
3
Noise Reduction
In this section, we show how to reduce the noisiest instances (entities) in distantly supervised data. We first propose a selection criterion to find the noisiest instances from the data. In order to reduce human labeling effort in similar noisy instances, we propose a hybrid annotation strategy to re-annotate them.
Using Active Learning to Improve Distantly Supervised Entity Typing
3.1
223
Selection Criterion
A common selection criterion for active learning is uncertainty sampling [1], which selects those instances with the highest uncertainty in their predictions. However, uncertain instances are not necessarily noisy. Therefore, we propose a new selection criterion which aims to find noisy instances in distantly supervised data. We use DS(e) to represent the type labels of entity e in distant supervision data. As mentioned above, there are two kinds of noises in distant supervision data, namely false positive and false negative. We use F P (e) to represent the False Positive type set of entity e, which belongs to the distantly supervised label set but does not belong to the prediction set, and use F N (e) to represent the False Negative type set of entity e, which belongs to the prediction set but does not belong to the distantly supervised label set. F P (e) = {t|t ∈ T , t ∈ / ET (e), t ∈ DS(e)}
(3)
F N (e) = {t|t ∈ T , t ∈ ET (e), t ∈ / DS(e)}
(4)
Our selection criterion is based on the difference between the predicted and distantly supervision results of the entity. The prediction result ET (e) is the type set whose probability P (t|e) ≥ 0.5. We first propose two scoring functions to evaluate false positive and false negative labels in an entity, namely ScoreF alseP ositive and ScoreF alseN egative . ScoreF alseP ositive (e) = 0.5 − P (t|e) (5) t∈F P (e)
ScoreF alseN egative (e) =
P (t|e) − 0.5
(6)
t∈F N (e)
Then we use a combined scoring function ScoreN oiseReduction as our selection criterion for noise reduction, as defined in Eq. 7. ScoreN oiseReduction (e) = ScoreF alseP ositive (e) + ScoreF alseN egative (e)
(7)
The higher the score is, the more likely this entity is to have false positive or false negative labels. 3.2
Annotation Strategy
Noted that training neural networks usually takes a long time. In order to reduce the total time that humans spend waiting for the next labeled instance, we use the batch active learning strategy and select the batch of instances with the highest score in each iteration. It has been proved that even compared to a fully sequential active learning strategy, the use of an appropriate batch size can produce competitive performance while significantly reducing training time [7].
224
B. Xu et al.
Traditional batch active learning usually needs to consider the diversity among selected instances. However, in the Noise Reduction scenario, similar noisy instances also need to be revised. Otherwise, these errors will still exist in the distantly supervised data. In order to reduce the human labeling effort on similar noisy instances, we propose a hybrid annotation method that allows humans to re-annotate a fixed number of diverse noisy instances, while similar noisy instances are annotated by an automatic method. Considering that the automatic method may cause new errors, we propose a strict variant of the KNN method for automatic labeling similar noisy instances. Specifically, for a selected instance, we first use the similarity measure (i.e., Euclidean Distance or Jaccard Coefficient) to calculate the similarity value between it and the instances in the existing re-annotation set. If we can find some very similar instances (for example, the similarity value is greater than a certain threshold), and their type labels are exactly the same, then we can automatically annotate this new noisy instance with the same type labels. Finally, we propose a hybrid annotation strategy to iteratively re-annotate the noisiest entities in the distantly supervised data. We first initialize an empty set as the re-annotation instance set. Then, for each iteration, we first divide the DS data into two parts. One part is used to train the entity typing model, and the remaining part is used to reduce the noises in it. Next, we compute the scores of noise reduction for instances in the remaining part by Eq. 7 and rank them in descending order by score values. After that, we first use an automatic annotation method to re-annotate some noisy instances which are similar to the re-annotation instance set, and then we manually re-annotate K instances by order. These instances are then added to the re-annotation instance set. The annotation process repeatedly executes until a certain stopping criterion is met, such as the number of iterations and the total number of human-annotated entities.
4
Data Augmentation
In this section, we show how to augment the labeled data (the denoised distantly supervised data after the noise reduction phase) with some most mismatched entities in unlabeled data. We first propose a selection criterion to find those instances from the data. Then we introduce our complete data augmentation algorithm. 4.1
Selection Criterion
Our selection criterion for data augmentation is based on two aspects. One is whether the entity suffers from imbalanced feature distribution and the other is whether its prediction result is incomplete. If an entity suffers from imbalanced feature distribution and incomplete prediction result, then we consider this entity suffers from the semantic heterogeneity problem.
Using Active Learning to Improve Distantly Supervised Entity Typing
225
We first propose a score function ScoreImbalance to evaluate the extent to which an entity suffers from imbalanced feature distribution. If a feature is popular in unlabeled data but rare in labeled data, we consider the feature to be imbalanced. As shown in Eq. 8, the scoring function is the sum of the distribution differences of all its features. ScoreImbalance (e) =
d
xi × max P (fi |Dunlabel ) − P (fi |Dlabel ), 0 ,
(8)
i=1
where P (fi |Dlabel ) and P (fi |Dunlabel ) are the proportion of entities containing feature fi in labeled and unlabeled data, respectively. The larger the value is, the more likely the entity is to suffer from imbalanced feature distribution. Then we propose a score function ScoreIncomplete to evaluate the probability that an entity’s prediction result is incomplete. We use L(e ) to represent type set of entity e in labeled data Elabel and F G(e) to represent the entities in labeled data whose type set are the superset of or equal to the prediction set of entity e, as shown in Eq. 9: F G(e) = {e |e ∈ Elabel , ET (e) ⊆ L(e )}
(9)
The score function ScoreIncomplete is calculated in Eq. 10. We use the difference in the number of type sets to measure the degree of incompleteness of the predicted result of an entity. e ∈F G(e) |L(e )| − |ET (e)| (10) ScoreIncomplete (e) = |F G(e)| Finally, we use the rank products [3] method to combine the rankings of two scoring functions. As shown in Eq. 11: ScoreDataAugmentation (e) = RankImbalance (e) × RankIncomplete (e), (11) where RankImbalance (e) and RankIncomplete (e) are the rankings of these score values, respectively. We rank the group of instances that have the same score values with the lowest rank in the group. The lower the value is, the more likely this entity is to be mismatched to the labeled data. 4.2
Annotation Strategy
We also propose a hybrid annotation strategy to iteratively annotate the most mismatched entities in the unlabeled data. As the size of unlabeled data is very large, we sample M instances from the unlabeled data in each iteration. Specifically, for each iterator, we first train the entity typing model with the labeled data. Next, we compute the scores of data augmentation for these sample instances by Eq. 11 and rank them in descending order by score values. After that, we also propose a hybrid annotation method to annotate the most
226
B. Xu et al.
mismatched instances, which is similar to Sect. 3.2. The annotation process repeatedly executes until a certain stopping criterion is met.
5
Experiment
5.1
Settings
Data. Inspired by [14], we conduct a real-world experiment of cross-lingual ETMKB, which classifies Chinese entities in CN-DBpedia [13] (A Chinese multisource knowledge base) into the predefined type set from DBpedia (An English knowledge base). Specifically, we collect Chinese entities and their factual triples from CN-DBpedia and English entity type information from the latest DBpedia Dump1 . We obtain about 11 million Chinese entities from CN-DBpedia and 4 million English entities from DBpedia, respectively. By using the distant supervision technique, 60,000 Chinese entities are annotated with English types. To evaluate the performance of noise reduction on distantly supervised data, we randomly select 25,000 entities of them as distantly supervised data, and select 1,000 of them from the remaining ones and re-annotate them with human annotators as test data. To evaluate the performance of data augmentation on distantly supervised data, we randomly select 1,000 entities from 101,000 unlabeled data in CN-DBpedia and annotate them with human annotators as test data. In order to distinguish those two test data, we refer to the first as NR1000 and the second as DA1000 2 . Metrics. To evaluate the performance of the entity typing model trained on different training data, we use Accuracy (Strict-F1), Micro-averaged F1 (Mi-F1) and Macro-averaged F1 (Ma-F1) metrics, which have been widely used in entity typing tasks [4–6,12,15]. 5.2
Performance of Noise Reduction
Baselines. We compare the performance of the entity typing model trained on different denoised distantly supervised data, which are obtained through different selection criteria (including uniform, uncertainty and our noise reduction selection criterion) and annotation strategies (including human and hybrid annotation): (1) DS, the raw distantly supervised data; (2) DS + NR1 (Uniform + Human); (3) DS + NR2 (Uncertainty + Human); (4) DS + NR3 (Our + Human); (5) DS + NR4 (Our + Hybrid). We use NR1000 as the test data. Each method performs 40 iterations, and 10 instances are manually labeled per iteration.
1 2
https://databus.dbpedia.org/dbpedia/collections/pre-release-2019-08-30. Data can be downloaded at: https://github.com/xubodhu/ETMKB.
Using Active Learning to Improve Distantly Supervised Entity Typing
227
Fig. 3. The comparison results of different methods for noise reduction.
Fig. 4. The comparison results of different methods for data augmentation.
Performance Comparison and Analysis. The comparison results are shown in Fig. 3. First, DS + NR1 and DS + NR2 are similar to DS, which demonstrates that using traditional selection criterion cannot improve the performance of distantly supervised ETMKB. Second, DS + NR3 performs better than DS + NR1 and DS + NR2 and DS, which demonstrates that re-annotating the noisiest instances in the distantly supervised data is an effective way to improve the performance of distantly supervised ETMKB. Third, DS + NR4 performs better than DS + NR3, which demonstrates that using the hybrid annotation strategy can effectively improve the quality of distantly supervised data under the constraint of fixed human labeling efforts. 5.3
Performance of Data Augmentation
Baselines. We also compare the performance of the entity typing model trained on different augmented data, which are obtained through different selection criteria (including uniform, uncertainty and our data augmentation selection criterion) and annotation strategies (including human and hybrid annotation): (1) DS, the raw distantly supervised data; (2) DS + NR4; (3) DS + NR4 + DA1 (Uniform
228
B. Xu et al.
+ Human); (4) DS + NR4 + DA2 (Uncertainty + Human); (4) DS + NR4 + DA3 (Our + Human); (5) DS + NR4 + DA4 (Our + Hybrid). In this experiment, we use DA1000 as the test data. Each method performs 20 iterations, and 20 instances are manually labeled per iteration. Performance Comparison and Analysis. The comparison results are shown in Fig. 4. First, DS + NR4 performs better than DS in DA1000 test data, which consistently verifies the effectiveness of our noise reduction method. Second, compared with DS and DS + NR4, all data augmentation selection criteria perform very well, which demonstrates that using the data augmentation method can solve the semantic heterogeneity problem in distantly supervised data and improve the performance of distantly supervised ETMKB. Third, DS + NR4 + DA3 performs better than DS + NR4 + DA1 and DS + NR4 + DA2, which demonstrates the effectiveness of our selection criterion for data augmentation. Fourth, DS + NR4 + DA4 performs better than DS + NR4 + DA3, which also demonstrates the effectiveness of the hybrid annotation strategy. 5.4
Entity Typing in CN-DBpedia
Finally, we use our best method (DS + NR4 + DA4) to classify entities on the multi-source knowledge base, CN-DBpedia. In total, we obtain 28,438,100 types for 10,900,000 entities by using our best method (DS + NR4 + DA4), with an average of 2.6 types per entity. In contrast, we only get 23,944,139 types for these entities by using the distant supervision method, with an average of 2.2 types per entity. We also compare the top 10 types obtained by these two methods. Specifically, we count the number of entities included in each type and evaluate the accuracy of each type by randomly selecting 50 entities that belong to the type to determine whether it is correct. The result is shown in Table 1. From the table we can observe that: First, our method can discover more entities, such as Book and Music entities. The reason is that CN-DBpedia collects entities from different data sources, the semantic representation of these entities are different. By using our data augmentation method, we discover more diverse entities and solve the semantic heterogeneity problem. Second, our method can improve the accuracy of the types, such as type Place. The reason is that there are many noisy entities in distantly supervised data. By using our noise reduction method, we revise some noisy entities and solve the label noise problem.
Using Active Learning to Improve Distantly Supervised Entity Typing
229
Table 1. The prediction results of Top-10 types in CN-DBpedia by using our best method and DS method. DS + NR4 + DA4 Rank Types Count
6
1
Agent
2 3
DS Accuracy Types
5,354,279 1.00
Count
Accuracy
Agent
5,348,411 1.00
Organisation 4,415,545 1.00
Organisation
4,367,970 0.98
Company
4,198,294 1.00
Company
3,937,844 1.00
4
Work
2,985,281 0.98
Work
2,312,920 1.00
5
WrittenWork 1,952,698 1.00
Place
1,380,592 0.52
6
Book
1,808,219 1.00
Person
829,667
1.00
7
Person
847,027
1.00
WrittenWork
461,946
1.00
8
Place
789,480
1.00
PopulatedPlace 421,349
0.66
9
MusicalWork 607,744
1.00
Book
335,943
1.00
10
Song
1.00
Settlement
262,153
0.62
543,236
Related Work
There have been extensive studies on entity typing. In terms of entity granularity, existing work can be categorized into mention-level typing in the text corpus and entity-level typing in the knowledge base. Mention-level typing is a task that classifies an entity mention in a sentence into a predefined set of types [10], while entity-level typing is a task that classifies an entity in a knowledge base into a predefined set of types. In terms of the existence of labeled data, it can be categorized into supervised entity-level typing and distantly supervised entitylevel typing. In this paper, we focus on the task of distantly supervised entity typing in the knowledge base. Distant supervision is a technique of labeling data using an existing knowledge database [9]. It has been widely used in many NLP fields. These tasks all use entity-level knowledge in the knowledge base to annotate mention-level data in the text corpus, so it would inevitably produce false positive noise. Moreover, considering the incomplete of the knowledge base, it may also have false negative noise [8]. However, distantly supervised entity typing in the knowledge base is a different task that uses type labels of entities in the target knowledge base to annotate entities in the source knowledge base. Besides the well-known noisy labeling problem, it also produces a new challenge of semantic heterogeneity. Active learning is another technique of labeling data by selecting the most beneficial instances to learn a good classifier with a minimal amount of human supervision [1]. However, despite its high quality, it is often limited by the data size. To solve the drawbacks of both distant supervision and active learning, researchers have recently explored the idea of augmenting distant supervision with a small amount of human-annotated data to improve the performance of NLP tasks [2,11]. They propose different strategies to integrate both the
230
B. Xu et al.
distantly supervised data and human-annotated data, such as simple union, transfer learning [11] and multitask learning [2]. However, existing active learning criteria do not make full use of distantly supervised data, resulting in much of annotation effort wasted on useless instances. In this work, we propose two novel active learning selection criteria to select the most difficult instances and use a hybrid annotation strategy to label them to improve the performance of ETMKB.
7
Conclusion
In this paper, we propose to use an active learning method to improve distantly supervised entity typing in the multi-source knowledge base. To solve the label noise problem, we propose a novel selection criterion to find the noisiest instances in distantly supervised data and propose a hybrid annotation strategy to relabel them. To solve the semantic heterogeneity problem, we propose another novel selection criterion to label the most mismatched instances in unlabeled data to augment the training data. Experimental results show the effectiveness of our method.
References 1. Aggarwal, C.C., Kong, X., Gu, Q., Han, J., Philip, S.Y.: Active learning: a survey. In: Data Classification, pp. 599–634. Chapman and Hall/CRC (2014) 2. Beltagy, I., Lo, K., Ammar, W.: Combining distant and direct supervision for neural relation extraction. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1858–1867 (2019) 3. Breitling, R., Armengaud, P., Amtmann, A., Herzyk, P.: Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett. 573(1–3), 83–92 (2004) 4. Jin, H., Hou, L., Li, J., Dong, T.: Attributed and predictive entity embedding for fine-grained entity typing in knowledge bases. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 282–292 (2018) 5. Jin, H., Hou, L., Li, J., Dong, T.: Fine-grained entity typing via hierarchical multi graph convolutional networks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 4970–4979 (2019) 6. Ling, X., Weld, D.S.: Fine-grained entity recognition. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pp. 94–100. AAAI Press (2012) 7. Lourentzou, I., Gruhl, D., Welch, S.: Exploring the efficiency of batch active learning for human-in-the-loop relation extraction. In: Companion Proceedings of the the Web Conference, pp. 1131–1138 (2018) 8. Min, B., Grishman, R., Wan, L., Wang, C., Gondek, D.: Distant supervision for relation extraction with an incomplete knowledge base. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pp. 777–782 (2013)
Using Active Learning to Improve Distantly Supervised Entity Typing
231
9. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL, pp. 1003–1011. Association for Computational Linguistics (2009) 10. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007) 11. Su, P., Li, G., Wu, C., Vijay-Shanker, K.: Using distant supervision to augment manually annotated data for relation extraction. PloS One 14(7), 1–17 (2019) 12. Xu, B., et al.: METIC: multi-instance entity typing from corpus. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 903–912. ACM (2018) 13. Xu, B., et al.: CN-DBpedia: a never-ending Chinese knowledge extraction system. In: Benferhat, S., Tabia, K., Ali, M. (eds.) IEA/AIE 2017. LNCS (LNAI), vol. 10351, pp. 428–438. Springer, Cham (2017). https://doi.org/10.1007/978-3-31960045-1 44 14. Xu, B., Zhang, Y., Liang, J., Xiao, Y., Hwang, S., Wang, W.: Cross-lingual type inference. In: Navathe, S.B., Wu, W., Shekhar, S., Du, X., Wang, X.S., Xiong, H. (eds.) DASFAA 2016. LNCS, vol. 9642, pp. 447–462. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32025-0 28 15. Yaghoobzadeh, Y., Sch¨ utze, H.: Multi-level representations for fine-grained typing of knowledge base entities. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 578–589 (2017)
TransBidiFilter: Knowledge Embedding Based on a Bidirectional Filter Xiaobo Guo1,2 , Neng Gao1 , Jun Yuan3 , Lin Zhao1 , Lei Wang1(B) , and Sibo Cai4 1
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China {guoxiaobo,gaoneng,liuweile,wangxin}@iie.ac.cn 2 School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3 College of Traffic Engineering, Hunan University of Technology, Zhuzhou, China [email protected] 4 Beijing Internetware Limited Corporation, Beijing, China [email protected]
Abstract. A large-scale knowledge base can support a large number of practical applications, such as intelligent search and intelligent question answering. As the completeness of the information in a knowledge base may have a direct impact on the quality of downstream applications, its automatic completion has become a crucial task for many researchers and practitioners. To address this challenge, the knowledge representation learning technology which represents entities and relations as lowdimensional dense real value vectors has been developed rapidly in recent years. Although researchers continue to improve knowledge representation learning models using an increasingly complex feature engineering, we find that the most advanced models can be outdone by simply considering interactions from entities to relations and that from relations to entities without requiring huge number of parameters. In this work, we present a knowledge embedding model based on a bidirectional filter called TransBidiFilter. By learning the global shared parameter set based on the traditional gate structure, TransBidiFilter captures the restriction rules from entities to relations and that from relations to entities respectively. It achieves better automatic completion ability by modifying the standard translation-based loss function. In doing so, though with much fewer discriminate parameters, TransBidiFilter performs better than state-of-the-art baselines of semantic discriminate models on most indicators on many datasets. Keywords: Knowledge representation Relation-based gate
1
· Entity-based gate ·
Introduction
Knowledge base organizes social knowledge into structured and systematic knowledge and is a crucial resource for many artificial intelligence applications c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 232–243, 2020. https://doi.org/10.1007/978-3-030-60450-9_19
TransBidiFilter: Knowledge Embedding Based on a Bidirectional Filter
233
including intelligent search, question answering and so on. Many famous companies have launched knowledge base products, such as Google Knowledge Graph, IBM Watson and Apple Siri. A knowledge base is mainly organized in the form of triplets and each triplet corresponds to a social fact. For example, (Steve Jobs, is f ounder of, Apple Inc.) indicates that the person entity Steve Jobs founded the company entity Apple. Currently, there are many large-scale knowledge bases, such as WordNet [1], Freebase [2] and Yago [3], etc. However, they are far from perfect and suffers from fact records lacking. According to Google, 71% of the people in Freebase [2] lack birthplace records and 75% lack nationality records. The link prediction task can be used to complete the missing part of a knowledge base and can greatly improve the quality of knowledge bases, thereby supporting higher-quality downstream applications. Knowledge representation learning is an automatic learning technique which can be used to represent entities and relations in a knowledge base as lowdimensional dense real value vectors. After vectorization, the semantic correlations between entities and relations is contained in the interaction of the vector values of different dimensions. Then through vector calculations, the knowledge base can be automatically completed and improved. With the development of machine learning, data mining and so on, knowledge representation learning has become a research hotspot in the field of artificial intelligence. Currently, many knowledge representation learning models have been developed. The original models, such as TransE [4], DisMult [19], assume that vector representations of entities and relations are consistent in any semantic environment. Linear transformation and translation principle are mainly used to realize the learning. These models are easy to be applied to real knowledge bases, but difficult to deal with multi-semantic situations. Later, researchers introduce the semantic discriminate mechanism to make an entity or a relation achieve a specific semantic expression in a certain semantic environment. The semantic discrimination can be realized through space projection. Representative models are TransH [17], TransR [11], TranSparse [10] and TransD [21]. In KG2E [22] and TransG [18], semantic discrimination can also be obtained with use of Gaussian distribution functions to simulate a variety of semantic uncertainties. These discriminate models have higher accuracies because of introducing a much more complex feature engineering. At the same time, the number of discriminate parameters also dramatically increases. As a result, these discriminate models are difficult to be applied to real knowledge bases. In order to reduce the large number of semantic discriminate parameters, TransGate [20] model establishes two global shared parameter gates for the head entities and tail entities respectively. The sizes of the two gate structures are fixed. Their sizes are only determined by the model embedding dimension and have nothing to do with the size of the dataset. This model not only alleviates the huge parameter pressure of traditional semantic discriminate models, but also achieves better performance.
234
X. Guo et al.
However, we find that TransGate [20] only considers the multiple semantics of entities under different relations, but does not consider the multiple semantics of relations under different entities pairs. What’s more, the concatenation of an entity vector and a relation vector as an input of the gate structure makes the interaction between entities and relations unclear and poorly interpretable. TransBidiFilter is proposed to realize the multiple semantic expressions of both entities and relations, and to obtain richer semantic interactions between them. There are four parts that distinguish it from the related works: 1. Unlike most existing semantic discriminate models that only take multiple semantics of entities, we argue that a relation should also show corresponding semantics according to different head and tail entity pairs. 2. Instead of measuring the distance of a triplet (h, r, t) on a relation-specific plane, we choose to measure the distance after projecting the entities and relations into a unified triplet-specific plane. Each triplet-specific plane is produced by a bidirected global shared parameter gate structure. 3. Unlike many of the related models that require pre-trained embedding from basic prerequisite models, TransBidiFlilter is a self-contained model. 4. Rather than an unclear interaction between entities and relations, the filtering vectors produced by the TransBidiFlilter gate structure will clearly tells how each dimension of value in an entity vector or a relation vector should be obtained in a particular interaction.
2
Related Work
Our work is related to classical translation-based methods of knowledge representation learning. Inspired by word2vec [23], TransE [4] regards every entity and relation as a vector in the embedding space. It is an energy-based model and represents a relation as a translation vector indicating the semantic translation from the head entity to the tail entity. In this way, the pair of embedded entities (h, t) in a triplet (h, r, t) can be connected by r with low error. The score function is Eq. (1): fr (h, t) = |h + r − t|L1/L2
(1)
TransH [17] is proposed to enable an entity to have distinct distributed representations when involved in different relations. TransR [11] assumes that entities and relations should be in two different vector spaces and an entity is a complex of attributes. Different relations have different semantic spaces and focus on different attributes of the entities. For each triplet, the entity should first be projected into the corresponding relation space, and then the translation can be established between the entities and relations. TransD [21] argues that it is unreasonable for TransR [11] to make the head entities and tail entities share the same projection matrix and make the projection matrix only determined by relations. It believes that the types or attributes of head entities and tail entities should vary greatly for the same relation, and
TransBidiFilter: Knowledge Embedding Based on a Bidirectional Filter
235
the projection matrix should be determined jointly by entities and relations. It designs projection matrixes correlated with both entities and relations and make projection for the head entities and tail entities into the relation space respectively. TranSparse [21] points out that there are heterogeneity problem (the number of joined entity pairs varies greatly between different relations) and imbalance problem (the number of joined head entities and that of joined tail entities for the same relation vary greatly) in knowledge base. In order to solve the heterogeneous problem, TranSparse [21] uses a sparse matrix to carry out projection and the sparse degree is determined by the number of entity pairs connected by relations. In order to solve the unbalance problem, TranSparse [21] uses a head entityspecific matrix and a tail entity-specific matrix to carry out the corresponding projections. As the discriminate operation is relation-specific, the above translation-based semantic discriminate models are all faced with the problem of large numbers of discriminate parameters and the number of discriminate parameters will increase with the number of relations and entities. TransGate [20] is proposed to relieve the parameter pressure of semantic discriminate models. Across the whole knowledge base, it establishes two fixed-size shared parameter gates for head entities and tail entities respectively, by utilizing the inherent correlations between relations. The shared parameter gate structure is used to produce filtering vectors for a given triplet. Then, multiply the head entity vector or the tail entity vector with the filtering vectors respectively to get the filtered entity vectors, which only represent the semantic in the current semantic environment. Compared to the above semantic discriminate models, TransGate [20] obtains better performance with much fewer parameters. However, it does not consider the situation of multiple semantics of relations, and the concatenation of entity vectors and relation vectors blurs the interactions between them. In addition to improving model accuracies by building semantic discriminate mechanisms, other models improve performance by incorporating other features. For example, PTransE [19] and RTransE [24] get the multi-step relations between two entities by random walks. They obtain a performance improvement by considering the shallow structure features of a knowledge. R-GCN [13], ConvE [6] and ConvKB [12] learn the associations between entities and relations through the convolutional neural network. They achieve higher accuracies by considering deep features. However, the above models all have the problem of poor interpretability. NLFeat [15] and RUGE [8] use data mining methods to excavate the correlations between relations or triplets and achieve a better performance. For all this, they have a great dependence on the accuracy of mining method, and the time complexity is increased by the extra data mining algorithms. At present, there is a lack of a semantic discriminate model with good tradeoffs in model accuracy, clear semantic interactions and number of discriminate parameters. In this work, we propose TransBidiFilter which meets all the above needs.
236
3
X. Guo et al.
Methodology
Gate structure is the core mechanism of LSTM (Long Short-Term Memory) [9] and is mainly used to allow information elements to be selectively expressed. A gate consists of a full connection layer and a sigmoid activation function. In this section we first describe the TransBidiFilter framework, followed by two kinds of gates. The relation-based gate is used to filter entity vectors according to relations and the entity-based gate is used to filter relation vectors according to entities. Then the filtered vector representation of entities and relations will be the inputs of the loss function. In the experiments section, we demonstrate that TransBidiFilter outperforms existing models on most indicators despite having a relatively smaller parameter space. 3.1
Model Framework
The main insights of TransBidiFilter are shown in Fig. 1 and includes the following seven points:
Fig. 1. TransBidiFilter framework (Color figure online)
1. Each entity and relation in a knowledge base are embedded into a same vector space, that is they have the same dimension. 2. Throughout the whole knowledge base, each entity and relation have a unified complex vector representation. For a particular triplet, each part of it corresponds to a complex vector, which needs to be filtered by the corresponding filtering vector produced by the bidirectional filter. In this way, the
TransBidiFilter: Knowledge Embedding Based on a Bidirectional Filter
3.
4.
5.
6.
7.
237
triplet-specific head entity vector, tail entity vector and relation vector can be generated by three filtering operations respectively. The filtering operation can be finished through an element-wise multiplication between the filtering vector and the complex vector. The bidirectional filter is global shared and consists of two relation-based gates and one entity-based gate. Each gate structure contains a full connected layer for recording semantic interactions between entities and relations and a sigmoid layer for producing filtering vectors with values between zero and one. Entities and relations are different in nature, entities are the initiators and acceptors of actions, and relations can be understood as actions. As a result, the learning of semantic interactions needs to be carried out from the perspective of entities and relations respectively, that is, a two-way semantic filtering is needed. The two relation-based gates (the orange portion of Fig. 1) are used to produce two filtering vectors for head entity complex vector and tail entity complex vector respectively according to the specific relation. Since most of the relations in the knowledge base are not reflexive, it is necessary to establish full connection layers for the head entities and the tail entities separately to learn the constraint rules from relations to entities. The entity-based gate (the green portion of the Fig. 1) is used to produce a filtering vector for the relation complex vector according to the specific entity pairs. Only when the head and tail entity pairs are determined, can the semantic of a relation be clear. Therefore, the concatenation of head entity vector and tail entity vector is used as the input of the full connected network to learn the constraint rules from head and tail entity pairs to relations. The vectors generated by the fully connected layer inside the gate structure can form the filtering vectors through a sigmoid operation. The values of these filtering vectors represent the proportion that allows the information of each dimension of the corresponding complex vector to pass through. They are between zero and one. Zero means it can not pass through at all, and one means it is allowed totally through. The value of each dimension of a complex vector can pass through gate by different proportions according to the interaction rules between entities and relations in a specific triplet. By taking an element-wise multiplication between a complex vector and its filtering vector, information filtering can be realized. In this way, the semantic representation specific to a certain triplet can be obtained. Embedding models based on translations expect the sum of the head entity vector and the relation vector can approach the tail entity vector as close as possible. In this model, the translation distance is calculated with use of the three triplet-specific filtered vectors. The multiple semantics of entities and relations in a knowledge base are ubiquitous. Direct use of the complex vectors of entities or relations to calculate translation distance implies a coarsegrained semantic confusion. TransBidiFilter realizes the global discriminate parameter learning based on a fine-grained semantic interaction mechanism between entities and relations, so it can achieve higher accuracies.
238
3.2
X. Guo et al.
Formal Description
The formal descriptions of TransBidiFilter are as follows: Denote the set of triplets as T , the set of entities as E, the set of relations as R, the dimension of the embedding space is m. ∀ (h, r, t) ∈ T , the corresponding embedding is (h, r, t) ∈ Rm . There are two relation-based gates in TransBidiFilter. One is for recording the interactions from relations to head entities, the other is for recording that from relations to tail entities. Take the former one as an example. The discriminate parameters are W h ∈ Rm×m and bh ∈ Rm . The input of the fully-connected network is a relation vector r in a triplet. Through a sigmoid operation, the output of that can be used to produce a filtering vector f r h for the head entity h in the triplet. Similarly, the filtering vector f r t for the tail entity t can also be obtained: (2) f r h = σ (W r h · r + bh ) f r t = σ (W r t · r + bt )
(3)
Then the filtering vectors produced by the two relation-based gates will be used to decide how much of each dimension value of the head or tail entity vector can pass through. This filtering operation can be realized through an element-wise multiplication between the filtering vector and the entity vector to be filtered: (4) hr = h f r h tr = t f r
t
(5)
There is one entity-based gate in TransBidiFilter. It is for recording the interactions from entities to relations, that is the rules used to determine what semantic should a relation represent under different entity pairs. The discriminate parameters are W r ∈ Rm×2m and br ∈ Rm . The input of the fully-connected network is a concatenation of the head entity complex vector and the tail entity complex vector of a triplet. The filtering vector for a relation can be obtained: f ht r = σ (W r · [h, t] + br )
(6)
Then the filtering vector f ht r produced by the entity-based gate will be used to decide how much of each dimension value of the relation vector can pass through. This filtering operation can be realized through an element-wise multiplication between the filtering vector f ht r and the relation vector to be filtered r : (7) r (h,t) = r f ht r With use of the filtered vectors, translation in a specific triplet (h, r, t) plane can be realized. The distance function is: (8) d(h,r,t) = hr + r (h,t) − tr L1/L2 The correct triplets expect smaller distance, and the wrong triplets expects longer distance.
TransBidiFilter: Knowledge Embedding Based on a Bidirectional Filter
239
The global shared discriminate parameter set consists of all the parameters the fully connected networks of the three gates. From the above formulas, the triplet-specific semantic representation can be obtained because the input of the three fully connected networks are from the corresponding triplet. 3.3
Training Method
In the training process, the maximum interval method is used to optimize the objective function to speed up learning. For each (h, r, t) and its negative sample (h , r, t ), the aim of TransBidiFilter is to minimize the global hinge-loss as follows: max γ + d(h,r,t) − d(h ,r,t ) , 0 (9) L= (h,r,t)∈T,(h ,r,t )∈T
where γ is a margin hyper-parameter and (h , r, t ) is a negative sample from the negative sampling set T . A negative sample can be obtained by randomly replacing the head or the tail of a correct triplet with an entity from the entity list. The training process of TransBidiFilter is carried out using Adam optimizer with constant learning rate. 3.4
Complexity Analysis
Table 1 shows the size of discriminate parameter set and the prerequisite conditions of each translation-based model. The statistics of parameters of these models are from [20]. Ne is the number of entities. Nr is the number of relations. m is the dimension of entities. n is the dimension of relations. θ denotes the average sparse degree of all transfer matrices. According to the table, we can find that the size of discriminate parameter set of semantic discriminate models before TransGate [20] and TransBidiFilter are positively correlated with the number of the relations and entities in the knowledge base. On the contrary, the number of discriminate parameters of TransGate [20] and TransBidiFilter are fixed and do not increase with the size of knowledge base. What’s more, they both do not need pre-training. The following experiments will show TransBidiFilter can obtain a better performance with fewer parameters than all semantic discriminate models except TransGate [20]. Last but not the least, it gains both better learning ability and clearer interpretability at the cost of a little bigger size of discriminate parameters than TransGate [20].
4
Experiments
In this section, we evaluate TransBidiFiler on link prediction task. Results show that TransBidiFiler outperforms than state-of-the-art baselines on most evaluation indicators.
240
X. Guo et al. Table 1. Complexity analyisis Models
Discriminate parameters
Prerequisites
TransE [4] TransH [17] TransR [11] CTransR [11] TransD [21]
None Nr n Nr mn Nr mn Ne m+ Nr n TranSparse [21] 2Nr 1 − θˆ mn, 0 ≤ θˆ ≤ 1
TransE
4m2 + 2m
None
TransBidiFilter 6m2 + 3m
None
TransGate [20]
4.1
None TransE TransE TransE TransE
Datasets
The link prediction task is implemented on two large-scale knowledge bases: WordNet [1] and Freebase [2]. WordNet [1] is a large lexical database of the English language. Freebase is a large collaborative knowledge graph of general world facts. WN18RR [6] from Wordnet [1], FB15K [4] and FB15K-237 [15] from Freebase [2] are used to make the evaluation (Table 2). Table 2. Statistics of datasets Datasets
Rel
Ent
WN18RR [6]
11
40,943 86,835
FB15K [4]
1,345 14,951 483,142 50,000 59,071
FB15K-237 [15] 237
4.2
Train
Valid
Test
3,034
3,134
14,541 272,115 17,535 20,466
Link Prediction
The aim of link prediction is to predict the missing h or t for a triplet (h, r, t). i.e., predicting t given (h, r) or predicting h given (r, t). For each testing triplet (h, r, t), we corrupt it by replacing the head or tail randomly with each entity in the knowledge base and calculate all distance scores. Then we rank the scores in an ascending order, and get the rank of the correct entity. In fact, a corrupted triplet may also exist in the knowledge base, which should be also considered as correct. We filter out the correct triplets from corrupted triplets which has already existed in the knowledge base to get the true corrupted triplets. Three filtered metrics are adopted for evaluations: the mean rank of all correct entities (MR), the mean reciprocal rank of all correct entities (MRR), and the proportion of correct entities ranked in top 10 (Hits@10). A good link prediction result expects a lower Mean Rank, a higher MRR and a higher Hits@10.
TransBidiFilter: Knowledge Embedding Based on a Bidirectional Filter
241
For the three datasets, we search the learning rate α among {0.001, 0.01, 0.1}, the margin γ among {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, the embedding dimension m among {32, 50, 100, 200}, and the batch size b among {1440, 2880, 5760}. The optimal configurations are as follow: on WN18RR, γ = 8, α = 0.01, m = 200, b = 5760 and taking L1 distance; on FB15K, γ = 1, α = 0.01, m = 200, b = 5760 and taking L1 distance; on FB15K-237, γ = 6, α = 0.01, m = 200, b = 5760 and taking L1 distance. Table 3. Evaluation results on link prediction Datasets
WN18RR
FB15K
FB15K-237
Metrics
MRR MR
Hits@10 MRR
MR Hits@10 MRR
MR Hits@10
TransE [4]
0.226 3384
50.1
0.220
125 47.1
0.294
347
DistMult [19]
0.43
5110
49.0
0.654
97
82.4
0.241
254
41.9
TransD [21]
–
–
42.8
0.252
67
77.3
–
–
45.3
CombineE [14]
–
–
–
0.283
–
85.2
–
–
–
ComplEX [16]
0.44
5261
51.0
0.692
–
84.0
0.247
339
42.8
KB-LRN [7]
–
–
–
0.794
44
87.5
0.309
209
49.3
NLFeat [15]
–
–
–
0.822
–
87.0
0.249
–
41.7
RUGE [8]
–
–
–
0.768
–
86.5
–
–
–
KBGAN [5]
0.213 –
48.1
–
–
–
0.278
–
45.8
R-GCN [13]
–
–
–
0.696
–
84.2
0.248
–
41.7
TransG [18]
–
–
–
0.657
51
83.1
–
–
–
ConvE [6]
0.43
4187
52.0
0.657
51
83.1
0.325
244
50.1
ConvKB [12]
0.248 2554 52.5
0.768
–
–
0.396
257
51.7
TransGate [20]
0.409 3420
51.0
0.832
33
91.4
0.404
177 58.1
TransBidiFilter 0.226 3362
54.0
0.834 27
94.5
0.452 225
46.5
60.0
In Table 3, the best indicators are in bold and the second best indicators are in underline. We observe that: (1) on WN18RR, TransBidiFilter outperforms state-of-the-art baselines on Hits@10 and only less than TransGate on Mean Rank. (2) on FB15K, TransBidiFilter outperforms state-of-the-art baselines on all indicators. (3) on FB15k-237, TransBidiFilter outperforms state-of-the-art baselines on both Mean Reciprocal Rank and Hits@10, and is ranked second on Mean Rank. All in all, among the 9 indicators on the three datasets, TransBidiFilter is ranked first on 6 indicators and second on 2 indicators. It means TransBidiFilter has a better link prediction performance than existing methods in most cases. This proves the correctness of bidirectional semantic filtering mechanism to some extent. It is also proved that TransBidiFilter can accomplish the automatic completion of knowledge base excellently.
242
5
X. Guo et al.
Conclusions
To recap, the contributions of the present work are as follows: we present a bidirectional interaction mechanism between the entities and relations. Through building three filtering gates, we can achieve a specific semantic representation of each part of a triplet. What’s more, the clear interactions between entities and relations can be also obtained. This model gets a better performance on link prediction task than most semantic discriminate models. For future work, we will try to incorporate indirect interactions between relations and between entities (for example, the hierarchy of relations, the category of entities) into the learning process.
References 1. Miller, G.: Wordnet: a lexical database for English. Commun. ACM 38, 39–41 (1995) 2. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1247–1250 (2008) 3. Fabian, M., Gjergji, K., Gerhard, W., et al.: Yago: a core of semantic knowledge unifying wordnet and wikipedia. In: 16th International World Wide Web Conference, WWW, pp. 697–706 (2007) 4. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, pp. 2787–2795 (2013) 5. Cai, L., Wang, W.Y.: Kbgan: adversarial learning for knowledge graph embeddings. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1470–1480 (2018) 6. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2D knowledge graph embeddings. In: The Thirty-Second AAAI Conference on Artificial Intelligence, pp. 1811–1818 (2018) 7. Garcia-Duran, A., Niepert, M.: Kblrn: end-to-end learning of knowledge base representations with latent, relational, and numerical features. In: Proceedings of UAI (2017) 8. Guo, S., Wang, Q., Wang, L., Wang, B., Guo, L.: Knowledge graph embedding with iterative guidance from soft rules. In: Thirty-Second AAAI Conference on Artificial Intelligence, pp. 4816–4823 (2018) 9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997) 10. Ji, G., Liu, K., He, S., Zhao, J.: Knowledge graph completion with adaptive sparse transfer matrix. In: Thirtieth AAAI Conference on Artificial Intelligence, pp. 985– 991 (2016) 11. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings for knowledge graph completion. In: Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2181–2187 (2015)
TransBidiFilter: Knowledge Embedding Based on a Bidirectional Filter
243
12. Nguyen, D.Q., Nguyen, T.D., Nguyen, D.Q., Phung, D.: A novel embedding model for knowledge base completion based on convolutional neural network. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 327–333 (2017) 13. Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 593–607. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4 38 14. Tan, Z., Zhao, X., Wang, W.: Representation learning of large-scale knowledge graphs via entity feature combinations. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1777–1786 (2017) 15. Toutanova, K., Chen, D.: Observed versus latent features for knowledge base and text inference. In: Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pp. 57–66 (2015) 16. Trouillon, T., Welbl, J., Riedel, S., Gaussier, E., Bouchard, G.: Complex embeddings for simple link prediction. In: Proceedings of the 33rd International Conference on Machine Learning, pp. 2071–2080 (2016) 17. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 1112–1119 (2014) 18. Xiao, H., Huang, M., Zhu, X.: Transg:a generative model for knowledge graph embedding. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 2316–2325 (2016) 19. Yang, B., Yih, W., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. In: Proceedings of the International Conference on Learning Representations (2014) 20. Yuan, J., Gao, N., Xiang, J.: Transgate: knowledge graph embedding with shared gate structure. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3100–3107 (2019) 21. Ji, G., He, S., Xu, L., Liu, K., Zhao, J.: Knowledge graph embedding via dynamic mapping matrix. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), vol. 1, pp. 687–696 (2015) 22. He, S., Liu, K., Ji, G., Zhao, J.: Learning to represent knowledge graphs with gaussian embedding. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 623–632. ACM (2015) 23. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) 24. Garc´ıa-Dur´ an, A., Bordes, A., Usunier, N.: Composing relationships with translations. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 286–290 (2015)
Applying Model Fusion to Augment Data for Entity Recognition in Legal Documents Hu Zhang1(B) , Haihui Gao1 , Jingjing Zhou1 , and Ru Li1,2 1 School of Computer and Information Technology, Shanxi University, Taiyuan, China {zhanghu,liru}@sxu.edu.cn, [email protected], [email protected] 2 Key Laboratory of Computing Intelligence and Chinese Information Processing, Ministry of Education, Shanxi University, Taiyuan, China
Abstract. Named entity recognition for legal documents is a basic and crucial task, which can provide important knowledge for the related tasks in the field of wisdom justice. However, it is still difficult to augment the labeled data of named entities for legal documents automatically. To address this issue, we propose a novel data augmentation method for named entity recognition by fusing multiple models. Firstly, we train a total of ten models by conducting 5-fold cross-training on the small-scale labeled datasets based on Bilstm-CRF and Bert-Bilstm-CRF models separately. Next, we try to apply single-model fusion and multi-model fusion modes, in which, single-model fusion is to vote on the prediction results of five models of the same baseline, while multi-model fusion is to vote on the prediction results of ten models with two different baselines. Further, we take the identified entities with high correctness in the multiple experimental results as effective entities, and add them to the training set for the next training. Finally, we conduct the different experiments on two public datasets and our built judicial dataset separately, which shows the experimental results using data augmentation are close to those based on 5 times of labeled dataset, and obviously better than those on the initial small-scale labeled datasets. Keywords: Wisdom justice · Named entity recognition · Legal document · Data augmentation · Model fusion
1 Introduction With the development of artificial intelligence and big data technology, “wisdom justice” has become an interesting research focus, which can promote the transformation and upgrading of benefiting from the rule of law and achieve high-quality development of the judicial area. In March 2019, Zhou Qiang, President of the Supreme People’s Court, reported on the work of the Supreme People’s Court, emphasizing the need to deepen the reform of the judicial system and the construction of wisdom courts and promote the modernization of the judicial system and judicial capacity. Nowadays, courts in various regions are actively responding to the national action plan and carrying out the construction of wisdom courts from three stages: (1) Assist in some simple, mechanical © Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 244–255, 2020. https://doi.org/10.1007/978-3-030-60450-9_20
Applying Model Fusion to Augment Data for Entity Recognition
245
and repetitive tasks. (2) Learn artificial intelligence technology to assist in judicial trials. (3) Carry out judicial-related convenience services, such as online legal advisory services and intelligent judicial trials. In the meantime, to promote the development and application of wisdom justicerelated technologies, the “China Law Research Cup” Judicial Artificial Intelligence Challenge (CAIL) has been launched for three consecutive years by China Judicial Big Data Research Institute, Chinese Information Society of China, Tsinghua University, Peking University, Institute of Software of Chinese Academy of Sciences, and other institutions since 2018, which mainly organized public evaluation tasks such as judicial judgment prediction, judicial reading comprehension, element extraction and similar case matching. Further, the tasks of judicial summary, judicial examination and argumentation mining have been launched recently in CAIL 2020. In recent years, relevant studies for wisdom judicial services have shown that the named entity recognition in legal documents has provided important knowledge support for a series of key tasks. Named entity recognition is to identify the entity’s items in the text and indicate its category. Quick and effective identification of the relevant entities in the documents will bring great convenience to wisdom judicial services, and improve the efficiency of handling judicial cases. At present, there is relatively little research work specifically on the method of named entity recognition in legal documents, and there is an obvious lack of named entity labeled datasets. In this study, we refer to the entity labeling specifications of Automatic Content Extraction (ACE) Program and Message Understanding Conferences (MUC) for the named entity recognition task, define the entity labeling specifications of the legal documents, and construct the named entity labeled dataset. Based on this, we propose a novel data augmentation method for named entity recognition by fusing multiple models, which can achieve nearly the same performance compared with 5 times as large as the labeled dataset (Fig. 1).
Models by cross-training
The court found that the facts are as follows: The original and the defendant introduced love through the matchmaker Yang Zhennan in October 2005. Both parties voluntarily went to the marriage registration office of the Luchuan County Civil Affairs Bureau to obtain a marriage certificate.
Input Text
The court found that the facts are as follows: The original and the defendant introduced love through the matchmaker[ Yang Zhennan]LOC in October 2005. Both parties voluntarily went to the [marriage registration office of theLuchuan County Civil Affairs Bureau]ORG to obtain a marriage certificate.
The court found that the facts are as follows: The original and the defendant introduced love through the matchmaker [Yang Zhennan]PER in October 2005. Both parties voluntarily went to the [marriage registration office of the Luchuan County Civil Affairs Bureau]ORG to obtain a marriage certificate.
The court found that the facts are as follows: The original and the defendant introduced love through the matchmaker [Yang Zhennan]PER in October 2005. Both parties voluntarily went to the [marriage registration office of theLuchuan County Civil Affairs Bureau]ORG to obtain a marriage certificate.
The court found that the facts are as follows: The original and the defendant introduced love through the matchmaker [Yang Zhennan]PER in October 2005. Both parties voluntarily went to the [marriage registration office of the Luchuan County Civil Affairs Bureau]ORG to obtain a marriage certificate.
Tagging by Multiple Models
The court found that the facts are as follows: The original and the defendant introduced love through the matchmaker [Yang Zhennan]PER in October 2005. Both parties voluntarily went to the [marriage registration office of theLuchuan County Civil Affairs Bureau]ORG to obtain a marriage certificate.
Voting on the Prediction Results
Fig. 1. Data augmentation process by fusing the experimental results of multiple models.
Contributions. The main contribution of our work can be concluded as follows: (1) The model fusion is firstly used to augment the named entity recognition dataset in the field of justice.
246
H. Zhang et al.
(2) The proposed method achieves excellent results on three different datasets, which shows that our method has good generality. (3) The experimental results on two public datasets show that our method employing a small-scale labeled dataset can achieve close results as using 5 times of labeled dataset.
2 Related Work To improve the level of wisdom judicial service, many researchers have explored some valuable tasks based on legal documents in recent years, such as judicial judgment prediction, similar cases matching, judicial reading comprehension, etc. Luo used an attention-based neural network method to model the charge prediction task and the relevant article extraction task in a unified framework at the same time [1]. Jiang proposed a neural based system to jointly extract readable rationales and elevate charge prediction accuracy by a rationale augment mechanism [2]. Lauderdale used similarity information between different cases to estimate the similarity of different legal issues [3]. Duan firstly proposed a Chinese judicial reading comprehension dataset CJRC to fill gaps in the field of legal reading comprehension research [4]. Named entity recognition is a basic task in the justice research, which can provide important knowledge for the present research tasks. Wang proposed a joint learning approach, namely Aux-LSTM, to use a large scale of auto-annotated data to help humanannotated data (in a small size) for person name recognition in legal documents [5]. Cardellino tried to improve Information Extraction in legal texts by creating a legal named entity recognizer [6]. Leitner described an approach to named entity recognition in German language documents from the legal domain [7]. There are few existing studies specifically on named entity recognition for legal documents. For general datasets, with the development of deep learning in recent years, named entity recognition based on deep learning methods has gradually become dominant. Compared with feature-based methods, deep learning is easier to find hidden features. Collobert is one of the representatives who used neural networks for named entity recognition earlier, and he proposed window architecture for named entity recognition [8]. Most of these researches combine bi-directional long-short-term memory models with convolutional neural networks (CNN), conditional random field (CRF) [9–11], and attention mechanism [12–14] for named entity recognition. Cetoli used a set of graph convolutional networks (GCN) to study the role of dependency trees in named entity recognizers [15]. Zhang investigated a lattice-structured LSTM model for Chinese NER, which encodes a sequence of input characters as well as all potential words that match a lexicon [16]. Recently, the large pretrained language models have been used in many downstream tasks of natural language processing, such as ELMO [17], Bert [18], Albert [19], etc. Li pre-trained BERT model on the unlabeled Chinese clinical records, which can leverage the unlabeled domainspecific knowledge to the clinical named entity recognition [20]. Moon investigated a single named entity recognition model based on multilingual BERT, which is trained jointly on many languages simultaneously, and it can decode these languages with better accuracy than the models trained only on one language [21]. Aiming at the lack of labeled data, there are many studies for data augmentation in NLP. One popular study generated new data by back-translation from a neural
Applying Model Fusion to Augment Data for Entity Recognition
247
machine translation model [22]. Some work used predictive language models for synonym replacement [23]. Other work focused on creating adversarial examples either by replacing words or characters [24]. However, named entity recognition often uses semi-supervised methods to augment data. Wu used bootstrapping for named entity recognition, and learned and supplemented from a large number of unlabeled texts to increase the scale of the training dataset [25]. Neelakanta employed a large amount of unlabeled data and a small number of seed samples to automatically construct a dictionary for named entity recognition [26]. Peters enhanced the generalization ability of the model by adding word vectors obtained by the pre-trained language model [27]. Existing researches on named entity recognition for legal documents mainly have the following problems: (1) Due to the lack of labeled data specifically for the legal documents, and the experimental results obtained on small-scale datasets are unsatisfactory. (2) The existing data augmentation methods usually depend on specific areas, and their generality is relatively poor. To address these issues, the paper proposes a novel data enhancement method for entity recognition by fusing multiple models.
3 Entity Dataset Construction 3.1 Entity Labeling Specifications At present, there are no public labeling specifications and datasets for named entity recognition in the judicial field. By analyzing legal documents and the entity labeling specifications of ACE and MUC, we define six types of entities. Each entity and the corresponding label are shown in Table 1. In addition to the four common types of entities: person, location, organization and time, we also add two specific types of entities: statute and case number. Such as “Subparagraph (5) of Article 32, Paragraph 3 of the Marriage Law of the People’s Republic of China” is defined as a statute, “(2015) Zhongminyichuzi No. 168” is defined as a case number. Table 1. Types of entity. Entity type
Person
Location
Organization
Time
Statute
Case
Others
Entity label
PER
LOC
ORG
TIM
STA
CAS
O
3.2 Experimental Dataset In this paper, we build a judicial entity labeled dataset based on the above specifications. At the same time, in addition to this judicial datasets, we also introduce two other public datasets to verify our proposed model. Our Dataset. According to the above entity labeling specifications, we manually tagged 300 legal documents and constructed a corpus of judicial entity. Each word in the dataset
248
H. Zhang et al.
uses the BIO notation, where “B” represents the beginning of the entity, “I” represents the middle of the entity, and “O” represents others. The specific data allocation is shown in the Table 2. 200 legal documents (about 6000 sentences) are selected as the initial training set, 50 legal documents (about 1500 sentences) are the test set, and the remaining 50 are the development set. Table 2. Experimental datasets. Datasets
MSRA People’s daily Judicial data
Initial training set
9240
3600
6000
Test set
2310
900
1500
Dev set
2219
584
1500
Sentence number of per fusion 9240
3600
6000
4
4
Times of fusion
4
Public Datasets. To verify the validity of the model and the reliability of the experimental results, we adopt two other public datasets People’s Daily (January 1998) and MSRA in our experiments. Among them, the two datasets and the judicial dataset are all Chinese named entity datasets, and the specific data allocation is shown in Table 2. What’s more, in addition to the size of the three datasets, the sentence number of per fusion and the times of fusion are also shown in Table 2.
4 Data Augmentation Model for Named Entity Recognition In the following parts, we give the architecture of the entire model, and introduce the specific process of model fusion in detail. 4.1 Model Structure As illustrated in Fig. 2, the model mainly includes four layers: the first layer is the input data layer; the second layer is the model training layer of the baselines, which uses two baseline models including Bilstm-CRF and Bert-Bilstm-CRF in this paper; the third layer is the single-model fusion layer, in which, n prediction results of n trained models of the same baseline are fused, where n = 5 in our experiments, and then add the high correctness entities to the training set for the next training; the fourth layer is the multi-model fusion layer, in which, n prediction results of n trained models using two different baselines are fused, where n = 10 in our experiments, and then add the high correctness entities to the training set for the next training. In Fig. 2, “1” indicates that five corresponding prediction results of the five trained models using Bilstm-CRF model are fused, “2” represents that five prediction results based on Bert-Bilstm-CRF model are fused, and “3” denotes ten prediction results of the two baseline models are fused. In our experiments, in order to verify the effect of the different iterations, we try to conduct many times of iterative experiments around the four layers on all datasets.
Applying Model Fusion to Augment Data for Entity Recognition
249
Fig. 2. Data augmentation model.
4.2 Model Fusion Process The model fusion process is described as follows: (1) Using Bilstm-CRF model and Bert-Bisltm-CRF model to perform 5-fold crosstraining experiments on labeled small-scale datasets respectively, and obtain ten different entity recognizers. (2) Using n different entity recognizers to predict input text to get n corresponding prediction results, where n represents the number of prediction results, which can be 5 or 10 in this paper. (3) Fusing n prediction results. If the predicted entity labels are consistent with m times and above, then the label is assigned to the entity, otherwise, the entity label is “O”, and then the effective entities are added to the training set for the next iterative training. Assuming that the label set Tags = {ti |t = 1, 2, 3, . . . , 13}, sum(ti ) is used to count the number of times the label appears, the function of fusion of n prediction results is: t , sum(ti ) ≥ m, m is 3 or 6 Merge(ti ) = i (1) O, others (4) Finally, we stop the data fusion iteration when the F1 value of the model tends to be stable.
250
H. Zhang et al.
5 Experiments In this section, we analyze the experimental results of model fusion, including singlemodel fusion and multi-model fusion. At the same time, to ensure the quality of data augmentation and the reliability of the experimental results, a 5-fold cross-validation method was used in the process of model training. Cross-validation uses the technique of non-repetitive sampling to ensure that each sample participates in the training and testing, and acquires more effective information from limited labeled data. 5.1 Experimental Evaluation The experiment uses the precision (p), the recall (r) and f1 value (f 1) to evaluate the results of entity recognizers. The definitions of f1 value is shown in formula (2): f1 =
2∗p∗r p+r
(2)
In order to ensure more objective evaluation, we average the results of n experiments to obtain the final precision rate (P), recall rate (R), and F1 value (F1). The computing formula is shown in (3)–(5): 1 n pi i=1 n 1 n R = ri i=1 n 1 n F1 = f 1i i=1 n P =
(3) (4) (5)
Where n is the number of entity recognizers, pi , ri , and f 1i represent the precision, the recall, and f1 value of the i th entity recognizer on the test set separately. 5.2 Experimental Results and Analysis (1) Single-model fusion In this single-model fusion process, we fuse the five results predicted by the models performed 5-fold cross-training experiments on the initial labeled training datasets, and the experimental results can be found in Table 3. Among them, “initial” means the initial small-scale labeled data; “1st fusion” “2nd fusion” “3rd fusion” “4th fusion” represent the datasets after one, two, three, four times fusions respectively; “public” means we employ the real labeled dataset as the same size as “4th fusion”. As shown in Table 3, it can be seen that use Bilstm-CRF model to conduct the different experiments, the result of “4th fusion” is better than that of “public” on the MSRA dataset, increased by 1.12%, but the result of “4th fusion” is almost as good as that of “public” on the People’s Daily dataset. Simultaneously, when using the BertBilstm-CRF model to experiment on the two public datasets, the results of “4th fusion” and “public” are very close. All experimental results suggest that the fused data can
Applying Model Fusion to Augment Data for Entity Recognition
251
Table 3. Results of single-model fusion. Datasets MSRA
People’s daily
Judicial data
Number of sentences
Bilstm-CRF model
Bert-Bilstm-CRF model
P/%
R/%
F1/%
P/%
R/%
F1/%
9240 (initial)
73.82
68.17
70.88
88.82
88.60
88.70
18480 (1st fusion)
82.18
75.20
78.53
90.34
90.64
90.48
27720 (2nd fusion) 36960 (3rd fusion)
84.90
79.55
82.13
91.89
91.12
91.50
87.13
81.91
84.43
92.23
91.59
91.85
46200 (4th fusion)
87.49
83.06
85.21
91.96
91.98
91.97
46200 (public)
85.61
82.64
84.09
92.32
92.38
92.34
3600 (initial)
77.96
75.74
76.83
87.33
92.27
89.73
7200 (1st fusion)
84.24
83.21
83.72
90.72
94.00
92.33
10800 (2nd fusion) 14400 (3rd fusion)
87.57
85.53
86.53
92.21
94.57
93.37
88.17
87.14
87.65
92.47
94.91
93.67
18000 (4th fusion)
89.50
88.55
89.02
93.06
95.49
94.26
18000 (public)
89.95
88.78
89.36
94.07
94.59
94.33
6000 (initial)
91.46
90.17
90.81
93.09
94.15
93.67
12000 (1st fusion) 18000 (2nd fusion)
93.75
92.94
93.33
93.97
95.63
94.79
94.68
93.97
94.32
94.72
96.09
95.34
24000 (3rd fusion)
95.38
94.92
95.14
94.79
96.64
95.70
30000 (4th fusion)
95.43
94.95
95.18
95.11
96.78
95.94
also obtain comparable or even better results than the public labeled data. Moreover, on our judicial data, the results on three evaluation indicators after fusion have greatly improved, and the F1 of Bert-Bilstm-CRF model reaches a maximum of 95.94%. To further illustrate the trend after fusion, we plot the F1 value to investigate the changes of the different datasets on two baseline models in Fig. 3. From the results, we can learn that as the number of fusions increases, the F1 value of two models are all rising and later gradually stabilize. Besides, the results of “4th fusion” are very close to those of “public” on two public datasets. The performance on the three datasets further verifies the applicability of our proposed method. (2) Multi-model fusion The ten results predicted by the trained models using two different baselines were fused, and the experimental results are shown in Table 4. Among them, “initial” means the initial small-scale labeled data; “4th fusion” means the dataset after four times fusions. According to the results of Table 4, we can learn that the multi-model can also reach a promising performance after fusion. When the number of our judicial data in the training set is superimposed to 30000, the F1 value of the model is as high as 95.30%. However, compared with the results of the Bert-Bilstm-CRF model in Table 3, the results of multimodel fusion are not very well. Because the gap between the two models is relatively
252
H. Zhang et al. Bilstm-CRF model
Bert-Bilstm-CRF model
100
98
95
96 94
85
MSRA
80
People's Daily
75
Judicial Data
F1/%
F1/%
90
92
MSRA
90
People's Daily
88
Judicial Data
86
70
84
65 inial
1st fusion 2nd fusion 3rd fusion 4th fusion
public
inial
1st fusion 2nd fusion 3rd fusion 4th fusion
public
Data composion
Data composion
Fig. 3. F1 of single-model fusion with different data composition.
Table 4. Results of multi-model fusion. Datasets
Number of sentences Multi-model P/%
MSRA
9240 (initial) 46200 (4th fusion)
People’s daily 3600 (initial) Judicial data
R/%
F1/%
81.32 78.38 79.79 90.49 84.91 87.61 82.64 84.00 83.28
18000 (4th fusion)
91.74 90.49 91.11
6000 (initial)
92.28 92.16 92.24
30000 (4th fusion)
95.26 95.35 95.30
large, the results of the multi-model fusion are lower than that of single-model fusion for the Bert-Bilstm-CRF model. To visually display the performances based on the initial datasets and fusion datasets, the F1 value is presented in Fig. 4. As can be seen from Fig. 4, compared to the results of two single-model fusion and the multi-model fusion, the performance of Bert-Bilstm-CRF model is always the best. Besides, compared with the “initial” results, the results of “4th fusion” are significantly improved on all models. The experimental results indicate that the data augmentation can also work well. 5.3 Data Analysis By analyzing the entity annotation results in the legal documents, we find the experimental results of the different types of entities all are raised to some extent after data augmentation, and the specific example is shown in Fig. 5.
Applying Model Fusion to Augment Data for Entity Recognition Bilstm-CRF (initial)
Bilstm-CRF (4th fusion)
Bert-Bilstm-CRF (initial)
Bert-Bilstm-CRF (4th fusion)
Multi-model (initial)
Multi-model (4th fusion)
253
F1/%
100 90 80 70 60 50 40 30 20 10 0
MSRA
P e o p le 's D a i ly
J u d ic ia l D a t a
Datasets
Fig. 4. F1 value of “initial” and “4th fusion” with the different models on three datasets.
Shizuis han City,Autonomous Region] ORG Civil Judgment
[The People’s Court of Dawukou District, Shizuis han City, Ningxia Hui Autonomous Region] ORG Civil Judgment [(2016)
[(2016) Ning 0202 Minchu No. 1614] CAS Plaintiff [Qi] PER Moumou, female, was born in [March 25, 1980] TIM , Han
female, was born in [March 25, 1980] TIM , Han nationality, has
Ningxia Hui [The People’s Court of Dawukou District,
nationality, has a part-time job, live in [Dawukou District] LOC. Defendant [Meng] PER, XX, male,was born in [September 10, 1974] TIM , Han nationality, has a part-time job, live in [Alashan, Inner Mongolia Autonomous Region] LOC Left Banner. …
Ning 0202 Minchu No. 1614] CAS Plaintiff [Qi] PER Moumou, a part-time job, live in [Dawukou District] LOC. Defendant [Meng] PER, XX, male,was born in [September 10, 1974] TIM , Han nationality, has a part-time job, live in [Alashan Left Banner, Inner Mongolia Autonomous Region] LOC. … In summary, in accordance with [Article 32 (3) (5), Article 36
In summary, in accordance with [Article 32 (3) (5), Article 36 (2), Article 37 (1) and Article 39 (1) Paragraph of Marriage Law of the People's Republic of China] STA and [Several Opinions of the Supreme People’s Court on People’s Courts
(2), Article 37 (1) and Article 39 (1) Paragraph of Marriage Law of the People's Republic of China] STA and [Several Opinions of the Supreme People’s Court on People’s Courts
Dealing with Child Rais ing Cases in the Trial of Divorce Cases
Dealing with Child Rais ing Cases in the Trial of Divorce Cases
Article 3 (4)] STA ’s rule is as follows: 1. The plaintiff [Qi] PER is granted Divorced from the defendant [Meng] PER
Article 3 (4)] STA ’s rule is as follows: 1. The plaintiff [Qi] PER is granted Divorced from the defendant [Meng] PER …
…
Fig. 5. Entity annotation results on initial data and after data augmentation.
As shown in Fig. 5, “(2016) Ning 0202 Minchu No. 1614” represents the case number, and “Article 32 (3) (5), Article 36 (2), Article 37 (1) and Article 39 (1) Paragraph of Marriage Law of the People’s Republic of China” represents the statute, the prediction results of the two types of entities under the initial small-scale labeled data and the different times fusion data are all correct. Because the structure of the case number and statute are relatively regular, only using small-scale data can achieve a better recognition effect. However, the location entities and organization entities with complex compositions are difficult to identify. The recognition results of the entities on small-scale data are not very good. As shown in the left part of Fig. 5, the correct organization and location should be “The People’s Court of Dawukou District, Shizuishan City, Ningxia Hui Autonomous Region” and “Alashan Left Banner, Inner Mongolia Autonomous Region”, but the corresponding prediction results are “The People’s Court of Dawukou District, Shizuishan City, Autonomous Region” and “Alashan, Inner Mongolia Autonomous Region”, which
254
H. Zhang et al.
are different from the right entities. Obviously, the entities are wrongly tagged. Moreover, as the number of fusions increases, the recognition results of the two types of entities are significantly improved, as shown in the right part of Fig. 5. From the overall entity recognition results, the data augmentation method by fusing multiple models can improve the recognition effect of different type entities to a certain extent. However, the fusion will introduce some wrong labeled entities, and the more times of fusion, the more number of errors will accumulate, so the performance of the model cannot be improved by unlimited stacking. Therefore, it is worth exploring to control the introduced errors while expanding data.
6 Conclusion In this paper, we learn the ACE and MUC entity labeling specifications to define and manually label the entities in the legal documents. Aiming at the lack of labeling data for named entities in legal documents, we propose a novel data augmentation method for named entity recognition by fusing multiple models. The experimental results on our judicial dataset and public datasets all show that the results using data augmentation are close to those based on 5 times of real labeled data, and better than those with the initial small-scale labeled datasets. In the future, we will further explore the transfer learning methods on different types of legal documents, and study other semi-supervised named entity recognition methods for legal documents. Acknowledgments. This research was supported by the National Social Science Fund of China (No. 18BYY074).
References 1. Luo, B., Feng, Y., Xu, J., Zhao, D.: Learning to predict charges for criminal cases with legal basis. In: Empirical Methods in Natural Language Processing, pp. 2727–2736 (2017) 2. Jiang, X., Ye, H., Luo, Z., Chao, W., Ma, W.: Interpretable rationale augmented charge prediction system. In: International Conference on Computational Linguistics, pp. 146–151 (2018) 3. Lauderdale, B.E., Clark, T.S.: The supreme court’s many median justices. Am. Polit. Sci. Rev. 106(04), 847–866 (2012) 4. Duan, X., et al.: CJRC: a reliable human-annotated benchmark dataset for chinese judicial reading comprehension. In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 439–451. Springer, Cham (2019). https://doi.org/10.1007/ 978-3-030-32381-3_36 5. Wang, L., Yan, Q., Li, S., Zhou, G.: Employing auto-annotated data for person name recognition in judgment documents. In: Sun, M., Wang, X., Chang, B., Xiong, D. (eds.) CCL/NLPNABD -2017. LNCS (LNAI), vol. 10565, pp. 13–23. Springer, Cham (2017). https://doi.org/ 10.1007/978-3-319-69005-6_2 6. Cardellino, C., Teruel, M., Alemany, L.A., Villata, S.: A low-cost, high-coverage legal named entity recognizer, classifier and linker. In: 16th International Conference on Artificial Intelligence and Law, pp. 9–18. Londres, United Kingdom (2017)
Applying Model Fusion to Augment Data for Entity Recognition
255
7. Leitner, E., Rehm, G., Morenoschneider, J.: Fine-grained named entity recognition in legal documents. In: 15th International Conference on Semantic Systems, pp. 272–287. Karlsruhe, Germany (2019) 8. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.P.: Natural language processing (Almost) from scratch. J. Mach. Learn. Res. 12(1), 2493–2537 (2011) 9. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv preprint arXiv: 1508.01991 (2015) 10. Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4(1), 357–370 (2016) 11. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1064–1074 (2016) 12. Rei, M., Crichton, G.K., Pyysalo, S.: Attending to characters in neural sequence labeling models. In: International Conference on Computational Linguistics, pp. 309–318 (2016) 13. Bharadwaj, A., Mortensen, D.R., Dyer, C., Carbonellm, J.G.: Phonologically aware neural model for named entity recognition in low resource transfer settings. In: Empirical Methods in Natural Language Processing, pp. 1462–1472 (2016) 14. Tan, Z., Wang, M., Xie, J.: Deep Semantic Role Labeling with Self-Attention. arXiv preprint arXiv:1712.01586 (2017) 15. Cetoli, A., Bragaglia, S., Oharney, A.D., Sloan, M.: Graph Convolutional Networks for Named Entity Recognition. arXiv preprint arXiv:1709.10053 (2017) 16. Zhang, Y., Yang, J.: Chinese NER using lattice LSTM. In: Meeting of the Association for Computational Linguistics, pp. 1554–1564 (2018) 17. Peters, M.E., et al.: Deep contextualized word representations. In: North American Chapter of the Association for Computational Linguistics, pp. 2227–2237 (2018) 18. Devlin, J., Chang, M., Lee, K., Toutanova K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2018) 19. Lan, Z.: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv preprint arXiv: 1909.11942 https://arxiv.org/abs/1810.04805 (2019) 20. Li, X., Zhang, H., Zhou, X.-H.: Chinese clinical named entity recognition with variant neural structures based on BERT Methods. J. Biomed. Inform. 107 (2020). https://doi.org/10.1016/ j.jbi.2020.103422 21. Moon, T., Awasthy, P., Ni, J., Florian, R.: Towards Lingua Franca Named Entity Recognition with BERT. arXiv preprint arXiv:1912.01389 (2019) 22. Yu, A.W., et al.: QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. arXiv preprint arXiv:1804.09541 (2018) 23. Kobayashi, S.: Contextual augmentation: data augmentation by words with paradigmatic relations. North Am. Chapter Assoc. Comput. Linguist. 2, 452–457 (2018) 24. Samanta, S., Mehta, S.: Towards Crafting Text Adversarial Samples. arXiv preprint arXiv: 1707.02812 (2017) 25. Wu, D., Lee, W.S., Ye, N., Chieu, H.L.: Domain adaptive bootstrapping for named entity recognition. In: Empirical Methods in Natural Language Processing, pp. 1523–1532. Singapore (2009) 26. Neelakantan, A., Collins, M.: Learning dictionaries for named entity recognition using minimal supervision. In: Conference of the European Chapter of the Association for Computational Linguistics, pp. 452–461 (2014) 27. Peters, M.E., Ammar, W., Bhagavatula, C., Power, R.: Semi-supervised sequence tagging with bidirectional language models. Meet. Assoc. Comput. Linguist. 1, 1756–1765 (2017)
Combining Knowledge Graph Embedding and Network Embedding for Detecting Similar Mobile Applications Weizhuo Li1,3 , Buye Zhang2(B) , Liang Xu1 , Meng Wang1 , Anyuan Luo2 , and Yan Niu4 1
School of Computer Science and Engineering, Southeast University, Nanjing, China [email protected], [email protected] 2 School of Cyber Science and Engineering, Southeast University, Nanjing, China 3 School of Modern Posts and Institute of Modern Posts, Nanjing University of Posts and Telecommunications, Nanjing, China 4 China Academy of Industrial Internet, Beijing, China [email protected]
Abstract. With the popularity of mobile devices, large amounts of mobile applications (a.k.a.“app”) have been developed and published. Detecting similar apps from a large pool of apps is a fundamental and important task because it has many benefits for various purposes. There exist several works that try to combine different metadata of apps for measuring the similarity between apps. However, few of them pay attention to the roles of this service. Besides, existing methods do not distinguish the characters of contents in the metadata. Therefore, it is hard to obtain accurate semantic representations of apps and capture their fine-grained correlations. In this paper, we propose a novel framework by knowledge graph (KG) techniques and a hybrid embedding strategy to fill above gaps. For the construction of KG, we design a lightweight ontology tailored for the service of cybersecurity analysts. Benefited from a defined schema, more linkages can be shared among apps. To detect similar apps, we divide the relations in KG into structured and unstructured ones according to their related content. Then, TextRank algorithm is employed to extract important tokens from unstructured texts and transform them into structured triples. In this way, the representations of apps in our framework can be iteratively learned by combining KG embedding methods and network embedding models for improving the performance of similar apps detection. Preliminary results indicate the effectiveness of our method comparing to existing models in terms of reciprocal ranking and minimum ranking.
1
Introduction
With the popularity of mobile devices, the number of mobile applications (a.k.a. “app”) has been growing rapidly, which provides great convenience to users for online shopping, education, entertainment, financial management etc. [1]. According to a recent report1 , as of August 2018, there were over 9.8 and 4.5 1
https://www.appannie.com/cn/insights/market-data/the-state-of-mobile-2019/.
c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 256–269, 2020. https://doi.org/10.1007/978-3-030-60450-9_21
Combining Knowledge Graph Embedding and Network Embedding
257
million apps available on Google Play and App Store, respectively, and global downloads of mobile apps have exceeded 194 billion. With large amounts of apps, if a specific app is given as a query, it is difficult to find all other apps that are similar to the query one. Detecting similar apps from a large pool of apps is a basic and important task because it has many benefits for different stakeholders in the mobile app ecosystem [2]. For example, it is helpful for app platforms to improve the performance of their app recommendation systems and enhance the user experience of app search engines. For app developers, detecting similar apps can be useful for various purposes such as identifying direct competing apps, assessing reusability and so on. Meanwhile, lots of apps also become the hotbeds for cybercriminals such as thieving private data, propagating false news and pornography, online-scam. Hence, it is essential for cybersecurity analysts to supervise apps and prevent potential cybercriminals derived from them. However, it is a nontrivial and difficult problem for detecting similar apps. One of the key challenges is how to explore and combine different modalities of data in app markets to measure the similarity between apps in a principled way. Previous studies provided solutions based on bag of words [3] or topic models [4,5] to calculate the similarity of apps, which depended on description texts, titles and user reviews of apps. Recently, Chen et al. [2] and Lin et al. [6] proposed hybrid frameworks to achieve this service. The authors defined kernel functions and decision trees to integrate different metadata for improving the performances of similar apps detection. Although existing methods have obtained some encouraging results, they still suffer from two limitations. Firstly, different objects (e.g., users, developers) expect different results of this service [7]. Therefore, it may not be suitable to directly utilize their algorithms to provide the service of similar apps detection for cybersecurity analysts, who pay more attention to the sensitive apps rather than the entertainment ones. Secondly, existing works focus on basic features of metadata, whereas they do not distinguish the characters of contents in these metadata such as structured labels (e.g., developers of apps), unstructured texts (e.g., descriptions of apps). Therefore, it is hard to obtain accurate semantic representations of apps and capture their fine-grained correlations. To fill above gaps, in this paper, we present a novel framework for detecting similar apps using knowledge graph techniques and a hybrid embedding strategy. To provide better detection services for cybersecurity analysts, we focus on one kind of apps, namely sensitive apps, that own more conditions or plausibility than normal apps that become the hotbeds for related cybercriminals. We define a lightweight ontology including basic classes and properties (or relations) from the view of cybersecurity analysts and construct the knowledge graph (KG) of sensitive apps. Benefited from a well-defined schema, more linkages can be shared among apps. To detect similar apps, we divide the relations in KG into structured and unstructured ones according to their related content. Then, TextRank algorithm [8] is employed to extract important tokens from unstructured texts and transform them into structured triples. In this way, the representations of apps in our framework can be iteratively learned by combining KG embedding methods [9] and network embedding models [10] for improving the performance of similar apps detection. The main contributions of our work are summarized as follows.
258
W. Li et al.
1. We study the problem of detecting similar apps serviced for cybersecurity analysts. To the best of our knowledge, this is the first work that focuses on this problem; 2. We present a novel framework to tackle this problem, in which a knowledge graph is constructed based on defined ontology so that more linkages can be shared among apps. Moreover, KG embedding methods and network embedding models are combined to iteratively learn the representations of apps for improving the performance; 3. We construct a new dataset based on the constructed knowledge graph for evaluation. Compared with several existing methods, the preliminary result indicates that our approach for detecting similar apps of a new one can obtain better performances in terms of reciprocal ranking and minimum ranking.
2 2.1
Related Work Detecting Similar Mobile Applications
Detecting semantically similar apps from a large pool of apps is a fundamental and important problem, as it is beneficial for various applications, such as app classification and app recommendation, app search, etc. Previous studies provided solutions based on bag of words or topic models to calculate the similarity of apps. Bhandari et al. [3] linked the title, description and user reviews of an app as one document, and built the vector using the TF-IDF weighting scheme. Then, they employed cosine similarity to calculate the pairwise similarity. Yin et al. [4] treated the description of an app as a document and applied LDA to learn its latent topic distribution. In this way, each app was represented as a fixed-length vector. The similarity between two apps was computed as the cosine similarity of their vectors. Recently, several works try to employ hybrid strategy to achieve this service. Chen et al. [2] proposed a framework called SimApp that detected similar apps by constructing kernel functions based on multi-modal heterogeneous data of each app (e.g., description texts, images, user reviews) and learned optimal weights for the kernels. Park et al. [5] exclusively leveraged text information such as reviews and descriptions (written by users and developers, respectively) and designed a topic model that could bridge the vocabulary gap between them to improve app retrieval. Lin et al. [6] developed a hybrid framework that integrated a variety of app-related features and recommendation techniques, and then identified the most important indicators for detecting similar app. The authors employed a gradient tree boosting model as the core to integrate the scores by using user features and app metadata as additional features for the decision tree. Although existing methods have obtained some encouraging results, it may not be suitable for these algorithms to provide the same service for cybersecurity analysts, who pay more attention to the sensitive apps rather than the entertainment ones. Because different objects expect different results of this service [7]. Without domain knowledge, existing methods may not meet the requirements of similar apps detection in the domain of cybersecurity. On the other hand, existing methods do not distinguish the characters of contents in these metadata (e.g., structured labels, unstructured texts). It is hard to obtain accurate semantic representations of apps and capture their fine-grained correlations. Relatively,
Combining Knowledge Graph Embedding and Network Embedding
259
we define a lightweight ontology in view of cybersecurity analysts, and utilize existing metadata of apps to construct a knowledge graph to achieve this goal. Moreover, we propose a hybrid strategy that combines KG embedding methods and network embedding models to iteratively learn the representations of apps, which can further improve the performance of similar apps detection. 2.2
Knowledge Bases for Mobile Applications
Many interesting insights can be learned from data on application markets and aggregations of that data, which gain a remarkable attraction from academia and industry [11]. Drebin [12] provides a considerate number (5,560) of malware to the public with specific malicious behaviors inside, which were identified and classified by a machine learning method. These samples were categorized into 149 families in terms of contained malicious behaviors. AndroZoo++ [13] is an ongoing effort to gather executable Android applications from as many sources as possible and make them available for analysts. In addition, the authors figured out 20 types of app metadata to share with the research community for relevant research works. AndroVault [14] is a knowledge graph of information on over five million Android apps. It has been crawled from diverse sources, including Google Play and FDroid since 2013. AndroVault computes several attributes for each app based on downloaded android application package, in which entities can be heuristically clustered and correlated by attributes. AndroZoo++ mainly focus on the scale of apps. The goal of its collecting apps is to share them with the research community. Relatively, Drebin and AndroVault dedicate to provide abundance apps for malware detection. All of these knowledge bases pay little attention to similar apps detection serviced for cybersecurity analysts, but it is essential for them to supervise apps and prevent potential cybercriminals. To the best of our knowledge, our work is the first step towards employing knowledge graph techniques and a hybrid embedding strategy to tackle this problem.
KG Embedding Network Embedding query app
detect similar apps
app1 app2 appn
Knowledge Storage App Crawling
Knowledge Extraction
Fig. 1. The framework of detecting similar mobile applications
260
3
W. Li et al.
Detecting Similar Mobile Applications
Figure 1 presents our framework for detecting similar apps. After we extracted the metadata of apps from application markets and external resources, we further construct a knowledge graph tailored for the service of cybersecurity analysts, in which a lightweight ontology is defined to formalize the basic classes and relations. Benefited from a well-defined schema, more linkages can be shared among apps. To detect similar apps, the underlying idea is to divide the relations in KG into structured and unstructured ones according to their related content. Then, we employ TextRank algorithm to extract important tokens from unstructured texts and transform them into structured triples. In this way, the representations of apps can be iteratively learned by combining KG embedding methods and network embedding models for improving the performance of similar apps detection. Next, we will illustrate each part in detail. 3.1
The Construction of Mobile Application Knowledge Graph
Ontology Definition. To model a well-defined schema of apps according to their sensitivity, we discuss with analysts worked on the China Academy of Industrial Internet, and discover that the vast majority of conceptualizations (e.g., function point, interaction mode) described for sensitive apps are not available online. Hence, we select appropriate terms based on the survey of existing conceptualizations, and define a set of properties (or relations) by prot´eg´e2 to cover the sensitivity of apps.
memberOf
Company
Developer
Chinese rdfs:type English
copyright
rdfs:type develop
Japanese
rdfs:type
rdfs:type
Language
Market
rdfs:type hasLanguage Korean
Google Play
publish
rdfs:type
rdfs:type
source
App
Arabic
App Store
description function Point
subclassof Social App
Social App
subclassof
subclassof
Social App
subclassof
download Times
Integer
updated Time
DataTime
available State
String
Boolean
Fig. 2. The overview of lightweight ontology
Figure 2 shows an overview of light-weight ontology, where red edges and blue ones represent subclassof and rdfs:type relations, respectively. They are two basic relations. The green ones represent the object properties, and violet ones represent data properties. Overall, we define 9 basic concepts and 21 properties in the ontology. Benefited from a well-defined schema, it not only can make apps to present more comprehensive properties to analysts, but also can generate more shared linkages among apps. 2
https://protege.stanford.edu/.
Combining Knowledge Graph Embedding and Network Embedding
261
App Crawling. With the help of scrapy framework, we crawl the descriptive information of apps published from app markets. Note that we do not download application packages because we focus on supervising the sensitive apps rather than detecting malicious codes of them. To achieve this goal, we discuss with analysts and design dozens of heuristic principles to guide sensitive apps crawling. Intuitively, more principles are triggered, the apps are more sensitive. The main four principles are listed as follows. – If the state of one app is not available (e.g., off the shelf), it may be a sensitive app. – If download times of one app are more than one thousand, it may be a sensitive app. – If the description of one app contains sensitive tokens (e.g., belle, lottery), it may be a sensitive app. – If one app shares the same companies or developers with sensitive apps, it may be a sensitive app. Knowledge Extraction. Crawling the data of apps from application markets is the most direct way to build KG. However, the known labels often inadequately cover the value of properties in our designed ontology, which impedes the discovery of the shared linkage among apps. Therefore, we try to extract related web pages from external resources (e.g., Baidu Baike3 and Wikipedia4 ) to fill the lacked value of these properties. Due to the space limitation, please refer to the paper [15] for more details. Knowledge Storage. After app crawling and knowledge extraction, we transform them into triples {(h, r, t)} by Jena5 . For knowledge storage, we employ AllegroGraph6 to store the transformed triples, which is one of the efficient graph bases for storing triples and supporting SPARQL query7 seamlessly. Benefited from SPARQL query and inference rules implied in ontology, it can present comprehensive information of apps for analysts. To keep the KG in sync with the evolving apps, we periodically update apps by crawling above sources and record the updated logs. 3.2
Similar Apps Detection Based on Hybrid Embedding Strategy
To better detect similar apps, we propose an iterative architecture with a hybrid embedding strategy as shown in Fig. 3. Given the descriptive information of apps, we divide various relations stored in KG into structured and unstructured ones according to their contents. Then, TextRank algorithm is employed to extract important tokens from unstructured texts and transform them into structured triples. In this way, we combine KG embedding methods [9] and network embedding (NE) models [10] successively to learn the representations of apps. The learning process is iterative. If loss of function is converged or the number of iterations exceeds the preset value, this process is terminated. 3 4 5 6 7
https://baike.baidu.com. http://en.wikipedia.org/wiki/Wiki. http://jena.apache.org/. https://allegrograph.com/. https://www.w3.org/2001/sw/wiki/SPARQL.
262
W. Li et al.
Tex t
N Ra
nk ing
Structured Triples
n
KG Embedding Embeddings of Network Embedding Embeddings
objects apps, relations value)
of apps value
loss converge or beyond iteration number
Y
Representation Strategy
Embeddings
io ct
tra
le
Ex
ip Tr
Fig. 3. The architecture of similar apps detection based on a hybrid embedding strategy
Extracting the Structure Triples from Unstructured Texts. Note that, KG embedding methods and NE models can not make use of the description texts of apps to enhance the potential correlations of apps. To address this problem, we extract the important tokens of app description texts by TextRank algorithm [8], which is a graph-based ranking model for text processing. The corresponding formula is defined as follows. Sim(Si , Sj ) =
|{tk |tk ∈ Si ∩ tk ∈ Sj }| , log(Ni ) + log(Nj )
(1)
N
i where Si = t1i , t2i , ..., tN and Sj = t1j , t2j , ..., tj j are two sentences in the descripi tion text of one app, Ni and Nj are the number of tokens in Si and Sj , tk is one shared token between two sentences. After iteratively calculated the text-rank value of each token, we can obtain several important tokens by a threshold θ to represent the description text of this app. Then, we introduce a new relation relatedTo tailored for these tokens and generate new triples.
Hybrid Embedding Strategy for Structured Triples. For structured triples, we combine KG embedding methods and network embedding models to iteratively learn the representations of apps. KG embedding aims to effectively encode a relational knowledge graph into a low dimensional continuous vector space and achieves success on downstream tasks like link prediction and triple classification [9]. Network embedding can effectively preserve the network structure and capture higher-order similarities among entities [10]. Although they are suitable to model the structured triples, the merits of them are different. KG embedding can learn the representations of entities and relations in KG simultaneously. Relatively, network embedding sacrifices the semantics of edges for capturing higher-order semantic similarities among entities. Precisely, for a set of triples {(h, r, t)}, KG embedding methods are utilized to pre-train the vector representations of objects such that the semantics of relations can be encoded to some extend. Then, we treat these pre-trained embedding as initial vectors for NE models such that fine-grained semantic representations among apps can be learned. Consider the network for NE models constructed from structured triples {(h, r, t)}, we treat h, t as two nodes vi , vj , and regard relations r as edges that connect them. Then, a network G = (V, E) for NE models can be quickly built, where V and E represent the sets of nodes and edges, respectively. Notice that, our framework is iterative. Hence, learned representations of apps and their value in NE models can be also treated as inputs
Combining Knowledge Graph Embedding and Network Embedding
263
for KG embedding methods, which is helpful to adjust the vector presentations of relations in KG. Finally, if loss of function is converged or the number of iterations exceeds the preset value, this process is terminated. The loss functions8 of KG embedding methods and NE models are defined in Eq. 2 and Eq. 3. [fKGE(k) (h, r, t) − fKGE(k) ((h , r, t )) + γ]+ , (2) LKGE(k+1) = (h,r,t)∈ξ (h ,r,t )∈ξ
where ξ is a set of structured triples of KG, ξ is a negative one generated by negative sampling [16], [x]+ = max(x, 0), fKGE(k) represents the score function of KG embedding methods employed in the kth iteration. For the vector representations h, r, t ∈ Rd of each triple (h, r, t), h and t need to be replaced with vN E(k) , uN E(k) ∈ Rd that are the vector representations of nodes trained by NE models in the kth iteration. r is replaced with rk , which is trained by KG embedding methods in the kth iteration. LN E(k+1) = fN E(k) (vi , vj ), (3) i∈V j∈N (vi )
where V is a set of nodes in network for NE models, N (i) is a set of out-neighbors of node vi , fN E(k) represents the score function of NE models employed in the kth iteration. For the vector representations vi , vj ∈ Rd of nodes vi , vj ∈ V , vi and vj need to be replaced with hKG(k) , tKG(k) ∈ Rd that are the vector representations of entities trained by KG embedding methods in the kth iteration. Notice that, joint learning [17] of above embedding techniques is an alternative architecture, including merging pre-training models (e.g. BERT [18]). Nevertheless, the experimental result indicates that the performances of KG embedding methods and pre-training models are not well for detecting similar apps (Sect. 4.2). Therefore, it may not be suitable to combine them with NE models together for joint learning. The Representations of New Apps. With helpful of above embedding techniques, the similarities between apps can be calculated based on cosine measure. However, it is still challenging for KG embedding methods and NE models to obtain accurate embeddings for new apps because these apps are not fed into the training process. Existing NE models try to utilize the related information (or entities) of these apps to calculate their similarity. Arithmetic mean [19] and property concatenation [20] are two common strategies to represent the embeddings of new apps. Nevertheless, these two strategies ignore the semantics of properties in triples, which assume related information and entities of new apps have the same contributions. Hence, it may not reflect the reality embedding representations for new apps. To address this problem, we further optimized the property concatenation strategy based on entropy. Intuitively, this strategy can utilize the value or entities in each property to measure the importance of itself. Given one new 8
In this paper, KG embedding methods are employed by translated-based methods, and NE models mainly consider the effects of out neighbors of nodes in the network.
264
W. Li et al.
app vn+1 and its related information (or entities) denoted by {(vn+1 , rk , vk )}, we formalize our strategy by Eq. 4. vn+1 = w1 v1 ⊕ w2 v2 ⊕ ... ⊕ wm vm , s.t. v1 , v2 , ..., vm ∈ V,
(4)
H(pi ) , wi = l 1 H(pt )
(5)
where vn+1 is the embedding representation of a new app. v1 , v2 ..., vm ∈ Rd are embedding representations of related information (or entities) v1 , v2 , ..., vm , which belong to a set of nodes V in the network. ⊕ is a concatenate operation Ra×d ⊕ Rb×d → R(a+b)×d , wi is a weight calculated by all the entropy of app properties, H(pi ) is an entropy of all the value and entities of property pi , l is the number of them.
4
Evaluation
In this section, we report the statistic of constructed KG for mobile applications, called MAKG, and verify the effectiveness of our proposed framework for detecting similar apps. Our approach9 can be downloaded together with the datasets. A technical report with more details of our algorithms and results can be downloaded in the same address. 4.1
Statistics of the Constructed Knowledge Graph
Table 1 lists the statistics of MAKG, in which more than 241 thousand apps are collected and they are divided into four categories, including Tools, Social, News and Newspaper&Magazine. The last column lists the whole number of apps, entities, relations in MAKG. Due to the defined schema of apps, the number of relations in each category is the same. Notice that MAKG is a multilingual KG because the language of their names and descriptions includes Chinese, English, Japanese, Korean and Arabic. Table 1. The detailed statistics of MAKG Category Apps Relations Entities
Tools 129,730 30 235,333
Social 70,948 30 146,598
News 36,325 30 70,386
Newspaper & Magazine Total 4,422 241,425 30 30 4,369 445,028
Combining Knowledge Graph Embedding and Network Embedding
265
Table 2. Statistics of datasets for evaluation Dataset MAKG-E MAKG-E +
Train Apps Nodes Edges 61773 126817 432411 61773 165838 628458
Test Apps 100 100
Datasets. To evaluate our method, we select some apps with Chinese and English in MAKG and build a benchmark dataset named MAKG-E listed in Table 2. For the test set, we invite several experienced analysts to select 100 apps separated from the training set, which are simulated as the special apps that have become the hotbeds for cybercriminals. For each special app, analysts select 20 most similar apps related it in the training set as a standard set, which are from the candidate apps generated by TF-IDF algorithm based on their textual descriptions. MAKG-E + is an enhanced one that has integrated important tokens and corresponding relationships of apps based on TextRank algorithm, in which the threshold θ for selecting important tokens is set to 0.5. Metrics. According to the built benchmarks, we introduce two metrics from the field of information retrieval to evaluate ranking methods that are formally defined as follows. RR =
n m i=1 j=1
1 Rankij
1 arg min Rankil . n i=1 l n
Rankmin =
The first metric is reciprocal rank, written RR, which is defined as the sum of the reciprocal of Rankij . Rankij indicates the jth similar apps in descending order for the ith tested app. If the jth similar app belongs to the standard set, then Rankij = j. Otherwise, the rank value is 0. The second metric, written Rankmin , is defined as the minimum rank of similar apps in descending order for each given app. The larger RR is, the closer of the similar search list is to the ideal one. Relatively, the smaller Rankmin is, the earlier people can see similar apps. Notice that, as similar apps in the standard set are not unique, we do not employ AUC (Area Under Curve) as a metric, which is one of the important indicators to evaluate the effectiveness of classification models. 4.2
Evaluation of Similar Apps Detection
Implementation Details. For structured triples, we employ TransE [16], TransH [21], TransD [22] by OpenKE platform10 to train them and obtain the vector representations of apps by the average strategy in [23]. The network embedding models are implemented based on DeepWalk [24], LINE [25], 9 10
https://github.com/zbyzby11/MAKG4Embedding. https://github.com/thunlp/OpenKE.
266
W. Li et al.
Node2Vec [26] by OpenNE platform11 . For our iterative framework, we employ TransE as KG embedding method for pre-training the vector representations of apps, and utilize LINE as NE model to learn the embeddings of apps and their value. Because both of them are efficient for large-scale representation learning. The number of iterations in our framework is set to 3 as default. In addition, we employ a feature matching method (abbreviated as FM) and the pre-training model BERT [18] as baselines to verify the effectiveness of our framework. FM is implemented by calculating the overlapping entities related to apps based on Jaccard similarity. Relatively, we transform all the triples into textual descriptions and feed them into BERT for detecting similar apps of new ones. To ensure a fair comparison, we fine-tune the hyperparameters (e.g., dimension, learning rate, negative sampling number) of above models to obtain the best results. The Evaluation Results. Table 3 lists comparison results of different embedding methods in terms of RR and Rankmin . From the table, we can observe that: – Benefited from TextRank algorithm applied in MAKG-E + , FM and NE-based models can gain significant improvements compared with the original MAKGE. Because these important tokens and relations can enrich the contexts of apps. It is helpful to capture more semantic correlations among apps. – NE-based models (e.g., LINE, Node2Vec, our method) outperform other models in both two datasets. It indicates that NE-based models can obtain finegrained correlations among apps by the constructed network in their models. Notice that our method is slightly better than LINE and Node2Vec because the representations of apps have been further encoded the semantics of relations by TransE. – KG embedding methods and BERT do not perform well for detecting similar apps. We analyze that the inherent characters (e.g., multilingual textual descriptions, insufficient triples) of constructed datasets may affect them to capture the semantic correlations of apps. Besides, as TextRank algorithm transforms descriptions into tokens with the same relations, so it may lost original semantics of sentences and affect the representations of entities. Therefore, the performances of BERT and KG embedding methods in MAKG-E+ are worse than the ones in MAKG-E.
The Results of Different Representation Strategies of New Apps. Table 4 and Table 5 show the results of different representation strategies of new apps. Overall, NE-based models with the concatenation strategy based on entropy are better than the original one and arithmetic mean. It indicates that our optimized concatenation strategy can solve the representation problem of new apps to some extent. Our method can obtain better performances than other models in terms of RR and Rankmin because the fine-grained semantics of relations can be encoded by a hybrid embedding strategy in our iterative 11
https://github.com/thunlp/OpenNE.
Combining Knowledge Graph Embedding and Network Embedding
267
Table 3. Comparison results in terms of RR and Rankmin Methods
MAKG-E RR Rankmin FM 93.60 5.14 BERT 0.92 20.47 TransE 22.83 17.30 TransH 22.86 17.38 TransD 23.15 17.31 DeepWalk 83.55 9.07 Line 99.38 5.05 Node2vec 98.28 6.24 Our method 100.40 5.98
MAKG-E + RR Rankmin 165.10 2.37 3.03 20.00 12.10 18.44 11.90 18.51 11.35 18.86 117.68 4.18 188.93 2.17 173.13 2.21 190.80 1.87
framework. Nevertheless, Node2Vec equipped with our strategy does not perform satisfactorily in MAKG-E + . We discover that the test apps corresponding to good results equipped with arithmetic mean are different from the ones with our strategy. It makes sense to combine these two strategies for detecting similar apps in our future work. Table 4. Comparison results of different representation strategies in terms of RR
Table 5. Comparison results of different representation strategies in terms of Rankmin
5
Conclusion and Future Work
In this paper, we presented a novel framework using KG techniques and a hybrid embedding strategy, which is suitable for cybersecurity analysts to find similar apps. We designed a light-weight ontology for KG construction, which can generate more shared linkages among apps. Moreover, KG embedding methods and network embedding models are combined to iteratively learn the representations of apps for improving the performance. Preliminary results indicated the effectiveness of our approach comparing to several existing methods. In future work, we will collect more apps to enrich MAKG, and employ techniques to achieve the alignment of multilingual texts, which is helpful to improve the performances of similar apps detection.
268
W. Li et al.
Acknowledgements. This work was partially supported by the Natural Science Foundation of China grants (U1736204, 61906037), the National 242 Information Security Plan grant (6909001165).
References 1. Meng, G., Patrick, M., Xue, Y., Liu, Y., Zhang, J.: Securing Android app markets via modeling and predicting malware spread between markets. IEEE Trans. Inf. Forensics Secur. 14(7), 1944–1959 (2019) 2. Chen, N., Hoi, S.C., Li, S., Xiao, X.: SimApp: a framework for detecting similar mobile applications by online kernel learning. In: WSDM, pp. 305–314 (2015) 3. Bhandari, U., Sugiyama, K., Datta, A., Jindal, R.: Serendipitous recommendation for mobile apps using item-item similarity graph. In: Banchs, R.E., Silvestri, F., Liu, T.-Y., Zhang, M., Gao, S., Lang, J. (eds.) AIRS 2013. LNCS, vol. 8281, pp. 440–451. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-450686 38 4. Yin, P., Luo, P., Lee, W.-C., Wang, M.: App recommendation: a contest between satisfaction and temptation. In: WSDM, pp. 395–404 (2013) 5. Park, D.H., Liu, M., Zhai, C., Wang, H.: Leveraging user reviews to improve accuracy for mobile app retrieval. In: SIGIR, pp. 533–542 (2015) 6. Lin, J., Sugiyama, K., Kan, M.-Y., Chua, T.-S.: Scrutinizing mobile app recommendation: identifying important app-related indicators. In: Ma, S., et al. (eds.) AIRS 2016. LNCS, vol. 9994, pp. 197–211. Springer, Cham (2016). https://doi. org/10.1007/978-3-319-48051-0 15 7. Al-Subaihin, A., Sarro, F., Black, S., Capra, L.: Empirical comparison of text-based mobile apps similarity measurement techniques. Empirical Softw. Eng. 24(6), 3290–3315 (2019). https://doi.org/10.1007/s10664-019-09726-5 8. Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: EMNLP, pp. 404– 411 (2004) 9. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: a survey of approaches and applications. IEEE Trans. Knowl. Data Eng. 29(12), 2724–2743 (2017) 10. Cui, P., Wang, X., Pei, J., Zhu, W.: A survey on network embedding. IEEE Trans. Knowl. Data Eng. 31(5), 833–852 (2019) 11. Geiger, F.-X., Malavolta, I.: Datasets of Android applications: a literature review. CoRR, abs/1809.10069 (2018) 12. Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K.: DREBIN: effective and explainable detection of Android malware in your pocket. In: NDSS (2014) 13. Li, L., et al.: AndroZoo++: collecting millions of Android apps and their metadata for the research community. CoRR, abs/1709.05281 (2017) 14. Meng, G., Xue, Y., Siow, J.K., Su, T., Narayanan, A., Liu, Y.: AndroVault: constructing knowledge graph from millions of Android apps for automated analysis. CoRR, abs/1711.07451 (2017) 15. Niu, X., Sun, X., Wang, H., Rong, S., Qi, G., Yu, Y.: Zhishi.me - weaving Chinese linking open data. In: Aroyo, L., et al. (eds.) ISWC 2011. LNCS, vol. 7032, pp. 205–220. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-250934 14 16. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: NIPS, pp. 2787–2795 (2013) 17. Gao, Y., Yue, X., Huang, H., Liu, Q., Wei, L., Liu, L.: Jointly learning topics in sentence embedding for document summarization. IEEE Trans. Knowl. Data Eng. 32(4), 688–699 (2020)
Combining Knowledge Graph Embedding and Network Embedding
269
18. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019) 19. Tang, J., Qu, M., Mei, Q.: PTE: predictive text embedding through large-scale heterogeneous text networks. In: SIGKDD, pp. 1165–1174 (2015) 20. Wang, J., Huang, P., Zhao, H., Zhang, Z., Zhao, B., Lee, D.L.: Billion-scale commodity embedding for e-commerce recommendation in Alibaba. In: SIGKDD, pp. 839–848 (2018) 21. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: AAAI, pp. 1112–1119 (2014) 22. Ji, G., He, S., Xu, L., Liu, K., Zhao, J.: Knowledge graph embedding via dynamic mapping matrix. In: ACL, pp. 687–696 (2015) 23. Wang, M., Wang, R., Liu, J., Chen, Y., Zhang, L., Qi, G.: Towards empty answers in SPARQL: approximating querying with RDF embedding. In: Vrandeˇci´c, D., et al. (eds.) ISWC 2018. LNCS, vol. 11136, pp. 513–529. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00671-6 30 24. Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: online learning of social representations. In: SIGKDD, pp. 701–710 (2014) 25. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: LINE: large-scale information network embedding. In: WWW, pp. 1067–1077 (2015) 26. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: SIGKDD, pp. 855–864 (2016)
CMeIE: Construction and Evaluation of Chinese Medical Information Extraction Dataset Tongfeng Guan1,2 , Hongying Zan1,2(B) , Xiabing Zhou3(B) , Hongfei Xu4,5 , and Kunli Zhang1,2 1
3
School of Information Engineering, Zhengzhou University, Zhengzhou, China [email protected] 2 Peng Cheng Laboratory, Shenzhen, China School of Computer Science and Technology, Soochow University, Suzhou, China [email protected] 4 Saarland University, Saarland, Germany 5 German Research Center for Artificial Intelligence, Saarland, Germany
Abstract. In this paper, we present the Chinese Medical Information Extraction (CMeIE) dataset, consisting of 28, 008 sentences, 85, 282 triplets, 11 entities, and 44 relations derived from medical textbooks and clinical practices, constructed by several rounds of manual annotation. Additionally, we evaluate performances of the most recent state-of-theart frameworks and pre-trained language models for the joint extraction of entities and relations task on the CMeIE dataset. Experiment results show that even these most advanced models still have a large space to improve on our dataset; currently, the best F1 score on the dataset is 58.44%. Our analysis points out several challenges and multiple potential future research directions for the task specialized in the medical domain. Keywords: Chinese Medical Information Extraction · Joint extraction of entities and relations · Pre-trained language models
1
Introduction
The Medical Knowledge Graph stores a large amount of structured medical facts, which mostly consist of two medical entities connected by a semantic relation. These facts are represented by relational triplets in the form of , such as . The medical facts are normally derived from unstructured or semi-structured medical texts by entity recognition and relation extraction approaches. Entity and relation extraction is to extract structural information from unstructured medical texts, consisted of two steps, the entity extraction and the relation extraction. In the medical field, entities usually refer to nominal phrases in the text, e.g., diseases, drugs, treatments, symptoms, causes, and risk factors. In the view of Natural Language Processing (NLP), entity extraction is c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 270–282, 2020. https://doi.org/10.1007/978-3-030-60450-9_22
CMeIE
271
Fig. 1. Medical triplets overlapping examples
usually regarded as a Named Entity Recognition (NER) [18] task, which extracts fore-mentioned medical entities from raw medical texts. Relation extraction [26] is to judge the semantic relations (such as clinical sign, complication, auxiliary examination, and drug therapy) of entity pairs. Compare to general texts such as the Wikipedia or the Baidu Baike, unstructured medical texts are characterized by complex types and high density of entity relations, distinct reference situations, and continuous descriptions of the specific relation (such as symptom, treatment, and examination). They are constituted by passages from medical textbooks [24], disease topics in clinical practice [15], and parts of electronic medical record data [8]. As shown in Fig. 1, we divide the sentences into three types according to their triplet overlap degree, including Normal, EntityPairOverlap (EPO), and SingleEntityOverlap (SEO). Thus, the joint entity and relation extraction task in the medical field requires prior knowledge of the medical field and special design in the model structure. Until now, most benchmark datasets for the jointly extraction of entities and relations are in English, such as NYT [17] and WebNLG [6]. And to the best of our knowledge, there is no dataset specialized in the field of Chinese medicine. Therefore, we present a Chinese Medical Information Extraction (CMeIE) dataset in this paper. Our contributions are as follows: – We collect medical texts from multiple sources and refer to annotation schema for named entities and relations on medical texts [2,27,28], and annotate a Chinese medical information extraction corpus using the labeling platform developed by Zhang et al. [32].
272
T. Guan et al.
– We employ the most recent state-of-the-art frameworks and pre-trained language models for the joint extraction of entities and relations to perform detailed evaluations on our dataset. We suggest that our work provides both a dataset and baselines for future research for Chinese medical entity and relation extraction. Our pre-trained models can also serve as a tool for Chinese medical entity and relation extraction.
2
Related Work
Early work for bio-entity name recognition with biomedical texts and electronic medical records can be categorized into rule-based approaches and machine learning approaches (Haniach et al. [8]; Savova et al. [18]). The Informatics for Integrating Biology and the Bedside (I2B2) conference 2010 [23] introduces the relation classification task focusing on assigning relation types that hold between medical problems, tests, and treatments. To extract the temporal relations between pairs of events or time expressions presented in the clinical notes, Nikfarjam and Gonzalez [15] design separate extraction components for different types of temporal relations, which utilizes both machine learning and graphbased inference. Yang et al. [26] propose a hybrid temporal relation extraction approach that combines patient-record-specific rules and the Conditional Random Field (CRF) model for the processing of patient records. Seol et al. [19] propose methods to extract clinical events related to the patient using CRF and to extract relations between events using Support Vector Machine (SVM) and to extract event causality pattern list in a semantic unit of events. To extract both of entities and relations, early work follows a pipeline (Zelenko et al. [29]; Zhou et al. [7]), in which they first conduct NER, then perform relation classification for each entity pair. However, the pipeline ignores the relation between the two sub-tasks and dependencies between them. Thus, its performance is likely to be affected by error propagation (Li and Ji [11]). To address this issue, subsequent work attempts to extract entities and relations jointly. Li and Ji [11]; Miwa and Sasaki [14] propose to bridge the two sub-tasks using several elaborate features. Zeng et al. [30]; Zheng et al. [34] propose neural network-based extraction frameworks. However, as discussed, a sentence may contain multiple triplets that overlap each other in medical texts, as illustrated in Fig. 1. Conventional sequence tagging schemes, which assumes that each token bears only one tag (Zheng et al. [34]), has difficulty in tagging the entity as multiple types in case the entity is overlapped. This brings significant challenges to relation classification approaches in which each entity pair is assumed to involve in only one relation (Katiyar and Cardie [10]). To address this problem, Zeng et al. [31] propose a sequence-to-sequence model with copy mechanism. Fu, Li and Ma [5] present GraphRel, an end-to-end relation extraction model based on Graph Convolutional Network (GCN). Wei et al. [25] propose a novel cascade binary tagging framework (CasRel) derived from the principle problem formulation which
CMeIE
273
employs the pre-trained BERT as encoder. Their framework models relations as a function which maps subjects to objects in a sentence.
3
CMeIE Dataset
In this section, we describe the process for the construction of CMeIE. The whole procedure can be divided into three steps: 1) Setting up the annotation standard. 2) Machine-aided human annotation. 3) Corpus processing. 3.1
Annotation Schema
To ensure the corpus’ authority and practicability, we select the medical textbooks of Pediatrics [24], Clinical Pediatrics [20] and clinical practice as the annotating corpus. Medical books, which are highly authoritative and reliable, are compiled by professional doctors under the Ministry of Health’s guidance. Clinical practice, which has the characteristics of standard structure, rich content and timely update, is based on a specific clinical situation and is systematically formulated to help clinicians and patients make appropriate treatment guidelines. To design a reasonable medical entity concept and entity-relation classification system, we refer to the authoritative standard medical terminology at home and abroad, including ICD-10 [22], ATC [1], MeSH [12], Medical Insurance Drug Catalog and Treatment Project Catalog. By pre-annotating and analyzing a part of the corpus, we collaborate with medical experts to settle standards for the annotation of the corpus. 3.2
Machine-Aided Human Annotation
We select 627 common diseases and pediatric diseases for annotation. Each annotated text is composed of titles, subheadings, and paragraph-level information to describe a specific disease’s characteristics. Compared with medical textbooks, clinical practice texts have distinct reference relations, and usually do not directly mention the subject disease in the paragraph. Thus, we pre-process the clinical practice corpus, design rules to add the subject disease entity before each sentence, and separate the subject and the original text by a special indicator “@”. The relation between medical entities usually appears across sentences, sometimes even paragraphs, so we annotate it at the chapter level. Specifically, we segment chapters into sentences when constructing the dataset. For the triplet Tj = < s, r, o >, where s is the subject, o is the object, and r is the semantic relation of the entity pair. In case when s and o both belong to a single sentence, we annotate the sentence and all triplets belonging to that sentence as a training sample. For cross-sentence expressions, i.e., s and o are from two sentences, we concatenate the two involved sentences, and then annotate the concatenation and all related triplets into a training sample {xi + xj :< s, r, o >}.
274
T. Guan et al. Table 1. Statistics of the CMeIE dataset Category
Train Validation Test
# relations 44 44 # sentences 17924 4482 54286 13484 # tuples
44 5602 17512
# entity overlap type Normal EPO SEO
6931 1718 1572 197 10993 2764
2116 268 3486
# tuples in a sentence 1 2 3 4 ≥5
6713 3711 2304 1635 3561
1663 962 583 396 878
2036 1147 699 494 1223
For machine-aided annotation, we first use the collected entity resource library and the maximum bi-directional matching algorithm to annotate the corpus automatically. To ensure consistency and accuracy, we iteratively annotate the corpus for multiple rounds, and each text is annotated by at least two annotators. If the two annotators have disagreements on the text, it will be decided by medical experts. As a result, the modified version becomes the final marked version. To filter the dataset, we collect the triplets schemes table of the dataset. Each schema consists of , where stype is the subject type, otype is the object type, r is the semantic relation, num is the number of relation. We remove the relations which occur less than 50 times and merge some similar sub-relations. For example, we combine clinical symptom and clinical sign into clinical manifestation. The Inter-Annotator Agreement (IAA) [9] is applied to measure the consistency of the annotated corpus between annotators. We use F1 value [16] as the metric to evaluate the consistency of relation annotation. The annotations of triplets are regarded as consistent only when the relation type and the two entities of two annotators are the same. The consistency rate of the CMeIE dataset is 0.82, which supports the reliability of the constructed dataset. 3.3
Dataset Statistics
The statistics of the dataset are shown in Table 1. As a result, the CMeIE dataset centralized to disease descriptions, consists of 28, 008 sentences, 85, 282 triplets, 11 entities, and 44 relations. The average number of tokens in each sentence is 85.53, and the maximum length is 300.
CMeIE
275
For the joint entity and relation extraction task, we separate the dataset into three parts: 17, 924 as the training set, 4, 482 as the validation set, and 5, 602 as the test set. Note that a sentence can belong to both the EPO class and the SEO class.
4
Chinese Medical Entity and Relation Extraction
The goal of entity and relation joint extraction is to identify all possible triplets in a sentence. 4.1
The Objective for Overlapping Entities
Due to the high density of triplets and overlapping entities in medical texts, we introduce the idea of probability graph [25] to design the training objective at the triplet-level, and then decompose it down to entity and relation level. Given the training set D and all possible relation set R, the framework aims to maximize the data likelihood of the D: ⎡ ⎤ |D| ⎣ log pθ (s|xj ) + log pr (o|s, xj ) + log pr (o∅ |s, xj )⎦ J(Θ) = j=1
s∈Tj
r∈Tj |s
r∈R\Tj |s
(1) where parameters Θ = {θ, r}, r ∈ R. xj is a sentence from the training set D, Tj = {(s, r, o)} is a set of potentially overlapping triplets belonging to xj . s ∈ Tj denotes a subject appearing in the triplets in Tj . pθ (s|xj ) is the probability to extract the subject s from the given xj . r ∈ Tj |s stands for the set of semantic relations owned by the subject s. pr (o|s, xj ) is the probability of object o given a sentence representation xj and a subject s. r ∈ R\Tj |s indicates all relations except those led by s. o∅ is “null” object, pr (o∅ |s, xj ) is the probability no semantic relation r between the subject s and any object o∅ . We formulate relations as the function o = r(s) that maps subjects to objects in a sentence, which is able to handle the overlapping triplet problem by independent modeling the subject tagger pθ (s|xj ) and the object tagger pr (s, xj ). To extract multiple triplets at once, we divide entity and relation extraction into two steps. First, the model detects subjects from the input sentence. Then for each candidate subject, it checks all possible relations to see if there is a relation that can associate objects in the sentence with that subject. 4.2
Models
In order to perform joint extraction of entity and relation task, we evaluate different frameworks: 1) Lattice LSTM [33] encoder with transition-based network. 2) CasRel [25] building upon different pre-trained encoders. Lattice LSTM is based on the character-level LSTM framework. The embedding layer first looks up the character embeddings and word embeddings of the
276
T. Guan et al.
text. Then the word information obtained by matching each character in the input word sequence is added to character representations to enrich character embeddings during the encoding of the input sentence. The state transfer neural network model maps the word sequence to the action sequence. The extraction process of named entities and relations is transformed into the generation process of the transfer action sequence. Each state represents the intermediate result, and the next transfer action is predicted based on the current state. Transfer actions continuously consume the input while generating the output, and eventually the model reaches the end condition. The hierarchical binary tagging framework is composed of two parts: the encoder module and the hierarchical decoder module. The encoder module uses the pre-trained model to extract feature information from a sentence, which will feed into subsequent tagging modules. The Hierarchical decoder consists of two modules: a subject tagger, and a set of relation-specific object taggers. The low-level tagging module, a subject tagger, is designed to recognize all possible subjects in the input sentence. The high-level tagging module simultaneously identifies the objects and the involved relations concerning the subjects obtained at a lower level. 4.3
Pre-trained Models
We investigate the effects of several popular pre-trained models as encoders of the hierarchical binary tagging framework. All pre-trained models are Chinese public versions. BERT. Pre-trained language model BERT (Devlin et al. [4]) stands for Bidirectional Encoder Representations from Transformers.1 Fine-tuning the pretrained BERT has established the state-of-the-art performances in a wide range of tasks. BERT-WWM. Cui et al. [3] adapt the whole word masking to change the training set generation strategy in Chinese BERT.2 The whole word masking forces the model to recover the whole word in Masked Language Model (MLM) pre-training task, instead of just recovering word pieces, which is more challenging. ROBERTA-WWM. Liu et al. [13] present a replication study of BERT pretraining that carefully measures the impact of key hyper-parameters and the training data size. (See footnote 2) They find that BERT is significantly undertrained, and can achieve or exceed the performance of many models published after it.
1 2
https://github.com/google-research/bert. https://github.com/ymcui/Chinese-BERT-wwm.
CMeIE
277
Table 2. Results of different methods on the CMeIE dataset Models
Validation P R
F1
Test P
Lattice LSTM+Trans 88.62 16.20
27.39
87.54 15.86
26.86
CasRelERNIE
59.40
51.64
55.25
56.78
50.76
53.60
CasRelBERT
63.06
56.94 59.83
60.61
55.09
57.72
CasRelBERT−wwm
62.29
56.55
59.28
60.80
55.02
57.76
CasRelRoBERTa−wwm
63.67
56.71
59.96 60.45
R
F1
56.57 58.44
ERNIE. Sun et al. [21] present a language representation model enhanced by knowledge, namely ERNIE.3 ERNIE is designed to learn language representation enhanced by knowledge masking strategies, including entity-level masking and phrase-level masking. ERNIE adopts heterogeneous corpus for pre-training, which contains Chinese Wikipedia, Baidu Baike, Baidu news, and Baidu Tieba.
5
Experiment and Analysis
To verify the performances of approaches described in the Sect. 4 and establish baselines for future researches based on the CMeIE. We conducted the Chinese medical entity and relation extraction experiments on our CMeIE dataset. We used the standard Precision (P), Recall (R) and F1 score as the evaluation metrics. An extracted relational triple is correct only if the relation and the heads of both subject and object all match corresponding ground truths. Hyper-parameters of each model were selected via a grid search on the validation set. 5.1
Results
Results of different models for jointly extraction of entities and relations on the CMeIE dataset are shown in Table 2, and all the results are the average of three experiments. Table 2 shows that all CasRel frameworks outperform the Lattice LSTM encoder with transition-based neural network. We suppose that possible reasons might be: 1) the input Lattice LSTM layer word embeddings are obtained by matching words to the word sequence. The general word embedding space is different from the word segmentation and word meaning in the medical field. 2) Lattice LSTM encoder does not involve any knowledge from the pre-training on a large dataset. For various pre-trained models in the CasRel framework, surprisingly, CasRelERNIE has the worst performance. We conjecture it may be due to the fact that compared to Google’s Chinese BERT, which is trained using formal texts, ERNIE’s training data adds informal texts such as Baidu Tieba. There are significant semantic differences between Tieba and medical textbooks 3
https://github.com/PaddlePaddle/ERNIE.
278
T. Guan et al.
Fig. 2. F1-scores of extracting from different types of sentences Table 3. Results of relational triple element prediction Element P
R
F1
71.15 63.28 66.99 68.68 62.24 65.30 64.95 60.10 62.43
60.45 56.57 58.44
and clinical practice. The CasRelRoBERTa−wwm model performs best, surpassing both CasRelBERT and CasRelBERT−wwm . 5.2
Analysis
To further study the models’ capability in extracting overlapping relational triplets, we additionally conduct two experiments on different types of sentences. Figure 2(a) shows the detailed results on three different overlapping patterns. The performances of the Lattice LSTM model on Normal, EPO, and SEO show a decreasing trend, reflecting the increasing difficulty of extracting relations triplets from sentences with different overlapping patterns. In contrast, the CasRel framework consistently obtains good performances overall three overlapping patterns, especially for those hard patterns. Possible reasons are: 1) the CasRel framework introduces a two-layer decoder framework to achieve independent decoding of the subject and the object, and extract more overlapping entities. 2) the pre-training model provides prior knowledge. To compare the model’s ability in extracting relational triplets from sentences containing different numbers of triplets, we split the sentences into five classes,
CMeIE
279
as shown in Fig. 2(b). The performance comparisons of models are similar to that in Fig. 2(a). In order to explore the impacts of factors on the accuracy of triplets extraction in CMeIE dataset, we examine the CasRelRoBERTa−wwm model’s performance on predicting different tuples of the triplet, i.e., , and , as shown in Table 3. The performances of extracting and are higher than that of the subject-object pair , demonstrating the effectiveness of the framework in identifying the relation. Table 4. Examples from relation “Department”. Blue for subject, Red for object. Relation
Sentences
Reasoning
Laryngeal cancer@ hoarseness, dysphonia, ... or lymphadenopathy for more than three weeks is its early manConsultation ifestation and requires ENT consultation. Osteoarthritis@ Need to be referred to spine orthopedics for spinal cord decompression. Department Squamous cell carcinoma of the skin@ * There is currently a debate about the potential malignancy of keratoacanthoma, but most dermatologists tend to remove the tumor because...
Referral Logical reasoning
We suggest that, in the CMeIE dataset, the most challenging characteristic is the diversity in expressing the same relation, and some examples are shown in Table 4. The department relation can be obtained through reasoning such as consultation, referral, and the department doctor’s opinion. Based on our analysis, we suggest that future medical relation extraction research may study integrating commonsense medical knowledge or improving the architecture of the extraction model.
6
Conclusion
In this paper, we construct a large dataset of high quality, CMeIE, for the entity and relation extraction task in the medical field. We adopt the most recent stateof-the-art frameworks and pre-trained language models for the joint extraction of entities and relations in the Chinese medical domain, which also offers several trained models as part of the toolkit for Chinese medical entity and relation extraction. Experiment results show that even the most advanced models still have large space to improve on the CMeIE dataset. Based on our analysis, we suggest that future research may integrate commonsense medical knowledge or improve the current extraction models.
280
T. Guan et al.
Acknowledgements. We greatly appreciate anonymous reviewers for their hard work and insightful suggestions. This work is supported by National Key Research and Development Project (Grant No. 2017YFB1002101), Science and Technique Program of Henan Province (Grant No. 192102210260), Medical Science and Technique Program Co-sponsored by Henan Province and Ministry (Grant No. SB201901021), Hongfei Xu acknowledges the support of China Scholarship Council ([2018]3101, 201807040056).
References 1. WATC: Anatomical therapeutic chemical classification system. WHO Collaborating Center for Drug Statistics (2009) 2. Byambasuren, O., et al.: Preliminary study on the construction of Chinese medical knowledge graph. J. Chin. Inf. Process. 33(10), 1–7 (2019) 3. Cui, Y., et al.: Pre-training with whole word masking for Chinese Bert. arXiv preprint arXiv:1906.08101 (2019) 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019) 5. Fu, T.J., Li, P.H., Ma, W.Y.: GraphRel: modeling text as relational graphs for joint entity and relation extraction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1409–1418 (2019) 6. Gardent, C., Shimorina, A., Narayan, S., Perez-Beltrachini, L.: Creating training corpora for NLG micro-planners. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 179–188 (2017) 7. GuoDong, Z., Jian, S., Jie, Z., Min, Z.: Exploring various knowledge in relation extraction. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 427–434. Association for Computational Linguistics (2005) 8. Hanisch, D., Fundel, K., Mevissen, H.T., Zimmer, R., Fluck, J.: ProMiner: rulebased protein and gene entity recognition. BMC Bioinform. 6(1), S14 (2005) 9. Hripcsak, G., Rothschild, A.S.: Agreement, the f-measure, and reliability in information retrieval. J. Am. Med. Inf. Assoc. 12(3), 296–298 (2005) 10. Katiyar, A., Cardie, C.: Going out on a limb: joint extraction of entity mentions and relations without dependency trees. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 917–928 (2017) 11. Li, Q., Ji, H.: Incremental joint extraction of entity mentions and relations. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 402–412 (2014) 12. Lipscomb, C.E.: Medical subject headings (mesh). Bull. Med. Lib. Assoc. 88(3), 265 (2000) 13. Liu, Y., et al.: Roberta: a robustly optimized Bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019) 14. Miwa, M., Sasaki, Y.: Modeling joint entity and relation extraction with table representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1858–1869 (2014)
CMeIE
281
15. Nikfarjam, A., Emadzadeh, E., Gonzalez, G.: Towards generating a patient’s timeline: extracting temporal relationships from clinical notes. J. Biomed. Inf. 46, S40– S47 (2013) 16. Ogren, P.V., Savova, G.K., Chute, C.G., et al.: Constructing evaluation corpora for automated clinical named entity recognition. LREC 8, 3143–3150 (2008) 17. Riedel, S., Yao, L., McCallum, A.: Modeling relations and their mentions without labeled text. In: Balc´ azar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 148–163. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15939-8 10 18. Savova, G.K., et al.: Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inf. Assoc. 17(5), 507–513 (2010) 19. Seol, J.W., Yi, W., Choi, J., Lee, K.S.: Causality patterns and machine learning for the extraction of problem-action relations in discharge summaries. Int. J. Med. Inf. 98, 1–12 (2017) 20. Shen, X., Gui, Y.: Clinical Pediatrics, 2nd edn. People’s Medical Publishing House (2013) 21. Sun, Y., et al.: ERNIE: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019) 22. Sundararajan, V., Henderson, T., Perry, C., Muggivan, A., Quan, H., Ghali, W.A.: New ICD-10 version of the Charlson comorbidity index predicted in-hospital mortality. J. Clinical Epidemiol. 57(12), 1288–1294 (2004) ¨ South, B.R., Shen, S., DuVall, S.L.: 2010 i2b2/VA challenge on con23. Uzuner, O., cepts, assertions, and relations in clinical text. J. Am. Med. Inf. Assoc. 18(5), 552–556 (2011) 24. Wang, W., Sun, K., Chang, L.: Pediatrics. 9th edn. People’s Medical Publishing House (2018) 25. Wei, Z., Su, J., Wang, Y., Tian, Y., Chang, Y.: A novel cascade binary tagging framework for relational triple extraction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1476–1488 (2019) 26. Yang, Y.L., Lai, P.T., Tsai, R.T.H.: A hybrid system for temporal relation extraction from discharge summaries. In: Cheng, S.M., Day, M.Y. (eds.) International Conference on Technologies and Applications of Artificial Intelligence, vol. 8916, pp. 379–386. Springer, Heidelberg (2014). https://doi.org/10.1007/9783-319-13987-6 35 27. Zan, H., et al.: Construction of Chinese medical knowledge graph based on multisource corpus. J. Zhengzhou Univ. (Nat. Sci. Edn.) 52(02), 45–51 (2020) 28. Zan, H., Liu, T., Chen, J., Li, J., Niu, C., Zhao, Y.: Corpus construction for named-entity and entity relations for paediatric diseases. J. Chin. Inf. Process. 34(5), 19–26 (2020) 29. Zelenko, D., Aone, C., Richardella, A.: Kernel methods for relation extraction. J. Mach. Learn. Res. 3(Feb), 1083–1106 (2003) 30. Zeng, D., Liu, K., Lai, S., Zhou, G., Zhao, J.: Relation classification via convolutional deep neural network. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 2335–2344 (2014) 31. Zeng, X., Zeng, D., He, S., Liu, K., Zhao, J.: Extracting relational facts by an end-to-end neural model with copy mechanism. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 506–514 (2018)
282
T. Guan et al.
32. Zhang, K., Zhao, X., Guan, T., Shang, B., Zan, H.: Construction and application of entity and relationship labeling platform for medical texts. J. Chin. Inf. Process. 34(6), 117–125 (2020) 33. Zhang, Y., Yang, J.: Chinese NER using lattice LSTM. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1554–1564 (2018) 34. Zheng, S., Wang, F., Bao, H., Hao, Y., Zhou, P., Xu, B.: Joint extraction of entities and relations based on a novel tagging scheme. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1227–1236 (2017)
Document-Level Event Subject Pair Recognition Zhenyu Hu1 , Ming Liu1,2(B) , Yin Wu3 , Jiexin Xu3 , Bing Qin1,2 , and JinLong Li3 1 Harbin Institute of Technology, Harbin, China
[email protected] 2 PENG CHENG Laboratory, Shenzhen, China 3 China Merchants Bank, Shenzhen, China
Abstract. In recent years, financial events in the stock market have increased dramatically. Extracting valuable information automatically from massive financial documents can provide effective support for the analysis of financial events. This paper just proposes an end-to-end document-level subject pair recognition method. It aims to recognize the subject pair, i.e. the subject and the object of an event. Given one document and the predefined event type set, this method will output all the corresponding subject pairs related to each event type. Subject pair recognition is certainly a document-level extraction task since it needs to scan the entire document to output desired subject pairs. This paper constructs a global document-level vector based on sentence-level vectors which are encoded from BERT. The global document-level vector aims to cover the information carried by the entire document. It is utilized to guide the extraction process conducted sentence by sentence. After considering global information, our method obtains superior experimental results. Keywords: Subject pair recognition · Document-level encoding · Event type
1 Introduction With the rapid development of market economy, we have witnessed the explosive growth of digital financial documents. The financial announcement is a major type among financial documents. Companies need to perceive risk when making investment decision. However, if we solely rely on human resources to extract valuable information from massive financial documents, it will cost a lot of manpower and time. Thus, it is very important to design an automatic tool to extract valuable information automatically. This paper’s research mainly focuses on document-level event extraction for financial documents, where we propose a method to extract the subject pairs related to a given event type (e.g., Resignation, Bankruptcy, Pledge, and so on) in a financial document. The subject pair literally involves the subject and the object of an event. Compared with the traditional event extraction tasks, this task has the following challenges: 1. When extracting event subject pairs from document, it may appear that a subject corresponds to multiple objects, or an object corresponds to multiple subjects. © Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 283–293, 2020. https://doi.org/10.1007/978-3-030-60450-9_23
284
Z. Hu et al.
2. As event subject pair may scatter in different sentences, the extraction of event subject pairs should be conducted on document-level, which is distinguished from traditional extraction methods conducted on sentence-level. Recognition solely based on one sentence lacks the information carried by the whole document. Since most of traditional methods for event extraction are based on sentence-level, they cannot be aware of the information carried by other sentences when handling document-level extraction. Thus, we construct a global document-level vector. By combining document-level vector and sentence-level vector together, we can integrate the global information carried by the entire document into one sentence to guide the subject pair extraction process. Besides, we treat event type as a transcendental signal to trigger the subject pairs related to the given event type. In our data set, the subject pairs basically appear in one sentence or several adjacent sentences. Therefore, we just simply use the order matching approach to assemble the appropriate subject and object included by one subject pair.
2 Related Work Recently, the methods for event extraction can be roughly classified into three categories. They are pattern matching-based, machine learning-based, and neural network-based. Pattern matching-based method aims to extract certain types of events under the guidance of the predefined pattern. It uses various pattern matching algorithms to extract event and its related arguments from matched sentences according to the established template. The pattern matching-based method can achieve high performance in a specific domain. But it has the problem of poor portability. Surdeanu and Harabagiu’s event extraction system FSA for open-domain [1] is an example. Machine learning-based method draws on the experience from text classification and converts event extraction into a classification task. The keys to this kind of methods are the construction of the classifier and the selection of features. Chieu et al. introduced the maximum entropy model into event elements recognition. Their works are conducted on the texts about Seminar Announcements and Management Succession [2] and the extraction process is conducted sentence by sentence. To improve the extraction effect, sometimes a variety of machine learning algorithms are applied together. Ahn et al. combined MegaM, a maximum entropy learner, and TiMBL, a memory-based (nearest neighbor) learner [3], to accomplish event type classification and event argument role extraction respectively. Experimental result on ACE corpus shows that this method is better than only using a single algorithm. Neural network-based method regards event extraction as a supervised multiclassification task, which can be divided into two ways, pipeline-based and joint modelbased. The pipeline-based way separates event recognition and event classification into two steps and trains two models respectively for them. The input of the classification model is the output of the recognition model. Unlike traditional methods using discrete features, neural network-based methods use continuous vectors as input and they learn embeddings containing semantic information from continuous vectors. Chen et al. [4] and Nguyen et al. [5] applied neural network in two tasks about event type classification and event argument extraction respectively and achieved satisfactory results. Their
Document-Level Event Subject Pair Recognition
285
works verify the effectiveness of the application of neural network in event related tasks. Feng et al. used RNN to model sequence input and take convolutional layer to model local phrase block. Then they merged these two types of features for recognition [6]. Using neural networks for end-to-end learning can effectively reduce the complexity of feature engineering. In this paper, we propose a neural network-based method to extract subject pairs from long financial documents. Our method uses BERT and CNN as encoder layer to represent an input sentence into a sentence-level vector. Furthermore, to embed the information carried by the entire document, we construct a global document-level vector from sentence-level vector which is encoded from BERT. Adding the global documentlevel vector aims to enable our method to be aware of global information carried by the entire document when recognizing subject pairs. In addition, the event type is encoded as a feature vector and we integrate it in the recognition process to guide extraction to extract subject pairs related to this event type.
3 Problem Definition 3.1 Definition 1 (Subject Pair) Given a known event, among the entities which occur in this event, two entities are the subject Ms and the object Mo of the given event. We can name the subject pair of this event as (Ms , Mo ). Of the corresponding event type, one element in the subject pair can be NULL. Now given a document D and the set of event types contained in the document T = {(t1 , t2 , t3 , . . . . . . , tn }, our method will output the set including all event subject pairs corresponding to each event type in the document. The set including all event subject pairs corresponding to event types ti is: i i i i i i , Ms2 , . . . . . . , Msn . , Mo1 , Mo2 , Mon M (ti ) = Ms1
3.2 Definition 2 (Event Types) According to the task setting, we have 11 predefined event types. We need to recognize the subject pairs corresponding to each event type in one given document. The event type is known during the recognition process. There are 7 event types, and one of subject pair related to them is set to NULL (means the subject or the object is empty). In the other 4 event types, both the subject and the object in the subject pair are not empty (both not NULL). Based on the previous setting, we divide event type set into two categories. The first category represents the situation that one of subject pair is set to NULL. The second category means both the subject and the object in the subject pair are not NULL.
4 Proposed Method Figure 2 gives the overall workflow of our model. As shown in Fig. 2, the whole model consists of two parts. The first part is used to extract subject pairs (model 1 ) and it is the
286
Z. Hu et al.
major part. The second part is used to obtain a global document-level vector (model 2 ). The output of model 2 will be fed into model 1 to guide the extraction process. The input of whole model is a document and the related event type. For example, in Fig. 1, the input document is shown at the top left corner and the given event type that needs to be recognized is punishment. The document will be divided into sentences at first. Then the document is sent to model 1 sentence by sentence to get sentence-level embedding and the entire document is sent to model 2 at once to get document-level embedding. Finally, it will give all subject pairs corresponding to punishment in the document. Our model shown in Fig. 2 does not have the step of trigger word extraction. This design draws on the experience from the work of Zheng et al. [7]. They propose a novel DEE (Document-level Event Extraction) formalization that removes the trigger word labeling. This none trigger word design does not rely on any predefined trigger word set and still perfectly matches the ultimate goal of DEE. In addition, trigger word is mainly used to recognize event type, however, which is known in our task. Therefore, to ease the document-level extraction, our method does not involve the step of trigger word extraction. 4.1 Subject Pair Recognition Model (Model 1 ) In this paper, we convert the subject pair recognition task into a sequence labeling task. As shown in Fig. 2, model 1 utilizes BERT + CNN + CRF architecture as a basic model.
Fig. 1. One example to show the process of our model on dealing with one input document. The given event type is punishment. The words with red color are subjects and the words with orange color are objects. The document is divided into sentences at first. Then sentences are input into our model to get all subjects and objects contained in the document. The recognized subjects and objects are in order in which they appear in the document. After matching step shown in Sect. 4.3, we get the ultimate subject pairs. (Color figure online)
Document-Level Event Subject Pair Recognition
287
The input of model 1 is each sentence in the input document and the given event type. The output of model 1 are all subjects and objects contained in the input sentence. Figure 2 shows that model 1 takes BERT as a sentence encoder to get the sentencelevel vector. In this paper, the ith sentence of the document, i.e. si , is embedded as esi = [wi1 ; wi2 ; . . . ; wiN ], and wik is the embedding of the k th word in the ith sentence. We feed esi into BERT and use the hidden layer vector hi to represent the ith sentence. Since BERT [8] learns a large amount of general semantic knowledge from massive data and utilizes a multi-layer Transformer [9] structure to solve the problem of long-distance dependence, the vector hi can express the sentence very well. At the same time, the embedding esi is fed into CNN layer. As Fig. 2 shows, CNN layer has three filters and it outputs three vectors. These three vectors will be concatenated into a single vector Ci . We apply CNN to encode local contexts within sentence scope and the vector Ci is used as a supplement to BERT embedding hi . As for the parameter setting of three filters, we follow the work of Yoon Kim et al. [10] and set the filters to (1,768), (3,770), (5,772). Via the previous approach, we get the embedding that encodes local context within sentence scope. To enable our model to be aware of the document-level contexts, we input the entire document into model 2 and obtain the global document-level vector eglobal . How to acquire this vector is shown in Sect. 4.2. Regarding the given event type, it is embedded as a vector t. Considering that the number of non-empty elements in subject pairs related to different event types is different (as described in Sect. 3.2), we aim to use the feature vector t to distinguish this difference.
Fig. 2. The overall workflow of our model. model 1 aims to recognize all subjects and objects related to certain given event type in a document. model 2 aims to get a vector that is aware of document-level contexts. The global vector obtained from model 2 will be sent to model 1 to improve the recognition effect. After concatenating hi , Ci , t, eglobal together (note that ⊗ means concatenation operation and d means dimensionality), we get the vector hfinal . The vector hfinal is fed into FFN to get the matrix P. Then P is sent into CRF layer to get the labeling result.
288
Z. Hu et al.
The feature vector t is also treated as a transcendental signal to trigger the subject pairs related to this event type. So far, we have all the information we need. As shown in Fig. 2, we concatenate hi , Ci , t, eglobal together to obtain the final representation vector hfinal . hfinal is sent into FFN (Feed-Forward Network) to get the matrix P. The FFN consists of two linear transformations with a ReLU activation in between. The dimensionality of input is dinput = 2048, the dimensionality of output is doutput = 11 (Reference in Fig. 2), and the inner-layer of FFN has the dimensionality dff = 1024. Finally, we use CRF (Conditional Random Field) [11] to model the interaction among tags [12], and the matrix P is sent into CRF layer to output the labels which indicate the subjects and the objects respectively corresponding to certain given event type. At last, label matching is conducted to assemble the corresponding subject and object to form a subject pair. 4.2 Global Document-Level Vector Model (Model 2 ) In our dataset, the subject pair may exist across sentences, but the input of model 1 is a single sentence of the input document. So, we design model 2 which applies the entire document as input to get a global document-level vector containing the information carried by the entire document. We apply this vector to guide model 1 . Figure 3 gives the overall structure of model 2 . As shown in Fig. 3, the input of si is embedded as model 2 is an entire document D = [s1 ; s2; . . . ; sNs ]. The ith sentence the sequence of word embeddings esi = wi,1 ; wi,2 ;. . . ; wi,N as same as conducted in Sect. 4.1. So, the document D is embedded as D = es1 ; es2 ; . . . ; esNs .
Fig. 3. Global document-level vector model structure. Though we get embeddings for all sentences by feeding the whole document into BERT sentence by sentence, these embeddings only encode local contexts within sentence scope. To enable the awareness of document-level contexts, we employ Transformer module. Before feeding them into Transformer, we add sentence position embedding. This position embedding is learnable (note that ⊗ means add operation).
Firstly, the document D is fed into BERT to obtain the embedding for each sentence. It will be represented as [h1 ; h2 . . . hNs ], where N S denotes the number of sentences. Considering different sentences may have various lengths, we set a fixed-size embedding ci for each sentence by conducting a max-pooling operation over hi . Then the document D can be represented as D = [c1 ; c2 ; . . . ; cNs ].
Document-Level Event Subject Pair Recognition
289
Previous representation of document D = [c1 ; c2 ; . . . ; cNs ] only encodes local information. To enable our model to be aware of document-level contexts, we employ Transformer module to facilitate the information flow through all sentences. Before feeding them into Transformer, we add them with sentence position embeddings to specify the sentence order. After encoded by Transformer, information was exchanged between d . Finally, adjacent sentences, and the document D is expressed as cd = c1d ; c2d ; . . . ; cNs after Transformer encoding, we utilize the max-pooling operation again to get the final document-level vector eglobal . 4.3 Subject Pair Matching After previous approach, we can get the subjects and the objects, however, they are not matched. We need to match them to acquire the correct subject pairs (one subject with its corresponding object). According to the definition in Sect. 3.2, we can divide event type into two categories. One category includes the subject pairs only including one element (subject or object). The other one includes the subject pairs including both subject and object. If the recognized subject or object related to the event type belonging to the first category, we then fill the recognized subject or object in the corresponding position and leave the other element NULL. When dealing with the second category, we utilize an order matching approach to match them. As shown in Fig. 1, regarding the event type punishment, one subject corresponds to multiple objects.
5 Experiment and Analysis 5.1 Data Set The data set for training and testing in this paper is provided by China Merchants Bank and it is annotated manually. There are 1579 documents in total, with an average of 22 sentences per document and an average length of 1284 words per document. 5.2 Evaluation Criteria The evaluation is performed by comparing the recognition results with the manually annotated results. Note that the recognized subjects or objects may be abbreviation. In our evaluation, the abbreviation also counts right. We calculate F1 value for each event type. Then, we calculate the weighted average of F1 values across all the event types. The weight is set to the frequency of the occurrence of documents in each event type. 5.3 Results and Analysis The data set is divided into three parts: training set, development set, and testing set according to 8: 1: 1. As Table 1 shows, our method achieves significant improvements
290
Z. Hu et al.
over all baselines. The high quality of our method mainly originates from the importation of the document-level vector. With this vector, we can utilize document-level information to better model the sentences in the document and extract the subject and the object with high accuracy. Table 1. F1 value of each event type. (P.S.: RS-Resignation, VI-Violation of law, MA-M&A, UN-Unqualified, BA-Bankruptcy, RT-Return risk, DE-Default, PL-Pledge, RG-Reorganization, GU-Guarantee, SH-Share equity transfer) Model
RS
Lstm+CRF
55.4 58.5 23.8 32.2 58.8 50.6
VI
MA UN
BA
RT
BERT
76.0 61.1 55.6 67.6 66.7 55.8
Our Method 82.1 81.0 69.4 69.3 66.7 64.6 DE
PL
RG
GU
SH
Avg.
Lstm+CRF
18.4 26.8 55.7 44.7 11.8 40.6
BERT
56.0 51.6 47.1 45.1 47.1 56.8
Our Method 64.2 63.1 58.8 58.6 58.1 67.2
As Table 2 shows, we sort F1 values obtained by our method on each event type. The recognition result obtained in the event type belonging to the first category is better than that belonging to the second category (the first category includes only one subject or object, while, the second category includes both subject and object). That means recognizing two elements is more difficult than single element. Table 2. F1 value of each event type. If one of subject or object is set to NULL, the number related to this event type is 1. If subject and object are both not NULL, the number is 2. Event type RS F1 Number
VI
MA UN
BA
RT
82.1 81.0 69.4 69.3 66.7 64.6 1
1
DE
1
1 PL
2 RG
1
GU
SH
F1
64.2
63.1
58.8
58.6
58.1
Number
1
2
1
2
2
Specifically, we divide testing set into two parts according to the definition of Sect. 3.2. Table 3 shows the results. The result obtained on the testing set including the samples which have only one subject or object is better than that obtained on the other testing set including the samples which have both subject and object. To demonstrate the key elements of our method, we conduct ablation test by evaluating three variants: 1) -CNN, removing CNN layer used to encode input sentence. 2)
Document-Level Event Subject Pair Recognition
291
Table 3. F1 values on the testing set. The testing set is separated into two parts. The first part includes the samples that have only one subject or object. The second part includes the samples that have both subject and object Model
First part Second part
LSTM+CRF
48.5
30.3
BERT
61.1
49.4
Our Method(-G-Vec) 69.2
55.2
Our Method
61.7
70.3
Table 4. Ablation testing results. Model F1 Full
67.2
-CNN
64.6(−2.6)
-ET-F 61.1(−6.1) -G-Vec 63.9(−3.3)
-Event Type Feature (ET-F), removing the embedding of event type. 3) -Global Vector(GVec), removing model2 which takes a vector to model the global information carried by the entire document. From Table 4, we can observe that 1) The Global Vector is of prime importance, as removing it can result in the sharp decline of performance. As the third column of Table 3 indicates, this vector can effectively improve our method’s F1 score on the second part of testing set. The reason is that the subject pairs in the second part contain two elements which may scatter in different sentences. Extracting the scattered subject pairs is more difficult because it should consider the information of the current sentence and some adjacent sentences. The high-quality performance verifies that the global vector actually contains the information carried by the entire document and it effectively guides the extraction process. 2) Event Type Feature is a significant feature, as Table 3 shows, it contributes 6.1 F1 scores on average. 3) CNN layer is not so important, but it also contributes 2.6 F1 scores on average, which proves that CNN layer indeed captures the semantic feature within sentence scope. Through observing the bad case, we find that if one document has more than one kind of event type, the result is prone to error. As Fig. 4 shows, our goal is to recognize the subject pairs for the event type ‘M&A’, but our model miss-recognizes the subject pairs for the event type ‘Reorganization’. Our model is easy to be confused when recognizing a document with multiple event types. As statistic shows, this kind of error occupies about 60% proportion. So, our future work is to solve this problem.
292
Z. Hu et al.
Fig. 4. One bad case. This document contains two events and one is related to ‘Reorganization’ and the other one is related to ‘M&A’. When our method extracts the subject pairs related to ‘M&A’, it extracts not only the subject pair of ‘M&A’ but also the subject pair of ‘Reorganization’.
6 Conclusions In this paper, we propose a method to extract subject pairs related to certain event type from a document. Our method conducts the extraction task on document-level and obtains a better performance than several baseline models. The high performance mostly results from the construction of a global document-level vector which covers the information carried by the entire document. This global vector is combined with the sentence-level vector to guide extraction process. Experimental results also prove its high quality. In our method, event type is treated as a given knowledge, whereas in some applications, we cannot acquire such kind prior knowledge. In the future work, we try to deal with this situation. That means, we need to recognize event type at first, and then extract the subject pairs related to the recognized event type. Acknowledgement. The research in this article is supported by the Science and Technology Innovation 2030 - “New Generation Artificial Intelligence” Major Project (2018AA0101901), the National Key Research and Development Project (2018YFB1005103), the Key Project of National Science Foundation of China (61632011), the National Science Foundation of China (61772156, 61976073) and the Foundation of Heilongjiang Province (F2018013).
References 1. Surdeanu, M., Harabagiu, S.: Infrastructure for open-domain information extraction. In: Proceedings of the Human Language Technology, pp. 325–330 (2002) 2. Chieu, H.L., Ng, H.T.: A maximum entropy approach to information extraction from semistructured and free text. In: Proceedings of the 18th National Conference on Artificial Intelligence, pp. 786–791 (2002) 3. Ahn, D: The stages of event extraction. In: Proceedings of the Workshop on Annotations and Reasoning About Time and Events, pp. 1–8 (2006)
Document-Level Event Subject Pair Recognition
293
4. Chen, Y., Xu, L., Liu, K., et al.: Event extraction via dynamic multi-pooling convolutional neural networks. In: Proceedings of the 53rd Association for Computational Linguistics, pp. 167–176 (2015) 5. Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the 53rd Association for Computational Linguistics, pp. 365–371 (2015) 6. Feng, X., Huang, L., Tang, D., et al.: A language independent neural network for event detection. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 66–71 (2016) 7. Zheng, S., Cao, W., Xu, W., et al.: Doc2EDAG: an end-to-end document-level framework for chinese financial event extraction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 337–346 (2019) 8. Devlin, J., Chang, M., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019) 9. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010 (2017) 10. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1746–1751 (2014) 11. Lafferty, J., McCallum, A., Pereira, F., et al.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001) 12. Lample, G., Ballesteros, M., Subramanian, S., et al.: Neural architectures for named entity recognition. In: Proceedings of the North American Chapter of the Association for Computational Linguistics, pp. 260–270 (2016)
Knowledge Enhanced Opinion Generation from an Attitude Zhe Ye1,3 , Ruihua Song2(B) , Hao Fu3 , Pingping Lin3 , Jian-Yun Nie4 , and Fang Li1 1
Shanghai Jiao Tong University, Shanghai, China [email protected] 2 Renmin University of China, Beijing, China songruihua [email protected] 3 Microsoft, Beijing, China {zheye,fuha,pinlin}@microsoft.com 4 University of Montreal, Montreal, China [email protected]
Abstract. Mining opinion is essential for consistency and persona of a chatbot. However, mining existing opinions suffers from data sparsity. Toward a given entity, we cannot always find a proper sentence that expresses desired sentiment. In this paper, we propose to generate opinion sentences for a given attitude, i.e., an entity and sentiment polarity pair. We extract attributes of a target entity from a knowledge base and specific keywords from its description. The attributes and keywords are integrated with knowledge graph embeddings, and fed into an encoderdecoder generation framework. We also propose an auxiliary task that predicts attributes using the generated sentences, aiming to avoid common opinions. Experimental results indicate that our approach significantly outperforms baselines in automatic and human evaluation. Keywords: Opinion
1
· Generation · Chatbot · Knowledge
Introduction
Conversation systems have advanced in recent years due to the progress of deep learning techniques and the accumulation of conversation data on the Internet. However, it is challenging for a conversation system to produce responses that are consistent with a specified persona. [17] found that 92.3% persona profiles and 49.2% sentences of persona profiles in PersonaChat study [20] contain at least one sentiment word1 such as like, enjoy, and hate. This indicates that opinions of a given attitude are in demand in personalising chatbots and ensuring consistency. Mining existing opinions is a way but with some issues. As Fig. 1) shows, the number of opinions of an entity is imbalanced. 1/3 entities have less than 10 opinions and 1/3 entities do not have any negative opinion. For new entities, 1
www.cs.uic.edu/∼liub/FBS/sentiment-analysis.html#lexicon.
c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 294–305, 2020. https://doi.org/10.1007/978-3-030-60450-9_24
Knowledge Enhanced Opinion Generation from an Attitude 1
9
0.9
fraction of negative opinions
10
ln(#opinions + 1)
8 7 6 5 4 3 2 1
295
0.8 0.7 0.6 0.5
0.4 0.3 0.2 0.1
0 0
500
1000
entity rank on ln(#opinions+1)
1500
0 0
500
1000
1500
entity rank on fraction of negative opinions
Fig. 1. Left: Number of opinions per entity. Right: Fraction of negative opinions per entity.
one cannot find any opinion in an existing corpus. In contrast, human can easily adopt opinions from similar entities to express their feelings about the new ones. Generation-based models provide flexibility to address the above issues. Knowledge about entities and relations between them may help. In this paper, we propose a new way of generating opinion sentences from a given attitude. For example, from Shaquille O’Neal and positive sentiment polarity, we aim to generate more specific opinions like “[entity] is the forever star on the NBA All-Star stage.”, rather than common opinions like “[entity] is good.” Some previous studies propose to generate opinions in specific domains. For example, the approach in [3] generates reviews of a book for a given user and rating. Within this “book” domain, generation patterns learned from one book can be easily transferred to another one. We propose a more generic method of generating opinions in mixed domains, where the entities could be persons, cities, TV series, novels, games, etc. We enhance the model’s ability of transferring by incorporating knowledge base, where similar entities can be identified by their attributes. Moreover, we improve the specificity of generated opinions to avoid common but boring ones. In this paper, we propose a new generic framework of using knowledge to generate opinion sentences from an attitude. We first represent an entity target by its general attributes in a knowledge base and specific keywords extracted from its description. Then we integrate knowledge graph embedding into the encoderdecoder framework to generate opinion sentences. Next, we propose using an auxiliary task of using opinion sentences to predict attribute values to enhance specificity. Evaluation indicates that our proposed approach significantly outperforms baselines in generating more interesting and specific opinion sentences.
296
2 2.1
Z. Ye et al.
Related Work Opinion Mining
There is a long history of opinion mining or sentiment analysis. As [10] described, opinion mining aims to identify and extract subjective content in text and thus most works focus on sentiment classification. What we address is not classification but generation. Some works studied generation, e.g. concept-to-text generation. For example, [12] generates weather forecast or sports reports from structured input data. They regard the input data and output sentences as sequences and apply RNN based encoder-decoder framework to address the problem. These works are similar to ours in structured data input and basic framework, but we have different goals. Concept-to-text tasks require the output sentences to convey the information represented by the input data. There are relatively limited templates that can be mapped to the given schema of database. In contrast, the generated opinions have more forms. It has more serious one-to-many issue. What we need is to generate appropriate and specific opinion sentences to express a chatbot’s persona. [3] proposes a new task of generating reviews from a triple of userID, productID, and rating. Their goal is close to ours but they conduct experiments on one category of products, i.e., books. They do not leverage knowledge graph to extend the entity perhaps because their data is relatively rich for the same category. 2.2
Generation Models
There are many generation models are proposed for sequence-to-sequence generation. The two main applications are machine translation and conversation generation. Our task is more relevant to conversation generation because our input also has many proper outputs (one-to-many). We share the same issue that common results are much more easily generated but they lack of information and diversity[8,16,19]. To solve the issue, we propose using P (X|Y ), where X is the source sequence and Y is the target sequence, to balance the frequency and the specificity. The main idea is similar to that used by [9,18] but in different ways due to different tasks. To the best of our knowledge, we are the first who introduce the similar methods into opinion generation. Similar to [9], we face similar practical issues when we try to integrate P (X|Y ) into objective function: intractable models and ungrammatical generation. We propose our own solution to solve these issues. 2.3
Knowledge Graphs
A typical knowledge graph contains millions of triples (h, r, t) where h is the head entity, t is the tail entity and r is the relation from h to t. Knowledge graph embedding models learn low-dimensional representation vectors for entities or relations. The embedding preserves the structural information of the original knowledge graph. Translation-based knowledge graph embedding models, such
Knowledge Enhanced Opinion Generation from an Attitude
297
AttributePredicter Knowledge Graph Embedding
…
Attributes
Keywords …
Entity Representation Tag Embedding
…
Transformer Encoder
+
Transformer Decoder
BiGRU
Decoder Output Embeddings
Sentiment Polarity
Fig. 2. The overview of KNowledge Enhanced Opinion Generation Model.
as TransE, are proved effective [1]. Recently, graph neural network (GNN) has attracted a lot of attention. Graph Convolutional Networks (GCN) [7], as one of GNNs, can be used to encode the graph information. The GCN showed promising performance in graph node classification tasks [5] and semantic role labeling task [11]. We apply GCN to embed knowledge graphs in our approach.
3
Problem Formulation
Given an attitude, i.e., an entity e and its sentiment polarity p ∈ {+1, −1}, the task is to generate opinions of the entity e that express the sentiment polarity p. The generated opinions are expected to be 1) fluent, 2)coherent with the sentiment polarity, 3) relevant and specific to the entity e. A training sample is a triple (e, p, Y ), which denotes that the sentence Y = [y0 , y1 , ..., yN ] expresses a sentiment polarity of p toward the entity e. We also make use of a knowledge graph G. The graph contains three types of nodes, namely entities, attribute values, and keywords. Nodes are connected by different types of edges, which correspond to keyword and different types of attribute. For each entity node, there is a corresponding description document d.
4
Our Approach
As Fig. 2 shows, we propose a generic framework to solve the problem. We provide details in this section. We first describe how to represent an entity as the input of our model. Then we integrate a knowledge graph with the encoderdecoder framework. At last we describe how to improve the quality of generated opinions by avoiding common opinion sentences.
298
4.1
Z. Ye et al.
Entity Representation
An entity itself does not provide much information for generation. We extend the representation of an entity by its attribute values in a knowledge base G. For example, the entity of Shaquille O’Neal has attributes like entity type, nationality, gender, occupation. The entity can be represented as [person, American, male, basketball player, ...]. Song Xu is a Chinese singer and the representation could be [person, Chinese, male, singer, ...]. We find the attributes are not specific enough for entities. For example, many basketball players have the same attribute values like [person, American, male, basketball player, ...]. This results in common generation results. We further extend the representation of an entity by keywords extracted from its description document d. We extract the top k frequent keywords (excluding stop words). For example, Shaquille O’Neal has keywords like NBA, center, star. Tim Duncan has keywords like Spurs, history, and champion. These keywords clearly distinguish the two basketball players. Therefore, an entity e is represented as X = (attr1 , attr2 , ...attrM , keyword1 , keyword2 , ..., keywordK ) and we let the number of dimension of X T = M + K. In our experiments, M = 11, K = 10. 4.2
Encoder-Decoder Framework with Knowledge Graph Integrated
Encoder-Decoder Framework We choose a transformer-based encoder-decoder model as a start. We define ei (·), eo (·) and es (·) as three functions for looking up embeddings for inputs, outputs and sentiment polarities. An L-layer transformer is used as the encoder. Given the input sequence X = (attr1 , attr2 , ..., attrM , keyword1 , keyword2 , ..., keywordK ), we first pack their embeddings into H0 = [ei (attr1 ), ..., ei (attrT ), ei (keyword1 ), ..., ei (keywordK )].
(1)
The output of the last layer H = [h1 , h2 , ..., hT ] are used as the encoded representation vectors of the X which are calculated by H = T ransL e (H0 ).
(2)
where T ransL e (·) represents the transformer encoder. Another L-layer transformer is used as the decoder. The decoding procedure of i-th step is as follows: Ei = [eo (y0 ), eo (y1 ), ..., eo (yi−1 )] Si =
T ransL d (H, Ei , es (p))
(3) (4)
Knowledge Enhanced Opinion Generation from an Attitude
299
where T ransL d (·) is the L-layer transformer decoder, Ei is composed of embeddings of decoded words. Si is composed of i output embeddings, [si1 , si2 , ..., sii ], in i-th step. The pilot experiments show the sentiment signal may fade away during broadcasting from the encoder to the decoder. Then the generated opinion sentences are poorly coherent to the sentiment. We feed the sentiment polarity p to the decoder in every step instead of treating it as an input of the encoder. The unnormalized generation probability P (yi ) is conditioned on the output embedding sii : P (yi = w) = PV (yi = w) (5) = wT · M LPV (sii ) where w is the one-hot indicator for word w. Integrating Knowledge Graph Embedding In order to further leverage the knowledge graph as a whole, we propose using knowledge graph embeddings to represent the attribute values of an entity e. We incorporate Graph Convolutional Network (GCN), which is a neural network model designed for graph-structured data [7], into our opinion generation model. eg (attri ), eg (keywordj ) is the graph embeddings of attri and keywordj . We use a linear transformation to merge the graph embedding with the original tag embedding ei (attri ) as follows: em (attri ) = WT [eg (attri ); ei (attri )]
(6)
Then we replace the ei (attri ) of Eq. 1 with the em (attri ) to encode graph information into the opinion generation model. We update the parameters of GCN along with the parameters of the main model. 4.3
Promoting Specificity by Enhancing Knowledge
In dialogue generation, generation-based models tend to generate common responses. A common response can be coherent to many different input utterances [8,19]. An opinion generation model based on a vanilla encoder-decoder framework also suffers from generating common opinions. A common opinion sentence is coherent to many different entities. We can use the pattern “[entity] is good” to generate “[O’Neal] is good” and “[Paris] is good”. On the contrary, “[entity] is the forever star on the NBA All-Star stage” is a specific opinion. One can infer that it is used to express a positive sentiment about a NBA basketball player. If the generation model knows the specific degree of an opinion sentence, it will be able to avoid from generating common opinion sentences. Inspired by the recent studies on the diversity and specificity in dialogue generation task [8,14,16,19,21], we propose our methods to improve our opinion generation model by promoting specificity with the help of knowledge information (attributes). The main idea is to predict attribute values based on a generated opinion sentence. The attribute values shall be accurately predicted for an opinion sentence with good specificity. We use cross entropy to measure the difference between the predicted attribute distribution and the ground-truth
300
Z. Ye et al.
attribute distribution. A small difference means the attribute prediction model can easily infer the ground-truth attribute values. It further indicates the given opinion sentence is specific. So the calculation procedure of specificity of Y is as follows: spec(Y ) = exp(−
M
P (attri |Y )ln(Pˆ (attri |Y ))) =
i=1
M
Pˆ (attri |Y )
(7)
i=1
where P (attri |Y ) is the ground-truth distribution of attri and Pˆ (attri |Y ) is the predicted distribution of attri . Because attri is the true attribute value of Y ’s entity, P (attri |Y ) equals 1. It is intuitive that if the model can “see” more specific training samples and less common training samples, the model will tend to generate the specific opinion sentences. We assign every training sample a sampling probability. Before every training epoch, we re-sample the training dataset to get a new training dataset with the same size according to the sampling probabilities. A training sample with higher sampling probability has more chances to be seen by the model. We use spec(Y ) as the sampling probability of an opinion sentence Y . We use Bi-directional GRUs to encode opinion sentence Y , hot = BiGRU (ei (yt ), hot−1 ); t ∈ [1, N ]
(8)
and then use M (the number of attributes) softmax-based classifiers to get attribute distributions, Pˆ (attr1 |Y ), Pˆ (attr1 |Y ), ... Pˆ (attrM |Y ). The model gives “[entity] is the forever star on the NBA All-Star stage” a score of 0.997 and “[entity] is good” a score of 0.021. Joint Learning: We regard the opinion generation as the main task and the attribute prediction task as the auxiliary task. Applying joint learning is supposed to increase the specificity of generated opinion sentences. But if we take the generated opinion sentences as the input to the attribute prediction model, the training procedure is intractable. So we use the decoder output embeddings |N | s11 , s22 , ..., s|N | of opinion generation model as the representation of input opinions to attribute prediction models (See upper right part of Fig. 2): hot = BiGRU (stt , hot−1 ); t ∈ [1, N ].
(9)
Then the convergence of the auxiliary task could “force” the main model to produce more specific opinion sentences. We denote the attribute prediction distribution as Pˆ (attri |Y ). Re-ranking: When performing decoding, we use beam search to find all candidates according to the scores from the main model. After that, we re-rank all candidates by a combination of specific degree and the main model output scores as follows: score(Yˆ ) = log(P (Yˆ |X)) + α
M i=1
Pˆ (attri |Y ) + β
M i=1
Pˆ (attri |Y ).
(10)
Knowledge Enhanced Opinion Generation from an Attitude
301
Table 1. The statistics on the data for the experiments. #Entity #Attitude #Opinion #Labeled Train Dev Test.seen Test.Unseen Test.seen.Human Test.Unseen.Human
5 5.1
1,314 100 61 130
2141 162 61 217
18 18
18 30
104,823 7,751 2,484 7,386
— — — —
— —
1310 2227
Experiment Dataset
To construct training samples, we use a pre-trained attitude detector [17] to detect sentiment polarity p and associated entity e from a Chinese conversation corpus. Responses with positive or negative attitude were kept as opinion sentences. To obtain the sentence Y , the entity e in an opinion sentence were replaced with a special token [entity]. In this way, we obtained triples like (e, p, Y ) for training. We split the data into four parts (see Table 1). Entities in Test.Unseen were not included in either train or Dev. Entities in Test.seen were included in Train with the opposite polarity. Due to the cost, we selected 30 attitudes from Test.Unseen denoted as Test.Unseen.Human and another 18 attitudes from Test.seen denoted as Test.seen.Human for human evaluation. The top ten generated opinions from different methods were pooled together and shuffled before showing them to every annotator. Even though, an annotator had to labeled more than 3500 opinions. 5.2
Baselines
Retrieval: Given an entity Ei and the expected polarity, we find the entity Ej with the most similarity with Ei from the training data. We choose N opinions with descending similarity. We define the similarity between two entities as the weighted sum of the matched attributes. We give larger weight to more important attribute. LibFM: A recommendation model [13] is used to “recommend” opinions for the given entity and sentiment polarity. Embeddings of the sentiment, attributes, keywords, graph nodes and words in opinions are used as the side information. Att2Seq: We adopt the model proposed by [3] to generate opinions conditioned on the attitude polarity and the attributes of an entity. Transformer: We use a 6-layer transformer as the encoder and another 6-layer transformer as the decoder. The whole structure is similar to [15].
302
Z. Ye et al.
Table 2. The second, third and fourth column show the ratios of generated opinions with overall scores of +2, +1 and 0. Spec column shows the ratio of specific opinions. Avg column is the average overall score based on the human evaluation scores. nDCG column is used to show the quality of the generated opinions from the view of ranking over the all models generated opinions list. The bold means the model outperforms all other models in term of that metric. ∗ indicates KNOG outperforms Transformer significantly (p < 0.05). Model
+2
+1
0
Spec
Avg
nDCG
Retrieval
0.172
0.475
0.353
0.309
0.818
0.376
NIST 0.714
Dist-1
Dist-2
0.114
0.347
LibFM
0.099
0.412
0.490
0.216
0.609
0.274
0.009
0.034
0.095
Att2Seq+A
0.193
0.565
0.242
0.324
0.951
0.443
0.505
0.040
0.174
Transformer
0.156
0.572
0.272
0.247
0.885
0.409
0.470
0.038
0.165
KNOG
0.279
0.458
0.263
0.420
1.017*
0.484*
1.240
0.047
0.205
vs.Att2Seq+A
↑ 44.6%
↓ 18.8%
↑ 8.3%
↑ 29.8%
↑ 6.9%
↑ 9.3%
↑ 145.6%
↑ 18.8%
↑ 17.9%
vs.Transformer
↑ 78.7%
↓ 19.9%
↓ 3.3%
↑ 69.9%
↑ 14.9%
↑ 18.5%
↑ 163.9%
↑ 23.9%
↑ 24.6%
-Reranking
0.239
0.463
0.298
0.367
0.941
0.441
1.051
0.049
0.208
-Joint learning
0.238
0.467
0.295
0.340
0.943
0.449
0.772
0.044
0.184
-Reranking
0.223
0.460
0.317
0.326
0.906
0.432
0.861
0.045
0.183
5.3
Experiment Settings
Our models and baselines are implemented by PyTorch2 . The sizes of embeddings and hidden states in our encoder-decoder framework are set to 768. We use 1layer bidirectional GRUs with a hidden size of 768 to encode the opinions for predicting the distribution of attributes. We also use another 1-layer bidirectional GRUs with a hidden size of 768 to encode the decoder output embeddings of the generated opinions in joint learning. We tune the α and β based on the performance of our model on the Dev in terms of the automatic metrics. We use Adam optimizer [6] to train models with learning rate of 1e-4. Except LibFM, all other trainable models are trained for 50 epochs. LibFM are trained for 100 epochs because it needs more epochs training to converge. 5.4
Evaluation Methodology
We conduct automatic and human evaluations to compare our approach with baselines. In automatic evaluation, we employ NIST [2], Distinct-1 and Distinct2 [8] as metrics. NIST and BLEU are two variants of N-gram scoring metrics which are widely used in machine translation. NIST gives larger weights to those N-grams which are more informative. Distinct-1 and Distinct-2 are used to measure the diversity of generated sentences based on the ratio of unique unigrams and bigrams. In human evaluation, we recruited three human annotators, who are independent of authors and are not major in computer science. Each sentence is judged by the following criteria: 2
https://pytorch.org.
Knowledge Enhanced Opinion Generation from an Attitude
303
Table 3. Generated results by Transformer (left) and our knowledge enhanced model KNOG (right).
– Good (+2): The sentence is fluent. The opinion exactly expresses the given attitude. And the opinion is interesting and appropriate. – Fair (+1): The sentence is fluent. The opinion exactly contains the given attitude. The opinion is not interesting. – Bad (+0): The sentence cannot be understood. Or the generated opinion is not consistent with the given attitude or not reasonable in terms of facts. The annotators also judged whether a sentence is specific or not. The annotators completed the two tasks with Fleiss’ kappa [4] of 0.379 and 0.411, which means fair and moderate agreement respectively. 5.5
Result and Analysis
Table 2 shows all experimental results. The Att2Seq outperforms other baseline models by generating more good opinions and fewer bad opinions. In human evaluation, our model KNOG outperforms Att2Seq by 6.9% and 9.3% improvements in terms of average score and nDCG. In automatic evaluation, our model also significantly outperforms Att2Seq in terms of NIST, Distinct-1 and Distinct2 by 145.6%, 18.8% and 17.9%. Our model is based on Transformer. And our model also outperforms Transformer in terms of those metrics used in our experiment. KNOG can generate 78.7% more high quality opinion sentences which are labeled as Good (+2) and 3.3% fewer Bad opinion sentences. The knowledge enhancement also makes the model can generate more specific, interesting and
304
Z. Ye et al.
coherent opinions. The comparison between KNOG and Transformer shows that our model can actually promoting the diversity of generated opinions. In order to further study Reranking and Joint Learning’s impact, we do an ablation study. The last three rows of Table 2 shows the ablation result. We can find that Reranking and Joint learning both can make the model generate more Good opinion sentences and fewer Bad opinion sentences. And combining them can enhance these effects. It seems reranking can improve the model in terms of NIST, Avg and nDCG but slightly deteriorate the diversity. In general, reranking can also improve the overall performance. Table 3 shows some cases that are generated by a baseline and KNOG. We can find KNOG would generate more specific opinions. It can generate detailed attributes of the entity.
6
Conclusion and Future Work
In this paper, we propose a knowledge enhanced opinion generation model based on the transformer-based encoder-decoder model to address the problem of generating opinion sentences by a given attitude. We leverage a knowledge base and descriptions to extend entity names to tags and integrate knowledge graph embedding methods into our model to further exploit knowledge graph. Moreover, we propose to use reranking and joint learning to enhance the knowledge in generated opinions. Experimental results shows that using our model would improve the generated opinions significantly by increasing Good opinions and decreasing Bad opinions at the same time. As future work, we plan to investigate how to combine knowledge graph with the main model more closely.
References 1. Bordes, A., Usunier, N., Garc´ıa-Dur´ an, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held 5–8 December 2013, Lake Tahoe, Nevada, United States, pp. 2787–2795 (2013) 2. Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 138–145. Morgan Kaufmann Publishers Inc. (2002) 3. Dong, L., Huang, S., Wei, F., Lapata, M., Zhou, M., Xu, K.: Learning to generate product reviews from attributes. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 623–632 (2017) 4. Fleiss, J.L., Cohen, J.: The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ. Psychol. Measur. 33(3), 613–619 (1973) 5. Hammond, D.K., Vandergheynst, P., Gribonval, R.: Wavelets on graphs via spectral graph theory. CoRR abs/0912.3848 (2009)
Knowledge Enhanced Opinion Generation from an Attitude
305
6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014) 7. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. CoRR abs/1609.02907 (2016) 8. Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. abs/1510.03055 (2015). http://arxiv.org/ abs/1510.03055 9. Li, J., Galley, M., Brockett, C., Spithourakis, G.P., Gao, J., Dolan, W.B.: A persona-based neural conversation model. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Berlin, Germany, 7–12 August 2016, Volume 1: Long Papers (2016) 10. Liu, B.: Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge University Press, Cambridge (2015) 11. Marcheggiani, D., Titov, I.: Encoding sentences with graph convolutional networks for semantic role labeling. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, 9–11 September 2017, pp. 1506–1515 (2017) 12. Mei, H., Bansal, M., Walter, M.R.: What to talk about and how? Selective generation using LSTMs with coarse-to-fine alignment. arXiv preprint arXiv:1509.00838 (2015) 13. Rendle, S.: Factorization machines with libFM. ACM Trans. Intell. Syst. Technol. 3(3), 57:1–57:22 (2012) 14. Shen, X., et al.: A conditional variational framework for dialog generation. abs/1705.00316 (2017). http://arxiv.org/abs/1705.00316 15. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017, pp. 6000–6010 (2017) 16. Xing, C., et al.: Topic aware neural response generation. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, California, USA, 4–9 February 2017, pp. 3351–3357 (2017) 17. Zeng, Z., Song, R., Lin, P., Sakai, T.: Attitude detection for one-round conversation: jointly extracting target-polarity pairs. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 285–293. ACM (2019) 18. Zhang, H., Lan, Y., Guo, J., Xu, J., Cheng, X.: Reinforcing coherence for sequence to sequence model in dialogue generation. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, 13–19 July 2018, pp. 4567–4573. ijcai.org (2018) 19. Zhang, R., Guo, J., Fan, Y., Lan, Y., Xu, J., Cheng, X.: Learning to control the specificity in neural response generation. In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, 15–20 July 2018, Volume 1: Long Papers, pp. 1108–1117. Association for Computational Linguistics (2018) 20. Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personalizing dialogue agents: I have a dog, do you have pets too? abs/1801.07243 (2018). http:// arxiv.org/abs/1801.07243 21. Zhou, G., Luo, P., Cao, R., Lin, F., Chen, B., He, Q.: Mechanism-aware neural machine for dialogue response generation. In: Singh, S.P., Markovitch, S. (eds.) Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, California, USA, 4–9 February 2017, pp. 3400–3407. AAAI Press (2017)
MTNE: A Multitext Aware Network Embedding for Predicting Drug-Drug Interaction Fuyu Hu1,2 , Chunping Ouyang1,2(B) , Yongbin Liu1,2 , and Yi Bu3 1 2 3
School of Computer, University of South China, Hengyang 421001, Hunan, China [email protected], [email protected], [email protected] Hunan Medical Big Data International Sci. & Tech. Innovation Cooperation Base, Changsha, China Department of Information Management, Peking University, Beijing 100871, China [email protected]
Abstract. Identifying drug-drug interactions (DDIs) is an important research topic in drug discovery. Accurate predictions of DDIs reduce the unexpected interactions during the drug development process and play a significant role in drug safety surveillance. Many existing methods used drug properties to predict the unobserved interactions between drugs. However, semantic relations between drug features have seldom been considered and have resulted in low prediction accuracy. In addition, incomplete annotated data and sparse drug characteristics have greatly hindered the performance of DDI predictions. In this paper, we proposed a network embedding method named MTNE (MultiText Aware Network Embedding) that considers multiple external information sources. MTNE learns the dynamic representation of the drug description and the pharmacodynamics through a mutual attention mechanism. It effectively maps a high-dimension drug-drug interaction network to low dimension vector spaces by taking advantage of both the textual information of drugs and the topological information of the drug-drug interaction network. We conduct experiments based on the DrugBank dataset. The results show that MTNE improves the performance of DDI predictions with an AUC value of 76.1% and outperforms other state-ofthe-art methods. Moreover, MTNE can also achieve high-quality prediction results on sparse datasets. Keywords: Drug-drug interaction · Network embedding · Text information · Topological information · Dynamic representation
1
Introduction
Drugs are substances that used to prevent, treat, or diagnose diseases. Interactions among drugs may occur when people take multiple drugs simultaneously. However, in some cases, these interactions may include adverse reactions, such as death and allergic reactions, and sometimes they might greatly decrease c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 306–318, 2020. https://doi.org/10.1007/978-3-030-60450-9_25
MTNE
307
human life safety. Thus, accurately identifying drug-drug interactions (DDIs) is extremely important for improving medical quality and for establishing safe and effective combinations of drugs. In general, wet methods [11], such as in vitro methods, vivo experiments, and clinical trials, for identifying DDIs are time-consuming and labor-intensive [11]. In the past, text mining and statistical methods [3,12] have been used to detect whether an increasing risk of certain adverse events [23] is caused by a combination (interaction) of two certain drugs by analyzing insurance claim databases, spontaneous reports, the public literature, and electronic medical records. But these methods cannot be used to predict drug-drug interactions. In recent years, to take full advantage of the domain knowledge, a large number of machine learning methods have been proposed, which are mainly based on learning and calculating drug features. For example, Kastrin et al. [5] considered the similarity of drug semantic features and emplcork and cannot consider the semantic relationship of the context and neglect the different roles when different drugs are interacting (as illustrated in Fig. 1). Therefore, we proposed a multitext aware network embedding method (MTNE) for predicting DDIs.
Fig. 1. An example of the descriptive information of a drug network (Inside the box is the descriptive information of the drug, and the red and purple fonts represent the attention of the drug on the left and right, respectively.) (Color figure online)
In our proposed MTNE, we built a drug interaction network. We conduct the DDI prediction experiments on the DrugBank datasets. The results show that MTNE can effectively perform network embedding with text information, which acchieves high accuracy and outperforms other benchmark methods. The main contributions of this paper are as follows: (1) We leverage the drug textual descriptions and drug textual pharmacodynamic information within the network embedding model to alleviate the limitation of inaccurate or incomplete public data sources. Moreover, our method learns the dynamic embeddings with a mutual attention mechanism and is expected to more precisely acquire the semantic relationships from textual information. (2) We propose a multitext aware network embedding method for precisely modeling the relationship between drugs, which takes into account both the textual information of drugs and the topological information of the drug-drug interaction network. (3) According to the sparseness of the drug dataset, our experimental results show
308
F. Hu et al.
that with the reduction of the training set, MTNE achieves higher prediction accuracy compared with some current methods.
2
Related Work
Existing methods to predict DDIs can be divided into two types: considering the domain knowledge of drugs, and considering the drug features. Considering Domain Knowledge of Drugs: Zhang et al. [22] proposed an integrative label propagation framework that integrates the side effects extracted from the package inserts of prescription drugs to predict DDIs. This method uses the text-based knowledge information for the drug. Shen et al. [16] proposed a knowledge-oriented feature-driven method that learns drug related knowledge with an accurate representation (KMR). But this method does not consider dynamic representations based on different interactive objects. Furthermore, the structural information of the network is not considered. Considering Drug Features: Previous DDI studies are more focused on the features of drugs, such as their molecular structures, enzymes, and pathways. Prediction methods based on drug features are the more common method now. Ryu et al. [14] proposed a deep learning method based on the drug name and structures to predict drug-drug interactions. Zhang et al. [24] proposed a sparse feature learning ensemble method with linear neighborhood regularization, which was abbreviated as SFLLN, to predict DDIs. In addition, the method using drug characteristics does not consider the topological information of drug-drug interaction network and the knowledge of the drug. There are also some DDI prediction methods that calculate the similarity of drugs by using multiple characteristics of drugs. Rohani et al. [13] proposed a neural network-based method for DDIs. Takako Taketa et al. [17] proposed predicting DDIs through drug structural similarities and interaction networks. Both methods predict DDIs by calculating the similarity of drug characteristics and it also does not consider the topological structure information of the drug network. In recent years, there have been large numbers of network embedding methods proposed to learn efficient vertex embeddings. Such as Node2vec [2], LINE [18], these methods represent high-dimensional networks with low-dimensional vectors and can maintain the relevant topology structure of the original network. Later, researchers have added text information of nodes on the network. Thereby further enhancing the network’s presentation ability, such as Tu et al. [19], presented context aware network embedding (CANE). Usually these methods are used in social networks [20], scholar networks, etc., due to drug networks are very similar in structure to these networks, so we try to solve the problem of drug prediction in this way. In hence, we first apply a network-embedded method to predict drug interactions, which is named MTNE. The greatest advantage of the method based on network embedding is that it can make full use of the structural information of the drug network.
MTNE
3 3.1
309
Method Problem Formulation
First, we give the basic symbols and definitions of MTNE. The network embedding (NE), i.e., network representation learning (NRL), learns a low-dimensional embedding v ∈ Rd for each vertex v ∈ V from a high dimensional complex network according to its network topology and external information. d |V | is the dimension of the representation space. Assume there is a network of drug interactions, I = (V, E, T ), where V represents the set of drug vertices. E ⊆ Vx ×Vy indicates the interactions between drugs, T = t1 + t2 , t1 represents the descriptive information of the drug, and t2 represents the pharmacodynamic information of the drug. evx ,vy indicates an interaction between two drugs (vx , vy ), with an associated weighted wvx ,vy , and Tx and Ty denote the text information of the drugs. Here, for the Tx of a specific drug vertex v ⊆ V , we can represent it as a word sequence St = {w1 , w2 , ..., wn }, where n = |St | represents the number of words in St . Then, we will introduce two important problem definitions here. Definition 1: Topology-based embedding The relationship between given nodes and their neighbors are expressed in a certain way to preserve the topological structural characteristics of the network. We represent it as V t .
Fig. 2. Overall architecture of the proposed model (We take the drug pair and the external information of the drug as input. After the presentation layer, we get two different embeddings. We connect them to the final drug embedding.)
310
F. Hu et al.
Definition 2: Dynamic text embedding MTNE learns the various embeddings of vertices according to the external information of the interactive vertices. Specifically, for an edge EVx ,Vy , MTNE learns the dynamic embeddings Vx (Vy ) and Vy (Vx ). We represent it as V s . 3.2
Overall Framework
MTNE makes full use of the topological information and related text information of the drug-drug interaction network. Here, we propose two embeddings of the drug vertices v, which include the topology-based embedding v s and the textbased embedding v t . We combine the two types of embeddings into the final vertex embedding as: v = v s ⊕ v t , where ⊕ indicates the concatenation operator. In the following sections, we will introduce each of these two types of embeddings in detail. The overall architecture of the model is illustrated in Fig. 2. 3.3
Topology-Based Embedding
The Important Node in the Generation Path. The important node is the next hop in the process of reducing the dimensions of the drug-drug interaction network. The important node in the generation path for each vertex in the network is denoted as I. For an anchor vertex vi ∈ V , the value of the vertex importance for vi is represented as I(vi ). Network Embedding. In drug interaction networks, we define the topology-based embedding as V s , and it can capture the information of the network topology. Specifically, the network embedding maps the network data into a lowdimensional latent scpace and learns a low-dimensional embedding v ∈ Rd according to the topological information from the drug-drug interaction network. Note that d |v| is the dimension of the latent represent space. In this paper, we embedded the DDI network into a low-dimensional space, which is useful for the analysis of drug interactions. During this process, the topological structure and properties are coded and saved. We denote the topologybased objective function as follows: Es = wvx ,vy × log p(vxs |vys )
(1)
This method aims to apply the topology-based embedding to measure the loglikelihood of a directed edge. Following LINE [18], we model the conditional probability of vx generated by vy in Eq. (1) as: p(vxs |vys ) =
exp(vxs · vys ) s s vz ∈v exp(vz · vx )
(2)
The formula can be interpreted as the probability of detecting the edge from vy to vx , which represents the reconstructed distribution. Furthermore, we use the SoftMax function to calculate the probability.
MTNE
3.4
311
Dynamic Text Embedding
As mentioned earlier, each vertex should have a different focus for a particular vertex, which produces a dynamic representation. To achieve this dynamic representation, we use a mutual attention mechanism [15], which is a novel and popular model for machine translation. This mechanism can make the CNN pooling layer be able to notice the text of another node connected by the edge to generate a text representation for this node. In this paper, we study the application of different neural networks in text modeling, such as CNN [7], GRU [1], etc, and from the experimental results we found that the CNN performs the best because it can capture the local semantic dependencies between words. Thus, in our model, we used the CNN to implement text-based embedding. Figure 3 shows the framework of the dynamict embedding. In the following, we will show the process of the dynamic embedding generation process.
Fig. 3. An illustration of dynamic text embedding for the drug node
Encoder and Looking-Up. First, we map all words in the text network to a sequence of word IDs. Thus, we can get an ID sequence L = (n1 , n2 , ..., ni ) for t ∈ T . Then, we put the word ID sequence ni ∈ L into the looking-up layer. Then we can get two vector sequences Ti = (ti , ..., ti,m/2 , ..., ti,n/2 ) and Tj = (tj , ..., tj,m/2 , ..., tj,n/2 ). Concatenation. After getting the vector sequences Ti and Tj , we concate them together, and get the ultimate matrix sequence T . T = Ti ⊕ Tj
(3)
312
F. Hu et al.
Convolution. We give two drug nodes that interaction, drugx and drugy . Each drug has two external pieces of knowledge with two corresponding text sequences T1 and T2 . After a series of processes, as described in above, we can acquire two feature matrices Dx ∈ Rd×m and Dy ∈ Rd×n . Here, m and n represent the sum of the lengths of T1 and T2 , respectively. By introducing the mutual attention matrix F ∈ Rd×d mentioned above, we can get a correlation matrix CM ∈ Rm×n for two nodes, which is computed as follows: CM = tanh(Dx T · F · Dy )
(4)
Max-Pooling. Here, each element in CM represents the relevant score of Dxi and Dyj . Then, we respectively perform row pooling and column pooling along the rows and columns of F to obtain an important vector. In our experiment, we tested the respective performances of max-pooling and mean-pooling in our experiment, and we find that their mean performs better. The mean vectors of Dx and Dy are represented as mdx and mdy , and we use the mean-pooling operation to achieve the following: mdi x = mean(CMi,1 , CM i,2 , ..., CMi,n ) d mj y = mean(CM1,j , CM2,j , ..., CMn.j )
(5)
Then, we use the SoftMax function to convert the mean vectors mdx and mdy into attention vectors f dx and f dy . The i-th element of f dx is as follows: exp(mdi x ) dx j∈[1,m] exp(mj )
fidx =
(6)
Finally, calculate the dynamic text representation of dx and dy as follows: (dtx |dy ) = Dx · f dx (dty |dx ) = Dy · f dy
(7)
The final representation of a vertex pair (dx , dy ) consists of a dynamic text embedding and topology-based embedding as follows: dx = dsx ⊕ dtx(dy ) dy = dsy ⊕ dty(dx )
(8)
Then, we optimize MTNE to achieve the best performance. For Eq. (1), the purpose of MTNE is to maximize the conditional probability between dx and dy . Because the SoftMax function is used for all vertices, the calculation costs are high. To solve this problem, we use the negative sampling [10] method to approximate the objective function as the following form: log σ(vyT · vx ) +
m
EZ ∼ P (v)[log σ(−vyT · z)]
(9)
x=1
Where σ(x) = 1/(1 + exp(−x)) represents a sigmoid function, and m is the 3/4 number of negative sampled vertices. P (v) ∝ dv , where dv is the out-degree of v. Finally, we use the Adam [8] algorithm to optimize Eq. (9) and set the learning rate as 0.001.
MTNE
4 4.1
313
Experiments Datasets and Evaluation Metrics
The DrugBank [21] database is a unique bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive drug target and drug-drug interaction information. We use the latest release of DrugBank (version 5.1.4 released on 2019-07-02). Table 1. Benchmark dataset in this work Drug Interactions Drug knowledge Description Pharmacodynamics Target Enzyme Substructure Pathway 615
53521
615
583
693
150
881
279
KEGG [4] is a database that integrates genomic, chemical, and system function information. It is one of the most commonly used biological information databases in the world. Due to the original data having much noisy information, we performed data preprocessing and obtained 615 drugs with interactions. Then, we extracted the descriptive information and pharmacodynamic information for each drug. In addition, for the comparative experiments, we also obtained other information about the drugs from DrugBank and KEGG, such as targets, structures, pathways, and enzymes, of which the pathways of the drug are only applicable to 615 drugs. The details as shown in Table 1. And to measure the prediction performances, we use the AUC (the area under ROC curve) evaluation metric. 4.2
Comparison with State-of-the-Art Methods
In the experiment, we use the following advanced DDI prediction methods for comparison. SF LLN [24] maps the different features of drugs into a common interaction space through sparse feature learning, and it then uses linear neighborhood regularization to describe the interactions between drugs to predict the interactions between drugs. CAN E [19] learns the context-aware embedding of vertices using a mutual attention mechanism in order to more accurately simulate the semantic relationships between vertices. N ode2vec [2] improved the random walk strategy. It defines two parameters p and q, and introduces a biased random walk procedure that combines breadth-first and depth-first sampling strategies. We also use some tradition methods for comparison. Such as Katz [6], they consider all the paths, give bigger weights to short paths, give lighter weights to long paths, and then make a prediction based on path similarity. LHN-1 [9] is a link prediction method based on similarity of local information. The simplest indicator of similarity is having common neighbors (CNs).
314
F. Hu et al. Table 2. AUC values of different models based on different features
%Training edges
15%
25%
35%
45%
55%
65%
75%
85%
95%
Katz
15.2
19.1
25.6
33.5
43.7
54.9
65.3
67.3
67.0
LHN-1
55.7
57.1
56.0
56.3
58.2
57.5
57.9
58.1
58.6
Node2vec
55.1
57.7
59.5
61.8
62.5
65.6
66.4
66.7
67.3
Target & Enzyme-based SFLLN
50.1
50.5
50.3
50.9
50.6
50.7
51.3
52.9
60.6
Structure & Pathway-based SFLLN 50.3
50.4
50.6
50.5
50.9
51.3
51.6
51.5
55.3
Description based CANE
71.0
72.4
72.8
73.0
73.1
73.9
74.1
74.2
69.9
Pharmacodynamics based CANE
69.8
70.8
71.9
72.9
73.1
73.5
73.6
73.8
73.9
Target & Enzyme-based MTNE
70.5
71.2
72.5
73.2
73.8
74.0
74.1
74.3
74.6
MTNE
70.5 71.8 73.4 73.6 73.9 74.5 75.1 75.9 76.1
4.3
Performances Based on Different Methods
At first, we compare the performances of different methods with different features. During the experiment, we control different sized edges during training and the results are shown in Table 2. Note that when we control 5% of the edges as the training set, most of the vertexes are isolated, which make them meaningless to all experimental methods; therefore, we omit the experimental results for this ratio. We find that the value of the AUC increases steadily with the number of training edges. According to the experiment results in Table 2, our method has more superior performance than SFLLN for the target and enzyme features. As we predicted, our model performs better in the cases of pharmacological and textual description features than in the case of target and enzyme features. Compared with Node2vec, LHN-1, and Katz, the advantage of MTNE lies in that it uses lots of external knowledge of drugs, which greatly improves the prediction accuracy. Generally, it adopts all contributing information (i.e., pharmacological information and description information), and it provides a greater performance boost for DDI predictions. The size of the data set used in our previous experiments is the full datasets we have outlined in Table 1. Nevertheless, to verify whether our method can achieve good results on a small-scale dataset, we randomly took different sized subsets of the dataset (53321 records). The SFLLN experimental method we are comparing here is based on four features of drugs, including targets, enzymes, chemical substructures, and pathways; however, our method uses only two pieces of text information, the drug description and the pharmacological information. Table 3. AUC values of different methods on different scale datasets %Data scale 10% 20% 30% 40% 50% SFLLN
51.2 64.1 79.6 82.6 86.1
MTNE
66.9 70.8 72.1 73.8 73.9
MTNE
315
As shown in Table 3, MTNE has AUC values of 66.9% and 70.8% for 5352 and 10704 pairs of interactions, respectively. The experimental results demonstrate that our method outperforms the current best system (SFLLN) by 6%. Here, we use up to 50% of the original data set for experiments. We believe that if the data set continues to increase in size, it will no longer be sparse data. Note that when we use more than 30% of the data set, the AUC value of our model is lower than that of the SFLLN method because the SFLLN method uses more drug features; therefore, as the size of the data set increases, the AUC value increases more obviously. 4.4
Performance Based on Different Datasets
To illustrate whether our model works well on datasets in other domains, we also performed experiments on the following datasets. Table 4. Statistics of the Datasets Datasets Cora Hep-Th Zhihu DrugBank Vertices
2277 1038
10000 615
Edges
5214 1990
43894 53521
Lables
7 –
–
–
Table 5. AUC values of different Datasets %Training edges 15% 25% 35% 45% 55% 65% 75% 85% 95% Cora
86.5 91.4 93.1 93.8 94.3 94.8 95.4 96.3 97.5
Hepth
89.8 91.1 91.9 92.7 94.3 94.5 95.3 95.6 96.1
Zhihu
56.7 59.2 61.5 64.1 68.7 70.3 71.2 73.1 75.2
DrugBank
70.5 71.8 73.4 73.6 73.9 74.5 75.1 75.9 76.1
Cora is a typical paper citation network that consists of 2277 scientific publications in one of seven classes. There are 5214 links in this network. Hep-Th is another citation network that includes all papers in Hep-Th portion of arXiv. Each paper is identified by a unique arXiv ID. We used some articles with summary information. Zhihu is a largest online O&A website in China. It connects users from all walks of life, and users have relevant discussions around a topic of interest. We randomly used 10,000 users and the related topics they discussed. The details of the different datasets are shown in Table 4. From Table 5, it can be observed that the AUC value for DrugBank is lower than those for Cora and Hep-Th, mainly due to the network they build having many vertexes and few edges, thus making the network sparse. Moreover, the
316
F. Hu et al.
method based on the DrugBank dataset performs much better on the Zhihu datasets. This occurs because the text information in the Zhihu is Chinese, the Chinese semantic structure is more complicated and the number of nodes far exceeds that of Drugbank. From the prediction results of different datasets, we can conclude that our model is stronger and has better generalizability. 4.5
Experiment Settings
SFLLN has three parameters: α, δ, and γ. During the experiment, we set α = 10−4 , δ = 10−1 , γ = 1, which can get the best AUC score. To be fair, we set the embedding dimension to 200 and set the learning rate to 10−3 for all network embedding methods. In Node2vec, we use grid search and select the best p and q values for training.
Fig. 4. Influence of the parameters for link predictions
MTNE also has three parameters, including α, β, and γ, and the parameters may influence the performances of the MTNE method. Here, we consider combinations of parameters: α, β, γ ∈ {0 ∼ 1}. We use different parameter combinations to build prediction models, and analyze the influence of the parameters based on the performance of the prediction models. Through a series of experiments, we finally concluded that the performance of MTNE is the best when α = 0.3, β = 0.3, γ = 0.3. During the experiment, we fix two parameters and discuss the performance of the remaining parameter. The AUC scores of MTNE are shown in Fig. 4.
5
Conclusions
The prediction of drug-drug interactions plays an important role in the drug discovery since it can reduce some potential adverse reactions between drugs. In this research, we investigated the easy-to-obtain drug external knowledge and propose a multitext aware network embedding method with a mutual attention mechanism to capture the dynamic medical text information, the semantic
MTNE
317
information of drugs, and the topological structure information of the drug-drug interaction network. MTNE efficiently improves the performance compared with existing DDI prediction methods. Compared with other state-of-the-art methods, MTNE can achieve a higher AUC e on sparse datasets, and can solve the problem of sparse drug data in public datasets. In the future, we will utilize a heterogeneous network and consider the different information of drugs as different types of vertexes to predict DDI. Acknowledgements. This research was funded by the National Natural Science Foundation of China, grant number 61402220, the Philosophy and Social Science Foundation of Hunan Province, grant number 16YBA323, the Scientic Research Fund of Hunan Provincial Education Department for excellent talents, grant number 18B279, the key program of Scientic Research Fund of Hunan Provincial Education Department, grant number 19A439, the Project supported by the Natural Science Foundation of Hunan Province, China, grant number 2020JJ4525.
References 1. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) 2. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864 (2016) 3. Jin, H., Li, C., Zhang, J., Hou, L., Li, J., Zhang, P.: XLORE2: large-scale crosslingual knowledge graph construction and application. Data Intell. 1(1), 77–98 (2019) 4. Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., Hirakawa, M.: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38(Suppl. 1), D355–D360 (2010) 5. Kastrin, A., Ferk, P., Leskoˇsek, B.: Predicting potential drug-drug interactions on topological and semantic similarity features using statistical learning. PLoS ONE 13(5), e0196865 (2018). https://doi.org/10.1371/journal.pone.0196865 6. Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18(1), 39–43 (1953) 7. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014) 8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 9. Leicht, E.A., Holme, P., Newman, M.E.: Vertex similarity in networks. Phys. Rev. E 73(2), 026120 (2006) 10. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) 11. Percha, B., Altman, R.B.: Informatics confronts drug-drug interactions. Trends Pharmacol. Sci. 34(3), 178–184 (2013) 12. Raja, K., Patrick, M., Elder, J.T., Tsoi, L.C.: Machine learning workflow to enhance predictions of adverse drug reactions (ADRS) through drug-gene interactions: application to drugs for cutaneous diseases. Sci. Rep. 7(1), 1–11 (2017)
318
F. Hu et al.
13. Rohani, N., Eslahchi, C.: Drug-drug interaction predicting by neural network using integrated similarity. Sci. Rep. 9(1), 1–11 (2019) 14. Ryu, J.Y., Kim, H.U., Lee, S.Y.: Deep learning improves prediction of drug-drug and drug-food interactions. Proc. Nat. Acad. Sci. 115(18), E4304–E4311 (2018). https://doi.org/10.1073/pnas.1803294115 15. Santos, C.D., Tan, M., Xiang, B., Zhou, B.: Attentive pooling networks. arXiv preprint arXiv:1602.03609 (2016) 16. Shen, Y., et al.: KMR: knowledge-oriented medicine representation learning for drug-drug interaction and similarity computation. J. Cheminform. 11(1), 22 (2019) 17. Takeda, T., Hao, M., Cheng, T., Bryant, S.H., Wang, Y.: Predicting drug-drug interactions through drug structural similarities and interaction networks incorporating pharmacokinetics and pharmacodynamics knowledge. J. Cheminform. 9(1), 1–9 (2017). https://doi.org/10.1186/s13321-017-0200-8 18. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077 (2015) 19. Tu, C., Liu, H., Liu, Z., Sun, M.: Cane: context-aware network embedding for relation modeling. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1722–1731 (2017) 20. Wan, H., Zhang, Y., Zhang, J., Tang, J.: AMiner: search and mining of academic social networks. Data Intell. 1(1), 58–76 (2019) 21. Wishart, D.S., et al.: DrugBank 5.0: a major update to the drugbank database for 2018. Nucleic Acids Res. 46(D1), D1074–D1082 (2018) 22. Zhang, P., Wang, F., Hu, J., Sorrentino, R.: Label propagation prediction of drugdrug interactions based on clinical side effects. Sci. Rep. 5(1), 1–10 (2015) 23. Zhang, T., Ji, H., Sil, A.: Joint entity and event extraction with generative adversarial imitation learning. Data Intell. 1(2), 99–120 (2019) 24. Zhang, W., et al.: SFLLN: a sparse feature learning ensemble method with linear neighborhood regularization for predicting drug-drug interactions. Inf. Sci. 497, 189–201 (2019)
Machine Learning for NLP
Learning to Generate Representations for Novel Words: Mimic the OOV Situation in Training Xiaoyu Xing, Minlong Peng, Qi Zhang, Qin Liu, and Xuanjing Huang(B) School of Computer Science, Fudan University, Shanghai, China {xyxing18,mlpeng16,qz,liuq19,xjhuang}@fudan.edu.cn
Abstract. In this work, we address the out-of-vocabulary (OOV) problem in sequence labeling using only training data of the task. A typical solution in this field is to represent an OOV word using the meanpooled representations of its surrounding words at test time. However, such a pipeline approach often suffers from the error propagation problem, since training of the supervised model is independent of the meanpooling operation. In this work, we propose a novel training strategy to address the error propagation problem suffered by this solution. It designs to mimic the OOV situation in the process of model training and trains the supervised model to fit the OOV word representations generated by the mean-pooling operation. Extensive experiments on different sequence labeling tasks, including part-of-speech tagging (POS), named entity recognition (NER), and chunking verified the effectiveness of our proposed method.
Keywords: Word representation
1
· Sequence labeling
Introduction
Word representation is a fundamental component in neural sequence labeling systems [9,18]. However, natural language yields a Zipfian distribution [30] over words, which means that a significant number of words are rare and out of the training vocabulary (OOV). Learning representations for those rare or OOV words is challenged, since the standard end-to-end supervised learning models require multiple observations of a word to obtain its reliable representation [2]. Many works [18,19] have proved that performance on sequence labeling usually drops a lot when encountering OOV words. This is commonly referred to as the OOV problem, which we address in this work. In the past few years, many methods have been proposed to deal with the OOV problem. A typical solution is to first train the supervised model on those within-vocabulary words, and then represent an OOV word using the meanpooled representations of its surrounding within-vocabulary words [11,24]. In the following, we refer MeanPool to this solution. Intrinsically, MeanPool can be c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 321–332, 2020. https://doi.org/10.1007/978-3-030-60450-9_26
322
X. Xing et al.
seen as a pipeline, with the supervised model and the mean-pooling operation being its two cascaded components. Because training object of the supervised model is independent to the pooling operation used for generating the OOV word representations, this solution often suffers from the typical error propagation problem in the pipeline paradigm [6,8], resulting in a functional gap between the generated OOV word representations and the obtained within-vocabulary word representations using supervised training. To address this problem, some works proposed to replace the heuristic pooling operation with a learnable predicting model. The predicting model was trained to predict the representation of a within-vocabulary word given its context and surface-form. For example, a learnable linear transform onto the mean-pooled representation to get the representation of an OOV word [14]. Based on this work, [28] additionally modelled the surface-form (subword n-grams) of the word for predicting its representation. Though these methods have achieved considerable improvement over MeanPool, they still suffer from the error propagation problem since the auxiliary object for training the predicting model does not guarantee to be compatible with that for training the supervised model. In this work, we propose a novel method to deal with the error propagation within MeanPool. Different from previous works that rely on training an additional predicting model to achieve this purpose, we instead turn to adjust the training process of the supervised model to perform the task. Specifically, in the training procedure of the supervised model, we mimic the OOV situation by randomly sampling some within-vocabulary words as the OOV words (referred to as fake OOV words). For these fake OOV words, we generate their representations using MeanPool based on current training state, instead of by looking up the word embedding table. These generated representations are then fed as input to the higher sentence modeling layer to get the training signal of the supervised model. At test time, when encountering OOV words, it directly applies MeanPool to obtain their representations. We argue that there are several advantages of our proposed method over previous works. First, it does not introduce any additional parameters for generating word representations. This makes it precise while at the same time effective. Moreover, because the training of the supervised model is nested with the pooling operation, our proposed method does not suffer from error propagation between supervised learning and the step for generating OOV word representations. The contribution of this work can be summarized as follows: (i) We address the OOV problem in sequence labeling using only training data of the task. (ii) We propose a novel training strategy to address the error propagation problem suffered by MeanPool, which represents an OOV word with the mean-pooled representations of its surrounding within-vocabulary words. (iii) We perform extensive experiments on different sequence labeling tasks, including part-of-speech tagging (POS), named entity recognition (NER), and chunking. The experimental results verify the effectiveness of our proposed method.
Learning to Generate Representations for Novel Words
2
323
Methodology
In the following, we first briefly describe a generic neural architecture for implementing the supervised model. Then, we illustrate a small modification to the mean-pooling based approach. Finally, we show how to adjust the training of the supervised model to narrow the functional gap between the generated word representations and the embeddings of training words. 2.1
Notation and Task Definition
Throughout this work, VT denotes the training vocabulary, X = {w1 , · · · , wn } denotes a word sequence, and Y = {y1 , · · · , yn } denotes a label sequence. For a function f (w) defined on a word, f (X) is a shorthand for {f (w1 ), · · · , f (wn )}. The task of sequence labeling is to assign a label sequence Y to a word sequence X, and in the supervised setting, we are given a training dataset DT = {(X1 , Y 1 ), · · · , (Xn , Y n )} for model training on VT . The OOV problem refers to the situation that there are some words in a testing sequence not belonging to VT . 2.2
Generate OOV Word Representations with Mean-Pooling
We implement the supervised model with the popular BiLSTM-CNN-CRF architecture [18]. Briefly, this architecture represents a word w with the concatenation of an unique word embedding ew (w) and a vector ec (w) modeled from its character sequence using a character-level convolutional neural network (CNN). Then a bidirectional long-short term memory network (BiLSTM) [12] is used to model word dependency within a sentence. On top of BiLSTM, it uses a sequential conditional random field (CRF) [16] to jointly decode labels for the whole sentence. For the above architecture, the OOV problem results from the missing of ew (w) for a word w not occurring in VT . For the above architecture, the OOV problem results from the missing of ew (w) for a word w not occurring in VT . To address this problem, we follow the idea of MeanPool which represents an OOV word with the mean-pooled representations of its surrounding training words: vw (w|C(w)) =
1 |C(w) ∩ VT |
ew (w ).
(1)
w ∈C(w)∩VT
where C(w) is a context of w. Note that there is a small difference with MeanPool [11]. Generally speaking, unlike previous work that represents a word w in different contexts with a consistent representation, we relax representations of w to be different in different contexts. It is mainly motivated by the idea beneath [10,22,23], which differently represents a word in different contexts. According to our experience, this adjustment can generally improve performance. Therefore, we applied this adjustment to all compared methods of this paradigm (i.e., MeanPool, LinearTransform, and GatedTransform defined in Sect. 3.3).
324
2.3
X. Xing et al.
Embed MeanPool into Model Training
According to our aforementioned discussion in Sect. 1, the above solution for dealing with the OOV problem will suffer from the error propagation problem as a typical pipeline. We argue that this is mainly because the training of ew (w) on the supervised task does not take account the mean-pooling postprocessing. During the training of the supervised model, we do not apply any constraint on ew (·) to make it reliable to represent the OOV word with its mean-pooled context word representations. Therefore, the MeanPool solution may result in very poor OOV word representations for the supervised model, which can be indeed observed from our experimental results. To address this problem, in this work, we propose to embed MeanPool into the training process of the supervised model. That is, we train the supervised model, on the one hand, to minimize the supervised task loss L and on the other hand, to fit the generated OOV word representations by MeanPool. The overall framework of our proposed method is illustrated in Fig. 1. Specifically, let denote VS ⊂ VT a support vocabulary and VE ⊂ VT an evaluation vocabulary, respectively, with VS ∩ VE = ∅ and VS ∪ VE = VT . At each training step of the supervised model, we randomly separate VT into VS and VE . Then, for a training sentence X = {w1 , · · · , wm }, we represent w in the word level with ew (w) if w ∈ VE , otherwise vw (w). This representation is concatenated with their corresponding character level presentation ec (w) to get the complete representation v(w) of w with: [ew (w) ⊕ ec (w)], if w ∈ VS v(w) = (2) [vw (w) ⊕ ec (w)], otherwise. The corresponding sentence representation v(X) is then fed into BiLSTM to get the context representation ht and finally get the conditional probability p(Y|X; θ, VS , VE ) of a label sequence Y given X. Training Strategy. The training object of the model can then be precisely formulated by: ⎡ ⎤ θ ∗ = arg min E(VS ,VE )∼P ⎣ log p(Y|v(X); θ, VS , VE )⎦ . (3) θ
X,Y∈DT
Here, (VS , VE ) ∼ P is a random separation of VT according to a distribution P, which we will illustrate in the following. From this definition, we can see that any specific separation of VT corresponds to an unique optimization task defined by: θ ∗ (VS , VE ) = arg minθ log p(Y|v(X); θ, VS , VE ). (4) X,Y∈DT
Therefore, the proposed method can be seen as a pseudo-ensemble method [3], which implicitly ensembles an infinite number of models (here, θ ∗ (VS , VE )) to
Learning to Generate Representations for Novel Words
325
Fig. 1. The overall framework of our proposed method. w1 · · · w5 denote the words in the training sentence. Vs is the support vocabulary and Ve is the evaluation vocabulary, which are separated from the training vocabulary during each step of training.
obtain the final model θ ∗ just as the dropout technique [29]. Alternatively, it can also be seen as a multi-task learning method, with each task corresponds to a specific optimization task. For simulating the realistic OOV situation, we design P considering the following two principles. First, the ratio |VE |/|VT | should be close to the real OOV rate (number of OOV words in testing data set / number of words occurring in both training and testing data sets) at test time. Second, the word with a higher frequency in the training set should be less likely to be sampled as an item of VE . Let z(w) denote the frequency of w occurring in the training set. We estimate the OOV ratio πo from the training set with: πo = c(k)/|VT |.
(5)
Here, c(k) = w∈VT I(z(w) ≤ k), where I is the indication function, denotes the number of words occurring no more than k times in the training set. In this work, we heuristically set k = 1. Then, we define the probability of a word w being of VE with: p(w ∈ VE ) = πo
k z(w)
w∈VT
k z(w)
.
(6)
We assume that the sampling of a word is independent to the other words. Therefore, P (VS , VE ) is defined by:
P(VS , VE ) = (1 − p(w ∈ VE )) p(w ∈ VE ). (7) w∈VS
w ∈VE
Training process of the model is precisely illustrated in Algorithm 1 and can be summarized as follows: For each training example (X, Y), we randomly separate VT into VS and VE according to P. Then, according to the separation, we
326
X. Xing et al. Input : training dataset DT ; training vocabulary VT ; vocabulary separation probability P Output: the supervised model θ while θ does not converge do for (X, Y) ∈ DT do separate VT into VS and VE according to P; represent w ∈ VS with e(w); represent w ∈ VE with v(w); get loss L(X, Y) = log p(Y|v(X); θ, VS , VE ); update θ in direction to minimize L(X, Y). end end
Algorithm 1: Training process of the proposed method.
obtain the sentence matrix representation v(X). Based on v(X), we get the task loss L(X, Y) over (X, Y) under this separation, and update model parameters θ to minimize L(X, Y).
3
Experiments
We carry out experiments on three kinds of sequence labeling tasks: part-ofspeech tagging (POS), named entity recognition (NER), and chunking. For each task, we consider several datasets with varying OOV rates, which is defined by the percentage of words in the testing data set that never appear within the training data set. 3.1
Dataset
POS. For POS, we conducted experiments on three benchmark datasets in different languages: (1) PTB-English: the Wall Street Journal portion of the English Penn Treebank dataset [20]. We followed the standard splits: sections 2-21 for training, section 22 for validation, and section 23 for testing. (2) GSD-Russian: the Russian Universal Dependencies Treebank v21 with the given data splits. (3) RRT-Romanian: the Romanian UD treebank (called RoRefTrees) [4] v2 (see footnote 1) with the given data splits. NER. For NER, we performed experiments on four benchmark datasets in different languages: (1) CoNLL02-Spanish, (2) CoNLL02-Dutch, (3) CoNLL03English, and (4) CoNLL03-German. CoNLL02 dataset [27] was used for shared task on language-independent Named Entity Recognition, including Spanish and Dutch. The data is annotated by four types: PER, LOC, ORG, and MISC. CoNLL03 [26] is an NER dataset structurally similar to CoNLL2002, but its 1
https://universaldependencies.org/.
Learning to Generate Representations for Novel Words
327
Table 1. The number of OOV words (for POS), entities (for NER) and phrases (for Chunking) in the testing sets, when treating words never occur in the training set as OOV words. An entity or a phrase is treated as OOV if it contains at least one OOV word. Dataset
POS PTBEnglish
NER GSDRussian
RRTRomanian
CoNLL02Spanish
Chunking CoNLL02Dutch
CoNLL03English
CoNLL03German
CoNLL00English
#OOV
3240
2875
1516
864
1822
1867
1922
2817
OOV Rate
2.49%
24.89%
9.28%
24.27%
46.23%
33.03%
52.32%
11.81%
data is in English and German. For these four datasets, we use the official split training set for model training, testa for validating and testb for testing. Chunking. For chunking, we performed experiments on CoNLL00-English dataset [25], which was introduced as a part of a shared task on Chunking in English. It contains 12 different labels (22 with IOB prefix included). Since the lack of development data, we randomly sampled 10% of the training set for this purpose. Table 1 reports the OOV rates of these datasets. From the table, we can see that even for the same task, the OOV rate varies a lot over different datasets. 3.2
Metrices
We partitioned the test data set into training word subset and out-of-vocabulary (OOV) word subset. We consider a word being of OOV if it never appears in the training data set. An entity or a phrase is considered to be OOV if it contains at least one OOV word. We care about the performance on the whole test set and the OOV subset. Metrices on the Whole Test Set. We follow previous work use per-word accuracy (ACC) as the overall metrics for POS tagging, and F1-score metric (F1) for NER and Chunking [13]. Metrices on the OOV Subset. For POS, the accuracy metric on the OOV subset (OOV-ACC) is defined by the number of right predictions on occurrences of OOV words to the total number of occurrences of OOV words. For NER and Chunking, the precision (P) on the OOV subset is defined by the number (referred to as TP in the following) of right predictions on occurrences of OOV entities to the number of predicted entities that contain at least one OOV word, the recall (R) is defined by TP to the total number of occurrences of OOV entities, and the F1 score (OOV-F1) is defined by 2 × P × R/(P + R). 3.3
Baseline Methods
We compare our proposed method with the following baselines. We study their performance when the word embeddings are randomly initialized and pretrained in the “Result” and “Ablation study” section, respectively.
328
X. Xing et al.
– RandomEmb: It trains the supervised model on all words occurring in the training set. At test time, it assigns a consistent random embedding ew (RAND) to all of the words not occurring in the training set if not using pre-trained word embeddings. Otherwise, it instantiates the representation of the OOV word with the pretrained word embedding. – SingleUNK: It trains on words that occur more than five times within the training set. The other infrequent words are all mapped to a single trainable embedding ew (UNK), which is trained during model training and assigned to all of the words not occurring in the training vocabulary. – MeanPool [11]: Based on RandomEmb, it represents an OOV word using the mean-pooled representation vw (w) of its surrounding training words at test time. – LinearTransform [14]: Based on RandomEmb, it introduces a linear transformer A onto the mean-pooled representation to narrow the gap between vw (w) and ew (w). The training of A is performed on the words that occur more than five times within the training set. – GatedTransform [28]: Based on RandomEmb, it models both the context and surface-form of a word to get its representation using a predicting network. In similar, the training of the predicting network is performed on words that occur more than five times within the training set. – Char-level LSTM [1]: It models words and context fundamentally as sequences of characters. Instead of using pretrained character language model, we simply use character-level LSTM where a sentence is input as a character sequence. 3.4
Implementation Detail
For data processing, all digits were replaced with the special token “”, and all URLs were replaced with the special token “”. The dimension of character embedding and word embedding was set to 25, and 100, respectively. All of the embeddings were randomly initialized. The hidden size of LSTM was set to 200, and the kernel size of the character-level CNN was set to 25 for kernel width 3 and 5. Optimization was performed using the Adam step rule [15] with initial learning rate being 1e−3. To reduce the effects of “gradient exploding”, we use a gradient clipping of 5.0 [21]. Early stopping [7] was performed on the development set of each dataset. 3.5
Results
Table 2 and 3 show model performance on the OOV subset and the whole testing subset for POS tagging and Chunking, and NER. From these two tables, we have the following observations. First, on most tasks, the methods that are specifically designed to deal with the OOV problem outperform the RandomEmb baseline. It proves that it is necessary to deal with the OOV problem in sequence labeling task. Second , LinearTransform and GatedTransform, in general, outperform MeanPool. It suggests that the effectiveness of a trainable mapping
Learning to Generate Representations for Novel Words
329
Table 2. POS (Accuracy) and Chunking (F1 score) performance on the OOV subset (OOV-ACC, OOV-F1) and on the whole testing data set (Acc, F1). Model
POS
Chunking
PTB-English
GSD-Russian
RRT-Romanian
CoNLL00-English
OOV-ACC
ACC
OOV-ACC
ACC
OOV-ACC
ACC
OOV-F1
F1
RandomEmb
86.66
96.96
85.47
95.04
85.02
96.25
85.50
92.37
SingleUNK
89.90
97.12
87.60
95.05
86.91
96.47
88.27
93.34
MeanPool [11]
86.41
97.02
85.91
94.63
84.36
96.22
85.10
92.19
LinearTransform [14]
89.10
97.05
87.75
95.11
85.36
96.35
88.36
93.39
GatedTransform [28]
88.91
97.05
88.31
95.30
85.96
96.45
88.46
93.42
Char-level LSTM [1]
87.34
96.83
86.92
93.63
84.36
95.66
83.27
89.42
Proposed
90.30
97.19
88.38
95.30
87.73
96.43
89.00
93.45
Table 3. NER performance (F1) on the OOV subset and on the whole testing data set. Model
CoNLL02-Spanish
CoNLL02-Dutch
CoNLL03-English
OOV-F1
F1
OOV-F1
F1
OOV-F1
F1
CoNLL03-German OOV-F1
F1
RandomEmb
65.22
80.30
57.22
74.09
76.27
83.54
53.96
64.90
SingleUNK
69.51
80.72
64.83
75.85
77.93
82.07
58.08
65.69
MeanPool [11]
66.65
80.15
59.30
75.03
71.44
82.08
41.97
60.28
LinearTransform [14]
67.45
79.86
59.52
75.25
72.77
82.43
51.93
64.13
GatedTransform [28]
67.56
79.41
62.28
76.40
76.46
83.60
55.47
65.43
Char-level LSTM [1]
64.05
75.03
57.67
72.22
65.12
75.17
53.42
61.38
Proposed
70.90
81.05
65.88
77.65
79.27
84.68
60.93
67.03
onto the MeanPool in narrowing the functional gap between the generated OOV representations by mean-pooling and the learned word embeddings. Third , our proposed method consistently outperforms LinearTransform and GatedTransform. This shows the superiority of our proposed method in dealing with the discrepancy problem suffered by MeanPool over LinearTransform and GatedTransform. Fourth, Char-level LSTM performs worse than LinearTransform and GatedTransform on OOV subsets and achieve the worst overall performance. It indicates that, without any resources such as unlabeled data, char-level LSTM is hard to capture the characteristics of rare words. Fifth, compared with other baseline methods, SingleUNK achieves better results. We think it is because SingleUNK simulates the OOV situation during model training to address the OOV problem as well. However, our proposed method outperforms SingleUNK. The reason is that in SingleUNK, the training vocabulary is separated fixedly. While in our proposed method, there is an infinite number of separations of the training vocabulary. It means that in SingleUNK, it simulates the OOV situation using only those infrequent words. In contrast, our proposed method simulates the OOV situation using the whole training vocabulary. What’s more, SingleUNK represents all infrequent words with a shared embedding, while for the proposed method, representations of OOV words are case by case. Finally , our proposed method outperforms the compared methods in most cases, verifying its effectiveness in dealing with the OOV problem in the close test. And from
330
X. Xing et al.
the improvement of the overall performance, we can infer that performance on training words are not suffering when applying our method.
4
Related Works
In the past few years, many methods have been proposed to deal with the OOV problem. A popular solution is to represent all of the OOV words with a single shared embedding. However, this heuristic solution will conflate many words thus losing specific information of the OOV words. Another popular solution is to obtain the word representation from its surface-form like subword, character and n-gram sequence [5,17,24]. However, it is difficult for the encoder to capture semantic distinctions among syntactically similar but semantically unrelated words. Therefore, [11,14] and [28] represent the OOV word with its surrounding training words based on the well-trained supervised model [11] represent the OOV word with the mean-pooled representations of its surrounding words. Though they have achieved considerable improvement over mean-pooling, they will still suffer from the gap problem since the auxiliary training object for getting the representation for OOV word does not guarantee to be compatible with the training object.
5
Conclusion
In this work, we propose a novel training strategy that uses only the training data of the task to address the out-of-vocabulary (OOV) problem in sequence labeling. We embed the mean-pooling operation into the training process of the supervised model by randomly selecting training words as OOV words, with the probability related to their frequency. Our proposed method can mitigate the functional gap between the representations of OOV words generated at inference and the learned word embeddings appropriate for a particular downstream task. We verified the effectiveness of our method on three kinds of sequence labeling tasks, including part-of-speech tagging, named entity recognition, and chunking. In particular, we observe that our proposed method works well even without good pre-trained word vectors, which is helpful when dealing with low-resource languages.
References 1. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649 (2018) 2. Ataman, D., Federico, M.: Compositional representation of morphologically-rich input for neural machine translation. In: 56th Annual Meeting of the Association for Computational Linguistics, pp. 305–311 (2018) 3. Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. In: Advances in Neural Information Processing Systems, pp. 3365–3373 (2014)
Learning to Generate Representations for Novel Words
331
4. Barbu Mititelu, V., Ion, R., Simionescu, R., Irimia, E., Perez, C.: The Romanian treebank annotated according to universal dependencies. In: Proceedings of The Tenth International Conference on Natural Language Processing (HrTAL2016) (2016) 5. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017) 6. Bojarski, M., et al.: End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016) 7. Caruana, R., Lawrence, S., Giles, C.L.: Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. In: Advances in Neural Information Processing Systems, pp. 402–408 (2001) 8. Caselli, T., et al.: When it’s all piling up: investigating error propagation in an NLP pipeline. In: WNACP@ NLDB (2015) 9. Chen, X., Qiu, X., Zhu, C., Liu, P., Huang, X.: Long short-term memory neural networks for Chinese word segmentation. In: EMNLP, pp. 1197–1206 (2015) 10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 11. Herbelot, A., Baroni, M.: High-risk learning: acquiring new word vectors from tiny data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 304–309 (2017) 12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 13. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015) 14. Khodak, M., Saunshi, N., Liang, Y., Ma, T., Stewart, B., Arora, S.: A La Carte embedding: cheap but effective induction of semantic feature vectors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 12–22 (2018) 15. Kinga, D., Adam, J.B.: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR), vol. 5 (2015) 16. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001) 17. Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1520–1530 (2015) 18. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNsCRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1064–1074 (2016) 19. Madhyastha, P.S., Bansal, M., Gimpel, K., Livescu, K.: Mapping unseen words to task-trained embedding spaces. In: Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 100–110 (2016) 20. Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the Penn treebank. Comput. Linguist. 19(2), 313–330 (1993) 21. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, pp. 1310–1318 (2013) 22. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 23. Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)
332
X. Xing et al.
24. Pinter, Y., Guthrie, R., Eisenstein, J.: Mimicking word embeddings using subword RNNs. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 102–112 (2017) 25. Sang, E.F., Buchholz, S.: Introduction to the CONLL-2000 shared task: chunking. arXiv preprint cs/0009008 (2000) 26. Sang, E.F., De Meulder, F.: Introduction to the CONLL-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050 (2003) 27. Sang, E.F.T.K.: Introduction to the CONLL-2002 shared task: languageindependent named entity recognition. Computer Science, pp. 142–147 (2002) 28. Schick, T., Sch¨ utze, H.: Learning semantic representations for novel words: leveraging both form and context. arXiv preprint arXiv:1811.03866 (2018) 29. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 30. Zipf, G.K.: Human behavior and the principle of least effort (1949)
Reinforcement Learning for Named Entity Recognition from Noisy Data Jing Wan1 , Haoming Li1 , Lei Hou2,3(B) , and Juaizi Li2,3 1
2
Beijing University of Chemical Technology, Beijing, China [email protected], [email protected] Department of Computer Science and Technology, BNRist, Beijing, China 3 KIRC, Institute for Artificial Intelligence, Tsinghua University, Beijing 100084, China {houlei,lijuanzi}@tsinghua.edu.cn
Abstract. Named entity recognition (NER) is an important task in natural language processing, and is often formalized as a sequence labeling problem. Deep learning becomes the state-of-the-art approach for NER, but the lack of high-quality labeled data remains the bottleneck for model performance. To solve the problem, we employ the distant supervision technique to obtain noisy labeled data, and propose a novel model based on reinforcement learning to revise the wrong labels and distill highquality data for learning. Specifically, our model consists of two modules, a Tag Modifier and a Tag Predictor. The Tag Modifier corrects the wrong tags with reinforcement learning and feeds the corrected tags into the Tag Predictor. The Tag Predictor makes the sentence-level prediction and returns rewards to the Tag Modifier. Two modules are trained jointly to optimize tag correction and prediction processes. Experiment results show that our model can effectively deal with noises with a small number of correctly labeled data and thus outperform state-of-the-art baselines.
Keywords: Named entity recognition learning
1
· Noisy data · Reinforcement
Introduction
Named entity recognition (NER), aiming to recognize the named entities in a given text, is a preliminary and important problem in natural language processing, particularly for relation extraction [2], event extraction [3] and question answering [26]. Numerous methods have been proposed to solve this problem, including Hidden Markov Model (HMM) [1], Support Vector Machine (SVM) [10] and Conditional Random Field (CRF) [12]. Recently, deep learning methods yielded state-of-the-art performance [18]. However, deep learning requires large amounts of high-quality annotated data, which is very expensive to obtain as it requires multi-stage pipelines with sufficiently well-trained annotators. c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 333–345, 2020. https://doi.org/10.1007/978-3-030-60450-9_27
334
J. Wan et al.
To overcome this problem, rule-based and semi-supervised approaches are proposed. But these methods usually result in a low recall, and it is difficult to transfer among different fields. A promising method is aligning the large corpus to domain dictionaries [21] to generate labeled data, which is similar to the idea of distant supervision [17]. Although these approaches are effective to label data automatically, they suffer from the wrong label issue. Table 1 presents three typical noises in preliminary observations. – Partially-labeled noise: It occurs when an entity is a long word, which only ” (F-22 fighter) is the partially exists in the domain dictionary. “F-22 full name of the entity while it is partially included in the domain dictionary (“F-22”), and thus this approach incorrectly labels “F-22” as an entity. – Non-labeled noise: It occurs when the sentence contains entities which are not included in the domain dictionary. “F-35” in the second sentence is not included in the dictionary, making it wrongly mark as a non-entity. – Ambiguous-labeled noise: It occurs when the domain words in the dictionary are wrong or polysomic. In the third sentence, the word “F22” is a display, but still labeled as a military entity.
Table 1. Examples of three kinds of noise labeled by BIO [19]
The noisy labels will inevitably have a negative impact on the recognition models, and thus it is critical to deal with noisy data. In this paper, we proposed a novel framework based on reinforcement learning named RLNER, which is able to utilize large amount of noisy data for better entity recognition. Specifically, our model consists of two modules, tag modifier and tag predictor. The tag modifier is trying to revise the noisy labels obtained by distant supervision as correct as possible, and the tag predictor is to complete the recognition
Reinforcement Learning for Named Entity Recognition from Noisy Data
335
over the revised inputs. Although we do not have an explicit supervision for whether each label is right or not, we can know whether the modification is reasonable by measuring the utility of a whole sentence (estimated by the tag predictor). Such trial-and-error search and delayed reward properties inspire us formalize the tag modifier as a reinforcement learning process. The tag predictor is a normal sequence labeling task, and any NER models can by applied. In this paper, we choose the widely-used architecture with BiLSTM for encoding and CRF for labeling. Our contributions in this work can be summarized as follows: – We propose a novel model for NER, which is able to revise wrong labels and thus make full use of noisy labeled data for model learning. – We formulate the tag modification as a reinforcement learning problem, which enables it to modify the wrong tags without explicit annotations. – We construct two noisy datasets for NER, i.e., MIL in the military domain and MSRA in the general domain. Experimental results show that our proposed model significantly and consistently outperforms existing state-of-theart methods on both datasets.
2 2.1
Methodology Overview
Our task is to perform NER for noisy data, i.e., labeled sentences with partially wrong tags. More formally, given a sentence composed by a sequence of words X = x1 , x2 , . . . , xn with each xi denoting a word chosen from a vocabulary V , and the corresponding tags T = t1 , t2 , . . . , tn which are partially wrong, we expect to obtain a sequence of predictions Y = y1 , y2 , . . . , yn . Note that ti , yi are labels based on BIO schema [19]. Figure 1 illustrates the proposed framework. It consists of two modules, a Tag Modifier which corrects the tags at the word level and a Tag Predictor which performs the tag prediction. The tag Modifier tries to change the wrong tags to correct ones via a stochastic policy. Specifically, it keeps sampling an action at each state until the end of a sentence to produce an action sequence, and then a new list of tags is generated accordingly. The Tag Predictor predicts the sequence tags and offers reward computation to Tag Modifier. Since the reward can be computed once the final representation is available (completely determined by the action sequence), the process can be naturally addressed by the policy gradient method [23]. Obviously, the two modules interact with each other during the training process. The state representation of the Tag Modifier is derived from Tag Predictor which relies on the final tags obtained from the Tag Modifier. And Tag Modifier obtains rewards from Tag Predictor to guide the learning of a policy.
336
J. Wan et al.
Fig. 1. Overall process. The Tag Modifier (left) corrects tags according to a policy function. The new tags are used to train a better Tag Predictor (right). The reward computed by the Tag Predictor will help update the policy function in the Tag Modifier.
2.2
Tag Modifier
We cast tag modification as a reinforcement learning problem. The modifier is the agent, which interacts with the environment (i.e., data and the Tag Predictor). It follows a policy to decide which action to take (modifying the current tag or not) at each state. Specifically, it adopts a stochastic policy and uses a delayed reward to guide policy learning. It samples an action with a probability derived from the state which encodes the current input and previous contexts. We introduce state, action and policy, reward, and the objective function as follows. State: The state st represents the current input and previous context when deciding the t-th tag of the sentence. We represent the state as a continuous real-valued vector (1) st = ct−1 ⊕ ht−1 ⊕ xt which encodes the following information: 1) The vector representation of the current word xt ; 2) The representation of the previous contexts, i.e., the last hidden state of the LSTM obtained from the sequence tagging network where ct is the memory cell and ht is the hidden state. ⊕ indicates the vector concatenation. Action and Policy: The action space A = {Retain, Change} indicates that the tag of a word keeps unchanged or not. Clearly, each action is an indicator of the new tags of a sentence. Table 2 gives some examples, and you can find that the nearby tags might be also changed if necessary. We adopt a stochastic policy. Let at denote the action at state t, and the policy is defined as (2) π(at |st ; Θ) = σ(W ∗ St + b)
Reinforcement Learning for Named Entity Recognition from Noisy Data
337
Table 2. Examples of tags modification Original tags Actions
Revised tags
O B I I O C R R R R B I
I
O
O B I I O R C R R R O O B I
I
O
O B I I O R R C R R O B O B O O B I I O R R R C R O B I
O O
O B I I O R R R R C O B I
I
I
where π(at |st ; Θ) denotes the probability of choosing at , σ denotes the sigmoid function, and Θ = {W, b} denotes the parameters of policy network. During training, the action is randomly sampled according to Eq. 2. During testing, the action with the maximal probability (i.e., a∗t = arg maxa π(at |st ; Θ)) will be chosen in order to obtain superior prediction. Reward: The reward function is an indicator of the performance of the corrected tags. The corrected tags are passed to the Tag Predictor for training, the reward is calculated using the loss on the validation dataset (i.e., a small amount of data without noise). This is a typical delayed reward because the loss of the Tag Predictor on the validation dataset cannot be obtained unless the tags of the whole sentence are modified. The loss is calculated by Eq. 8 that will be detailed in the next section. Objective Function: We optimize the parameters of the policy network using the REINFORCE algorithm [24] and policy gradient method, aiming to maximize the expected reward as follows. r(s1 a1 . . . sn an ) J(Θ) = (st ,at )∼PΘ (st ,at )
=
PΘ (s1 a1 . . . sn an )Rn
s1 a1 ...sn an
=
s1 a1 ...sn an
=
p(s1 )
s1 a1 ...sn an
t
πΘ (at |st )p(st+1 |st , at )Rn
(3)
t
πΘ (at |st )Rn
Note that this reward is computed over just one sample. Since the current state is fully determined by the previous state and the action, the probability p(s1 ) and p(st+1 |st , at ) are equal to 1. By applying the likelihood ratio trick, we update the policy network with the gradient as follows. ∇Θ J(Θ) =
n t=1
Rn ∇Θ log πΘ (at |st )
(4)
338
J. Wan et al.
2.3
Tag Predictor
The predictor adopts a BiLSTM-CRF architecture, which consists of an input layer, a BiLSTM layer and a CRF layer. Input Layer: The first step is to map discrete words into distributed representations. For a given sentence X, we get an embedding matrix M = m1 m2 . . . mn by looking up pre-trained models for each word xi as wi (wi ∈ Rd , d is the dimension of embedding). Then these vectors are fed into the next layer. BiLSTM Layer: Since BiLSTM has become the common grounding to the NLP community, we do not present the details and simply denote this layer as P = BiLST M (M )
(5)
where P ∈ Rn×k is the output of a linear layer that maps the hidden state of BiLSTM model from m to k dimension with m, k, n denote the numbers of BiLSTM cells, distinct tags and sentences. CRF Layer: In the CRF layer, we consider P as the score matrix generated from the BiLSTM layer. Pij corresponds to the score of the j-th tag of the i-th word in a sentence. For a sequence of predictions Y , yi is a tag modified by the Tag Modifier during the training and its original tag during validating or testing. We define its score as: S(X, Y ) =
n
Ayi ,yi+1 +
i=0
n
Pi,yi
(6)
i=0
where A a square transition matrix of size k with Ai,j representing the transition score from tag i to tag j. A softmax over all possible tag sequences yields a probability for sequence Y , i.e., eS(X,y) P (Y |X) = (7) S(X,˜ y) y˜∈YX e and we maximize the log-probability of the modified tag sequence during training eS(X,˜y) ) log P (Y |X) = S(X, Y ) − log( y˜∈YX (8) S(X,˜ y) = S(X, Y ) − logaddy˜∈YX e After training, we return the log-probability of the validation set with original tags as the reward to the Tag Modifier. While decoding, we use Viterbi algorithm to predict the output sequence that obtains the maximum score. y ∗ = arg max s(X, y˜) y˜∈YX
(9)
Reinforcement Learning for Named Entity Recognition from Noisy Data
339
Note that the key difference between our Tag Predictor and other sequence tagging models is that we use the modified tags provided by our Tag Modifier, while others use the original tags. As the correctness of the predicted tags is unknown, we return the performance on the validation set to improve the performance of the next batch.
Algorithm 1. Overall Training Procedure 1. Initialize the Tag Predictor and the Tag Modifier 2. Pre-train the Tag Predictor by maximizing log p(yi |xi ) 3. Pre-train the Tag Modifier with Algorithm 2 using the fixed BiLSTM-CRF. 4. Jointly train the BiLSTM-CRF and the policy network with Algorithm 2 until convergence
2.4
Training Details
Since Tag Modifier and Tag Predictor are interleaved together, they should be trained jointly. As described in Algorithm 1 [5], the entire training process consists of three major steps. First, we pre-train the Tag Predictor with the noisy data. Then we train the Tag Modifier while keeping the parameters of the Tag Predictor fixed. Finally, we train them jointly. Since training a reinforcement learning model from scratch is extremely difficult and has a high variance, we complete the training in a warm-start manner. For the Tag Predictor, we use the original tags to perform the pre-training.
3 3.1
Experiments Experimental Setting
Dataset. In order to evaluate the effectiveness of our model, we design experiments with two datasets: our own military data and MsraNER dataset [14]. The military data, obtained by aligning military dictionary with military news crawled on Wikipedia, includes three types of noises mentioned in Sect. 1. There are 2,498 sentences with 3,558 entities in total. For the MsraNER dataset, we inject three types of noises separately to achieve three noisy versions named MsraNER-N1, MsraNER-N2 and MsraNER-N3. Baseline. We compare the proposed method with the following methods: – BiLSTM+CRF is the most classical method for named entity recognition. – Ma and Hovy [15]: LSTM-CRF is used as the main network structure and CNN is used to represent character sequences of words. – Lample et al [13]: LSTM is used to represent character sequences in words based on the BiLSTM+CRF model.
340
J. Wan et al.
Algorithm 2. Reinforcement Learning for the Tag Modifier Input: Episode number L, Training data X = {X1 , X2 , . . . , XN }, Parameters for BiLSTM-CRF and policy network Φ, Θ ˜ = Φ, Θ ˜ = Θ; Initialize target networks: Φ for l = 1 to L do shuffle X; foreach Xk ∈ X do ˜ (We omit Sample tag modification actions for each word in Xk with Θ the subscript k below for convenience) ˜ A = {a1 , a2 , . . . , a|X| }, a ∼ φ(at |st , Θ) Train target BiLSTM-CRF network Compute delayed reward r(s|X|+1 |X) Update the parameter Θ of Tag Modifier: Θ ← Θ+ ∝ i vi ∇Θ log πΘ (at |st ), where vi = r(s|X|+1 |X) if k% batchsize == 0 then Update Φ in the BiLSTM model Update the weights of the target networks: ˜ = τ Φ + (1 − τ )Φ; ˜ Φ ˜ = τ Θ + (1 − τ )Θ; ˜ Θ end end end
Word Embeddings. We adopted word2vec to train the 50-dimensional word embedding on the Wikipedia data [16]. Parameter Setting. The dimension of the hidden state in the BiLSTM model is 50. In order to smooth the update of the policy gradient, a suppression factor is multiplied to the Eq. 2 and is set to 0.1. We adopted dropout before the BiLSTM layer in sequence tagging with a probability of 0.5. During training, Adam algorithm [11] is used to optimize the parameters with a learning rate 0.005. The Mini-batch size is 5. 3.2
Result and Analysis
We first report the overall results compared with baselines, then verify the modification decision in the tag modifier by manually checking, and finally study several typical cases. Overall Results. Results in Tables 3 and 4 reveal the following observations. – RLNER obtains better performance than the baseline BilSTM CRF on all noisy datasets, showing that correcting the wrong tags in noisy data with our Tag Modifier module does greatly improve the performance. – RLNER achieves competitive performance on noisy data comparing with baseline models that are feed with clean data. It shows that only improving performance with high-quality labels cannot perform well in noisy settings.
Reinforcement Learning for Named Entity Recognition from Noisy Data
341
Table 3. Overall Results on Military dataset Method
P(%)
R(%) F1(%)
BiLSTM CRF
77.44
75.29
76.35
Ma and Hovy [15] 80.99
78.15
79.54
Lample et al [13]
80.20
79.16
79.68
RLNER
85.33 84.29 84.81
– The partially labeled noise has a clear effect on both precision rate and recall because partial labels make the model predict many wrong entities. The nonelabeled noise has a low impact on precision and a great impact on recall because the absence of labels makes the model predict limited right entities. While the influence of the ambiguous noise is just opposite because the model can predict lots of entities and a part of them are right.
Table 4. Overall Results on Msra dataset Bi-LSTM+CRF Ma and Hovy Lample et al RLNER MsraNER
P(%) 75.80 R(%) 82.57 F1(%) 74.14
83.88 81.52 82.68
83.36 82.16 82.75
-
MsraNER-N1 P(%) 12.27 R(%) 10.43 F1(%) 11.27
15.89 14.97 15.42
15.45 15.25 15.35
61.91 59.70 60.79
MsraNER-N2 P(%) 71.88 R(%) 37.68 F1(%) 49.45
71.72 47.30 57.00
71.81 45.78 55.92
72.03 63.32 67.39
MsraNER-N3 P(%) 58.80 R(%) 63.98 F1(%) 61.28
57.84 70.24 63.44
58.17 68.67 62.99
76.29 71.67 73.91
Performance of Tag Modifier. To assess the accuracy of the Tag Modifier, we manually compared the modified labels with the original labels. For each tag, we check whether the Tag Modifier has made a correct decision by examining whether an original noisy tag has been assigned a “change” action or not. If it is labeled as “change”, then it is viewed as a correct decision, otherwise a wrong decision. Table 5 presents the performance of the Tag Modifier, from which we can see that RLNER corrects over half of the wrong tags inserted in MsraNER dataset. It validates the effectiveness of the Tag Modifier.
342
J. Wan et al. Table 5. Entity correction rate on MsraNER dataset Dataset
Wrong tags Corrected tags Correction rate
RLNER+MsraNER-N1 1529 RLNER+MsraNER-N2 931 RLNER+MsraNER-N3 1014
1026 604 932
0.6919 0.6488 0.9191
Table 6. Corrected entity examples by our model on Military dataset
Case Study. Table 6 shows some example wrong tags are corrected by Tag Modifier in the Military dataset. We also observe that the Tag Modifier may introduce a few new noises. Our investigation shows that new noises have similar tags in the validation dataset, and thus the Tag Modifier changes the correct tags to wrong ones to make the sequence tagging network get better performance on the validation data. But this kind of new noise has a little impact due to its small size.
4
Related Work
Many approaches have been proposed for NER. Early studies on NER often exploit SVMs [10], HMMs [1] and CRFs [12] which heavily rely on human annotations and handcrafted features. Recent advances in neural models have freed domain experts from handcrafting features. [7] attempted to solve the problem by using a unidirectional LSTM, which was the first neural NER model. [4] used a CNN-CRF structure, obtaining competitive results for the best statistical models. [9] exploited BiLSTM to extract features and feed them into the CRF decoder. Since then, the BiLSTM-CRF model is usually exploited as the baseline. However, it is expensive to obtain enough labeled data for deep learning models. NER in industry, which has no enough domain labeled data, has received continued research attention. [13] proposed a method that uses a small amount of labeled data combined with a large number of unlabeled corpora for model training. [22] proposed an approach that combines active learning with deep
Reinforcement Learning for Named Entity Recognition from Noisy Data
343
learning which drastically reduces the amount of required labeled training data. Recently, various distant supervision models were proposed for NER [6,8,20]. However, such approaches suffer from the noisy labeling issue, which heavily affects the performance. In this paper, we proposed a reinforcement learning approach to address the above issues. [25] also adopts reinforcement learning to solve the low resource NER task, but their method is selecting instances for noisy annotation. In contrast, while in our method, the model is trying to correct the noisy labels. We make use of the correctly labeled resources as much as possible while learning an independent Tag Modifier to correct the wrong tags. In our approach, the reward is intuitively reflected by the performance change of the sequence tagging network.
5
Conclusion
In this paper, we propose a novel model for NER with noisy data using a reinforcement learning framework. The model consists of a Tag Modifier and a Tag Predictor. The Tag Modifier corrects the wrong tags of noisy data. The Tag Predictor predicts the sequence of tags for the sentence and provides rewards to the modifier as a weak signal to supervise the tag modification. Extensive experiments demonstrate that our model can correct the wrong tags and outperform state-of-the-art baselines in named entity recognition with noisy data. Acknowledgements. This work is supported by the Key-Area Research and Development Program of Guangdong Province (2019B010153002), NSFC Projects (U1736204, 61533018) and grants from Beijing Academy of Artificial Intelligence (BAAI2019ZD0502) and Institute for Guo Qiang, Tsinghua University (2019GQB0003).
References 1. Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high-performance learning name-finder. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194–201 (1998) 2. Bunescu, R.C., Mooney, R.J.: A shortest path dependency kernel for relation extraction. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 724–731 (2005) 3. Chen, Y., Xu, L., Liu, K., Zeng, D., Zhao, J.: Event extraction via dynamic multipooling convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 167–176 (2015) 4. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011) 5. Feng, J., Huang, M., Zhao, L., Yang, Y., Zhu, X.: Reinforcement learning for relation classification from noisy data. In: Thirty-Second AAAI Conference on Artificial Intelligence, pp. 5779–5786 (2018)
344
J. Wan et al.
6. Giannakopoulos, A., Musat, C., Hossmann, A., Baeriswyl, M.: Unsupervised aspect term extraction with B-LSTM & CRF using automatically labelled datasets. arXiv preprint arXiv:1709.05094 (2017) 7. Hammerton, J.: Named entity recognition with long short-term memory. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pp. 172–175 (2003) 8. He, W.: Autoentity: automated entity detection from massive text corpora. Ph.D. thesis, University of Illinois at Urbana-Champaign (2017) 9. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015) 10. Isozaki, H., Kazawa, H.: Efficient support vector classifiers for named entity recognition. In: Proceedings of the 19th international conference on Computational linguistics, pp. 1–7 (2002) 11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 12. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001) 13. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270 (2016) 14. Levow, G.A.: The third international Chinese language processing bakeoff: word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 108–117 (2006) 15. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNsCRF. arXiv preprint arXiv:1603.01354 (2016) 16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) 17. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011 (2009) 18. Peng, N., Dredze, M.: Improving named entity recognition for Chinese social media with word segmentation representation learning. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 149–155 (2016) 19. Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 147–155 (2009) 20. Ren, X., El-Kishky, A., Wang, C., Tao, F., Voss, C.R., Han, J.: Clustype: effective entity recognition and typing by relation phrase-based clustering. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 995–1004 (2015) 21. Shang, J., Liu, L., Ren, X., Gu, X., Ren, T., Han, J.: Learning named entity tagger using domain-specific dictionary. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2054–2064 (2018) 22. Shen, Y., Yun, H., Lipton, Z.C., Kronrod, Y., Anandkumar, A.: Deep active learning for named entity recognition. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 252–256 (2017)
Reinforcement Learning for Named Entity Recognition from Noisy Data
345
23. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000) 24. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992) 25. Yang, Y., Chen, W., Li, Z., He, Z., Zhang, M.: Distantly supervised NER with partial annotation learning and reinforcement learning. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2159–2169 (2018) 26. Yao, X., Van Durme, B.: Information extraction over structured data: question answering with freebase. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 956–966 (2014)
Flexible Parameter Sharing Networks Chengkai Piao1 , Jinmao Wei1,2(B) , Yapeng Zhu1 , and Hengpeng Xu1 1 College of Computer Science, Nankai University, Tianjin 300071, China {apark,zhuyapeng,xuhengpeng}@mail.nankai.edu.cn, [email protected] 2 Institute of Big Data, Nankai University, Tianjin 300071, China
Abstract. Deep learning models have flourished in recent years, but it still remains a complex optimization problem in that the parameters of each layer are independent. Although this problem can be alleviated by the coefficient vector based parameter sharing methods, it has brought up a new problem: different size of parameters cannot be generated from a fixed-size global parameter template, which may truncate latent connections among parameters. In order to generate different size of parameters from the same parameter template, a Flexible Parameter Sharing Scheme (FPSS) is proposed. We exploited the asymmetric characteristic of convolution operations to resize and transform the template to specific parameters. As a generalization of the coefficient vector based methods, FPSS incorporates 2-dimension convolution operations rather than linear combinations to make transformations on the global template. Since all parameters are generated from the same template, FPSS can be viewed as building latent connections among each parameter through the global template. Meanwhile, each layer needs much fewer parameters, which will reduce the search space and make it easier to train. Furthermore, we presented two deep models as applications of FPSS, Hybrid CNN and Adaptive DenseNet, which sharing the global template to different modules and blocks. One can easily find the similar parts of a deep network through our method. Experimental results on several text datasets show that the proposed models are comparable or better to state of the art model.
Keywords: Deep learning
1
· Parameter sharing · Template
Introduction
Convolutional Neural Network (CNN) is a kind of outstanding Neural Network (NN) model which often trained to perform Machine Learning (ML) tasks such as Text Classification [15], Sentiment Analysis [17] and Object Detection [3]. Many CNNs learn hierarchical features through its human hand-designed, fixed-depth and feed-forward architectures. Despite many promising achievements have been This work is supported by the National Natural Science Foundation of China (61772288) and the Natural Science Foundation of Tianjin City (18JCZDJC30900). c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 346–358, 2020. https://doi.org/10.1007/978-3-030-60450-9_28
Flexible Parameter Sharing Networks
347
reached, the parameters of each layer of CNN based models are trained separately, which may result in a complex optimization problem [22]. Deep CNNs usually perform better than the shallows, but difficult to optimize due to a vast number of independent parameters. To address this problem, Hyper Network [7], a simple but efficient method, was proposed to evolve a smaller network to generate the structure of weights for a larger network, so that the search is constrained within a much smaller weight space. Since the Hyper Network can be viewed as the relation information among all layers, it can produce context-dependent weight changes for the target network. Coefficient Network [22] is a special case of Hyper Network, in which a variety size of parameters are regarded as global templates. Meanwhile, each layer’s parameters were generated through linear combinations of specific size of templates. Consequently, Coefficient Network significantly reduces the number of parameters in deep models. Despite the success, there still remains a problem that different size parameters cannot be generated from the same template, e.g. the number of channels of different Blocks in DenseNet may not be same. Obviously, the template integrality may be truncated for that multi-template must be involved. We present a Flexible Parameter Sharing Scheme (FPSS), making it possible to learn different size parameters from a fixed-size template. Consequently, our method can be compatible with different modules, e.g. convolution module followed by a Linear module, to be jointly trained. Since all parameters are generated from the global template, the optimizer will be constrained in a smaller search space. The differences between our method and previous methods are shown in Fig. 1.
Fig. 1. Differences between our method and previous methods
Figure 1a illustrates that hierarchical features can be seen as the progress of gradually abstraction of knowledge, deeper features depend on shallower features. This procedure is a single direction feed forward paradigm which is similar with computer programs. Figure 1b presents a partial joint version of CNN [7,22], which generates parameters through learning linear combinations from a specific template. Obviously, weight sharing is allowed only between modules with the same parameter size. That is, it cannot deal with different modules for that may
348
C. Piao et al.
have different parameter size. Figure 1c diagrams our flexible method that all parameters, even have different size, are generated from the global template. Consequently, our method preserves the integrity of parameter sharing. With the help of FPSS, we present two variants of CNN: Hybrid CNN (HCNN) and Adaptive DenseNet (A-DenseNet). In H-CNN, each layer is composed of Pre-Norm, residual connection and a convolution module followed by a linear module. In A-DenseNet, each block is composed of several convolution modules, while transition layers are used to connect blocks. All parameters in these models are generated from its global parameter template. Experimental results show that FPSS based models are competitive to or sometimes better than state-ofthe-art models. The main contributions of this paper are as follows: 1) With a trainable 3-dimension matrix as the global parameter template, we proposed a Flexible Parameter Sharing Scheme which is compatible with different size of parameters. 2) We address the problem that modules with different size parameter can not be unified and present a Hybrid CNN which composed with a CNN module and a linear module, in which all parameters shared from the same template. 3) We presented a compatible version of DenseNet, a high rated CNN model, in which latent connections were built among all convolution kernels through FPSS.
2
Related Work
Deep models have been applied in many tasks such as text classification [15], sentiment analysis [17] and relation classification [6]. The reason why deep models perform better than others [21,23] is partly because feature reuse and hierarchical abstract. The authors of [12] proposed a Text CNN model which extracts semantic information through feature maps. But one limitation of CNN is that long-range dependencies are neglected [4,28]. To deal with this problem, RNNs and LSTMs were incorporated since they are in a position to preserve sequence information. Inspired by this assumption, researchers have developed many recurrent variants of CNN [18,24,25] . Some researchers insisted that hierarchical abstraction plays an important role in deep models. The authors of [27] proposed a hierarchical model corresponding to the hierarchical structure of documents. In [5], the authors proposed a simplified gating mechanism and built a hierarchical representation to capture long-range dependencies. Indeed, stacking more layers and feature maps will be sure to benefit collecting context information, but may result in optimization problems. To alleviate this dilemma, two main problems should be solved: preserving gradient information in recurrent structures and reducing search space. Residual Networks [9] and Hyper Networks [7] were proposed to deal with the above concerns. Residual Networks apply a stable derivative 1 + f (x) to replace f (x) to avoid gradient explosions. Meanwhile, the authors of [26] insisted that a deep model with Pre-Norm and Residual Connections can prevent the model
Flexible Parameter Sharing Networks
349
from gradient explosion. In fact, this idea has been adopted in our H-CNN model and improves the model’s performance. Hyper Network exploits a smaller network, to control the parameters of the main network, which is a natural idea to reduce the number of parameters. As a special case of Hyper Network, Coefficient vector based parameter sharing scheme [22] learns a linear combination across several templates for each layer to generate parameters. Obviously, this scheme significantly reduces the number of parameters, especially in deep models. However, it can not adapt different size parameters, different convolution kernels require different size of templates, which may result in truncating template integration. In this paper, we present a Flexible Parameter Sharing Scheme to generate different size parameters from the same global template. Specially, we exploit the asymmetric character of convolution operations to prune the global template into different size of parameters. Recent works have proven that Hyper Network based methods outperform sparse connection [10]. Meanwhile, pruning methods [8] can also reduce the number of parameters, which hold the similar idea of [7]. In fact, our method can be viewed as an extension of [22] for our method learning a 2-dimension convolution kernel to transform the global template instead of 1-dimension. Furthermore, with the help of FPSS, we present two variants of CNN: Hybrid CNN and Adaptive DenseNet. The former incorporates different modules and Residual connection in each layer, which tries to avoid optimization and gradient explosion problems. While the latter unifies different Blocks and transitions in DenseNet.
3
Model
We present Flexible Parameter Sharing Scheme to address the problems of parameter independent [10,12] and template truncation problems [15,22], which can transform the global template into specified size of parameters. Specifically, FPSS is a 2-dimension convolution based algorithm that can generate different shape parameters for specific models (Theoretically, FPSS can generate any shape of parameters. This paper mainly study on linear models and 1-dimension convolution models). The comparison between FPSS and coefficient vector based model is shown in Fig. 2. As shown in Fig. 2 a, to generate different size parameters W1 and W2 , coefficient vector based models have to incorporate different templates, which may truncate the latent connections between W1 and W2 . In contrast, our method, shown in Fig. 2 b, uses different size parameters to transform the global template T into specific sizes. That is, we exploits the output asymmetry character of convolution operations. Our method can be viewed as a high-dimensional extension of [22] for that we handling different parameters through adjusting the kernels (coefficient vectors) rather than templates. The process of generating parameters is shown in Eq. 1. (1) W (i) = kernel(i) ⊗ T .
350
C. Piao et al.
Fig. 2. The comparison of FPSS and coefficient vector based model.
where W (i) denotes the parameter matrix of specified model (i.e. linear model or convolutional model) of i th layer, kernel(i) represents template kernel used to generate W (i) , ⊗ denotes convolution operation. Specifically, given the global ∗ ∗ ∗ template T ∈ Rr×c×k and a target parameter matrix W∗ ∈ Rr ×c ×k , we need k ∗ kernels with shape of [r − r∗ + 1, c − c∗ + 1, k] (the third dimension of kernel and the template must be same, so we use k to represent them). Compared with computing linear combinations of templates, our method uses template kernels to replace the coefficient vector in [22]. One can adjust the size of kernel to generate a variety size of parameters. Consequently, our method is more flexible for it is equivalent to building latent connections among all parameters W . 3.1
Hybrid CNN
We present an upgrade version of [22], Hybrid Convolution Neural Network (HCNN). With the help of FPSS, all modules in H-CNN share parameters through the global template T . Each layer of H-CNN consists of a LayerNorm module [1], a Convolution module and a Linear module. Additionally, activation function and residual connection also included. Figure 3a, illustrates the structure of HCNN. (i) For layer i, Eq. 2 illustrates the process of generating W∗ for each module through the global template. Conv(X) = conv1d(X; Wc(i) ), Wc(i) = kernelc(i) ⊗ T , (i)
(i)
Linear(X) = linear(X; Wl ), Wl
(i)
= kernell ⊗ T .
(2)
Flexible Parameter Sharing Networks
351
Fig. 3. Differences between our method and previous methods
(i)
where Conv(X) and Linear(X) are convolution and linear modules, with Wc (i) (i) (i) and Wl as parameters respectively, kernelc and kernell are parameters used (i) (i) to control the size of Wc and Wl . H-CNN is an instance of parameter sharing among layers and modules. Particularly, Pre-Norm and residual connection modules are used to accelerate model convergence [26]. Meanwhile, it can also make the training procedure omitting the warm-up stage with little performance loss. At the end of the network, a feed forward fully connected layer is adopted, independent with the template. 3.2
Adaptive DenseNet
DenseNet [11] is a progressively hierarchical increasing architecture, in which each layer is directly connected to the front layers. Since the number of dimensions varies from layer to layer, its parameters are separate. Coefficient vector based methods will fail to share parameters across different layers, not to mention different blocks. We present a 1-dimension adapted version of DenseNet (Adaptive DenseNet, A-DenseNet), which all parameters are connected through FPSS. Figure 3b diagrams the structure of Adapt DenseNet. In DenseNet, the dimension increases through blocks and decreases through transition modules. Consequently, most parameters have different size and coefficient vector based methods may fail. For all Blocks and transitions, Eq. 3 indicates the process of generating parameters.
352
C. Piao et al.
Conv (i,j) (X) = conv1d(X; Wc(i,j) ), Wc(i,j) = kernelc(i,j) ⊗ T , (i)
(i)
T rans(i) (X) = conv1d(X; Wt ), Wt
(i)
= kernelt ⊗ T .
(3)
where Conv (i,j) (X) represents the j th convolution module of Block i, (i,j) T rans(i) (X) denotes the transition layer between Block i and i + 1, Wc (i) and Wt are corresponding parameters. As a generalized version of coefficient vector based parameter sharing scheme, our method rebuilt DenseNet as a parameter sharing model. Meanwhile, to generate a variety size of parameters to fit different layers, the size of kernel have to be adjusted accordingly. Generally, shallower layers are prone to learn less abstract features and deeper layers learn more abstract features. According to Eq. 1, deeper layers of A-DenseNet have more channels than the shallowers but have fewer parameters, which mean the search space of A-DenseNet reduces with the increase of abstraction. Consequently, A-DenseNet is good at collecting hierarchical features and convergences faster than its classical version. 3.3
Interpretation
Our method can be viewed as learning a non-linear transformation from the global template. For example, given any module f (X), Eq. 4 illustrates the relationship between the module and the template. f (X) = σ(f (X; W , b)) = σ(f (X; kernelW ⊗ T ; b)).
(4)
where X is the input data, W denotes the convolution result with filter kernelW and template T , σ denoted sigmoid activation function, b is the bias. FPSS based models have much fewer parameters than the others. Considering a l-layer H-CNN model, in which each layer contains a conv1d module (parameter shape = c × n × n) and a linear module (parameter shape = n × n), bias are omitted for simplicity. Without FPSS, it has (c + 1)ln2 parameters, where c is the kernel size of conv1d. In contrast, with FPSS equipped, the number of parameters becomes lk[(c + 1)s2 + (s + n − 1)2 ]. Where s and k are parameters used to control the size of template. In our experiments, s and k are less than 10. Consequently, FPSS can not only handle different size parameters but also significantly reduce the number of parameters. Since template kernels have different sizes, our method doesn’t support directly calculating the similarity among layers. As an alternative, distribution similarities were adopted. The detailed similarity measurements are shown in Sect. 4.4.
4
Experiments
We empirically compare our methods with existing models on different classification tasks. All experiments are conducted using a GeForce GTX 1080Ti GPU with 11 GB memory.
Flexible Parameter Sharing Networks
4.1
353
Datasets
To make an extensive evaluation, we choose a group of 16 different datasets [16] from several popular review corpora. Each dataset contains a set of reviews to be classified either positive or negative. The detailed statistics about all the datasets are listed in Table 1. Table 1. Statistics of datasets [16].
Table 2. Hyper parameters of Hybrid CNN
Dataset
Trn
Model
Acc (%) Param
Apparel
1400 200 400
57
21K
(layer = 3, k = 0)
84.75
906.9 K 129/epoch
Dev Tst Avg.L Vocab
Time (ms)
Baby
1400 200 400 104
26K
(layer = 5, k = 0)
82.50
1271 K 140/epoch
Books
1400 200 400 159
62K
(layer = 7, k = 0)
83.25
1634 K 151/epoch
Camera
1397 200 400 130
26K
(layer = 10, k = 0)
82.25
2180 K 173/epoch
DVD
1400 200 400 173
69K
(layer = 3, k = 2)
85.75
362.8 K 129/epoch
Electronic 1398 200 400 101
30K
(layer = 5, k = 2)
84.75
364.0 K 146/epoch
Health
1400 200 400
81
26K
(layer = 7, k = 2)
84.50
365.3 K 164/epoch
IMDB
1400 200 400 269
44K
(layer = 10, k = 2)
84.00
365.3 K 182/epoch
Kitchen
1400 200 400
28K
(layer = 3, k = 6)
85.75
362.8 K 131/epoch
Magazine 1370 200 400 117
30K
(layer = 5, k = 6)
84.50
364.1 K 150/epoch
MR
1400 200 400
21
12K
(layer = 7, k = 6)
82.25
365.4 K 164/epoch
Music
1400 200 400 136
60K
(layer = 10, k = 6)
86.25
366.7 K 185/epoch
Software
1315 200 400 129
26K
(layer = 3, k = 10)
86.25
362.8 K 131/epoch
Sports
1400 200 400
94
30K
(layer = 5, k = 10)
86.25
364.1 K 150/epoch
Toys
1400 200 400
90
28K
(layer = 7, k = 10)
84.50
365.5 K 176/epoch
Video
1400 200 400 156
57K
(layer = 10, k = 10) 87.50
366.8 K 185/epoch
89
These datasets are online Amazon product reviews from different domains, such as Books, DVDs, Electronics, etc, collected by the authors of [2]. In particular, IMDB and MR are reviews from rotten tomato website [19,20]. All of these datasets have been preprocessed through Stanford tokenizer (http://nlp. stanford.edu/software/tokenizer.shtml) and partitioned into training set, development set and testing set with the proportion of 70%, 10% and 20% respectively by the authors of [16]. 4.2
Competitor Models
Previous works have proposed various models while not all can be applied to the tasks we focused. Therefore, we chose several highly related models and implement them as competitor methods. 1) Text-CNN: This model is proposed by the authors of [12]. The author applies CNN to text classification, and extracts key information in sentences (similar to N-Gram with multiple window sizes) by using multiple kernels of different sizes, so as to better capture local correlation. This model is a typical parameter independent model, as a baseline.
354
C. Piao et al.
2) Coefficient CNN: This model is proposed by the authors of [22]. The author tried to apply a linear combination of templates as the parameter of convolution layer and thus restricted the search in a smaller space. We use this model as a competitor for showing our method has an advantage of handling different parameter size. 3) DenseNet: This model is proposed by the authors of [11]. It gradually increases the number of dimensions from layer to layer to represent more complex features. Meanwhile, [14] proved that DenseNet is also performing well in text dataset. We choose this model to verify if FPSS can unify its parameters. 4) S-LSTM: This model is proposed by the authors of [29]. It achieved state of the art on these 16 datasets. The comparison with S-LSTM shows that our model is more competitive than state of the art model. 4.3
Hyper Parameters
Table 2 illustrates the results of Hybrid CNN on Apparel dataset across different hyper parameters. Without the global template (k = 0), the accuracy of Hybrid CNN drops to 84.25%, demonstrating the necessity of global parameter sharing. With FPSS equipped, as described in Sect. 3, the trend of accuracy improves with the number of layers and the hyper parameter k. Although the number of parameters and training time increase accordingly, the training time increases relatively marginal. Compared to existing methods (906K∼2180K), FPSS needs much less parameter (362.8K∼366.8K). It can be noticed that the increasing trend of parameter number is relatively marginal, thus the space cost of FPSS is lower. We fix the hidden size to 300 for the accuracies increase as the dimension of the models increase from 100 to 300, which was agreed with many previous works and didn’t listed here. Batch size was set with 5, learning rate was set with 0.0001 and Adam optimizer [13] was adopted. For comparison models, the hyper parameters are set according to the development data. 4.4
Similarities Among Layers
Since the similarity of two different size matrices can not be measured, as an alternative method, histogram distances were used to measure the layer similarity matrices and inspect whether there’s a group of layers structural similar. Figure 4 illustrates the similarities among layers with different k. To make it clarity, we fixed the number of layers with 10 and set k from 1 to 10. With the global templates parameter k increases, the performance of Hybrid CNN outperforms the baselines. Accordingly, we observed that many structural similar layers emerge in flexible parameter shared networks without any regularizer. With the increasing of k, the network is prone to learn different parameters for each layer (the similarities among layers decrease), as layer similarities depend on the learned kernels, leading to a more flexible network, and vice versa. Usually, the similarities between each layer are high, but each model has some special layers that different with any others (marked with red rectangles), which may be the key structure to make abstractions of semantic information.
Flexible Parameter Sharing Networks
355
Fig. 4. The similarity matrices with different k. The rectangles are special layers. Table 3. Accuracy and training batch time of 16 datasets (precent/ms).
4.5
Dataset
Text-CNN Coef-CNN DenseNet
Apparel
85.25/430
85.50/164
84.50/308 85.75/2830 87.50/176 85.50/312
S-LSTM
H-CNN
A-DenseNet
Baby
84.25/460
83.50/177
85.50/351 86.25/2630 86.75/188 85.50/365
Books
81.75/641
83.25/211
81.25/377 83.44/3640 85.75/219 81.25/381
Camera
88.50/442
88.25/198
89.25/362 90.02/2850 88.00/210 90.50/203
DVD
81.75/508
82.75/214
81.75/344 85.52/5290 83.25/235 85.50/356
Electronic 81.75/475
82.00/166
82.25/309 83.25/2550 82.50/181 83.25/312
Health
85.25/422
85.00/153
83.25/320 86.50/2170 86.75/170 85.00/322
IMDB
81.25/490
86.25/185
85.50/388 87.15/3690 87.75/190 87.25/400
Kitchen
80.25/429
83.25/175
81.50/315 84.54/2500 84.50/179 86.00/371
Magazine
88.75/451
89.50/169
92.75/359 93.75/2930 92.75/184 90.25/371
MR
74.50/390
76.25/157
76.00/391 76.20/1250 75.25/163 73.75/411
Music
80.00/455
83.75/164
82.25/356 82.04/3440 84.50/166 83.25/374
Software
86.50/498
83.50/193
85.50/402 87.75/2980 90.25/201 88.75/423
Sports
84.25/453
83.75/185
84.25/371 85.75/2640 86.25/185 84.25/382
Toys
85.50/466
85.50/172
84.25/357 85.25/2420 87.25/182 87.00/367
Video
81.50/587
84.50/184
86.00/473 86.75/3950 86.50/196 86.25/480
Average
83.20/475
84.16/179
84.11/361 85.62/2985 85.97/189 85.20/361
Result Statistics
We make comparisons between our models and competitor models in all datasets. We use the best settings on the development dataset for Text CNN, Coefficient CNN, DenseNet and our models. The results of state of the art models, S-LSTM, are from [29]. As shown in Table 3, the final results are consistent with the development results, where Hybrid CNN and Adaptive DenseNet outperform Text-CNN, Coefficient CNN and DenseNet significantly. Our methods also gives competitive results compared with S-LSTM. Although the increasements of our method are relatively marginal, it achieves a significant training time reduction. Specially, SLSTM gives the best results on 5 datasets, Hybrid CNN and adaptive DenseNet
356
C. Piao et al.
give the best results on 9 and 3 datasets. compared with original CNNs, FPSS achieves 1.81 % and 1.17% accuracy gain in average. For training time among these 16 datasets, Coefficient-CNN gives the best time cost (179 ms/epoch) compared with other models. The average time cost of Hybrid-CNN is 189 ms/epoch, negligible performance inferior. Meanwhile, the average training time of adaptive DenseNet is 364 ms/epoch, which is almost same as DenseNet (361 ms/epoch). Consequently, the proposed method yields few extra computational cost. Although the accuracy improvement seems marginal in several datasets (DVD, kitchen, MR, etc.) compared to S-LSTM, our models have vast advantage on computational cost. Moreover, the flexible character and explainability result in easier development.
5
Conclusion
In this paper, we present a Flexible Parameter Sharing Scheme which shares parameters from the global template to all modules. To build latent connections among modules, we utilize the output asymmetric character of convolution operations to generate parameters for each model from the template and presented two FPSS based CNN models. Experimentally, flexible parameter sharing yields models with higher accuracy on several datasets with negligible training time increase. Meanwhile, the number of parameters was reduced significantly with FPSS equipped. In addition, FPSS based models are expected to capture the layers that are significantly different from others, enabling us to study the abstraction of semantic information. The proposed models are also competitive with the state of the art model, while having stronger explainability.
References 1. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016) 2. Blitzer, J., Dredze, M., Pereira, F.: Domain adaptation for sentiment classification. In: 45th Annual Meeting of the Association Computational Linguistics (ACL) (2007) 3. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018) 4. Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional networks for text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 1, Long Papers, pp. 1107–1116 (2017) 5. Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: International Conference on Machine Learning, pp. 933–941 (2017) 6. Guo, X., Zhang, H., Yang, H., Xu, L., Ye, Z.: A single attention-based combination of CNN and RNN for relation classification. IEEE Access 7, 12467–12475 (2019) 7. Ha, D., Dai, A., Le, Q.V.: Hypernetworks. arXiv preprint arXiv:1609.09106 (2016)
Flexible Parameter Sharing Networks
357
8. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015) 9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 10. Huang, G., Liu, S., Van der Maaten, L., Weinberger, K.Q.: Condensenet: an efficient densenet using learned group convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2752–2761 (2018) 11. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017) 12. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751 (2014) 13. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 14. Le, H.T., Cerisara, C., Denis, A.: Do convolutional networks need to be deep for text classification? In: Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 29–36 (2018) 15. Lee, J.Y., Dernoncourt, F.: Sequential short-text classification with recurrent and convolutional neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 515–520 (2016) 16. Liu, P., Qiu, X., Huang, X.: Adversarial multi-task learning for text classification. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 1–10 (2017) 17. Luo, L.: Network text sentiment analysis method combining LDA text representation and GRU-CNN. Pers. Ubiquit. Comput. 23(3–4), 405–412 (2019) 18. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNSCRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, vol. 1, pp. 1064–1074 (2016) 19. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1. pp. 142–150. Association for Computational Linguistics (2011) 20. Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 115–124. Association for Computational Linguistics (2005) 21. Qiu, X., Huang, X.: Convolutional neural tensor network architecture for community-based question answering. In: Proceedings of the 24th International Conference on Artificial Intelligence, pp. 1305–1311 (2015) 22. Savarese, P., Maire, M.: Learning implicitly recurrent CNNS through parameter sharing. arXiv preprint arXiv:1902.09701 (2019) 23. Wan, S., Lan, Y., Guo, J., Xu, J., Pang, L., Cheng, X.: A deep architecture for semantic matching with multiple positional sentence representations. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2835–2841 (2016)
358
C. Piao et al.
24. Wang, X., Jiang, W., Luo, Z.: Combination of convolutional and recurrent neural network for sentiment analysis of short texts. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2428–2437 (2016) 25. Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015) 26. Xiong, R., et al.: On layer normalization in the transformer architecture. arXiv preprint arXiv:2002.04745 (2020) 27. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016) 28. Zeng, D., Liu, K., Lai, S., Zhou, G., Zhao, J.: Relation classification via convolutional deep neural network. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 2335–2344 (2014) 29. Zhang, Y., Liu, Q., Song, L.: Sentence-state LSTM for text representation. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 317–327 (2018)
An Investigation on Different Underlying Quantization Schemes for Pre-trained Language Models Zihan Zhao1,2 , Yuncong Liu1,2 , Lu Chen1,2(B) , Qi Liu1,2 , Rao Ma1,2 , and Kai Yu1,2(B) 1
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China {zhao mengxin,assassain lyc,chenlusz,liuq901,rm1031,kai.yu}@sjtu.edu.cn 2 SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Abstract. Recently, pre-trained language models like BERT have shown promising performance on multiple natural language processing tasks. However, the application of these models has been limited due to their huge size. To reduce its size, a popular and efficient way is quantization. Nevertheless, most of the works focusing on BERT quantization adapted primary linear clustering as the quantization scheme, and few works try to upgrade it. That limits the performance of quantization significantly. In this paper, we implement k-means quantization and compare its performance on the fix-precision quantization of BERT with linear quantization. Through the comparison, we verify that the effect of the underlying quantization scheme upgrading is underestimated and there is a huge development potential of k-means quantization. Besides, we also compare the two quantization schemes on ALBERT models to explore the robustness differences between different pre-trained models.
Keywords: K-means quantization language model · GLUE
1
· Linear quantization · Pre-trained
Introduction
Pre-trained transformer-based models [13] recently have achieved state-of-theart performance at a variety of natural language processing (NLP) tasks, such as sequence tagging and sentence classification. Among them, BERT models [3] based on transformer architecture [13] have drawn even more attention because of their great performance and generality. However, the memory and computing consumption of these models are prohibitive. Even the relatively small versions of BERT models (e.g., BERT-base) contain more than 100 million parameters. The over-parameterized characteristic makes it challenging to deploy BERT models Zihan Zhao and Yuncong Liu are co-first authors and contribute equally to this work. c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 359–371, 2020. https://doi.org/10.1007/978-3-030-60450-9_29
360
Z. Zhao et al.
on devices with constrained resources, such as smartphones and robots. Therefore, compressing these models is an important demand in the industry. One popular and efficient method for model compression is quantization. To reduce model sizes, quantization represents the parameters of the model by fewer bits instead of the original 32 bits. With proper hardware, quantization could significantly reduce the memory footprint while accelerating inference. There have been many works focusing on quantizing models in the computer vision area [4,5,8,15,17,18], while much fewer works have been done on NLP [1,2,9,10,12]. Pilot works of transformer quantization include [1,2,10]. They successfully quantized transformer models to 8 or 4 bits while maintaining comparable performance. Moreover, to the best of our knowledge, there are only two published works focusing on BERT quantization [11,16]. [16] applied 8-bit fixedprecision linear quantization to BERT models and achieved a compression ratio of 4× with little accuracy drop. [11] improved the quantization performances by group-wise mix-precision linear quantization based on the Hessian matrix of the parameter tensors. However, for the underlying quantization scheme, most of the above transformer quantization works, especially the BERT quantization works utilized linear clustering, which is a primary clustering method. Although it can process fast and easily, the quantized results cannot represent the original data distribution well. As a result, [16] only manages to quantize BERT to 8 bits. Although the other BERT quantization work [11] has achieved much higher compress ratios without quantization scheme upgrading, the group-wise method they developed is rather time-consuming and increases the latency significantly. Although it is believed that replacing linear clustering with a better clustering method can improve the performance of quantized models. The effect of the quantization scheme upgrading is rather underestimated. Therefore, in this paper, we explore the effect of simply upgrading the quantization scheme from linear clustering to k-means clustering, and compare the performance of the two schemes. Furthermore, to see the effect on other pre-trained language models, we also compare the two quantization schemes on ALBERT models [7], which is an improved version of BERT. In summary, we applied k-means and linear quantization on BERT and ALBERT and test their performances on GLUE benchmarks. Through this, we verify that simple upgrading of quantization scheme could result in great performance increases and simple k-means clustering has great potential as BERT quantization scheme. Moreover, we also show that the number of k-means iterations plays an important role in the k-means quantization. Through further comparison, we discover that ALBERT is less robust than BERT in terms of quantization, as the parameter sharing has reduced the redundancy of the parameters.
2
Background: BERT and ALBERT
In this section, we briefly introduce the architectures of BERT and ALBERT models and point out the version of the models we used in our experiments.
An Investigation on Quantization Schemes
2.1
361
BERT
BERT models [3] are a special kind of pre-trained transformer-based network. They mainly consist of embedding layers, encoder blocks, and output layers. There is no decoder block in BERT models. Each encoder block contains one self-attention layer (includes three parallel linear layers corresponding to query, key, and value) and 3 feed-forward layers (each includes one linear layer). For each self-attention layer, BERT utilize the multi-head technique to further improve its performance. For the each self-attention head, there are 3 weight d matrices Wq , Wk , and Wv ,, where Wq , Wk , Wv ∈ Rd× h (h is the number of n×d denote the input of the correheads in each self-attention layer). Let X ∈ R sponding self-attention layer. Therefore, the output of the self-attention head is calculated as: Q = XWq K = XWk V = XWv (1) QKT Attention(Q, K, V) = softmax( √ )V, d Then, for each self-attention layer, the outputs of all its self-attention heads are concatenated sequentially to generate the output of the corresponding layer. Specifically, in our work, we use the bert-base-uncased version of BERT models, which has 12 encoder blocks and 12 heads for each self-attention layer, to carry out the following experiments. 2.2
ALBERT
Compared to BERT, ALBERT contributes three main improvements. First, ALBERT models decompose the embedding parameters into the product of two smaller matrices. Second, they adapt cross-layer parameter sharing to improve parameter efficiency. These two improvements can significantly reduce the total number of parameters and make the model more efficient. Moreover, parameter sharing can also stabilize network parameters. Third, they replace nextsentence prediction (NSP) loss with sentence-order prediction (SOP) loss while pre-training. This makes the models focus on modeling inter-sentence coherence instead of topic prediction and improves the performance on multi-sentence encoding tasks. Specifically, in this paper, we use the albert-base-v2 version of ALBERT models, which also has 12 encoder blocks (where all parameters are shared across layers) and 12 heads for each self-attention layer.
3
Methodology
In this section, we first introduce the quantization process in our experiments (Sect. 3.1), then explain the two quantization schemes we used in detail (Sect. 3.2, 3.3).
362
3.1
Z. Zhao et al.
Overview
To compare linear and k-means quantization schemes on pre-trained transformerbased models, we test the performance of quantized models on different downstream tasks. Specifically, for each chosen task, the following experiments are carried out sequentially: fine-tuning the pre-trained models (BERT and ALBERT) on the downstream task; quantizing the task-specific model; fine-tuning the quantized model. Then the performance of the resulting model is tested on the validation set of each chosen task. To avoid the effect of other tricks, we simply apply the two quantization scheme (linear and k-means) following fix-precision quantization strategy without any tricks. We quantize all the weight of the embedding layers and the fully connected layers (except the classification layer). For each weight vector, after quantization, it will be represented by a corresponding cluster index vector and a centroid value vector, and each parameter of the weight vectors will be replaced with the centroid of the cluster which it belongs to. After the model is quantized, we further fine-tune it on the corresponding downstream tasks while maintaining quantized. For the forward pass, we reconstruct each quantized layer by its cluster index vector and centroid value vector. For the backward pass, while updating the rest parameters normally, we update the quantized parameters by training the centroids vectors. More specifically, the gradient of each parameter in the centroid vectors is calculated as the average of the gradients of the parameters that belong to the corresponding cluster. Then, the centroids value vectors are updated by the same back-propagation methods. 3.2
Linear Quantization
Suppose that we need to quantize a vector v to k bits (k-bit quantization). We first search for its minimum value vmin and maximum value vmax . The range [vmin , vmax ] is then divided into 2k clusters with width vmax − vmin . 2k
(2)
ˆ i ) = vi − vmin , Q(v width
(3)
width = ˆ as Define function Q
whose value is between 0 and 2k − 1. Such that each parameter vi belongs to the ˆ i )-th cluster. And vi will be replaced with the centroid of Q(v ˆ i )-th cluster, Q(v i.e., the average of all parameters belonging to it. Therefore, the quantization function is ˆ ˆ j 1{Q(vj ) = Q(vi )}vj Q(vi ) = , (4) ˆ j ) = Q(v ˆ i )} 1{Q(v j
where 1{statement} equals to 1 when the statement is true, otherwise 0.
An Investigation on Quantization Schemes
3.3
363
K-Means Quantization
Suppose that we need to quantize a vector v to k bits (k-bit quantization). For k-means quantization, we leverage the k-means clustering with k-means++ initialization to partition the vector v into 2k clusters. We first utilize k-means++ initialization method to initialize the 2k centroids (µ1 , µ2 , ... , µ2k ) for each cluster (c1 , c2 , ... , c2k ). Then, each parameter vi is classified into its nearest cluster. After all the parameters in v are classified, the centroids are updated as the average of all the parameters that belong to them respectively. Then, repeat re-classifying parameters and updating centroids until convergence is met or the maximum iteration is reached. Moreover, the procedure of k-means++ initialization method is as follows: first, choose a random parameter from the vector v as the first centroid; then assign the possibilities to become the next centroids of other parameters according to their smallest distance from all the existing centroids and choose the next centroid based on these possibilities; finally, repeat possibility assignment and centroid choosing until all the 2k centroids are generated.
Algorithm 1. k-means clustering Require: #bits k, vector v Ensure: the 2k centroids and the corresponding label vector 1: Initial the 2k centroids 2: repeat 3: Calculate the distance between each parameter vi and each centroid µi as di,j 4: Classify each parameter vi : vi ∈ck where k = arg minj di,j 1 {v ∈c }v 5: Update each centroid µj : µj = i 1 {vi i ∈cj j } i i 6: until convergence is met or the maximum iteration is reached
To reduce the efficiency drop caused by the upgrading of the quantization scheme, we set the maximum iteration of k-means clustering to only 3. After k-means clustering is finished, We utilize the resulting label vector as the cluster index vector and the resulting centroids as the corresponding centroid value vector. Each parameter vi will be replaced by the centroid of the cluster which it belongs to.
4
Experiments
In this section, we first introduce the dataset we used in our experiments (Sect. 4.1), then explain the experimental details of our experiments on BERT and ALBERT (Sect. 4.2), finally show the results and the corresponding discussion (Sect. 4.3).
364
4.1
Z. Zhao et al.
Dataset
We test the performance of our quantized models on the General Language Understanding Evaluation (GLUE) benchmark [14]. which contains NLU tasks including question answering, sentiment analysis, and textual entailment. Specifically, we utilize 8 tasks (QNLI, CoLA, RTE, SST-2, MRPC, STS-B, MNLI, and QQP) to test the performance of different quantization schemes. The evaluation metrics of each task are as follows: Matthews correlation coefficient (mcc) for CoLA; accuracy (acc) for QNLI, RTE, SST-2, and MNLI; accuracy (acc) and F1 score for MRPC and QQP; Pearson and Spearman correlation coefficients (corr) for STS-B. We follow the default split of the dataset. The datasets are available for download here: https://gluebenchmark.com/tasks. 4.2
Experimental Setup
Before quantization, the bert-base-uncased version of BERT models is fine-tuned on the 8 tasks by the Adam optimizer [6] and the linear schedule with a learning rate of 5e−5. As for ALBERT models, We first fine-tune the albert-base-v2 model on QNLI, CoLA, SST-2, MNLI, and QQP, and then further fine-tuned on RTE, MRPC, and STS-B basing on the MNLI checkpoint (following the same process as [7]). We use Adam optimizer and linear schedule to fine-tune ALBERT, and the learning rate for each tasks is searched in {1e−5, 2e−5, 3e−5, 4e−5, 5e−5}. After quantization, we further fine-tune the quantized models on the corresponding tasks. In particular, the learning rates of the layers which are quantized Table 1. The results of fixed-precision linear quantization for BERT on GLUE benchmark. #bits
QNLI CoLA RTE SST-2 MRPC
STS-B MNLI-m/mm QQP
Average
32 bits 91.7
59.2
72.2
93.1
86.3/90.4 89.7
85.0/84.8
91.6/88.8 83.7
5 bits
88.5
48.4
69.3
89.6
83.8/88.7 88.7
79.8/80.4
88.9/85.3 79.7
4 bits
81.8
19.9
57.0
81.4
75.7/84.5 84.9
71.4/71.9
80.8/75.9 69.4
3 bits
61.3
11.9
56.3
78.9
70.8/81.9 68.6
59.6/61.6
76.5/71.1 60.6
2 bits
60.7
6.6
55.2
77.9
69.6/81.4 47.4
49.6/50.8
74.2/63.2 54.7
1 bit
59.5
0
54.9
77.5
69.9/81.4 37.8
47.3/48.8
74.3/63.3 52.2
Table 2. The results of fixed-precision k-means quantization for BERT on GLUE benchmark. #bits
QNLI CoLA RTE SST-2 MRPC
STS-B MNLI-m/mm QQP
Average
32 bits 91.7
59.2
72.2
93.1
86.3/90.4 89.7
85.0/84.8
91.6/88.8 83.7
5 bits
91.5
60.2
70.8
94.0
87.3/91.0 89.6
84.7/84.9
91.7/88.8 83.9
4 bits
91.7
57.4
70.8
93.6
87.0/91.0 89.6
84.8/84.8
91.6/88.7 83.5
3 bits
91.3
56.9
70.0
93.1
86.0/90.2 89.4
84.4/84.1
91.2/88.1 82.9
2 bits
89.5
50.2
66.1
91.3
84.6/89.2 88.3
81.6/81.9
90.3/87.0 80.4
1 bit
62.2
13.7
54.5
83.0
70.8/81.7 52.2
62.0/62.6
77.1/65.9 59.8
An Investigation on Quantization Schemes
365
Table 3. The results of fixed-precision linear quantization for ALBERT on GLUE benchmark. #bits
QNLI CoLA RTE SST-2 MRPC
STS-B MNLI-m/mm QQP
Average
32 bits 91.5
58.9
81.6
92.8
90.2/93.1 90.9
84.9/85.1
90.8/87.7 85.2
5 bits
60.1
0
53.1
74.8
68.4/81.2 39.9
43.6/45.6
72.6/65.8 50.9
4 bits
52.3
0
52.7
50.9
68.4/81.2 6.8
35.5/35.2
67.9/56.5 41.1
3 bits
51.4
0
54.2
54.9
68.4/81.2 16.7
35.5/35.4
68.2/56.7 42.7
2 bits
54.0
0
52.7
50.9
68.4/81.2 18.8
35.4/35.3
67.5/53.2 42.6
1 bit
54.3
0
55.6
50.9
68.4/81.2 9.7
35.5/35.3
67.3/52.5 41.9
Table 4. The results of fixed-precision k-means quantization for ALBERT on GLUE benchmark. #bits
QNLI CoLA RTE SST-2 MRPC
STS-B MNLI-m/mm QQP
Average
32 bits 91.5
58.9
81.6
92.8
90.2/93.1 90.9
84.9/85.1
90.8/87.7 85.2
5 bits
91.0
55.9
78.3
92.7
90.7/93.4 90.8
84.2/85.1
90.3/87.1 84.3
4 bits
90.1
48.9
75.5
87.0
84.8/89.3 75.8
82.1/83.1
89.2/85.5 79.6
3 bits
63.5
4.6
53.8
76.5
68.1/80.8 77.7
63.7/65.8
82.9/77.9 61.8
2 bits
61.4
0
59.9
71.6
70.8/82.2 20.4
45.0/45.6
72.7/61.5 49.7
1 bit
50.6
0
56.0
52.2
68.4/81.2 6.3
35.4/35.2
69.8/58.8 41.5
are multiplied 10 times (i.e., 5e−4 for all the quantized BERT models) while those of other layers remained the same. 4.3
Experimental Results and Discussion
We mainly focus on 1–5 bits fixed-precision quantization. The results of linear and k-means quantization for BERT are shown in Table 1 and Table 2 respectively, and further comparison between the average scores of the two sets of experiments is shown in Fig. 1. Similarly, The results and comparison of ALBERT are shown in Table 3, Table 4, and Fig. 2 respectively. BERT. The improvements brought by quantization scheme upgrading. As shown in Table 1, Table 2 and Fig. 1, although the models perform worse with lower bits no matter which quantization scheme is utilized, the models quantized with k-means quantization perform significantly better than those using linear quantization in each bit setting respectively, across all 8 tasks and their average. On average of 8 tasks, only by upgrading quantization scheme from linear to k-means, we achieve a performance degradation drop from (38.8%, 34.7%, 27.6%, 17.1%, 4.8%) to (28.6%, 3.94%, 0.9%, 0.3%, −0.2%) for 1–5 bits quantization respectively, as compared to the full precision model. The result shows that great performance improvements could be achieved by only upgrading the quantization scheme, which indicates that the improvement space of the quantization
366
Z. Zhao et al.
Fig. 1. The comparison of average scores of the 8 GLUE tasks for linear and k-means quantization on BERT models.
Fig. 2. The comparison of average scores of the 8 GLUE tasks for linear and k-means quantization on ALBERT models.
scheme is much underestimated. To further illustrate it, we repeated several experiments using the group-wise linear quantization scheme developed by [11] which is an improvement based on linear quantization and achieves much higher performance than simple linear quantization. The results are shown in Table 5. Compared to the performance of group-wise linear quantization, simple k-means
An Investigation on Quantization Schemes
367
quantization achieve even higher performance or comparable performance while saving a huge amount of time.1 Table 5. The comparison between k-means quantization and group-wise linear quantization on BERT. The rightmost column are the average accelerations of k-means quantization compared to group-wise linear quantization on RTE and MRPC. The experiments are carried out using four NVIDIA 2080 Ti. Model
RTE MRPC
Acceleration
3 bits k-means 70.0 86.0/90.2 22× 3 bits group-wise 72.6 84.8/89.6 2 bits k-means 66.1 84.6/89.2 16× 2 bits group-wise 58.5 72.3/81.1 1 bit k-means 1 bit group-wise
54.5 70.8/81.7 10× 53.1 70.6/81.4
The Potential of K-means Quantization. As shown in Table 2, the model can be compressed well simply using k-means quantization with fixed-precision strategies, and the quantized models still perform well even in some particularly low bit settings. For instance, on the task RTE, the model quantized to 3 bits with k-means quantization only results in a 2.16% performance degradation. For most tasks including QNLI, SST-2, MRPC, STS-B, MNLI, and QQP, the performance of the quantized models only show a significant drop in 1-bit setting. It is worth noting that these results were achieved by simple k-means quantization with a maximum iteration of only 3 and without any tricks, which indicates the great developing potential of k-means quantization. ALBERT. Generally speaking, the two main arguments drew from BERT experiments still hold as shown in Table 3, Table 4, and Fig. 2. We could also see great improvements brought by quantization scheme upgrading and great potential of k-means quantization. However, there are some abnormal results which are worth discussing. The Influence of the Number of K-means Iterations. The first set of abnormal results is from 1-bit quantization of QNLI, MRPC, and STS-B. While k-means normally outperformed linear quantization, these results violate this 1
In group-wise quantization, each matrix is partitioned to different groups and each group is quantized separately. For the forward pass, the model needs to reconstruct each quantized group respectively for each layer instead of reconstructing the entire weight matrix of each quantized layer directly. That explains why group-wise quantization is quite time-consuming. Specifically, in our group-wise quantization experiments, we partition each matrix to 128 groups.
368
Z. Zhao et al.
Table 6. The performance of 1-bit quantization with different number of k-means iteration on ALBERT. Iteration QNLI MRPC
STS-B
3 5 10 20
6.29 6.93 13.76 11.10
50.56 50.63 60.63 60.19
68.38/81.22 68.38/81.22 68.87/81.30 69.85/81.83
regulation. We believe that is because the distribution of parameters is so complicated that 3 iterations of k-means could not work well. To validate this theory and further explore the influence of iterations, we repeated the experiments with these abnormal results while extending the number of iteration to 5, 10, and 20. The corresponding results are shown in Table 6. With more iterations, the accuracy of k-means quantization increases and outperforms linear quantization. However, the over-fitting problem might be troublesome as the performances decrease for QNLI and STS-B when the number of iteration increases from 10 to 20. Therefore, in k-means quantization, the number of k-means iterations is also an important hyper-parameter that needs to be searched carefully. The Special Number of CoLA and MRPC. Another set of abnormal results is from the linear quantization of CoLA and MRPC, which are binary classification tasks. We find the quantized models output “1” all the time after being fine-tuned. The two values 0 and 68.4 are only determined by the data distribution on the dev sets. In other words, after the model is quantized to 1–5 bits with linear quantization, it almost loses its functionality and becomes difficult to train on the two tasks. Moreover, we further do experiments in high bit settings on the two tasks and find that the results of the quantized models are no longer the two values starting from 6 bits. The Comparison Between BERT and ALBERT. Moreover, we compare the performances between k-means quantization for BERT and ALBERT, and the results are shown in Fig. 3 and Fig. 4. Compared with BERT which remains 96.1% of its origin performance after k-means 2-bit quantization, ALBERT is much less robust in terms of quantization (in our work, robustness towards quantization means the ability to quantize to low bit-width while maintaining high performance). The performance of ALBERT falls to 93.4% and 72.5% after k-means 4-bit and 3-bit quantization respectively. Consider that the major improvement of ALBERT based on BERT is parameter sharing and quantization can also be considered as intra-layer parameter sharing, we speculate that parameter sharing and quantification have similar effects, which means that the redundant information removed by parameter sharing and quantization partially overlaps. Moreover, after parameter sharing, ALBERT has removed a great amount of redundant information compared to BERT (the total number of parameters fall from 108M to 12M). Therefore, further applying quantization
An Investigation on Quantization Schemes
369
Fig. 3. The comparison of average scores of the 8 GLUE tasks for BERT and ALBERT models with k-means quantization.
Fig. 4. The comparison of performance for BERT and ALBERT models with k-means quantization. Each value refers to the percentage of the average score of the quantized model compared to the score of the full precision model.
upon ALBERT will easily damage the useful information and the robustness of ALBERT towards quantization is rather low. However, from another point of view, the parameter sharing has already significantly reduced the parameter number and thus can also be considered as a model compression method. Moreover, consider that the performances of full-precision ALBERT are better than those of 4-bit and 3-bit BERT models which occupy a similar amount of memory in GPU, the parameter sharing can even achieve better compress performance
370
Z. Zhao et al.
than simple quantization. However, as a compression method, parameter sharing has a non-negligible drawback: it can only reduce the memory consumption while most other compression methods can reduce both the memory consumption and the calculation consumption (i.e. the inference time).
5
Conclusion
In this paper, we compare k-means and linear quantization on BERT and ALBERT models and get three main results. First, we find the models quantized with k-means significantly outperform those using linear quantization. Great performance improvements could be achieved by simply upgrading the quantization scheme. Second, the model can be compressed to relatively low bit-width only using k-means quantization even with simple fix-precision strategy and without any tricks. That indicates the great developing potential of k-means quantization. Third, the number of k-means iterations plays an important role in the performance of quantized models and should be determined carefully. Besides, through comparison between the results of k-means quantization for BERT and ALBERT, we discover that ALBERT is much less robust towards quantization than BERT. That indicates that parameter sharing and quantization have some effects in common. Therefore, further applying quantization upon models with extensive parameter sharing will easily damage the useful information and thus lead to a significant performance drop. Acknowledgement. We thank the anonymous reviewers for their thoughtful comments. This work has been supported by the National Key Research and Development Program of China (Grant No. 2017YFB1002102) and Shanghai Jiao Tong University Scientific and Technological Innovation Funds (YG2020YQ01).
References 1. Bhandare, A., et al.: Efficient 8-bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532 (2019) 2. Cheong, R., Daniel, R.: Transformers.zip: compressing transformers with pruning and quantization. Technical report, Stanford University, Stanford, California (2019). https://web.stanford.edu/class/cs224n/reports/custom/15763707.pdf 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019) 4. Dong, Z., Yao, Z., Gholami, A., Mahoney, M.W., Keutzer, K.: HAWQ: hessian aware quantization of neural networks with mixed-precision. In: ICCV, pp. 293– 302 (2019) 5. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural network with pruning, trained quantization and Huffman coding. In: ICLR (2016) 6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015) 7. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite BERT for self-supervised learning of language representations. In: ICLR (2019)
An Investigation on Quantization Schemes
371
8. Lin, X., Zhao, C., Pan, W.: Towards accurate binary convolutional neural network. In: NIPS, pp. 345–353 (2017) 9. Ma, R., Liu, Q., Yu, K.: Highly efficient neural network language model compression using soft binarization training. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 62–69. IEEE (2019) 10. Prato, G., Charlaix, E., Rezagholizadeh, M.: Fully quantized transformer for machine translation. arXiv preprint arXiv:1910.10485 (2019) 11. Shen, S., et al.: Q-BERT: hessian based ultra low precision quantization of BERT. In: AAAI (2020) 12. Shi, K., Yu, K.: Structured word embedding for low memory neural network language model. In: INTERSPEECH, pp. 1254–1258 (2018) 13. Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017) 14. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: Glue: a multitask benchmark and analysis platform for natural language understanding. In: ICLR (2019) 15. Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: HAQ: hardware-aware automated quantization with mixed precision. In: CVPR, pp. 8612–8620 (2019) 16. Zafrir, O., Boudoukh, G., Izsak, P., Wasserblat, M.: Q8BERT: quantized 8bit BERT. In: NIPS EMC2 Workshop (2019) 17. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016) 18. Zhu, C., Han, S., Mao, H., Dally, W.J.: Trained ternary quantization. In: ICLR (2017)
A Survey of Sentiment Analysis Based on Machine Learning Pingping Lin and Xudong Luo(B) Guangxi Key Lab of Multi-source Information Mining and Security, College of Computer Science and Information Engineering, Guangxi Normal University, Guilin 541004, China [email protected]
Abstract. Every day, Facebook, Twitter, Weibo and other social network sites and major e-commerce sites generate a large number of online reviews with emotions. The analysing people’s opinions from these reviews can assist a variety of decision-making processes in organisations, products, and administrations. Therefore, it is practically and theoretically important to study how to analyse online reviews with emotions. To help researchers study sentiment analysis, in this paper, we survey the machine learning based method for sentiment analysis of online reviews. These methods are main based on Support Vector Machine, Neural Networks, Na¨ıve Bayes, Bayesian network, Maximum entropy, and some hybrid methods. In particular, we point out the main problems in the machine learning based methods for sentiment analysis and the problems to be solved in the future. Keywords: Sentiment analysis learning · Transfer learning
1
· Machine learning · Integrated
Introduction
The rapid development and popularity of the Internet inevitably lead to a significant increase in the number of online data [7]. Many of the data are about opinions that people express on public forums such as Facebook, Twitter, Microblog, Blogs, and e-commerce websites. Particularly, online comment texts on e-commerce websites reflect the buyer’s real feelings or experiences on the quality of the purchased goods, business services, and logistics services, regarding not only satisfaction information of consumers’ shopping, but also their acceptance and expectation to new products or services. The insights into online comments significantly affect consumers’ desires and decisions, which in turn impacts the efficiency of e-commerce platforms. Therefore, it is crucial to quickly mine and effectively take advantage of the comments. However, it is difficult to extract valuable information from these massive online texts. Therefore, the academic community and the industry pay lots of attention to the issue [49]. Generally speaking, sentiment analysis refers to the task of detecting, analysing, and extracting attitudes, opinions, and emotions expressed by people c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 372–387, 2020. https://doi.org/10.1007/978-3-030-60450-9_30
A Survey of Sentiment Analysis Based on Machine Learning
373
in a given dataset [4,21]. So sentiment analysis is also called opinion mining, orientation analysis, emotion classification, and subjective analysis. Sentiment analysis tasks involve many problems in the field of natural language processing, including named entity recognition, word polarity disambiguation, satire detection, and aspect extraction. The number of problems involved in a sentiment analysis task is directly proportional to the difficulties users face in their application (Fig. 1).
Fig. 1. Machine learning based methods
The term of sentiment analysis is coined by Nasukawa and Yi [28], but it is Pang and Lee [30] who first proposed the task of sentiment analysis. They define the subjective calculation process of the text as sentiment analysis and opinion mining, yet they fail to give a more detailed definition of sentiment analysis. Later on, Liu [20] defines an emotional expression as a 4-tuple: of (Holder, Target, Polarity, Time), where Holder represents the opinion holder, Target refers to the object to be evaluated, Polarity stands for expressed emotion category, and Time is the evaluation time. Among them, the sentiment categories involved may vary with the sentiment analysis tasks. For example, in some sentiment analysis tasks, they are positive and negative only; in other tasks, they may be positive, negative, and neutral; or hi, anger, sorrow, and fear; or just some scores (such as 1–6 points). Specifically, the significance of sentiment analysis of online commentary texts is twofold. 1) Practical significance. Online comments could be consumers’ ones that support their decisions of buying a product or a service, or people’s opinions about social issues, which are concerns of a government. Moreover, the impact
374
P. Lin and X. Luo
of decision-making, online analysis of sentiment has important practical significance. Alfaro et al. [1] illustrate the effect of sentiment analysis technology for government and public institutions. Generally speaking, the application of sentiment analysis on massive data can help to improve the Internet’s potential public opinion monitoring systems, expand the company’s marketing capabilities, and achieve detection of world anomalies or emergencies. Moreover, it can also be applied to the research fields of psychology, sociology, and financial forecasting. 2) Theoretical significance. The analyses of texts with emotions need the expertise of multiple disciplines, such as the natural language process, machine learning, and text classification. They can help establish a reliable emotion dictionary, sub-area sentiment dictionary, and abundant corpus resources. Moreover, it can improve the accuracy of various classification algorithms. At present, there are mainly four kinds of methods for sentiment analysis. The first kind is based on sentiment lexicon. The rules can be artificially formulated according to specific needs. The kinds of methods are very dependent on the emotion dictionary. The second is based on the machine learning method. This kind of methods first has to dig out the word features, and then chooses different machine learning algorithms to analyse the sentiment of the text. The methods of this kind rely heavily on the feature extraction of texts. The third is based on the deep learning methods, i.e., using different neural network models to map a piece of text into a vector space, and then inputting the digitised text into a classifier model. The fourth is based on the transfer learning method. In this paper, we focus on machine learning-based methods for sentiment analysis. Although there are some surveys on sentiment analysis, ours in this paper is different from them. For example, in 2019, Sasikala and Sukumaran [36] surveyed machine learning-based methods for sentiment analysis. However, it does not cover the studies in 2019 and 2020, but we do; and their work does not concern the sentiment analysis of Chinese texts either, but we do. In 2020, Nazir et al. [29] conduct a survey on the problems and challenges of aspect-based sentiment analysis, and Yadav et al. [47] survey deep learning based sentiment analysis. However, their surveys do not focus on machine learning based methods for sentiment analysis, but we do. The rest of this paper is organised as follows. Section 2 briefs some of the linear classifiers based methods. Section 3 recaps some probabilistic classifiers based methods. Section 4 analyses some other methods based on machine learning. Section 5 discusses the challenges of machine learning-based methods. Finally, Sect. 6 summarises this paper.
2
Linear Classifiers Based Methods
In this section, we discuss two kinds of linear classifiers based methods. 2.1
Support Vector Machine Based Methods
Yang [48] constructs a set of benchmark sentiment words, and analyse their characteristics to explore potential sentiment words. Specifically, Yang uses the
A Survey of Sentiment Analysis Based on Machine Learning
375
classification method of Support Vector Machine (SVM) to fuse the words, part of speech features, and semantic features to identify and classify the sentences. The work focuses on solving the sentiment analysis problems of four words happiness, anger, sorrow, and fear, and can analyse the different emotional tendencies of the same words in different contexts. Wang and Yang [41] consider the high-dimensional data characteristics of commodity comments (a kind of important sentiment analysis problems) and propose a new RS-SVM (Random Subspace-SVM), hoping to take the advantages of SVM and Random Subspace. The main reason for choosing the Random Subspace method is that the classification features of commodity comment sentiment analysis problems are often tens of thousands, and there is also some noise. Compared with data partitioning based methods (e.g., Bagging and Boosting), the feature partitioning method of Random Subspace can mitigate high dimensional and noise problems by dividing the classification features into different subsets. Their experimental results show that RS-SVM outperforms SVM, NB (Na¨ıve Bayes), Boosting-SVM, and other classifiers. 2.2
Neural Networks
Chen and Huang [6] propose a knowledge-enhanced neural network for sentiment analysis. It combines aspect-opinion recognition and aspect-level sentiment classification and integrates external knowledge into the neural network to compensate for training data insufficient. Finally, they classify the sentiment polarity of the Chinese car review dataset as positive, negative, and neutral. The experimental results show that the knowledge-enhanced neural network can give more detailed sentiment analysis results, and is always better than conventional models. Especially when the training data is insufficient, or the corpus is limited, and the knowledge enhanced neural network can show better performance. Liu and Shen [22] propose a novel neural network structure, named the Gated Alternate Neural Network (GANN), to learn informative aspect-dependent sentiment clue representations. Their method addresses some weaknesses of previous sentiment analysis methods, such as lacking position invariance and sensitivity to local key patterns, weak at capturing long-distance dependency and modelling sequence information, and poor ability for dealing with noise in capturing important sentiment expressions. To verify the effect and generalization of GANN, they did lots of experiments on four Chinese and three English datasets, and classify the sentiment polarity of review dataset as positive, negative and neutral. They find that GANN achieves state-of-the-art results, meaning that their method is language-independent.
3
Probabilistic Classifiers Based Methods
In this section, we discuss four kinds of Probabilistic classifiers based methods.
376
3.1
P. Lin and X. Luo
Na¨ıve Bayes based methods
We are taking into account the degree of contribution of emotional characteristics to the emotional orientation of comment texts, Zeng et al. [50] develop a Na¨ıve Bayes (NB) classification model based on an extended sentiment dictionary to analyse the sentiment of the comment text. Its key is to use the weighting factor to reflect the different importance of different sentiment words. This method actually is a weighted integration of ordinary NB and the basic NB model. With respect to data sets regarding different things (e.g., hotels and notebook), the former classification results better in terms of accuracy, recall, and F1 value. Rout, Choo, and Dash [33] evaluate the utility of unsupervised and supervised algorithms in the emotional classification of unstructured data. In the unsupervised method, they use the SentiWordNet dictionary to evaluate the performance of this method. In the supervised method, they use the NB model to perform sentiment analysis on tweets. The performance of the classifier is evaluated based on accuracy, recall rate and F metrics. In the future, it is worth applying their method to larger data sets regarding, for example, smart recommendations. Similarly, Chang and Huo [5] propose a fine-grained short text sentiment analysis method based on NB. They use experiments to show the classification accuracy of their approach outperforms other approaches. Sentiment analysis requires a lot of manual annotation of a corpus, which is a difficult task. To solve this problem, Su et al. [38] propose the NB-LDA model based on the Topic Model and the NB model. Their model actually is an unsupervised classification model, capable of simultaneous sentence and discourse level. The level text is sentimental analysis, combining the emotional relationship between the two to improve the performance of sentiment analysis. The test results show that the correct rate of the NB-LDA model is significantly better than other unsupervised methods, even close to some semi-supervised or supervised research methods. 3.2
Bayesian Network Based Methods
Ruz, Henriquez, and Mascareno [34] analyse Twitter data in events such as natural disasters or social movements. More specifically, they use Bayesian Network (BN) to perform sentiment analysis on two datasets in Spanish (2010 Chile earthquake and 2017 Catalonia independence referendum) to understand the qualitative information of the event from the historical and social perspectives. Compared with SVM and random forests, their method is effective. Besides, their method can identify the relations amongst words, and so reveal interesting qualitative information to understand historically and socially the main features of the event dynamics. Most of the current work focuses on the classification of emotions. Liang, Ganeshbabu, and Thorne [18] find that most researchers ignore how the emotional orientation of the topic is affected by other topics or the dynamic interaction of the topic from the perspective of emotion analysis. They built a Gaussian process dynamic BN, analysing 9.72 million tweets from Twitter, analysing the
A Survey of Sentiment Analysis Based on Machine Learning
377
emotional dynamics of topics related to Brexit. This probably is the first work to use dynamic BN to model the dynamics and interactions of subject emotions. In the future, ones may use this method to provide more in-depth sentiment analysis for social media. 3.3
Maximum Entropy Based Methods
Zhang, Zheng, and Chen [51] propose a method for sentiment component analysis and public opinion prediction of Chinese microblog posts based on the Maximum Entropy (ME) model. Their model uses fine-grained emotion classification. First, it filters the noise data in the Chinese micro-blog. Then it extracts the text features by using the document frequency method and information gain principle. Finally, it trains the classifier with the maximum entropy model, and integrate multiple classifiers to analyse emotions. Xie et al. [45] propose a novel ME classification bases on probabilistic latent semantic analysis, capturing the relevance of words and parts of speech in context, the relevance of adverbs of degree, and the similarity between reference emotion words. 3.4
Probability Graph Based Method
Wu, Zhu, and Zhou [42] propose a method of sentiment analysis based on the probability graph model, which solved the emotional judgment problem of the commentary sentences with the dual characteristics of evaluation words and evaluation objects. It improves the accuracy of the classification compared with the traditional SVM method.
4
Other Models
In this section, we discuss three kinds of other machine learning based methods. 4.1
Models of Integrating Machine Learning and Other Methods
For sentiment analysis of Weibo (the most popular Chinese social media platform), Li et al. [16] integrate machine learning methods with lexicons of 448 everyday Internet slang and 109 Weibo emoticons. The machine learning approaches they consider include k-Nearest Neighbors (KNN), Decision Tree, Random Forest, Logistic Regression, NB, and SVM. Their experiments show that their method can significantly improve the performance for detecting expressions that are difficult to be polarised into positive-negative categories. Basiri et al. [3] propose a fusion model for sentiment analysis. Specifically, they use a deep learning method as a primary classifier and a traditional learning method as the secondary method when the confidence of the deep method during the classification of test samples is low. Their experiments on the reviews based on Drugs.com dataset show that their method outperforms the traditional deep learning based methods in terms of accuracy and F1-measure.
378
P. Lin and X. Luo
In order to make full use of the results of the rule sentiment analysis, Jiang and Xia [15] develop a Microblog emotion classification algorithm by integrating machine learning and rule. They use the rule algorithm to obtain sufficient emotion information and turn it into a multidimensional vector and embedded it in the vector space model. The feature template embeds the rule feature to obtain the fusion feature template so that the machine learning algorithm can fully utilise the rule feature to achieve better classification performance. However, two points need to be improved further. One is to adjust the rule method, to mine more priori emotional information (such as negative sentences and irony sentence emotion recognition). The other is to automatically learn the emotion dictionary from the unlabelled corpus so that the performance gain is higher than the manually labelled sentiment dictionary. Xia et al. [44] believe that using emotion feature segments instead of the entire review and asymmetrically weighted feature words can improve the accuracy of emotion classification. They propose a method for sentiment classification of online reviews. This method uses a conditional random field algorithm to extract comment fragments from the text and extract the emotional characteristics of the comment fragments. Then, the sentiment feature words are asymmetrically weighted, and finally, the sentiment direction of the reviews is classified using the SVM classifier. The average accuracy of their experimental classification results reached 90%. 4.2
Integrated Learning Based Methods
Sisodia et al. [37] compare the accuracy, precision, recall, and F-measure of different machine learning algorithms for the sentiment analysis of positive or negative polarity in the movie review data sets. These methods include na¨ıve Bayes classifier, Support Vector Machine, Decision trees, and ensemble learners. They try to find the best classifier for the sentiment analysis of the movie reviews. However they find that using the set of classifiers collaboratively can get more effective results. So it is necessary to integrate several machine learning methods for sentiment analysis. Generally, integrated learning accomplishes learning tasks by first generating a set of individual learners, and then incorporating them with a particular strategy. There are mainly three kinds of such strategies. The first one is the standard method, i.e., average the outputs of several weak learners to get a final predicted output. The second is the voting method (including equal weights and unequal weight two ways). Mungra, Agrawal, and Thakkar [27] propose a voting-based ensemble model for sentiment analysis. They use five supervised machine learning classifiers (logistic regression, support vector machine, artificial neural network, decision tree, and random forest) as base classifiers and a majority voting rule-based mechanism to get the final prediction. In terms of minimum, maximum, mean, and median values of precision, recall, F-measure, and accuracy, their method outperforms the individual classifiers in the majority of the cases. The third is the learner method. In this method, the weak learner is called the primary learner, and the learner used for the combination is called
A Survey of Sentiment Analysis Based on Machine Learning
379
the secondary level learner. For a test set, the primary learner is used to predict once to get the input samples for the secondary level learner and then predict once with the secondary level learner to get the final prediction. Integrated learning of multiple classifiers can be done through different training data sets. For example, Wan [40] proposes a new method of using Chinese emotional resources to analyse Chinese sentiment (using bilingual knowledge resources), combined with machine translation (Google translation) and integrated learning methods. Using an integrated approach to fusing individual analysis results for each language can yield better results than single-language analysis. The specific process is first to construct the basic association between languages. That is, the machine translates the source language annotation text into the target language to obtain the pseudo-label data of the target language. Secondly, the emotional vocabulary of the source language and the limited target language emotion vocabulary are used to learn several sentiment classifiers. Finally, the cross-language sentiment classifier is obtained through integration. Xu et al. [46] propose a sentiment analysis model MF-CSEL for ChineseEnglish mixed text. The model uses word vectors, bilingual sentiment features, and TF-IDF as the input features of the baseline classifier. The cost-sensitive ensemble learning method is used to fuse the classification results of different base classifiers to achieve multi-lingual text sentiment classification, which is happy, sad, anger, fear, and surprising. In the future, people may it is worth trying different classifiers for multi-lingual text sentiment analysis tasks, because there is still a certain distance from the optimal results of experimental evaluation of NLPCC2018. Nevertheless, it is worth noticing that sometimes one machine learning method is the best in dealing with a specific task of sentiment analysis. For example, L´opez-Chau, Valle-Cruz, and Sandoval-Almaz´ an [23] employ three classifiers for sentiment analysis of tweets that belong to the same topic. However, they find the classifiers with the best accuracy of predicting emotions are NB and SVM. Samuel et al. [35] use two machine learning methods for sentiment analysis of people in Tweets during the Coronavirus pandemic. They observe a reliable classification accuracy of 91% for short Tweets using NB, a reasonable accuracy of 74% for shorter Tweets using the logistic regression classification method and both methods show relatively weaker performances for longer Tweets. Similarly, Lim et al. [19] examine different machine learning methods in sentiment analysis of the same object: business news headline. The methods they examine include multi-layer perceptron classifier, multinomial na¨ıve Bayes, complement na¨ıve Bayes, decision trees, the typical RNN architecture, and the encoder-decoder architecture. Mostafa [26] proposes a traveller review sentiment classifier to analyse a total of 11,458 travellers’ reviews on five hotels located in Aswan in Egypt, providing a classification of each sentiment based on hotel
380
P. Lin and X. Luo
features. Mostafa tries the three classification techniques of SVM, NB, KNN, J48, and logistic regression in terms of recall, precision, and F1-measure, and find that NB outperforms other classification methods with the precision of about 94.0%. 4.3
Transfer Learning Based Methods
Sentiment analysis is the process of identifying human emotion from signals such as facial expression, speech, and text. It is difficult for people to manually collect and classify these signals because the tasks are often tedious and timeconsuming, and requires expert-level knowledge. Transfer learning is an effective way to address the challenges related to the scarcity of data and the lack of human labels. This is because it can find labelled data from relevant fields for training when the target field has fewer data, i.e., it transfers the learned knowledge in a field to another [31]. Feng and Chaspari [13] recap fundamental concepts in the field of transfer learning, review work which has successfully applied transfer learning for sentiment analysis, and point out future research directions for using transfer learning for sentiment analysis. For cross-language sentiment analysis, the transfer learning-based method can evolve from one language knowledge to another. The fundamental challenge of cross-language learning for sentiment analysis is that a source language has almost no overlap with the feature space of a target language data. The translation of a source language into a target language faces several problems. One is to change the polarity of emotions. For example, English sentence “it is too beautiful to be true”, which is originally a negative meaning: “it is not true because it is too beautiful”. Nevertheless, when Google translates into Chinese, it turns into a positive meaning: “it is so beautiful and true”. Another is the vocabulary overlap between the documents translated into the target language, and the target document is deficient. Meng et al. [25] propose a hybrid model of cross-language sentiment classification using unlabeled parallel data. It can learn emotion words that have not been seen before from a large amount of data and improving vocabulary coverage. Duh, Fujino, and Nagata [11] show that vocabulary coverage has a strong correlation with the accuracy of sentiment classification. Data sparseness and dimensionality disasters are issues for natural language processing tasks. For data-sparse problems, Popat et al. [32] study feature clustering of unlabelled parallel data, alleviating the issue of data sparsity faced by supervised learning methods. Fang and Tao [12,39] propose a transfer learning-based multi-label classification method for Aspect Based Sentiment Analysis (ABSA) at a fine-grained level (e.g., different features or characteristics of products/services). Their experiments confirm that their method outperforms other mainstream multiclassification methods in the context of ABSA on online restaurant reviews.
A Survey of Sentiment Analysis Based on Machine Learning
381
Table 1. Sentiment analysis algorithm based on machine learning. Algorithm
Advantage
Disadvantage
Application field
SVM [48]
Simple
Long training time
Product and hotel reviews
RS-SVM [41] Based on random subspace Long training time
Movie reviews
NN [6, 22]
Self-learning
-
DNN [8]
Strong learning ability
Lack of interpretability
YouTube
KNN [6]
Knowledge-enhanced
Unexplainable
Chinese car reviews
NB [33]
Simple, easy to implement Poor precision
Unexplainable
Hotels reviews and tweets
NB-LDA [38] Based on LDA
Insensitive to missing data
BN [34]
Deal with uncertainty
Problems with input variables Major events comments
ME [45, 51]
Flexible constraints
Slow down iterations
Microblog
LR [14]
Time efficiency
Low classification accuracy
Movie reviews
5
Movie reviews
Discussion
Table 1 lists several common machine learning methods for sentiment analysis. The idea of Support Vector Machine (SVM) classification is very simple. It is to maximise the interval between the sentiment classification samples and the decision surface. The classification effect is good. It can solve the machine learning problem in the case of small samples similar to the sentiment analysis task. Large-scale data training is difficult for SVM, and the training time is long; it cannot directly support multi-classification, but it can be done using indirect methods. Na¨ıve Bayes (NB) is relatively simple, but its classification efficiency is stable. It performs well on small-scale data and can handle sentiment analysis and multiple classification tasks. When the number of attributes of the comment object is relatively large, or the correlation between attributes is large, the NB’s classification effect is not good. When the attribute correlation is small, NB performance is the best. It is insensitive to miss data and very sensitive to the representation of the input data. Because NB classification assumes that the features of the samples are independent of each other, which is often not true in practice, it cannot learn the interaction between features. The extracted emotional features are closely related, which affects the accuracy of classification. Compared with NB, Logistic Regression (LR) does not need to worry about whether or not the extracted emotional features are related. They can easily use new data to update the model. Its time efficiency is very high, so it is suitable for large-scale data sets, but its classification accuracy is very low. Both LR and SVM overcome the disadvantage that the error of classification could be substantial when the data is unevenly distributed. Specifically, LR uses a sigmoid function for non-linear mapping of all data, which weakens the role of data far from the classification decision plane; SVM directly removes far away classification (for decision-making data, only the influence of support vectors is considered). We can choose LR or SVM by referring to the following rules: (i) if the number of features is large, which is similar to the number of samples, then use LR or Linear Kernel SVM; (ii) if the number of features is relatively small
382
P. Lin and X. Luo
and the number of samples is average, then choose SVM plus Gaussian Kernel; and (iii) if the number of features is small and the number of samples is large, then manually add some features to become the first case. Bayesian network (BN) has a strong ability to deal with uncertain problems, and it can deal with the problem of sentiment classification with confusing boundaries. The Maximum Entropy (ME) model can flexibly set constraints. The degree of constraints can adjust the model’s fitness for unknown data and the degree of fitness for public data. Because the number of constraints is often related to the number of samples, as the number of samples increases, the number of constraints will increase accordingly. This could cause more and more calculations and slow down iterations, so it is challenging to put sentiment analysis into practice. Table 2. State-of-art sentiment analysis algorithm based on machine learning. Research AlgorithmData description
Sentiment category
Performance
Hamdan LR et al. [14]
SemEval 2015, target-level sentiment analysis
Positive, negative and neutral
F1: 0.62, Recall: 0.55, Precision: 0.72
Xia et al. SVM [43]
http://www.autohome.com.cn; Positive and negative Average accuracy: 90% http://www.amazon.co.uk, chapter-level sentiment analysis
Li et al. NB [17]
Barrage data from Danmaku data, mix audio and text.
Xie et al. ME [45]
Citysearch New York: http:// Positive and negative Precision: 87.11%, Recall: newyork.citysearch.com/ 91.42%, F-measure: 89.21%
Ruz et al. [34]
Dataset 1: Chilean earthquake, Negative and positive Dataset 1: Accuracy: 0.721, Dataset 2: Catalan Precision: 0.896; Dataset 2: independence referendum, Accuracy: 0.808, Precision: both mix text and emoticons 0.906
BN
Positive: like, happiness; negative: surprise, fear, anger, sadness and disgust
Accuracy: 0.882; Positive: F1: 0.823, Precision: 0.867; Negative: F1: 0.912, Precision: 0.889
Recently, Diaz et al. [10] do an exciting study worthy of our in-depth consideration. Most researchers use off-the-shelf tools to solve problems they face without considering the shortcomings of these tools unless they realise their shortcomings in use. We should adequately consider whether or not existing sentiment analysis tools are biased (e.g., gender, race, and geography). It is beneficial for sentiment analysis to deal with these prejudices and reduce their impact. When the data set is large enough, the models of sentiment analysis methods based on machine learning will become more and more mature and achieve better classification results. However, training such a model requires a large amount of correctly labelled data, and the training time is long. It is necessary to continue to experiment to select the algorithm that is most suitable for the data set used in sentiment analysis tasks. The size and structure of the data have a significant impact on the effectiveness of the algorithm. Due to the imbalance of categories in the data, the performance of machine learning models in sentiment analysis tasks is often limited, which leads to underrepresented categories often being classified
A Survey of Sentiment Analysis Based on Machine Learning
383
incorrectly. finding the optimal data feature combination. It is also interesting to consider the multi-domain nature of the data set and mine implicit emotions. Besides, we can try to find resources that can be suitable for various fields. Table 2 summarises the state-of-art machine learning based algorithms, regarding their data description, sentiment classification, and algorithm performance. The data sets in IEMOCAP1 , IMDB2 , and SemEval Task are widely used in sentiment analysis. SemEval Task includes SemEval-2014 Task 43 , SemEval2015 Task 124 , and SemEval-2016 Task 55 . It involves restaurants, hotels, laptops, consumer electronics, telecom, museum, and other fields. The sentiment polarity labels of these data sets are all positive, negative, and neutral. From Table 2, we can see that researchers are more inclined to use machine learning algorithms to study coarse-grained sentiment analysis problems. However, industry and academia need more fine-grained sentiment research results, such as aspect-level, attribute-level, and target-level sentiment analysis research. A pre-trained language model [1,2,9,24] might be a feasible solution to fine-grained sentiment analysis problems.
6
Conclusion
This paper surveys the state-of-art in the area of sentiment analysis based on machine learning, analyse the advantages and disadvantages of these methods and points out problems that need to be addressed in the future. In particular, we discuss the problem of cross-lingual sentiment analysis, which is different from cross-modal sentiment analysis, and points out that machine learning methods play an essential role in this problem. There are many open issues. Most of the studies on sentiment analysis focus on texts, and few on image or graphics fusion. It is also necessary to improve sentiment dictionaries, especially for specific areas. For different sentiment analysis tasks, It is difficult to compare and evaluate various methods. For the needs of different evaluators, the challenge lies in the breakdown of sentiment analysis. At present, researchers pay little attention to the problem of emotional level. To deal with the emotional level and ambiguity, we may use fuzzy theory to calculate the strength of emotion polarity to establish a fuzzy membership function of emotion. Future research may consider using transfer learning and fuzzy approach to apply existing research results and sentiment analysis related technologies to Chinese data sets. Emotion is an essential aspect of natural language semantics, so linguists are encouraged to join the research field. We hope to see more parsing algorithms proposed for other languages in future research to better deal with the emerging challenges of sentiment analysis. The best result might be a language-independent processing platform. As long as the user inputs some 1 2 3 4 5
https://blog.csdn.net/qq 33472146/article/details/90665196. http://ai.stanford.edu/amaas/data/sentiment/. http://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools. http://alt.qcri.org/semeval2015/task12/index.php?id=data-and-tools. http://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools.
384
P. Lin and X. Luo
language knowledge about opinions or emotions, the system can automatically transfer to the target language for sentiment analysis. Acknowledgement. This work was supported by the National Natural Science Foundation of China (No. 61762016), and Guangxi Key Lab of Multi-Source Information Mining & Security (No. 19-A-01-01).
References 1. Araci, D.: FinBERT: financial sentiment analysis with pre-trained language models. In: Proceedings of the 2019 Computing Research Repository, pp. 1–10 (2019) 2. Azzouza, N., Akli-Astouati, K., Ibrahim, R.: TwitterBERT: framework for Twitter sentiment analysis based on pre-trained language model representations. In: Saeed, F., Mohammed, F., Gazem, N. (eds.) IRICT 2019. AISC, vol. 1073, pp. 428–437. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-33582-3 41 3. Basiri, M.E., Abdar, M., Cifci, M.A., Nemati, S., Acharya, U.R.: A novel method for sentiment classification of drug reviews using fusion of deep and machine learning techniques. Knowl.-Based Syst. 105949 (2020) 4. Cambria, E.: Affective computing and sentiment analysis. IEEE Intell. Syst. 31(2), 102–107 (2016) 5. Chang, G., Huo, H.: A method of fine-grained short text sentiment analysis based on machine learning. Neural Netw. World 28(4), 325–344 (2018) 6. Chen, F., Huang, Y.-F.: Knowledge-enhanced neural networks for sentiment analysis of Chinese reviews. Neurocomputing 368, 51–58 (2019) 7. Coffman, K.G., Odlyzko, A.M.: Internet growth: is there a “Moore’s Law” for data traffic? In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds.) Handbook of Massive Data Sets. MC, vol. 4, pp. 47–93. Springer, Boston, MA (2002). https://doi.org/ 10.1007/978-1-4615-0005-6 3 8. Cunha, A.A.L., Costa, M.C., Pacheco, M.A.C.: Sentiment analysis of Youtube video comments using deep neural networks. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2019. LNCS (LNAI), vol. 11508, pp. 561–570. Springer, Cham (2019). https://doi.org/ 10.1007/978-3-030-20912-4 51 9. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186 (2019) 10. Diaz, M., Johnson, I., Lazar, A., et al. Addressing age-related bias in sentiment analysis. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pp. 6146–6150 (2019) 11. Duh, K., Fujino, A., Nagata, M.: Is machine translation ripe for cross-lingual sentiment classification? In: Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, vol. 2, pp. 429–443 (2011) 12. Fang, X., Tao, J.: A transfer learning based approach for aspect based sentiment analysis. In: 2019 Sixth International Conference on Social Networks Analysis, Management and Security, pp. 478–483 (2019) 13. Feng, K., Chaspari, T.: A review of generalizable transfer learning in automatic emotion recognition. Front. Comput. Sci. 2, 9 (2020)
A Survey of Sentiment Analysis Based on Machine Learning
385
14. Hamdan, H.: Lsislif: CRF and logistic regression for opinion target extraction and sentiment polarity analysis. In: Bellot, P., Bechet, F. (eds.) Proceedings of the 9th International Workshop on Semantic Evaluation, pp. 753–758, Association for Computational Linguistics (2015) 15. Jiang, J., Xia, R.: Microblog sentiment classification via combining rule-based and machine learning methods. Acta Scientiarum Naturalium Universitatis Pekinensis 53(2), 247–254 (2017). (In Chinese) 16. Li, D., Rzepka, R., Ptaszynski, M., Araki, K.: A novel machine learning-based sentiment analysis method for Chinese social media considering Chinese slang lexicon and emoticons. In: Proceedings of the 2nd Workshop on Affective Content Analysis, pp. 88–103 (2019) 17. Li, Z., Li, R., Jin, G.-H.: Sentiment analysis of danmaku videos based on na¨ıve bayes and sentiment dictionary. IEEE Access 8, 75073–75084 (2020) 18. Liang, H., Ganeshbabu, U., Thorne, T.: A dynamic Bayesian network approach for analysing topic-sentiment evolution. IEEE Access 8, 54164–54174 (2020) 19. Lim, S.L.O., Lim, H.M., Tan, E.K., Tan, T.P.: Examining machine learning techniques in business news headline sentiment analysis. In: Alfred, R., Lim, Y., Haviluddin, H., On, C. (eds.) Computational Science and Technology. Lecture Notes in Electrical Engineering, vol. 603, pp. 363–372 (2020) 20. Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Hum. Lang. Technol. 5(1), 1–167 (2012) 21. Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. In: Aggarwal, C., Zhai, C., (eds.) Mining Text Data. Springer, Boston (2012). https://doi. org/10.1007/978-1-4614-3223-4 13 22. Liu, N., Shen, B.: Aspect-based sentiment analysis with gated alternate neural network. Knowl.-Based Syst. 188, 105010 (2020) 23. L´ opez-Chau, A., Valle-Cruz, D., Sandoval-Almaz´ an, R.: Sentiment analysis of Twitter data through machine learning techniques. In: Ramachandran, M., Mahmood, Z. (eds.) Software Engineering in the Era of Cloud Computing. CCN, pp. 185–209. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-33624-0 8 24. Lu, Z.-Y., Cao, L.-L., Zhang, Y., Chiu, C.-C., Fan, J.: Speech sentiment analysis via pre-trained features from end-to-end ASR models. In: Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7149–7153 (2020) 25. Meng, X.F., Wei, F.R., Liu, X.H., Zhou, M., Wang, H.F.: Cross-lingual mixture model for sentiment classification. In: Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 572–581 (2013) 26. Mostafa, L.: Machine learning-based sentiment analysis for analyzing the travelers reviews on Egyptian hotels. In: Hassanien, A.-E., Azar, A.T., Gaber, T., Oliva, D., Tolba, F.M. (eds.) AICV 2020. AISC, vol. 1153, pp. 405–413. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44289-7 38 27. Mungra, D., Agrawal, A., Thakkar, A.: A voting-based sentiment classification model. In: Choudhury, S., Mishra, R., Mishra, R.G., Kumar, A. (eds.) Intelligent Communication, Control and Devices. AISC, vol. 989, pp. 551–558. Springer, Singapore (2020). https://doi.org/10.1007/978-981-13-8618-3 57 28. Nasukawa, T., Yi, J.: Sentiment analysis: capturing favorability using natural language processing. In: Proceedings of the 2nd International Conference on Knowledge Capture, pp. 70–77 (2003) 29. Nazir, A., Rao, Y., Wu, L.-W., Sun, L.: Issues and challenges of aspect-based sentiment analysis: a comprehensive survey. IEEE Trans. Affect. Comput. 1 (2020)
386
P. Lin and X. Luo
30. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retrieval 2(1–2), 1–135 (2008) 31. Patel, V.M., Gopalan, R., Li, R.N.: Visual domain adaptation: an overview of recent advances. Umiacs.umd.edu (3), 53–59 (2015) 32. Popat, K., Balamurali, A.R., Bhattacharyya, P., Haffari, G.: The haves and the have-nots: leveraging unlabelled corpora for sentiment analysis. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 412–422 (2014) 33. Rout, J.K., Choo, K.K.R., Dash, A.K., Bakshi, S., Jena, S.K., Williams, K.L.: A model for sentiment and emotion analysis of unstructured social media text. Electron. Commer. Res. 18(1), 181–199 (2018) 34. Ruz, G.A., Henriquez, P.A., Mascareno, A.: Sentiment analysis of Twitter data during critical events through Bayesian networks classifiers. Future Gener. Comput. Syst. 106, 92–104 (2020) 35. Samuel, J., Ali, G.G.M.N., Rahman, M.M., Esawi, E., Samuel, Y.: Covid-19 public sentiment insights and machine learning for tweets classification. Information 11(6), 314 (2020) 36. Sasikala, D., Sukumaran, S.: A survey on lexicon and machine learning based classification methods for sentimental analysis. Int. J. Res. Anal. Rev. 6(2), 256– 259 (2019) 37. Sisodia, D.S., Bhandari, S., Reddy, N.K., Pujahari, A.: A comparative performance study of machine learning algorithms for sentiment analysis of movie viewers using open reviews. In: Pant, M., Sharma, T., Basterrech, S., Banerjee, C. (eds.) Performance Management of Integrated Systems and its Applications in Software Engineering, Asset Analytics (Performance and Safety Management), pp. 107–117 (2020) 38. Su, Y., Zhang, Y., Hu, P., Tu, X.H.: Sentiment analysis research based on combination of Naive Bayes and Latent Dirichlet Allocation. J. Comput. Appl. 36(6), 1613–1618 (2016). (In Chinese) 39. Tao, J., Fang, X.: Toward multi-label sentiment analysis: a transfer learning based approach. J. Big Data 7(1), 1–26 (2020) 40. Wan, X.-J.: Using bilingual knowledge and ensemble techniques for unsupervised Chinese sentiment analysis. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 553–561 (2008) 41. Wang, G., Yang, S.-L.: Study of sentiment analysis of product reviews in internet based on RS-SVM. Comput. Sci. 40(Z11), 274–277 (2013). (In Chinese) 42. Wu, Y.-J., Zhu, F.-X., Zhou, J.: Using probabilistic graphical model for text sentiment analysis. J. Chin. Comput. Syst. 36(7), 1421–1425 (2015). (In Chinese) 43. Xia, H.-S., Yang, Y.-T., Pan, X.-T., An, W.-Y.: Sentiment analysis for online reviews using conditional random fields and support vector machines. Electron. Commer. Res. 20(2), 343–360 (2020) 44. Xia, H.-S., Yang, Y.-T., Pan, X.-T., Zhang, Z.-P., An, W.-Y.: Sentiment analysis for online reviews using conditional random fields and support vector machine. Electron. Commer. Res. 1–18 (2019) 45. Xie, X., Ge, S.-L., Hu, F.-P., Xie, M.-Y., Jiang, N.: An improved algorithm for sentiment analysis based on maximum entropy. Soft. Comput. 23(2), 599–611 (2019) 46. Xu, Y.-Y., Chai, Y.-M., Wang, L.-M., Liu, Z.: Multilingual text emotional analysis model MF-CSEL. J. Chin. Comput. Syst. 40(5), 1026–1033 (2019). (In Chinese) 47. Yadav, A., Vishwakarma, D.K.: Sentiment analysis using deep learning architectures: a review. Artif. Intell. Rev. 53, 4335–4385 (2020)
A Survey of Sentiment Analysis Based on Machine Learning
387
48. Yang, J.: Emotion analysis on text words and sentences based on SVM. Comput. Appl. Softw. 28(9), 225–228 (2011). (In Chinese) 49. Yu, C.-M.: Mining opinions from product review: principles and algorithm analysis. Inf. Stud.: Theory Appl. 32(7), 124–128 (2009). (In Chinese) 50. Zeng, Y., Liu, P.-Y., Liu, W.-F., Zhu, Z.-F.: Naive Bayesian algorithm for text sentiment classification based feature weighting integration. J. Northwest Normal Univ. 53(04), 56–60 (2017). (In Chinese) 51. Zhang, M.-C., et al.: Emotional component analysis and forecast public opinion on micro-blog posts based on maximum entropy model. Clust. Comput. 22(3), 6295–6304 (2019)
Machine Translation and Multilinguality
Incorporating Named Entity Information into Neural Machine Translation Leiying Zhou , Wenjie Lu, Jie Zhou , Kui Meng(B) , and Gongshen Liu(B) School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China {zhouleiying,jonsey,sanny02,mengkui,lgshen}@sjtu.edu.cn
Abstract. Most neural machine translation (NMT) models normally take the subword-level sequence as input to address the problem of representing out-of-vocabulary words (OOVs). However, using subword units as input may omit the information carried by larger text granularity, such as named entities, which leads to a loss of important semantic information. In this paper, we propose a simple but effective method to incorporate the named entity (NE) tags information into the Transformer translation system. The encoder of our proposed model takes both the subwords and the NE tags of source sentences as inputs. Furthermore, we introduce a novel entity-aligned attention mechanism to make full use of the chunk information of NE tags. The proposed approach can be easily integrated into the existing framework of Transformer. Experimental results on two public translation tasks demonstrate that our proposed method can achieve significant translation improvements over the basic Transformer model and also outperforms the existing competitive systems. Keywords: Neural machine translation Transformer · Self-attention
1
· Named entity ·
Introduction
Neural machine translation (NMT) systems based on the encoder-decoder framework [2,17] have achieved great progress. However, NMT systems often make mistakes when translating rare and unseen words or phrases, such as named entities, which are essential to understanding the meaning of the sentence. This is mainly because these low-frequency words are often replaced by the symbol unk due to the limitation of vocabulary size, which are also called out-of-vocabulary words (OOVs). The approach that tries splitting words into subword units [15] to solve the problem of representing OOVs has been widely used in current NMT models. But they are still often inaccurately or inadequately translated because using subword units omits information carried by larger text granularity, including the word-level and phrase-level information. The encoder cannot easily adapt to certain combinations of subwords, which may lead to a loss of important semantic information. c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 391–402, 2020. https://doi.org/10.1007/978-3-030-60450-9_31
392
L. Zhou et al.
Previous studies to handle this problem are mainly RNN-based NMT models. Sennrich and Haddow [14] have improved the attention-based NMT model by incorporating the linguistic features of source-language sentences, including the lemma, subword tags, and dependency labels. Ugawa et al. [18] proposed a model that encodes the input word based on its named entity (NE) tag at each time step and introduces a chunk-level LSTM layer over a word-level LSTM layer to hierarchically encode a source sentence. Recently, the self-attention based Transformer model [19] has achieved a new state-of-the-art performance on multiple language pairs. Some recent studies also show that incorporating word segmentation information [11] or N-gram representations [5] into the character-based pre-trained language models can enhance the character representations. Inspired by this, we propose a simple but effective way to address the NE translation problem in Transformer, which has not been studied extensively. In this paper, we present an effective method to incorporate the NE tags information of source sentences into the Transformer translation system. Specifically, we integrate an extra NE tags embedding layer into the existing system to model the NE tags information in the source sentence. The NE tags embedding layer will generate an NE tag embedding for each token based on its NE tag, which is then added to its token embedding. Furthermore, we introduce a novel entity-aligned attention mechanism, which could aggregate the attention weights of subwords in one named entity into a unified value with the mixed pooling strategy [23]. During the training and testing process, the encoder will generate an enhanced entity-aware sentence representation, which will help the decoder generate higher quality translations. The proposed method can be cast as a special case of extending the self-attention mechanism of Transformer to considering the NE tags knowledge. We conduct extensive experiments on the WMT 2019 English-Chinese and the IWSLT 2016 English-German datasets, and the experimental results demonstrate that our approach outperforms the baseline Transformer model and the existing competitive systems, which verifies the effectiveness of the method we propose. This paper primarily makes the following contributions: – We propose an effective approach to incorporate the NE tags information into the Transformer translation system, which has not been studied extensively. – The proposed method can be seen as an extension of the self-attention mechanism of Transformer to consider the NE tags knowledge and it can be easily integrated into the encoder of Transformer. – Experimental results demonstrate that our method outperforms both the basic Transformer model and several existing competitive models.
2
Related Work
Our work incorporates the NE tags information for NMT to improve the translation quality of named entities. There are two kinds of closely related studies, NMT based on phrases, and incorporating linguistic features into NMT by modeling them with certain functions or mechanisms.
Incorporating Named Entity Information into Neural Machine Translation
393
NMT based on phrases. Phrases play an important role in traditional statistical machine translation (SMT) [10,12]. However, current NMT models do not have an explicit treatment on phrases. Thus, there are a growing number of research works that try translating phrases directly in NMT to improve the translation of phrases. In many cases, it is possible to just copy the named entities in the source sentence to the target side instead of translating them. Gulcehre et al. [6] propose a pointer network based [20] NMT system which can learn when to translate and when to copy. It is also possible to use lexically-constrained decoders [7] to force the network to generate certain words or phrases. Wang et al. [21] integrate a phrase memory into the encoder-decoder architecture of NMT, which stores target phrases provided by an SMT model, to perform a phrase-by-phrase translation rather than a word-by-word translation. Their model dynamically selects a word or phrase to be output at each decoding step. Besides, Huck et al. [8] improve the translation of OOVs in NMT by using the back-translation approach to generate parallel sentences which contain OOVs. Incorporating linguistic features into NMT. Different from trying translating phrases directly in NMT, some works show that the introduction of linguistic features into NMT systems can also improve the translation of source sentences. Sennrich and Haddow [14] have improved the attention-based NMT model by incorporating the linguistic features of source-language sentences. The encoder receives the lemma, subword tags, POS tags, and dependency labels of sourcelanguage sentences in addition to source-language words. Ugawa et al. [18] propose a model that encodes the input word based on its NE tag at each time step, the encoder of which introduces a chunk-level LSTM layer over a wordlevel LSTM layer to hierarchically encode a source sentence to capture a compound NE as a chunk on the basis of the NE tags. To handle the problem that most NMT systems represent the sentence at only one single level of granularity, Chen et al. [4] incorporate multiple levels of granularity into the RNN-based NMT models. Xiao et al. [22] propose lattice-based encoders to explore effective word or subword representation in an automatic way during training and experiment results show that the lattice-based encoders in word-level and subword-level representations outperform the conventional Transformer encoder. Our work is different from those mentioned above, as our approach models the NE tags information of source-language sentences through an extra NE tags embedding layer and we introduce a novel entity-aligned attention mechanism to make full use of the chunk information of NE tags.
3 3.1
Preliminaries Transformer
The Transformer model [19] employs an encoder-decoder structure. The encoder is composed of a stack of identical encoder layers. Each layer consists of two sublayers: multi-head self-attention sublayer and position-wise feed-forward sublayer. The decoder is also composed of a stack of identical decoder layers. The
394
L. Zhou et al.
difference between the decoder layer and encoder layer is that each decoder layer has one more multi-head attention sublayer which performs attention over the output of the encoder stack. Residual connections are employed around each of the two sublayers, followed by layer normalization [1]. The self-attention sublayer in the decoder stack is modified by a method called masking to prevent positions from attending to subsequent positions during training. Because of these unique structures, the Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality on multiple language pairs. 3.2
Self-attention
The self-attention sublayers of Transformer employ h attention heads, which allow the model to jointly attend to information from different representation subspace at different positions. Later, the results from each head are concatenated and a parameterized linear transformation is applied to form the output of this sublayer. Assuming that Hn−1 is the input representation of the n-th stacked n encoder layer, an intermediate representation vector H can be obtained from the multi-head self-attention sublayer, which contains the scaled dot product attention, residual connection, and layer normalization operations. The order of these operations is as follows: n
H = LN (Self Att(Hn−1 ) + Hn−1 ),
(1)
where LN (·) represents the layer normalization operation and Self Att(·) means the self-attention mechanism. The specific calculation formula of Self Att(·) is: QKT Self Att(Hn−1 ) = sof tmax( √ )V, dk
(2)
where Q, K, V are the linear transformations of the input representation Hn−1 , respectively called query, key and value vectors. And dk is the dimension of the query and key vector, which is used to scale the dot products to avoid pushing the softmax function into regions having extremely small gradients. The output n intermediate representation vector of the multi-head self-attention sublayer H is then fed to the position-wise feed-forward sublayer to generate the final output of this encoder layer Hn : n
n
Hn = LN (F F N (H ) + H ),
(3)
where F N N (·) represents the fully connected feed-forward network. And the output representation vector Hn will be used as the input of the next stacked encoder layer. If there are a stack of N encoder layers in the Transformer architecture, the initial representation vector H0 is obtained from the positional encoding layer and the output representation vector of the N -th encoder layer HN is the final representation of the source sentence. For the stacked decoder layers, there are two different multi-head attention sublayers in each decoder layer. One is the masked multi-head self-attention
Incorporating Named Entity Information into Neural Machine Translation
395
sublayer, which performs self-attention among the words in target sentences and allows each position to attend to all positions up to and including that position. The other is the encoder-decoder attention sublayer, in which the queries come from the previous decoder layer and the keys and values are the final output of the encoder layers. This allows every position in the decoder to attend over all positions in the input sequence.
4
Approach
To address the problem that named entities are often inaccurately or inadequately translated, we propose a simple but effective way to incorporate the NE tags information of the source sentences into the existing Transformer system. The overall architecture of our proposed approach is shown in Fig. 1. As we can see in the figure, we use two methods to incorporate the NE tags information. Firstly, we assume that each word in the source sentences has its corresponding NE tag generated by an NE tagger. So we can also get an NE tag for each token in the segmented source sentences. The encoder of our proposed model receives the sequence of NE tags along with the sequence of source-language subwords as inputs. Furthermore, we are likely to pay approximate attention to different subwords within an entity intuitively when we read or translate a sentence, and the dependencies between the subwords within the same entity and other words outside the entity are often similar. Thus, inspired by this, we introduce a novel entity-aligned attention mechanism to enhance the final source sentence representation with the NE chunk information. 4.1
NE Tags Embeddings
Named entities (NEs) are words or phrases used to express specific meanings, such as person names, organizations, locations, time expressions and numerical representations. NE Recognition is a classification task to locate and classify NE words or phrases in given input sentences. NE tags generated by the commonly used named entity recognition tools or methods usually include the information of NE types and NE chunks (e.g., IO or BIO). In this work, we used BIO tags, which indicate the beginning, inside and outside of NEs respectively. For example, the sentence ‘David Gallo comes from the United States.’ will be tagged as follows: ‘David:B-PERSON Gallo:I-PERSON comes:O from:O the:BGPE United:I-GPE States:I-GPE.:O’, where tag GPE means countries, cities or states. As we can see from the tagged sentence, the NE tags include two kinds of features: the semantic class of words or phrases (e.g., person and GPE) and the chunk information (e.g., B, I and O). By using these two features, we aim to improve the translation of the NEs in source sentences. In order to incorporate the NE tags information into Transformer, the first method we try is to map the NE tags to NE tags embeddings, the dimension of which are the same as the original input token embeddings, and then they are used as the inputs of encoder together. Given a source sentence of
396
L. Zhou et al.
Fig. 1. The architecture of Transformer’s encoder with entity-aligned attention sublayer.
length T , the encoder receives its corresponding token embedding sequence X, X = {x1 , . . . , xT }, along with the sequence of NE tags embeddings NE, NE = {t1 , . . . , tT }, where xi , ti ∈ Rdmodel . And then the NE tags embeddings NE are added to the corresponding token embeddings X: X = X + λ1 NE,
(4)
where λ1 ∈ R1 is a trainable variable to control the influence of NE tags. As a result, the enhanced token embeddings X contain both the semantic information and the NE tags information of the source sentence. Finally, the positional embeddings are added to X and the calculated result will be the input of the encoder layers to learn the final sentence representation. 4.2
Entity-Aligned Attention
Although we directly incorporate the NE tags information into the input part of the encoder, there are no explicit structures to exploit the NE tags information in the subsequent neural networks of Transformer. Therefore, in order to make full use of the chunk information of NE tags, we introduce an entity-aligned attention mechanism on the basis of the self-attention networks, which can aggregate the
Incorporating Named Entity Information into Neural Machine Translation
397
attention weights of subwords in one named entity into a unified value with the mixed pooling strategy [23]. As described in Sect. 3.2, the self-attention sublayers in the encoder of Transformer aim to learn the dependencies between each token and other tokens in a sentence. Given a sentence X of length T , we assume that Hn−1 is the input representation of the n-th stacked encoder layer. According to Eq. 2, we firstly calculate the subword-level attention score matrix ASn based on the self attention mechanism with the input representations Hn−1 . The calculation process is as: (Hn−1 WQ )(Hn−1 WK )T √ ASn = sof tmax( ), (5) dmodel where WQ , WK ∈ Rdmodel ×dmodel are trainable parameters and dmodel is the model dimension. Each element of ASn ∈ RT ×T is a probability value between zero and one, which represents the dependencies between corresponding subwords without considering the NE boundary. We argue that taking NE word or phrase as a whole can better express the semantics of those NEs in the process of calculating attention scores, as the meaning of each subword and the whole NE may be quite different, and it is not easy to capture the dependencies between large text granularity by simple weighted sum in subword-level. Therefore, we propose to align this subword-level attention score matrix ASn into the NE-level. We denote ASn as {an1 , an2 , an3 , . . . , anT −2 , anT −1 , anT }, where ani ∈ RT is the i-th row vector of ASn , which can represent the dependencies between the i-th token and other tokens in the sentence. Then, we segment ASn according to the chunk information of NE tags: SEG(ASn ) = {{an1 , an2 }, {an3 }, . . . , {anT −2 , anT −1 , anT }}.
(6)
As a result, several submatrices can be obtained by segmenting the attention score matrix and each submatrix represents the attention score of one single token, NE word or NE phrase. Then, we add an extra appropriate aggregation module to fuse the inner-NE subword attention. Concretely, we assume that there is an NE sequence in the source sentence and its corresponding attention score submatrix is {ans , . . . , ans+l−1 }. Then we transform it into one attention vector ani with the mixed pooling strategy: ani =λ2 M axpooling({ans , . . . , ans+l−1 }) + (1 − λ2 )M eanpooling({ans , . . . , ans+l−1 }),
(7)
where λ2 ∈ R1 is a trainable variable to balance the mean and max pooling. Later, we execute the Kronecker product operation over each ani to keep input and output dimensions unchanged. It can be formulated as: n
AS [s : s + l − 1] = el ⊗ ani , n
(8)
where AS ∈ RT ×T is the entity-aligned attention score matrix, l is the length of the NE word or NE phrase, el = [1, . . . , 1]T represents a l-dimensional all-ones
398
L. Zhou et al.
vector, el ⊗ani = [ani , . . . , ani ] denotes the Kronecker product operation between el and ani . The Eq. 7 and Eq. 8 can help incorporate the chunk information of NE tags into subword-level attention calculation process, and determine the attention vector of one subword from the perspective of the whole NE, which is beneficial for eliminating the attention bias caused by subword ambiguity. n Finally, we can obtain the enhanced token representation H produced by the entity-aligned attention sublayer: n
n
H = AS (Hn−1 WV )
(9)
where WV ∈ Rdmodel ×dmodel is the trainable weight matrix of the model. The n dimension of H is T × dmodel , which is the same as Hn−1 . As described in n Eq. 3, the enhanced token representation vector H is then fed to the positionwise feed-forward sublayer to generate the output Hn of this encoder layer. These processes are repeated until the final enhanced sentence representation HN is generated, which will be used to help the decoder generate the final translations.
5
Experiments
5.1
Datasets
We conducted experiments on the WMT 20191 English-Chinese (En-Zh) and the IWSLT 2016 [3] English-German (En-De) datasets. For WMT 2019 En-Zh task, we used the casia2015 corpus as the training set which consists of approximately 1.01M sentence pairs, newsdev2017 as the validation set and newstest2017, newstest2018 and newstest2019 as test sets. For IWSLT 2016 En-De task, the corpus consists of 197 K sentence pairs, tst2013 was used as the validation set and tst2010, tst2011, tst2012 and tst2014 were used as test sets. For both translation tasks, we used spaCy NE tagger2 to identify NE tags of English sentences in the proposed model. 5.2
Compared Methods
In our experiments, the baseline system was the vanilla Transformer model. To validate the effectiveness of our proposed method, we designed three Transformer-based translation models: +NE tags Embedding, +Entityaligned Attention and +Both. Besides, we also compared our work with the following existing competitive models: RNNsearch [2]: The first work to incorporate the attention mechanism into the RNN-based NMT model and its thought has been widely used in the field of machine translation.
1 2
http://www.statmt.org/wmt19/translation-task.html. https://spacy.io/usage/linguistic-features#named-entities.
Incorporating Named Entity Information into Neural Machine Translation
399
DTMT [13]: A novel deep transition RNN-based architecture for NMT which outperforms the state-of-the-art NMT systems, including the Transformer model and the RNN-based NMT models. Evolved Transformer [16]: A recently proposed architecture which applies neural architecture search based the Transformer model to design better feedforward architectures for sequence to sequence tasks. Since the datasets we used are different from these papers, we used the source codes reported in these papers to experiment on our datasets and got the final comparative test results. 5.3
Setup
For both translation tasks, the byte pair encoding (BPE) algorithm [15] was used to encode the sentences in both source and target sides and the shared sourcetarget vocabulary size is set to 32k tokens. And sentence pairs with either side longer than 100 tokens were dropped. Especially for the En-Zh translation task, we used Jieba toolkit3 to segment the Chinese words in the target sentences before the jointly BPE operation. For all experiments, the embedding size was 512, the hidden size was 1024, and the number of heads was 8. The encoder and the decoder each had six layers. The parameter λ1 used to control the effect of NE tags, and λ2 which balanced the mean and max pooling, were both initialized to 0.5. For training, we used the Adam optimizer [9] with a learning rate of 0.0001. The learning rate was varied under a warm-up strategy with warmup steps of 4,000. The attention dropout and residual dropout were 0.3. We set the batch size to 128 sentence pairs. The number of training epochs for WMT 2019 and IWSLT 2016 were 20 and 50, respectively. For evaluation, we validated the model every epoch on the dev set. After all training epochs, the model with the highest BLEU score of the dev set was selected to evaluate the test sets. We used the multi-bleu.perl4 as the evaluation metric for both translation tasks. We implemented all methods based on a TensorFlow implementation of Transformer5 and trained and evaluated all models on a single NVIDIA GeForce GTX 1080 Ti GPU. 5.4
Results and Analysis
Evaluation results for the En-Zh translation task and the En-De translation task are presented in Tables 1 and 2 respectively. From the results in the table, we can see that the three proposed models all achieve improvements over the basic Transformer model and our model +Both also outperforms the existing competitive systems, which proves the validity of our approach. 3 4 5
https://github.com/fxsjy/jieba. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multibleu.perl. https://github.com/Kyubyong/transformer.
400
L. Zhou et al. Table 1. Evaluation results (BLEU) on WMT 2019 En-Zh dataset System
dev17 test17 test18 test19 Average
RNNsearch DTMT Evolved transformer
23.78 26.35 26.11
24.92 28.07 27.84
24.17 26.10 25.98
24.20 27.34 27.25
24.27 26.97 26.80
Transformer +NE tags embedding +Entity-aligned attention +Both
25.04 26.41 25.92 27.10
26.37 27.86 27.63 28.95
25.09 26.46 26.50 27.34
25.76 26.88 26.69 27.75
25.57 26.90 26.69 27.79
Table 2. Evaluation results (BLEU) on IWSLT 2016 En-De dataset System
tst13
tst10
tst11
tst12
tst14
Average
RNNsearch DTMT Evolved transformer
23.18 26.94 26.72
21.58 25.11 24.93
23.61 27.10 26.89
21.45 25.26 24.90
20.42 24.28 23.62
22.05 25.74 25.41
Transformer +NE tags embedding +Entity-aligned attention +Both
25.51 26.47 26.39 27.28
23.77 24.71 24.89 25.88
25.47 26.69 26.71 27.70
23.82 25.09 24.65 25.55
22.50 23.69 23.54 24.45
24.21 25.33 25.24 26.17
Comparison to Baseline System and Existing Competitive Systems. More concretely, our method +Both, which integrates the NE tags embedding and the entity-aligned attention mechanism, significantly outperforms the vanilla Transformer model by 2.22 and 1.96 BLEU scores averagely for the En-Zh and En-De translation tasks, which shows that the NE tags information is beneficial for the Transformer model. Meanwhile, our proposed model also outperforms the three existing competitive systems for both translation tasks. For the evaluation results of the En-Zh translation task in Table 1, our method outperforms the second-best model DTMT approximately 0.82 BLEU score. For the evaluation results of the En-De translation task in Table 2, our method is also better than DTMT by 0.43 BLEU score. The experimental results demonstrate that incorporating the NE tags information into the existing Transformer system is very effective and can improve the translation of source sentences. Effect of NE Tags Information. In order to show the effect of the two methods we proposed on the Transformer model more intuitively, we compared the evaluation results of our proposed models and the vanilla Transformer model on the dev and test sets of these two translation tasks in Fig. 2. As we can see from the figure, the results of +NE tags Embedding are superior to the basic Transformer model, which indicates that the NE tags information including the semantic class information and the chunk information could be helpful to the Transformer model. Besides, the model +Entity-aligned Attention, which only
Incorporating Named Entity Information into Neural Machine Translation
401
introduces the chunk information of NE tags, achieves comparable performance to +NE tags Embedding. This shows the effectiveness of our proposed entityaligned attention mechanism based on the mixed pooling strategy. The model +Both, which integrates both NE tags embedding and the entity-aligned attention mechanism into Transformer, achieving the best performance indicates that these two methods are complementary to each other.
Fig. 2. Comparison of evaluation results of the three models we proposed and the basic Transformer model for En-Zh (left) and En-De (right) translation tasks.
6
Conclusion
In this paper, we propose an effective method to incorporate the NE tags information of source sentences into the Transformer translation system. The encoder of our proposed model takes both the subwords and the NE tags of source sentences as input. Furthermore, we introduce a novel entity-aligned attention mechanism to aggregate the attention weights of subwords in one named entity into a unified value with the mixed pooling strategy, which can make full use of the chunk information of NE tags. The proposed approach can be easily integrated into the encoder of the Transformer. And the experimental results on two public datasets demonstrate that our method is very effective and outperforms the Transformer baseline system and several existing competitive models. Acknowledgments. This research work has been funded by the National Key Research and Development Program of China NO. 2016QY03D0604 and NO. 2018YFC0830803, the National Natural Science Foundation of China (Grant No. 61772337).
References 1. Ba, L.J., Kiros, J.R., Hinton, G.E.: Layer normalization. CoRR abs/1607.06450 (2016) 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
402
L. Zhou et al.
3. Cettolo, M., Girardi, C., Federico, M.: WIT3: web inventory of transcribed and translated talks. In: Conference of European Association for Machine Translation, pp. 261–268 (2012) 4. Chen, H., Huang, S., Chiang, D., Dai, X., Chen, J.: Combining character and word information in neural machine translation using a multi-level attention. In: NAACL-HLT, pp. 1284–1293. Association for Computational Linguistics (2018) 5. Diao, S., Bai, J., Song, Y., Zhang, T., Wang, Y.: ZEN: pre-training Chinese text encoder enhanced by N-gram representations. CoRR abs/1911.00720 (2019) 6. G¨ ul¸cehre, C ¸ ., Ahn, S., Nallapati, R., Zhou, B., Bengio, Y.: Pointing the unknown words. In: ACL, vol. 1, The Association for Computer Linguistics (2016) 7. Hasler, E., de Gispert, A., Iglesias, G., Byrne, B.: Neural machine translation decoding with terminology constraints. In: NAACL-HLT, vol. 2, pp. 506–512. Association for Computational Linguistics (2018) 8. Huck, M., Hangya, V., Fraser, A.M.: Better OOV translation with bilingual terminology mining. In: ACL, vol. 1, pp. 5809–5815. Association for Computational Linguistics (2019) 9. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015) 10. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: HLTNAACL, The Association for Computational Linguistics (2003) 11. Li, Y., Yu, B., Xue, M., Liu, T.: Enhancing pre-trained Chinese character representation with word-aligned attention. CoRR abs/1911.02821 (2019) 12. Lopez, A.: Statistical machine translation. ACM Comput. Surv. 40(3), 8:1–8:49 (2008) 13. Meng, F., Zhang, J.: DTMT: a novel deep transition architecture for neural machine translation. In: AAAI, pp. 224–231. AAAI Press (2019) 14. Sennrich, R., Haddow, B.: Linguistic input features improve neural machine translation. In: WMT, pp. 83–91. The Association for Computer Linguistics (2016) 15. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: ACL, vol. 1. The Association for Computer Linguistics (2016) 16. So, D.R., Le, Q.V., Liang, C.: The evolved transformer. In: Proceedings of Machine Learning Research, PMLR, ICML, vol. 97, pp. 5877–5886 (2019) 17. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014) 18. Ugawa, A., Tamura, A., Ninomiya, T., Takamura, H., Okumura, M.: Neural machine translation incorporating named entity. In: COLING, pp. 3240–3250. Association for Computational Linguistics (2018) 19. Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017) 20. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: NIPS, pp. 2692–2700 (2015) 21. Wang, X., Tu, Z., Xiong, D., Zhang, M.: Translating phrases in neural machine translation. In: EMNLP, pp. 1421–1431. Association for Computational Linguistics (2017) 22. Xiao, F., Li, J., Zhao, H., Wang, R., Chen, K.: Lattice-based transformer encoder for neural machine translation. In: ACL, vol. 1, pp. 3090–3097. Association for Computational Linguistics (2019) 23. Yu, D., Wang, H., Chen, P., Wei, Z.: Mixed pooling for convolutional neural ´ ezak, D., Peters, G., Hu, Q., Wang, R. networks. In: Miao, D., Pedrycz, W., Sl¸ (eds.) RSKT 2014. LNCS (LNAI), vol. 8818, pp. 364–375. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11740-9 34
Non-autoregressive Neural Machine Translation with Distortion Model Long Zhou1,2(B) , Jiajun Zhang1,2 , Yang Zhao1,2 , and Chengqing Zong1,2,3 1
National Laboratory of Pattern Recognition, Institute of Automation, CAS, Beijing, People’s Republic of China {long.zhou,jjzhang,yang.zhao,cqzong}@nlpr.ia.ac.cn 2 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 3 CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, People’s Republic of China
Abstract. Non-autoregressive translation (NAT) has attracted attention recently due to its high efficiency during inference. Unfortunately, it performs significantly worse than the autoregressive translation (AT) model. We observe that the gap between NAT and AT can be remarkably narrowed if we provide the inputs of the decoder in the same order as the target sentence. However, existing NAT models still initialize the decoding process by copying source inputs from left to right, and lack an explicit reordering mechanism for decoder inputs. To address this problem, we propose a novel distortion model to enhance the decoder inputs so as to further improve NAT models. The distortion model, incorporated into the NAT model, reorders the decoder inputs to close the word order of the decoder outputs, which can reduce the search space of the non-autoregressive decoder. We verify our approach empirically through a series of experiments on three similar language pairs (En⇒De, En⇒Ro, and De⇒En) and two dissimilar language pairs (Zh⇒En and En⇒Ja). Quantitative and qualitative analyses demonstrate the effectiveness and universality of our proposed approach. Keywords: Neural machine translation translation · Distortion model
1
· Non-autoregressive
Introduction
Neural encoder-decoder architectures have gained significant popularity for machine translation [2,8,25,27,32,33]. Recent approaches to sequence to sequence learning typically leverage recurrence [25], convolution [8], or attention [27] as basic units. All these models translate a source sentence in an autoregressive manner, which generates a target sentence token by token from left to right. A well-known limitation of the autoregressive translation (AT) models is that the inference process can hardly be parallelized. To alleviate the inference c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 403–415, 2020. https://doi.org/10.1007/978-3-030-60450-9_32
404
L. Zhou et al.
so where do we go Target language: Alignment: (a) Source and ferlity: wohin(1) gehen(2) wir(1) also(1) ?(1) so where do we go Decoder Output: (b) Decoder Network (parallel generaon) Convenonal Input: wohin gehen gehen wir also Decoder Output: (c) Reordered Input:
so
where
do
we
go
?
? ? ?
Decoder Network (parallel generaon)
also
wohin
gehen
wir
gehen ?
Fig. 1. Examples of German-English translation. (a) alignment between source and target language; (b) conventional decoder inputs for NAT; (c) reordered decoder inputs, with which NAT is easier to decode.
latency, non-autoregressive translation (NAT) [10] models have been proposed, which initialize the decoder inputs using copied source inputs in a left-to-right manner, and generate all target tokens independently and simultaneously, as shown in Fig. 1(b). Because of this, NAT models achieve the speedup at the cost of the significant drop in translation quality due to the large search space. Our preliminary experiments show that it can significantly reduce the gap between NAT and AT by using gold reordered decoder inputs for NAT. Intuitively, the NAT is easier to translate the reordered inputs of Fig. 1(c) than the original inputs, since they have aligned relationships in each position. However, existing NAT models cannot explicitly model the reordering of the decoder inputs. Accordingly, we propose in this paper an explicit distortion mechanism to reorder the decoder inputs to approach the order of the decoder outputs, which is capable of reducing the decoding search space and increasing the certainty of translation results [10]. Inspired by IBM distortion model [3], we explore and compare explicit-absolute reordering and explicit-relative reordering methods. Then, we introduce permute-then-copy and copy-then-permute strategies for different explicit reordering capabilities. We extensively evaluate the proposed approach on three similar language pairs (WMT14 En⇒De, WMT16 En⇒Ro, and IWSLT14 De⇒En) and two dissimilar language pairs (NIST Zh⇒En and KFTT En⇒Ja). Massive experiments demonstrate the effectiveness and universality of our model, which achieves substantial translation improvement all language pairs while maintaining the fast inference speed.
2
Background and Motivation
In order to speed up the inference process of the neural Transformer, NAT modified the autoregressive architecture to speed up machine translation by directly generating target words in parallel. P (Y |X, θ) =
T t=1
X, θ) p(yt |X,
(1)
Non-autoregressive Neural Machine Translation with Distortion Model
405
= ( where X x1 , x 2 , ..., x T ) is the source tokens copied sequentially from the encoder side, as shown in Fig. 1(b). For the sake of brevity, we refer the reader to [10] for more details. Without considering the target translation history, NAT models are weak to exploit the target words permutation knowledge and tend to generate repeated and incomplete translation [30]. In addition, NAT models suffer from the huge search space problem than AT where target words to be predicted at the current step are relative definite when given the previously generated words [31]. Although NAT can achieve significant speedup compared to AT models, they are suffering from accuracy degradation without an explicit reordering mechanism. We argue that NAT can be further improved by taking into account the distortion model. T X, θ) p(yt |X, (2) P (Y |X, θ) = t=1
is our reordered source tokens for decoder inputs. Take Fig. 1 as an where X example, it is clear that the NAT model is easier to predict if reordering the decoder inputs from Fig. 1(b) to Fig. 1(c), because deterministic fertility and reordering can narrow the search space of the translation results. Moreover, we find that NAT behaves especially worse when translating between languages that have very different word order, which accords with our guess that NAT models need an explicit distortion mechanism more than AT models. To further verify our conjecture, we use the gold reordered input sequences for training and testing, and this model achieves a significant improvement over baselines, close to the performance of the AT model, as suggested by our experiments in Fig. 4.
3 3.1
Model Description Distortion Model
Inspired by the success of the distortion model in SMT [1,3,14], we propose using a explicit distortion model to dynamically reorder the input sequences such that the decoder inputs and the decoder outputs have similar word order and then proceed to monotonically decode the reordered input sequences. More specifically, we first employ a position-wise neural network to obtain position information of reordering. Then, with discrete absolute or relative position information, we can directly reorder the input sequences X to get the output which is similar to IBM distortion model [3]. We refer to the discrete tokens X, reordering process as a function Mapping(·). Specifically, we propose explicitabsolute reordering and explicit-relative reordering to model the absolute and relative position relation between inputs and outputs, respectively. Explicit-Absolute Reordering (EAR). We use a pointer network [28] to obtain the absolute position information about current sequence tokens, i.e., the output value shown in Fig. 2(a). Formally, point network uses attention weights (Eq. 3) as a pointer to select a member of the input sequences as the outputs (Eq. 4):
406
L. Zhou et al. Output:
x4 2
x1 3
x2 4
x3 1
x5
x4
x1
x2
x3
x5
+1
+1
+1
-3
0
x4
x5
5 Neural Network
Neural Network Input:
x1
x2
x3
x4
x5
x1
(a) Explicit-Absolute Reordering
x2
x3
(b) Explicit-Relative Reordering
Fig. 2. Different distortion models. Note that the dotted arrow denotes the Mapping(·) which reorders the inputs based on the position information.
AbsPosiInfo = softmax(
Q(K + α)T √ ) dk
= Mapping(X, AbsPosiInfo) X
(3) (4)
1
where α denotes the relative position encoding , and softmax normalizes the attention weights to be an output distribution over the input tokens. Q, K are query and key vectors that are transformed from the hidden state of previous layer (e.g., the top encoder layer). For example, using absolute position information (2, 3, 4, 1, 5) from the index of highest-probability weight, Mapping(·) can reorder the inputs (x1 , x2 , x3 , x4 , x5 ) into the output tokens (x4 , x1 , x2 , x3 , x5 ). Explicit-Relative Reordering (ERR). It allows the model to predict the relative position through position-wise self-attention. We adapt the method of [24] to use relative positions1 (α, β) in distortion layer (Eq. 5), then use a distortion predictor (Eq. 6) to generate the probability distribution over different distortion numbers (e.g., −2, −1, 0, 1, 2), and use the most probable numbers as relative reordering information. T
= softmax( Q(K√+ α) ) · (V + β) H dk
(5)
· W) RelPosiInfo = softmax(H
(6)
is a parameter, d denotes hidden state size and k denotes where W ∈ R clipping distance in (see footnote 1). Finally, we can get the reordered tokens according to above position information: d×(2k+1)
= Mapping(X, RelPosiInfo) X
(7)
Figure 2(b) demonstrates a relative reordering situation where +1 means to move one step to the right, −3 means to move three steps to the left, and 0 means not to move. We keep the original relative order if the reordered positions of two tokens conflict. 1
K The relative position encoding α and β are computed as αij = wclip(j−i,k) , βij V = wclip(j−i,k) , where clip(x, k) = max(−k, min(k, x)), i and j denote the absolute position of two tokens. Besides, wK and wV are learnable parameters, and we use k = 100 for our experiments.
Non-autoregressive Neural Machine Translation with Distortion Model
407
Fig. 3. The architecture of the NAT with distortion model (PTC-ERR). The distortion predictor is proposed to model the reordering information in NAT.
3.2
Integrating Distortion Model into NAT
Next, we consider how to integrate the distortion model into a NAT model. We introduce two integration strategies for explicit reordering, including permutethen-copy and copy-then-permute. Permute-Then-Copy (PTC). Figure 3 illustrates the NAT model with an PTC approach. We model the distortion at each position independently using the proposed distortion network on the top encoder layer, then use a softmax layer to predict the distortion. According to the encoder distortion function dis(·) by reordering and and fertility function f er(·), we can get the decoder inputs X = f er(dis(X)). copying source tokens: X Copy-Then-Permute (CTP). Another way of reordering decoder inputs deeply is at the bottom of the decoder. Different from PTC, CTP first generates a preliminary input sequence from source input tokens using the fertility predictor. Then it reorders the preliminary sequence using the above explicit reordering = dis(f er(X)). methods. Formally, the decoder inputs can be calculated as: X 3.3
Joint Training
Given the training set D = {X N , Y N } with N sentence pairs, we follow [10] to model orders as latent variables for explicit reordering and optimize a variational bound for the overall maximum likelihood loss, consisting of translation loss, fertility loss, and distortion loss. L=
T N
log p(yt |x1:m ; θenc ; θdec ; θf er ; θdis ) +
n=1 t=1
Translation Loss
M N n=1 t=1
log pf (ft |x1:m ; θenc ; θf er )
Fertility Loss
+
N M n=1 t =1
log pd (dt |z1:m ; θenc ; θdis )
Distortion Loss
(8)
408
L. Zhou et al.
where θenc , θdec , θf er , and θdis means the parameters of encoder, decoder, fertility predictor, and distortion predictor respectively. z1:m is equal to x1:m if using PTC strategy, or f er(x1:m ) when using CTP strategy. The standard fertility f and distortion d can be computed from an external aligner or attention weights used in the autoregressive teacher model.
4
Experiments
4.1
Dataset
To compare with the results reported by previous work [10,12,15,23,31], we first choose three similar language pairs: WMT14 English-German2 (En⇒De), WMT16 English-Romanian3 (En⇒Ro), and IWSLT14 German-English4 whose training sets consist of 4.5M, 600K, 153K sentence pairs, respectively. We tokenize the corpora using a script from Moses and segment each word into subword units using BPE [22,29]. We also evaluate our model on two dissimilar pair, including Chinese-English (Zh⇒En) which includes about 2M sentence pairs extracted from the LDC corpus5 , and English-Japanese (En⇒Ja) which comes from KFTT datasets6 440 K sentence pairs. 4.2
Setting
We implement our model based on the open-sourced tensor2tensor7 toolkit for training and evaluating. For the NAT model, we use the same network architectures as in [10]. For all translation tasks except De⇒En, we use the hyperparameter settings of base Transformer model as [27], whose encoder and decoder both have 6 layers, and 512 dimension sizes, 8 attention-heads, 2048 feed-forward inner-layer dimensions. As IWSLT14 is a smaller dataset, we follow [10] to use the same small Transformer setting. For a fair comparison with NAT, we perform our experiments under the constraint that the number of parameters is similar to NAT, so we replace one layer of encoder or decoder with our reordering layer for PTC and CTP model, respectively. During training, we also use sequence-level knowledge distillation [10,13] to teach NAT using distillation corpus, where the target side of the training corpus is replaced by the output of an AT model. Additionally, we supervise the fertility and distortion predictions at training time by using a fixed aligner fast align8 , which can produce a deterministic sequence of integer fertilities and distortions for each pair. For evaluation, we use argmax 2 3 4 5
6 7 8
http://www.statmt.org/wmt14/translation-task.html. http://www.statmt.org/wmt16/translation-task.html. https://wit3.fbk.eu/. The corpora includes LDC2000T50, LDC2002T01, LDC2002E18, LDC2003E07, LDC2003E14, LDC2003T17 and LDC2004T07. http://isw3.naist.jp/∼philip-a/emnlp2016/. https://github.com/tensorflow/tensor2tensor. https://github.com/clab/fast align.
Non-autoregressive Neural Machine Translation with Distortion Model
409
decoding without re-scoring for a fair comparison.9 We employ three Titan Xp GPUs to train En⇒De and one GPU for others. We use 4-gram NIST BLEU to evaluate our proposed model. Table 1. Comparison with existing NAT systems on WMT14 En⇒De, WMT16 En⇒Ro, and IWSLT14 De⇒En tasks. PTC-ERR and CTP-ERR means explicitrelative reorder using permute-then-copy and copy-then-permute strategies, respectively. Latency is computed as an average per sentence decoding time on the test set of De⇒En without minibatching. All results of our model are significantly better than NAT without distortion (p < 0.01). System
Architecture
En⇒De En⇒Ro De⇒En Latency/speedup
AT systems [27]
Transformer (b = 4)
27.06
32.28
32.87
640 ms 1.00×
Transformer (b = 1)
26.23
32.10
31.95
566 ms 1.13×
Existing NAT systems [10]
NAT
17.35
26.22
27.10
39 ms
15.6×
[15]
NAT-IR
18.91
29.66
-
-
1.98×
[12]
NAT-LT
19.80
-
-
182 ms 3.89×
[31]
Imitate-NAT
22.44
28.61
-
-
18.6 ×
[23]
Reinforce-NAT
19.15
27.09
-
-
10.73×
This work NAT without distortion 21.03
Our NAT systems
4.3
26.68
27.10
35 ms
18.29×
NAT with PTC-ERR
21.88
28.29
28.94
34 ms
18.82×
NAT with CTP-ERR
22.67
29.05
29.58
37 ms
17.30×
Results on Similar Language Translation
In this section, we verify our proposed approach on En⇒De, En⇒Ro, and De⇒En, as listed in Table 1. Translation Quality. Across different datasets, our method achieves significant improvements over previous proposed non-autoregressive models. Specifically, our method (CTP-ERR) outperforms NAT with 1.64, 2.37, and 2.48 BLEU score improvements on En⇒De, En⇒Ro, and De⇒En, respectively. The promising results demonstrate that the proposed method can make the NAT easy to predict by providing similar word order inputs close to target tokens and reduce the gap between NAT and AT models. In addition, experimental results show that CTP-ERR behaves better than PTC-ERR on most test sets, which shows that CTP-ERR indeed benefits more from the deeper and broader reordering. Inference Latency. Apart from the translation accuracy, our proposed models achieve a speedup of 17–19 times over the AT counterparts, and the decoding speed of PTC-ERR is slightly faster than that of CTP-ERR. Experiments 9
We think that NAT with re-scoring technique is an unfair comparison to standard AT model, because AT model can still improve the performance by reranking the beam-search results [17].
410
L. Zhou et al.
demonstrate that NAT with additional distortion model can obtain a comparable inference speed compared to the standard NAT model, and the speedups of our models significantly outperform the previous NAT-IR [15] and NAT-LT [12], and Reinforce-NAT [23]. Table 2. Evaluation of translation quality for Chinese-English and English-Japanese translation tasks using case-insensitive BLEU scores. All results of our model are significantly better than NAT (p < 0.01). Model
NIST Zh⇒En
KFTT En⇒Ja
DEV
DEV
MT02 MT03 MT04 AVE
TEST
Transformer [27]
45.29 45.54 47.82 46.57 46.64 29.27 32.17
NAT [10]
28.06 29.74 29.27 30.44 29.82 13.60 17.42
Our model (NAT + PTC-ERR) 30.50 31.21 32.76 33.28 32.42 16.12 19.69 Our model (NAT + CTP-ERR) 30.98 32.47 33.14 32.95 32.85 16.19 20.31
4.4
Results on Dissimilar Language Translation
We conduct additional experiments on two dissimilar language pairs to better evaluate our model, including Zh⇒En and En⇒Ja. As shown in Table 2, our proposed models significantly outperform NAT model on all test sets. In particular, our approach with CTP-ERR gets substantial improvements of 3.03 and 2.89 BLEU points over a strong NAT model in two tasks, respectively. Experimental results demonstrate that the proposed method is still valid for long-distance language pairs. The gap between NAT and AT is greater in dissimilar language pairs than in similar translation pairs by comparing Table 1 and Table 2, which indicates that determining the correct word order of translated words is a greater challenge for NAT. Although our model achieves a prominent improvement on both similar and dissimilar language pairs, it still has a lot of room for improvement on dissimilar pairs and we remain it as future exploration. Table 3. Experimental results of combination between different integration strategies and reordering methods on De⇒En dev set. # Strategies
Models
DEV
1
NAT [10]
27.11
2
Permute-Then-Copy (PTC) Explicit-Absolute Reordering (EAR) 27.52
3
Permute-Then-Copy (PTC) Explicit-Relative Reordering (ERR)
28.42
4
Copy-Then-Permute (CTP) Explicit-Absolute Reordering (EAR) 28.02
5
Copy-Then-Permute (CTP) Explicit-Relative Reordering (ERR)
28.99
Non-autoregressive Neural Machine Translation with Distortion Model
4.5
411
Analysis
We conduct extensive analysis from different perspectives to better understand our model. 50
Translation Accuracy
45
AT NAT Our Model NAT + Oracle Fertility NAT + Oracle Fertility + Oracle Encoder Distortion NAT + Oracle Fertility + Oracle Decoder Distortion
45.29 42.31
40 35
33.67
33.57 31.65
30
28.99 27.11
28.29 27.12
30.98 29.45 28.06
25 20 German-English
Chinese-English
Fig. 4. Results of oracle distortion in similar translation pairs De⇒En and dissimilar pairs Zh⇒En.
Explicit-Absolute vs. Explicit-Relative Reordering. We compare different reordering metheds and integration strategies in De⇒En dev set. As listed in Table 3 (row 2 vs. 3 and row 4 vs. 5), explicit-relative reordering with PTC achieves a significant improvement over explicit-absolute reordering by 0.90 BLEU. One possible reason is that, similar to the distortion model in SMT, modeling the absolute position with a simple and independent point network is more difficult than modeling relative position. Effect of Oracle Reordering. Furthermore, we attempt to explore the potential effect of a perfect distortion model, because our distortion model is relatively simple, but could be substantially improved in future work. To this end, we perform an experiment to evaluate the performance of NAT model when provided with external inputs from an oracle. As illustrated in Fig. 4, the NAT model with oracle decoder distortion substantially improves the performance of our NAT model, and approaches the quality of AT model. The results shown in Fig. 4, demonstrate the importance of the distortion mechanism for NAT model, which also points out a future research direction. Evaluation of Repeated Translation. Existing NAT models suffer from the repeated translation issue which refers to the same word token being successively generated for multiple times, e.g., the “where where we ?” in the example of Table 5. To further understand our model, we conduct a statistical analysis of the number of repeated translation in both NAT and our model, as listed in Table 4. Our model alleviates the repeated translation problem by integrating
412
L. Zhou et al.
Table 4. Evaluation of repeated translation on De⇒En dev set. “#Nums-all ” represents the all number of repeated translation words, “#Nums-ave” means the average number of repeated translations in each sentence. Δ indicates relative improvement. Models
#Nums-all #Nums-ave Δ
NAT
3840
0.53
-
Our model 2818
0.39
−27%
reordering information into NAT and enhancing the ability to distinguish targetside adjacent hidden states, reducing 27% of repeated translation errors. Case Study. Table 5 shows the mentioned example in Fig. 1 for standard NAT and our models. Without distortion mechanism, the decoder inputs are in the order of source tokens, so the standard NAT model translates the meaning of “gehen wir” to a set of meaningless tokens. We can see that our proposed model with reordered decoder inputs obtains a superior translation, although the reordered inputs are not perfect. The result demonstrates that the distortion model can relieve the reordering burden of NAT and help NAT to generate better target sentences using reordered decoder inputs. Table 5. Translation examples from De⇒En task. Input and Output mean the decoder inputs and decoder outputs of NAT, respectively. We use the blue fonts to indicate the correct reordering. Example
Source: wohin gehen wir also ?
Reference: so where do we go ?
NAT
Input: wohin ::::: gehen ::: wir ?
Output:
where where we ?
Output:
so where we go ?
+ PTC-ERR Input: wohin wir gehen also ?
+ CTP-ERR Input: wohin wohin wir also gehen ? Output:
5
so where do we go ?
Related Work
Our work is built upon a NAT model, but it is motivated by the distortion models in SMT. We discuss the two topics in the following. 5.1
Non-autoregressive Translation
In term of speeding up the inference process, a line of works begin to develop nonautoregressive translation models [9–11,15,18,23,30,31,36]. Due to the fact that the quality of the decoder inputs is very crucial and largely impacts the model accuracy, [10] proposed to generate decoder inputs by a uniform mapping method or the fertility prediction function, which copies source inputs from the encoder side in left-to-right. Recently, [11] boosted the performance of NAT by enhanced
Non-autoregressive Neural Machine Translation with Distortion Model
413
decoder with phrase table and embedding transformation. However, the decoder inputs of the above models are still in the order of the source language, lacking an explicit distortion mechanism. Parallel to our work, [21] proposed to guild NAT decoding with reordering information. One main difference is that our model with CTP strategies allows deeper reordering, in constrast, their work can only reorder the original inputs rather than copied inputs and greatly damages the decoding speed. 5.2
Distortion Models
Our work is also inspired by the distortion models that are widely used in SMT [1,3,14,20,26]. Another related study to our idea is pre-ordering for SMT or NMT [4–6,16,19,35]. [6] firstly investigated the impact of the pre-reordered methods on NMT with speed degradation. To alleviate the reordering problem of AT models, [7] introduced a recurrent attention mechanism as an implicit distortion model. [34] presented a distortion model to enhance attention-based NMT through incorporating the word reordering knowledge. [4] proposed to learn the reordering embedding of a word based on its contextual information. Nevertheless, there has been little attention to exploiting the reordering information of NAT by using distortion models.
6
Conclusion
We find that gold reordering of decoder inputs can significantly reduce this gap between NAT and AT, making the quality of NAT close to that of AT. Inspired by these findings, we propose using explicit distortion models to effectively reorder the decoder inputs and reduce the decoding search space of NAT. We compare different reorder models and extensively evaluate the proposed model on three similar language pairs and two dissimilar pairs. Experimental results show that NAT with distortion model achieves consistent and substantial improvements of translation quality while maintaining the speed advantage of NAT. Acknowledgments. The research work has been funded by the Natural Science Foundation of China under Grant No. U1836221 and 61673380. The research work in this paper has also been supported by Beijing Advanced Innovation Center for Language Resources and Beijing Academy of Artificial Intelligence (BAAI2019QN0504).
References 1. Al-Onaizan, Y., Papineni, K.: Distortion models for statistical machine translation. In: ACL 2016 (2006) 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR 2015 (2015) 3. Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
414
L. Zhou et al.
4. Chen, K., Wang, R., Utiyama, M., Sumita, E.: Neural machine translation with reordering embeddings. In: ACL 2019 (2019) 5. De Gispert, A., Iglesias, G., Byrne, B.: Fast and accurate preordering for SMT using neural networks. In: NACCL 2015 (2015) 6. Du, J., Way, A.: Pre-reordering for neural machine translation: helpful or harmful? Prague Bull. Math. Linguist. 108(1), 171–182 (2017) 7. Feng, S., Liu, S., Yang, N., Li, M., Zhou, M., Zhu, K.Q.: Improving attention modeling with implicit distortion and fertility for machine translation. In: COLING 2016 (2016) 8. Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: ICML 2017 (2017) 9. Ghazvininejad, M., Levy, O., Liu, Y., Zettlemoyer, L.: Mask-predict: parallel decoding of conditional masked language models. In: EMNLP-IJCNLP 2019 (2019) 10. Gu, J., Bradbury, J., Xiong, C., Li, V.O., Socher, R.: Non-autoregressive neural machine translation. In: ICLR 2017 (2017) 11. Guo, J., Tan, X., He, D., Qin, T., Xu, L., Liu, T.Y.: Non-autoregressive neural machine translation with enhanced decoder input. In: AAAI 2019 (2019) 12. Kaiser, L ., et al.: Fast decoding in sequence models using discrete latent variables. In: ICML 2018 (2018) 13. Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. In: EMNLP 2016 (2016) 14. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: ACLNAACL 2013 (2003) 15. Lee, J., Mansimov, E., Cho, K.: Deterministic non-autoregressive neural sequence modeling by iterative refinement. In: EMNLP 2018 (2018) 16. Lerner, U., Petrov, S.: Source-side classifier preordering for machine translation. In: EMNLP 2013 (2013) 17. Liu, Y., Zhou, L., Wang, Y., Zhao, Y., Zhang, J., Zong, C.: A comparable study on model averaging, ensembling and reranking in NMT. In: Zhang, M., Ng, V., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2018. LNCS (LNAI), vol. 11109, pp. 299–308. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99501-4 26 18. Ma, X., Zhou, C., Li, X., Neubig, G., Hovy, E.: FlowSeq: non-autoregressive conditional sequence generation with generative flow. In: EMNLP-IJCNLP 2019 (2019) 19. Nakagawa, T.: Efficient top-down BTG parsing for machine translation preordering. In: ACL-IJCNLP 2015 (2015) 20. Och, F.J., et al.: A smorgasbord of features for statistical machine translation. In: NACCL 2014 (2004) 21. Ran, Q., Lin, Y., Li, P., Zhou, J.: Guiding non-autoregressive neural machine translation decoding with reordering information. arXiv preprint arXiv:1911.02215 (2019) 22. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: ACL 2016 (2016) 23. Shao, C., Feng, Y., Zhang, J., Meng, F., Chen, X., Zhou, J.: Retrieving sequential information for non-autoregressive neural machine translation. In: ACL 2019 (2019) 24. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: NAACL 2018 (2018) 25. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS 2014 (2014) 26. Tillmann, C.: A unigram orientation model for statistical machine translation. In: NAACL 2004 (2004)
Non-autoregressive Neural Machine Translation with Distortion Model
415
27. Vaswani, A., et al.: Attention is all you need. In: NIPS 2017 (2017) 28. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: NIPS 2015 (2015) 29. Wang, Y., Zhou, L., Zhang, J., Zong, C.: Word, subword or character? An empirical study of granularity in Chinese-English NMT. In: Wong, D.F., Xiong, D. (eds.) CWMT 2017. CCIS, vol. 787, pp. 30–42. Springer, Singapore (2017). https://doi. org/10.1007/978-981-10-7134-8 4 30. Wang, Y., Tian, F., He, D., Qin, T., Zhai, C., Liu, T.Y.: Non-autoregressive machine translation with auxiliary regularization. In: AAAI 2019 (2019) 31. Wei, B., Wang, M., Zhou, H., Lin, J., Sun, X.: Imitation learning for nonautoregressive neural machine translation. In: ACL 2019 (2019) 32. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016) 33. Zhang, J., Zong, C.: Neural machine translation: Challenges, progress and future. arXiv preprint arXiv:2004.05809 (2020) 34. Zhang, J., Wang, M., Liu, Q., Zhou, J.: Incorporating word reordering knowledge into attention-based neural machine translation. In: ACL 2017 (2017) 35. Zhao, Y., Zhang, J., Zong, C.: Exploiting pre-ordering for neural machine translation. In: LREC 2018 (2018) 36. Zhou, L., Zhang, J., Yu, H., Zong, C.: Sequence generation: from both sides to the middle. In: IJCAI 2019 (2019)
Incorporating Phrase-Level Agreement into Neural Machine Translation Mingming Yang1(B) , Xing Wang2 , Min Zhang3 , and Tiejun Zhao1 1
3
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China [email protected], [email protected] 2 Tencent AI Lab, Shenzhen, China [email protected] School of Computer Science and Technology, Soochow University, Suzhou, China [email protected]
Abstract. Phrase information has been successfully integrated into current state-of-the-art neural machine translation (NMT) models. However, the natural property of the source and target phrase alignment has not been explored. In this paper, we propose a novel phrase-level agreement method to deal with this problem. First, we explore n-gram models over minimal translation units (MTUs) to explicitly capture aligned bilingual phrases from the parallel corpora. Then, we propose a phraselevel agreement loss that directly reduces the difference between the representations of the source-side and target-side phrase. Finally, we integrate the phrase-level agreement loss into the NMT models, to improve the translation performance. Empirical results on the NIST Chinese-toEnglish and the WMT English-to-German translation tasks demonstrate that the proposed phrase-level agreement method achieves significant improvements over state-of-the-art baselines, demonstrating the effectiveness and necessity of exploiting phrase-level agreement for NMT. Keywords: Phrase-level agreement Minimal translation units
1
· Neural machine translation ·
Introduction
In recent years, neural machine translation (NMT) [1,10] has achieved remarkable progress under the encoder-decoder framework. The attention-based Transformer [26] model has achieved state-of-the-art performance on multiple language pairs. Previous studies demonstrate that phrases modeling, which plays a crucial role in traditional statistical machine translation (SMT) [13], can also improve the NMT translation performance. Wang et al. [30] and Zhao et al. [37] introduce external phrases into the NMT decoding process to guide phrase generation. As for the Transformer model, localness modeling [32], target-side attention [34], phrase-based attention [2,19] and multi-granularity self-attention [6] have c Springer Nature Switzerland AG 2020 X. Zhu et al. (Eds.): NLPCC 2020, LNAI 12430, pp. 416–428, 2020. https://doi.org/10.1007/978-3-030-60450-9_33
Incorporating Phrase-Level Agreement into Neural Machine Translation
417
been proposed to help the Transformer model capture the phrase structure. In addition, Park et al. [21] introduce a phrase-based NMT model built upon continuous-output NMT, in which the decoder generates embeddings of words or phrases. However, there is a major problem that the above methods only capture the phrase structure of the source or target side, and did not consider the direct connection between the source and target phrases. In order to alleviate this problem, inspired by sentence-level agreement method [33] which brings the source and target sentence-level semantic representations closer, we propose a novel phrase-level agreement architecture for the NMT model. It is achieved by directly minimizing the representation difference between the source and target phrases. Specifically, First, bilingual phrase segmentation is need to obtain aligned phrases as elementary units. To this end, we adopt minimal translation units (MTUs), which are an exceedingly effective alignment phrase method based on statistics, and have been successfully applied in translation systems [23,35], to align the bilingual phrase segmentation. Then we propose a phrase-level agreement loss based on the phrase representation and integrate the loss to the NMT models. We validate the proposed phrase-level agreement on the state-of-the-art Transformer model [26]. Experimental results on the benchmarks NIST Chineseto-English and WMT14 English-to-German translation tasks show that the proposed phrase-level agreement consistently improves translation performance across language pairs. Linguistic analyses [4] reveal that the proposed phraselevel agreement exploits richer linguistic information for the NMT model. This paper primarily makes the following contributions: 1. We design a novel architecture that integrates the alignment agreement of bilingual phrases into the state-of-the-art Transformer model so that the translation model can explicitly handle the phrases as the basic translation unit. 2. Experimental results demonstrate that the proposed phrase-level agreement architecture achieves significant performance improvements over the stateof-the-art baseline Transformer models, demonstrating the effectiveness and necessity of exploiting phrase-level agreement for NMT.
2 2.1
Backgroud Neural Machine Translation
In recent NMT work, an encoder-decoder framework [10] is proposed. It employs a recurrent neural network (RNN) encoder to represent the source-side sentence as a sequence of vectors, and then the vectors are fed into an RNN decoder to generate target translation word by word. Especially, the NMT with an attention mechanism is therefore proposed to acquire a context vector over a sequence of vectors dynamically at each decoding step, thus improving the performance of NMT [1]. Unlike the conventional RNN based model that leverages recurrence as the basic building module, Transformer replaces RNN with the self-attention
418
M. Yang et al.
network (SAN) to exploit input elements in parallel. In this section, we take the Transformer architecture proposed by Vaswani et al. [26], which is the state-ofthe-art translation architecture, as the baseline system. As an encoder-to-decoder architecture, X = {x1 , x2 , ..., xJ } represents a source-side sentence and Y = {y1 , y2 , ..., yI } represents a target-side sentence. The encoder-to-decoder model learns to estimate the conditional probability from the source-side sentence to the target-side sentence word by word: P (y|x; θ) =
I
P (yi |y