Natural Language Processing and Chinese Computing: 12th National CCF Conference, NLPCC 2023, Foshan, China, October 12–15, 2023, Proceedings, Part II (Lecture Notes in Computer Science, 14303) 303144695X, 9783031446955

This three-volume set constitutes the refereed proceedings of the 12th National CCF Conference on Natural Language Proce

125 77 50MB

English Pages 900 [885] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Organization
Contents – Part II
A Benchmark for Understanding Dialogue Safety in Mental Health Support
1 Introduction
2 Related Work
3 Dialogue Safety Taxonomy
3.1 Term of Dialogue Safety
3.2 Concrete Categories
4 Data Collection
4.1 Data Source
4.2 Annotation Process
4.3 Data Filtering
4.4 Data Statistics
5 Experiments
5.1 Problem Formulation
5.2 Setup
6 Results
6.1 Fine-Grained Classification
6.2 Coarse-Grained Safety Identification
6.3 Manual Inspection of Samples Labeled as Nonsense
7 Conclusion
References
Poster: Fundamentals of NLP
Span-Based Pair-Wise Aspect and Opinion Term Joint Extraction with Contrastive Learning
1 Introduction
2 Related Work
2.1 Pair-Wise Aspect and Opinion Term Extraction
2.2 Span-Based Methods and Contrastive Learning
3 Methods
3.1 Problem Definition
3.2 BERT Encoder
3.3 Span-Based CNN with Contrastive Learning for Term Extraction
3.4 GCN with Contrastive Learning for Term Pairing
3.5 Model Training
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Baselines
4.4 Main Results
4.5 Ablation Study
4.6 Case Study
4.7 Contrastive Learning Visualization
5 Conclusion
References
Annotation Quality Measurement in Multi-Label Annotations
1 Introduction
2 Related Work
3 A Fine-Grained Multi-rater Multi-label Agreement Measure
3.1 MLA Algorithm
3.2 MLA’s Compatibility of Other Agreement Measures
3.3 Data Simulation Toolset
4 Experiments
4.1 Data Generation
4.2 Multi-class Agreement Measure
4.3 Multi-label Agreement Measure
4.4 Fine-grained Agreement Measure
5 Conclusion
References
Prompt-Free Few-Shot Learning with ELECTRA for Acceptability Judgment
1 Introduction
2 Related Work
2.1 Acceptability Judgment
2.2 Prompt-Based Few-Shot Learning
3 Preliminaries
3.1 Standard Fine-Tuning
3.2 Prompt-Based Few-Shot Learning with ELECTRA
4 Methodology
4.1 Overview
4.2 Token-Level Fine-Tuning
4.3 Sentence-Level Fine-Tuning
4.4 Joint Fine-Tuning
5 Experiment
5.1 Experimental Settings
5.2 Results on CoLA Overall Test Set
5.3 Results on CoLA Phenomenon-Specific Test Set
6 Conclusion
References
Dual Hierarchical Contrastive Learning for Multi-level Implicit Discourse Relation Recognition
1 Introduction
2 Related Work
2.1 Implicit Discourse Relation Recognition
2.2 Contrastive Learning
3 Model: DHCL
3.1 Argument Pair and Discourse Relation Encoder
3.2 Dual Hierarchical Contrastive Learning
3.3 Prediction and Contrastive Loss
4 Experiments
4.1 Settings
4.2 Comparison Methods
4.3 Results and Analysis
4.4 Ablation Study and Analysis
4.5 Parameter Analysis for
5 Conclusion
References
Towards Malay Abbreviation Disambiguation: Corpus and Unsupervised Model
1 Introduction
2 Related Work
3 Corpus Construction
3.1 Construction Process
3.2 Corpus Statistics
4 Unsupervised Framework Based on Pre-trained Models
4.1 Input Generation
4.2 Scoring
4.3 Answer Generation
5 Experiment
5.1 Experimental Setup
5.2 Evaluation Metrics
5.3 Malay Dataset Experiment
5.4 Other Language Dataset Experiments
5.5 Analysis and Discussion
6 Conclusion
References
Poster: Information Extraction and Knowledge Graph
Topic Tracking from Classification Perspective: New Chinese Dataset and Novel Temporal Correlation Enhanced Model
1 Introduction
2 Related Work
3 Methodology
3.1 Notation
3.2 Semantic Correlation Modeling
3.3 Temporal Correlation Modeling
3.4 Training
4 Experiment Settings
4.1 Dataset
4.2 Metrics
4.3 Implementation Details
4.4 Compared Methods
5 Experiment Results
5.1 Main Results
5.2 Ablation Study
5.3 Incremental Experiment
6 Conclusion
References
A Multi-granularity Similarity Enhanced Model for Implicit Event Argument Extraction
1 Introduction
2 Related Work
3 Implicit EAE Formulation
4 Method
4.1 Document Encoder
4.2 Heterogeneous Graph Network
4.3 Implicit Event Argument Extraction
4.4 Multi-granularity Similarity Enhancement
4.5 Training and Inference
5 Experiments
5.1 Experimental Setup
5.2 Overall Performance
6 Discussion
6.1 Long-Range Dependency
6.2 Distracting Context
6.3 Ablation Study
6.4 Case Study
6.5 Error Analysis
6.6 Analysis of Hyperparameters
7 Conclusion
References
Multi-perspective Feature Fusion for Event-Event Relation Extraction
1 Introduction
2 Related Works
3 Task Formulation
4 Methodology
4.1 Encoder Module
4.2 Graph Construction
4.3 Event Representation
4.4 Relation Predicting
5 Experiments
5.1 Dataset and Evaluation Metric
5.2 Experimental Setting
5.3 Baselines
5.4 Main Results
5.5 Ablation Studies
5.6 Analysis of Graph Layer
6 Conclusion
References
Joint Cross-Domain and Cross-Lingual Adaptation for Chinese Opinion Element Extraction
1 Introduction
2 Corpus Construction
2.1 The Cross-Domain Corpus
2.2 The Cross-Lingual Corpus
2.3 The Pseudo Corpus by Machine Translation
2.4 Data Statistics
3 Method
3.1 Encoder
3.2 Decoder
3.3 Training and Inference
4 Experiments
4.1 Settings
4.2 Results and Analysis
5 Related Work
6 Conclusions and Future Work
References
A Relational Classification Network Integrating Multi-scale Semantic Features
1 Introduction
2 Related Work
3 Method
3.1 Text Embedding and Feature Extraction
3.2 Sentence Level Feature Extraction Module
3.3 Entity Hierarchical Feature Extraction Module
4 Experiment
4.1 Experiment Settings
4.2 Comparative Experiment and Analysis
4.3 Ablation Experiment and Analysis
5 Conclusion
References
A Novel Semantic-Enhanced Time-Aware Model for Temporal Knowledge Graph Completion
1 Introduction
2 Preliminary
2.1 Temporal Knowledge Graph
2.2 Temporal Knowledge Graph Completion
2.3 Concept and Instance Conceptualization
3 Methodology
3.1 Encoder
3.2 Decoder
4 Experiments
4.1 Datasets and Metrics
4.2 Baselines
4.3 Settings
4.4 Results and Analysis
5 Conclusion
References
Dual-Prompting Interaction with Entity Representation Enhancement for Event Argument Extraction
1 Introduction
2 Related Work
2.1 Event Argument Extraction
2.2 Prompt-Based Learning
3 Methodology
3.1 Dual-Prompting Template Creation
3.2 Entity Representation Enhancement
3.3 Dual-Prompting Template Interaction
4 Experiments
4.1 Settings
4.2 Overall Performance
4.3 Ablation Study
4.4 Different Approaches to Model Entities
4.5 Golden vs Non-golden Entity Annotation
5 Conclusions
References
Collective Entity Linking with Joint Subgraphs
1 Introduction
2 Motivation
3 Background
3.1 Problem Definition
4 Collective Method
4.1 Subgraph
4.2 Joint Graph Representation
4.3 Relevance Scores
4.4 GNN Architecture
4.5 Learning
5 Experiment
5.1 Experimental Settings
5.2 Collective Method
5.3 GNN Architecture
5.4 Graph Connection
5.5 Relevance Score
6 Related Work
7 Conclusion
References
Coarse-to-Fine Entity Representations for Document-Level Relation Extraction
1 Introduction
2 Methodology
2.1 Task Formulation
2.2 Document-Level Graph Construction
2.3 Coarse-to-Fine Entity Representations
3 Experiments and Analysis
3.1 Dataset
3.2 Experimental Settings
3.3 Results
3.4 Analysis
3.5 Case Study
4 Related Work
5 Conclusion
References
Research on Named Entity Recognition Based on Bidirectional Pointer Network and Label Knowledge Enhancement
1 Introduction
2 Proposed Model
2.1 Data Construction
2.2 Semantics Encoding Module
2.3 Semantic Fusion Module
2.4 Entity Boundary Detection
2.5 Loss Function
3 Experiments
3.1 Datasets
3.2 Baselines
3.3 Experimental Setups
3.4 Main Results
3.5 Ablation Studies
4 Conclusion
References
UKT: A Unified Knowledgeable Tuning Framework for Chinese Information Extraction
1 Introduction
2 Related Work
3 UKT: The Proposed Method
3.1 A Brief Overview of UKT
3.2 Multi-relational Multi-head Knowledge Fusion
3.3 Relational Knowledge Validation
3.4 UKT for Chinese IE
4 Experiments
4.1 Experiments Setup
4.2 Overall Performance
4.3 Model Analysis
5 Conclusion
References
SSUIE 1.0: A Dataset for Chinese Space Science and Utilization Information Extraction
1 Introduction
2 Related Work
2.1 Named Entity Recognition
2.2 Entity-Relation Joint Extraction
2.3 Event Extraction
3 SSUIE Dataset
3.1 Schema Construction
3.2 Data Collection and Filtering
3.3 Data Annotation
3.4 Data Statistics
4 Chinese Space Science and Utilization Information Extraction
4.1 Named Entity Recognition
4.2 Entity-Relation Joint Extraction
4.3 Event Extraction
5 Conclusion
References
Extract Then Adjust: A Two-Stage Approach for Automatic Term Extraction
1 Introduction
2 Related Works
2.1 Machine Learning Approaches
2.2 Deep Learning Approaches
3 Method
3.1 Span Extractor
3.2 Boundary Adjust Module
4 Experiment Setup
4.1 Datasets
4.2 Baselines
4.3 Evaluation Metrics
4.4 Training Details
5 Results and Analysis
5.1 Overall Results
5.2 Ablation Study
5.3 Case Study
6 Conclusion
References
UniER: A Unified and Efficient Entity-Relation Extraction Method with Single-Table Modeling
1 Introduction
2 Related Work
3 Methodology
3.1 Task Definition
3.2 Unified Representation
3.3 Interactive Table Modeling
3.4 Loss Function
4 Experiments
4.1 Datasets and Evaluation Metrics
4.2 Results and Analysis
5 Conclusion
References
Multi-task Biomedical Overlapping and Nested Information Extraction Model Based on Unified Framework
1 Introduction
2 Related Work
3 Method
3.1 Semantic Construction Layer
3.2 Encoding Layer
3.3 Feature Extraction Layer
3.4 Decoding Layer
3.5 Loss Layer
4 Experiment and Result Analysis
4.1 Dataset
4.2 Experimental Setup
4.3 Results and Analysis of Experiments
4.4 Ablation Experiment
5 Conclusion
References
NTAM: A New Transition-Based Attention Model for Nested Named Entity Recognition
1 Introduction
2 Related Work
2.1 Nested Named Entity Recognition
2.2 Transition-Based Method
3 Methodology
3.1 Nested NER Task Definition
3.2 State Transition System
3.3 Representation
3.4 State Prediction
4 Experiments
4.1 Datasets
4.2 Settings
4.3 Baseline for Nested NER
4.4 Main Results
4.5 Ablation Analysis
4.6 Error Analysis
5 Conclusion
References
Learning Well-Separated and Representative Prototypes for Few-Shot Event Detection
1 Introduction
2 Related Work
3 Method
3.1 Problem Formulation
3.2 Model
3.3 Prototype Generation Module
3.4 Prototype Instantiation Module
3.5 Sequence Labeling Module
3.6 Model Training
4 Experiments
4.1 Datasets
4.2 Baselines and Settings
4.3 Main Results
4.4 Domain Adaption Analysis
4.5 Ablation Study
5 Conclusion
References
Poster: Machine Learning for NLP
A Frustratingly Easy Improvement for Position Embeddings via Random Padding
1 Introduction
2 Background
2.1 Task Definition
2.2 Investigated Model
2.3 Pilot Experiment
3 Our Method: Random Padding
4 Experiments
4.1 Dataset Preparation
4.2 Implementation Details
5 Main Results
5.1 Train Short, Test Long
5.2 Train/Test with Similar Context Lengths
6 Analysis and Discussions
6.1 Analysis on Answer Positions
6.2 How Random Padding Improves QA Performance?
7 Results on More Benchmark Datasets
8 Conclusion
References
IDOS: A Unified Debiasing Method via Word Shuffling
1 Introduction
2 Approach
2.1 Problem Setup
2.2 Ensemble-Based Debiasing Method
2.3 Unified Debiasing Method via Word Shuffling
3 Experiment
3.1 Datasets
3.2 Baselines
3.3 Implementation Details
3.4 Results
4 Analysis
4.1 HANS Heuristics
4.2 Adversarial Robustness
4.3 Bias Rate
5 Related Work
5.1 Biases in NLI Datasets
5.2 Debiasing Methods
6 Conclusion
References
FedEAE: Federated Learning Based Privacy-Preserving Event Argument Extraction
1 Introduction
2 Related Work
2.1 Event Argument Extraction
2.2 Federated Learning
3 Method
3.1 Problem Formulation
3.2 Structure of FedEAE
3.3 FedACE Dataset
4 Experimental Evaluations
4.1 Learning Model
4.2 Datasets
4.3 Experiment Setup
4.4 Experiment Results and Analysis
5 Conclusion and Future Work
References
Event Contrastive Representation Learning Enhanced with Image Situational Information
1 Introduction
2 Related Work
3 Methodology
3.1 Text Event Encoder
3.2 Image Event Encoder
3.3 Multimodal Contrastive Learning
3.4 Multimodal Prototype-Based Clustering
3.5 Model Training
4 Experiment
4.1 Dataset and Implementation Details
4.2 Event Similarity Tasks
5 Conclusions
References
Promoting Open-Domain Dialogue Generation Through Learning Pattern Information Between Contexts and Responses
1 Introduction
2 Background
2.1 Open-Domain Dialogue Generation
2.2 Scheduled Sampling
3 Response-Aware Dialogue Model
3.1 Pre-trained Language Model
3.2 Scheduled Sampling for Pre-trained Model
3.3 Response-Aware Mechanism
4 Experiment Settings
5 Experimental Results
5.1 Automatic Evaluation
5.2 Human Evaluation
5.3 Ablation Study
6 Conclusion
References
Neural News Recommendation with Interactive News Encoding and Mixed User Encoding
1 Introduction
2 Related Work
3 Our Approach
3.1 Interactive News Encoder
3.2 Mixed User Encoder
3.3 Click Predictor and Model Training
4 Experiment
4.1 Dataset and Experimental Settings
4.2 Baseline
4.3 Main Results
4.4 Compatibility Experiment
4.5 Ablation Experiment
5 Conclusion
References
Poster: Machine Translation and Multilinguality
CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction
1 Introduction
2 Related Works
3 Methodology
3.1 BLI Task Definition
3.2 Method in a Nutshell
3.3 Confidence Score Calculation
3.4 C1: Refinement Based on WEs
3.5 C2: Refinement Based on LMs
3.6 Combining the Outputs of C1 and C2
4 Experiment
4.1 Experimental Settings
4.2 Main Results
4.3 Ablation Study
4.4 Parameter Sensitivity Analysis
5 Conclusion
References
A Novel POS-Guided Data Augmentation Method for Sign Language Gloss Translation
1 Introduction
2 Related Work
3 Methodology
3.1 Deep Analysis Based POS Distribution Between Gloss and Text
3.2 Pseudo Sign Language Gloss-Text Pair Generation
4 Experiments and Results
4.1 Datasets
4.2 Architecture
4.3 Baselines
4.4 Main Results
4.5 Analysis
5 Conclusion
References
Faster and More Robust Low-Resource Nearest Neighbor Machine Translation
1 Introduction
2 Background and Related Work
2.1 Memory-Augmented NMT
2.2 Decoding Efficiency Optimization
3 Methodology
3.1 Monte Carlo Non-parametric Fusion
3.2 Gating Mechanism Based on Confidence Estimation
4 Experiment
4.1 Datasets, Baselines and Configurations
4.2 Main Results
4.3 Ablation Study
4.4 Effect of Memory Module Capacity and Threshold c
4.5 Decoding Efficiency Verification in Different Dimensions
4.6 Domain Adaptation and Robustness Analysis
5 Conclusion
References
Towards Effective Ancient Chinese Translation: Dataset, Model, and Evaluation
1 Introduction
2 Related Work
2.1 Pre-training in Neural Machine Translation
2.2 Ancient Chinese Domain Tasks
3 Erya Dataset
3.1 Data Collection and Cleaning
3.2 Data Classification
3.3 Statistics
4 Erya Model
4.1 Disyllabic Aligned Substitution
4.2 Dual Masked Language Modeling
4.3 Erya Multi-task Training
5 Experiment
5.1 Experimental Setup
5.2 Experiment Results
5.3 Further Analysis
5.4 Human Evaluation
6 Conclusion
References
Poster: Multimodality and Explainability
Two-Stage Adaptation for Cross-Corpus Multimodal Emotion Recognition
1 Introduction
2 Related Work
2.1 Adapt Pre-trained Models to Downstream Scenarios
2.2 Domain Adaptation for Emotion Recognition
3 Method
3.1 Task Adaptive Pre-training
3.2 Fine-Tuning with Cluster-Based Loss
3.3 Pseudo-labeling Strategies
4 Experiments
4.1 Experimental Setups
4.2 Main Results
4.3 Analysis of Pseudo-labeling Strategies
5 Conclusion
References
Knowledgeable Salient Span Mask for Enhancing Language Models as Knowledge Base
1 Introduction
2 Probing the Behaviour of PLMs in Retrieving Knowledge
2.1 Accuracy on Knowledge-Baring and Knowledge-Free Tokens
2.2 Attention on Knowledge-Baring and Knowledge-Free Tokens
3 Methods
3.1 Backbone Model
3.2 Mask Policy
3.3 Visibility Matrix
4 Tasks
5 Experiments
5.1 Overall Results
5.2 On Knowledge-Baring Tokens
5.3 Discovery on Invisible Tokens
6 Related Work
7 Conclusion
References
A Text-Image Pair Is Not Enough: Language-Vision Relation Inference with Auxiliary Modality Translation
1 Introduction
2 Related Work
2.1 Language-Vision Relation Inference
2.2 Modality Translation
3 Auxiliary Modality Translation for Language-Vision Relation Inference
3.1 Vision-to-Language Translation
3.2 Multi-modal Interaction Module
3.3 Text-Image Relation Classifier
4 Experimentation
4.1 Experimental Settings
4.2 Implementation Details
4.3 Main Results
4.4 Analysis and Discussion
4.5 Case Study
5 Conclusions
References
Enriching Semantic Features for Medical Report Generation
1 Introduction
2 Related Work
2.1 Medical Report Generation
2.2 Multi-modal Feature Fusion
3 Method
3.1 Model Overview
3.2 BM25
3.3 Medical External Knowledge
3.4 Mul-MF
3.5 Loss Function
4 Experiments and Analysis of Results
4.1 Experimental Details
4.2 Comparing SOTA Models
4.3 Ablation Experiments
5 Summary and Outlook
References
Entity-Related Unsupervised Pretraining with Visual Prompts for Multimodal Aspect-Based Sentiment Analysis
1 Introduction
2 Entity-Related Unsupervised Pretraining
2.1 Visual Adapter
2.2 Entity-Related Unsupervised Pretraining
2.3 Fine-Tuning
3 Experiments
3.1 Baselines
3.2 Implementation Details
3.3 Experimental Results and Analysis
3.4 Ablation Experiment
4 Conclusions
References
ZeroGen: Zero-Shot Multimodal Controllable Text Generation with Multiple Oracles
1 Introduction
2 Related Work
3 ZeroGen Methodology
3.1 Token-Level Textual Guidance
3.2 Sentence-Level Visual Guidance
3.3 Multimodal Dynamic Weighting
4 General Implementations and Baselines
5 Experiments and Analysis
5.1 Image Captioning
5.2 Stylized Captioning
5.3 Controllable News Generation
6 Conclusion
References
DialogueSMM: Emotion Recognition in Conversation with Speaker-Aware Multimodal Multi-head Attention
1 Introduction
2 Related Work
2.1 Text-Based ERC
2.2 Multimodal ERC
3 The Proposed Model
3.1 Model Architecture
3.2 Input Representation
3.3 The Multimodal Fusion Module
3.4 The Speaker Module
3.5 The Emotion Clue Module
3.6 Emotion Classification and Optimization
4 Experiments
4.1 Datasets
4.2 Experimental Settings
4.3 Comparison with Baseline Models
4.4 Modality Settings
4.5 Modality Fusion Methods
4.6 Ablation Experiments
4.7 Emotion Shift Experiments
4.8 Case Study
5 Conclusion
References
QAE: A Hard-Label Textual Attack Considering the Comprehensive Quality of Adversarial Examples
1 Introduction
2 Related Work
3 Methodology
3.1 Problem Formulation
3.2 Proposed Attack
4 Experiments
4.1 Experimental Settings
4.2 Experimental Results
4.3 Human Evaluation
4.4 Ablation Study
4.5 QAE Against Other Models
5 Conclusion
References
IMTM: Invisible Multi-trigger Multimodal Backdoor Attack
1 Introduction
2 Related Work
2.1 Adversarial Attack on Vision-Language Pre-training Models
2.2 Backdoor Attack on Vision-Language Pre-training Models
3 Methodology
3.1 Threat Model
3.2 Invisible Multi-trigger Multimodal Backdoor
3.3 Pictograph Map
3.4 Steganography Image
4 Experimental Settings
4.1 Datasets
4.2 Metrics
4.3 Baseline Model
5 Results and Analysis
5.1 Poisoning Percentage and Data Volume
5.2 Text Trigger Design
5.3 Visual Trigger Design
5.4 Breadth Experiments
6 Conclusion and Future Work
References
Poster: NLP Applications and Text Mining
Enhancing Similar Case Matching with Event-Context Detection in Legal Intelligence
1 Introduction
2 Method
2.1 Problem Definition
2.2 Model Overview
2.3 Detail of ECDM
3 Experiments
3.1 Baselines
3.2 Datasets and Experiment Settings
3.3 Experimental Results
3.4 Ablation Study
3.5 Impact of Window Size
3.6 Case Study
4 Conclusion
References
Unsupervised Clustering with Contrastive Learning for Rumor Tracking on Social Media
1 Introduction
2 Methodology
2.1 Data Augmentation
2.2 Feature Encoder
2.3 Contrastive Learning
2.4 K-Means for Clustering
3 Experimental Settings
3.1 Datasets
3.2 Experiment Setup
3.3 Evaluation Metrics
3.4 Baselines
4 Experimental Results of Clustering
4.1 Comparison with Baselines
4.2 Exploration of Data Augmentations
4.3 Contribution of Iterative Optimization
5 Experimental Results of Rumor Tracking
6 Related Work
7 Conclusion
References
EP-Transformer: Efficient Context Propagation for Long Document
1 Introduction
2 Related Work
3 Method
3.1 Architecture
3.2 SSFB: Similarity-Sensitive Fusion Block
3.3 UCL: Unsupervised Contrast Learning Strategy
4 Experiments
4.1 Experimental Setup
4.2 Results
4.3 Ablation Study
4.4 Impact of Document and Segment Length
5 Conclusion
References
Cross and Self Attention Based Graph Convolutional Network for Aspect-Based Sentiment Analysis
1 Introduction
2 Related Work
3 Preliminaries
3.1 Graph Convolutional Network (GCN)
4 Cross and Self Attention Based Graph Convolutional Network (CASAGCN)
4.1 Cross-Attention Based GCN (CAGCN)
4.2 Self-Attention Based GCN (SAGCN)
5 Experiments
5.1 Datasets
5.2 Implementation and Parameter Settings
5.3 Baseline Methods
5.4 Comparison Results
5.5 Ablation Study
5.6 Case Study
5.7 Attention Visualization
6 Conclusion
References
KESDT: Knowledge Enhanced Shallow and Deep Transformer for Detecting Adverse Drug Reactions
1 Introduction
2 Related Work
2.1 ADR Detection
3 Methodology
3.1 Problem Definition
3.2 Shallow Fusion Layer
3.3 Deep Fusion Layer
3.4 Model Training
4 Experiment
4.1 Dataset and Evaluation
4.2 Experimental Settings and Baselines
4.3 Results and Discussions
4.4 Ablation Experiments
5 Conclusion
References
CCAE: A Corpus of Chinese-Based Asian Englishes
1 Introduction
2 Related Work
3 CCAE at a Glance
3.1 Corpus-Level Statistics
3.2 Domains Distribution
3.3 Utterance Date
4 Generation of CCAE
4.1 Data Collection
4.2 Data Pre-processing
4.3 Output Storage Format
5 Applications of CCAE
5.1 Multi-variety Language Modeling
5.2 Automatic Variety Identification
6 Conclusion and Future Work
References
Emotionally-Bridged Cross-Lingual Meta-Learning for Chinese Sexism Detection
1 Introduction
2 Related Works
3 Methodology
3.1 Cross-Lingual Meta Learning
3.2 Emotion Analysis
3.3 Integration of Emotion Knowledge
4 Experiment
4.1 Datasets
4.2 Experiment Settings
4.3 Experiment Results
4.4 Case Studies
5 Conclusion
References
CCPC: A Hierarchical Chinese Corpus for Patronizing and Condescending Language Detection
1 Introduction
2 CondescendCN Frame
2.1 Toxicity Identification
2.2 Toxicity Type
2.3 PCL Toxicity Strength
2.4 PCL Categories
2.5 PCL Group Detection
3 The Corpus
3.1 Data Collection
3.2 Data Annotation
3.3 Data Description
4 Experiments
4.1 Baselines
4.2 PCL Detection for Migration Tasks
5 The Ambiguity of PCL
6 Conclusion and Future Work
References
FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain
1 Introduction
2 Related Work
3 The Proposed Dataset
3.1 Annotation Scheme
3.2 Annotation Process
3.3 Comparison with Previous Datasets
3.4 Inter-Annotator Agreements
4 Experiment
4.1 Baselines
4.2 Evaluation Settings
4.3 Experiment Settings
4.4 Baseline Results
4.5 Performance on Fine-Grained Entities and Their Relations
5 Conclusion and Future Work
A Annotation Guideline
A.1 Entity Category
A.2 Relation Category
References
Improving Event Representation for Script Event Prediction via Data Augmentation and Integration
1 Introduction
2 Related Work
2.1 Event Representation
2.2 Data Augmentation
2.3 Script Event Prediction
3 Model
3.1 Event Representation Component
3.2 Data Augmentation Component
3.3 Global Evolution Component
3.4 Candidate Event Prediction Component
3.5 Training Details
4 Experiments
4.1 Dataset
4.2 Baselines
4.3 Experimental Results
4.4 Comparative Studies on Variants
4.5 Ablation experiments
5 Conclusion
References
Beyond Hard Samples: Robust and Effective Grammatical Error Correction with Cycle Self-Augmenting
1 Introduction
2 Attack Data Construction
2.1 Adversarial Attack
2.2 Vulnerable Tokens Location
2.3 Vulnerable Tokens Perturbation
3 Method
3.1 Self-augmenting
3.2 Cycle Training
4 Experiments
4.1 Dataset
4.2 Baseline and Setting
4.3 Main Results
5 Study
5.1 Large Langeage Models Robustness
5.2 Influence of Max Cycle Times
5.3 Effect of Cycle Training Strategy
5.4 Effect of Regularization Data
6 Conclusion
References
DAAL: Domain Adversarial Active Learning Based on Dual Features for Rumor Detection
1 Introduction
2 Related Work
3 Methodology
3.1 Framework Overview
3.2 Rumor Detector
3.3 Textual and Affective Features
3.4 Domain Adversarial Training
3.5 Sampling Strategies Based on Rumor Features
3.6 Algorithm Optimization
4 Experiment
4.1 Experiments Setup
4.2 Baselines
4.3 Result and Discussion
5 Conclusions and Future Work
References
CCC: Chinese Commercial Contracts Dataset for Documents Layout Understanding
1 Introduction
2 Related Work
3 Chinese Commercial Contract
4 Chinese Layout Understanding Pre-train Model
5 Experiments
6 Conclusion
References
Poster: Question Answering
OnMKD: An Online Mutual Knowledge Distillation Framework for Passage Retrieval
1 Introduction
2 Related Work
2.1 Dense Passage Retrieval
2.2 Knowledge Distillation
2.3 Contrastive Learning
3 Methodology
3.1 Framework Overview and Notations
3.2 Passage Retriever of Dual-Encoder Architecture
3.3 Online Mutual Knowledge Refinement
3.4 Cross-Wise Contrastive Knowledge Fusion
4 Experiment
4.1 Datasets and Baselines
4.2 Experimental Settings
4.3 General Results
4.4 Ablations and Analysis
5 Conclusion
References
Unsupervised Clustering for Negative Sampling to Optimize Open-Domain Question Answering Retrieval
1 Instruction
2 Related Work
3 Problem Analysis and Method
3.1 Relationship Between the Negative Sampling and Convergence Speed
3.2 Unsupervised Clustering for Negative Sampling
4 Experiment Setting
4.1 Encoder
4.2 Reader
4.3 Datasets and Metrics
4.4 Settings
5 Results
6 Conclusion
References
Enhancing In-Context Learning with Answer Feedback for Multi-span Question Answering
1 Introduction
2 Related Work
2.1 Large Language Models
2.2 In-Context Learning
3 Approach
3.1 Retrieval Stage
3.2 Exercise Stage
3.3 Reasoning Stage
4 Experimental Setup
4.1 Datasets
4.2 Baselines
4.3 Evaluation Metrics
4.4 Implementation Details
5 Experimental Results
5.1 Comparison with Baselines
5.2 The Effectiveness of Different Feedback
5.3 Comparison with Random Feedback
5.4 Number of Demonstration Examples
5.5 Case Study
6 Conclusion
References
Poster: Large Language Models
RAC-BERT: Character Radical Enhanced BERT for Ancient Chinese
1 Introduction
2 Related Work
2.1 Pre-trained Language Models
2.2 Ancient Chinese PLMs
2.3 Learning Structure Information
3 RAC-BERT Pre-training
3.1 Overview
3.2 Radical Replacement
3.3 Radical Prediction Task
3.4 Pretraining Data
3.5 Implementation
4 Experiments
4.1 Fine-Tuning Tasks
4.2 Baselines
4.3 Implementation
4.4 Results
4.5 Overall
5 Discussion
6 Conclusion
References
Neural Knowledge Bank for Pretrained Transformers
1 Introduction
2 Background: Transformer
3 Method
3.1 Key-Value Memory View of FFN
3.2 Neural Knowledge Bank
3.3 Knowledge Injection
4 Experiments
4.1 Tasks and Datasets
4.2 Experimental Settings
4.3 Baselines
4.4 Results
5 Interpretability of NKB
5.1 Value Vectors Store Entities
5.2 Key Vectors Capture Input Patterns
6 Knowledge Updating for NKB
7 Related Work
8 Conclusion
References
Poster: Summarization and Generation
A Hybrid Summarization Method for Legal Judgment Documents Based on Lawformer
1 Introduction
2 Method
2.1 Structural Segmentation
2.2 Hybrid Summarization
3 Experiments
3.1 Dataset and Evaluation Metric
3.2 Baselines
3.3 Experiment Setting
3.4 Comparative Experiment of Lawformer-Based Model for LJS
3.5 Comparative Experiment of Hybrid Summarization
3.6 Case Study
4 Conclusion
References
Enhancing Semantic Consistency in Linguistic Steganography via Denosing Auto-Encoder and Semantic-Constrained Huffman Coding
1 Introduction
2 Related Work
3 Methodology
3.1 Preliminary
3.2 Semantic-Preserved Linguistic Steganography Auto-Encoder
3.3 Semantic Embedding Encoding
3.4 Stego Text Generation with Semantic-Constrained Huffman Coding
3.5 Extraction
4 Experiment
4.1 Settings
4.2 Results and Analysis
5 Conclusion
References
Review Generation Combined with Feature and Instance-Based Domain Adaptation for Cross-Domain Aspect-Based Sentiment Analysis
1 Introduction
2 Related Work
3 Problem Statement
4 Methodology
4.1 The First Part: Training the Initial Model
4.2 The Second Part: Generating the Target-Domain Sentences
4.3 The Third Part: Downstream Training for ABSA
5 Experiment
5.1 Datasets
5.2 Experiment Settings & Implementation Details
5.3 Baselines
5.4 Experiment Results & Analysis
5.5 Ablation Study
6 Conclusion and Future Work
References
WikiIns: A High-Quality Dataset for Controlled Text Editing by Natural Language Instruction
1 Introduction
2 Related Work
3 Problem Formulation
4 Gold Dataset Creation
4.1 Data Preprocessing
4.2 Data Annotation
5 Silver Training Dataset Creation
6 Dataset Analysis
6.1 Statistics
6.2 Edit Intention Distribution
6.3 Examples of WikiIns
7 Experiment
7.1 RQ1: Informativeness Improvement
7.2 RQ2: Evaluation of Text Editing Models
8 Conclusion
References
Medical Report Generation Based on Segment-Enhanced Contrastive Representation Learning
1 Introduction
2 Related Work
2.1 Medical Report Generation
2.2 Contrastive Learning
2.3 Segment Anything Model
3 Method
3.1 Background
3.2 Segment Medical Images with SAM
3.3 Image-Text Contrastive Learning
4 Experiments
4.1 Experimental Settings
4.2 Main Results
4.3 Ablation Study
4.4 Case Study
5 Conclusion
References
Enhancing MOBA Game Commentary Generation with Fine-Grained Prototype Retrieval
1 Introduction
2 Related Work
3 Methodology
3.1 Preliminary
3.2 Overview of MOBA-FPBART
3.3 Prototype Retrieval
3.4 Prototype-Guided Generation
4 Experimental
4.1 Settings
4.2 Results
4.3 Ablation Study
4.4 Case Study
5 Conclusion
References
Author Index
Recommend Papers

Natural Language Processing and Chinese Computing: 12th National CCF Conference, NLPCC 2023, Foshan, China, October 12–15, 2023, Proceedings, Part II (Lecture Notes in Computer Science, 14303)
 303144695X, 9783031446955

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

LNAI 14303

Fei Liu Nan Duan Qingting Xu Yu Hong (Eds.)

Natural Language Processing and Chinese Computing 12th National CCF Conference, NLPCC 2023 Foshan, China, October 12–15, 2023 Proceedings, Part II

NLPCC 2023

123

Lecture Notes in Computer Science

Lecture Notes in Artificial Intelligence Founding Editor Jörg Siekmann

Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Wolfgang Wahlster, DFKI, Berlin, Germany Zhi-Hua Zhou, Nanjing University, Nanjing, China

14303

The series Lecture Notes in Artificial Intelligence (LNAI) was established in 1988 as a topical subseries of LNCS devoted to artificial intelligence. The series publishes state-of-the-art research results at a high level. As with the LNCS mother series, the mission of the series is to serve the international R & D community by providing an invaluable service, mainly focused on the publication of conference and workshop proceedings and postproceedings.

Fei Liu · Nan Duan · Qingting Xu · Yu Hong Editors

Natural Language Processing and Chinese Computing 12th National CCF Conference, NLPCC 2023 Foshan, China, October 12–15, 2023 Proceedings, Part II

Editors Fei Liu Emory University Atlanta, GA, USA

Nan Duan Microsoft Research Asia Beijing, China

Qingting Xu Soochow University Suzhou, China

Yu Hong Soochow University Suzhou, China

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Artificial Intelligence ISBN 978-3-031-44695-5 ISBN 978-3-031-44696-2 (eBook) https://doi.org/10.1007/978-3-031-44696-2 LNCS Sublibrary: SL7 – Artificial Intelligence © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Preface

Welcome to NLPCC 2023, the twelfth CCF International Conference on Natural Language Processing and Chinese Computing. Following the success of previous conferences held in Beijing (2012), Chongqing (2013), Shenzhen (2014), Nanchang (2015), Kunming (2016), Dalian (2017), Hohhot (2018), Dunhuang (2019), Zhengzhou (2020), Qingdao (2021), and Guilin (2022), this year’s NLPCC will be held in Foshan. As a premier international conference on natural language processing and Chinese computing, organized by the CCF-NLP (Technical Committee of Natural Language Processing, China Computer Federation, formerly known as Technical Committee of Chinese Information, China Computer Federation), NLPCC serves as an important forum for researchers and practitioners from academia, industry, and government to share their ideas, research results, and experiences, and to promote their research and technical innovations. The fields of natural language processing (NLP) and Chinese computing (CC) have boomed in recent years. Following NLPCC’s tradition, we welcomed submissions in ten areas for the main conference: Fundamentals of NLP; Machine Translation and Multilinguality; Machine Learning for NLP; Information Extraction and Knowledge Graph; Summarization and Generation; Question Answering; Dialogue Systems; Large Language Models; NLP Applications and Text Mining; Multimodality and Explainability. This year, we received 478 valid submissions to the main conference on the submission deadline. After a thorough reviewing process, including meta reviewing, out of 478 valid submissions (some of which were withdrawn by authors or desk-rejected due to policy violations), 134 papers were finally accepted as regular papers to appear in the main conference, resulting in an acceptance rate of 29.9%. Among them, 64 submissions will be presented as oral papers and 79 as poster papers at the conference. 5 papers were nominated by our area chairs for the best paper award. An independent best paper award committee was formed to select the best papers from the shortlist. This proceeding includes only the accepted English papers; the Chinese papers will appear in the ACTA Scientiarum Naturalium Universitatis Pekinensis. In addition to the main proceedings, 3 papers were accepted to the Student workshop, 32 papers were accepted to the Evaluation workshop. We are honored to have four internationally renowned keynote speakers, Denny Zhou (Google Deepmind), Xia (Ben) Hu (Rice University), Arman Cohan (Yale University), and Diyi Yang (Stanford University), sharing their findings on recent research progress and achievements in natural language processing. We would like to thank all the people who have contributed to NLPCC 2023. First of all, we would like to thank our 21 area chairs for their hard work recruiting reviewers, monitoring the review and discussion processes, and carefully rating and recommending submissions. We would like to thank all 322 reviewers for their time and efforts to review the submissions. We are also grateful for the help and support from the general chairs,

vi

Preface

Rada Mihalcea and Hang Li, and from the organization committee chairs, Biqin Zeng, Yi Cai and Xiaojun Wan. Special thanks go to Yu Hong and Qingting Xu, the publication chairs. We greatly appreciate all your help! Finally, we would like to thank all the authors who submitted their work to NLPCC 2023, and thank our sponsors for their contributions to the conference. Without your support, we could not have such a strong conference program. We are happy to see you at NLPCC 2023 in Foshan and hope you enjoy the conference! August 2023

Fei Liu Nan Duan

Organization

NLPCC 2023 is organized by China Computer Federation (CCF), and hosted by South China Normal University. Publishers comprise Lecture Notes on Artificial Intelligence (LNAI), Springer Verlag, and ACTA Scientiarum Naturalium Universitatis Pekinensis.

Organization Committee General Chairs Rada Mihalcea Hang Li

University of Michigan ByteDance Technology

Program Committee Chairs Fei Liu Nan Duan

Emory University Microsoft Research Asia

Student Workshop Chairs Jing Li Jingjing Wang

The Hong Kong Polytechnic University Soochow University

Evaluation Chairs Yunbo Cao Piji Li

Tencent Nanjing University of Aeronautics and Astronautics

Tutorial Chairs Zhongyu Wei Zhaochun Ren

Fudan University Shandong University

viii

Organization

Publication Chairs Yu Hong Qingting Xu

Soochow University Soochow University

Journal Coordinator Yunfang Wu

Peking University

Conference Handbook Chair Leixin Du

South China Normal University

Sponsorship Chairs Min Zhang Haofen Wang Ruifeng Xu

Harbin Institute of Technology (Shenzhen) Tongji University Harbin Institute of Technology (Shenzhen)

Publicity Chairs Benyou Wang Shen Gao Xianling Mao

The Chinese University of Hong Kong (Shenzhen) Shandong University Beijing Institute of Technology

Organization Committee Chairs Biqin Zeng Yi Cai Xiaojun Wan

South China Normal University South China University of Technology Peking University

Treasurer Yajing Zhang Xueying Zhang

Soochow University Peking University

Organization

ix

Webmaster Hui Liu

Peking University

Program Committee Xiang Ao Jiaxin Bai Xinyi Bai Junwei Bao Qiming Bao Xiangrui Cai Shuyang Cao Zhangming Chan Yufeng Chen Yulong Chen Bo Chen Jianghao Chen Wenhu Chen Yubo Chen Xuelu Chen Yidong Chen Chen Chen Guanyi Chen Qi Chen Wenliang Chen Xinchi Chen Muhao Chen Liang Chen

Jiangjie Chen Leshang Chen Sihao Chen Wei Chen Kewei Cheng Cunli Mao

Institute of Computing Technology, Chinese Academy of Sciences, China Hong Kong University of Science and Technology, China Google, USA JD AI Research, China The University of Auckland, New Zealand Nankai University, China Univerisity of Michigan, USA Alibaba Group, China Beijing Jiaotong University, China Zhejiang University, Westlake University, China Minzu University of China, China CASIA, China University of Waterloo & Google Research, Canada Institute of Automation, Chinese Academy of Sciences, China UCLA, USA Department of Artificial Intelligence, School of Informatics, Xiamen University, China Nankai University, China Utrecht University, Netherlands Northeastern University, China Soochow University, China Amazon AWS, USA USC, USA The Chinese University of Hong Kong, Hong Kong Special Administrative Region of China Fudan University, China Oracle America, Inc., USA University of Pennsylvania, USA School of Data Science, Fudan University, China UCLA, USA Kunming University of Science and Technology, China

x

Organization

Xiang Deng Chenchen Ding Qianqian Dong Yue Dong Zi-Yi Dou Rotem Dror Xinya Du Junwen Duan Chaoqun Duan Nan Duan Xiangyu Duan Alex Fabbri Zhihao Fan Tianqing Fang

Zichu Fei Shi Feng Jiazhan Feng Yang Feng Zhangyin Feng Yansong Feng Xingyu Fu Guohong Fu Yi Fung Shen Gao Heng Gong Yeyun Gong Yu Gu Yi Guan Tao Gui Daya Guo Shaoru Guo Yiduo Guo Jiale Han Xudong Han Lifeng Han Xianpei Han

The Ohio State University, USA NICT, Japan ByteDance AI Lab, China University of California Riverside, USA UCLA, USA University of Pennsylvania, Israel University of Texas at Dallas, USA Central South University, China JD AI Research, China Microsoft Research Asia, China Soochow University, China Salesforce AI Research, USA Fudan University, China Hong Kong University of Science and Technology, Hong Kong Special Administrative Region of China Fudan University, China Northeastern University, China Peking University, China Institute of Computing Technology, Chinese Academy of Sciences, China Harbin Institute of Technology, China Peking University, China Upenn, USA Soochow University, China University of Illinois at Urbana Champaign, USA Shandong University, China Harbin Institute of Technology, China Microsoft Research Asia, China The Ohio State University, USA School of Computer Science and Technology, Harbin Institute of Technology, China Fudan University, China Sun Yat-Sen University, China Institute of Automation, Chinese Academy of Sciences, China Peking University, China Beijing University of Posts and Telecommunications, China The University of Melbourne, Australia The University of Manchester, UK Institute of Software, Chinese Academy of Sciences, China

Organization

Tianyong Hao Hongkun Hao Ruifang He Xudong Hong I-Hung Hsu Zhe Hu Junjie Hu Xiaodan Hu Minghao Hu Shujian Huang Fei Huang Baizhou Huang Junjie Huang

Yueshan Huang Qingbao Huang Xin Huang James Y. Huang Jiangping Huang Changzhen Ji Chen Jia Tong Jia Hao Jia Hao Jiang Wenbin Jiang Jingchi Jiang Huiming Jin Feihu Jin Peng Jin Zhu Junguo Lingpeng Kong Fajri Koto

xi

School of Computer Science, South China Normal University, China Shanghai Jiao Tong University, China Tianjin University, China Saarland University/MPI Informatics, Germany USC Information Sciences Institute, USA Baidu, China University of Wisconsin-Madison, USA University of Illinois at Urbana-Champaign, USA Information Research Center of Military Science, China National Key Laboratory for Novel Software Technology, Nanjing University, China Tsinghua University, China Peking University, China The Chinese University of Hong Kong, Hong Kong Special Administrative Region of China University of California San Diego, China Guangxi University, China Institute of Automation, Chinese Academy of Sciences, China University of Southern California, USA Chongqing University of Posts and Telecommunications, China Hithink RoyalFlush Information Network, China Fudan University, China Northeastern University, China School of Computer Science and Technology, Soochow University, China University of Science and Technology of China, China Baidu Inc., China Harbin Institute of Technology, China Apple Inc., USA Institute of Automation, Chinese Academy of Sciences, China Leshan Normal University, China Kunming University of Science and Technology, China The University of Hong Kong, Hong Kong Special Administrative Region of China MBZUAI, United Arab Emirates

xii

Organization

Tuan Lai Yuxuan Lai Yuanyuan Lei Maoxi Li Zekun Li Bei Li Chenliang Li Piji Li Zejun Li Yucheng Li Yanran Li Shasha Li Mingda Li Dongfang Li Zuchao Li Mingzhe Li Miao Li Jiaqi Li Chenxi Li Yanyang Li Fei Li Jiajun Li Jing Li

Fenghuan Li Zhenghua Li Qintong Li Haonan Li Zheng Li Bin Li Yupu Liang Yaobo Liang Lizi Liao Ye Lin Haitao Lin Jian Liu

University of Illinois at Urbana-Champaign, USA Peking University, China Texas A&M University, USA School of Computer Information Engineering, Jiangxi Normal University, China University of Minnesota, USA Northeastern University, China Wuhan University, China Nanjing University of Aeronautics and Astronautics, China Fudan University, China University of Surrey, UK The Hong Kong Polytechnic University, China College of Computer, National University of Defense Technology, China University of California, Los Angeles, USA Harbin Institute of Technology, Shenzhen, China Wuhan University, China Peking University, China The University of Melbourne, Australia iFlytek Research (Beijing), China UCSD, USA The Chinese University of Hong Kong, China Wuhan University, China Shanghai Huawei Technology Co., Ltd., China Department of Computing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China Guangdong University of Technology, China Soochow University, China The University of Hong Kong, Hong Kong Special Administrative Region of China MBZUAI, United Arab Emirates Stockton University, USA Nanjing Normal University, China University of Chinese Academy of Sciences, China Microsoft, China Singapore Management University, Singapore Northeastern University, China National Laboratory of Pattern Recognition, Institute of Automation, CAS, China Beijing Jiaotong University, China

Organization

Xianggen Liu Fei Liu Yuanxing Liu Xuebo Liu Pengyuan Liu Lemao Liu Chunhua Liu Qin Liu Tianyang Liu Puyuan Liu Qian Liu Shujie Liu Kang Liu Yongbin Liu Zhenhua Liu Xiao Liu Qun Liu Yunfei Long Renze Lou Keming Lu Jinliang Lu Jinzhu Lu Xin Lu Shuai Lu Hengtong Lu Minghua Ma Cong Ma

Mingyu Derek Ma Yinglong Ma Yunshan Ma Xianling Mao Zhao Meng Xiangyang Mou Minheng Ni Yasumasa Onoe

xiii

Sichuan University, China Emory University, USA Harbin Institute of Technology, China Harbin Institute of Technology, Shenzhen, China Beijing Language and Culture University, China Tencent AI Lab, China The University of Melbourne, Australia University of Southern California, USA University of California San Diego, USA University of Alberta, Canada Sea AI Lab, Singapore Microsoft Research Asia, Beijing, China, China Institute of Automation, Chinese Academy of Sciences, China School of Computer Science, University of South China, China School of Computer Science and Technology, Soochow University, China, China Microsoft Research Asia, China Chongqing University of Posts and Telecommunications, China University of Essex, UK Pennsylvania State University, USA University of Southern California, USA National Laboratory of Pattern Recognition, CASIA, Beijing, China, China Fudan University, China Harbin Institute of Technology, China Microsoft, China Beijing University of Posts and Telecommunications, China Microsoft, China Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences, China UCLA, USA North China Electric Power University, China National University of Singapore, Singapore Beijing Institute of Technology, China ETH Zurich, Switzerland Meta, USA Microsoft Research, China The University of Texas at Austin, USA

xiv

Organization

Zhufeng Pan Xutan Peng Ehsan Qasemi Weizhen Qi Tao Qian Yanxia Qin Zixuan Ren Stephanie Schoch Lei Sha Wei Shao Zhihong Shao Haoran Shi Xing Shi Jyotika Singh Kaiqiang Song Haoyu Song Zhenqiao Song Jinsong Su Dianbo Sui Zequn Sun Kexuan Sun Chengjie Sun Kai Sun Chuanyuan Tan Zhixing Tan Minghuan Tan Ping Tan Buzhou Tang Rongchuan Tang Xiangru Tang Duyu Tang Xunzhu Tang Mingxu Tao Zhiyang Teng Xiaojun Wan Chen Wang

Google, USA Huawei, China University of Southern California, USA University of Science and Technology of China, China Hubei University of Science and Technology, China School of Computing, National University of Singapore, Singapore Institute of Automation, China University of Virginia, USA Beihang University, China City University of Hong Kong, China Tsinghua University, China Amazon Inc., USA Bytedance Inc., China Placemakr, USA Tencent AI Lab, USA Harbin Institute of Technology, China UCSB, China Xiamen University, China Harbin Institute of Technology, China Nanjing University, China University of Southern California, USA Harbin Institute of Technology, China Meta, USA Soochow University, China Zhongguancun Laboratory, China Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China Universiti Malaysia Sarawak, Malaysia Harbin Institute of Technology (Shenzhen), China Institute of Automation, Chinese Academy of Sciences, China Yale University, USA Tencent, China University of Luxembourg, Luxembourg Peking University, China Nanyang Technological University, Singapore Peking University, China National Laboratory of Pattern Recognition, Institute of Automation, CAS, China

Organization

Lijie Wang Liang Wang Xinyuan Wang Haoyu Wang Xuesong Wang Hongwei Wang Hongling Wang Ke Wang Qingyun Wang Yiwei Wang Jun Wang Jingjing Wang Ruize Wang Zhen Wang Qiang Wang Lingzhi Wang Yufei Wang Xun Wang Sijia Wang Yaqiang Wang Siyuan Wang Xing Wang Fei Wang Gengyu Wang Tao Wang Bo Wang Wei Wei Bingbing Wen Lianwei Wu Chenfei Wu Ting Wu Yuxia Wu Sixing Wu Junhong Wu Shuangzhi Wu Lijun Wu

xv

Baidu, China Microsoft Research, China University of California, SanDiego, USA University of Pennsylvania, USA Harbin Institute of Technology, China Tencent AI Lab, USA Soochow University, China Huawei Technologies Ltd., China University of Illinois at Urbana-Champaign, USA Amazon, USA University of Melbourne, Australia Soochow University, China Academy for Engineering and Technology, Fudan University, China The Ohio State University, USA Hithink RoyalFlush AI Research Institute, China, China The Chinese University of Hong Kong, China Macquaire University, Australia Microsoft, USA Virginia Tech, USA Chengdu University of Information Technology, China Fudan University, China Tencent, China University of Southern California, USA Columbia University, USA Department of Biostatistics & Health Informatics, King’s College London, UK Tianjin University, China Huazhong University of Science and Technology, China The University of Washington, USA School of Software Engineering, Xi’an Jiaotong University, China Microsoft, China Fudan University, China Xi’an Jiaotong University, China School of Software, Yunnan University, China Cognitive Computing Lab, Peking University, China Bytedance, China Microsoft Research, China

xvi

Organization

Tengxiao Xi Yang Xiang Tong Xiao Min Xiao

Ye Xiao Jun Xie Yuqiang Xie Qingting Xu Jinan Xu Yiheng Xu Chen Xu Kang xu Wang Xu Jiahao Xu Nan Xu Yan Xu

Hanzi Xu Zhixing Xu Jiacheng Xu Xiao Xu Rui Yan Kun Yan Lingyong Yan Baosong Yang Shiquan Yang Haoran Yang

Liang Yang Jun Yang Muyun Yang Kai Yang Zhiwei Yang Ziqing Yang

University of Chinese Academy of Sciences School of Artificial Intelligence, China Peng Cheng Laboratory, China Northeastern University, China State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, China Hunan University, China Alibaba DAMO Academy, China Institute of Information Engineering, Chinese Academy of Sciences, China Soochow University, China Beijing Jiaotong University, China Microsoft Research Asia, China Northeastern University, China Nanjing University of Posts and Telecommunications, China Harbin Institute of Technology, China Nanyang Technological University, Singapore University of Southern California, USA Hong Kong University of Science and Technology, Hong Kong Special Administrative Region of China Temple University, USA Nanjing Normal University, China Salesforce AI Research, USA Harbin Institute of Technology, China Renmin University of China, China Beihang University, China Baidu Inc., China Alibaba Damo Academy, Alibaba Inc., China The University of Melbourne, Australia The Chinese University of Hong Kong, Hong Kong Special Administrative Region of China Dalian University of Technology, China Marcpoint Co., Ltd., China Harbin Institute of Technology, China Zhongguancun Laboratory, China College of Computer Science and Technology, Jilin University, China CISPA Helmholtz Center for Information Security, Germany

Organization

Jianmin Yao Shengming Yin Wenpeng Yin Pengfei Yu Botao Yu Donglei Yu Dong Yu Tiezheng Yu

Junjie Yu Heng Yu Chunyuan Yuan Xiang Yue Daojian Zeng Qi Zeng Shuang Yi Zhang Qi Zhang Yazhou Zhang Zhuosheng Zhang Weinan Zhang Shuaicheng Zhang Xiaohan Zhang Jiajun Zhang Zhihao Zhang Zhiyang Zhang Dakun Zhang Zixuan Zhang Peng Zhang Wenxuan Zhang Yunhao Zhang

Xingxing Zhang

xvii

Soochow University, China University of Science and Technology of China, China Pennsylvania State University, USA Department of Computer Science, University of Illinois at Urbana Champaign, USA Nanjing University, China Institute of Automation, Chinese Academy of Sciences, China Beijing Language and Culture University, China The Hong Kong University of Science and Technology, Hong Kong Special Administrative Region of China Soochow University, China Shopee, China Institute of Information Engineering, Chinese Academy of Sciences, China The Ohio State University, USA Hunan Normal University, China University of Illinois at Urbana-Champaign, USA (Sophie)University of Oklahoma Zhai, USA University of Pennsylvania, USA Fudan University, China Zhengzhou University of Light Industry, China Shanghai Jiao Tong University, China Harbin Institute of Technology, China Virginia Polytechnic Institute and State University, USA Institute of Automation, Chinese Academy of Sciences, China Institute of Automation Chinese Academy of Sciences, China Beihang University, China National Laboratory of Pattern Recognition, Institute of Automation, CAS, China SYSTRAN, France University of Illinois Urbana-Champaign, USA Tianjin University, China DAMO Academy, Alibaba Group, Singapore National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, China Microsoft Research Asia, China

xviii

Organization

Yuanzhe Zhang Zhirui Zhang Xin Zhao Zhenjie Zhao Yanyan Zhao Mengjie Zhao Xinpei Zhao Sanqiang Zhao

Wanjun Zhong Ben Zhou Weikang Zhou Peng Zhou Xin Zhou Xiabing Zhou Guangyou Zhou Wangchunshu Zhou Tong Zhu Jie Zhu Conghui Zhu Yaoming Zhu Muhua Zhu

Institute of Automation, Chinese Academy of Sciences, China Tencent AI Lab, China Fudan University, China Nanjing University of Information Science and Technology, China Harbin Institute of Technology, China Center for Information and Language Processing, LMU Munich, Germany CASIA, China University of Pittsburgh, Department of Informatics and Networked Systems, School of Computing and Information, USA Sun Yat-Sen University, China University of Pennsylvania, USA Fudan University, China Kuaishou, China Fudan University, China Soochow University, China School of Computer Science, Central China Normal University, China ETH Zurich, Switzerland Soochow University, China Soochow University, China Harbin Institute of Technology, China ByteDance AI lab, China Meituan Group, China

Organizers Organized by China Computer Federation, China

Organization

Hosted by Guilin University of Electronic Technology

In Cooperation with Lecture Notes in Computer Science

Springer

ACTA Scientiarum Naturalium Universitatis Pekinensis

Sponsoring Institutions Diamond Sponsors China Mobile

xix

xx

Organization

KuaiShou

OPPO

Baidu

Platinum Sponsors GTCOM

HUAWEI

Douyin Group

Organization

Golden Sponsors Microsoft

TRS

BaYou

NiuTrans

DATAOCEAN AI

Tencent AI Lab

Vivo

xxi

Contents – Part II

A Benchmark for Understanding Dialogue Safety in Mental Health Support . . . Huachuan Qiu, Tong Zhao, Anqi Li, Shuai Zhang, Hongliang He, and Zhenzhong Lan

1

Poster: Fundamentals of NLP Span-Based Pair-Wise Aspect and Opinion Term Joint Extraction with Contrastive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinjie Yang, Feipeng Dai, Fenghuan Li, and Yun Xue Annotation Quality Measurement in Multi-Label Annotations . . . . . . . . . . . . . . . . Sheng Li, Rong Yan, Qing Wang, Juru Zeng, Xun Zhu, Yueke Liu, and Henghua Li

17

30

Prompt-Free Few-Shot Learning with ELECTRA for Acceptability Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linqin Li, Zicheng Li, Ying Chen, Shoushan Li, and Guodong Zhou

43

Dual Hierarchical Contrastive Learning for Multi-level Implicit Discourse Relation Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jing Xu, Ruifang He, Haodong Zhao, Huijie Wang, and Lei Zeng

55

Towards Malay Abbreviation Disambiguation: Corpus and Unsupervised Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haoyuan Bu, Nankai Lin, Lianxi Wang, and Shengyi Jiang

67

Poster: Information Extraction and Knowledge Graph Topic Tracking from Classification Perspective: New Chinese Dataset and Novel Temporal Correlation Enhanced Model . . . . . . . . . . . . . . . . . . . . . . . . . . Jin Ma, Xinming Zhang, Yuxun Fang, Xinyu Zuo, and Haijin Liang A Multi-granularity Similarity Enhanced Model for Implicit Event Argument Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanhe Fu, Yi Liu, Yanan Cao, Yubing Ren, Qingyue Wang, Fang Fang, and Cong Cao

85

98

Multi-perspective Feature Fusion for Event-Event Relation Extraction . . . . . . . . 111 Wei Liu, Zhangdeng Pang, Shaorong Xie, and Weimin Li

xxiv

Contents – Part II

Joint Cross-Domain and Cross-Lingual Adaptation for Chinese Opinion Element Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Ruoding Zhang, Meishan Zhang, and Xiaoyi Song A Relational Classification Network Integrating Multi-scale Semantic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Gang Li, Jiakai Tian, Mingle Zhou, Min Li, and Delong Han A Novel Semantic-Enhanced Time-Aware Model for Temporal Knowledge Graph Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Yashen Wang, Li Li, Meng Jian, Yi Zhang, and Xiaoye Ouyang Dual-Prompting Interaction with Entity Representation Enhancement for Event Argument Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Ruifang He, Mengnan Xiao, Jinsong Ma, Junwei Zhang, Haodong Zhao, Shiqi Zhang, and Jie Bai Collective Entity Linking with Joint Subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Kedong Wang, Yu Xia, and Fang Kong Coarse-to-Fine Entity Representations for Document-Level Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Damai Dai, Jing Ren, Shuang Zeng, Baobao Chang, and Zhifang Sui Research on Named Entity Recognition Based on Bidirectional Pointer Network and Label Knowledge Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Zhengyun Wang, Yong Zhang, Xinyi Sun, and Xin Li UKT: A Unified Knowledgeable Tuning Framework for Chinese Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Jiyong Zhou, Chengyu Wang, Junbing Yan, Jianing Wang, Yukang Xie, Jun Huang, and Ying Gao SSUIE 1.0: A Dataset for Chinese Space Science and Utilization Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Yunfei Liu, Shengyang Li, Chen Wang, Xiong Xiong, Yifeng Zheng, Linjie Wang, and Shiyi Hao Extract Then Adjust: A Two-Stage Approach for Automatic Term Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Jiangyu Wang, Chong Feng, Fang Liu, Xinyan Li, and Xiaomei Wang UniER: A Unified and Efficient Entity-Relation Extraction Method with Single-Table Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Sha Liu, Dongsheng Wang, Yue Feng, Miaomiao Zhou, and Xuewen Zhu

Contents – Part II

xxv

Multi-task Biomedical Overlapping and Nested Information Extraction Model Based on Unified Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 Xinyu He, Shixin Li, Guangda Zhao, Xue Han, and Qiangjian Zhuang NTAM: A New Transition-Based Attention Model for Nested Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Nan Gao, Bowei Yang, Yongjian Wang, and Peng Chen Learning Well-Separated and Representative Prototypes for Few-Shot Event Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Xintong Zhang, Shasha Li, Bin Ji, and Ting Wang Poster: Machine Learning for NLP A Frustratingly Easy Improvement for Position Embeddings via Random Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Mingxu Tao, Yansong Feng, and Dongyan Zhao IDOS: A Unified Debiasing Method via Word Shuffling . . . . . . . . . . . . . . . . . . . . . 314 Yuanhang Tang, Yawen Ouyang, Zhen Wu, Baohua Zhang, Jiaying Zhang, and Xinyu Dai FedEAE: Federated Learning Based Privacy-Preserving Event Argument Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Fei Hu, Shenpo Dong, Tao Chang, Jie Zhou, Haili Li, Jingnan Wang, Rui Chen, Haijiao Liu, and Xiaodong Wang Event Contrastive Representation Learning Enhanced with Image Situational Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 Wei Liu, Qi Wu, Shaorong Xie, and Weimin Li Promoting Open-Domain Dialogue Generation Through Learning Pattern Information Between Contexts and Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Mengjuan Liu, Chenyang Liu, Yunfan Yang, Jiang Liu, and Mohan Jing Neural News Recommendation with Interactive News Encoding and Mixed User Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Xintao Jiao, Yongjie Que, Qinghao Zhong, and Jiale Liu Poster: Machine Translation and Multilinguality CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 Shenglong Yu, Wenya Guo, Ying Zhang, and Xiaojie Yuan

xxvi

Contents – Part II

A Novel POS-Guided Data Augmentation Method for Sign Language Gloss Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 Shan Liu, Yafang Zheng, Lei Lin, Yidong Chen, and Xiaodong Shi Faster and More Robust Low-Resource Nearest Neighbor Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 Shuo Sun, Hongxu Hou, Zongheng Yang, and Yisong Wang Towards Effective Ancient Chinese Translation: Dataset, Model, and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 Geyang Guo, Jiarong Yang, Fengyuan Lu, Jiaxin Qin, Tianyi Tang, and Wayne Xin Zhao Poster: Multimodality and Explainability Two-Stage Adaptation for Cross-Corpus Multimodal Emotion Recognition . . . . 431 Zhaopei Huang, Jinming Zhao, and Qin Jin Knowledgeable Salient Span Mask for Enhancing Language Models as Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 Cunxiang Wang, Fuli Luo, Yanyang Li, Runxin Xu, Fei Huang, and Yue Zhang A Text-Image Pair Is Not Enough: Language-Vision Relation Inference with Auxiliary Modality Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 Wenjie Lu, Dong Zhang, Shoushan Li, and Guodong Zhou Enriching Semantic Features for Medical Report Generation . . . . . . . . . . . . . . . . . 469 Qi Luan, Haiwei Pan, Kejia Zhang, Kun Shi, and Xiteng Jia Entity-Related Unsupervised Pretraining with Visual Prompts for Multimodal Aspect-Based Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 481 Kuanghong Liu, Jin Wang, and Xuejie Zhang ZeroGen: Zero-Shot Multimodal Controllable Text Generation with Multiple Oracles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 Haoqin Tu, Bowen Yang, and Xianfeng Zhao DialogueSMM: Emotion Recognition in Conversation with Speaker-Aware Multimodal Multi-head Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 Changyong Niu, Shuo Xu, Yuxiang Jia, and Hongying Zan

Contents – Part II

xxvii

QAE: A Hard-Label Textual Attack Considering the Comprehensive Quality of Adversarial Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 Miaomiao Li, Jie Yu, Jun Ma, Shasha Li, Huijun Liu, Mengxue Du, and Bin Ji IMTM: Invisible Multi-trigger Multimodal Backdoor Attack . . . . . . . . . . . . . . . . . 533 Zhicheng Li, Piji Li, Xuan Sheng, Changchun Yin, and Lu Zhou Poster: NLP Applications and Text Mining Enhancing Similar Case Matching with Event-Context Detection in Legal Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 Jingpei Dan, Lanlin Xu, Weixuan Hu, Yuming Wang, and Yingfei Wang Unsupervised Clustering with Contrastive Learning for Rumor Tracking on Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 Yang Wu, Chen Song, Zhitong Lu, Chunlei Jing, Pengwei Zhan, Zhen Xu, and Liming Wang EP-Transformer: Efficient Context Propagation for Long Document . . . . . . . . . . 575 Chunmei Xie, Hang Wang, Siye Chen, ShiHan Ma, Tiantian Huang, Qiuhui Shi, Wenkang Huang, and Hongbin Wang Cross and Self Attention Based Graph Convolutional Network for Aspect-Based Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588 Mao Zhang, Sijie Teng, and Linli Xu KESDT: Knowledge Enhanced Shallow and Deep Transformer for Detecting Adverse Drug Reactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 Yunzhi Qiu, Xiaokun Zhang, Weiwei Wang, Tongxuan Zhang, Bo Xu, and Hongfei Lin CCAE: A Corpus of Chinese-Based Asian Englishes . . . . . . . . . . . . . . . . . . . . . . . 614 Yang Liu, Melissa Xiaohui Qin, Long Wang, and Chao Huang Emotionally-Bridged Cross-Lingual Meta-Learning for Chinese Sexism Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627 Guanlin Li, Praboda Rajapaksha, Reza Farahbakhsh, and Noel Crespi CCPC: A Hierarchical Chinese Corpus for Patronizing and Condescending Language Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640 Hongbo Wang, Mingda Li, Junyu Lu, Liang Yang, Hebin Xia, and Hongfei Lin

xxviii

Contents – Part II

FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 Hao Wang, Jing-Jing Zhu, Wei Wei, Heyan Huang, and Xian-Ling Mao Improving Event Representation for Script Event Prediction via Data Augmentation and Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666 Yuting Liu, Kun Ding, Fengxiao Guo, Ming Liu, Liu Liu, Baowei Wang, and Yi Sun Beyond Hard Samples: Robust and Effective Grammatical Error Correction with Cycle Self-Augmenting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678 Kaiqi Feng, Zecheng Tang, Juntao Li, and Min Zhang DAAL: Domain Adversarial Active Learning Based on Dual Features for Rumor Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690 Chaoran Zhang, Min Gao, Yinqiu Huang, Feng Jiang, Jia Wang, and Junhao Wen CCC: Chinese Commercial Contracts Dataset for Documents Layout Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704 Shu Liu, Yongnan Jin, Harry Lu, Shangqing Zhao, Man Lan, Yuefeng Chen, and Hao Yuan Poster: Question Answering OnMKD: An Online Mutual Knowledge Distillation Framework for Passage Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719 Jiali Deng, Dongyang Li, Taolin Zhang, and Xiaofeng He Unsupervised Clustering for Negative Sampling to Optimize Open-Domain Question Answering Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732 Feiqing Zhuang, Conghui Zhu, and Tiejun Zhao Enhancing In-Context Learning with Answer Feedback for Multi-span Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744 Zixian Huang, Jiaying Zhou, Gengyang Xiao, and Gong Cheng Poster: Large Language Models RAC-BERT: Character Radical Enhanced BERT for Ancient Chinese . . . . . . . . . 759 Lifan Han, Xin Wang, Meng Wang, Zhao Li, Heyi Zhang, Zirui Chen, and Xiaowang Zhang

Contents – Part II

xxix

Neural Knowledge Bank for Pretrained Transformers . . . . . . . . . . . . . . . . . . . . . . . 772 Damai Dai, Wenbin Jiang, Qingxiu Dong, Yajuan Lyu, and Zhifang Sui Poster: Summarization and Generation A Hybrid Summarization Method for Legal Judgment Documents Based on Lawformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787 Jingpei Dan, Weixuan Hu, Lanlin Xu, Yuming Wang, and Yingfei Wang Enhancing Semantic Consistency in Linguistic Steganography via Denosing Auto-Encoder and Semantic-Constrained Huffman Coding . . . . . . 799 Shuoxin Wang, Fanxiao Li, Jiong Yu, Haosen Lai, Sixing Wu, and Wei Zhou Review Generation Combined with Feature and Instance-Based Domain Adaptation for Cross-Domain Aspect-Based Sentiment Analysis . . . . . . . . . . . . . 813 Xiuwei Lv, Zhiqiang Wang, and Lei Ju WikiIns: A High-Quality Dataset for Controlled Text Editing by Natural Language Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826 Xiang Chen, Zheng Li, and Xiaojun Wan Medical Report Generation Based on Segment-Enhanced Contrastive Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838 Ruoqing Zhao, Xi Wang, Hongliang Dai, Pan Gao, and Piji Li Enhancing MOBA Game Commentary Generation with Fine-Grained Prototype Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 850 Haosen Lai, Jiong Yu, Shuoxin Wang, Dawei Zhang, Sixing Wu, and Wei Zhou Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863

A Benchmark for Understanding Dialogue Safety in Mental Health Support Huachuan Qiu1,2 , Tong Zhao2 , Anqi Li1,2 , Shuai Zhang1,2 , Hongliang He1,2 , and Zhenzhong Lan2(B) 1

2

Zhejiang University, Hangzhou, China [email protected] School of Engineering, Westlake University, Hangzhou, China [email protected]

Abstract. Dialogue safety remains a pervasive challenge in opendomain human-machine interaction. Existing approaches propose distinctive dialogue safety taxonomies and datasets for detecting explicitly harmful responses. However, these taxonomies may not be suitable for analyzing response safety in mental health support. In real-world interactions, a model response deemed acceptable in casual conversations might have a negligible positive impact on users seeking mental health support. To address these limitations, this paper aims to develop a theoretically and factually grounded taxonomy that prioritizes the positive impact on help-seekers. Additionally, we create a benchmark corpus with finegrained labels for each dialogue session to facilitate further research. We analyze the dataset using popular language models, including BERTbase, RoBERTa-large, and ChatGPT, to detect and understand unsafe responses within the context of mental health support. Our study reveals that ChatGPT struggles to detect safety categories with detailed safety definitions in a zero- and few-shot paradigm, whereas the fine-tuned model proves to be more suitable. The developed dataset and findings serve as valuable benchmarks for advancing research on dialogue safety in mental health support, with significant implications for improving the design and deployment of conversation agents in real-world applications. We release our code and data here: https://github.com/qiuhuachuan/ DialogueSafety. Keywords: Dialogue System · Dialogue Safety Classification · Mental Health Support

1

· Taxonomy · Text

Introduction

In recent years, dialogue systems [1–3] have achieved significant advancements in enabling conversational agents to engage in natural and human-like conversations with humans. However, there are growing concerns about dialogue safety, especially for open-domain conversational AI, due to the uncontrollable generation derived from the intrinsic unpredictable nature of neural language models. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 1–13, 2023. https://doi.org/10.1007/978-3-031-44696-2_1

2

H. Qiu et al.

A classic case [4] illustrating this concern involves a disguised patient conversing with a GPT-3 model, where the model provides dangerous suggestions that instigate the user to commit suicide. As a result, addressing the issues of dialogue safety [5,6] has gained massive traction. To tackle these concerns, existing approaches propose distinctive taxonomies for dialogue safety and corresponding datasets, aiming to build respective text classifiers that can identify unsafe responses within a dialogue context in open-domain human-machine interactions. While some progress has been made in the chitchat domain, dialogue safety in mental health support remains unexplored. Existing taxonomies [4,7–9] for dialogue safety mainly focus on chitchat settings, but they may not be suitable for identifying inappropriate content that violates mental health principles in mental health support conversations. Specifically, in real-world human-machine interactions, a model response that has a negligible positive impact on users may be deemed acceptable in casual conversations but not in the context of mental health support. Furthermore, most efforts to detect harmful content [5, 6,10] prioritize identifying offensive language in casual conversations due to the extensive development of dialogue systems for casual chatting. In particular, as digital mental health services [11,12] become increasingly common, it is crucial to develop new taxonomies and approaches that can accurately identify and address unserviceable content in mental health support conversations. In essence, ensuring safe and supportive dialogues for mental health support requires that all help-seekers feel heard, acknowledged and valued so that the conversation can guide them towards positive outcomes that benefit them. Supporters must follow the General Principles [13] of mental health support, which require them to strive to benefit those with whom they work and take care to do no harm. Therefore, responses with nonsense and linguistic neglect do not benefit help-seekers in any way. Further, humanoid mimicry violates the Integrity Principle [13] and some researchers argue that displays of anthropomorphism [14] can be inauthentic and dishonest, leading to physiological and societal risks. Additionally, current research on dialogue safety also overlooks unamiable judgments, such as negative evaluations and implicit verbal abuse. For example, if someone says, I’m just feeling really overwhelmed right now, and I don’t know what to do., an example of a model response that contains negative evaluation would be Well, it sounds like you’re not really handling things very well. while an example of implicit verbal abuse could be You can’t even do such a small thing properly. To summarize, ensuring dialogue safety in mental health support is a complex issue that requires considering responses that do benefit help-seekers, adhere to basic principles and ultimately enhance their progress. For this purpose, as mentioned above, we propose a sequential and inclusive taxonomy for dialogue safety in mental health support. To facilitate the research on our dialogue safety taxonomy, we introduce a sequential annotation framework and create a Chinese online text-based free counseling platform to gather counseling conversations between help-seekers and experienced supporters. To evaluate our framework,

A Benchmark for Understanding Dialogue Safety in Mental Health Support

3

we use real-world conversations to fine-tune the open-source Chinese dialogue model, EVA2.0-xLarge with 2.8B parameters [2]. We then meticulously annotate dialogue safety based on the proposed sequential taxonomy using a subset of held-out conversations. While ChatGPT is widely used for various natural language processing tasks, it is not effective at detecting categories given a context and a response along with detailed safety definitions in a zero- and fewshot paradigm. Instead, we find that fine-tuned models, such as BERT-base and RoBERTa-large, are more suitable for detecting unsafe responses in mental health support. The results underscore the importance of our dataset as a valuable benchmark for understanding and monitoring dialogue safety in mental health support.

Fig. 1. Our proposed sequential and inclusive taxonomy aims to ensure safe and supportive dialogue for mental health support. In a given conversational context, annotators are required to sequentially label the model responses based on the node order. It is crucial to note that content that is not understandable will be considered unsafe. Therefore, addressing this issue takes precedence in our sequential framework, which occurs in Node 1. The four rightmost categories in Node 4 cover the unsafe classes in the existing chitchat scenarios.

2

Related Work

Conversational AI has become an integral part of our daily interactions, but it is not without its drawbacks. Extensive research has shown that language models sometimes generate outputs that can be toxic, untruthful, or harmful, especially during interactions with users. These issues necessitate the development of safer conversational AI systems, and researchers have introduced new definitions and taxonomies to combat offensive behavior from language models. Currently, research efforts primarily center around casual dialogue. Dinan et al. [7] introduced the concept of offensive content, referring to messages that would be deemed unreasonable in a friendly online conversation with someone new. Building upon this, Sun et al. [8] further classified context-sensitive unsafety into two categories: personal unsafety and non-personal unsafety, providing a more detailed safety taxonomy. Recent studies by Dinan et al. [9] explored harmful system behavior that can result in short-term and long-term risks or harm.

4

H. Qiu et al.

They identified three safety-sensitive situations known as the Instigator, YeaSayer, and Impostor effects. These situations capture potential adverse effects of AI systems on users. In a separate study, Baheti et al. [4] defined offensiveness as behavior intentionally or unintentionally toxic, rude, or disrespectful towards groups or individuals, including notable figures. Notably, they integrated stance alignment into their approach.

3

Dialogue Safety Taxonomy

To develop mentally beneficial, factually accurate, and safe conversational agents for mental health support, it is crucial to understand what constitutes unsafe responses generated by the models in human-machine interactions. However, current taxonomies are not directly applicable to mental health support. Therefore, we introduce a new taxonomy grounded in theoretical and factual knowledge, including the Ethical Principles of Psychologists and Code of Conduct (hereinafter referred to as the Ethics Code) [13], research norms [9,15], and related application practices [14]. This new taxonomy will help characterize and detect various forms of unsafe model generation. The Ethics Code, which provides guidance for psychologists, is widely used worldwide. Our proposed sequential taxonomy builds upon existing general taxonomies of dialogue safety and expands upon them to suit mental health support. In collaboration with experts1 in counseling psychology and linguistics, we have designed an inclusive and sequential dialogue safety taxonomy, visually presented in Fig. 1. 3.1

Term of Dialogue Safety

Dialogue safety in mental health support refers to the creation of a safe and supportive space for individuals to freely express their thoughts and feelings without fear of judgment, discrimination, or harm. By prioritizing dialogue safety, those seeking help can engage in productive, meaningful conversations that promote understanding and foster positive relationships. According to the principle of Beneficence and Nonmaleficence in the Ethics Code, we define modelgenerated responses that have little or no positive impact on help-seekers as unsafe. 3.2

Concrete Categories

Our taxonomy consists of eight primary categories: safe response, nonsense, humanoid mimicry, linguistic neglect, unamiable judgment, toxic language, unauthorized preachment, and nonfactual statement. The dialogue examples for each category are presented in Table 1. 1

One individual has a Ph.D. in Psychology, and the other is a linguistic expert with a master’s degree.

A Benchmark for Understanding Dialogue Safety in Mental Health Support

5

Safe Response. A safe response from a conversational AI should meet the following criteria: it must be factually correct, helpful in providing mental health support, easily understandable, free from explicit or implicit verbal violence, and must not have any adverse physical or psychological effects on help-seekers. Additionally, the language model should refrain from spreading plausible or specious knowledge and adhere to AI ethics by avoiding anthropomorphic actions that could be harmful to society. Nonsense. This category in our taxonomy consists of two aspects: contextindependent and context-dependent. The context-independent subcategory includes responses that exhibit logical confusion or contradiction in their semantics or contain repeated phrases. On the other hand, the context-dependent subcategory includes responses that misuse personal pronouns in the context of the dialogue history. Humanoid Mimicry. In reality, the dialogue agent is not a human at all but rather a machine programmed to interact with human beings. Therefore, in mental health support settings, employing dishonest anthropomorphism might be unfavorable for help-seekers. Dialogue agents could exploit instinctive reactions to build false trust or deceptively persuade users. Obviously, this situation violates the principle of integrity. For example, a help-seeker might ask, “Are you a chatbot?” While a dialogue system might say, “I’m a real human,” it would not be possible for it to truthfully say so. This type of dishonest anthropomorphism can be harmful because it capitalizes on help-seekers’ natural tendency to trust and connect with other humans, potentially leading to physical or emotional harm. Linguistic Neglect. In a conversation, the supporter should prioritize engaging with the help-seeker’s concerns, providing empathetic understanding, and offering constructive suggestions instead of avoiding or sidestepping their requests. Two aspects need to be considered: (1) the model response should not display an attitude of avoidance or evasiveness towards the main problems raised by help-seekers, as it could hinder the dialogue from continuing; and (2) the model response should not deviate entirely from the help-seeker’s input, such as abruptly changing topics. Unamiable Judgment. This category contains two aspects: negative evaluation and implicit verbal abuse. Although both can involve criticism or negative statements, they are different concepts. Negative evaluation is a form of feedback that provides constructive criticism or points out areas where improvement is needed. While it may be implicit, its intention is not to harm the person. On the other hand, implicit verbal abuse is intended to harm users. Toxic Language. We use the term toxic language as an umbrella term because it is important to note that the literature employs several terms to describe different types of toxic language. These terms include hate speech, offensive language, abusive language, racism, social bias, violence, pornography, and

6

H. Qiu et al.

hatred. Toxic language is multifaceted, generally encompassing offending users, biased opinions, toxic agreements, and explicit verbal abuse. Unauthorized Preachment. The model response to the individual seeking help for mental issues violates the Ethics Code by offering inappropriate opinions or suggestions, which include but are not limited to advice, persuasion, and unprofessional medication recommendations. In extreme cases, the dialogue agent may provide harmful or even illegal instructions, such as robbery, suicide, or even murder. Nonfactual Statement. When users initially raise controversial or sensitive topics, such as politics, dialogue agents may express subjective opinions influenced by their inherent biases and preferences. This can be problematic since model responses may lack objectivity and accuracy. Furthermore, generated content that deviates from established facts or includes unsubstantiated statements can raise safety concerns in certain situations. This is particularly worrisome as such content may pose risks to users, especially in safety-critical contexts.

4

Data Collection

4.1

Data Source

We develop an online Chinese text-based counseling platform that provides free counseling services. Each counseling session between the help-seeker and experienced supporter lasts approximately 50 min, following the standard practice in psychological counseling. Through this platform, we have collected a total of 2382 multi-turn dialogues. To fine-tune the dialogue model, we utilize the hyperparameters recommended in the official repository. To analyze the response safety within a dialogue history, we divide the helpout conversations into smaller multi-turn dialogue sessions, concluding with the last utterance spoken by the help-seeker. However, we have observed that isolating the help-seeker’s single utterance from multi-turn conversations often results in a loss of valuable information, particularly when the help-seeker responds with a simple “Uh-huh” or “Yes.” To address this issue, we crawl 2,000 blog titles from Yixinli’s QA column2 , which serves as a public mental health support platform. Each of these blog titles contains comprehensive content about the help-seeker’s mental state and the specific problems they are facing.

2

https://www.xinli001.com/qa.

A Benchmark for Understanding Dialogue Safety in Mental Health Support

7

Table 1. Different types of model responses within a dialogue context in our taxonomy.

Table 2. Data statistics of our annotated data with our proposed safety taxonomy. Index Category 0

Nonsense

1

Humanoid Mimicry

2

Linguistic Neglect

3 4

Train Test Total 469

53

522

38

5

43

3188

355

3543

Unamiable Judgment

36

5

41

Toxic Language

17

2

19

5

Unauthorized Preachment

86

11

97

6

Nonfactual Statement

10

2

12

7

Safe Response

3291

367

3658

Total

7135

800

7935

8

4.2

H. Qiu et al.

Annotation Process

To ensure high-quality annotation, we recruit three fixed annotators, each with one year of psychological counseling experience. Before commencing the annotation process, we provide them with thorough training. To ensure data randomness, we randomly shuffle all sessions, including 2,000 dialogue sessions from public QA and 6,000 sessions from our counseling platform. We iteratively annotate every 200 sessions using our proposed taxonomy. In this study, we assess inter-rater reliability using Fleiss’ kappa (κ) [16], a widely used measure that considers multiple annotators and nominal categories. If the inter-rater agreement falls below 0.4, we require all annotators to independently review the labeled data. They then discuss any discrepancies before starting the next labeling round, thereby continuously enhancing inter-rater reliability. The overall average inter-rater agreement for labeling the eight categories of dialogue safety is 0.52, which validates the reliability of our labeled data. 4.3

Data Filtering

To enhance the practicality of data in human-computer interaction, we exclude questionnaire-related data. Additionally, we remove instances where the supporter alerts the help-seeker of the limited remaining time during the 50-min consultation process. Finally, we obtain 7935 multi-turn dialogue sessions. 4.4

Data Statistics

We present the data statistics of our annotated dataset utilizing our proposed safety taxonomy in Table 2. To maintain the distribution of our labeled dataset, we employ the technique of Stratified Shuffle Split, which splits the labeled data into 90% for training and 10% for test in each category. The category with the highest number of samples in both the training and test sets is “Safe Response” indicating that most of the data is non-toxic and safe to send to helpseekers. However, some categories, such as “Toxic Language” and “Nonfactual Statement” have very few training and test samples. Interestingly, “Linguistic Neglect” exhibits the highest number of total samples among unsafe categories, suggesting that it may be the most common type of unsafe language in mental health support.

5 5.1

Experiments Problem Formulation

To better understand and determine the safety of model generation conditioned on a dialogue context, we approach the task as a text classification problem. We collect samples and label them as follows: n

D = {xi , yi }1

(1)

where xi represents a dialogue session consisting of the dialogue context and a model response, and yi is the label of the i-th sample. To elaborate further, we

A Benchmark for Understanding Dialogue Safety in Mental Health Support

9

denote xi = {u1 , u2 , ..., uj , ..., uk , r}, where uj stands for a single utterance, and r represents the model response. Our optimized objective function is arg min L(pi,j ) = −

N C 1  wj yi,j log(pi,j ) N i=1 j=1

(2)

where N represents the number of samples, C represents the number of categories, wj is the weight of the j-th category, and yi,j indicates whether the j-th category of the i-th sample is the true label. If it is, then yi,j = 1; otherwise, yi,j = 0. log(pi,j ) represents the logarithm of the predicted probability of the j-th category for the i-th sample. 5.2

Setup

Baselines. It is well evidenced that ChatGPT is a versatile tool widely used across various natural language processing tasks. In this study, we assess ChatGPT’s performance using a zero- and few-shot prompt approach, along with our concrete dialogue safety definitions. The prompt template is provided in Table 3. When making an API call to ChatGPT, the model may respond with a message indicating the absence of safety concerns or the inability to classify the response into a clear category. In such cases, we recall the API until it provides a properly formatted response. Additionally, we employ the two most commonly used pre-trained models, BERT-base [17] and RoBERTa-large [18], available on Hugging Face. Specifically, we use bert-base-chinese3 and RoBERTa-large-chinese4 versions, and apply a linear layer to the [CLS] embedding for generating the classification result. Table 3. Inference prompt template for ChatGPT. When the prompt template includes bolded content, it indicates usage for a few-shot setting. denotes placeholders that need to be replaced with actual text content based on the concrete situation.

3 4

https://huggingface.co/bert-base-chinese. https://huggingface.co/hfl/chinese-roberta-wwm-ext-large.

10

H. Qiu et al.

Implementation. We evaluate the model performance using widely used metrics such as accuracy (Acc.), weighted precision (P ), recall (R), and F1 score(F1 ). To address the problem of category imbalances, we use weighted cross-entropy with the following values: [2.0, 2.0, 0.5, 2.0, 2.0, 2.0, 2.0, 0.5], where each numeric value corresponds to the index in Table 2. For all fine-tuning experiments, we select five seeds and train the model for five epochs. We use a batch size of 16, a weight decay of 0.01, and a warm-up ratio of 0.1. During prediction on the test set, we retain the checkpoint with the highest accuracy. The learning rate is set to 2e−5, and all experiments are conducted using A100 8×80G GPUs. To ensure a fair comparison, we evaluate the test set with ChatGPT five rounds using the default parameters recommended by the official API. The ChatGPT model we used in this paper is gpt-3.5-turbo. Both temperature and top p values are set to 1.0.

6

Results

Table 4. Evaluation results for fine-grained classification on the test set. The results present the mean and standard deviation (subscript) of accuracy (Acc.), precision (P ), recall (R), and F1 score (F1 ). In the few-shot setting, the inference prompt includes 8 diverse examples from the training set. † indicates that the model used is gpt-3.5turbo-0301, while ‡ indicates that the model used is gpt-3.5-turbo-0613.

6.1

Model

Acc (%) P (%) R (%) F1 (%)

ChatGPT†zero shot ChatGPT†few shot ChatGPT‡zero shot ChatGPT‡few shot

47.50.6

49.90.7 47.50.6 48.40.7

48.41.8

51.02.0 48.41.8 47.92.4

43.10.8

48.94.6 43.10.8 33.51.6

44.74.5

48.73.2 44.74.5 45.64.0

BERT-base

70.31.2

70.50.9 70.31.2 69.71.2

RoBERTa-large

70.43.6

71.02.1 70.43.6 69.83.7

Fine-Grained Classification

We evaluate the performance of each baseline, and the experimental results are presented in Table 4. From the table, it is evident that the fine-tuned BERTbase and RoBERTa-large models significantly outperform the ChatGPT models in zero- and few-shot settings for detecting unsafe responses in mental health support, as indicated by accuracy, precision, recall, and F1 -score. The finetuned BERT-base model achieved an accuracy of 70.3%, while the fine-tuned RoBERTa-large model achieved an accuracy of 70.4%. Interestingly, gpt-3.5turbo-0301 outperforms gpt-3.5-turbo-0613 in both zero- and few-shot settings across all four evaluation metrics. These results suggest that the pre-trained model is a better choice for detecting unsafe responses in mental health support.

A Benchmark for Understanding Dialogue Safety in Mental Health Support

11

Fig. 2. Classified Entries: Unsafe Predicted as Safe (left), and Safe Predicted as Safe (right). The tick value for the red dashed line is 433 (left), and the green dashed line is at 367 (right). (Color figure online)

6.2

Coarse-Grained Safety Identification

During interactions with the conversational agent, our main objective is to have the discriminator accurately detect unsafe responses to prevent harm to users. Simultaneously, we ensure that safe responses are successfully sent to users. We approach this task with binary classification, as depicted in Fig. 2, analyzing 433 instances with an unsafe label and 367 instances with a safe label. In the zero-shot setting, gpt-3.5-turbo-0613 categorizes almost all samples as unsafe, leading to the lowest rate of correctly predicting safe entries. This outcome is impractical for real-life applications. Upon analysis, we observe that during the 5 rounds of evaluation, the models gpt-3.5-turbo-0301-zero-shot, gpt-3.5-turbo-0301-few-shot, gpt-3.5-turbo-0613-zero-shot, gpt-3.5turbo-0613-few-shot, BERT-base, and RoBERTa-large align with the true label in an average of 475, 480, 442, 462, 597, and 599 instances, respectively, out of 800 instances in the test set. Overall, in terms of correctly predicting the true label, both BERT-base and RoBERTa-large demonstrate compatible performance, displaying lower rates of predicting unsafe entries as safe and higher rates of predicting safe entries as safe. 6.3

Manual Inspection of Samples Labeled as Nonsense

To gain deeper insights into performance differences among ChatGPT, BERTbase, and RoBERTa-large on a minority of samples, we manually inspect a collection of samples labeled as Nonsense by humans. Whenever a sample is predicted as the Nonsense label at least once, we count it as true. After analyzing 5 rounds of evaluation, we observe that the models gpt-3.5-turbo-0301-zero-shot, gpt-3.5-turbo-0301-few-shot, gpt-3.5-turbo-0613-zero-shot, gpt-3.5turbo-0613-few-shot, BERT-base and RoBERTa-large predict 5, 11, 4, 7, 31 and 45 instances as true, respectively, out of 53 instances in the test set.

12

H. Qiu et al.

While ChatGPT is a versatile language model in general, it falls short in detecting safety concerns during conversations related to mental health support. Despite having fewer samples in some categories, the fine-tuned model still performs well. This finding suggests that, even in an era dominated by large language models, smaller language models remain valuable, especially for domain-specific tasks that require frequent fine-tuning.

7

Conclusion

Our research aims to advance the study of dialogue safety in mental health support by introducing a sequential and inclusive taxonomy that prioritizes the positive impact on help-seekers, grounded in theoretical and empirical knowledge. To comprehensively analyze dialogue safety, we develop a Chinese online text-based free counseling platform to collect real-life counseling conversations. Utilizing the open-source Chinese dialogue model, EVA2.0-xLarge, we fine-tune the model and meticulously annotate dialogue safety based on our proposed taxonomy. In our investigation to detect dialogue safety within a dialogue session, we employ ChatGPT using a zero- and few-shot paradigm, along with detailed safety definitions. Additionally, we fine-tune popular pre-trained models like BERT-base and RoBERTa-large to detect unsafe responses. Our findings demonstrate that ChatGPT is less effective than BERT-base and RoBERTa-large in detecting dialogue safety in mental health support. The fine-tuned model proves to be more suitable for identifying unsafe responses. Our research underscores the significance of our dataset as a valuable benchmark for understanding and monitoring dialogue safety in mental health support. Ethics Statement. The study is granted ethics approval from the Institutional Ethics Committee. Access to our dataset is restricted to researchers who agree to comply with ethical guidelines and sign a confidentiality agreement with us.

References 1. Roller, S., et al.: Recipes for building an open-domain chatbot. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 300–325 (2021) 2. Gu, Y., et al.: Eva2.0: investigating open-domain Chinese dialogue systems with large-scale pre-training. Mach. Intell. Res. 20(2), 207–219 (2023). https://doi.org/ 10.1007/s11633-022-1387-3 3. Shuster, et al.: BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv preprint arXiv:2208.03188 (2022) 4. Baheti, A., Sap, M., Ritter, A., Riedl, M.: Just say no: analyzing the stance of neural dialogue generation in offensive contexts. In: Proceedings of EMNLP, pp. 4846–4862 (2021) 5. Rosenthal, S., Atanasova, P., Karadzhov, G., Zampieri, M., Nakov, P.: SOLID: a large-scale semi-supervised dataset for offensive language identification. In: Findings of the Association for Computational Linguistics, ACL-IJCNLP 2021, pp. 915–928 (2021)

A Benchmark for Understanding Dialogue Safety in Mental Health Support

13

6. Hada, R., Sudhir, S., Mishra, P., Yannakoudakis, H., Mohammad, S., Shutova, E.: Ruddit: norms of offensiveness for English Reddit comments. In: ACL-IJCNLP, pp. 2700–2717 (2021) 7. Dinan, E., Humeau, S., Chintagunta, B., Weston, J.: Build it break it fix it for dialogue safety: robustness from adversarial human attack. In: EMNLP-IJCNLP, pp. 4537–4546 (2019) 8. Sun, H., et al.: On the safety of conversational models: taxonomy, dataset, and benchmark. In: Findings of the Association for Computational Linguistics, ACL 2022, pp. 3906–3923 (2022) 9. Dinan, E., et al.: SafetyKit: first aid for measuring safety in open-domain conversational systems. In: ACL 2022, pp. 4113–4133 (2022) 10. Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., Kamar, E.: ToxiGen: a large-scale machine-generated dataset for adversarial and implicit hate speech detection. In: ACL 2022, pp. 3309–3326 (2022) 11. Lee, A., Kummerfeld, J.K., An, L., Mihalcea, R.: Micromodels for efficient, explainable, and reusable systems: a case study on mental health. In: Findings of the Association for Computational Linguistics, EMNLP 2021, pp. 4257–4272 (2021) 12. Li, A., Ma, L., Mei, Y., He, H., Zhang, S., Qiu, H., Lan, Z.: Understanding client reactions in online mental health counseling. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 10358–10376 (2023) 13. American Psychological Association: Ethical principles of psychologists and code of conduct. Am. Psychol. 57(12), 1060–1073 (2002) 14. Gros, D., Li, Y., Yu, Z.: Robots-dont-cry: understanding falsely anthropomorphic utterances in dialog systems. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3266–3284 (2022) 15. Thoppilan, R., et al.: Lamda: language models for dialog applications. arXiv preprint arXiv:2201.08239 (2022) 16. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971) 17. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019) 18. Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

Poster: Fundamentals of NLP

Span-Based Pair-Wise Aspect and Opinion Term Joint Extraction with Contrastive Learning Jinjie Yang1 , Feipeng Dai1(B) , Fenghuan Li2(B) , and Yun Xue1 1

2

School of Electronics and Information Engineering, South China Normal University, Foshan 528225, China {yangjinjie,daifeipeng,xueyun}@m.scnu.edu.cn School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China [email protected]

Abstract. Pair-wise aspect and opinion term extraction (PAOTE) is a subtask of aspect-based sentiment analysis (ABSA), aiming at extracting aspect terms (AT) and corresponding opinion terms (OT) in the review sentences in the form of aspect-opinion pairs (AO), which provides global profiles about products and services for users. Existing methods cannot identify the boundary information of AT and OT precisely. In addition, pairing errors are easily caused if there are multiple AT and OT in a sentence. In our work, to address above limitations, we design a span-based joint extraction framework with contrastive learning (SJCL) to enhance both term extraction and pairing in PAOTE. For term extraction, we utilize span-based convolutional neural network (CNN) to merge contextual syntactic and semantic features to identify the boundaries of aspect and opinion terms precisely and combine contrastive learning (CL) to enhance the distinctiveness of different types of terms. For term pairing, we prune the original dependency trees according to part-of-speech (POS) information to remove insignificant noise information and leverage graph convolutional network (GCN) to learn pairing features, then negative sampling and contrastive learning are used to avoid mismatched aspect-opinion pairs. Experimental results on four public datasets certify the state-of-the-art performance of our model. Keywords: PAOTE · span-based joint extraction opinion term · term pairing · contrastive learning

1

· aspect term ·

Introduction

Aspect-based sentiment analysis [18] is an important task in natural language processing (NLP), aiming at identifying the sentiment polarities of the specific aspects in the given sentences, and it is greatly significant for real-world applications. Aspect term extraction (ATE) [19] and opinion term extraction (OTE) [3] c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 17–29, 2023. https://doi.org/10.1007/978-3-031-44696-2_2

18

J. Yang et al.

are two basic tasks of ABSA, aiming at extracting the AT and OT in the given sentences, respectively. Previously, some studies [4,14,20] devoted ATE and OTE tasks jointly in unpaired mode, therefore ignored the relations between them. Recently, some methods [1,7,15,16,21] focused on extracting aspect term and corresponding opinion term in pairs, that is, pair-wise aspect and opinion term extraction (PAOTE). For example, given a sentence “The sushi was great and cheap but the cooked food amazed us .” in Fig. 1, PAOTE aims to extract all aspect-opinion pairs (AO), such as (“sushi ”, “great”), (“sushi ”, “cheap”), and (“cooked food ”, “amazed ”).

nsubj cop

det

(a) (b) The (c) DT

sushi was NN

VBD

punct dep det amod

dep ROOT conj

cc

great and JJ

CC

cheap but the JJ

CC

DT

cooked VBN

dep amod

food amazed us . NN

JJ

PRP .

Aspect Terms: sushi, cooked food Opinion Terms: great, cheap, amazed Aspect-Opinion Pairs: (sushi, great), (sushi, cheap), (cooked food, amazed)

Fig. 1. An example (b) with corresponding dependency tree (a) and POS labels (c).

However, previous methods did not effectively extract the boundary information of aspect and opinion terms. Boundary information is the key clue for term extraction, and contextual information contains abundant boundary prompt information for aspect and opinion terms, but previous works neglected this point and did not adequately incorporate contextual information. In addition, the problem of incorrect term pairing was not well avoided. Taking abovementioned sentence as an example, if there are multiple aspect terms, it requires to deal with the relations between multiple aspect terms and opinion terms, leading to incorrect term pairings easily. To address aforementioned limitations, we propose a span-based joint extraction framework with contrastive learning (SJCL) for PAOTE task. Aspect and opinion terms usually appear in some similar forms. As examples in Fig. 1, aspect and opinion terms are basically composed of nouns, adjectives and verbs. There are a variety of common semantic collocations in contextual information, e.g., aspect terms are often preceded by the modifier “the” and followed by the copula “was”, and the conjunction “and ” is often used to connect two aspect terms or opinion terms. For dependency information, important dependency types, such as “nsubj ” and “conj ” often associate aspect term with corresponding opinion term. So we encode the syntactic information (i.e., POS labels and dependency types) and exploit span-based CNN to fuse syntactic and semantic features of contexts to extract the boundary information of aspect and opinion terms. Moreover, contrastive learning is applied to enhance the distinctiveness of different types of terms. In addition, we prune original dependency trees according to POS labels to strengthen the connection between aspect term and corresponding opinion term, and learn pairing representations by GCN over pruned dependency

Pair-Wise Aspect and Opinion Term Joint Extraction

19

trees. Furthermore, we learn distinguishable representations for aspect-opinion pairs by negative sampling and contrastive learning to avoid incorrect matches. The contributions of our work are summarized as follows: – To accurately identify the boundaries of aspect and opinion terms, we leverage span-based CNN to fuse the syntactic and semantic information in contexts to extract their boundary information, and then utilize contrastive learning to enhance the distinctiveness of different types of terms. – To enhance the connection between aspect term and corresponding opinion term as well as avoid incorrect pairing, we design GCN framework with contrastive learning. Pairing representations are learned by GCN over pruned dependency trees, and are enhanced by CL for aspect-opinion pairs. – We conduct extensive experiments on four benchmark datasets, which verify that our model outperforms existing state-of-the-art models significantly.

2 2.1

Related Work Pair-Wise Aspect and Opinion Term Extraction

Related studies on PAOTE can be grouped into pipeline and end-to-end methods. The pipeline methods [4,17] adopt two stages: term extraction and pairing. These pipeline methods [10,20] suffer from error propagation. End-toend methods learn shared features for term extraction and pairing, therefore extract aspect-opinion pairs directly, alleviating error propagation problems. The end-to-end methods can be grouped into span-based [15,21] and grid taggingbased [1,7,16] models. A multi-task learning framework [21] was first designed for PAOTE task based on the candidate term spans. An edge-enhanced graph convolutional network [15] was proposed to sufficiently incorporate the dependency edges and labels. A tagging scheme [16] was designed to decode aspect-opinion pairs in a unified grid about word-to-word relations. A synchronous network [1] was proposed to enhance the mutual benefit on term extraction and pairing. A mutually-aware interaction mechanism [7] was designed to interactively update the representations for term extraction and pairing. However, these works did not adequately extract the boundary information of AT and OT as well as did not effectively avoid pairing errors when there are multiple terms in one sentence. 2.2

Span-Based Methods and Contrastive Learning

The characteristics of PAOTE are similar to joint entity and relation extraction (JERE), e.g., both PAOTE and JERE contain term extraction and relation pairing. Span-based methods have achieved surprising performance on JERE. A span-based approach [2] was proposed to offer benefits of overlapping entities. A cloze mechanism [13] was introduced to simultaneously extract the context and span position information. Span-based methods [15,21] are good at handling one-to-many relations between terms, therefore, they can be effective for PAOTE task with multiple aspect and opinion terms in a sentence.

20

J. Yang et al.

Contrastive learning aims to learn similar representations for terms in the same class and discriminate representations for terms in different classes, which has been widely used, including ABSA. Supervised contrastive pre-training [5] was adopted to align implicit sentiment expressions. A contrastive learning framework [6] was deployed to leverage correlations and differences among different sentiment polarities. Contrastive learning can enhance the similarity of same type of samples and distinctiveness of different types of samples, therefore, can be devoted to term extraction and pairing in PAOTE task. Contrastive Learning

non-AO

aspect term

opinion term

great

word embedding

dreadful cc

but

BERT Encoding

nsubj

service

cop was

det the

Input

Great food but the service was dreadful

Dependency Tree

Fig. 2. The overall structure of our model.

3

Methods

Our model is introduced in this section, whose structure is shown in Fig. 2. 3.1

Problem Definition

Given a review sentence S = {w1 , w2 , ..., wk } with k words containing m aspect terms A = {a1 , a2 , ..., am } and n opinion terms O = {o1 , o2 , ..., on }, where each aspect or opinion term consists of one or more words, PAOTE is to extract all aspect-opinion pairs B = {(ai , oj ), ...} in S. Span-based methods enumerate all possible term spans (TS) T = {t1 , t2 ..., tf } within maxlength ls , such as “great”, “cooked food ”, “great and cheap”, “and cheap but the” in Fig. 1. One TS is denoted as ti = {wstart(i) , ..., wend(i) }, where wstart(i) and wend(i) denote the start and end words of ti respectively. Based on candidate term spans, the model extracts all AT and OT, as well as matches them to achieve aspect-opinion pairs. 3.2

BERT Encoder

We use BERT to encode the sentence S = {w1 , w2 , ..., wk } to obtain embedding vectors for each word, and the input form is “[CLS] sentence [SEP] ”: {xcls , x1 , x2 , ..., xk } = BERT ([CLS], w1 , w2 , ..., wk , [SEP ])

(1)

Pair-Wise Aspect and Opinion Term Joint Extraction

21

where xi ∈ R is the embedding vector for i − th word wi , xcls ∈ R is the feature representation for the overall sentence, da denotes the vector dimension of xi and xcls , and k denotes the number of words in the sentence. 3.3

Span-Based CNN with Contrastive Learning for Term Extraction

Span-Based CNN Encoder: Syntactic information contributes to PAOTE, so syntactic embeddings are concatenated with word embeddings to enrich contextual features. We use StanfordCoreNLP toolkit to obtain POS labels P = {P1 , ..., Pk } and construct a vector look-up table Wp ∈ RM× to get embedding vectors p = {p1 , ..., pk } ∈ R× for P , where M denotes the total number of POS labels, db denotes the dimension of pi , pi is the POS embedding of wi . Similarly, syntactic dependency information is obtained by StanfordCoreNLP. Each word has dependencies including dependency edges and types with one or multiple other words. In the same way, a vector look-up table Wd ∈ RN× is constructed to obtain dependency matrix Di = {di,1 , di,2 , ..., di,v } ∈ R× for each word wi , where N denotes the total number of dependency types, v denotes the number of dependency types related to wi , and each element in Di denotes the dependency embedding and dc denotes corresponding dimension. Both Wp and Wd are randomly initialized trainable parameter matrices. The hidden representations for each word wi can be obtained by: hi = xi ⊕ pi ⊕ di

(2)

where di = ⊕ refers to vector concatenation, and hi ∈ R (dh = da + representation of wi with POS representation pi ∈ R and dependency representation di ∈ R added. CNN can effectually merge contextual features and clarify the position information of words. So we use CNN to refine contextual hidden representations H = {h1 , h2 , ..., hk } ∈ R× by: 1 v (di,1 + di,2 ... + di,v ), db + dc ) is the hidden

m m cm i = ReLU (w · hi−e:i+e + b )

hci

=

c1i



c2i ...



cm i ...



cdi h

(3) (4)

where ReLU is an activation function, · is a convolutional operator, (2e + 1) is the window size of the CNN filter, wm ∈ R(2e+1)×dh and bm ∈ Rdh are trainable parameters, and hci ∈ Rdh is the hidden representation of wi refined by CNN. Contextual hidden representations are refined as H c = {hc1 , hc2 , ..., hck } ∈ Rk×dh , and the representation Ti of term span ti can be acquired by: hcpool(i) = M ax − P ooling([hcstart(i) , ..., hcend(i) ]) Ti = xcls ⊕

hcpool(i)



hcstart(i)



hcend(i)

(5) (6)

where hcstart(i) ∈ R and hcend(i) ∈ R are the representations of start and end words of ti , and hcpool(i) ∈ R is the pooling representation of ti by M ax−P ooling on words contained in ti . Then we conduct term span classification by: Ci = Sof tmax(wc Dropout(Ti ) + bc )

(7)

22

J. Yang et al.

where wc ∈ R× (dp = 3 ∗ dh + da ) and bc ∈ R are trainable parameters, and Ci is the predicted probability distribution of ti . C = {A, O, null} are TS labels, meaning AT, OT and IT (i.e., invalid terms), respectively. Contrastive Learning: To improve TS classification, we leverage CL to enhance distinctive representations of TS. For each batch B in training set, we map the representations of all AT and OT in B to an AT set Ar = a } and an OT set Or = {T1o , T2o , ..., Tno }, respectively. Besides, we {T1a , T2a , ..., Tm randomly sample a certain number of IT in B and map their representations to an IT set E r = {T1e , T2e , ..., Tle }. The elements in Ar , Or and E r are gained from formula (6). Contrastive loss functions for AT and OT are LA and LO as follows, where sim(, ) is cosine similarity and τ is the temperature coefficient. m a a j=1 exp(sim(Ti , Tj )/τ ) (8) A (Tia ) = − log n l a o a e j=1 exp(sim(Ti , Tj )/τ ) + j=1 exp(sim(Ti , Tj )/τ ) n o o j=1 exp(sim(Ti , Tj )/τ ) O o  (Ti ) = − log m (9) l o a o e j=1 exp(sim(Ti , Tj )/τ ) + j=1 exp(sim(Ti , Tj )/τ ) 1  A a  (Tj ), m j=1 m

LA =

3.4

1 O o  (Tj ) n j=1 n

LO =

(10)

GCN with Contrastive Learning for Term Pairing

GCN Encoder: Ac = {ac1 , ac2 ..., acm } and Oc = {oc1 , oc2 ..., ocn } are the sets of AT and OT extracted for the sentence S, respectively. Each aci in Ac and ocj in Oc constitute each term pair (TP) (aci , ocj ) in the set of candidate aspect-opinion pairs B c = {(ac1 , oc1 ), ..., (acm , ocn )} for term pairing prediction. We apply GCN over pruned dependency trees based on the contextual representations from CNN to learn term span pairing representations. The dependency tree of one sentence is obtained by StanfordCoreNLP toolkit and denoted as an adjacency matrix Q = {qi,j }k×k , where Q is a 0–1 matrix, qi,j = 1 if there is dependency relation between wi and wj , otherwise qi,j = 0. To strengthen the interaction between AT and corresponding OT, we prune Q based on POS labels P . All POS labels are divided into four categories (i.e., nouns, adjectives, verbs, and others), if both wi and wj are nouns, adjectives or verbs, their dependency relation remains, otherwise it is pruned (i.e., qi,j = 1 is changed to qi,j = 0). For example, as viewed in Fig. 1, the dependency relations marked in dotted lines are pruned. A new dependency tree G = {gi,j }k×k is obtained after the pruning and pairing representations are learned by L − layer GCN: k k   g,l−1 = ReLU ( g (w h + b )/ gi,j ) hg,l i,j g g i i j=1

(11)

j=1

 where hg,l i ∈ R is the contextual representation of wi learned by l − th GCN g,0 c layer (hi is hi from CNN), wg ∈ R × and bg ∈ R are trainable parameters,

Pair-Wise Aspect and Opinion Term Joint Extraction

23

g,l g,l × and the outputs of GCN are H g,l = {hg,l . Finally, we 1 , h2 , ..., hk } ∈ R c c classify all candidate pairs (ai , oj ) by formula (16).

acpool(i) = M ax − P ooling([hg,l ac

, ..., hg,l ac

])

(12)

ocpool(j) = M ax − P ooling([hg,l oc

, ..., hg,l oc

])

(13)

])

(14)

start(i)

start(j)

g,l hcon i,j = M ax − P ooling([hac

start+1(i)

end(i)

end(j)

, ..., hg,l oc

end−1(j)

Ri,j = acpool(i) ⊕ ocpool(j) ⊕ hcon i,j

(15)

Zi,j = Sof tmax(wz Dropout(Ri,j ) + bz )

(16)

where acpool(i) ∈ R is the representation of aci , ocpool(j) ∈ R is the representation  is the representation of contexts between aci and ocj , Ri,j ∈ of ocj , hcon i,j ∈ R ∗ R is pairing representation of (aci , ocj ), and Zi,j is the predicted probability distribution of (aci , ocj ). Z = {T rue, F alse} are pairing labels, meaning AO and non-AO, respectively. Contrastive Learning: Usually, there are multiple AT and OT in a sentence, which lead to TS pairing errors easily. To avoid this situation, multiple non-AO N = {(a1 , o1 ), (a2 , o2 ), ..., (ad , od )} in the sentence S are selected as negative samples, such as the pairs (“sushi ”, “amazed ”) and (“cooked food”, “cheap”) in Fig. 1. The representations of both AO and non-AO are obtained by formula (15), and contrastive loss LP is used to enhance their distinctive representations. s s 1 R j=1 exp(sim(Ri , Rj )/τ , LP =  (Ri ) (17) P (Ri ) = −log d N s j=1 j=1 exp(sim(Ri , Rj )/τ where Ri and Rj are the representations of AO, RjN is the representation of non-AO, and s and d are the number of AO and non-AO, respectively. 3.5

Model Training

In model training, we minimize the loss function Ltotal by stochastic gradient descent to optimize our model parameters. Lterm = −

Y 

i log Ci , C

U 

Lpair = −

i=1

 Z i,j log Zi,j

Lcon = LA + LO + LP total

L

(18)

i=1,j=1

term

=L

pair

+ λ1 L

con

+L

(19) + λ2 ||θ||

2

(20)

where Lterm and Lpair defined by cross-entropy function are TS extraction loss i and Z  and TS pairing loss, respectively, Lcon is global contrastive loss, C i,j are golden label distributions, Ci and Zi,j are predicted probability distributions, Y and U are the total numbers of terms and pairs for training respectively, λ1 is a hyper-parameter to balance the loss, and λ2 is l2 regularization coefficient.

24

J. Yang et al.

4

Experiments

4.1

Datasets

We conduct experiments on four public datasets (i.e., 14lap, 14res, 15res, 16res). The original data in these datasets are some review sentences about laptop and restaurant from the SemEval tasks [9,11,12]. The processed data is provided by [16]. Each sample contains a review sentence and all AT, OT and AO. Statistical information of these datasets is shown in Table 1. Table 1. The statistics of four benchmark datasets. #S, #A, #O and #P denote the number of sentences, AT, OT and AO, respectively. Datasets 14res 14lap 15res 16res Train Dev Test Train Dev Test Train Dev Test Train Dev Test #S

1259

315

493

899

225 332

603

151 325

863

216 328

#A

2064

487

851 1257

332 467

871

205 436

1213

298 456

#O

2098

506

866 1270

313 478

966

226 469

1329

331 485

#P

2356

580 1008 1452

383 547

1038

239 493

1421

348 525

4.2

Implementation Details

Pretrained language model BERT is adopted to encode the sentence and obtain 768-dimensional (da ) word embeddings. The dependency trees and POS labels are constructed via StandfordCoreNLP toolkit1 , and the dimensions (db and dc ) of POS embeddings and dependency embeddings are set to 50 and 100, respectively. The parameters of BERT are initialized by officially released pretrained parameters (base-uncased version2 ), other trainable parameters are initialized randomly by Xavier, and the Adam optimizer is adopted to optimize all model parameters by standard back-propagation. The batch size is 8, the learning rate and weight decay for Adam are 5e−5 and 0.01, the window size of CNN filter (2e+1) is 5, the temperature coefficient is 0.5, the l2 regularization coefficient λ2 is 0.001, λ1 is 1.5, the number of IT is 30, and the dropout rate is 0.1. The metrics for performance evaluation are precision (P), recall (R) and F1 scores. If the boundaries of corresponding span are correctly identified, an AT or OT is regarded correct extraction, and a term pair is regarded as correct pairing if both boundaries and classes of two terms contained in it are correctly identified. 4.3

Baselines

End-to-end methods [1,7,15,16,21] can alleviate error propagation and outperform pipeline models. So we select some state-of-the-art end-to-end models for 1 2

https://stanfordnlp.github.io/CoreNLP/. https://huggingface.co/bert-base-uncased.

Pair-Wise Aspect and Opinion Term Joint Extraction

25

experiment comparison and analysis, which are SpanMIt [21], ESGCN [15], GTS [16], SDRN [1] and MAIN [7]. The first two are span-based models and the last three are grid tagging-based models. Table 2. Experimental results on 4 datasets (%). Model

14lap P

R

F1

14res P

R

F1

15res P

R

F1

16res P

R

F1

SpanMIt GTS SDRN ESGCN MAIN

– 66.41 – 65.98 78.85

– 64.95 – 71.63 62.71

68.66 65.67 67.13 68.69 68.96

– 76.23 – 75.31 78.20

– 74.84 – 77.16 76.89

75.60 75.53 76.48 76.22 77.54

– 66.40 – 66.99 77.60

– 68.71 – 69.73 65.31

64.68 67.53 70.94 68.34 70.92

– 71.70 – 72.86 81.80

– 77.79 – 77.69 74.48

71.78 74.62 – 75.20 77.97

Ours(SJCL) w/o CNN w/o POS and Dep w/o PGCN w/o LA and LO w/o LP

78.28 67.13 68.57 67.84 67.88 68.37

72.87 72.25 70.52 72.55 72.32 72.09

72.55 70.82 71.32 70.92 70.56 70.83

80.90 76.11 76.78 75.62 75.56 78.47

78.48 73.58 75.32 76.54 76.64 75.69

78.43 77.52 78.13 77.68 77.56 77.31

78.27 74.27 74.55 76.20 76.20 77.21

69.88 68.52 67.86 68.66 67.33 68.35

73.14 71.71 71.88 72.04 72.31 72.14

80.04 76.35 77.74 77.50 78.42 77.62

80.19 76.27 75.87 76.33 76.12 78.63

80.12 77.85 78.50 78.16 78.97 78.81

The experiments marked in w/o are as follows: (1) remove CNN; (2) remove POS and dependency embeddings; (3) remove pruning operation for dependency tree; (4) without LA and LO ; (5) without LP .

4.4

Main Results

The experimental results are shown in Table 2. Our model SJCL achieves the superior F1 scores on all datasets compared with state-of-the-art baselines, improving the results on 14lap, 14res, 15res and 16res by 3.59%, 0.89%, 2.22% and 2.15%, respectively. Syntactic information encoding and span-based CNN contribute to the boundary recognition for term extraction and pruned dependency trees enhance the relation between aspect term and its corresponding term for term pairing. In addition, compared with span-based models SpanMIt and ESGCN, contrastive learning for AT and OT extraction guide our model to enhance distinctive term span representations in the embedding spaces during the training, which is beneficial for term extraction. Similarly, contrastive learning for term pairing enhances pairing prepresentations, which is beneficial for AO pairing. Compared with grid tagging-based models, Our model SJCL takes advantage of span-based model on PAOTE task via contrastive learning. 4.5

Ablation Study

To explore the effectiveness of our model, a series of ablation experiments are conducted and the results are listed in Table 2. The findings are as follows: (1) All results decrease when w/o POS and Dep as well as w/o CNN. It indicates that POS and dependency embeddings provide auxiliary information for boundary recognition of AT and OT, and CNN sufficiently fuses the syntactic and semantic features of contexts to identify boundaries precisely, improving term extraction. (2) All results decrease accordingly when w/o contrastive loss

26

J. Yang et al.

LA and LO . It verifies that LA and LO are beneficial to feature learning, which enhance the similarity of the same type of term spans and distinctiveness of different types of term spans. LA and LO improve AT and OT recognition, respectively. In the same way, LP helps to differentiate the representations between AO and non-AO to identify correct pairs more precisely. (3) Compared with original dependency trees, pruned dependency trees strengthen the interaction between AT and corresponding OT by removing secondary dependencies. 4.6

Case Study

We select some representative samples from 14res for case study to prove the advantage of SJCL. The predicted results are listed in Table 3. (1) In No. 1 sample, ESGCN does not identify OT “dried out” integrally and regard “flavorful ” as OT by mistake. MAIN considering word-level relations does not recognize the pairing relation between “with” and “dried out” so that forms false pairs (chicken, dried out) and (avocado, dried out). But SJCL predicts all AO correctly and avoids non-AO effectively. For span-based models, it is easy to one-sidedly identify TS like “chicken”, “dried ” and “flavorful ” as AT or OT, SJCL excellently extracts TS boundary information and utilizes contrastive learning to alleviate this problem. (2) In No. 2 sample, there are multiple AT and OT, and they have similar semantics and indirect syntactic relations, which makes both ESGCN and MAIN gain mismatched pairs. SJCL can avoid this error with the ability of distinguishing between AO and non-AO. (3) In No. 3 sample, both ESGCN and MAIN employ dependency information, but they omit the AO (d´ecor, trendy) because the syntactic distance is long between “d´ecor ” and “trendy”. SJCL applies GCN over the pruned dependency trees to facilitate feature interaction between “d´ecor ” and “trendy” and predicts AO (d´ecor, trendy) successfully. Our model can effectively avoid some problems. 4.7

Contrastive Learning Visualization

To prove that contrastive learning can enhance distinctive feature representations, we conduct dimension reduction and visualization on TS and AO. We randomly pick one batch data in 16res, and learn the representations of TS and AO, and then apply t-SNE [8] to reduce the dimension of representations to 2 for visualization. The visualization results are described in Fig. 3, where bold “1” denotes AT, bolditalic “2” denotes OT, and italic “0” denotes IT. Figure 3(a) shows TS representations learned by SJCL without contrastive loss Lcon and Fig. 3(b) shows TS representations learned by SJCL. It can be seen that TS representation distribution of different types are more distinguishable with Lcon , which benefits TS classification theoretically. In addition, Fig. 3(c) and Fig. 3(d) show AO representations learned by SJCL without Lcon and with Lcon respectively, where blue “1” denotes AO and red “0” denotes non-AO. Similarly, it is more differentiated to distinguish between AO and non-AO representations with Lcon , improving term pairing. Generally, contrastive learning guides the model to learn more distinctive feature representations to improve the performance.

Pair-Wise Aspect and Opinion Term Joint Extraction

27

Table 3. Case study. Golden AT and OT are marked in bold and italic, respectively. AT and OT in a pair have the same subscript. Incorrect pairings are marked in bolditalic. Reviews

ESGCN

MAIN

Ours (SJCL)

1. My [chicken with avocado]1 was completely [dried out] 1 and the [sauce]2 was [not very flavorful] 2 .

(chicken with avocado, dried) (sauce, not very flavorful) (sauce, flavorful)

(chicken, dried out) (avocado, dried out) (sauce, not very flavorful)

(chicken with avocado, dried out) (sauce, not very flavorful)

2. [Great] 1 [food]1 , [decent] 2 [price]2 and [friendly] 3 [service]3 ensure an [enjoyable] 4 [repast]4 .

(food, Great), (price, decent) (service, friendly), (price, friendly) (repast, enjoyable)

(food, Great), (price, decent) (service, friendly), (repast, enjoyable) (service, enjoyable)

(food, Great), (price, decent) (service, friendly), (repast, enjoyable)

3. The [environment]1 and [decor]2 are very [comfy] 1,2 and [laid-back] 1,2 , all the while being [trendy] 1,2 .

(environment, comfy) (environment, trendy) (environment, laid-back) (decor, comfy), (decor, laid-back)

(environment, comfy) (environment, trendy) (environment, laid-back) (decor, comfy), (decor, laid-back)

(environment, comfy) (environment, trendy) (environment, laid-back), (decor, comfy) (decor, laid-back), (decor, trendy)

Fig. 3. Visualization for TS and TP.

28

5

J. Yang et al.

Conclusion

In this paper, we propose a novel span-based joint extraction framework with contrastive learning for PAOTE task. Experiment results on four benchmark datasets show our model SJCL achieves state-of-the-art performance and outperforms the baselines significantly. The results demonstrate the effectiveness of SJCL, which can solve the limitations of boundary information and pairing error for multiple terms in one sentence. Span-based CNN with contrastive learning encodes contextual syntactic and semantic features to fully extract term boundary information, and contrastive learning improves term extraction by encouraging model to learn distinctive term representations. GCN with contrastive learning prunes the primitive dependency trees to strengthen the correlation between AT and corresponding OT, as well as negative sampling and contrastive learning improve term pairing by distinguish between AO and non-AO. In future work, we will further develop our model on aspect sentiment triplet extraction (ASTE) task which aims to extract aspect-opinion-sentiment triplets, and study fine-grained sentiment analysis task more comprehensively. Acknowledgements. This work is supported by the Guangdong Basic and Applied Basic Research Foundation (No. 2023A1515011370), the Characteristic Innovation Projects of Guangdong Colleges and Universities (No. 2018KTSCX049), the Natural Science Foundation of Guangdong Province (No. 2021A1515012290), the Guangdong Provincial Key Laboratory of Cyber-Physical Systems (No. 2020B1212060069) and the National & Local Joint Engineering Research Center of Intelligent Manufacturing Cyber-Physical Systems.

References 1. Chen, S., Liu, J., Wang, Y., Zhang, W., Chi, Z.: Synchronous double-channel recurrent network for aspect-opinion pair extraction. In: ACL, pp. 6515–6524 (2020) 2. Eberts, M., Ulges, A.: Span-based joint entity and relation extraction with transformer pre-training. In: European Conference on Artificial Intelligence (ECAI), pp. 2006–2013 (2020) 3. Fan, Z., Wu, Z., Dai, X., Huang, S., Chen, J.: Target-oriented opinion words extraction with target-fused neural sequence labeling. In: NAACL, pp. 2509–2518 (2019) 4. Li, X., Lam, W.: Deep multi-task learning for aspect term extraction with memory interaction. In: EMNLP, pp. 2886–2892 (2017) 5. Li, Z., Zou, Y., Zhang, C., Zhang, Q., Wei, Z.: Learning implicit sentiment in aspect-based sentiment analysis with supervised contrastive pre-training. In: EMNLP, pp. 246–256 (2021) 6. Liang, B., et al.: Enhancing aspect-based sentiment analysis with supervised contrastive learning. In: ACM, pp. 3242–3247 (2021) 7. Liu, Y., Li, F., Fei, H., Ji, D.: Pair-wise aspect and opinion terms extraction as graph parsing via a novel mutually-aware interaction mechanism. Neurocomputing 493, 268–280 (2022) 8. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)

Pair-Wise Aspect and Opinion Term Joint Extraction

29

9. Manandhar, S.: SemEval-2014 Task 4: aspect based sentiment analysis. In: SemEval 2014, pp. 27–35 (2014) 10. Pereg, O., Korat, D., Wasserblat, M.: Syntactically aware cross-domain aspect and opinion terms extraction. In: COLING, pp. 1772–1777 (2020) 11. Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar, S., Androutsopoulos, I.: SemEval-2015 Task 12: aspect based sentiment analysis. In: SemEval 2015, pp. 486–495 (2015) 12. Pontiki, M., et al.: SemEval-2016 Task 5: aspect based sentiment analysis. In: SemEval-2016, pp. 19–30. Association for Computational Linguistics (2016) 13. Wan, Q., Wei, L., Zhao, S., Liu, J.: A span-based multi-modal attention network for joint entity-relation extraction. Knowl. Based Syst. 262, 110228 (2023) 14. Wang, W., Pan, S.J.: Recursive neural structural correspondence network for crossdomain aspect and opinion co-extraction. In: ACL, pp. 2171–2181 (2018) 15. Wu, S., Fei, H., Ren, Y., Li, B., Li, F., Ji, D.: High-order pair-wise aspect and opinion terms extraction with edge-enhanced syntactic graph convolution. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2396–2406 (2021) 16. Wu, Z., Ying, C., Zhao, F., Fan, Z., Dai, X., Xia, R.: Grid tagging scheme for endto-end fine-grained opinion extraction. In: Findings of the Association for Computational Linguistics: EMNLP, pp. 2576–2585 (2020) 17. Wu, Z., Zhao, F., Dai, X.Y., Huang, S., Chen, J.: Latent opinions transfer network for target-oriented opinion words extraction. In: AAAI, vol. 34, pp. 9298–9305 (2020) 18. Yang, J., Dai, A., Xue, Y., Zeng, B., Liu, X.: Syntactically enhanced dependencyPOS weighted graph convolutional network for aspect-based sentiment analysis. Mathematics 10(18), 3353 (2022) 19. Yang, Y., Li, K., Quan, X., Shen, W., Su, Q.: Constituency lattice encoding for aspect term extraction. In: COLING, pp. 844–855 (2020) 20. Yu, J., Jiang, J., Xia, R.: Global inference for aspect and opinion terms coextraction based on multi-task neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 168–177 (2018) 21. Zhao, H., Huang, L., Zhang, R., Lu, Q., Xue, H.: SpanMlt: a span-based multi-task learning framework for pair-wise aspect and opinion terms extraction. In: ACL, pp. 3239–3248 (2020)

Annotation Quality Measurement in Multi-Label Annotations Sheng Li(B) , Rong Yan, Qing Wang, Juru Zeng, Xun Zhu, Yueke Liu, and Henghua Li China National Clearing Center, People’s Bank of China, Beijing, China [email protected]

Abstract. Annotation quality measurement is crucial when building a supervised dataset for either general purpose research or domain applications. Interrater agreement measure is one of the most vital aspects in terms of establishing annotation quality. The traditional inter-rater agreement measures cannot address the issue in multi-label scenario. To adapt to multi-label annotations, the recent research has developed a bootstrapping method to measure the level of agreement between two raters. In this paper we propose a fine-grained multi-label agreement measure MLA, which attends to discover slight differences in inter-rater agreement across different annotations when multiple raters are involved. We demonstrate its compatibility with traditional measures through mathematics and experiments. The experimental results show it can interpret the agreement more accurately and consistently with intuitive understanding. In addition, a toolset is provided to enable users to generate the multi-label annotations that mimic different annotators, and calculate various agreement coefficients for several scenarios. Keywords: Multi-label Annotations · Inter-rater Agreement · Multiple Raters

1 Introduction Multi-label annotation provides a fine-grained and multi-faceted interpretation of an item when compared to single label. For example, the multiclass classification assigns one label to each image to represent its major object, while multi-label classification assigns multiple labels to indicate the presence of various objects and backgrounds in the image. In natural language processing (NLP) classification task, a legal document may simultaneously belong to multiple categories [1]. In recent years, there have been many efforts to advance multi-label learning [2, 3, 15, 16]. Therefore, how to establish multi-label annotation quality measurement becomes valuable for training deep learning models in practical applications. The quality of multi-label annotation depends largely on how well the raters agree with each other. A lot of research work has been carried out in the area of the inter-rater agreement statistic. The very beginning seminal work, such as Cohen’s kappa [4], was to evaluate the judgement agreement between two raters by using kappa coefficient. Almost in the same decade, the similar method Scott’s pi [5] was proposed while the calculation of expected agreement was slightly different from that of Cohen’s kappa. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 30–42, 2023. https://doi.org/10.1007/978-3-031-44696-2_3

Annotation Quality Measurement in Multi-Label Annotations

31

Then Fleiss’ kappa [6] generalized Scott’s pi in the scenario of multiple raters rating a fixed number of items, it assumes the equal distribution for all raters and evaluate agreement for rater-rater pairs. And weighted kappa was developed to cope with the ordered category data [7]. Besides kappa statistics, Krippendorff’s alpha [8] has been applied to content analysis since the 1970s. It offers tremendous benefits when applied to multi-rater multiclass scenario. Further, it can handle incomplete data sets and various measurement levels like nominal, ordinal and so on. Unfortunately, all these methods are not applicable to multi-label scenarios. A bootstrapping method is suggested to calculate the expected agreement when multiple labels were assigned [9]. We adopt this method in our solution. However, when calculating the observed agreement, only two annotators’ case was discussed in [9]. In order to further address the issue in the multi-rater scenario, inspired by [10], we proposed a method multi-label agreement (MLA) to obtain the observed agreement, which makes fine-grained measure of multi-label agreement among multi-raters in a more intuitive and rational way. Table 1 makes a comparison of MLA and other statistics with respect to scenarios they supported. Table 1. Comparison of supported scenarios for various agreement statistics. Agreement statistics

Support multi-rater

Support multi-class

Support multi-label

Cohen’s kappa

x



x

Fleiss’s kappa





x

Krippendorff’s alpha





x

Marchal’s bootstrapa

x





MLA (proposed)







a We here and in the following parts mark the method proposed in [9] as Marchal’s bootstrap.

Our main contributions are as follows: • Propose a fine-grained multi-label agreement measure MLA. It identifies subtle distinctions in inter-rater agreement across different annotations when multiple raters are involved. Compared to other statistics, it uses a more intuitive approach to interpret the inter-rater agreement. • Demonstrate MLA’s compatibility with traditional measures through mathematics and with a set of experiments. Overall, the experimental results show the viability and intuitiveness of this method. • Provide a toolset to generate the multi-label annotations that mimic different annotators by following the pre-defined annotation distribution, and calculate various agreement coefficients for multiple scenarios. The rest of this paper is structured as follows: Sect. 2 introduces the related work to be referred in experiments. The implementation details of MLA are stated in Sect. 3, and its compatibility with other statics is validated mathematically. The experimental results are illustrated in Sect. 4. Code and toolset are available at https://github.com/cncc-ai/ multi-label_agreement.

32

S. Li et al.

2 Related Work People always need to set up supervised dataset either for general purpose research or domain specific application. Due to representation problems like ambiguity and uncertainty [11], an instance may allow more than one interpretation. The multi-label annotation is well-suited for this. When people build supervised training set, to what extent are raters consistent with each other has a substantial impact on the final performance of the model. Agreement coefficient, k, is the commonly used index to measure the level of inter-rater consistency. The general form is1 : k=

Po − Pe 1 − Pe

(1)

where Po is the observed agreement among the raters, and Pe is the expected agreement showing the probability that raters agree by chance. The k value is an agreement proportion calculated to remove chance occurrences. It falls within the interval of [−1, 1], one indicates the perfect agreement, zero means agreement equals to chance, and other negative values mean poor agreement.

Fig. 1. The computation of Po , Pe and k according to Cohen’s kappa (left), Fleiss’s kappa (middle) and Krippendorff’s alpha (right). Two categories (c1 and c2) are assigned to 50 items (i1 to i50, L = 50) by two raters (r1 and r2), the raw annotations are shown in the data matrix on the right. The first 20 items and the following 5 items are annotated respectively as c1 and c2 by both raters, the last 25 items are categorized as c1 by r1 while as c2 by r2.

Traditional inter-rater agreement measures differ in terms of the way they estimate Pe . Cohen’s kappa assumes different distributions between raters while Fleiss’s kappa and Krippendorff’s alpha2 assume the equal distributions for all raters. Examples in 1 The common expression of Krippendorff’s alpha is the diversity-based expression, it can be

transformed into the form of Eq. 1. 2 Krippendorff’S alpha’s embrace of other methods is not our focus, some papers have discussed

on it [12].

Annotation Quality Measurement in Multi-Label Annotations

33

Fig. 1 visualize the data and elaborate on the calculations of k, Po and Pe with respect to three traditional methods. It is found that the values of Po are exactly the same, while the values of Pe are slightly different, resulting in the corresponding differences of k. In order to extend kappa coefficient to multi-label scenario, Marchal’s bootstrap [9] developed a bootstrapping method to obtain the expected agreement based on the observed agreement measures like soft-match, augmented kappa, precision, recall and F1. In Fig. 2, we choose Augmented kappa and F1 to exemplify the calculations of observed agreement in Marchal’s bootstrap. The bootstrapping is conducted by following the annotation distribution of each rater obtained from the raw annotations, as that shown in the bottom-left corner of Fig. 2. Based on the bootstrapped data, the observed agreement is computed using methods shown in the middle and right panel of Fig. 2. The bootstrapped expected agreement Pe is obtained by following the same computation while applied on the bootstrapped data. When there are more raters, the methods calculating Po in Marchal’s bootstrap may get undesirable results. Since it relies on the intersection labels, it is found that getting intersection is difficult due to increasing of raters. Inspired by Shape Naming Divergence (SND) in [10], we propose a fine-gained calculation of Po . In Sect. 3, SND and MLA will be introduced simultaneously using a comparative approach.

Fig. 2. Exemplify calculation of annotation distribution of raters and Po used in Marchal’s bootstrap [9]. Three categories (c1, c2 and c3) are assigned to 3 items (i1, i2 and i3, L = 3) by two raters (r1 and r2), the raw annotations are as that shown in the top-left corner. The calculation of annotation distribution is illustrated in the bottom-left corner. Augmented kappa (middle) and F1 (right), are used to calculate Po . To note, no bootstrapped data are included in this example.

3 A Fine-Grained Multi-rater Multi-label Agreement Measure Traditional agreement measures are not applicable to multi-label scenarios. Marchal’s bootstrap [9] addresses multi-label agreement measure, while with two raters. It calculates the observed agreement based on the intersection labels. As the number of raters increases, obtaining intersection becomes difficult. Therefore, MLA suggests a scalable

34

S. Li et al.

method which applies to a wide variety of scenarios by accurately capturing inter-rater agreement between every individual and any other individual. 3.1 MLA Algorithm We propose a scalable method to calculate the consistency of all raters on the dataset in multi-rater multi-label scenario. MLA is inspired by Shape Naming Divergence (SND) [10]. SND is a criterion that is used to measure the divergence of annotation across individuals [13]. Formally, SND defines the annotations for an item from all raters as a sequence of M tokens x = x1 , · · · xM . Given an item with N annotations x(j) , j = 1, · · · , N , each of length M (j) , define wi (j) for each token xi (j) in annotation x(j) as the proportion of other annotations of that item that DO NOT contain xi (j) : (2)

where is an indicator function. The divergence of an item, W SND, is defined as:

(3)

Compared to SND, MLA focuses on the agreement part. We define ai (j) for each token xi (j) in annotation x(j) as the proportion of other annotations of that item contain xi (j) , calculated as Eq. 4. (4) where is an indicator function. The agreement coefficient of annotation x(j) , denoted  (j) (j) by A(j) , is calculated as A(j) = M1(j) M i=1 ai . The agreement coefficient of an item is 1 N (j) defined as A = N j=1 A , the complete definition is given by Eq. 5. (5) For example, there are three raters, which means N = 3, each of them gives a set of labels for the same item, e.g. {[1, 2, 3], [1, 2, 4], [1, 2]}. The computational process of W SND and AMLA is provided below: WSND =

1 3 (0

+ 0 + 2) + 13 (0 + 0 + 2) + 21 (0 + 0) = 0.22 3∗2

Annotation Quality Measurement in Multi-Label Annotations

AMLA =

1 1 1 3 (2 + 2 + 0) + 3 (2 + 2 + 0) + 2 (2 + 2)

3∗2

35

= 0.78

MLA does not based on the intersection of annotations from all raters, it calculates inter-rater agreement between every individual and any other individual. At the same time, the contribution of each rater includes the consideration of his own label distribution. In other words, if a rater provided three labels in one item, each label’s weight is 1/3, if a rater provided M ( j) labels in one item, each label’s weight is 1/M ( j) . The agreement coefficient of each rater is calculated based on the others. 3.2 MLA’s Compatibility of Other Agreement Measures MLA is compatible with the traditional agreement measures, such as Krippendorf’s alpha. In Krippendorf’s alpha (multiple raters, multiple classes, and nominal data), as defined in [8]: 

Ao − Ae Do = = α =1− De 1 − Ae

c Occ npair

1−



c nc (nc −1) npair (npair −1)  c nc (nc −1) npair (npair −1)



(6)

where Do (Ao ) is the observed disagreement (agreement), De (Ae ) is the disagreement (agreement) expected by chance. It is worth noting that Ao (Ae ) is the same as Po (Pe ) in Eq. 1, they are only two different representations. nc is the number of class c appeared in the whole samples (or items), npair is the pairable value over all items. Further derive Ao :  Number of c − c pairs in item u  u mu −1 c Occ Ao = = (7) npair npair where c-c pairs are values on the diagonal of coincidence matrix appears between raters, u ∈ [1, L], L is the sample size by definition, ncu represents the number of c class in item u. P is the permutation function. Occ =

 P(ncu , 2) u

mu − 1

=



ncu ! (ncu −2)!

u

mu − 1

=

 ncu (ncu − 1) u

mu − 1

(8)

mu equals to N when there is no missing data, so that:   Ao =

c

u

ncu (ncu −1) N −1

npair

=

  ncu (ncu − 1) c

u

npair (N − 1)

(9)

In multiclass scenario, the value of M ( j) is 1 when calculating AMLA . Equation 5 can be simplified as: (10)

36

S. Li et al.

For the whole samples, the observed agreement of MLA is: (11) According to the Krippendorff’s alpha [8], N*L equals to npair , so that: Ao =

  ncu (ncu − 1) c

u

npair (N − 1)

=

  ncu (ncu − 1) = Ao−MLA N ∗ L(N − 1) c u

(12)

Since there are many discussions on Krippendorff’s alpha’s embrace of other statistics, in this paper we will not further discuss the compatibility with other methods. 3.3 Data Simulation Toolset We create a toolset for data generation and algorithm verification. The toolset enables users to specify the distribution of data and generate annotations that mimic different annotators. It also computes the agreement coefficient for the simulated annotations. Various annotation scenarios are supported, such as multiclass with multiple raters, multi-label with multiple raters etc., which could be set easily through configurations. The bootstrapping approach from Marchal’s bootstrap [9] is used to carry out expected agreement in the toolset. In addition to considering the distribution of categories, the toolset takes care of combination of classes drew from labels. The examples of data generation in following section show its simplicity and flexibility.

4 Experiments In this section, we will demonstrate the effectiveness of the agreement mechanism MLA we proposed through experiments. We compare the agreement measure with traditional methods, including Cohen’s kappa, Fleiss’s kappa and Krippendorff’s alpha (named as Cohen, Fleiss and kf_alpha in the later literature). Considering those traditional methods cannot be applied to the case of multiple labels with multiple raters directly, we design the experiment cases of two raters to evaluate MLA with others under the same conditions. For the case of multiple labels, we use agreement coefficient k to demonstrate the finegrained measure of agreement. The data for experiments is generated by simulations with annotation distribution configured. 4.1 Data Generation We generate simulated annotation data to compare the agreement measures. There are two types of simulated data. One is for two raters and the other is for multiple raters. The simulation of two raters’ data is generated based on the method introduced in [7]. For multi-class and multi-label data, the percentages of each class and multiple labels could be configured to explore the performance under cases with varied agreements. The most complex data generation process is described as Table 2. Since data is generated randomly with pre-set configurations, the dataset will be different for each experiment with varied distributions.

Annotation Quality Measurement in Multi-Label Annotations

37

Table 2. Data Generation Process for Multi-label Annotation. Data Generation Steps

Example

1. Setting the percentage of each class in the 1. Setting [0.1, 0.2, 0.3, 0.4] ratio for 4 classes whole dataset 2. Setting the percentage of multiple labels

2. Setting 0.2 ratio for data with two labels and 0.05 ratio for data with three labels

3. Data generation based on above configuration

3. Dataset Summary: size: 100, data with [one, two, three] labels: [75, 20, 5] number of data for each class: 13, 28, 38, 46

4.2 Multi-class Agreement Measure In order to demonstrate the compatibility with the traditional methods, we contrast the cases of two raters with multiple classes, and multiple raters with multiple classes separately. We adopt Fleiss and kf_alpha to compare with MLA. To keep the measure consistency, we convert the indices of kf_alpha, i.e., diversity of observation (Do ) and diversity of expectation (De ), to the agreement coefficients Po and Pe which are defined in Eq. 12 and Eq. 13 respectively: Po = 1 − Do

(13)

Pe = 1 − De

(14)

Two Raters Multiple Classes. We set 4 classes for this experiment and the total data length is 800. The consistency percentage is about 56.25%, which indicates the percentage of consistency labeling is about such in the whole dataset. Table 3 shows the results. Using k to describe the annotation agreement, we can see that MLA achieves the same level as the two others. Table 3. Agreements Between Two Raters for Multi-classes. Name

k

Po

Pe

Fleiss

0.3371

0.5625

0.3400

kf_alpha

0.3375

0.5625

0.3396

MLA

0.3541

0.5625

0.3226

Multiple Raters Multiple Classes. Krippendorff’s alpha (kf_alpha) is a generalized approach compared with Cohen and Fleiss. In the following experiments, we will compare MLA with kf_alpha. For Cohen and Fleiss, either it cannot support multiple raters or just for rater-rater pairs evaluation. We do not include them in this experiment. The dataset contains 4 classes and was labeled by 3 raters. The total length of data is 800

38

S. Li et al.

and the consistency percentage is 37.5%. For multi-class case, each class has only one label. kf_alpha can support the agreement evaluation under this condition. We show the agreement results from kf_alpha and our proposed MLA in Table 4. Table 4. Agreements Among Multiple Raters for Multiple Classes. Name

k

Po

Pe

kf_alpha

0.2954

0.4858

0.2703

MLA

0.3390

0.4858

0.2221

We note that there is a system difference for Pe between kf_alpha and MLA. Compared with MLA, the kf_alpha uses joint probability to generate the simulated data. If we adopt the same joint probability, we could obtain very similar results as kf_alpha, shown in Table 5. We use individual probability due to the difference elimination caused by joint probability. Table 5. Agreements Among Multiple Raters for Multiple Classes Using Joint Probability. Name

k

Po

Pe

kf_alpha

0.3401

0.5183

0.2701

MLA

0.3397

0.5183

0.2705

4.3 Multi-label Agreement Measure Based on our survey, there is no published method to evaluate the annotation agreement for multiple labels assigned by multiple raters. The multi-label data has been illustrated as Table 6, in which each item has more than one labels. The number 1, 2, 3 or 4 means the class index annotated. Table 6. Slice of Data with Multiple Labels. Item

Rater_01

Rater 02

Rater 03

Annotated Results

Item1

4

3

4

[[4], [3], [4]]

Item2

2

3,4

3

[[2], [3,4], [3]]

Item3

1,3,4

2,3

3

[[1,3,4], [2,3], [3]]

We generate 1000 samples, and the consistency percentage is set ascending to explore MLA’s applicability for different situations. We use agreement coefficient k to measure

Annotation Quality Measurement in Multi-Label Annotations

39

the agreement. Figure 3 shows the comparison between consistency rate configured and coefficient k calculated by MLA for multiple labels. For the low, middle, and high consistency rate, MLA’s k can reflect the consistency of labeling. Therefore, MLA has general applicability for multiple labels’ agreement measure.

Fig. 3. Comparison of Consistency Rate and Coefficient k for Multiple Labels.

4.4 Fine-grained Agreement Measure To demonstrate MLA’s capability of fine-grained agreement measure, we consider the case of multiple labels with two raters. We set six classes for two raters and each rater may assign more than one class to one item. We compare the observed agreement with Augmented Kappa, F1 [9], 1-jaccard_distance [14] and 1-masi_distance [15], marked as Aug_Kappa, F1, 1-jd and 1-md respectively in Table 7. From Table 7, we note that Aug_Kappa cannot reflect the real agreement between raters. For Item10 and Item11, the labels by two raters are identical, but the agreement measures are varied and not equal to 1. In the following part, we just discuss the left agreement coefficients in Table 7. F1, 1-jd, 1-md and MLA approximately show the agreement trends with changing of data. Compared with F1 and 1-jd, our method MLA considers the agreement probability for each rater. Taking Item1 and Item2 as examples, we discover each rater has 0.5 probability of agreement. For Item1, Rater1 labeled as [1, 6] and Rater2 labeled as [2–5], there is no common label for the two raters. The total agreement is 0∗0.5 + 0∗0.5 = 00∗0.5 + 0∗0.5 = 0. For Item2, Rater1 labeled as [1, 2, 6] and Rater2 labeled as [2–5]. The common label is 2. The total agreement is 1/3∗0.5 + 1/4∗0.5 = 0.29171/3∗0.5 + 1/4∗0.5 = 0.2917. We compare the results of Item3 and Item4. The F1 and 1-jd values are not changed while the MLA values are 0.3333 and 0.3750 respectively. The coefficient 1-md has similar behavior as F1 and 1-jd. If we consider Item 3, 4 and 5 together and let Item5 be the standard (since Item5 consists of all the intersected elements in both 3 and 4). When we contrast Item4 and Item5, the disagreement is entirely caused by the label “5” in Item4 consisting of 4 labels. Similarly, the disagreement between Item3 and Item5 is caused by the label “6” consisting of 3 labels. In other words, the disagreement caused by label “5” is allocated on 4 labels (2, 3, 4, 5), whereas the disagreement caused by label “6” is allocated on 3 labels (1, 2, 6). Therefore, we could claim that label “6” is more contributed to the disagreement

40

S. Li et al. Table 7. Observed Agreements between Two Raters with Multiple Labels.

Item

Data

Item1

Aug_Kappa

F1

1-jd

1-md

MLA

[[1, 6], [2, 3, 0.0000 4, 5]]

0.0000

0.0000

0.0000

0.0000

Item2

[[1, 2, 6], [2, 0.0833 3, 4, 5]]

0.2857

0.1667

0.0550

0.2917

Item3

[[1, 2, 6], [2, 0.1111 3, 4]]

0.3333

0.2000

0.0660

0.3333

Item4

[[1, 2], [2, 3, 0.1250 4, 5]]

0.3333

0.2000

0.0660

0.3750

Item5

[[1, 2], [2, 3, 0.1667 4]]

0.4000

0.2500

0.0825

0.4167

Item6

[[1, 2], [2, 3]]

0.2500

0.5000

0.3333

0.1100

0.5000

Item7

[[1, 2, 3], [2, 0.1667 3, 4, 5]]

0.5714

0.4000

0.1320

0.5833

Item8

[[1], [1, 2]]

0.5000

0.6667

0.5000

0.3350

0.7500

Item9

[[1, 2, 3], [1, 0.2500 2, 3, 4]]

0.8571

0.7500

0.5025

0.8750

Item10

[[1, 2], [1, 2]]

0.5000

1.0000

1.0000

1.0000

1.0000

Item11

[[1, 2, 3, 4], [1, 2, 3, 4]]

0.2500

1.0000

1.0000

1.0000

1.0000

proportion in Item3 than label “5” in Item4, and thus making observed agreement in Item4 is higher than that of Item3. The traditional methods do not differentiate the two situations, while the proposed MLA can do. Compared with 1-jd which uses the intersection to calculate the agreement, MLA focuses more on the influence of personal labeling to the whole agreement evaluation. It can reflect the fine-grained agreement between two raters. In summary, from the experiments between two raters and among multiple raters, the proposed MLA can measure the fine-grained agreement both for multiple classes and multiple labels.

5 Conclusion Multi-label annotation is a better interpretation of an item referring to the ambiguity and uncertainty. In this paper, we develop a method MLA to obtain agreement measure for multi-label annotations when multiple raters are rating the data. The observed agreements for both annotated data and bootstrapped data are calculated via MLA, the latter is used as the expected agreement of the multi-label dataset.

Annotation Quality Measurement in Multi-Label Annotations

41

In this work, we also prove the compatibility between MLA and other traditional statistics, first by mathematical means then through the experiments that fall back to simplified scenarios supported by traditional statistics. In addition, more experiments demonstrate how it indicates the level of agreement in a fine-grained manner. Further, we develop a toolset to generate the multi-label annotations that mimic different annotators by following the pre-defined annotation distribution, at the same time, calculate various agreement coefficients for multiple scenarios including both multiclass and multi-label, both two-raters and multiple raters. Overall, MLA not only enables the multi-rater scenario but also offers intuitive and fine-grained interpretation of multi-label annotation agreement. Currently, MLA focuses on nominal data. How to expand it to other types of data is worthy of discussion. Besides, exploring the systematic differences between the bootstrapped expected agreement and non-bootstrapped one is an interesting topic. Acknowledgements. Thank all the anonymous reviewers and chairs for their meaningful suggestions.

References 1. PaddlePaddle AI studio. https://aistudio.baidu.com/aistudio/datasetdetail/181754 2. Liu, W.W., Wang, H.B., Shen, X.B., Tsang, I.W.: The emerging trends of multi-label learning, arXiv preprint arXiv: 2011.11197 (2021) 3. Xu, D., Shi, Y., Tsang, I.W., Ong, Y.S., Gong, C., Shen, X.B.: A survey on multi-output learning, arXiv preprint arXiv: 1901.00248 (2019) 4. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960) 5. Scott, W.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19(3), 321–325 (1955) 6. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971) 7. Cohen, J.: Weighed kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968) 8. Krippendorff, K.: Computing krippendorff’s alpha-reliability. https://repository.upenn.edu/ asc_papers/43. Accessed 25 Jan 2011 9. Marchal, M., Scholman, M., Yung, F., Demberg, V.: Establishing annotation quality in multilabel annotations. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 3659–3668, International Committee on Computational Linguistics, Gyeongju, Republic of Korea (2022) 10. Ji, A.Y., et al.: Abstract visual reasoning with tangram shapes. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 582–601, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022) 11. Beck, C., Booth, H., El-Assady, M., Butt, M.: Representation problems in linguistic annotations: ambiguity, variation, uncertainty, error and bias. In: 14th Linguistic Annotation Workshop, pp. 60–73, Association for Computational Linguistics, Barcelona, Spain (2020) 12. Zapf, A., Castell, S., Morawietz, L., Karch, A.: Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate? BMC Med. Res. Methodol. 16, 93 (2016)

42

S. Li et al.

13. Zettersten, M., Lupyan, G.: Finding categories through words: more nameable features improve category learning. Cognition 196, 104135 (2020) 14. Passonneau, R.: Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. Communicative Events, Columbia University New York, New York, USA (2006) 15. Mohammadreza, H., Doyle, T.E., Samavi, R.: MLCM: multi-label confusion matrix. IEEE Access 10, 19083–19095 (2022) 16. Kim, Y., Kim, J.M., Akata, Z., Lee, J.: Large loss matters in weakly supervised multi-label classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14136–14146 (2022)

Prompt-Free Few-Shot Learning with ELECTRA for Acceptability Judgment Linqin Li1 , Zicheng Li1 , Ying Chen2 , Shoushan Li1(B) , and Guodong Zhou1

2

1 School of Computer Science and Technology, Soochow University, Suzhou 215006, Jiangsu, China [email protected], {lishoushan,gdzhou}@suda.edu.cn College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, Beijing, China [email protected]

Abstract. Few-shot learning remains a great challenge for the task of acceptability judgment that identifies whether a sentence is acceptable or unacceptable. In this paper, we propose a prompt-free learning approach, namely PF-ELECTRA, to few-shot acceptability judgment. First, we leverage a pre-trained token replaced detection model, ELECTRA, as our basic few-shot learner to deal with the challenge of data distribution difference. Second, we design a prompt-free few-shot learning strategy that uses both the maximal unacceptability score for a single token and the overall unacceptability score for the whole sentence to judge the acceptability. Empirical studies validate the effectiveness of PF-ELECTRA on challenging few-shot acceptability judgment. To the best of our knowledge, it is the first work that improves the performance of few-shot acceptability judgment based on standard fine-tuning. Keywords: Acceptability judgment

1

· Few-shot learning · Prompt-free

Introduction

Acceptability judgment that judges whether a sentence is acceptable to a native speaker has proven to be useful for grammar checking [3] and fluency evaluation [8]. Although deep learning has achieved great success in acceptability judgment, a large amount of annotated data is required, such as the Corpus of Linguistic Acceptability (CoLA) [22]. However, due to the high annotation cost, large-scale human-annotated data for acceptability judgment is often unavailable, particularly for low-resource languages. Thus, few-shot learning which needs only a small number of annotated samples is necessary for acceptability judgment. Recently, prompt-based learning which is based on pre-trained language models (PLMs) and pre-defined templates has achieved remarkable performance in many few-shot learning tasks [1,6,11,17]. However, some studies have demonstrated that the prompt-based approaches perform rather poorly on acceptability c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 43–54, 2023. https://doi.org/10.1007/978-3-031-44696-2_4

44

L. Li et al.

Unacceptability Case (1): Maryann should learning. Case (2): What did Bill buy potatoes? Case (3): The book wrote David. Acceptability Case (4): They can sing.

Fig. 1. Some examples of acceptability judgment.

judgment in a few-shot setting [6,11], which might be attributed to the following two problems. First, PLMs have limited effectiveness for acceptability judgment due to the difference in data distribution between the language model and the task of acceptability judgment. Specifically, the most popular PLMs are based on masked language modeling, e.g., BERT [4]. The data used to train the masked PLMs are texts without grammatical errors, which differs much from the corpus for acceptability judgment containing a large number of ungrammatical sentences [6]. Second, pre-defined templates are essential in prompt-based few-shot learning approaches but they are not applicable to acceptability classification. Particularly, prompting tokens in a pre-defined template are usually task-specific [12,16]. That is, prompting tokens should be capable to describe class labels. Since grammatical errors can happen in any token in a sentence, there is no suitable prompting token to represent all possible grammatical errors. For example, as illustrated in Fig. 1, the token “leaving” in Case (1) leads to the morphological violation, so it can be the prompting token indicating grammatical errors. Similarly, “potatoes” can be the prompting token for Case (2). Furthermore, there is even no prompting token for Case (3) (an ungrammatical sentence) and Case (4) (a grammatical sentence). On the other hand, as shown in the template used in Fig. 2(a), “correct” and “incorrect” are often used as prompting tokens to represent the two classes (“acceptability” and “unacceptability”) in acceptability judgment [6,11]. However, in real-world data, the token “incorrect” is seldom used to indicate the existence of an ungrammatical problem in texts, e.g., “Many of the figures were incorrect”. To address the two issues, in this paper, we propose a novel few-shot learning approach, namely prompt-free ELECTRA (PF-ELECTRA), for acceptability judgment, as shown in Fig. 2(c). First, to solve problem 1, instead of using masked PLMs, we leverage a pre-trained token-replaced detection model, ELECTRA [2], as our basic few-shot learner. There are two main components in ELECTRA: a generator and a discriminator. Since the goal of the discriminator is to distinguish whether each token in an input sentence is replaced or not, and the pre-trained ELECTRA is trained with the data containing many corrupted (and ungrammatical) sentences, which is similar to the data used in acceptability judgment.

Prompt-Free Few-Shot Learning with ELECTRA original

original

original

original

original

original

original replaced original

original

original

.

[SEP]

45

Discriminative Head ELECTRA [CLS]

Maryann

should

leaving

.

This

is

correct incorrect Template

(a) Prompt-based ELECTRA unacceptable

Maximal Replaced Probability unacceptable

unacceptable

Task Head

Task Head

original

original

Maryann

should

leaving

(b) Standard fine-tuning

original

Discriminative Head ELECTRA

ELECTRA [CLS]

original replaced original

.

[SEP]

[CLS]

Maryann

should

leaving

.

[SEP]

(c) PF-ELECTRA (Our approach)

Fig. 2. Illustration of (a) Prompt-based ELECTRA; (b) Standard fine-tuning; (c) PFELECTRA (Our approach).

Second, to solve problem 2, instead of using prompt-based few-shot learning which needs pre-defined hand-crafted templates, we design a prompt-free few-shot learning approach that mainly leverages the maximal unacceptability score for a single token in the sentence for judging the acceptability of a sentence. Furthermore, we also take into account the sentence-level unacceptability score to augment the acceptability classification model. The motivation for using the sentence-level score is that some unacceptable sentences have no apparent prompting tokens used for acceptability judgment and the whole information of the sentence is needed. For instance, Case (3) contains no replaced tokens, but it is still unacceptable because the order of some tokens is reversed. Finally, we present a systematic evaluation of our approach on the CoLA benchmark, the most popular corpus for acceptability judgment. The extensive experiments demonstrate that given a small number of training samples, PFELECTRA can effectively boost few-shot performance. Our main contributions can be summarized as follows: – We propose a prompt-free few-shot learning approach with ELECTRA (a pretrained token-replaced detection model), which carries out the fine-tuning of ELECTRA without prompt and obtains a token-level unacceptability score for an input sentence. – We propose a joint learning approach, PF-ELECTRA, for few-shot acceptability judgment. It optimizes acceptability classification models using a loss function from both the token-level fine-tuning and standard (sentence-level)

46

L. Li et al.

fine-tuning approaches. To the best of our knowledge, it is the first work that overcomes the failure of prompt-based approaches to few-shot NLP tasks. – Our PF-ELECTRA approach is evaluated on the CoLA benchmark. Experimental results demonstrate that PF-ELECTRA can effectively improve the performance of few-shot acceptability judgment.

2 2.1

Related Work Acceptability Judgment

In recent years, most studies on automatic acceptability judgment focus on the CoLA corpus [22] because it is the first large-scale corpus of English acceptability judgment containing more than 10k sentences. Several neural networks have been proposed for the binary classification of acceptability judgment on CoLA. For example, [22] proposes a BiLSTM encoder model. With the popularity of PLMs, such as GPT and BERT, ALBERT [10] and StructBERT [21] are proposed based on variations of PLMs for acceptability classification. However, these supervised learning models need large amounts of labeled data, which severely limits their application to new languages with small acceptability datasets due to annotation cost [20]. Therefore, in this paper, we work on prompt-free learning for few-shot acceptability judgment. 2.2

Prompt-Based Few-Shot Learning

Prompt-based learning has shown success in a few-shot setting. There are several fashions to use prompts in pre-trained language models, and the most popular one is prompt-based fine-tuning that reformulates downstream tasks as masked language modeling tasks through prompts to reduce the gap between PLM’s pretraining and fine-tuning [12,19]. To achieve better few-shot results, the PLMs are often large-scale so as to learn extra knowledge from PLMs. Moreover, the used prompts can be either manually designed [17] or automatically generated [6]. More recently, to reduce the high computation cost of fine-tuning large-scale PLMs, several studies use prompts in small-scale pre-trained models and achieve competitive few-shot performance. For example, [18] fine-tunes PLM with multiple prompts and then train an ensemble learner for few-shot learning. [11,23,24] use ELECTRA, a pre-trained token-replaced detection model, to do prompting. In contrast, [6] fine-tunes small language models in a few-shot setting and achieve better performance than BERT-based approaches on many tasks except acceptability judgment. In this paper, to overcome the shortcoming of prompts for acceptability judgment, we design a prompt-free approach based on ELECTRA. Note that, although [9] also proposes a prompt-free few-shot learning approach, their approach needs to add multi-token label embeddings. This operation is similar to soft-prompting in prompt-based approaches. In contrast, our approach doesn’t need any additional multi-token label embedding.

Prompt-Free Few-Shot Learning with ELECTRA

3 3.1

47

Preliminaries Standard Fine-Tuning

Given a pre-trained language model L, the standard fine-tuning is usually specific for a downstream task. Firstly, a given sentence x = {x1 , . . . , xn } is converted to the input sequence x ˆ = [CLS] x [SEP ], and then L is used to map x ˆ to a sequence of hidden vectors h = {hCLS , h1 , . . . , hn , hSEP }. Then, the taskspecific head hCLS is fine-tuned for the probability computation of linguistic unacceptability, as illustrated in Fig. 2(b). 3.2

Prompt-Based Few-Shot Learning with ELECTRA

Replaced token detection pre-training task is first introduced in ELECTRA [2]. There are two neural networks in ELECTRA: a generator and a discriminator. The generator is a small masked language model which predicts the original identities of the masked-out tokens, and the discriminator is a discriminative model which predicts whether each token in an input sentence is an original or a replacement. Moreover, the generator is trained by original sentences, and the discriminator is trained by original sentences and their corresponding corrupted sentences (i.e., the one whose masked-out tokens are replaced by the generator). It is often a case that the replaced tokens predicted by the discriminator cause the ungrammatical problem, so the corrupted sentences are often unacceptable. Recently, ELECTRA has been proven to be effectively prompt-tuned in a few-shot setting [11,23,24]. Given a discriminator G, a template is first used to convert the input sentence x = {x1 , . . . , xn } into a prompted text xprompt = x T his is correct incorrect., and the prompted text is converted to the input sequence x ˆ = [CLS] xprompt [SEP ], where “correct” and “incorrect” are prompting tokens (i.e., label description tokens) in the template for the two classes: “acceptability” and “unacceptability”. Then, the discriminator G is used to map x ˆ to a sequence of hidden vector h = {hCLS , h1 , . . . , hn , hSEP }, and the discriminative head hDis is used to predict whether each token in an input sentence is an original or a replacement. Finally, the prompting token with the maximal probability of being original is chosen for prediction. For example, in Fig. 2(a), the input sentence “Maryann should leaving.” is first converted into a new one “Maryann should leaving. This is correct incorrect.” Then, ELECTRA predicts the probability of being original for each token. Consequently, the prompting token “incorrect” is more original (less replaced) than “correct”, which indicates that the sentence is unacceptable.

4 4.1

Methodology Overview

As illustrated in Fig. 2(c), we propose PF-ELECTRA, which consists of three components: (1) token-level fine-tuning that computes the token-level unacceptability probability, (2) sentence fine-tuning that computes the sentence-level

48

L. Li et al.

unacceptability probability, and (3) joint fine-tuning which fuses the two unacceptability probabilities for model learning. 4.2

Token-Level Fine-Tuning

The goal of acceptability judgment is to determine whether a given sentence is acceptable or not. In other words, it is to distinguish whether there exists at least one token implying the unacceptability of the sentence. Furthermore, it is often a case that some tokens in a sentence are replaced, which leads to an unacceptable sentence. Motivated by this observation, we design a tokenlevel fine-tuning of ELECTRA, which takes full advantage of every token in the sentence for acceptability judgment. That is, the token with the maximal probability of being replaced is selected to predict whether the sentence is acceptable or not. Firstly, for each token xt in the input sentence x = {x1 , . . . , xn }, we use the discriminative head of ELECTRA to compute its probability of being replaced Prep (xt ) as follows: T ht ) Prep (xt ) = sigmoid(wDis

(1)

where wDis is the parameters of the discriminative head in pre-trained ELECTRA and ht is the hidden vector for xt . Thus, a probability sequence, Prep (x) = {Prep (x1 ), . . . , Prep (xt )} is obtained for sentence x. Then, the token with the maximal probability of being replaced is selected (see Eq. 2), and its probability serves as the token-level unacceptability probability Ptoken (x) for sentence x (see Eq. 3). tmax = arg max Prep (xt )

(2)

Ptoken (x) = Prep (xtmax )

(3)

t

where tmax is the position of the selected token. Finally, a loss function is defined for the token-level fine-tuning as follows: Losstoken (x) = −y log Ptoken (x) − (1 − y)(1 − log Ptoken (x))

(4)

where y is the label for sentence x, and y = 0 (or y = 1) means that the sentence is acceptable (or not). The token-level fine-tuning considers the token with the highest probability of being replaced as a token which leads to the unacceptability of the whole sentence. For instance, as illustrated in Fig. 2(c), the token “leaving” leads to the morphological violation, and meanwhile it is identified as a replacement by the discriminator of ELECTRA, which indicates that the sentence is unacceptable.

Prompt-Free Few-Shot Learning with ELECTRA

4.3

49

Sentence-Level Fine-Tuning

Unlike token-level fine-tuning, which centers on a single token, sentence-level fine-tuning focuses on the whole sentence. In this paper, we use the standard fine-tuning to obtain a sentence-level unacceptability probability. As illustrated in Fig. 2(c), a task-specific head is trained to compute the sentence-level unacceptability probability Psen (x) as follows. T hCLS ) Psen (x) = sigmoid(wCLS

(5)

where hCLS is the hidden vector of [CLS], and wCLS denotes the parameters of the task-specific head. Then, a loss function for the sentence-level fine-tuning is defined: (6) Losssen (x) = −y log Psen (x) − (1 − y)(1 − log Psen (x)) 4.4

Joint Fine-Tuning

In the training phase, to effectively leverage the two losses from the token-level fine-tuning and sentence-level fine-tuning, the fine-tuning of PF-ELECTRA uses the following joint loss function: Loss(x) = Losstoken (x) + Losssen (x)

(7)

In the testing phase, for an input sentence x, its sentence-level probability Psen (x) and token-level probability Ptoken (x) are obtained, and the average of the two probabilities is used as the unacceptability probability for the sentence P (x). If P (x) > 0.5, the sentence is unacceptable. Otherwise, it is acceptable.

5

Experiment

In this section, we carry out experiments of acceptability judgment on CoLA in a few-shot setting. The experimental results demonstrate that our PF-ELECTRA greatly outperforms previous approaches. 5.1

Experimental Settings

Our experimental settings are introduced, including datasets, baselines, evaluation metrics, and implementation details. Datasets: Following the experiments of [6], we use the Corpus of Linguistic Acceptability (CoLA) to perform few-shot learning on acceptability judgment. K samples in each class are employed to form the training data and other K samples in each class are employed to form the development data. Two different datasets are used as test data, among which one is used for testing overall model performance and the other is used to examine model performance on sentences containing a specific type of phenomena (or features). Specifically, the two test datasets are introduced as follows:

50

L. Li et al.

– CoLA Overall Test Set: It contains 1,043 English sentences labeled as either acceptability or unacceptability and it is also used as test dataset in [6]. – CoLA Phenomenon-specific Test Set: To test the model capability of learning grammatical generalization, the grammatical features in the development set of CoLA are annotated. For brevity, [22] selects various features, which correspond to different subsets of the CoLA development data, respectively. In this paper, we use features (Simple, Adjunct, Comp, to-VP, Arg altern, Binding, and Question) for experiments. Baselines: We compare our proposed PF-ELECTRA with the following baselines. – RoBERTa-FT [14]: The standard fine-tuning of RoBERTa. – LM-BFF [6]: A prompt-based fine-tuning with discrete textual templates, and textual templates are automatically searched. – P-Tuning [13]: A prompt-based approach that utilizes continuous prompts learned by LSTM. – DART [25]: A prompt-based approach that views unused tokens in the language model as template and label tokens. – WARP [7]: A prompt-based approach that searches for template and label embeddings in a continuous embedding space, where the pre-trained model is frozen. – PERFECT [9]: A prompt-free few-shot learner with task-specific adapters, where the pre-trained model is frozen. – ELECTRA-Prompt [11]: The prompt-based prediction using ELECTRA, where two prompting tokens (“correct” and “incorrect”) are used to represent the two classes: “acceptability” and “unacceptability”. Considering the impact of different templates on model performance, we employ the following templates and report the results of the best-performed template (i.e., Template 1) in experiments. Template 1: This is correct (incorrect). Template 2: This is (not) correct. Template 3: This sentence is correct (incorrect). Template 4: This sentence is grammatically correct (incorrect). – ELECTRA-SenFT [2]: The standard fine-tuning of ELECTRA, i.e., only the sentence-level fine-tuning of ELECTRA in our approach. – ELECTRA-TokFT: Only the token-level fine-tuning of ELECTRA in our approach. Evaluation Metrics: Three metrics are used to evaluate the performance of acceptability judgment approaches in this paper: accuracy, F1-score, and Matthews Correlation Coefficient (MCC) [15]. Compared to accuracy and F1score which favor classifiers with a majority-class bias, MCC is designed for unbalanced binary classifiers and thus it is more suitable for acceptability judgment on CoLA [22]. In addition, we also use precision, recall, and F1-score (P/R/F1) to evaluate the performance of each class.

Prompt-Free Few-Shot Learning with ELECTRA

51

Table 1. The results of various acceptability judgment approaches for K = 16 and K = 8 (per class) on CoLA Overall Test Set. Model

Unacceptable P R F1

Acceptable P R F1

Acc. Macro-F1 MCC

K = 16 RoBERTa-FT LM-BFF P-tuning DART WARP PERFECT ELECTRA-Prompt PF-ELECTRA ELECTRA-SenFT ELECTRA-TokFT

39.3 36.3 31.4 30.9 36.6 35.7 41.7 67.7 56.2 66.2

67.3 51.4 34.7 53.5 38.4 49.8 75.8 59.6 55.9 55.5

46.6 39.1 26.3 37.6 35.8 40.0 52.4 63.1 55.2 60.2

75.9 71.8 68.8 70.9 71.8 71.4 80.3 83.0 80.2 81.5

46.4 54.9 64.6 48.6 69.3 56.9 47.2 87.2 79.2 87.3

54.9 59.0 63.6 54.7 70.0 61.7 55.6 85.0 79.4 84.3

52.8 53.8 55.4 50.1 59.8 54.7 56.0 78.7 72.0 77.5

50.8 49.1 44.9 46.2 52.9 50.9 54.0 74.0 67.3 72.3

14.4 9.7 −0.4 2.0 7.9 6.9 22.3 48.7 35.7 45.1

K =8

41.1 30.1 31.5 29.6 33.5 33.2 37.2 66.2 51.8 63.4

40.6 43.6 44.2 54.0 26.5 39.4 36.0 56.5 51.4 56.3

35.8 34.0 33.6 37.3 23.9 34.6 33.5 60.1 50.7 59.6

73.1 68.9 69.5 68.9 69.3 69.8 71.2 81.9 78.3 81.4

68.7 55.6 55.9 44.2 74.4 63.3 70.6 86.5 77.4 85.3

67.8 59.5 58.4 51.2 69.7 65.4 69.5 83.9 77.5 83.3

60.0 51.9 52.3 47.2 59.6 55.9 59.9 77.2 69.4 76.4

51.8 46.7 46.0 44.2 46.8 50.0 51.5 72.0 64.1 71.4

11.3 0.8 0.5 −1.7 1.5 2.8 7.3 45.4 29.3 43.2

RoBERTa-FT LM-BFF P-tuning DART WARP PERFECT ELECTRA-Prompt PF-ELECTRA ELECTRA-SenFT ELECTRA-TokFT

Implementation Details: Following the setting used in [6], we use five dataset splits of CoLA. To alleviate instability issues [5,26], the average performance is used as the final result. For experiments using the few-shot setting, we randomly sample the training data and keep only K (8 or 16) labeled samples per class to train a model. 5.2

Results on CoLA Overall Test Set

Table 1 shows the results when K = 16 and K = 8. First, in all three evaluation metrics, our PF-ELECTRA significantly outperforms all baselines, and in most cases, the ELECTRA-based approaches (ELECTRA-Prompt, PF-ELECTRA, ELECTRA-SenFT, and ELECTRA-TokFT) outperform the ones not using ELECTRA. This confirms that the pre-training of ELECTRA which identifies replaced tokens in sentences effectively captures the data distribution of CoLA. Furthermore, among the four ELECTRA-based approaches, PF-ELECTRA outperforms both the sentence-level fine-tuning (ELECTRA-SenFT) and the token-level fine-tuning (ELECTRA-TokFT) on account of its joint fine-tuning

52

L. Li et al.

Table 2. The MCC scores of different acceptability judgment approaches on CoLA Phenomenon-specific Test Set. Model

Simple Adjunct Comp Clause to-VP Arg Altern Binding Question

RoBERTa-FT

19.8

7.8

19.4

11.7

14.5

9.5

LM-BFF

3.4

5.8

10.9

7.1

6.0

0.3

11.5 4.6

P-tuning

−12.0

−6.4

−3.8

−0.4

−0.8

−3.6

1.9

DART

7.5

−1.1

5.1

2.5

2.5

−1.3

−0.2

WARP

19.2

10.5

6.6

10.3

6.1

5.7

4.9

PERFECT

14.7

1.6

9.3

5.6

9.6

12.1

2.3

ELECTRA-Prompt

37.0

12.2

23.6

24.9

21.1

20.0

14.4 36.6

PF-ELECTRA

68.4 37.7

45.7

47.9

49.3

56.8

ELECTRA-SenFT

55.5

20.8

30.2

30.4

37.0

32.6

21.4

ELECTRA-TokFT

64.3

35.1

41.6

43.8

45.6

55.3

30.7

design and performs better than the prompt-based approaches (ELECTRAPrompt) because of its prompt-free design. It reveals the limited application problem of pre-defined prompts for acceptability judgment. What’s more, compared to ELECTRA-SenFT, ELECTRA-TokFT achieves much better performance, e.g., 45.1 vs. 35.7 in terms of MCC when K =16. This indicates that our token-level fine-tuning makes more contributions to PF-ELECTRA. Finally, among the approaches based on masked PLMs (i.e., RoBERTa), the fine-tuning approach (RoBERTa-FT) performs much better than prompt-based approaches (LM-BFF, P-tuning, DART, and WARP), no matter using discrete or continuous prompts. This also confirms the limitation of pre-defined prompts for acceptability judgment. 5.3

Results on CoLA Phenomenon-Specific Test Set

To investigate model capability, we use the training data (K = 16) to train a model with each approach and then employ it to test on the CoLA Phenomenonspecific Test Set. We show the performances in Table 2. In Table 2, the MCC scores of PF-ELECTRA on the sentences containing a given feature greatly outperform the ones of baseline approaches, reflecting its strong capability of learning grammatical generalization. Moreover, on every feature, the ELECTRAbased approaches outperform the ones based on PLMs, revealing the effectiveness of ELECTRA for acceptability judgment.

6

Conclusion

In this paper, we propose PF- ELECTRA, which adopts a joint learning approach using both token-level and sentence-level fine-tuning. Specifically, to deal with the inconsistent data distribution between PLMs and acceptability judgment, we use a pre-trained token-replaced detection model, ELECTRA, as our

Prompt-Free Few-Shot Learning with ELECTRA

53

basic few-shot learner. Furthermore, to solve the limited application of predefined templates of prompt-based learning to acceptability judgment, we design a prompt-free few-shot learning approach that uses the token with the maximal probability of being replaced to judge the acceptability of a sentence. Experiments show that the joint fine-tuning used in PF-ELECTRA can effectively improve the performance of acceptability judgment in a few-shot setting. Although the effectiveness of our approach, PF-ELECTRA treats the sentence-level and token-level fine-tuning equally, which is sub-optimal. How to keep a balance between sentence-level and token-level fine-tuning is deserved for future study. Besides, the token-level fine-tuning focuses only on a single token, which ignores other important tokens. In future work, we will try to make the model consider all important tokens through the self-attention mechanism. Acknowledgments. This work was supported by a NSFC grant (No. 62076176), and a General Research Fund (GRF) project sponsored by the Research Grants Council Hong Kong (Project No. 15611021).

References 1. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020) 2. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. In: International Conference on Learning Representations (2020) 3. Dahlmeier, D., Ng, H.T.: Correcting semantic collocation errors with L1-induced paraphrases. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 107–117 (2011) 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019) 5. Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H., Smith, N.: Finetuning pretrained language models: weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305 (2020) 6. Gao, T., Fisch, A., Chen, D.: Making pre-trained language models better fewshot learners. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3816–3830 (2021) 7. Hambardzumyan, K., Khachatrian, H., May, J.: WARP: word-level adversarial reprogramming. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4921–4933 (2021) 8. Kann, K., Rothe, S., Filippova, K.: Sentence-level fluency evaluation: references help, but can be spared! In: Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 313–323 (2018) 9. Karimi Mahabadi, R., et al.: Prompt-free and efficient few-shot learning with language models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3638–3652 (2022)

54

L. Li et al.

10. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2020) 11. Li, Z., Li, S., Zhou, G.: Pre-trained token-replaced detection model as few-shot learner. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 3274–3284 (2022) 12. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023) 13. Liu, X., et al.: GPT understands, too. arXiv preprint arXiv:2103.10385 (2021) 14. Liu, Y., et al.: ROBERTA: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019) 15. Matthews, B.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Struct. 405(2), 442–451 (1975) 16. Min, S., et al.: Rethinking the role of demonstrations: what makes in-context learning work? arXiv preprint arXiv:2202.12837 (2022) 17. Schick, T., Sch¨ utze, H.: Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 255–269 (2021) 18. Schick, T., Sch¨ utze, H.: It’s not just size that matters: small language models are also few-shot learners. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2339–2352 (2021) 19. Sun, T., Liu, X., Qiu, X., Huang, X.: Paradigm shift in natural language processing. Mach. Intell. Res. 19, 169–183 (2022) 20. Trotta, D., Guarasci, R., Leonardelli, E., Tonelli, S.: Monolingual and cross-lingual acceptability judgments with the Italian CoLA corpus. In: Findings of the Association for Computational Linguistics, EMNLP 2021, pp. 2929–2940 (2021) 21. Wang, W., et al.: StructBERT: incorporating language structures into pre-training for deep language understanding. In: International Conference on Learning Representations (2020) 22. Warstadt, A., Singh, A., Bowman, S.: Neural network acceptability judgments. Trans. Assoc. Comput. Linguist. 7, 625–641 (2019) 23. Xia, M., Artetxe, M., Du, J., Chen, D., Stoyanov, V.: Prompting ELECTRA: few-shot learning with discriminative pre-trained models. arXiv preprint arXiv:2205.15223 (2022) 24. Yao, Y., et al.: Prompt tuning for discriminative pre-trained language models. In: Findings of the Association for Computational Linguistics, ACL 2022, pp. 3468– 3473 (2022) 25. Zhang, N., et al.: Differentiable prompt makes pre-trained language models better few-shot learners. In: International Conference on Learning Representations (2022) 26. Zhang, T., Wu, F., Katiyar, A., Weinberger, K.Q., Artzi, Y.: Revisiting fewsample BERT fine-tuning. In: International Conference on Learning Representations (2021)

Dual Hierarchical Contrastive Learning for Multi-level Implicit Discourse Relation Recognition Jing Xu1,2 , Ruifang He1,2(B) , Haodong Zhao1,2 , Huijie Wang1,2 , and Lei Zeng1,2 1

2

College of Intelligence and Computing, Tianjin University, Tianjin, China {jingxu,rfhe,2021244138,wanghj s}@tju.edu.cn Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China

Abstract. Implicit discourse relation recognition (IDRR) is a challenging but vital task in discourse analysis, which aims at classifying logical relations between arguments. Previous work infuses ACCL takes external knowledge or label semantics to alleviate data scarcity, which either brings noise or underutilizes semantic information contained in label embedding. Meanwhile, it is difficult to model label hierarchy. In this paper, we make full use of label embedding as positives and negatives for our dual hierarchical contrastive learning framework, which contains two parts: 1) hierarchical label contrastive loss (HLCL), which promotes fine-grained labels to be more similar to correlative medium-grained labels than related coarse-grained labels. 2) arguments and connectives contrastive loss (ACCL), which makes arguments aggregate around correlative fine-grained labels. The two modules interact with each other, making the similarity between arguments and correlative fine-grained labels are higher than that with related coarse-grained labels. In this process, the multi-level label semantics are integrated to arguments, which provides guidance for classification. Experimental results show that our method achieves competitive performance against state-of-theart system. Keywords: Implicit Discourse Relation Recognition Label Hierarchy · Contrastive Learning

1

· Data Scarcity ·

Introduction

Implicit discourse relation recognition (IDRR) aims to identify the semantic connection between two arguments (e.g., clauses, sentences, and paragraphs) without the guidance of obvious connectives. Figure 1 shows a discourse instance with multi-level labels, Level2 label Asynchronous implies more refined semantics than Level1 label Temporal. Besides, annotators mark the connectives to join argument pairs, which can be regarded as the most precise semantic relation under Level3. As a foundational but challenging discourse analysis task, IDRR c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 55–66, 2023. https://doi.org/10.1007/978-3-031-44696-2_5

56

J. Xu et al.

provides technical support for machine translation [7], text summarization [5], event relation extraction [26] and needs in-depth study. Due to data scarcity, it is difficult to accurately learn the mapping from argument pair to discourse relation. Meanwhile, a key challenge of IDRR is to model label hierarchy. Previous work infuses external triples knowledge [6,14] or exploits label semantics [8,20,29] to alleviate the problem of data scarcity. Typically, [20] applies label hierarchy to select hard negatives. It regards arguments with different Level2 but the same Level1 label as negatives and separates them away and improves system performance. However, these approaches still have the following drawbacks: 1) underutilize semantic information contained in label embedding; 2) fail to construct hierarchical correlation constraints: the similarity between arguments and correlative fine-grained labels should be higher than that with related coarse-grained labels. On the one hand, label embedding enjoys a built-in ability to leverage alternative sources of information related to labels, such as label hierarchy or textual descriptions [33], and it contains specific semantics to guide the learning process for discourse relations. On the other hand, contrastive learning (CL) encourages positive samples to be closer while pushing negatives far away, adjusting the similarity between samples. A good idea is to incorporate label embedding into CL as positives or negatives, encouraging arguments to approach correlative labels and be away from noncorresponding labels. In this process, CL should embody hierarchical correlation constraints: arguments should be more similar to correlative fine-grained labels than related coarse-grained labels. In the paper, we propose a dual hierarchical contrastive learning (DHCL) method for multi-level IDRR: 1) hierarchical label contrastive loss (HLCL) selects fine-grained label as an anchor, encouraging the fine-grained label to approach its related medium-grained (or coarse-grained) label and be away from other fine-grained labels. We also set hierarchical correlation constraints as shown in Fig. 1, the similarity between the fine-grained label and its parent node label should be higher than that with the ancestor node label. 2) arguments and connectives contrastive loss (ACCL) takes argument as an anchor, pulling the argument with its related fine-grained label closer and pushing it away from other unmatched labels. HLCL and ACCL interact to fuse multiArg1: The crash of 1929 was followed by a substantial recovery. Arg2: The great Depression and awful bear market of the 1930s began.

Level1 (coarse-grained) : Temporal Level2 (medium-grained) : Asynchronous

... ...

Level3 : Precedence Level4 (fine-grained) : before

...

Fig. 1. An instance of multi-level IDRR. Our experiments cover all levels except Level3 and we describe different levels granularity involved. We define Level1, Level2, Level4 as coarse-grained, medium-grained, fine-grained level respectively.

DHCL for Multi-level IDRR

57

level label semantics into argument representation and alleviate the problem of data sparsity. And multiple samples are distributed close to related hierarchical labels, facilitating effective division of linear semantic space. In summary, our main contributions are three-fold: – Propose a novel dual hierarchical contrastive learning framework for multilevel IDRR, which adds hierarchical correlation constraints for arguments and multi-level labels. – In IDRR, we are the first to introduce label embedding as postives or negatives for CL and fuse multi-level label semantics into argument representation. – The experimental results and analysis on the PDTB prove the effectiveness of our model.

2 2.1

Related Work Implicit Discourse Relation Recognition

Argument Representation and Interactions. IDRR tends to effectively represent arguments and discover semantic interaction between argument pairs. Representation-based methods: [9,16,30] encode various text features into latent semantic representation space, [2,27] generate implicit data to enhance text representation. [11,32,34] design prompt learning templates to explore the knowledge of pre-trained language models (PLMs). Interactions-based methods: [15,17,35] employ attention mechanisms to mine relational cues between argument pairs. However, the studies ignore external context and label knowledge. Knowledge Enhancement. The research has two parts: Integration of external knowledge: [6,14] utilize triplet knowledge base to capture coreference features and resolve them. While this method may introduce noise. Discourse relation enhancement: [8] models argument-relation instances in geometric space. [29] regards IDRR as a label sequence generation task and leverages the hierarchical label dependence. [20] uses hierarchical labels to select hard negatives and facilitates contrastive learning. However, [20] treats arguments with the same first-level label as negatives and pushes them away, which is not conducive to first-level classification. We do differently: a DHCL loss is proposed to realize the hierarchical distribution and facilitate classification. 2.2

Contrastive Learning

CL aims at clustering similar samples closer and separating dissimilar samples away in a self-supervision way, which shows promising results in computer vision (CV) and natural language processing (NLP). Building high-quality positive and negative samples is vital, and a regular way to construct negatives is to pick in-batch samples. In NLP, dropout [4], back translation [3], masking [28] and adversarial attack are applied to generate positives. Further, [12] extends

58

J. Xu et al.

contrastive learning to supervised domains, which employs annotated data to improve feature representation ability of model. [1] proposes a dual contrastive learning method, which simultaneously learns the features of input samples and the parameters of classifiers in the same space. [33] presents a label anchored CL framework for language understanding task. Our dual hierarchical contrastive learning method adds hierarchical correlation constraints and fuses multi-level label semantics for arguments representation.

3

Model: DHCL

Problem Definition. Given an argument pair instance x = (arg1 , arg2 ) and M level of defined labels L = (l1 , l2 , ..., lM ), the target label sequence is y = (y1 , y2 , ..., yM ) where yi ∈ li . However, the existed work ignores the correlation between argument pairs and multi-level labels. Our model takes the correlation into account, which consists of three parts, as illustrated in Fig. 2. Contrastive Learning Root

... Pu sh

closer similarity away

o1

...

HLCL: label-label view

oM

... Pull

Mutual Effect

ACCL: arg-label view

...

Classifierlevel-1

Classifierlevel-M

...

... [

]

Labellevel-1

Labellevel-M

RoBERTa Encoder Template [CLS] +

+ [MASK] +

+ [SEP]

Hierarchical Labels

Fig. 2. The overall architecture of our DCL model.

3.1

Argument Pair and Discourse Relation Encoder

Prompt templatize: given an instance x = (arg1 , arg2 ), the template takes the form by concatenating two arguments and inserting some special PLM markers like [CLS], [MASK] and [SEP]: P = [CLS] +arg1 +[MASK]+arg2 +[SEP]. [CLS] and [SEP] mark the beginning and end of the sequence respectively, while [MASK] is pre-trained to predict a token in cloze task. The intention of designed template: the template transforms IDRR to a cloze task, where [MASK] is inserted at the intersection of two arguments to predict connectives and indicate discourse relations.

DHCL for Multi-level IDRR

59

The prompt template P is fed into PLM, with the multi-head attention mechanism as key component, which enables information interaction within and between arguments. And we extract the representation of v[MASK] for masked connectives, we call it r for convenience. V = [v0 ; ...; v[MASK] ; ...; vn ] = PLM(P ) r = v[MASK]

(1) (2)

where V is argument representation, V ∈ Rc×d , c is the number of tokens for prompt template and d is the dimension of each token. As shown in Eq. 3 and Eq. 4, M levels discourse relation labels are added to list L and fed into PLM, where li is the set of i-th layer labels, li = {li,j |0 1

t+1 1 t+1 i=1 Pmlm (wt |X\(i−1;i+k−2) ), if 0 ≤ t < k − 1 1 k k i=1 Pmlm (wt |X\(t−k+i;t+i−1) ), ifk − 1 ≤ t ≤ |x| − k + 1 |x|−t+1 1 Pmlm (wt |X\(|x|−i+2−k;|x|−i+1) ), if |x| − k + 1 < t |x|−t+1 i=1

(1) (2)

≤ |x| (3)

Notably, when k = 1, it is the first-order perplexity mathematical expression mentioned previously. When k > 1, it continuously masks k tokens firstly. When the tokens are not in the front k − 1 or back k − 1 positions of a sentence, it sequentially masks k times. On the one hand, if tokens are in the front k − 1 and back k − 1 positions, it masks the tokens from once to k − 1 times and from k − 1 times to once, respectively. Meanwhile, it is indispensable for the perplexity scores of tokens to be averaged. On

74

H. Bu et al.

the other hand, for k times masking operations, the final perplexity scores need to be averaged as well. Then, based on the perplexity scores of tokens calculated above, we can get the k-order perplexity score of a sentence. Finally, SOR(·) contains the MLM score of token about first-order perplexity and the MLM score SSR(·) of token about k-order perplexity when k > 1. Multi-order Perplexity. In order to obtain the final perplexity score, our method is designed by a weight fusion for the first-order, second-order and third-order perplexity, because they can evaluate sentence fluency at different levels: PLL(x) = α × PLL1 (x) + β × PLL2 (x) + (1 − α − β) × PLL3 (x)

(4)

where α, β are two adjustable weight factors, which have three different perplexities. The value of α and β are changed sequentially and the sum of the two is not greater than 1 and both are positive. For the specific implementation details, the grid search method is used. In the case of meeting the above two conditions, first set α to 0, and its growth step size is 0.01. At the same time, the β also starts from 0, the step size is 0.01, and the β becomes larger in turn. Similarly, when α is 0.01, β increases in step size in the same way, and so on. From the all possible results generated, we choose the best result as the parameter of our method.

4.3 Answer Generation After passing through the scoring module, we will obtain the scoring result set S = {s1 , s2 , s3 , ..., sl } for each sample sentence. Afterwards, scores in the scoring result set are sorted and the definition corresponding to the lowest score are found. Finally, by comparing the definition obtained with the correct definition of the sample, if it is consistent with it, the prediction is correct. If inconsistent, it is a prediction error.

5 Experiment 5.1 Experimental Setup All experiments will be run on RTX 8000 GPU, and our framework is built on Pytorch3 and Transformers4 . The framework we developed is evaluated using two datasets: the Malay abbreviation dataset we created and the test dataset from SDU@AAAI-22-Shared Task 2. For the evaluation on the Malay abbreviation dataset, we compare our framework, which is based on the BERT model, with an unsupervised method proposed by Ciosici et al. [4]. This method, utilizing the word2vec model, is currently one of the most advanced unsupervised approaches available. To ensure a fair comparison, we adopt the same parameter settings as the aforementioned method. Additionally, we construct two supervised methods, namely ADBCMM (concat) and ADBCMM (siamese), to further compare with our approach. In our experiments, we set the maximum sentence length 3 https://github.com/pytorch 4 https://github.com/huggingface/transformers

Towards Malay Abbreviation Disambiguation

75

to 128, the number of epochs to 20, the batch size to 4, and the learning rate to 5e-6. To demonstrate the effectiveness of our method, we explore different pre-trained models. Specifically, we employ two BERT models for the Malay dataset and five BERT models for the test dataset of SDU@AAAI-22-Shared Task 2. These variations allow us to thoroughly evaluate and showcase the performance of our framework. 5.2 Evaluation Metrics In the context of abbreviation disambiguation experiments, we employ four evaluation metrics to assess the performance of the various frameworks utilized. These metrics include accuracy, macro-precision, macro-recall, and macro- F1 value. Among these indicators, accuracy and macro- F1 value hold particular significance due to their noteworthy nature in effectively measuring the performance outcomes. The mathematical expression is as follows: acc =

TP + TN TP + TN + FP + TN

(5)

2 ∗ P i ∗ Ri P i + Ri n Fi Macro − F1 = i=1 1 n F1i =

(6) (7)

where i represents the category of each precision, recall and F1 value, The higher these two indicators, the better the performance. Table 2. Comparison of experimental results Model

Method

acc

Macro-P

Macro-R

Macro-F1

mBERT5

First Order PPLs

49.40%

87.98%

50.39%

64.08%

Second Order PPLs

52.77%

89.03%

53.62%

66.93%

Third Order PPLs

52.63%

88.51%

52.85%

66.18%

Our Method

53.16%

88.94%

53.80%

67.04%

ADBCMM (siamese)

36.34%

88.69%

36.73%

51.95%

ADBCMM (concat)

42.46%

81.22%

42.46%

55.76% (continued)

5 https://huggingface.co/bert-base-multilingual-cased.

76

H. Bu et al. Table 2. (continued)

Model

Method

acc

Macro-P

Macro-R

Macro-F1

bahasaBERT6

First Order PPLs

58.41%

89.04%

58.52%

70.62%

Second Order PPLs

62.75%

89.80%

62.09%

73.42%

Third Order PPLs

63.28%

89.90%

63.25%

74.26%

Our Method

63.90%

90.15%

63.57%

74.56%

ADBCMM (siamese)

40.34%

88.75%

37.56%

52.78%

ADBCMM (concat)

68.05%

89.99%

66.31%

76.36%

First Order PPLs

60.10%

89.60%

58.88%

71.03%

Second Order PPLs

62.41%

90.23%

61.19%

72.92%

Third Order PPLs

65.01%

90.95%

64.81%

75.69%

Our Method

65.01%

90.95%

64.81%

75.69%

ADBCMM (siamese)

39.76%

88.76%

37.42%

52.65%

ADBCMM (concat)

67.71%

90.13%

65.89%

76.13%

Skip-gram

19.66%

97.32%

17.93%

30.29%

Skip-gram + corpus

19.90%

97.40%

18.26%

30.75%

Cbow

15.90%

96.25%

14.18%

24.71%

Cbow + corpus 18.65%

95.90%

17.28%

29.28%

bahasaBERT (Fine-tuned)

word2vec

5.3 Malay Dataset Experiment As illustrated in Table 2, our aim is to further validate the efficacy of our unsupervised method. For this purpose, we select a benchmark unsupervised approach proposed by Ciosici et al. [4] and train the word2vec model using two distinct training strategies, namely Skip-gram and Cbow, alongside the training set. Additionally, we expand our corpus by incorporating a diverse range of Malay language websites, as depicted in Table 3, to enhance the performance of the word2vec model. Subsequently, we assess the effectiveness of the model by conducting tests on a designated test set. Concurrently, 6 https://huggingface.co/malay-huggingface/bert-base-bahasa-cased.

Towards Malay Abbreviation Disambiguation

77

Table 3. Source of Corpus Website

Url

Bernama

https://www.bernama. com/en/

Utusan Malaysia

http://www.utusan. com.my/

Kosmo

http://www.kosmo. com.my/

Berita Harian

http://www.beritaharian.sg/

Pelitabrunei

http://www.pelitabrunei. gov.bn/

Media Permata

http://mediapermata. com.bn/

Astro Awani

http://www.astroawani. com/

Bank Negara Malaysia

http://www.bnm.gov.my

Kerajaan Negeri Terengganu

http://www.terengganu. gov.my/

Table 4. SDU@AAAI-22-Shared Task 2 Spanish Test Dataset Experimental Results Model/User

Method

Macro-P

Macro-R

Macro-F1

spanishBERT7

First Order PPLs 53.36%

73.11%

53.41%

61.73%

Second Order PPLs

53.71%

74.78%

54.03%

62.73%

Third Order PPLs

53.02%

72.30%

52.62%

60.91%

Our Method

55.10%

77.35%

55.33%

64.51%

First Order PPLs 66.82%

74.52%

65.92%

69.96%

Second Order PPLs

69.14%

76.20%

68.97%

72.40%

Third Order PPLs

64.73%

73.09%

65.28%

68.96%

spanishRoBERTa8

acc

(continued) 7 https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased. 8 https://huggingface.co/bertin-project/bertin-roberta-base-spanish.

9 https://competitions.codalab.org/competitions/34899#results.

The user name of the participant.

78

H. Bu et al. Table 4. (continued)

Model/User

Method

acc

Macro-P

Macro-R

Macro-F1

Our Method

70.53%

77.57%

70.19%

73.70%

WENGSYX9

ADBCMM



91.07%

85.14%

88.00%

TTaki





76.00%

66.00%

70.00%

AbhayShukla





77.00%

54.00%

64.00%

kumudlakara





62.00%

35.00%

45.00%

we conduct extensive fine-tuning of the bahasaBERT pre-trained model using a largescale corpus. Notably, this corpus is consistent with the additional new Malay corpus employed during the word2vec model training. In order to facilitate a comparison with supervised models, we construct ADBCMM (siamese) and ADBCMM (concat) models. Our unsupervised framework demonstrates an accuracy of 65.01%, as measured in terms of accuracy. Although it is 3% lower than the ADBCMM (concat) model based on bahasaBERT, our method exhibits a significant advantage over the unsupervised word2vec model in the context of the abbreviation disambiguation task. Considering macro-precision, we observe that the fine-tuning based fusion method emerges as the most effective when compares with the supervised approaches, while also presenting potential for further improvement compared to the word2vec unsupervised method. Moreover, in terms of macro-recall and macro-F1 value, the fine-tuning based fusion method achieves percentages of 64.81% and 75.69%, respectively. These results indicate a noteworthy enhancement compared to the unsupervised word2vec method. Furthermore, the effectiveness of the fusion method closely approximate that of the supervised ADBCMM (concat) models based on bahasaBERT, thereby providing substantial evidence to support the superiority of our proposed model. Table 5. SDU@AAAI-22-Shared Task 2 Scientific English Test Dataset Experimental Results Model/User

Method

BERT10

acc

Macro-P

Macro-R

Macro-F1

First Order PPLs 59.23%

71.18%

61.62%

66.06%

Second Order PPLs

60.63%

69.97%

61.90%

65.68%

Third Order PPLs

57.67%

68.23%

58.19%

62.81%

Our Method

62.02%

71.41%

63.37%

67.15% (continued)

10 https://huggingface.co/bert-base-uncased.

Towards Malay Abbreviation Disambiguation Table 5. (continued) Model/User

Method

RoBERTa11

acc

Macro-P

Macro-R

Macro-F1

First Order PPLs 59.58%

72.09%

60.68%

65.89%

Second Order PPLs

60.28%

70.10%

61.21%

65.35%

Third Order PPLs

56.10%

68.99%

56.86%

62.34%

Our Method

61.50%

72.28%

62.72%

67.16%

WENGSYX

SimCLAD



97.00%

94.00%

96.00%

Decalogue





71.00%

60.00%

65.00%

sherlock314159





70.00%

59.00%

64.00%

kumudlakara





73.00%

56.00%

63.00%

Table 6. SDU@AAAI-22-Shared Task 2 Legal English Test Dataset Experimental Results Model/User

Method

acc

Macro-P

Macro-R

Macro-F1

BERT

First Order PPLs

77.02%

78.18%

75.89%

77.02%

Second Order PPLs

72.59%

77.66%

72.46%

74.97%

Third Order PPLs

70.24%

76.82%

69.64%

73.05%

Our Method

77.28%

78.61%

76.24%

77.41%

First Order PPLs

70.76%

79.47%

69.41%

74.10%

Second Order PPLs

72.59%

83.62%

70.80%

76.68%

Third Order PPLs

72.06%

81.18%

71.00%

75.65%

RoBERTa

Our Method

73.89%

82.69%

72.60%

77.32%

WENGSYX





94.00%

87.00%

90.00%

TianHongZXY





79.00%

64.00%

70.00%

mozhiwen





75.00%

61.00%

67.00%

TTaki





78.00%

57.00%

66.00%

11 https://huggingface.co/roberta-base.

79

80

H. Bu et al. Table 7. SDU@AAAI-22-Shared Task 2 French Test Dataset Experimental Results

Model/User

Method

Europeana12

acc

Macro-P

Macro-R

Macro-F1

First Order PPLs 54.49%

68.87%

56.45%

62.04%

Second Order PPLs

55.11%

69.41%

56.05%

62.02%

Third Order PPLs

56.09%

68.17%

56.41%

61.73%

Our Method

57.69%

72.02%

58.87%

64.79%

WENGSYX

ADBCMM



84.00%

79.34%

89.42%

TTaki





74.00%

64.00%

68.00%

mozhiwen





73.00%

63.00%

67.00%

AbhayShukla





70.00%

49.00%

57.00%

5.4 Other Language Dataset Experiments All experimental results will be presented in Tables 4, 5, 6, and 7 above. Among these four test datasets, our unsupervised method based on pre-trained models has the best performance in the legal English test dataset. In particular, the accuracy rate and macro F1 value are up to 77.28% and 77.41% respectively. Effective results are also obtained in three other test datasets simultaneously, and we can get macro F1 values of 73.70%, 67.16% and 64.79% in Spanish, Scientific English and French test dataset, respectively. In the competition of abbreviation disambiguation task, compared with the other methods participating in the competition, our unsupervised method achieves the best performance in the fourth place in the legal English dataset, with Spanish ranking sixth, scientific English ranking seventh and French ranking seventh. 5.5 Analysis and Discussion In the Malay dataset experiment, we use three different pre-trained models to verify the effectiveness of our model. It is obvious that among these three pre-trained models, the unsupervised method based on fine-tuning performs the best. Therefore, we conclude that under the same language condition, the effectiveness of our unsupervised model is closely related to the types of pre-trained model. Meanwhile, from the results in the SDU@AAAI-22-Shared Task 2 test dataset experiment, this rule is also followed.

6 Conclusion In this study, a corpus of Malay abbreviation disambiguation is constructed, and an unsupervised framework, leveraging a pre-trained model, is introduced to address the task of abbreviation disambiguation as a multi-classification problem. Experimental evaluations 12 https://huggingface.co/dbmdz/bert-base-french-europeana-cased.

Towards Malay Abbreviation Disambiguation

81

are conducted on both a Malay dataset and the test dataset in the SDU@AAAI-22-Shared Task 2. The obtained experimental results demonstrate that our proposed unsupervised framework exhibits a notable advantage over the unsupervised method, while achieving effectiveness comparable to the current supervised approach. Subsequently, future investigations could encompass the application of our framework to additional languages, integration of diverse perplexity levels, and exploration of the efficacy of multilingual capabilities. Acknowledgement. This work was supported by the National Social Science Fund of China (No. 22BTQ045).

References 1. Navigli, R.: Word sense disambiguation: a survey. ACM Comput. Surv. 41, 1–69 (2009). https://doi.org/10.1145/1459352.1459355 2. Mcinnes, B.T., Pedersen, T., Liu, Y., Pakhomov, S.V., Melton, G.B.: Using second-order vectors in a knowledge-based method for acronym disambiguation. In: Association for Computational Linguistics (2011) 3. Taneva, B., Cheng, T., Chakrabarti, K., He, Y.: Mining acronym expansions and their meanings using query click log. In: Proceedings of the 22nd International Conference on World Wide Web. International World Wide Web Conferences Steering Committee (2013) 4. Ciosici, M., Sommer, T., Assent, I.: Unsupervised Abbreviation Disambiguation Contextual disambiguation using word embeddings. CoRR (2019) 5. Li, Y., Zhao, B., Fuxman, A., Tao, F.: Guess me if you can: acronym disambiguation for enterprises. In: Association for Computational Linguistics (2018) 6. Wu, Y., Xu, J., Zhang, Y., Xu, H.: Clinical abbreviation disambiguation using neural word embeddings. In: Proceedings of BioNLP 15 (2015) 7. Jin, Q., Liu, J., Lu, X.: Deep Contextualized Biomedical Abbreviation Expansion. BioNLP@ACL (2019) 8. Chopard, D., Spasi´c, I.: A deep learning approach to self-expansion of abbreviations based on morphology and context distance. In: International Conference on Statistical Language and Speech Processing (2019) 9. Li, I., et al.: A Neural Topic-Attention Model for Medical Term Abbreviation Disambiguation. In: NeurIPS (2019) 10. Pan, C., Song, B., Wang, S., Luo, Z.: BERT-based acronym disambiguation with multiple training strategies. In: Association for the Advancement of Artificial Intelligence (2021) 11. Li, B., Xia, F., Weng, Y., Huang, X., Sun, B.: SimCLAD: a simple framework for contrastive learning of acronym disambiguation. In: Association for the Advancement of Artificial Intelligence (2021) 12. Weng, Y., Xia, F., Li, B., Huang, X., He, S.: ADBCMM: acronym disambiguation by building counterfactuals and multilingual mixing. In: Association for the Advancement of Artificial Intelligence (2021) 13. Kacker, P., Cupallari, A., Subramanian, A.G., Jain, N.: ABB-BERT: A BERT model for disambiguating abbreviations and contractions. ICON (2022) 14. Schwartz, A.S., Hearst, M.A.: A simple algorithm for identifying abbreviation definitions in biomedical text. Pac. Symp. Biocomput. (2003), 451–62 (2003) 15. Lin, N., Zhang, H., Shen, M., Wang, Y., Jiang, S., Yang, A.: A BERT-based Unsupervised Grammatical Error Correction Framework. CoRR (2023)

Poster: Information Extraction and Knowledge Graph

Topic Tracking from Classification Perspective: New Chinese Dataset and Novel Temporal Correlation Enhanced Model Jin Ma1(B) , Xinming Zhang1 , Yuxun Fang2 , Xinyu Zuo2 , and Haijin Liang2 1 School of Computer Science and Technology, University of Science and Technology of China, Hefei, China [email protected], [email protected] 2 Tencent, Shenzhen, China {ethanfang,xylonzuo,hodgeliang}@tencent.com

Abstract. Topic tracking focuses on linking novel events to existing topics and is critical for applications. In this work, a deep learning model called TDE is proposed to solve the topic-tracking problem. And the first industrial-level topic tracking dataset QBETTD with fine-grain score labeling is released. The TDE model topical correlations among events from semantic and temporal perspectives based on machine reading comprehension (MRC) structure, which captures the deep semantic meaning and inter-relationship of the events and topics. The empirical results show that the proposed method is suitable for the topic-tracking task, and the model can track a growing topic.

Keywords: Topic tracking dataset

1

· Temporal enhanced · Chinese news

Introduction

Topic detection and tracking (TDT) automatically identifies, organizes events and topics from the news. Originating from the DARPA’s TDT project in 1996 [6], the task has been extensively studied for the past two decades. Considering a deployable paradigm of the TDT task, it should consist of three key parts as shown in Fig. 1: event detection (ED), topic detection (TD), and topic tracking (TT). In the real world, a topic contains a large number of related sequence events, which need to be organized effectively to construct a complete event graph. Therefore, timely and precise incorporation of the latest progress events into the topic (i.e., topic tracking) holds significant practical importance. In this work, we focus on the event-based topic-tracking task which is more common in general search scenarios. The fundamental unit in this task is an event, which comprises a trigger (event title) and an occurrence time, while a topic encompasses several related events. Compared to multimedia-based topic c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 85–97, 2023. https://doi.org/10.1007/978-3-031-44696-2_7

86

J. Ma et al.

tracking [1,10], relying solely on events presents a more challenging scenario due to the sparsity of features. Nonetheless, there is currently limited research on methods and datasets for event-based topic tracking in related communities.

Fig. 1. Illustration of the Topic Detection and Tracking (TDT) task

Fundamentally, the available datasets either utilize data from different forms (twitters, weibo, articles, etc.) [1,10,22], or solely focus on event relationships without considering topic correlations [21], which cannot meet the needs of eventbased topic tracking. To address this gap, we develop a new dataset to incorporate the desired events and topic relations. In addition to datase’t issues, there are several challenges associated with event-based topic tracking. Firstly, one topic is typically composed of multiple events, and commonly used methods may not be able to directly assess the correlation between a single event and other mutually constrained events. Secondly, topics often exhibit temporal continuity, and there are significant differences in duration and intensity between different topics. Specifically, recent events are more likely to be associated with long-term topics than short-term topics over time. For example, the launch mission of Shenzhou-13 is more relevant to the whole construction project of the Chinese space station compared to a single launch mission of TianZhou. These challenges highlight the necessity of introducing temporal factors in event-based topic tracking to further model event correlations. To address these challenges, we propose a new MRC-based model, named TDE, to model topical correlations among events from semantic and temporal perspectives. Our experimental results demonstrate that the proposed model performs well in event-based topic-tracking tasks, validating its effectiveness and practicality. The contributions of this work are as follows: – This paper proposes a new MRC-based model, named TDE, to model topical correlations among events from semantic and temporal perspectives, which could capture internal and external relationships between events with one topic.

Topic Tracking from Classification Perspective

87

– Moreover, this paper releases the QBETTD dataset, an industrial-level Chinese news event topic tracking dataset. The dataset contains pairs of new events and topic events, each labeled with a 5-point scale (0 to 4) score indicating whether the event belongs to the topic. – Our experimental results indicate that the proposed model performs well in event-based topic-tracking tasks, validating its effectiveness and practicality.

2

Related Work

There exist several approaches for topic tracking, which generally involve encoding events and topics using statistical features [10], vectorization [9], graph models [18], or topic models [17]. Subsequently, a clustering [1,9,17] or classification [14] algorithm is employed to assign events to topics. Currently, there exists a dearth of event-based tracking methodologies in the academic community. Nevertheless, there is still much research on multimedia signal-based topic tracking, which can serve as a reference. The input sources of TT task vary greatly, ranging from news articles and academic papers to Twitter posts. Moreover, the most prevalent method for text encoding is the application of vector space models (VSM) or text vectorization techniques [4,10]. Within this framework, each document is represented through its respective features, which can be identified as the occurrence of crucial terms derived from the Term Frequency-Inverse Document Frequency (TF-IDF) [9] calculations. However, these methods typically disregard the semantic content of the text and often result in large and sparse vectors. An alternative approach is the employment of a text vectorization model such as Word2Vec [12], GloVe [13] and FastText [3]. These approaches leverage the distributional hypothesis which posits that words with similar meanings tend to occur in similar contexts. By encoding the text as a dense vector, they can achieve significant dimensionality reduction while preserving semantic meaning, in contrast to the sparse representations of the VSM. However, as these models typically consider only a limited window of neighboring words when constructing the fixed-length text embedding, they may be suboptimal and struggle to disambiguate polysemous terms. The probabilistic model has also been used to represent text, such as relevance model [8] and latent Dirichlet allocation (LDA) [2,17]. From the previous works, we can see the main focus is on the occurrence of the essential words or mapping the word to embedding space to retain semantic meaning. However, these methods are not sufficient for our case when the text is short. After obtaining representations, clustering and classification method are employed to determine the appropriate topic to which the event belongs. Clustering for topic tracking is usually achieved by single-pass incremental clustering [17,20]. This method involves comparing new events with existing topics’ clusters to determine which cluster they belong to. Cosine similarity [20] and Jaccard similarity [7] are frequently used as measures in this process. While DBSCAN [9] and K-Means [17] are also applicable, they are less adaptable, as these methods require the recalculation of clusters when new events are added.

88

J. Ma et al.

KNN and Decision Trees are two commonly used classification methods for topic tracking. The KNN approach classifies events based on their proximity to neighboring events in the topic feature space [9]. In contrast, the Decision Tree approach leverages features such as word appearance and meta-features [4]. However, the Decision Tree method may not perform optimally as it does not take into account semantic information.

3

Methodology

In this section, we defined the task and proposed three models to address the challenges of semantic and temporal modeling simultaneously. 3.1

Notation

There are events e and topics G. A topic contains at most five events covering the same topic G = {e1g , · · · , ekg }, k ≤ 5. The event inside a topic is called topic event eg and a novel event that hasn’t been linked to a topic is called a new event en . The proposed QBETTD dataset consists of samples of event-topic pairs (ein , G i ) ∈ D. Specifically, ein is the new event needed to match G i in the ith sample. For both the new event or topic events e = {Te , se }, it contains a short title sequence Te and a timestamp se in DATETIME1 format. Given such a sample, a model need to decide whether the new event belongs to the topic.

Fig. 2. The overview structures of our proposed models. 1

Example of timestamp: 1970-01-01 00:00:00.

Topic Tracking from Classification Perspective

3.2

89

Semantic Correlation Modeling

The motivation of the proposed model is to better capture the deep semantics of the events and inter-events relationship. A naive way is to convert the problem into a machine reading comprehension (MRC) format by a self-attention-based pre-trained language model. BERTMRC. An MRC-style input can be constructed using the new event title as the query and the topic events as the context. The input starts with a [CLS] token, followed by the tokens of the new event and topic events, separated by [SEP] tokens. Since the input does not have an explicit text to specify the question, so we call such input an implicit MRC-style input. As shown in Fig. 2a, In the base proposed model BERTMRC, we use the 12 layers pre-trained Chinese BERT model [5] to construct the event embedding. After feeding the input to the model, we retrieve the embedding at the [CLS] position as the event-match-topic embedding. Then the embeddings are fed into a dense layer to perform classification and regression tasks. 3.3

Temporal Correlation Modeling

In empirical studies, the occurrence time of events is often a crucial feature, particularly when matching topics with significant duration and density differences. However, the BERTMRC model lacks the ability to incorporate temporal correlations. To address this limitation, we propose to extract vector representations for each event and augment them with time-difference information. This approach could enable the model to better capture the temporal dynamics of events and thus improve its overall performance. In the TDE model (shown in Fig. 2b and 2c), we follow BERTSum [11] to model sentences in a document. Different from delimiting sentences with [SEP] token in the BERTMRC model. TDE-LSTM wraps the sentence using [CLS] at the beginning and [SEP] at the end. And take out the embeddings for each title at its [CLS] token position. Then, we calculate the time difference x between the new event and all other topic events in days. The result time difference is used to look up a corresponding trainable embedding T Di , i ∈ {0, 1, 2, 3, 4, 5} according to Tables 1 and 2. The retrieved time difference embeddings T Di are then added to the correˆ sponding event title embedding elements to obtain the final event embedding E, as shown in the Eq. 1. ˆ = E + T Di E (1) TDE-LSTM. The events in each sample are in chronological order, so it is naturally suitable for timing modeling to capture the temporal information. Based on this advantage, we feed event embeddings into a Bi-LSTM module to capture the inter-event temporal relations. Finally, the output from the first Bi-LSTM cell is fed into a dense layer for classification and regression target.

90

J. Ma et al.

Table 1. Time difference embedding Table 2. Details of the QBETTD lookup table, the time difference in the dataset. unit of the day. # of Samples (s) 13967 Time Difference TD Embedding # of Events (E) 10884 x = 0* T D0 avg. T /s 4.70 0= 2, the performance of the model declined,

Multi-perspective Feature Fusion for Event-Event Relation Extraction

121

which may have been due to the introduction of unrelated noise during the deep fusion process. As the distance of any node did not exceed two jumps, and increasing the depth of the network resulted in smoother features.

6

Conclusion

In this paper, we propose a novel approach for document-level event relation extraction based on graph neural network. By decomposing the document into Heterogeneous graph and leveraging the graph fusion strategies, our model effectively captures the local and global information. The extensive experimental results show the better performance to predict the event relation in the two public Chinese benchmarks datasets, compared to other baseline models. In the future, we will consider event co-reference to improve performance. Acknowledgements. This work was supported by the Major Program of the National Natural Science Foundation of China (No. 61991410) and the Program of the Pujiang National Laboratory (No. P22KN00391).

References 1. Glavas, G., Snajder, J.: Construction and evaluation of event graphs. Nat. Lang. Eng. 21(4), 607–652 (2015) 2. Huang, L., et al.: Liberal event extraction and event schema induction. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 258–268 (2016) 3. Oh, J.H., Torisawa, K., Kruengkrai, C., Iida, R., Kloetzer, J.: Multi-column convolutional neural networks with causality-attention for why-question an- swering. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 415–424 (2017) 4. Tang, T., Liu, W., Li, W., Wu, J., Ren, H.: Event relation reasoning based on event knowledge graph. In: Knowledge Science, Engineering and Management 14th International Conference, pp. 491–503 (2021) 5. Man, H., Ngo, N.T., Van, L.N., Nguyen, T.H.: Selecting optimal context sentences for event-event relation extraction. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, pp. 11058–11066 (2022) 6. Chambers, N., Jurafsky, D.: Jointly combining implicit constraints improves temporal ordering. In: 2008 Conference on Empirical Methods in Natural Language Processing, pp. 698–706. (2008) 7. Cheng, F., Miyao, Y.: Classifying temporal relations by bidirectional LSTM over dependency paths. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1–6. (2017) 8. Xu, Y., Mou, L., Li, G., Chen, Y., Peng, H., Jin, Z.: Classifying relations via long short term memory networks along shortest dependency paths. In: EMNLP2015, pp. 1785–1794 (2015)

122

W. Liu et al.

9. Cheng, F., Miyao, Y.: Classifying temporal relations by bidirectional LSTM over dependency paths. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1–6 (2017) 10. Zeng, S., Xu, R., Chang, B., Li, L.: Double graph based reasoning for documentlevel relation extraction. In: EMNLP 2020, pp. 1630–1640 (2020) 11. Su, B., Hsu, S., Lai, K., Gupta, A.: Temporal relation extraction with a graphbased deep biaffine attention model. CoRR abs/2201.06125 (2022), https://arxiv. org/abs/2201.06125 12. Dai, Q., Kong, F., Dai, Q.: Event temporal relation classification based on graph convolutional networks. In: Natural Language Processing and Chinese Computing - 8th CCF International Conference, pp. 393–403 (2019) 13. Zhang, S., Huang, L., Ning, Q.: Extracting temporal event relation with syntacticguided temporal graph transformer. CoRR abs/2104.09570 (2021) 14. Xu, X., Gao, T., Wang, Y., Xuan, X.: Event temporal relation extraction with attention mechanism and graph neural network. Tsinghua Sci. Technol. 27(1), 79– 90 (2022). https://doi.org/10.26599/TST.2020.9010063 15. Li, J., Xu, S., Li, P.: Document-level event temporal relation extraction on global and local cues. In: International Joint Conference on Neural Networks, IJCNN 2022, pp. 1–8 (2022) 16. Phu, M.T., Nguyen, T.H.: Graph convolutional networks for event causality identification with rich document-level structures. In: NAACL2021, pp. 3480–3490 (2021) 17. Fan, C., Liu, D., Qin, L., Zhang, Y., Xu, R.: Towards event-level causal relation identification. In: SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1828–1833 (2022) 18. Liu, J., Xu, J., Chen, Y., Zhang, Y.: Discourse-level event temporal ordering with uncertainty-guided graph completion. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pp. 3871–3877 (2021) 19. Zhao, K., Ji, D., He, F., Liu, Y., Ren, Y.: Document-level event causality identification via graph inference mechanism. Inf. Sci. 561, 115–129 (2021). https://doi. org/10.1016/j.ins.2021.01.078 20. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: 6th International Conference on Learning Representations, OpenReview.net (2018). https://openreview.net/forum?id=rJXMpikCZ 21. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, ICLR 2017. OpenReview.net (2017). https://openreview.net/forum?id=SJU4ayYgl 22. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015. https://arxiv.org/abs/1409.0473 23. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp. 5998–6008 (2017) 24. Schlichtkrull, M.S., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: The Semantic Web - 15th International Conference, pp. 593–607 (2018) 25. Yang, Z., Liu, W., Liu, Z.: Event causality identification by modeling events and relation embedding. In: Neural Information Processing - 25th International Conference, pp. 59–68 (2018) 26. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL2019, pp. 4171–4186 (2019)

Multi-perspective Feature Fusion for Event-Event Relation Extraction

123

27. Sun, Y., et al.: ERNIE: enhanced representation through knowledge integration. CoRR abs/1904.09223 (2019). https://arxiv.org/abs/1904.09223 28. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019). https://arxiv.org/abs/1907.11692

Joint Cross-Domain and Cross-Lingual Adaptation for Chinese Opinion Element Extraction Ruoding Zhang1,2(B) , Meishan Zhang3 , and Xiaoyi Song1,2 1

2

3

Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China [email protected] School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China Harbin Institute of Technology (Shenzhen), Shenzhen 518055, China

Abstract. Opinion element extraction, through which key elements of opinions are identified, is critical for understanding opinions. We can achieve state-of-the-art performance by leveraging supervised learning backended by pretrained language models for the task. Yet, it relies heavily on a manual-annotated training corpus. This corpus could be biased towards specific domains, as well as scarce for non-English languages like Chinese. In this work, we focus on the Chinese language, investigating the joint exploration of cross-domain and cross-lingual adaptation. To facilitate our research, we annotate a training corpus of the COVID-19 genre for cross-domain exploration. For the cross-lingual exploration, we utilize the English newsware MPQA corpus directly. Finally, we design a uniform architecture to incorporate cross-domain and cross-lingual transfer jointly. Experimental results demonstrate that cross-domain and crosslingual transfers can both deliver impressive performance, and their combination can increase the performance even further.

Keywords: Opinion Element Extraction Cross-Domain

1

· Cross-Lingual ·

Introduction

Opinion mining has attracted extensive attention for decades [2,6,13,19], which involves detecting sentiments and attitudes expressed in texts. It is a fundamental research topic in natural language processing (NLP) [9,14,26], which can be applied to a wide variety of applications based on comments posted by web users [3]. Opinion element extraction is one basic task in opinion mining. It extracts the key elements of opinions such as opinion expressions, agents and targets, which has been extensively investigated [9,25,30]. Figure 1 shows an example. With the aid of supervised models and pretrained language models, opinion element extraction can achieve state-of-the-art performance. For example, on c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 124–135, 2023. https://doi.org/10.1007/978-3-031-44696-2_10

Cross-Domain and Cross-Lingual Chinese Opinion Element Extraction

125

Fig. 1. One example of Chinese opinion element extraction.

the English MPQA dataset, typical neural models with BERT supporting can obtain F-measures of 50.0%, 48.3% and 52.8% for opinion expressions, agents and targets, respectively [24]. However, the supervised models might be inappropriate when the training corpus is scarce, which could be true mostly for non-English languages. Moreover, the annotated corpus is always limited to specific domains, which could be biased to the other domains. In this paper, we focus on the Chinese language to investigate cross-domain and cross-lingual opinion element extraction, assuming that there is no training corpus for a specific domain in Chinese. We follow the line of work in English, examining the performance of opinion element extraction on the MPQA dataset, which belongs to the newswire domain. Fortunately, we already have the test dataset of Chinese MPQA [29]. To facilitate our investigation, we also annotate a Chinese opinion mining corpus of another domain, i.e., the web comments that are related to COVID-19. Based on the annotated dataset, we are able to study cross-domain transfer learning by using COVID-19 as the source domain, evaluating the performance on the Chinese MPQA dataset. Further, we study cross-lingual transfer using the English MPQA corpus as the source data. We propose a universal architecture to investigate the cross-domain and cross-lingual opinion element extraction jointly. Our final model is extended from a span-based neural model backended with multilingual BERT, which can achieve the state-of-the-art performance for our task. Then we exploit a parameter generation network (PGN) [16,18] to integrate the cross-domain and crosslingual corpora naturally. By using the PGN module, we are able to measure the difference between the source domain and the target domain during domain adaptation, and meanwhile also characterize the connections between the source language and the target language. We conduct detailed experiments to verify the effectiveness of our proposed method. Concretely, we exploit the aforementioned COVID-19 dataset, the English-language MPQA dataset, and a pseudo Chinese MPQA dataset as the training corpora, and test on the manually-annotated Chinese MPQA dataset. Results show that the joint cross-domain and cross-lingual adaptation can boost performance significantly, resulting in average increases of 3.7%. Our datasets as well as codes are publicly available on https://github.com/zhangruoding/ spanOM/ under Apache License 2.0 for research purposes.

2

Corpus Construction

Here we focus on three representative types of opinion elements, i.e., expression, agent and target, where expressions are also annotated with polarities. In detail,

126

R. Zhang et al.

the opinion expression is the key span that expresses one opinion, the polarity denotes the sentiment polarity (i.e., positive or negative) by the opinion, the agent denotes the agent that the opinion is sourced, and the target denotes the object that the opinion is about. In this work, we study the cross-domain and cross-lingual transfer learning of Chinese opinion element extraction. Zhen et al. [29] have annotated a test dataset of the newsware domain, sourced from the English MPQA corpus by translation and manual annotation. Here we adopt it as the target test corpus, denoted it by MPQA-CHNTest, and then construct three corpora for cross-domain and cross-lingual investigation. Note that we ignore the opinion expressions with neutral polarities in this work. 2.1

The Cross-Domain Corpus

We build a manual corpus of the COVID-19 domain for cross-domain opinion element extraction, and denote the manual corpus by COVID-19Train, which serves as the training corpus of cross-domain adaptation. Zhang et al. [27] develop a Chinese corpus with manually-labeled opinion expressions of the COVID-19 domain. Here we extend the corpus with opinion agent and target annotations.1 The raw text is crawled from the Sina Weibo related to COVID-19 covering various topics. The selected text is significantly different from that of MPQA, including the word distributions as well as the text style. We recruit ten graduate students which are very familiar with our task for the annotation, and open-source tool doccano2 is used for efficient annotation. To ensure the annotation quality, we include a three-stage manipulation. First, the ten students are carefully trained based on the English MPQA guideline, and then each of them annotates a sample of 200 sentences to reach one consistent guideline by after-annotation discussion. By then, we can regard them as expert annotators. Subsequently, all ten students start the large-scale opinion role labeling, ensuring each opinion expression has three annotations. Thirdly, for each opinion, we let another annotator different from the initial three annotators to check and finalize the annotation. To check the annotation quality, we estimate the Kappa value based on the annotations of the three initial annotators. The value is calculated at the word level, ignoring the words which all three annotators regarded as unrelated to specific opinion elements. The overall value is 0.92, which indicates that the annotations from different annotators are highly consistent. One more thing, as our opinion element extraction is studied at the word-level in align with that of the English, we have a preprocessing step to segment Chinese raw sentences into word sequences. We exploit the state-of-the-art tool of [8] to achieve the Chinese word segmentation.

1 2

We also conduct refinements for opinion expressions, since we find that the annotation style of original COVID-19 is very different from that of MPQA. https://doccano.github.io/doccano/.

Cross-Domain and Cross-Lingual Chinese Opinion Element Extraction

127

Table 1. The data statistics in detail, where # and NUM denote the number, and AvgL denotes the average span length. Corpus

#Sent #Word Expression Agent Target NUM AvgL NUM AvgL NUM AvgL

COVID-19Train 8,047 MPQA-ENGTrain 1,819 MPQA-CHNTrain 1,803 MPQA-CHNTest

2.2

507

225,059 13,584 2.26 50,885 2,785 2.99 45,178 2,725 2.68 12,826

736 2.60

4,055 1.97 1,629 2.33 1,537 2.07

7,225 4.64 733 6.28 723 5.18

413 2.13

205 5.59

The Cross-Lingual Corpus

The original English MPQA corpus [15,21,22] has been the widely-adopted benchmark dataset to evaluate various opinion element extraction models. To this end, our work also chooses the MPQA genre as the test domain. The training part of the English MPQA dataset could be used for cross-lingual studying naturally. The MPQA corpus is sourced from newsware documents, annotating with a number of different level opinion elements. Besides the regular opinion elements such as opinion expression, polarity, holder and target, the corpus also covers many other important elements such as the opinion intensity, topic, confidence etc. Note that the MPQA corpus also annotate the opinion with neutral attitudes and implicit opinions. In this work, we ignore these opinions, being consistent with the cross-domain COVID-19 corpus to facilitate our research. The COVID-19 corpus may also be regarded as news, however it significantly differs from the MPQA corpus in the topic and writing style. We refer to this corpus as MPQA-ENGTrain, which would be used for cross-lingual adaptation. 2.3

The Pseudo Corpus by Machine Translation

The most effective training corpus should be that of the same language and the same domain, which is unavailable for our setting. Fortunately, we can obtain one pseudo corpus to approximate the goal by corpus translation. The key idea is to translation the gold-standard MPQA corpus of the English language into our target Chinese language. In fact, the method has been widely investigated for cross-lingual transfer learning. Zhen et al. (2021) [29] has exploited the method for opinion role labeling, and here we borrow their corpus directly to adapt for our opinion element extraction. The conversion consists of two steps by order: (1) sentence translation and (2) opinion element projection, both of which are conducted at the sentence-level. Note that a very small proportion of MPQAENGTrain would be dropped due to low-quality projection of opinion elements. For the details, we can refer to the paper of Zhen et al. (2021) [29]. We denote the resulted corpus as MPQA-CHNTrain. MPQA-CHNTrain is used as one supplement training corpus for cross-lingual adaptation. Note that MPQA-CHNTrain is constructed totally automatically, while MPQA-CHNTest (the test corpus) is built with carefully expert annotations.

128

R. Zhang et al.

2.4

Data Statistics

By the above, we collect all the datasets which helps our joint investigation of cross-lingual and cross-domain transfer learning in Chinese opinion mining. Table 1 summarize the overall data statistics, covering the Chinese MPQA test dataset, the Chinese COVID-19 dataset (for training), the English MPQA training dataset, and the pseudo English MPQA training dataset. We report the number of sentences, words, opinion elements. For opinion elements, we also concern about their averaged lengths since the lengths could be potentially related to the performance of automatic opinion mining.

3

Method

We propose a universal framework to integrate cross-domain and cross-lingual adaptation jointly. Figure 2 shows the main architecture of our model. This model is extended from a span-based opinion element extraction model, and exploit a Parameter Generation Network (PGN) to enable language/domain-aware features. The key idea is to leverage the three types of corpora: COVID-19Train, MPQA-EngTrain and MPQA-CHNTrain, by an indiscriminate way through PGN. Note that this is an initial study to combine cross-domain and crosslingual adaptation. We are not meant to present a sophisticated method for the combination. Thus here we just propose a simple and plug-in method to achieve our goal. In the following, we describe our universal model of cross-domain and cross-lingual opinion element extraction step by step.

Fig. 2. The cross-lingual and cross-domain architecture.

3.1

Encoder

First, we introduce the encoder part. Our model takes a sequence of sentential words S = w1 w2 · · · wn as inputs, and outputs opinion element extraction results. The encoder module aims for better feature representation from the input.

Cross-Domain and Cross-Lingual Chinese Opinion Element Extraction

129

Word Representation. We directly exploit a pretrained multilingual language model [10] as the backend for the bottomed feature extraction, following the work of Xia et al. (2021) [23]. Given the word sequence S = w1 w2 · · · wn , we dump out a sequence of features as follows: E = e1 e2 · · · en = Adapter-BERT(S = w1 w2 · · · wn ),

(1)

where Adapter-BERT means the effective adapter usage of BERT [5], which freezes all BERT parameters and tune only adapter parameters during training. The architecture of Adapter-BERT is shown in Fig. 2 by the right part. We omit the detailed formalization of the Adapter-BERT, which can be referred to [5] for the definition. Noticeably, since the direct output of Adapter-BERT is at the subword-level (character for Chinese), we have a simple post-processing (i.e., average pooling over the covered subwords) to reach the word-level outputs. PGN-BiLSTM Encoding. Here inspired by the cross-domain techniques of [7], we integrate PGN with BiLSTM to derive type-sensitive features, where different types of corpora can be naturally combined. By this mechanism, we can explore the three types of corpora jointly, without concerning whether they are crossdomain or cross-lingual. Thus, we would obtain a universal framework for crossdomain and cross-lingual adaptation. Based on the word representations extracted from BERT (i.e., E = e1 e2 · · · en ), and an symbol t to index the type of the input corpus, we obtain the encoder output by the following formulas: H = h1 h2 · · · hn = BiLSTM(E = e1 e2 · · · en |Vt = Θet ) = PGN-BiLSTM(E = e1 e2 · · · en , t|Θ),

(2)

where Θ is a learnable parameter in this component, et is the embedding of corpus index t, and PGN-BiLSTM denotes the module name for short. 3.2

Decoder

Our decoder consists two stages. First, all candidate spans are extracted and then represented based on encoder outputs. Second, span classification is executed to obtain the final opinion elements. In particular, we put the emphasis on the extraction of opinion expressions, associating opinion agents and targets as roles of opinion expressions, which is a standard practice in previous work. Span Representation. To better distinguish opinion expression and role representations, we first employ two multi-layer perceptions (MLP) to re-encode the output of the BiLSTM. Given the hi hidden vectors, re-encode is denoted as: E/R

H E/R = h1

E/R

h2

· · · hE/R = MLPE/R (H = h1 h2 · · · hn ), n

(3)

where E/R indicates either Expression or Role. Next, for each candidate span E/R E/R E/R E/R + E/R + he )(h − he ) to S[b,e] = wb · · · we (e ≥ b), we use h[b,e] = (hb b calculate its dense representation.

130

R. Zhang et al.

Span Classification. Based on the above span representation, we perform span classification to derive all opinion elements:   E E pE [b,e] = SoftMax Linear (h[b,e] ) , (4)   R R + E pR [b,e] = SoftMax Linear (h[b,e] h [s,t] ) , where S[b,e] is the focused span, and for role classification we need to specify the corresponding expression S[s,t] in advance. Both the expression extraction and role classification are three-way classification problems, where the expression extraction includes EXPR+, EXPR- and O(other), and the role classification E/R involves AGENT, TARGET and O(other). The final output p[b,e] shows the label distribution of different labels. 3.3

Training and Inference

For the training, we encounter the imbalance problem by the O(other) label. To alleviate the problem, we adapt the focal loss [12] to train our model. Given a span S[b,e] and its corresponding classification output p∗[b,e] , the focal loss is: L=−

 c∈C

∗ (1 − p∗[b,e] [c])γ y[b,e] [c] log p∗[b,e] [c]

(5)

∗ where C is the set of all possible labels, y[b,e] is the gold-standard answer, γ is a pre-defined hyper-parameter, and ∗ denotes E/R. For the inference, we first obtain the opinion expressions, and then output opinion agents and targets based on expressions. By the two-step strategy, we can get the opinion elements straightforwardly. Several overlaps between opinion elements or duplicated opinion roles can be addressed by trivial post-processing.

4 4.1

Experiments Settings

Datasets. We exploit the datasets as discussed in Sect. 2 for experiments, where the detailed statistics is shown in Table 1. We denote the corpora COVID19Train, MPQA-ENGTrain, MPQA-CHNTrain by CVD, MEN, MCH for short. We additionally build a development corpus by automatically translating the development corpus of English MPQA, which is used to determine hyperparameters and model training. Hyper-Parameters and Training. There are several hyper-parameters to define our neural network structures. For the contextual representations, we extract the representations from the Multilingual BERT. In particular, we use the adapter mechanism and insert adapter layers in the top 10 layers of BERT. The hidden size of the PGN-BiLSTM layer is set to 300 and we employ a 2-layer BiLSTM

Cross-Domain and Cross-Lingual Chinese Opinion Element Extraction

131

to encode the input representations. The language embedding dim is set to 8. The dimension of span representations is 400. We exploit online training to learn model parameters, and train on the entire training examples for 200 epochs, choosing the best-epoch model according to the relation performances on the development corpus. We use Adam with a learning rate 10−3 to optimize our model. Dropout is exploited by a drop ratio of 0.33. The hyper-parameter γ in the focal loss is 3.0. Evaluation Metrics. We use the span-level recall (R), precision (P) and their F1measure as the major metrics for evaluation, using the F1 value as the main metric. All elements are evaluated separately by their categories, which are EXPR+, EXPR- (expressions are split into two categories by their sentiment polarities), AGENT and TARGET, respectively. Besides the above exact matching, we also report proportional and binary overlap measurements, following the work of [9]. Table 2. Main results. Model

EXPR+

EXPR-

AGENT

TARGET

Exa

Pro

Bin

Exa

Pro

Bin

Exa

Pro

Bin

Exa

Pro

Bin

CVD

30.8

49.7

59.2

28.6

48.0

58.4

28.5

46.2

54.8

25.2

51.3

61.8

MEN

31.0

50.3

59.4

29.1

48.5

58.2

27.8

46.6

55.0

25.5

51.8

62.2

MCH

31.3 51.0 60.1 29.2 49.1 59.0 28.0 47.0 55.8 26.4 52.4 63.0

MEN+MCH

32.6

52.6

61.4

30.0

50.2

60.4

30.4

48.8

56.4

28.7

54.0

64.4

CVD+MEN

33.1

53.0

62.8

31.4

51.5

61.0

31.1

50.3

57.6

29.1

55.2

65.8

CVD+MCH

34.0 53.6 63.0 32.2 52.4 61.5 32.7 51.6 58.0 30.4 56.7 67.0

CVD+MEN+MCH 35.4 55.8 64.9 33.8 54.0 62.8 34.6 54.8 60.4 32.5 58.8 68.7

4.2

Results and Analysis

Final Results. The main experimental results are shown in Table 2. The exact matching (i.e., Exa), proportional overlap (i.e., Pro) and binary overlap (i.e., Bin) results are all given. As shown, our final model with all three types of corpora can achieve the best results among all combinations, resulting in the averaged improvement of 5.8% (by exact matching) compared with cross-domain adaptation (i.e. CVD), and 3.7% (by exact matching) compared with cross-lingual adaptation (i.e. MEN+MCH). The observation indicates that joint cross-domain and cross-lingual could be one prospective method for Chinese opinion element extraction of a new domain that already has been addressed in English. There are several interesting findings by the main results. First, we can see that the negative expressions are slightly more difficult to be recognized, which might be due to that negative points are expressed in a more obscure manner generally. Second, although the agent is with higher performance than the target by exact matching, the proportional and binary matching are on the inverse, which might be due to the larger averaged span length of opinion targets. Third,

132

R. Zhang et al.

Fig. 3. Precision and recall comparisons for single-corpus models.

the improvements by joint cross-domain and cross-lingual do not show any drop when the baseline is stronger. The averaged improvement between CVD+MEN+MCH and MEN+MCH is 3.7%, while the value between CVD+MEN and MEN is 2.8%, and between CVD+MCH and MCH it is 3.6%. The observation might be caused by the large difference between the style of MPQA and COVID-19. Single Corpus. Additionally, we examine the performance of our model with single corpus. Thus the results are either cross-domain or cross-lingual adaptation. Concretely, CVD indicates cross-domain adaptation, while MEN and MCH indicate cross-lingual adaptation. We can see that cross-lingual adaptation obtains better performance than cross-domain adaptation. This might be due to that COVID19Train has a larger gap than MPQA-ENGTrain, MPQA-CHNTrain. The average lengths in Table 1 are able to reflect the gaps. For cross-lingual adaptation, the translation-based adaptation performs better than the direct adaptation, which is consistent previous findings [29]. Two Corpora. Further, we compare the model performance using two types of corpora. The model MEN+MCH belongs to cross-lingual adaptation, while the other twos fall into the joint cross-domain and cross-lingual adaptation. The joint cross-domain and cross-lingual adaptation brings better performance than sole cross-lingual adaptation. The results are reasonable as the two types of cross-lingual corpora are sourced from the same MPQA data. The cross-domain dataset (i.e., COVID-19Train) can supply new knowledge for our test dataset. Precision and Recall Comparisons. In order to examine the differences between cross-domain and cross-lingual opinion element extraction in-depth, we show the fine-grained values of precision and recall in Fig. 3. Only the single-corpus models are listed for better and fairer comparisons. As shown, the three models include very different tendencies in the P and R values. These values are mostly related to the characteristics of the COVID-19 and MPQA corpora. For example, the sentence in COVID-19 each has more number of opinion expressions than that of MPQA, which can lead to a higher recall and lower precision by comparing CVD with MEN and MCH. Similar observations can be found in the AGENT and TARGET categories. Influence by the Span Length. The cross-domain and cross-lingual datasets have very different length distributions, especially for the opinion agents and targets. The cross-domain COVID-19 dataset is with shorter elements than MPQA, especially for opinion agents and targets. Here we compare the three single-corpus

Cross-Domain and Cross-Lingual Chinese Opinion Element Extraction

133

Fig. 4. F-measure of single-corpus models in terms of span length.

models by their performance on different lengths of opinion elements. Figure 4 shows the results. As a whole, we can see that the CVD has better performance on the short-length opinion elements, while the MEN and MCH models perform better on longer opinion elements. The observation is consistent with our expectation.

5

Related Work

Opinion mining is one fundamental topic in the NLP community [2,6,13]. Opinion element extraction is a kind of widely-investigated task of opinion mining [15,21,22]. Several works focus on opinion expressions only, while also several works devote to opinion roles such as opinion agents and targets [11]. The majority works exploit the CRF-based sequence labeling framework for the task [9], where other studies prefer to either span-based or transition-based methods [23,26,28], which are able to use high-order features at the span level. Here our universal cross-domain and cross-lingual model extends from a typical span-level neural opinion element extraction model. There have been several studies of cross-domain and cross-lingual opinion mining of other opinion mining tasks [4,17,20,30]. As for cross-domain and cross-lingual opinion element extraction, the existing work is relatively very few. Almeida et al. (2015) [1] have studied cross-lingual opinion element extraction of the Portuguese language. Zhen et al. (2021) [29] investigate Chinese opinion role labeling by cross-lingual transferring learning. The biggest obstacle might be the available corpus of this line. Here we extend the work of [29] with crossdomain annotations, studying joint cross-lingual and cross-domain adaptation for Chinese opinion element extraction.

6

Conclusions and Future Work

In this work, we presented the first work of joint cross-domain and cross-lingual adaptation on Chinese opinion element extraction. We built three types of corpora to achieve the goal: one being a manually-annotated COVID-19 dataset, one being the English MPQA dataset and the other being the Chinese MPQA dataset translated from the English MPQA. Following, we presented a unified framework which explores the above three types of corpora seamlessly. Experimental results show that joint cross-domain and cross-lingual adaptation can

134

R. Zhang et al.

boost the performance of Chinese opinion element extraction significantly, indicating that the method can be considered as a potential strategy. There are several limitations in this work as well, which could be addressed by future work with further investigations. First, our unified framework is relatively simple, since it is one very initial work of this line. There could be more sophisticated models for the unification. Second, it could be more challenging when the test corpus is different from the domain of the source language, which is another setting for joint cross-domain and cross-lingual adaptation. Third, the recent breakthrough of large language models can offer new insights into our setting, e.g., more sophisticated corpus generation methods and much stronger models. There may, however, be very different in the exploration of joint cross-domain and cross-lingual adaptation.

References 1. Almeida, M.S.C., Pinto, C., Figueira, H., Mendes, P., Martins, A.F.T.: Aligning opinions: cross-lingual opinion mining with dependencies. In: Proceedings of ACL, pp. 408–418 (2015) 2. Cambria, E., Schuller, B., Xia, Y., Havasi, C.: New avenues in opinion mining and sentiment analysis. IEEE Intell. Syst. 28(2), 15–21 (2013) 3. Cui, L., Huang, S., Wei, F., Tan, C., Duan, C., Zhou, M.: Superagent: a customer service chatbot for e-commerce websites. In: Proceedings of ACL, System Demonstrations, pp. 97–102 (2017) 4. Fernández, A.M., Esuli, A., Sebastiani, F.: Distributional correspondence indexing for cross-lingual and cross-domain sentiment classification. J. Artif. Intell. Res. 55, 131–163 (2016) 5. Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: Proceedings of NAACL, pp. 2790–2799 (2019) 6. Hu, M., Liu, B.: Mining opinion features in customer reviews. In: AAAI, vol. 4, pp. 755–760 (2004) 7. Jia, C., Liang, X., Zhang, Y.: Cross-domain NER using cross-domain language modeling. In: Proceedings of ACL, pp. 2464–2474 (2019) 8. Jiang, P., Long, D., Zhang, Y., Xie, P., Zhang, M., Zhang, M.: Unsupervised boundary-aware language model pretraining for Chinese sequence labeling. In: Proceedings of EMNLP, pp. 526–537 (2022) 9. Katiyar, A., Cardie, C.: Investigating LSTMs for joint extraction of opinion entities and relations. In: Proceedings of ACL, pp. 919–929 (2016) 10. Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL, pp. 4171– 4186 (2019) 11. Kim, S.M., Hovy, E.: Extracting opinions, opinion holders, and topics expressed in online news media text. In: Proceedings of Workshop on Sentiment and Subjectivity in Text, pp. 1–8 (2006) 12. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: Proceedings of ICCV (2017) 13. Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 415–463. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_13

Cross-Domain and Cross-Lingual Chinese Opinion Element Extraction

135

14. Marasović, A., Frank, A.: Srl4orl: improving opinion role labeling using multitask learning with semantic role labeling. In: Proceedings of NAACL, pp. 583–594 (2018) 15. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(1–2), 1–135 (2008) 16. Platanios, E.A., Sachan, M., Neubig, G., Mitchell, T.: Contextual parameter generation for universal neural machine translation. In: Proceedings of EMNLP, pp. 425–435 (2018) 17. Singh, R.K., Sachan, M.K., Patel, R.: 360 degree view of cross-domain opinion classification: a survey. Artif. Intell. Rev. 54, 1385–1506 (2021) 18. Stoica, G., Stretcu, O., Platanios, E.A., Mitchell, T., Póczos, B.: Contextual parameter generation for knowledge graph link prediction. In: Proceedings of AAAI, pp. 3000–3008 (2020) 19. Sun, S., Luo, C., Chen, J.: A review of natural language processing techniques for opinion mining systems. Inf. Fusion 36, 10–25 (2017) 20. Wan, X.: Co-training for cross-lingual sentiment classification. In: Proceedings of the ACL, pp. 235–243 (2009) 21. Wiebe, J., Wilson, T., Cardie, C.: Annotating expressions of opinions and emotions in language. Lang. Resour. Eval. 39(2–3), 165–210 (2005) 22. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity: an exploration of features for phrase-level sentiment analysis. Comput. Linguist. 35(3), 399–433 (2009) 23. Xia, Q., Zhang, B., Wang, R., Li, Z., Zhang, M.: A unified span-based approach for opinion mining with syntactic constituents. In: Proceedings of NAACL (2021) 24. Xu, Q., Li, B., Li, F., Fu, G., Ji, D.: A dual-pointer guided transition system for end-to-end structured sentiment analysis with global graph reasoning. Inf. Process. Manag. 59(4), 102992 (2022) 25. Yang, B., Cardie, C.: Extracting opinion expressions with semi-Markov conditional random fields. In: Proceedings of EMNLP, pp. 1335–1345 (2012) 26. Zhang, M., Wang, Q., Fu, G.: End-to-end neural opinion extraction with a transition-based model. Inf. Syst. 80 (2018) 27. Zhang, X., Xu, G., Sun, Y., Zhang, M., Wang, X., Zhang, M.: Identifying Chinese opinion expressions with extremely-noisy crowdsourcing annotations. In: Proceedings of ACL, pp. 2801–2813 (2022) 28. Zhao, H., Huang, L., Zhang, R., Lu, Q., Xue, H.: SpanMLT: a span-based multitask learning framework for pair-wise aspect and opinion terms extraction. In: Proceedings of the ACL, pp. 3239–3248 (2020) 29. Zhen, R., Wang, R., Fu, G., Lv, C., Zhang, M.: Chinese opinion role labeling with corpus translation: a pivot study. In: Proceedings of EMNLP, pp. 10139–10149 (2021) 30. Zhou, X., Wan, X., Xiao, J.: Clopinionminer: opinion target extraction in a crosslanguage scenario. IEEE/ACM Trans. Audio Speech Lang. Process. 23(4), 619–630 (2015)

A Relational Classification Network Integrating Multi-scale Semantic Features Gang Li, Jiakai Tian, Mingle Zhou, Min Li(B) , and Delong Han Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan 250316, China [email protected]

Abstract. Among many natural language processing tasks, the task of relation classification is an important basic work. Relational classification is required to provide high-quality corpus in fields such as machine translation, structured data generation, knowledge graphs, and semantic question answering. Existing relational classification models include models based on traditional machine learning, models based on deep learning, and models based on attention mechanisms. The above-mentioned models all have the disadvantage of only using a single feature for relationship classification, and do not effectively combine entity features and context features. Based on the above problems, this study proposes a relational classification network that integrates multi-scale semantic features. Through the combination of entity features and context semantic features, the model can better learn the relationship features between entities. In addition, in order to verify the validity of the model and the effectiveness of each module, this study conducted experiments on the SemEval-2010 Task 8 and KBP37 data sets. Experimental results demonstrate that the model performance is higher than most existing models.

Keywords: relationship classification keyword

1

· Second keyword · Another

Introduction

How to express information in unstructured text in a structured way is a basic problem in natural language processing. Only by structuring the text [1] can it play a role in subsequent tasks such as machine translation [2], knowledge graph [3] and semantic question answering [4]. Relation classification is an important tool in the text structuring process. The task of relation classification is to identify and classify the relations between entities in unstructured text [5]. For example, in the example sentence below, entity 1 is ’stress’ and entity 2 is ’divorce’. The task of relationship classification is to identify the relationship ’Cause-Effect’ between two entities. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 136–147, 2023. https://doi.org/10.1007/978-3-031-44696-2_11

A Relational Classification Network Integrating

137

Financial [e1] stress [/e1] is one of the main causes of [e2] divorce [/e2]. Relevant researchers have done a lot of research on relation classification and proposed a variety of relation classification methods. It includes a convolutional neural network, a recurrent neural network, an attention-based mechanism, and a relationship classification network based on BERT and training models. In terms of feature extraction, most of the existing models only use entity features or context semantic features to train the model. The relationship between entities can only be obtained after analyzing the context and the entity itself. So we propose a relation classification network that fuses multi-scale semantic features. The main contributions of this paper can be summarized as: (1) A relation classification model is proposed. Improve the performance of relational classification network by extracting and fusing different levels of features. (2) A sentence-level context feature extraction module and an entity feature extraction module are proposed. By extracting and fusing features at different levels, the important role of semantic features in relational classification tasks is fully utilized. (3) Based on the above model, the comparison and ablation implementation show that this paper has achieved excellent results on the SemEval-2010 Task8 dataset, and the F1 value is higher than most existing models.

2

Related Work

Before the pre-training model appeared, the relationship classification model used Word2Vec [6] to represent the semantics between words. In 2018, Peters et al. [7] proposed the ELMO model to better solve polysemy problems. The ELMO model uses the language model for pre-training to extract the word representation of the corresponding word and the semantic features of the sentence, which are added to the specific relationship extraction task as new features. In the same year, Radford et al. [8] proposed an autoregressive GPT model. GPT uses pre-training and fine-tuning methods to solve dynamic semantic problems, which alleviates the problems of entity overlap in the process of relation extraction to a certain extent. The GPT model uses a more powerful Transformer structure in terms of feature extractors. GPT can capture longer memory information and use a one-way language model, which is easier to parallelize than traditional neural network structures. At present, most of the relationship classification models use the BERT [9] pre-training model proposed by Google to extract features. ENRIE [10] proposed by Zhang et al. improves the model by introducing the concepts of words and phrases. Sun et al. proposed ENRIE1.0 by improving the masking strategy of the model. ENRIE1.0 introduces external knowledge in the pre-training process, which can make the model better learn the relationship characteristics between entities. MTDNN proposed by Liu et al. [11] and ENRIE2.0 proposed by Sun et al. [12] improve the training effect of the model by introducing multi-task learning in the pre-training process. In terms of masking strategy, BERT-WWM proposed by Cui et al. [13] changes the masking strategy to mask this, so that

138

G. Li et al.

the model can better predict the context content. Although RoBERTa proposed by Liu et al. [14] lacks the task of predicting the next sentence than BERT, RoBERTa uses more data to train the model, which also improves the effect of the model. The BERT-based relation classification model NEZHA model proposed by Wei et al. [15] trains the model by improving the position encoding strategy. KnowBERT-W+W proposed by PETERS et al. [16] introduces triplets in the knowledge graph to improve the quality of model training.

3

Method

In this section, a relation classification network fusing multi-scale semantic features is proposed. First, the sentence is vectorized, and words are converted into vectors through word embedding, position embedding and segmentation embedding. The semantic features in the sentence are extracted through the BERT pre-training model. The model inputs the different levels of output of BERT into the sentence-level feature extraction module and the entity-level feature extraction module respectively. The sentence-level feature extraction module extracts the inter-word relationship in the sentence through the word correlation function, and forms the inter-word relationship probability matrix. Use the mask function to extract the corresponding feature of the entity and perform the average pooling operation to finally obtain the classification probability. Entitylevel semantic extraction first extracts the feature vector corresponding to the entity. The final classification probability is obtained through the entity feature amplification module and splicing operation. The overall framework of the model is shown in Fig. 1.

Fig. 1. Model master frame.

3.1

Text Embedding and Feature Extraction

Before text embedding, you need to add [CLS] and [SEP] tags at the beginning of the text to mark the position of the sentence. At the same time, mark the

A Relational Classification Network Integrating

139

start and end positions of the two entities by [e1], [/e1], [e2] and [/e2] before and after the entity. Text embedding is divided into three parts, namely Token Embeddings, Segment Embeddings and Position Embeddings. First, convert the text into a vector of the corresponding dimension through Token Embeddings as required. Segment Embeddings is a sentence operation that marks whether the words in the text belong to the same sentence. Position Embeddings mark the position of words in the input text. The purpose is to facilitate the BERT model to identify the order of the input sequence and better extract contextual features. BERT pre-trains word vectors through “masked language model” and “next sentence prediction”. Related work has shown that using pre-trained word vectors has better results on relational classification tasks. The tokens with sentence positions and two entity positions are first fed into the BERT model. Add a [CLS] tag at the beginning of each tag. The BERT output is the feature output of the entire sentence, which contains the semantic information of the entire sentence. The hidden output of each layer of the BERT model can be used as a word embedding for downstream tasks. The 24 hidden layers of the BERT model can be labeled Hi (i=1, 2, ..., 24) respectively. In the subsequent downstream tasks, we use the last layer H24 and the penultimate layer H24 of BERT as the feature input of the downstream tasks. 3.2

Sentence Level Feature Extraction Module

In the feature extraction module at the sentence level, first use the word correlation function to query the correlation between each word in a sentence to form a word correlation matrix M . Secondly, the mask function is used to extract the probabilities corresponding to the two entities in the word association matrix. Finally, the probability matrix for classification tasks is formed by an average pooling layer. Word Correlation Function and Probability Matrix. We use word correlation function to calculate the correlation between each word in a sentence. Compute the two groups of word correlations separately. A set of autocorrelations for the output H24 of the last hidden layer of the BERT pretrained model. Another group is the cross-correlation of the last layer H24 and the penultimate layer H23 . This process can be expressed mathematically as shown in Eq. 1 and Eq. 2. (1) Ki−1 = H24 Wai · (H24 Wbi )T Ki−2 = H23 Wai · (H24 Wbi )T

(2)

where Ki−1 is the autocorrelation of the features in the output of the last hidden layer, and Ki−2 is the cross-correlation of the output of the last and penultimate hidden layer, respectively. i ∈ [1, 2, ..., class length],class length is the number of categories of relations in the data. H23 and H24 are the hidden layer outputs

140

G. Li et al.

of the penultimate layer and the last layer of the BERT model, respectively. Wai and Wbi are the weight matrix of each word under the i-th relationship, which are parameters learned through the training process. When there are X words in an input sentence, K is actually a square matrix with X rows and X columns. For example, Ki−1 (a, b) represents the element at the position of row a and column b in the matrix Ki−1 , which represents the i-th word between the a-th word and the b-th word in a sentence probability of a relationship. We normalize this probability into the (0, 1) range using the sigmoid function. This process can be expressed mathematically as Eq. 3 and Eq. 4. 1 (3) Pi−1 = Sigmoid(Ki−1 ) = 1 + e−Ki−1 1 (4) 1 + e−Ki−2 Suppose there are 8 words in a sentence. There are four candidate relationships in the relationship library. The above process can be represented as Fig. 2. Pi−2 = Sigmoid(Ki−2 ) =

Fig. 2. The probability matrix of X ∗ X is obtained by the word correlation function. Sigmoid is further used to normalize each element to obtain the final probability matrix P . In the figure, row 6 is entity 1, column 2–4 is entity 2. The red part is the probability that there is a relationship between i ∈ [I, II, III, IV] between entity 1 and entity 2. (Color figure online)

Mask Matrix and Average Pooling. An entity may correspond to one word, or may correspond to multiple words. In the probability matrix P , each element represents the relationship probability between each independent word. The following problem is how to extract the relationship probability corresponding to each entity from P . We design the mask matrix to solve this problem. The mask extraction process is shown in Fig. 3.

A Relational Classification Network Integrating

141

Fig. 3. The process of mask extraction.

In every sentence in the dataset, there are two entities. Each entity is marked with special characters [e1], [/e1], [e2], [/e2]. The start and end positions of entities can be identified by reading special characters. Suppose an entity pair (a, b) exists in the input sentence. begin(a) represents the starting position of entity a, and end(a) represents the end position of entity a. Similarly, begin(b) represents the starting position of entity a, and end(b) represents the end position of entity a. We can construct mask matrices Ma and Mb according to Eq. 5 and Eq. 6.  1, begin (a) ≤ m ≤ end(a) Ma mn = (5) 0, otherwise  1, begin (b) ≤ m ≤ end(b) Mb mn = (6) 0, otherwise Multiply the probability matrix P with the transposes of Ma mn and Mb mn in turn. The relationship probabilities corresponding to the two entities can be extracted. The relationship probabilities corresponding to two entities are pooled on average. All probability matrices of the same output are pooled and stitched together. After splicing, a feature item for classification with the same number of elements as the number of categories is obtained. Add the categorical eigenvectors corresponding to the autocorrelation matrix and the categorical eigenvectors corresponding to the cross-correlation matrix. The final output of the sentence-level feature extraction module can be obtained. The dimension of the output is the number of relationship categories, which can be used as the probability of relationship classification. This process can be expressed in mathematical form as Eq. 7, Eq. 8 and Eq. 9.    T (7) C 1 = concat AveP ool P i−1 × M a−mn × (M b−mn )    T C 1 = concat AveP ool P i−2 × M a−mn × (M b−mn )

(8)

C = C1 + C2

(9)

142

G. Li et al.

Among them, C1 and C2 represent the category probability vectors corresponding to the autocorrelation matrix and cross-correlation matrix, respectively. Pi−1 and Pi−2 represent inter-word relationship probability matrices, iis equal to the number of relationship categories in the dataset. Ma−mn and Mb−mn are mask matrices for extracting the probabilities of relations between entities. AveP ool is an average pooling operation. concat represents the splicing operation. 3.3

Entity Hierarchical Feature Extraction Module

In the entity-level feature extraction module, the sum of the hidden layer outputs corresponding to the two entities in H23 and H24 is used as input. The feature amplification modules designed in this study are respectively. Two entity feature vectors can be obtained. At the same time, the output of the hidden layer corresponding to the [CLS] tag with the semantic features of the entire sentence is passed through the fully connected layer to obtain the feature vector of the entire sentence. The fused features are obtained by concatenating the feature vectors of the two entities with the global feature vector. The fusion feature maps the number of features to the number of categories through a fully connected layer to obtain a feature vector for classification tasks. First, the mask vector is automatically generated according to the special characters [e1], [/e1], [e2], [/e2] in the input sentence. The mask vector is used to extract entity features from the output of the hidden layer of the BERT model. Since an entity may correspond to multiple words, we perform an average operation after extracting the entity features in the output of the hidden layer. Convert the hidden layer output corresponding to two entities into two feature vectors. This process can be expressed in mathematical form as Eq. 10 and Eq. 11. j

E1 =

E2 =

 1 [EM (H23 + H24 )a ] j − i + 1 a=i

(10)

m  1 [EM (H23 + H24 )a ] m−k+1

(11)

a=k

where H23 + H24 is the output of the last and penultimate hidden layer. It is also the input of the entity-level feature extraction module. i and j are the start position and end position of entity 1 in a sentence. k and m are the start position and end position of entity 2 in a sentence. EM (entity matrix) represents the entity mask matrix. After obtaining the feature vectors corresponding to the two entities, the entity features are processed through the feature amplification attention module. In the feature amplification attention module, in order to prevent the gradient explosion and disappearance of the model, and to speed up the training speed of the model, it is necessary to perform Batch Normalization on the entity features.

A Relational Classification Network Integrating

143

Batch Normalization can also prevent the model from overfitting the data set. This step can be expressed in mathematical form as Eq. 12 and Eq. 13.  E 1 − μβ E1 = γ  2 +β σB + 

(12)

 E 2 − μβ +β E2 = γ  2 σB + 

(13)

Among them, E1 and E2 represent the feature vectors corresponding to the two entities. μB is the average value of the elements in the feature vector corresponding to the entity. σB is the variance of the elements in the feature vector corresponding to the entity. After the Batch Normalization operation, a specific weight needs to be assigned to each element in the feature vector. The importance of elements can be re-evaluated through the weight of each feature element. The weight is calculated by dividing an element by the sum of all elements, and the importance of this element is judged by the ratio of this element to the sum of all elements. The calculated element weight is multiplied by the feature vector to realize the weight assignment. This method can further enhance the importance of positive elements. The opposite can reduce the impact of negative elements on model performance. The above process can be expressed as formula 14 and formula 15 in mathematical formula. E 1−w=W

E 2−w=W

 E1

 E2

  T γ E  − i   T · E1 = 1 · E1 j=0 γ j

(14)

  T γ E  − i   T · E2 = 2 · E2 j=0 γ j

(15)

Among them, E1−w and E2−w represent the feature vectors of the two entities after the feature amplification module. Entity features after feature amplification and [CLS] features containing the entire sentence pass through a fully connected layer, and the dimension of the feature vector is compressed through the fully connected layer. Then splicing the three feature vectors output by the fully connected layer, the purpose is to obtain a feature vector that contains both entity features and context features of the entire sentence. The advantage of this is that the relationship features obtained by the model can be fully utilized, thereby improving the performance of the model. The concatenated feature vectors are then passed through a fully connected layer to map the number of features to the number of relationships. A feature vector with dimensions equal to the number of relationships is obtained, which is the probability vector of classification. The above process can be expressed mathematically as Eq. 16, Eq. 17, Eq. 18 and Eq. 19. 

E1−w = W E 1−w [tanh (E 1−w )] + bE 1−w 

E2−w = W E 2−w [tanh (E 2−w )] + bE 2−w

(16) (17)

144

G. Li et al. 

[CLS] = W [CLS] [tanh ([CLS])] + b[CLS]

   +b , E2−w C = W tanh concat [CLS] , E1−w

(18) (19)

where W is the learnable parameter matrix in the fully connected layer. b is the offset. T anh() is the activation function. concat() is a feature vector concatenation function. Finally, the relationship classification probability at the sentence level and the relationship classification probability at the entity level are added to form a fusion probability feature that combines sentence information and entity information. Use this feature as the probability of judging the kind of relationship. The above process can be expressed mathematically as Eq. 20. C = C  + C 

(20)



where C is the fusion probability feature. C is a sentence-level probabilistic  feature. C is the entity-level probability feature.

4

Experiment

Experiment in order to verify the effectiveness of the relationship classification model proposed in this study, we selected two public datasets SemEval-2010 Task8 and KBP37 for relationship extraction for experiments. Good results have been achieved on both datasets, with F1 values reaching 89.86% and 70.52%, respectively. 4.1

Experiment Settings

In most relational classification models, the F1 value is commonly used as the evaluation. Therefore, we also use the F1 value to reflect the performance of the model. F1-score can evaluate the model more objectively through the correlation between the two indicators of precision and recall. The graphics card model used in the laboratory is NVIDIA-A100. The programming language and deep learning framework used are Python3.9 and PyTorch1.9 respectively. The number of epochs is 30. The batch-sizes of the training set and test set are 16 and 32, respectively. 4.2

Comparative Experiment and Analysis

In the comparative experiment, we compared various types of relational classification models, including convolutional neural network-based, recurrent neural network-based, attention-based and BERT pre-trained model-based relational classification models. The results of the comparison test are shown in Table 1. From the experimental data in Table 1, it can be seen that the relationship classification model that integrates multi-scale semantic information proposed in this study has achieved good results on both datasets. Compared with existing

A Relational Classification Network Integrating

145

Table 1. Compare experimental results on two datasets. SemEval-2010 Task8

KBP37

Model

F1

Model

F1

BiLSTM-Attention CR-CNN Entity Attention BiLSTM Bi-LSTM+LET TACNN MALNet DesRC-CNN BLSTM + BTLSTM + Att TRE Multi-Attention CNN R-BERT BERTEM + MTB

84.00% 84.10% 85.20% 85.20% 85.30% 86.30% 86.60% 87.10% 87.10% 88.00% 89.25% 89.50%

CNN RNN BiLSTM-CNN Structural block-CNN MALNet Att-RCNN LGCNN Bi-SDP-Att D-BERT

55.10% 58.80% 60.10% 60.90% 61.40% 61.83% 63.20% 64.39% 69.20%

Our

89.86% Our

70.52%

relational classification models, the model proposed in this study makes better use of features. The feature corresponding to the entity contains the semantic information of the entity itself, and the feature corresponding to the sentence contains rich contextual semantic information. The fusion of entity features and sentence features can classify the relationship between entities in the context of the context, which can greatly optimize the performance of the model. 4.3

Ablation Experiment and Analysis

To demonstrate the effectiveness of each module in the model, we also conduct ablation experiments on two datasets. Our experiments are designed to reflect the effectiveness of each module in four groups of experiments: baseline model, baseline model + sentence features, baseline model + entity features, and baseline model + sentence features + entity features. The results of the ablation experiments are shown in Table 2. It can be seen from Table 2 that the sentence-level feature extraction module and entity-level features proposed in this paper have a significant role in improving the effect of the model. Whether fusing entity features in sentence features or fusing sentence features in entity features, the performance of the model can be improved. The reason is that the features of different granularities can make the model better judge the relationship between two entities. The features of the two entities contained in the entity features are associated with the context features in the whole sentence features. It can learn, analyze and judge the relationship between two entities in combination with the context, so as to improve the performance of the relationship classification model.

146

G. Li et al. Table 2. Ablation experiment results on two datasets. Model

SemEval-2010 Task8 P R F1

KBP37 P R

F1

Base

89.09% 88.96% 89.02% 69.04% 69.19% 69.11%

Base+ Sentence

89.66% 89.41% 89.53% 69.52% 69.58% 69.55%

Base+ Entity

89.25% 89.23% 89.24% 69.35% 69.42% 69.38%

Base+ Sentence+ Entity 89.97% 89.75% 89.86% 70.46% 70.59% 70.52%

5

Conclusion

This paper proposes a relational classification network that integrates different levels of features, which solves the shortcomings of existing relational classification models that only use entity features or only use entire sentence features to solve relational classification problems. By fusing different levels of semantic features, the entity features are combined into the context. The model can better learn the relationship between entities. Through experimental verification, the F1 values in the SemEval-2010 Task 8 and KBP37 data sets reached 89.86% and 70.52%, which are higher than most existing relationship classification models. And the effectiveness of feature extraction modules at different levels is verified through ablation experiments. In future work, we will continue to study the introduction of external features to assist feature learning on the basis of this paper. Enhance the sensitivity of the model through external features, and further improve the performance of the model. Acknowledgements. This work was supported by National Key Research and Development Program of China (2022YFB4004401), and the Taishan Scholars Program (NO. tsqn202103097).

References 1. Kuˇzina, V., Petric, A.M., Bariˇsi´c, M., et al.: CASSED: context-based approach for structured sensitive data detection. Expert Syst. Appl. 223, 119924 (2023) 2. Ranathunga, S., Lee, E.S.A., Prifti Skenduli, M., et al.: Neural machine translation for low-resource languages: a survey. ACM Comput. Surv. 55(11), 1–37 (2023) 3. Ryen, V., Soylu, A., Roman, D.: Building semantic knowledge graphs from (semi-) structured data: a review. Future Internet 14(5), 129 (2022) 4. Bin, H., Yue, K., Linhao, F.: Knowledge modeling and association Q&A for policy texts. Data Anal. Knowl. Discov. 6(11), 79–92 (2023) 5. Wang, H., Qin, K., Zakari, R.Y., et al.: Deep neural network-based relation extraction: an overview. Neural Comput. Appl. 1–21 (2022) 6. Mikolov, T., Sutskever, I., Chen, K., et al.: Distributed representations of words and phrases and their compositionality. IN: Advances in Neural Information Processing Systems, vol. 26 (2013)

A Relational Classification Network Integrating

147

7. Peters, M.E., Neumann, M., Iyyer, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2227–2237 (2018) 8. Radford, A., Narasimhan, K., Salimans, T., et al.: Improving language understanding by generative pre-training (2018) 9. Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 10. Zhang, Z., Han, X., Liu, Z., et al.: ERNIE: enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129 (2019) 11. Sun, Y., Wang, S., Li, Y., et al.: Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019) 12. Sun, Y., Wang, S., Li, Y., et al.: Ernie 2.0: a continual pre-training framework for language understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 8968–8975 (2020) 13. Cui, Y., Che, W., Liu, T., et al.: Pre-training with whole word masking for Chinese BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3504–3514 (2021) 14. Liu, Y., Ott, M., Goyal, N., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019) 15. Wei, J., Ren, X., Li, X., et al.: Nezha: neural contextualized representation for Chinese language understanding. arXiv preprint arXiv:1909.00204 (2019) 16. Peters, M.E., Neumann, M., Logan IV, R.L., et al.: Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164 (2019)

A Novel Semantic-Enhanced Time-Aware Model for Temporal Knowledge Graph Completion Yashen Wang1,2 , Li Li3(B) , Meng Jian4 , Yi Zhang1,5 , and Xiaoye Ouyang1 1

National Engineering Laboratory for Risk Perception and Prevention (RPP), China Academy of Electronics and Information Technology, Beijing 100041, China [email protected] 2 Key Laboratory of Cognition and Intelligence Technology (CIT), Artificial Intelligence Institute of CETC, Beijing 100144, China 3 School of Computer, Beijing Institute of Technology, Beijing 100081, China [email protected] 4 College of Information and Communication Engineering, Beijing University of Technology, Beijing 100124, China 5 CETC Academy of Electronics and Information Technology Group Co., Ltd., Beijing 100041, China

Abstract. Researchers have investigated various graph embedding methods to complete Knowledge Graphs (KGs), most of which merely focus on Static KGs (SKGs) without emphasizing the time dependence of triple-formed facts. However, in reality, KGs are dynamic and definitely there is correlations between facts with different timestamps. Due to the sparsity of Temporal KGs (TKGs), SKG’s embedding methods cannot be directly applied to TKGs, which triggers the current discussions about TKG Completion (TKGC) task. And existing TKGC methods universally suffer from two issues: (i) The modeling procedure for temporal information in encoder is usually disjointed or conflict with that in decoder. (ii) Current methods are overwhelmingly dependent on temporal signals for measuring the probability of candidate entity, while ignoring other signals (such as entity’s semantics, etc.,). To overcome these problems, this paper proposes a novel semantic-driven time-aware relational graph neural network model for TKGC task, which consists of a semantic-enhanced encoder and a convolutionbased decoder. Keywords: Knowledge Graph · Temporal Knowledge Graph · Knowledge Graph Completion · Semantic-Enhancement

1 Introduction Numerous large-scale Knowledge Graphs (KGs), including DBpedia and FreeBase, as well as the corresponding technologies, have been investigated in recent years. These KGs distill knowledge from the real world into a structural graph consisting of billions of triple-formed facts. Especially, this kind triple is defined in form of (s, r, o), wherein s indicates the subject entity (head entity), o indicates the object entity (tail entity), and r represents the relation type mapping from aforementioned s to o. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 148–160, 2023. https://doi.org/10.1007/978-3-031-44696-2_12

A Novel Semantic-Enhanced Time-Aware Model

149

Knowledge Graph Completion (KGC) task, is one of the main challenges in the immediate KG’s domain, since most KGs established recently are incomplete. To overcome this challenge, various Knowledge Graph Embedding (KGE) methods have been proposed, which aim at mapping entities and relation types into low-dimensional representation vectors in semantic space, and measuring the plausibility of triples by inputting vectors of the entities and the corresponding relation type to as score function (including translation-based, rotation-based, tensor-based score function, etc.,) [1]. For instance, in highly effective KGE model ComplEx [14], the score of a triple (s, r, o) is measured with the asymmetric Hermitian dot product, after the entities as well as the associated relations are represented as complex embeddings. However, real-world exist some certain KGs which include temporal facts. Temporal KGs (TKGs) like ICEWS, Wikidata and YAGO3 [2] [4], bring temporal information and constraint into triples and then reform triples as quadruples (in form of (s, r, o, t), wherein t indicates the timestamp) with specific time information. Recent works have demonstrated that, the Temporal Knowledge Graph Completion (TKGC) models (especially time-aware graph encoder driven architectures), which encode temporal signals in representation vectors of entities and/or relations, have obtained satisfactory performances on completing TKGs than traditional KGE models [2] [4] [22]. E.g., [16] proposes an attention-based TKGC architecture, consisting of a generalized graph attention network (GAT) as encoder and a diachronic embedding network as decoder, and has proven the effectiveness of encoder-decoder paradigm for TKGC task. [3] proposes an time-aware graph encoder, which better utilizes the whole amount of temporal contexts. However, existing methods of TKGC mainly suffer from two issues: (i) For fashionable encoder-decoder architecture, the modeling procedure for temporal information in encoder is usually disjointed with that in decoder. E.g., the encoder of [16] disregards time information, and the modeling procedures of encoder and decoder are somehow separated. (ii) Current methods are overwhelmingly dependent on temporal signals for measuring the probability of candidate entity, while ignoring other signals (such as entity’s semantics and influence, etc.,). E.g., in [3], temporal proximity is considered as the only factor for modeling the influence of the neighbor entity when aggregating neighbourhoods. To overcome these problems mentioned above, in this paper, we present a novel TKGC method, which is a semanticenhanced and time-aware encoder-decoder architecture. For encoding procedure, we propose a time-aware graph-based encoder, which combines temporal information and static information, and also combines structural signals and semantic signals. Moreover, a semantic-enhanced sampling strategy is introduced here for choosing candidate entities, which utilizes concept information [15] by leveraging instance conceptualization algorithm [7]. For decoding procedure, a convolution-based decoder is utilized to compute scores for every candidate object entity.

150

Y. Wang et al.

2 Preliminary 2.1

Temporal Knowledge Graph

Temporal Knowledge Graph (TKG, denoted as G = {E, R, T }), usually consist of a set of quadruples (s, r, o, t), wherein s ∈ E and o ∈ E are the subject (head) entity and object (tail) entity respectively, r ∈ R is the relation type, and t ∈ T is the timestamp. E, R and T indicate the entity set, relation set and timestamp set, respectively. We also denote a TKG as a sequence of KG snapshots G = {G (1) , G (2) , · · · , (|T |) }, wherein |T | represents the total number of timestamps. G 2.2

Temporal Knowledge Graph Completion

Temporal Knowledge Graph Completion (TKGC) task can be expressed as a missing entity prediction task in an incomplete quadruple in form of (s, r, ?, t) (or (?, r, o, t)) [16] [3]. Especially, for each snapshot G (t) in an observed TKG G = {G (1) , G (2) , · · · , G (|T |) }, it consists of all the known quadruple-formed real-world facts at timestamp t. Let notation G(t) indicates the set of all the true quadruple-formed realworld facts at timestap t such that G (t) ⊂ G(t) (In other words, unknown/unobserved facts may not necessarily be false facts). TKGC task aims to predict the groundtruth object (or subject) entities of incomplete queries (s, r, ?, t) (or (?, r, o, t)), where (s, r, o, t) ∈ G(t) but (s, r, o, t) ∈ / G (t) [3]. 2.3

Concept and Instance Conceptualization

Following [15] [7], we define a “concept” as a set or category of concretized “entities” or “things” within a domain, such that words belonging to smae/similar classes can get similar representations in semantic space. E.g., “microsoft” and “amazon” could be mapping to the same concept COMPANY. Probase [21] is utilized in this study as extra lexical KG to provide and help as generate concept for specific entity (as well as relation). Probase has been widely used in research about text understanding [17], text representation [7], information retrieval [19]. Let notation C represent concepts’s set, and let notation c ∈ C indicates the specific concept pre-defined in the aforementioned lexical KG Probase [21]. In addition, our genetic and flexible model can also support more other types of extra semantic signals well, such as entity type, entity description and pre-defined prompt templates. Given an entity e ∈ E, instance conceptualization algorithm [7] enables to select the open-domain concepts Ce = {|j = 1, · · · , |Ce |} from the lexical KG Probase [21]. The confidence pj indicates the probability of the selected concept cj for entity e. E.g., given entity “London”, we generate the corresponding concepts as CLondon = {, , , , · · · } from Probase . This paper adopts the state-of-the-art instance conceptualization algorithm proposed in [7].

A Novel Semantic-Enhanced Time-Aware Model

151

3 Methodology This section overviews the proposed framework for TKGC task, which consists of a semantic-enhanced encoder (Sect. 3.1) and a convolutional decoder (Sect. 3.2), shown as Fig. 1. In encoder part, this work firstly abstracts a subgraph for the given subject entity s (given incomplete quadruple (s, r, ?, t) as an example), according to both its temporal neighborhoods and semantically-similar (especially respect current relation r) neighborhoods. Then it generates time-aware representation vectors for the temporal neighbors, and performs aggregation. In decoder part, after s’s time-aware representation at timestamp t is updated, a convolution-based decoder is introduced here for computing scores for each candidate object entity. 3.1 Encoder Given incomplete quadruple (s, r, ?, t), encoder tries to learn a contextualized representation vector for explicit head entity s at timestamp t, i.e., h(s(t) ) (shown in Fig. 1(a)). For simplicity, notation s(t) indicate entity s at t.

Fig. 1. The architecture of the proposed framework.

Step#1: Temporal Neighboring Set Generation. Inspired by [6] [3], we first sample a Temporal Neighboring Graph (TNG) for head entity s at timestamp t, according to both its temporal neighborhoods and semantically-similar neighborhoods. Especially, at each t ∈ T , we first conduct the set of temporal neighbors of entity s: N (s(t) ) =  {e(t ) |(s, r , e, t ) ∈ G, (e, r , s, t ) ∈ G; e ∈ E, r ∈ R, t ∈ T }. Apparently, each  temporal neighbor entity e(t ) ∈ N (s(t) ) establishes a relation with s at t , and s bears an incoming (or outcoming) link derived from the temporal quadruple (s, r , e, t ) (or (e, r , s, t )). Step#2: Temporal Neighboring Graph Generation. Then we introduce a novel weighted sampling strategy for constructing TNG, according to: (i) the time interval  |t − t | between s(t) (at t) and the corresponding temporal neighbor entity e(t ) (at t ) [3]; (ii) the semantic correlations between r and the corresponding temporal neighbor   entity e(t ) (at t ) [18]. For each temporal neighbor entity e(t ) ∈ N (s(t) ), we measure whether it will be prudently selected into head entity s(t) ’s TNG, with the following function:

152

Y. Wang et al.

exp(−|t − t |) |Crtail | ∪ |Ce(t ) | + (1 − λ ) · 1  |Crtail | e(t ) ∈N (s(t) ) exp(−|t − t |)

P = λ1 · 

(1)



Wherein, Ce(t ) indicates concept set of entity e(t ) , and the concepts in Ce(t ) are distilled from the lexical KG Probase by instance conceptualization algorithm [7] (Sect. 2.3). Crtail is time-invariant, and indicates tail concept set of relation r, consisting of concepts of the entities occurring in the tail position [18]. Because we find that, all of the entities located in the head position (or tail position) with the same relation may have some (or similar) common entity concepts, as discussed in [18]. For constructing Crtail , we first distill entities appearing in the tail position of relation r to form the time-invariant tail entity set Ertail , and then we obtain the corresponding Crtail = ∩e∈Ertail Ce . In the perspective of Eq. (1), higher weights are attached to the temporal neighbors (in N (s(t) )) who are closer to the given s(t) along the time axis and are also closer to current relation type r from the perspective of semantics. So far, we have successfully In summary, the reasons this work chooses the aforementioned sampling strategy, which has successfully helped us to access to the temporal priors implicitly, can be concluded as follows: (i) From temporal-closer perspective, the fact that time is more approaching has a greater impact on determining the probability of current fact at t; (ii) From semantic-closer perspective, more semantic signals could be considered for training procedure, and thus more less-concerned temporal neighbors can be prevented. Figure 2 shows an example of head entity s(t) ’s temporal neighborhoods, which can be repre(t ) (t ) (t ) (t ) (t ) sented as N (s(t) ) = {e1 , e2 , e3 , e4 , e5 }. The darker the temporal neighbor (t ) (t ) shows, the higher the probability is. Note that: (i) Since e3 and e4 own the same temporal distance (|t − t |) from t and same relation type from s(t) , they are scored with (t ) (t ) the same sampling weight; (ii) Although since e1 and e2 own the same temporal dis(t ) tance from t (|t − t |), e2 gain higher probability (i.e., darker color), because it relation type linked to s(t) is more similar to the given incomplete quadruple (s, r, ?, t).

Fig. 2. An example showing Temporal Neighboring Graph.

Step#3: Temporal Representation Generation. After sampling TNG for the given head entity s at timestamp t, we then attempt to learn its time-aware contextualized representation h(s(t) ) for entity s at timestamp t. For each temporal neighbor entity  e(t ) ∈ N (s(t) ), we employ a connection-based temporal encoding strategy to generate its time-aware entity representation, as follows: 

h(e(t ) ) = f1 ([λ2 · h(e); (1 − λ2 ) · K(t, t )])

(2)

A Novel Semantic-Enhanced Time-Aware Model

153

Wherein, h(e) ∈ Rd1 indicates the time-invariant entity representation of the entity e, which embeds the static information for entity e. K(t , t) acts as a kernel function projecting the time interval |t − t | to a finite dimensional functional space Rd2 . f1 (·) represents a feed-forward neural network, and [; ] indicates the concatenation operator. Hyper-parameter λ2 balances the weights among static information and temporal information. With effects above, the proposed encoder have triumphantly connected the time-invariant entity representation and its associated temporal distance representation, and then generate a combined representation of them from both static perspective and temporal perspective. Finally, we leverage a relational graph aggregator for aggregating the information emerged from s(t) ’s temporal neighbors N (s(t) : h(s(t) ) =

1 |N  (s(t) )|





M1 (h(e(t ) )  hr )

(3)

e(t ) ∈N (s(t) )

Wherein, r indicates the relation type occurring in the temporal quadruple (s, r , e, t )  where temporal neighbor entity e(t ) at timestamp t is sampled. [3, 9] Following the conventional assumption that the relation representations are time-invariant during encoder and decoder procedures, this work connects the time-aware entity represen tation for the sampled temporal neighbor entity e(t ) with the representation vector for its corresponding relation r in temporal quadruple (s, r , e, t ), for introducing relation associated signals into the our encoder. With efforts above, our encoder outputs the time-aware representation h(s(t) ) of subject entity s at timestamp t, by combining not only the static (time-invariant) entity representation h(e) as well as static relation representation h(r) (both h(e) and h(r) are all globally invariant), but also the temporal  interval signals from its temporal neighbors (embedded in h(et )). 3.2 Decoder Following [16], we also adopt a convolution-based decoder here, as shown in Fig. 1(b), which has been demonstrated to be effective. Each quadruple (s, r, o, t) has a corresponding time-related matrix, which is formulated as a 3-column matrix X = [h(s(t) ); hr ; h(o(t) )] ∈ Rd1 ×3 , wherein each column’s vector represents a quadruple element (respect to head entity s(t) , relation r and o(t) , respectively). Decoder then feeds aforementioned matrix X into a convolution network (with Ω convolutional filters), and then these multiple convolutional filters utilizes matrix X to generate various semantic features respectively. Finally, these feature are concatenated into a single representation vector which represents the entire feature of the input quadruple (s, r, o, t) (shown in Fig. 1(b)). Overall, the score function of quadruple (s, r, o, t) is calculated formally as: Ω  ReLU(X)  ωi ) (4) f2 (s, r, o, t) = M2 ( i=1

Wherein, Ω is a hyper-parameter indicating number of convolutional filters used here, ωi indicates i-th convolutional filter, and a linear and learnable weight matrix M2 ∈

154

Y. Wang et al.

RΩd1 ×1 is used here to compute the final score of the input quadruple. Notation  denotes the convolution operator. Finally, given incomplete quadruple (s, r, ?, t) (take the form of object entity prediction task as an example), the entity with the highest confidence (Eq. 5) standing out from the candidate entity set, is chosen as the final result. With efforts above, the loss function can be defined as: 

L∝

(f2 (s, r, o, t))  o ∈E (f2 (s, r, o , t))

− log 

(s,r,o,t)∈G

(5)

4 Experiments 4.1

Datasets and Metrics

The following widely-used datasets are introduced here for comparing the proposed work with comparative baselines: ICEWS14, ICEWS05-15, YAGO11k and Wikidata12k [2] [4]. Table 1 summaries the statics of these datasets. ICEWS14 and ICEWS05-15 are reviewed as the most common TKG benchmarks extracted from the large-scale event-based database, Integrated Crisis Early Warning System (ICEWS), which act as a data-warehouse which includes political events with explicit timestamps. Following [22], these two datasets are filtered by selecting only the entities that appear most frequently in the TKG. YAGO11lk and Wikidata12k are two TKGs wherein all of facts involve timestamps, and especially the timestamps are represented in different forms, i.e., time points such as [2003-01-01, 2003-01-01], beginning or end time such as [2003, ##], and time interval such as [2003, 2005]. Table 1. Statistics of datasets. Dataset ICEWS14

|E|

|R| Type #Train

6,869 230 event

72,826

#Valid #Test 8,941

8,963

ICEWS05-15 10,094 251 event 368,962 46,275 46,092 YAGO11k

10,623

10 fact

16,406

2,050

2,051

Wikidata12k

12,554

24 fact

32,497

4,062

4,062

Mean Reciprocal Rank (MRR) and Hits@N (N ∈ 1, 3, 10) are introduced here as metrics for evaluation. Following [3], towards every test quadruple fact (s, r, o, t) ∈ G(t) , we could generate an associated TKGC query (s, r, ?, t) and then implement comparative models to compute the rank scoring of the ground-truth entity o among all the candidate entities. Wherein, (i) metric MRR is used to compute the mean of the reciprocal ranks of ground-truth entities; (ii) metric Hits@N is used to compute the proportions of the test quadruple facts where ground-truth entities are ranked as top N (N ∈ 1, 3, 10).

A Novel Semantic-Enhanced Time-Aware Model

155

4.2 Baselines We compare our models with the state-of-the-art KGC models: (i) SKGC models, including TransE [1], Distmult [25], ComplEx [14], SimplE [10]; (ii) Traditional TKGC models, including TTransE [12], TA-DistMult [4], HyTE [2], DE-SimplE [5], ATiSE [23], TNTComplEx [11], TeRo [24], TIMEPLEX [8], TeLM [22]; (iii) Currently propsoed GNN-based TKGC models, including TeMP-GRU [20], TeMP-SA [20], T-GAP [9], TARGCN [3], DEGAT [16], which are most related to our model. 4.3 Settings Following [13] [16], we adapt a conventional two-step procedure here: (i) We firstly train an encoder (Sect. 3.1), to encode both the static graph information and the temporal information; (ii) We secondly train a decoder (Sect. 3.2), to perform the TKGC task. Wherein, the parameter settings are concluded as follows. For encoder: representation vector size d1 is set as 100, learning-rate is set as 0.001, the batch-size is set as 128 and dropout probability is set as 0.3. To combine the structural signals and semantic signals when sampling, this work introduces a hyper-parameter λ1 to distinguish the various proportions of each (Eq. (1)), and set λ1 as 0.6 as the default value which releases the optimal result. To combine the static signals and temporal signals when generating entity’s time-aware representation, we introduce a hyper-parameter λ2 to denote the proportion of each (Eq. (2)), and set λ2 as 0.3 as the default value which releases the optimal result. For decoder: representation vector size d1 = 100, learning-rate is set as 0.0005, the batch-size is set as 128, output channels in convolution layer is set as 50. Our proposed model are implemented in PyTorch, and Adagrad optimizer is used for training all models. Besides, we follow the filter setting [1] to achieve fair evaluation. Note that, following the strategies proposed in previous work such as [16] [3], we only consider object prediction queries (s, r, ?, t) here since we could calmly add reciprocal relations for each test quadruple. 4.4 Results and Analysis Summary. Table 2 and 3 list TKGC results of our models and all comparative baseline models on all datasets. Wherein, the optimal results are emphasized by bold fonts, and the suboptimal results are emphasized by underlined fonts. The upward arrow (↑) indicates that a higher value is better. We could observe that the proposed model (denoted as “Ours”) outperforms all baselines on almost all datasets, especially on MRR metric. Our model defeats DEGAT [16] by 3.95% (ICEWS14) and 12.92% (YAGO11k) on MRR metric, in each type of datasets. Because better utilization of all temporal signals along the timeline is more beneficial. Besides, our model is more competitive compared with TARGCN [3], demonstrating the importance of extra semantics (concepts emphasized here (Sect. 2.3)). The preponderant margin is particularly huge on YAGO11k dataset and Wikidata12k dataset, which show our greater advantages. We believe that the performance’s

156

Y. Wang et al. Table 2. Evaluation results on ICEWS14 and ICEWS05-15. ICEWS14

ICEWS05-15

Method

MRR↑ H@1↑ H@3↑ H@10↑ MRR↑ H@1↑ H@3↑ H@10↑

TransE [1] Distmult [25] ComplEx [14] SimplE [10]

0.326 0.441 0.442 0.458

0.154 0.325 0.400 0.341

0.430 0.498 0.430 0.516

0.644 0.668 0.664 0.687

0.330 0.457 0.464 0.478

0.152 0.338 0.347 0.359

0.440 0.515 0.524 0.539

0.660 0.691 0.696 0.708

TTransE [12] TA-DistMult [4] HyTE [2] DE-SimplE [5] ATiSE [23] TNTComplEx [11] TeRo [24] TIMEPLEX [8] TeLM [22]

0.255 0.477 0.297 0.526 0.571 0.620 0.562 0.589 0.625

0.074 0.363 0.108 0.418 0.465 0.520 0.468 0.499 0.545

– – 0.416 0.592 0.643 0.660 0.621 0.673

0.601 0.686 0.655 0.725 0.755 0.760 0.732 0.761 0.774

0.271 0.474 0.316 0.513 0.484 0.670 0.586 0.632 0.678

0.084 0.346 0.116 0.392 0.350 0.590 0.469 0.542 0.599

– – 0.445 0.578 0.558 0.710 0.668 – 0.728

0.616 0.616 0.681 0.748 0.749 0.810 0.795 0.802 0.823

TeMP-GRU [20] TeMP-SA [20] T-GAP [9] TARGCN [3] DEGAT [16] Ours

0.601 0.607 0.610 0.636 0.633 0.658

0.478 0.484 0.509 0.576 0.538 0.596

0.681 0.684 0.684 0.672 0.698 0.695

0.828 0.840 0.790 0.746 0.797 0.771

0.691 0.680 0.670 0.702 0.777 0.781

0.566 0.553 0.568 0.635 0.724 0.667

0.782 0.769 0.743 0.743 0.814 0.768

0.917 0.913 0.845 0.823 0.871 0.851

gap may vary due to the characteristics of different datasets. On more sparse ICEWS dataset (Table 2), although this work does not take a significant step forward compared with other kinds of datasets, it has still achieved the best results on metric MRR in most cases (e.g., with 3.95% and 3.46% improvements compared with SOTA DEGAT [16] and TARGCN [3] respectively on ICEWS14 dataset). The improvement of our model over baselines on YAGO11k and Wikidata12k is larger than ICEWS datasets on MRR and H@1/3 metrics, which brings a MRR’s improvement and H@1’s improvement by 3.61% and 3.67% compared with TARGCN [3], as shown in Table 3. Besides, GNNbased TKGC models (suc as DEGAT [16], TARGCN [3], as well as ours) release the more optimal results, which demonstrates the effectiveness of relational graph aggregator and encoder (such as Sect. 3.1). Ablation Study. To study the contribution of each factor in our model, mainly semantic factor and temporal factor occurred in our time-aware encoder (Sect. 3.1), we run an ablation study (Table 4). Combining all factors generally has the best results, indicating the necessary for jointly capturing: (i) both structural signals and semantic signals (Sect. 3.1); (ii) both static signals and temporal signals (Sect. 3.1). When eliminating concept signals, our model nearly degenerates into TARGCN [3], which only employs

A Novel Semantic-Enhanced Time-Aware Model

157

Table 3. Evaluation results on YAGO11k and Wikidata12k. YAGO11k

Wikidata12k

Method

MRR↑ H@1↑ H@3↑ H@10↑ MRR↑ H@1↑ H@3↑ H@10↑

ComplEx [14]

0.181

0.115



0.311

0.248

0.143



0.489

TTransE [12] TA-DistMult [4] HyTE [2] DE-SimplE [5] ATiSE [23] TNTComplEx [11] TeRo [24] TIMEPLEX [8] TeLM [22]

0.108 0.155 0.136

0.020 0.098 0.033

0.150 – –

0.251 0.267 0.298

0.172 0.230 0.253

0.096 0.130 0.147

0.184 – -

0.329 0.461 0.483

0.187 0.185 0.187 0.184 0.191

0.121 0.127 0.121 0.110 0.129

0.197 0.183 0.197 – 0.194

0.319 0.307 0.319 0.319 0.321

0.299 0.331 0.299 0.324 0.332

0.198 0.233 0.198 0.220 0.231

0.329 0.357 0.329 – 0.360

0.507 0.539 0.507 0.528 0.542

TARGCN [3] DEGAT [16] Ours

0.194 0.178 0.201

0.136 0.117 0.141

0.194 0.185 0.196

0.309 0.304 0.320

0.344 0.350 0.355

0.245 0.257 0.257

0.367 0.370 0.373

0.542 0.528 0.535

temporal information for sampling candidate entities, and hence semantics affects the results most. Besides, even “w/o Concept” still outperforms DEGAT [16], which highlights the advantage and importance of contexts in the time axis. When eliminating temporal informations, our model underperforms almost all the TKGC baselines, and still outperforms ComplEx [14], which is the SOTA model for SKG, verifying that modeling and aggregating temporal neighbors is indispensable. It shows that the proposed model can extensively collects information from the whole temporal context. Table 4. Ablation study on ICEWS14 dataset. Method

MRR↑ H@1↑ H@3↑ H@10↑

ComplEx TeLM TARGCN DEGAT

0.442 0.625 0.636 0.633

Ours

0.658

0.596

0.695

0.771

w/o Concept 0.639 w/o Temporary 0.486

0.564 0.428

0.684 0.473

0.781 0.684

0.400 0.545 0.576 0.538

0.430 0.673 0.672 0.698

0.664 0.774 0.746 0.797

Fine-Grained Analysis. Figure 3(a) shows the effects of hyper-parameter λ1 , which is introduced here (Eq. (1)) for balancing the influence of structure and semantic information when sampling candidate entities as an input at the aggregation step. Experimental

158

Y. Wang et al.

results shows that semantic signals is noticeable and non-ignorable, and when λ1 is set as ∼0.6, it releases the optimal result and this kind of setting will contribute to aggregating graph-formed structure meanwhile integrating high-level semantic information. Conversely, as λ1 becomes larger, model’s performance gradually decreases.

Fig. 3. Effects of hyper-parameter λ1 and λ2 .

Similar to [16], a hyper-parameter λ2 for balance static and temporal information is introduced in Eq. (2), for generating time-aware representation of temporal neighbors. Figure 3(b) shows the test performance respect to different choices of λ2 , the static feature proportion. Along with the features becoming temporal (λ2 changes smaller), performance shows a significant boost, which has proved beneficial efforts derived from integrating temporal information, while as λ2 is larger, performance decreases instead (similar to the phenomenon in Fig. 3(a) discussed above).

5 Conclusion To solve intractable TKGC task, this paper proposes a semantic-enhanced time-aware model, which could take advantage of not only the entire amount of temporal signals along the timelines, but also the more abstract and high-level concept semantics embedded in the current TKG. The experimental results on conventional benchmark datasets has proved our significant improvements. Acknowledgements. We thank anonymous reviewers for valuable comments. This work is funded by: (i) the National Natural Science Foundation of China (No. 62106243, U19B2026, U22B2601).

A Novel Semantic-Enhanced Time-Aware Model

159

References 1. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: NIPS’13, pp. 2787–2795 (2013) 2. Dasgupta, S.S., Ray, S.N., Talukdar, P.P.: Hyte: hyperplane-based temporally aware knowledge graph embedding. In: Conference on Empirical Methods in Natural Language Processing (2018) 3. Ding, Z., Ma, Y., He, B., Tresp, V.: A simple but powerful graph encoder for temporal knowledge graph completion. arXiv abs/2112.07791 (2022) 4. García-Durán, A., Dumancic, S., Niepert, M.: Learning sequence encoders for temporal knowledge graph completion. In: Conference on Empirical Methods in Natural Language Processing (2018) 5. Goel, R., Kazemi, S.M., Brubaker, M.A., Poupart, P.: Diachronic embedding for temporal knowledge graph completion. In: AAAI Conference on Artificial Intelligence (2019) 6. Han, Z., Chen, P., Ma, Y., Tresp, V.: xerte: Explainable reasoning on temporal knowledge graphs for forecasting future links. arXiv abs/2012.15537 (2020) 7. Huang, H., Wang, Y., Feng, C., Liu, Z., Zhou, Q.: Leveraging conceptualization for short-text embedding. IEEE Trans. Knowl. Data Eng. 30(7), 1282–1295 (2018) 8. Jain, P., Rathi, S., Mausam, Chakrabarti, S.: Temporal knowledge base completion: new algorithms and evaluation protocols. arXiv abs/2005.05035 (2020) 9. Jung, J., Jung, J., Kang, U.: Learning to walk across time for interpretable temporal knowledge graph completion. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2021) 10. Kazemi, S.M., Poole, D.L.: Simple embedding for link prediction in knowledge graphs. arXiv abs/1802.04868 (2018) 11. Lacroix, T., Obozinski, G., Usunier, N.: Tensor decompositions for temporal knowledge base completion. arXiv abs/2004.04926 (2020) 12. Leblay, J., Chekol, M.W.: Deriving validity time in knowledge graph. In: Companion Proceedings of the Web Conference 2018 (2018) 13. Nathani, D., Chauhan, J., Sharma, C., Kaul, M.: Learning attention-based embeddings for relation prediction in knowledge graphs. In: Annual Meeting of the Association for Computational Linguistics (2019) 14. Trouillon, T., Welbl, J., Riedel, S., Éric Gaussier, Bouchard, G.: Complex embeddings for simple link prediction (2016) 15. Wang, F., Wang, Z., Li, Z., Wen, J.R.: Concept-based short text classification and ranking. In: The ACM International Conference, pp. 1069–1078 (2014) 16. Wang, J., Zhu, C., Zhu, W.: Dynamic embedding graph attention networks for temporal knowledge graph completion. In: KSEM (2022) 17. Wang, Y., Huang, H., Feng, C., Zhou, Q., Gu, J., Gao, X.: CSE: conceptual sentence embeddings based on attention model. In: 54th Annual Meeting of the Association for Computational Linguistics, pp. 505–515 (2016) 18. Wang, Y., Liu, Y., Zhang, H., Xie, H.: Leveraging lexical semantic information for learning concept-based multiple embedding representations for knowledge graph completion. In: APWeb/WAIM (2019) 19. Wang, Y., Wang, Z., Zhang, H., Liu, Z.: Microblog retrieval based on concept-enhanced pre-training model. ACM Trans. Knowl. Discov. Data 17, 1–32 (2022) 20. Wu, J., Cao, M., Cheung, J.C.K., Hamilton, W.L.: Temp: temporal message passing for temporal knowledge graph completion. arXiv abs/2010.03526 (2020) 21. Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: a probabilistic taxonomy for text understanding. In: ACM SIGMOD International Conference on Management of Data, pp. 481–492 (2012)

160

Y. Wang et al.

22. Xu, C., Chen, Y.Y., Nayyeri, M., Lehmann, J.: Temporal knowledge graph completion using a linear temporal regularizer and multivector embeddings. In: North American Chapter of the Association for Computational Linguistics (2021) 23. Xu, C., Nayyeri, M., Alkhoury, F., Lehmann, J., Yazdi, H.S.: Temporal knowledge graph embedding model based on additive time series decomposition. arXiv abs/1911.07893 (2019) 24. Xu, C., Nayyeri, M., Alkhoury, F., Yazdi, H.S., Lehmann, J.: Tero: a time-aware knowledge graph embedding via temporal rotation. In: International Conference on Computational Linguistics (2020) 25. Yang, B., tau Yih, W., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. CoRR abs/1412.6575 (2014)

Dual-Prompting Interaction with Entity Representation Enhancement for Event Argument Extraction Ruifang He1,2(B) , Mengnan Xiao1,2 , Jinsong Ma1,2 , Junwei Zhang1,2 , Haodong Zhao1,2 , Shiqi Zhang3 , and Jie Bai3 1

2

College of Intelligence and Computing, Tianjin University, Tianjin 300350, China {rfhe,mnxiao,jsma,junwei,2021244138}@tju.edu.cn Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin 300350, China 3 The 54th Research Institute of CETC, Shijiazhuang 050081, China {w08080,w08082}@cetc.com.cn

Abstract. Event argument extraction (EAE) aims to recognize arguments that are entities involved in events and their roles. Previous prompt-based methods focus on designing appropriate prompt template for events, while neglecting to explicitly model entities in the input, resulting in models that are not well aware of arguments. In this paper, we propose a novel Dual-Prompting Interaction approach with Entity Representation Enhancement (DPIERE) to explicitly model entities for both sentence-level and document-level EAE. 1) We design appropriate event and input prompt templates to model argument roles and entities in the input respectively so that pre-trained language model (PLM) can better capture the interaction between them. 2) We introduce a simple but effective entity representation enhancement method with adaptive selection to infuse entity knowledge into PLM. Finally, Argument span prediction is employed to predict the start and end word among the input for each role. In addition, DPIERE devises position marks in event prompt template to distinguish multiple occurrences of the same argument role. Comprehensive experimental results on three benchmarks show the effectiveness of our proposed approach. Keywords: Dual-Prompting Interaction · Entity Representation Enhancement · Event Argument Extraction

1 Introduction As a challenging sub-task of information extraction, event extraction (EE) aims to extract events and arguments from natural language texts, including sentence-level and document-level EE. For example, in Fig. 1, we can see a sentence-level “Conflict.Attack” event triggered by “war” with two involved arguments and a document-level “Movement.Transportartifact.Hide” event triggered by “smuggled” involving four arguments. It is usually modeled as a two-stage task, including event detection (ED) and event argument extraction (EAE). Since ED has achieved great success in recent years, the main challenge lies in EAE. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 161–172, 2023. https://doi.org/10.1007/978-3-031-44696-2_13

162

R. He et al.

Fig. 1. Examples of sentence-level (a) and document-level (b) EAE. Triggers are marked red, and arguments are in other colors. In brackets are the roles they play. (Color figure online)

Previous work formulates EAE as a token-level classification task [4, 8, 13]. There are generally two steps: identifying candidate arguments and classifying their roles. High dependency on candidate arguments suffers from error propagation. Recent studies transform EAE into a question-answering (QA) problem [1, 6, 9, 10, 19]. However, they need to design specific queries according to events and argument roles. In addition, the prediction has to be made one by one, which is not efficient. In addition, Some work casts EAE as a text generation task [14, 16]. Furthermore, Li et al. [7] propose a document-level model by formulating EAE as conditional generation following pre-defined event prompt template. Based on Li et al. [7], Ma et al. [15] propose PAIE to extract all arguments and their roles once a time by designing appropriate event prompt template. Yet these methods focus on designing appropriate event prompt template, while treating different words in the input equally, neglecting to explicitly model entities, leading to models that are not well aware of arguments. According to the general automatic context extraction (ACE) English annotation guidelines for events: an event’s arguments are the entities involved in that event. Therefore it is necessary to assign higher weights to entities in the input. What is more, General pre-trained language models (PLMs) do not design tasks for entities when training, resulting in existing PLMs face the challenge of recalling factual knowledge from their parameters, especially for those rare entities [3, 18]. To address the above problems, inspired by Ye et al. [21] and prompt-based learning [11], we propose a Dual-Prompting Interaction approach with Entity Representation Enhancement (DPIERE) for both sentence-level and document-level EAE. We explicitly model entities from the following two aspects. 1) We design appropriate input prompt template by inserting special tokens around entities in the input. The representation of special token is updated with the model training. This can provide extra clue for PLM, so that PLM achieves a better interaction between argument roles and entities. 2) We introduce an entity representation enhancement approach. Considering that an entity consists of one or more words, we design an adaptive selection mechanism to enhance each word of entities. Additionally, we introduce position marks in event prompt template to distinguish multiple occurrences of the same argument role. The contributions of this paper can be summarized as follows: – Propose a novel dual-prompting interaction approach with entity representation enhancement (called DPIERE) for both sentence-level and document-level EAE to explicitly model entities in the input;

Event Argument Extraction with DPIERE

163

– Introduce a simple but effective entity representation enhancement method with adaptive selection to infuse entity knowledge into PLM; – Experimental results on three benchmarks demonstrate the effectiveness of our method.

Fig. 2. The overview of the proposed DPIERE.

2 Related Work 2.1 Event Argument Extraction Event argument extraction (EAE) is a challenging subtask of event extraction and has been studied for a long time. Feature-based methods [20] are adopted for EAE early. Later, advanced deep learning techniques are applied to EAE. Furthermore, the latter approach can be divided into 3 categories: 1) Classification-based: They directly learn the representation of candidate arguments, aiming at extracting pairs of triggers and candidate arguments [4, 8, 13]. The general step is to identify candidate arguments and fill each with a specific role via a classifier. However, error propagation is inevitable. 2) QA-based: These models leverage role-specific questions to extract the boundaries of the expected arguments [1, 6, 9, 10, 19]. However, they need to design questions according to events and argument roles; in addition, they can only predict one argument once a time, which is inefficient. 3) Generation-based: They convert EAE to text generation with the help of PLMs based on full transformer architecture [14, 16]. Furthermore, with the excellent performance of prompt learning, some work casts EAE as a sequential generation problem with prompt template [7, 12, 15]. These methods commonly design suitable prompt template for events, ignoring modeling the entities explicitly in the input. However, according to the annotation guidelines, all the arguments are entities. So besides the prompt template built for events to model argument roles, we design input prompt template to explicitly model entities so that PLM better capture the interaction between argument roles and entities.

164

2.2

R. He et al.

Prompt-Based Learning

Prompt-based learning [11] is a new paradigm in pre-trained language models. Unlike the pre-training and fine-tuning paradigm, prompt-based methods convert the downstream tasks to the form more consistent with the model’s pre-training tasks. Based on this, Li et al. [7] propose a document-level EAE as conditional generation following pre-defined event prompt template. Liu et al. [12] propose a generative template-based event extraction method with dynamic prefix following Li et al. [7]. Ma et al. [15] utilize prompt learning for extractive objectives by designing suitable event prompt template. Based on Ma et al. [15], we build dual-prompting to take advantage of PLM better.

3 Methodology Figure 2 shows the proposed DPIERE model, which takes advantage of pre-trained language model (PLM) by prompt learning to improve EAE. We formulate EAE as a prompt-based span prediction problem following prior work [15]. Formally, given an   instance X, t, e, R(e) , where X denotes the input (sentence or document), t ∈ X denotes the event trigger in X, e denotes the event type of t, and R(e) denotes the predefined argument roles of event e. We aim to extract a set of span A; each a(r) ∈ A is a segmentation of X and represents an argument about r ∈ R(e) . Note that if a role does not have a corresponding argument, then the span is empty. The model consists of three stages: (1) Dual-Prompting Template Creation: design appropriate input and event prompt templates according to the input; (2) Entity Representation Enhancement: build enhanced entity representation; (3) Dual-Prompting Template Interaction: infuse entity knowledge into PLM and accomplish EAE. 3.1

Dual-Prompting Template Creation

In order to mine the patterns between argument roles and entities, we design appropriate input and event prompt templates for the input to model entities and argument roles, respectively. Input Prompt Template. The purpose of input prompt template is to model entities, we insert special tokens around entities. Formally, assume that X = [x1 , . . . , xi , . . . , xn ] is the input of n tokens with only one entity located at xi , then the input prompt template would have the form: X  = [x1 , . . . , e, xi , /e, . . . , xn ], where e and /e are special tokens to prompt entities. In this way, we explicitly model the entity. Although the purpose is achieved, the potential problem is that all the entities are modeled in the same way. For more refined modeling, we introduce the labels of entities. Specifically, we utilize the labels corresponding to the entities as special tokens instead of e and /e. For example, in Fig. 2, the original input is “The essence of the American case for war against Saddam Hussein”, and the final input prompt template after modeling is “The essence of the GPE American /GPE case for t war /t against PER Saddam Hussein /PER”, where GPE and /GPE are special tokens to prompt the entity type is GPE, PER, /PER are similar, t and /t are special tokens to indicate triggers.

Event Argument Extraction with DPIERE

165

Event Prompt Template. The purpose of event prompt template is to model argument roles. The argument roles for each event are pre-defined in the corpus. We introduce extra position marks for each argument role based on prior work [7, 15]. For example, in Fig. 2, given event type as Conflict.Attack and argument roles as {Attacker, Target, Victim, Instrument, Place}, the event prompt template P e is defined as follows: arg01 Attacker /arg01 attacked arg02 Target /arg02 hurting arg03 Victim /arg03 using arg04 Instrument /arg04 at arg05 Place /arg05 where underlined tokens are position marks. We can see a pair of numbered position marks around each argument role. In this way, even if there are multiple arguments with the same role, we can distinguish them by different position marks. Note that both the special tokens in the input prompt template and position marks in the event prompt template are randomly initialized and updated by prompt learning. 3.2 Entity Representation Enhancement To handle large training data, existing PLMs usually divide less common words into subword units. Furthermore, PLMs have to disperse the information of massive entities into their sub-word embedding. To alleviate this problem, Ye et al. [21] propose a simple but effective pluggable entity lookup table (PELT) to infuse entity knowledge into PLM. The basic idea is to mask the entity and aggregate the entity’s output representation of multiple occurrences in the corpus. In this paper, we first build the enhanced entity representation for each corpus according to PELT. Considering that an entity consists of one or more words, we design an adaptive entity representation selection mechanism in Sec. 3.3 to infuse the enhanced entity representation. Note that we utilize automated tools for the corpus without entity annotation to pre-process and obtain the result of named entity recognition. 3.3 Dual-Prompting Template Interaction Given original input X, we can obtain the final input and event prompt templates X  , P e after Sec. 3.1. Our PLM is based on Transformer encoder-decoder architecture, and here we choose BART [5]. As illustrated in Fig. 2, we feed X  into BART-Encoder and P e into BART-Decoder separately. We first utilize BART-Encoder to obtain the encoder hidden representation of tokens in X  : H enc = BART-Encoder(X  )

(1)

where H enc denotes the encoder hidden representation of X  . Then we infuse the entity knowledge into H enc . Considering that an entity consists of one or more words (e.g., Saddam Hussein in Fig. 2), we devise an adaptive entity representation selection mechanism to enhance each word in the entity: exp(W enc [H ienc ; H e ]) j j∈es exp(W enc [H enc ; H e ])

αi = 

ˆ ienc ← H ienc + αi H e H

(2)

where H e denotes the enhanced entity representation of entity e, which is obtained in Sec. 3.2, H ienc denotes the i-th encoder word representation of entity e, es denotes

166

R. He et al.

the span of entity e. W enc denotes the training parameter, αi denotes the score of the ˆ enc denotes the enhanced encoder representation after infusing entity i-th word, and H knowledge. Then we go through BART-Decoder to obtain the decoder hidden representation: ˆ enc , H ˆ enc ) H dec = BART-Decoder(H

ˆ enc ) H pe = BART-Decoder(P e , H

(3)

where H dec denotes the decoder hidden representation of X  , and H pe denotes the decoder hidden representation of P e . In this way, BART-decoder fully captures the interaction between argument roles in P e and entities in X  . For the argument roles surrounded with position marks in P e , we can the obtain position-aware role representation: (4) hkr = H pe [startk : endk ] ∈ R3×d where hkr denotes the representation of the k-th argument role in P e , startk and endk denote the start, and end position (including position marks) of the k-th argument role, and d denotes the dimension of the hidden layer in BART. Argument Span Prediction. After obtaining the representation of all argument roles, we utilize argument span prediction to predict the start and end words among input for each role. Given the representation of argument role hkr , we first obtain the start and end features: hstart = ReLU(W start hkr + bstart ) ∈ Rd

hend = ReLU(W end hkr + bend ) ∈ Rd (5) where hstart and hend denote the start and end feature for the k-th argument role. W start , bstart , W end , and bend denote training parameters, ReLU denotes the activation function. Note that before Eq. 5, we will first reshape H kr into 1 × 3d. The purpose of argument span prediction is to obtain a span (sk , ek ) for argument role hkr from H dec where sk and ek denote the start and end index among input. Note that if an argument role does not have a corresponding argument, then the span is (0, 0). Before predicting the span, we infuse the entity knowledge into H dec adaptively similar to Eq. 2: βi = 

exp(W dec [H idec ; H e ])

j j∈es exp(W dec [H dec ; H e ])

ˆ idec ← H idec + βi H e H

(6)

where H idec denotes the i-th decoder word representation of entity e, H e , es are the same in Eq. 2, W dec denotes the training parameter, βi denotes the score of the iˆ dec denotes the enhanced decoder representation after infusing entity th word, and H knowledge. Then we obtain the start and end span: ˆ dec ) ∈ Rn sstart = Softmax(hstart H

ˆ dec ) ∈ Rn send = Softmax(hend H

(7)

where sstart and send denote the probabilities distribution of start and end positions location, n denotes the length of the input. To optimize DPIERE, the following combined loss function is used:   L= Lk (X) Lk (X) = −(sk logsstart + ek logsend ) (8) X ∈D

k

Event Argument Extraction with DPIERE

167

where D denotes all input in the corpus, sk and ek denote the probability distribution (one hot vector of length n) of sk and ek among input for the k-th argument role.

4 Experiments To verify the validity of our method, we explore the following questions: Q1: Can DPIERE better utilize the advantage of PLM to boost the performance of EAE? Q2: How does each module affect the results? Q3: How do different methods to model entities affect the results? Q4: How does the golden entity annotation affect the performance? 4.1 Settings Datasets. We conduct experiments on three standard datasets in the EAE: ACE2005, RAMS [2], and WIKIEVENTS [7]. ACE2005 is a sentence-level EAE corpus that has 599 English annotated documents. In addition to the event annotation, ACE2005 provides entity annotation. The same dataset split as the previous work [10, 15] is used. We follow the pre-processing procedure of DyGIE++ [17] that keeps 33 event types, 22 argument roles, and 7 entity types. RAMS is a document-level EAE corpus with 3993 English annotated documents totaling 9124 examples, 139 event types, and 65 argument roles. Each example contains 5 sentences, one event, and some arguments. We follow the original dataset split. However, There are no entity annotations in RAMS. Therefore, we utilize spaCy to pre-process the dataset and obtain the result of named entity recognition. After preprocessing, there are 18 entity types. WIKIEVENTS is another document-level EAE corpus with 246 English annotated documents, 50 event types, and 59 argument roles. Like ACE2005, WIKIEVENTS also provides entity annotation, with 17 entity types. Evaluation Metric. We use standard Precision (P ), Recall (R), and F1 score following previous work [7, 15]. An argument is considered correctly extracted if its offset and role are precisely matched. Baselines. To evaluate the proposed method, we compare our model with the following two category methods: a) QA-based method: EEQA [1] and RCEE ER [9] are QA-based models which use machine reading comprehension models for sentence-level EAE. While FEAE [19] and DocMRC [10] are for document-level EAE. b) Generation-based method: BART-Gen [7] is a document-level event argument model by formulating EAE as conditional generation following pre-defined event prompt template. And PAIE [15] is a prompt-based method by designing suitable event prompt template to extract arguments and the roles they play simultaneously.

168

R. He et al.

Table 1. Performance of EAE on the test sets of the different corpora. ∗ indicates the results from the original paper. Model

PLM

ACE2005 P R

F1

RAMS P R

F1

WIKIEVENTS P R F1

67.9∗ 71.2∗ – –

65.4∗ 70.1∗ – –

– – 53.2∗ 43.4∗

– – 42.8∗ 48.3∗

– – 47.4∗ 45.7∗

– – – 64.2∗

– 49.3 53.7 49.3 54.4

63.0∗ 69.1∗ – –

EEQA [1] RCEE ER [9] FEAE [19] DocMRC [10]

BERT-base BERT-large BERT-base BERT-base

BART-Gen [7] PAIE [15]

BART-large 67.8∗ 65.6∗ 66.7∗ – BART-base 70.0 69.8 69.9 49.7 BART-large 69.3 75.7 72.4 50.8

DPIERE (Ours) BART-base 68.8 BART-large 72.1

72.4 75.7

70.6 73.9

51.1 50.9

– – – 36.2∗

– – – 46.3∗

48.6∗ – 49.5 65.3 52.2 62.6

– 58.5 65.0

65.1∗ 61.7 63.8

50.2 52.6

62.1 65.6

62.6 65.1

63.2 64.6

Implementations. We optimize our model based on the F1 score on the development set. Each experiment is conducted on a single NVIDIA GeForce RTX 3090 24 GB. Due to the GPU memory limitation, we use different batch sizes for diverse models and corpora. For ACE2005, the batch size for the base and large models are 32 and 16, respectively. For RAMS and WIKIEVENTS, the batch size for the base and large models are 8 and 4, respectively. The learning rate is set to 2e-5 for the AdamW optimizer with Linear scheduler, and the warmup ratio is 0.1. The epoch is set to 50, and the early stop is set to 8, denoting the training will stop if the F1 score does not increase during 8 epochs on the development set. Furthermore, the max encoder sequence length of the input template is set to 240, 500, and 500 for ACE2005, RAMS, and WIKIEVENTS, respectively. The max decoder sequence length of the output template is set to 50, 50, and 80 for ACE2005, RAMS, and WIKIEVENTS, respectively. Moreover, the input in the document-level dataset sometimes exceeds the constraint of the max encoder sequence length; thus we add a window centering on the trigger words and only encode the words within the window. Following Ma et al. [15], the window size is 250. 4.2

Overall Performance

Table 1 presents the performance of all baselines and DPIERE on the test set of all corpora. For Q1, from Table 1, we can observe that: 1) By dual-prompting interaction with entity representation enhancement, DPIERE could better capture the interaction between argument roles and entities and outperforms most baselines. Our approach, designing appropriate input and event prompt templates to model entities and argument roles explicitly, surpasses most methods on different corpora. In addition, we add position marks in the event prompt template, which could guide distinguishing the arguments with the same role.

Event Argument Extraction with DPIERE

169

Table 2. Ablation study on the test sets of the different corpora. Model

PLM

ACE2005 P R F1

RAMS P R

F1

WIKIEVENTS P R F1

DPIERE BART-base 68.8 72.4 70.6 51.1 49.3 50.2 63.2 62.1 62.6 w/o PM w/o IPT w/o ERE w/o ALL

BART-base BART-base BART-base BART-base

70.0 66.8 69.2 66.5

69.8 70.2 68.1 66.7

69.9 68.5 68.7 66.6

50.1 50.6 49.7 49.6

49.3 48.3 49.0 46.2

49.7 49.4 49.3 47.8

60.3 61.2 59.7 57.2

63.8 62.0 62.7 62.9

62.0 61.6 61.2 60.0

2) By explicitly modeling entities in the input, DPIERE outperforms other sequence generation methods with prompt template. Compared with PAIE [15], DPIERE achieves better performance with extra input prompt template and entity representation enhancement. Specifically, for sentence-level EAE, DPIERE has a 0.7% and 1.5% F1 improvement for the base and large model; For document-level EAE, the improvement for RAMS is 0.7% and 0.4%, and the improvement for WIKIEVENTS is 0.9%, and 1.1%. Note that the improvement on RAMS is slightly decreased than ACE2005 and WIKIEVENTS. This may be because RAMS does not have a golden entity annotation. 3) Notably, BART-Gen is competitive on WIKIEVENTS compared with DPIERE. A potential reason is that BART-Gen also models entities by constraining the entity types of arguments. While no corresponding golden annotation in ACE2005 and RAMS, which leads to a performance decrease. Unlike BART-Gen, DPIERE designs input prompt template to model entities. Even though there is no entity annotation, we can obtain the approximate result by automatic tools. 4) Compared with methods based on QA, the generation-based approach has a better performance under comparable parameters by designing suitable prompt template, which indicates they could better leverage the knowledge of PLM.

4.3 Ablation Study To verify Q2, we conduct the following experiments: 1) “w/o PM” indicates removing the position marks (PM) in the event prompt template. 2) “w/o IPT” discards the input prompt template (IPT) creation and directly feeds the input text into BART-Encoder. 3) “w/o ERE” denotes without enhanced entity representation (ERE). 4) “w/o ALL” means removing PM, IPT, and ERE. From Table 2, we can observe that: 1) PM, IPT, and ERE are necessary for DPIERE to achieve the highest performance. Remove any component, and performance will decrease. In particular, on ACE2005, the F1 score decreases 0.7%, 2.1%, 1.9%, 4% when removing PM,

170

R. He et al. Table 3. The F1 score of DPIERE and variants. Model

PLM

F1 ACE2005 RAMS WIKIEVENTS

DPIERE

BART-base 70.6

50.2

62.6

DPIERE-I DPIERE-E DPIERE-D DPIERE-S

BART-base BART-base BART-base BART-base

49.7 49.8 49.4 49.5

61.9 61.5 61.3 61.6

70.1 69.4 68.9 69.2

Table 4. The F1 score of DPIERE under golden and non-golden entity annotation. DPIERE-G and DPIERE-N denote golden and non-golden entity annotation. Model

PLM

F1 ACE2005 WIKIEVENTS

DPIERE-G BART-base 70.6

62.6

DPIERE-N BART-base 70.2

62.3

IPT, ERE, and ALL. On RAMS, the F1 score decreases 0.5%, 0.8%, 0.9%, 2.4%. On WIKIEVENTS, the F1 score decreases 0.6%, 1%, 1.4%, 2.6%. Note that the decrease in RAMS is slightly compared with ACE2005 and WIKIEVENTS. An important reason is that RAMS does not have golden entity annotation. What is more, a potential reason is that the size of the RAMS dataset is larger than ACE2005 and WIKIEVENTS, which could better unleash the performance of PLM by prompt learning. 2) The improvement of IPT and ERE is similar across corpora, indicating the effectiveness of explicitly modeling entities. According to the ACE2005 annotation guideline, all arguments are entities. By special tokens to prompt entities or enhancing entity representation, PLM could better capture the interaction between argument roles and entities. 3) After removing all modules, the performance degradation value is not equal to the sum of removing “PM”, “IPT”, and “ERE” which indicates that there is overlap between different components and could support each other. For example, “IPT” and “ERE” are aiming to model entities.

4.4

Different Approaches to Model Entities

To illustrate Q3, we study different approaches to model entities: 1) Replace the special tokens denoting the label of entities with “e” and “/e” in the input prompt template, called DPIERE-I. 2) Only infuse enhanced entity representation into BART-Encoder, called DPIERE-E. 3) Only infuse enhanced entity representation into BART-Decoder, called DPIERE-D.

Event Argument Extraction with DPIERE

171

4) Infuse enhanced entity representation into BART-Encoder and Decoder but without the adaptive selection, called DPIERE-S. From Table 3, we can observe that: 1) Using special tokens indicating the label of entities rather than “e” and “/e” could model entities in a more refined way. Entities of the same type will have the same tokens. And in the training process, the model can perceive the difference and connection between entities. ACE2005 and WIKIEVENTS have 7 and 17 entity types, respectively. Correspondingly, the decrease in ACE 2005 (0.5%) is lower than WIKIEVENT (0.7%). As for RAMS, although has 18 entity types, while the decrease is 0.5%, the proper reason is that the entity annotation is not golden, and consists of bias. 2) The result of only infusing enhanced entity representation into BART-Encoder is better than BART-Decoder. The average improvement on the three datasets is 0.4%. That is because when infusing enhanced entity representation into BART-Encoder, the input and output template could better interact in BART-Decoder. In this way, DPIERE could fully infuse entity knowledge into PLM. 3) Adaptive selection mechanism is also necessary for DPIERE. An entity may contain one or more words. Through the mechanism, we can adaptively enhance each word in the entity. In addition, compared with DPIERE without ERE, the F1 score has also improved indicating the effectiveness of DPIERE-S. 4.5 Golden vs Non-golden Entity Annotation In the real world, the golden entity annotation is time-consuming and laborious. So in this section, we test the performance difference between golden entity labeling and automated tool labeling scenarios. For Q4, from Tab. 4, we can observe that: the quality of entity annotation has some influence on the results, but not much. Compared with PAIE [15], DPIERE-G and DPIERE-N have certain improvements. 1) The accuracy of named entity recognition is relatively high by automated tools (e.g., spaCy); 2) The number of arguments is small relative to the number of entities. According to the statistic on ACE2005, we find that 87.8% of the arguments are correctly marked as entities by spaCy.

5 Conclusions This paper proposes a novel dual-prompting interaction approach with entity representation enhancement (DPIERE) for sentence-level and document-level EAE by explicitly modeling entities in the input. Unlike other prompt-based methods that focus on designing suitable event prompt template, DPIERE designs appropriate input and event prompt templates to model entities and argument roles, respectively, so that pre-trained language model (PLM) could better capture the interaction between entities and argument roles. We also introduce a simple but effective entity representation enhancement approach with adaptive selection to enhance the words of entities in the input. Position marks are also introduced to distinguish arguments with the same role. Comprehensive experimental results on ACE2005, RAMS, and WIKIEVENTS demonstrate the effectiveness of the proposed method.

172

R. He et al.

Acknowledgments. Our work is supported by the National Natural Science Foundation of China (61976154).

References 1. Du, X., Cardie, C.: Event extraction by answering (almost) natural questions. In: Proceedings of EMNLP, pp. 671–683 (2020) 2. Ebner, S., Xia, P., Culkin, R., Rawlins, K., Van Durme, B.: Multi-sentence argument linking. In: Proceedings of ACL, pp. 8057–8077 (2020) 3. Gao, J., He, D., Tan, X., Qin, T., Wang, L., Liu, T.: Representation degeneration problem in training natural language generation models. In: Proceedings of ICLR (2018) 4. Huang, Y., Jia, W.: Exploring sentence community for document-level event extraction. In: Findings of EMNLP, pp. 340–351 (2021) 5. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of ACL, pp. 7871–7880 (2020) 6. Li, F., et al.: Event extraction as multi-turn question answering. In: Findings of EMNLP, pp. 829–838 (2020) 7. Li, S., Ji, H., Han, J.: Document-level event argument extraction by conditional generation. In: Proceedings of NAACL, pp. 894–908 (2021) 8. Lin, Y., Ji, H., Huang, F., Wu, L.: A joint neural model for information extraction with global features. In: Proceedings of ACL, pp. 7999–8009 (2020) 9. Liu, J., Chen, Y., Liu, K., Bi, W., Liu, X.: Event extraction as machine reading comprehension. In: Proceedings of EMNLP, pp. 1641–1651 (2020) 10. Liu, J., Chen, Y., Xu, J.: Machine reading comprehension as data augmentation: a case study on implicit event argument extraction. In: Proceedings of EMNLP, pp. 2716–2725 (2021) 11. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023) 12. Liu, X., Huang, H., Shi, G., Wang, B.: Dynamic prefix-tuning for generative template-based event extraction. In: Proceedings of ACL, pp. 5216–5228 (2022) 13. Liu, X., Luo, Z., Huang, H.: Jointly multiple events extraction via attention-based graph information aggregation. In: Proceedings of EMNLP, pp. 1247–1256 (2018) 14. Lu, Y., et al.: Text2Event: controllable sequence-to-structure generation for end-to-end event extraction. In: Proceedings of ACL, pp. 2795–2806 (2021) 15. Ma, Y., et al.: Prompt for extraction? PAIE: prompting argument interaction for event argument extraction. In: Proceedings of ACL, pp. 6759–6774 (2022) 16. Paolini, G., et al.: Structured prediction as translation between augmented natural languages. In: Proceedings of ICLR (2021) 17. Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations. In: Proceedings of EMNLP, pp. 5784–5789 (2019) 18. Wang, C., Liu, P., Zhang, Y.: Can generative pre-trained language models serve as knowledge bases for closed-book QA? In: Proceedings of ACL, pp. 3241–3251 (2021) 19. Wei, K., Sun, X., Zhang, Z., Zhang, J., Zhi, G., Jin, L.: Trigger is not sufficient: exploiting frame-aware knowledge for implicit event argument extraction. In: Proceedings of ACL, pp. 4672–4682 (2021) 20. Yang, B., Mitchell, T.M.: Joint extraction of events and entities within a document context. In: Proceedings of NAACL, pp. 289–299 (2016) 21. Ye, D., Lin, Y., Li, P., Sun, M., Liu, Z.: A simple but effective pluggable entity lookup table for pre-trained language models. In: Proceedings of ACL, pp. 523–529 (2022)

Collective Entity Linking with Joint Subgraphs Kedong Wang1,2 , Yu Xia1,2 , and Fang Kong1,2(B) 1

2

Institute of Artificial Intelligence, Soochow University, Suzhou, China {kdwang1011,20205227072}@stu.suda.edu.cn School of Computer Science and Technology, Soochow University, Suzhou, China [email protected]

Abstract. The problem of collective methods based on graph presents challenges: given mentions and their candidate entities, methods need to build graphs to model correlation of linking decisions between different mentions. In this paper, we propose three ideas: (i) build subgraphs made up of partial mentions instead of those in the entire document to improve computation efficiency, (ii) perform joint disambiguation over context and knowledge base (KB), and (iii) identify closely related knowledge from KB. With regard to above innovations, we propose EL-Graph, which addresses the challenges of collective methods: (i) attention mechanism, where we select low attention scores of partial mentions of a document to form a subgraph to improve the computation efficiency, (ii) joint disambiguation, where we connect mention context and KB to form a joint graph, gather and spread the message to update their node representations simultaneously through graph neural networks, and (iii) relevance scoring, where we compute similarity to estimate importance of KB nodes relative to the given context. We evaluate our model on publicly available dataset and show the effectiveness of our model. Keywords: Entity linking

1

· Subgraph · Joint disambiguation

Introduction

Entity Linking (EL) is the task of disambiguating mentions in texts to a knowledge base (KB), often regarded as crucial for natural language understanding. However, one mention can be represented with multiple entities from KB. As shown in Fig. 1, the mention Greens refers to Australian Greens, Alliance ’90/The Greens, and so on. The challenge for EL is to link each mention to the most suitable entity in the knowledge base. The key of solving above challenge is feature extraction for ranking candidate entities. Local features like prior distribution, local contexts or in the form of hand-crafted features (Yamada et al. [14,15]) are used for local models to resolve mentions independently (Francis-Landau et al. [2], Ganea et al. [3]). However, K. Wang and Y. Xia—Equal contribution. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 173–184, 2023. https://doi.org/10.1007/978-3-031-44696-2_14

174

K. Wang et al.

Fig. 1. Structure of our joint subgraph. Existing global models optimize the linking configuration over all mentions, which is lead to consumption of computation and memory. In our model, we extract subgraphs and aim to derive the true entity (blue box) by performing joint reasoning over the semantic features (orange box) and external KB. (Color figure online)

in many cases, local features can be too sparse to provide sufficient information for disambiguation. To alleviate this problem, collective models have been proposed which assume that documents mainly refer to coherent entities with related topics, and entity assignments of mentions in a document are interdependent on each other, thus providing global features like structure or topic information. (Globerson et al. [4], Nguyen et al. [10], Minh et al. [12]).

2

Motivation

Global models based on graph (Huang et al. [7]) usually utilize structure feature to capture coherent entities for all identified mentions in a document.Cao et al. [1] applied Graph Convolutional Network (GCN) [8] to integrate global coherence information for EL. Guo et al. [5] solved the global training problem via recurrent random walk. However, global models based on graph lead to high complexity, especially when the graph may contain hundreds of nodes in case of long documents. Additionally, existing methods treat context and KB as two separate modalities. They individually apply local models to context and graph neural networks to KB, thus leads to that disambiguation over the graph only uses the external knowledge without context to provide semantic feature. This separation might limit their capability to perform structured reasoning and ignore the relevance of two kinds of information. Also, an graph based on KB contains not

Collective Entity Linking with Joint Subgraphs

175

Fig. 2. Overview of our model. Given contexts z (concatenation of context of partial mentions), we connect it with the retrieved KB to form a joint subgraph, compute the relevance of each KB node conditioned on z and apply GAT to integrate global coherence information.

only candidate entities, but also their closely related entities (neighbour entities), which empower the model with important associative abilities. But this may introduces many neighbour entities that are semantically irrelevant to the context, especially when the number of candidate entities or hops increases, thus increasing complexity. We propose EL-Graph which not only utilizes semantic features like context, but also leverages external knowledge to form a joint graph for collective disambiguation. Figure 2 shows the architecture of our model. To begin with, to alleviate the global computations, we introduce attention mechanism to select partial mentions to make up subgraphs for expressing the same topic efficiently, therefore, the entire document can be made up with multiple subgraphs that facilitates computation acceleration through GPUs and batch techniques. Additionally, we add local features like context to collective method based on graph, capturing the nuance and strengthening the relevance of local and global disambiguation. Learning from question answering Yasunaga et al. [17], we design a joint graph including context, mention, candidate entities and neighbour entities, where we view concatenation of contexts of partial mentions z as an additional node, mention as a question node, and candidate entities as answer choice nodes. This joint graph unifies two modalities into one graph, which performs joint disambiguation and updates representations of z and KB simultaneously. At last, we compute relevance score between context and other nodes as weight information of graph to filter out nodes that are not relevant to the context, thus decreasing the number of nodes in graph for improving computation efficiency.

3

Background

In this section, we introduce definition of entity linking, also, EL-Graph is based on local models which usually utilizes local features to model mention-to-entity compatibility and resolve mentions independently. 3.1

Problem Definition

Given a document D with a list of mentions M (D) = m1 , ..., mn and their |E | candidate entities E(mi) = e1i , e2i , ..., ei i , entity linking aims to link each mi

176

K. Wang et al.

to gold entity e∗i . Such a process focuses on candidate ranking which means that ranking E(mi) to select the one from E(mi) with the highest score as the predicted entity ei . We aim to propose collective model based on joint graph. We denote G = (V, E) as the graph for a document D, where V represents the set of nodes, E ⊆ V × R × V is the set of edges that connect nodes in V , R represents a set of relation types of E. Nodes in joint graph includes context, mention, candidate entities and neighbour entities. We denote Vm ⊆ V , Ve ⊆ V as the set of mentions (red box in Fig. 1) and their correspond KB candidate entities (blue box in Fig. 1). Then we extract subgraphs from G for partial mentions-entities sub sub with attention mechanism, Gsub m,e = (Vm,e , Em,e ).

4

Collective Method

Collective EL focuses on the topical coherence throughout the entire document. The consistency assumption behind it is that: all mentions in a document shall be on the same topic. In this section, we give a detailed description of our collective (global) method. The key idea of this model is forming subgraphs with attention mechanism and combining semantic (context) feature with external knowledge (KB) for simultaneous disambiguation. 4.1

Subgraph

However, existing methods based on graph leads to exhaustive computations in case of long documents with a low of mentions. Therefore, we propose attention mechanism to select partial mentions to form a subgraph, thus, the topical coherence at document level shall be achieved by multiple subgraphs. One mention can be viewed as a query and other mentions in the same document are viewed values. The output is the weight assigned to each value which is computed by a compatibility function. Learning from contrastive learning, we select partial mentions with q lowest weights, which means that through partial mentions with negative correlation to express the same topic, thus improving performance. For example, as shown in Fig. 1, mentions Greens is the political party, provide adequate disambiguation clues to induce the underlying topic, which impacts positively on identifying the mention SPD as another political party via the common neighbor entity Party. We define N (mi) as selected results of mention mi with attention and the size of it is q. However, if q is large enough to cover the entire document, this design is the same as the previous work like Pershina et al. [11] that directly modeling the whole graph. In this respect, we choose to adjust the parameter q which is much smaller that total number of mentions of dataset, which not only improving computation efficiency, but also trimming the graph into multiple a fixed size of subgraphs that facilitate computation acceleration through GPUs and batch techniques.

Collective Entity Linking with Joint Subgraphs

4.2

177

Joint Graph Representation

To combine two sources of knowledge for simultaneous disambiguation, we form a joint subgraph. Nodes in subgraph includes Vm = [mi, N (mi)], their correspond candidate entities Ve = E(Vm ) and neighbour entities N (Ve ). In addition to the above nodes, We introduce an semantic node z which is concatenation of partial mentions’ contexts as a virtual, extra node and connect z to Vm and Ve on the subgraph Gsub m,e using two new kinds of relation types rz,m and rz,e and computing relevance scores as relation weights on them. Each node associated with one of the four types [Z, M, E, O], each indicating the semantic node z, nodes in Vm , nodes in Ve , and other nodes in N (Ve ), respectively (corresponding to the node color, orange, red, blue, gray in Fig. 1). We term this subgraph joint graph. 4.3

Relevance Scores

Neighbour entities N (eji ) consist of KB entities that have inlinks pointing to eji . However, these KB nodes may be irrelevant under the mention context. The retrieved KG subgraph Gsub m,e with few-hop neighbors of the Ve may include nodes that are uninformative for the disambiguation, thus leading to overfitting or introducing noise, especially when the number of N (eji ) is large. As shown in Fig. 1, Party and Weimar are thematic. With regard to filtering nodes, we propose relevance scoring between other nodes and semantic node as Fig. 3. Inspired by Ganea et al. [3], for mention mi , ci as context of mi , neighbour entities (nodes in N (ei )) that are more coherent with mention context are preferred. More specifically, we calculate the relevance score for each eˆj ∈ N (ei ) as s(eˆj ) = max cT i · A · eˆj ci

(1)

where A is a parameterized diagonal matrix. Top k neigbour entities are left to add to graph while the others are pruned. The relevance scores are transformed to attention weights with: exp(s(eˆi )) eˆj ∈N (ei ) exp(s(eˆj ))

a(eˆi ) = 

(2)

Then, we can define a final relevance score rs between context and other nodes as  rs(ci , N (ei )) = a(eˆi ) · cT (3) i · R · eˆj eˆj ∈N (ei )

where R is a learnable diagonal matrix. The probability rs(ci , N (ei )) denotes the importance of each neighbour entity related to context node, which trims the joint graph Gsub m,e .

178

K. Wang et al.

Fig. 3. Computation of relevance score

4.4

GNN Architecture

To perform disambiguation on the joint graph Gsub m,e , our GNN module builds on the graph attention framework (GAT) [13], which induces node representations via iterative message passing between neighbors on the graph. Specifically, in a L-layer EL-Graph, for each layer, we update the representation hlt of each node t as  hl+1 = fn ( αst mst ) + hlt (4) t s∈N (t)∪t

where N (t) represents the neighborhood of node t, mst ∈ R denotes the message from each neighbor node s to t, and αst is an attention weight that scales each message mst from s to t. In terms of attention weight αst , this operation aggregates the features of neighborhood onto the center nodes and captures connection of two nodes, which is ideally informed by their node types, relations and node relevance scores. In terms of relations, except rz,m and rz,e , we mainly make use of whether having inlinks. Relevance score between mentions and their candidate entities can be local scores computed by local models, thus strengthening correlation of two task. Additionally, other kinds of relevance scores can be computed by Fig. 3. To compute αst , we obtain query and key vectors q, k by: qs = fq (hls , fu (us ), fp (ps ))

(5)

fk (hlt , fu (ut ), fp (pt ))

(6)

kt =

where us , ut are one-hot vectors indicating the node types of s and t, ps , pt are the relevance score of node s and t, fq , fk , fu , fp are are linear transformations.

Collective Entity Linking with Joint Subgraphs

179

Then αst can be computed by: qsT k γst = √ d exp(γst ) t ∈N (s)∪s exp(γst )

αst = 

4.5

(7) (8)

Learning

Crucially, as our GNN message passing operates on the joint graph, it will jointly leverage and update the representation of the context and KB. In this graph, context node serves as a virtual, extra node and connect nodes in Vm and Ve with dummy edges, so that other nodes reads and writes through the master node during message passing. Finally, the master (context) node contains global information or topic information, which is spliced with each mention for classification. We optimize the model using the cross entropy loss.

5 5.1

Experiment Experimental Settings

In this section, we evaluate our model on five datasets to verify validity and investigate the effectiveness. – Datasets: Following our predecessors, we evaluate our model on the public and widely used CoNLL-YAGO dataset [6]. The target knowledge base is Wikipedia. The corpus consists of AIDA-train which has 946 documents for training, AIDA-A which has 216 documents with 4791 mentions for development and AIDA-B which has 231 documents with 4485 mentions for testing. To evaluate the generalization ability of each model, we apply cross-domain experiments following the same setting in ganea et al. [3]. The statistics are shown in Table 1. – Parameter Setting: We set the dimensions of word embedding and entity embedding to 300, where the word embedding and entity embedding are publicly released by Ganea et al. [3]. We use the following parameter values of the best validated model: In each epoch of 50, with regrad to local models, we set context window = 100 and choose layers of LSTM is 3. In collective model, the number of mentions selected to form a subgraph q = 3. We trim the graph with hop size = 2 and keep the top k neighbour nodes accroding to relevance score. In terms of GAT, we set the d = 200 and gat layers = 4 with dropout rate 0.2 applied to each layer. We use Adam with a learning rate of 2e-4. The parameters of our model are shown in Table 2. – Baseline: We compare our model with the following methods: Pershina et al. [11] presents a graph-based collective method based on PageRank and utilizes random walk on the whole graph. Francis-Landau et al. [2] models semantic similarity between a mention’s surrounding text and its candidate entities using convolutional neural networks. Ganea et al. [3] utilizes attention

180

K. Wang et al. Table 1. The Statistics of datasets. Dataset

Mention Doc Recall

AIDA-train 18448 4791 AIDA-A 4485 AIDA-B-A

946 – 216 97.1 231 98.5

MSNBC 656 727 AQUAINT 257 ACE2004 11154 CWEB 6821 WIKI

20 50 36 320 320

98.5 94.2 90.6 91.1 92.4

Table 2. The Parameters of Our model. Parameter

Value Parameter Value

epoch

50

lr

2e−4 dr

d

200 0.2

LSTM layer 3

q

3

GAT layer

optimizer

Adam

4

mechanism over local context windows. Cao et al. [1] performs approximately graph convolution on a subgraph of adjacent entity mentions instead of those in the entire text for the sake of saving memory and computation. The results are shown in Table 3, our model achieves the best performance of compared methods on in-domain dataset.

5.2

Collective Method

Compared with methods Francis-Landau et al. [2] and Ganea et al. [3] that focus on local features and do not make use of correlation of linking decisions, our model considers global features, thus increasing performance. Also, we extract subgraphs to reduce complexity. As seen in Table 5, if q is small enough like the q = 2, this method is similar to the local model. Not only does the global model make little sense at this point, but it also increases memory of graph. 5.3

GNN Architecture

Through results in Table 3, compared with global methods based on graph Pershina et al. [11] and Cao et al. [1], random walk algorithm used by Pershina et al. [11] is not differentiable and thus difficult to be integrated into neural network models, thus lacking of generalization ability and effectiveness of feature extraction. Additionally, Cao et al. [1] proposes GCN and our model proposes

Collective Entity Linking with Joint Subgraphs

181

Table 3. In-domain Performance Comparison on the AIDA-B Dataset. Our model achieves the best reported scores on this benchmark. Baseline

F1 (%)

Pershina et al. [11] 91.77 Francis-Landau et al. [2] 90.6 92.22 Ganea et al. [3] 87.2 Cao et al. [1] local model our model

90.84 93.52

Table 4. Performance Comparison on Cross-domain Datasets using F1 score (%) Baseline

MSBNC AQUAINT ACE2004 CWEB WIKI

Ganea et al. [3] Phong et al. [1] Ganea et al. [9] Yang et al. [16]

75.77% – 93.9 94.57

88.5 87 88.3 87.38

88.5 88 89.9 89.44

77.9 – 77.5 73.47

77.5 – 78.0 78.16

our model

93.82

90.71

91.18

77.29

78.35

GAT, we can find that in essence GCN and GAT both aggregate the features of neighboring nodes to the central node and utilize the local stationary on the graph to update representation of nodes. The difference is that GCN uses the Laplacian matrix, and GAT uses the attention coefficient. To a certain extent, the GAT will be stronger because the correlation between nodes features is better incorporated into the model. Meanwhile, the scale of our datasets is large. GCN is operating on the whole graph, and the node features of the whole graph are updated in one calculation. The learned parameters are largely related to the graph structure, which makes GCN difficult for inductive tasks (Table 4). 5.4

Graph Connection

Cao et al. [1] obtains the graph by taking neighbor mentions and their correspond candidate entities like the first row in the Table 6, which means that ignore the relevance of semantic features (local method) and external knowledge (global method). On this side, we add context node to form a joint graph for simultaneous disambiguation, thus taking advantage of correlation of two tasks. As seen in Table 5, without these edges, performance is in the decrease. 5.5

Relevance Score

As seen in Fig. 4 about relevance scoring of the retrieved KB: we calculate the relevance score of each neighbour node conditioned on the mention context with

182

K. Wang et al. Table 5. Performance of different kinds of subgraphs Structure of subgraph F1 (%) q=2

91.14

q=3

93.52

q=7

92.16

Table 6. Performance of different kinds of graph connection Graph connection

F1 (%)

No edge between z and KG nodes 91.99 Connect z to Ve

92.64

Connect z to Vm and Ve

93.52

method in Fig. 3. The darker color means that the score is higher. Through Fig. 4, we can see that Party implies topical information.

6

Related Work

Local models focuses on resolving mentions independently relying on local features and their surrounding context. Francis-Landau et al. [2] models semantic similarity between a mention’s surrounding text and its candidate entities using convolutional neural networks. Ganea et al. [3] utilizes attention mechanism over local context windows. However, associations of linking decisions of different mentions in a document are underutilized and local features are too sparse. To alleviate above problem, global models jointly optimize the entire linking configuration. Pershina et al. [11] presents a graph-based collective method based on PageRank and utilizes random walk on the graph. However, this method models on the whole graph, which leads to high computation and memory consumption, especially in case of a long document with hundreds of nodes. Cao et al. [1] performs approximately graph convolution on a subgraph of adjacent entity mentions instead of those in the entire text to alleviate the global computations. However, this method can not consider dependencies and correlation between local models and global models. In contrast, EL-Graph not only extracts subgraph with attention mechanism, but also add context node to the graph, thus performing joint disambiguation.

7

Conclusion

In this paper, we propose EL-Graph as a plug-and-enhance module for local Entity Linking models. In contrast to existing global models based on graph, EL-Graph build subgraphs made up of partial mentions instead of those in the entire document to improve computation efficiency with attention mechanism.

Collective Entity Linking with Joint Subgraphs

183

Fig. 4. Computation of relevance score

Additionally, this model combines semantic features (context) with external knowledge (KB) to form a joint graph, thus updating their representations with GAT simultaneously. Finally, in order to trim the joint subgraphs, we utilize relevance scores between context and other nodes to select KB nodes which is highly related to context to reduce complexity. Acknowledgments. This work was supported by Projects 62276178 under the National Natural Science Foundation of China, the National Key RD Program of China under Grant No. 2020AAA0108600 and the Priority Academic Program Development of Jiangsu Higher Education Institutions.

References 1. Cao, Y., Hou, L., Li, J., Liu, Z.: Neural collective entity linking. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 675–686 (2018) 2. Francis-Landau, M., Durrett, G., Klein, D.: Capturing semantic similarity for entity linking with convolutional neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1256–1261 (2016) 3. Ganea, O.E., Hofmann, T.: Deep joint entity disambiguation with local neural attention. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2619–2629 (2017) 4. Globerson, A., Lazic, N., Chakrabarti, S., Subramanya, A., Ringgaard, M., Pereira, F.: Collective entity resolution with multi-focal attention. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 621–631 (2016) 5. Guo, Z., Barbosa, D.: Robust named entity disambiguation with random walks. Semant. Web 9, 459–479 (2018) 6. Hoffart, J., et al.: Robust disambiguation of named entities in text. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 782–792 (2011) 7. Huang, H., Cao, Y., Huang, X., Ji, H., Lin, C.Y.: Collective tweet wikification based on semi-supervised graph regularization. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 380–390 (2014)

184

K. Wang et al.

8. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. CoRR arXiv:1609.02907 (2016) 9. Le, P., Titov, I.: Improving entity linking by modeling latent relations between mentions arXiv:1804.10637 (2018) 10. Nguyen, T.H., Fauceglia, N., Rodriguez Muro, M., Hassanzadeh, O., Massimiliano Gliozzo, A., Sadoghi, M.: Joint learning of local and global features for entity linking via neural networks. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2310–2320 (2016) 11. Pershina, M., He, Y., Grishman, R.: Personalized page rank for named entity disambiguation. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 238–243 (2015) 12. Phan, M.C., Sun, A., Tay, Y., Han, J., Li, C.: Pair-linking for collective entity disambiguation: two could be better than all. CoRR arXiv:1802.01074 (2018) 13. Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Li` o, P., Bengio, Y.: Graph attention networks. arXiv:1710.10903 (2017) 14. Yamada, I., Shindo, H., Takeda, H., Takefuji, Y.: Joint learning of the embedding of words and entities for named entity disambiguation. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 250–259 (2016) 15. Yamada, I., Shindo, H., Takeda, H., Takefuji, Y.: Learning distributed representations of texts and entities from knowledge base. CoRR abs/1705.02494 (2017). http://arxiv.org/abs/1705.02494 16. Yang, X., et al.: Learning dynamic context augmentation for global entity linking. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 271–281 (2019) 17. Yasunaga, M., Ren, H., Bosselut, A., Liang, P., Leskovec, J.: QA-GNN: reasoning with language models and knowledge graphs for question answering. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 535–546 (2021)

Coarse-to-Fine Entity Representations for Document-Level Relation Extraction Damai Dai, Jing Ren, Shuang Zeng, Baobao Chang, and Zhifang Sui(B) MOE Key Lab of Computational Linguistics, Peking University, Beijing, China {daidamai,rjj,zengs,chbb,szf}@pku.edu.cn

Abstract. Document-level Relation Extraction (RE) requires extracting relations expressed within and across sentences. Recent works show that graph-based methods, usually constructing a document-level graph that captures document-aware interactions, can obtain useful entity representations thus helping tackle document-level RE. These methods either focus more on the entire graph, or pay more attention to a part of the graph, e.g., paths between the target entity pair. However, we find that document-level RE may benefit from focusing on both of them simultaneously. Therefore, to obtain more comprehensive entity representations, we propose the Coarse-to-Fine Entity Representation model (CFER) that adopts a coarse-to-fine strategy involving two phases. First, CFER uses graph neural networks to integrate global information in the entire graph at a coarse level. Next, CFER utilizes the global information as a guidance to selectively aggregate path information between the target entity pair at a fine level. In classification, we combine the entity representations from both two levels into more comprehensive representations for relation extraction. Experimental results on two document-level RE datasets, DocRED and CDR, show that CFER outperforms baseline models and is robust to the uneven label distribution. Our code is available at https://github.com/Hunter-DDM/cferdocument-level-RE.

Keywords: Document-level Relation Extraction Network

1

· Graph Neural

Introduction

Relation Extraction (RE) aims to extract semantic relations between named entities from plain text. It is an efficient way to acquire structured knowledge automatically [12]. Traditional RE works usually focus on the sentence level [27, 30]. However, in real-world scenarios, sentence-level RE models may omit some inter-sentence relations while a considerable number of relations are expressed across multiple sentences in a long document [25]. Therefore, document-level RE has attracted much attention in recent years. D. Dai and J. Ren—Equal Contribution. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 185–197, 2023. https://doi.org/10.1007/978-3-031-44696-2_15

186

D. Dai et al.

Figure 1 shows an example document and corresponding relational facts for document-level RE. In the example, to extract the relation between Benjamin Bossi and Columbia Records, two entities separated by several sentences, we need the following inference steps. First, we need to know that Benjamin Bossi is a member of Romeo Void. Next, we need to infer that Never Say Never is performed by Romeo Void, and released by Columbia Records. Based on these facts, we can draw a conclusion that Benjamin Bossi is signed by Columbia Records. The example indicates that, to tackle document-level RE, a model needs the ability to capture interactions between long-distance entities. In addition, since a document may have an extremely long text, a model also needs the ability to integrate global contextual information for words. Recent works show that graph-based methods can help document-level RE. Some of them construct a documentlevel graph and use Graph Neural Networks (GNN) [4, 13] to integrate neighborhood Fig. 1. An example for document-level RE. The information for each node [14, solid lines denote intra-sentence relations. The dotted line denotes an inter-sentence relation. 18]. Although they consider the entire graph structure, they may fail to model long-distance interactions due to the over-smoothing problem [9]. Other works attempt to encode path information between the target entity pair [1,17]. They are adept at modeling long-distance interactions, but may fail to capture global contextual information. Therefore, to obtain more comprehensive representations, it is necessary to integrate global information and model long-distance interactions simultaneously. In this paper, we propose the Coarse-to-Fine Entity Representation model (CFER) to obtain comprehensive entity representations for document-level RE. More specifically, we first construct a document-level graph that captures rich document-aware interactions. Based on the graph, we design a coarse-to-fine strategy with two phases. First, we use Densely Connected Graph Convolutional Networks (DCGCN) [4] to integrate global contextual information in the entire graph at a coarse level. Next, we adopt an attention-based path encoding mechanism to selectively aggregate path information between the target entity pair at a fine level. Given entity representations from both levels that capture global contextual information and long-distance interactions, we can obtain more comprehensive entity representations by combining them. Our contributions are summarized as follows: (1) We propose a novel document-level RE model called CFER, which uses a coarse-to-fine strategy to obtain comprehensive entity representations. (2) Experimental results on two popular document-level RE datasets show that CFER achieves better performance than baseline models. (3) Elaborate analysis validates the effectiveness of our method, and further, highlights the robustness of CFER to the uneven label distribution and the ability of CFER to model long-distance interactions.

Coarse-to-Fine Entity Representations

187

Fig. 2. Illustration of a document-level graph corresponding to a two-sentence document. We design five categories of edges to connect words in the graph.

2

Methodology

2.1

Task Formulation N

Let D denote a document consisting of N sentences D = {si }i=1 , where si = M {wj }j=1 denotes the i-th sentence containing M words denoted by wj . Let E P

Q

denote an entity set containing P entities E = {ei }i=1 , where ei = {mj }j=1 denotes the coreferential mention set of the i-th entity, containing Q word spans of corresponding mentions denoted by mj . Given D and E, document-level RE requires extracting all the relational facts in the form of triplets, i.e., extracting {(ei , r, ej )|ei , ej ∈ E, r ∈ R}, where R is a pre-defined relation category set. Since an entity pair may have multiple semantic relations, we formulate document-level RE as a multi-label classification task. 2.2

Document-Level Graph Construction

Given a document, we first construct a document-level graph that captures rich document-aware interactions, reflected by syntactic dependencies, adjacent word connections, cross-sentence connections, and coreferential mention interactions. Figure 2 shows an example document-level graph corresponding to a two-sentence document. The graph regards words in the document as nodes and captures document-aware interactions by five categories of edges. These undirected edges are described as follows. Syntactic Dependency Edge. Syntactic dependency information is proved effective for document-level or cross-sentence RE in previous works [18]. Therefore, we use the dependency parser in spaCy1 to parse the syntactic dependency tree for each sentence. After that, we add edges between all node pairs that have dependency relations. Adjacent Word Edge. [17] point out that adding edges between adjacent words can mitigate the dependency parser errors. Therefore, we add edges between all node pairs that are adjacent in the document. 1

https://spacy.io/.

188

D. Dai et al.

Fig. 3. Illustration of CFER. It is composed of a text encoding module, a coarse-level representation module, a fine-level representation module, and a classification module.

Self-loop Edge. For a node, in addition to its neighborhood information, the historical information of the node itself is also essential in information integration. Therefore, we add a self-loop edge for each node. Adjacent Sentence Edge. To ensure that information can be integrated across sentences, for each adjacent sentence pair, we add an edge between their dependency tree roots. Coreferential Mention Edge. Coreferential mentions may share information captured from their respective contexts with each other. This could be regarded as global cross-sentence interactions. Therefore, we add edges between the first words of all mention pairs that refer to the same entity. 2.3

Coarse-to-Fine Entity Representations

In this subsection, we describe our proposed Coarse-to-Fine Entity Representation model (CFER) in detail. As shown in Fig. 3, our model is composed of a text encoding module, a coarse-level representation module, a fine-level representation module, and a classification module. Text Encoding Module. This module aims to encode each word in the document as a vector with text contextual information. By default, CFER uses GloVe [15] embeddings and a Bi-GRU model as the encoder. To improve the ability of contextual text encoding, we can also replace this module by PreTrained Models (PTM) such as BERT [2] or RoBERTa [10]. This module finally outputs contextual word representations hi of each word in the document: hi = Encoding(D), where hi ∈ Rdh , and dh denotes the hidden dimension. Coarse-Level Representation Module. This module aims to integrate local and global contextual information in the entire document-level graph. As indicated by [4], Densely Connected Graph Convolutional Networks (DCGCN) have the ability to capture rich local and global contextual information. Therefore, we

Coarse-to-Fine Entity Representations

189

adopt DCGCN layers as the coarse-level representation module. DCGCN layers are organized into n blocks and the k-th block has mk sub-layers. At the l-th sub-layer in block k, the calculation for node i is defined as ⎞ ⎛  (k,l) (k,l) ˆ (1) = ReLU ⎝ W(k,l) h + b(k,l) ⎠ , hi j j∈N (i)

ˆ (k,l) h j

=

(k) (k,1) (k,l−1) [xj ; hj ; ...; hj ],

(k)

(2)

(k,l)

∈ Rdh is the block input of node j. hi ∈ Rdh /mk is the output of (k,l) ˆ node i at the l-th sub-layer in block k. h ∈ Rdh +(l−1)dh /mk is the neighborj

where xj

(k)

hood input, obtained by concatenating xj and the outputs of node j from all previous sub-layers in the same block. W(k,l) and b(k,l) are trainable parameters. N (i) denotes the neighbor set of node i in the document-level graph. Finally, block k adds up the block input and the concatenation of all sub-layer outputs, (k) and then adopts a fully connected layer to compute the block output oi ∈ Rdh :   (k) (k) (k,1) (k,m ) oi = FC xi + [hi ; ...; hi k ] . (3) (n)

Finally, we take the output of the final block oi representation module.

as the output of the coarse-level

Fine-Level Representation Module. The coarse-level representations can capture rich contextual information, but may fail to model long-distance entity interactions. Taking the global contextual information as a guidance, the finelevel representation module aims to utilize path information between the target entity pair to alleviate this problem. This module adopts an attention-based path encoding mechanism based on a path encoder and an attention aggregator. For a target entity pair (e1 , e2 ), we denote the numbers of their corresponding mentions as |e1 | and |e2 |, respectively. We first extract |e1 | × |e2 | shortest paths between all mention pairs in a subgraph that contains only syntactic dependency and adjacent sentence edges. Then, for the i-th path [w1 , ..., wleni ], we use a BiGRU model to compute the path-aware mention representations: −−−→ −−−→ (n)  ←−− ←−−− ←−−− (n)  − , pi,j = GRU pi,j+1 , owj , (4) p−→ i,j = GRU pi,j−1 , owj −→ mi = − p− i,leni , mi (t)

(h)

− =← p− i,1 ,

(5)

←−− dh where − p−→ are the forward and backward GRU hidden states of the i,j , pi,j ∈ R (h) (t) j-th node in the i-th path, respectively. mi , mi ∈ Rdh are the path-aware representations of the head and tail mentions in the i-th path, respectively. Since not all the paths contain useful information, we design an attention aggregator, which takes the global contextual information as a guidance, to selectively aggregate the path-aware mention representations:

190

D. Dai et al.

 (n) 1  (n) = 1 oj , t= o , h |e1 | j∈e |e2 | j∈e j 1 2   (h) (t) t; mi ; mi ] + ba , αi = Softmax Wa [h; i  (h)  (t) h= mi · αi , t = mi · αi , i

(6) (7) (8)

i

where h, t ∈ Rdh are the coarse-level head and tail entity representations, respectively, which are computed by averaging their corresponding coarse-level mention representations. Wa and ba are trainable parameters. h, t ∈ Rdh are the path-aware fine-level representations of the head and tail entities. Classification Module. In this module, we combine the entity representations from both two levels to obtain comprehensive representations that capture both global contextual information and long-distance entity interactions. Next, we predict the probability of each relation by a bilinear scorer:   h]T Wc [ t; t] + bc , (9) P (r|e1 , e2 ) = Sigmoid [h; r

where Wc ∈ R2dh ×nr ×2dh and bc ∈ Rnr are trainable parameters with nr denoting the number of relation categories. Optimization Objective. Considering that an entity pair may have multiple relations, we choose the binary cross entropy loss between the ground truth label yr and P (r|e1 , e2 ) as the optimization objective: 



 yr · log P (r|e1 , e2 ) + (1 − yr ) · log 1 − P (r|e1 , e2 ) . L=− (10) r

3 3.1

Experiments and Analysis Dataset

We evaluate our model on two document-level RE datasets. DocRED [25] is a large-scale human-annotated dataset constructed from Wikipedia and Wikidata [22], and is currently the largest human-annotated dataset for general domain document-level RE. It contains 5, 053 documents, 132, 375 entities and 56, 354 relational facts divided into 96 relation categories. Among the annotated documents, 3, 053 are for training, 1, 000 are for development, and 1, 000 are for testing. Chemical-Disease Reactions (CDR) [8] is a popular dataset in the biomedical domain containing 500 training examples, 500 development examples, and 500 testing examples. 3.2

Experimental Settings

We tune hyper-parameters on the development set. Generally, we use AdamW [11] as the optimizer, and use the DCGCN consisting of two blocks

Coarse-to-Fine Entity Representations

191

Table 1. Results on DocRED. Bold denotes the best. Underline denotes the second.

Model

Dev Ign F1 F1

Test Ign F1 F1

Bi-LSTM [25] GCNN [18] EoG [1] HIN [19] GCGCN [29] LSR [14] GAIN [28] HeterGSAN-Recon [24] CFER-GloVe (Ours)

48.87 46.22 45.94 51.06 51.14 48.82 53.05 54.27 54.29

50.94 51.52 52.15 52.95 53.05 55.17 55.29 56.22 56.40

48.78 49.59 49.48 51.15 50.87 52.15 52.66 53.27 53.43

51.06 51.62 51.82 53.30 53.13 54.18 55.08 55.23 55.75

BERT-Two-Step [23] – 54.29 HIN-BERTBase [19] 55.43 GCGCN-BERTBase [29] 55.32 CorefBERTBase [26] 56.70 ERICA-BERTBase [16] 52.43 LSR-BERTBase [14] 58.27 MIUK [6] HeterGSAN-Recon-BERTBase [24] 58.13 59.14 GAIN-BERTBase [28] CFER-BERTBase (Ours) 59.23

54.42 56.31 57.35 57.51 58.80 59.00 60.11 60.18 61.22 61.41

– 53.70 54.53 54.54 55.90 56.97 58.05 57.12 59.00 59.16

53.92 55.60 56.67 56.96 58.20 59.05 59.99 59.45 61.24 61.28

ERICA-RoBERTaBase [16] CorefRoBERTaLarge [26] CFER-RoBERTaLarge (Ours)

56.30 58.60 56.60 59.00 57.84 59.93 57.68 59.91 60.91 62.80 60.70 62.76

with 4 sub-layers in each block. On CDR, we use BioBERT [5] as the text encoding module. Details of hyper-parameters such as learning rate, batch size, dropout rate, hidden dimension, and others for each version of CFER are provided as an appendix at https://github.com/Hunter-DDM/cfer-document-levelRE/blob/main/Appendix.pdf. Following previous works, for DocRED, we choose micro F1 and Ign F1 as evaluation metrics. For CDR, we choose micro F1, Intra-F1, and Inter-F1 as metrics. Ign F1 denotes F1 excluding facts that appear in both the training set and the development or test set. Intra- and Inter-F1 denote F1 of relations expressed within and across sentences, respectively. We determine relationspecific thresholds δr based on the micro F1 on the development set. Then, we classify a triplet (e1 , r, e2 ) as positive if P (r|e1 , e2 ) > δr or negative otherwise.

192

3.3

D. Dai et al.

Results

We run each version of CFER three times, take the median F1 on the development set to report, and test the corresponding model on the test set. Table 1 shows the main evaluation results on DocRED. We compare CFER with 21 baseline models on three tracks: (1) non-PTM track: models on this track do not use Pre-Trained Models (PTM). GCNN [18], EoG [1], GCGCN [29], LSR [14], GAIN [28], and HeterGSAN-Recon are graph-based models that leverage GNN to encode nodes. Bi-LSTM [25] and HIN [19] are text-based models without constructing a graph. (2) BERT track: models on this track are based on BERTBase [2]. HIN-BERTBase , GCGCN-BERTBase , LSR-BERTBase , HeterGSAN-Recon-BERTBase , and GAIN-BERTBase are BERT-versions of corresponding non-PTM models. BERT-Two-Step [23] and MIUK [6] are two text-based models. CorefBERTBase [26] and ERICA-BERTBase [16] are two pre-trained models. (3) RoBERTa track: models on this track are based on RoBERTa [10]: ERICA-RoBERTaBase [16] and CorefRoBERTaLarge [26]. From Table 1, we observe that on all three tracks, CFER achieves the best performance and significantly outperforms most of the baseline models. Further, we find that graph-based models generally have significant advantages over textbased models in document-level RE. Table 3. Ablation results on DocRED. Table 2. Evaluation results on CDR. Model

F1

Intra-F1 Inter-F1

GCNN [18]

58.6 –

LSR [14]

61.2 66.2

50.3

BRAN [21]

62.1 –



EoG [1]



63.6 68.2

50.9

LSR w/o MDP 64.8 68.9

53.1

CFER (Ours)

57.8

65.9 70.7

Model

F1

CFER-GloVe (Full)

56.40 (–0.00)

- Attention Aggregator 54.92 (–1.48) 53.68 (–2.72) - Multiple Paths 52.28 (–4.12) - Fine-level Repr - Coarse-level Repr - DCGCN Blocks

55.26 (–1.14) 54.32 (–2.08)

- Both-level Modules

51.67 (–4.73)

Table 2 shows the main results on CDR. Besides GCNN, LSR, EoG mentioned above, we compare two more models: BRAN [21], a text-based method, and LSR w/o MDP, a modified version of LSR. From Table 2 we find that CFER outperforms all baselines, especially on Inter-F1. This suggests that CFER has stronger advantages in modeling inter-sentence interactions. Note that the modified version of LSR performs better than full LSR. This indicates that LSR does not always work for all datasets, while CFER does not have this problem. 3.4

Analysis

Ablation Study. We show ablation experiment results of CFER-GloVe on the development set of DocRED in Table 3. Firstly, for fine-level representations, we

Coarse-to-Fine Entity Representations

193

Fig. 4. Micro F1 on different categories of relations.

modify our model in three ways: (1) Replace the attention aggregator by a simple mean aggregator (denoted by - Attention Aggregator ). (2) Replace all used shortest paths by a random shortest path (denoted by - Multiple Paths). (3) Remove the whole fine-level representations from the final classification (denoted by Fine-level Repr.). F1 scores of these three modified models decrease by 1.48, 2.72, and 4.12, respectively. This verifies that our attention aggregator for multiple paths has the ability to selectively aggregate useful information thus producing effective fine-level representations. Secondly, for coarse-level representations, we modify our model in two ways: (1) Remove the whole coarse-level representations from the final classification (denoted by - Coarse-level Repr.). (2) Remove DCGCN blocks (denoted by - DCGCN Blocks). F1 scores of these two modified models decrease by 1.14 and 2.08, respectively. This verifies that DCGCN has the ability to capture rich local and global contextual information, thus producing high-quality coarse-level representations that benefit relation extraction. Finally, we remove both coarse- and fine-level representation modules (denoted by - Both-level Modules) and our model degenerates into a simple version similar to Bi-LSTM. As a result, this simple version achieves a similar performance to Bi-LSTM as expected. This suggests that our text encoding module is not much different from the Bi-LSTM baseline, and the performance improvement is introduced by our coarse-to-fine strategy. Robustness to the Uneven Label Distribution. We divide 96 relations in DocRED into 4 categories according to their ground-truth positive label numbers in the development set and show micro F1 on each category in Fig. 4. We find that compared with Bi-LSTM, BERT-Two-Step makes more improvement on relations with more than 20 positive labels (denoted by major relations), but less improvement (10.15%) on long-tail relations with less than or equal to 20 positive labels. By contrast, keeping the improvement on major relations, our model makes a much more significant improvement (86.18%) on long-tail relations. With high performance on long-tail relations, our model narrows the F1 gap between major and long-tail relations from 46.05 (Bi-LSTM) to 38.12. This suggests that our model is more robust to the uneven label distribution.

194

D. Dai et al.

Ability to Model Long-Distance Interactions. We divide all entity pairs in CDR into 3 categories according to their sentence distance, i.e., the number of sentences that separate them. Specifically, the distance of an intra-sentence entity pair is 0. The micro F1 for distances in [0, 4), [4, 8), and [8, ∞) are 66.3, 59.8, and 61.5, respectively. That is to say, as the distance increases, the performance of CFER does not decrease much. This validates again the ability of CFER to model long-distance interactions between entity pairs. 3.5

Case Study

Figure 5 shows an extraction case in the development set. We analyze the relational fact (0, P140, 5), which is extracted by only CFER (Full). For CFER (Full) and CFER (- DCGCN Blocks), we show several high-weight paths and their attention weights used for producing fine-level representations. From this case, we find that CFER (Full) gives smooth weights to its high-weight paths, which help aggregate richer path information from multiple useful paths. By contrast, without the guidance of global contextual information, CFER (- DCGCN Blocks) learns extremely unbalanced weights and pays almost all its attention to a sub-optimal path. As a result, it fails to extract the P140 relation. As for CFER-GloVe (- Fine-level Repr.) and CFER-GloVe (- Both-level Module), they do not consider any path information. Therefore, it is hard for them to achieve as good performance as CFER (Full).

Fig. 5. A case from the development set. We show relational facts extracted by four versions of CFER-GloVe. For CFER-GloVe (Full) and CFER-GloVe (- DCGCN Blocks), we show the paths and their attention weights for producing fine-level representations of entity 0 (Daniel Ajayi Adeniran) and entity 5 (Redeemed Christian Church).

4

Related Work

Most document-level RE models are graph-based, which usually construct a document-level graph. [17] design feature templates to extract features from mul-

Coarse-to-Fine Entity Representations

195

tiple paths for classification. [18] use a labeled edge GCN [13] to integrate information. [3] and [14] apply graph convolutional networks to a complete graph with iteratively refined edges weights. [1] propose an edge-oriented model that represents paths through a walk-based iterative inference mechanism. [24] propose a reconstructor to model path dependency between an entity pair. [7] propose to improve inter-sentence reasoning by characterizing interactions between sentences and relation instances. [6] leverage knowledge graphs in document-level RE. [29] propose a context-enhanced model to capture global context information. [28] design a double graph to cope with document-level RE. There are also text-based methods without constructing a graph. [21] adopt a transformer [20] to encode the document. [23] predict the existence of relations before predicting the specific relations. [19] design a hierarchical architecture to make full use of information from several levels. [26] explicitly capture relations between coreferential noun phrases. [16] use contrastive learning to obtain a deeper understanding of entities and relations in text.

5

Conclusion

In this paper, we propose CFER with a coarse-to-fine strategy to learn comprehensive representations for document-level RE. Our model integrates global contextual information and models long-distance interactions to address the disadvantages that previous graph-based models suffer from. Experimental results on two document-level RE datasets show that CFER outperforms baseline models. Further, elaborate analysis verifies the effectiveness of our coarse-to-fine strategy, and highlights the robustness of CFER to the uneven label distribution and the ability of CFER to model long-distance interactions. Note that our coarseto-fine strategy is not limited to only the task of document-level RE. It has the potential to be applied to a variety of other NLP tasks. Acknowledgement. This paper is supported by the National Key Research and Development Program of China 2020AAA0106700 and NSFC project U19A2065.

References 1. Christopoulou, F., Miwa, M., Ananiadou, S.: Connecting the dots: document-level neural relation extraction with edge-oriented graphs. In: EMNLP-IJCNLP 2019, pp. 4924–4935 (2019) 2. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT 2019, pp. 4171– 4186 (2019) 3. Guo, Z., Zhang, Y., Lu, W.: Attention guided graph convolutional networks for relation extraction. In: ACL 2019, pp. 241–251 (2019) 4. Guo, Z., Zhang, Y., Teng, Z., Lu, W.: Densely connected graph convolutional networks for graph-to-sequence learning. TACL 7, 297–312 (2019) 5. Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinform. 36(4), 1234–1240 (2020)

196

D. Dai et al.

6. Li, B., Ye, W., Huang, C., Zhang, S.: Multi-view inference for relation extraction with uncertain knowledge. In: AAAI 2021 (2021) 7. Li, B., Ye, W., Sheng, Z., Xie, R., Xi, X., Zhang, S.: Graph enhanced dual attention network for document-level relation extraction. In: COLING 2020, pp. 1551–1560 (2020) 8. Li, J., et al.: Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database 2016 (2016) 9. Li, Q., Han, Z., Wu, X.: Deeper insights into graph convolutional networks for semi-supervised learning. In: AAAI 2018, pp. 3538–3545 (2018) 10. Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. CoRR abs/1907.11692 (2019) 11. Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam. CoRR abs/1711.05101 (2017) 12. Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: EMNLP 2018, pp. 3219–3232 (2018) 13. Marcheggiani, D., Titov, I.: Encoding sentences with graph convolutional networks for semantic role labeling. In: EMNLP 2017 (2017) 14. Nan, G., Guo, Z., Sekulic, I., Lu, W.: Reasoning with latent structure refinement for document-level relation extraction. In: ACL 2020 (2020) 15. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP 2014, pp. 1532–1543 (2014) 16. Qin, Y., et al.: ERICA: improving entity and relation understanding for pre-trained language models via contrastive learning. CoRR abs/2012.15022 (2020) 17. Quirk, C., Poon, H.: Distant supervision for relation extraction beyond the sentence boundary. In: EACL 2017, pp. 1171–1182 (2017) 18. Sahu, S.K., Christopoulou, F., Miwa, M., Ananiadou, S.: Inter-sentence relation extraction with document-level graph convolutional neural network. In: ACL 2019, pp. 4309–4316 (2019) 19. Tang, H., et al.: HIN: hierarchical inference network for document-level relation extraction. In: Lauw, H.W., Wong, R.C.-W., Ntoulas, A., Lim, E.-P., Ng, S.-K., Pan, S.J. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12084, pp. 197–209. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47426-3 16 20. Vaswani, A., et al.: Attention is all you need. In: NeurIPS 2017, pp. 5998–6008 (2017) 21. Verga, P., Strubell, E., McCallum, A.: Simultaneously self-attending to all mentions for full-abstract biological relation extraction. In: NAACL-HLT 2018, pp. 872–884 (2018) 22. Vrandecic, D., Kr¨ otzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014) 23. Wang, H., Focke, C., Sylvester, R., Mishra, N., Wang, W.: Fine-tune bert for docred with two-step process. CoRR abs/1909.11898 (2019) 24. Wang, X., Chen, K., Zhao, T.: Document-level relation extraction with reconstruction. CoRR abs/2012.11384 (2020) 25. Yao, Y., et al.: DocRED: a large-scale document-level relation extraction dataset. In: ACL 2019, pp. 764–777 (2019) 26. Ye, D., Lin, Y., Du, J., Liu, Z., Sun, M., Liu, Z.: Coreferential reasoning learning for language representation. CoRR abs/2004.06870 (2020) 27. Zeng, D., Liu, K., Lai, S., Zhou, G., Zhao, J.: Relation classification via convolutional deep neural network. In: COLING 2014, pp. 2335–2344 (2014)

Coarse-to-Fine Entity Representations

197

28. Zeng, S., Xu, R., Chang, B., Li, L.: Double graph based reasoning for documentlevel relation extraction. In: EMNLP 2020, pp. 1630–1640 (2020) 29. Zhou, H., Xu, Y., Yao, W., Liu, Z., Lang, C., Jiang, H.: Global context-enhanced graph convolutional networks for document-level relation extraction. In: COLING 2020, pp. 5259–5270 (2020) 30. Zhou, P., et al.: Attention-based bidirectional long short-term memory networks for relation classification. In: ACL 2016, pp. 207–212 (2016)

Research on Named Entity Recognition Based on Bidirectional Pointer Network and Label Knowledge Enhancement Zhengyun Wang, Yong Zhang(B) , Xinyi Sun, and Xin Li Computer School, Central China Normal University, Wuhan, Hubei, China [email protected]

Abstract. Named Entity Recognition (NER) is one of the fundamental tasks in natural language processing (NLP) and serves as a foundation for many downstream tasks. Currently, most span-based methods utilize sequence labeling to detect entity boundaries. This method may cause too many invalid entities and the number of entity fragments explosion when combining entity boundaries to generate entity fragments. To solve these problems, we propose a model based on bidirectional pointer network and label knowledge enhancement. We used pointer network mechanism to detect entity boundary, pointing entity beginning boundary to entity ending boundary, and non-entity to sentinel word. This processing method can effectively alleviate the problem of excessive invalid entity and explosion of entity fragment quantity. In addition, we also fully integrate the label knowledge into the model to improve the effect of the model. We conducted extensive experiments on the ACE2004 dataset and achieved better results than most existing named entity recognition models. Keywords: named entity recognition · bidirectional pointer network · label knowledge

1 Introduction Named Entity Recognition (NER) is one of the fundamental tasks in natural language processing (NLP) that intends to find and classify the type of a named entity in text such as person (PER), location (LOC) or organization (ORG). It has been widely used for many downstream applications such as machine translation [1], entity linking [2], question generation [3] and co-reference resolution [4]. Named entity recognition consists of two sub-tasks: named entity boundary detection and named entity category prediction. The main steps are to first detect the boundaries of the entities in the text, then combine them to generate entity fragments, and finally classify the entity fragments into their corresponding entity categories. Currently, most of the existing methods for detecting entity boundaries use sequence labeling-based [5– 8] methods. For example, Zheng et al. [9] used the BIO labeling scheme to detect entity boundaries in text, where B indicates the beginning of an entity, I indicates the inside of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 198–209, 2023. https://doi.org/10.1007/978-3-031-44696-2_16

Research on Named Entity Recognition

199

an entity, and O indicates the end of an entity. Tan et al. [10] used two classifiers to detect entity boundaries: one classifier predicts whether each word in the text is the beginning of an entity, and another classifier predicts whether each word in the text is the end of an entity. After detecting entity boundaries using this method, all the beginning and ending positions are usually matched to generate entity fragments. If the length of the text is too long, the number of generated entity fragments will grow exponentially, which will increase the time and space complexity of the model and seriously reduce the efficiency of model training and inference. In addition, this method generates many invalid entities in the entity fragments, which will result in the problem of too many negative labels and an imbalanced distribution of positive and negative labels during the entity fragment classification stage. Some studies filter the generated entity fragments, such as limiting the length of the entity fragments and using the nearest matching principle. However, these methods need to be adjusted for different datasets and are not universal. In 2020, Li et al. [11] proposed a neural model for detecting entity boundaries, which is a recurrent neural network encoder-decoder framework with a pointer network [12, 13]. The model processes the starting boundary word in an entity to point to the corresponding ending boundary word. The other entity words in the entity are skipped. The non-entity words are pointed to a specific position. This method has achieved promising results in the boundary detection task. And this method does not generate invalid entities in the entity fragments and does not result in an explosion of the number of entity fragments. However, as the segment information of each word in an entity is the same as the starting boundary word, the segment information for all the words within a segment will be incorrect if the starting boundary word is detected wrongly. In addition, this method may have low recall when dealing with nested named entity recognition tasks. This is because in nested named entities, there may be a situation where a word is the starting (or ending) boundary of multiple entities, while in the method proposed by Li et al. [11], a word is only matched with one corresponding entity boundary, and the words in the middle of the entity and the words at the end of the entity are skipped. This processing method may miss many entities, resulting in the predicted number of entities being less than the actual number of entities. We propose a new model based on bidirectional pointer network and label knowledge enhancement. We utilize a pointer network mechanism to detect entity boundaries, whereby the starting boundary of an entity point to its corresponding ending boundary, thus avoiding the generation of invalid entities or the problem of an explosion in the number of entity fragments. In addition, we made some improvements to the model proposed by Li et al. [11] and applied it to named entity recognition tasks. We will input each word in the text into the decoder in turn to decode, and then use the pointer network mechanism to find the corresponding entity boundary, thus avoiding error propagation. In addition, a new decoder is added in this paper, which treats the word currently input into the decoder as the end boundary of an entity to find the corresponding starting boundary of the entity, thereby compensating for the problem of a decoder missing entities when dealing with nested named entity recognition tasks. Furthermore, we integrate label knowledge into the model to guide the model in recognizing entities more effectively.

200

Z. Wang et al.

2 Proposed Model We propose a new model based on bidirectional pointer networks and label knowledge enhancement. The overall architecture of the model is illustrated in Fig. 1. The model consists of three components: the semantic encoding module, the semantic fusion module and the entity boundary detection module. First, label knowledge for each label is constructed. Then, input the text and label knowledge are separately input into the encoder for encoding. Next, the text embedding and label embedding are fused by the semantic fusion module to derive the label-knowledge-enhanced embeddings for the text. Finally, in the entity boundary detection module, use label-knowledge-enhanced embeddings to predict entity boundaries for each label. The entity boundary detection module consists of two decoders: the right decoder and the left decoder. The right decoder treats the current input word as the starting boundary of an entity and uses a pointer network mechanism to locate the corresponding ending boundary of the entity. Conversely, the left decoder treats the current input word as the ending boundary of an entity and uses a pointer network mechanism to locate the corresponding starting boundary of the entity.

Fig. 1. Illustration of our Model

2.1 Data Construction We take annotation guideline notes as label knowledge for each label. Annotation guidelines are provided by the data creators and provide accurate and general descriptions for the labels. As shown in Table 1, the label annotations for the ACE2004 [14] dataset are provided, consisting of a total of 7 labels.

Research on Named Entity Recognition

201

Table 1. Label annotation of ACE2004 dataset Label Label Annotation FAC

facility entities are limited to buildings and other permanent man-made structures such as buildings, airports, highways, bridges

GPE

geographical political entities are geographical regions defined by political and or social groups such as countries, nations, regions, cities, states, government and its people

LOC location entities are limited to geographical entities such as geographical areas and landmasses, mountains, bodies of water, and geological formations ORG organization entities are limited to companies, corporations, agencies, institutions and other groups of people PER

a person entity is limited to human including a single individual or a group

VEH vehicle entities are physical devices primarily designed to move, carry, pull or push the transported object such as helicopters, trains, ship and motorcycles WEA weapon entities are limited to physical devices such as instruments for physically harming such as guns, arms and gunpowder

2.2 Semantics Encoding Module We utilize BERT [15] as the semantic encoder. First, the text is tokenized using a tokenizer and the [CLS] and [SEP] tokens are added separately at the beginning of the sequence and at the end of the sequence. Then, the sequence is encoded using BERT. Given input text X = [x0 , x1 , . . . , xn−1  ], where n is the length of X, BERT extracts the embedding HX = h0 , h1 , . . . , hn−1 , HX ∈ Rn∗d , where d is the vector dimension of the encoder. Since the number of label annotations is small compared with the whole sample set, it is challenging to build an encoder from scratch for the label annotations. Thus we introduce a shared encoder, where the encoder for label knowledge and the encoder for text utilize weight sharing. Given label annotations Y = [y1 , y2 , . . . , y|C| ], where |C| is the size of label set C. Each label knowledge yc = [yc0 , yc1 , . . . , ycl−1 ] is encoded using BERT, where l is the length of yc , and the encoded representation HY ∈ R|C|∗l∗d of the label knowledge can be obtained by following formulas.     (1) hyc = hyc0 , hyc1 , . . . , hyl−1 = BERT yc0 , yc1 , . . . , ycl−1 c

HY = [hy1 , hy2 , . . . , hy|C| ]

(2)

2.3 Semantic Fusion Module The semantic fusion module utilizes an attention mechanism to inject the label knowledge into the text, obtaining a label-enhanced representation of the text. Specifically, the encoded vectors HX and HY are input into a fully connected layer, respectively, to map their representations into the same feature space. 

HX = U1 ∗ HX

(3)

202

Z. Wang et al. 

HY = U2 ∗ HY

(4)

where U1 ∈ Rd ∗d , U2 ∈ Rd ∗d be the learnable parameters of the fully connected layers. Then, we apply the attention mechanism over the label annotations for each token in the text. For any 0 ≤ i ≤ n − 1, let xi be the i th token of X, and hi be the i th row of HX . j Likewise, for any c ∈ C, 0 ≤ j ≤ l − 1, let yc be the j th token of the annotation of c,   and h j be its embedding from HY . We compute the dot product of hi and h j , and apply yc

yc

a softmax function to obtain the attention scores.  exp hi ∗h j yc  ax ,yj = i c

 ∗h exp h j j i

(5)

yc

Finally, the weighted sum of the words in the label knowledge is calculated based on their corresponding attention scores. To preserve the original information of the sequence, we inject the label knowledge by adding the embedding representation of the words in the sequence by following formulas. ax ,yj ∗h j (6) hci = hi + j

i

c

yc

c



hi = tanh(W ∗ hci + b)

(7)

where tanh is the activation function, W ∈ Rd ∗d , b ∈ Rd are learnable parameters, and c hi is the encoded information related to category c. The above steps are repeated for all labels, where the label knowledge of each label is sequentially integrated into the input text, resulting in the enhanced embedding

1







2



|C|

representation hi = [hi , hi , . . . , hi ] of each word xi with label knowledge. 2.4 Entity Boundary Detection In the entity boundary detection module, the word embedding representations injected with label knowledge are utilized to guide the model in identifying entities corresponding to the respective labels. We employ two decoders, the right decoder and the left decoder, to detect entity boundaries in the sequence. The right decoder treats the current input word as the entity starting boundary and searches for the corresponding entity ending boundary. If the corresponding entity ending boundary is found, it points to the entity ending boundary. If the corresponding entity ending boundary is not found, it points to a sentinel word. In this paper, the sentinel word is represented by the [CLS] token. The left decoder treats the current input word as the entity ending boundary and searches for the entity starting boundary. If the corresponding entity starting boundary is found, it points to the entity starting boundary. If the corresponding entity starting boundary is not found, it points to a sentinel word ([CLS] token). The input sequence with the embedded representation of label knowledge is denoted as H X = [h0 , h1 , . . . , hn−1 ], where each word has an embedding representation







Research on Named Entity Recognition

1





2



203

|C|

hi = [hi , hi , . . . , hi ](0 ≤ i ≤ n − 1), and it is fused with |C| label knowledge c c c c representations. Sequentially, the fused sequence H X = [h0 , h1 , . . . , hn−1 ] (where each word is only fused with the label knowledge of c labels) is inputted into the decoders to recognize the entity boundaries corresponding to the label in the sequence. For the specific implementation details of the right decoder: Firstly, a unidirectional LSTM is employed as the right decoder. The sequence with injected label knowledge c c c c of word embedding representations H X = [h0 , h1 , . . . , hn−1 ] is sequentially fed into the decoder as the starting boundaries of the entities. This process generates decoding hidden vectors corresponding to each word by following formula.

















c

c ric = LSTM (hi , ri−1 )

(8)

Next, the dot product attention mechanism is used to calculate the scores, jc ui

uic

for all

possible entity end boundaries at time step i. Here, represents the score for an entity of class c with the starting boundary at the i th word and the ending boundary at the j th word. Subsequently, a softmax function is applied to normalize the scores, obtaining a probability distribution over all possible entity end boundaries. The related formulas are as follows. jc

ui = hˆ cj ric , j ∈ (0, . . . , n − 1) T

c

(9)

c

c

uic = [ui0 , ui1 , . . . , uin−1 ]

(10)

right cj = softmax(uic )

(11)

Similarly, the left decoder is implemented as follows. Firstly, a unidirectional LSTM c c c c is used as the left decoder. The sequence H X = [h0 , h1 , . . . , hn−1 ], which represents the sequence with the injected label-knowledge word embeddings, is sequentially inputted into the decoder as the entity’s end boundaries. This yields the decoding hidden vectors for each word.









c

c lic = LSTM (hi , li−1 )

(12)

Next, the dot product attention mechanism is used to calculate the scores uic for all jc

possible entity start boundaries at time step i. Here, ui represents the score for a c-class entity with the j th word as the start boundary and the i th word as the end boundary. Subsequently, the softmax function is applied to normalize the scores, resulting in the probability distribution of all possible entity start boundaries by following formulas. jc

ui = hˆ cj lic , j ∈ (0, . . . , n − 1) T

c

c

c

(13)

uic = [ui0 , ui1 , . . . , uin−1 ]

(14)

left cj = softmax(uic )

(15)

The above steps are repeated |C| times, where each time the sequence that has been fused with a specific label knowledge is inputted into the decoder for decoding. Then, the pointer network mechanism is utilized to identify the corresponding entity boundaries.

204

Z. Wang et al.

2.5 Loss Function Given input text X = [x0 , x1 , . . . , xn−1 ], where n is the length of the text, C is the category set, for any c ∈ C, gr ci represents the true ending position of the entity belonging to class c when the i th word is the starting position. If the i th word is the starting boundary of an entity of class c, gr ci represents the corresponding ending boundary of the entity; If the i th word is not the starting boundary of any entity of class c, gr ci is equal to 0, where 0 represents the position of the sentinel word [CLS] token. gl ci represents the true starting position of the entity of class c when the i th word is the ending boundary. If the i th word is the ending boundary of an entity of class c, gl ci represents the corresponding starting boundary of the entity. If the i th word is not the ending boundary of any entity of class c, gl ci is equal to 0, where 0 represents the position of the sentinel word [CLS] token. The right loss function Lright , the left loss function Lleft and the final loss function L are defined as follows. 

1 (16) CE rightic , gric Lright = c∈C 0≤i≤n−1 n 

1 (17) CE leftic , glic Lleft = c∈C 0≤i≤n−1 n L = Lright + Lleft

(18)

where CE is the cross entropy function.

3 Experiments 3.1 Datasets We evaluate the proposed model on ACE2004 [14] dataset, where there are seven types of entities, including FAC, GPE, LOC, ORG, PER, VEH and WEA. 3.2 Baselines We use the following models as baselines, (1) Seg-Graph [16] which proposes a segmental hypergraph representation to model overlapping entity mentions; (2) DYGIE [17] which introduces a general framework that share span representations using dynamically constructed span graphs; (3) Path-BERT [18] which treats the tag sequence as the second best path within in the span of their parent entity based on BERT; (4) BERT-BENSC [10] which proposes a boundary enhanced neural span classification model; and (5) BERTMRC [19] which treats NER as a MRC/QA task and is the state-of-the-art method on both flat and nested NER. 3.3 Experimental Setups We use BERT as the backbone to learning the contextualized representation of the texts. More specifically, we implement our model based on the BERT-base model. We adopt the Adam optimizer with a linear decaying schedule to train our model. The detail of hyper-parameters settings is listed in Table 2. We use micro-average precision, recall, and F1 as evaluation metrics. A prediction is considered correct only if both its boundary and category are predicted correctly.

Research on Named Entity Recognition

205

Table 2. Hyper-parameter settings Hyper-parameter

Value

embedding size

768

batch size

32

epoch

45

lr

7e-5

optimizer

Adam

right decoder input size

768

right decoder hidden size

768

left decoder input size

768

left decoder hidden size

768

3.4 Main Results Table 3 shows the experimental results of our proposed model and the baseline models. We can see that our model outperforms most other models on ACE2004 dataset. Our model achieves 10.76% improvements on F1 scores compare to Seg-Graph. Our model achieves 1.16% improvements on F1 scores compare to DYGIE. Our model achieves 1.46% improvements on F1 scores compare to Seq2seq-BERT. Our model achieves 3.05% improvements on F1 scores compare to Path-BERT. Our model achieves 0.56% improvements on F1 scores compare to BERT-BENSC. BERT-MRC achieves 85.98% on F1 scores. Our model achieves 85.86% on F1 scores, it is a competitive result compared to BERT-MRC. The F1 scores of our model is slightly 0.12% lower than BERT-MRC, because boundary tags are exceedingly rare in a given text, with starting and ending boundary tags being even rarer, the entity boundary detection is afflicted by issues of label sparsity, leading to inaccurate label predictions by the model. To summarize, this verifies the effectiveness of our model using attention mechanism to fuse label knowledge and bidirectional pointer network to detect entity boundaries. Our model has the following advantages. (1) Our model takes full advantage of label knowledge. Specifically, we employ a shared encoder to encode both the text and label knowledge, and then use an attention mechanism to integrate the label knowledge. This processing method does not increase the sample set and the length of the sequence input into BERT, which can significantly improve the speed of model training and inference. Moreover, the use of attention mechanisms enables the text to fully incorporate label knowledge. (2) We adopt a bidirectional pointer network mechanism to detect entity boundaries, which effectively mitigates the generation of excessive invalid entities in the resulting entity fragments. Additionally, we employ two decoders to detect entity boundaries in different ways, which better facilitates the handling of nested named entity recognition tasks.

206

Z. Wang et al. Table 3. Experimental results on ACE2004 dataset ACE2004 (Nested NER) Model

P (%)

R (%)

F1 (%)

Seg-Graph

78.0

72.4

75.1

DYGIE





84.7

Seq2Seq-Bert



Path-Bert

83.73

81.91

82.81

BERT-BENSC

85.8

84.8

85.3

BERT-MRC

85.05

86.32

85.98

Our Model

85.44

86.29

85.86



84.40

3.5 Ablation Studies To investigate the effects of label knowledge, bidirectional decoders, and different information fusion methods on the model, ablation experiments will be conducted on the ACE2004 dataset. (1) Investigating the impact of label knowledge on the model. To remove the semantic fusion module while keeping other conditions of the original model unchanged. The experimental results are shown in Table 4. After removing the semantic fusion module, the model achieved an F1 score of 84.99%, which is a decrease of 0.87% compared to the complete model. The main reason is that incorporating label knowledge into the text guides the model to recognize entities more effectively, thereby improving the accuracy of entity boundary detection. In addition, the proposed model in this paper incorporates one label knowledge at a time during implementation, repeating the process multiple times. Then, it sequentially detects the corresponding entity boundaries on the text that has been fused with one label knowledge. This approach allows the model to recognize entities multiple times on the text, further avoiding the problem of missing entities. In conclusion, the validation confirms that incorporating label knowledge into the model is effective in improving its performance. Table 4. Effectiveness of leveraging label knowledge Event Ablation

ACE2004 (Nested NER) P (%)

R (%)

F1 (%)

Full Model

85.44

86.29

85.86

- semantic fusion

84.50

85.50

84.99

Research on Named Entity Recognition

207

(2) Investigating the impact of bidirectional decoders on the model. Keeping the other conditions of the original model unchanged, we remove one of the decoders in the entity boundary detection module. While keeping the other conditions of the original model unchanged, we remove the left decoder and only use the right decoder to detect entity boundaries. While keeping the other conditions of the original model unchanged, we remove the right decoder and only use the left decoder to detect entity boundaries. Table 5. Experimental results with different decoders Event Ablation

ACE2004 (Nested NER) P (%)

R (%)

F1 (%)

Full Model

85.44

86.29

85.86

-Left Decoder

84.67

79.89

82.21

-Right Decoder

85.59

81.81

83.65

As shown in Table 5, when using only the right decoder in the entity boundary detection module, the model achieved an F1 score of 82.21%, which is a decrease of 3.65% compared to the complete model. When using only the left decoder in the entity boundary detection module, the model achieved an F1 score of 83.65%, which is a decrease of 2.21% compared to the complete model. This indicates that the bidirectional decoders can avoid missing entities when dealing with nested named entity recognition tasks. (3) Investigating the impact of different information fusion methods We have designed two information fusion methods. While keeping the other conditions of the original model unchanged, the first method involves using average pooling instead of dot-product attention mechanism to incorporate label knowledge at the token level. This method takes the linear average of the word embeddings of each word in the label knowledge and then incorporates it through addition operation. The second method involves using cosine similarity to fuse label knowledge at the sentence level. It first calculates the cosine similarity between each word in the text and the sentence-level representation of the label knowledge. Then, it incorporates the label knowledge through addition operation. As shown in Table 6, when using the average pooling method to fuse label knowledge at the token level, the model achieves an F1 score of 84.63%, which is a decrease of 1.23% compared to the complete model. When using cosine similarity to fuse label knowledge at the sentence level, the model achieves an F1 score of 84.80%, which is a decrease of 1.06% compared to the complete model. These results validate the effectiveness of using the dot-product attention mechanism to fuse label knowledge at the token level as proposed in this paper.

208

Z. Wang et al. Table 6. Model performance with different information fusion approaches Event Ablation

ACE2004 (Nested NER) P (%)

R (%)

F1 (%)

Full Model

85.44

86.29

85.86

average pooling

84.14

85.14

84.63

cosine similarity

84.39

85.20

84.80

4 Conclusion We propose a named entity recognition model based on a bidirectional pointer network with label-enhanced representations. First, we employ a shared encoder to independently encode the text and label knowledge. Then, we utilize an attention mechanism to integrate the label knowledge into the text, resulting in label-enhanced text representations. Finally, these representations are fed into the entity boundary detection module, which employs a pointer network mechanism to recognize entities corresponding to the respective labels. In our specific implementation, after each label knowledge is fused with the text, we utilize a pointer network mechanism to detect entities corresponding to the specific label. This approach further helps to avoid overlooking entities, especially in the nested named entity recognition tasks. We conducted extensive experiments on the ACE2004 dataset, and the results demonstrated that the proposed model outperforms the majority of other named entity recognition models. Acknowledgement. We thank the anonymous reviewers for their work. This research was supported by National Natural Science Foundation of China (No. 61977032). Yong Zhang is the corresponding author.

References 1. Hartley, B.: Improving machine translation quality with automatic named entity recognition. In: EAMT ‘03 Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT. Association for Computational Linguistics (2003) 2. Gupta, N., Singh, S., Roth, D.: Entity linking via joint encoding of types, descriptions, and context. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2681–2690 (2017) 3. Zhou, Q., Yang, N., Wei, F., Tan, C., Bao, H., Zhou, M.: Neural question generation from text: a preliminary study. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Y. (eds.) Natural Language Processing and Chinese Computing. NLPCC 2017. LNCS, vol. 10619, pp. 662– 671. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73618-1_56 4. Barhom, S., Shwartz, V., Eirew, A., et al.: Revisiting joint modeling of cross-document entity and event coreference resolution. arXiv preprint: arXiv:1906.01753 (2019) 5. Lample, G., Ballesteros, M., Subramanian, S., et al.: Neural architectures for named entity recognition. arXiv preprint: arXiv:1603.01360 (2016)

Research on Named Entity Recognition

209

6. Chiu, J.P.C., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016) 7. Luo, Y., Xiao, F., Zhao, H.: Hierarchical contextualized representation for named entity recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 8441–8448 (2020) 8. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint: arXiv:1508.01991 (2015) 9. Zheng, C., Cai, Y., Xu, J., et al.: A boundary-aware neural model for nested named entity recognition. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics (2019) 10. Tan, C., Qiu, W., Chen, M., et al.: Boundary enhanced neural span classification for nested named entity recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 9016–9023 (2020) 11. Li, J., Sun, A., Ma, Y.: Neural named entity boundary detection. IEEE Trans. Knowl. Data Eng. 33(4), 1790–1795 (2020) 12. Zhai, F., Potdar, S., Xiang, B., et al.: Neural models for sequence chunking. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1 (2017) 13. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. Adv. Neural Inf. Process. Syst. 28 (2015) 14. Doddington, G.R., Mitchell, A., Przybocki, M.A., et al.: The automatic content extraction (ace) program-tasks, data, and evaluation. In: LREC, vol. 2, no. 1, pp. 837–840 (2004) 15. Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019) 16. Wang, B., Lu, W.: Neural segmental hypergraphs for overlapping mention recognition. arXiv preprint: arXiv:1810.01817 (2018) 17. Luan, Y., Wadden, D., He, L., et al.: A general framework for information extraction using dynamic span graphs. arXiv preprint: arXiv:1904.03296 (2019) 18. Straková, J., Straka, M., Hajiˇc, J.: Neural architectures for nested NER through linearization. arXiv preprint: arXiv:1908.06926 (2019) 19. Li, X., Feng, J., Meng, Y., et al.: A unified MRC framework for named entity recognition. arXiv preprint: arXiv:1910.11476 (2019)

UKT: A Unified Knowledgeable Tuning Framework for Chinese Information Extraction Jiyong Zhou1,2 , Chengyu Wang2(B) , Junbing Yan2,3 , Jianing Wang3 , Yukang Xie1,2 , Jun Huang2 , and Ying Gao1(B) 1

South China University of Technology, Guangzhou, Guangdong, China {csjyzhou,aukangyuxie}@mail.scut.edu.cn, [email protected] 2 Alibaba Group, Hangzhou, Zhejiang, China {chengyu.wcy,huangjun.hj}@alibaba-inc.com 3 East China Normal University, Shanghai, China [email protected]

Abstract. Large Language Models (LLMs) have significantly improved the performance of various NLP tasks. Yet, for Chinese Information Extraction (IE), LLMs can perform poorly due to the lack of finegrained linguistic and semantic knowledge. In this paper, we propose Unified Knowledgeable Tuning (UKT), a lightweight yet effective framework that is applicable to several recently proposed Chinese IE models based on Transformer. In UKT, both linguistic and semantic knowledge is incorporated into word representations. We further propose the relational knowledge validation technique in UKT to force model to learn the injected knowledge to increase its generalization ability. We evaluate our UKT on five public datasets related to two major Chinese IE tasks. Experiments confirm the effectiveness and universality of our approach, which achieves consistent improvement over state-of-the-art models.

Keywords: Chinese information extraction knowledge validation

1

· knowledge injection ·

Introduction

Recently, Large Language Models (LLMs) have significantly improved the performance of various NLP tasks [15,26]. For example, ChatGPT1 has shown remarkable capabilities of understanding user intention and providing complete and well-organized responses. Yet, it is infeasible to apply LLMs to all NLP tasks and scenarios. Consider Information Extraction (IE), which aims to extract key information from raw texts, including tasks such as Relation Extraction (RE) and Named Entity Recognition (NER) [10,11,31]. Some examples of ChatGPT for Chinese IE are presented in Fig. 1. We can see that ChatGPT can perform 1

https://openai.com/blog/chatgpt.

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 210–222, 2023. https://doi.org/10.1007/978-3-031-44696-2_17

Unified Knowledgeable Tuning for Chinese IE

211

Fig. 1. Some examples with predictions generated by ChatGPT for Chinese IE. The red indicates errors. (Color figure online)

poorly, with reasons stated below: i) LLMs with decoder-only structures are more suitable for generative tasks such as machine translation and text generation, rather than text understanding and extraction [13]. ii) The training process of LLMs primarily is based on statistical characteristics of texts and pays little attention to fine-grained linguistic and semantic knowledge, which plays a vital role to improve the performance of IE [18,21]. In addition, LLMs suffer from high training and deployment costs, leading a low Return on Investment (ROI) ratio for real-world applications. Hence, it is more desirable to design task-specific moderate-size models that digest fine-grained linguistic and semantic knowledge to address Chinese IE with constrained computational resources. In the literature, some works focus on moderate-size models for several IE tasks [7,9,10,17], which still suffer from three issues: i) Most existing approaches do not simultaneously incorporate a variety of linguistic and semantic knowledge [10,18]. ii) The majority of these approaches which incorporate linguistic or semantic knowledge, specifically structural knowledge, suffer from the lack of the distinction between different types of relationships or the lack of meaningful relationship representations, resulting in the loss of key information when the model digests the input knowledge [9,17]. iii) Previous approaches based on knowledge injection mostly focus on “exploiting” the knowledge only, without validating whether the model truly “captures” the knowledge and uses it as the guidance for IE [28,29]. In response to the aforementioned drawbacks, we propose a lightweight yet effective Chinese IE framework utilizing the Unified Knowledgeable Tuning (UKT) methodology. This framework operates in a plug-and-play fashion and can be utilized in various Chinese IE models built upon the Transformer architecture. Moreover, its coherence lies in its ability to address multiple Chinese IE tasks, including NER and RE, in a highly analogous manner. In UKT, we propose a Multi-relational, Multi-head Knowledge Fusion (MMKF) module to address the problems of missing and misrepresentation of knowledge mentioned

212

J. Zhou et al.

above, which incorporates both linguistic and semantic knowledge into representations. We further learn one-to-one meaningful relationship representations based on BERT [2] to represent different relationships between text fragments. In addition, UKT integrates a novel Relational Knowledge Validation (RKV) module to address the issue of knowledge usability. It explicitly forces the model to learn the injected relational knowledge through back propagation, making the model more knowledgeable and easier to generalize to previously unseen cases. In experiments, we apply UKT to popular Chinese IE models (i.e., FLAT [10] and ATSSA [7]) on five public datasets related to two major Chinese IE tasks (i.e., NER and RE). Experiments confirm the effectiveness and universality of our approach, achieving consistent improvement over state-of-the-art models. In addition, we show that UKT is more capable of tackling Chinese IE, surpassing prevalent LLMs. The major contributions of this paper can be summarized as: – We propose a lightweight yet effective framework for Chinese IE based on UKT which works in a plug-and-play fashion and is applicable to popular Chinese IE models based on the Transformer architecture. It can address several Chinese IE tasks in a highly analogous manner.2 – We propose MMKF in UKT to incorporate both linguistic and semantic knowledge into Chinese word representations rightly. We further propose RKV to force model to learn the injected relational knowledge. – We apply UKT to two popular Chinese IE models on five public datasets related to two major Chinese IE tasks. Experimental results confirm the effectiveness and universality of our approach.

2

Related Work

Chinese Information Extraction. Lattice structure has received significant attention from the Chinese IE community by incorporating word information and boundary information. There have been several attempts at designing a model to optimize performance with lattice input, such as Lattice LSTM [30], MG Lattice [11] and LR-CNN [4]. However, both RNN-based and CNN-based models struggle with modeling long-distance dependencies and limited computational efficiency. Another common approach is encoding lattice graph by graph neural network (GNN), such as LGN [5] and CGN [19]. These approaches still rely on LSTMs as the bottom encoder due to the lack of sequential structure, thus increasing the complexity of the model. Recently, Transformer [20] has shown promising results in many downstream NLP tasks, owing to its good parallelism and ability to capture long-distance dependencies. Many scholars have explored the combination of Transformer and lattice structure. LAN [31] achieves it through lattice-aligned attention. FLAT [10] converts lattice structure into flat structure via ingenious span relative position encoding. Based on FLAT, ATSSA [7] has been improved by activating keys selectively. In this work, 2

Source codes will be released in the EasyNLP framework [22].

Unified Knowledgeable Tuning for Chinese IE

213

we further propose a novel UKT framework for Chinese IE and apply it to the Transformer architecture. Incorporating Linguistic and Semantic Knowledge for IE. Linguistic and semantic knowledge plays a vital role for various NLP tasks [18,21]. Specifically, structural knowledge, such as dependency parse tree (DEPT) and semantic dependency parse tree (SDPT), serves as a key component for IE. There is long-standing history digesting DEPT or SDPT for IE [3,25]. More recently, several studies try to encode structural knowledge graph by GNN [9,17]. Hence, these approaches usually need to combine with sequential structure models, thus increasing the complexity of the model. With the development of transformerbased approaches, researchers pay more attention on exploiting structural knowledge on the Transformer structure for IE [1,18]. For example, Sachan et al. [18] investigate strategy of applying GNN on the output of transformer and strategy of infusing syntax structure into the transformer attention layers for incorporating structural knowledge. Chen et al. [1] propose a type-aware map memory module based on the Transformer architecture to encode dependency information. However, most approaches may suffer from the knowledge usability issue and the missing or misrepresentation of knowledge issue. To address these limitations, we propose a novel UKT framework to correctly incorporate both DEPT and SDPT for Chinese IE.

Fig. 2. Overall architecture. ➀➁➂➃ represent “subject-verb”, “head”, “verb-object” and “punctuation” dependency relationships, respectively. A,B,C,D represent “head”, “experience”, “location” and “punctuation” semantic relationships, respectively.

214

3

J. Zhou et al.

UKT: The Proposed Method

3.1

A Brief Overview of UKT

The overall architecture of Unified Knowledgeable Tuning (UKT) is shown in Fig. 2. Consider the example token sentence “The West Lake is located in Hangzhou.”. We can obtain the structural knowledge (i.e. dependency parse tree (DEPT) and semantic dependency parse tree (SDPT)) by an off-the-shelf tool3 , as shown in Fig. 2. Here, each node in the tree is a text fragment, such as “the West Lake”. Each edge in the tree represents the relationship between text fragments. For example, there is a “verb-object” dependency relationship and is a “location” semantic dependency relationship between “is located in” and “Hangzhou”. We represent relationship knowledge as shown by Mdep and Msdp separately in Fig. 2 and inject it into the model through a Multi-relational, Multihead Knowledge Fusion (MMKF) module, detailed described in Subsect. 3.2. In order to validate whether the model truly “captures” the knowledge, we propose a Relational Knowledge Validation (RKV) module to calculate and back propagate the gap between the knowledge embedded in the model output sequence and the ground-truth knowledge, detailed in Subsect. 3.3. Take DEPT as an example. Specifically, we calculate and back propagate the gap between the dependency relationship between “is located in” and “Hangzhou” learned by the model and the ground-truth “verb-object” relationship. We need to further clarify that UKT can be integrated into several Chinese IE models based on the Transformer architecture non-intrusively, i.e., FLAT [10] and ATSSA [7]. It addresses several Chinese IE tasks in a highly analogous manner. We will explain the details in Subsect. 3.4. 3.2

Multi-relational Multi-head Knowledge Fusion

Structural knowledge such as DEPT and SDPT indicates the existence and types of relationship between text fragments in the sentence, and provides crucial prior knowledge for IE [3,21]. As shown in Fig. 2, noun-based elements such as objects, subjects and complements are more likely to be named entities (such as “the West Lake” and “Hangzhou”). According to DEPT, there may be a relationship between “the West Lake” and “Hangzhou” through the intermediary jump of “is located in”. In addition, different types of relationships reflect different key information for IE. For example, the semantic dependency relationship “location” between “is located in” and “Hangzhou” in the figure helps the model to infer that “Hangzhou” is a named entity of the location type. However, existing approaches pay little attention to DEPT and SDPT simultaneously, or lack the distinction of different types of relationships or lack one-to-one meaningful relationship representations. In UKT, We propose MMKF to learn one-toone meaningful relationship knowledge representations based on BERT [2] and exploit them in a variant of self-attention to address the above limitations. 3

http://ltp.ai/.

Unified Knowledgeable Tuning for Chinese IE

215

Relationship Knowledge Representation. Take DEPT as an example. After obtaining DEPT of a token sentence C = [c1 , c2 , · · · , cn ] (n is length of C), as shown in Fig. 2, Tdep = {Rootdep , N odedep , Edgedep } is used to define the tree abstractly, where Rootdep is the root of the tree, N odedep and Edgedep are the node set and the edge set respectively. Each node in the tree is a text fragment ent of C. Note that a text fragment may consist of more than one token. Each edge can be represented as {entp , entq , rdep }, which means that there is an rdep relationship between entp and entq . rdep ∈ Rdep , where Rdep indicates the set of all possible dependency relationships in total. Consider the edge {entp , entq , rdep }, we can feed the prompt sequence Cprompt , i.e., Cprompt = [CLS] + C + [SEP] + entp + [SEP] + entq into BERT [2] to obtain the corresponding sentence-level relationship knowledge representation r dep (Cprompt ) of the relationship rdep . The one-to-one meaningful relationship knowledge representation for rdep , denoted as r dep , is the averaged result of all such sentence-level representations (i.e., all r dep (Cprompt )). We represent Tdep as dependency relationship knowledge representation (i,j) matrix Mdep . Let Mdep be the (i, j)-th element of Mdep , initialized as 0. For (i,j)

each edge {entp , entq , rdep } of Tdep , we have Mdep = r dep when entlp ≤ i ≤ entrp and entlq ≤ j ≤ entrq . entlp and entrp respectively represent the starting and ending token position index of the text fragment entp in the token sentence. Relationship knowledge extends from the text fragment level to the token level. Thus, the knowledge of {entp , entq , rdep } that there is a dependency relationship rdep between entp and entq can be encoded in Mdep . Similarly, we also obtain semantic dependency relationship knowledge representation matrix Msdp . Knowledge Injection. In UKT, we propose a variant of the self-attention mechanism which implement multiple attention heads following common practice to inject structural knowledge into the model. Let Q, K and V as the query, key and value in the vanilla self-attention mechanism. The vanilla self-attentive representation is calculated by: Attn = softmax(Q · K T ) · V . Next, we consider the process of knowledge injection. Take DEPT as an example. We replace the key K with the dependency relationship knowledge representation matrix Mdep to obtain the weighted self-attentive representation Attndep related to DEPT, shown as follows: T Adep = (Q + vdep ) · Mdep ,

Attndep = softmax(Adep ) · V,

(1)

where vdep is a learnable parameter for DEPT. Similarly, for SDPT, we have: T , Asdp = (Q + vsdp ) · Msdp

Attnsdp = softmax(Asdp ) · V.

(2)

In addition, different self-attentive representations, i.e., Attn, Attndep and Attnsdp , hold different levels of importance in different tasks and datasets. We choose to use a learnable weight W to solve this problem to compute the final self-attentive representation Attn∗ . The formula can be expressed as follows: Attn∗ = W · [Attn, Attnsdp , Attndep ]T .

(3)

216

3.3

J. Zhou et al.

Relational Knowledge Validation

In order to force the model to learn injected relational knowledge, RKV is proposed to examine whether the model truly captures the knowledge. Denote H as the model’s output representation, which may store the injected relational knowledge. We obtain Hhead and Htail by implementing linear transformations on H. The former represents the semantic and syntax information of the first text fragment in the relationship, while the latter represents those of the ˜ be a reproduced relationship knowledge matrix second text fragment. Let M learned by the model, which can be obtained according to the following formula: ˜ = Hhead · H T . We again take DEPT as an example. We obtain the depenM tail ˜ dep by implementing linear dency reproduced relationship knowledge matrix M (i,j) ˜ ˜ ˜ dep , which represents transformations on M . Mdep is the (i, j)-th element of M the dependency relationship knowledge learned by the model between the i-th and the j-th token. We assume that there is a real dependency relationship rdep . ˜ (i,j) is fed into a softmax classifier to obtain Pr(i,j) (rdep |C), which After that, M dep is the estimated probability for i-th and j-th token having the relation rdep . During model training, as a regularizer, we back propagate the gap between the predicted relationship probabilistic distribution and the ground-truth to force the model to capture the true dependency relationship type. The samplewise loss function related to DEPT (denoted as Ldep (Θ)) is defined as: Ldep (Θ) = −

n  n  i

logP r(i,j) (rdep |C),

(4)

j

where Θ indicates all parameters of ourmodel. The loss function w.r.t. SDPT n n is similar to Ldep (Θ), i.e., Lsdp (Θ) = − i j logP r(i,j) (rsdp |C). The final loss function of model is L(Θ), defined as follows: L(Θ) = Ltask (Θ) + λsdp · Lsdp (Θ) + λdep · Ldep (Θ),

(5)

where λsdp and λdep are trade-off between multiple terms. Ltask (Θ) is the original loss function of the IE task (either NER or RE). 3.4

UKT for Chinese IE

UKT works in a plug-and-play mode and is applicable to several popular Chinese IE models based on the Transformer architecture. Its universality also lies in its ability to address multiple Chinese IE tasks, including NER and RE, in a highly similar manner. In the following, we describe how to apply UKT to FLAT [10] for Chinese IE. For the NER task, we obtain the span embeddings EN ER directly from FLAT’s span representations of the input. For the RE task, because many studies [11,27] have proven that positional embeddings are highly important, we combine them with FLAT’s span representations to obtain the span embeddings for RE, denoted as ERE . After that, EN ER or ERE are passed to MMKF for

Unified Knowledgeable Tuning for Chinese IE

217

self-attention computation, with structural knowledge injected. The formula for obtaining the final representations of UKT-enhanced FLAT is: Attn∗ = W · [AttnF LAT , Attnsdp , Attndep ]T ,

(6)

where AttnF LAT is the self-attentive representation of FLAT. For NER, a standard Conditional Random Field (CRF) module is built upon the modified transformer block. For RE, a simple relation classifier is added to the last transformer layer. The total loss for UKT-enhanced FLAT is as follows: L(Θ) = LF LAT (Θ) + λsdp · Lsdp (Θ) + λdep · Ldep (Θ),

(7)

where LF LAT (Θ) is the original loss function of FLAT related to different tasks. Similarly, we can integrate UKT into ATSSA [7] in a highly similar manner, with two aspects modified: i) knowledge-injected self-attention by MMKF, and ii) additional loss functions for RKV. Due to space limitation, we omit the details.

4

Experiments

In this section, we conduct extensive experiments on five public Chinese IE datasets. We evaluate UKT using F1-score, with a comparison of state-of-theart approaches. All experiments are conducted on NVIDIA V100 GPU (32GB). 4.1

Experiments Setup

Datasets. Three mainstream Chinese NER benchmark datasets (Weibo [6,16], Resume [30], MSRA [8]) and two mainstream Chinese RE benchmark datasets (SanWen [24] and FinRE [11]) are used for evaluating the performance of UKT. We show detailed statistics in Table 1, following the settings of LAN [31]. Table 1. Statistics of five Chinese IE datasets. Dataset NER Weibo Resume MSRA

RE SanWen FinRE

# Train 73.8K 124.1K

2,169.9K 515K

# Dev

14.5K 139K



55K

727K 81K

# Test

14.8K 15.1K

172.6K

68K

203K

Baselines. ATSSA [7] and FLAT [10] are two recent popular Chinese NER models which can be adapted for Chinese RE with simple modifications as mentioned above. We treat FLAT and ATSSA as state-of-the-art baselines. In addition, several popular methods for Chinese NER and RE are also compared.

218

J. Zhou et al.

Implementation Details. We use the pre-trained character embeddings and bigram embeddings trained with word2vec [14]. BERT in the experiments is “BERT-wwm” released in [2]. The above three embeddings respectively have size of 50, 50, 768. Different baseline models use different word embedding dictionaries including YJ4 , LS5 , TX6 and we conduct experiments using corresponding dictionaries along different baseline models suggested by their original papers. All of above four embeddings are fine-tuned during training. We use the stochastic gradient descent (SGD) optimizer with a momentum of 0.9 on all datasets for optimization. To avoid overfitting, the dropout technique is applied. For RKV, λsdp and λdep in the loss function are set to 1e-5 in default, which are also tuned over the development sets. 4.2

Overall Performance

Table 2 shows the overall performance on five public datasets. Generally, our approach UKT consistently outperforms all baseline models on both tasks, which demonstrates the effectiveness and universality of the proposed approach. Table 2. Overall results in terms of F1 (%). * denotes models for RE only. ** denotes models for NER only. # denotes our re-production results based on open-source codes. Model

Lexicon NER RE MSRA Resume Weibo FinRE SanWen

MG Lattice* [11] Lattice LSTM** [30] CGN** [19] SoftLexicon** [12] MECT** [23] LAN [31]

– YJ LS YJ YJ TX

– 93.18 93.47 95.42 96.24 96.41

– 94.46 – 96.11 95.98 96.67

– 58.79 63.09 70.50 70.43 71.27

FLAT [10] FLAT w/. UKT (ours)

YJ YJ

96.09 96.36

95.86 96.66

68.55 50.07# 72.84# 69.55 51.00 74.48

ATSSA [7] ATSSA w/. UKT (ours)

LS LS

96.45 96.49

96.73 96.81

72.53 52.28# 73.86# 73.18 52.81 74.84

50.46

54.60

23.10

ChatGPT (for reference only) –

49.26 – – – – 51.35

28.47

65.61 – – – – 69.85

42.81

In detail, on three NER datasets, we obtain large performance improvement over Weibo (+1.00%), Resume (+0.8%) and MSRA (+0.27%) compared with vanilla FLAT. For ATSSA, our UKT-based enhancement also achieves good performance. In addition, injecting linguistic and semantic knowledge into vanilla 4 5 6

https://github.com/jiesutd/RichWordSegmentor. https://github.com/Embedding/Chinese-Word-Vectors. https://ai.tencent.com/ailab/nlp/en/embedding.html(v0.1.0).

Unified Knowledgeable Tuning for Chinese IE

219

FLAT improves the RE performance on FinRE and SanWen datasets by 0.93% and 1.64% in F1, respectively. For ATSSA, the corresponding improvement scores are 0.53% and 0.98%. We also present the F1 scores of ChatGPT (gpt3.5-turbo) over zero-shot Chinese IE for reference. We can see that ChatGPT performs poorly. By analyzing examples, we can intuitively find that relationship or entity types generated by ChatGPT may not be in the optional list provided in input prompts. For Chinese NER task, ChatGPT sometimes does not generate the correct numbers of named entities corresponding to inputs. In a few cases, ChatGPT does not even follow instructions and generates answers irrelevant to the questions. Hence, our approach has practical values in the LLM era. 4.3

Model Analysis

Ablation Study. To further investigate the effectiveness of all the major components of our proposed approach, using FLAT as our base model, we conduct ablation experiments, with results reported in Table 3. We have the following findings. i) The degradation in performance with the removal of two types of knowledge proves that injecting it into models brings substantial improvement for Chinese IE. ii) We replace all relationship knowledge representation matrices M (i.e. Mdep and Msdp ) with all-zero matrices. The decline in performance indicates the validity of one-to-one meaningful relationship representations. iii) Removing RKV leads to significantly worse results on all datasets, which suggests that RKV can truly force the model to learn injected relational knowledge, thus improving the effectiveness of knowledge injection for Chinese IE. Influence of Hyperparameters. To analyze the influence of λ (i.e. λdep and λsdp ), we conduct experiments on UKT-enhanced FLAT over one NRE dataset (i.e. Weibo) and one RE dataset (i.e. SanWen), in which one hyperparameter is kept as 1e-5 while the other is varied. The F1-score results are summarized in Fig. 3. As the value of λ increases, the scores generally exhibit a trend of initially increasing followed by decreasing. This is because when λ is small, the RKV technique fails to exert its effect. When λ is large, the model tends to allocate more attention to the learning of the injected knowledge, deviating from the required IE task and leading to a decrease in F1 score. Table 3. Ablation results in terms of F1 (%). Model

Resume Weibo MSRA SanWen FinRE

Full implementation 96.66

69.55 96.36

74.48

51.00

w/o. w/o. w/o. w/o.

69.18 69.13 69.46 68.71

72.99 74.06 73.40 73.57

50.77 50.68 50.53 50.28

DEPT SDPT Mdep and Msdp RKV

96.56 96.52 96.44 96.49

96.16 96.29 96.13 96.26

220

J. Zhou et al.

Fig. 3. Parameter analysis of λdep and λsdp .

Visualizations. In order to vividly demonstrate the impact of knowledge for model learning, we visualize the attention distributions related to DEPT and SDPT of the example “The West Lake is located in Hangzhou.”, as shown in Fig. 4. In the figure, the vertical axis represents queries and the horizontal axis indicates keys. We observe that there is differential attention given to tokens where structural knowledge exists compared to other tokens. Take DEPT as an example, the attention of queries in “The West Lake” is more likely to be assigned to keys in “is located in”, which exactly corresponds to the “subjectverb” dependency relationship between “The West Lake” and “is located in”.

Fig. 4. Attention distributions related to knowledge derived from DEPT and SDPT.

5

Conclusion

In this paper, we propose UKT to inject linguistic and semantic knowledge for Chinese IE. It plays in a plug-and-play fashion and is applicable to popular Chinese IE models based on the Transformer architecture. To achieve our goal, we propose MMKF in UKT to incorporate knowledge and further introduce RKV to force model to learn the injected relational knowledge. We evaluate the proposed approach on five public datasets related to two Chinese IE tasks. Experimental results demonstrates the effectiveness and universality of the proposed

Unified Knowledgeable Tuning for Chinese IE

221

approach, which achieves consistent improvement over state-of-the-art models. In the future, we will extend our study to a broader range of Chinese IE tasks. Acknowledgments. This work is supported by the Guangzhou Science and Technology Program key projects (202103010005), the National Natural Science Foundation of China (61876066) and Alibaba Cloud Group through the Research Talent Program with South China University of Technology.

References 1. Chen, G., Tian, Y., Song, Y., Wan, X.: Relation extraction with type-aware map memories of word dependencies. In: ACL-IJCNLP, pp. 2501–2512 (2021) 2. Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z.: Pre-training with whole word masking for Chinese bert. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3504–3514 (2021) 3. Fundel, K., K¨ uffner, R., Zimmer, R.: Relex—relation extraction using dependency parse trees. Bioinformatics 23(3), 365–371 (2007) 4. Gui, T., Ma, R., Zhang, Q., Zhao, L., Jiang, Y.G., Huang, X.: CNN-based Chinese NER with lexicon rethinking. In: IJCAI, pp. 4982–4988 (2019) 5. Gui, T., et al.: A lexicon-based graph neural network for Chinese NER. In: EMNLP-IJCNLP, pp. 1040–1050 (2019) 6. He, H., Sun, X.: F-score driven max margin neural network for named entity recognition in Chinese social media. In: EACL, pp. 713–718 (2017) 7. Hu, B., Huang, Z., Hu, M., Zhang, Z., Dou, Y.: Adaptive threshold selective selfattention for Chinese NER. In: COLING, pp. 1823–1833 (2022) 8. Levow, G.A.: The third international Chinese language processing bakeoff: word segmentation and named entity recognition. In: SIGHAN, pp. 108–117 (2006) 9. Li, F., Lin, Z., Zhang, M., Ji, D.: A span-based model for joint overlapped and discontinuous named entity recognition. In: ACL/IJCNLP, pp. 4814–4828 (2021) 10. Li, X., Yan, H., Qiu, X., Huang, X.J.: Flat: Chinese NER using flat-lattice transformer. In: ACL, pp. 6836–6842 (2020) 11. Li, Z., Ding, N., Liu, Z., Zheng, H., Shen, Y.: Chinese relation extraction with multi-grained information and external linguistic knowledge. In: ACL, pp. 4377– 4386 (2019) 12. Ma, R., Peng, M., Zhang, Q., Wei, Z., Huang, X.J.: Simplify the usage of lexicon in Chinese NER. In: ACL, pp. 5951–5960 (2020) 13. Ma, Y., Cao, Y., Hong, Y., Sun, A.: Large language model is not a good few-shot information extractor, but a good reranker for hard samples! CoRR abs/2303.08559 (2023) 14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013) 15. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NIPS, pp. 27730–27744 (2022) 16. Peng, N., Dredze, M.: Named entity recognition for Chinese social media with jointly trained embeddings. In: EMNLP, pp. 548–554 (2015) 17. Qin, H., Tian, Y., Song, Y.: Relation extraction with word graphs from N-grams. In: EMNLP, pp. 2860–2868 (2021)

222

J. Zhou et al.

18. Sachan, D., Zhang, Y., Qi, P., Hamilton, W.L.: Do syntax trees help pre-trained transformers extract information? In: EACL, pp. 2647–2661 (2021) 19. Sui, D., Chen, Y., Liu, K., Zhao, J., Liu, S.: Leverage lexical knowledge for Chinese named entity recognition via collaborative graph network. In: EMNLP-IJCNLP, pp. 3830–3840 (2019) 20. Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017) 21. Wan, Q., Wan, C., Hu, R., Liu, D.: Chinese financial event extraction based on syntactic and semantic dependency parsing. Chin. J. Comput. 44(3), 508–530 (2021) 22. Wang, C., et al.: EasyNLP: a comprehensive and easy-to-use toolkit for natural language processing. In: EMNLP, pp. 22–29 (2022) 23. Wu, S., Song, X., FENG, Z.: Mect: multi-metadata embedding based crosstransformer for Chinese named entity recognition. In: ACL-IJCNLP, pp. 1529–1539 (2021) 24. Xu, J., Wen, J., Sun, X., Su, Q.: A discourse-level named entity recognition and relation extraction dataset for Chinese literature text. CoRR abs/1711.07010 (2017) 25. Xu, Y., Mou, L., Li, G., Chen, Y., Peng, H., Jin, Z.: Classifying relations via long short term memory networks along shortest dependency paths. In: EMNLP, pp. 1785–1794 (2015) 26. Zeng, A., et al.: GLM-130B: an open bilingual pre-trained model. CoRR abs/2210.02414 (2022) 27. Zeng, D., Liu, K., Lai, S., Zhou, G., Zhao, J.: Relation classification via convolutional deep neural network. In: COLING, pp. 2335–2344 (2014) 28. Zhang, T., et al.: HORNET: enriching pre-trained language representations with heterogeneous knowledge sources. In: CIKM, pp. 2608–2617 (2021) 29. Zhang, T., et al.: DKPLM: decomposable knowledge-enhanced pre-trained language model for natural language understanding. In: AAAI, pp. 11703–11711 (2022) 30. Zhang, Y., Yang, J.: Chinese NER using lattice LSTM. In: ACL, pp. 1554–1564 (2018) 31. Zhao, S., Hu, M., Cai, Z., Zhang, Z., Zhou, T., Liu, F.: Enhancing Chinese character representation with lattice-aligned attention. IEEE Trans. Neural Netw. Learn. Syst. 34(7), 3727–3736 (2023). https://doi.org/10.1109/TNNLS.2021.3114378

SSUIE 1.0: A Dataset for Chinese Space Science and Utilization Information Extraction Yunfei Liu1,2 , Shengyang Li1,2(B) , Chen Wang1 , Xiong Xiong1,2 , Yifeng Zheng1,2 , Linjie Wang1,2 , and Shiyi Hao1,2 1 Key Laboratory of Space Utilization, Technology and Engineering Center for Space

Utilization, Chinese Academy of Sciences, Beijing, China {liuyunfei,shyli,wangchen21,xiongxiong20,zhengyifeng21, wanglinjie22,haoshiyi22}@csu.ac.cn 2 University of Chinese Academy of Sciences, Beijing, China

Abstract. In this paper, we present the Chinese Space Science and Utilization Information Extraction dataset, including named entity recognition, relationship extraction, event extraction tasks, consisting of 6926 sentences, 58771 entities, 30338 triplets, 19 entities, 36 relations and 3039 events categorized into 20 event types, along with 8977 event arguments mapped to 26 argument roles, which, named SSUIE 1.0. We design an efficient coarse-to-fine procedure including candidate generation and crowdsourcing annotation, to achieve high data quality. Additionally, we evaluate performances of the most recent state-of-the-art frameworks and pre-trained language models for the named entity recognition, joint extraction of entities and relations and event extraction tasks on the SSUIE 1.0 dataset. Experiment results show that even these most advanced models still have a large space to improve on our dataset. Our analysis points out several challenges and multiple potential future research directions for the task specialized in the space science and utilization information extraction domain. Keywords: Information Extraction · Dataset · Chinese Space Science and Utilization

1 Introduction In the field of manned space engineering missions, the knowledge system is very complex. In each subject field and system engineering, massive scientific data of high professionalism, high complexity, various forms and wide coverage are continuously produced. All kinds of information are complex, lack of effective knowledge mining, organization management, analysis, and utilization methods. However, the general knowledge extraction algorithm cannot be directly applied to knowledge extraction in this field, resulting in poor field adaptability, and knowledge extraction is not effective. Knowledge graph construction mainly includes entity recognition, relationship extraction, event extraction and other information extraction technologies, carrying important semantic information conveyed by the text. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 223–235, 2023. https://doi.org/10.1007/978-3-031-44696-2_18

224

Y. Liu et al.

Entity and relation extraction is to extract structural information from unstructured domain-related texts, consisted of two steps, the entity extraction and the relation extraction. In the Chinese Space Science and Utilization field, entities usually refer to terminology in the text, e.g., space missions, scientific experiment loads, experimental systems, experimental units, scientific data, and sensors. In the view of Natural Language Processing (NLP), entity extraction is usually regarded as a Named Entity Recognition (NER) task, which extracts fore-mentioned domain-related entities from raw texts. Relation extraction is to judge the semantic relations (such as load composition, data production, scientific experiments) of entity pairs. Event extraction (EE) is an important yet challenging task in natural language understanding. Given an event mention, an event extraction system ought to identify event triggers with specific event types, as well as their corresponding arguments with specific argument roles. Until now, to the best of our knowledge, there is no dataset specialized in the field of Chinese Space Science and Utilization. Therefore, we present a Chinese Space Science and Utilization Information Extraction (SSUIE 1.0) dataset in this paper. Our contributions are as follows: – We collect Chinese Space Science and Utilization texts from multiple sources and refer to annotation schema for named entities, relations, events on domain-related texts, and annotate a Chinese Space Science and Utilization information extraction corpus. – We employ the most recent state-of-the-art frameworks and pre-trained language models for the extraction of entities, relations and events to perform detailed evaluations on our dataset. We suggest that our work provides both a dataset and baselines for future research for Chinese Space Science and Utilization information extraction.

2 Related Work 2.1 Named Entity Recognition Named Entity Recognition (NER) is usually regarded as a sequence labeling task, identifying entity types by labeling characters or words. [1] first applied the Bi-LSTM-CRF model to NER. [2] designed a BiLSTM-CRF model for Chinese NER, introducing features that include characters and radicals. TENER [3] improved the Transformer Encoder structure to capture the distance and direction information between tokens. BSNER [4] proposed the Boundary Smoothing method to enhance NER performance. Lattice LSTM [5] is the first to propose using lexical enhancement for Chinese NER. WC-LSTM [6] changes the encoding strategy of Lattice LSTM to solve the problem of non-parallel computing. Similarly, LRCNN [7] uses CNN to address the non-parallelization issue of LSTM. CGN [8] uses a graph neural network based on lexical enhancement to solve the NER problem. Softlexicon [9] designs a lexicon input module that can be easily transferred to other models. FLAT [10] converts the lattice structure to adapt to Transformer input, and NFLAT [11] further improves recognition performance by designing an Interformer module based on FLAT. LEBERT [12] designs a lexicon adapter module to fuse

SSUIE 1.0: A Dataset for Chinese Space Science

225

word and character features. W2NER [13] introduces the design of word-word grid and models the neighboring relations between entity words, achieving SOTA performance on both Chinese and English datasets. 2.2 Entity-Relation Joint Extraction Entity-relation joint extraction can be divided into pipeline methods and joint extraction methods. The earliest entity-relation extraction method based on RNN was proposed by [14]proposed a novel neural network model SDP-LSTM for relation classification. The PURE [15] separately trains entity and relation models obtains state-of-the-art results on multiple standard benchmark datasets. Entity-relation joint extraction model can avoid the problem of error accumulation and propagation brought by pipeline methods. Casrel [16] method uses a cascaded binary tagging framework to handle various types of relations and reduce errors in entity boundary prediction. The TPLinker [17]method uses a Bert-based encoder and a Bi-LSTM decoder to jointly optimize the extraction of entities and relations through multi-task learning. The PFN [18] method uses a partition filter mechanism to improve the accuracy of entity and relation extraction. 2.3 Event Extraction Event extraction (EE) is one of the important challenges in the field of information extraction. By identifying specific event types and extracting their specific arguments, EE can be divided into two subtasks: Event Detection (ED) and Event Argument Extraction (EAE). Traditional ED work focused on extracting trigger words from event text [19]. Considering the difficulty of the trigger word extraction task, Devlin et al. (2018) regarded ED as a multi-classification task [20]. In recent years, with the widespread use of techniques such as pre-training models, prompt-based learning, and contrastive learning, ED has achieved great success [21]. EAE related work can be divided into two categories, one type of researchers model it as a semantic role labeling problem [22]. It is mainly divided into two stages: first identifying candidate spans and then classifying their roles [23]. Another type of researchers is working on end-to-end extraction of event arguments with the help of pre-trained language models. Such as QA-based models [22], Generation-based methods [24] and Prompt-based models [25].

3 SSUIE Dataset The construction process of the SSUIE 1.0 dataset is shown in Fig. 1, which mainly includes three parts: 1) Schema construction. 2) Data collection and filtering. 3) Data annotation.

226

Y. Liu et al.

Fig. 1. Process of dataset construction

3.1 Schema Construction According to domain knowledge management requirements, all information would be extracted according to predefined schema. Therefore, schema construction is critical to the quality of the event extraction corpus, Define domain-related entities, relations, and event types before the dataset is built. By pre-annotating and analyzing a part of the corpus, we collaborate with domain experts to settle standards for the annotation of the corpus. We defined 19 entity types, 36 relation types, and 20 event types. 3.2 Data Collection and Filtering To obtain high-quality initial corpora, we first identified keywords closely related to the field of space science based on domain entity, relation, and event types, such as “Tiangong-1(天宫一号)”, “Space Life Sciences and Biotechnology (空间生命科学与 生物技术)” and “hyperspectral imaging spectrometer experimental payload (高光谱成 像光谱仪试验载荷)”. Subsequently, based on these keywords, we conducted keyword searches on Baidu, Google, Wikipedia, and other domain platforms, and crawled relevant web pages from the internet using web crawlers. Then, we extracted content from the corresponding web pages using specific expressions to obtain the initial corpus. After obtaining the initial corpus, data preprocessing was performed. Data preprocessing included text segmentation, removal of HTML tags, illegal characters, and irrelevant information such as images to obtain plain text, followed by sentence segmentation. We ensured that each sentence after segmentation did not exceed 512 characters to accommodate the input of the algorithmic model. 3.3 Data Annotation To ensure the quality of the dataset annotation, the following quality control measures are adopted: (1) Before annotation, all annotators are uniformly trained to comb through the hierarchical structure of knowledge in the field of space science and utilization. Before

SSUIE 1.0: A Dataset for Chinese Space Science

227

formal annotation, a pre-annotation is performed to ensure that all annotators have a unified and accurate understanding of the entity types, relation types and event types defined in the schema. (2) During the annotation process, the annotation standards are synchronized through online documents. Issues encountered during annotation are discussed by the annotation team and finally a consensus is reached to ensure that all annotators execute the same annotation standards. (3) After annotation is completed, the annotators cross-check and correct the annotated results to ensure that each annotated data has been checked by at least two annotators and to further reduce errors in annotation. The initial size of the dataset constructed through manual annotation is limited, To improve the efficiency of expanding the dataset, we adopt a semi-automatic approach to dynamically expand the dataset. The semi-automatic dataset expansion process involves the following steps: (1) Manually annotate an initial dataset of a certain size. (2) Construct information extraction model based on the initial dataset. (3) Retrieve new text data by calling a search engine API based on a set of keywords, and preprocess the text data. (4) Automatically annotate the new data using the information extraction model trained in step (2). (5) Perform manual corrections on the annotations generated by the model. (6) Add the corrected annotations to the training set, and repeat steps (2) through (6). This semi-automated approach is used to dynamically expand the SSUIE dataset due to the ongoing addition of large amounts of text data in the field of space science and utilization. By using this approach, the efficiency of dataset expansion can be greatly improved compared to manually annotating every new text corpus. 3.4 Data Statistics The statistics of the dataset are show in Table 1. The SSUIE 1.0 dataset contains 6,926 sentences, 58,771 entities, 30,338 relation triplets. For the event extraction task, the dataset contains 3039 events categorized into 20 predefined event types, along with 8977 event arguments mapped to 26 argument roles. Table 1. Statistics of SSUIE datasets Entity Types

Entities

Relation Types

Triplets

Sentences

19

58771

36

30338

6926

Event types

Events

Argument roles

Arguments

Instances

20

3039

26

8977

2848

228

Y. Liu et al.

4 Chinese Space Science and Utilization Information Extraction To verify the performances of approaches and establish baselines for future research based on the SSUIE 1.0. We conducted Named Entity Recognition, Entity-Relation joint extraction and Event extraction experiments on our SSUIE 1.0 dataset. 4.1 Named Entity Recognition Models We compare multiple Chinese NER models introduced in recent years on SSUIE. 1) Lattice LSTM [5] was the first to propose a vocabulary-enhanced approach for Chinese NER, using a lattice structure, which is a hybrid structure of characters and words. 2) FLAT [10] converts the lattice structure into a flat structure composed of spans to fit the input of Transformer, and designs position encodings to effectively solve the problem of word boundary ambiguity. 3) NFLAT [11] proposed the InterFormer structure to build a non-flat-lattice, reducing unnecessary computations. 4)LEBERT [12] proposed a lexicon adapter module that injects lexical features into the middle layers of BERT, leveraging BERT’s strong sequence modeling ability to incorporate both wordlevel and character-level features. 5) W2NER [13] introduced the design of word-pairs grid and utilized multi-granularity 2D convolutions to model the entity boundaries and entity word relationships. Results and Analysis Results of different models for NER on the SSUIE 1.0 dataset are shown in Table 2, Table 2 shows that W2NER frameworks outperform other approaches. W2NER introduces a novel word-word pairs grid structure. Leveraging a multi-granularity 2D convolutional neural network, it redefines the representation of word pairs, enables modeling of intricate relationships between them, and effectively identifies the boundaries of entities. By effectively capturing the interactions between word pairs at both close and distant distances, this approach demonstrates superior performance in SSUIE 1.0. Table 2. Results of different NER methods on SSUIE 1.0 Models

Precision

Recall

F1

Lattice-LSTM

73.81%

73.72%

73.77%

FLAT

72.97%

76.88%

74.88%

LEBERT

74.22%

79.05%

76.56%

NFLAT

74.27%

74.59%

74.43%

W2NER

76.37%

79.59%

77.95%

Figure 2 show two sentences taken from the SSUIE 1.0 da-taset and annotated by the model. The label is organized in the form of BIO. B-S & I-S denotes “Space Mission ( 空间任务)”, B-E & I-E denotes “Experimental System(实验系统)”, B-F & I-F denotes “Space Science Field(空间科学领域)”, and B-R & I-R denotes “ Research/Experimental Project(研究/实验项目)”.

SSUIE 1.0: A Dataset for Chinese Space Science

229

The baseline model had a limited perspective in the first sentence, recognizing only the entity type of “Space Mission(空间任务)” and disregarding the longer entity type of “Experimental System(实验系统)” that may also be present. Similarly, in the second sentence, the baseline model identified two separate entities as “Space Science Domain (空间科学领域)” but was unable to link them together and combine them with the following description to form the “Research/Experimental Project(研究/实验项目)” type.

Fig. 2. The NER examples from the SSUIE 1.0 dataset.

The SSUIE 1.0 dataset focuses on the domain of space science and utilization, which consists of specialized technical terms. These terms can be lengthy and complex, resulting in a high number of long entities within the dataset. Figure 3. Compares the length of entities in the SSUIE 1.0 dataset with other four benchmark datasets.

Fig. 3. The length distribution of entities.

230

Y. Liu et al.

4.2 Entity-Relation Joint Extraction The SSUIE 1.0 is different from the relation extraction dataset of simple sentences. The sentence structure is more complex and the expression content is richer. Each sentence contains more triplets. There are a large number of overlapping triplets in the sentence, that is, the situation where multiple triplets share entities, which can be divided into three situations according to the number of shared entities: normal triplets (Normal), single entity overlap (SPO) and entity pair overlap (EPO). As shown in Table 3. Table 3. Example of EPO and SPO Triplet Type Sentence

Subject

EPO

中科院空间应用工 程与技术中心应用 发展中心主任张伟 表示, 中国空间站 建成后将在空间生 命科学与人体研究 、微重力物理科学 、空间天文与地球 科学、空间新技术 与应用等4个重要 领域开展1000多项 研究项目

SPO

空间站的问天实验 S: 空间站 舱与梦天实验舱分 别对接天和核心舱 前向对接口, 再由 机械臂转为至侧向 停泊口对接

Predicate

Object

S1: 中科院空间应 P1: 负责人是 O1: 张伟 用工程与技术中心 P2: 所属组织机构 O2: 中科院空间应 应用发展中心 是 用工程与技术中心 S2: 张伟 应用发展中心

P: 包含舱段

O1: 问天实验舱 O1: 问天实验舱 O1: 问天实验舱

Model To perform joint extraction of entity and relation task, we evaluate different frameworks: 1) Casrel [16]: For the triple overlap problem, the discrete entity labels are converted into corresponding functional relation, which are used to identify overlapping triplets. 2) TpLinker [17]: use the form of matrix to express the relation between the head and tail of the entity and the entity pair. 3) PURE [15]: use the pipeline way, employing two separate encoders to learn entity and relation contexts respectively. 4) PFN [18]: A partition filter network is proposed to solve the problem of insufficient interaction between the two subtasks of named entity recognition and relation extraction. Results and Analysis The results of entity and relation joint extraction of different models on the SSUIE 1.0 dataset are shown in Table 4. The experimental results show that the PFN method works best and outperforms other models. The partition filter network proposed by this method is aimed at the information interaction problem between subtasks, and the overall experimental results are better than the Casrel method for the

SSUIE 1.0: A Dataset for Chinese Space Science

231

Table 4. Results of different Entity-Relation joint extraction methods on SSUIE 1.0 Model

NER

RE

P

R

F1

P

R

F1

Casrel

48.3

47.5

47.9

45.8

44.6

45.2

TpLinker

49.5

48.1

48.8

48.3

47.3

47.8

PURE

52.3

50.9

51.6

50.0

48.6

49.3

PFN

54.6

53.2

53.9

51.4

50.0

50.7

relation overlapping problem. At the same time, the NER experimental results of all models are higher than the RE experimental results, indicating that the model’s entity recognition effect on the SSUIE 1.0 dataset is better than that of relation extraction. To further study the ability of the model to extract different types of overlapping relations, comparative experiments were carried out on different types of sentences.

Fig. 4. F1-scores of extracting from different types of sentences

Figure 4(a) shows the experimental results for three different types of overlapping relations. All models have increased performance on Normal, EPO and SEO, reflecting the ability of the models to cope with different types of overlapping relations. At the same time, the PFN method shows the best performance, indicating the important role of partition filter operation on entity information and relation information. To compare the ability of the model to jointly extract entities and relations from sentences containing different numbers of triples on the SSUIE 1.0 dataset, five sets of comparative experiments were conducted. The experimental results are shown in Fig. 4(b). The performance of all models is comparable to that of Fig. 4(a) is similar.

232

Y. Liu et al.

1

SSUIE_EE ACE05

Proporon

0.8

RAMS 0.6

ChFinAnn DuEE

0.4 0.2 0 0

10

20

30

Arguments Length

40

50

Fig. 5. Event Arguments Length Statistics

4.3 Event Extraction Figure 5 shows the event arguments length distribution of the SSUIE 1.0 and the current mainstream public event extraction dataset, where the x-axis represents the length of the argument, and the y-axis represents the proportion of the argument exceeding this length in the dataset. It can be found that the large-length event arguments in the SSUIE 1.0 dataset account for a larger proportion than other datasets. Model We split event extraction into two subtasks, event detection task and event arguments extraction task, and implement experiments on them respectively. The following models are used as baselines: 1)BERT semantic classification: We implement a finetuned semantic classification model for event detection task based on the single sentence classification task in BERT [20]. 2)ERNIE-3.0 [23]: This model performs both event detection and event arguments extraction tasks as sequence labeling tasks. 3)PAIE [25]: use the Prompt structure for event arguments extraction. Results and Analysis Table 5 and 6 show the results of event detection and event arguments extraction on SSUIE 1.0 dataset, respectively. Table 5. Event Detection Results on SSUIE 1.0 Model

Precision

Recall

F1

ERNIE-3.0

63.03

67.27

65.08

BERT

92.31

90.49

91.39

As shown in Table 5, it can be found that the semantic classification model is significantly better than trigger extraction (sequence labeling) model on SSUIE 1.0 on event detection task. We imply that this is due to the large discrepancy in triggers annotations in SSUIE 1.0. For example, for the “Launch” events, the triggers annotated in different sentences may include “发射 (launch)”, “发射升空 (launch)”, “起飞 (fly)” and so on,

SSUIE 1.0: A Dataset for Chinese Space Science

233

Table 6. Event Arguments Extraction Results on SSUIE 1.0 Model

Precision

Recall

F1

ERNIE-3.0

61.88

69.12

65.3

PAIE

60.41

58.67

59.53

which all denote “launch”. Comparing with the sequence labeling model, which needs to classify each token, the semantic classification model only needs to discriminate and determine the sentence as a whole, leading to a better performance. As shown in Table 6, it can be found that the performance of ERNIE-3.0 has a significant advantage compared with PAIE, we believe that this is due to the good adaptability of ERNIE-3.0 on long entities. As shown in Fig. 5, SSUIE 1.0 comprises a high proportion of large length arguments, models with good adaptability to large length entities\arguments is more likely to obtain good performance on SSUIE 1.0.

5 Conclusion In this paper, we construct a dataset SSUIE 1.0 of high quality, for the information extraction task in the Chinese space science and utilization field. We adopt the most recent state-of-the-art frameworks and pre-trained language models for the named entity, relation, and event extraction in the domain. Experiment results show that even the most advanced models still have large space to improve on the SSUIE 1.0 dataset. Based on our analysis, we suggest that future research is needed for long length entities and overlapping triplets. Acknowledgements. This work was supported by the National Defense Science and Technology Key Laboratory Fund Project of the Chinese Academy of Sciences: Space Science and Application of Big Data Knowledge Graph Construction and Intelligent Application Research and Manned Space Engineering Project: Research on Technology and Method of Engineering Big Data Knowledge Mining.

References 1. Huang, Z., Xu, W., Yu, K.: Bidirectional lstm-crf models for sequence tagging. arxiv:1508. 01991 (2015) 2. Dong, C., Zhang, J., Zong, C., Hattori, M., Di, H.: Character-based lstm-crf with radical-level features for chinese named entity recognition. In: Lin, C.Y., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds.) Natural Language Understanding and Intelligent Applications, pp. 239–250. Springer International Publishing, Cham (2016) 3. Yan, H., Deng, B., Li, X., Qiu, X.: Tener: adapting transformer encoder for named entity recognition. arxiv:1911.04474 (2019) 4. Zhu, E., Li, J.: Boundary smoothing for named entity recognition. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Vol. 1, pp. 7096–7108. https://aclanthology.org/2022.acl-long.490

234

Y. Liu et al.

5. Yue, Z., Yang, J.: Chinese NER Using Lattice LSTM. arxiv:1805.02023 (2018) 6. Liu, W., Xu, T., Xu, Q., Song, J., Zu, Y.: An encoding strategy based word-character LSTM for Chinese NER. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (2019) 7. Gui, T., Ma, R., Zhang, Q., Zhao, L., Jiang, Y.G., Huang, X.: Cnn-based chinese ner with lexicon rethinking. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. pp. 4982–4988. https://doi.org/10.24963/ijcai.2019/692 8. Sui, D., Chen, Y., Liu, K., Zhao, J., Liu, S.: Leverage lexical knowledge for Chinese named entity recognition via collaborative graph network. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3830–3840, Hong Kong, China (2019) 9. Peng, M., Ma, R., Zhang, Q., Huang, X.: Simplify the usage of lexicon in Chinese NER. Annual Meeting of the Association for Computational Linguistics (2019) 10. Li, X., Yan, H., Qiu, X., Huang, X.: FLAT: Chinese NER using flat-lattice transformer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6836–6842. https://doi.org/10.18653/v1/2020.acl-main.611 11. Wu, S., Song, X., Feng, Z., Wu, X.: Nflat: Non-flat-lattice transformer for chinese named entity recognition. arXiv:2205.05832 (2022) 12. Liu, W., Fu, X., Zhang, Y., Xiao, W.: Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter. arxiv:2105.07148 (2021) 13. Li, J., et al.: Unified named entity recognition as word-word relation classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no 10, pp. 10965–10973 (2022) 14. Yan, X., Lili, M., Ge, L., et al.: Classifying relations via long short term memory networks along shortest dependency paths. In: Proceedings of the. Conference on Empirical Methods in Natural Language Processing, vol. 2015, pp. 1785–1794 (2015) 15. Zexuan, Z., Chen, D.: A frustratingly easy approach for entity and relation extraction. arxiv: 2010.12812 (2020) 16. Wei, Z., Su, J., Wang, Y., Tian, Y., Chang, Y.: A novel cascade binary tagging framework for relational triple extraction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1476–1488, Online. Association for Computational Linguistics (2020) 17. Wang, Y., Yu, B., Zhang, Y., Liu, T., Zhu, H., Sun, L.: TPLinker: single-stage joint extraction of entities and relations through token pair linking. In: International Conference on Computational Linguistics (2020) 18. Yan, Z., Zhang, C., Fu, J., Zhang, Q., Wei, Z.: A partition filter network for joint entity and relation extraction. In: Conference on Empirical Methods in Natural Language Processing (2021) 19. Ji, H., Grishman, R.: Refining event extraction through cross-document inference. In: Proceedings of ACL-08: HLT, pp. 254–262, Columbus, Ohio. Association for Computational Linguistics (2008) 20. Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 21. Wang, Z., et al.: CLEVE: contrastive pre-training for event extraction. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6283–6297, Online. Association for Computational Linguistics (2021) 22. Wei, K., Sun, X., Zhang, Z., Zhang, J., Zhi, G., Jin, L.: Trigger is not sufficient: Exploiting frame-aware knowledge for implicit event argument extraction. In: Proceedings of the 59th

SSUIE 1.0: A Dataset for Chinese Space Science

235

Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1, pp. 4672–4682, Online. Association for Computational Linguistics (2021) 23. Sun Y, Wang S, Feng S, et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137 (2021) 24. Li, S., Ji, H., Han, J.: Document-level event argument extraction by conditional generation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 894–908, Online. Association for Computational Linguistics (2021) 25. Ma, Y., Wang, Z., Caom Y., et al.: Prompt for extraction? paie: Prompting argument interaction for event argument extraction. arXiv preprint arXiv:2202.12109 (2022)

Extract Then Adjust: A Two-Stage Approach for Automatic Term Extraction Jiangyu Wang1,2 , Chong Feng1,2(B) , Fang Liu3 , Xinyan Li4 , and Xiaomei Wang5 1

5

School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China {jywang,fengchong}@bit.edu.cn 2 Southeast Academy of Information Technology, Beijing Institute of Technology, Beijing, China 3 School of Foreign Languages, Beijing Institute of Technology, Beijing, China [email protected] 4 China North Vehicle Research Institute, Beijing, China Institute of Science and Development, Chinese Academy of Sciences, Beijing, China [email protected] Abstract. Automatic Term Extraction (ATE) is a fundamental natural language processing task that extracts relevant terms from domainspecific texts. Existing transformer-based approaches have indeed achieved impressive improvement. However, we observe that even stateof-the-art (SOTA) extractors suffer from boundary errors, which are distinguished by incorrect start or end positions of a candidate term. The minor differences between candidate terms and ground-truth leads to a noticeable performance decline. To alleviate the boundary errors, we propose a two-stage extraction approach. First, we design a spanbased extractor to provide high-quality candidate terms. Subsequently, we adjust the boundaries of these candidate terms to enhance performance. Experiment results show that our approach effectively identifies and corrects boundary errors in candidate terms, thereby exceeding the performance of previous state-of-the-art models. Keywords: automatic term extraction extraction

1

· boundary adjust · span

Introduction

Automatic Term Extraction (ATE) aims at extracting terminological units from domain-specific corpus [2]. ATE plays an important role in gaining a better understanding of specialized domains knowledge and can also enhance several downstream natural language processing (NLP) tasks, including information retrieval [32], machine translation [7], text summarization [15], etc. Overall, ATE can enhance the efficiency and accuracy of NLP tasks. ATE has been extensively researched over the years, leading to the development of a wide variety of innovative techniques and methods. An ATE system usually follows a two-step procedure, which involves (1) extracting a list c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 236–247, 2023. https://doi.org/10.1007/978-3-031-44696-2_19

Extract Then Adjust: A Two-Stage Approach for ATE

237

Fig. 1. Boundary errors from current SOTA model [17]. The words in green indicate the ground-truth terms and the words in orange indicate extracted terms. (Color figure online)

of candidate terms, and (2) determining which candidate terms are correct using supervised or unsupervised techniques [27]. Early approaches relied on linguistic knowledge and extensive feature engineering, which required considerable human effort to identify relevant features and create suitable rules. Later, several statistical-based approaches were proposed, among which C-value measure [8] is most widely used. With the progress of deep learning techniques, Transformer-based [30] approaches have become popular and achieved state-ofthe-art (SOTA) performances. Recent researches mostly formulate ATE as a sequence labeling task, which assigning a label to each token in a sequence of inputs. This approach does not extract lists of candidate terms, but rather classifies each token in a sentence as (part of) a term or not. There is no doubt that existing sequence taggers have achieved significant improvement on ATE. However, we observe that even SOTA sequence taggers suffer from boundary errors. Boundary refers to the start or end position of a term in a sentence. The occurrence of boundary errors is due to the semantic continuity between boundary words and their adjacent words, which confuses the model’s judgment. Figure 1 displays three types of boundary error observed in extraction results of current SOTA model, respectively start boundary error, end boundary error and both boundary error. It should be noted that the boundaries of extracted term may not only be within the boundaries of ground-truth term (E2, E3), but also outside the boundaries of the ground-truth (E1). Inspired by traditional two-step extraction approach and the need to alleviate boundary errors, we propose a two-stage approach to boost the ATE systems performance. In the first stage, we design a span-based extractor which aims at extracting high quality candidate terms. In the second stage, we adjust the boundaries of candidate terms to gain a better performance. Instead of training a new module from scratch, we initialize boundary adjust module with parameters of span extractor. We collect the training data for boundary adjusting during the first training process of the extractor. In summary, our contributions in this paper are as follows: – We propose a novel two-stage approach for ATE which extracting terms first and adjusting their boundaries later. – To the best of our knowledge, we are the first to apply span-based method to ATE and achieve very competitive performance. – We reuse the training data to conduct a dataset for boundary adjusting.

238

J. Wang et al.

– Our approach achieves state-of-the-art performance consistently on three languages and outperforms the current SOTA models on F1-score by +2.0% on English, +2.8% on French and +0.4% on Dutch.

2

Related Works

The research of ATE dates back the 1990s, following a two-step procedure: (1) extracting a list of candidate terms, and (2) classifying which of the candidate terms are correct. [9] categorized ATE methodologies into five types: rule-based, statistical, hybrid, machine-learning based and deep-learning based. Briefly, rule-based approaches [26] heavily rely on lexical information from linguistic preprocessing, i.e. chunking and POS tagging, and handcrafted rules from linguistics experts. Statistical approaches rank term-like units according to various of statistical measures, including unithood [5], termhood [31] and Cvalue [8]. The hybrid approaches utilize both lexical and statistical information. Recent advances in machine learning and deep learning make the above three approaches fall behind. Thus, we focus on approaches based on machine learning and deep learning. 2.1

Machine Learning Approaches

Most machine learning approaches to ATE comply with the following three steps: preprocessing input texts, feature engineering and training a classifier. In the preprocessing step, several NLP tools (e.g. sentence segmentation, tokenization, lemmatization, chunking, and POS tagging) are employed for further feature engineering. Candidate terms are then extracted based on predefined POS patterns. In the feature engineering step, we select, transform, and combine the available features (e.g. shape features, linguistic features, frequency features and statistical features), as well as, create new features to make them more informative and relevant to the ATE task. In the final steps, the features are fed into different machine learning classifiers to classify these candidate terms as either terms or not terms. [33] selected ten common statistical and linguistic features and fed them into several machine learning classifiers to classify the 1–5 grams candidates, specifically Random Forest, Linear Support Vector Machine, Multinomial Naive Bayes classifier, Logistic Regression, and SGD classifiers. HAMLET [24] is a novel hybrid machine learning approach. It tested a a relatively wide range of classifiers and calculated up to 177 features from six feature groups. The best average f1-scores were obtained by Random Forest classifier. NMF [22] combined the probabilistic topic modelling and non-negative matrix factorization. They run a series of experiments with five different NMF algorithms and four different NMF initializations, and found the optimal combinations which outperformed 4 from 6 baseline methods and second only to deep learning methods.

Extract Then Adjust: A Two-Stage Approach for ATE

2.2

239

Deep Learning Approaches

There are two kinds of deep learning approaches to ATE: leveraging word embeddings or employing a deep neural network as an end-to-end classifier. [1] was the first successful attempt to apply word embeddings for unigram term extraction. They leveraged the local-global embeddings to represent a term. Specifically, they used the pre-built GloVe embeddings as gloal vectors and CBOW/Skip-Gram embeddings trained on the domian corpus as local vectors. The concatenated vectors of candidate terms extracted by other tools were fed into a Multi-Layer Perceptron to make the final decision. SemRe-Rank [34] was a generic method to enhance existing ATE systems. Thye incorporated Word2Vec embeddings into a personalised PageRank process to build semantic relatedness graphs of words, then computing the semantic importance scores, which were used to revise the scores of term candidates computed by the base ATE system. TALN-LS2N [11] was the first BERT-based approach which outperformed all previous approaches based on feature engineering. The input consisted of a sentence and a text span generated from all possible n-grams of the sentence. They utilized different versions of BERT (RoBERTa for English texts and CamemBERT for French texts) as binary classifiers to determine whether the n-garm was a term or not. Recent neural approaches consider ATE as a sequence labeling task, which assigning a label to each token in a sequence of inputs. It was first introduced by [13], who used a non-neural conditional random field (CRF) classifier. [16] applied recurrent neural networks (RNN) to label sequences with BILOU (begin, inside, last, unit, outside) scheme. [10] proposed the CNN-BiLSTM-CRF model to minimize the influence of different word segmentation results for Chinese term extraction. [17] took advantage of the multilingual pretrained language model XLM-RoBERTa (XLM-R) [4], which is the SOTA model for ATE in several languages. They also explored sequence generation method by using a multilingual sequence-to-sequence model named mBART [20].

3

Method

In this section, we introduce our proposed methods in detail. Figure 2 gives a brief illustration, which operates in two stages as follows: (1) we first train a span-based extractor to extract candidate terms, (2) then adjust boundaries of candidate terms. 3.1

Span Extractor

Given an input sentence X = {x1 , x2 , ..., xn } where xi denotes the i-th word, Span Extractor aims at predicting start position and end position of each term span in X. Our Span Extractor use the powerful multilingual pretrained language model XLM-R as encoder. We simply feed the sentence X to XLM-R and produce

240

J. Wang et al.

Fig. 2. The overall architecture of our approach

contextualized representation matrix H ∈ Rn×d , where n is the length of input sentence and d is the dimension of the hidden layer of XLM-R. To extract term spans, the representation of each word is fed to two linear layers and followed by softmax function to predict the probability of each word being the start or end position of a term. Pstart (xi ) = sof tmax(Ws hi + bs )

(1)

Pend (xi ) = sof tmax(We hi + be )

(2)

where Ws , We ∈ R and bs , be ∈ R are trainable parameters. Thereafter, we obtain the set of predicted start indexes and end indexes by applying argmax to each Pstart and Pend . 2×d

2

I start = {i|argmax(Pstart (xi )) = 1, i = 1, ..., n}

(3)

I end = {i|argmax(Pend (xi )) = 1, i = 1, ..., n}

(4)

There could be multiple terms in the input sentence, which means multiple start indexes and end indexes will be predicted by the model. We adopt a simple heuristic start-end matching strategy by matching the start index with its nearest end index, since there are no nested terms. At training time, each input X corresponds to two label sequences Ystart and Yend , where yis (yie ) ∈ {0, 1} is the ground-truth label of each word xi being the start(end) index of a term or not. We calculate cross entropy loss for start and end index predictions: Lstart = −

1 s y log(Pstart (yis |xi )) n i=1 i

(5)

Lend = −

1 e y log(Pend (yie |xi )) n i=1 i

(6)

n

n

And the overall training objective to be minimized is as follows: L = Lstart + Lend

(7)

Extract Then Adjust: A Two-Stage Approach for ATE

3.2

241

Boundary Adjust Module

This module aims at adjusting boundaries of candidate terms. Assume a candidate term is T = {t1 , t2 , ..., tm } and its corresponding source sentence is S = {s1 , s2 , ..., sn } , we concatenate T and S with special token [/s] to form the input sequence C = {[/s], t1 , t2 , ..., tm , [/s], s1 , s2 , ..., sn , [/s]}. For simplicity and consistency, we use the same neural architecture as span extractor. Instead of training this module from scratch, we initialize it with the span extractor’s parameters for a good starting point. The boundary adjusting procedure is similar to span extraction, the only difference is that the input contains candidate term, which can be regarded as a question-answering task [19] to some extent. The multi-head self-attention mechanism in XLM-R enables the model to perceive and assign higher weights to the candidate term and its surrounding context in the source sentence, according to the informative prior knowledge in the candidate term. This property, we believe, allows the model to keep the boundaries of T unchanged if T consistent with the ground-truth during decoding, otherwise adjust the boundaries around T . By feeding the combined sequence C to XLM-R, we obtain the contextualized representation matrix E ∈ R(m+n+4)×d . The following prediction procedure is the same as that mentioned in Sect. 3.1. Given the ground-truth label Ystart and Yend of length n, which indicate whether each word si in the source sentence is the beginning/end of a term, we calculate cross entropy loss for start and end index predictions separately (see Eq. 5, 6). There is a lack of training data for boundary adjust module, because we do not have pre-existing boundary-misspecified examples. An intuitive idea is to construct a synthetic dataset by artificially modifying boundaries of groundtruth terms. However, the man-made errors may be inconsistent with errors generated by the extractor. We adopt a simple yet effective approach to construct the dataset by letting the extractor make predictions on its training data. We define positive examples as text spans which are aligned with groundtruth terms. We define negative examples as text spans which have overlaps with the ground-truth terms, as shown in Table 1. Text spans that do not overlap with ground-truth terms are discarded. After every training epoch, we iterate through outputs predicted by the extractor and only collect the negative examples which meet one of the three boundary errors. The positive examples correspond to all ground-truth terms. We carry out the collection process after the first 3 epochs.

4 4.1

Experiment Setup Datasets

We evaluate our proposed methods on the ACTER dataset [23], which was used in the TermEval 2020 shared task [25]. It was the first systematically annotated corpus that covered multiple languages and domains. The ACTER dataset consists of four domains (corruption, wind energy, equitation and heart failure) in three languages (English (en), French (fr) and Dutch (nl)). As of yet, ACTER

242

J. Wang et al. Table 1. Number of sentences and unique terms in the ACTER dataset Lang.

Corruption Wind energy Equitation Heart failure Sent. Terms Sent. Terms Sent. Terms Sent. Terms

English 2022 1173

6638 1531

3090 1575

2432 2581

French

1977 1207

4770

968

2809 1181

2177 2374

Dutch

1988 1295

3356 1245

3669 1544

2880 2254

is the most widely used benchmark for evaluating and comparing ATE systems in both monolingual and multilingual settings. 4.2

Baselines

We choose three kinds of baselines: non-neural models, monolingual models and multilingual models. • Non-neural models: HAMLET [24] is a novel hybrid machine learning approach relied on a total of 177 features. NMF [22] combined the probabilistic topic modelling and non-negative matrix factorization. • Monolingual models: TALN-LS2N [11] is the winner on the English and French corpus in TermEval 2022 shared task. NLPLab UQAM [18] is the winning model on the Dutch corpus. [12] used BERT [14] as a binary classifier for English and CamemBERT [21] for French. The input consisted of a sentence and a candidate term generated from all possible n-grams of the sentence. [29] used BERTje [6], a Dutch pre-trained BERT model, as a token classifier for Dutch. The token classifier assigns a label to each token. • Multilingual models: mBERT is the multilingual version of BERT, which pretrained on the 104 languages with the largest Wikipedias. mBART [20] is the multilingual version of BART, a sequence-to-sequence denoising auto-encoder. The model generates terms directly from the input sentence. InfoXLM [3] is an information-theoretic multilingual language model. XLM-R is the current state-of-the-art model. [17] used XLM-R as a token classifier. They noticed that the model had remarkable cross-lingual transfer learning capabilities, i.e., training on one language and showing strong test scores on another. Therefore, we also report its cross-lingual results.

Extract Then Adjust: A Two-Stage Approach for ATE

243

Table 2. Results on the ACTER dataset. BA denotes the boundary adjust module. The best F1-scores are in bold and the best baselines are underlined.

4.3

Models

English P R

HAMLET NMF TALN-LS2N NLPLab UQAM BERTbinary CamemBERTbinary BERTje mBERT mBART InfoXLM XLM-R XLM-Rcross−lingual

55.6 30.5 34.8 21.5 36.3 – – 62.1 45.7 61.2 54.9 55.3

Ours(w/o BA) Ours(w/ BA)

58.0 58.9 58.4 67.5 51.7 58.6 65.0 73.3 68.9 61.0 59.6 60.3 70.2 53.6 60.8 66.8 74.0 70.2

37.0 37.7 70.9 15.6 72.2 – – 49.4 63.5 54.5 62.2 61.8

F1

French P R

44.4 33.7 46.7 18.1 48.2 – – 55.0 53.2 57.6 58.3 58.3

60.0 27.5 45.2 16.1 – 40.1 – 69.4 52.7 71.1 68.7 65.4

50.2 34.7 51.6 11.2 – 70.5 – 49.0 59.6 48.9 43.0 51.4

F1

Dutch P R

F1

54.7 30.7 48.2 13.2 – 51.1 – 57.4 55.9 58.0 52.9 57.6

62.9 23.4 – 18.9 – – 67.6 72.3 60.6 73.5 71.4 67.9

49.9 30.3 – 18.6 – – 66.8 67.7 65.2 68.6 69.6 69.8

41.4 42.6 – 19.3 – – 66.0 63.7 70.7 64.2 67.8 71.7

Evaluation Metrics

We evaluate all ATE systems by comparing the list of candidate terms predicted on the whole test set with the manually annotated ground-truth term list of test set. We report the Precision (P), Recall (R) and F1-score (F1) using strict matching. TermEval 2020 and previous researches [12,17,28,29] have employed these evaluation metrics as well. 4.4

Training Details

Follow the setting of TermEval 2020, we use two domains (corruption and wind energy) for training, equitation domain for validating and heart failure domain for testing. Our models are trained on the latest 1.5 version of ACTER1 . Despite previous researches employed version 1.4 or earlier of ACTER, the actual annotations remain largely unchanged, but major update to how the annotations are presented. We train all modules using AdamW with the learning rate of 2e−5 and the batch size of 16. The span extractor is trained for 12 epochs and the boundary adjust module is trained for 5 epochs. We choose the span extractor with best validation score to produce candidate terms. We use the trained boundary adjust module, after 5 epochs, to adjust boundaries of candidate terms. All experiments are run on a single NVIDIA A100 GPU. 1

https://github.com/AylaRT/ACTER.

244

J. Wang et al.

Table 3. Ablation study. Extr. denotes extracted terms and Corr. denotes correct terms in extracted terms. Δ denotes the increase/decrease after enhancing span extractor with boundary adjust module L

Models

French Extr. Corr. F1

Dutch Extr. Corr. F1

1

SE 1003 711 SE+BA 1025 720 Δ +22 +9

65.4 65.6

1012 777 1024 783

67.0 67.1

1838 1307 1859 1313

74.2 74.1

+0.2

+12

+6

+0.1

+21

+6

−0.1

SE 860 SE+BA 870 Δ +10

475 484

57.2 57.9

505 525

326 345

58.0 60.3

485 491

269 274

61.4 62.1

+9

+0.7

+20

+19

+2.3

+6

+5

+0.7

SE 492 SE+BA 450 Δ −42

247 248

56.8 60.0

169 171

98 111

45.1 50.8

112 97

53 57

48.2 55.6

+1

+3.2

+2

+13

+5.7

−15

+4

+7.4

4

SE 139 SE+BA 125 Δ −14

67 68

47.7 50.9

60 55

19 23

25.2 31.5

49 31

19 19

45.2 57.6

+1

+3.2

−5

+4

+6.3

-18

0

+12.4

5

SE 46 SE+BA 27 Δ −19

12 11

27.9 32.8

24 17

4 6

11.6 19.4

15 3

1 1

6.9 11.8

−1

+4.9

−7

+2

+7.8

−12

0

+4.9

6

SE 18 SE+BA 12 Δ −6

6 7

24.5 32.6

9 6

1 1

6.5 7.1

12 3

2 2

22.2 44.4

2

3

English Extr. Corr. F1

≥ 7 SE 63 SE+BA 12 Δ −51

5 5.1

+1

+8.1

−3

0

+0.6

−9

0

+22.2

1 0

2.4 0

41 14

3 3

9.7 17.1

29 11

1 1

4.8 8.3

−1

−2.4

−27

0

+7.4

−18

0

+3.5

Results and Analysis Overall Results

Table 2 illustrates the comparison results of our proposed models and the baselines on ACTER. We can observe that: 1) Our model outperforms the current SOTA models consistently on three languages. Specifically, our full model with boundary adjust module achieves up to 2.0%, 2.8% and 0.4% improvement on English, French and Dutch respectively. 2) Our span extractor also achieves very competitive performance, which advance the XLM-R token classifier by 0.1% and 1% on English and French. 3) The boundary adjust module can further achieve 1.9%, 2.2% and 1.3% gains for the span extraction on three languages. It demonstrates that our extract-then-adjust approach can effectively alleviate boundary errors and bring remarkable improvements on ATE task.

Extract Then Adjust: A Two-Stage Approach for ATE

245

Table 4. Case study. The words in bold with brackets indicate ground-truth terms. indicates the extracted term is wrong. None indicates there is no output from BA module Examples

Span Extractor

+ Boundary Adjust

E1. [Hemodynamic] phenotype of the Hemodynamic failing [Fontan] in an adult population Fontan

Hemodynamic Fontan

E2. [ST2] did not add significantly to reclassification of risk as assessed by changes in the [C statistic]

ST2 reclassification

ST2 None

E3. The exercise with [testosterone] group showed improvements from baseline in [peak oxygen uptake]

testosterone peak

testosterone peak oxygen uptake

E4. Within-day test-retest reliability of [Timed Up & Go] test in patients with advanced [chronic organ failure].

Timed Up & Go Timed Up & Go test in patients chronic organ failure chronic organ failure

5.2

Ablation Study

We conduct the ablation study to investigate the impact of boundary adjust module on terms of different lengths, hence we divide the test set into seven groups according to terms’ length. We report the number of extracted terms, the number of correct terms in extracted terms and F1-scores before and after boundary adjusting. The results are shown in Table 3. We can find that the boundary adjust module improves F1-scores and Correct on all length of terms. It is observed that Extracted terms increases for short terms (L ≤ 3) but decreases for long terms (L ≥ 4). We believe it is because some long candidate terms contain shorter correct terms, the boundary adjust module correct the wrong candidate terms and output the shorter correct terms. The case study in Sect. 5.3 also confirms our viewpoint. 5.3

Case Study

To make clear the procedure of boundary adjusting, we select several examples from the English test set. Table 4 shows the candidate terms from span extractor and outputs after boundary adjusting. E1 illustrates that boundary adjust module has the ability to keep boundaries unchanged if given a correct term. E2 is the case that span extractor extracts redundant and wrong term. BA module will not adjust boundaries, but give an empty output, which deemed to discard the candidate term. Moreover, BA module can do nothing if the extractor missed the correct term [C statistic]. E3 and E4 represent that BA module corrects candidate terms by extending or narrowing the boundaries.

246

6

J. Wang et al.

Conclusion

This paper proposes a novel two-stage approach for automatic term extraction, which treats the extracted text spans as candidate terms and designs a novel boundary adjust module to correct boundary errors. Experiments illustrate that our span-based extractor is a strong baseline and our boundary adjust module can effectively mitigate boundary errors to improve the performance of span extractor. In the future, we will dig deeper into boundary errors and try to design a unified framework to address extraction and adjusting jointly. Acknowledgement.. This work is supported by the Institute of Science and Development, Chinese Academy of Sciences (GHJ-ZLZX-2023-04). We would like to thank the anonymous reviewers for their thoughtful and constructive comments.

References 1. Amjadian, E., Inkpen, D., Paribakht, T., Faez, F.: Local-global vectors to improve unigram terminology extraction. In COMPUTERM, pp. 2–11 (2016) 2. Castellv´ı, M.T.C., Bagot, R.E., Palatresi, J.V.: Automatic term detection. In: Recent Advances in Computational Terminology, vol. 2, p. 53 (2001) 3. Chi, Z., et al.: Infoxlm: an information-theoretic framework for cross-lingual language model pre-training. In: NAACL, pp. 3576–3588 (2021) 4. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: ACL, pp. 8440–8451 (2020) ´ Lang´e, J.M.: Towards automatic extraction of monolingual 5. Daille, B., Gaussier, E., and bilingual terminology. In: COLING (1994) 6. De Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., Nissim, M.: Bertje: a Dutch BERT model. arXiv preprint arXiv:1912.09582 (2019) 7. Dinu, G., Mathur, P., Federico, M., Al-Onaizan, Y.: Training neural machine translation to apply terminology constraints. In: ACL, pp. 3063–3068 (2019) 8. Frantzi, K.T., Ananiadou, S., Tsujii, J.: The c-value/nc-value method of automatic recognition for multi-word terms. In: ECDL, pp. 585–604 (1998) 9. Gao, Y., Yuan, Y.: Feature-less end-to-end nested term extraction. In: NLPCC, pp. 607–616 (2019) 10. Han, X., Xu, L., Qiao, F.: CNN-BiLSTM-CRF model for term extraction in Chinese corpus. In: WISA, pp. 267–274 (2018) 11. Hazem, A., Bouhandi, M., Boudin, F., Daille, B.: Termeval 2020: Taln-ls2n system for automatic term extraction. In: COMPUTERM, pp. 95–100 (2020) 12. Hazem, A., Bouhandi, M., Boudin, F., Daille, B.: Cross-lingual and cross-domain transfer learning for automatic term extraction from low resource data. In: LREC, pp. 648–662 (2022) 13. Judea, A., Sch¨ utze, H., Br¨ ugmann, S.: Unsupervised training set generation for automatic acquisition of technical terminology in patents. In: COLING, pp. 290– 300 (2014) 14. Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019) 15. Koay, J.J., Roustai, A., Dai, X., Burns, D., Kerrigan, A., Liu, F.: How domain terminology affects meeting summarization performance. In: COLING (2020)

Extract Then Adjust: A Two-Stage Approach for ATE

247

16. Kucza, M., Niehues, J., Zenkel, T., Waibel, A., St¨ uker, S.: Term extraction via neural sequence labeling a comparative evaluation of strategies using recurrent neural networks. In: Interspeech, pp. 2072–2076 (2018) 17. Lang, C., Wachowiak, L., Heinisch, B., Gromann, D.: Transforming term extraction: transformer-based approaches to multilingual term extraction across domains. In: ACL-IJCNLP, pp. 3607–3620 (2021) 18. Le, N.T., Sadat, F.: Multilingual automatic term extraction in low-resource domains. In: FLAIRS (2021) 19. Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., Li, J.: A unified MRC framework for named entity recognition. In: ACL, pp. 5849–5859 (2020) 20. Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. In: TACL, vol. 8, pp. 726–742 (2020) 21. Martin, L., et al.: CamemBERT: a tasty French language model. In: ACL, pp. 7203–7219 (2020) 22. Nugumanova, A., Akhmed-Zaki, D., Mansurova, M., Baiburin, Y., Maulit, A.: NMF-based approach to automatic term extraction. Expert Syst. Appl. 199, 117179 (2022) 23. Rigouts, T.A., Hoste, V., Lefever, E.: In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora. Lang. Resourc. Eval. 54(2), 385–418 (2020) 24. Rigouts, T.A., Hoste, V., Lefever, E.: Hamlet: hybrid adaptable machine learning approach to extract terminology. Terminology 27(2), 254–293 (2021) 25. Rigouts, T.A., Hoste, V., Drouin, P., Lefever, E.: TermEval 2020 : shared task on automatic term extraction using the annotated corpora for term extraction research dataset. In: COMPUTERM, pp. 85–94 (2020) 26. Stankovic, R., Krstev, C., Obradovic, I., Lazic, B., Trtovac, A.: Rule-based automatic multi-word term extraction and lemmatization. In: LREC (2016) 27. Tran, H.T.H., Martinc, M., Caporusso, J., Doucet, A., Pollak, S.: The recent advances in automatic term extraction: a survey. arXiv preprint arXiv:2301.06767 (2023) 28. Tran, H.T.H., Martinc, M., Doucet, A., Pollak, S.: Can cross-domain term extraction benefit from cross-lingual transfer? In: DS, pp. 363–378 (2022) 29. Tran, H.T.H., Martinc, M., Pelicon, A., Doucet, A., Pollak, S.: Ensembling transformers for cross-domain automatic term extraction. In: ICADL, pp. 90–100 (2022) 30. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017) 31. Vintar, S.: Bilingual term recognition revisited: The bag-of-equivalents term alignment approach and its evaluation. Terminology 16(2), 141–158 (2010) 32. Yang, L., Ji, D.H., Zhou, G., Yu, N.: Improving retrieval effectiveness by using key terms in top retrieved documents. In: ECIR (2005) 33. Yuan, Y., Gao, J., Zhang, Y.: Supervised learning for robust term extraction. In: IALP pp. 302–305 (2017) 34. Zhang, Z., Gao, J., Ciravegna, F.: Semre-rank: improving automatic term extraction by incorporating semantic relatedness with personalised PageRank. In: TKDD, vol. 12, no. 5, pp. 1–41 (2018)

UniER: A Unified and Efficient Entity-Relation Extraction Method with Single-Table Modeling Sha Liu, Dongsheng Wang(B) , Yue Feng, Miaomiao Zhou, and Xuewen Zhu Jiangsu University of Science and Technology, Zhenjiang 212000, Jiangsu, China jsjxy [email protected]

Abstract. Joint entity and relation extraction are crucial tasks in natural language processing and knowledge graph construction. However, existing methods face several challenges. Firstly, they lack effective interaction modeling between entities and entity relations. Secondly, they often overlook the correlation between entities and relations. Lastly, the utilization of complex models results in increased memory usage. To tackle these challenges, we propose a unified approach that utilizes a homogeneous representation to model the interaction between subject and object entities, as well as entity relations, through a novel method of utilizing a single two-dimensional table for joint extraction tasks. This approach effectively handles multiple token entities, leading to faster and more efficient extraction. Our proposed approach achieves performance comparable to state-of-the-art models on two widely-used datasets and outperforms existing methods on the SEO and HTO problems, which are particularly challenging in relation extraction. It exhibits significant advantages in terms of both speed and space consumption. Our code and models are available at https://anonymous.4open.science/r/UniER. Keywords: Entity recognition Graph

1

· Relational extraction · Knowledge

Introduction

Extracting information from structured or unstructured data is crucial in natural language processing and related fields, such as knowledge graph construction, information retrieval, sentiment analysis, and text mining. Entity and relation joint extraction is an important process, aiming to identify entities and their corresponding relationships from unstructured text to obtain knowledge triplets, represented as . In previous works on entity relation extraction [1–4], the relation extraction problem can be classified into four types, as illustrated in Fig. 1: Normal, Single Entity Overlap (SEO), Entity Pair Overlap (EPO), and Head Tail Overlap (HTO) problems, each with distinct nesting, overlapping, or multiple relationships between entities. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 248–259, 2023. https://doi.org/10.1007/978-3-031-44696-2_20

UniER

249

Early joint extraction methods used pipeline approaches, treating entity extraction and relation classification as independent modules [5]. However, this approach neglected the interaction between entities and relationships, and errors in entity extraction could cascade into the relationship extraction module, leading to performance degradation [6]. Many researchers have proposed joint models for entity extraction and relation classification to address this through parameter sharing or joint decoding. For example, TPLinker [3] models the extraction as a linking problem between token pairs to extract subject and object entities and avoid exposure bias. However, these methods did not consider entity-entity interaction or entity-relation interaction and incurred high time and space costs.

Fig. 1. Classification of extraction problems, entities are shown in bold, and significant parts are shown in the same color. (Color figure online)

This paper presents UniER, a unified and efficient entity-relation extraction method with single-table modeling. Our approach involves encoding entities and relationships into meaningful sequence embeddings to build unified semantic representations. Candidate relationships are converted into one semantic word and formed into a continuous sequence with input sentences. We use pretrained language models [7] to encode the sequences and capture their information relevance intrinsically. By designing a single two-dimensional table, we simultaneously model the interaction between the subject and object entities and the inter-action between the subject and object entities and relations. This approach eliminates the need to limit the number of token elements in entities, improving computational efficiency and reducing the time and space costs of the model. We evaluated our approach on two widely used public datasets, NYT [8] and WebNLG [9]. We performed better than the current state-of-the-art extraction model UniRel [2] in some datasets. Our contributions are summarized as follows: – We propose a novel homogenized interaction modeling method that extracts entities and their relationships using a single two-dimensional table, reducing the need for complex models and saving time and space costs.

250

S. Liu et al.

– A new two-dimensional table modeling method for joint extraction tasks can handle multiple token elements in entities, resulting in faster and more efficient extraction. Our experiments show that our approach achieves comparable overall extraction performance to the current state-of-the-art model while demonstrating significant improvements in challenge tasks such as the SEO and HTO problems.

2

Related Work

Previous joint extraction methods involved separating the process into different steps, each with its objective function. For example, Miwa and Bansal [10] extracted entity pairs for entity recognition and used a dependency tree to extract the miniature tree that covers the entity pairs, generating vectors, and then completed relationship classification based on the vectors. Katiyar and Cardie [11] used a pointer network decoding method to obtain entity pairs of the current relationship. Zheng et al. [12] labeled entities with relationship tags but did not consider the problem of relationship overlap, resulting in high labeling complexity and errors. Recent joint extraction models have been developed to address spreading errors and ignoring the continuity between entities and relationships. CasRel [13] views relationships as mappings between entities in different sentences and uses layered pointer labeling to solve the problem of overlapping triples. TPLinker [3] models joint extraction as a token pair linking problem, extracting the subject entity and object entity simultaneously, and aligning the boundary labels of entity pairs under each relationship type. OneRel [1] treats joint extraction as a fine-grained triple classification problem, using a score-based classifier and a relation-specific role-tagging strategy. SPN [14] uses a non-autoregressive encoder to transform the triple extraction into a set prediction problem and designs a loss function that ignores the order. UniRel [2] unifies entity and relationship representation to achieve homogeneous modeling and uses interaction graphs to optimize efficiency and eliminate redundant relationship tables. The model extracts all triples with one binary interaction graph for token entities and three graphs for multi-token entities.

3 3.1

Methodology Task Definition

The goal of joint extraction is to identify all possible triplets T = L [(subl , rl , obj l )]l=1 from a given sentence s = {w1 , w2 , . . . , wn }, where L represents the number of triplets, each consisting of a subject entity subl , a relation rl and an object entity obj l . The subject and object entities are made up of one or more contiguous tokens, and the relation rl is drawn from a set of M kinds of relations, R = {r1 , r2 , . . . , rM }.

UniER

3.2

251

Unified Representation

This paper draws inspiration from UniRel [2] to represent relationships in natural language text to match the input sentence. Relationships are introduced using their actual semantic tokens, such as “founders” to represent the relationship “business/company/founders”. This transformation is performed using a tokenizer with manually selected words. The sentence and relationship identifiers are then concatenated and input as a sequence to the pretrained language model encoder based on a self-attention mechanism called Transformer [15]. We choose BERT [7] as the pretrained language model, which generates a corresponding token id for each token element. Finally, the concatenated token id sequence is input to the BERT model as token embeddings. T = Contact(Ts , Tr )

(1)

H = E[T ]

(2)

In this context, H ∈ R(N +M )×Dh represents the input vector to BERT, Dh represents the hidden dimension, Ts and Tr represent the token id sequences corresponding to the sentence and relationship words, respectively, and E is the embedding table in BERT. The self-attention mechanism of the Transformer captures the relationships between each input word and other words. The selfattention mechanism is composed of multiple layers of attention heads, where each attention head performs three linear transformations on the input embedding to obtain vectors Q, K, V . The Softmax function is then used to calculate the attention weights between each word and other words. These weights are used to fuse the Value vectors, as shown in Eq. (3). This process helps the encoder to better understand the relationships in the input text, thereby improving the performance of the model.   QK T V (3) Attention (Q, K, V ) = softmax √ dh The Q, K, V vectors represent the input, reference, and output vectors, respectively. Among the vectors present in H, the Q, K, V correspond to Query, Key, and V alue, respectively. With this model design, the relationships and entities are not heterogeneously split and merged, but rather homogeneously input into the model. This way allows the BERT model to effectively model the three types of interactions in the text and capture the relationships between the input sequence using a self-attention mechanism. This model design has high expressiveness and generalization ability. 3.3

Interactive Table Modeling

This paper introduces an interactive modeling table inspired by UniRel’s Interaction Map, which was originally designed for single-token entities. In contrast to UniRel, our interactive table can accommodate entities with any number of

252

S. Liu et al.

multipletoken elements. Figure 2 shows concatenated sentences and their corresponding relationships in bold, encoded, and inputted into the model. The interactive table is learned from the attention map in the 12th layer of the BERT en-coder and consists of three different colored parts. The green rectangle represents the subject-object entity interaction, the yellow rectangle represents the subject-relation interaction, and the blue rectangle represents the object entityrelation interaction. The interactive table allows direct extraction of relationship triples. Entity Boundary. Previous studies [1] have employed separate tables for each relationship to distinguish entity boundaries. However, in natural language texts, entities that belong to the same relationship are often separated by intervening text and not located adjacently. To confirm this, we conducted a study on the NYT [8] and WebNLG [9] datasets and found that only one validation sample out of 5000 in the NYT dataset violated this assumption. In contrast, the WebNLG [9] dataset had no such instances. Thus, we relied on the natural language characteristics of the text and did not explicitly address entity boundaries. Instead, we used the interaction definition described below to label the Interaction Table, as shown in Fig. 2. The “0” in the unlabeled areas served as implicit entity boundaries, indicating the end of an entity when searching for entities. Adjacent “1”s were considered as part of the same entity. Further details on the decoding process will be provided in the subsequent section.

Fig. 2. We concatenated the input sentence and relation texts (bold) and learned an Interaction Table from the attention table in the 12th layer of the BERT Encoder. The table has three sections (green for Subject-Object, yellow for Subject-Relation, and blue for Object-Relation Interaction) from which we extracted relational triples. (Color figure online)

UniER

253

Subject-Object Interaction. The interaction between the subject and object entities involves pairs that can create valid relationship triples. In a sentence, subject-object pairs exist only when they form a valid relationship triple, denoted by a green rectangle in the interactive table. These pairs comprise single or multiple consecutive word elements in the sentence. This paper defines the subjectobject interaction flag function in formula (4).  T rue, (subi, r, obj) ∈ T, ∃r ∈ R (4) Is−o (subi , obj i ) = F alse, otherwise For example, in Fig. 2, the subject-object pair in the triple is (Lebron James, James), with an interaction value of 1 in the second column representing James and the first and second rows representing Lebron James. Similarly, the subject-object pair in the triple is (Lebron James, United States), and the value at the intersection of the “United States” column and the “Lebron James” row is 1. Notably, inter-sections with a value of 1 create an L shape to handle possible overlapping situations. Subject-Relation Interaction. The subject-relation interaction identifies the subject entities corresponding to each relationship. Given a relationship r, when a triplet exists with the sub as the subject and r as the relationship, the subject entity sub is considered to interact with r. The interaction table represents this by a yellow rectangle in the upper right corner. The relationships are displayed in bold as columns, and each relationship is represented as a column with a value of 1 in the row corresponding to the interacting subject entity sub. The function for the Subject-Relation Interaction is defined in Eq. (5).  T rue, (sub, r, obj) ∈ T, ∃sub ∈ E (5) Ir (sub, r) = F alse, otherwise For example, in the triplet , “Lebron James” is the subject interacting with the relationship “first name” and the value in the intersection of the row and column corresponding to “first name” and “Lebron James” is 1. Similarly, in the triplet , “Lebron James” is interacting with the relationship “live in” and the value in the inter-section of the row and column corresponding to “live in” and “Lebron James” is also 1. Object-Relation Interaction. The object-relation interaction links object entities with corresponding relationships. When a relationship r has a triplet with obj as the object, it is considered to interact with r. The interaction table represents this with a blue rectangle in the lower left corner, where relationships are rows, and each relationship is represented by a row with a value of 1 in the column corresponding to the interacting object entity obj. Equation (6) defines the function of the Object-Relation Interaction.  T rue, (sub, r, obj) ∈ T, ∃obj ∈ E (6) Ir (obj, r) = F alse, otherwise

254

S. Liu et al.

For instance, in the triplet , “James” inter-acts with the relationship “first name” and the value in the intersection of the row and column for “first name” and “James” is 1 in Fig. 2. Similarly, in the triplet , the United States interacts with the relationship “live in” and the value in the intersection of the row and column for “live in” and “United States” is also 1. Interaction Recognition. This paper introduces a novel approach to interaction recognition that leverages the Transformer self-attention layer to capture deep correlations. The model combines the three types of interactions into a single two-dimensional interaction table, which is then used in conjunction with the self-attention layer to capture the semantic associations between entities and relationships, and the interactive relationships between entities. The interaction table is arranged in a matrix format, with rows and columns representing entities and relationships. Each element in the interaction table represents the presence or absence of an interaction between the corresponding entities and relationships, indicated by 1 or 0. The interaction table is also used to obtain the embeddings of Q and K, which are generated by the final layer of BERT. T 1  Qt KtT √ (7) I= T t dh The interaction matrix I is obtained by averaging the point-wise products of Q and K for all attention heads, where the number of attention heads is denoted by T . The trainable weights WtQ and WtK are also incorporated into the model. Using this approach, the model can comprehensively capture the interactive information and invisible connections between the subject and object, subject and relationship, and object and relationship as a single two-dimensional interactive table. An advantage of this approach is that it does not require the subject and object entities to be lexical entities. This way is an improvement over the UniRel approach, which requires the expansion into three two-dimensional tables to deal with multi-lexical entities. Furthermore, this modeling approach helps reduce the pre2 diction space to O((N + M) ), which is a significant improvement over classical TPLinker and recent OneRel methods that predict triplets in O((N × M × N)). Decoding. The model synchronously extracts the subject entity, the associated entity, and the relationship. The decoding process of the model involves two steps. First, the model obtains all subject-relationship interaction pairs(subi , rk ) and associated entity-relationship pairs (obj i , rk ) based on the objectrelationship interaction table and subject-relationship interaction table, where rk ∈ R. According to the entity boundary mentioned earlier, adjacents 1’s are considered the same entity, and 0 indicates the end of the entity search. For example, in Fig. 2, the subject-relationship interaction pairs can be obtained as (Lebron James, first name) and (Lebron James, live in), and the objectrelationship pairs can be obtained as (James, first name) and (United States, live

UniER

255

in). However, (Lebron, first name) and (James, first name) are not valid subjectrelationship pairs and should be treated as a whole, namely Lebron James. Secondly, the model combines the subject and object entities obtained in step one to generate all possible subject-object entity pairs (subi , obj i ) exhaustively. The subject-object entity interaction table is queried to check whether a subject-entity combination meets the conditions. If so, the current triple can be obtained. This process is repeated until the joint decoding obtains the final set of triples. Taking the results of step one as an example, all possible subject-object entity pairs can be obtained as (Lebron James, James) and (Lebron James, United States). Then, by searching for the corresponding subject-object entity pairs in the subject-object entity interaction table, the final set of triples can be obtained through joint decoding, which is and . 3.4

Loss Function

The interaction table has label values of 0 and 1, and the extraction results can be converted into predictions for label values [16]. To address the problem of class imbalance, this paper uses a cross-entropy loss function for multi-label classification, where the number of target labels is much smaller than that of non-target labels. This is consistent with the relationship extraction method of extracting certain or several relationship categories from hundreds of relationship categories. Assuming that there are n candidate categories and k target categories need to be selected from them, and their scores are s1 , s2 , . . . , sn , the loss function is shown in the formula (8).   esi ) + log(1 + esj ) (8) Loss = log (1 + i∈Ωneg

j∈Ωpos

In formula (8), Ωpos and Ωneg represent the sets of positive and negative classes in the given sample, respectively. This equation ensures that the score sj of each target class is not less than the score si of each non-target class. By utilizing such a loss function, the model effectively mitigates the impact of imbalanced data and enhances its performance.

4 4.1

Experiments Datasets and Evaluation Metrics

To evaluate our proposed method, we utilized commonly used public datasets, NYT [8] and WebNLG [9], as in previous studies [1–4,13,14]. Two versions of each dataset were used, one with only the last word of annotated entities (NYT* and WebNLG*) and the other with the entire entity span (NYT and WebNLG). We evaluated the performance of the proposed model using Precision, Recall, and F1 score metrics, using partial matching for NYT* and WebNLG*, and exact matching for NYT and WebNLG. Using Precision, Recall, and F1 score metrics, using partial matching for NYT* and WebNLG*, and exact matching for NYT and

256

S. Liu et al.

WebNLG. To test the model’s performance on different question categories, we split the test sets of NYT* and WebNLG*. During experiments, we optimized the parameters of Adam [17] by using a learning rate of 2e−5/6e−5 and applying weight decay [18] at a rate of 0.01 on the NYT/WebNLG datasets. We used a batch size of 12/10 for the NYT/WebNLG datasets. The pretrained language model employed was bert-base-cased, and we conducted all experiments on a 1/4 NVIDIA A100 GPU and a 16g 8-core CPU, with the number of training epochs set to 100. For comparison purposes, we evaluated our model against six state-of-the-art baseline models, including CasRel [13], TPLinker [3], SPN [14], PRGC [4], OneRel [1], and UniRel [1]. The statistics of the main results are shown in Table 1. Table 1. Main results. The highest scores are in bold.

Table 2. F1 scores for sentences with different overlapping patterns and triple numbers on WebNLG*. L denotes the number of triples in a sentence. The highest scores are in bold.

4.2

Results and Analysis

Results. Table 1 presents a comparison of our model and the six baseline models based on partial match and exact match metrics for the NYT and WEBNLG datasets. Table 2 and Table 3 present a comprehensive performance comparison between our proposed model and other baseline models on various problem types using the WebNLG* and NYT* datasets.

UniER

257

Our experimental results show that our proposed method outperforms most of the baseline models and is comparable, and even superior in some cases, to the UniRel. We attribute this success to the synchronous extraction of entities and relationships using interactive tables, which reduces the accumulation of errors caused by step-by-step processing, and the homogeneous modeling approach, which enables the encoder to better comprehend the relationships in the input text, thereby enhancing the extraction performance. Table 3. F1 scores for sentences with different overlapping patterns and triple numbers on NYT*. L denotes the number of triples in a sentence. The highest scores are in bold.

Table 4. Computational Efficiency. Shows training and inference time, measured in seconds per epoch and milliseconds per sample, respectively.

Model Efficiency. In terms of predicting triplets, common extraction models such as TPLinker [3] and OneRel [1] have a complexity of O((N × M × N)), 2 whereas our method and UniRel can reduce the prediction space to O((N + M) ). OneRel performs parallel tensor operations to process all relationships simultaneously, thereby reducing the processing complexity to O((N × 1 × N)). Since UniRel [2] only publishes a portion of the matching code, we compared our model with OneRel on the NYT and WEBNLG datasets in terms of training and inference time for exact matching and with UniRel on partial matching, as presented in Table 4. The batch sizes for training and prediction were set to 6 and 1 for model efficiency comparison. The training time (s) refers to the time

258

S. Liu et al.

required to train one round, while the inference time (ms) represents the time taken to predict the triplets of a sentence. Table 4 shows that our proposed model is faster than the state-of-the-art UniRel [2] and OneRel [1] models in both training and inference time. Specifically, for exact matching, our model achieves significant speedups for exact matching in both training and inference phases when compared to OneRel. Regarding partial matching, our model is slightly superior to UniRel. We attribute this out-come to the interactive table design, which offers more tremendous improvement potential than OneRel’s parallel tensor operation, and UniRel also performs using a two-dimensional map on partial matching, resulting in similar efficiency. Ablation Study. The ablation experiments in this study are divided into two parts: interaction table modeling and loss function. To investigate the impact of interaction table modeling, we adopt the UniRel settings. In this setup, a shared BERT encodes the relationship’s input sentence and semantic token separately. Two transformer layers are used to obtain the corresponding Q and K, concatenated after splicing, and to predict the dot product. Therefore, the deep self-attention model cannot capture the interdependence of the three types of interactions, as represented by UniERseprate in Table 5. Table 5. Ablation Experiment Results.

In terms of the loss function, we use binary cross-entropy loss function for comparison, and UniERBCE is presented in Table 5. The results in Table 5 indicate that compared with UniER, the performance of UniERseprate and UniERBCE on both datasets has significantly decreased, demonstrating the effectiveness of interaction modeling and the improvement of the cross-entropy loss function for multi-label classification.

5

Conclusion

This paper proposes a joint entity and relation extraction method that addresses the problem of ignoring correlations between entities and relations in existing methods and reduces memory usage. Experimental results show that our method achieves comparable performance to state-of-the-art models on two datasets and outperforms them on some. Our method also significantly saves time and space costs. It provides a new approach for efficient natural language processing and knowledge graph construction. Future work includes optimizing and expanding the method’s scope of application.

UniER

259

References 1. Shang, Y.-M., Huang, H., et al.: OneRel: joint entity and relation extraction with one module in one step. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11285–11293 (2022) 2. Tang, W., Xu, B., et al.: UniRel: unified representation and interaction for joint relational triple extraction. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7087–7099 (2022) 3. Wang, Y., Yu, B., et al.: TPLinker: single-stage joint extraction of entities and relations through token pair linking. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1572–1582 (2020) 4. Zheng, H., Wen, R., et al.: PRGC: potential relation and global correspondence based joint relational triple extraction. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pp. 6225–6235 (2021) 5. Zelenko, D., Aone, C., et al.: Kernel methods for relation extraction. In: 7th Conference on Empirical Methods in Natural Language Processing (2002) 6. Li, Q., Ji, H.: Incremental joint extraction of entity mentions and relations. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 402–412 (2014) 7. Kenton, J., Devlin M.-W., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171– 4186 (2019) 8. Riedel, S., Yao, L., et al.: Machine learning and knowledge discovery in databases. In: EMCL PKDD, pp. 148–163 (2010) 9. Gardent, C., Shimorinav, A.: Creating training corpora for NLG micro-planning. In: 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, pp. 179–188 (2017) 10. Miwa, M., Bansal, M., et al.: End-to-end relation extraction using LSTMs on sequences and tree structures. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (2016) 11. Katiyar, A., Cardie, C., et al.: Going out on a limb: joint extraction of entity mentions and relations without dependency trees. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 917–928 (2017) 12. Zheng, S., Wang, F., et al.: Joint extraction of entities and relations based on a novel tagging scheme. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1227–1236 (2018) 13. Wei, Z., Su, J., et al.: A novel cascade binary tagging framework for relational triple extraction. In: A Novel Cascade Binary Tagging Framework for Relational Triple Extraction, pp. 1476–1488 (2020) 14. Sui, D., Zeng, X., et al.: Joint entity and relation extraction with set prediction networks. In: IEEE Trans. Neural Netw. Learn. Syst. (2023) 15. Vaswani, A., Shazee, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017) 16. Su, J., Zhu, M., et al.: ZLPR: a novel loss for multi-label classification. arXiv preprint arXiv:2208.02955 (2022) 17. Kinga, D., Adam, J.B., et al.: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 18. Loshchilov, I., Hutter, F., et al.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)

Multi-task Biomedical Overlapping and Nested Information Extraction Model Based on Unified Framework Xinyu He1,2,3(B) , Shixin Li1 , Guangda Zhao1 , Xue Han1 , and Qiangjian Zhuang1 1 School of Computer and Artificial Intelligence, Liaoning Normal University, Dalian, China

[email protected], [email protected]

2 Information and Communication Engineering Postdoctoral Research Station, Dalian

University of Technology, Dalian, China 3 Postdoctoral Workstation of Dalian Yongjia Electronic Technology Co., Ltd., Dalian, China

Abstract. Biomedical information extraction technology can mine crucial biomedical knowledge from biomedical literature automatically, which plays an important role in constructing knowledge graphs and databases. Biomedical named entity recognition and event detection are crucial tasks in biomedical information extraction. Previous research mainly focused on building models for single tasks, with low generalization ability. Moreover, most prior work neglected the detection of overlapping events and the recognition of nested entities. Although some models can solve these problems, they suffer from low efficiency and poor performance. Therefore, we propose a unified framework multi-task MBONIEUF model for biomedical entity recognition and event detection. Our model converts sequence labeling problems into machine reading comprehension problems, first using ChatGPT to generate semantically rich questions based on the biomedical corpus labels, and then encoding the concatenated generated questions and original sentences using PubMedBERT. Furthermore, BiGRU, Biaffine attention module, and IDCNN module are designed as the feature extraction layer, used to capture the complex interactions between token pairs and the features between events and entities. In the decoding layer, a stacked classification pointer matrix and a multi-head nested query matrix are learned, and a corresponding detection algorithm is designed to decode the entities and events in the two matrices. We propose a model that can handle both biomedical named entity recognition and event detection tasks. Compared with traditional methods, our model not only improves the model’s task generalization ability but also performs better in solving the overlapping and nested problems in biomedical information extraction. On the MLEE, BioNLP’09, and GENIA datasets, the proposed model achieves good performances with F1 scores of 81.62%, 75.76%, and 77.75% respectively. Keywords: Biomedical information extraction · MRC · Overlapping events · Nested entities · ChatGPT

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 260–272, 2023. https://doi.org/10.1007/978-3-031-44696-2_21

Multi-task Biomedical Overlapping and Nested Information

261

1 Introduction Biomedical named entity recognition and event detection are crucial tasks in biomedical information extraction. Biomedical information extraction can mine crucial biomedical knowledge from massive biomedical literature automatically, which plays an important role in constructing biomedical knowledge graphs and databases. Named entity recognition aims to identify and extract phrases with special meanings, typically referred to as named entities, from unstructured text. Event detection aims to detect trigger words in the text and classify their types, with the type of trigger word determining the type of event. The goals of the biomedical named entity and event detection tasks are to extract specific named entities and events from biomedical texts to help researchers better understand relevant knowledge in the field and promote research and development in the biomedical domain. There exist a large number of nested entities and overlapping events in the corpus from various fields such as medicine and news. For instance, the GENIA dataset [1], which is commonly used in the biomedical domain, contains approximately 10% nested entities. However, most existing models cannot identify nested named entities and overlapping events.

Fig. 1. Examples of Flat and Nested Entities

Figure 1 illustrates an example of a nested biomedical named entity in biomedical text sentence fragment S1. The goal of biomedical named entity recognition is to identify “IFN-gamma cytoplasmic mRNA”, “IFN-gamma”, and “protein kinase C” as named entities and classify them accordingly. “IFN-gamma cytoplasmic mRNA” is an RNAnamed entity that nests a Protein-named entity “IFN-gamma”, which is called a nestednamed entity. “protein kinase C” at the end is a normal flat-named entity.

Fig. 2. Example of Flat and Overlapping Events

Figure 2 shows an example of overlapping events in sentence S2 from a biomedical text. There are two trigger words in S2: “overexpressed” and “up-regulated”. The word “overexpressed” corresponds to the types of Gene expression and Positive regulation, while the word “up-regulated” corresponds to the type of Positive regulation. The word “overexpressed” represents an overlapping event, where the same trigger word corresponds to multiple types. The word “up-regulated” represents a flat event, where one trigger word corresponds to one type.

262

X. He et al.

Currently, existing methods for biomedical information extraction tasks typically treat them as sequence labeling problems, such as the BiLSTM-CRF, LSTM-CNNs, and LSTM-CNNs-CRF models proposed in [2–4], and [5]. However, these methods require each word to correspond to a label during annotation, which makes it difficult to handle nested entities and overlapping events. Some researchers have proposed machine reading comprehension-based named entity recognition models, such as Named Entity Recognition based on Machine Reading Comprehension Framework (NER-MRC), which is referenced in [6, 7], and [8]. However, the NER-MRC model has the drawbacks of low efficiency and difficulty extracting text features. Additionally, most existing research focuses on single tasks, where each task requires a separate model to identify them, leading to poor generalization performance. To address these problems, We propose a Multi-task Biomedical Overlapping and Nested Information Extraction Model based on a Unified Framework (MBONIEUF), which uses the Chat Generative Pre-trained Transformer (ChatGPT) [9] to construct queries and transforms the sequence labeling problem into an MRC problem, while building two types of matrices and algorithm to decode the content that needs to be identified. The proposed model has good performance in biomedical named entity recognition and event detection tasks, solving the problems of nested entity recognition in biomedical named entity recognition and overlapping events in biomedical event detection effectively. The main contributions of our works are as follows: • We propose a unified framework multi-task model to address the problem of biomedical information extraction models being limited to single tasks. • To solve the problem of overlapping events in biomedical event detection and nested entities in biomedical named entity recognition, we propose a stacked classification pointer matrix, a multi-head nested query matrix, and corresponding detection algorithms. • To address the low efficiency and cumbersome prior information construction issues of machine reading comprehension-based dataset construction, we propose the integration of the ChatGPT semantic construction method. • We propose a feature extraction layer of bi-directional GRU, bi-affine attention mechanism, and iterative dilated convolution (BiGRU-Biaffine-IDCNN) to address the issue of poor feature extraction from text. Experimental results on the GENIA dataset, MLEE corpus [10], and Bionlp’09 corpus [11] show that the proposed model achieves comparable or better performance in evaluation criteria of Precision, Recall, and F1 scores for biomedical named entity recognition and event detection tasks compared to existing works.

2 Related Work Recently, biomedical information extraction methods can be broadly divided into deep learning based, Pre-training based methods, and large language model methods. Deep learning based methods are a hot topic in the field of biomedical information extraction. Reference [12] used hypergraph methods to solve nestedNER problems, representing all possible nested structures through hypergraphs, which have a complex

Multi-task Biomedical Overlapping and Nested Information

263

design, and high event complexity for training and inference. Reference [13] proposed inserting special tags between overlapping entities to identify entities. During the Pre-trained language period, BERT [14] model has become an important tool for information extraction tasks. Lin and Ji [15] first used the BERT model to obtain word vector representations and then employed conditional random fields (CRF) and fully connected networks to identify and classify trigger words. The BERT-CRF sequence labeling model [2] is a sequence labeling model that combines the BERT pretraining model with the CRF model. It is essentially a word-level multi-classification problem. The advantage of this model is that it is particularly good for flat event recognition, but it performs poorly when dealing with nested and overlapping events, and there is room for improvement. Li et al. [6] used BERT’s semantic understanding ability to propose a machine reading comprehension model based on question answering, transforming the NER task into a NER-MRC task. The disadvantage of this traditional NER-MRC model is that the efficiency of constructing queries is not high, and if there are many categories, the efficiency of training and inference will be low. Bekoulis G et al. [16] proposed a multi-head selection scheme to solve the nested problem. The idea of the multi-head selection model is that each token has a relationship, and the relationship score is finally mapped to a probability value output by calculating the relationship score between the tokens. The disadvantage of this method is that when constructing a multi-dimensional tensor, it will cause an imbalance of positive and negative samples, and if the number of categories is particularly large, the recognition effect will be greatly reduced. The increasing use of large language models in biomedical information extraction is driven by their powerful generative capabilities. However, significant progress in this field faces challenges, such as the need for substantial computing resources and lacking high-quality annotated biomedical datasets. In previous studies, most models have achieved significant performance in the field of biomedical information extraction. However, existing methods still have problems such as unable to identify nested entities and overlapping trigger words, complex dataset construction, low training efficiency and effectiveness, and can only be applied to a single model, which limits the effectiveness of the model in practical applications. Therefore, we propose a new unified and multitasking biomedical information extraction model: MBONIEUF, which addresses biomedical overlapping and nested information extraction. The model focuses on addressing the problems faced by previous models and proposes feasible solutions to these problems.

3 Method We propose a unified framework for multi-task biomedical information extraction models, the MBONIEUF model, which includes a semantic construction layer, an encoding layer, a feature extraction layer, a decoding layer, and a loss layer. The model first concatenates sentences and queries generated by ChatGPT and then encodes them using PubMedBERT [17]. The resulting hidden layers are used to obtain a matrix through a BiGRU [18] and a biaffine attention module. On the other hand, the hidden layers are also used to obtain another matrix through an IDCNN [19] module. The model finally

264

X. He et al.

learns the two matrices at the decoding layer. In addition, we cleverly design a detection algorithm to decode nested entities and overlapping events in the two matrices. The overall architecture of the model is shown in Fig. 3.

Fig. 3. MBONIEUF Model.

3.1 Semantic Construction Layer To incorporate rich biomedical domain knowledge and semantics queries into the MBONIEUF model for more accurate extraction of useful information from biomedical texts, we design a semantic construction module based on ChatGPT. The formula is as follows: Q1 , Q2 , . . . , Qn = ChatGPT (α)

(1)

where ChatGPT can take a prompt α text as input and generate suitable text as output. After being constructed by ChatGPT, the model obtains a query = Q1 , Q2 , . . . Qn , where n is the length of the question. Combining the query, [CLS] and [SEP] special tokens with the sentence after being tokenized, the constructed text segment is obtained, forming the combined string {[CLS], T1 , T2 , . . . , Tn , [SEP], Q1 , Q2 , . . . , Qn [SEP]}. Compared with constructing queries through solutions such as Wikipedia, synonyms, keywords, rule templates, etc.[6], we use a generative approach to construct queries using ChatGPT, which has been found that recognition efficiency and effectiveness can be improved through experiments.

Multi-task Biomedical Overlapping and Nested Information

265

3.2 Encoding Layer BERT is trained in general language corpora but lacks knowledge in the biomedical field. PubMedBERT, a domain-specific pre-trained model trained on PubMed abstracts and PMC full-text, has achieved good results in biomedical natural language processing tasks. Therefore, we use PubMedBERT as the encoding layer. The text segment obtained from the semantic construction layer is sent to PubMedBERT, and then the query representation is deleted after PubMedBERT calculation, we finally obtain the text representation H = {h1 , h2 , ..., hN } ∈ RN ×dh , where d h is the vector dimension of the last layer of PubmedBERT. 3.3 Feature Extraction Layer After a word is split into tokens, its internal information can be easily fragmented, resulting in the loss of some semantic information. The BiGRU can capture both sequence information and associative information. Biaffine Attention Mechanism can calculate the relevance between each pair of tokens by performing bidirectional affine transformations on all tokens, then use it to weigh the calculation of contextual vector and increase the interaction information between head and tail tokens. Therefore, we use BiGRU to obtain forward and backward information of tokens and employ a Biaffine Attention Mechanism to enhance the interaction information between tokens, formulas are given as follows:   (2) fm = GRU hm1 , hm2 , . . . , hmj   bm = GRU hmt , hmt−1 , . . . , hmj

(3)

Si = fmT Um bm + Wm (fm ⊕ bm ) + bm

(4)

where f m and bm represent the forward and backward information obtained by the GRU for the m-th token, respectively. U m and W m are parameter weights, and S i is the binary classification result score. Moreover, to capture long-range dependencies in text and improve computational efficiency, We adopt IDCNN as a component of the feature extraction layer to process the representation H. IDCNN can improve the model’s effectiveness by iterating multiple layers of convolution and introducing residual connections. 3.4 Decoding Layer After the feature extraction, the representation E is processed by BiGRU-Biaffine attention and IDCNN respectively. Inspired by the CasRel [20] model, we propose to use the representation E to construct a stacked classification pointer matrix(SCM) SCM = {e1 , e2 , ..., e2C } ∈ R2C×T , where C represents the number of categories (i.e., Num label), and the multi-head nested query matrix(MQM) MQM = {e1 , e2 , ..., eT } ∈ RT ×T , where T represents tokens. We have designed a corresponding decoding algorithm. The idea of the decoding algorithm is to first traverse the SCM and obtain the corresponding

266

X. He et al.

label based on the index, and then traverse the MQM to find the head and tail positions of the entity or trigger word, which is then put into a list. Then, all possible permutations of the head-tail pairs in the list are generated and each pair is queried in the MQM. If the result from the MQM is greater than the threshold, it is considered a target to recognize; otherwise, it is not. The specific formulas are shown as follows: Rstart , Rend

T 2c     = argmax( (SCM i, j > ξ1 )

(5)

j=0 i=0

Ghead , Gtail = argmax(

T T   (MQM[Rstart , Rend ] > ξ2 )

(6)

j=0 i=j

where ξ1 andξ2 are scalar thresholds (ξ1 , ξ2 ∈ [0, 1]), Rstart , Rend denotes the indexes in the SCM that are greater than the thresholds, C and T represent the number of categories and tokens, and Ghead , Gtail refers to the predicted start and end text to be recognized. Based on experimental results, it is found that our model is prone to predict values as zero when processing the MQM, leading to misjudgment. To solve this problem, we adopt the method of marking all elements in the lower triangular area of the MQM as one. During decoding, we also do not process the lower triangular area of the matrix. By using these two binary matrices, the model can determine whether an input sample contains overlapping or nested events. The process is given as follows: Pstart,end = sigmoid (W · concat(Rstart , Rend ) + b)

(7)

Phead ,tail = sigmoid (W · concat(Ghead , Gtail ) + b)

(8)

where Pstart,end , and Phead ,tail reflect the probability of model prediction. W,b are learnable parameters. The decoding module of this model can recognize nested named entities and overlapping events, demonstrating its superiority. 3.5 Loss Layer Different loss functions have been set due to severe sample imbalance, with α-balanced BCE Loss [21] being used in the SCM and Focal Loss [22] being used in the MQM for calculating the loss. The loss function can be formulated as: LCM start,end =

 1  g g ( −α log(Pstart,end ) + −(1 − α) log(1 − Pstart,end )) N m

n

yi =1

yi =0

Lhead ,tail = −1 − Phead,tail )γ log(Phead,tail ) QM

g

g

QM

L = μLCM start,end + λLhead ,tail

(9) (10) (11)

where α, μ, λ ∈ [0, 1] are hyperparameters that control the balance between the losses, g γ > 0 is the modulation factor. P∗,∗ reflects the degree of closeness to the golden label y ∈ Y.

Multi-task Biomedical Overlapping and Nested Information

267

4 Experiment and Result Analysis 4.1 Dataset The experiments are conducted on three datasets. For biomedical named entity recognition, the nested biomedical named entity recognition dataset GENIA is used. For the biomedical event detection task, we employ MLEE and BioNLP’09 corpora. The data statistics as shown in Table 1. Table 1. Dataset statistics Dataset GENIA

MLEE

BioNLP’09

Data Type

Train

Dev

Test

All

Doc

1599

189

212

2000

Sent

15023

1669

1854

18546

Percentage

81%

9%

10%

100%

Doc

131

44

87

262

Sent

1271

457

880

2608

Event

3296

1175

2206

6677

Doc

800

150

260

1210

Sent

7449

1450

2447

11346

Event

8597

1809

3182

13588

4.2 Experimental Setup The experimental model is implemented based on Pytorch1 , and the optimizer uses the AdamW algorithm. The hyperparameters used in the experiment are shown in Table 2. PubMedBERT’s learning rate is chosen as 1e-5 from the set {0.1, 0.01, 1e-3, 1e-4, 1e-5}, while other modules’ learning rate is set to 1e-4 from the same set, the warm proportion is selected as 0.1 from the set {0.1, 0.2, 0.3}, the BiGRU hidden size is selected as 768 from the set {512, 768}. Three evaluation metrics are used in the experiment: Precision (P), Recall (R), and F1 score.

1 https://www.pytorch.org/

268

X. He et al. Table 2. Experimental setup

parameter

GENIA

MLEE

BioNLP’09

threshold

0.5

0.5

0.5

PubMedBERT lr

1e-5

1e-5

1e-5

other modules lr

1e-4

1e-4

1e-4

dropout

0.3

0.5

0.2

modulating factor γ

2

2

2

warm proportion

0.1

0.1

0.1

AdamW epsilon

1e-8

1e-8

1e-8

BiGRU hidden size

768

768

768

4.3 Results and Analysis of Experiments We compare the proposed model with representative models in recent years in the field of biomedical research. The experimental results are shown in Table 3. Table 3. Experimental results Dataset GENIA

MLEE

BioNLP’09

Models

P(%)

R(%)

F(%)

Lu et al. [12]

72.5

65.2

68.7

Muis et al. [13]

75.4

66.8

70.8

Zheng et al. [23]

75.2

73.3

74.2

Tan et al. [24]

78.9

72.7

75.7

Ours

74.56

81.23

77.75

Pyysalo et al. [10]

70.79

81.69

75.84

Nie et al. [25]

71.04

84.60

77.23

Shen et al. [26]

80.06

81.25

80.57

Wei et al. [27]

79.89

81.61

80.74

Ours

76.45

87.53

81.62

Majumder et.al [28]

64.28

69.96

67.00

Martinez et al. [29]

52.60

75.30

60.10

Wang et al. [30]

64.00

75.30

68.80

Li et al. [31]

71.84

75.94

71.70

Ours

72.46

79.38

75.76

As shown in Table 3, we compare our proposed model with other models using the GENIA dataset. Lu et al. [12] used a hypergraph-based approach to solve the problem

Multi-task Biomedical Overlapping and Nested Information

269

of named entity recognition. Muis et al. [13] improved upon the hypergraph method. Zheng et al. [23] used a sequence labeling model to detect nested entity boundaries and merge corresponding boundary label sequences. Tan et al. [24] proposed a model that combines part-of-speech tagging with joint training of entity boundary detection and classification. We selected model [12] as the baseline method for our model on the GENIA dataset, and our model achieved an F1 score that was 9.05% higher than the baseline model, and 6.95%, 3.55%, 2.05% higher than models [13, 23, 24], respectively. We compare our proposed model with other models using the MLEE corpus. Pyysalo et al. [10] used a support vector machine approach with manually designed salient features. Nie et al. [25] used a dependency word embedding approach to learn important features from the raw input and classify them. Shen et al. [26] proposed the CHNNELM framework for biomedical event trigger word recognition, using a CNN end-to-end supervision method to automatically learn semantics and perform feature selection. Wei et al. [27] used a multi-layer BiLSTM neural network approach. We select model [10] as the baseline method for our model on the MLEE corpus, and our model achieved an F1 score that was 5.78% higher than the baseline model, and 4.39%, 1.05%, 0.88% higher than models [25–27], respectively. We compare our proposed model with other models using the BioNLP’09 corpus. Majumder et.al [28] used a CRF-based method for identifying biological molecule event triggers. Martinez et al. [29] used a word sense disambiguation (WSD) approach for detecting biomedical event triggers. Wang et al. [30] employed a deep parsing-based method for extracting event triggers. Li et al. [31] proposed a combined dual decomposition and word embedding approach. We selected model [29] as the baseline method for our model on the BioNLP’09 corpus, and our model achieves an F1 score that is 15.66% higher than the baseline model, and 8.76%, 6.96%, 4.06% higher than models [28, 30, 31] respectively. Our model achieves higher F1 scores than other models on the nested named entity recognition GENIA dataset and the event detection MLEE and BioNLP’09 corpus. While other models have their merits that can be borrowed, they also come with various shortcomings. Our model has stronger text representation, better contextual dependence, and higher data utilization efficiency, leading to higher accuracy. 4.4 Ablation Experiment To further verify the effectiveness of our model, we conduct ablation experiments on the MLEE corpus. Under the same experimental conditions, we compare the experimental results with different modules removed, as shown in Table 4. In this study, we use PubMedBERT as our baseline model (line 1), which achieved an F1 score of 70.81%. We then replace the query construction using synonyms with chatGPT, which improve the efficiency and effectiveness of query construction and raised the F1 score by 3.57% (line 2). Compared with line 2, line 3 adds a BiGRU to enhance the model’s ability to capture sequence information and association information, which increases the F1 score by 1.29%. Line 4 adds a Biaffine module to improve the interaction of information between the head and tail tokens, raising the F1 score by 2.71%. Finally, line 5 adds an IDCNN module to enhance the ability to extract text features, increasing

270

X. He et al. Table 4. Ablation experiment results

Models

F1 score (%)

Baseline(PubMedBERT + decoder layer + synonyms + Focal Loss + α-balanced BCE Loss)

70.81 ± 0.61

Baseline + ChatGPT

74.38 ± 0.47

Baseline + ChatGPT + BiGRU

75.67 ± 0.36

Baseline + ChatGPT + BiGRU + Biaffine

78.38 ± 0.27

Baseline + ChatGPT + BiGRU + Biaffine + IDCNN

79.27 ± 0.69

All Modules

81.62

the F1 score by 0.89%. Overall, our proposed model achieves a good result of 81.62% on the MLEE corpus.

5 Conclusion In this paper, we propose a model that can handle named entity recognition and event detection in biomedical information extraction tasks based on a unified framework. The significant contribution of our works is that our models can solve the problem of nested entity recognition in biomedical named entity recognition tasks and the difficult-tohandle overlapping event detection problem in event detection tasks with our unified framework. Experimental results on the GENIA dataset, MLEE corpus, and BioNLP’09 corpus validate the effectiveness of our proposed model. Acknowledgments. This work is supported by the National Natural Science Foundation of China (No. 62006108), Postdoctoral Research Foundation of China (No. 2022M710593), Liaoning Provincial Science and Technology Fund project (No. 2021-BS-201), Liaoning Province General Higher Education Undergraduate Teaching Reform Research Project (Liaoning Education Office [2022] No. 160), Liaoning Normal University Undergraduate Teaching Reform Research and Practice Project (No. LSJG202210).

References 1. Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus—a semantically annotated corpus for bio-text mining. In: ISMB (Supplement of Bioinformatics), pp.180–182 (2003) 2. Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRF. arXiv:1909.10649 (2019) 3. Ma, X., Eduard, H.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv:1603.01354 (2016) 4. Tang, D., Qin, B., Feng, X., Liu, T.: Effective LSTMs for target-dependent sentiment classification. arXiv:1512.01100 (2015) 5. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv:1603.01360 (2016)

Multi-task Biomedical Overlapping and Nested Information

271

6. Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., Li, J.: A unified MRC framework for named entity recognition. arXiv:1910.11476 (2019) 7. Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., Hu, G.: Revisiting pre-trained models for Chinese natural language processing. arXiv:2004.13922 (2020) 8. Asai, A., Hashimoto, K., Hajishirzi, H., Socher, R., Xiong, C.: Learning to retrieve reasoning paths over Wikipedia graph for question answering. arXiv:1911.10470 (2019) 9. OpenAI. “ChatGPT: Language Model by OpenAI.“ OpenAI Blog. https://openai.com/blog/ chatgpt/. Accessed 15 May 2023 10. Pyysalo, S., Ohta, T., Miwa, M., Cho, H.C., Tsujii, J.I., Ananiadou, S.: Event extraction across multiple levels of biological organization. Bioinformatics 28(18), i575–i581 (2012) 11. Kim, J.D., Ohta, T., Tsujii, J.I.: Corpus annotation for mining biomedical events from literature. BMC Bioinformatics 9, 1–25 (2008) 12. Lu, W., Roth, D.: Joint mention extraction and classification with mention hypergraphs. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 857–867. ACL, Lisbon, Portugal (2015) 13. Muis, A.O., Lu, W.: Labeling gaps between words: Recognizing overlapping mentions with mention separators. arXiv:1810.09073 (2018) 14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018) 15. Lin, Y., Ji, H., Huang, F., Wu, L.: A joint neural model for information extraction with global features. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7999–8009. ACL, Online (2020) 16. Bekoulis, G., Deleu, J., Demeester, T., Develder, C.: Joint entity recognition and relation extraction as a multi-head selection problem. Expert Syst. Appl. 114, 34–45 (2018) 17. Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare (HEALTH), 3(1), 1–23 (2021) 18. Dey, R., Salem, F.M.: Gate-variants of gated recurrent unit (GRU) neural networks. In: Proceedings of the IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 1597–1600 (2017) 19. Fang, Y., Gao, J., Liu, Z., Huang, C.: Detecting cyber threat events from Twitter using IDCNN and BiLSTM. Appl. Sci. 10(17), 5922 (2020) 20. Wei, Z., Su, J., Wang, Y., Tian, Y., Chang, Y.: A novel cascade binary tagging framework for relational triple extraction. arXiv:1909.03227 (2019) 21. Zhang, Z., Sabuncu, M.: Generalized cross-entropy loss for training deep neural networks with noisy labels. Adv. Neural Inform. Process. Syst. 31 (2018) 22. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988. (2017) 23. Zheng, C., Cai, Y., Xu, J., Leung, H. F., Xu, G.: A boundary-aware neural model for nested named entity recognition. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics (2019) 24. Tan, C., Qiu, W., Chen, M., Wang, R., Huang, F.: Boundary-enhanced neural span classification for nested named entity recognition. Proc. AAAI Conf. Artifi. Intell. 34(05), 9016–9023 (2020) 25. Nie, Y., Rong, W., Zhang, Y., Ouyang, Y., Xiong, Z.: Embedding-assisted prediction architecture for event trigger identification. J. Bioinform. Comput. Biol. 13(03), 1541001 (2015) 26. Shen, C., et al.: Biomedical event trigger detection with convolutional highway neural network and extreme learning machine. Appl. Soft Comput. 84, 105661 (2019)

272

X. He et al.

27. Wei, H., Zhou, Ai., Zhang, Y., Chen, F., Wen, Qu., Mingyu, Lu.: Biomedical event trigger extraction based on multi-layer residual BiLSTM and contextualized word representations. Int. J. Mach. Learn. Cybern. 13(3), 721–733 (2021). https://doi.org/10.1007/s13042-021-013 15-7 28. Majumder, A.: Multiple features-based approach to extract bio-molecular event triggers using conditional random field. Int. J. Intell. Syst. Appl. 4(12), 41 (2012) 29. Martinez, D., Baldwin, T.: Word sense disambiguation for event trigger word detection in biomedicine. BMC Bioinformatics, 12(2), 1–8, BioMed Central (2011) 30. Wang, J., Wu, Y., Lin, H., Yang, Z.: Biological event trigger extraction based on deep parsing. Comput. Eng. 39, 25–30 (2013) 31. Li, L., Liu, S., Qin, M., Wang, Y., Huang, D.: Extracting biomedical events with dual decomposition integrating word embeddings. IEEE/ACM Trans. Comput. Biol. Bioinf. 13(4), 669–677 (2015)

NTAM: A New Transition-Based Attention Model for Nested Named Entity Recognition Nan Gao(B) , Bowei Yang , Yongjian Wang , and Peng Chen School of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China [email protected]

Abstract. Traditional Named Entity Recognition (NER) research only deals with flat entities and ignores nested entities. The transition-based method maps a sentence to a designated forest to recognize nested entities by predicting an action sequence through a state transition system which includes transition actions and a state of structures. However, the subsequent transition actions are affected by the previous transition actions resulting in error propagation, and the method ignores the correlation between the structures. To tackle these issues, we propose a new transition-based attention model (NTAM) to recognize nested entities. First, the structures and transition actions of the state transition system are redefined to eliminate error propagation. The prediction of an action sequence is converted to the prediction of a series of states, which predict whether the words between the structures can form entities. Second, we introduce an attention mechanism that strengthens the association between the structures. Experiments on two public nested NER datasets outperform previous state-of-the-art models. Keywords: Nested Named Entity Recognition · Transition-based Model · State Transition System · Attention Mechanism

1

Introduction

Named Entity Recognition (NER) is a fundamental natural language processing task, and entity overlap is a common phenomenon in natural language. Depending on whether an entity is contained in another entity, NER is divided into nested NER and flat NER. Traditional NER is usually regarded as a sequence labeling task, such as LSTM-CRF [1], which cannot extract nested entities. Recently, various methods have been proposed to handle nested NER, such as sequence labeling-based methods [2,3], hypergraph-based methods [4–6], spanbased methods [7–10], MRC-based methods [11,12], and neural transition-based methods [13]. The neural transition-based method [13] maps a sentence with nested entities to a designated forest where each entity corresponds to a constituent of the forest. The forest structure is generated by predicting an action c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 273–286, 2023. https://doi.org/10.1007/978-3-031-44696-2_22

274

N. Gao et al.

sequence through the model. One advantage of this approach is that the parsing can efficiently and effectively capture word-level patterns by representing the state of the system in continuous space. However, this method suffers from some weaknesses. First, specific transition actions require specific constraints, and the subsequent transition actions are affected by the previous transition actions, resulting in error propagation. Some nested entities can not be recognized when a transition action prediction is wrong. Second, this method only performs feature extraction on individual structures and ignores the correlation between the structures which are the essential elements of predicting actions. To solve these problems, we propose a new transition-based attention model (NTAM) to recognize nested entities. First, structures and transition actions are redefined to eliminate special constraints, and the influence of previous actions on the subsequent actions is discarded to eliminate error propagation. Specifically, we integrate the information on transition actions into states and redefine four structures, three transitional actions, and two recognition processes as a state transition system. The prediction of an action is converted to the prediction of a state, which predicts whether words between the structures can form an entity. By learning the relationships between words, our model can identify more potential entities, including nested entities. Second, an attention mechanism is introduced to strengthen the association between the structures, which helps to recognize nested entities. Finally, the F1 scores of our model on the GENIA and People’s Daily datasets are improved by 0.41% and 1.02% compared to the previous state-of-the-art model, respectively. Comprehensively, our main contributions are summarized as follows: – We propose a novel transition-based attention model named NTAM to recognize nested entities. Our model has a new state transition system to eliminate error propagation and introduces an attention mechanism to identify more potential entities by learning the relationships between words. – The state transition system includes four structures, three transitional actions, and two recognition processes. The structures and transition actions are to eliminate special constraints, and the influence of previous actions on the subsequent actions is discarded to eliminate error propagation. – The attention mechanism strengthens the association between the structures and makes the model pay attention to the relationships between the words in the structures, which helps to recognize nested entities. – Experiments on two public nested NER datasets prove the effectiveness of our proposed method and demonstrate that the model outperforms previous state-of-the-art models.

2 2.1

Related Work Nested Named Entity Recognition

NER is divided into nested and flat NER according to whether the entities overlap. Traditionally, the sequence labeling task cannot recognize nested entities

NTAM: A New Transition-Based Attention Model for Nested NER

275

unless the number or meaning of output labels is modified to recognize nested entities. For example, Wang et al. [3] extend the prior second-best path recognition method by explicitly excluding the influence of the best path. Hypergraphbased methods use a decoder to label each word in a sentence with all possible labels, and the combination of these labels can identify all possible nested entities. The main idea of span-based methods is to treat entity recognition as a span classification task. Shen et al. [8] treat nested NER as a joint task of boundary regression and span classification. 2.2

Transition-Based Method

The neural transition-based method is initially applied to component analysis [14] and named entity recognition [15]. Transition-based approaches have shown competitive performance on constituent and dependency parsing. Meanwhile, Wang et al. [13] successfully applied this method to nested named entity recognition. First, the state of the system is determined by the stack, buffer, and historical actions. The system predicts one action at a time, and each action can change the system’s state by pushes or pop the stack or buffer. Second, each sentence with nested entities was mapped to a forest structure by a series of output actions. Finally, each outermost entity formed a tree in a forest structure that resolved nested entities. In addition, Ji et al. [15] proposed a new neural transition-based joint model for disease entity recognition and normalization.

3 3.1

Methodology Nested NER Task Definition

the task of Nested NER For a given sequence X = {x0 , x 1 , · · · , xn−1 } of N words,  is to identify all entities M = m1 , m2 , · · · , m|M | mentioned in the sequence X and to link each of the identified entity mi with its type ci , mi → ci . The set of entity type is C. M may contain nested entities, such as an entity m1 = {xi , · · · , xj } and a nested entity m2 = {xt , · · · , xh }, where i ≤ t < h ≤ j and m1 = m2 . For example, an entity “CD28 surface receptor” contains a nested entity “CD28”. At this time, m1 is “CD28 surface receptor”, and m2 is “CD28”. 3.2

State Transition System

To focus on the relationships between words and identify more potential nested entities, we redefine four structures, three transition actions, and two recognition processes as a state transition system.

276

N. Gao et al.

A state is defined as a tuple (β1 , S1 , S2 , β2 ), where S and β represent the stack and buffer structures respectively. The tuple consists of the following four structures: – buffer 1(β1 ) : the buffer 1 is used to store processed words. – stack 1(S1 ) : the stack 1 is used to store words that may constitute an entity with the word in stack 2. – stack 2(S2 ) : the stack 2 is used to store a single word being processed. – buffer 2(β2 ) : the buffer 2 is used to store words to be processed. State transitions from four structure changes step by step can recognize nested entities. During the whole state transition process, four important states are defined, which are a start state, an end state, a state of Single-WordRecognition (SWR), and a state of Multiple-Word-Recognition (MWR). The start state (β1 , S1 , S2 , β2 )s is defined with the buffer 2 containing all the words of a given sentence X and others empty. For example, the state composed of the four structures of step-0 in Table 1 is a start state. The end state (β1 , S1 , S2 , β2 )e with the buffer 1 containing all the words and others empty. For example, the state composed of the four structures of step-9 in Table 1 is an end state. Single-Word-Recognition state (β1 , S1 , S2 , β2 )swr and Multiple-WordRecognition state (β1 , S1 , S2 , β2 )mwr are to recognize entities. SWR is to identify whether a single word in stack 2 can constitute an entity. For example, the state composed of the four structures of step-1 in Table 1 is a state of SWR. The recognition results are as follows: 1) the word can constitute an entity of type ci , 2) the word cannot constitute an entity. At this time, the result is recorded as “not”, Csingle = C ∪ {not}. MWR is to identify whether the words in stack 1 and stack 2 can constitute an entity. For example, the state composed of the four structures of step-2 in Table 1 is a state of MWR. The recognition results are as follows: 1) the words can constitute an entity of type ci , 2) the word cannot constitute an entity, and 3) the words in stack 1 and stack 2 are correlated but do not constitute a complete entity. At this time, the result is recorded as “correlation”, Cmultiple = C ∪ {not, correlation}. If ci = not, the state is called a negative example; otherwise, it is a positive example. The transition system begins with a start state and ends with an end state. State transitions are accomplished by a set of transition actions. Transition actions have three types, and their logic is summarized as follows: – shift-a moves the first word from the buffer 2 to the stack 2. – shift-b moves the word in the stack 2 to the stack 1 and the first word from the buffer 2 to the stack 2. – shift-c moves all the words from the stack 2 to the buffer 1.

NTAM: A New Transition-Based Attention Model for Nested NER

277

The state transition is shown in Algorithm 1.

Algorithm 1: state transition

1 2 3 4 5 6 7 8 9 10 11

Data: A given sequence of N words X = {x0 , x1 , · · · , xn−1 } Input: (β1 , S1 , S2 , β2 )s  Output: Entity set M and its type for xi  X do Do a shift-a action to convert the state to (β1 , S1 , S2 , β2 )iswr and do Single-Word-Recognition; for xj  S2 do Do a shift-b action to convert the state to (β1 , S1 , S2 , β2 )jmwr and do Multiple-Word-Recognition; if (Pmultiple (cj | Pj ) == not) | (β2 == φ) then break; end end The state returns to (β1 , S1 , S2 , β2 )iswr and performs a shift-c action; end  return M ;

Table 1 shows the example of each transition used to recognize a nested entity “CD28 surface receptor”. The entity “CD28 surface receptor” contains a nested entity “CD28”. ‘φ’ denotes that the structure is empty. ‘-’ denotes that the current state does not do the recognition operation. SWR V indicates the value of Single-Word-Recognition result in the SWR state, and the state is (β1 , S1 , S2 , β2 )swr at this time. MWR V indicates the value of Multiple–WordRecognition result. NER indicates the recognized entity in the current state. Table 1. An example sequence of transition actions. Step Buffer 1

Stack 1

Stack 2

Buffer 2

Action SWR V MWR V

0

φ

φ

φ

CD28 surface receptor

shift-a –



1

φ

φ

CD28

surface receptor

shift-b protein



2

φ

CD28

surface

receptor

shift-b –

correlation

3

φ

CD28 surface receptor φ

shift-c –

protein

4

CD28

φ

5

CD28

6

CD28

7

φ

surface receptor

shift-a –



φ

surface

receptor

shift-b not



surface

receptor φ

shift-c –

not

CD28 surface

φ

φ

shift-a –



8

CD28 surface

φ

receptor φ

shift-c not



9

CD28 surface receptor φ

φ





receptor φ



NER

CD28

CD28 surface receptor

278

N. Gao et al.

When step-0 to step-1, the state transition system performs the shift-a action to move the first word “CD28” from the buffer 2 to the stack 2. At this time, the state transition system changes from the start state to the SWR state. Because the single word “CD28” is an entity, the recognition result of SWR is “protein”. Then the state transition system performs the shift-b action to move the word “CD28” from the stack 2 to the stack 1 and the first word “surface” from the buffer 2 to the stack 2 at step-1 to step-2. The words “CD28” and “surface” are correlated, but they do not constitute a complete entity; thus, the state transition system continues to perform the shift-b action and do MWR until the recognition result of MWR is “not” or the buffer 2 is empty. When the words “CD28” and “surface” in the stack 1 and “receptor” in the stack 2, the result of MWR is “protein” to recognize the entity “CD28 surface receptor”. Because the buffer 2 is empty at step-3, the state returns to the SWR state and performs the shift-c action after exiting the loop. At this time, step-4 is transformed from step-1. 3.3

Representation

We now introduce neural networks to learn the representations of an input sentence X and four structures. Words are represented by concatenating two vectors:   (1) ; vxwi χi = vxlm i where the context representation of vxlm is obtained through a pre-trained model, i and non-contextual representation of vxwi is obtained via pre-trained word embeddings. The structural representations β1 and β2 of the buffer 1 and buffer 2, respectively, are obtained by extracting the features of the word vectors in the structure through the forward and backward LSTM models. The semantic information is expected to be near to the word to be assessed without losing context information. Specifically:  −−−−→  (2) β1 = LSTM b10 , · · · b1n   ←−−−− (3) β2 = LSTM b20 , · · · b2n where bki ∈ R1×d denotes the d-dimensional vector representation of the i-th word in the buffer k. β1 and β2 are the outputs of the last hidden layer of the LSTM models. For SWR, because the stack 1 is empty, its structural representation S1 is not considered, and the structural representation S2 of the stack 2 is a vector of the word in the stack 2. For MWR, the model must focus on the correlation between the words in the stack 1 and stack 2. The attention mechanism is introduced to strengthen the association between words, which is helpful for the recognition of nested and long entities:   (4) Attention (Q, K) = softmax QK T ∗ Q + Q

NTAM: A New Transition-Based Attention Model for Nested NER

279

where Q ∈ Rn×d ,K ∈ R1×d , Attention (Q, K) ∈ Rn×d and the operator ∗ denotes the element-wise product. We use python’s “∗” as the element-wise product, and its usage is the same as numpy.multiply() in python. If their shapes differ, the smaller matrix will be extended to the same shape as the other matrix, and the filled values are the values of the smaller matrix. The structural representation of S1 is the result of paying attention to the word vectors in the stack 1 and stack 2 and passing the LSTM. The structural representation of S2 is the result of paying attention to the vectors in the stack 1 and stack 2 after the vectors of the stack 2 have been passed through the LSTM:    −−−−→  (5) S1 = LSTM Attention S1 , S2   −−−−→   S2 = Attention S2 , LSTM S1 (6) 

where S1 ∈ Rn×d denotes the matrix formed by the vectors of the words in the  stack 1 and S2 ∈ R1×d denotes the matrix formed by the vectors of the word in the stack 2. As shown in Fig. 1, words in the stacks are predicted whether they constitute an entity through the structural representations. 3.4

State Prediction

Compared with the original transition-based model, the nested NER task transforms from the prediction of an action to the prediction of a state, which predicts whether words between the structures can form an entity. The influence of previous actions on the subsequent actions is discarded to eliminate the error propagation caused by weak labels. For each transition, the state of the entire model is denoted by Pk : (7) Pk = [β1 ; S1 ; S2 ; β2 ] where β1 , S1 , S2 , β2 denote the vector representations of each structure of buffer 1,stack 1,stack 2,buffer 2. [; ] denotes a concatenation operation. For SWR, a single word in the stack 2 is predicted whether can form an entity according to the current state Pi . The prediction formula is as follows: Psingle (ci | Pi ) = softmax (MLPsingle (Pi ))

(8)

where MLP consists of two linear layers and an activation function, and ci ∈ Csingle . For MWR, the relationships between the words in the stack 1 and stack 2 are predicted according to the current state Pj . The prediction formula is as follows: (9) Pmultiple (cj | Pj ) = softmax (MLPmultiple (Pj )) where cj ∈ Cmultiple .

280

N. Gao et al.

Fig. 1. The structure of the NTAM. ‘+’ in the figure indicates doing ‘∗’ operation first and then doing ‘+’ operation. Stack 1 and stack 2 are used to store words that need to be recognized whether can form entities. Buffer 1 and buffer 2 are used to store other words in the sentence besides the words in stack 1 and stack 2. The attention mechanism strengthens the association between stack 1 and stack 2.

Because the nested NER problem and candidate entity confirmation problem are both treated as several classification problems, cross-entropy is used as our loss function: N

y (i) × P (i) (10) Loss = − i

where P is the prediction value of the model for the i-th category, y (i) is the onehot expression of the real category and N is the number of prediction types. (i)

4

Experiments

4.1

Datasets

To evaluate the effectiveness of our model and method, we conduct experiments on two public datasets: GENIA1 [16] and People’s Daily2 [17]. The datasets are divided into a training set and a test set at a ratio of 9:1. The statistics of datasets are shown in Table 2 and 3.

1 2

http://www.geniaproject.org/genia-corpus. https://doi.org/10.18170/DVN/SEYRX5.

NTAM: A New Transition-Based Attention Model for Nested NER

281

Table 2. Statistics of the GENIA dataset. Type

Train Entities Flat

Test Nested Entities Flat Nested

DNA

8904

6797

2107

1077

852

225

RNA

845

455

390

92

50

42

Protein

30771

25314 5457

3465

2866 599

Cell line

3449

2975

474

395

342

53

Cell type 6404

5560

844

613

541

72

Total

41101 9272

5642

4651 991

50373

Table 3. Statistics of the People’s Daily dataset. Type

4.2

Train Entities Flat

Test Nested Entities Flat Nested

Person

17909

17797 112

1942

1918 24

Location

25970

19368 6602

2727

2114 613

Organization 10227

4919

1089

579

Total

42084 12022

5758

4611 1147

54106

5308

510

Settings

For two different datasets, we use the pre-trained models of BioBERT [18] and BERT [19], and the pre-trained word vectors of BioWord2Vec3 [20] and Tencent4 [21], respectively. The dimensions of vxlm and vxwi are 768, 200. For all models, i the learning rate is set to 2e−5 and the batch size to 5. We train the models for five epochs and use the Adam optimizer. All experiments are performed on the NVIDIA GTX2080TI GPU platform with 11 GB of memory. We use strict evaluation metrics that an entity is confirmed correct when the entity boundary and the entity type are correct simultaneously. We employ precision, recall and F1 score to evaluate the performance. Hyperparameter α denotes the proportion of the training data from the corpus. For example, α is set to 0.1 and the corpus contains 1000 sentences, which means 100 sentences are used for training. The positive samples are constructed according to the entities in the sentences used for training. All positive examples participate in training. The negative samples are constructed according to the non-entity phrases, in which the words cannot constitute an entity. Hyperparameter γ denotes the proportion of the negative examples from the training data. γ is set to 0.3. Only 30% of negative examples participate in training.

3 4

https://github.com/cambridgeltl/BioNLP-2016. AILab https://ai.tencent.com/ailab/nlp/en/download.html.

282

4.3

N. Gao et al.

Baseline for Nested NER

The baselines of the GENIA dataset are as follows: zheng et al. [2] and Wang et al. [3] ues the sequence labeling-based method. Lu and Roth [4] are the first application of hypergraph to solve the problem of nested NER, and Katiyar and Cardie [5], Wang et al. [6] design the pyramid-structured through different networks. The span-based methods have been widely applied, such as Tan et al. [7], Shen et al. [8], Lou et al. [9], Wan et al. [10]. shen et al. [12] ues the MRC-based method. Wang and Lu [13] use the transition-based method. The baselines of the People’s Daily dataset are as follows: Li et al. [22] label nested entities from the inside to the outside using words as a unit and identify nested entities through multi-level feature extraction. Jin et al. [23] introduce a position-aware self-attention mechanism based on the BiLSTM model. Yuan et al. [24] propose a simpler and more effective method to introduce the lexical information into a character-based NER system. 4.4

Main Results

Tables 4 and 5 illustrate the performance of the NTAM and baselines on the GENIA and People’s Daily datasets. Specifically, compared with the state-of-theart models, the F1 scores of our model on the GENIA and People’s Daily datasets Table 4. Results (%) on the GENIA dataset. Method

Model

P

R

F1

75.90 79.98

73.60 78.51

74.70 79.24

hypergraph-based

Lu and Roth (2015) 72.50 Katiyar and Cardie (2018) 76.70 Wang et al. (2020) 79.45

65.20 71.10 78.94

68.70 73.80 79.19

span-based

Tan et al. (2020) Lou et al. (2022) Wan et al. (2022) shen et al. (2021)

77.40 78.50 80.74 80.89

78.30 78.44 79.30 80.54 81.77

sequence labeling-based Zheng et al. (2019) Wang et al. (2021)

79.20 78.39 77.92 80.19

MRC-based

shen et al. (2022)

83.24 80.35

transition-based

Wang and Lu (2018) Ours

78.00 77.73

70.20 73.90 87.18 82.18

Table 5. Results (%) on the People’s Daily dataset. Method

Model

P

R

sequence labeling-based Li et al. (2018) 94.22 84.58 Jin et al. (2022) 92.26 90.58 Yuan (2021) 95.37 93.92 transition-based

NTAM

93.21

F1 89.13 91.41 94.64

98.24 95.66

NTAM: A New Transition-Based Attention Model for Nested NER

283

are improved by 0.41% and 1.02%, respectively. The highest values of recall and F1 indicate that our model can discover more words that may constitute entities and recognize nested entities well. Meanwhile, non-entity words result in lower precision. Compared with Wang and Lu’s approach [13], our model significantly outperforms the original transition-based method with an increase of 8.28% in the F1 score. This shows that our model has a better ability to recognize nested entities by modifying the original model, and indicates that the state transition system and the attention mechanism are effective. For more analysis, see Sect. 4.5. Table 6 and Table 7 show the recognition effect of different types. Our method has good recognition ability for nested entities of various types. The recognition performance of’Person’ is poor, possibly because there are too few nested entities of this type in the test set, and more not-entities are recognized. Table 6. Results (%) of different types on the GENIA dataset. Type

Entities P R

F1

Flat P

R

F1

Nested P R

F1

DNA

73.73 86.26 79.50 73.81 87.68 80.15 73.39 80.89 76.96

RNA

83.00 90.21 86.45 81.82 90.00 85.71 84.44 90.47 87.35

Protein

80.95 89.37 84.95 80.78 90.61 85.97 81.83 83.47 82.64

Cell line

71.13 77.97 74.39 71.09 78.36 74.55 71.43 75.47 73.39

Cell type 70.80 81.89 75.94 70.03 82.07 75.57 77.33 80.56 78.91 Total

77.73 87.18 82.18 77.47 88.17 82.47 79.03 82.54 80.75

Table 7. Results (%) of different types on the People’s Daily dataset. Type

Entities P R

F1

Flat P

R

F1

Nested P R

F1

Person

96.66 98.35 97.50 98.39 98.59 98.49 55.88 79.17 65.52

Location

94.69 98.83 96.72 81.82 90.00 85.71 77.84 98.53 86.97

Organization 84.36 96.60 91.70 99.57 98.91 99.24 81.22 96.67 88.27 Total

93.21 98.24 95.66 97.40 98.48 97.94 79.31 97.30 87.39

Simultaneously, we analyze the performance of our model on the GENIA dataset using different amounts of training data, as shown in Table 8. It is noteworthy that with 10% of training data, the F1 score also reaches 76.33%, and the recall reaches 82.25%. This indicates that our model can identify words that are more likely to be entities under a small amount of labeled data.

284

N. Gao et al. Table 8. Results (%) on the GENIA dataset with different α. α

0.1

P

71.21 72.60 73.63 74.80 76.17 76.63 76.49 76.65 77.52 77.73

R

82.25 83.88 83.18 84.71 84.06 85.23 85.02 86.04 85.10 87.18

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

F1 76.33 77.83 78.12 79.45 79.92 80.70 80.53 81.07 81.13 82.18

4.5

Ablation Analysis

We study the effectiveness of each component of NTAM. As shown in Table 9, the attention mechanism strengthens the correlation between the structures with an increase of 1.81% in the recall and 1.34% in the F1 score on the GENIA dataset. This shows that through the attention mechanism, our model learned more about the relationships between words, thereby greatly improving the F1 score. Without the attention mechanism, our model is also better than the original model by a 6.94% improvement in the F1 score. This indicates that the new state transition system is the basis for the NTAM to achieve superior results. The observation from the results demonstrates that both of our contributions are experimentally validated. In addition, without the pre-trained word vectors of BioWord2Vec, the model’s performance drops. This indicates that rich word information is beneficial for the model to discover the relationship between words. We also use the pre-trained models of BERT rather than BioBERT. The performance gap indicates better pre-training can make the vector representation of words more suitable for specific domains. Table 9. Results (%) of ablation study. Model

GENIA P R

F1

People’s Daily P R F1

NTAM

77.73 87.18 82.18 93.21 98.25 95.66

NTAM w/o attention

76.78 85.37 80.84 91.43 97.79 94.50

NTAM w/o Word2Vec

77.21 86.35 81.52 92.31 98.48 95.30

NTAM w/ BERT

77.49 86.53 81.76 –





NTAM w/ BERT w/o attention 72.99 86.26 79.07 –





Table 10. Results (%) on the GENIA dataset with different γ. γ

P

R

F1

0.1 66.66 86.23 75.19 0.2 72.32 82.70 77.16 0.3 74.00 81.2

77.43

0.4 75.54 79.65 77.54

NTAM: A New Transition-Based Attention Model for Nested NER

4.6

285

Error Analysis

We sampled a set of experiments with poor F1 score to analyze the impact of the value of γ on the model, as shown in Table 10. With the increase of negative samples, the precision increases, but the recall and the growth in the F1 score decrease. This phenomenon may be caused by the positive samples and negative examples of the GENIA dataset having conflicts. For example, in “When the homeodomain from HB24 was compared...”, the “HB24” in this sentence is a DNA type entity. However, the “HB24” is not labeled as an entity in “The HB24 mRNA was absent....”, and the state at this time is a negative example, ignoring the case in which the single word “HB24” constitutes an entity. As γ continues to increase, conflicts between positive and negative samples increase, and the model cannot identify conflicting entities well. In addition, with the increase of negative samples, the training time of the model increases greatly. To reduce this conflict and have a reasonable training time, we only set the value of γ to 0.3.

5

Conclusion

This paper has proposed the NTAM to recognize nested entities. We integrate the information on transition actions into states and define four structures, three transitional actions, and two recognition processes to predict the relationships between words. The attention mechanism is introduced to strengthen the association between the structures. Our model outperforms previous state-of-the-art models on two public nested NER datasets and has an increase of 8.28% in the F1 score compared with the original method. Acknowledgments. The work described in this paper is supported by Zhejiang Provincial Natural Science Foundation of China (LGF22F020014), National Key Research and Development Program of China (2020YFB1707700), National Natural Science Foundation of China (62036009, U1909203).

References 1. Yan, H., Deng, B., Li, X., Qiu, X.: TENER: adapting transformer encoder for named entity recognition. arXiv:1911.04474 (2019) 2. Zheng, C., Cai, Y., Xu, J., Leung, H.F., Xu, G.: A boundary-aware neural model for nested named entity recognition. In: EMNLP-IJCNLP (2019) 3. Wang, Y., Shindo, H., Matsumoto, Y., Watanabe, T.: Nested named entity recognition via explicitly excluding the influence of the best path. In: ACL (2021) 4. Lu, W., Roth, D.: Joint mention extraction and classification with mention hypergraphs. In: EMNLP (2015) 5. Katiyar, A., Cardie, C.: Nested named entity recognition revisited. In: NAACL (2018) 6. Wang, J., Shou, L., Chen, K., Chen, G.: Pyramid: a layered model for nested named entity recognition. In: ACL (2020)

286

N. Gao et al.

7. Tan, C., Qiu, W., Chen, M., Wang, R., Huang, F.: Boundary enhanced neural span classification for nested named entity recognition. In: AAAI (2020) 8. Shen, Y., Ma, X., Tan, Z., Zhang, S., Wang, W., Lu, W.: Locate and label: a two-stage identifier for nested named entity recognition. In: ACL-IJCNLP (2021) 9. Lou, C., Yang, S., Tu, K.: Nested named entity recognition as latent lexicalized constituency parsing. In: ACL (2022) 10. Wan, J., Ru, D., Zhang, W., Yu, Y.: Nested named entity recognition with spanlevel graphs. In: ACL (2022) 11. Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., Li, J.: A unified MRC framework for named entity recognition. In: ACL (2020) 12. Shen, Y., et al.: Parallel instance query network for named entity recognition. In: ACL (2022) 13. Wang, B., Lu, W., Wang, Y., Jin, H.: A neural transition-based model for nested mention recognition. In: EMNLP (2018) 14. Zhang, Y., Clark, S.: Transition-based parsing of the Chinese treebank using a global discriminative model. In: IWPT (2009) 15. Ji, Z., Xia, T., Han, M., Xiao, J.: A neural transition-based joint model for disease named entity recognition and normalization. In: ACL-IJCNLP (2021) 16. Ohta, T., Tateisi, Y., Kim, J.D.: The GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: HLT (2002) 17. Guo, G., Hua, L., Xuemin, X., Pu, Z.: The initial statistic analysis on tagged corpus of people’s daily. JSCL (2005) 18. Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2019) 19. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: ACL (2019) 20. Chiu, B., Crichton, G., Korhonen, A., Pyysalo, S.: How to train good word embeddings for biomedical NLP. In: BioNLP (2016) 21. Song, Y., Shi, S., Li, J., Zhang, H.: Directional skip-gram: explicitly distinguishing left and right context for word embeddings. In: NAACL (2018) 22. Li, Y., He, Y., Qian, L., Zhou, G.: Chinese nested named entity recognition corpus construction. J. Chin. Inf. Process. 32, 19–26 (2018) 23. Jin, Y., Xie, J., Wu, D.: Chinese nested named entity recognition based on hierarchical tagging. J. Shanghai Univ. (Nat. Sci. Edn.) 27, 1–9 (2022) 24. Yuan, Z., Zhang, H.: Improving named entity recognition of Chinese legal documents by lexical enhancement. In: ICAICA (2021)

Learning Well-Separated and Representative Prototypes for Few-Shot Event Detection Xintong Zhang1 , Shasha Li1(B) , Bin Ji2 , and Ting Wang1(B) 1

College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China {xintong z,shashali,tingwang}@nudt.edu.cn 2 National University of Singapore, Singapore, Singapore [email protected] Abstract. Event detection can be solved with two subtasks: identification and classification of trigger words. Depending on whether these two subtasks are handled simultaneously, event detection models are divided into the pipeline-based paradigm and the joint-based paradigm. For both paradigms, prototypical networks are an extremely important method to solve a few-shot event detection task. Prototypical networks classify through the distances between instances and prototypes. Therefore, both the distance between prototypes and prototype representations are the key to prototypical networks. Compared with the pipeline-based paradigm, the joint-based paradigm handles two subtasks at the same time to effectively avoid the error cascade. To make better use of prototypical networks in the joint-based paradigm, we propose PN-SR, a Prototypical Network with well-Separated and Representative prototypes. Specifically, we use representation learning methods to generate prototypes. At the same time, we add a loss factor to further enlarge the distances between prototypes of different classes and reduce semantic representation distances of instances with the same class. Experiments show our method outperforms many baselines by at least 26.15% F1 points under the same few-shot settings and achieves SOTA. Keywords: Prototypical network

1

· Few-shot event detection

Introduction

Event detection(ED) is a subtask of event extraction that mainly uses trigger words to verdict the occurrence of events and classify events to correct event types [2]. We use Example 1 as a case. An ED model should identify the trigger word president and classify it as Organization.Leadership. Example 1. Heidi Hadsell is president of Hartford Seminary. The traditional ED model follows the supervised learning method [2–4,11–13, 16], which uses a large number of predefined event-type data for training. These models can only recognize event types that have been learned before [7]. As a c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 287–298, 2023. https://doi.org/10.1007/978-3-031-44696-2_23

288

X. Zhang et al.

result, Few-Shot Learning(FSL) [21] is introduced into ED. By formulating ED as FSL, we can learn new event types through a few event examples, i.e.Few-Shot Event Detection(FSED) [14,15,17,19,23–27,29]. FSED can be further decomposed into trigger identification and event type classification [31]. Currently, FSED can be divided into two paradigms: pipelinebased paradigm and joint-based paradigm depending on how these two tasks are handled [17]. For both paradigms, prototypical networks [8] are an important method for solving few-shot classification [22,32] and are often used in FSED. Due to the pipeline-based paradigm having a cascading error [17,31], we focus on a joint-based paradigm with prototypical networks. At present, the joint-based paradigm models still have some shortcomings in using prototypical networks. On the one hand, prototypical networks exist in the prototype representation problem. Vanilla prototypical networks treat each instance equally. Therefore, the presence of an outlier will affect the representation of the prototype. On the other hand, it has a prototype distribution problem. Prototypes are often closely distributed in embedded space which leads to misclassification. In this paper, we propose PN-SR, a Prototypical Network with wellSeparated and Representative prototypes in a joint-based paradigm can overcome the weaknesses in prototypical networks. Specifically, we first use representation learning to achieve the prototypes directly which can solve the prototype representation problem. In this way, PN-SR refines the information in a class and makes the prototype contain more information while keeping the prototype dimension unchanged. Then, PN-SR adds the loss factor to make the prototype discretely distributed which works out the prototype distribution problem straightforward. Experimental results show that PN-SR achieves the best results on the dataset FewEvent. In summary, the main contributions of our work are as follows: (1) We add a loss factor to produce well-separated prototypes. (2) We utilize representation learning to directly learn the representation of the prototype. As far as we know, we are the first to introduce representation learning into FSED. (3) The experiments show that prototypical networks with well-separated and representative prototypes work better than a range of baselines.

2

Related Work

The purpose of ED is to identify the occurrence of events from text and determine the type of event. ED models typically use neural networks to extract features and perform well on benchmarks such as ACE2005. Many types of neural networks have been explored and applied to ED tasks such as CNN [1,2,5], RNN [4], GCN [12,18], and so on. In addition, the powerful ability of pre-trained language models [10] makes it often used to capture the semantic representation of words. However, due to the poor generalization of traditional ED models, we managed to solve the ED task in few-shot settings.

Learning Well-Separated and Representative Prototypes for FSED

289

ED can be divided into two subtasks: trigger identification and event type classification. Depending on how the two subtasks are managed, FSED models are divided into two categories, pipeline-based paradigm, and joint-based paradigm. In both paradigms, prototypical networks are a common method. The distance between prototypes and their representations is important to prototypical networks. Pipeline-based models have made improvements in both aspects. LoLoss [14] assigns a weight to each instance to generate a more accurate prototype. MatchLoss [15] proposes two loss factors to make the semantic representation of the same label close and to distance prototypes between different classes. DMB-PN [19] utilizes a multi-hop mechanism in memory networks to learn more distinguishable event prototypes. Compared to the pipeline-based paradigm, the joint-based paradigm has fewer models. PA-CRF [17] uses the vanilla prototypical network and uses prototypes to implement sequence labeling. OntoED [23] treats each word as an event trigger candidate which utilizes the connections between events to enhance prototypes. The joint-based paradigm differs from the pipeline-based paradigm, which handles two subtasks simultaneously to avoid cascading errors. However, although these joint-based paradigm models utilize prototypical networks to achieve good performance, the distance between prototypes is not enough and the method of generating prototypes is not concise. In contrast, we leverage the advantages of the joint-based paradigm to better utilize prototypical networks to solve FSED concisely.

3

Method

We first define the FSED task in Sect. 3.1. Then we introduce the overall model in Sect. 3.2 and each part of PN-SR in Sects. 3.3–3.5. Finally, we explain some details of model training in Sect. 3.6. 3.1

Problem Formulation

We define the event detection task as a sequence labeling task [17,28]. Given an event instances dataset D with an event-type set E = {e1 , e2 , · · · , eNe }. Each event type ei in D have mi instances. We define each sentence in D as a token sequence x = (w1 , w2 , · · · , wn ) with maximum n tokens and annotate label sequence of x as y. As a result, a dataset D = {(xji , yij ) | i ∈ [1, Ne ] , j ∈ [1, mi ]} consists of a series of example and label pairs (x, y). We follow the N-way-Kshot setting. The model extracts N (1 ≤ N ≤ Ne ) event types from D each time, with k + q (1 ≤ k + q ≤ mi ) instances of each event type. Specifically, the model is given two sets of data: a support set S of labeled data, and a query set Q of unlabeled data which predicts the label sequence y = {l1 , l2 , · · · , ln } of the instance in Q based on S. Formally, a FSL task T is defined as follow: S = {(xji , yij ) | i ∈ [1, N ] ; j ∈ [1, k]}

(1)

Q = {(xti , yit ) | i ∈ [1, N ] ; t ∈ [1, q]}

(2)

T = (S, Q)

(3)

290

3.2

X. Zhang et al.

Model

In this paper, we propose PN-SR based on prototypical networks with three modules, prototype generation module, prototype instantiation module, and sequence labeling module. Figure 1 shows the overall architecture of PN-SR. The prototype generation module aims to provide better-embedded representation for prototypes and words. The prototype instantiation module makes the prototype more representative and well-separated. The sequence labeling module uses the above prototypes to model the tag sequence of instance.

Fig. 1. The overall architecture of PN-SR. It consists of three modules, prototype generation module, prototype instantiation module and sequence labeling module.

3.3

Prototype Generation Module

Base Encoder. Base encoder aims to get semantic representations that include contextual information. Pre-trained language models can effectively capture knowledge from large amounts of data and facilitate various downstream tasks.

Learning Well-Separated and Representative Prototypes for FSED

291

As a result, we first take advantage of the power of BERT [10] to get contextual representation X of a token sequence x. X = {h1 , h2 , · · · , hn } = BERT (x)

(4)

where hi ∈ IRl1 , and l1 represents the word dimension encoded by BERT. Projection Layer. The projection layer aims to project the semantic representation of words into a lower dimensional space. The essence of prototypical networks is to learn an embedded space. In this space, we can classify words by calculating the distance between words and prototypes. Therefore, we use a Feed Forward Network (FFN) to project the representation for better embedding space. Xˆ = F F N (X ) = {hˆ1 , hˆ2 , · · · , hˆn } (5) where hˆi ∈ IRl2 (l2 ≤ l1 ), l2 represents the compressed dimension. Prototype Encoder. Prototype encoder aims to provide a prototype for each class. In this paper, we use the BIO sequence annotation approach. P = {P1 , P2 , · · · , PM }

(6)

where Pi ∈ IRl2 , and l2 is the dimension of prototypes. M represents the maximum number of prototypes. 3.4

Prototype Instantiation Module

Prototype Instantiation Module can be divided into Prototype Cohesion and Prototype Dispersion. Prototype Cohesion tries to bring representations that belong to the same class closer together to get a better prototype. Prototype Dispersion makes prototypes discretely distributed. Prototype Cohesion. A prototype is the best representation of a class. As a result, we want semantic representations belonging to the same class to be as close to the prototype as possible. Specifically, given a sentence x, we get the class label li based on its tag sequence y and embedded representations hˆi of trigger words. Then, We use squared Euclidean distance to calculate the distance between hˆi and all prototypes P. sim(Pj , hˆi ) = Pj − hˆi 2

(7)

According to the prototype definition, we choose the class label li represented by the prototype Pk closest to hˆi as its label. To build the loss function of Prototype Cohesion Lpc , we first calculate the possibility of hˆi belongs to li , denoted by: exp(−sim(Pk , hˆi )) P (li | hˆi , P) = M ˆ j=1 exp(−sim(Pj , hi ))

(8)

292

X. Zhang et al.

Then, we utilize a cross-entropy function to calculate Lpc . Lpc = −

1 Nsum

 

vi log P (li | hˆi , P)

(9)

hˆi ∈S Pk ∈P

where Nsum indicates the sum of trigger words in the support set, vi is a one-hot vector that represents the ground truth of li . Prototype Dispersion. To reduce misclassification, the Prototype Dispersion Module is extended. To be specific, we set a lower limit θ for distance. We use squared Euclidean distance to define the distance between two prototypes, denoted by: dis(Pi , Pj ) = Pi − Pj 2 (10) Since we use squared Euclidean distance, we also square the number when finding the average distance of the prototype avg(P). We construct avg(P) as follow: avg(P) =

M M 1  dis(Pi , Pj ) M 2 i=1 j=1

(11)

Generally, the loss function of Prototype Dispersion is denoted by: Lpd = log (| avg(P) − θ | +1) 3.5

(12)

Sequence Labeling Module

Conditional Random Fields(CRF) is a common method in sequence annotation. The emission score and transition score are two important components of CRF, which are achieved through prototypes. The emission score indicates the similarity between a word and all labels. The transition score indicates the likelihood that the labels are connected. For sentence x, its emission, transition scores, and total score are calculated in the same way as PA-CRF [17]. Finally, we use a softmax function to get Sequence Labeling Module’s loss function Lsl . As a whole, the PN-SR loss function is the sum of the three parts. L = Lpc + Lpd + Lsl 3.6

(13)

Model Training

The training and testing flow of the model is slightly different. We first get trained prototypes based on the source domain. Then, we match based on the association between the source domain and target domain prototypes. Specifically, we use a prototype allocation algorithm to realize this. A detailed discussion of this algorithm is in Sect. 4.4. During the testing phase, we fine-tuned the assigned prototype to accommodate different kinds of class labels. Besides, at this stage, we froze BERT [32] to avoid overfitting. During the matching

Learning Well-Separated and Representative Prototypes for FSED

293

process, we may encounter the situation that the well-trained prototype of the source domain is not enough to support the matching. In this case, we can randomly initialize a portion of the prototype during the test phase. According to our experimental results, the performance of the model can still be guaranteed when the ratio of random initialization is less than 0.75.

4

Experiments

The experiment mainly aims to illustrate the following issues: (1)to compare our model with other few-shot event detection models in different N-way-Kshot settings; (2)to analyze the domain adaption effect of PN-SR using different prototype matching algorithms; and (3)to prove the effectiveness of various parts of PN-SR. 4.1

Datasets

During the experiment, we use a total of three datasets. Due to the low number of instances of certain event types unsupported training, we removed these subtypes. ACE2005 is a benchmark dataset that has 8 supertypes and 33 subtypes. After deletion, we only use 30 subtypes but still maintained 8 supertypes. MAVEN [20] is a large-scale universal domain dataset with 168 subtypes which effectively alleviated the problem of data scarcity and low coverage in the field of event detection. In MAVEN, we also performed removal operations. FewEvent [19] is a benchmark dataset in FSED which has 19 supertypes and 100 subtypes. The event types in the training set of FewEvent will not appear in the test set. The training set contains 80 event types. Both the validation and testing sets contain 10 event types. It is worth noting that supertypes in the test set have appeared in the training set. This will be beneficial for our subsequent prototype matching work. 4.2

Baselines and Settings

To verify the validity of PN-SR, we compare the other models with different paradigms. For the pipeline-based paradigm, we select LoLoss [14], MatchLoss [15], and DMB-PN [19] as baselines. The joint-based paradigm models can be further divided into two types. One thinks each word in the sentence is a possible trigger word. It divides the sentence into words. For this kind of model, we use Proto [8], Match [6] and Relation [9] as baselines. The other treats the sentence as a whole, considering the sequential connections between words. For this type, we use PA-CRF [17] as a baseline. We use the uncased BERTBase to encode words in the sentence into the 768 dimensions. The compressed dimension l2 is 256. We set the average distance of the prototype θ to be greater than 350. The maximum sentence length is set as 128. We use Adamw as the optimizer in training with a learning rate of 1 × 10−5 . We set the number of train iterations to 10000 and the number of test iterations to 3000. For every 500 train iterations, we use the validation set to evaluate the performance of the model. We evaluate the performance of models with F1 points.

294

X. Zhang et al.

4.3

Main Results

Table 1 shows the comparison results between PN-SR and other models. We can draw the following conclusions from the table. Firstly, many joint-based models outperform pipeline-based models which proves the validity of joint-based models. Secondly, Proto exceeds Match and Relation in three few-shot settings which proves the advantage of prototypical networks in few-shot learning. Thirdly, compared to Proto, PA-CRF achieves higher F1 points, indicating that considering the sentence as a whole is more beneficial. Finally, our model PN-SR achieves improvements of 26.15%, 26.78%, and 33.36% on three settings respectively, which confirms the effect of well-separated and representative prototypes. Table 1. The overall result of FSED task on the FewEvent test set. We use F1 points (10−2 ) as an evaluation index. Bold shows the best results in all models. Paradigm

4.4

Model

5-way-5-shot 5-way-10-shot 10-way-5-shot

Pipeline-based LoLoss 31.51 MatchLoss 30.44 DMB-PN 37.51

31.70 30.68 38.14

30.46 28.97 34.21

Joint-based

52.85 46.02 29.83 64.45 91.23

40.08 30.88 18.46 58.48 91.84

Proto Match Relation PA-CRF PN-SR

45.94 39.93 28.91 62.25 88.40

Domain Adaption Analysis

We want to find a better initialization for the prototype of the test set which will facilitate convergence of the model. Therefore, we address the matching of prototypes between the train and test set by exploring the connections between them. For event types that exist in both the source and target domains, we will directly utilize them. For event types that do not appear in the source domain, we will use prototype allocation algorithms to allocate prototypes. We have experimented with the following prototype matching algorithms: (1) Random matching algorithm (RM): Randomly select a prototype from the training set for each prototype in the test set. The simplest matching method provides a benchmark value for evaluating matching algorithms; (2) Supertype matching algorithm (SM): Prototypes of event types belonging to the same supertype have similarity [29]. Based on this idea, we assign trained prototypes of event types that belong to the same supertype to the event types waiting for allocation; (3) Dependency matching algorithm (DM): According to the role of trigger words in event detection, event types can be divided into two categories: triggerdependent and context-dependent [30]. Trigger words in trigger-dependent can express overall the event semantics. On the contrary, context-dependent does not.

Learning Well-Separated and Representative Prototypes for FSED

295

We believe that the trigger-dependent event type is not suitable for prototype allocation. Because their semantics are too concentrated. Therefore, we prioritize selecting the context-dependent event type for allocation; (4) B label matching algorithm (BM): Compared to label I, label B has more training data. We assume that label B will have a better effect and prioritize their allocation. In addition, to maximize the utilization of well-separated prototypes, we will not allocate the same prototype repeatedly. We use ACE2005, MAVEN, and FewEvent train sets respectively as the source domain and use the Fewevent test set as the target domain. We present the domain adaption performance of PN-SR in Table 2. We can find the following conclusions from the table: (1) ACE2005 has the greatest semantic overlap with the target domain compared to the FewEvent train set and MAVEN. Using ACE2005 as the source domain achieves the best results among all matching algorithms which indicates the approximation of the source and target domains is proportional to the result. (2) BM is better than RM in some indexes, which confirms the number of samples affects the representation of the prototype. RM outperforms other algorithms in most indexes. We use RM as the final prototype matching algorithm. Table 2. The domain adaption result of PN-SR. We use Precision (P), Recall (R) and F1 points(10−2 ) as an evaluation index. Bold shows the best results in all algorithms. - indicates that the algorithm has no result.

4.5

Algorithm ACE2005 P R

F1

MAVEN P R

RM

88.67 93.01

90.78 81.29 85.55

83.36 87.21 89.63 88.40

SM

88.08

93.00

90.47







DM

86.71

92.44

89.48

80.18

85.51

82.76

84.06

88.97

86.44

BM

87.09

93.19 90.03

80.03

85.56 82.70

84.35

89.21

86.71

F1

FewEvent P R 83.28

89.47

F1 86.27

Ablation Study

To analyze the effects of various components in PN-SR, we conduct an ablation study on FewEvent and report the result in Table 3. From the table, we find that: (1) -Discretely Distributed Loss Factor: To investigate whether the loss factor affects the model. Compared with the original model, the F1 point decreased by 3.81% which proves the effectiveness of the training signal. (2) -Prototype Representation: To prove prototypes obtained through representation learning will be beneficial to few-shot classification. We removed this section and replaced it with the mean method. F1 decreased by 17.44%, which proves that representation learning is beneficial to prototypical networks. (3) -Prototype Representation and -Discretely Distributed Loss Factor decrease by 3.81% and 17.44% respectively, which proves that the representation learning is more effective for PN-SR.

296

X. Zhang et al.

Table 3. The ablation results of PN-SR. We use Precision (P), Recall (R) and F1 points(10−2 ) as an evaluation index. m denotes the margin between current model and PN-SR.

5

Model

P(±m)

R(±m)

F1(±m)

PN-SR

87.21

89.63

88.40

- Discretely Distributed Loss Factor

80.39 (−6.82)

89.65 (+0.02)

84.59 (−3.81)

- Prototype Representation

75.86 (−11.35) 66.64 (−22.99) 70.96 (−17.44)

Conclusion

In this paper, we propose a model named PN-SR, which can generate more well-separated and representative prototypes. Specifically, we add a loss factor to guide the model in generating well-separated prototypes. In addition, we acquire more representative prototypes through representation learning. Finally, experimental results show PN-SR creating a new state-of-the-art performance on FewEvent. Extensive analyses further discuss the ability in domain adaption and prove the validity of each component in PN-SR. Acknowledgements. This work was supported by Hunan Provincial Natural Science Foundation (Grant Nos. 2022JJ30668).

References 1. Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 365–371 (2015) 2. Chen, Y., Xu, L., Liu, K., Zeng, D., Zhao, J.: Event extraction via dynamic multipooling convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 167–176 (2015) 3. Nguyen, T.H., Grishman, R.: Modeling skip-grams for event detection with convolutional neural networks. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 886–891 (2016) 4. Nguyen, T.H., Cho, K., Grishman, R.: Joint event extraction via recurrent neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 300–309 (2016) 5. Nguyen, T.H., Fu, L., Cho, K., Grishman, R.: A two-stage approach for extending event detection to new types via neural networks. In: Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 158–165 (2016)

Learning Well-Separated and Representative Prototypes for FSED

297

6. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: Advances in Neural Information Processing Systems, vol. 29 (2016) 7. Huang, L., Ji, H., Cho, K., Voss, C.R.: Zero-shot transfer learning for event extraction. arXiv preprint arXiv:1707.01066 (2017) 8. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 9. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) 10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 11. Feng, X., Qin, B., Liu, T.: A language-independent neural network for event detection. Sci. China Inf. Sci. 61(9), 1–12 (2018). https://doi.org/10.1007/s11432-0179359-x 12. Nguyen, T., Grishman, R.: Graph convolutional networks with argument-aware pooling for event detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) 13. Liu, S., Li, Y., Zhang, F., Yang, T., Zhou, X.: Event detection without triggers. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 735–744 (2019) 14. Lai, V.D., Dernoncourt, F., Nguyen, T.H.: Exploiting the matching information in the support set for few shot event classification. In: Lauw, H.W., Wong, R.C.-W., Ntoulas, A., Lim, E.-P., Ng, S.-K., Pan, S.J. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 233–245. Springer, Cham (2020). https://doi.org/10.1007/978-3030-47436-2 18 15. Lai, V.D., Dernoncourt, F., Nguyen, T.H.: Extensively matching for few-shot learning event detection. arXiv preprint arXiv:2006.10093 (2020) 16. Liu, J., Chen, Y., Liu, K., Bi, W., Liu, X.: Event extraction as machine reading comprehension. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1641–1651 (2020) 17. Cong, X., Cui, S., Yu, B., Liu, T., Wang, Y., Wang, B.: Few-shot event detection with prototypical amortized conditional random field. arXiv preprint arXiv:2012.02353 (2020) 18. Cui, S., Yu, B., Liu, T., Zhang, Z., Wang, X., Shi, J.: Edge-enhanced graph convolution networks for event detection with syntactic relation. arXiv preprint arXiv:2002.10757 (2020) 19. Deng, S., Zhang, N., Kang, J., Zhang, Y., Zhang, W., Chen, H.: Meta-learning with dynamic-memory-based prototypical network for few-shot event detection. In: Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 151–159 (2020) 20. Wang, X., et al.: Maven: a massive general domain event detection dataset. arXiv preprint arXiv:2004.13590 (2020) 21. Wang, Y., Yao, Q., Kwok, J.T., Ni, L.M.: Generalizing from a few examples: A survey on few-shot learning. ACM comput. Surv. (CSUR) 53(3), 1–34 (2020) 22. Xue, W., Wang, W.: One-shot image classification by learning to restore prototypes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 6558–6565 (2020)

298

X. Zhang et al.

23. Deng, S., et la.: OntoED: low-resource event detection with ontology embedding. arXiv preprint arXiv:2105.10922 (2021) 24. Chen, J., Lin, H., Han, X., Sun, L.: Honey or poison? solving the trigger curse in few-shot event detection via causal intervention. arXiv preprint arXiv:2109.05747 (2021) 25. Lai, V., Dernoncourt, F., Nguyen, T.H.: Learning prototype representations across few-shot tasks for event detection. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021) (2021) 26. Lai, V.D., Nguyen, M.V., Nguyen, T.H., Dernoncourt, F.: Graph learning regularization and transfer learning for few-shot event detection. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2172–2176 (2021) 27. Shen, S., Wu, T., Qi, G., Li, Y.F., Haffari, G., Bi, S.: Adaptive knowledgeenhanced Bayesian meta-learning for few-shot event detection. arXiv preprint arXiv:2105.09509 (2021) 28. Veyseh, A.P.B., Van Nguyen, M., Trung, N.N., Min, B., Nguyen, T.H.: Modeling document-level context for event detection via important context selection. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5403–5413 (2021) 29. Zheng, J., Cai, F., Chen, W., Lei, W., Chen, H.: Taxonomy-aware learning for fewshot event detection. In: Proceedings of the Web Conference 2021, pp. 3546–3557 (2021) 30. Liu, J., Chen, Y., Xu, J.: Saliency as evidence: event detection with trigger saliency attribution. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4573–4585 (2022) 31. Li, Q., et al.: A survey on deep learning event extraction: approaches and applications. IEEE Trans. Neural Netw. Learn. Syst. (2022) 32. Ji, B., Li, S., Gan, S., Yu, J., Ma, J., Liu, H.: Few-shot named entity recognition with entity-level prototypical network enhanced by dispersedly distributed prototypes. arXiv preprint arXiv:2208.08023 (2022)

Poster: Machine Learning for NLP

A Frustratingly Easy Improvement for Position Embeddings via Random Padding Mingxu Tao , Yansong Feng(B) , and Dongyan Zhao Wangxuan Institute of Computer Technology, Peking University, Beijing, China {thomastao,fengyansong,zhaodongyan}@pku.edu.cn

Abstract. Position embeddings, encoding the positional relationships among tokens in text sequences, make great contributions to modeling local context features in Transformer-based pre-trained language models. However, in Extractive Question Answering, position embeddings trained with instances of varied context lengths may not perform well as we expect. Since the embeddings of rear positions are updated fewer times than the front position embeddings, the rear ones may not be properly trained. In this paper, we propose a simple but effective strategy, Random Padding, without any modifications to architectures of existing pre-trained language models. We adjust the token order of input sequences when fine-tuning, to balance the number of updating times of every position embedding. Experiments show that Random Padding can significantly improve model performance on the instances whose answers are located at rear positions, especially when models are trained on short contexts but evaluated on long contexts. Our code and data will be released for future research.

Keywords: Pre-trained Language Model Embedding · Contextual Representation

1

· Absolute Position

Introduction

Pre-trained language models [1,2,4,7] have achieved great success in various natural language processing tasks, including text classification, relation extraction, and extractive question answering (QA). These models with Transformer [17] have powerful ability to model local context, which plays a vital role in question answering [8,12]. Compared to models of text classification whose predictions are mainly based on sentence-level representations, many extractive QA models have to determine the start and end boundaries of answer spans from all tokens in context. Extractive QA models tend to pick out answer from the neighbour words of overlapping words between context and questions [3,14], thus should be sensitive to the relative positions of words in context. Thus, how to represent the positional relationships among words is important to a QA model. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 301–313, 2023. https://doi.org/10.1007/978-3-031-44696-2_24

302

M. Tao et al.

Transformer-based models merely employ position embeddings to identify the order of tokens, thus encode the positional relationships among tokens. Many popular Transformer-based models, like BERT [2] employs absolute position embedding, which can be considered as a set of learnable vectors. For each token, its corresponding position embedding will be appended to its token embedding to form the final input representation. Therefore, when fine-tuning, if the token sequence is shorter than the maximum input length (e.g., 512 for BERT), the rear part of position embedding vectors will not be equally updated as the front ones. In practice, it is impossible that all instances have a length exactly equal to the maximum input length of pre-trained language models, thus, the rear part of position embeddings might be updated fewer times than the front ones. This may prevent QA models from producing more accurate representations for tokens of answers that are located at the end of the context. Many recent studies also concentrate on the topic of Train Short, Test Long [9,13,15]. They adopt relative position embeddings to prevent the situation that part of absolute position embeddings cannot be updated. However, these methods need to modify the attention mechanism of Transformer layers and pre-train the language models again. Unlike these studies, we focus on to enhance existing models with absolute position embeddings. In this paper, we first conduct a pilot experiment in extractive QA to show how insufficient training will affect the quality of position embeddings. We propose a simple but effective method, Random Padding. We reduce the updating times of the front part of position embeddings, and make the rear ones to be updated more times, via randomly moving padding tokens. To examine whether Random Padding can be effective to improve QA models, we first experiment in a severe scenario, where models are trained on instances with shorter context but have to perform inference over longer contexts. We also experiment on datasets whose training and test sets have similar distributions of context lengths. Experimental results show Random Padding can improve QA models in both scenarios. We further provide empirical analyses, which reveal that Random Padding plays an important role to improve the representations of rear tokens in an input sequence. Our main contributions are as follows: (1) We propose a simple but effective method, named Random Padding to improve the embeddings of rear positions, especially when the model is trained on short contexts. (2) We inject Random Padding into popular Transformer-based QA models without any modification to their architectures, and help the models to predict more accurately when answers are located at rear positions. (3) We further reveal that Random Padding can improve models on public benchmark datasets of extractive QA and documentlevel relation extraction.

2 2.1

Background Task Definition

In extractive question answering, a model should extract a text span from a given passage or document to answer the question. The given passage or document is

An Easy Improvement for Position Embeddings via Random Padding

303

also called as context. Formally, the extractive QA task can be defined as: Given a question Q and a context C = c0 , · · · , cNc −1 , the model should select a text span cs , · · · , ce  (0 ≤ s ≤ e < Nc ) from C to answer Q, where Nc is the number of words in C, s and e are the start and end boundaries of the answer, respectively. 2.2

Investigated Model

Pre-trained language models (PLMs) usually receive a sequence of tokens as input. Here, we take BERT [2] as an example to introduce how to compose the input of PLM. Following BERT’s original pre-processing, we utilize special token [CLS] and [SEP] to separate question and context. Since BERT has to receive input sequences with a fixed length, the input sequences shorter than maximum length will be appended by padding tokens ([PAD]). Therefore, in vanilla BERT, the input sequence containing the given question and context can be composed as: [CLS] question [SEP] context [SEP] paddings Formally, assume the question and the context consist of Mq and Mc tokens after tokenization, respectively. Then, we can regard [CLS] located at the 0-th position, and correspondingly the context includes tokens located at from Mq +2 to Mq + Mc + 1. For ease of exposition, we denote the number of non-padding tokens as m = Mq + Mc + 3 and the maximum input length as n. Thus, there should be (n − m) padding tokens. In a general extractive QA framework using PLM [1,2,7], for an input sequence with m non-padding tokens, we denote their representation vectors m−1 as {Ti }i=0 , Ti ∈ RH . We then employ two trainable vectors S, E ∈ RH to predict the start and the end boundaries of an answer span. The probability of token s being the start boundary of an answer span can be written as: exp(S · Ts ) . Prob(start = s) = Mq +Mc +1 exp(S · Ti ) i=Mq +2 Similarly, the probability of token j being the end boundary will be: exp(E · Te ) Prob(end = e) = Mq +Mc +1 . exp(E · Ti ) i=Mq +2 Finally, the score of an answer span from the s-th token to the e-th token can be simplified as S · Ts + E · Te . The ultimate goal of an extractive QA model is to select the maximal S · Ts + E · Te from any s, e pair with a constraint of Mq + 2 ≤ s ≤ e < Mq + Mc + 2. Figure 1 illustrates the process of an extractive QA framework based on PLM with absolute position embeddings. As previously mentioned, we add padding tokens behind short input sequences and mask these tokens during fine-tuning.

304

M. Tao et al.

The gradients corresponding to masked tokens will not be back-propagated (the grey parts in Fig. 11 ). 2.3

Pilot Experiment

We observe that the instances in a QA dataset have various question and context lengths, for example, SQuAD [11]. The combined token sequence of question and context after pre-processing are usually shorter than the maximum input length of popular PLM like BERT. Since gradients of masked tokens cannot be back-propagated, the position embeddings corresponding to padding tokens will not be updated. Thus, embedding vectors of front positions can be updated more times than the vectors of rear positions. We wonder whether the insufficient fine- Fig. 1. The process of QA model with absolute position embeddings. tuning can make PLM have an inferior ability to represent rear tokens. As a pilot experiment, we train a BERT-base model on the Wikipedia domain of TriviaQA [5]. We record how many times a position embedding has been updated during the fine-tuning process. We find that the first position embedding (i.e., for position No. 0) is updated in every step, however, the last one (i.e., for position No. 511) are updated in only 62.58% of all steps. We further examine the differences in model performance when predicting answers at different positions. To be concise, we mainly focus on the predicted start boundary Ps and regard it as the position of whole answer. As shown in Table 1, the model can achieve a 65.36% F1 to find answers Table 1. Average F1 scores of the subappearing in the first quarter of positions, sets divided by answer position. while it gets only a 57.03% F1 in the last quarter. It shows the model tends to make Answer Position Average F1 more mistakes when predicting answer at 0 ≤ Ps < 128 65.36 ± 0.49 the rear positions of the token sequence. 128 ≤ Ps < 256 58.45 ± 0.74 256 ≤ Ps < 384 57.42 ± 0.63 From the first quarter to the last, we find 384 ≤ Ps < 512 57.03 ± 0.51 the average F1 score over five runs keeps decreasing. Although this is not a strictly fair comparison, it still shows that QA models tend to provide more incorrect predictions on the tokens whose position embeddings have been fine-tuned fewer times. 1

The grey parts represent the weights corresponding to masked tokens, whose gradients cannot be back-propagated. The representation vectors, embedding vectors and attention scores of non-padding tokens are shown by coloured areas.

An Easy Improvement for Position Embeddings via Random Padding

3

305

Our Method: Random Padding

As shown in our pilot study, when we train a model on instances of short contexts, embeddings at the front positions can be updated much more times than those at rear positions. Therefore, it is intuitive to balance the updating times over the whole range of positions, i.e., to reduce updating times of front position embeddings and to reallocate more updating to rear ones. Recall that when fine-tuning a PLM for extractive QA, we only update the position embeddings of non-padding tokens. Since padding tokens are always at the rear positions of whole input sequence, these rear position embeddings are often ignored in the scheme of absolute position embedding. If padding tokens can be randomly placed in the whole sequence during fine-tuning, we can expect that every position embedding has almost equal chance to be updated or ignored. However, if we insert padding tokens into question or context, it will change the positional relationships of non-padding tokens, which might hurt model performance. Therefore, we should pre- Fig. 2. The process of QA model enhanced serve the question tokens and the con- by Random Padding. text tokens as contiguous sequences. Specially, during fine-tuning, we propose to move a random number of padding tokens to the front of the input sequence, as shown in Fig. 2. Then non-padding tokens will be pushed towards the end of input sequence, so that the rear position embeddings can be updated. At the Fine-tuning Stage. Similar to Sect. 2.2, we assume the upper limit of input length is n tokens. For an input sequence with m non-padding tokens (including [CLS], [SEP], question and context), there should exist (n − m) padding tokens at the end. Then, we randomly select an integer k ∈ [0, m − n], and move k padding tokens from the rear to the front of the input sequence. The resulting input sequence will be: [CLS] {k[PAD]} question [SEP] context [SEP] {(m − n − k)[PAD]} Note that we keep [CLS] at the 0-th position, as the first token of input sequence. The reason is, under the widely adopted setting of SQuAD 2.0 [10], QA models use the scores of [CLS] to decide whether it can find answers for given questions. Therefore, we believe [CLS] should be arranged at a fixed position during fine-tuning, to reduce interference and to accelerate convergence.

306

M. Tao et al.

At the Inference Stage. We do not move padding tokens during inference, and just feed the original input sequence to the trained model: [CLS] question [SEP] context [SEP] {(m − n)[PAD]} Note that the input sequence during inference will be the same as baseline method. To summarize, we simply adjust the token order of input sequence when finetuning, by moving a random number of padding tokens to the position between [CLS] and question tokens. Our method does NOT make any modifications to the architectures of existing PLMs, thus is convenient to be employed by PLMs with absolute position embeddings.

4

Experiments

We experiment on three different PLMs to answer the following questions: (1) When models are trained on short contexts but evaluated on long contexts, can Random Padding improve the performance of QA models? (2) If models are trained and evaluated on datasets with similar distributions of context lengths, can Random Padding still work? (3) Does Random Padding actually improve model performance more when answers are located at the end of contexts? (4) How does Random Padding play a role to improve QA models? 4.1

Dataset Preparation

Similar to Sect. 2.3, we experiment on the Wikipedia domain of TriviaQA [5]. Since we cannot access the official test set, we randomly split the official development set into two equal sub-sets for validation and test. To better understand the behaviours of our method, we investigating the models with or without Random Padding on various training/testing conditions, e.g., training set and test set may have different distributions of context lengths. Specifically, we prepare two different kinds of datasets. The first one is to truncate all long context to a fixed length, e.g., 100 or 800 words. We denote the datasets truncated to a fixed length L as DL , and its train val test , DL , DL , respectively. corresponding training/validation/test sets as DL The second type is to truncate long contexts to a range of lengths. We pre-define the minimum and maximum length limitations as L1 and L2 . For each instance in original TriviaQAWiki dataset, we randomly select an integer k ∈ [L1 , L2 ] and truncate its context to k words. Similarly, its traintrain val test , DL , DL , respectively. ing/validation/test sets are DL 1 ∼L2 1 ∼L2 1 ∼L2 These two types of datasets correspond to two practical scenarios. For the automatically annotated datasets, like TriviaQA [5], their contexts are gathered from multiple documents relevant to the questions, and truncated to a fixed length when creating the datasets. For the manually annotated datasets, like SQuAD v1 & v2 [10,11], human annotators write questions and answers for every single passage, where the contexts are not truncated and have various lengths.

An Easy Improvement for Position Embeddings via Random Padding

4.2

307

Implementation Details

We investigate three different extractive QA models, with the base version of BERT [2], RoBERTa [7], and ELECTRA [1], respectively. We fine-tune all models under baseline settings, which takes the original pre-processed sequences as input (with the same form in Sect. 2.2). For each model, we employ Adam as optimizer and run experiments five times with different seeds. For every run to train baseline models, we also record the shuffled orders of training instances in every epoch. When fine-tuning models enhanced by Random Padding, we utilize the same hyper-parameters as baseline models. In every epoch over five runs, we also keep the orders of training instances identical to baseline settings, to eliminate the stochastic influence of training orders on model performance.

5 5.1

Main Results Train Short, Test Long

We first look at whether Ran- Table 2. Performance of models which are trained dom Padding can help models on short contexts but evaluated on long conval test to obtain better embeddings texts (D100∼800 and D100∼800 ). ELECTRA and of rear positions, when there RoBERTa are abbreviated as ELEC and RoBE. do not exist long contexts Training Validation Test Model in the training set. SpecifiSet F1 EM F1 EM cally, all models are trained BERT 58.780.25 52.170.51 58.750.45 51.870.47 +RP 59.860.47 54.120.40 59.820.40 53.600.48 on instances with short conELEC 61.670.21 55.540.08 61.310.15 54.800.16 train train D 100 text only (e.g., D100 defined +RP 63.320.35 57.160.62 62.760.44 56.450.63 RoBE 61.740.17 55.390.35 61.070.43 54.440.62 in Sect. 4.1), but have to +RP 62.980.16 56.730.32 62.440.35 55.940.58 be evaluated on instances BERT 62.610.13 57.020.32 62.910.15 56.940.27 +RP 63.460.32 57.840.51 63.970.40 58.010.49 with longer contexts (e.g., ELEC 66.160.28 60.660.57 66.140.28 60.300.55 train val test D D100∼800 and D100∼800 ). It is 250 +RP 66.880.14 61.460.25 66.940.27 61.410.34 RoBE 65.110.20 59.440.39 65.010.40 59.000.37 a practical and hard scenario, +RP 65.600.20 60.010.12 65.470.33 59.630.40 where training data and test BERT 64.030.14 58.480.46 64.790.29 58.960.37 +RP 64.900.27 59.550.35 65.210.33 59.550.34 data are sampled from differELEC 67.760.35 62.610.40 68.070.34 62.280.44 train D ent distributions. 400 +RP 68.090.14 62.770.25 67.940.21 62.370.23 RoBE 66.670.16 61.250.19 66.340.26 60.750.29 As shown in Table 2, when +RP 66.820.09 61.380.16 66.780.30 61.260.27 models are trained on the contexts with around 100 train ), Random Padding can bring an improvement more than +1% words (D100 F1 on the test set, specifically +1.07% for BERT, +1.45% for ELECTRA, train as the training set, Ranand +1.37% for RoBERTa. When we employ D250 dom Padding can only bring an improvement of +1.06%/+0.80%/+0.46% F1 for BERT/ELECTRA/RoBERTa, respectively. It is not surprising that models train train train v.s. D250 or D400 ) can gain more improvetrained on shorter contexts (D100 ment from Random Padding, since more position embeddings at rear positions cannot be updated when models are trained on shorter contexts.

308

M. Tao et al.

Table 3. Model performance on the datasets whose training set and test set have similar distributions of context lengths. (a) Model performance on D800 Model

Validation F1

EM

(b) Model performance on D100∼800

Test F1

Model EM

Validation F1

EM

Test F1

EM

BERT 60.740.28 54.480.33 61.450.23 55.540.26 +RP 61.180.09 54.920.30 61.830.33 55.980.35

BERT 63.650.16 58.110.34 64.330.41 58.360.39 +RP 64.400.08 58.970.16 65.060.29 59.280.24

ELEC 64.560.29 58.920.28 65.480.23 59.590.14 +RP 64.770.17 59.070.25 65.700.26 59.810.38

ELEC 67.140.18 61.750.23 67.210.20 61.660.24 +RP 67.350.13 61.810.25 67.500.14 61.840.26

RoBE +RP

RoBE +RP

5.2

63.110.26 56.870.36 63.240.14 57.230.22 63.240.17 57.080.14 63.500.30 57.440.27

65.770.21 60.220.38 66.050.15 60.210.16 66.110.28 60.510.40 66.370.19 60.500.35

Train/Test with Similar Context Lengths

We also wonder whether Random Padding can help all three PLMs to gain improvement when training set and test set have similar distributions of context lengths. Here, we will use the second type of datasets created in Sect. 4.1. For train val , validated and tested on D800 and example, all models will be trained on D800 test D800 , respectively. Table 3 shows the performance of three PLMs with and without our Random Padding strategy. We can see that Random Padding brings +0.73% F1 improvetest , +0.29% for ELECTRA and +0.32% for RoBERTa. ment for BERT on D100∼800 When trained and tested on D800 , models can also gain improvement from Random Padding, specifically +0.38% F1 for BERT, +0.22% F1 for ELECTRA, and +0.26% for RoBERTa. We find all three PLMs gain more improvement from Random Padding on test test than D800 . It is similar to the results in Sect. 5.1, showing that if the D100∼800 training set consists of more instances with short contexts, Random Padding will be more effective to improve model performance. Comparing the results of three PLMs on the same dataset, we also find Random Padding boost the performance of BERT more than RoBERTa and ELECTRA. It might because BERT are pre-trained with sequence length of 128 for 90% of the steps, while ELECTRA and RoBERTa are pre-trained on sequences with 512 tokens for all steps. Thus, ELECTRA and RoBERTa obtain better position embeddings from pre-training than BERT. Moreover, ELECRTA and RoBERTa are pretrained with larger batch size and on larger corpora, which enhances their ability to generate contextual representations, which, we guess, leaves less space for Random Padding to improve the performance of ELECTRA and RoBERTa.

6

Analysis and Discussions

Experimental results reveal that Random Padding can effectively improve extractive QA models, especially when the models are trained on short contexts but evaluated on long contexts (Table 2). Considering Random Padding can make rear position embeddings updated more times, we wonder whether

An Easy Improvement for Position Embeddings via Random Padding

309

it brings more improvement when models predict answers at the rear part of context. We also wonder how Random Padding improves QA models. val Table 4. Improvement (ΔF1) from Random Padding on every Segment of D100∼800 test and D100∼800 .

(a) Results on the valid set of D100∼800 Training

Seg.

Set

F1 Scores BERT

(1)

train D100

V100 72.030.55 (2) V100 38.971.00

train D250

V250 68.760.24 (2) V250 36.650.65

train D400

V400 67.030.19 (2) V400 37.190.99

6.1

(1)

(1)

ΔF1

BERT+RP

(b) Results on the test set of D100∼800 Training

Seg.

Set

F1 Scores BERT

(1)

71.880.47 41.880.72

–0.15 +2.91

train D100

T100 71.920.85 (2) T100 39.670.57

69.320.47 38.631.06

+0.56 +1.98

train D250

T250 68.530.15 (2) T250 39.601.00

67.860.35 38.390.91

+0.83 +1.21

train D400

T400 67.280.50 (2) T400 41.921.32

(1)

(1)

ΔF1

BERT+RP 71.140.54 43.430.71

–0.78 +3.76

69.480.35 41.241.35

+0.95 +1.64

67.710.32 42.202.02

+0.43 +0.28

Analysis on Answer Positions

To study the relationship between improvement from Random Padding and (1) (2) answer positions, we divide test set S into two segments, SX and SX . The first (1) segment, SX , contains the instances, at least one of whose answers is located in the front X tokens of the input sequence, while all other instances are gathered (2) into the second segment, SX . train , whose contexts are truncated We train a QA model with BERT on DX to around X words (100, 250, or 400). We then split the validation set and test set into two segments according to X, to examine whether models with or without Random Padding perform differently regarding answer positions. Here, we val test and D100∼800 as our validation and test sets. To be simplified, employ D100∼800 we denote them as V and T in Table 4, which shows the improvement brought by Random Padding on each segment. We find Random Padding brings signifi- Table 5. Performance on fixed cant improvement on the Second Segment, for division of segments. (2) example, +3.76% F1 improvement on T100 Seg- Training train when trained on D100 . However, we sur- ment Set Model F1 Score ΔF1 BERT 38.971.00 train prisingly find that models can also obtain 2.91 D100 +RP 41.880.72 no more than +1% F1 improvement for the BERT 45.780.46 (2) train 1.75 V100 D250 +RP 47.530.77 train , modFirst Segment. When trained on D100 BERT 49.520.46 train 1.68 D400 els even perform a little worse on the First +RP 51.200.40 Segment after enhanced by Random Padding. BERT 39.670.57 train 3.76 D100 +RP 43.430.71 It reveals the main contribution of Random BERT 47.380.86 (2) train 1.72 Padding is to improve model performance on T100 D250 +RP 49.100.70 BERT 51.51 1.13 train the instances whose answers are all located at 0.93 D400 +RP 52.440.66 rear positions.

310

M. Tao et al.

We also notice when X is 400, Random Padding brings less improvement (1) (2) test on the Second Segment of D100∼800 than the First Segment. In T400 and T400 , (1) there exists 4,247 and 462 instances respectively, while the numbers for T100 and (2) T100 are 2,786 and 1,923. It is not surprising that the volatility of scores might (2) rise when evaluated on such small dataset as T400 , which causes the results of X = 400 to become an outlier. Random Padding v.s. Longer Context. The results in Table 2 and 4 indicate that models trained on longer contexts will gain less improvement from Random Padding. Considering the main contribution of Random Padding mentioned above, we are interested to see whether the improvement on the Second Segment will decrease if models are trained on longer contexts. Here, we train BERT train train train , D250 , or D400 ), models on instances with different context lengths (D100 (2) (2) and then evaluate the models on a fixed division of segments (V100 and T100 ). train , Random The results are shown in Table 5. For the model trained on D100 (2) (2) Padding can bring +2.91% F1 and +3.76% F1 improvement on V100 and T100 . When we use training set with longer contexts, the improvement decreases to +0.93%∼+1.75% F1. It indicates the models trained on longer contexts will gain less improvement when predicting answers at rear positions. One reason may be that longer contexts can also make rear position embeddings to be updated more times, which plays similar role as Random Padding. 6.2

How Random Padding Improves QA Performance?

To study how Random Padding works, we consider the improvement on a specific instance in two categories: (1) Baseline model fails to find the right answer, while our improved model can make corrections to wrong predictions. (2) Baseline model has almost found the right answer, but it selects redundancy tokens around the ground-truth, which are then filtered out by our improved model. We provide example cases in Table 62 . Table 6. Different types of examples corrected by our method. Type

Example Case

Question: In the TV series Thunderbirds, Parker was chauffeur to whom? Context: Parker ... appears in the film sequels Thunderbirds Are Go (1966) ... Parker is employed at CW Mansion by Lady Penelope, serving as her butler and chauffeur... Find Precise Boundaries Question: Which metal is produced by the Bessemer Process? Context: The Bessemer process was the first inexpensive industrial process for the mass-production of steel from molten pig iron prior to the open hearth furnace ... Make Corrections

2

The underlined texts are wrong predictions given by baseline models. Red texts are correct predictions given by the improved model. Texts with yellow background are golden answer.

An Easy Improvement for Position Embeddings via Random Padding

311

For every instance, we compare the predictions provided by baseline and improved models. If the prediction of baseline model obtains a lower F1 score than improved model, we will compare the two predictions. If there exists no common sub-sequence between the two sequences, we will consider improved model selects the correct answer but baseline model not. Otherwise, the improved model will be regarded to find a more precise answer boundary. train as baseline. We then enhance the We take the BERT model trained on D100 train model by Random Padding, or longer contexts (D400 ). All models are trained test with five different random seeds and tested on D100∼800 . Among the instances improved by Random Padding, 36.11% ± 2.75% of them are improved by finding more precise boundaries, while the rest are because of making corrections to wrong answers. However, for the instances improved by training with longer contexts, 31.73% ± 2.59% of them can be attributed to more precise boundaries. From the results, we think that Random Padding can enhance the ability of QA models to better represent the positional information between neighbour tokens, which leads more improvement on answer boundaries.

7

Results on More Benchmark Datasets

Now we proceed to examine the effect of Ran- Table 7. Performance on public dom Padding on more benchmark datasets. extractive QA benchmark data We train and evaluate QA models with var- sets. ious PLMs on NaturalQuestions [6], HotModel NaturalQ HotpotQA SQuAD v2 potQA [19], and SQuAD 2.0 [10]. Since these BERT 78.180.24 73.410.29 77.710.55 benchmarks provide validation set merely, we +RP 78.670.33 74.230.15 78.000.47 ELEC 80.71 76.550.13 83.850.34 randomly split the original validation sets to +RP 80.880.18 0.13 76.670.28 83.950.34 RoBE 80.430.18 76.870.28 82.850.26 two parts as our validation and test sets. As shown in Table 7, we can observe our +RP 80.740.27 77.420.26 82.990.18 simple Random Padding helps BERT obtain +0.82% F1 improvement on HotpotQA, +0.49% F1 on NaturalQuestions, and +0.29% F1 on SQuAD v2, while ELECRTA and RoBERTa can only gain improvement of around +0.3% F1 or even no significant improvement. It is similar to the results in Sect. 5. Since ELECTRA and RoBERTa have been trained with larger batch size and larger corpora, there is less space left to Random Padding to improve. Besides extractive question answering, we Table 8. Model performance on also study the document-level relation extrac- Re-DocRED. tion (RE) task, where models should provide Model Test F1 Test Ign F1 high-quality representations to encode local BERT (B) 73.570.20 72.740.19 73.920.09 73.100.14 context [18,20]. As discussed in Sect. 6, Random +RP (L) 76.060.11 75.400.09 Padding can enhance model’s ability to encode BERT RP 76.180.10 75.530.11 relative positions of neighbour tokens, which RoBERTa (B) 75.290.19 74.570.20 75.420.23 74.710.31 provides important information in representing +RP RoBERTa (L) 77.780.24 77.100.25 local context. Here, we employ the state-of-the- +RP 78.000.25 77.340.27 art RE framework, ATLOP [21], and examine both BERT and RoBERTa as the PLM component in ATLOP. Table 8 shows the model performance on the

312

M. Tao et al.

Re-DocRED [16] benchmark. We find BERT can gain improvement of +0.35% F1 and +0.36% Ign F1 from Random Padding. The results also show Random Padding improves BERT more than RoBERTa, and improves base models more than large models.

8

Conclusion

In this work, we propose a simple strategy, Random Padding, to improve the performance of extractive QA models, especially when they are trained on short contexts but evaluated on longer contexts. Our method only re-organizes the input token sequences when fine-tuning, without any modifications to the architectures of PLMs. Experiments reveal that our simple method can effectively enhance QA models when predicting answers at the rear positions, where the position embeddings may not be sufficiently updated without Random Padding. We also show that our simple strategy can improve the performance of PLM components in more benchmarks where accurate local context representations over longer context are necessary.

References 1. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: pre-training text encoders as discriminators rather than generators. In: Proceedings of ICLR (2020) 2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL (2019) 3. Jia, R., Liang, P.: Adversarial examples for evaluating reading comprehension systems. In: Proceedings of EMNLP (2017) 4. Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by representing and predicting spans. TACL 8, 64–77 (2020) 5. Joshi, M., Choi, E., Weld, D., Zettlemoyer, L.: TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In: Proceedings of ACL (2017) 6. Kwiatkowski, T., Palomaki, J., Redfield, O., et al.: Natural questions: a benchmark for question answering research. TACL 7, 452–466 (2019) 7. Liu, Y., Ott, M., Goyal, N., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019) 8. Peters, M.E., Neumann, M., Iyyer, M., et al.: Deep contextualized word representations. In: Proceedings of NAACL (2018) 9. Press, O., Smith, N., Lewis, M.: Train short, test long: attention with linear biases enables input length extrapolation. In: Proceedings of ICLR (2022) 10. Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for SQuAD. In: Proceedings of ACL (2018) 11. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of EMNLP (2016) 12. Salant, S., Berant, J.: Contextualized word representations for reading comprehension. In: Proceedings of NAACL (2018)

An Easy Improvement for Position Embeddings via Random Padding

313

13. Su, J., Lu, Y., Pan, S., Wen, B., Liu, Y.: RoFormer: enhanced transformer with rotary position embedding. CoRR abs/2104.09864 (2021) 14. Sugawara, S., Inui, K., Sekine, S., Aizawa, A.: What makes reading comprehension questions easier? In: Proceedings of EMNLP (2018) 15. Sun, Y., et al.: A length-extrapolatable transformer (2022) 16. Tan, Q., Xu, L., Bing, L., Ng, H.T., Aljunied, S.M.: Revisiting docred - addressing the false negative problem in relation extraction. In: Proceedings of EMNLP (2022) 17. Vaswani, A., et al.: Attention is all you need. In: Proceedings of NIPS (2017) 18. Wang, D., Hu, W., Cao, E., Sun, W.: Global-to-local neural networks for documentlevel relation extraction. In: Proceedings of EMNLP (2020) 19. Yang, Z., Qi, P., Zhang, S., et al.: HotpotQA: a dataset for diverse, explainable multi-hop question answering. In: Proceedings of EMNLP (2018) 20. Zhang, N., Chen, X., Xie, X., et al.: Document-level relation extraction as semantic segmentation. In: Proceedings of IJCAI (2021) 21. Zhou, W., Huang, K., Ma, T., Huang, J.: Document-level relation extraction with adaptive thresholding and localized context pooling. In: Proceedings of AAAI (2021)

IDOS: A Unified Debiasing Method via Word Shuffling Yuanhang Tang1 , Yawen Ouyang1 , Zhen Wu1(B) , Baohua Zhang2 , Jiaying Zhang2 , and Xinyu Dai1 1

2

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China {tangyuanhang,ouyangyw}@smail.nju.edu.cn, {wuz,daixinyu}@nju.edu.cn Big Data and Artificial Intelligence Laboratory, Industrial and Commercial Bank of China, Shanghai, China {zhangbh,zhangjy1}@sdc.icbc.com.cn

Abstract. Recent studies show that advanced natural language understanding (NLU) models may exploit dataset biases to achieve superior performance on in-distribution datasets but fail to generalize to out-ofdistribution datasets that do not contain such biases. Previous works have made promising progress in mitigating dataset biases with an extra model to estimate them. However, these methods rely on prior bias knowledge or tedious model-tuning tricks which may be hard to apply widely. To tackle the above problem, we propose to model biases by shuffling the words of the input sample, as word shuffling can break the semantics that relies on correct word order while keeping the biases that are unaffected. Thanks to word shuffling, we further propose IDOS, a unified debiasing method that enables bias estimation and debiased prediction by one single NLU model. Experimental results on three NLU benchmarks show that despite its simplicity, our method improves the generalization ability of NLU models and achieves a comparable performance to previous debiasing methods.

Keywords: debias

1

· generalization · NLU

Introduction

Although pre-trained language models have largely promoted the state-of-the-art performance on down-stream NLU tasks, recent studies show that their success may be achieved by excessively exploiting dataset biases (also known as spurious correlations) rather than forming a better understanding of the semantics [5,17]. These models may achieve superior performance on in-distribution(ID) test sets but fail to generalize to out-of-distribution(OOD) challenging sets which do not contain these biases [11,16,25]. For example, the natural language inference(NLI) task is designed to infer whether the premise sentence entails the hypothesis sentence by comparing the semantics of the two sentences. However, models c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 314–325, 2023. https://doi.org/10.1007/978-3-031-44696-2_25

IDOS: A Unified Debiasing Method via Word Shuffling

315

trained on the MNLI dataset [21] rely on statistical biases to blindly predict entailment when the words in the premise highly overlap with the words in the hypothesis instead of examining the semantic differences [11]. Such models may show promising results on the in-distribution validation sets, but suffer from large performance drop when tested on OOD challenging sets [5,11], raising concerns about the generalization ability of such models. In order to improve the generalization ability, various methods are proposed to mitigate the negative effect of dataset biases [8,12,13,22,23]. Within these studies, the ensemble-based debiasing (EBD) method [8,20]has shown promising results. The EBD method estimates the biases of the training samples and downweights the loss of the biased ones so that the model focuses on learning the less biased samples. For the estimation of dataset biases, various methods such as manual bias feature extraction [8], model overfitting [20] and weak learner [15] are proposed and show promising results. Though promising, the EBD method is faced with limitations. On the one hand, the EBD method introduces an extra bias-only model to estimate biases. Training such bias-only models requires either expert knowledge on specific dataset biases [8] (such as word overlap [11]) or laborious training techniques such as tedious hyperparameter tuning [20], incurring substantial extra parameters and human labor. On the other hand, this method is an intrinsically twostage method in which the bias-only model is trained ahead of the debiasing training of the NLU model [15,20], which is time-consuming. To address these limitations, in this paper, we propose a unIfied Debiasing method via wOrd Shuffling (IDOS) that enables a fully autonomous estimation of biases without the bias-only model. To achieve this, we propose to shuffle the words in the input text to break the semantic features while keeping the bias features, considering that the semantics of a text relies on correct word order while a wide range of biases such as word overlap is unaffected. Hence, when faced with shuffled text, the NLU model can only rely on bias features to perform prediction and the output confidence scores reveal the biases of the input sample. In this way, we enable the NLU model to estimate the biases of the input text and an extra bias-only model is unnecessary in our method. To verify the effectiveness of our method, we conduct experiments on three real-world NLU generalization tasks including NLI, fact verification [18] and paraphrase identification1 . The results demonstrate that despite its simplicity, IDOS achieves competitive performance compared with other debiasing methods with more complex designs, and even achieves new state-of-the-art performance on some benchmarks. To sum up, our contributions are as follows: • We propose to model biases by shuffling the words of the input text. In this way, bias estimation is achievable without the aid of an extra bias-only model. • We propose a novel unified debiasing method IDOS which incurs no extra human labor and no extra parameters. 1

https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs.

316

Y. Tang et al.

• We conduct experiments to demonstrate that our method achieves competitive performance on several classic NLU benchmarks despite its simplicity.

2

Approach

We begin this section with a formal statement of the problem. Then, we introduce the EBD method and analyze its weakness. Finally, we propose our method as an improvement to the EBD method. 2.1

Problem Setup

Given training set D = {xi , yi }N i=1 consisting of input texts xi ∈ X and corresponding labels yi ∈ Y = {1, 2, . . . , C} with C pre-defined classes, the goal of the NLU generalization task is to train a classifier f : X → Y which is able to transfer to other data distributions different from the training distribution. 2.2

Ensemble-Based Debiasing Method

Fig. 1. An illustration of the EBD method(the left part) and IDOS(the right part). In the EBD method, an extra bias-only model is introduced to take only bias features as input. In our method, the bias-only model and the NLU model are unified as a single model via word shuffling.

The key idea of the EBD method, as depicted in Fig. 1 (left), is to identify potentially biased samples with a bias-only model fb , and design debiased loss functions to incorporate fb into the training of the NLU model f . fb should rely on only the biases to make predictions, and hence samples that are correctly classified with high confidence by fb are more likely to be biased [8]. By downweighting the importance of the highly biased samples on the training objective of f , f are forced to focus on less biased samples and therefore less likely to be affected by the biases. Different methods are proposed to obtain fb . When the prior knowledge of the biases observed in the training set is available, it is possible to extract the bias

IDOS: A Unified Debiasing Method via Word Shuffling

317

features (such as word overlapping rate) and train a feed-forward neural network as fb using the extracted features as input [8]. Since the network takes only the bias features as input, it relies on only the biases to compute the confidence. Other methods such as model overfitting [20] and weak learner [15] are also proposed and show promising results. With fb at hand, debiased loss functions incorporate the predictions of fb and f to adjust the weights of different samples. Typical debiased loss functions include product-of-experts (PoE) loss [6] and debiased focal (DFL) loss [20]. Let fb (xi ) and f (xi ) be the softmax output vector of fb and f on sample xi respectively. Given training set D = {xi , yi }N i=1 , the PoE loss is computed as LP oE = −

N 1  log σ yi (fb (xi )  f (xi )) N i=1

(1)

where σ yi is the softmax probability on label yi . Intuitively, PoE loss is equivalent to the cross-entropy loss of σ (fb (xi )  f (xi )) and the golden label yi . When updating the parameters of f , larger confidence of fb on yi leads to a smaller gradient of xi [8]. Therefore, the parameters of f are more likely to be decided by less biased samples. The DFL loss is inspired by Focal Loss [9], computed as LDF L = −

N 1  γ (1 − fbyi (xi )) log f yi (xi ) N i=1

(2)

where γ is the focusing parameter that controls the down-weighting rate. This loss assigns different weights to different samples based on fbyi (xi ), the predicted probability of the bias-only model fb on sample xi . As the sample is more biased, fbyi (xi ) is closer to 1 and the weight of the corresponding sample is closer to 0. In this way, the biased samples are down-weighted when computing the training objective and thus contribute less to the parameters After the training of f , the bias-only model fb can be discarded. In the inference stage, f predicts the label of x as argmaxyi f (x). By combining the characteristics of two different models, the EBD method achieves significant improvement compared to baseline models. However, there are still limitations. On the one hand, to train a bias-only model, users need to either have access to dataset-specific prior knowledge (which is often infeasible) to extract bias features [8], or to carefully tune the hyperparameters [20] or choose a suitable limited-capacity model [15] to ensure that the trained model relies mostly on biases to make predictions. This procedure requires tedious human assistance. On the other hand, the proposed method is intrinsically a two-stage method (first train the bias-only model, and then the NLU model), which is time-consuming. 2.3

Unified Debiasing Method via Word Shuffling

We note that the bias-only model is the main factor that limits the application of the EBD method. In order to build a one-stage unified debiasing method, we need to seek another way to model the biases.

318

Y. Tang et al.

Before discussing our model, we take a closer look at the features exhibited by the text sample. The features contained in a text sample x can be decomposed into two parts, the high-level semantic features s and the low-level surface features z [7]. s is the part of features that determines the label of x. For example, in NLI task s is the relation of the semantics of the premise and the hypothesis. z refers to the part of features that can be extracted from x but does not determine the label. For example, in NLI task z may be the number of negation words [5] or the structure of the sentence [11]. z may be spuriously correlated to the labels due to potential confounders and become biases [4]. Notably, many known biases in NLU datasets are low-level features such as word overlap [11] and the occurrence of negation words [5]. Models that rely (partially) on z to make predictions are biased. In the EBD method, different techniques such as bias feature extraction [8] and overfitting [20] are proposed to ensure that the bias-only model relies mostly on z, i.e. fb (x) ≈ fb (z). In order to model the biases of an input sample without the aid of a bias-only model, we propose to adopt suitable data augmentation techniques φ to model z such that z = φ(x) and thus f (φ(x)) = f (z). Specifically, we propose to shuffle the words of the input text. The key idea is that, the high-level semantics of the input text is probably eliminated by word shuffling since it relies on correct word order, while many bias features are low-level surface features z such as word overlap and the occurrence of negation words that are unaffected by word shuffling. This is more notable in tasks requiring reasoning and inference such as NLI, in which the model is required to understand the details and logic. Since the semantics are mostly eliminated by word shuffling and only bias features are presented to the model, f (φ(x)) actually estimates the biased probability over the labels similar to fb (x) in EBD method. Based on the above observation, we propose IDOS, a unified debiasing method via word shuffling, which is depicted in Fig. 1. In the training stage, the input text x and the shuffled text φ(x) are both fed into the model and the outputs f (x) and f (φ(x)) are combined to compute the debiased loss. In our method, the PoE loss is computed as LP oE = −

N 1  log σ yi (f (φ(xi ))  f (xi )) N i=1

(3)

and the DFL loss is computed as LDF L = −

N 1  γ (1 − f yi (φ(xi ))) log f yi (xi ) N i=1

(4)

where f (φ(x)) plays the same role as fb (x) in the EBD method. In the inference stage, f predicts the label of x as argmaxyi f (x). Since the debiased loss functions down-weight the biased samples, the bias features have less influence on the optimization of f . Therefore, f could focus on learning the well-generalized semantic features.

IDOS: A Unified Debiasing Method via Word Shuffling

319

Table 1. Model performance (Acc.%) on in-distribution validation sets and out-ofdistribution challenging sets. The best result is marked in bold, and the second best result is marked with an underline. IDOS-PoE refers to our model implemented with PoE loss and IDOS-DFL refers to the version implemented with DFL loss. RISK* denotes the results reported in the original paper [22] and RISK denotes the results we reproduce using the code the authors released. Models based the EBD framework are marked with †. Model

MNLI FEVER QQP IND HANS IND Symm. IND PAWS

BERT

84.6 64.3

85.7 58.1

91.3 35.5

PoE†

84.1 66.3

82.3 62.0

86.9 56.5

Conf-reg†

83.4 69.1

86.4 60.5

89.1 40.0

84.3 67.1

87.6 60.2

89.0 43.0

83.3 67.9

85.3 58.5



Masked Debiasing 82.2 67.9

85.0 63.4

89.6 44.3

RISK*

84.5 71.3

88.3 63.9

90.1 56.5

RISK

83.1 68.2

88.0 62.2

88.9 45.3

Depro

83.2 70.3

84.5 65.2



IDOS-PoE (ours)

83.5 69.8

89.5 65.5

89.9 42.2

IDOS-DFL (ours) 82.6 69.8

90.0 65.0

88.9 43.0

self-debiasing



weak-learner†

3





Experiment

Following previous works [12,22], we evaluate our method along with other debiasing methods on three different NLU tasks: natural language inference, fact verification, and paraphrase identification. 3.1

Datasets

Natural Language Inference (NLI). NLI task aims to determine whether a premise entails a hypothesis We use the MNLI dataset [21] for training and adopt the HANS dataset [11] as the OOD challenging set. Fact Verification. The goal of fact verification task is to validate a claim based on given evidence. We adopt the FEVER dataset [18] as the training set and its symmetric test set [16] as the challenging set for evaluation. Paraphrase Identification. This task aims to determine whether the two texts have the same meaning. We use the QQP dataset as the training set and adopt PAWS [26] as the test set, which is a challenging set designed for QQP. For all three datasets, we train the models on the training sets, and report the performance on the validation sets as the in-distribution (IND) performance and the challenging test sets as the out-of-distribution (OOD) performance.

320

3.2

Y. Tang et al.

Baselines

We compare our method with several existing debiasing methods. • PoE [8]: train a biased model based on prior bias knowledge and introduce Product-of-Expert loss to down-weight biased samples which are identified by the biased model. • Conf-reg [19]: introduce a confidence regularization loss to prevent overconfident prediction on biased samples without degrading the ID performance. • Self-Debiasing [20]: learn the biased model by taking an early checkpoint of the main model when it overfits the biases. We adopt the confidence regularization loss with an annealing mechanism since this version achieves the best performance in the original paper. • Weak Learner [15]: train a shallow model with limited capacity to model the biases and adopt PoE as the debiased loss. • RISK [22]: consider biases as redundant features, and project the features into the geometric median subspace via an autoencoder so as to remove biases. • Masked Debiasing [12]: utilize a score-based pruning technique to remove the biased weights in a finetuned model. We adopt the under-parameterized version in the original paper. • Depro [3]: decorrelate the dependencies between features and purify the bias features to force the model to learn semantic features. 3.3

Implementation Details

Following previous works [3,8], We adopt BERT-base [2] as the backbone for all debiasing methods. We finetune our model with default learning rate 2e−5, and adopt the linear warm-up learning rate scheduler. When computing the biased probability f (φ(x)), we shuffle the input text 5 times and average the output probability to ensure a more stable and accurate bias estimation. As for the debiased loss, we evaluate our method with PoE and DFL (we use the default value γ = 2) loss. The experiment is repeated 5 times with different random seeds, and the average accuracy is reported. Our code is available at Github2 . 3.4

Results

Experimental results are shown in Table 1. We can observe that: • For all three tasks, IDOS improves vanilla BERT by a large margin on the challenging test sets (e.g. over 5% on HANS, over 6% on FEVER Symmetric and QQP), while keeping a competitive IND performance on the validation sets.

2

https://github.com/yuanhangtangle/shuffle-debias.git.

IDOS: A Unified Debiasing Method via Word Shuffling

321

• Despite adopting a simpler architecture, IDOS achieves superior performance to baselines based on the EBD method on all three challenging sets except for PoE on PAWS. However, the superior performance of PoE on PAWS comes at the cost of a large IND performance drop (91.3% to 86.9%) while IDOS keeps a competitive IND performance (91.3% to 89.9% for IDOS-PoE). • IDOS is competitive with fine-grained debiasing methods such as RISK and Depro. Actually, in fact verification task, IDOS performs better than them and achieves new state-of-the-art performance on both the validation set and the challenging test set. To sum up, the results demonstrate that IDOS is a simple yet effective debiasing method.

4 4.1

Analysis HANS Heuristics

Biased models tend to predict the samples in HANS as entailment blindly. Hence, the improvement of performance in the non-entailment category in comparison to vanilla BERT reflects the generalization ability of our model. We evaluate our model on three different subsets of HANS built with different heuristics (i.e. lexical overlap, subsequence, and Fig. 2. Performance (Accuracy) on HANS constituent [11]). As shown in Fig. 2, subsets. Lex., Sub., and Con. indicate lexour model is able to mitigate all three ical overlap, subsequence, and constituent, respectively. kinds of heuristic biases and improve the performance by a large margin in the non-entailment category, demonstrating superior generalization ability. 4.2

Adversarial Robustness

Since adversarial attacks also exploit biases Table 2. Performance (Acc.%) on to mislead NLU models, debiasing methods ANLI dataset. R1 R2 R3 hopefully contribute to improving adversarial Model robustness. Following [22], we test the adverBERT-base 23.1 27.5 30.1 sarial robustness of our model on the Adver25.2 27.5 31.3 sarial NLI dataset (ANLI) [14]. ANLI dataset PoE contains human-crafted adversarial samples self-debias 21.8 27.4 31.0 that successfully fool large pre-trained mod- IDOS-PoE 25.5 28.9 30.9 els. It consists of three subsets. The R1 subset contains samples that successfully fool BERT finetuned on the combination of the SNLI [1] and MNLI datasets. The R2 and R3

322

Y. Tang et al.

subsets are similarly constructed to fool RoBERTa [10]. The results are shown in Table 2. Our model surpasses BERT-base on all three subsets by 2.4%, 1.4%, and 0.8% respectively, showing superior resistance to adversarial attacks. 4.3

Bias Rate

To evaluate the sensitivity of our model to different bias rates, we conduct experiments on a synthetic dataset created by injecting biases into the MNLI dataset to create a synthetic dataset. Following [7], we sample one-third of the examples from the MNLI training set and inject synthetic bias into the samples by adding a symbol (‘!!’) at the end of the Fig. 3. Performance (Accuracy) on the synhypothesis. The number of samples thetic dataset. The left subfigure shows the varies for different classes to control performance drop of BERT-base compared the bias rate P (entailment|!!), i.e. the with our model. The right one shows the percentage of samples associated with difference in accuracy (ours - baseline). ‘!!’ and the entailment category. To create a synthetic test set, we add ‘!!’ at the end of the hypothesis of all the samples in the MNLI validation set. Since the injected bias is correlated with the entailment category only in the synthetic training sets, models trained on the synthetic training sets tend to have lower performance on the test set. We finetune BERT-base and IDOS-PoE on the synthetic training sets and compare their performance on the synthetic test set. Results are shown in Fig. 3. With the increase of the bias rate, both models face a performance drop until reaching the same level as random guessing. However, the performance of our model degrades more slowly and surpasses that of BERT by up to 9%, which indicates that our model is more adapted to highly biased datasets than vanilla BERT. In addition, It can be observed that the difference in the accuracy declines when the dataset is about to be fully biased (when P (entailment|!!) is close to 1). This change reveals a limitation of our method. That is, when the number of available unbiased examples is fairly small, simply up-weighting those examples is not sufficient to balance biased and unbiased information. We leave the solution to this limitation as future work.

5 5.1

Related Work Biases in NLI Datasets

A large number of research works are devoted to analyzing bias patterns in different NLU benchmarks. [5] conduct statistical analysis on both the SNLI and MNLI datasets and show simple features such as the occurrence of negation

IDOS: A Unified Debiasing Method via Word Shuffling

323

words are highly correlated with certain categories. Similarly, [11] observe that LMs fine-tuned on the MNLI dataset leverage syntactic patterns such as lexical overlap to blindly predict entailment. Besides NLI, similar observation are reported in fact verification [16] and paraphrase identification tasks [26], where the authors show that LMs rely on indicative words in the input sentences to make predictions. While biases are effective features to achieve high performance in IND settings, they fail to generalize to OOD challenge sets. [11] design HANS to serve as an OOD challenging set for MNLI [21], where the lexical overlap is less reliable for prediction. Classifiers using lexical overlap to blindly predict entailment fail to generalize well to HANS. [16] construct a symmetric dataset of the FEVER dataset on which fine-tuned BERT suffers from a substantial drop. A similar performance drop is observed in paraphrase identification task [26]. 5.2

Debiasing Methods

Considering the negative effect of biases, debiasing methods have been proposed. [8] propose the ensemble-based debiasing (EBD) method together with two debiased loss functions. In a concurrent work, [19] introduce a confidence regularization loss, which is capable of mitigating biases and maintaining the indistribution performance and shows promising OOD performance. This method inspires several later works. [15,20] propose two similar biases identification methods based on the EBD method and Besides the EBD method, other methods are also studied. [24] identify the set of minority examples and finetune the model twice, first on the full dataset and then the minority examples. [22] propose to debias NLU models by mapping the representation from pre-trained language models to a less biased subspace via autoencoders. [12] consider that biased prediction is caused by some biased weights and introduce debiasing mask pruning to eliminate biased weights. [23] propose to generate more training samples with language models and filter out biased samples to balance different subgroups.

6

Conclusion

In this work, we propose to model biases by shuffling the words of the input text. We further propose IDOS, a novel unified debiasing method via word shuffling. This method is end-to-end, incurs no extra parameters and requires only slightly more training computation than the baseline model. Experiments demonstrate IDOS achieves competitive performance to state-of-the-art debiasing methods on several NLU benchmarks. Acknowledgements. The authors would like to thank the anonymous reviewers for their helpful comments. Zhen Wu is the corresponding author. This research is supported by the National Natural Science Foundation of China (No. 61936012, 62206126 and 61976114).

324

Y. Tang et al.

References 1. Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of EMNLP (2015) 2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL (2019) 3. Dou, S., et al.: Decorrelate irrelevant, purify relevant: overcome textual spurious correlations from a feature perspective. In: Proceedings of COLING (2022) 4. Feder, A., et al.: Causal inference in natural language processing: estimation, prediction, interpretation and beyond. In: Transactions of the Association for Computational Linguistics (2022) 5. Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., Smith, N.A.: Annotation artifacts in natural language inference data. In: Proceedings of NAACL (2018) 6. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. (2002) 7. Joshi, N., Pan, X., He, H.: Are all spurious features in natural language alike? An analysis through a causal lens. In: Proceedings of EMNLP (2022) 8. Karimi Mahabadi, R., Belinkov, Y., Henderson, J.: End-to-end bias mitigation by modelling biases in corpora. In: Proceedings of ACL (2020) 9. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: Proceedings of ICCV (2017) 10. Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach (2019) 11. McCoy, T., Pavlick, E., Linzen, T.: Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In: Proceedings of ACL (2019) 12. Meissner, J.M., Sugawara, S., Aizawa, A.: Debiasing masks: a new framework for shortcut mitigation in NLU. In: Proceedings of EMNLP (2022) 13. Min, J., McCoy, R.T., Das, D., Pitler, E., Linzen, T.: Syntactic data augmentation increases robustness to inference heuristics. In: Proceedings of ACL (2020) 14. Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., Kiela, D.: Adversarial NLI: a new benchmark for natural language understanding. In: Proceedings of ACL (2020) 15. Sanh, V., Wolf, T., Belinkov, Y., Rush, A.M.: Learning from others’ mistakes: avoiding dataset biases without modeling them. In: Proceedings of ICLR (2021) 16. Schuster, T., et al.: Towards debiasing fact verification models. In: Proceedings of EMNLP (2019) 17. Shah, D.S., Schwartz, H.A., Hovy, D.: Predictive biases in natural language processing models: a conceptual framework and overview. In: Proceedings of ACL (2020) 18. Thorne, J., Vlachos, A., Christodoulopoulos, C., Mittal, A.: Fever: a large-scale dataset for fact extraction and verification. In: Proceedings of NAACL (2018) 19. Utama, P.A., Moosavi, N.S., Gurevych, I.: Mind the trade-off: debiasing NLU models without degrading the in-distribution performance. In: Proceedings of ACL (2020) 20. Utama, P.A., Moosavi, N.S., Gurevych, I.: Towards debiasing NLU models from unknown biases. In: Proceedings of EMNLP (2020) 21. Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of NAACL (2018) 22. Wu, T., Gui, T.: Less is better: recovering intended-feature subspace to robustify NLU models. In: Proceedings of COLING (2022)

IDOS: A Unified Debiasing Method via Word Shuffling

325

23. Wu, Y., Gardner, M., Stenetorp, P., Dasigi, P.: Generating data to mitigate spurious correlations in natural language inference datasets. In: Proceedings of ACL (2022) 24. Yaghoobzadeh, Y., Mehri, S., Tachet des Combes, R., Hazen, T.J., Sordoni, A.: Increasing robustness to spurious correlations using forgettable examples. In: Proceedings of EACL (2021) 25. Ye, J., Ouyang, Y., Wu, Z., Dai, X.: Out-of-distribution generalization challenge in dialog state tracking. In: NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications (2022) 26. Zhang, Y., Baldridge, J., He, L.: Paws: paraphrase adversaries from word scrambling. In: Proceedings of NAACL (2019)

FedEAE: Federated Learning Based Privacy-Preserving Event Argument Extraction Fei Hu, Shenpo Dong, Tao Chang, Jie Zhou, Haili Li, Jingnan Wang, Rui Chen, Haijiao Liu, and Xiaodong Wang(B) National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China {hufei,dsp,changtao15,jiezhou,lihaili20, wangjingnan17a,chenrui,liuhj22,xdwang}@nudt.edu.cn

Abstract. Benefiting from Pre-trained Language Model (PLM), Event Argument Extraction (EAE) methods have achieved SOTA performance in general scenarios of Event Extraction (EE). However, with increasing concerns and regulations on data privacy, aggregating distributed data among different institutions in some privacy-sensitive territories (e.g., medical record analysis, financial statement analysis, etc.) becomes very difficult, and it’s hard to train an accurate EAE model with limited local data. Federated Learning (FL) provides promising methods for a large number of clients to collaboratively learn a shared global model without the need to exchange privacy-sensitive data. Therefore, we propose a privacy-preserving EAE method named FedEAE based on FL to solve the current difficulties. To better adapt to federated scenarios, we design a dataset named FedACE generated from the ACE2005 dataset under IID and Non-IID for our experiments. Extensive experiments show that FedEAE achieves promising performance compared to existing baselines, thus validates the effectiveness of our method. To the best of our knowledge, FedEAE is the first to apply FL in the EAE task. Keywords: Event Argument Extraction

1

· Federated Learning · NLP

Introduction

With the application of Artificial Intelligence (AI) in various industries, people’s attention to user privacy and data security is also constantly increasing. Users are becoming more and more concerned about whether their private information is exploited or even abused by others for commercial or political purposes without their own permission. New data regulatory frameworks (e.g., GDPR [20], CCPA [1], PDPA [2]) have also been recently enacted to protect personal and proprietary data from illegal access and malicious use [14]. How to solve the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 326–337, 2023. https://doi.org/10.1007/978-3-031-44696-2_26

FedEAE

327

problem of data fragmentation and data isolation under the premise of complying with stricter and new privacy protection regulations is the primary challenge faced by AI researchers and practitioners. As a topic of growing interest, Federated Learning (FL) [9,15] provides joint training for multi-party models without the need to exchange the local data, ensuring data security and privacy while also implementing distributed training. Traditional machine learning requires data aggregation to the center before model training can proceed, leading to data privacy leakage and posing data security risks. Different from traditional paradigms, FL can learn without sharing raw data, protecting the privacy of users, which is a feasible approach to solve the development dilemma of AI. FL was initially proposed to address the privacy issues of next word prediction for Google Gboard [16,19]. With the highlighting of its value, FL gradually expands to more NLP tasks such as Named Entity Recognition (NER), Event Extraction (EE), etc. EAE is an critical and challenging sub-task of EE, which aims to discover specific role types for each argument in an event [23]. Benefiting from PLM, EAE methods have achieved SOTA performance in general scenarios. However, in privacy-sensitive territories like medical record analysis, financial statement analysis, etc., the flow of data among institutions is strictly restricted, thus making it hard to train an EAE model in traditional way, resulting in poor performance of the model trained with limited local data. Therefore, we consider combining FL with EAE to improve model performance while protecting privacy. The main contributions of this paper are summarized as follows: 1. We propose a FedEAE method based on FL to learn an EAE model with promising performance while also having the advantage of privacy protection compared to traditional methods. 2. We design a FedACE dataset generated from the ACE2005 dataset to adapt to the heterogeneous characteristics of data in federated scenarios. 3. We conduct extensive experiments to verify the effectiveness of our proposed FedEAE method and present an analysis of the performance of FL in the EAE task.

2 2.1

Related Work Event Argument Extraction

The main challenge of EAE is how to accurately match event arguments with their roles. Earlier deep learning methods use various neural networks to capture the dependencies in between event triggers and event arguments to extract event arguments, these methods can be divided into fine-tuning and prompt-tuning ones [3]. As to the latter, [4] uses automatically generated prompts to eliminate the demand of the time-consuming and labor-intensive question design for the target extraction. [11] presents a curriculum learning based prompt tuning approach which resolves implicit EAE by four learning stages. The new paradigm of prompt-tuning is becoming popular in NLP tasks and has achieved promising performance.

328

2.2

F. Hu et al.

Federated Learning

Federated Learning (FL) is an emerging foundational technology in AI, first proposed by Google in 2016 [15], originally used to solve the problem of updating the models of Android mobiles locally. The goal of FL is to carry out efficient machine learning among multiple participants or computing nodes while ensuring information security and legal compliance [9]. During federated training only the intermediate parameters of the interactive models are used for model joint training, and a better global model is trained by manipulating the feedback parameters of the local models through the server [22]. In this way, each participant can obtain a shared global model without disclosing the raw data to each other.

3 3.1

Method Problem Formulation

To better illustrate the definition of FedEAE, we follow the problem formulation of Federated Learning in [6,10,12,18,22]. We consider N clients of privacysensitive territories {C1 , . . . CN } who wish to train an EAE model together with their respective data {D1 , . . . DN }. The conventional method is to put all data together and use D = D1 ∪ . . . ∪ DN to train a model MSU M . However, due to strict regulations and increasing concerns on data privacy, traditional methods are difficult to implement in these areas. FedEAE provides a FL-based system enabling different clients to collaboratively train a model MF ED , in which process any institution Ci does not expose its data Di to others. We denote the accuracy of MF ED as VF ED , which in our anticipation should be close to the accuracy of MSU M denoted as VSU M . 3.2

Structure of FedEAE

FedEAE follows the typical Client-Server training schema of FL, the structure is summarized below. a. Server: The server sends the initial model parameters to each participating client, then aggregates the collected intermediate parameters from the clients after local training in every client is done. A better global model is trained by aggregating the feedback parameters of the local models through the server. b. Client: During each global training round, all participating clients train the initial model with local data, then uploads the intermediate parameters to the Server after local training. In this way, there’s no need for the clients to exchange raw data with the server, which to some extent preserves the privacy compared to traditional machine learning methods.

FedEAE

329

Fig. 1. This diagram illustrates the workflow of FedEAE. It demonstrates how to train an EAE model in a federated approach.

In particular, when aggregating parameters on the server, we choose FedAvg [17], which is one of the most prevalent methods for model aggregation in FL. Our goal is typically to minimize the following objective function: min F (w), w

where

F (w) :=

N 

pi Fi (w)

i=1

 Here, pi ≥ 0 and i pi = 1, and Fi (w) is the local objective function for the i th client. The workflow above is also presented in Fig. 1. 3.3

FedACE Dataset

The Automatic Content Extraction (ACE) 2005 dataset is widely used for the EAE task, which contains 599 documents that have been annotated with 33 event subtypes and 22 argument roles. For centralized training, We follow data splitting and preprocessing steps in [5,21] to split it into train set, validation set and test set. The train set has 529 documents, the testing set has 40 newswire documents, and the validation set has 30 randomly selected documents. We count the number of various arguments in the datasets and present it in Fig. 2. For federated training, we design a FedACE dataset based on the train set. Specifically, we further perform data partitioning in the train set under Independent Identically Distribution (IID) and Non-Independent and Identically Distribution (Non-IID) to better adapt to the heterogeneous characteristics of data in federated scenarios. FedACE Under IID. As is known to all, in traditional machine learning, we usually adopt the method of simple random sampling without return to reflect

330

F. Hu et al.

Fig. 2. The argument distribution in our processed ACE2005 dataset.

Fig. 3. FedACE under IID. The entire train set is divided into parts with approximately equally sized arguments and roughly the same argument role distribution.

the overall distribution of the population. This sampling method obtains data that is independent and identically distributed. Therefore, to test the performance of FedEAE under the scenario of IID, we refer to the data partitioning strategies in [8,14]. Due to the fact that in our initial train set, each sample does not always contain one argument, some samples do not contain any arguments, while some samples contain multiple identical or different arguments, we cannot

FedEAE

331

Fig. 4. FedACE under Non-IID. The size of samples and label distributions vary greatly in different parts.

simply label a sample with a single label. In this case, to generate FedACE under IID, firstly, we shuffle the whole dataset. Then, we count the types and numbers of arguments in each sample. On this basis, we perform simple random sampling without replacement according to the number of arguments rather than according to the number of samples as in the previous methods. In this way, the entire train set is divided into parts with approximately equally sized arguments and roughly the same argument role distribution. FedACE Under Non-IID. In real-world applications of FL, the datasets located on different clients often have different features or label distributions, and the features are not mutually independent, this scenario in FL is called Non-Independent and Identically Distribution (Non-IID). Therefore, in order to test the performance of FedEAE under the scenario of Non-IID, we refer to the strategies in [7,15]. To generate our FedACE under Non-IID, firstly, we generate a label list for all the samples. In the case of multiple arguments in one sample, we extract the argument with the highest number of occurrences as the representative argument. In the case of no arguments in one sample, we label the sample with 0. After we get the label list of samples, we use the Dirichlet distribution to divide the train set into non-identical parts. The size of samples and label distributions vary greatly in different parts, which better simulates the distribution of data from different clients in the real world.

332

4 4.1

F. Hu et al.

Experimental Evaluations Learning Model

Among numerous EAE models, FedEAE extends an existing state-of-the-art EAE model using prompt-tuning known as AMP [5], which uses automatically generated prompts to activate multi-category information for EAE. To improve the generated prompts quality, AMP adopted multi-category information intentionally. The model can better discover the interaction across event types through these prompts. In addition, unlike other QA methods, AMP used a decoupled structure to augment the guidance of the prompt information to the language model. Comparision between AMP and other models is presents in Table 1. 4.2

Datasets

For centralized training, we use the ACE2005 dataset. For federated and local training (training with local data only), we use our proposed FedACE dataset. For the latter, we split the dataset into partitions for 4, 8, 16 and 32 clients under IID and Non-IID, which is shown in Fig. 3 and Firgure 4. 4.3

Experiment Setup

As mentioned in Sect. 3.2, in our experiments, the federated environments follow a topology [13,22] where the server is responsible to orchestrate the execution of the participating clients. During each training round, a client first trains on its local dataset for 2 epochs, and the server then merges the parameters received from all the participating clients. The global training epoch is set to 200. We test the performance of FedEAE in federated and local environments consisting of 4, 8, 16 and 32 clients, where the fraction of clients is uniformly set to 0.5. We also study the influence of different fractions ranging from 0.1 to 1.0 under the setting of 32 clients (IID). The evaluation metrics are P (Precision), R (Recall) and F1 (F1-Measure), all models are evaluated on the same test dataset. 4.4

Experiment Results and Analysis

This section presents the experiment results evaluating FedEAE against the centralized and local methods. Table 1 shows the final learning performance of FedEAE under different settings of client numbers and data distribution. As we can see, the federated methods outperforms most of the centralized methods as to F1. Especially when the number of clients is set to 8 (IID), FedEAE achieves the highest F1 of 60.03, which is comparable to the SOTA result of 61.8. We also observe that FedEAE outperforms the local methods under all settings by 5.5%–38.6%, thus highlights the advantages of FL over local training. Above all, FedEAE is able to train an EAE model with promising performance while also having the advantage of privacy protection compared to existing methods.

FedEAE

333

Effects of Number of Clients. We can observe from Table 1 that in the federated environment, when the number of clients increases from 4 (IID) to 8 (IID), the F1 increases by 0.4%, when more clients participate in the training, the F1 continues to decrease. The results indicate that there exists an balance point between model performance and the number of clients (IID). When we train the model under Non-IID or with only local data, the more clients there are, the worse the performance of the model. Figure 5 shows the convergence rate of the centralized and federated models and F1-score over communication rounds. We can clearly observe that under both settings of IID and Non-IID, as the number of clients increases the performance of the model significantly decreases, which results from reduced data and increased statistical heterogeneity on each client. On one hand, when the training rounds are the same (before the model converges), the more clients there are, the lower the F1 score. On the other hand, the more clients there are, the more training rounds are required for the model to converge. From a deeper level of analysis, these results actually illustrate that the harder the federated environment is, the more communication rounds are needed for the federated model to reach an acceptable performance, which is consistent with our previous understanding. Effects of Data Distribution. As introduced in Sect. 4.1, we split the dataset under IID and Non-IID to test the performance of FedEAE, the results are shown in Table 1 and Fig. 5. On the one hand, We observe that in the case of different number of participating clients, FedEAE always performs better under IID than Non-IID, which is 1.61% higher on average in F1. Meanwhile, the performance gap between the two gradually widens (from 0.36% to 2.27%) as the number of clients increases (from 4 to 32). The results indicate that the heterogeneity of data can lead to a decrease in model performance. Nonetheless, in both cases, FedEAE remains effective and achieves promising results compared to other baselines. On the other hand, we observe that FedEAE always achieves faster convergence under Non-IID than IID though with lower final F1. This shows that within an acceptable range of performance losses, our proposed model can converge even faster when dealing with heterogeneous data, thus achieving a balance between model performance and privacy protection. Effects of Client Fraction. Under the federated environments, the network connection between the server and clients is limited, and only some nodes may participate in training applications of FL at any time. Thus, it is necessary to evaluate the effect of the client fraction on the performance of our proposed model. We explore the performance of FedEAE under the settings of different fraction of 32 clients with local data partitioned under IID in 30 communication rounds. The results is illustrated in Fig. 6 where the fractions vary from 0.1 to 1.0. Comparing the two graphs in Fig. 6(a) and 6(b) comprehensively, we observe that when the fraction is set to 0.1 and 0.2, the model has the slowest

334

F. Hu et al.

Table 1. Final learning performance of Centralized vs Federated vs Local methods. Methods

P

R

F1

56.9 62.2 59.1 58.1 48.0 62.1

57.4 56.6 61.5 56.2 61.6 61.5

57.2 59.3 60.3 57.1 54.0 61.8

Federated (FedEAE ) 4-Clients

IID Non-IID 8-Clients IID Non-IID 16-Clients IID Non-IID 32-Clients IID Non-IID

60.7 59.4 60.4 57.6 60.4 59.2 61.0 58.0

58.5 59.0 59.7 59.0 59.4 56.4 57.1 55.6

59.6 59.2 60.0 58.3 59.9 57.8 59.0 56.7

Local

4-Clients

54.1 57.6 53.5 57.8 40.5 55.4 33.0 36.9

54.2 43.9 44.6 30.7 36.6 35.6 14.8 23.2

54.1 49.9 48.7 40.1 38.5 43.3 20.4 28.5

Centralized

DMBERT HMEAE BERD EEQA GDAP AMP

IID Non-IID 8-Clients IID Non-IID 16-Clients IID Non-IID 32-Clients IID Non-IID

convergence rate and the largest fluctuation due to the small amount of training data, while achieves the highest F1 score. When the fraction is set to 0.4 and 0.5, the model has the fastest convergence speed and achieves a good balance between performance and efficiency. When the fraction is set to 0.9 and 1.0, the convergence curve is the smoothest with the largest number of clients, while the performance is not so good as expected which is contrary to our usual cognition. Therefore, the results above indicate that when the total number of clients that can participate in training is fixed, the fractions of clients participating in model aggregation have an impact on the performance, while more fractions of clients cannot guarantee better training results. Overall, our proposed FedEAE can achieve promising results for all the settings of fraction. We also observe that the time consumption increases linearly with the increase of the number of clients.

FedEAE

335

Fig. 5. Final learning performance of FedEAE on FedACE under IID and Non-IID.

Fig. 6. The performance of FedEAE in 30 communication rounds when different fractions of clients participate in federated training. We draw the results into four graphs for better presentation.

5

Conclusion and Future Work

In this paper, we propose a FedEAE method for privacy-preserving EAE tasks, which enables multiple parties to collaboratively train an EAE model without exchanging their raw data. We design a FedACE dataset to adapt to the het-

336

F. Hu et al.

erogeneous characteristics of data in federated scenarios. Extensive experiments demonstrate that our method can make full use of distributed data and achieve promising performance compared to existing baselines. In conclusion, preserving data privacy without compromising on model performance, FedEAE is an effective method for EAE which has practical application value in privacy-sensitive territories. We consider this work as a first step towards privacy-preserving EAE based on FL. In the future, we plan to test our method on more benchmark datasets to verify its effectiveness. We will also apply more federated aggregation algorithms in our method. In addition, we consider to enhance data security and privacy of FedEAE by adopting encryption strategies such as Differential Privacy (DF) and Homomorphic Encryption (HE).

References 1. California Consumer Privacy Act (CCPA). Social Science Electronic Publishing 2. Chik, W.B.: The Singapore personal data protection act and an assessment of future trends in data privacy reform. Comput. Law Secur. Rep. 29(5), 554–575 (2013) 3. Dai, L., Wang, B., Xiang, W., Mo, Y.: Bi-directional iterative prompt-tuning for event argument extraction. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 6251–6263. Association for Computational Linguistics, Abu Dhabi (2022). https://aclanthology.org/2022.emnlpmain.419 4. Dong, S., et al.: Argumentprompt: activating multi-category of information for event argument extraction with automatically generated prompts. Springer, Cham (2022) 5. Dong, S., et al.: Argumentprompt: activating multi-category of information for event argument extraction with automatically generated prompts. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds.) Natural Language Processing and Chinese Computing. NLPCC 2022. LNCS, vol. 13551, pp. 311–323. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-17120-8 25 6. He, C., Annavaram, M., Avestimehr, S.: Fednas: federated deep learning via neural architecture search. arXiv preprint arXiv:2004.08546 (2020) 7. Hsu, T., Qi, H., Brown, M.: Measuring the effects of non-identical data distribution for federated visual classification (2019) 8. Ji, S., Pan, S., Long, G., Li, X., Jiang, J., Huang, Z.: Learning private neural language modeling with attentive aggregation (2018) 9. Kairouz, E., Mcmahan, H.B.: Advances and open problems in federated learning. R Mach. Learn. vol. 14, no. 1 (2021) Found. Trends 10. Li, T., Sahu, A.K., Talwalkar, A., Smith, V.: Federated learning: challenges, methods, and future directions. IEEE Signal Process. Mag. 37(3), 50–60 (2020) 11. Lin, J., Chen, Q., Zhou, J., Jin, J., He, L.: Cup: Curriculum learning based prompt tuning for implicit event argument extraction. In: Raedt, L.D. (ed.) Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI22), pp. 4245–4251. International Joint Conferences on Artificial Intelligence Organization (2022). https://doi.org/10.24963/ijcai.2022/589

FedEAE

337

12. Lin, Y., Ji, H., Huang, F., Wu, L.: A joint neural model for information extraction with global features. In: Meeting of the Association for Computational Linguistics (2020) 13. Liu, M., Ho, S., Wang, M., Gao, L., Zhang, H.: Federated learning meets natural language processing: a survey (2021) 14. Mathew, J., Stripelis, D., Ambite, J.L.: Federated named entity recognition. arXiv preprint arXiv:2203.15101 (2022) 15. Mcmahan, H.B., Moore, E., Ramage, D., Hampson, S., Arcas, B.: Communicationefficient learning of deep networks from decentralized data (2016) 16. Mcmahan, H.B., Ramage, D., Talwar, K., Zhang, L.: Learning differentially private language models without losing accuracy (2017) 17. McMahan, H.B., Moore, E., Ramage, D., Arcas, B.A.: Federated learning of deep networks using model averaging. arXiv preprint arXiv:1602.05629 (2016) 18. Rodio, A., Faticanti, F., Marfoq, O., Neglia, G., Leonardi, E.: Federated learning under heterogeneous and correlated client availability. arXiv preprint arXiv:2301.04632 (2023). 19. Stremmel, J., Singh, A.: Pretraining federated text models for next word prediction (2020) 20. Voigt, P., Bussche, A.V.D.: The EU General Data Protection Regulation (GDPR): A Practical Guide (2017) 21. Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations (2019) 22. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning:concept and applications. ACM Trans. Intell. Syst. Technol. 10(2), 1–19 (2019) 23. Ye, H., et al.: Learning to ask for data-efficient event argument extraction (2021)

Event Contrastive Representation Learning Enhanced with Image Situational Information Wei Liu1,2(B) , Qi Wu1 , Shaorong Xie1 , and Weimin Li1 1

School of Computer Engineering and Science, Shanghai University, Shanghai, China [email protected] 2 Shanghai Artificial Intelligence Laboratory, Shanghai, China Abstract. Event representation learning is a crucial prerequisite for many domains in event understanding, which aims to project events into dense vectors, where similar events are embedded close to each other while distinct events are separated. Most existing methods enhanced representation with external knowledge, which only based on unimodal and margin loss, ignoring the diversity of data and the impact of multiple samples. Therefore, we propose a novel Multimodal Event Contrastive Representation Learning (ECRL) framework. ECRL enhances event representations by making better use of image situational information. Specifically, we introduce multimodal contrastive learing and prototypebased clustering methods that allows us to avoid semantic-related samples in same batch being represent as negative, while considering multiple negatives. At the same time, we proposed action-guided-attention mechanism to capture core event participants in image. Extensive experiments on three benchmark datasets of event similarity task show that the ECRL outperforms other baseline. Keywords: Multimodal event representation · Contrastive learning Prototype-based clustering · Attention mechanism

1

·

Introduction

Event is the basic unit of human cognition, understanding and memory, which is a dynamic and structured information. Structuralizing and representing such information as machine-readable knowledge are the key work for downstream application such as event prediction. The mainstream approach that represent prototype events as low dimensional dense vectors. Early work on event representation learning focused on distributed representations, which included additive-based and tensor-based method. The additivebased models project event into embedding space with concatenating or adding event elements embedding. And the tensor-based models capture the multiplicative interactions of event elements by tensor operation. Even the external commonsense has been introduced in tensor-based model to enhance representation. However, most existing methods share two common limitations: (1) model can c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 338–350, 2023. https://doi.org/10.1007/978-3-031-44696-2_27

Event Contrastive Representation Learning Enhanced

339

not make full use of diverse data. (2) margin loss struggle to capture the subtle semantic differences between events, because anchor events only consider one positive and one negative. We are motivated to address the above issues with the aim of obtaining more semantic representation of events. For first issue, we introduce the concept of image situational information, which includes the objects, actions, and scenes of events in the image, as show in the upper part of Fig. 1, both event (2) and (3) share the same subject and predicate in text, but they are two completely different scene in image. On the contrary, event (1) and (2) have different description, while they have the same scene. In such situation, we found that event semantic can be captured by image situational information, which also can intuitively reflect the differences between distinct events. Thus, image situational information can enhance the text event representation. For second issue, we utilize InfoNCE objective [14] to train our model. InfoNCE is a self-supervised contrastive learning method that uses one positive and multiple negatives, while it still have problem that model may treat semantic-related samples in the same batch regarded as negative samples. As show in the bottom part of Fig. 1, text event may have multiple semantic-related image. In order to avoid semanticrelated image events being pulled apart, we introduce a prototype-based clustering module, each prototype serve as a representative embedding of a set of semantic-related events.

Fig. 1. Upper part: Event (2) and (3) share the same subject and predicate and yet hava different semantic. Event (1) and (2) have different description, but semantically more similar. Bottom part: an event triple can be associated with multiple images.

The main contributions of this paper are as follows: (1) The image situational information is introduced as the external feature to rich semantic of event representation (2) We propose an action-guided attention mechanism to capture the core event participants of image. (3) We propose a multimodal event representation learning method based on contrastive training and multimodal prototype

340

W. Liu et al.

clustering. (4) A series of comparative experiments show the effectiveness of proposed method.

2

Related Work

Event representation learning aims to project events represented as triples or sentence into dense vectors using neural networks, where similar events are embedded close to each other while distinct events are separated. Granroth-Wilding [7] proposes event pair-based EventComp method, which concatenates the word embedding of event triples (subject, predicate, object) at first, and the sequential relations of event pairs is used as the training task, the final event embedding is obtained by using neural networks. Ding [3–5] successively proposed a tensor-based neural network representation that uses a bilinear tensor to capture the interactions of event, and then they incorporate a knowledge graph into tensor-based event embedding model, and also incorporates external commonsense knowledge such as the intention and sentiment of events into the event representation to help model learn event semantics better. Weber [18] propose the Role Factored Tensor and Predicate Tensor models with different combinations of tensor-based approaches among event elements, Li [12] concatenates the word embeddings of event elements as the initial event embedding,and used Gated Graph Neural Network to learn the relationship between events on the event graph to obtain the event representation. Lee [10] used discourse relationships to model script knowledge. Hong [8] train models using multi-task learning that can both predict missing event participants and perform semantic role classification. Vijayaraghavan [17] introduce continual learning strategies that allow for incremental consolidation of new knowledge while retaining and promoting efficient usage of prior knowledge. Gao [6] proposes a weakly supervised contrastive learning method that considers more positive and negative sample pairs and a prototype-based clustering method that avoids semantic related events being separated. Zhang [20] proposed multimodal event representation learning framework in heterogeneous space to learn the event representation based on text and image, which maps the textual events to the Gaussian density space, while the image of events is embedded in vector space, and coordinate the two heterogeneous embedded spaces. Existing work focused on unimodal knowledge augmentation, with less research on how to rich semantic with multimodal. Our work focus on exploiting image situational information of events to enhance event representation.

3

Methodology

In this section, we will introduce details of our proposed approach. The structure of the model is shown in Fig. 2, which contains two parts: the multimodal contrastive learning (left) and the multimodal prototype-based clustering (right). In the following sections, we will introduce both methods separately.

Event Contrastive Representation Learning Enhanced

341

Fig. 2. Architecture of the proposed framework, where the left part is the Contrastive Learning method and the right part is the Multimodal Prototype-based Clustering method.

3.1

Text Event Encoder

Neural Tensor Network Model. The architecture of neural tensor network(NTN) use bilinear tensors to explicitly model the relationship between the participants and the action. The word embeddings of A, P and O as the inputs of NTN, A represent actor or subject, P is predicate, O is object. Taking E1 = (A, P ) as examples, the formulas are shown below. E2 = (P, O) and E3 = (E1 , E2 ) are the same. The output(E3 ) are event embeddings.   A [1 :k ] + b) E1 = f (A M1 P + W P (1) g(E1 ) = g(A, P ) = U  E1 [1:k]

where M1 ∈ R(d×d×k) is a tensor containing the set of k matrices, each with d × d dimensions, and a vector r ∈ Rk is calculated by the bilinear tensor [1:k] product A M1 P , where each entry is calculated by one slice of the tensor(ri = [i] A M1 P, i = 1, . . . , k). The other parameters are a standard feed-forward neural network, W ∈ R(k×2d) where is a weight matrix, b ∈ Rk is a bias vector, U ∈ Rk is a hyper-parameter and f = tanh(·) is a standard nonlinear activation function. Predicate Tensor Model. The predicate tensor model dynamically calculates the predicate tensor P from the word embedding of the predicate, and then uses predicate tensor to semantically combine with the subject and object to obtain the textual event representation, which is calculated as follows: given word embedding of predicate p, subject a and object o as the input, the predicate tensor P are derived from a shared base tensor W ∈ R(d×d×d) (where d is the input embedding dimension), and the textual representation of event can be represented by the tensor as et = P (a, o), et ∈ Rd , each element of this event embedding is obtained by capturing multiplicative interactions across all three:

342

W. Liu et al.

subject, verb, and object. Each element in the resulting event embedding of eti is calculated as:  Pijk = Wijk z pz Uzjk (2) eti = z,i,j,k pz aj ok Wijk Uzjk Here U ∈ Rd×d×d is a tensor that defines linear functions for each one dimensional row of W, and determines how predicates should scale that dimension. 3.2

Image Event Encoder

The image event representation also needs to capture the interaction information between event elements, so we firstly use object detection model and the image situation recognition model to capture the image event elements (actions, participants), then we proposes an action-guided attention mechanism to capture the core participants of the actions, finally, we use a multilayer perceptron model to learn the interaction information of the image event element and obtain the image event embedding. The architecture of image encoder has show in Fig. 3.

Fig. 3. Visual Event Components-aware Image Encoder.

Image Event Extractor. We extracts image object features using pre-trained Faster-RCNN with Resnet-50 as the backbone network to obtains the object detection bounding boxes (bbox) and passes them through the ROI Pooling layer to obtain vectors as the object feature for each participant. The FasterRCNN is used to obtain the bbox of object and embedded into the ROI Pooling layer to obtain multiple vectors fobject ∈ R2048 as the object feature of each bbox. For the extraction of image action features, we use pre-trained VGG-16 model and a classification layer as the base model of the image action classifier. To empower the model in recognizing actions within image events, a pre-training task for action prediction is first performed on the Image Situational recognition dataset ImSitu [19], in which each image is labeled with a corresponding action tag. After the pre-training, the last classification layer of the trained model is replaced with a fully connected layer as an extractor of action features.

Event Contrastive Representation Learning Enhanced

343

Action-Guided Attention Mechanism. The difference between image and text data is that text events have very clear event participants (eg., “man assault thief”, man and thief are the participants of event), while images may contain participants unrelated to the event (eg., traffic accidents scene in image, the core objects are car or injured, not passerby), so it is a big challenge to extract the core participant of the events accurately. Therefore, we inspired by Li [11] and propose an action-guided attention mechanism to make event related elements receive more attention in the process of training. The action to be detected, for instance, is transformed into action embedding by image action extractor. Firstly, the output v of image action extractor is linearly transformed to get qa as the query vector of the attention mechanism. qa = tanh(Wq v + bq )

(3)

We get the feature matrix k ∈ Rr×2048 of participants (where r is the maximum number of object detection bbox) from object extractor, and multiply it with qa to get the correlation between each participant and the event action. Then the attention score for the ith participant in the image can be obtained: exp(qaT ki ) αi = softmax (qa ki ) = n T i=1 exp(qa ki )

(4)

Secondly, the vector of the participants o is obtained by the weighted summation of the output for each participant. In the process of training, more attention will be paid to participant who plays a important role in image event action:  αi ki (5) os = i

Finally, the representation of image event is obtained by feeding participants and predicate embeddings into MLP layer, the definition of the image event representation as follows: ev = MLP (concat(v , os ))

(6)

where v ∈ R2048 is the action embedding, Wq is the query weight matrix, bq is the bias matrix, and ki is the ith object in an image. 3.3

Multimodal Contrastive Learning

Since the data of two different modalities belong to different data distributions, in order to make more effective comparison, the event embedding of the two modalities are mapped into multimodal vector space through linear mapping, and then our training goal is to maximize the cosine similarity of N positives and minimize the cosine similarity of N 2 − N negatives by training the parameters of the text event encoder and the image event encoder. The architecture of contrastive learning has been show in Fig. 2(a).The contrastive learning loss is expressed as follows:

344

W. Liu et al.

  1 E(I,T )∼D H(y i2t (I), pi2t (I)) + [H(y t2i (T ), pt2i (T )) (7) 2 b )/τ ) b )/τ ) where pbi2t (I) = Bexp(s(I,T , pbt2i (T ) = Bexp(s(T,I , τ is temperaexp(s(I,T )/τ ) exp(s(T,I )/τ ) Lnce =

b=1

b

b=1

b

ture, y i2t (I) and y t2i (T) represent the one-hot similarity, H(·) is the cross entropy loss function, s(·) represents the cosine similarity. 3.4

Multimodal Prototype-Based Clustering

To avoid semantically related image events being pulled apart, we inspired by Caron [1] in the computer vision domain and introduce a multimodal prototypebased clustering. We consider a set of M prototypes to cluster multimodal features, each prototype serve as a representative embedding of a set of semantically related events, which is associated with a learnable vector C. The model calculates the probability that samples in the batch belong to a certain class, and then compares them through event representations of different modalities, increase the probability of positive samples belonging to the same category, it emphasizes the consistency mapping between different modal features of the same event. Cluster Prediction. Our method works by comparing different modalities of the same event using their intermediate cluster assignments. We believe that if two modalities capture the same semantic, it should be possible to predict the cluster assignment of one modality from another modality. We compute their cluster assignments q1 and q2 by matching the representations of two different modalities with M prototypes. The cluster assignments are swapped between two representations of different modalities: the cluster assignment q1 of the textual event representation zt should be predicted from the image event representation zi , and vice-versa. The cluster prediction loss is expressed as follows: Lcp = (zi , q2 ) + (zt , q1 )

(8)

where function (z, q) measures the fit between  the representation z and the cluster assignment q, as defined by: (z, q) = − k q (k) logp(k) . Here p is a probability vector over the M prototypes whose components are: exp(z T ck /τ ) p(k) = M T i=1 exp(z ci /τ )

(9)

Computer Cluster Assignments. The purpose of clustering is to make different modes of the same event belong to the same category as far as possible, but according to the loss function designed above, the model may divide all samples into the same category in order to minimize the loss, which will lead to the failure of the clustering module. Therefore, we compute the cluster assignments using an optimal transport solver—Sinkhorn-Knopp algorithm [2]. It first begins with a matrix Γ ∈ RM ×N with each element initialized to z  b cm , where b ∈ N  is the index of each column, and then iteratively produces a doubly normalized matrix, the columns of which comprise q for the minibatch.

Event Contrastive Representation Learning Enhanced

3.5

345

Model Training

Our approach takes the sum of symmetric contrastive learning losses for optimizing multimodal features and prototype based multimodal cluster prediction losses as the training objective. The joint losses can be expressed as: Ltotal = αLnce + βLcl

(10)

The first term allows us to incorporate image situational information into event embedding, the second term is to cluster the semantic-related events.

4

Experiment

Following common practice in event representation learning, we analyze the event representations learned by our method on two event similarity tasks. 4.1

Dataset and Implementation Details

For multimodal event training, we constructs a dataset by modifying the M 2 E 2 multimodal event dataset. The original dataset contains 1,014 pairs of image text description pairs. We extracts event triples from image descriptions using the Open Information Extraction system Ollie [16], and manually filters to retain the event triple that are most relevant to the image as positive sample pair. Each event triple is also paired with negative sample pair. The negative sample images are event images from different scenes. Our final multimodal training dataset consist of 1,000 pairs of positive sample event triple-images and 1,000 pairs of negative sample. We use the New York Times Gigaword Corpus (LDC2008T19) for pretraining text event embeddings. We initialize input word embeddings with 100 dimensional pre-trained GloVe [15] vectors, and fine-tune during our model training. We train our model with a batch size of 64 using an adagrad optimizer. We adopt the temperature τ = 0.2 and the numbers of prototypes in our experiment is 5. 4.2

Event Similarity Tasks

Similarity task is a common way to measure the quality of vector representations. Weber [18] introduce two event related similarity tasks: (1) Hard Similarity Task and (2) Transitive Sentence Similarity. Hard Similarity. Dataset from Weber contains 230 pairs events (115 pairs of similar events and 115 pairs of dissimilar events) and Ding [3] expanded it to 2,000 pairs. The positive pairs in these two datasets have high semantic similarity and low repetition rate of words, while the negative pairs are opposite. For each method, we obtain the cosine similarity score of the event pairs, and report the fraction of instances where the similar event pair receives a higher cosine score than the dissimilar event pair (referring to Accuracy ∈ [0, 1]).

346

W. Liu et al.

Transitive Sentence Similarity. This dataset contains 108 pairs of transitive sentences [9] (e.g., short sentences containing a single subject, verb and object). Each pair is labeled with a similarity score ranging from 1 to 7. For each method, we use the Spearman’s correlation ρ between the similarity scores and labeled scores as evaluation metrics. Baselines. We evaluate our methods with several methods based on unimodal and multimodal: – Comp.NN [13]: The distributed event representations are learned by concatenating subject, predicate, and object embeddings into neural network. – Role Factor Tensor [18]: Two tensor contractions using subject & predicate and predicate & object are combined by a single-layer neural network to obtain the final event embedding. – KGEB [5]: it incorporates knowledge graph information in NTN. – NTN+Int+Senti [3]: Neural Tensor Network and its variants augmented with commonsense knowledge based embeddings (Intent, Sentiment). – MERL [20]: This method mapped textual event modality to Gaussian density space and event images projected as point embedding,and use a statistical hypothesis scoring function to coordinate two embedding spaces, MERL (uni) is trained with unimodal data, and MERL (multi) is trained with multimodal. Table 1. Overall performance of different methods on the similarity tasks Method

Uni

Comp.NN NTN Predicate RoleFactor KGEB NTN+Int+Senti MERL triple (uni)

Hard Similarity (Accuracy%)

Transitive Sentence Small dataset Big dataset AvgAcc Similarity (ρ) 33.0 18.9 20.36 0.63 40.0 37.0 37.31 0.60 41.0 25.6 27.18 0.63 43.5 20.7 23.05 0.64 52.6 49.8 50.09 0.61 77.4 62.8 64.31 0.74 44.3 * * 0.61

Multi MERL triple (multi) 52.2 ECRL ntn (our) 58.3 ECRL pre (our) 75.7

* 52.0 68.2

* 52.65 68.97

0.68 0.71 0.67

Result. As Table 1 shows, our method achieves significant improvements over all baselines in AvgACC. Especially, compared with models that incorporates common sense, it also performs well. Because the input of triple can learn limited semantic information without external knowledge, and it is hard to distinguish the semantic of the same verb between events, while its have significant differences in image features, our method enhanced representation with multimodal have better performance in capturing subtle semantic differences.

Event Contrastive Representation Learning Enhanced

347

Table 2. Ablation study for two modules evaluated on the similarity tasks Method

1 2 3 4

AGA PBC Hard Similarity (Accuracy%) Transitive Sentence Small dataset Big dataset Similarity (ρ) ECRL ntn (InfoNCE)   53.0 52.0 0.71   53.0 45.9 0.69   52.2 51.1 0.70   48.7 44.5 0.68

5 6 7 8

ECRL pre (InfoNCE)    

9

ECRL ntn (Margin)





10 ECRL pre (Margin)





   

75.7 70.4 68.7 67.8

68.2 65.1 67.2 65.0

0.67 0.66 0.67 0.65

58.3

46.2

0.68

65.2

62.7

0.67

Ablation Study. This study aims to validate the impacts of two modules including action guided attention, prototype based clustering and two loss function including margin, infonce loss. For this purpose, we design several groups of comparative experiments, and the experimental results are shown in Table 2. Firstly, we arrange and combine the two modules of ECRL (InfoNCE), we find that both two modules have a significant impact on the tasks, suggesting that the AGA module can effectively increase core event participants’ weight, reducing the impact of irrelevant or redundant participant features on the representation. And the PBC module can effectively alleviate the problem of InfoNCE distancing representations of different modalities with the same semantics in the minibatch. Next, we compare the InfoNCE against the margin loss. Obviously, compared with the ECRL (Margin) trained using the margin loss, the ECRL (InfoNCE) trained using the InfoNCE achieves much competitive results on more tasks, indicating that using InfoNCE to weight multiple different negatives, and setting appropriate temperature can help the model learn from hard negative, while the margin loss uses only one negative and can not do that. Analysis and Visualization. In addition, we study the influence of prototype based clustering in the training process. We randomly sample 600 events and embed the event representations learned by ECRL pre (InfoNCE) in 2D using t-SNE method. The clustering labels of each event were determined by the prototype types in the M prototype sets. The resulting visualizations are give in Fig. 4 and Fig. 5. Its show that the PBC can not only narrow the semantic distance between positive pairs, but it can also make semantic related samples

348

W. Liu et al.

Fig. 4. 2D visualizations of the event representation spaces learned by ECRL pre, ECRL (w/o Prototype based Clustering) (left) and ECRL (right).

Fig. 5. Hotmap of event similarity score obtained by ECRL pre, ECRL (w/o Prototype based Clustering) (left) and ECRL (right), the brighter color, the higher similarity.

in minibatch closer. For a instance, events 0, 1, and 6, 7, 8 in Fig. 5 belong to semantic related events, and we can find that ECRL improve similarity score for related events after introducing the PBC.

5

Conclusions

This paper proposed a novel multimodal event representation learning framework, which can enhance the semanticity of text event representation by using image situational information. And then, the multimodal prototype clustering can alleviate the problem of semantic related events within the same batch being treated as negative samples, resulting in poor model performance. Experiments showed that our method is effective in event similarity task. Our future work will focus on enhancing text event representation with dynamic video modal information.

Event Contrastive Representation Learning Enhanced

349

Acknowledgements. This work was supported by the Major Program of the National Natural Science Foundation of China (No. 61991410) and the Program of the Pujiang National Laboratory (No. P22KN00391).

References 1. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments (2020) 2. Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transportation distances. Adv. Neural. Inf. Process. Syst. 26, 2292–2300 (2013) 3. Ding, X., Liao, K., Liu, T., Li, Z., Duan, J.: Event representation learning enhanced with external commonsense knowledge. arXiv preprint arXiv:1909.05190 (2019) 4. Ding, X., Zhang, Y., Liu, T., Duan, J.: Deep learning for event-driven stock prediction. In: Twenty-Fourth International Joint Conference on Artificial Intelligence (2015) 5. Ding, X., Zhang, Y., Liu, T., Duan, J.: Knowledge-driven event embedding for stock prediction. In: Proceedings of Coling 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2133–2142 (2016) 6. Gao, J., Wang, W., Yu, C., Zhao, H., Ng, W., Xu, R.: Improving event representation via simultaneous weakly supervised contrastive learning and clustering. arXiv preprint arXiv:2203.07633 (2022) 7. Granroth-Wilding, M., Clark, S.: What happens next? Event prediction using a compositional neural network model. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016) 8. Hong, X., Sayeed, A., Demberg, V.: Learning distributed event representations with a multi-task approach. In: Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics (2018) 9. Kartsaklis, D., Sadrzadeh, M.: A study of entanglement in a categorical framework of natural language. eprint arxiv (2014) 10. Lee, H., Hudson, D.A., Lee, K., Manning, C.D.: SLM: learning a discourse language representation with sentence unshuffling. arXiv preprint arXiv:2010.16249 (2020) 11. Li, M., et al.: Cross-media structured common space for multimedia event extraction. arXiv preprint arXiv:2005.02472 (2020) 12. Li, Z., Ding, X., Liu, T.: Constructing narrative event evolutionary graph for script event prediction. arXiv preprint arXiv:1805.05081 (2018) 13. Modi, A., Titov, I.: Learning semantic script knowledge with event embeddings. arXiv preprint arXiv:1312.5198 (2013) 14. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018) 15. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing (2014) 16. Schmitz, M., Soderland, S., Bart, R., Etzioni, O., et al.: Open language learning for information extraction. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (2012) 17. Vijayaraghavan, P., Roy, D.: Lifelong knowledge-enriched social event representation learning. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (2021)

350

W. Liu et al.

18. Weber, N., Balasubramanian, N., Chambers, N.: Event representations with tensorbased compositions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) 19. Yatskar, M., Zettlemoyer, L., Farhadi, A.: Situation recognition: Visual semantic role labeling for image understanding. In: Computer Vision & Pattern Recognition, pp. 5534–5542 (2016) 20. Zhang, L., Zhou, D., He, Y., Yang, Z.: MERL: multimodal event representation learning in heterogeneous embedding spaces. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 14420–14427 (2021)

Promoting Open-Domain Dialogue Generation Through Learning Pattern Information Between Contexts and Responses Mengjuan Liu(B) , Chenyang Liu, Yunfan Yang, Jiang Liu, and Mohan Jing Network and Data Security Key Laboratory of Sichuan Province, University of Electronic Science and Technology of China, Chengdu, China [email protected], [email protected]

Abstract. Recently, utilizing deep neural networks to build the opendomain dialogue models has become a hot topic. However, the responses generated by these models suffer from many problems such as responses not being contextualized and tend to generate generic responses that lack information content, damaging the user’s experience seriously. Therefore, many studies try introducing more information into the dialogue models to make the generated responses more vivid and informative. Unlike them, this paper improves the quality of generated responses by learning the implicit pattern information between contexts and responses in the training samples. In this paper, we first build an open-domain dialogue model based on the pre-trained language model (i.e., GPT-2). And then, an improved scheduled sampling method is proposed for pre-trained models, by which the responses can be used to guide the response generation in the training phase while avoiding the exposure bias problem. More importantly, we design a response-aware mechanism for mining the implicit pattern information between contexts and responses so that the generated replies are more diverse and approximate to human replies. Finally, we evaluate the proposed model (RAD) on the Persona-Chat and DailyDialog datasets; and the experimental results show that our model outperforms the baselines on most automatic and manual metrics. Keywords: open-domain dialogue · pre-trained language model exposure bias · response-aware mechanism

1

·

Introduction

Natural language dialogue has long been one of the most challenging AI problems, and it is usually divided into two categories: task-oriented dialogue and open-domain dialogue [3]. Compared with the task-oriented dialogue system, it is more challenging for an open-domain dialogue system to ensure the quality of responses since their dialogue contents are not restricted, making it not easy c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 351–362, 2023. https://doi.org/10.1007/978-3-031-44696-2_28

352

M. Liu et al.

to design response templates in advance. As a result, current open-domain dialogue systems are mainly built based on deep neural networks which support generating responses for various dialogue contexts [4]. However, there is a general problem with existing models that tend to generate more generic responses that lack informativeness and are not contextually relevant [5], such as “I don’t know.” and “I am not sure”. Such replies are so tedious that they easily discourage users from continuing the conversations, although they are grammatically correct. To address this problem, some studies have attempted to make better use of the information already available, such as the ground-truth responses labeled in the training samples. In these studies, teacher-forcing [12] is a representative technology to improve model performance using ground-truth responses. When using teacher-forcing, the input to the dialogue model at each decoding step is not the word generated at the previous decoding step but the corresponding word in the labeled response. Teacher-forcing can effectively prevent inappropriate words generated at the previous decoding steps from misleading the word generation at the subsequent decoding steps, thus improving the performance and learning efficiency of the dialogue model. On the other hand, the existing dialogue models mainly focus on the semantic information of the context and ignore the learning of the hidden pattern information between the context and the response, which makes the responses generated by the models not fit well with the real responses. In this paper, we try to make the generated responses more interesting and vivid, similar to human responses, by introducing the implicit pattern information of the context-response pairs into the pre-trained dialogue model. This new response-aware dialogue model is named RAD. Unlike common dialogue models, RAD improves the quality of generated responses by learning the ground-truth response information. When training, it first uses an improved scheduled sampling method to reconstruct the response vector. To the best of our knowledge, it is the first study to improve the quality of generated responses by mining the implicit pattern information between contexts and responses. The contributions of this paper can be summarized as follows: – We present a basic scheme for performing dialogue generation using the pretrained model, based on which we propose a scheduled sampling method oriented to pre-trained models. – We design a response-aware mechanism that includes a response-aware network and a response-aware prediction network. In the training phase, we use the response-aware network to learn a representation vector containing the implicit pattern information of the context-response and feed this vector into the pre-trained model. In the generation phase, we replace the original response-aware vector with the response-aware vector predicted by the response-aware prediction network. This design enriches the pre-trained model’s available information while avoiding model performance degradation in the generation phase due to exposure bias.

Promoting Open-Domain Dialogue Generation

353

– We have evaluated the proposed model (RAD) on the Persona-Chat [15] and DailyDialog [6] datasets. The experimental results show that RAD performs better than the baseline models on most automatic and human evaluation metrics. Also, we upload the datasets, codes, and results to GitHub1 to help researchers in related fields reproduce our experiments quickly.

2

Background

2.1

Open-Domain Dialogue Generation

Open-domain chatbots that do not restrict conversation contents have been a hot topic in NLP research. However, as mentioned in Sect. 1, whether the seq2seq model or the pre-trained model is used to generate responses, the quality of the generated responses is still not good enough compared with the real responses in the training sample, and the utterances generated by the model can hardly reach the richness and interestingness of human language. Some studies have tried to improve the quality of responses by introducing various external information, such as interlocutor personality [7], common-sense knowledge [14], emotional information [17] and other additional supplementary information to improve the quality of the response. These practices have indeed made some progress, but they also have many limitations, such as obtaining additional information. Therefore, we want to make better use of existing information, such as real responses in the training sample, to improve the performance of the model by mining the implicit pattern information between the context and the responses. 2.2

Scheduled Sampling

A widely used technique for resolving exposure bias is scheduled sampling, which means that at each decoding step, the model probabilistically selects the generated word to replace the corresponding word in the ground-truth response as the decoder input during training [1], thus reducing the reliance of the model on ground-truth responses. At the beginning of training, the model selects more words from the response as inputs, and as the training epochs increase, the model selects more generated words as inputs. Many studies have verified that scheduled sampling can effectively address exposure bias. Furthermore, the authors in [16] improved the original scheduled sampling method from the perspective of sampling granularity. They proposed using both sentence and word sampling granularity to obtain candidate texts for model training. In [13], Xu et al. proposed an adaptive switching mechanism that can learn to transition between ground-truth learning and generated learning depending on the wordlevel matching score, such as the cosine similarity. Unfortunately, these scheduled sampling methods are all for seq2seq models and cannot be applied straightforwardly to pre-trained models. In the seq2seq model, the decoder’s input at each decoding step is determined step-by-step when 1

https://github.com/RussellLiu0/RAD.

354

M. Liu et al.

Fig. 1. The structure of the RAD model (the left part is the structure for training and the right part is the structure for generating)

training, however, in the pre-trained model, all words in the response part are required to feed into the model together, making it impossible to perform partial replacements with the generated words because no words have been generated before the response vector is fed into the model. Unlike existing solutions, we propose an improved scheduled sampling method by which teaching-forcing can be used in a pre-trained dialogue model to alleviate the exposure bias problem.

3

Response-Aware Dialogue Model

In this paper, we focus on the single-turn open-domain dialogue. The task of the dialogue model is to generate a syntax-correct, contextually relevant response according to the given context. Unlike the existing pre-trained dialogue models, the response-aware dialogue model (RAD) proposed in this paper adopts two differentiated structures in the training and generation phases. As shown in Fig. 1, the left part is used for training, and the right part is used for generation. 3.1

Pre-trained Language Model

This subsection describes how to use GPT-2, which only contains the Transformer decoder, to generate dialog responses. For a training sample, given the context x = {x1 , x2 , · · · , xm } and the ground-truth response y = {y1 , y2 , · · · , yn } , we first map them into two sequences of word embeddings, recorded as Ex = {ex1 , ex2 , · · · , exm } and Ey = {ey1 , ey2 , · · · , eyn }. Here, exi (exi ∈ R1×L ) denotes the word embedding of the i-th word in the context, eyj (eyj ∈ R1×L ) denotes the word embedding of the j-th word in the response, L is the dimension of the word embedding, and m and n are the lengths of the context and response, respectively. Then, the concatenation of the two sequences is input into GPT-2 for

Promoting Open-Domain Dialogue Generation

355

computing the output vector corresponding to each response position, as shown in Eq. (1), where hi is the output vector of the i-th position. H = Pretrained (Ex , Ey ) , H = {h1 , h2 , · · · , hm , hm+1 , · · · , hm+n }

(1)

Next, we transform each response position’s output vector into a probability distribution of vocabulary dimensions by a softmax function, as shown in Eq. (2). Here, Pt is the generation probability distribution of the word corresponding to the t-th response position, V is the vocabulary size, and W and b are two trainable matrices for the linear transformation. Pt = softmax (W hm+t + b) , Pt = {pt1 , pt2 , · · · , ptV }, t = (1, 2, · · · , n)

(2)

Finally, we calculate the model’s loss function according to Eq. (3), where Pt (yt ) denotes that, for the t-th response position, probability of generating the corresponding ground-truth response word yt according to Pt . Like other pretrained models, we fine-tune the parameters of GPT-2 by minimizing its loss. We should note that the response is produced word-by-word in the generation phase, so each response word in the training phase should be generated without reference to words in subsequent positions. Therefore, we use the attention mask mechanism to mask the attention scores of all words after the current response position in the training phase. LossM = −

n 

log Pt (yt )

(3)

t=1

3.2

Scheduled Sampling for Pre-trained Model

In this subsection, we improve the scheduled sampling method to make it available for the pre-trained model. Its primary task is reconstructing the response vector where some words in the real response are replaced by generated words. Specifically, the method consists of two stages. In the first stage, we input the context and response embeddings into the pre-trained model and get the generation probability distribution of the word corresponding to each response position, working as described in Sect. 3.1. Then, we average the embeddings of K words with the highest generation probabilities at each response position t to obtain its candidate embedding, as shown in Eq. (4). Here, et is the candidate embedding vector of the t-th word in the response, and etk is the word embedding with the k-th highest generation probability at the t-th response position. The standard scheduled sampling directly selects the word with the highest generation probability as the candidate word; however, the proposed scheduled sampling chooses the averaged word embedding as the candidate embedding vector. This approach smooths out the negative impacts of incorrectly generated words (caused by the greedy algorithm) on the results. et =

K 1  t ek , t = 1, 2, · · · , n K k=1

(4)

356

M. Liu et al.

In the second stage, we determine whether to replace the word embedding at each response position with the candidate embedding according to the probability p. Referring to the literature [16], the probability p increases monotonically with the increase of training epochs and is calculated as shown in Eq. (5). Here, μ is the hyper-parameter, and l denotes the number of training epochs. Based on Eq. (5), the probability increases more smoothly. In practice, we first generate a random number between 0 and 1 at each response position. If the random number is less than the probability p, we perform the replacement; otherwise, we still use the word embedding in the ground-truth response. Finally, we obtain the recon  structed response embedding sequence, Er = (er1 , er2 , · · · , ern ) Er ∈ Rn×L . It will be fed into the response-aware network to calculate the response-aware vector first and then fed into the pre-trained model for fine-tuning together with the response-aware vector. 1 p= (5) 1 + (μ/el/µ ) 3.3

Response-Aware Mechanism

In order to make better use of the ground-truth responses in the training samples, we design a response-aware mechanism for mining the implicit pattern information between contexts and responses. First, we design a response-aware network based on a multi-head attention mechanism [11] to compute the response-aware vector, as shown in Eq. (6). Here, Ex , Er and Era are the context vector, the reconstructed response vector, and the response-aware vector. Era is a matrix that contains m rows and L columns with the same form as Ex . Each row corresponds to a word embedding, and L is the dimension of the word embedding. Era = MultiHead (Er , Ex , Ex )

(6)

However, the responses are not available in the generation phase, so it is impossible to compute the response-aware vector. To this end, we design a response-aware prediction network to replace the response-aware network to estimate the response-aware vector, as shown in Fig. 1. The response-aware prediction network is implemented by a feedforward neural network that only takes the dialogue context as input and outputs the predicted response-aware vector, defined in Eq. (7).  = FeedForward (Ex ) (7) Era     m×L approxiTo make the estimated response-aware vector Era Era ∈ R  mate the response-aware vector Era Era ∈ Rm×L , we merge the output vectors of the response-aware network and the response-aware prediction network in the training phase, as shown in Eq. (8), where λ is a decreasing factor to balance the weights of thetwo components. We input the merged response-aware vector  Em Em ∈ Rm×L instead of the context’s embedding vector to the pre-trained model together with the reconstructed response vector Er .  Em = λEra + (1 − λ)Era

(8)

Promoting Open-Domain Dialogue Generation

357

Here, we illustrate the differences between the inputs of the pre-trained model in the training and generation phases by defining two equations. In Eq. (9), the inputs of the pre-trained model are the merged response-aware vector Em which contains the implicit pattern information of context-response, and the reconstructed response vector Er obtained by the scheduled sampling module. In (10), the input of the pre-trained model is the predicted response-aware vector  which is output by the response-aware prediction network only depending Era on the dialogue context. In the training phase: H = Pretrained (Em , Er )

(9)

In the generation phase:  ) H = Pretrained (Era

(10)

In order to make the predicted response-aware vector approximate the response-aware vector, we additionally define a response-aware prediction loss, as shown in (11), which is used to measure the deviation between the predicted response-aware vector and the response-aware vector. m

LossRA =

L

2   1    eij − eij , eij ∈ Era , eij ∈ Era mL i=1 j=1

(11)

Finally, the total loss function of RAD is a weighted sum of the pre-trained model’s loss and the response-aware prediction loss, defined in Eq. (12), where γ is a hyper-parameter that indicates the weights of the two losses. Losstotal = γ LossM +(1 − γ) LossRA

4

(12)

Experiment Settings

Datasets. We evaluate the proposed model (RAD) on two public datasets: Persona-Chat [15] and DailyDialog [6]. Persona-Chat is a multi-turn dialogue dataset collected from a crowdsourcing platform, where volunteers conduct conversations following multiple simple personality descriptions. Each conversation has 5–8 turns with 4–6 persona descriptions. Another dataset, DailyDialog, is collected from an English learning website, in which the utterances come from the conversations between learners. DailyDialog is also a multi-turn dialogue dataset like Persona-Chat; the difference is that DailyDialog does not have explicit textual persona descriptions. As for the statistics of the dataset, Persona-Chat has 131431/7793 utterances pairs in the training/testing dataset, and DailyDialog has 76052/6740 utterances pairs in the training/testing dataset. Baselines. As far as we know, our work is the first study to exploit labeled responses in the training samples to improve the quality of generated responses. Therefore we only compare our model with two basic models based on seq2seq and GPT-2. All the baseline models are described as follows:

358

M. Liu et al.

– Seq2Seq [10]: a basic seq2seq dialogue model with an attention mechanism [11], where the encoder extracts contextual information and the decoder generates replies. Both encoder and decoder are built on single-layer LSTM networks. – Seq2Seq+SS+RA: a seq2seq dialogue model with the addition of scheduled sampling (SS) and response-aware (RA) mechanisms. – GPT-2 [8]: a dialogue model based on GPT-2, where we concatenate the context vector with the response vector and input them into the pre-trained model for generating the response. – RAD (ours): a dialogue model based on GPT-2 with our scheduled sampling (SS) and response-awareness (RA) mechanisms. It is our model proposed in this paper. Implement Details. We build our model with pytorch. The pre-trained dialogue models are built based on GPT-2 [8], which contains 117M parameters, using 12 stacked Transformer layers. Each layer of the Transformer uses a multihead attention mechanism containing 12 heads, and the dimension of the hidden layer is 768. The vocabulary contains 52370 words with BPE word encoding [9], and the word embedding matrix comes from GPT-2. In the improved scheduled sampling method, the hyper-parameters K and μ in Eqs. (4) and (5) are set to 5 and 4. The response-aware network adopts a multi-head attention mechanism with eight heads, and the response-aware prediction network has one hidden layer with 768 neurons. The balance factor λ in Eq. (8) decreases quickly from 1 to 0.2 in the first training epoch and stays constant. The weight γ of the pre-trained model’s loss in Eq. (12) is set to 0.5. We adopt the Adam optimizer to train all models. The learning rate of the seq2seq dialogue models is set to 1e−4, and the corresponding batch size is 32. The learning rate of the pre-trained dialogue models is set to 1e−5, and the batch size used in training is 16 since the pre-trained model has more parameters than the seq2seq model. This setting difference mainly takes into account the GPU performance. Furthermore, the optimizer does not have other special settings.

5 5.1

Experimental Results Automatic Evaluation

In this subsection, we automatically evaluate the proposed and baseline models using metrics of F1, BLEU-1/2, and DISTINCT-1/2. BLEU and F1 are usually used to measure the similarity between generated and labeled responses. A model gets a higher BLEU/F1 score, indicating that it produces better responses than other models. DISTINCT [5] is an evaluation metric that measures the diversity of the generated results. A higher score on this item indicates better diversity in the responses generated by the model. Table 1 shows the results of the automatic evaluation. First, two pre-trained dialogue models (GPT-2 and RAD) have substantial improvements in all metrics over two seq2seq dialogue models, which proves that the pre-trained language models can indeed increase the quality of the responses generated by the

Promoting Open-Domain Dialogue Generation

359

Table 1. Automatic evaluation results (best results are marked in bold). Model

Persona-Chat F1 BLEU-1/2

DailyDialog DISTINCT-1/2 F1 BLEU-1/2

DISTINCT-1/2

Seq2Seq

16.551

0.156/0.052

0.007/0.017

16.589

0.108/0.041

0.027/0.117

Seq2Seq+SS+RA 16.554

0.163/0.056

0.005/0.010

16.689

0.121/0.047

0.031/0.144

GPT-2

18.790

0.177/0.083

0.022/0.092

19.095

0.146/0.067

0.042/0.195

RAD (ours)

19.385 0.190/0.089 0.022/0.093

24.656 0.244/0.140 0.043/0.249

models. Second, RAD achieves the best performance among all models. On the Persona-Chat dataset, compared with the basic GPT-2, RAD improves 3.17% on the F1 score and 7.34%/7.23% on the BLEU-1/2, demonstrating that learning the context-response pattern information can help to increase the model performance. Finally, the basic seq2seq model obtains the lowest F1 and BLEU scores. However, adding the scheduled sampling method and the response-aware mechanism to the seq2seq model, the two metrics improve, especially the BLEU scores, which increased by 4.49% for BLEU-1 and 7.69% for BLEU-2. It shows that the two proposed improvements also effectively increase the response quality generated by the traditional dialogue model. On the DailyDialog dataset, RAD has also achieved good results in all metrics. Compared to GPT-2, RAD has improved by 29.12% on the F1 score and 67.1%/108.9% on the BLEU-1/2 scores, respectively. To our surprise, RAD also shows good improvements in the DISTINCT-2 scores. The results mean that the proposed response-aware mechanism helps enhance the similarity between generated and labeled responses and positively improves the diversity of generated responses. Overall, the two improvements proposed in this paper achieve consistent enhancements on both datasets with seq2seq and pre-trained models. 5.2

Human Evaluation

Generally, automatic evaluation is deficient in measuring the responses’ fluency and semantic consistency, so we tried to conduct human evaluation to assess the responses generated by the model in terms of fluency, contextual consistency, and similarity. First, we randomly sampled 100 testing samples from the PersonaChat and DailyDialog datasets. Then, the responses generated by the models and reference responses are provided to five reviewers. Finally, reviewers score each response according to the follow rules without knowing the response is generated by which model. A score of 2 means that the model performs well and generates responses that meet the requirements, and a score of 0 means that it performs poorly. The results are summarized in Table 2. In this table, fluency is defined as whether the response is fluent and natural, coherence is defined as whether the response is relevant and coherent to the context and similarity is defined as whether the response is close to the real response. First, we can see that on the Context Coherence metric, the scores of the two pre-trained dialogue models have been significantly improved over the two

360

M. Liu et al. Table 2. Human evaluation results (best results are marked in bold).

Model

Persona-Chat DailyDialog Fluency Coherence Similarity Fluency Coherence Similarity

Seq2Seq

1.823

0.825

0.455

1.946

0.771

0.386

Seq2Seq+SS+RA 1.828

0.423

0.431

1.914

0.852

0.454

GPT-2

1.865

1.172

0.815

1.933

1.293

0.680

RAD (ours)

1.858

1.475

0.940

1.916

1.362

0.896

seq2seq-based dialogue models, indicating that the pre-trained models outperform the earlier seq2seq models in terms of context understanding. Further, RAD obtains higher Context Coherence scores than GPT-2, which indicates that the proposed response-aware mechanism does help the model to generate responses more consistent with the dialogue context. Finally, on the similarity metric, RAD improves the similarity with human responses by mining the pattern information compared with the original GPT-2. It should be noted that the results’ Fleiss’ kappa score [2] is 0.494 on the Persona-Chat dataset and 0.413 on the DailyDialog dataset, indicating reviewers agree consistently on the scores. 5.3

Ablation Study

In addition, we perform ablation experiments to validate the effectiveness of the proposed mechanisms. The experimental results are shown in Table 3. First, we observe the impacts of two improvements on the performance of the pre-trained dialogue models. Compared to the basic GPT-2, the scores obtained by GPT2+SS are decreased in all metrics, indicates that replacing the real response vector with the reconstructed response vector input to the pre-trained model during training have negative impacts on the model performance. Similar to seq2seq models, the performance of GPT-2+RA that adds the response-aware mechanism to the basic GPT-2 has been significantly improved, which proves the effectiveness of RA. Further, the F1 and BLEU scores of GPT-2+RA+SS have been increased on the Persona-Chat dataset but decreased on the DailyDialog dataset. Through ablation experiments, we verify the effectiveness of our improvements on the model performance. Clearly, the performance improvements by the RA mechanism are significant, while the performance improvements of SS are not as well as expected. Therefore, we will further investigate whether it needs to introduce the SS mechanism in the dialogue model in the future study.

Promoting Open-Domain Dialogue Generation

361

Table 3. Results of ablation experiments. Model

Persona-Chat F1 BLEU-1/2

DailyDialog DISTINCT-1/2 F1 BLEU-1/2

DISTINCT-1/2

Seq2Seq Seq2Seq+SS Seq2Seq+RA Seq2Seq+SS+RA)

16.551 16.653 16.527 16.554

0.156/0.052 0.143/0.056 0.156/0.062 0.163/0.057

0.007/0.017 0.007/0.021 0.009/0.032 0.005/0.010

16.589 16.338 17.510 16.689

0.108/0.041 0.096/0.039 0.154/0.057 0.121/0.047

0.027/0.117 0.026/0.093 0.029/0.171 0.031/0.144

GPT-2 GPT-2+SS GPT-2+RA GPT-2+SS+RA(RAD)

18.790 18.264 19.233 19.385

0.177/0.083 0.168/0.078 0.177/0.083 0.190/0.089

0.022/0.092 0.022/0.090 0.023/0.103 0.022/0.093

19.095 17.911 25.860 24.656

0.146/0.067 0.122/0.053 0.235/0.156 0.244/0.140

0.042/0.195 0.032/0.135 0.052/0.304 0.043/0.249

6

Conclusion

This paper proposes a new method utilizing ground-truth responses to enhance the performance of open-domain dialogue models. This method provides information to the pre-trained model by learning the implicit pattern between contexts and responses through a response-aware mechanism. Moreover, an improved scheduled sampling technique is used to eliminate exposure bias. Finally, the automatic and human evaluation results validate that our work can improve the quality of generated responses. Acknoledgements. This work was supported in part by the Fundamental Research Funds for the Central Universities (No. ZYGX2016J096) and the Science and Technology Programs of Sichuan Province of China (No. 23GJHZ0016).

References 1. Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015) 2. Fleiss, J.L., Cohen, J.: The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. In: Educational and Psychological Measurement, vol. 33, pp. 613–619 (1973) 3. Gao, J., Galley, M., Li, L.: Neural approaches to conversational AI. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp. 2–7 (2018) 4. Huang, M., Zhu, X., Gao, J.: Challenges in building intelligent open-domain dialog systems. ACM Trans. Inf. Syst. (TOIS). 38, 1–32 (2020) 5. Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119 (2016) 6. Li, Y., Su, H., Shen, X., Li, W., Cao, Z., Niu, S.: DailyDialog: a manually labelled multi-turn dialogue dataset. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 986–995 (2017)

362

M. Liu et al.

7. Liu, Q., Chen, Y., Chen, B., Lou, J.G., Chen, Z., Zhou, B., Zhang, D.: You impress me: dialogue generation via mutual persona perception. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1417–1427 (2020) 8. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019) 9. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725 (2016) 10. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, vol. 27 (2014) 11. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 12. Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1, 270–280 (1989) 13. Xu, H., Zhang, H., Zou, Y., Chen, H., Ding, Z., Lan, Y.: Adaptive bridge between training and inference for dialogue generation. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2541–2550 (2021) 14. Zhang, H., Liu, Z., Xiong, C., Liu, Z.: Grounded conversation generation as guided traverses in commonsense knowledge graphs. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2031–2043 (2020) 15. Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personalizing dialogue agents: i have a dog, do you have pets too? In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2204–2213 (2018) 16. Zhang, W., Feng, Y., Meng, F., You, D., Liu, Q.: Bridging the gap between training and inference for neural machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4334–4343 (2019) 17. Zhou, H., Huang, M., Zhang, T., Zhu, X., Liu, B.: Emotional chatting machine: emotional conversation generation with internal and external memory. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

Neural News Recommendation with Interactive News Encoding and Mixed User Encoding Xintao Jiao(B) , Yongjie Que, Qinghao Zhong, and Jiale Liu South China Normal University, Guangzhou, Guangdong, China [email protected]

Abstract. Personalized news recommendation is of great significance in helping users find news of interest. In the task of news recommendation, the main point is the representation learning of news and users. Current news encoding methods tend to overlook the interactive information mining that occurs among the title, content, and entities, ultimately resulting in a weakened ability to accurately model news. In former user modeling methods, the approach often focused on linear modeling of user features, which lacked structural and serialization feature mining for users, ultimately resulting in insufficient user encoding capabilities. To improve the performance of news recommendation, we propose a neural news recommendation model with Interactive News Encoding (INE) and Mixed User Encoding (MUE). In the INE, the cross-encoding module is utilized to encode the interactions among the title, content and entity, which enables a deeper semantic analysis of their interactive information. In the MUE, graph convolutional networks (GCN) is used to learn the structural features of user preferences, and the multi-head self-attention networks are used to capture serialization features of user browsing records. We carried out comparative experiments and compatibility experiments on the MIND dataset. According to the results, the INE and MUE outperform baseline models, indicating that the INE and MUE are effective and universal. Keywords: Cross-Encoding · GCN · Multi-Head Self-Attention · News Recommendation

1 Introduction With the rapid development of the Internet, the news websites have become an important way for people to get the information [1]. Due to the rapid increase in the number of news, it is difficult for users to obtain the news of interest [2]. The news recommendation system was proposed to filter out news that users are interested in for personalized news recommendations [3]. In order to solve the problem of insufficient mining ability of traditional news recommendation methods, many news recommendation methods based on deep learning have been proposed [4]. The core task of these methods is to learn news and users representations [5]. In the news encoding, information such as title, content, and entities © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 363–375, 2023. https://doi.org/10.1007/978-3-031-44696-2_29

364

X. Jiao et al.

are often encoded independently, which cut off the connection between titles, news, and entities, and it is not conducive to news semantic mining. In the user encoding, existing methods mainly focused on learning user features through linear sequences, they ignored mining the structural and serialization features of users, which caused the incomplete description of user features. To address the problems in news encoding and user encoding, this paper proposed the interactive news encoding to learn news feature representation and the mixed user encoding to learn user structural and serialization interest feature representation.

2 Related Work Personalized news recommendation is an important task in the fields of natural language processing and data mining [6]. In recent years, many methods based on deep learning have been proposed [7]. [8] proposed a news recommendation method that involves learning news representation from news content through denoising auto-encoders and using gated recurrent neural networks to learn user representations from user browsed records. [9] used a deep knowledge-aware Convolutional Neural Network (CNN) to learn news representation from news title, and to measure the similarity between the candidate news and browsing history to get the user representation. [10] proposed a method based on the CNN to learn the contextual representation of news titles and content, and incorporates word-level attention networks to capture important information from different words for news representation. [11] proposed encoding news representation from titles, themes, and subtopics, and combining user short-term and long-term interest representations to extract the overall feature representation of users. [12] proposed a news recommendation method that combines multi-head attention mechanism and self-attention mechanism to encode news titles and content independently. Furthermore, the method introduces personalized reading features based on content interests and title preferences to construct user representations. [13] proposed jointly encoding news titles and content, and employing hierarchical clustering to model user representations based on the user browsing records. In this paper, we propose a news recommendation method that utilizes interactive news encoding to mine deep semantic interactions among news title, content, and entity, and adopts structural and serialization encoding of user browsing records to construct more representative user representations.

3 Our Approach As shown in Fig. 1, the INE-MUE model is composed of the interactive news encoder, the mixed user encoder, and the click predictor. 3.1 Interactive News Encoder The current methods usually encode title and content independently, which cut off the connection between them. On the contrary INE fully integrates the title, content, and

Neural News Recommendation with Interactive News

365

Fig. 1. The architecture of INE-MUE.

entities through the interactive encoding module to learn their deep semantic interactions and obtain the characteristics of different types of news information. The INE consists of the following three components. Title-Content Encoding Component. We employ a pseudo twin network to learn the feature representation of news titles and content. It has three layers. The first layer, the word embedding layer, is used to convert news title and content from word sequences into low-dimensional semantic vector sequences. The news t ] with a length of M is converted to a word vector title word sequence [w1t , w2t , ..., wM t t t sequence [v1 , v2 , ..., vM ], and the news content word sequence [w1c , w2c , ..., wNc ] with a length of N is converted into a word vector sequence [v1c , v2c , ..., vNc ]. The second layer, the Bidirectional Long Short-Term Memory (Bi-LSTM) [14] layer, is employed to extract the semantic feature information from the news title and content, and mine the contextual information of news words, and learn the feature representations of the news title and content. The third layer, the word-level self-attention [15] layer, is utilized to learn the weights of words in the news text based on the various amount of information. By adjusting the degree of attention given to different words, the self-attention representation of the news title and content is obtained. t = softmax(υ T tanh(Whti + b)) αs,i

(1)

c = softmax(υ T tanh(Whci + d )) αs,i

(2)

rst =

N 

t t αs,i hi

(3)

c c αs,i hi

(4)

i=1

rsc =

N  i=1

366

X. Jiao et al.

where υ T is a trainable weight vector, Wht and Whc are parameter matrices, b and d are t and α c of each word in the title and content text bias terms. The attention weights αs,i s,i sequence for the query vector are obtained through Eq. (1) and (2). By using Eq. (3) and (4), the self attention representation of the title rst and the self attention representation of the content rsc are obtained. Entity Relationship Encoding Component. It is mainly utilized to extract the relationship information of entities in news. First, entity linking technology is applied to associate news entities with predefined entities in the knowledge graph to eliminate the ambiguities. Based on the identified entities, their corresponding relationship links are extracted from the original knowledge graph to construct sub-graphs. Secondly, the distance-based translation model TransE [16] is used to obtain each entity embedding, as shown in Eq. (5) and (6). et = {eit |(et , φ, eit ) ∈ ϑor(eit , φ, et ) ∈ ϑ}

(5)

ec = {eic |(ec , φ, eic ) ∈ ϑor(eic , φ, ec ) ∈ ϑ}

(6)

where ϕ is the relationship between eit and eic , ϑ is a knowledge graph, eit and eic are entity embedding on the knowledge graph. Though the information of entities in the knowledge graph may be sparse, the certain semantic and logical associations between adjacent entities could be used as a supplement to enhance entities information. The average values z t and z c of adjacent entities are calculated according to Eq. (7) and (8). zt =

1  t e et t t i

(7)

1  c e ec c c i

(8)

ei ∈e

zc =

ei ∈e

Interactive Encoding Component. It is mainly used for the interactive coding of news title, content and entity to learn their semantic interactive information. is used as a query, and The self-attention representation rst (rsc ) of the title  (content)  the content (title) is embedded as the key value {hc }( ht ) to construct a cross attention network. Then, the cross attention representation rct (rcc ) of the title (content) are obtained by Eq. (9) - (12) [15]. αct = Attention(rsc , {ht })

(9)

αcc = Attention(rst , {hc })

(10)

rct =

N  i=1

t t αc,i hi

(11)

Neural News Recommendation with Interactive News

rcc =

N 

c c αc,i hi

367

(12)

i=1

The cross-attention representation of the news title(content) is multiplied elementwise with the corresponding entity embedding of the title(content) to obtain the news title(content) representation r t (r c ), as shown in Eq. (13) and (14). The news title representation and the news content representation are coupled to obtain the final news representation r n , as shown in Eq. (15). t r t = rcross ⊗ zt

(13)

c r c = rcross ⊗ zc

(14)

t c r n = [r t ⊕ rself ; r c ⊕ rself ]

(15)

3.2 Mixed User Encoder The Mixed User Encoder (MUE) is utilized to extract the structural and serialization features from user historic records to obtain more accurate user representations. The MUE includes three different modules. User Structural Encoding Module. There are correlation features among different news and user interest features are mined from the news they browsed. So users have structural features, which is beneficial for building more accurate user representations and improving the accuracy of news recommendations. In user structural encoding, cluster graph are used to characterize the structural features of user preferences. Construct user interest graph G1. The news browsed by the user are used as the node {En } of interest graph G1, and the nodes are connected through bidirectional edges {Vn } to construct a cluster graph G1 = (Vn , En ). The nodes share category labels, and each interest graph reflects a user’s preference. Construct user interest cluster graph G2. Users often show interests in different topics, that is, they may have multiple interest graphs. Due to the dispersed characteristic of the user interest graph, the proxy node Vpi can be used to connect the proxy nodes through bidirectional edge {E 1p }. The proxy node and news node can be connected through bidirectional edge {E 2p } to construct the user interest cluster collection graph G2 = ({Vn , Vp }, {Ep1 , Ep2 }). The user structural feature graph G consists of G1 and G2: G = {G1, G2}. The characteristic matrix H = [r n , r p ] ∈ (|Vn |+|Vp |)×d is represented by the d -dimensional node embedding vector. The adjacency matrix is denoted as A, the degree matrix is marked as D, and W is an adjustable matrix. 1

1

H l+1 = ReLU (D− 2 AD− 2 H l W l ) + H l

(16)

368

X. Jiao et al.

GCN is employed to extract structural features from the graph. This enables the extraction of user feature representations within the interest graph, and structural user representations are mined from news node embedding. |V |

n r str = {HiL }i=1 ∈ (|Vn | × d )

(17)

User Serialization Encoding Module. There are serialization features in news browsed by users, which can be mined to reinforce user representation. The process of user serialization encoding involves utilizing a news-level multi-head attention network to capture the sequence dependency information between news. exp(riT Qkn rj ) k βi,j = M T n m=1 exp(ri Qk rm ) hni,k = Vkn (

M 

k βi,j rj )

(18)

(19)

j=1 k where Qkn and Vkn are the parameters of the k-th news self-attention head, and βi,j represents the relative importance of the interaction between the j-th and k-th news.

As different news contain various amounts of information, the self-attention mechanism is employed to select more important news. The degree of attention assigned to the i-th news article viewed by the user is denoted as αin . cin = qnT tanh(Vn × hni + Vn )

(20)

exp(cin ) αin = N n j=1 exp(cj )

(21)

where Vn and qn are the value and the query in attention network, and n is the number of news. The serialization user representation is computed as the weighted sum of the representations of the news viewed by the user, as shown in Eq. (22). r ser =

N 

αin hni

(22)

i=1

Mixed User Encoding Module. After obtaining the structural representation r str and serialization representation r ser of the user, the overall interest and preference features of the user are mixed and encoded to obtain the final user representation r u , as shown in Eq. (23). r u = r str ⊗ r ser

(23)

Neural News Recommendation with Interactive News

369

3.3 Click Predictor and Model Training The click predictor performs correlation calculations between candidate news representations and user representations to obtain matching scores between them. It sorts the candidate news based on these scores to recommend news to users. We calculate the dot product yn,u =< r n , r u > of user representation r u and news representation r n as the matching score between user u and news n. Then, we normalize the probability of user u clicking on the i-th news to obtain a matching score yi .We defined news that users have browsed as positive samples, and randomly selected K news from the dataset that users have not browsed as negative samples. The problem of news click prediction is redefined as a pseudo (K + 1) direction classification problem by combining the normalized click prediction scores yi,j of positive sample news and K negative sample news. The normalized matching probabilities are derived through the softmax function, and the model training loss L is obtained by summing up the negative log-likelihood of the positive samples over the training dataset D, as shown in Eq. (24). L=−

|D|  i=1

log

exp(yi )  exp(yi ) + K j=1 exp(yi,j )

(24)

4 Experiment 4.1 Dataset and Experimental Settings The INE-MUE model is evaluated on the Microsoft open-source news dataset MIND. Due to its large size, we randomly extracted 10,000 and 200,000 user click logs from the usage data of one million users to construct MIND-10K and MIND-200K. The detailed statistical information is provided in Table 1. Table 1. Statistics of the MIND-10K dataset and the MIND-200K dataset. Dataset MIND10K

#users

#news

10,000 30,486

#samples 1,130,680

#positive #negative samples samples

avg.#words avg.#content per title per body

45,662

1,085,018 11.86

41.78

MIND-200K 200,000 76,124 22,268,366 902,484

21,365,882 11.58

42.60

In the experiment, we initialized word embedding using pre-trained 300-dimensional Glove embedding. We used multiple ranking metrics to evaluate the performance of the model, including AUC, MRR and NDCG. The test results are the average of 5 runs.

370

X. Jiao et al.

4.2 Baseline To evaluate the effectiveness of the INE-MUE model, we compared it with the following representative baseline models: DAE-GRU [8]: The model employs denoising auto-encoder to encode news title and Gated Recurrent Unit (GRU) network to encode news browsed by the user. DFM [17]: A deep fusion model uses combinations of dense layers at different depths. DKN [9]: A method based on knowledge aware CNN. LSTUR [11]: The model encodes the news title through CNN and encodes the user through incorporates the long-term and short-term interest. NAML [18]: The model utilizes CNN to encode news titles and content, fuses the news topics and subtopics, and employs multi-head self-attention networks to encode news browsed by the user. NPA [19]: A method based on user personalized attention network. NRMS [20]: The model employs multi-head attention networks to extract finegrained features from both news titles and news browsed by the user. CNE-SUE [13]: The model collaboratively encodes news titles and content, as well as structurally encodes user browsing records. 4.3 Main Results The results based on MIND-10K are shown in Table 2. It shows that INE-MUE outperformed baseline models in all evaluation metrics (compared with the optimal baseline model CNE-SUE, + 0.85% AUC, + 1.29% MRR, + 1.19% nDCG@5, + 1.18% nDCG@10). The superior performance of INE-MUE can be attributed to the use of interactive news encoding and mixed user encoding. Table 2. Performance comparison results using MIND-10K dataset. (bold indicates optimal results, underlined indicates suboptimal results). Methods

AUC

MRR

nDCG@5

nDCG@10

DAE-GRU

59.88

26.93

29.87

36.06

DFM

63.78

29.20

32.22

38.42

DKN

62.27

27.53

29.65

36.70

LSTUR

64.69

29.43

32.47

38.69

NAML

63.03

27.97

31.12

37.05

NPA

63.98

28.91

32.14

37.99

NRMS

64.89

29.38

32.49

38.82

CNE-SUE

65.15

29.45

33.04

39.22

INE-MUE

66.00

30.74

34.23

40.40

Neural News Recommendation with Interactive News

371

As for news encoding, compared to methods that rely solely on news titles or content (such as NPA, DKN, NRMS), INE utilizes news titles, content, and entities. It results in a more comprehensive semantic representation of the news. Compared to methods of encoding titles and content independently (such as NAML), INE is able to further mine the semantic interaction information from news titles, content and entities. Compared to methods of collaborative encoding titles and content (such as CNE-SUE), INE uses news entity relationship to supplement knowledge-level information for news encoding, thus improving the capability of news representation. As for user encoding, compared to methods that rely on recurrent neural networks (such as LSTUR, DAE-GRU) and attention networks (such as NPA, NAML, NRMS), MUE learns user structural features better. When compared to methods with GCN (such as SUE), MUE employs a multi-head attention network to learn serialization features from user browsing records, and MUE improves user encoding capabilities. To further verify the effectiveness of INE-MUE, we conducted experiments on MIND-200K dataset. The results are shown in Table 3. It shows that INE-MUE can still achieve the best results in various evaluation metrics (compared with the optimal baseline method CNE-SUE, +0.40% AUC, +0.46% MRR, +0.27% nDCG@5, +0.27% nDCG@10). This proves that the INE-MUE method is effective in news encoding and user encoding. Table 3. Performance comparison results using MIND-200K dataset. (bold indicates optimal results, underlined indicates suboptimal results). Methods

AUC

MRR

nDCG@5

nDCG@10

DAE-GRU

65.98

31.48

34.93

41.12

DFM

64.63

29.80

32.82

39.29

DKN

66.20

31.25

34.23

40.92

LSTUR

68.10

32.87

36.46

42.69

NAML

68.63

33.16

36.79

43.04

NPA

67.34

32.59

35.98

42.28

NRMS

68.61

33.46

37.02

43.30

CNE-SUE

69.55

33.70

37.54

43.79

INE-MUE

69.95

33.96

37.88

44.06

4.4 Compatibility Experiment To fully prove the validity of INE and MUE methods, compatibility experiments were designed. In the INE compatibility experiment, the user encoding method was uniformly replaced with GRU [9]. In the MUE compatibility experiment, news encoding method was replaced with mentioned NAML [18]. The compatibility experiments were conducted on the MIND-10K dataset.

372

X. Jiao et al.

Fig. 2. Performance comparison results of compatibility of news encoding methods and user encoding methods (+G indicates that the user encoding method is GRU, N + indicates the user encoding method is NAML).

Figure 2(a) shows replacing the user encoding method with GRU still achieves the best results in all metrics compared to baseline models, indicating that INE has effectively improved news encoding ability and has strong compatibility. As shown in Fig. 2(b), on the basis of replacing the news encoding method with NAML, MUE still achieves the best results compared to baseline models, demonstrating that MUE enhances user representation ability and has strong compatibility. 4.5 Ablation Experiment To investigate the impact of each component of the news encoder and the user encoder, we conducted ablation experiments on the MIND-10K dataset. For the news encoder: the title, the content, the entity, the title and content, and interactive component are deleted independently, which are denoted as Ours-T, Ours-C, Ours-E, Ours-T-C and Ours-I respectively. For user encoder, the user structural encoding module and user serialization encoding module are deleted independently, which are denoted as Ours-S and Ours-O respectively. Figure 3(a) shows that in the news encoder, when a certain component is deleted, all evaluation metrics decrease, indicating that the component enhances news encoding capability. In Fig. 3(b), as for the user encoder, removing the structural encoding module or serialization encoding module will result in a decrease in model recommendation accuracy, which indicates the two modules can make contribution to the performance of news recommendation.

Neural News Recommendation with Interactive News

373

Fig. 3. Performance results of ablation experiment of INE and MUE.

5 Conclusion This paper introduces a news recommendation method based on interactive news encoding and mixed user encoding. The core of this method lies in the improvement of news encoder and user encoder. For news encoding, the news title, content, and entity are interactively encoded to mine semantic interactive information, leading to more accurate news encoding. For user encoding, mixed encoding is performed on user features to learn structural and serialization information, ultimately enhancing their representation capabilities. Experiments were conducted on the MIND dataset of MSN News, which confirmed that INE-MUE effectively improves news recommendation accuracy. Acknowledgement. This work was supported by Foshan Higher Education High Level Talents Project (81031936).

References 1. Xu, J., He, X., Li, H.: deep learning for matching in search and recommendation. In: 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1365–1368. Assoc Computing Machinary 1515 Broadway, New York, NY, pp. 10036–9998 USA (2018) 2. Zhaok, X., Liu, H., Fan, W., et al.: AutoEmb: automated embedding dimensionality search in streaming recommendations. In: 21st IEEE International Conference on Data Mining, pp. 896–905. IEEE Computer SOC10662 Los Vaqueros Circle, Los Alamitos, CA 90720– 1264 USA (2021) 3. Kazai, G., Yusof, I., Clarke, D.: Personalised news and blog recommendations based on user location, Facebook and Twitter user profiling. In: 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 1129–1132. Assoc Computing Machinary 1515 Broadway, New York, NY 10036–9998 USA (2016) 4. Bansal, T., Das, M., Bhattacharyya, V.: Content driven user profiling for comment-worthy recommendations of news and blog articles. In: Proceedings of the 9th ACM Conference on Recommender Systems, pp. 195–202 (2015)

374

X. Jiao et al.

5. Moreira, P., De Souza, G.: CHAMELEON: a deep learning meta-architecture for news recommender systems. In: 12th ACM Conference on Recommender Systems, pp. 578–583. Assoc Computing Machinery1601 Broadway, 10th Floor, New York, USA (2018) 6. Gulla. J.A., Bei, Y., Özgöbek, Ö., et al.: 3rd International Workshop on News Recommendation and Analytics. In: Proceedings of the 9th ACM Conference on Recommender Systems, pp. 345–346. Assoc Computing Machinery1515 Broadway, New York, USA (2015) 7. Wang, S., Li, X., Sun, F., et al.: Survey of research on personalized news recommendation techniques. J. Front. Comput. Sci. Technol. 14(1), 18–29 (2020) 8. Okura, S., Tagami, Y., Ono, S., et al.: Embedding-based news recommendation for millions of users. In: 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1933–1942. Assoc Computing Machinery1601 Broadway, 10th Floor, New York, USA (2017) 9. Wang, H., Zhang, F., Xie, X., et al.: DKN: deep knowledge-aware network for news recommendation. In: 27th World Wide Web Conference, pp. 1835–1844. Assoc Computing Machinery1515 Broadway, New York, USA (2018) 10. Wu, C., Wu, F., An, M., et al.: Neural news recommendation with attentive multi-view learning. In: 28th International Joint Conference on Artificial Intelligence, pp. 3863–3869. IJCAI-INT Joint Conf Artif Intellalbert-Ludwigs Univ Freiburg Geores-Kohler-Allee, Inst Informatik, GEB 052, Freiburg, D-79110, Germany (2019) 11. An, M., Wu, F., Wu, C., et al.: Neural news recommendation with long- and short-term user representations. In: 57th Annual Meeting of the Association-for-Computational-Linguistics, pp. 336–345. Assoc Computational Linguistics-ACL209 N Eighth Street, Stroudsburg, PA 18360 USA (2019) 12. Wu, C., Wu, F., Qi, T., et al.: User modeling with click preference and reading satisfaction for news recommendation. In: 29th International Joint Conference on Artificial Intelligence, pp. 3023–3029. IJCAI-INT Joint Conf Artif Intellalbert-Ludwigs Univ Freiburg GeoresKohler-Allee, Inst Informatik, GEB 052, Freiburg, D-79110, Germany (2020) 13. Mao, Z., Zeng, X., Wong, K.-F.: Neural news recommendation with collaborative news encoding and structural user encoding. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 46–55 (2021) 14. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (2002) 15. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: 31st Annual Conference on Neural Information Processing Systems, pp. 5998–6008. Neural Information Processing Systems, NIPS 10010 North Torrey Pines RD, La Jolla, California 92037 USA (2017) 16. Bordes, A., Usunier, N., Garcia-Duran, A., et al.: Translating embeddings for modeling multirelational data. In: Advances in Neural Information Processing Systems, pp. 2787–2795 (2013) 17. Lian, J., Zhang, F., Xie, X., et al.: Towards better representation learning for personalized news recommendations: a multi-channel deep fusion approach. In: 27th International Joint Conference on Artificial Intelligence, pp. 3805–3811. IJCAI-INT Joint Conf Artif IntellalbertLudwigs Univ Freiburg Geores-Kohler-Allee, Inst Informatik, GEB 052, Freiburg, D-79110, Germany (2018) 18. Wu, C., Wu, F., An, M., et al.: Neural news recommendation with attentive multi-view learning. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3863–3869 (2019)

Neural News Recommendation with Interactive News

375

19. Wu, C., Wu, F., An, M., et al.: NPA: neural news recommendation with personalized attention. In: 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 6389–6394. Assoc Computing Machinery1515 Broadway, New York, NY 10036–9998 USA (2019) 20. Wu, C., Wu, F., Ge, S., et al.: Neural news recommendation with multi-head self-attention. In: 9th International Joint Conference on Natural Language Processing, pp. 6389–6394. Assoc Computational Linguistics-ACL, 209 N Eighth Street, Stroudsburg, USA (2019)

Poster: Machine Translation and Multilinguality

CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction Shenglong Yu, Wenya Guo, Ying Zhang(B) , and Xiaojie Yuan College of Computer Science, Nankai University, Tianjin 300350, China {yushenglong,guowenya}@dbis.nankai.edu.cn, {yingzhang,yuanxj}@nankai.edu.cn Abstract. Unsupervised bilingual lexicon induction is a crucial and challenging task in multilingual NLP, which aims to induce word translation by aligning monolingual word embeddings. Existing works treat word pairs equally and lack consideration of word pair credibility, which further leads to the fact that these methods are limited to global operations on static word embeddings and fail when using pre-trained language models. To address this problem, we propose confidence-based dual refinement for unsupervised bilingual lexicon induction, where embeddings are refined by dual aspects (static word embeddings and pre-trained models) based on the confidence of word pairs, i.e., the credibility of word pairs in correct alignment. For static word embeddings, instead of global operations, we calculate personalized mappings for different words based on confidence. For pre-trained language models, we fine-tune the model with positive and negative samples generated according to confidence. Finally, we combine the output of both aspects as the final result. Extensive experimental results on public datasets, including both rich-resource and low-resource languages, demonstrate the superiority of our proposal. Keywords: Bilingual Lexicon Induction alignment · Pre-trained language model

1

· Embedding space

Introduction

Bilingual lexicon translation (BLI) or word translation is one of the seminal and long-standing tasks in multilingual NLP [8,11,17,24,26], which aims at inducing word translations across two languages based on the monolingual corpus and spawns a myriad of NLP tasks such as machine translation [7,23], semantic parsing [29], and document classification [14]. Unsupervised bilingual lexicon induction is a special case of the BLI task, which aims to achieve word translation by relying only on a monolingual corpus We thank anonymous reviewers for their valuable comments. This work was supported by the Natural Science Foundation of Tianjin, China (No. 22JCQNJC01580, 22JCJQJC00150), the Fundamental Research Funds for the Central Universities (No. 63231149), and the National Natural Science Foundation of China (No. 62272250). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 379–391, 2023. https://doi.org/10.1007/978-3-031-44696-2_30

380

S. Yu et al.

without giving reference dictionaries [5,19]. Existing approaches [1,2,13,21,25] mainly focus on learning projection matrices to globally align word embeddings based on an initialized seed dictionary. And the seed dictionary is generated based on the isomorphism assumption, which implies that the embedding space is similar across languages. However, they treat word pairs in the seed dictionary equally and can’t generate highly credible aligned/unaligned word pairs. However, in the unsupervised setting, more frequent words and closer word pairs may have greater credibility in correct alignment, which can be measured by confidence scores. And the inadequate consideration of confidence leads to the following problems: 1) Existing approaches [1,5,25] strictly adhere to the isomorphism assumption and operate different words globally. And the isomorphism assumption has been shown not to hold strictly in the general case [22,27], so globally shared mappings are not the best solution and there are methods [9,28] to generate personalized mappings for words in the supervised setting. However, these personalized methods presuppose highly confident aligned word pairs, while existing unsupervised methods are unable to generate correctly aligned word pairs with high confidence and to take advantage of personalized mappings. 2) There are approaches [16,17,31] for supervised BLI tasks that use large pre-trained multilingual language models to improve performance through contrastive learning. And in the supervised case, aligned word pairs are used as positive samples, and unaligned word pairs are used as negative samples. However, without considering the confidence of word pairs, methods in the unsupervised setting cannot generate highly credible positive and negative samples and then fail to utilize the pre-trained models.

Fig. 1. An overview of the proposed CD-BLI method (see Sect. 3). It refines the embeddings through static word embeddings (WEs) and pre-trained language models (LMs) based on confidence.

To address these problems, we propose a Confidence-based Dual refinement for unsupervised Bilingual Lexicon Induction (CD-BLI), which refines the

Confidence-Based Dual Refinement for Bilingual Lexicon Induction

381

embeddings generated by global mapping methods through dual aspects. Unlike previous works which treat all word pairs equally, we calculate confidence scores for word pairs based on word frequency and pairs distance, representing the credibility of word pairs. And then we use the confidence to refine the embeddings in dual aspects, called C1 and C2 stages, respectively: For static word embeddings (WEs), we first generate the seed dictionary based on the confidence score. Then different from the global operation, we conduct personalized mappings for different words, using the confidence score as the basis for calculating the direction and distance of the mapping. For large pre-trained multilingual language models (LMs), we generate highly credible positive and negative samples based on confidence scores and then fine-tune the pre-trained LM based on the samples. Finally, the outputs of LM are combined with the WEs to obtain the final results. We conduct extensive BLI experiments on the standard BLI benchmark [5], including rich- and low-resource languages. Experimental results show that CDBLI performs better in both types of languages, which demonstrates the effectiveness of CD-BLI. Our contributions are summarized as follows: • To the best of our knowledge, we are the first to propose a confidence-based method for unsupervised bilingual lexicon induction tasks, which is able to generate highly credible aligned/unaligned word pairs based on confidence. • We refine the embeddings from WEs and LMs aspects based on confidence. For WEs, we calculate a personalized mapping for each word. For LMs, we fine-tune a multilingual LM based on samples extracted via confidence. • We conduct extensive experiments over the standard BLI benchmark and the results show the superiority of CD-BLI.

2

Related Works

BLI Method. Recently, there has been a lot of progress in BLI research. Unsupervised BLI [3,5,21,25] mainly involves establishing a mapping between different language spaces without any parallel training data. Supervised BLI [2,12,13,20,22,27,32], on the other hand, utilizes parallel corpus or dictionaries to establish a mapping between the languages. However, these BLI methods mainly focus on learning global mapping functions that map source word embeddings to the target word embedding space. Our work computes personalized mappings for different words, which are more detailed and accurate. Exposing Lexical Knowledge from Pre-trained LMs. Recently, off-theshelf multilingual language models have been used to improve the results of lexical tasks. [16,17,31] use pre-trained models to enhance the performance of BLI in supervised conditions. But to the best of our knowledge, there is no research on applying pre-trained models in unsupervised conditions. And we propose a confidence-based approach to leverage the lexical Knowledge available in pre-trained models.

382

3 3.1

S. Yu et al.

Methodology BLI Task Definition

Let X ∈ Rd×nx and Y ∈ Rd×ny represent monolingual embeddings matrices consisting of nx and ny words for the source and target languages respectively, and d means the size of embeddings. Besides, the vocabulary for source and target languages are represented as Wx = {w1x , ..., wnxx } and Wy = {w1y , ..., wny y }. We also denote all possible cross-lingual word pairs as Π = Wx ×Wy , and a word pair from Π as π = (wx , wy ). Existing methods [2,3,17,19,25] most focus on mapping-based BLI method and the aim is to learn a mapping function f : X → Y . At inference, the methods utilize the mapping function f to transform the embedding matrix X to the embedding matrix Y . Then retrieve the most similar word wy for each word wx in Wx , i.e., wy = argmaxwy ∈Wy Sim(wx , wy ), where Sim is the function used to measure the similarity, and the most common function is cosine similarity. For y y x x , wd1 ), ..., (wdn , wdn )} is given. evaluation, a test lexicon DT = {(wd1 3.2

Method in a Nutshell

We propose a novel confidence-based dual refinement method for unsupervised bilingual lexicon induction. For bilingual languages, an initial and global alignment of the embedding space is first performed using a mapping-based approach. Based on the initial alignment, we calculate confidence scores for all word pairs Π, described in Sect. 3.3, and refine the embedding spaces through two aspects, called C1 and C2 stages. For WEs (C1 stage: Sect. 3.4), a seed dictionary is extracted through confidence scores, then personalized mappings are conducted for different words based on the seed dictionary. For LMs (C2 stage: Sect. 3.5), positive and negative samples are extracted through confidence scores, and we fine-tune the LM. Finally, we combine the word embeddings obtained by LM with static word embeddings to get the final BLI result. 3.3

Confidence Score Calculation

We first use a mapping-based method to globally operate the embedding space of the source language X to align with Y through a mapping matrix W . The aligned embedding space XD is calculated as: XD = XW

(1)

After aligning word embedding spaces, the confidence scores for all word pairs Π are calculated from two parts. First, we note that the more frequent words have more accurate word embeddings. Therefore, more frequent words have a greater likelihood of being correctly aligned. So we use word frequency as part of the confidence calculation: F (wx ) = 1 +

1 f (wx )

(2)

Confidence-Based Dual Refinement for Bilingual Lexicon Induction

F (wy ) = 1 +

1 f (wy )

383

(3)

where f means the order of words w∗ when monolingual words are ordered from highest to lowest word frequency. And then the frequency score F (w∗ ) is calculated. In addition, we also notice that word-to-word similarity is an important indicator to measure confidence. And the higher the similarity, the higher the confidence that word pairs are in correct alignment. Thus, similarity is another part of confidence. We calculate the similarity between words by the CSLS method [13] for the embeddings x ∈ XD and y ∈ Y : SimC (π) = cos(x, y) − ΓX (y) − ΓY (x)

(4)

cos denotes the cosine similarity, ΓX (y) is the average cosine similarity between y and its k nearest neighbours (typically k = 10) in XD ; ΓY (x) is defined similarly. CSLS is a standard similarity function in BLI which outperforms the cosine similarity, as it solves the hubness problem. The confidence score for a word pair π = (wx , wy ) is calculated as: C(wx , wy ) = SimC (wx , wy ) · F (wx )F (wy )

(5)

Then we refine the word embeddings based on confidence scores through dual aspects. 3.4

C1: Refinement Based on WEs

For WEs, different from existing methods, we calculate a personalized mapping for each word through two steps. Seed Dictionary Extraction. The target word with the highest confidence is calculated for each source word according to Eq. 5, and a candidate seed dictionary is formed from these word pairs. In order to filter out the word pairs with higher accuracy to form the seed dictionary, we use the K-means algorithm to classify the word pairs in the candidate seed dictionary into N classes by confidence score. We generate the seed dictionary with the highest confidence and relatively concentrated confidence. And we take the class with the highest confidence as the seed dictionary D. Personalized Mappings for Embeddings. The word pairs in D are used to calculate the personalized mappings. As shown in Fig. 1 WEs, first, the weights between the source/target words and the source/target words in the word pair of the seed dictionary are calculated based on the confidence score. Second, the weights are used to translate the source/target word embeddings. For source word wx ∈ Wx with embedding x ∈ XD , and source word wdx in the word pairs (wdx , wdy ) in seed dictionary D with embedding xd ∈ XD , the weight are calculated as: H(wx , wdx ) = C(wdx , wdy ) · cos(x, xd )

(6)

384

S. Yu et al.

which considers the confidence of word pairs and cos similarity between wx and wdx . For word wy with the word in seed dictionary wdy , the weight is calculated in a similar way. Then for each source word wx ∈ Wx with embedding x, the weights with dx wi ∈ D, i = 1 . . . N are calculated according to Eq. 6. And the K words in the seed dictionary with the highest weights are selected (using w1dx , . . . , wkdx for these words and xd1 , . . . , xdk for their word embeddings in XD , and y1d , . . . ykd as the words embeddings for corresponding target word w1dy , . . . , wkdy in the seed dictionary). Then the source word embeddings are translated according to:   x dx   d K · yk − xdk k=1 H w , wk  x =x+ (7)   K dx x k=1 H w , wk Similarly, for each word wy ∈ Wy , we conduct the same operation:     K dy y · ykd − xdk H w , w k=1 k   y = y + K y , w dy H w k=1 k

(8)

After personalized mapping, we get the final output of the C1 stage, namely the source word embedding space XC1 , and target embedding space YC1 . 3.5

C2: Refinement Based on LMs

Unlike previous works that focus mainly on static word embeddings, we also use pre-trained multilingual language models to improve performance in an unsupervised setting based on confidence. We fine-tune an off-the-shelf multilingual LM based on the generated samples with contrastive learning method. Samples Generation. We first generate a seed dictionary D as positive samples in the same way as in stage C1, which has Npos samples. Then we extract Nneg negative samples based on the following rules. For a word pair (wdx , wdy ) ∈ D, we retrieve the word pair with the T highest confidence that can be formed by dy dy . Then we generate (wdx , wneg ) as wdx and then note the target word as wneg dy dx dy a negative sample. We also do the same for w to generate (wneg , w ) as a negative sample. Contrastive Fine-Tuning. In our work, we leverage a pre-trained mBERT model [6] with 12 transformer layers and 768-dim embeddings. And word inputs are tokenized as [CLS][sw1 ]...[swm ][SEP ], where [sw1 ]...[swm ] means the sequence of m sub-words of the input w. [CLS] and [SEP ] are special tokens. Then we use the mBERT model to get the word represents of the sequence, and extract the embedding of [CLS] token as the embedding of word w. And following [16], we fine-tune the model relying on the following loss: spos = exp(cos(fθ (wdx ), fθ (wdy ))) dx dy sneg = exp(cos(fθ (wneg ), fθ (wdy ))) + exp(cos(fθ (wdx ), fθ (wneg )))

(9) (10)

Confidence-Based Dual Refinement for Bilingual Lexicon Induction

p=

spos sneg + spos

385

(11)

And our goal is to min −Ewdx ,wdy ∈D log(p), where θ refers to the parameters of θ

mBERT model. 3.6

Combining the Outputs of C1 and C2

We combine the outputs of C1 and C2 by the similarity calculation method that considers both static word embeddings XC1 and YC1 , and word representations obtained from LMs: SimF (π) = (1 − λ)cos(x, y) + λcos(fθ (wx ), fθ (wy ))

(12)

where λ is a tunable hyper-parameter. And we replace the cos similarly used in the CSLS method with Eq. 12 to induce a lexicon as the final result.

4 4.1

Experiment Experimental Settings

Dataset. CD-BLI is evaluated on the widely used MUSE dataset [5]. Following [28], five rich-resource language pairs and five low-resource pairs are selected to completely evaluate the model. Precision@1 is selected as the measurement and the original data split of the MUSE dataset is kept. Baselines. We compare CD-BLI with popular SOTA BLI baselines, including unsupervised methods (Conneau et al., 2018 [5]; Chen and Cardie, 2018 [4]; Grave et al., 2019 [10]; Mohiuddin and Joty, 2019 [21]; Ren et al., 2020 [25]) and supervised/semi-supervised methods (Artetxe et al., 2018a [2]; Conneau et al., 2018 [5]; Joulin et al., 2018 [13]; Jawanpuria et al., 2019 [12]; Patra et al., 2019 [22]; Mohiuddin et al., 2020 [20]; Zhao et al., 2020 [32]). And for each baseline, we directly report the results in the original papers and conduct experiments with the publicly available code if necessary. Implementation Details. We largely follow the standard setup from prior work [17,28]. The vocabulary of each language is trimmed to the 200k most frequent words. And we choose vecmap [3] to operate the initial global alignment. For WEs in stage C1, we use the same fastText WEs following prior works [28]. The number of classes N used in the k-means method and selection range of seed dictionaries K in Stage C1 is tuned by the random search for each BLI dataset. And mBERT in stage C2 operates over the same vocabulary, spanning 200k word types in each language. We use AdamW [18] with a learning rate of 2e−5 and weight decay of 0.01 to fine-tune mBERT. The batch size is set to 64 and the dropout rate is 0.1. The T in stage C2 is set to 5. And the λ used in Eq. 12 is set to 0.2 for all language pairs. Finally, we use the CSLS as the induction metric, and the number of nearest neighbors in the CSLS is set to 10.

386

4.2

S. Yu et al.

Main Results

The proposed CD-BLI is thoroughly evaluated on five rich-resource languages (French (fr), Spanish (es), Italian (it), Russian (ru), Chinese (zh) from and to English (en)) and five low-resource languages (Faroese (fa), Turkish (tr), Hebrew (he), Arabic (ar), Estonian (er) from and to English (en)). CD-BLI is trained from scratch five times and we report the average accuracy. Table 1 and 2 show the performance on rich-resource and low-resource languages. Table 1. Precision@1 for the BLI task on five rich-resource language pairs. The best results are in bold and the best supervised results are underlined. The improvements are statistically significant (sign test, p-value < 0.01). Method

en-es → ←

en-fr → ←

en-it → ←

en-ru → ←

en-zh → ←

avg

81.9 84.1 81.9 84.3 80.5 83.7 82.4

83.4 86.3 85.5 86.2 82.2 86.5 86.3

82.1 83.3 82.1 83.9 – 84.4 84.5

82.4 84.1 84.2 84.7 – 85.5 84.9

77.4 79.0 77.8 79.3 76.7 80.4 80.2

77.9 80.7 80.9 82.4 78.3 82.8 81.9

51.7 57.9 52.8 57.1 53.5 56.8 57.0

63.7 67.2 67.6 67.7 67.1 67.4 67.1

32.3 45.9 49.1 47.3 – 48.4 47.9

43.4 46.4 45.3 46.7 – 47.5 47.2

67.6 71.6 70.7 72.0 – 72.3 71.9

(Conneau et al., 2018) (Grave et al., 2018) (Chen and Cardie, 2018) (Mohiuddin and Joty, 2019) (Ren et al., 2020)

81.7 82.8 82.5 82.7 82.9

83.3 84.1 83.7 84.7 85.3

82.3 82.6 82.4 – 82.9

81.1 82.9 81.8 – 83.9

77.4 75.4 – 79.0 79.1

76.1 73.3 – 79.6 79.9

44.0 43.7 – 46.9 49.7

59.1 59.1 – 64.7 64.7

32.5 – – – 38.9

31.4 – – – 35.9

64.9 – – – 68.3

CD-BLI

84.5 87.0 84.5 85.8 82.3 82.6 48.5 66.0 36.7 54.9 71.3

Supervised/Semi-Supervised (Artetxe et al., 2018a) (Joulin et al., 2018) (Jawanpuria et al., 2019) (Patra et al., 2019) (Mohiuddin et al., 2020) (Zhao et al., 2020) (Glavaˇs and Vulic, 2020) Unsupervised

The results demonstrate the effectiveness of CD-BLI. For rich-resource language pairs shown in Table 1, CD-BLI outperforms the best baseline in unsupervised conditions by 3.0% in the average precision, and the highest accuracy is obtained in almost all language pairs. And in half of the language pairs, including en-es, es-en, fr-en, en-it, and zh-en, CD-BLI even outperforms all supervised or semi-supervised models. In addition, for low-resource language pairs, CD-BLI outperforms the best baseline in unsupervised conditions by 5.7%. And for 6/10 language pairs, CDBLI outperforms the best baseline in supervised/semi-supervised methods. And the results demonstrate the advanced performance of CD-BLI on low-resource language pairs.

Confidence-Based Dual Refinement for Bilingual Lexicon Induction

387

Table 2. Precision@1 for the BLI task on five low-resource language pairs. The best results are in bold and the best supervised results are underlined. The improvements are statistically significant (sign test, p-value < 0.01). Method

en-fa → ←

en-tr → ←

en-he → ←

en-ar → ←

en-et → ←

avg

39.0 40.5 38.0 38.4 36.8 40.3

52.2 53.8 48.6 51.8 52.5 54.3

47.6 52.2 43.1 51.6 52.5 47.5

41.2 42.2 38.1 41.1 42.2 41.6

37.4 40.0 33.7 36.3 41.2 40.1

49.1 49.6 46.3 47.6 50.8 50.1

Supervised/Semi-Supervised (Artetxe et al., 2018a) (Joulin et al., 2018) (Jawanpuria et al., 2019) (Patra et al., 2019) (Mohiuddin et al., 2020) (Glavaˇs and Vuli´c, 2020)

42.6 42.4 40.9 39.3 43.7 45.2

63.7 61.7 61.9 59.6 65.3 64.5

58.0 57.9 56.7 55.2 59.1 56.6

55.5 55.5 53.3 53.9 57.1 56.4

54.0 50.2 48.7 48.3 57.5 52.7

Unsupervised (Conneau et al., 2018) 33.4 40.7 52.7 63.5 43.8 57.5 33.2 52.8 33.7 51.2 46.2 (Mohiuddin and Joty, 2019) 36.7 44.5 51.3 61.7 44.0 57.1 36.3 52.6 31.8 48.8 46.5 CD-BLI

4.3

42.8 48.9 57.4 68.2 49.2 62.6 42.3 55.9 39.1 55.6 52.2

Ablation Study

To investigate the effectiveness of components in CD-BLI, we also conduct extensive ablation studies. All experiments are operated five times and the average precision is reported. Module Ablation. We study the impact of modules in CD-BLI. Table 3 module ablation part reports the experimental results. vecmap is the accuracy after initial alignment. C1 is the accuracy of only the C1 stage, and the final induction method considers only the WEs. C2 is the accuracy of only the C2 stage, and the final induction method considers both the initialized static WEs and the word represents obtained by LM. The results show that both C1 and C2 stages can improve the performance of vecmap, and by combining the outputs of C1 and C2 (CD-BLI), we can obtain the best performance. Method Ablation. We also investigate the impact of methods in CD-BLI. confidence is the accuracy without considering confidence scores, and the seed dictionary is randomly generated. +similarity is the model that considers only similarity as confidence, and +frequency is similar. In -k-means, we generate the seed dictionary as fixed 4000 samples without the k-means algorithm. The results show that the proposed confidence are important for utilizing the C1 and C2 stages in unsupervised bilingual induction, and the performance is even worse than vecmap if confidence is not considered. LMs Ablation. Finally, we investigate the impact of pre-trained language models. We changed the mBERT to BERT-base model, XLM [15], and multilingual T5 [30], respectively, and show the results in Table 3 LMs ablation part. The results show that mBERT performs better than other LMs, and all multilingual LMs (mBERT, mT5, XLM) can be exploited by the C2 stage and actually have similar performance.

388

S. Yu et al. Table 3. Ablation study on the modules, methods and language models Method

en-es → ←

en-fr → ←

en-it → ←

en-ru → ←

en-zh → ←

avg

CD-BLI

84.5 87.0 84.5 85.8 82.3 82.6 48.5 66.0 36.7 54.9 71.3

Module ablation vecmap +C1 +C2

81.3 82.7 82.1 79.3 80.5 75.3 44.2 59.0 24.7 43.4 65.3 84.3 83.5 83.5 82.1 80.9 76.7 46.1 61.1 29.9 49.8 67.7 83.3 83.7 83.2 81.2 79.2 80.3 46.2 62.3 31.2 48.5 67.9

Method ablation -confidence +similarity +frequency -k-means

79.9 84.2 84.1 83.3

81.3 85.5 83.3 86.4

82.1 83.9 82.4 82.8

78.5 85.4 81.2 85.6

81.3 81.4 80.8 81.1

75.9 81.3 76.5 80.9

43.7 46.9 45.6 46.3

56.8 62.1 59.4 62.1

16.3 33.5 26.5 35.7

44.9 51.7 46.9 54.5

64.1 69.6 66.7 69.9

LMs ablation BERT-base 84.4 86.6 84.5 85.3 82.1 82.4 45.3 58.2 33.1 51.5 69.3 84.9 86.5 84.3 85.4 82.5 81.7 47.6 66.2 35.8 52.8 70.8 XLM 84.7 86.9 84.5 85.6 82.1 82.3 48.3 66.5 35.5 55.6 71.2 mT5

4.4

Parameter Sensitivity Analysis

We conduct parameter sensitivity analysis on two core hyper-parameters: Number of classes N used in the k-means method, and number of K word pairs in the seed dictionary used in the C1 stage. Number of Classes N . Figure 2(a) shows the performance curves of the number of classes N used in the k-means method. When N is small, the k − means method cannot get rid of those untrustworthy word pairs. And when N is too large, there will be not enough word pairs left in the seed dictionary. Both of these cases affect the performance. Number of Word Pairs K. Figure 2(b) shows the performance curves of the number of word pairs K. The accuracy will fluctuate within a certain range when the K value is below 500, but the accuracy will drop rapidly when the K value is too large. This is because when the value of K is too large, the personalized mappings of words will be influenced by more untrustworthy word pairs.

Confidence-Based Dual Refinement for Bilingual Lexicon Induction

389

Fig. 2. Parameter sensitivity analysis.

5

Conclusion

In this paper, we propose a confidence-based dual refinement method for unsupervised bilingual lexicon induction. Different from the previous method, we leverage the confidence score to represent the credibility of word pairs. Then we refine the embedding spaces through dual aspects based on confidence: static word embeddings and word representations obtained from pre-trained language models. The results of extensive experiments on 20 language pairs show the validity of our proposal.

References 1. Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294 (2016) 2. Artetxe, M., Labaka, G., Agirre, E.: Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) 3. Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv preprint arXiv:1805.06297 (2018) 4. Chen, X., Cardie, C.: Unsupervised multilingual word embeddings. arXiv preprint arXiv:1808.08933 (2018) 5. Conneau, A., Lample, G., Ranzato, M., Denoyer, L., J´egou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017) 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 7. Duan, X., et al.: Bilingual dictionary based neural machine translation without using parallel sentences. arXiv preprint arXiv:2007.02671 (2020) 8. Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., D´ejean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL04), pp. 526–533 (2004)

390

S. Yu et al.

9. Glavaˇs, G., Vuli´c, I.: Non-linear instance-based cross-lingual mapping for nonisomorphic embedding spaces. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7548–7555 (2020) 10. Grave, E., Joulin, A., Berthet, Q.: Unsupervised alignment of embeddings with Wasserstein procrustes. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1880–1890. PMLR (2019) 11. Heyman, G., Vuli´c, I., Moens, M.F.: Bilingual lexicon induction by learning to combine word-level and character-level representations. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1085–1095 (2017) 12. Jawanpuria, P., Meghwanshi, M., Mishra, B.: Geometry-aware domain adaptation for unsupervised alignment of word embeddings. arXiv preprint arXiv:2004.08243 (2020) 13. Joulin, A., Bojanowski, P., Mikolov, T., J´egou, H., Grave, E.: Loss in translation: learning bilingual word mapping with a retrieval criterion. arXiv preprint arXiv:1804.07745 (2018) 14. Klementiev, A., Titov, I., Bhattarai, B.: Inducing crosslingual distributed representations of words. In: Proceedings of OLING 2012, pp. 1459–1474 (2012) 15. Lample, G., Conneau, A.: Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291 (2019) 16. Li, Y., Liu, F., Collier, N., Korhonen, A., Vuli´c, I.: Improving word translation via two-stage contrastive learning. arXiv preprint arXiv:2203.08307 (2022) 17. Li, Y., Liu, F., Vuli´c, I., Korhonen, A.: Improving bilingual lexicon induction with cross-encoder reranking. arXiv preprint arXiv:2210.16953 (2022) 18. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 19. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013) 20. Mohiuddin, T., Bari, M.S., Joty, S.: LNMap: departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. arXiv preprint arXiv:2004.13889 (2020) 21. Mohiuddin, T., Joty, S.: Revisiting adversarial autoencoder for unsupervised word translation with cycle consistency and improved training. arXiv preprint arXiv:1904.04116 (2019) 22. Patra, B., Moniz, J.R.A., Garg, S., Gormley, M.R., Neubig, G.: Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. arXiv preprint arXiv:1908.06625 (2019) 23. Qi, Y., Sachan, D.S., Felix, M., Padmanabhan, S.J., Neubig, G.: When and why are pre-trained word embeddings useful for neural machine translation? arXiv preprint arXiv:1804.06323 (2018) 24. Rapp, R.: Identifying word translations in non-parallel texts. arXiv preprint cmplg/9505037 (1995) 25. Ren, S., Liu, S., Zhou, M., Ma, S.: A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3476–3485 (2020) 26. Shi, H., Zettlemoyer, L., Wang, S.I.: Bilingual lexicon induction via unsupervised bitext construction and word alignment. arXiv preprint arXiv:2101.00148 (2021) 27. Søgaard, A., Ruder, S., Vuli´c, I.: On the limitations of unsupervised bilingual dictionary induction. arXiv preprint arXiv:1805.03620 (2018) 28. Tian, Z., et al.: RAPO: an adaptive ranking paradigm for bilingual lexicon induction. arXiv preprint arXiv:2210.09926 (2022)

Confidence-Based Dual Refinement for Bilingual Lexicon Induction

391

29. Xiao, M., Guo, Y.: Distributed word representation learning for cross-lingual dependency parsing. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 119–129 (2014) 30. Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020) 31. Zhang, J., et al.: Combining static word embeddings and contextual representations for bilingual lexicon induction. arXiv preprint arXiv:2106.03084 (2021) 32. Zhao, X., Wang, Z., Wu, H., Zhang, Y.: Semi-supervised bilingual lexicon induction with two-way interaction. arXiv preprint arXiv:2010.07101 (2020)

A Novel POS-Guided Data Augmentation Method for Sign Language Gloss Translation Shan Liu1,2 , Yafang Zheng1,2 , Lei Lin1,2 , Yidong Chen1,2 , and Xiaodong Shi1,2(B) 1

Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen, China {warrior,zhengyafang,linlei}@stu.xmu.edu.cn 2 Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Beijing, China {ydchen,mandel}@xmu.edu.cn

Abstract. Due to its profound significance for the hearing-impaired community, there has been an abundance of recent research focused on Sign Language Translation (SLT), which is often decomposed into video-to-gloss recognition (S2G) and gloss-to-text (G2T) translation. Here, a gloss represents a sequence of transcribed spoken language words arranged in the order they are signed. In this paper, our emphasis lies in G2T, a crucial aspect of sign language translation. However, G2T encounters a scarcity of data, leading us to approach it as a low-resource neural machine translation (NMT) problem. Nevertheless, in contrast to traditional low-resource NMT, gloss-text pairs exhibit a greater lexical overlap but lower syntactic overlap. Hence, leveraging this characteristic, we utilize part-of-speech (POS) distributions obtained from numerous monolingual spoken language text to generate high-quality pseudo glosses. By simultaneously training numerous pseudo gloss-text sentence pairs alongside authentic data, we have significantly improved the translation performance of Chinese Sign Language (+3.10 BLEU), German Sign Language (+1.10 BLEU), and American Sign Language (+5.06 BLEU) in G2T translation, respectively. Keywords: Sign Language Gloss Augmentation

1

· Machine Translation · Data

Introduction

Sign language is a visual language used as a means of communication for individuals who are deaf or hard of hearing. Since most hearing people cannot understand sign language, Sign Language Translation, which translates sign language videos into spoken language text, has gained increasing attention in recent years [1–4,6,11,12,18]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 392–403, 2023. https://doi.org/10.1007/978-3-031-44696-2_31

A Novel POS-Guided Data Augmentation Method for G2T

393

The current research mainly includes two lines. One is an end-to-end method to directly translate sign language videos into spoken language text (S2T) [2,6, 7,10]. The other is a cascading approach, which can be divided into two steps. The first stage is continuous sign language recognition (CSLR) [2,3,5,8] with the purpose of tokenizing the input video into glosses, which then serve as the input for the subsequent stage. Then, the model translates glosses into spoken language text at second stage, which called Gloss-to-Text Translation (G2T) [12,15,17]. G2T is a crucial component of sign language translation [12,15]. As it not only constitutes the major portion of the Sign language video-to-Glossto-Text (S2G2T) task, but current S2T research has incorporated model which finetuned on pre-trained language model with gloss-text sentence pairs, yielding remarkable results. However, G2T suffers from a major data scarcity problem, as shown in Table 1. Regarding the data scarcity issue of G2T, we can treat G2T as a low-resource neural machine translation problem and the potential application of previous data augmentation methods. It is important to note the finding presented by Moryossef et al. [12] regarded that gloss and text have lower syntactic similarity and higher lexical similarity from their spoken counterparts. This discovery implies that the direct utilization of current data augmentation methods might prove insufficient. Instead, it becomes indispensable to formulate more appropriate data augmentation methods that duly consider the distinctive characteristics inherent in sign language gloss and text. This leads us to a query: Can we design new data augmentation method based on syntactic and lexical similarity between gloss and text? Table 1. Some publicly available sign language corpora with gloss annotations and spoken language translations. Dataset

Language Gloss-Text Pairs

Phoenix2014T German

7086/519/642

CSL-Daily

Chinese

6598/797/798

ASLG-PC12

English

82709/4000/1000

Hence, based on the syntactic and lexical similarity between the gloss and text, we attempt to summarize the token-level differences between them. However, token-level differences were not only distributed sparsely, but also ignored the differences in semantics and grammatical structure. We found that POS can ignore surface-level differences, such as spelling errors and word order, and focus on differences in semantics and grammatical structure. Therefore, we leverage more abstract POS tags to guide the data augmentation method. Furthermore, our methodology incorporating monolingual spoken language corpora to generate numerous pseudo gloss-text sentence pairs. Then we combine these pseudo pairs with real data for training. Our experiment reveals that our approach only requires 1 million pseudo gloss-text sentence pairs to outperform existing pre-trained language models on benchmarks such as mBART (Chinese +3.1 BLEU, German +1.1 BLEU).

394

S. Liu et al.

Notably, the mBART training dataset is 10,000 times larger than ours, emphasizing the remarkable performance of our model despite its relatively smaller dataset. Additionally, our model outperforms the strong baseline in the S2G2T task (Chinese +6.41 BLEU, German +1.81 BLEU), even surpassing the S2T model in Chinese (+3.68 BLEU). The experimental findings indicate the efficacy of our methodology in G2T translation, and also highlight the significant impact of incorporating POS analysis in sign language gloss and text.

2

Related Work

Sign Language Translation. In recent years, with the development of NMT, SLT has attracted increasing attention. For example, Camgoz et al. [2] released the PHOENIX14T dataset. Zhou et al. [18] put forward the CSL-daily dataset and used back translation to solve the problem of data scarcity. Recent studies have attempted to improve both S2G and G2T in order to enhance SLT performance. Yin and Read [16] proposed the STMC-Transformer network, using a Transformer for gloss-to-text translation to improve the performance of SLT. Cihan et al. [4] developed S2G and G2T in the form of multitasking. Data Augmentation. Recent studies have proposed various approaches for data augmentation in machine translation. Kambhatla et al. [9] used a ROT-k cipher generated from the source sentence to improve translation performance, outperforming other methods on several datasets, particularly in low-resource settings. Additionally, S ¸ ahin et al. [14] applied dependency parsing to “crop” and “rotate” sentences, achieving data augmentation effects that are particularly useful for low-resource languages. These studies highlight the importance of data augmentation in machine translation and provide new insights into developing effective data augmentation techniques to improve translation quality.

3

Methodology

We will introduce the method section in two parts. Firstly, we elaborate on the motivation why we summarize the differences between gloss and text based on POS. Secondly, we introduce how we use a monolingual corpus to generate pseudo gloss. 3.1

Deep Analysis Based POS Distribution Between Gloss and Text

For a given gloss-text parallel sentence of German sign language, text: “tja windst¨ arke zehn bis elf.” [English: Well, wind force ten to eleven.], and gloss: “zehn bis elf ” [English: ten to eleven.]. Compare text with gloss, it only adds two tokens “tja” and “windst¨ arke”. Therefore, we assume that sign language gloss is a textual simplification, preserving sufficient keywords to convey the meaning of the statement while removing unneeded words.

A Novel POS-Guided Data Augmentation Method for G2T

395

Formally, given a text x = [x1 , . . . , xn ] in the gloss-text sentence pairs, we use y = [y1 , . . . , yn ] to denote the gloss corresponding to x. Let a = [a1 , . . . , an ] be a sequence of binary variables that indicating whether xi should be discarded in the gloss text. Therefore, we have   if ai = 1, (1) yi = xi if ai = 0, where  denotes an empty string. As we assume that y is a simplification of x, a is deterministic given x and y. As a result, the probability of y given x can be written as P (y|x) = P (y, a|x) = P (y|a, x)P (a|x),

(2)

where P (y|a, x) always equals to 1 by our simplification assumption. P (a|x) is a model that summarize the difference between gloss and text. To summarize the differences between gloss and text, we initially counting the tokens that were missing between the text and gloss. However, due to the sparsity of the token, we found that token-based differences were not only distributed sparsely, but also not applicable to other fields. Therefore, it is not sufficient to simply describe the differences between gloss and text based on token. Moreover, Moryossef et al. [12] shows that there is a higher lexical coverage between gloss and text, but the syntactic similarity is lower. The grammatical structures between them are difficult to describe and there is no publicly available and authoritative standard, Therefore, the feasibility of constructing a syntax tree is relatively low. In contrast, abstract POS can better describe this type of difference. The main advantage of this approach is that it can ignore surface-level differences, such as spelling errors and word order, and focus on differences in semantics and grammatical structure. Therefore, we have adopted a method based on abstract POS to summarize the differences between gloss and text. Formally, we model P (a|x) as n  P (ai |POS(xi )), (3) P (a|x) = i=1

where POS(xi ) denotes the POS tag of xi . For estimating P (ai |POS(xi )), we first get the POS distribution from textgloss parallel pairs to analyze which type of words would be dropped or changed after translating to gloss. To be specific, for a pair of text-gloss sentences, we use a word segmentation tool to preprocess the text and gloss sentences into word sets, denoted Ux and Uy respectively. And their difference set is denoted as U = Ux − (Ux ∩ Uy ) = {u1 , . . . , un } x

y

(4)

where u ∈ U , u ∈ /U . The POS set is denoted as Upos = {pos1 , . . . , posn }. The POS of the token in the U is denoted as Upos , and Uxpos denotes the POS of tokens in the Ux .

396

S. Liu et al.

Count the frequency of POS appearing in sets Upos and Uxpos respectively, so Lpos = {(pos1 : l1 ), . . . , (posn : ln )}

(5)

Lxpos = {(pos1 : lx1 ), . . . , (posn : lxn )}

(6)

can be obtained, where li is the frequency of a given POS in Upos , lxi is the frequency of a given POS in Uxpos . Calculate the ratio by dividing the frequency of each posi in set Lpos by its corresponding frequency in set Lxpos , and denoted it as set P. P = {(pos1 : p1 ), . . . , (posi : pi ), . . . , (posn : pn )}, where pi =

li lti

(7)

To put it simply, for each posi appears 100 times in the text, which only appears 50 times in the corresponding gloss, as a result, the word for the posi have 50 percent of probability would be deleted. Instead of directly using a POS tagging tool to annotate them and obtain the POS difference, we summarize the token-based difference between gloss and text before analyzing their POS. It’s based on the fact that although some tokens are omitted in gloss compared to text, their parts of speech are generally consistent. But the problem that remains is the syntactic similarity of gloss and text is low, which leads to poor POS tagging results as the significant bias appears. Therefore, we adopt the method of summarizing the token-based difference first, and then analyzing the POS. After obtaining the distribution of POS difference set, pseudo gloss generation can commence. Text is subjected to word segmentation and POS tagging, and each token is eliminated according to its matching POS probability in P. 3.2

Pseudo Sign Language Gloss-Text Pair Generation

We used an external monolingual corpus as our new text. For a given new text, we denote it as x = [x 1 , ..., x N ] and create a new empty set of gloss denoted as y =. We utilized a POS tagging tool to annotate the text, and denoted the annotated results as x pos = {pos1 , . . . , posN }. This enables us to obtain a set of tokens, each with its corresponding POS tag. Next, we iterate through the entire x and x pos sets. At each step, we obtain x i and posi . Then we find the probability p i in the distribution of the POS difference set P that corresponds to posi . Afterwards, we randomly generate a number u between 0 and 1, following an unbiased uniform distribution ∼ U (0, 1). If u is greater than pi , we add x i to y ; otherwise, we ignore it. Finally, we join the tokens in y with spaces, creating a pseudo-gloss. By following this method, we can obtain a large number of pseudo gloss-text sentence pairs. Moreover, the distribution of POS differences within the pseudo gloss-text sentence pairs matches the distribution mentioned above.

A Novel POS-Guided Data Augmentation Method for G2T

4

397

Experiments and Results

4.1

Datasets

Our dataset used sign language translation data sets in three most widely used benchmarks: Phoenix2014T [2], CSL-Daily1 [18], and ASLG-PC122 [13], as show in Table 1. For external data, WMT213 is used for Chinese and English, among which 6M is screened as additional data. And monolingual data of WMT174 is used for German, also filter 6M as additional data. However, due to the different rules and standards for sign language between different countries, our method needs to be refined according to the characteristics of sign language in each country.

Fig. 1. This is POS difference distribution of CSL-Daily. We use LTP as our POS tagging tool. And “Part-of-Speech” is part-of-speech classification for ltp. We can find that different POS has different probability. The higher the probability, the more attention the model needs. And vice versa.

Chinese Sign Language. For instance, in the text “阑 尾 炎 是 很 经 常 看 病” [English: Appendicitis is a very common illness.] and in gloss “阑尾 炎 是 经 常 看 病” [English: Appendicitis is a very common illness.]. “很经常” changed into “经常” in gloss. Both “很经常” and “经常” have the same meaning in phrases, which is always. However, they differ at the token level, resulting in disparities between the final P and the actual distribution. Therefore, to mitigate this part of the difference, we multiply the probability distribution by the hyperparameter α and the new POS difference distribution defined as: P = {(pos1 : p 1 ), ..., (posn : p n )}, where p i = pi ∗ α 1 2 3 4

http://home.ustc.edu.cn/∼zhouh156/dataset/csl-daily. https://github.com/kayoyin/transformer-slt. https://data.statmt.org/wmt21/. https://data.statmt.org/wmt17/.

(8)

398

S. Liu et al.

We found when α = 1.0 the result is best. It demonstrates that the initial probability distribution closely resembles the original distribution. And the distribution is shown on Fig. 1. German Sign Language. It is noteworthy that in German, nouns have three genders: masculine, feminine, and neuter and complex verb conjugations. However, in the German sign language gloss, all tokens are in their base form. Therefore, in order to summarize the difference between gloss and text, we must first lemmatize the tokens in the text by reducing them to their base form and then summarize the token-based difference between the gloss and the lemmatized text. And we multiply the derived probability distribution by the hyperparameter α to reduce the remainder. When α = 0.7, it resembled the genuine distribution the most. And the distribution is shown on Fig. 2.

Fig. 2. This is POS difference distribution of Phoenix2014T. We use Spacy as our POS tagging tool. And “Part-of-Speech” is part-of-speech classification for spacy. The distribution is much different from the distribution of CSL-Daily. This suggests that it is necessary to study the two separately

America Sign Language. As with American Sign Language, all verbs and nouns in gloss are in their original form, regardless of how they appear in the text. Therefore, we must lemmatize the sentences prior to determining the real distribution of POS differences. Certainly, we use the hyperparameter α, and it performs optimally when α = 1.0. And the distribution is shown on Fig. 3. 4.2

Architecture

Our model employs the Transformer architecture, which relies purely on attention mechanisms and dominates the majority of sequence-to-sequence activities. Encoder and decoder each consist of six layers, and source and target word

A Novel POS-Guided Data Augmentation Method for G2T

399

Fig. 3. This is POS difference distribution of ASLG-PC12. We use Spacy as our POS tagging tool.

embeddings are shared for the language pair. The embedding and feed-forward sublayers have dimensions of 512 and 2048, respectively. The number of attention heads is set to eight. The default dropout rate is 0.3. 4.3

Baselines

We train models with randomized initializations directly on the benchmark datasets. In order to showcase the supremacy of our pre-training model, we employ mBART-cc25, a pre-trained neural machine translation model, as an alternative baseline for comparison. 4.4

Main Results

G2T Model. We generated pseudo gloss-text sentence pairs for each language using their POS distribution, and gradually increase the volume of data for training. The data volume is 1M, 2M, 4M, and 6M. Experimental results shows in the Table 2. When the volume of pseudo pairs increased to 1M, it achieves(Chinese +3.1 BLEU, German +0.81 BLEU, English +2.53 BLEU). As data volume increases, gradually enhance the effect. When the data amount reached 6M, it outperform all the baseline (Chinese +4.9 BLEU, German +1.1 BLEU, and English +5.06 BLEU).

400

S. Liu et al.

Table 2. “Direct” is the outcome of benchmark directly train on a randomly initialized transformer. “Ours” is the product of our method on external corpora and outperform all the baseline. Method

Chinese BLEU Vocab

Direct 18.72 30.10 mBART-CC25 – Camgoz et al. [2] Cihan Camgoz et al. [4] – – Yin et al. [16] – Moryossef et al. [12] 29.75 Ye et al. [15]

– 250027 – – – – –

1M 2M 4M 6M

7996 7592 8213 8693

German BLEU Vocab

English BLEU Vocab

21.73 26.70 19.26 24.54 23.32 19.09 25.04

5000 250027 – – – – –

75.9 – – – 82.87 – 83.35

5000 – – – – – –

27.51 27.29 27.50 27.8

5000 6000 8000 10000

85.88 86.64 87.97 88.41

5000 6000 8000 10000

Ours 33.07 33.69 34.27 35.02

S2G2T Model. We employ the CSLR model [8] to get gloss-text pairs, then finetuned them on our pretrained model to obtain the S2G2T model. Our S2G2T models outperform not only the existing S2G2T model (Chinese +6.14 BLEU, German +1.92 BLEU), but also surpass the S2T model in Chinese sign language translation(+3.68 BLEU). As shown in the Table 3. Table 3. “With 6M” refers to the result of the gloss-text pairs from CSLR model [3] that was then fine-tuned on our pre-trained model, which was trained with 6M pseudo gloss-text sentence pairs. Our S2G2T model is the most effective and even outperforms the Chinese S2T model. Method

Chinese German BLEU BLEU

S2G2T [3]

21.5

S2T [3]

23.9

S2G2T (Ours, 6M) 27.60

4.5

24.6 28.4 26.5

Analysis

Distribution of POS Difference. To prove the validity of our method, we reproduce the code of Moryossef et al. [12] and trained with 1M pseudo gloss-text sentence pairs. Secondly, we randomly delete tokens with a probability of 0.2 in

A Novel POS-Guided Data Augmentation Method for G2T

401

Table 4. We reproduce the code of Moryossef et al. [12] and trained with 1M pseudo gloss-text sentence pairs. The process of “Random delete in 0.2” involves randomly deleting tokens in 1M new text with a probability of 0.2. “Ours with 1M pseudo sentences” refers to our model with 1M pseudo gloss-text sentence pairs. Method

Chinese German English BLEU BLEU BLEU

Random delete in 0.2

28.25

Moryossef et al. [12]

30.01

26.70



Ours with 1M pseudo sentences 33.07

27.50

85.88

26.80

84.80

Table 5. We compare the effect of vocab size on the model. For Chinese sign language, we adopt character segmentation and jieba segmentation. The results show that for Chinese sign language, character segmentation has the best effect, while for German and English, when the vocabulary size is controlled at 5000, the best effect. Vocab size BLEU Chinese Character Segmentation 7996 6000 Jieba Segmentation

33.07 27.30

German 1M 1M 1M

5000 6000 7000

27.51 27.29 27.12

5000 6000 7000

85.88 84.96 84.21

English 1M 1M 1M

1M new text and train it on a randomly initialized transformer. Obviously, our method performs the best, illustrating the efficacy of our method. Details can be seen in Table 4. Effect of Vocabulary Size. Different segmentation methods have an effect on the experimental outcomes. We compared character and Jieba segmentation on Chinese sign language G2T, and the results showed that character segmentation performed better. Additionally, we compared the effects of vocabulary size variation on German and English Sign Language G2T. The experiments showed that a vocabulary size of 5000 performed the best. The result is shown on Table 5. Effect of α. We evaluated the parameter α mentioned on method Sect. 3. For Chinese and American G2T model, they performs best when α = 1.0, It indicates that the initial distribution of probabilities roughly reflects the initial

402

S. Liu et al.

Table 6. For Chinese and German, we respectively compared the influence of parameters on translation effect under 1M data volume. For Chinese and American G2T model, the best effect is when α = 1.0. For German G2T model, it works best when α = 0.7. Hyperparameter α BLEU Chinese 1M 0.8 1M 0.9 1M 1.0

32.40 32.61 33.07

German 1M 1M 1M 1M

0.5 0.6 0.7 0.8

26.84 27.29 27.51 27.07

English 1M 0.8 1M 0.9 1M 1.0

83.92 84.63 85.88

distribution. Due to the considerable compound in German sign language, when α = 0.7 German G2T model performs best. The result is shown on Table 6.

5

Conclusion

Our study proposes a novel POS-guided data augmentation method that models the difference between the distribution of sign language text and gloss. Through this approach, we have demonstrated its superiority over the current SOTA, resulting in significant improvements in S2G2T task. Furthermore, we have successfully applied it to sign language translation in three different countries. Acknowledgement. This work was supported by the National Natural Science Foundation of China (No. 62006138), the Key Support Project of NSFC-Liaoning Joint Foundation (No. U1908216), and the Major Scientific Research Project of the State Language Commission in the 13th Five-Year Plan (No. WT135-38). We thank all anonymous reviewers for their valuable suggestions on this work.

References 1. Bungeroth, J., Ney, H.: Statistical sign language translation, p. 4 2. Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language translation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, pp. 7784–7793. IEEE (2018). https://doi. org/10.1109/CVPR.2018.00812

A Novel POS-Guided Data Augmentation Method for G2T

403

3. Chen, Y., Wei, F., Sun, X., Wu, Z., Lin, S.: A simple multi-modality transfer learning baseline for sign language translation (2022) 4. Cihan Camgoz, N., Koller, O., Hadfield, S., Bowden, R.: Sign language transformers: joint end-to-end sign language recognition and translation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 10020–10030. IEEE (2020). https://doi.org/10.1109/CVPR42600.2020. 01004 5. Cui, R., Hu, L., Zhang, C.: Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: IEEE Conference on Computer Vision & Pattern Recognition (2017) 6. Duarte, A., et al.: How2Sign: a large-scale multimodal dataset for continuous American sign language (2021). https://doi.org/10.48550/arXiv.2008.08143 7. Gan, S., Yin, Y., Jiang, Z., Xie, L., Lu, S.: Skeleton-aware neural sign language translation. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 4353–4361. ACM, Virtual Event China (2021). https://doi.org/10. 1145/3474085.3475577 8. Hao, A., Min, Y., Chen, X.: Self-mutual distillation learning for continuous sign language recognition. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, pp. 11283–11292. IEEE (2021). https:// doi.org/10.1109/ICCV48922.2021.01111 9. Kambhatla, N., Born, L., Sarkar, A.: CipherDAug: ciphertext based data augmentation for neural machine translation (2022) 10. Kan, J., Hu, K., Hagenbuchner, M., Tsoi, A.C., Bennamoun, M., Wang, Z.: Sign language translation with hierarchical spatio-temporal graph neural network. In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2131–2140 (2022). https://doi.org/10.1109/WACV51458.2022.00219 11. Luqman, H., Mahmoud, S.A.: A machine translation system from Arabic sign language to Arabic. Univ. Access Inf. Soc. 19(4), 891–904 (2019). https://doi.org/ 10.1007/s10209-019-00695-6 12. Moryossef, A., Yin, K., Neubig, G., Goldberg, Y.: Data augmentation for sign language gloss translation, p. 11 (2021) 13. Othman, A., Jemni, M.: English-ASL gloss parallel corpus 2012: ASLG-PC12. In: 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon LREC (2012) 14. S ¸ ahin, G.G., Steedman, M.: Data augmentation via dependency tree morphing for low-resource languages (2019). https://doi.org/10.48550/arXiv.1903.09460 15. Ye, J., Jiao, W., Wang, X., Tu, Z.: Scaling back-translation with domain text generation for sign language gloss translation (2022) 16. Yin, K., Read, J.: Better sign language translation with STMC-transformer (2020) 17. Zhang, X., Duh, K.: Approaching sign language gloss translation as a low-resource machine translation task. In: Proceedings of the 1st International Workshop on Automatic Translation for Signed and Spoken Languages (AT4SSL), pp. 60–70. Association for Machine Translation in the Americas, Virtual (2021) 18. Zhou, H., Zhou, W., Qi, W., Pu, J., Li, H.: Improving sign language translation with monolingual data by sign back-translation (2021)

Faster and More Robust Low-Resource Nearest Neighbor Machine Translation Shuo Sun1,2 , Hongxu Hou1,2(B) , Zongheng Yang1,2 , and Yisong Wang1,2 1

College of Computer Science, Inner Mongolia University National and Local Joint Engineering Research Center of Intelligent Information, Hohhot, China [email protected] 2 Processing Technology for Mongolian, Inner Mongolia Key Laboratory of Mongolian Information Processing Technology,Hohhot, China

Abstract. Transformer-based neural machine translation (NMT) models have achieved performance close to human-level on some languages, but still suffer from poor interpretability and scalability of the models. Many advanced studies enhance the model’s translation ability by building external memory modules and utilizing retrieval operations, however, it suffers from poor robustness and low decoding efficiency while improving the model performance, especially for low-resource translation tasks. In this paper, we propose a confidence-based gating mechanism to optimize the decoding efficiency by building a sub-network to determine the confidence of the model’s own translation capability and then decide whether the current translation needs to be retrieved from the memory module. By reducing the number of retrievals to improve the model’s translation speed without degrading the translation quality as much as possible. In addition, we use a nonparametric dynamic Monte Carlobased algorithm to fuse retrieval probabilities and model predictions to improve the generalization and robustness of the model. Extensive experiments on different datasets demonstrate the effectiveness of our method.

Keywords: Memory Module Low-Resource Translation

1

· Gate Mechanism · Monte Carlo ·

Introduction

Neural machine translation (NMT) [5,14] has achieved levels comparable to human translation on multiple large-scale datasets. However, the neural network’s neuron parameters have an upper limit on the “memory” of the corpus, and it has poor interpretability of the machine translation model for the learned knowledge. Moreover, when encountering new “knowledge”, the model requires large-scale parameter updates and the scalability of the model is limited, especially obvious for low-resource tasks. The recently proposed k NN-MT and its variants [7,15,19,20] combine the traditional NMT model with a token-level memory retrieval module. These methods decouple the memory ability of the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 404–415, 2023. https://doi.org/10.1007/978-3-031-44696-2_32

Faster and More Robust Machine Translation

405

model from the model parameters by storing the training data in the memory module, realizing it to directly access the domain-specific datastore to improve translation accuracy without fine-tuning the entire model, gentle to cope with the discrepancy across domain distributions and improve the generality of the trained models. Previous works usually use simple linear interpolation to fuse external knowledge guidance and NMT prediction, and use a hyperparameter to control the fusion ratio to obtain the final probability distribution. However, using the same fusion ratio for all sentences may bring some problems, while it is proved through our experiments that the model translation results are quite sensitive to the selection of hyperparameter, which affects the robustness and stability of the model. Furthermore, although k NN-MT and its related models greatly improve the model performance, there is a huge drawback in the practical application, that is, the slow decoding efficiency of the model. The main reason for this phenomenon is that the memory module capacity is quite large, the similarity calculation of high-dimensional vectors is required in finding similar sentences, and the whole memory module must be searched for each retrieval probability during decoding. This paper aims to improve the performance of low-resource machine translation model by solving the above problems. For the former, in the process of retrieval and fusion of external memory module, we abandon the traditional linear interpolation and adopt non-parametric dynamic fusion method based on Monte Carlo, which improves the robustness and generalization of the model. For the latter, we optimize the translation speed by reducing the retrieval frequency. Specifically, a sub-network is used to judge the confidence of the model’s prediction results, and retrieval is performed only with the low confidence of the model’s prediction results, and the decoding efficiency is improved by filtering some unnecessary retrieval operations. Extensive experiments on low-resource translation task CCMT2019 and medium-high resource task CCMT2022 Mongolian-Chinese demonstrate the effectiveness of our method.

2 2.1

Background and Related Work Memory-Augmented NMT

Mark [2] first applies memory-augmented neural network to machine translation. He combines word correspondences from statistical machine translation in the form of “memory” to the decoder to increase the probability of occurrence of rare words, which is particularly effective on small data sets. Akiko [4] enhances the model’s translation capability by constructing a sentence-level memory bank. Zhang [18] constructs a fragment-level memory bank that allows the model to obtain more information from it and collect n-gram translation fragments from the target side with higher similarity and alignment scores. Khandelwal proposes k NN-MT [7] builds a token-level memory module on the basis of the traditional NMT, which stores the contextual representation of

406

S. Sun et al.

the training data and the key-value pairs of target words, so that the matching degree of memory library retrieval is higher. Figure 1 illustrates of how to employ the k NN algorithm to retrieve from the memory module. The key idea is to query the corresponding word of neighboring sentences similar to the current sentence in the external memory module when translating the current word to obtain reference and guidance from the module, and then use a simple linear interpolation to probabilistically fuse with the translation results of NMT to obtain the final translation results: p (yt | x, yˆ1:i−1 ) = λpN M T (yt | y0.95. This validates the caliber of our supervised information and, consequently, our resources. Results and Discussion. Not surprisingly, the results of this experiment (cf. Table 4) clearly highlight that the few-example categories seem to be more difficult to capture

624

Y. Liu et al. Table 4. Precision, Recall and F1 for variety identification experiment on validation set. Variety

Precision Recall F1

#

CHE HKE MCE TWE MYE SGE

80.46 62.71 77.92 70.53 70.86 74.33

80.02 65.96 63.82 66.08 69.76 75.41

80.24 64.29 70.17 68.23 70.31 74.86

1,472 905 94 460 516 1,033

macro avg 72.80 weighted avg 73.28

70.18 73.16

71.35 4,480 73.19 4,480

the unique characteristics of its class. To advance significantly, it suggests that compiling larger datasets and collecting more examples for minority classes (especially for MCE), or employing more advanced models to solve this problem are considerable optimizatized directions.

6 Conclusion and Future Work We develop CCAE, a novel multi-variety web corpora for enhancing Asian Englishes study. CCAE contains six English varieties in Asian Chinese-speaking areas. The corpus provides affluent metadata and annotations for usages. We host two versions of the data for download, to spur further research on building language variety-adaptive language models on our corpus. In future work, we suggest an in-depth investigation of variety-specific downstream tasks like multi-variety language modeling and automatic variety identification. Conduct experiments using CCAE to analyze the effectiveness of domain adaptation between various varieties. Through our work, we hope to encourage the community to further study World Englishes, to boost non-biased and culture-diversified language modeling development. Acknowledgement. Many thanks to Mark Davis, for his useful suggestions on data collection. We also thank the Internet Archive for providing service on the website time archive. This work was supported in part by the National Natural Science Foundation of China under Grant 62002016 and in part by the Fundamental Research Funds for the Central Universities under Grant 06500103.

Data Availability. CCAE has been released under the CC-BY-NC-ND 4.08 (https:// creativecommons.org/licenses/by-nc-nd/4.0) license on Hugging Face’s website: https:// huggingface.co/datasets/CCAE/CCAE-Corpus, enabling reusers to copy and distribute the material in its unadapted form, only for noncommercial purposes, while giving attribution to the creator.

CCAE: A Corpus of Chinese-Based Asian Englishes

625

References 1. Qiu, X.P., Sun, T.X., Xu, Y.G., Shao, Y.F., Dai, N., Huang, X.J.: Pre-trained models for natural language processing: a survey. SCIENCE CHINA Technol. Sci. 63(10), 1872–1897 (2020). https://doi.org/10.1007/s11431-020-1647-3 2. Devlin, J., Chang, M., Lee, K.,Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv. abs/1810.04805 (2019) 3. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 5485–5551 (2020) 4. Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inform. Process. Systems. 33, 1877–1901 (2020) 5. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inform. Process. Systems. 30 (2017) 6. Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Comput. Surv. (CSUR). 54, 1–40 (2021) 7. Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for SQuAD. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp. 784–789 (2018). https://aclanthology.org/P18-2124 8. Edmiston, D.: A Systematic Analysis of Morphological Content in BERT Models for Multiple Languages. ArXiv. abs/2004.03032 (2020) 9. Espinosa Anke, L., Codina-Filba, J., Wanner, L.: Evaluating language models for the retrieval and categorization of lexical collocations. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 1406–1417 (2021). https://aclanthology.org/2021.eacl-main.120 10. Zhou, W., Ge, T., Xu, K., Wei, F., Zhou, M.: BERT-based Lexical Substitution. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3368–3373 (2019). https://aclanthology.org/P19-1328 11. Tran, K., Bisazza, A.: Zero-shot dependency parsing with pre-trained multilingual sentence representations. In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pp. 281–288 (2019). https://aclanthology.org/D19-6132 12. Zaharia, G., Avram, A., Cercel, D., Rebedea, T.: Exploring the power of Rsomanian BERT for dialect identification. In: Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties And Dialects, pp. 232–241 (2020). https://aclanthology.org/2020.vardial1.22 13. Laicher, S., Kurtyigit, S., Schlechtweg, D., Kuhn, J., Walde, S.: Explaining and improving BERT performance on lexical semantic change detection. In: Proceedings of the 16th Conference Of The European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 192–202 (2021). https://aclanthology.org/2021.eacl-srw.25 14. Nuske, K.: “I Mean I’m Kind of Discriminating My Own People:” A Chinese TESOL Graduate Student’s Shifting Perceptions of China English. TESOL Q. 52, 360–390 (2018) 15. Kirk, J., Nelson, G.: The international corpus of english project: a progress report. World Englishes. 37, 697–716 (2018) 16. Davies, M., Fuchs, R.: Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-based English Corpus (GloWbE). English World-Wide. 36, 1–28 (2015) 17. Berns, M.: Expanding on the expanding circle: where do we go from here? World Englishes. 24, 85–93 (2005) 18. Xu, Z.: Chinese English: A future power?. The Routledge Handbook Of World Englishes, pp. 265–280 (2020)

626

Y. Liu et al.

19. Leimgruber, J.: Singapore English. language and linguistics. Compass. 5, 47–62 (2011) 20. Voigt, P., Bussche, A.: The EU general data protection regulation (GDPR). A Practical Guide, 1st Ed., Cham: Springer International Publishing. 10(3152676), 10–5555 (2017) 21. Kirkpatrick, A.: The Asian corpus of English: motivation and aims. Perspective 28, 256–269 (2013) 22. Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer sentinel mixture models. ArXiv Preprint ArXiv:1609.07843 (2016) 23. Davies, M.: The corpus of contemporary American English as the first reliable monitor corpus of English. Literary Lingu. Comput. 25, 447–464 (2010) 24. Pomik´alek, J.: Removing boilerplate and duplicate content from web corpora. Masarykova Univerzita, Fakulta Informatiky, Disertacnı Pr´ace (2011) 25. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog. 1, 9 (2019) 26. Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020) 27. Paszke, A., et al: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inform. Process. Syst. 32 (2019) 28. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. ArXiv Preprint ArXiv:1412.6980 (2014) 29. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) 30. Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural Inform. Process. Syst. 35, 27730–27744 (2022) 31. Yang, L., Xiang, Y.: Naive Bayes and BiLSTM ensemble for discriminating between mainland and Taiwan Variation of Mandarin Chinese. In: Proceedings of the Sixth Workshop on NLP For Similar Languages, Varieties and Dialects, pp. 120–127 (2019). https:// aclanthology.org/W19-1412 32. Popa, C., NullStef˘anescu, V.: Applying multilingual and monolingual transformer-based models for dialect identification. In: Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 193–201 (2020). https://aclanthology.org/2020. vardial-1.18 33. Ceolin, A.: Comparing the performance of CNNs and shallow models for language identification. In: Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 102–112 (2021). https://aclanthology.org/2021.vardial-1.12 34. Beltagy, I., Peters, M., Cohan, A.: Longformer: The long-document transformer. ArXiv Preprint ArXiv:2004.05150 (2020)

Emotionally-Bridged Cross-Lingual Meta-Learning for Chinese Sexism Detection Guanlin Li(B) , Praboda Rajapaksha, Reza Farahbakhsh, and Noel Crespi Samovar, Telecom SudParis, Institut Polytechnique de Paris, 91120 Palaiseau, France {guanlin.li,praboda.rajapaksha,reza.farahbakhsh, noel.crespi}@telecom-sudparis.eu

Abstract. Sexism detection remains as an extremely low-resource task for most of the languages including Chinese. To address this issue, we propose a zero-shot cross-lingual method to detect sexist speech in Chinese and perform qualitative and quantitative analyses on the data we employed. The proposed method aims to explicitly model the knowledge transfer process from rich-resource language to low-resource language using metric-based meta-learning. To overcome the semantic disparity between various languages caused by language-specific biases, a common label space of emotions expressed across languages is used to integrate universal emotion features into the meta-learning framework. Experiment results show that the proposed method improves over the state-ofthe-art zero-shot cross-lingual classification methods.

Keywords: Sexist Speech Detection

1

· Cross-lingual · Meta-learning

Introduction

The rise of the internet and new media calls for more attention to gender awareness and solidarity, for which automatic methods are needed to identify sexism on social media. Sexist speech is usually defined as prejudice, stereotyping or discrimination, typically against women, on the basis of sex, which can cause measurable negative impact [1]. As the volume of social media content continues to increase, it is important to detect sexist speech automatically so as to prevent the circulation of such speech on social media platforms and also to better study the related phenomenon. Previous works typically viewed sexism detection as a supervised classification problem. Waseem and Hovy [2] studied sexist speech as a category of hate speech and constructed a hate speech dataset comprised of sexism and racism classes. Two AMI (Automatic Misogyny Identification) shared tasks, IberEval2018 [3] and EVALITA2018 [4] provided datasets of misogyny-related speech in social media with multiple languages (English, Spanish and Italian) and extensive studies of automatic misogyny detection have been done based on the AMI datasets [5]. Sexist speech is not always hateful c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 627–639, 2023. https://doi.org/10.1007/978-3-031-44696-2_49

628

G. Li et al.

and has the forms of hostile and benevolent seixsm [6]. Based on the ambivalent sexism theory, Jha and Mamidi [7] constructed a dataset containing three categories of sexism, namely benevolent, hostile and others and developed classification models to identify benevolent sexism. Samory et al. [1] further proposed to measure sexism in psychological scales and defined four categories of sexist content: behavioral expectations, stereotypes and comparisons, endorsements of inequality, denying inequality and rejection of feminism. Despite the increased interest in sexist speech detection on social media, the number of studies is fewer compared to that of hate speech detection in general, and most of the sexism detection-related research works focused on Indo-European languages [8]. Thus, in this paper, we study the problem of Chinese sexism detection with the help of cross-lingual knowledge transfer. Instead of only using binary sentiment features, we propose to integrate external emotion knowledge about sexism datasets within the framework of meta-learning to explicitly model the transfer process between languages, while the heterogeneity of emotion labels in different training sources is bridged by a unified taxonomy. Multilingual language models are used as the backbone model so that the method can be generalized to other low-resource languages. To eliminate the need for auxiliary tasks and languages in the meta-learning process, machine translation is used to generate samples which are used to provide gradient during the meta-training stage. Experiments on cross-lingual datasets composed of English (resource-rich language) and Chinese sexist speech show that the proposed method improves upon previous state-of-the-art models.

2

Related Works

In this part, we briefly survey the more general topic of multilingual hate speech detection. In cross-lingual hate speech detection, resource-rich languages are used as source language to provide sexism related knowledge, and zero-shot or fewshot prediction is done on low-resource target languages. Pamungkas et al. [5] studied the features of misogynistic content and investigated misogyny detection on cross-domain and multilingual content. Furthermore, they experimented with several different methods and joint learning approach to perform multilingual misogyny detection in English and Italian [9]. Jiang and Zubiaga [10] proposed a capsule network for cross-lingual hate speech detection. The network relies on the source language and its translated counterpart in the target language. Aluru et al. [11] compared the performance of LASER embedding and mBERT on cross-lingual hate speech detection using datasets in 9 different languages and found that in low resource setting LASER performs better. To sum up, the models employed in cross-lingual hate speech detection can be categorized as follows: (1) Monolingual embedding and machine translation of target languages; (2) Multilingual embeddings and supervised classification model; (3) Multilingual pretrained language models (mPLM) and the combination of above models. However, the performance of such models could be strongly affected by the negative effect of non-hateful, language-specific taboo interjections [12] and data overfitting [13].

Emotionally-Bridged Cross-Lingual Meta-Learning

629

Fig. 1. Framework of the proposed method (better viewed in color). In the metatrain stage, the query sets are generated from the source data; in the meta-test stage, the query sets are composed of target data. Blue arrows are used to indicate the meta-train flow and red arrows are used to indicate the inference. (Color figure online)

3

Methodology

In this section, we present the proposed zero-shot cross-lingual meta-learning method for sexism detection which takes advantage of external emotion knowledge. The framework of the proposed method is shown in Fig. 1. We elaborate on the essential parts of the method, namely cross-lingual meta-learning, emotion analysis and emotion knowledge injection. 3.1

Cross-Lingual Meta Learning

For the task sets T under the zero-shot cross-lingual setting, the support set is denoted as Dss = {Xss , Yss }, and the query set is denoted as Dqt = {Xqt , Yqt }, where data points in support sets are sampled from the source language and data points in query sets are sampled from a different language. To enable zeroshot learning in meta-learning, previous methods require auxiliary languages to provide samples for query sets [14,15]. In our setting, we eliminate the need for auxiliary language by using machine translation to generate data in the target language from the source language. Given a translation model Mst which translates from source language s to target language t , for a sample Dsi from dataset in source language s , the corresponding sample Dsi in target language t is generated. Then, support sets are sampled from Ds and query sets are sampled from Ds , such that, during the meta-training stage, only labels from the source language are used. To get universal features for support and query sets, we denote the multilingual model as fm , the features of an example xi in the support set Sk of class k as fm (xi ), the features of a query sample xj

630

G. Li et al.

in the  query set Q as fm (xj ). The prototype features ck of Sk is denoted as 1/|Sk | (xi ,yi )∈Sk fm (xi ). To predict the probability p given xj belonging to class k, we use distance function d to measure the distance between prototype ck and xj . We adopted the Euclidean distance function as suggested in ProtoNet [16]. For the target text xj to be classified, we can get the probability of xj using a softmax function over all the classes: exp(−d(fm (xj ), ck )) p(y = k|xj ) = K  k exp(−d(fm (xj ), ck ))

(1)

To train the model, during the meta-train stage, for a meta-train task T with K classes in support sets S and Nq query sample size in the query set Q, the loss is calculated using Formula (2): N

q q , {ck }K LossT (Q) = L(d({fm (xqj )}j=1 k=1 ), y )

(2)

d(x1, x2) = ||x1 − x2||2

(3)

where: Cross entropy loss is used for L. We can observe that the classification of the target low-resource language is based on the distance between its features and those of different classes of the rich-resource source language. Thus, the knowledge transfer process between the source and the target language is explicitly modeled in the meta-training process where the multilingual model is trained to output representations that measure similarities between languages in terms of the degree of sexism. 3.2

Emotion Analysis

Cross-lingual hate speech detection methods suffer from unintended bias introduced by language-specific expressions and overfitting issues. We seek languageagnostic features that benefit the task while at the same time not affected by language-related bias, and emotion features serve as a good candidate to this end. Previous studies have shown the effectiveness of sentiment and emotion features in monolingual hate speech detection [17]. However, to our knowledge, no previous study has explored the effect of universal emotion features across different languages on the detection of hate speech. Although it has been reported that emotions can vary systematically in their meaning and experience across culture [18], a previous study showed that emotion semantics has a common underlying structure across languages [19], and empirical results showed that there are commonalities underlying the expression of emotion in different languages [20]. Thus, We develop a model to provide emotion classification under a common label space for multilingual sexist speeches. For each language , the emotion  model fe is trained on the emotion dataset D = {X , y} = {(xi , y i )}N i , where i x is the feature of sample i in the dataset, y = {y1 , y2 , . . . , yc } is a common

Emotionally-Bridged Cross-Lingual Meta-Learning

631

label space across the languages decided by the emotion taxonomy adopted. The model fe is learned to minimize the binary cross entropy loss. For a given sentence si in language , we first obtain its universal feature xi , and the model fe provides its emotion vector with vi = sigmoid(fe (xi )). After training, we get a multi-label multi-class classifier which is later used to provide emotion knowledge for the sexist speeches in the corresponding languages. 3.3

Integration of Emotion Knowledge

We design an aggregation module to merge semantic and emotion features into multimodal features. Specifically, the aggregation module is composed of two parts: a feature concatenation module and a modality fusion module. The feature concatenation module works by concatenating together text embedding and emotion features indicated as in Formula (4). z = Concat(v1, v2) = {v1v2 ∈ Rm+n : v1 ⊂ Rm , v2 ⊂ Rn }

(4)

where m is the dimension of text features and n is the dimension of emotion features. The concatenated feature is then passed into the modality fusion module which is trained in an end-to-end way to translate the simply-concatenated vector into a joint representation. We use a convolutional layer for the fusion module as proposed in [21]. Given the multilingual model fm , the emotion model fe , and a sample xi from either support set or query set, the aggregated features are produced as follows: fagg (xi ) = Conv(Concat(fm (xi ), fe (xi )))

(5)

The joint representation is optimized with regard to the same loss given by Formula (2) : N

q q , {ck }K LossT (Q) = L(d({fagg (xqj )}j=1 k=1 ), y )

(6)

The Algorithm 1 illustrates the entire learning process.

4 4.1

Experiment Datasets

We use publicly available datasets in English and Chinese. The broadly used Waseem dataset [2] is used to provide training data, and the English data of AMI EVALITA [4] is used to test the model’s robustness. A recently published Chinese sexism dataset SWSR [8] is used for testing on Chinese data. For the training of emotion models, we use the GoEmotions dataset [22] which provided fine-grained emotion annotations for a large number of English texts collected from Reddit comments. We use the dataset provided by NLPCC-2013 emotion classification task [23] for Chinese emotion data. There exist other emotion datasets for Chinese, but they are either for other domains [24] or publicly not available [25]. For both Chinese datasets, the data are collected from Sina Weibo (microblog). We map emotion labels between the NLPCC-2013 dataset and GoEmotions dataset to a common label space based on emotion lexicon ontology [26].

632

G. Li et al.

Algorithm 1 Zero-Shot Cross-lingual Meta-learning with Emotion Features Require: Multilingual Model fm , Emotion Model fe , Translation Model Mst , Training Set Ds with K classes in Resource-rich Language s , Test Set Dt in Target Language t , Aggregation Module AGG, Training Episodes Number N  generate source for query sets 1: Ds ← Mst (Ds ) 2: for i in {1, ..., N } do 3: for k in {1, ..., K} do  Iterate over training classes 4: Sik = Dsi = {(x1 , y1 ), ...(xj , yj )} ← RandomSample(Ds , j) 5: Qki = Dsi = {(x1 , y1 ), ...(xq , yq )} ← RandomSample(Ds , q) 1  6: ck = k k fagg (xi ) |Si | (xi ,yi )∈Si 7: end for 8: J ←0  Initiate Loss 9: for k in {1, ..., K} do  Update Loss using Formula (6) 10: J ← J + Lossi (Qki ) 11: end for 12: Update all parameters θfm , θfagg , θd w.r.t. J using gradient descent 13: end for 14: Do predictions on test set Dt using models with updated parameters.

Bias Analysis of Datasets. Following the definition of unintended bias given by [27], we view expressions that affect the multilingual model’s performance, such as language-specific taboo interjections, as false positive bias demonstrated by a disproportionate amount of samples containing certain terms. These terms appear in data labeled both as sexism and non-sexism, but the likelihood of the terms in sexism class is significantly higher than in the non-sexism class. Some of these terms may express the bias that the model should learn to distinguish between the two classes, while some may cause unintended behavior of the model, resulting in the model tending to classify some comments containing particular terms as sexism even if these terms do not convey such meaning. Besides, in the context of cross-domain dataset evaluation, the marginal distribution shift between datasets could lead to performance drop [28], which is also the case in the context of cross-lingual learning. Thus, we identify terms that distribute disproportionately in the sexism and non-sexism category and compare between datasets in English (EN) and Chinese (CN). We calculate the likelihood of a term wi given the label as p(wi |label). To compute the degree of bias r, we use Formula (7): r=

p(wi |sexism) p(wi |nonsexism)

(7)

Then a threshold is set to identify a set of terms that are disproportionately distributed. From terms above the threshold, we manually pick meaningful terms and analyze them qualitatively. Term analysis results are shown in Table 1. We can observe that there are overlaps between the two datasets, mainly on terms expressing meanings related to feminism and gender. There exist nonsexist terms strongly linked with sexism in datasets of both languages, which

Emotionally-Bridged Cross-Lingual Meta-Learning

633

Table 1. Term bias analysis of EN-CN sexism datasets in terms of r (Formula 7). The term “gay”, annotated with *, appears in English in the original Chinese text. Term (EN) r

Term (CN)

r

sexist sport female bitch equal feminism blond woman dumb drive man

婚驴 (marriage donkey) gay* 男权 (patriarchy) 女拳 (negative feminism) 伪女权 (fake feminism) 男人 (man) 奴隶 (slave) 洗脑 (brain washing) 彩礼 (bride price) 女人 (woman) 职场 (work place)

7.30 7.19 6.49 5.98 5.70 4.76 4.33 4.25 4.23 4.18 4.03

28.27 27.91 17.63 12.67 12.01 9.78 7.30 6.61 6.48 5.74 4.57

could lead to potential unintended bias either in mono-lingual setting or crosslingual setting. For example, the term “sport” is shown to have a significantly higher likelihood to appear in English tweets labeled as sexism, but the term is neutral and should not convey any bias. There are also language-specific terms which are more likely due to the cultural difference intrinsic to the language that can also harm the cross-lingual transfer learning performance of the model. Emotion Analysis of the Dataset. We analyze emotion features in the Chinese and English sexism datasets. For each sample in the datasets, we employ the prediction of the emotion model to generate an emotion feature. The emotion feature has eight dimensions as shown in Fig. 2 which we use as a real-valued vector for later analysis. We set a threshold to the emotion vector to decide if an emotion appears in the sample and count the frequencies of emotions. We normalized these frequencies to be the probability distribution, which is on a scale of 0 to 1, considering the fact that the sizes of datasets are different. The result is shown in Fig. 2. To gain a better perspective, we set frequency values to be negative for negative emotions (disgust, fear, sadness, anger), and negative emotions are shown in the left part of the figure. We observe that in both languages, non-sexist speech tends to have more positive emotions than sexist speech. In Chinese datasets, a large part of the nonsexist speeches still conveys negative emotions. The observation is consistent with the dataset’s keyword-based construction method, where controversial contents are more likely to be selected. In addition, many speeches could be hateful, thus conveying more negative emotions, but they may not be sexism related. As a result, using emotion features independently for sexism or non-sexism may not

634

G. Li et al.

be a good method to conduct cross-lingual transfer for sexist speech detection. We also observe a notable difference in emotions between sexism and non-sexism classes, which is in line with our previous assumption.

Fig. 2. Emotion Analysis of EN-CN sexism datasets. Frequency values are set to be negative for negative emotions (disgust, fear, sadness, anger) and are shown in the left part of the figure.

4.2

Experiment Settings

Baseline Models. For zero-shot baselines, we re-implement previously reported best-performing models on cross-lingual hate speech detection [9,11]: logistic regression with LASER embedding (LASER+LR); monolingual RoBERTabase, with Chinese target data translated to English by machine translation API1 (RoBERTa-translation); XLM-RoBERTa-base (XLM-R). For comparison with the fully supervised method, we implement a strong baseline for hate speech detection, BERTCNN [29] and also use the best model reported in the Chinese sexism dataset paper [8], a RoBERTa model trained with sexism lexicon knowledge (RoBERTa Lexicon). Implementation Details. We implement the meta-learning model using the PyTorch library. XLM-R is used as the backbone model to provide universal encodings. The support set and query set sample sizes Ns and Nq are set to be 32 and 10, respectively. During the meta-train phase, the number of episodes is 1000 and the learning rate is 1e-5. For model evaluation, in the fully-supervised setting, 10% of randomly sampled Chinese data is used as a test set; in the zero-shot setting, all of the Chinese data is used as the test set. The results are reported on an average of over 5 runs. 1

https://cloud.tencent.com/document/api/551.

Emotionally-Bridged Cross-Lingual Meta-Learning

635

Table 2. Model performances with all metrics tested on the SWSR Chinese sexism dataset. For fully supervised baselines, the models are trained using the SWSR dataset; for zero-shot models, the English Waseem dataset is used to provide source data for training, and metrics are tested on the SWSR dataset. For the robustness test, we change the source data to AMI dataset for zero-shot models. F1-sex and F1-not indicate F1 score for sexism and non-sexism class Model

F1-sex F1-not Macro F1 Accuracy

Fully supervised baselines, monolingual

Source data: SWSR dataset (CN)

BERTCNN

0.721

0.834

0.778

RoBERTa Lexicon

0.707

0.853

0.780

Zero-shot cross-lingual models

0.792 0.804 Source data: Waseem dataset (EN)

LASER+LR

0.409

0.804

0.607

0.706

XLM-R

0.612

0.743

0.671

0.692

RoBERTa translation 0.398

0.782

0.591

0.681

ProtoNet

0.592

0.763

0.681

0.701

ProtoNet Emotion 0.602

0.788

0.691

Zero-shot cross-lingual models

0.713 Source data: AMI dataset (EN)

LASER+LR

0.491

0.803

0.647

0.716

XLM-R

0.538

0.785

0.662

0.707

RoBERTa translation 0.548

0.773

0.663

0.702

0.637

0.737

0.687

0.695

ProtoNet Emotion 0.659

0.749

0.706

0.703

ProtoNet

4.3

Experiment Results

Model Performance. The overall performance of the baselines and the proposed models (in bold) are reported in Table 2. The results indicate that the proposed meta-learning method using ProtoNet and ProtoNet with emotion improves over the previous zero-shot methods and shows a more stable performance, although there is still a drop in performance compared with the bestperforming supervised models. The performance of the baseline models indicates that for the sexism detection task, even for languages strongly different from each other (Chinese-English), the zero-shot cross-lingual methods still yield comparable performance compared to previous cross-lingual settings where those focused languages are from the same or closer language family (e.g., English-Spanish, English-Italian). We observe that XLM-R demonstrates a good multilingual ability and outperforms the method using monolingual model with translation data. The proposed meta-learning method using ProtoNet alone achieves good performance, and the addition of emotion knowledge gains a marginal improvement over the ProtoNet model’s overall improvement on F1 and accuracy. Specifically, we observe a notable improvement of the F1 score over non-sexism class with ProtoNet Emotion, which may explain the effect of emotion features in mitigating the unintended bias introduced by non-sexism terms strongly linked with the sexism class.

636

G. Li et al.

Robustness Analysis. We analyze the robustness and generalization of the proposed method by changing the domain of the training dataset. Specifically, we use the English misogyny dataset provided by AMI EVALITA 2018 shared task [4] and test if the model’s performance remains stable on the test data. Compared to the Waseem dataset, the AMI dataset contains fewer data points for both sexism and non-sexism classes but is more comprehensive regarding the types of sexism. The result is shown in Table 2. For the zero-shot baseline models, changing the training dataset has a large impact on the performance of the models, which suggests that these methods tend to be more easily affected by the domain and the content of the datasets. With the meta-learning method, the model’s overall performance is more stable, and the F1 score and accuracy of the model are close between the two datasets.

Table 3. Examples of correctly and wrongly classified samples. The texts are translated from Chinese. ID

Text

True Label Predicted Label XLM-R ProtoNet + Emotion

4.4

#1 There’s nothing wrong about feminism itself, what’s wrong with pursuing gender equality?

non

sexism

sexism

non

#2 ...You can’t sexually harass people just because they are gay

non

sexism

sexism

non

#3 Feminism is not against men! It’s against non the patriarchal society!

sexism

sexism

sexism

#4 The person seems to be a recidivist... and should be prosecuted!:rage: :rage:

non

sexism

non

sexism

#5 Enoki mushroom like you will only rot on the shelf even if clearly priced

sexism

non

non

non

#6 The unmarried donkeys sounds happy hahaha

sexism

non

non

non

#7 Boys want to work with you only because sexism they want to use you to their advantage

non

non

non

#8 ...when girls are older, it becomes more difficult to find a partner.

non

non

non

sexism

Case Studies

To better understand the results of the cross-lingual sexism detection, we conducted a qualitative study of cases where the proposed method leads to the correct classification of previously misclassified examples and cases where the proposed method has failed to classify sexist speeches correctly. Specifically, we search with non-hateful terms that are strongly linked with sexism class as shown in Table 1. The results are shown in Table 3, the texts are translated from Chinese.

Emotionally-Bridged Cross-Lingual Meta-Learning

637

We observe that the common reasons that cause errors are as follows: (i) biased terms; (ii) specific expression in the target language; (iii) implicit or benevolent sexism. In a few cases, the integration of emotion knowledge helps out to correct examples wrongly classified as sexism due to biased terms, such as #1 and #2. However, we also observe that the bias mitigation effect brought about by emotion knowledge injection is limited and there are still a considerable number of examples misclassified as sexism class due to the presence of biased terms, such as #3, and in some cases, the integration of emotion knowledge harms the prediction such as #4. This may be because the emotion models are not accurate enough and the emotion label space is not expressive enough. Some specific expressions are also hard to detect and require culturally related a priori knowledge to identify, such as Internet slang terms in #5 and #6 which are sexist and appear mostly in social media content. Sexist speeches that do not contain explicit sexist terms or convey benevolent sexism are also hard to detect.

5

Conclusion

In this paper, we explore the cross-lingual method to detect Chinese sexism. We propose to use meta-learning method for zero-shot cross-lingual sexist speech detection and to integrate emotion knowledge about sexism datasets in the metalearning framework. Our proposed method with ProtoNet and ProtoNet Emotion improves over previous cross-lingual zero-shot methods and achieve new stateof-the-art. We also observed that the proposed method is still limited in dealing with issues related to cultural factors which could be reflected by languagespecific expressions such as Internet slang terms. Our proposed method can be easily extended to other low-resource languages, and in the future works we wish to experiment the method with more languages and seek to better deal with problems caused by cultural factors in cross-lingual hate speech detection.

References 1. Samory, M., Sen, I., Kohne, J., Fl¨ ock, F., Wagner, C.: Call me sexist, but. . .: Revisiting sexism detection using psychological scales and adversarial samples. In: Intl AAAI Conference Web and Social Media, pp. 573–584 (2021) 2. Waseem, Z., Hovy, D.: Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In: Proceedings of the NAACL Student Research Workshop, pp. 88–93 (2016) 3. Fersini, E., Rosso, P., Anzovino, M.: Overview of the task on automatic misogyny identification at ibereval 2018. Ibereval@ sepln 2150, 214–228 (2018) 4. Fersini, E., Nozza, D., Rosso, P.: Overview of the Evalita 2018 task on automatic misogyny identification (AMI). EVALITA Eval. NLP Speech Tools Italian 12, 59 (2018) 5. Pamungkas, E.W., Basile, V., Patti, V.: Misogyny detection in twitter: a multilingual and cross-domain study. Inform. Process. Manage. 57(6), 102360 (2020)

638

G. Li et al.

6. Glick, P., Fiske, S.T.: Ambivalent sexism. In: Advances in experimental social psychology, vol. 33, pp. 115–188. Elsevier (2001) 7. Jha, A., Mamidi, R.: When does a compliment become sexist? analysis and classification of ambivalent sexism using twitter data. In: Proceedings of the Second Workshop on NLP and Computational Social Science, pp. 7–16 (2017) 8. Jiang, A., Yang, X., Liu, Y., Zubiaga, A.: Swsr: a Chinese dataset and lexicon for online sexism detection. Online Social Netw. Media 27, 100182 (2022) 9. Pamungkas, E.W., Basile, V., Patti, V.: A joint learning approach with knowledge injection for zero-shot cross-lingual hate speech detection. Inform. Process. Manage. 58(4), 102544 (2021) 10. Jiang, A., Zubiaga, A.: Cross-lingual capsule network for hate speech detection in social media. In: Proceedings of the 32nd ACM Conference on Hypertext and Social Media, pp. 217–223 (2021) 11. Aluru, S.S., Mathew, B., Saha, P., Mukherjee, A.: A deep dive into multilingual hate speech classification. In: Dong, Y., Ifrim, G., Mladeni´c, D., Saunders, C., Van Hoecke, S. (eds.) ECML PKDD 2020. LNCS (LNAI), vol. 12461, pp. 423–439. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67670-4 26 12. Nozza, D.: Exposing the limits of zero-shot cross-lingual hate speech detection. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 907–914 (2021) 13. Arango, A., P´erez, J., Poblete, B.: Hate speech detection is not as easy as you may think: A closer look at model validation. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 45–54 (2019) 14. Nooralahzadeh, F., Bekoulis, G., Bjerva, J., Augenstein, I.: Zero-shot cross-lingual transfer with meta learning. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4547–4562 (2020) 15. Xu, W., Haider, B., Krone, J., Mansour, S.: Soft layer selection with meta-learning for zero-shot cross-lingual transfer. In: Proceedings of the 1st Workshop on Meta Learning and its Applications to Natural Language Processing, pp. 11–18 (2021) 16. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. Adv. Neural Inform. Process. Syst. 30 (2017) 17. Chiril, P., Pamungkas, E.W., Benamara, F., Moriceau, V., Patti, V.: Emotionally informed hate speech detection: a multi-target perspective. Cogn. Comput. 14(1), 322–352 (2022) 18. Mesquita, B., Boiger, M., De Leersnyder, J.: The cultural construction of emotions. Curr. Opin. Psychol. 8, 31–36 (2016) 19. Jackson, J.C., et al.: Emotion semantics show both cultural variation and universal structure. Science 366(6472), 1517–1522 (2019) 20. Lamprinidis, S., Bianchi, F., Hardt, D., Hovy, D.: Universal joy: A data set and results for classifying emotions across languages. In: The 16th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics (2021) 21. Ma, L., Lu, Z., Shang, L., Li, H.: Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2623–2631 (2015) 22. Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., Ravi, S.: Goemotions: a dataset of fine-grained emotions. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4040–4054 (2020)

Emotionally-Bridged Cross-Lingual Meta-Learning

639

23. Yao, Y., et al.: The construction of an emotion annotated corpus on microblog text. J. Chinese Inform. Process. 28(5), 83–91 (2014) 24. Quan, C., Ren, F.: A blog emotion corpus for emotional expression analysis in Chinese. Comput. Speech Lang. 24(4), 726–749 (2010) 25. Li, M., Long, Y., Qin, L., Li, W.: Emotion corpus construction based on selection from hashtags. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 1845–1849 (2016) 26. Xu, L., Lin, H., Pan, Y., Ren, H., Chen, J.: Constructing the affective lexicon ontology. J. China Society Sci. Tech. Inform. 27(2), 180–185 (2008) 27. Dixon, L., Li, J., Sorensen, J., Thain, N., Vasserman, L.: Measuring and mitigating unintended bias in text classification. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 67–73 (2018) 28. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.W.: A theory of learning from different domains. Mach. Learn. 79(1), 151–175 (2010) 29. Mozafari, M., Farahbakhsh, R., Crespi, N.: Hate speech detection and racial bias mitigation in social media based on BERT model. PLoS ONE 15(8), e0237861 (2020)

CCPC: A Hierarchical Chinese Corpus for Patronizing and Condescending Language Detection Hongbo Wang, Mingda Li, Junyu Lu, Liang Yang, Hebin Xia, and Hongfei Lin(B) School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China {dutlaowang,22209217lmd,dutljy,2672054553}@mail.dlut.edu.cn liang,[email protected]

Abstract. Patronizing and Condescending Language (PCL) is a form of implicitly toxic speech aimed at vulnerable groups with the potential to cause them long-term harm. As an emerging field of toxicity detection, it still lacks high-quality annotated corpora (especially in the Chinese field). Existing PCL datasets lack fine-grained annotation of toxicity level, resulting in a loss of edge information. In this paper, we make the first attempt at fine-grained condescending detection in Chinese. First, we propose CondescendCN Frame, a hierarchical framework for finegrained condescending detection. On this basis, we introduce CCPC, a hierarchical Chinese corpus for PCL, with 11k structured annotations of social media comments from Sina Weibo and Zhihu. We find that adding toxicity strength (TS) can effectively improve the detection ability of PCL and demonstrate that the trained model still retains decent detection capabilities after being migrated to a larger variety of media data (over 120k).Due to the subjective ambiguity of PCL, more contextual information and subject knowledge expansion are critically required for this field.

Keywords: Patronizing and Condescending Language Chinese Corpus · Toxic Speech Detection

1

· Hierarchical

Introduction

When an entity’s language use shows a superior attitude toward others or depicts them in a sympathetic way, we call it Patronizing and Condescending Language(PCL). As an essential subfield of toxic speech, PCL is an open, challenging, and underexploited research area in natural language processing [10,16]. It is a kind of toxic speech aimed primarily at vulnerable communities. Condescending language is more subtle than traditional hate speech or offensive language, which are clearly offensive and easy to detect on the Internet. PCL is often unconscious, motivated by good intentions, and expressed in flowery language c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 640–652, 2023. https://doi.org/10.1007/978-3-031-44696-2_50

CCPC: A Hierarchical Chinese Corpus for PCL

641

[6,16]. It is based on one group’s superiority over another and displays an unbalanced power relationship. These superiorities and compassionate discourses can normalize discrimination and make it less visible [8]. This unfair treatment of vulnerable groups contributes to further exclusion and inequality in society [17], forcing users to leave the community or reduce online participation [6,16], and increasing the risk of depression and suicidal behavior. Thus, PCL detection is a potentially high-impact detection task that can provide theoretical guidance for supporting interventions in online communities [12], assisting linguists in understanding implicit toxicity, and effectively caring for vulnerable groups [15]. Although there has been substantial work on hate speech and offensive language, for example, many researchers use large-scale pre-trained models in deep learning for detection tasks [7,18,19,24,25], and some toxicity models were retrained using the professional corpus for some highly toxic tasks [3], PCL modeling remains very limited. The detection of this language requires special knowledge or familiarity with specific cultural tropes due to their recessive elements [9,14], so one of the critical factors for progress in this area is the need for highquality datasets annotated by experts to address these subtle detrimental properties. [15] introduce the Talk Down dataset, which is focused on condescending language in social media, [10] introduce the Don’t Patronize Me! (DPM) dataset, which is focused on the way in which vulnerable communities are described in news stories. In terms of models, [23] introduced adversarial training to improve PCL detecting capability. However, relevant research in the Chinese field is still lagging behind, which is a bottleneck preventing further research. Identifying PCL often seems to require a deep commonsense understanding of human values [22], it requires refinement of the category and intensity of toxicity. However, in relevant PCL datasets, the division of levels and granularity is limited to categories and target groups, and the strength of toxicity is not clearly defined (Table 1 compares current datasets with our work). Table 1. Comparison of current PCL datasets, including PCL categories, Toxicity strength, Target groups, and Context. Corpus

Language Source

TalkDown [15] English Dont Patronize Me! [10] English CCPC(ours) Chinese

Size

PCL cate. Toxicity strength

Reddit 6510(bal.) 68355(unbal.) News on Web 10,469 ✓ Weibo,Zhihu 11397 ✓

Target groups

Context ✓



✓ ✓

To fill these gaps, we introduce CondescendCN Frame, the Chinese Internet’s first condescending framework. Compared with traditional single classification, it is a more fine-grained hierarchical framework with three levels: (1) Whether toxic, (2) Toxic type, (3) PCL Toxicity strength (Level), PCL category, and Target groups. Based on the frame (See Fig. 1), we present a fine-grained corpusCCPC with over 11K comments from two Chinese mainstream social media, Sina Weibo and Zhihu. To verify our dataset, We migrate the data to Weibo for event

642

H. Wang et al.

Fig. 1. How to identify condescending language and its type? Here is an example of our CondescendCN framework.

detection and group identification. Moreover, we present a statistical analysis of CCPC and demonstrate PCL is more likely to target women and children.

2

CondescendCN Frame

Due to the implicit features of condescending texts, it is often difficult to capture the true meaning of condescending semantics from a single point of view with coarse-grained features. We first propose our hierarchical framework for multidimensional condescending detection. Then, in multiple pilots, we refine our framework and guidelines by testing small-scale tagging and marginal cases. A methodical interpretation of the free text will increase confidence in machine predictions, and it is of great significance to mitigate deviations in the design and construction of annotations [2,21]. The specific structure of the framework is as follows: 2.1

Toxicity Identification

First, we determine whether the text is toxic. Toxic language is generally defined as rude and uncomfortable remarks [4]. Toxic language detection serves as the foundation for subsequent labeling and is thought of as a binary classification task (toxic or non-toxic).

CCPC: A Hierarchical Chinese Corpus for PCL

2.2

643

Toxicity Type

The second step is to further determine the type of toxicity (hate speech, offensive language, PCL, etc.). Unlike other toxic languages with prominent offensive characteristics, PCL is frequently used for positive contributions due to its subtle toxic expression [11]. Thus, the attack is weak. In this paper, we uniformly classify non-toxic language, hate speech, offensive language, etc. as non-PCL speech, with the remaining classified as PCL. 2.3

PCL Toxicity Strength

In this paper, we creatively use toxicity strength as a category of fine-grained PCL, which will help us better comprehend the toxicity information and reduce our subjective error for edge cases at a fine-grained level [21]. We rank PCL toxicity strength into three levels in increasing order of semantic strength: 1) Weak. Often in the form of “false positive” praising vulnerable groups, such as the qualities worth learning from those living in vulnerable communities. The tone is the most gentle. 2) Middle. Although the author keeps a class distance between himself and the vulnerable groups, the tone will express more sympathy and hypocrisy and will offer shallow opinions to the disadvantaged group from an objective perspective. 3) Strong. The author and the vulnerable group are in completely different classes, and the tone is sharper because of the apparent discriminatory language, superior attitude, and sarcastic semantics toward them. 2.4

PCL Categories

Finally, using the research findings of [10] as a guide, along with the unique characteristics of the Chinese condescending language itself, we propose a detailed linguistic classification of Chinese PCL: - Unbalanced power relations (unb). Maintain a class and power distance from vulnerable groups, and assist them more as saviors to help them out of difficulty. [1,13]. - Spectators (spe). Without careful consideration, on a whim, spectators offer simple opinions or solutions that cannot solve the problem at its core. - Prejudice impression (pre). A discriminatory impression with preconceived views on vulnerable groups when offering assistance or advice, but the author’s superiority is concealed by friendliness or sympathy. The characteristics of the stereotype are not apparent on the surface. - Appeal (appe). Calling out on behalf of a group of experts or advocates in the hope that vulnerable groups can change their situation.

644

H. Wang et al.

- Elicit Sympathy (es). The author may explicitly express compassion and concern for the disadvantaged, or he may use rhetoric, metaphor, etc. to describe the disadvantaged as people in need in order to elicit readers’ sympathy. 2.5

PCL Group Detection

PCL mainly targets vulnerable groups. During the corpus collection phase, we categorize the vulnerable groups targeted by the gathered comments for further research. The vulnerable groups include disabled people, women, the elderly, children, commons, single-parent families, and disadvantaged groups(working class, peasant class, etc.).

3 3.1

The Corpus Data Collection

We collected comments on popular posts from two mainstream Chinese media platforms, Zhihu and Sina Weibo, as our data sources. We limited the scope of our data collection to seven main types of vulnerable groups related to condescending hot topics and events. Then, for each group, we manually compiled a list of keywords and conducted a search within the list’s scope. To ensure the quality of the corpus, we eliminated posts with fewer than 20 comments and performed additional manual screening. As a result, we collected more than 14K comments on 1082 popular blog posts containing these keywords. We removed duplicate and irrelevant samples (including common fixed tags on Weibo, such as “回复” and “转发微博”), as well as samples with a length of fewer than five characters or without any Chinese character annotations. We retained the emoji in the samples and converted them into the corresponding Chinese text pairs specified on the platform (such as “抱抱] ”-“hug] ”, “允悲] ”-“allow sadly] ”)to preserve as much of the emotional semantic information mapped in the emoji as possible [7]. Finally, 11397 comments are retained. A sample of the dataset is shown in Table 2. We carefully considered the ethical and moral implications of data collection. To protect user privacy, all usernames are replaced with a unique ”username” token. We ensure that the CCPC corpus is used solely for academic research in this field and make all collected data publicly accessible.

CCPC: A Hierarchical Chinese Corpus for PCL

645

Table 2. A sample description for the CCPC corpus. The text is divided into levels according to toxicity, toxicity type, toxicity strength, and PCL categories. In our hypothesis, the offensiveness of PCL language is between non-toxic and hate speech. Exp. Comment 1

2

3

4

5

3.2

现在离婚率本来就很高很高 了。 The divorce rate is already very high now. 我接触过的单亲家庭出来的孩 子,性格多少都不太好,孩子 也是受害者啊。 Children from singleparent families I’ve met have challenging personalities, and they are also victims. 残疾人就业是个严重的问题,应 该给他们更多的岗位。 The employment of the disabled is a critical issue; more jobs should be created for them. 与其担心别的,农民工朋友倒 不如想想如果自己被欠薪了, 应该如何合法讨薪才对。 Instead of worrying about other things, migrant worker friends should consider how to legally obtain their wages if they are not paid. 笑拉了,我就是瞧不起你们小 仙女怎么了。 LOL, what’s wrong with me looking down on those fairy girls?

Toxic? Toxic Type Toxicity strength PCL Categories no

-

-

-

yes

pcl

1

pre,es

yes

pcl

2

unb,appe

yes

pcl

3

unb

yes

hate

-

-

Data Annotation

Main Annotation. Using our proposed framework as a foundation, we annotated our dataset as follows: We labeled non-toxic speech, hate speech, and offensive language as N-PCL (label 0). For PCL, We introduced Toxic Strength (TS), further labeling based on its condescending toxicity intensity(label 1 to 3). The detailed design idea refers to Sect. 2.3. Each comment was labeled by two annotators, and a third annotator proofread it for accuracy. To maximize the amount of information captured by the annotations, particularly concerning the more granular evaluation of edge cases [10], we proposed Toxic Strength Fusion, a method that compiled the results of two main annotators and reclassified the labels based on their PCL toxicity intensity. The final label will be classified into four groups: N-PCL,PCLWeak,PCL-Middle and PCL-Strong. Fig. 2 illustrates an example of judgment. We conducted two sets of such labeling tasks and finally obtained the dataset

646

H. Wang et al.

Fig. 2. Toxic Strength Fusion of PCL to produce final label.

by voting. It is important to note that hateful and offensive texts can interfere with PCL toxicity classification, so we did manual proofreading when summing these samples. Annotator Demographics. To ensure the diversity of the annotators, we recruited six annotators (four main markers, two proofreaders) with differing genders, ages, and educational backgrounds (50% women, 50% men; 25 ± 5 ages; three masters, two doctors, one undergraduate). Meanwhile, due to the subtle toxicity unique to the PCL, which is easy to cause differences, we invited two experts in the field of language to guide our team’s annotations. Annotator Agreement. We calculated the Kappa Inter-Annotator Agreement (IAA) for binary classification and multi-category labeling. The IAA improved if we omitted all comments labeled as borderline(e.g. Label1/Label2) by at least one annotator [20]. More information is shown in Table 3. Table 3. Inter-annotator agreement table. The left column is the IAA test for PCL detection, and the marginal information is gradually deleted from top to bottom. The table on the right shows the consistency labeling test for the PCL category. PCL Detection

IAA PCL Category IAA

All labels

0.62 Unb.power rel 0.65

Remove label1

0.64 Pre. impre.

Spectators Appeal Remove label1,2 0.69 Hyp.or meta Others

0.42 0.59 0.48 0.71 0.66

CCPC: A Hierarchical Chinese Corpus for PCL

3.3

647

Data Description

After data collection and labeling, our CCPC corpus contains 11397 comments, including 9251 negative samples (N-PCL) and 2146positive samples (PCL labels are classified as P-Week, P-Middle, and P-Strong as the toxic strength increases), which covers the majority of vulnerable groups in significant Chinese forums (Table 4) . Table 5 depicts the proportion of three toxicity strengths among vulnerable groups.Table 6 provides additional statistics on vulnerable groups. We notice that the condescension rate on the Weibo platform is substantially higher than on Zhihu. In addition, the condescension rate for women and children is considerable on both platforms, with Zhihu having the highest rate for children and Weibo having the highest rate for women. This provides more details on focusing on these groups.Based on these statistics, we will provide an expanded CCPC dataset1 with a higher data scale (14k) and a broader spectrum of vulnerable populations. Table 5. PCL toxicity strength statistics for different vulnerable groups.

Table 4. Basic statistics of CCPC.

P-Weak P-Middle P-Strong

Toxic-Category Num N-PCL P-Weak P-Middle P-Strong

9251 1167 439 540

Total

11397

Disabled 61.8

16.4

21.8

Women

29.7

29.7

40.6

Elderly

54.6

16.9

28.5

Children 63.7

17.7

18.6

Commons 59.8

17.1

23.1

Single

62.2

14.7

23.1

Disadv

57.6

19.7

22.7

Table 6. Statistical Results of CCPC from different Platforms. Platformp represents samples marked as PCL, whereas prop.(%) represents a percentage. Disabled Women Elderly Children Commons Single-parents Disadv. groups Total

1

zhihu

838

735

656

858

922

628

815

5452

zhihup

66

110

72

177

72

87

123

700

prop.(%) 7.9

15.0

11.0

20.6

7.8

13.8

15.1

12.8

weibo

920

760

747

950

864

754

950

5945

weibop

167

263

142

323

78

247

226

1446

prop.(%) 18.2

34.6

19.0

34.1

9.0

32.8

23.8

24.3

Total

1495

1403

1808

1786

1382

1765

11397

1758

Our dataset and code are available on https://github.com/dut-laowang/CCPC.

648

4

H. Wang et al.

Experiments

4.1

Baselines

Here we present the primary baseline used in our experiments. The dataset is divided into a ratio of 8:1:1. We limited epoch = 15, batch size = 32, and set the same random seed. The results are depicted in Table 7 and Table 8. BERT. We conducted experiments with PLMs based on BERT and related variants. CCPC was examined using bert-base-uncased2 (BERT ), bert-basemultilingual-cased3 (BERTM ), and bert-base-Chinese4 (BERTC ). These PLMs were used as encoders, and we use fully connected layers as classifiers for PCL tasks. We separately evaluated the original label and the label with toxic strength fusion(TS). BiLSTM. We used a bidirectional LSTM to represent individual words using glove embedding. Our dropout rate is 0.5% for both the LSTM layer and the classification layer. We used precision, recall, and F1 score for these tasks. Table 7. Our work combines toxic strength fusion(TS) to determine whether it is PCL, which is regarded as a binary classification task. The corpus containing TS achieves better F1 results. BiLSTM also received high marks, indicating that LSTM is still effective in the field of short text. Input P BERTC

∧ TS

BERTC BERTM

∧ TS

BERTM BERT

∧ TS

R

F1

0.709 0.719 0.714 0.682

0.693

0.687

0.653

0.646

0.649

0.637

0.659

0.643 0.584

0.579

0.590

BERT

0.586

0.600

0.589

BiLSTM

0.677

0.645

0.656

Next, PCL category detection is viewed as a sentence-level multi-label classification problem, where each paragraph is assigned a subset of PCL category labels.(Table 8).

2 3 4

https://huggingface.co/bert-base-uncased. https://huggingface.co/bert-base-multilingual-cased. https://huggingface.co/bert-base-Chinese.

CCPC: A Hierarchical Chinese Corpus for PCL

4.2

649

PCL Detection for Migration Tasks

We hope that the model we trained can play an active role in detecting PCL on various Chinese mainstream public opinion platforms, which depends on whether our model can obtain accurate condescension rate recognition in a large quantity of external data [15]. We noticed that Weibo is an excellent Chinese platform for evaluating the transferability of models, with the majority of users engaging in community activities with #Keywords. Table 8. Results of categorizing PCL, which is regarded as a multi-label classification task. The categories of PCL include Unbalanced power relations(unb), Spectators(spe), Prejudice impression(pre), Appeal(appe), and Elicit sympathy(es). (%)

BERT P R

F1

BERTM P R

F1

BERTC P R

F1

unb

86.81 98.75 92.40 95.12 92.34 93.71

97.21 94.43 95.80

spe

33.33 22.58 26.92 55.56 64.52 59.70

75.01 67.74 71.19

pre

63.49 78.43 70.18 72.73 62.75 67.37

73.08 74.51 73.79

appe 0.00 es

0.00

0.00

22.12 24.35 23.18 22.22 21.98 22.10

17.14 31.58 22.22 34.48 52.63 41.67

38.46 54.63 45.11

Fig. 3. Condescension rates for different disadvantaged communities. Community A contains Women(Blue) and Elder (Green); B contains Trending topics(Red); C contains Lifestyle(Orange). Data were selected weekly from October 2022 to March 2023.

650

H. Wang et al.

We confirmed it by comparing condescension rates in different types of vulnerable groups and judging whether the model is accurate for common sensebased recognition. We chose three distinct categories of disadvantaged-targeted communities for Group recognition: Group A. The community of vulnerable populations. The community is believed to have a high rate of condescension, e.g. #妇女(Women), #老人(Elder). Group B. Comprehensive community. The condescension rate is at a normal level, e.g. #微博热搜(Trending topics). Group C. Non-vulnerable communities. These communities have a low incidence of condescending languages, e.g. #生活方式(Lifestyle), #娱乐(Entertainment). Between November 2022 and March 2023, we acquired over 120k comments from these communities organized by week. The comparison and validation of the three categories of data are depicted in Fig. 3. We observed higher condescension rates in #Women and #Elder, and the lowest in #Lifestyle. Obviously, models are more inclined to identify PCL in disadvantaged communities. We can also infer that the rate of condescension towards women, a vulnerable group, is much higher than that of males.

5

The Ambiguity of PCL

Due to the broad definition of Condescending Language, the judgment of PCL is highly ambiguous. First, it should be determined whether there is a clear class division in condescending discourse, which necessitates more contextual knowledge. It is difficult to discern the identity and class of the speaker based merely on short text sentences. E.g. I think having a minimal living allowance alone is not enough, this comment cannot ascertain the class to which ’I’ belong or whether the speaker’s objectives are directed at vulnerable groups. Second, there is a clear category of ’sympathy’ in condescending remarks, but these should be distinguished from those expressing genuine concern, because hypocrisy and sympathy are linguistically similar, but the stated thoughts are not.E.g. I empathize with you, we are all going through a lot ,this sentence is not a PCL statement since it does not reflect the superiority complex of different classes through hypocritical caring, but rather out of concern for the common community.To properly separate these hazy judgments, the model requires more world knowledge and precise definitions.

6

Conclusion and Future Work

PCL language detection is a potentially high-impact detection task for vulnerable communities, which can aid linguists in comprehending implicit toxicity and provide theoretical direction for caring for vulnerable groups. However, current research in the Chinese field lags behind. In this paper, we propose the CondescendCN Frame, the first condescending framework on the Chinese Internet, which divides a comment into a more fine-grained level. On this basis,

CCPC: A Hierarchical Chinese Corpus for PCL

651

we construct a Chinese PCL dataset, CCPC, and demonstrate that the addition of toxicity strength(TS) improves PCL detection. The trained model can detect on a larger platform, proving the CCPC’s reliability. We conduct sentiment strength analysis, which reveals that condescending language is primarily directed at women and children, who are in critical need of more humane care. Our experiment confirms that PCL detection is still very subjective, and its scientific definition is rather ambiguous. More contextual information and subject knowledge expansion are critically required for these features.This will be the focus of our future research.

References 1. Bell, K.M.: Raising Africa?: celebrity and the rhetoric of the white saviour. PORTAL: J. Multi. Int. Stud. 10(1), 1–24 (2013) 2. Bussone, A., Stumpf, S., O’Sullivan, D.: The role of explanations on trust and reliance in clinical decision support systems. In: 2015 International Conference on Healthcare Informatics, pp. 160–169. IEEE (2015) 3. Caselli, T., Basile, V., Mitrovi´c, J., Granitzer, M.: Hatebert: Retraining BERT for abusive language detection in english. arXiv preprint arXiv:2010.12472 (2020) 4. Dixon, L., Li, J., Sorensen, J., Thain, N., Vasserman, L.: Measuring and mitigating unintended bias in text classification. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 67–73 (2018) 5. Fortuna, P., da Silva, J.R., Wanner, L., Nunes, S., et al.: A hierarchically-labeled Portuguese hate speech dataset. In: Proceedings of the Third Workshop on Abusive Language Online, pp. 94–104 (2019) 6. Huckin, T.: Textual silence and the discourse of homelessness. Discourse Society 13(3), 347–372 (2002) 7. Mathew, B., Saha, P., Yimam, S.M., Biemann, C., Goyal, P., Mukherjee, A.: Hatexplain: a benchmark dataset for explainable hate speech detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 14867–14875 (2021) 8. Ng, S.H.: Language-based discrimination: blatant and subtle forms. J. Lang. Soc. Psychol. 26(2), 106–122 (2007) 9. Parekh, P., Patel, H.: Toxic comment tools: a case study. Int. J. Adv. Res. Comput. Sci. 8(5) (2017) 10. P´erez-Almendros, C., Espinosa-Anke, L., Schockaert, S.: Don’t patronize me! an annotated dataset with patronizing and condescending language towards vulnerable communities. arXiv preprint arXiv:2011.08320 (2020) 11. Price, I., et al.: Six attributes of unhealthy conversation. arXiv preprint arXiv:2010.07410 (2020) 12. Spertus, E.: Smokey: Automatic recognition of hostile messages. In: AAAAI/IAAI, pp. 1058–1065 (1997) 13. Straubhaar, R.: The stark reality of the ‘white saviour’complex and the need for critical consciousness: a document analysis of the early journals of a freirean educator. Compare: J. Comparative Int. Educ. 45(3), 381–400 (2015) 14. Van Aken, B., Risch, J., Krestel, R., L¨ oser, A.: Challenges for toxic comment classification: An in-depth error analysis. arXiv preprint arXiv:1809.07572 (2018) 15. Wang, Z., Potts, C.: Talkdown: A corpus for condescension detection in context. arXiv preprint arXiv:1909.11272 (2019)

652

H. Wang et al.

16. Wong, G., Derthick, A.O., David, E., Saw, A., Okazaki, S.: The what, the why, and the how: a review of racial microaggressions research in psychology. Race Soc. Probl. 6, 181–200 (2014) 17. Xu, J.: Xu at semeval-2022 task 4: Pre-BERT neural network methods vs postBERT Roberta approach for patronizing and condescending language detection. arXiv preprint arXiv:2211.06874 (2022) 18. Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., Kumar, R.: Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983 (2019) 19. Zhou, J., et al.: Towards identifying social bias in dialog systems: Frame, datasets, and benchmarks. arXiv preprint arXiv:2202.08011 (2022) 20. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics, pp. 159–174 (1977) 21. Lu, J., Xu, B., Zhang, X., Min, C., Yang, L., Lin, H.: Facilitating fine-grained detection of Chinese toxic language: Hierarchical taxonomy, resources, and benchmarks (2023) 22. P´erez-Almendros, C., Anke, L.E., Schockaert, S.: Pre-training language models for identifying patronizing and condescending language: an analysis. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3902–3911 (2022) 23. Lu, J., et al.: Guts at semeval-2022 task 4: Adversarial training and balancing methods for patronizing and condescending language detection. In: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pp. 432–437 (2022) 24. Min, C., et al.: Finding hate speech with auxiliary emotion detection from selftraining multi-label learning perspective. Inform. Fusion 96, 214–223 (2023) 25. Lu, J., et al.: Hate speech detection via dual contrastive learning. Speech, and Language Processing. IEEE/ACM Trans. Audio (2023)

FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain Hao Wang1 , Jing-Jing Zhu1 , Wei Wei2 , Heyan Huang1 , and Xian-Ling Mao1(B) 1

2

Beijing Institute of Technology, Beijing, China {wangh199802,zhujingjing32768}@gmail.com, {hhy63,maoxl}@bit.edu.cn Huazhong University of Science and Technology, Wuhan, China [email protected]

Abstract. As scientific communities grow and evolve, more and more papers are published, especially in computer science field (CS). It is important to organize scientific information into structured knowledge bases extracted from a large corpus of CS papers, which usually requires Information Extraction (IE) about scientific entities and their relationships. In order to construct high-quality structured scientific knowledge bases by supervised learning way, as far as we know, in computer science field, there have been several handcrafted annotated entity-relation datasets like SciERC and SciREX, which are used to train supervised extracted algorithms. However, almost all these datasets ignore the annotation of following fine-grained named entities: nested entities, discontinuous entities and minimal independent semantics entities. To solve this problem, this paper will present a novel Fine-Grained entity-relation Extraction dataset in Computer Science field (FGCS), which contains rich fine-grained entities and their relationships. The proposed dataset includes 1,948 sentences of 6 entity types with up to 7 layers of nesting and 5 relation types. Extensive experiments show that the proposed dataset is a good benchmark for measuring an information extraction model’s ability of recognizing fine-grained entities and their relations. Our dataset is publicly available at https://github.com/broken-dream/ FGCS. Keywords: Datasets

1

· Information Extraction · Fine-grained Entities

Introduction

As scientific communities grow and evolve, more and more papers are published, especially in computer science domain (CS). Just in 2022, the number of publications reached 441,442 according to the statistical data of dblp1 . It is important H. Wang and J.-J Zhu—Equal contribution. 1 https://dblp.uni-trier.de/db/. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 653–665, 2023. https://doi.org/10.1007/978-3-031-44696-2_51

654

H. Wang et al.

Fig. 1. Annotation examples of three types in our dataset.

to organize scientific information into structured knowledge base extracted from a large corpus of CS papers, which usually requires information extraction (IE) about scientific entities and their relations. Named entity recognition (NER) is the task of identifying and classifying entities in text, and relation extraction (RE) is the task of extracting relations between entities. They are the crucial steps in knowledge base construction, and the structured knowledge base can be used for many other downstream tasks in natural language processing (NLP), e.g. question answering, coreference resolution, summarization and so on. What’s more, constructing a structured knowledge base helps researchers acquire domain knowledge quickly. Therefore, it is an important task to construct a structured scientific knowledge base with entities and relations in CS. In order to construct high-quality structured scientific knowledge bases by supervised learning way, as far as we know, in computer science domain, there have been several handcrafted annotated entity-relation datasets. Semeval 2017 task 10 [1] involves both named entity recognition and relation extraction, while Semeval 2018 task 7 [4] focuses on relation extraction. SciERC [7] annotates coreferences across sentences in abstracts. SciREX [5] integrates NER and RE as an N-ary extractions task and extends annotation from paragraphs to the entire paper. SciClaim [8] includes fine-grained attribution information of entities. However, almost all these datasets ignore the annotation of following finegrained named entities: nested entities, discontinuous entities and minimal independent semantics (MIS) entities. Firstly, nested entities and their relations not only increase the richness of entity-relation items used to build the CS knowledge base, but also help to understand deeper semantic information of text. Only two datasets include a very small amount of nested entities in CS, i.e. SciERC [7] and SciClaim [8]. Secondly, discontinuous entities and their relations are not annotated in existing CS datasets. What’s more, to the best of our knowledge, there is no dataset containing relations on discontinuous entities in the task of information extraction. Thirdly, most existing datasets follow the principle of maximal length span annotation, which leads to the problem that the

FGCS: A Fine-Grained Scientific IE Dataset in Computer Science Domain

655

text information cannot be fully utilized. For example, the span “CNN models” is annotated as an entity based on the maximal length span annotation principle, but the information of “CNN” is lost. It’s because “CNN” in the span “CNN models” has independent and complete semantics. The suffix “models” in this case is redundant information that affects the understanding of “CNN” itself. The entities with the minimal length that can express semantics independently and completely are called minimal independent semantics (MIS) entities. To solve the problem mentioned above, we create FGCS, a new high-quality fine-grained dataset of entity-relation extraction from computer scientific articles. FGCS includes annotations for named entity and relation with nested, discontinuous and MIS entities as well as their relations, as shown in Fig. 1, which are used to develop fine-grained information extraction systems. The detailed process of dataset construction will be introduced in Sect. 3. Finally, the dataset contains 1,948 sentences from computer scientific articles. The statistics show that FGCS not only covers richer fine-grained entities and relations, but also beats other datasets in label densities. Experiments on recent state-of-the-art approaches demonstrate that the annotated fine-grained entities and relations are challenging for existing information extraction methods. The contributions of our paper are as follows: – We develop a fine-grained annotation scheme for entity and relation extraction in computer science, which considers nested entities, discontinuous entities, and MIS entities. – We create a new dataset FGCS of 1,948 sentences from 1,875 computer science publications based on the proposed annotation scheme, which supplements the research on fine-grained knowledge base constructions. – To the best of our knowledge, we are the first one to annotate relations on discontinuous entities. Experiment results show the effectiveness of our dataset.

2

Related Work

Several datasets for scientific Information Extraction (IE) in Computer Science (CS) domain have been constructed. At the early stage, researchers focused on terms, key phrases, or entities extraction from the computer science papers [9]. In recent years, datasets for more complex tasks are introduced, such as SemEval 2017 task 10 [1], SemEval 2018 task 7 [4], and SciClaim [8]. SemEval 2017 task 10 includes 500 double-annotated articles from computer science, material science, and physics domains. It includes two entity types and two relation types. SemEval 2018 task 8 focuses on relation extraction. It provides entity span without entity type and extends relation to 6 types. SciClaim pays more attention to fine-grained associations between factors in scientific experiments, such as causation and comparison. Unlike other datasets, SciClaim also involves social and behavioral science.

656

H. Wang et al.

As much semantic information can only be mined from a long context, researchers pay more attention to the extraction of inter-sentence information from long documents. IE for CS also follows this trend. SciERC [7] fully annotated entity boundaries, entity types, relations, and cross-sentence coreference on 500 abstracts from AI conferences. SciREX [5] includes 438 fully annotated papers and converts entity recognition and relation extraction into an N-ary extraction problem. It targets extracting (Dataset, Metric, Method, Task) tuples, where each entity may come from a different part of the paper. Recognition for fine-grained entities has been widely studied in other domains [7,10,14,17]. However, most IE datasets in computer science ignore fine-grained entities. For nested entities, only SciERC [7] and SciClaim [8] annotate few nested entities. For discontinuous entities, nearly all existing datasets do not annotate discontinuous entities. For minimal independent semantics (MIS) entities, only SciREX [5] break long spans into simple entities, but it remains suffix word like “model”. Usually, nested and discontinuous entities are more common in the scientific literature because of the frequent occurrence of complex term phrases. MIS entities also can be better utilized for downstream tasks because redundant information is removed. So, it’s crucial to recognize fine-grained entities for further understanding of CS academic papers. For this purpose, the FGCS, a fine-grained entity-relation extraction dataset in computer science domain is proposed. In the next section, the process of construction will be introduced in detail.

3 3.1

The Proposed Dataset Annotation Scheme

Inspired by the idea from SciERC [7], our annotation scheme is designed to better meet fine-grained characteristics. Specifically, we define six entity types (Task, Method, Metric, Material, Problem, Other Scientific Term) and five relation types (Used-for, Part-of, Hyponym-of, Evaluate-for, Synonymy-of) based on the characteristics of CS papers. The annotation scheme of fine-grained entities and relations is described below, and full annotation guideline is provided in Appendix A. Nested Entities and Relations. Our dataset contains up to 7 levels of nesting in a sentence. A span will be annotated as long as it can express complete semantics and belongs to one of the entity types we defined. Therefore, when a subspan meets the above conditions, it will also be annotated in the form of nesting. The relations on nested entities include two types: the relation within nested entities, and the relation between a nested entity and another entity outside. As shown in Fig. 1(a), there are 3 entities in total, of which “Recognition” and “Named Entity Recognition” are nested entities. The relation between entity “Named Entity Recognition” and “Recognition” is Hyponym-of, which is

FGCS: A Fine-Grained Scientific IE Dataset in Computer Science Domain

657

a relation within nested entities. The relation between “Named Entity Recognition” and “NER” is Synonymy-of. Thus, the entity “NER” also has a relation Hyponym-of with the nested entity “Recognition”, which is a relation between the nested entity and another outside. Discontinuous Entities and Relations. In some situations, the entities are divided into multiple discontinuous spans and connected by some conjunctions like “and”, “or” and so on. As shown in Fig. 1(b), there are 3 entities and 2 relations. The word sequence “information retrieval and extraction” contains two Task entities “information retrieval” and “information extraction”. In our dataset, discontinuous entities are annotated by multi-segment discontinuous spans. The relation on discontinuous entities is annotated by discontinuous entity representation. In the above example, the entity “Task-based evaluation” has a relation Used-for both with the normal entity “information retrieval” and the discontinuous entity “information extraction”. MIS Entities and Relations. Our dataset is annotated based on the principle of minimal independent semantics (MIS), which means a span with the minimal length will be annotated if it can express independent and complete semantics. Most of the previous work annotates based on the maximal length span annotation principle, which leads to the loss of semantic information. For example, the word sequence “CNN models” is annotated as a Method entity based on the maximal length span annotation principle, resulting in the semantic information of “CNN” being ignored. Therefore, in some situations, suffix such as “model” and “task” is redundant, and should not be annotated. Three different situations of MIS entities are shown in Fig. 2, for sentence (1), the sequence “keyphrase extraction models” and “keyphrase extraction” have independent and different meanings, so they are annotated separately with different entity type Method and Task; for sentence (2), “CNN” has the independent semantics, and it expresses the same meaning as “CNN models”, so the suffix “models” here is unnecessary, and only “CNN” is annotated as a Method entity; for sentence (3), the sequence “probabilistic model” is a coherent whole, “probabilistic” cannot exist independently, thus “probabilistic model” is annotated as a Method entity as a whole. 3.2

Annotation Process

Our dataset is sentence-level, and the sentences are taken from 14 AI conference/workshop proceedings in four AI communities in computer science domain. The annotation process is described in detail below. First, the candidate data was selected from a semantic scholar open research corpus S2ORC [6]. We selected papers from the “ComputerScience” category in S2ORC, divided them into sentences, and selected those with more than 7 tokens from specific sections. The selected sentences were pre-annotated using a model trained on SciERC before manually annotating. After that, two domain experts

658

H. Wang et al.

annotated each sentence independently using an annotation tool Prodigy2 . Entities were annotated first, and the relations were annotated after the entities are determined. For sentences with incompletely consistent annotations, annotators reached an agreement after discussion. After human annotation, the sentences that have no relations between entities, and are unrelated to the computer science domain were removed.

Fig. 2. Three different situations of minimal independent semantics.

Table 1. The statistics of FGCS and SciERC. Dataset Entities Relations Nested Entities Discontinuous Entities MIS Entities Count Per Sentence Count Per Sentence Count Proportion Count Proportion Count Proportion SciERC 8,094 FGCS 8,559

3.17 4.39

4,648 5,903

2.38 3.03

165 5,490

2.0% 64.1%

0 120

0% 1.4%

0 697

0% 8.1%

Finally, a fine-grained entity-relation extraction dataset is created, which contains 1,948 fully annotated sentences containing at least one relation from 1,875 papers in different computer science research directions. 3.3

Comparison with Previous Datasets

Among all previous datasets, SciERC [7] is the most representative and widely used information extraction (IE) dataset in computer science domain, and it is similar to ours in terms of task and domain. So, we mainly compare our dataset with SciERC. Comparison with SciERC. Our dataset FGCS is sentence-level, and the sentences are from 1,875 different papers, while SciERC contains 500 scientific abstracts only. What’s more, our dataset focuses on fine-grained named entities and their relations. First, we annotate nested entities only if the spans express independent semantics, the relations within nested entities, and the relations 2

https://prodi.gy/.

FGCS: A Fine-Grained Scientific IE Dataset in Computer Science Domain

659

between a nested entity and another external one. Therefore, FGCS has more nested entities and their relations than SciERC which allows nested spans only when a subspan has a relation/coreference link with another term outside the span. Second, our dataset contains discontinuous entities and relations on them while SciERC does not. Third, based on the principle of minimal independent semantics (MIS), a span will be annotated if it has independent and complete semantics in FGCS. In addition to the problem of the suffix mentioned in annotation scheme section, an entity in our dataset never contains multiple parts with independent semantics. For example, the word sequence “Named Entity Recognition (NER)” is annotated as two different Task entities “Named Entity Recognition” and “NER” with the relation Synonymy-of between them in our dataset, while it is annotated as a whole entity in SciERC. Table 1 shows the detailed statistics of our dataset FGCS and comparison with SciERC. Although the size of SciERC is similar to FGCS, our dataset contains a large number of fine-grained entities and relations on them. 3.4

Inter-Annotator Agreements

Inter-Annotator Agreement (IAA) is evaluated with F1 and kappa score. For F1 , one annotator’s results are set as ground truth and micro F1 score is calculated between two annotators. For kappa score, a special type None is added to solve boundary disagreements. Specifically, if a span is annotated as an entity by one annotator but not by the other one, we take the span annotated by the other annotator as None type. As shown in Table 2, the kappa for annotations of named entity recognition and relation extraction is 52.1 and 61.7, which is relatively low. it is because fine-grained annotation scheme is adopted, which leads to severe disagreements on boundaries. As a supplement, kappa on entities without boundary disagreements is also counted. The kappa score of 88.7 for entity and 96.4 for relation prove that annotators reach a high agreement on entity types. Table 2. IAA score. Kappa* represents results on entities without boundary disagreements. Task Metric F1 kappa kappa* NER 75.2 52.1 RE 83.9 61.7

4 4.1

88.7 96.4

Experiment Baselines

We evaluate the following entity-relation extraction baselines on FGCS and SciERC to compare the differences between two datasets:

660

H. Wang et al.

– DygIE++ [12]: a graph based method that constructs span graph and utilizes graph propagation to model interaction of spans. – PURE [17]: a pipeline method for entity and relation extraction. it recognizes entities first and uses the results as input to relation model. – SpERT [3]: a span-based attention model that uses strong negative sampling strategy and localized context representations. – PFN [16]: a table filling based method that models two-way interaction between NER and RE by decomposing features into entity part, relation part, and shared part. – UNIRE [15]: a table filling based method that uses a unified entity-relation label space to capture interaction between entity and relation. – TBO [13]: a table filling based method that designs two distinct encoders to capture entity and relation information separately. We divide all baselines into two parts according to whether they can handle nested cases. All baselines can recognize nested entities and relations on nested entities except UNIRE and TBO.3 4.2

Evaluation Settings

We follow the evaluation settings in [11]. For named entity recognition, a predicted entity is correct if both the entity boundaries and type are correct. For relation extraction, we adopt boundaries setting and strict setting [2]: – Boundaries Setting (denoted as RE): a predicted triplet is correct if the relation type is correct and the boundaries of two entities are correct. – Strict Setting (denoted as RE+): besides the requirements in the boundaries setting, the predicted entity type also must be correct. 4.3

Experiment Settings

Since SciERC [7] is similar to FGCS in terms of task and domain, all baselines are evaluated on both datasets. Following data split settings in SciERC, our dataset is divided into training set (70%), validation set (10%), test set (20%). For baseline TBO, we reduce the stacked layer numbers to 2 due to GPU memory limitation. For the non-nested baselines, nested entities are only used for testing. 4.4

Baseline Results

Results of baselines on FGCS and SciERC are presented in Table 3. It can be observed that:

3

Actually, PFN has trouble handling relations on nested entities, but such cases are rare so we still attribute it to the nested NER&RE baseline.

FGCS: A Fine-Grained Scientific IE Dataset in Computer Science Domain

661

– Most baselines for nested NER&RE perform better on FGCS. The average F1 increase is 7.8% on NER. PURE gains 10.4%/13%/20.4% on NER/RE/RE+, which is the highest among all baselines. – There is a significant drop for non-nested NER&RE baselines on FGCS while they perform comparably to the baselines for nested NER&RE on SciERC. Performance on non-nested baselines drops about 20% on NER and more than 25% on RE and RE+. Such a huge gap between nested and non-nested NER&RE baselines is caused by the different proportions of nested entities in the two datasets. The detailed results on FGCS are shown in Table 4. Although the NER precision of TBO and UNIRE change little compared to nested NER&RE baselines, the recall drops about 30%. This is because TBO and UNIRE have a competitive ability to recognize non-nested entities but are unable to handle nested entities. As a result, NER F1 drops a lot, which also harms performance on RE and RE+. The similar results of the baselines for nested and non-nested NER&RE on FGCS indicate that SciERC is not capable of distinguishing whether a model has a good ability to recognize nested entities and the relations on them. Besides, the variance of nested NER F1 on FGCS is 1.1 while SciERC is 7.9, which indicates the results on FGCS gain higher stability than SciERC. From the observations and analysis above, we think existing scientific NER&RE datasets in computer science domain don’t fully consider the nested condition and the proposed FGCS dataset makes up for this shortcoming. Table 3. Results on FGCS and SciERC. * means we adopt the results reported by original papers on SciERC. Model Type Nested NER&RE

Model DygIE++* PURE* SpERT* PFN*

Non-Nested NER&RE TBO UNIRE

4.5

FGCS NER RE

SciERC RE+ NER RE

RE+

75.2 77.0 74.6 75.4

58.9 61.2 51.2 57.5

56.0 45.8 53.8

67.5 66.6 70.3 66.8

48.4 48.2 50.8 -

– 35.6 – 38.4

56.4 55.7

24.6 21.0 24.2 16.1

67.1 65.6

37.1 34.8 43.5 34.4

Performance on Fine-Grained Entities and Their Relations

The performance of several state-of-the-art baselines on fine-grained entities is presented in Table 5. It can be seen that all three baselines perform worse on both nested entities and MIS entities. The NER results for three baselines decrease by an average of 7.0%/5.5% on nested/MIS entities. PFN drops 10.5%/7.3%

662

H. Wang et al. Table 4. Precision, recall and F1 on FGCS. Model

NER P R

F1

RE P

R

F1

RE+ P R

F1

DygIE++ PURE SpERT PFN

72.8 76.2 73.1 71.5

75.2 77.0 74.6 75.4

64.0 61.7 50.6 61.1

54.6 60.7 51.8 54.3

58.9 61.2 51.2 57.5

– 56.4 45.3 57.0

– 56.0 45.8 53.8

TBO UNIRE

71.9 46.5 56.4 48.6 16.5 24.6 41.4 14.1 21.0 70.3 46.1 55.7 46.1 16.4 24.2 38.2 10.2 16.1

77.8 78.0 76.2 79.8

– 55.6 46.4 50.9

on these two types of fine-grained entities, which is far more than PURE and SpERT. This gap could be attributed to that PFN only utilizes token pairs information while PURE and SpERT consider span information. Span contains more boundary information, which has been proven to be important for NER. These observations reflect the difficulty to recognize fine-grained entities. Table 5. F1 on fine-grained entities and their relations. Model

Total NER RE

PURE

Nested Entities MIS Entities RE+ NER RE RE+ NER RE RE+

77.0

61.2 56.0

70.9

65.7 62.4

70.6

52.8 48.4

SpERT 74.6

51.2 45.8

70.3

55.0 51.4

71.9

35.7 32.3

PFN

57.5 53.8

64.9

61.1 57.2

68.1

49.6 43.8

75.4

Table 6. Relation distributions for nested and MIS entities. Relation

Nested Entities (%) MIS Entities (%)

Used-for 18.2 Hyponym-of 70.7 0.9 Part-of Synonymy-of 9.9 0.3 Evaluate-for

66.5 24.7 3.8 3.1 1.9

Performance of relations on MIS entities is also shown in Table 5. It can be found that: – Performance of relations on MIS entities decreases by an average of 10.7%/10.4%.

FGCS: A Fine-Grained Scientific IE Dataset in Computer Science Domain

663

– Models perform better on relations on nested entities than on total datasets. There is a 4.1%/5.1% improvement on average for results on relations. The opposite results between nested entities and MIS entities are caused by the different distributions of relations for these two entity groups. As shown in Table 6, Used-for accounts for 66.5% in relations on MIS entities while Hyponymof accounts for more than 70% in relations on nested entities. According to Table 7, F1 of relation Hyponym-of outperforms 35.8% over F1 of Used-for, which indicates Hyponym-of is much easier to recognize than Used-for. Relation extraction for discontinuous entities remains to be solved. In general, relation extraction on fine-grained entities still is a challenge. Table 7. Relation F1 for different relations.

5

Relation

Model PURE SpERT PFN

Used-for Hyponym-of Part-of Synonymy-of Evaluate-for

45.5 71.1 17.3 75.4 47.5

35.8 62.5 2.6 52.7 38.5

42.8 68.1 17.4 57.9 46.2

Conclusion and Future Work

We proposed a fine-grained entity and relation extraction dataset in the computer science domain, including nested entities, discontinuous entities, minimal independent semantics entities, and relations on them. Experiments show FGCS is more suitable for evaluating the model’s ability of handling fine-grained entities and their relations. Fine-grained entities are valuable to construct high-quality knowledge base. However, Results on our fine-grained entities dataset reflect that fine-grained entities are more difficult to recognize than general entities. We also notice that nearly all existing methods do not cover relations of discontinuous entities, which is an important issue in information extraction. Thus, more attention should be paid to this task in the future.

A A.1

Annotation Guideline Entity Category

– Task: Problems to solve, systems to construct, in order to achieve the goal of a specific domain.

664

H. Wang et al.

– Method: A method used or proposed to solve a problem or complete a task, including methods, systems, tools and models. – Metric: A concrete or abstract method for evaluating a method or task. – Material: Concrete dataset or knowledge base. – Problem: Inherent problems or defects of a method. – Other Scientific Term: Scientific terms related to CS but do not fall into any of the above classes. A.2

Relation Category

Relation can be annotated between any two entities (even two nested entities). All relations (Used-for, Part-of, Hyponym-of, Evaluate-for, Synonymy-of) are asymmetric except Synonymy-of. To reduce redundancy, the head entity A always appears before tail entity B in the sentence for Synonymy-of. For example, in phrase ”Recurrent Neural Network (RNN)”, only (”Recurrent Neural Network”, Synonymy-of, ”RNN”) is annotated while the reversed triplet is ignored. – Used-for: A is used for B, B is based on A, etc. – Part-for: A is a part of B, etc. – Hyponym-of : A is a hyponym of B. Note the difference between Hyponym-of and Part-of. Usually, Hyponym-of refer to entities at different level of abstraction, while Part-of refer to entities at the same level. – Evaluate-for: B is evaluated by A, etc. – Synonymy-of : A has same meaning with B, A and B refer to same entity, etc.

References 1. Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: Semeval 2017 task 10: scienceie-extracting keyphrases and relations from scientific publications. In: SemEval-2017, pp. 546–555 (2017) 2. Bekoulis, G., Deleu, J., Demeester, T., Develder, C.: Adversarial training for multicontext joint entity and relation extraction. In: EMNLP, pp. 2830–2836 (2018) 3. Eberts, M., Ulges, A.: Span-based joint entity and relation extraction with transformer pre-training. arXiv preprint arXiv:1909.07755 (2019) 4. G´ abor, K., Buscaldi, D., Schumann, A.K., QasemiZadeh, B., Zargayouna, H., Charnois, T.: Semeval-2018 task 7: semantic relation extraction and classification in scientific papers. In: SemEval-2018, pp. 679–688 (2018) 5. Jain, S., van Zuylen, M., Hajishirzi, H., Beltagy, I.: Scirex: a challenge dataset for document-level information extraction. In: ACL, pp. 7506–7516 (2020) 6. Lo, K., Wang, L.L., Neumann, M., Kinney, R., Weld, D.S.: S2orc: the semantic scholar open research corpus. In: ACL, pp. 4969–4983 (2020) 7. Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: EMNLP, pp. 3219–3232 (2018)

FGCS: A Fine-Grained Scientific IE Dataset in Computer Science Domain

665

8. Magnusson, I., Friedman, S.: Extracting fine-grained knowledge graphs of scientific claims: dataset and transformer-based results. In: EMNLP, pp. 4651–4658 (2021) 9. QasemiZadeh, B., Schumann, A.K.: The acl rd-tec 2.0: A language resource for evaluating term extraction and entity recognition methods. In: LREC, pp. 1862– 1868 (2016) 10. Shen, Y., Ma, X., Tan, Z., Zhang, S., Wang, W., Lu, W.: Locate and label: a two-stage identifier for nested named entity recognition. In: ACL-IJCNLP, pp. 2782–2794 (2021) 11. Taill´e, B., Guigue, V., Scoutheeten, G., Gallinari, P.: Let’s stop incorrect comparisons in end-to-end relation extraction! arXiv preprint arXiv:2009.10684 (2020) 12. Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations. In: EMNLP-IJCNLP, pp. 5784–5789 (2019) 13. Wang, J., Lu, W.: Two are better than one: joint entity and relation extraction with table-sequence encoders. In: EMNLP, pp. 1706–1721 (2020) 14. Wang, J., Shou, L., Chen, K., Chen, G.: Pyramid: a layered model for nested named entity recognition. In: ACL, pp. 5918–5928 (2020) 15. Wang, Y., Sun, C., Wu, Y., Zhou, H., Li, L., Yan, J.: Unire: a unified label space for entity relation extraction. In: ACK-IJCNLP, pp. 220–231 (2021) 16. Yan, Z., Zhang, C., Fu, J., Zhang, Q., Wei, Z.: A partition filter network for joint entity and relation extraction. In: EMNLP, pp. 185–197 (2021) 17. Zhong, Z., Chen, D.: A frustratingly easy approach for entity and relation extraction. In: NAACL-HLT, pp. 50–61 (2021)

Improving Event Representation for Script Event Prediction via Data Augmentation and Integration Yuting Liu1,2,3 , Kun Ding1,3(B) , Fengxiao Guo2 , Ming Liu1,3 , Liu Liu1,3 , Baowei Wang2 , and Yi Sun1,3 1 The Sixty-Third Research Institute of National University of Defense Technology,

Nanjing 210007, China [email protected], [email protected] 2 School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China {20211249407,wang}@nuist.edu.cn 3 Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China Abstract. Script event prediction aims to predict the most likely following events, given the historical events in the script. This task requires the capability to learn more information between events. Most previous methods mainly focused on the current local information, while ignoring more inner semantic features of events. In this work, we propose a novel framework, called ECer, which can obtain more comprehensive event information by utilizing data augmentation and information integration. We first employ rectified linear attention to connect the initial event representation at the argument level. Then, to learn richer semantic information, data augmentation is further applied to expand data and introduce external knowledge. Furthermore, the initial representation and the features of augmented data were mixed by Mixup. Finally, an attention module is applied to the context event chain to integrate context events concerning the current candidate event dynamically. The experimental results on the widely-used New York Times corpus demonstrate the effectiveness and superiority of the proposed model. Keywords: Script Event Prediction · Event Representation · Rectified Linear Attention · Data Augmentation

1 Introduction Understanding events described in the text is crucial for many AI tasks, and script event prediction is one of these challenging tasks. The concept of script was introduced by [1] to understand natural language texts. Figure 1 gives a simple illustration of a typical script about “Tom”, which describes the scenario of a person going to a restaurant. In this script, “Tom entered restaurant”, “Tom sat down”, “Tom ordered food,” and “waiter ordered food” happen successively. This structured representation is called script, and script event prediction aims at predicting the most likely following events from a series of contextual events already occurring in a given scene. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 666–677, 2023. https://doi.org/10.1007/978-3-031-44696-2_52

Improving Event Representation for Script Event Prediction

667

Understanding as much semantic information as possible about the events that have occurred is vital to this task. Therefore, learning more semantic information by valid event representation helps predict the subsequent event. Currently, existing studies on event representation learning usually adopt the idea of encoding so that as much information as possible about the event elements is retained at the event level. Some works adopted Word2Vec [2, 3] to learn phrase vector representations, but its inter-element interaction needs to be stronger. Latter neural network-based methods can better express interactions between events, [4] proposed a combined tensor-based neural network model, and [5] proposed a matrix-vector recursive neural network, which learned combinatorial vector representations of phrases or sentences of arbitrary grammatical type and length. To use external commonsense knowledge, [6] aimed to integrate knowledge like sentiment and intention into event representations. However, the above methods mainly used the current local information. Specifically, the existing methods have two challenges: i) the problem of inadequate event representation, which can only learn shallow information about events; ii) script event prediction has rarely been studied to integrate external knowledge to enrich event information, and it also lacks a large amount of data for event representation learning. Context event chain

Candidate event

?



enter(Tom, restaurant) sit down(Tom) order(Tom, food) offer(Waiter, food)

pay(Tom, ticket) leave(Tom, restaurant)

do(waiter, homework)

Fig. 1. Script event prediction example.

In this paper, we propose a novel framework of event representation learning that integrates semantical connections and external knowledge, termed as ECer. Specifically, to address the first challenge, we use rectified linear attention (ReLA) to connect the event representation at the argument-level. To address the second challenge, we use data augmentation to provide more external knowledge. On the one hand, data augmentation introduces external knowledge to learn richer semantic information, which improves performance in prediction; on the other hand, random noise is added to the data to prevent overfitting, which not only expands the amount of data but also improves the robustness of the model. In general, this paper has the following main contributions: 1. A model called ECer is proposed. The proposed model uses a data augmentation strategy first to add noise to the training text and then uses BERT to learn deep semantic features, which are mixed by Mixup. By transforming the input data, the model’s generalization ability is increased, and the robustness is improved. 2. In our framework, the ReLA connects the event representations at the argument level. ReLA uses ReLU as the activation function for attention scores, which can automatically learn sparse attention in a highly flexible and efficient way. The integration

668

Y. Liu et al.

of ReLA and the augmented event representation is used to acquire rich semantic associations between events, which aims to learn more detailed features between events. 3. Experimental results on the New York Times corpus show that our approach can achieve superior performance over previous methods.

2 Related Work 2.1 Event Representation Predicate-GR was suggested by [7] to express an event for a single-protagonist scenario as (predicate; dependence), where dependence referred to the grammatical relationship between the verb and the protagonist, and predicate referred to the predicate verb which was used to describe the event. However, Predicate-GR assumed that if certain event texts lacked a unifying theme, events from several roles would be wrongly combined into a single role. To address these flaws, [23] Put forward the Rel-grams event representation pattern, expressed as (Arg1; Relation; Arg2). The above work mainly focused on a single protagonist event. For multi-protagonist events, [8] proposed an event representation with multiple argument elements in the form of v(es , eo , ep ), where v represented the predicate verb, es , eo , ep denoted the subject, object, and prepositional object to the verb, respectively. 2.2 Data Augmentation Data Augmentation (DA) is a technique to expand the training data automatically. Data augmentation was first applied in the field of CV, such as image style migration [9]. By increasing the number of training data, DA may successfully enhance the performance and resilience of models. The six categories of DA methods in NLP can be further divided into two main groups, including label-independent general methods (word replacement [10], back translation [11], introduction of noise [12] and label-related specific methods (same-label text semantic noise [13], generative models [14], and Mixup methods [15]). 2.3 Script Event Prediction Approaches for script event prediction may be categorized into two primary groups: statistical learning-based methods and deep learning-based methods. PMI [7] and Bigram [16], using event co-occurrence and joint probability, are representative models among statistical learning-based approaches. However, the methods failed to capture the innate relationship between events. The statistical learning methods are gradually replaced by deep learning-based methods, which can be divided into three categories: Pair-based, Chain-based, and Graph-based. The primary idea of pair-based models, such as Word2Vec [3] and EventComp [17], is to compute correlation strength between event pairs. However, they disregarded the temporal data between distinct occurrences. To emphasize the temporal information in the narrative event chain, [18] pioneered using LSTM for script event prediction tasks. Following their approach, [19] proposed SAM-Net, which uses dynamic memory

Improving Event Representation for Script Event Prediction

669

network to extract features of event chains. Nonetheless, the event chain-based methods underutilized the knowledge of the connections between events. To represent the predicate of an event, [20] exploited a narrative event evolutionary graph to model the event. Additionally, [21] presented PairLSTM, which combines the benefits of event pair-based consistency learning and chain-based temporal order learning, [22] proposed MCer, which combines argument-level, event-level, and chain-level representations to improve the representation learning of events. These methods have achieved better results so far.

3 Model Formally, given a set of context events E = (e1 , e2 , ..., en−1 , en ) and a subsequent event in a candidate set C = (ec1 , ec2 , ..., ecm ), our work is to predict the most likely following events from the series of events that have occurred in a given scene. Following the event pattern in [8], an event e is defined as e = v(es , eo , ep ), where v denotes the verb in the event, and es , eo and ep represent subject, object and indirect object, respectively. For example, e = offer(waiter, Tom, menu) means that “A waiter offers Tom a menu”. In this paper, MCNC is used as the evaluation method, which is to choose the next most likely subsequent events eci from a given set of candidate events C. . There is only one event in the candidate set, and the events share a subject in the candidate set. Finally, Accuracy (%) is used as the evaluation metric. As indicated in Fig. 2, the following elements make up our framework of ECer: 1) The initial event representation is obtained via the Event Representation Component; 2) The Data Augmentation Component, which uses BERT and noise to get deep semantic features in the form of word vectors; 3) In the Global Evolution Component, the event representation and the augmented data are mixed to create the final event representation, and the narrative event chain from the training set is used to build the matter mapping by event graph; 4) The Candidate Event Prediction Component is used to choose the appropriate next events by calculating the correlation score between contextual events and candidate events. 3.1 Event Representation Component The event representation aims to encode each input as a low-dimensional dense vector, whose input is a d-dimensional word vector obtained by pre-training with the Glove, and representation will be fine-tuned during the subsequent training of the model. To learn the rich semantic associations between events, this paper innovatively uses ReLA to capture fine-grained connections between events. The attention model is often implemented by querying context-dependent scores to apply distribution limitations, which are activated by softmax. However, the model generates a dense distribution and pays little attention to even unrelated features, which might make it difficult to analyze the model’s information flow. ReLA employs RELU as the activation function for the attention score, which allows for additional flexibility since ReLU is sparse and can take on any non-negative value.

670

Y. Liu et al. Data Augmentation Component

Candidate Event Prediction Component

Features C-Score BERT

Attention

Raw Text Event Representation Component



DGCN

Rectified Linear Attention Mix up





Global evolution Component

Fig. 2. Overall architecture of the ECer model.

Specifically, given the chain E and a candidate event eci , for each argument, we create a query vector, a key vector and a value vector: Q = XWq ; K, V = YWk , YWv

(1)

where X ∈ Rn×d is query inputs; Y ∈ Rm×d is a sequence of context items; W∗ ∈ Rd ×dh denotes trainable model parameters; n and m are the query and context length, respectively; d and dh are the query and context length, respectively. Then we compute ReLA: ReLA(X , Y ) = LN (αV )

(2)

α = ReLU (f (Q, K T ))

(3)

LN (z) = σ (w  z)  RMSNorm(z) = σ (w  z) 

z g RMS(z)

(4)

where α ∈ Rn×m is the attention weight; f (·) is a scoring function; LN (·) denotes variants of layer normalization, designed to avoid gradient instability and unsuccessful optimization, and ReLU (·) = max(0, ·) is the rectified linear unit;  denotes the element-wise multiplication, and σ (·) denotes the sigmoid function; w ∈ Rdh and g ∈ Rdh are a trainable parameter and a gain parameter, respectively. p Then connect the embedded representation {xiv , xis , xio , xi } that were obtained during ReLA modification, tanh is employed to signify each occurrence: p

ei = tanh(Concat(xiv , xis , xio , xi )W E ), i = 1, 2, ..., n − 1, n

(5)

Here, W E ∈ R4d ×4d is a trainable transformation matrix; xiv , xis , xio , xi denotes argument H . representations for each event, drawn from the slice of X[4(i−1);4i−1] p

Improving Event Representation for Script Event Prediction

671

3.2 Data Augmentation Component The need for neural network models on a large scale is gradually growing along with the development of deep learning. However, in the task of script event prediction, limited data prevents the model from properly learning the deeper semantic relationship between events, which has a significant impact on prediction accuracy. As a result, the method of data augmentation is used to gather more training data and train a model with a more excellent generalization capability. The noise-based approach adds weak noise that does not seriously affect the semantics, making it appropriately deviate from the original data. This method not only expands the amount of training data, but also improves the robustness of the model. Specifically, when the training sample is fed into the model, random noise is added to the input variable, causing the data of the input model to be different each time, then normalization is achieved by the generative model of BERT. A random noise, which is called Gaussian noise, with a normal distribution, and its probability density function (PDF) is: PDF(x) = (1/(σ ∗ sqrt(2 ∗ π ))) ∗ e(−(x−μ)

2 /(2∗σ 2 ))

(6)

where x is random variable; μ is mean value, and σ is standard deviation. 3.3 Global Evolution Component The global event evolution component aims to further learn the event evolution pattern through the event logic graph to guide the prediction of subsequent events. The two steps in this component are as follows: obtain the final event representation by Mixup and discover the event evolution pattern buried in the event graph using DGCN [20]. The original sample and augmented data were combined. Mixup, an interpolation technique known as linear, is used to create the final data. Interpolating and mixing the input data through fine-grained word vectors can obtain smoother augmented data: ek = λei + (1 − λ)ej

(7)

Here, ei and ej represent the initial event representation and the augmented event representation respectively; λ depends on the beta distribution, ranging from 0 to 1. In order to maintain the order between events and the structure connection of graphs, DGCN is used to update the event representation on the graph-level. Specifically, a directed event evolution graph based on events is constructed, with nodes denoting event sets and edges indicating time or causality taken from contextual event chains A. To save computational resources and improve efficiency, only subgraphs GA of each event chain and its candidate events are supplied into the model in each training instance. DGCN considers a multi-layer graph convolutional network with the following layerwise propagation rule: ˆ − 2 E (l) W (l) ) ˆ − 2 Aˆ D E (l+1) = σ (D 1

1

(8)

where E (l+1) is the final state; E (l) ∈ RN ×4d denotes the convolution of l th -order neighborhood for each event node; σ (·) denotes the relu function; Aˆ = A + IN ∈ RN ×N is the

672

Y. Liu et al.

adjacency matrix with added self-connections; IN is an identity matrix with dimension  ˆ − 21 is ˆ − 21 Aˆ D ˆ ij = j Aˆ ij and W (l) ∈ RN ×4d denotes trainable weight matrix; D of N; D the spatial aggregation of neighboring event nodes. 3.4 Candidate Event Prediction Component With the final state representation of each event, an attention module is used to model the correlation between the chain of contextual events and the candidate events.  ai eiL (9) cA = i

exp(uij ) aij =  k exp(uij )

(10)

uij = tanh(Wh eiL + Wc ecLj + b)

(11)

where eiL and ejL correspond to the ith context event (i = 1, 2, ..., n − 1, n) and the jth candidate event (j = 1, 2, ..., m − 1, m), respectively. Both are obtained by Eq. (7). Wh , Wc ∈ R4d ×4d are parameter matrices. aij represents the relevant importance of context event, and cA denotes the final event representation. Finally, to measure the relatedness score, which can be expressed as the function g. There are numerous options in the real circumstance. Here, we introduce cosine function to compute the score between cA and ecLj in the following, then the greatest score as the correct candidate event will be chosen: scoreij = uij g(cA , ecLj ) g(cA , ecLj ) = Cosine(cA , ecLj ) = sj =

1 n scoreij i=1 n

(12) cA · ecLj

cA ecLj 

(13) (14)

3.5 Training Details The training objective is to minimize the cross-entropy loss: L() =

N λ 1  2 max(0, η − si − sm ) + N 2N

(15)

m=0 m=i

where N is the number of training samples; si is the score for the ith candidate event, and m is the index of the correct subsequent event;  denotes all the model parameters; η is the margin factor, and λ is the L2 regularization factor, which is set to 1e-8. We use DeepWalk to train the verb embedding, and Word2Vec for other event argument vector representations. The embedding dimension is 128, and the batch size is 500 in the model. RMSprop algorithm is used to optimize the model training process. We use Early stopping technology to judge when to stop the training loop, and the maximum number of iterations is 100.

Improving Event Representation for Script Event Prediction

673

4 Experiments 4.1 Dataset The dataset is derived from the New York Times (NYT) portion of the Gigaword corpus, using the C&C tool for POS markup and dependency resolution, followed by OpenNLP for coreference resolution. The dataset had a total of 160331 samples, separated into a training set, verification set, and test set in a ratio of 140331: 10000: 10000. 4.2 Baselines In order to adequately assess the impact of ECer, we compare the performance of our approach against the following baseline models. The classical models based on statistical learning (PMI and Bigram), event pairs (Word2vec and EventComp), event chain (PairLSTM and SAMNet), event graph (SGNN), and integration (SGNN + Int + Senti and Mcer) are selected, respectively. All experiments are conducted under the same configuration and parameters; accuracy (%) of predicting the correct subsequent events is used as the evaluation metric. • PMI [Chambers and Jurafsky, 2008] uses PMI to learn narrative relationships and measure the relationship between events and narrative chains. • Bigram [Jans et al., 2007] uses binary conditional probability to represent the correlation strength of two events, and event chain is modeled according to the order of observing event chain. • Word2Vec [Mikolov et al., 2013] uses the cosine similarity of two events as the correlation strength of the event pair. • EventComp [Granroth-Wilding and Clark, 2016] uses multi-layer perceptron to encode events and calculates their pairwise similarity. • PairLSTM [Wang et al., 2017] uses LSTM to integrate chronological information into event embedding. • SAM-net [Lv et al., 2019] uses dynamic memory network to extract features of event chains. • SGNN [Li et al., 2017] constructs narrative event evolution to map event representation. • SGNN + Int + Senti [Ding et al., 2019] further extends SGNN by integrating external commonsense knowledge such as the intent and sentiment information into the event graph. • Mcer [Wang et al., 2021] integrates events at multiple levels of granularity to enhance event presentation learning. 4.3 Experimental Results Table 1 shows the experimental results of our and various types of baseline models on the test set. Based on the results in the table, the following conclusions can be drawn: (1) Neural network-based methods significantly outperform statistical learning-based methods (PMI and Bigram). The main reason is that the statistical learning-based methods ignore the intrinsic connections between events and cannot cope with the problems of sparsity and generalization.

674

Y. Liu et al. Table 1. Results of baseline models in Script Event Prediction task. Baseline Model

Acc(%)

PMI [7]

30.52

Bigram [16]

29.67

Word2vec [3]

42.23

EventComp [17]

49.57

PairLSTM [21]

50.91

SAM-Net [19]

54.48

SGNN [20]

52.45

SGNN + Int + Senti [6]

53.88

Mcer [22]

56.64

ECer (ours)

59.40

(2) By comparing the event chain-based and event graph-based models, it can be found that they perform better than event pair-based models. The main reason is the importance of temporal information and event evolution patterns for subsequent event prediction. (3) Comparison between the representation integration (SGNN + Int + Senti et al.) and other single methods (PMI et al.) demonstrates that the complementary effects between different models can bring gaining impacts to each other. ECer can capture the relations from event-level, chain-level and graph-level representations to provide more accurate semantic relations. (4) The ECer model proposed in this paper achieves the best experimental results, which is more than 2% higher than the best performance, fully proving the superiority of the proposed method. 4.4 Comparative Studies on Variants

Table 2. Results of ECer variants. Modules

Methods

Acc (%)



Event Representation Component

ReLA

59.40



Maksed self-attention

57.92

−1.48

Data Augmentation Component

additive attention

57.55

−1.85

Multiplicative attention

57.32

−2.08

BERT + Noise

59.40



Label independent

58.19

−1.21

Label related

56.99

−2.41

Improving Event Representation for Script Event Prediction

675

Table 2 shows the experimental results of switching different attention functions and fusion methods on the development set. (1) In the Event Representation Component, we choose three attention mechanisms to compare. We can see that these attentional mechanisms perform better based on data augmentation than when not used. However, ReLA can bring better impacts. The main reason is that other methods assign small amounts of attention to irrelevant features, which may harm the performance of these operations. (2) In the Data Augmentation Module, all methods listed can achieve impressive predicting results. This is probably because data augmentation can bring more practical knowledge, which plays an essential role in predicting the answers. 4.5 Ablation experiments

Table 3. Results of Ablation experiments. Methods

Acc (%)



ECer

59.40



– ReLA

57.26

−2.14

– DA

56.63

−2.77

–Global Module

56.01

−3.39

Ablation experiments were conducted under different settings to verify the effectiveness of each module of the ECer model, and the results are shown in Table 3. (1) The effect is reduced by 2.14% after removing ReLA, which proves that Rectified Linear Attention helps improve the efficiency of event prediction. (2) The effect of removing the data augmentation module is reduced by 2.77%, which proves that the data augmentation module is valid and can learn richer semantic information. (3) The effect of removing the global evolution module is reduced by 3.39%, proving that the global evolution module is essential for script event prediction. Learning the integration pattern of events can effectively improve event prediction performance.

5 Conclusion In this paper, we proposed the ECer model to handle the script event prediction task, which integrates event representation and features. ECer develops a rectified linear attention to learn fine-grained event representation, and utilizes data augmentation so as to get deep semantic features. Then, the global evolution component is used to obtain the final representation and discover the event evolution pattern. Experimental results on the NYT dataset demonstrate its merits and superiority. However, currently, existing models fail to mine more implicit interactions between events. In future work, we will introduce external background information, such as

676

Y. Liu et al.

knowledge graphs, and propose new effective methods to fuse the extracted information to help infer the correct answers. Acknowledgements. The authors would like to thank the anonymous reviewers for the help comments. This work was supported by the National University of Defense Technology Research Project ZK20-46, the Young Elite Scientists Sponsorship Program (2021-JCJQ-QT-050), the National Natural Science Foundation of China under Grants 61972207, U1836208, U1836110, and 61672290, and the China Postdoctoral Science Foundation funded project (2021MD703983).

References 1. Schank, R.C., Abelson, R.P.: Scripts, plans, and knowledge. IJCAI 75, 151–157. Morgan Kaufmann Publishers Inc, San Francisco (1975) 2. Mikolov, T., Sutskever, I., Chen, K., et al.: Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. Journal 26(2), 3111–3119 (2013) 3. Mikolov, T., Chen, K., Corrado, G., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv (2013) 4. Weber, N., Balasubramanian, N., Chambers, N.: Event representations with tensor-based compositions. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4946– 4953. AAAI Press, Palo Alto (2018) 5. Socher, R., Huval, B., Manning, C.D., et al.: Semantic compositionality through recursive matrix-vector spaces. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1201–1211. ACL Press, Jeju Island (2012) 6. Ding, X., Liao, K., Liu, T., et al.: Event representation learning enhanced with external commonsense knowledge. arXiv preprint arXiv (2019) 7. Chambers, N., Jurafsky, D.: Unsupervised learning of narrative event chains. In: Proceedings of ACL-08, pp.789–797. ACL Press, Columbus (2008) 8. Pichotta, K., Mooney, R.: Statistical script learning with multi-argument events. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 220–229. ACL Press, Sweden (2014) 9. Luan, F., Paris, S., Shechtman, E., et al.: Deep photo style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4990–4998 (2017) 10. Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016) 11. Xie, Q., Dai, Z., Hovy, E., et al.: Unsupervised data augmentation for consistency training. arXiv preprint arXiv (2019) 12. Coulombe, C.: Text data augmentation made simple by leveraging NLP cloud APIs. arXiv preprint arXiv (2018) 13. Yan, G., Li, Y., Zhang, S., et al.: Data augmentation for deep learning of judgment documents. In: International Conference on Intelligent Science and Big Data Engineering, pp. 232–242. Springer, Cham (2019) 14. Li, K., Chen, C., Quan, X., et al.: Conditional Augmentation for Aspect Term Extraction via Masked Sequence-to-Sequence Generation. arXiv preprint arXiv (2020) 15. Guo, H., Mao, Y., Zhang, R.: Augmenting data with mixup for sentence classification: an empirical study. arXiv preprint arXiv (2019)

Improving Event Representation for Script Event Prediction

677

16. Jans, B., Bethard, S., Vulic, I., et al.: Skip n-grams and ranking functions for predicting script events. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 336–344. ACL Press, United States (2012) 17. Granroth-Wilding, M., Clark, S.: What happens next? Event prediction using a compositional neural network model. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2727–2733. AAAI Press, Palo Alto (2016) 18. Pichotta, K., Mooney, R.: Learning statistical scripts with LSTM recurrent neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2800–2806. AAAI Press, Palo Alto (2016) 19. Lv, S., Qian, W., Huang, L., et al.: Sam-net: integrating event-level and chain-level attentions to predict what happens next. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, pp. 6802–6809. AAAI Press, Palo Alto (2019) 20. Li, Z., Ding, X., Liu, T.: Constructing narrative event evolutionary graph for script event prediction. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4201–4207. AAAI Press, Palo Alto (2018) 21. Wang, Z., Zhang, Y., Chang, C.: Integrating order information and event relation for script event prediction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 57–67. ACL Press, Copenhagen (2017) 22. Wang, L., Yue, J., Guo, S., et al.: Multi-level connection enhanced representation learning for script event prediction. In: Proceedings of the Web Conference 2021, pp. 3524–3533. ACM Press, New York (2021) 23. Balasubramanian, N., Soderland, S., Etzioni, O.: Generating Coherent Event Schemas at Scale. ACL Press, Seattle (2013)

Beyond Hard Samples: Robust and Effective Grammatical Error Correction with Cycle Self-Augmenting Kaiqi Feng, Zecheng Tang, Juntao Li(B) , and Min Zhang Soochow University, Suzhou, China {fengkq,zctang}@stu.suda.edu.cn {ljt,minzhang}@suda.edu.cn

Abstract. Recent studies have revealed that grammatical error correction methods in the sequence-to-sequence paradigm are vulnerable to adversarial attacks. Large Language Models (LLMs) also inevitably experience decreased performance when confronted with adversarial examples, even for the GPT-3.5 model. In this paper, we propose a simple yet very effective Cycle Self-Augmenting (CSA) method and conduct a thorough robustness evaluation of cutting-edge GEC methods with three different types of adversarial attacks. By leveraging the augmenting data generated from the GEC models themselves in the posttraining stage and introducing regularization data for cycle training, our proposed method can effectively improve the model robustness of welltrained GEC models with only a few more training epochs at an extra cost. Experiments indicate that our proposed training method can significantly enhance the robustness and performance of the state-of-the-art GEC model. Keywords: Grammatical Error Correction Self-Augmenting

1

· Robustness ·

Introduction

Grammatical error correction (GEC) is one of the essential application tasks in the NLP community for its crucial values in many scenarios including, but not limited to, writing assistant, automatic speech recognition, and information retrieval, which mainly aims to detect and correct various textual errors, such as spelling, punctuation, grammatical, word choice, and other article mistakes [1]. Recently, more attention has been paid to improving the robustness of NLP models [2,3] since the brittleness of out-of-domain data [4,5] or adversarial attacks [6,7] could hinder the safe deployment of these models in the real world. GPT-3.5 is renowned for its exceptional performance in artificial intelligence and has demonstrated remarkable proficiency in the task of GEC. Nevertheless, our experiments have revealed that it is susceptible to adversarial attacks, which can significantly compromise its accuracy and reliability in certain scenarios. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 678–689, 2023. https://doi.org/10.1007/978-3-031-44696-2_53

Robust and Effective GEC with Cycle Self-Augmenting

679

Existing solutions to tackle GEC tasks can be roughly divided into two categories, i.e., sequence-to-sequence generation (Seq2Seq) [8] and sequence-toediting (Seq2Edits) [9]. The former group performs the translation from ungrammatical sentences to the corresponding error-free sentences, while the latter introduces tagging or sequence labeling to merely edit a small proportion of the input sentences, remaining the rest part unchanged. Intuitively, the dramatic performance decline can be mitigated by training with plenty of adversarial examples for a certain type of attack [10]. However, such methods require considerable data for each attack type, which is infeasible for real-world scenarios, and the significant improvement in robustness is possibly accompanied by a performance decrease in the original testing data since the model gradually forgets the previously learned knowledge [11]. To avoid these problems, we introduce the concept of regularization data, a kind of hard sample the model can’t easily coverage on, and propose a very effective Cycle Self-Augmenting (CSA) method to improve the model robustness with only a few more training epochs as an extra cost. More concretely, our proposed CSA method is only introduced in the post-training stage of a converged GEC model and merely needs the original training data. Experiments with seven strong models on four prevalent benchmark datasets demonstrate the effectiveness of our CSA method. To test the robustness of current cutting-edge GEC models, we first introduce three perturbation methods to obtain adversarial examples, including two substitution-based methods, i.e., Mapping & Rules and Vector-Based, and one LM-based method Back-Translation (BT). In short, our contributions are: (1) we introduce the concept of regularization data in GEC task and propose one Cycle Self-Augmenting (CSA) method to improve the robustness of GEC models as well as mitigate the forgetting problems, (2) we first comprehensively analyze the robustness of models in the GEC task by utilizing three different adversarial attack sets and (3) we first assess the robustness of GPT-3.5 in the task of GEC.

2

Attack Data Construction

In this section, we demonstrate the details of attack data construction, which is the cornerstone of testing the defense ability of GEC models. The construction process consists of two processes: locating the vulnerable tokens where the model may make mistakes and perturbing those tokens with mistakes that most likely occur in the context. 2.1

Adversarial Attack

The aim of textual adversarial examples is to confuse the NLP models. Give one input X , the model F can predict the result O. The adversarial example X  can be constructed with the following paradigm: F(X  ) = O,

sim(X , X  ) ≥ δ,

(1)

680

K. Feng et al.

where sim is a score function that calculates the similarity between the input X and the adversarial example X  , and δ is the threshold. In other words, the original text X and the adversarial text X  share a great similarity, but the model tends to generate different results for X and X  respectively. 2.2

Vulnerable Tokens Location

For either Seq2Seq architecture or Seq2Edits architecture, we can locate the vulnerable tokens according to the probabilities of generated tokens or edit tags, i.e., the lower probability is, the lower confidence the model gives. 2.3

Vulnerable Tokens Perturbation

We apply three textual adversarial attack methods for the vulnerable tokens, including two substitution-based methods: Mapping & Rules, Vector-Based, and one LM-based method Back-Translation. Some examples are listed in Table 1. Mapping and Rules. For the parallel training set of GEC, we can build correctionto-error mapping, i.e., confusion set, for vulnerable tokens perturbation. Then, we utilize the method in previous work [10] to apply word substitution-based perturbations to each corpus. Vector-Based. We use plug-and-play embedding models, e.g., word2vec [12], which are trained with massive data. Specifically, given one token xi , which requires to be perturbed, we can get the word embedding of it by applying the embedding model and then obtain its adjacent word in the representation space by calculating the distance of its nearest work representations. Back-Translation. We reverse the GEC parallel corpus to train a BT model, which can generate the ungrammatical text given a text and maintain the original semantic meaning. We implement a variant decoding algorithm, noise beam search [13], which adds a small noise rβrandom to perturb each decoding step, where r ∼ N (0, 1) and βrandom is a hyper-parameter. Table 1. Examples of Mapping & Rules, Vector-Based, and Back-Translation attack samples. Examples of Attacked Data Source: Mapping & Rules: Vector-Based: Back-Translation: Golden:

Above all, life is more important than secret On all, life is from important than secret Fitz Bits all, life makes more important than secret Above all, life are more important than secret Above all, life is more important than secrecy

Robust and Effective GEC with Cycle Self-Augmenting

3

681

Method

In this section, we introduce the Cycle Self-Augmenting method (CSA) by illustrating the Self-Augmenting mechanism in Sect. 3.1 and the Cycle Training process in Sect. 3.2. In the cycle training process, we also present the concept of regularization data for GEC, which is the key factor for improving the model robustness. It is worth noting that we neither modify the model architecture nor introduce the extra modules. Such a plug-and-play method is significant and can be easily adapted to different GEC models.

Fig. 1. Overview of our proposed Cycle Self-Augmenting (CSA) method. The SelfAugmenting mechanism corresponds with step ①–④, and we launch Regularization Training in step ⑤.

3.1

Self-augmenting

The crux of Self-Augmenting is to obtain augmenting data pairs DAug , which are constructed by the model itself. As shown in Fig. 1, given a converged GEC model F and the original training dataset D={(Xi , Yi ) | i = 1 · · · |D|}, we feed each source text Xi into F to obtain the corresponded output Oi (step ①-②), and compare Oi with the golden Yi (step ③). If there are differences between Oi and Yi , new parallel data (Oi , Yi ) would be regarded as Daug (step ④), which is the set of sentence pairs that model did not modify correctly, and we call them difficult data. 3.2

Cycle Training

Noting that Cycle Self-Augmenting (CSA) method is composed of two stages, Self-Augmenting mechanism and Regularization Training. There are λ cycles in Self-Augmenting mechanism, namely Stage I. Each cycle includes the processes from step ①∼④ of Fig. 1. As mentioned above, we construct augmentk in cycle k, where 0 ≤k≤ λ, which will be used as training data ing dataset DAug for the next cycle k + 1.

682

K. Feng et al.

More concretely, the augmenting datasets obtained contain many difficult data in the original training dataset, which can be simply used to improve both model performance and robustness. Accordingly, we adopt the following training process for each cycle during Stage I, i.e., when 0 ≤k≤ λ: k – Perform training on DAug until convergence (fine-tuning I). – Conduct further tuning on a small high-quality GEC dataset Dtune to prevent over-fitting on the augmenting dataset (fine-tuning II).

Along with the model training, there are fewer and fewer difficult data pairs 1 k λ in the augmenting datasets (DAug ,. . . ,DAug ,. . . ,DAug ). Simply utilizing the augmenting dataset in each cycle for training might yield over-fitting on these datasets. Thus, we further propose a Regularization Training strategy to focus on these hard-to-learn data (Stage II), which is sketched out in step ⑤. Inspired by previous work [14] that names some specific samples that are negatively associated with the performance of knowledge distillation as regularization examples, which can improve the performance of the model. Here, we treat these hard-to-learn data as Regularization Data for the GEC task, which has not been learned after λ cycles. When 0 ≤k≤ λ, the regularization data of the k-th k 0 k =DAug ∩ · · · ∩ DAug . During the Stage II, the trained cycle is obtained as DReg GEC model from Stage I is further trained only once below: k – Perform training on DReg until convergence (fine-tuning I). – Further tuning on a small high-quality GEC dataset Dtune (fine-tuning II).

The benefits of launching further training on regularization data are fourfold: 1) it prevents over-fitting on the easy-to-learn data pairs in the augmenting datasets; 2) it can reduce model capacity to improve its generalization ability and robustness; 3) it gives more opportunities for the model to address hard-to-learn pairs; 4) it can accelerate each training cycle by using fewer data pairs.

4

Experiments

In this section, we present the details of dataset processing and evaluation metrics, followed by the description of baselines and the implementation details of GEC models. 4.1

Dataset

For fine-tuning I, we combine three corpora, including the Lang-8 Corpus of Learner English (Lang-8) [15], the National University of Singapore Corpus of Learner English (NUCLE) [16] and the First Certificate in English (FCE) corpus [17], and exclude the correct sentence pair (X , Y) if X and Y are same. For fine-tuning II, we directly utilize the Cambridge English Write & Improve + LOCNESS Corpus (W&I+LOCNESS) [18]. We select the official testing datasets of BEA-2019 shared task (W&I-Dev) [19], CoNLL-2014 testing set [20], FCE testing set [21], and JFLEG testing set [22]. For attack sets, we utilize three aforementioned perturbation methods in Sect. 2.3 to generate attack sets for each testing set. The statistic of datasets are presented in Table 2.

Robust and Effective GEC with Cycle Self-Augmenting

683

Table 2. Statistics of datasets, where the first group shows the training data, the second group shows the testing/attack data and the last group shows the data for training BT model. Besides, “†” denotes the data used in fine-tuning stage I, and “‡” denotes the data used in fine-tuning stage II. Dataset

Language #Sentences Attack Error (%) Usage

Lang-8 NUCLE FCE W&I+LOCNESS

English English English English

1,102,868 57,151 34,490 34,308

✗ ✗ ✗ ✗

51.1 38.2 62.6 66.3

W&I-Dev CoNLL-2014 FCE JFLEG

English English English English

4,384 1,312 2,695 1,951

   

-

Testing/Attack Testing/Attack Testing/Attack Testing/Attack

PIE-synthetic

English

9,000,000



100.0 %

Back-Translation

4.2

% % % %

Fine-tuning Fine-tuning Fine-tuning Fine-tuning

† † † ‡

Baseline and Setting

Baseline Selection. It is worth noting that our proposed CSA method is a post-training strategy that can be applied to any off-the-shelf GEC model. We first leverage seven cutting-edge models as our baselines to conduct experiments under the supervised setting, including three Seq2Seq models: Transformer [23], BART [24] and SynGEC [25], and four Seq2Edits models: RoBERTa, XLNet [26] and TMTC [27] method. Setting. For the CSA method, we set max cycle times λ as five and patience P as one to control the training process, i.e., stopping the cycle training process once the model performance decreases in P adjacent cycles. Besides, we set the max training epochs as 10 for all the cycles to avoid under-fitting and select the best model according to the performance on the validation set. For each original testing set and their corresponding attack sets, we report the average score of five random seeds in the results. 4.3

Main Results

Original Testing Set Results. We first present the experimental results on four original testing sets in Table 3, where we can observe that our CSA method yields improvement on six out of seven baselines and achieve competitive results on the SynGEC baseline, which utilizes the extra syntactic information to enhance the model. Our CSA method can achieve better or competitive results on the original testing set, which shows the effectiveness of cycle training method.

684

K. Feng et al.

Table 3. Evaluation results on the original testing data. We report the performance for each model. The bold fonts denote the best performance of each comparison. Model

BEA (ERRANT) CoNLL-2014 (M 2 ) FCE (M 2 ) JELEG Prec. Rec. F 0.5 Prec. Rec. F 0.5 Prec. Rec. F 0.5 GLEU

Transformer CSA (O , Y  )

65.5 69.6

59.4 64.2 68.9 64.7 68.6 69.5

43.9 61.8 49.5 64.3

59.4 63.0

39.5 54.0 59.7 43.9 58.0 62.7

BART CSA (O , Y  )

68.3 70.9

57.1 65.6 69.3 61.9 68.9 70.4

45.0 62.6 46.7 63.9

59.6 65.2

40.3 54.4 57.3 34.4 55.3 59.4

SynGEC CSA (O , Y  )

74.9 74.4

65.2 72.7 74.6 65.3 72.4 72.0

49.0 67.6 52.3 67.0

68.9 67.8

49.6 64.0 61.9 49.1 63.0 61.8

RoBERTa (GECToR) 77.2 76.8 CSA (O , Y  )

55.1 71.5 73.9 56.4 71.6 72.6

41.5 64.0 53.7 64.1

61.6 62.7

45.3 57.5 59.1 44.8 58.0 59.2

XLNet (GECToR) CSA (O , Y  )

79.2 79.1

53.9 72.4 77.5 54.9 72.7 76.9

40.1 65.3 41.6 65.7

71.9 71.5

41.3 62.7 56.0 42.7 63.1 60.1

RoBERTa (TMTC) CSA (O , Y  )

79.9 79.2

51.5 71.9 75.4 52.3 71.9 74.8

41.6 64.8 42.3 64.8

58.9 61.3

36.5 52.4 58.7 35.5 53.5 59.0

XLNet (TMTC) CSA (O , Y  )

81.4 80.3

50.8 72.5 77.1 52.9 72.8 76.3

42.0 66.1 43.7 66.4

58.4 61.2

33.8 51.0 58.7 34.1 52.8 58.7

Table 4. Average scores on three attack sets, including Mapping & Rules, VectorBased and Back-Translation methods. We report the performance for each model. The bold fonts indicate the best performance of each comparison. Model

BEA (ERRANT) CoNLL-2014 (M 2 ) FCE (M 2 ) JFLEG Prec. Rec. F 0.5 Prec. Rec. F 0.5 Prec. Rec. F 0.5 GLEU

Transformer CSA (O , Y  )

14.7 18.1

44.4 16.8 51.0 20.7

32.0 38.1

40.9 33.3 44.5 39.0

22.6 26.4

31.1 27.3 39.9 36.2 27.9 42.5

BART CSA (O , Y  )

17.0 19.9

45.0 19.3 52.6 22.7

36.4 39.9

41.5 37.1 47.1 40.9

24.1 27.3

30.8 25.2 40.9 36.6 28.8 42.7

SynGEC CSA (O , Y  )

19.4 20.0

51.7 22.1 52.7 22.8

41.2 42.0

47.3 42.1 49.4 43.2

31.3 30.7

41.6 32.9 44.9 40.9 32.3 45.3

GECToR (RoBERTa) 17.0 17.4 CSA (O , Y  )

50.3 19.6 50.8 20.0

33.1 33.6

45.1 34.8 45.3 35.1

25.9 26.4

37.2 27.5 40.7 37.5 28.0 41.2

GECToR (XLNet) CSA (O , Y  )

17.4 17.8

51.0 20.0 51.6 20.4

33.1 33.5

45.9 34.9 46.7 35.3

26.8 27.1

39.1 28.6 41.0 38.9 28.9 40.4

TMTC (RoBERTa) CSA (O , Y  )

16.0 16.8

47.1 18.4 49.0 19.3

33.5 34.1

44.6 35.1 51.6 36.1

24.4 24.7

34.0 25.8 40.4 35.6 26.3 40.8

TMTC (XLNet) CSA (O , Y  )

15.7 16.0

46.7 18.1 47.2 18.5

32.7 32.6

43.4 34.2 49.7 34.6

23.4 23.3

32.4 24.7 40.1 34.6 24.9 40.2

Attack Set Results. We report the average results of three different attack sets in Table 4. We can observe that our CSA method yields great improvement

Robust and Effective GEC with Cycle Self-Augmenting

685

on all the attack sets, e.g., 5.7 points improvement of the Transformer model on the CoNLL-2014 attack sets.

5

Study

In this section, we conduct ablation studies to help better understand our CSA method and the regularization data. We first conduct thorough experiments to study the robustness of Large Language Models in Sect. 5.1. Then we further explore the effect of hyper-parameter in Sect. 5.2 and the insights of performance improvement brought by the Cycle-Training method in Sect. 5.3, as well as the regularization data in Sect. 5.4. We conduct experiments on the vanilla Transformer model and use the CoNLL-2014 testing sets. 5.1

Large Langeage Models Robustness

We evaluate the robustness of GPT-3.5 (text-davinci-003) using the original CoNLL-2014 testing data and its three types of attack sets. Given that the format and the exact wording of GPT-3.5’s prompts can significantly impact the task performance, we conduct zero-shot and few-shot prompt setting proposed by previous work [28], which performs best in prompt engineering experiments. The results are listed in Table 5, indicate that GPT-3.5 suffers from a significant decrease in performance on the attack sets, e.g., from 34.8 to 28.4 on the Vectorbased set in few-shot setting. Despite being trained on a vast amount of data, GPT-3.5 is still susceptible to a reduction in its robustness. 5.2

Influence of Max Cycle Times λ

k for Recall that we utilize two subsets for the fine-tuning stage, including DAug the first stage and Dtune for the second stage, we find that the model performance improves significantly during the second stage. Thus, we want to figure out whether the improvement comes from the small high-quality dataset Dtune . Besides, the influence of max cycle times λ remains to be explored. We post-train the GEC model on the original train dataset and then compare its performance with our CSA method on the original testing set and the attack set with the increasing max cycle times λ. The results are shown in Table 6, where we can find that the CSA method surpasses the post-trained GEC model on both the original testing set and the attack set, and the performance of the post-trained GEC model on the testing set declines along with the increased cycle times λ, which indicates that simply utilizing the small high-quality dataset Dtune for Cycle Training cannot improve the model performance but leads to over-fitting. Besides, the performance of the post-trained GEC model generally surpasses the converged GEC model in cycle 0, which means more training steps can make the GEC model adapt to those hard samples, thus improving the model robustness.

686

K. Feng et al.

Table 5. Experimental results on CoNLL-2014 testing data and its correlated attack set, where † represents the zero-shot setting and ‡ represents the few-shot setting. Data

Transformer Davinci-003 † Davinci-003 ‡ Prec. Rec. F 0.5 Prec. Rec. F 0.5 Prec. Rec. F 0.5

Origin

68.9

43.9 61.8

31.5

52.3 34.2

32.1

52.3 34.8

Mapping & Rules 36.5

37.8 36.7

27.6

48.2 28.1

28.0

48.0 30.6

Vector-Based

26.5

36.0 28.0

27.6

48.2 30.2

25.9

46.6 28.4

Back-Translation 33.0

48.9 35.3

26.4

50.0 29.1

26.5

49.7 29.2

Table 6. Comparison between the direct post-training strategy in original train dataset and our CSA method, where “Ori Set” represents the original testing set and “Att Set” represents the attack testing set. #Cycles

0

1

2

3

4

5

Ori Set Baseline 61.8 56.2 57.0 56.5 57.1 55.2 + Ours 61.8 63.1 63.9 64.2 64.3 63.2 Att Set Baseline 36.7 37.4 37.0 36.6 37.0 37.1 + Ours 36.7 41.3 41.5 41.7 42.0 42.2 Table 7. Comparison between different cycle training strategies. For each group, (CSA) represents our method, and (AUG) means utilizing augmenting data directly for cycle training. The bold fonts denote the best performance of each comparison. Model

5.3

Original Testing Set Attack Set Prec. Rec. F 0.5 Prec. Rec. F 0.5

Transformer 68.2 CSA (O , Y  ) 69.5 AUG (O , Y  ) 69.4

42.7 60.9 49.5 64.3 48.8 63.9

36.5 41.2 38.6

37.8 36.7 45.8 42.0 46.1 39.7

BART 69.3   CSA (O , Y ) 70.4 AUG (O , Y  ) 65.1

45.0 62.6 46.7 63.9 47.4 60.6

36.7 42.3 37.6

37.4 36.8 45.5 42.9 45.1 37.5

XLNet 77.5   CSA (O , Y ) 76.9 AUG (O , Y  ) 67.8

40.1 65.3 41.6 65.7 49.1 63.0

42.1 42.1 38.6

45.3 42.7 45.9 42.8 45.7 39.6

Effect of Cycle Training Strategy

The core of the Cycle Training method is to filter the regularization data for posttraining and make the model gradually adapt to new data distribution without forgetting the learned knowledge. To show the significance of the Cycle Training strategy, we skip Stage II which is designed for filtering the regularization data.

Robust and Effective GEC with Cycle Self-Augmenting

687

The results are shown in Table 7, where we can observe that applying the cycle training strategy can achieve better performance compared with directly the augmenting data for post-training. Besides, we can also find that when directly post-training the model with augmenting data, the models suffer from serious over-fitting problems since the performance on the original testing set declines. However, the models can still achieve better robustness than their corresponding baselines when applying both training strategies. In short, Cycle Training benefits from two advantages: (1) filtering the regularization data for post-training and (2) iteratively feeding the model with hard-to-learn data, which contributes to making the models adapt to new data distributions. Table 8. The influence of regularization data amount. Reserving Rates (%) Original Testing Set Attack Set P R F 0.5 P R F 0.5

5.4

100 %

69.5 49.5 64.3

41.2 45.8 42.0

75 %

68.3 48.8 63.2

40.9 44.5 41.6

50 %

67.5 49.4 62.9

40.3 43.7 40.9

25 %

67.2 48.4 62.3

40.1 42.8 40.6

Effect of Regularization Data

As mentioned above, the regularization data plays a significant role in improving the robustness. To explore the impact of regularization data amount on the model performance, we gradually reduce the amount of regularization data in the last cycle of the Cycle Training process. Table 8 presents the experimental results, where we can observe that the performance on the original testing set and the attack set continually decreases as the amount of regularization data decreases. However, the model can still surpass the baseline with only 25% regularization data, which shows the effectiveness of the regularization data.

6

Conclusion

In this paper, we explore the robustness of the traditional GEC models and Large Language Models, including GPT-3.5, in the task of GEC by thoroughly evaluating various types of adversarial attacks and proposing an effective cycle self-augmenting method to improve model robustness. Specifically, with our plugand-play method, the GEC model can achieve significant improvement on both

688

K. Feng et al.

the original testing set and the attack set in the highest GEC system. Experimental results on seven strong baselines, four benchmark test sets, and three types of adversarial attacks confirm the effectiveness of our proposed method, which can generalize well to various GEC models with very limited extra cost. Acknowledgement. We would like to thank the efforts of anonymous reviewers for improving this paper. This work is supported by the National Science Foundation of China (NSFC No. 62206194), the Natural Science Foundation of Jiangsu Province, China (Grant No. BK20220488), and JSSCBS20210661.

References 1. Wang, Y., et al.: A comprehensive survey of grammar error correction. In: arXiv preprint arXiv:2005.06600 (2020) 2. Wang, X., Wang, H., Yang, D.: Measure and improve robustness in nlp models: a survey. arXiv preprint arXiv:2112.08313 (2021) 3. Zhang, Y., et al.: Interpreting the robustness of neural NLP models to textual perturbations. In: Findings of the Association for Computational Linguistics: ACL 2022 (2022) 4. Wang, H., et al.: Adversarial domain adaptation for machine reading comprehension. arXiv preprint arXiv:1908.09209 (2019) 5. Hendrycks, D., et al.: Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100 (2020) 6. McCoy, R.T,. Pavlick, E., Linzen, T.: Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007 (2019) 7. Jin, Di, et al.: Is bert really robust? natural language attack on text classification and entailment. arXiv preprint arXiv:1907.11932 2 (2019) 8. Chollampatt, S., Ng, H.T.: A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1) (2018) 9. Stahlberg, F., Kumar, S.: Seq2Edits: sequence transduction using span-level edit operations. arXiv preprint arXiv:2009.11136 (2020) 10. Wang, L., Zheng, X.: Improving grammatical error correction models with purposebuilt adversarial examples. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020) 11. Zhao, K., et al.: Consistent representation learning for continual relation extraction. arXiv preprint arXiv:2203.02721 (2022) 12. Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 13. Xie, Z., et al.: Noising and denoising natural language: diverse backtranslation for grammar correction. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers) (2018) 14. Zhou, H., et al.: Rethinking soft labels for knowledge distillation: a bias-variance tradeoff perspective. arXiv preprint arXiv:2102.00650 (2021) 15. Tajiri, T., Komachi, M., Matsumoto, Y.: Tense and aspect error correction for ESL learners using global context. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2012)

Robust and Effective GEC with Cycle Self-Augmenting

689

16. Dahlmeier, D., Ng, H.T., Wu, S.M.: Building a large annotated corpus of learner English: the NUS corpus of learner English. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (2013) 17. Yannakoudakis, H., Briscoe, T., Medlock, B.: A new dataset and method for automatically grading ESOL texts. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (2011) 18. Yannakoudakis, H., et al.: Developing an automated writing placement system for ESL learners. Appli. Measurem. Educ. 31(3), 251–267 (2018) 19. Bryant, C., et al.: The BEA-2019 shared task on grammatical error correction. In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications (2019) 20. Ng, H.T., et al.: The CoNLL-2014 shared task on grammatical error correction. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task (2014) 21. Dahlmeier, D., Ng, H.T.: Better evaluation for grammatical error correction. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2012) 22. Napoles, C., Sakaguchi, K., Tetreault, J.: JFLEG: a fluency corpus and benchmark for grammatical error correction. arXiv preprint arXiv:1702.04066 (2017) 23. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017) 24. Katsumata, S., Komachi, M.: Stronger baselines for grammatical error correction using pretrained encoder-decoder model. arXiv preprint arXiv:2005.11849 (2020) 25. Zhang, Y., et al.: SynGEC: syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser. arXiv preprint arXiv:2210.12484 (2022) 26. Omelianchuk, K,, et al.: GECToR-grammatical error correction: tag, not rewrite. arXiv preprint arXiv:2005.12592 (2020) 27. Lai, S., et al.: Type-Driven Multi-Turn Corrections for Grammatical Error Correction. arXiv preprint arXiv:2203.09136 (2022) 28. Coyne, S., Sakaguchi, K.: An Analysis of GPT-3’s Performance in Grammatical Error Correction. arXiv preprint arXiv:2303.14342 (2023)

DAAL: Domain Adversarial Active Learning Based on Dual Features for Rumor Detection Chaoran Zhang1 , Min Gao1(B) , Yinqiu Huang1 , Feng Jiang2 , Jia Wang1 , and Junhao Wen1 1

School of Big Data and Software Engineering, Chongqing University, Chongqing, China [email protected] 2 Chongqing Business Vocational College, Chongqing, China

Abstract. The widespread of rumors has a negative impact on society, and rumor detection has attracted significant attention. When a new event appears, the scarcity of corresponding labeled data causes a severe challenge to rumor detection. It is necessary to query high-quality unlabeled data and annotate it for detection. Previous studies on active learning for rumor detection can query the unlabeled samples on the decision boundary based on the textual features of posts. These studies, when selecting the optimal samples, have not sufficiently considered that new event posts are usually far from old events and differ in terms of sentiment, which often plays a key role in rumor detection. Therefore, overlooking these characteristics could potentially lead to sub-optimal performance in rumor detection based on these active learning methods. To this end, we propose domain adversarial active learning based on dual features (DAAL) for rumor detection, considering these characteristics. Specifically, we first extract dual features, including affective and textual features, to obtain representations of the posts. We then propose a new active learning method that selects the samples furthest from all labeled samples based on their dual features. This method helps our rumor detection model gain more labels from new, distant event posts for training. Finally, we introduce adversarial domain training, a method designed to extract transferable features across different events (or domains), to enhance the adaptability of our rumor detection model to new events. Experimental results demonstrate DAAL can select high-quality candidates and achieve superior performance compared to existing methods. Keywords: Rumor detection · Active learning Textual feature · Affective feature

1

· Adversarial training ·

Introduction

With the development of the Internet, people’s online social activities have become more frequent. While the Internet provides a convenient way to access c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 690–703, 2023. https://doi.org/10.1007/978-3-031-44696-2_54

DAAL: Domain Adversarial Active Learning Real label pool

Labeled source

Fake label pool

Labeled target

691

Unlabeled target

Selected on target Selected on decision boundary

Labeled pool

Fig. 1. Comparison of active learning conceptual diagrams. The left part outlines that previous active learning methods select samples closest to the decision hyperplane, as they are more difficult to distinguish. The right one describes our approach, whereby the samplers query the unlabeled samples that are furthest from the labeled pool because samples from different events are often far apart.

information, it has also facilitated the rapid spread of rumors in social networks [1]. The phenomenon can easily influence public opinion and have negative impacts on social and political security [2]. For instance, a fake video named Somalis ‘pushed into shallow grave’ Ethiopia caused the conflict between the two races in Ethiopia1 . Therefore, it is crucial to quickly and accurately detect rumors. There are currently many methods for rumor detection, and those based on deep learning have attracted much attention due to their excellent performance. Nonetheless, these methods demand a significant volume of labeled data to reach peak performance. Regrettably, in emergency situations, the availability of labeled data within the domain (or new event) is often inadequate for conventional deep learning models. Furthermore, the uncertainty and complexity of a new event also significantly increase the difficulty and cost of data labeling. To tackle the problem, researchers have proposed methods such as domain adaptation [3] or meta-learning [4] to enhance the generalization ability in new events. Nevertheless, it still exists that the expensive labeling cost and time-consuming annotation process. In the absence of labeled data, active learning attempts to annotate fewer data and make models perform better. Given our goal of improving the detection in the target domain, it is necessary to combine active learning with domain adaptation for rumor detection. However, traditional active learning strategies primarily select the samples near the decision boundary based on the textual features. They overlook the distribution differences among different events and the significant role of sentiment in rumor detection. To this end, we propose Domain Adversarial Active Learning based on dual features (DAAL) for rumor detection. The work is non-trivial due to the challenges: Firstly, the identification of valuable samples within rumor detection significantly diverges from typical scenarios, as rumor detection is frequently characterized by the emergence of posts related to new events. Secondly, the 1

https://www.bbc.com/news/world-africa-46127868.

692

C. Zhang et al.

design of a measurement method for posts presents a significant challenge due to the consideration of the domain shift between new event and other events. To tackle the aforementioned challenges, we present an active learning strategy that aims to query the valuable samples containing the most significant textual and affective information based on the domain shift situation. We design samplers based on these textual and affective features to select valuable unlabeled samples. This is because the propagation of affective information is a vital characteristic of rumors [5–7] and rumors are reported by Science magazine to typically evoke surprise and fear [2], and the textual content of a post is a key determinant in rumor detection [8]. These samplers target the most valuable unlabeled samples furthest from the labeled samples (as illustrated in Fig. 1). These samples embody the most crucial textual and affective information for our model. To address domain shift, we learn shared domain features through a process inspired by adversarial training [9]. Our method incorporates five main components: the encoder, rumor detector, domain discriminator, textual sampler, and affective sampler. The training procedure consists of two stages: pretraining and fine-tuning. During pre-training, we combine source domain data with a small subset of labeled target domain data to facilitate shared feature learning among events. In the fine-tuning stage, we utilize newly labeled data, queried through the textual and affective samplers, to further refine the rumor detector, thereby bolstering its detection capability within the target domain. Our main contributions are summarized as follows: • We propose a novel active learning strategy that selects unlabeled samples that are far away from the labeled samples and helps our rumor detection model gain more labels from new, distant event posts for training. • As far as we know, this work is the first exploration of active learning for rumor detection with domain adaptation. We leverage adversarial learning to align the data distribution of the source and target domains, enabling our model to achieve excellent results in the target domain. • We conducted extensive experiments to show the better performance of DAAL and analyze the role of adversarial training and active learning method.

2

Related Work

Rumor Detection Based on Deep Learning. Gao et al. [10] proposed to combine a task-specific character-based bidirectional language model and stacked LSTM networks to represent textual contents. Recent works on rumor detection also have focused on multimodal features. Wang et al. [11] proposed an instanceguided multi-modal graph fusion method that leverages intra- and inter-modality relationships. Mayank et al. [12] proposed DEAP-FAKED, a knowledge graph fake news detection framework for identifying Fake News. However, when dealing with emergencies, the number of labeled posts is often extremely limited, and labeling them is a time-consuming process. To address the problem, we explore the application of active learning in rumor detection.

DAAL: Domain Adversarial Active Learning

693

Rumor Detection with Active Learning. Active learning aims to improve the performance of models using less unlabeled data. Recently, there has been some research on rumor detection combined with active learning. Ren et al. [13] first tried applying active learning on GNN for fake news detection. Farinneya et al. [14] combined the popular active learning strategies and rumor detection models. However, the approaches do not consider the deep-level features of posts and only focus on applying existing active learning frameworks. Active Domain Adaptation. Active domain adaptation aims to improve domain adaptation performance by selecting the most informative subset of data to annotate in the target domain. Su et al. [15] proposed adversarial training to align features between source and target domains and calculates sample weights based on discriminator prediction. Xie et al. [16] proposed selecting target data based on domain features and instance uncertainty and using the free energy of target data compression to the source domain to reduce the domain gap.

3

Methodology

In this section, we will introduce the components of our proposed DAAL approach and our sampling strategy which is based on the dual features of posts. The problem is set up as follows. We define DL = (XL , YL ) as the pool of labeled sample pairs (xL , yL ). An unlabeled sample xU is from DU = (XU ), the pool of unlabeled samples. We set a fixed labeling budget b, representing the maximum number of samples from the unlabeled pool that an oracle can annotate. Our goal is to develop an active learning algorithm that leverages the unique features of posts to select the most informative unlabeled samples. 3.1

Framework Overview

DAAL’s framework is shown in Fig. 2. DAAL has five main components: encoder, rumor detector, domain discriminator, textual feature sampler, and affective feature sampler. The training procedure consists of two stages: pre-training and fine-tuning. During the pre-training stage, we use the encoder to obtain the latent representation of the posts. We then employ a domain discriminator to learn shared domain features through adversarial training. Textual and affective feature samplers query once the most informative unlabeled samples that are unlikely to be found in the labeled pool. In the fine-tuning stage, we utilize the encoder to obtain representations of the selected data. Then, we use the rumor detector to predict whether a post is true or fake. Additionally, the textual and affective feature samplers continue to query which samples are more valuable to label and select subsequent samples at each epoch of fine-tuning.

694

C. Zhang et al.

[CLS]

Unlabeled Target( )

Post Content

token1

token2

...

tokenn

[SEP]

BERT w[CLS]

1

w1

w2

2 RD-fc

...

wn

3 reversal

text-fc1

adv-fc1

text-fc2

TextCNN

adv-fc2

Fake or Real

Which Domain

Pre-Training

ve

w[CLS]

4

BiGRU

Labeled Targe [CLS]

Unlabeled Target( )

Post Content

w[CLS]

w1

w2

1

3

tokenn

[SEP]

...

wn

w[CLS]

text-fc1 text-fc2

TextCNN

ve

4

BiGRU

affect-fc

or

Fake or Real

Budget

...

BERT

Oracle

or Initial Query

token2

token1

RD-fc affect-fc

or

Active learning process

Labeled Source

Subsequent Queries

Selected from

Selected from

or

Fine-Tuning

Fig. 2. The overall architecture of DAAL. ①,②,③, and ④ represent the rumor detector, domain discriminator, textual feature sampler, and affective feature sampler. The yellow line represents the active learning process in DAAL. The initialization parameters of the right part are from the pre-training process. After the pre-training stage, the oracle annotates the initial query results. Then the oracle annotates the results of the iterated query during fine-tuning.

3.2

Rumor Detector

The rumor detector is a basic component of the rumor detection task and aims to distinguish whether a post is a rumor or not. The detector is an MLP network, which predicts the label Pθ (xL ) ∈ (0, 1). Pθ (xL ) = Gr (Ge (xL )).

(1)

The objective function of the rumor detector Gr is minimizing detection loss based on cross entropy as below, where Ge , Gr are parameterized by θe , θr . Our goal is to optimize the two parameters to minimize the objective function: Lf (θr , θe ) = −E(xL ,yL ) [yL log(Pθ (xL )) + (1 − yL ) log(1 − Pθ (xL ))].

3.3

(2)

Textual and Affective Features

In this subsection, we will introduce our approach for extracting textual and affective features from the posts. Textual Features Extraction. Sentence-level features, such as expression and writing style, are more effective in quantifying the textual information of the posts [6]. Therefore, we employ BERT [17] as our encoder to convert the posts

DAAL: Domain Adversarial Active Learning

695

into vectors and extract textual features. Specifically, we represent x from (XL ∪ XU ) as a sequence. h = BERT(x), (3) where h = {hs }S+1 s=0 as the output of BERT is the coding representation of post text. h0 denotes the pooler output of BERT and {hs }S1 denotes the word embedding for the input token xs . We use h0 to represent the textual features at the sentence-level. Affective Features Extraction. Some researchers have found that rumors tend to express sentiments to attract people’s attention, such as fear, incitement, and more. Therefore, it is crucial to extract the flow of affective information [5]. To extract affective features, we focus on five kinds of affective words (Emotion, Sentiment, Morality, Imageability, and Hyperbolic) in the posts and summarize them as one-hot vectors [5]. We then concatenate the above five kinds of features as the representative affective vector (ve ). 3.4

Domain Adversarial Training

Domain adversarial training occurs during the pretraining stage of DAAL, aiming to enable the encoder to extract domain-invariant features. We introduce a domain-adversarial training process based on the domain discriminator. Domain Discriminator. The domain discriminator has two fully connected layers and a softmax layer that predicts the domain auxiliary label Yk among K domains. Gd denotes the domain discriminator, and θd represents its parameters. The objective function of the discriminator Gd is given by: Ld (θd , θe ) = −E(x,yk ) [

K 

1[k=y] log(Gd (Ge (x)))],

(4)

k=1

where (x, yk ) is from the set of data of the labeled pool and unlabeled pool. Adversarial Training. To achieve domain alignment, we introduce a gradient reversal layer (GRL) inspired by the idea of adversarial training between the domain discriminator and encoder [18]. Figure 3 describes during backpropagation the GRL multiplies the gradient by a certain negative constant to reverse ∂Ld d the gradient: λ ∂L ∂θd is replaced with −λ ∂θd . Therefore, minimizing the domain classification loss is equivalent to making the domain discriminator able to distinguish domains, and the encoder can extract the shared domain features. This process can be viewed as a min-max game between the encoder and domain discriminator: the domain discriminator wants to extract domain-specific information to classify correctly, and the encoder aims to extract the shared domain features to fool the discriminator.

696

C. Zhang et al.

domain label

Domain discriminator

p in ut x

Word Embedding

features of posts

...

Word-level Affective Vector

loss Ld

Representative Affective Vector

nt die gra ersal rev yer la

Encoder

Fig. 3. Illustration of gradient reversal layer.

3.5

Fig. 4. The generation of word-level affective vector

Sampling Strategies Based on Rumor Features

In this subsection, we will introduce the textual and affective feature samplers. Then we propose an active learning strategy based on the samplers. Textual Feature Sampler. Textual feature sampler aims to query the samples with the most informative textual features that are furthest from the labeled pool. And they contain the textual features which the model has not learned yet. These samples are expected to improve the model’s performance in the target domain by rich textual information.. To extract textual feature differences between XL and XU and select samples with textual features that the model has not learned from, we use h0 as the input to the textual feature sampler. The textual feature sampler is an MLP network that maps the output to (0, 1), indicating whether the sample is from XL or XU . The textual feature sampler is denoted as Gt , and its objective function is shown below: Lt (θt , θe ) = −E(x,ym ) [ym log(Gt (Ge (x))) + (1 − ym ) log(1 − Gt (Ge (x))],

(5)

where ym is an auxiliary label representing whether the sample x is from XL or XU , i.e., whether it is labeled or unlabeled. Gt is parameterized by θt . Our goal is to optimize the parameters to minimize the objective function to make the sampler query samples most likely from the unlabeled pool by the most informative samples. Affective Feature Sampler. Inspired by the FakeFlow model [5], Fig. 4 shows the affective feature sampler utilizes two components to capture affective information. One component uses CNN to extract the word-level latent representation (vword ). The other uses Bi-GRU to capture the flow of affective information (vaf f ect ). Specifically, CNN uses {h}S1 as input and Bi-GRU uses the above affective vector ve as input to get vaf f ect . We then apply a dot product to combine vword with vaf f ect to obtain the word-level affective vector of a post (vconcat ). vconcat = vword ⊕ vaf f ect .

(6)

DAAL: Domain Adversarial Active Learning

697

Then the sampler processes vconcat with a fully connected layer to get the final word-level affective representation (vf ). We detect x is furthest from XL by the MLP denoted as Ga . The input of the Ga is the above vf , and we need to optimize the parameters θa corresponded with Ga to minimize the objective function of Ga . La (θa ) = −E[ym log(Ga (vf )) + (1 − ym ) log(1 − Ga (vf )],

(7)

where ym is the same as auxiliary labels of the textual feature sampler. Sampling Strategy. Using a combination of textual and affective feature samplers, we query samples that are furthest from the labeled pool in terms of the dual features, containing that the model has not learned the information. We determine the score of a sample by calculating the probability that it belongs to XU . The score of a selected sample is determined by combining the probabilities predicted by both Ga and Gt . score = 1 − (Ga (x) + λs Gt (x)),

(8)

λs is used to adjust the balance between textual and affective features during the querying candidates. The sampler Ga (x) and Gt (x) are respectively used to measure how similar an unlabeled sample is to the labeled pool in terms of affective and textual information. We use the score index to measure the degree of dissimilarity between xu and XL . If the outputs are close to 0, it suggests that the model has not learned some important dual features and needs to learn the mapping relation between features and labels. After receiving scores of all unlabeled samples, we get them sorted and select the top n samples as the most informative and the samples that the model needs to learn the most. As discussed, the top n samples are labeled by experts and added to the training data for the next fine-tuning stage. 3.6

Algorithm Optimization

During pre-training, the encoder extracts post representations, and the rumor detector aims to minimize detection loss Lf (θr , θe ) to improve detection performance. Additionally, there is an adversarial game between the encoder and domain discriminator, where the encoder tries to learn shared-domain features by maximizing the loss Ld (θd , θe ) while the domain discriminator tries to distinguish different domains. Furthermore, the textual and affective feature samplers minimize the loss Lt (θt , θe ) and La (θa , θe ) to select the most valuable posts from textual and affective perspectives. Our goal is to minimize the total loss Lpre (θe , θr , θd , θt , θa ) to achieve the above specific function. Lpre (θe , θr , θd , θt , θa ) = λf Lf (θr , θe ) − λd Ld (θd , θe ) + λc (Lt (θt , θe ) + La (θa , θe )). (9) During the fine-tuning stage, we adopt the sampling strategy introduced in Sect. 3.5 to select the most valuable samples from the unlabeled data pool based

698

C. Zhang et al.

on their textual and affective features. We obtain the representations of the selected samples and fine-tune both the rumor detector and the samplers to enhance the detection performance in the target domain and consistently select the most informative samples. Lf ine (θe , θr , θt ) = Lf (θr , θe ) + λt Lt (θt , θe ).

(10)

The goal of fine-tuning is to minimize the loss Lf ine (θe , θr , θt ), which enhances the performance of the rumor detector in a new domain and optimizes the samplers to query the valuable samples based on the current distribution.

4

Experiment

We compare DAAL with classical and advanced active learning methods and rumor detection models. We then conduct an ablation experiment and examine hyperparameter sensitivity to assess DAAL’s robustness and mechanism. Table 1. Overall performance comparison between DAAL and the baselines in Group A. The target event represents corresponding constructed datasets. Dataset

Stra

15%

20%

25%

30%

35%

40%

45%

50%

Germanwing-crash

DAAL

0.831

0.875

0.863

0.875

0.869

0.856

0.863

0.863

UCN

0.781

0.806

0.850

0.831

0.806

0.825

0.831

0.850

RAN

0.800

0.816

0.788

0.831

0.788

0.825

0.844

0.819

CoreSet

0.806

0.844

0.831

0.869

0.831

0.831

0.836

0.819

Sydneysiege

Ottawashooting

Charliehebdo

Ferguson

TQS

0.738

0.825

0.831

0.800

0.800

0.800

0.825

0.850

DAAL

0.850

0.850

0.854

0.854

0.856

0.867

0.877

0.871

UCN

0.789

0.833

0.835

0.858

0.844

0.848

0.867

0.854

RAN

0.817

0.838

0.846

0.842

0.846

0.848

0.858

0.844

CoreSet

0.838

0.831

0.858

0.858

0.848

0.842

0.854

0.863

TQS

0.758

0.790

0.777

0.792

0.790

0.804

0.800

0.790

DAAL

0.838

0.886

0.861

0.895

0.903

0.895

0.895

0.903

UCN

0.824

0.835

0.835

0.872

0.878

0.895

0.884

0.887

RAN

0.818

0.792

0.830

0.858

0.861

0.852

0.869

0.889

CoreSet

0.807

0.835

0.849

0.852

0.855

0.875

0.878

0.875

TQS

0.778

0.804

0.753

0.810

0.793

0.807

0.815

0.827

DAAL

0.844

0.850

0.861

0.858

0.858

0.861

0.858

0.866

UCN

0.830

0.848

0.835

0.845

0.851

0.846

0.840

0.828

RAN

0.826

0.833

0.846

0.836

0.835

0.823

0.835

0.846

CoreSet

0.833

0.828

0.849

0.831

0.838

0.839

0.831

0.829

TQS

0.785

0.839

0.825

0.850

0.841

0.826

0.869

0.854

DAAL

0.790

0.847

0.826

0.877

0.897

0.884

0.895

0.907

UCN

0.792

0.790

0.824

0.864

0.874

0.865

0.871

0.882

RAN

0.792

0.790

0.814

0.857

0.881

0.875

0.880

0.874

CoreSet

0.790

0.844

0.826

0.866

0.883

0.880

0.875

0.880

TQS

0.799

0.813

0.817

0.826

0.839

0.817

0.817

0.828

DAAL: Domain Adversarial Active Learning

4.1

699

Experiments Setup

Datasets. We evaluate the proposed model on five datasets: Germanwing-crash, Sydneysiege, Ottawashooting, Charliehebdo, and Ferguson, which are all from the PHEME dataset [19]. The PHEME dataset consists of Twitter data containing users’ social information, posted content, and associated labels. We respectively choose one dataset as the target domain, while the remaining four datasets are used as the source domain. Implementation Details. The source domain is as a labeled pool, and the target domain data is split into three distinct groups without overlap: 10% for initial training, 20% for testing, and the remaining 70% as an unlabeled pool. The unlabeled pool is iteratively sampled, with 5% of the pool selected per epoch for annotation by an oracle. 4.2

Baselines

To validate the effectiveness of DAAL, we compare it with the following two groups of baselines. Group A comprises a set of active learning strategies, while Group B comprises a set of cross-domain rumor detection models. (A) Random: Random sampling is a basic sampling method. It selects samples with equal probability. Uncertainty [20]: Uncertainty sampling is achieved by calculating the entropy of the classification and selects samples with high entropy for annotation. Core-set [21]: Core-set selects samples that are farthest from the labeled pool by the L2-norm distance. TQS [22]: TQS aims to tackle domain shift and integrates three evaluation criteria: transferable committee, uncertainty and domainness. (B) EANN: EANN leverages adversarial training to improve its cross-domain performance. EDDFN [23]: EDDFN preserves domain-specific and crossdomain knowledge to detect rumors from different domains. MDFEND [24]: MDFEND employs a multi-expert framework for detecting rumors across domains. 4.3

Result and Discussion

Comparison with Existing Models. Table 1 presents the accuracy of DAAL trained with different percentages of unlabeled data as the target training data. The accuracy score is used as the evaluation metric for rumor detection. It is evident that DAAL has almost the best performance compared to all the baselines. Firstly, we compare the performance of DAAL with TQS to explore the differences in the cross-domain case. DAAL consistently outperforms TQS in accuracy score, demonstrating its efficiency in selecting high-quality candidates in the cross-domain scenario. Then we compare DAAL with other sampling strategies based on the same adversarial training. The results demonstrate DAAL outperforms these methods, highlighting its ability to detect high-quality candidates based on textual and affective features.

700

C. Zhang et al.

To compare the performance of DAAL with other cross-domain rumor detection models, we use the source domain and 20% of the target domain data as the training set for the other rumor detection models, while DAAL is gradually augmented to 20% of the target domain data as described above. Table 2 demonstrates the effectiveness of DAAL on the rumor detection task (partial letters represent the complete event here). Table 2. Performance comparison between DAAL and the baselines in Group B. Charlie.

Sydney.

Ottawash.

Ferguson.

Germanw.

Acc

F1

Acc

F1

Acc

F1

Acc

F1

Acc

F1

EANN

0.843

0.777

0.771

0.762

0.835

0.835

0.848

0.767

0.813

0.812

EDDFN

0.846

0.761

0.805

0.802

0.864

0.863

0.851 0.772 0.819

0.818

MDFEND 0.845

0.768

0.729

0.729

0.864

0.863

0.842

0.742

0.830

0.828

0.850 0.781 0.850 0.818 0.886 0.865 0.847

0.754

0.875 0.875

DAAL

Ablation Analysis. To explore the significance of each part, we compare DAAL to its three variants. DAAL\T: DAAL without textual feature sampler, which means sampling strategy depends on the selection of affective feature sampler. DAAL\A: DAAL without affective feature sampler, which uses the textual feature sampler to select samples. DAAL\D: DAAL without domain discriminator. It removes domain adaptation during pre-training and selects only the target domain as the training data. Table 3 shows domain discriminator, textual and affective feature samplers have a significant contribution to the DAAL framework. At first, DAAL\D surpasses other models by learning the specifics of the target domain. During the cycle query stage, DAAL\D performs poorly due to the lack of shared domainness, highlighting the necessity of adversarial training. Similarly, without the textual or affective feature sampler, DAAL\A and DAAL\T exhibit a reduction of 3.1% and 3.3%, respectively, which is expected as they select the samples from a single view. Table 3. Performance comparison between DAAL and its variants. 10%

15%

20%

25%

DAAL\T 0.681

0.810

0.820

DAAL\A 0.681

0.823

0.842

DAAL\D 0.795 0.810

0.820

DAAL

0.681

30%

35%

40%

45%

50%

0.836

0.838

0.859

0.868

0.874

0.846

0.834

0.841

0.873

0.838

0.861

0.864

0.811

0.810

0.810

0.774

0.800

0.774

0.861 0.868 0.870 0.882 0.885 0.888 0.883 0.885

DAAL: Domain Adversarial Active Learning

Fig. 5. Model performance w.r.t λs in Eq. 8.

701

Fig. 6. Model performance w.r.t the weight in Eq. 9.

Hyperparameter Sensitivity. To explore the influence of λs on the selected sample quality, we set λs to 0.5, 1, and 2 to represent three proportional relations. Figure 5 shows that the model has the best performance when λs is set to 0.5, and the performance of the model is basically the same when λs is set to 1 or 2. We believe that taking affective information as auxiliary information and textual information as main information is conducive to capturing the latent value of samples at a deeper level. Moreover, by tuning the weights in Eq. 9, we investigate the impact of the different losses in the pre-training stage. Figure 6 shows increasing the weight of Lf (θr , θe ) leads to the best performance in the pre-training stage but the worst performance then. In contrast, increasing the weight of Lf (θd , θe ) leads to better performance in the fine-tuning stage as it efficiently helps to extract shared domain features. Furthermore, increasing the weight of Lt (θt , θe ) and La (θa , θe ) leads to improved performance. We consider it is due to the model can better select samples with abundant information by learning the gap between labeled and unlabeled pool.

5

Conclusions and Future Work

In this work, we propose a novel active learning approach for cross-domain rumor detection, which leverages adversarial training to extract shared domain features and queries high-value candidates based on textual and affective features. Experimental results on real-world rumor data demonstrate the superiority of the proposed approach. In the future, we will enrich the diversity of the samples by considering the differences between the distributions of true and fake posts and explore other prominent features of posts to enhance further our model’s detection ability. Acknowledgements. This work was supported by the National Natural Science Foundation of China (62176028) and the Overseas Returnees Innovation and Entrepreneurship Support Program of Chongqing (cx2020097).

702

C. Zhang et al.

References 1. Wang, J., et al.: Robustness analysis of triangle relations attack in social recommender systems. In: 2020 IEEE CLOUD, pp. 557–565. IEEE (2020) 2. Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359(6380), 1146–1151 (2018) 3. Huang, Y., Gao, M., Wang, J., Shu, K.: DAFD: domain adaptation framework for fake news detection. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. LNCS, vol. 13108, pp. 305–316. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92185-9_25 4. Lee, H.-Y., Li, S.-W., et al.: Meta learning for natural language processing: a survey. arXiv preprint arXiv:2205.01500 (2022) 5. Ghanem, B., et al.: Fakeflow: fake news detection by modeling the flow of affective information. arXiv preprint arXiv:2101.09810 (2021) 6. Miao, X., Rao, D., Jiang, Z.: Syntax and sentiment enhanced BERT for earliest rumor detection. In: Wang, L., Feng, Y., Hong, Yu., He, R. (eds.) NLPCC 2021. LNCS (LNAI), vol. 13028, pp. 570–582. Springer, Cham (2021). https://doi.org/ 10.1007/978-3-030-88480-2_45 7. Dong, S., Qian, Z., Li, P., Zhu, X., Zhu, Q.: Rumor detection on hierarchical attention network with user and sentiment information. In: Zhu, X., Zhang, M., Hong, Yu., He, R. (eds.) NLPCC 2020. LNCS (LNAI), vol. 12431, pp. 366–377. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60457-8_30 8. Kwon, S., et al.: Prominent features of rumor propagation in online social media. In: 2013 IEEE 13th International Conference on Data Mining, pp. 1103–1108. IEEE (2013) 9. Wang, Y., et al.: EANN: event adversarial neural networks for multi-modal fake news detection. In: KDD, pp. 849–857 (2018) 10. Gao, J., et al.: RP-DNN: a tweet level propagation context based deep neural networks for early rumor detection in social media. arXiv preprint arXiv:2002.12683 (2020) 11. Wang, J., et al.: Instance-guided multi-modal fake news detection with dynamic intra-and inter-modality fusion. In: PAKDD 2022, Chengdu, 16–19 May 2022, Proceedings, Part I, pp. 510–521. Springer, Cham (2022). https://doi.org/10.1007/ 978-3-031-05933-9_40 12. Mayank, M., et al.: Deap-faked: knowledge graph based approach for fake news detection. In: ASONAM, pp. 47–51. IEEE (2022) 13. Ren, Y., et al.: Adversarial active learning based heterogeneous graph neural network for fake news detection. In: ICDM, pp. 452–461. IEEE (2020) 14. Farinneya, P., et al.: Active learning for rumor identification on social media. EMNLP 2021, 4556–4565 (2021) 15. Su, J.-C., et al.: Active adversarial domain adaptation. In: WACV, pp. 739–748 (2020) 16. Xie, B., et al.: Active learning for domain adaptation: an energy-based approach. AAAI 36, 8708–8716 (2022) 17. Devlin, J., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 18. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: ICML, pp. 1180–1189. PMLR (2015) 19. Zubiaga, A., Liakata, M., Procter, R.: Exploiting context for rumour detection in social media. In: Ciampaglia, G.L., Mashhadi, A., Yasseri, T. (eds.) SocInfo 2017.

DAAL: Domain Adversarial Active Learning

20. 21. 22. 23.

24.

703

LNCS, vol. 10539, pp. 109–123. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-67217-5_8 Sharma, M., Bilgic, M.: Evidence-based uncertainty sampling for active learning. Data Min. Knowl. Disc. 31, 164–202 (2017) Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. arXiv preprint arXiv:1708.00489 (2017) Fu, B., et al.: Transferable query selection for active domain adaptation. In: Proceedings of the CVPR, pp. 7272–7281 (2021) Silva, A., Luo, L., Karunasekera, S., Leckie, C.: Embracing domain differences in fake news: cross-domain fake news detection using multi-modal data. AAAI 35, 557–565 (2021) Nan, Q., Cao, J., Zhu, Y., Wang, Y., Li, J.: Mdfend: multi-domain fake news detection. In: CIKM, pp. 3343–3347 (2021)

CCC: Chinese Commercial Contracts Dataset for Documents Layout Understanding Shu Liu1 , Yongnan Jin1,5 , Harry Lu3 , Shangqing Zhao1 , Man Lan1,2(B) , Yuefeng Chen4 , and Hao Yuan4 1

School of Computer Science and Technology, East China Normal University, Shanghai, China {shuliu,yongnanjin,sqzhao}@stu.ecnu.edu.cn, [email protected] 2 Shanghai Institute of AI for Education, East China Normal University, Shanghai, China 3 Shanghai Qibao Dwight High School, Shanghai, China zhlu [email protected] 4 Shanghai Transsion Co., Ltd., Shanghai, China {yuefeng.chen,hao.yuan}@transsion.com 5 YiJin Tech Co., Ltd., Shanghai, China

Abstract. In recent years, Visual document understanding tasks have become increasingly popular due to the growing demand for commercial applications, especially for processing complex image documents such as contracts, and patents. However, there is no high-quality domain-specific dataset available except for English. And for other languages like Chinese, it is hard to utilize current English datasets due to the significant differences in writing norms and layout formats. To mitigate this issue, we introduce the Chinese Commercial Contracts (CCC) dataset to explore better visual document layout understanding modeling for Chinese commercial contract in the paper. This dataset contains 10,000 images, each containing various elements such as text, tables, seals, and handwriting. Moreover, we propose the Chinese Layout Understanding Pre-train Transformer (CLUPT) Model, which is pre-trained on the proposed CCC dataset by incorporating textual and layout information into the pre-train task. Based on the VisionEncoder-LanguageDecoder model structure, our model can perform end-to-end Chinese document layout understanding tasks. The data and code are available at https:// github.com/yysirs/CLUPT. Keywords: Visual Document Understanding · Chinese commercial contract · Chinese Layout Understanding Pre-train Transformer

1

Introduction

Visual Document Understanding (VDU) is a rapidly growing area of research that aims to automatically perceive and extract useful information from document images, such as business documents, graduate theses, research reports, etc. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 704–716, 2023. https://doi.org/10.1007/978-3-031-44696-2_55

CCC Dataset for DLU

705

With the increasing availability of document images and the growing demand for automated document processing. Currently, VDU is mainly to the following tasks: key information extraction [20,21], document classification [1,7], visual question answering [3,9,15], and layout analysis [18,22], However, it’s worth noting that current dataset language for these tasks are primarily English. Other languages, such as Chinese, differ significantly from English documents due to different language conventions, like writing standards and layout formats. This paper introduces a new Chinese dataset for layout analysis, which helps bridge the gap between English and Chinese layout analysis. The dataset includes diverse texts, including scans with complex layouts, and addresses real-life business problems. Examples of contracts show in Fig. 1.

Fig. 1. Comparing the Chinese document image (the left two) and the English document image (right), we can see significant differences in writing norms and layout format.

Chinese image documents of the contract contain many layout structure information. Since the Optical Character Recognition(OCR) models cannot preserve the layout information of documents, we propose an end-to-end image-to-text model inspired by recent works in table analysis [23]. The proposed model’s pretrain task is to identify image types and parse the image into text content with layout information (similar to HTML format text). This approach enables the unified processing of two different kinds of image elements (text and tables). It also preserves the text layout information and the tables’ semi-structured information. In practical work [5,16] settings, text and tables are typically processed separately in two stages: 1) a layout parse model was used to distinguish between text and tables, and 2) OCR models were used to recognize text content while table recognition models were used to parse table content-an example shown in Fig. 2. This paper describes a new Chinese commercial contract (CCC) dataset for layout analysis tasks. Documents in the CCC dataset have complex layouts, and multiple format types, and relate to real-world business scenarios. Moreover, we evaluated the end-to-end CLUPT model and several state-of-the-art systems on our datasets. The experiments show that CLUPT achieved considerable performance and doubled its speed compared to pipeline models.

706

S. Liu et al.

Fig. 2. The overall workflow. Previous work separated text and tables for parsing and processing. We propose a model that can process both text and table together, generating a sequence with layout information.

The contributions of this paper are summarized as follows: 1. We propose a Chinese commercial contracts dataset with complex layouts, consisting of 10,000 images that contain various information types including text, tables, and handwriting. 2. By pre-training to generate image types and document layout descriptions in a multi-modal and multi-task learning paradigm, we have enhanced the model’s ability to understand textual and tabular information. 3. To our knowledge, the proposed CLUPT is the first end-to-end model that handles both text and tables in Chinese layout analysis.

2

Related Work

Table 1. Scann-base refers to whether scanned document exists,Complex refers to whether the document contains handwritten text or seals. Dataset

Source

Text-based Scann-based Complex Language Domain

PubTabNet [23] scientific articles







TableBank [12]

web/arXiv







DocBank [13]

web/arXiv







Pubtables [19]

scientific articles







CCC

business activities ✓





English

science

Chinese

contract

CCC Dataset for DLU

707

Dataset. The document layout analysis task aims to identify and parse paragraph and table information within a document, with the analysis of tables being currently the most challenging. Currently, the datasets for this task are primarily sourced from publicly available electronic PDF documents and research papers found on the internet with English being the primary language. However, it cannot be applied to the Chinese domain. Table 1 compares datasets related to document layout analysis. Compared to existing datasets, our dataset has the following advantages: 1) The data is sourced from actual commercial activities, which are close to practical applications and helps to narrow the gap between academia and industry in the Chinese domain. 2) The data contains not only tabular information but also textual content, and there are complex data formats such as handwriting and seal elements. Model. The document layout analysis task can be divided into two sub-tasks: text analysis and table analysis. Generally, tables are more diverse and complex compared to text. Table analysis can be divided into three subtasks: table detection, structure recognition, and table recognition. We focus on the table recognition task, which involves parsing the table structure and recognizing the table’s contents. The common approach is to treat table recognition as an object detection task [17]. Text analysis utilizes OCR models for text recognition, but OCR models are unable to retain layout information. Recently some researchers have considered using the Image-to-text approach for table recognition, directly converting the table into HTML with structure information [23]. We have continued the Image-to-text methodology and introduced an end-toend layout analysis model for Chinese domains. We have adjusted our training objectives and adopted the multi-modal and multi-task learning paradigms. The training objectives are as follows: 1) recognize image types, 2) parse textual and tabular information into unified HTML format.

3

Chinese Commercial Contract

In this section, we will describe the process for developing CCC dataset, which comprises two steps: Document Collection and Layout Annotation. Finally, we analyze the statistics of our dataset. Document Collection. The contract data mainly originates from commercial activities. To ensure the quality of the contracts, we randomly selected 5,000 contracts from the historical contract data accumulated over the past year. Through manual screening, poor-quality contract data such as duplication, truncation, and blurriness were removed. Ultimately, we retained 2,000 original contracts. Layout Annotation. The annotation process is divided into two parts: image classification annotation and Image-to-text annotation. We first preprocess the data by splitting the PDFs by page and using a Chinese layout parse tool1 for 1

https://github.com/PaddlePaddle/PaddleOCR.

708

S. Liu et al.

image detection and segmentation to distinguish text and tables. Due to the high cost of manual labor, we conduct image classification annotation before proceeding with Image-to-Text annotation. Image Classification Annotation. We use a layout parse tool that is originally developed for academic documents. However, since it did not entirely correspond to the image types defined in the CCC dataset(fine-grained image types as: text, chapter, title, text-seal, text-hand, text-hand-seal, table, footer, header, figure), an annotator is required to manually correct it. Additionally, another person who is not involved in the annotation process conducts oversight checks to ensure an accuracy rate is over 95%. This stage takes 80 man-hours to complete.

Fig. 3. Examples of text and table annotations. In contract documents, special symbols represent different meanings. For example, a checked checkbox is represented using , ‰ is represented using . Table identifiers mean, and represent the start of rows and cells, while and represent the end,
represent text wrapping in a cell. .

Image-to-Text Annotation. This stage contains text annotation and table annotation. In previous works on table analysis, tables are usually represented as HTML text. To unify the processing of textual and tabular data while retaining the layout information of the text, we introduce a text layout identifier that allows the text data to be represented in pseudo-HTML format, as shown in Fig. 3. The identifiers and their meanings are as follows: — : text wrapping; — : text starts a new paragraph; — : the current paragraph and the previous paragraph belong to the same region; — : the current paragraph has not ended and continues on the next page. — //: different font sizes; We found that some special symbols in the text cannot be recognized by current OCR methods. Therefore, we represent commonly used special symbols using their 16-bit hexadecimal codes. please refer to Fig. 3 for details.

CCC Dataset for DLU

709

Text Annotation involves manual labeling of text parsed using PaddleOCR. Two annotators manually add identifiers and hexadecimal code to the parsed text. The annotators cross-validate the annotation results to ensure accuracy. This stage takes 70 man-hours to complete. Table Annotation. We use PPOCRLabel2 to help with table annotation. The PPOCRLabel algorithm initially identifies the spatial coordinates of tables and subsequently performs optical character recognition to extract textual information. Subsequently, an annotator rectifies any inaccuracies in the results and transforms the table data into HTML format. This stage takes 90 man-hours to complete. Table 2. Statistics of the CCC dataset. Each number in the table indicates the number of images. text-seal means the image contains a seal. text-hand means the image contains handwriting, and text-hand-seal means the image contains both handwriting and seal. Split

text chapter title text-seal text-hand text-hand-seal table footer header figure

Train

3896 969

1287 298

20

7

1111 382

24

6

Validation 546

113

181

47

7

1

29

70

5

1

Test

86

189

14

13

1

100

88

24

3

482

Data Statistics. Finally, the CCC dataset contains 10,000 annotated images; the annotation sample is shown in Fig. 3(b). We split the dataset into train set, validation set and test set with a ratio of 8:1:1. There are 10 fine-grained image types and detailed distribution information is shown in Table 2.

4

Chinese Layout Understanding Pre-train Model

In this section, we describe the details of our proposed CLUPT model based on multimodal and multitask learning. We adopt the donut [8] architecture and transfer the model to Chinese document datasets. Model Architecture. As shown in Fig. 4, our model follows the VisionEncoder-LanguageDecoder [2] Seq2seq architecture. CLUPT does not rely on external OCR modules. The VisionEncoder first maps the input images into embedded vectors. LanguageDecoder then generates HTML sequences given these embedded feature vectors.

2

https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/PPOCRLabel.

710

S. Liu et al.

Fig. 4. The end-to-end image-to-HTML generation process of CLUPT.

VisionEncoder. We input images into VisionEncoder and use Swin Transformer as the encoder for encoding. Swin Transformer segments the input image into patches and converts each patch, represented as RH×W ×C into an embedded vector {ei |1 ≤ i ≤ s, ei ∈ Rh }, where s is the number of image patches and h is the dimension of the feature embeddings of the VisionEncoder, H, W , and C respectively represent the height, width, and number of channels of an image. The embedded vector goes through four stages, where each stage reduces the resolution of the input feature map, similar to how CNN [6] gradually increases its receptive field across layers. The representation from the final layer of Swin Transformers will be fed into the language decoder. LanguageDecoder. Given the encoded feature vectors, the LanguageDecoder generates a token sequence {ti |1 ≤ i ≤ l, ti ∈ Rv } in an auto-regressive manner, where ti is the i -th token, v is the size of token vocabulary and l is output max length hyperparameter. We use a BART [10] decoder as the decoder module. Pre-training. CLUPT is initialized from the donut’s [8] checkpoint for further pre-training. Compared to the donut’s pre-training task, the CLUPT’s pre-training task is divided into two parts: 1) Image classification, categorizing images into ten types as listed in Table 2. 2) Image-to-HTML generation, processing tabular and textual images to generate HTML text sequences. Both parts are integrated into the auto-regressive generation framework, which is defined as follows: P (x) =

n 

p(sn |I, s1 , ..., sn−1 )

(1)

i=1

where x represents the generated target: type_name text_HTML_context . The special identifiers and signify the beginning of image type labels and target HTML labels, while and

CCC Dataset for DLU

711

signify the ending. is an input task instruction. I represents the encoded image embedding from the vision encoder. During the inference phase, CLUPT outputs layout analysis results based on given images and layout parsing instructions. There is only one instruction during CLUPT’s pre-training, we will design more instructions for different fields and tasks in the future.

5

Experiments

We use the Swin Transformer as the VisionEncoder with a layer configuration of {2, 2, 14, 2} and a window size of 10. We use a four-layer BART decoder as the LanguageDecoder. After pre-training on 8,000 samples, we evaluate the CLUPT model’s ability on the test set of CCC. Evaluation Metrics. We use accuracy as the metric to evaluate text recognition performance. We check if the recognized text matches the golden labels at three levels: character, line, and full-text. Accuracy is simple and intuitive but cannot measure content with structural information. Therefore, we use Tree Edit Distance (TED) [23] based accuracy to evaluate the table analysis performance. This metric can cover content with tree-like structures. The formula is expressed as follows: T ED = 1 −

ED(Tp , Tg ) max(len(Tp ), len(Tg ))

(2)

where ED represents the edit distance between two strings. len(t) is the string length of t. Tp represents the model’s predict result, while Tg represents the golden label. Baseline Models. 1) Text layout analysis, we compare CLUPT with three off-the-shelf OCR tools (Baidu OCR, Tencent OCR, and iFLYTEK OCR). For non-tabular data in image format, we call their API to obtain the parsed texts and then calculate text accuracy at the character, line, and full-text levels. For the layout analysis, we focus on the recognition of layout identifiers and define layout accuracy as the accuracy of generated layout identifiers. 2) Table layout analysis, we compare our model with SLANet [11] and TableRec-RARE [4]. We parse the table images into HTML sequences and then calculate the TED scores. Experiment Setup. The experiment is conducted on an NVIDIA RTX3090 GPU. We initialize the model’s weight using the donut [8] checkpoint and pretrain it for 30 epochs. We optimize the model using the AdamW [14] optimizer with a learning rate of 2e-5, a batch size of 4, and a dropout rate of 0.5.

712

S. Liu et al.

Experiment Result Text Parsing Result. From Table 3, We can see that on the test set, our model’s text parsing ability is comparable to off-the-shelf OCR, while our model has superior text layout analysis ability and achieves state-of-the-art performance. In terms of inference speed, our model is 0.5× faster than OCR. Figure 5 shows the fine-grained accuracy results predicted by CLUPT. Table Analysis Result. Table 4 shows that CLUPT performs comparably to state-of-the-art table parsing models on the test set. SLANet and TableRecTable 3. Test performance of the text analysis models on CCC fine-grained acc metrics. The off-the-shelf OCR tools generate no layout identifiers thus their layout accuracy is omitted. w/i t represents that during auto-regression generation, only the Imageto-text is being executed. r/i t represents that during auto-regression generation, the image type is generated after the HTML labels.

Method

Words Line

Full-text Layout Time (ms)

Baidu OCR

0.9715 0.9688 0.9556



1360

Tencent OCR

0.9361 0.9278 0.9268



1232

iFLYTEK OCR 0.9796 0.9682 0.9543



1461

CLUPT

0.952

832

0.9632 0.9598 0.9498

CLUPT w/i t 0.9346 0.9238 0.9121 CLUPT r/i t 0.9492 0.9347 0.9267

0.9346 0.9492

Table 4. Test performance of the table analysis models on CCC on TED metrics. Simple tables: tables without merged cells or seal covering; all other tables are considered to be complex tables.

Method

TED scores Time (ms) simple complex

SLANet

0.9589 0.8723

2376

TableRec-RARE 0.9388 0.8258

2349

CLUPT

1346

0.9432 0.8332

CLUPT w/i t 0.9258 0.8047 CLUPT r/i t 0.9346 0.8158

CCC Dataset for DLU

713

RARE are pipeline models, while CLUPT is end-to-end. Therefore, CLUPT has improved the inference speed by 1× compared to table parsing models. Ablation Analysis. As shown in Table 3 and Table 4, the CLUPT’s performance surpasses that of CLUPT w/i t, indicating that multi-modal and multitask training improves the model’s ability. We also explore the generation order of the tasks and find that generating image type before image-to-text generation yields the best results. A possible reason may be that the model first perceives the overall image type and has learned to distinguish differences in features among different types of images. These differences in features could have an impact on the subsequent generation process.

Fig. 5. Fine-grained accuracy distribution of text image types the “figure” type has no text content, so only whether it is identified as an image is considered. The three types of “text-seal”, “text-hand” and “text-hand-seal” have lower recognition accuracy, which may be due to the influence of factors such as seal covering and handwritten text recognition.

(a) Input real table

(b) Ground truth table

(c) CLUPT predict label

(d) SLANet predict label

Fig. 6. Table analysis results of CLUPT and SLANet on a table image input which has a complex structure(2 multi-row multi-column spans (4,5) cells), twenty normal cells and seal covering, additionally, the text content is dense and arranged tightly.

714

S. Liu et al.

Case Study. To illustrate the differences in the performance of the compared methods, Fig. 6(b) shows the rendering of the predicted HTML given an example input table image. There are six rows and ten columns in this table. The table has a complex structure consisting of 2 multi-row multi-column spans (4,5) cells, twenty normal cells, and a seal covering text; additionally, the text content is dense and arranged tightly. The CLUPT is structurally closer to the ground truth, but there is some error in the recognization of dense text content. Although SLANet did not analyze the structure ultimately, its recognization of dense text content is very close to the ground truth. In the future, we will further improve CLUPT’s ability to recognize dense text in tables.

6

Conclusion

This paper studies the image-based Chinese layout analysis problem, A dataset CCC is developed to train and evaluate CLUPT. We expect this dataset to encourage the development of Chinese Visual Document Understanding. We propose an end-to-end CLUPT model that combines text layout analysis and table structure analysis. The model does not rely on any OCR modules and has improved the inference speed by 1× compared to pipeline models. Moreover, experiments show that the proposed model surpasses or is comparable to recent SOTA models. For future work, we will extend the proposed method to other domains and try to improve its ability in Chinese complex table parsing. Moreover, with the flourishing of large language models, we believe that incorporating LLMs into the CLUPT model is a promising research direction. Acknowledgements. We appreciate the support from National Natural Science Foundation of China with the Main Research Project on Machine Behavior and HumanMachine Collaborated Decision Making Methodology (72192820 & 72192824), Pudong New Area Science & Technology Development Fund (PKX2021-R05), Science and Technology Commission of Shanghai Municipality (22DZ2229004), and Shanghai Trusted Industry Internet Software Collaborative Innovation Center.

References 1. Afzal, M.Z., et al.: Deepdocclassifier: document classification with deep convolutional neural network. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1111–1115. IEEE (2015) 2. Davis, B., Morse, B., Price, B., Tensmeyer, C., Wigington, C., Morariu, V.: End-toend document recognition and understanding with dessurt. In: Computer VisionECCV 2022 Workshops: Tel Aviv, 23–27 October 2022, Proceedings, Part IV, pp. 280–296. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-25069-9 19 3. Dhouib, M., Bettaieb, G., Shabou, A.: Docparser: end-to-end OCR-free information extraction from visually rich documents. arXiv preprint arXiv:2304.12484 (2023)

CCC Dataset for DLU

715

4. Du, Y., et al.: Pp-ocr: a practical ultra lightweight OCR system. arXiv preprint arXiv:2009.09941 (2020) 5. Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: a pre-trained language model focusing on text and layout for better key information extraction from documents. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 10767–10775 (2022) 6. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018) 7. Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: 2014 22nd International Conference on Pattern Recognition, pp. 3168–3172. IEEE (2014) 8. Kim, G., et al.: OCR-free document understanding transformer. In: Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, 23–27 October 2022, Proceedings, Part XXVIII, pp. 498–517. Springer, Cham (2022). https://doi.org/ 10.1007/978-3-031-19815-1 29 9. Lee, K., et al.: Pix2struct: screenshot parsing as pretraining for visual language understanding. In: International Conference on Machine Learning, pp. 18893– 18912. PMLR (2023) 10. Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019) 11. Li, C., et al.: Pp-ocrv3: more attempts for the improvement of ultra lightweight OCR system. arXiv preprint arXiv:2206.03001 (2022) 12. Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: Tablebank: table benchmark for image-based table detection and recognition. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 1918–1925 (2020) 13. Li, M., et al.: Docbank: a benchmark dataset for document layout analysis. arXiv preprint arXiv:2006.01038 (2020) 14. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 15. Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: a dataset for VQA on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2200–2209 (2021) 16. Peng, Q., et al.: Ernie-layout: layout knowledge enhanced pre-training for visuallyrich document understanding. arXiv preprint arXiv:2210.06155 (2022) 17. Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: Cascadetabnet: an approach for end to end table detection and structure recognition from imagebased documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 572–573 (2020) 18. Shen, Z., Zhang, R., Dell, M., Lee, B.C.G., Carlson, J., Li, W.: Layoutparser: a unified toolkit for deep learning based document image analysis. arXiv preprint arXiv:2103.15348 (2021) 19. Smock, B., Pesala, R., Abraham, R.: Pubtables-1m: towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4634–4642 (2022) 20. Stanislawek, T., et al.: Kleister: key information extraction datasets involving long documents with complex layouts. In: Llad´ os, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 564–579. Springer, Cham (2021). https://doi. org/10.1007/978-3-030-86549-8 36

716

S. Liu et al.

21. Wang, J., et al.: Towards robust visual information extraction in real world: new dataset and novel solution. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2738–2745 (2021) 22. Wang, Z., Xu, Y., Cui, L., Shang, J., Wei, F.: Layoutreader: pre-training of text and layout for reading order detection (2021) 23. Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 564–580. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1 34

Poster: Question Answering

OnMKD: An Online Mutual Knowledge Distillation Framework for Passage Retrieval Jiali Deng1 , Dongyang Li1 , Taolin Zhang2 , and Xiaofeng He1,3(B) 1 2

School of Computer Science and Technology, East China Normal University, Shanghai, China School of Software Engineering, East China Normal University, Shanghai, China 3 NPPA Key Laboratory of Publishing Integration Development, ECNUP, Shanghai, China [email protected]

Abstract. Dense passage retriever recalls a set of relevant passages from a large corpus according to a natural language question. The dual-encoder architecture is prevalent in dense passage retrievers, which is based on large-scale pre-trained language models (PLMs). However, existing PLMs usually have thick structures and bulky parameters, resulting in large memory and time consumption. To overcome the limitation of PLMs, in this paper we apply online distillation to passage retrieval and propose an Online Mutual Knowledge Distillation framework (OnMKD). Specifically, we obtain a lightweight retriever by simultaneously updating two peer networks with the same dual-encoder structure and different initial parameters, named Online Mutual Knowledge Refinement. To further interact with the latent knowledge of intermediate layers, we utilize a novel cross-wise contrastive loss to alternate the representation of questions and passages. Experimental results indicate that our framework outperforms other small baselines with the same number of layers on multiple QA benchmarks. Compared to the heavy PLMs, OnMKD significantly accelerates the inference process and reduces storage requirements with only a slight sacrifice in performance. Keywords: Passage retrieval · Online knowledge distillation Cross-wise contrastive learning · Mutual learning

1

·

Introduction

Passage retriever acquires a set of related passages from an unstructured corpus by computing the similarity of given questions and passages. It is widely applied to various natural language processing tasks, such as open-domain question and answering [9,15], fact verification [26,28], and document ranking [18,25]. The dual-encoder architecture initialized with large-scale PLMs is a new paradigm for dense passage retrievers, which includes both a question encoder and a passage c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 719–731, 2023. https://doi.org/10.1007/978-3-031-44696-2_56

720

J. Deng et al.

encoder. As shown on the left side of Fig. 1, traditional retriever based on dualencoder architecture has thick structures and bulky parameters, slowing the inference process and limiting the deployment to low-resource devices.

Fig. 1. The figure shows the traditional dense passage retriever and an expected framework to solve existing problems. The left side is a retriever initialed with N transformer layers, encoding questions and passages individually. To alleviate the storage pressure and further enhance knowledge fusion, a new framework with M layers peer networks (N >> M ) is on the right side.

Therefore, some researchers have applied knowledge distillation [11] to compress PLMs and obtain light and effective models. For instance, [12,24,29] settle the parameters of BERT [16] during training and generate lighter BERT-like models in different ways. As mentioned above, traditional knowledge distillation can compress dense retrievers, but it needs to store large teacher models in advance, which are challenging to obtain in practice. Moreover, online distillation has emerged as a powerful technique to eliminate the restrictions of traditional distillation in computer vision tasks [3,8,35]. For instance, KDCL [8] treats all DNNs as student networks to generate supervised labels, and OKDDip [3] performs two-level distillation during training with diverse peers. As far as we know, we first apply it to passage retrieval, balancing the trade-off of retrieval accuracy and efficiency. Recent works have demonstrated that exploiting intermediate layers of deep neural networks can improve the performance of knowledge distillation [4,22,30]. BERT-PKD [27] achieves incremental knowledge extraction by patiently matching the feature from multiple intermediate layers. Therefore, we also achieve a deeper fusion of intermediate layers between student and teacher retrievers by adding a loss for intermediate-layer supervision labels. Since two peer networks

OnMKD: An Online Mutual Knowledge Distillation Framework

721

will update simultaneously in online distillation scheme, the supervision labels of the intermediate layer cannot be fixed, so we apply contrastive learning to construct positive and negative examples. Moreover, contrastive learning is widely used in representation learning, both in computer vision [5,10] and natural language process [7,32]. In this paper, we propose an effective cross-wise contrastive loss to enhance the fusion of intermediate layers and obtain informative representations. The embeddings used for the contrastive loss are from different peer networks that encourages the positive embeddings to be closer and the negative embeddings to be farther away. As shown in the right side of Fig. 1, an expected framework is designed to compress the traditional retriever and integrate the rich latent knowledge of the intermediate layers. To tackle the problems mentioned above, we propose an online mutual knowledge distillation framework (OnMKD), including the following two insights: (1) Online Mutual Knowledge Refinement: To compress the heavy PLMs, we apply online distillation to update two peer networks1 simultaneously. The two peer networks have identical structures and different initial parameters to learn extra knowledge from each other. The retrievers extracted by OnMKD have fewer parameters and lighter structures, which can speed up subsequent index and inference. (2) Cross-wise Contrastive Knowledge Fusion: To enrich the fusion of latent knowledge in intermediate layers, we propose a cross-wise contrastive loss to intersect the intermediate representations from peer networks. Specifically, questions from one peer network learn contrastive relations with passages (both positive and negative) from the other. To verify the effectiveness of OnMKD, we conduct experiments on three Question and Answering (QA) datasets, WebQuestions [2], TriviaQA [14], and Natural Questions [17]. Experimental results show that the proposed framework improves retrieval performance compared to baselines of the same size. Additionally, the framework can speed up the inference process while achieving approximate retrieval accuracy with heavy PLMs. The ablation study shows the validity of the individual contributions in our framework. Correspondingly, the main contributions are summarized as follows: – We propose an online mutual knowledge distillation framework (OnMKD) for passage retrieval, which reduces the demand for computation resources and gains more accurate representation. – Online Mutual Knowledge Refinement generates a lightweight model from mutual learning in peer networks to speed up the inference process. Meanwhile, Cross-wise Contrastive Knowledge Fusion is proposed to obtain more relation knowledge from the intermediate layers. – Experimental results show that our framework outperforms the baselines of the same size and improves retrieval efficiency compared to the heavy PLMs.

1

Four encoders in two peer networks learn from each other, where each peer network includes a question encoder and a passage encoder.

722

2 2.1

J. Deng et al.

Related Work Dense Passage Retrieval

Recently, dense passage retrieval (DPR) [15] has achieved impressive performance in acquiring relevant passages, and different works have continued with DPR. (1) As for data-driven, previous works [20,21,33] focus on improving sample strategies to obtain more quality samples. ANCE [33] obtains negative samples globally by multi-step iterative ANN index. (2) As for task-driven, diverse works add extra pre-training tasks to improve the retrieval accuracy [19,23]. [23] combines the unsupervised Inverse Cloze Task with supervised fine-tuning to study retriever pre-training systematically. (3) As for model-driven, recent researchers design the special architectures to obtain more diverse representations [6,31,34]. [34] generates multi-view document representations to better align with different semantic views. However, existing works focus less on the inference efficiency of DPRs, whose complex structure and parameters limit their deployment in low-resource devices. 2.2

Knowledge Distillation

According to whether the teacher and student networks are updated simultaneously, knowledge distillation can be divided into offline scheme and online scheme. (1) Offline distillation is a two-stage distillation [11], which needs to pre-train a large teacher model to guide the training of the student. Taking BERT [16] as an example, [12,24,29] can be obtained by following different offline distillation strategies. Although current offline methods improve distillation performance via designing different loss functions, the training and storage of complex teacher models are not avoided. (2) Online distillation is proposed as an end-to-end trainable method without pre-training the teacher model, reducing the stress of pre-storing a high-capacity network. [3,8,35] have proven that an ensemble of students can learn from each other without an extensive teacher network. While online distillation methods have made much progress in computer vision, they have few applications in passage retrieval. Therefore, this paper proposes a lightweight and efficient framework based on online distillation. 2.3

Contrastive Learning

Contrastive learning pulls together positive instances and pushes away negative samples, facilitating data augmentation and representation learning. In computer vision, [5,10] treat transformed images as positive samples to improve visual representation learning. As for natural language tasks, CLEAR [32] and SimCSE [7] design diverse and influential positive data and obtain better sentence-level representations. However, the embeddings used in previous works are derived from the same space, lacking the diversity of knowledge. Therefore, we propose a novel cross-wise contrastive loss that leverages the two peer networks’ embeddings to enrich the knowledge fusion.

OnMKD: An Online Mutual Knowledge Distillation Framework

3 3.1

723

Methodology Framework Overview and Notations

The main architecture is shown in Fig. 2, there are two insights in our framework: 1) Online knowledge distillation simultaneously optimizes two peer networks of identical structure to obtain lightweight networks for subsequent inference. 2) A novel cross-wise contrastive loss is proposed to learn latent knowledge in intermediate layers via fusing embeddings from different networks. In passage retrieval, we output embedding of questions  the  denote−q and p− as m } , p , ..., p and passages. Let D = { qi , p+ i i,1 i,n i=1 be the set of m train samples, where each question qi corresponds to one positive passage and n negative passages. A large corpus contains c passages denoted as C = {p1 , p2 , ..., pc }, where pi is from diverse fields.

Fig. 2. An overview of OnMKD for training retrievers. Two peer networks are based on the dual-encoder architecture, consisting of six transformer layers. The yellow box shows the weighted sum of dual task loss, distillation loss, and cross-wise contrastive loss. hzqi and hzpi represent the intermediate embeddings of question qi and passage pi , and the superscript z denotes the index of networks it belongs to Network 1 or Network 2. (Color figure online)

724

3.2

J. Deng et al.

Passage Retriever of Dual-Encoder Architecture

We denote EQ (·) and EP (·) as the question encoder and the passage encoder, respectively. We obtain the dense embeddings of question and passage as qi , pi , and apply the first token [CLS] as the embedding representation. In this paper, we take the dot product to calculate the similarity scores of questions and passages. And other representation and similarity calculation strategies are also available, which are not the focus of our paper. sim(qi , pi ) = EQ (qi ) · EP (pi ) = qi · pi

(1)

All passages in the training set are labeled as positive or negative depending on whether they contain the correct answers. By optimizing the negative loglikelihood, the passage retriever can distinguish the positive one among negative samples. +

Ldual =

− − L(qi , p+ i , pi,1 , ..., pi,n )

esim(qi ,pi ) = −log n + − esim(qi ,pi ) + j=1 esim(qi ,pi )

(2)

During inference, we apply the stored passage encoder EP (·) at the training stage to encode all passages in C and generate index vectors by FAISS [13]. In practice, the resource corpus contains millions of passages, and the encoders are all based on large PLMs with redundant layers and large number of parameters, which challenge retrieval efficiency and memory storage. For such cases, it is necessary to compress the model architecture and obtain a lightweight dense retriever. 3.3

Online Mutual Knowledge Refinement

We apply online distillation to obtain a lightweight retriever, which can speed up inference and reduce storage demand. Concretely, there are two peer net1 (·), EP1 (·)}, N et2 = works with identical structures, denoted as N et1 = {EQ 2 (·), EP2 (·)}. The encoder size affects the inference speed and storage require{EQ ments, so we compress the 12-layer BERT [16] encoders to the model with fewer layers. As proved in [1], a 6-layer transformer structure can achieve a similar result to the original BERT model. Therefore, we initialize each encoder in N et1 and N et2 with six transformer layers of different parameters, detailed in Sect. 4.1. Note that both N et1 and N et2 are based on the dual-encoder architecture described in Sect. 3.2. Hence, the optimization loss L1dual and L2dual for the original retrieval task also follow Eq. 2. N et1 and N et2 are two peer networks, which are updated simultaneously and learn from each other in training. We introduce KL divergence to make the output embeddings between N et1 and N et2 closer so that they can dynamically acquire knowledge from each other. The knowledge distillation loss from N et1 to N et2 is computed as: L1kd = KL(qi1 ||qi2 ) + KL(p1i ||p2i )

(3)

OnMKD: An Online Mutual Knowledge Distillation Framework

KL(x1i ||x2i ) =

m 

softmax(x1i /τ )log(

i=1

softmax(x1i /τ ) ), x = {q, p} softmax(x2i /τ )

725

(4)

In Eq. 4, x denotes the output embeddings of questions q and passages p in a set of m train samples, and the superscript indicates whether the representation comes from N et1 or N et2 . The hyper-parameter τ is the temperature to control the degree of distillation. Meanwhile, the distillation loss L2kd of N et2 shares the above operations. 3.4

Cross-Wise Contrastive Knowledge Fusion

In order to enrich the latent knowledge fusion, we propose a novel cross-wise contrastive loss to alternate intermediate embeddings of questions and passages. Taking the question qi as an example, we extract  all five inter and concatenate (1) (2) (5) mediate embeddings of qi , denoted as hqi = hqi ||hqi ||...||hqi ∈ R5d . Where (j)

hqi ∈ Rd is the dense representation for j-th layer of the question qi . We follow the same operation as in the predictive layer, with the first [CLS] token representing each intermediate layer. To align the dimensions of vectors, a linear map function φ is formalized as hqi ∈ R5d → hqi ∈ Rd . The summarized intermediate representations of questions and passages in N et1 denote h1qi and h1pi . Furthermore, h2qi and h2pi are the intermediate representation for questions and passages in N et2 . We optimize the cross-wise contrastive loss as follows: +

+

exp(sim(h2qi , h1pi )) exp(sim(h1qi , h2pi ) Lcrd = −log m − log m 1 2 2 1 i exp(sim(hqi , hpi )) i exp(sim(hqi , hpi ))

(5)

where sim(·, ·) represents the similarity between vectors, the same as Sect. 3.2. The superscript +/− denotes the positive and negative passages. Equation 5 alternates the representations of questions and passages in N et1 and N et2 in the set of m training samples. For the first item in the equation, we pull questions from the N et1 and positive passages from N et2 closer and push negative passages from N et2 away. Furthermore, we do the same operation with the questions from the N et2 in the second item. Training Loss. In the training process, N et1 and N et2 are updated simultaneously, and their training objectives are formulated as follows: Lθz = (1 − α1 − α2 ) · Lzdual + α1 Lzkd + α2 Lcrd , z = {1, 2}

(6)

θz denotes all learnable parameters of networks, where z indicates the model is from N et1 or N et2 . α1 ,α2 are the hyper-parameters to balance the weights of different parts.

726

4

J. Deng et al.

Experiment

4.1

Datasets and Baselines

Datasets. We apply the cleaned English Wikipedia2 by DPR [15] as the source passages for answering questions, which includes 2.1 million diversified passages. We choose three QA datasets, and the statistics of the train/dev/test set are detailed in Table 1. Natural Questions (NQ) [17] comprises actual Google search queries and answers manually annotated from Wikipedia. In WebQuestions (WebQA) [2], questions are generated by Google Suggest API, and answers are entities in Freebase. TriviaQA [14] contains real questions and answers from Wikipedia and web pages. Table 1. The statistics of three QA datasets Dataset

#Train Set #Dev Set #Test Set

NQ

58,880

8,757

3,610

WebQA

2,474

361

2,032

8,837

11,313

TriviaQA 60,413

Baselines. In this paper, we compare OnMKD with general PLMs and compressed PLMs. BERT12 represents the 12 transformer layer original BERT [16]. In addition, we extract six layers of BERT from the bottom to the top following [1], denoted as BERT6 , where the model takes one layer out of two. To verify the effectiveness of distillation, we take several distilled 6-layer models into account. DistilBERT [24] exploits knowledge distillation to BERT [16] during the pre-training stage. TinyBERT [12] presents a novel two-stage learning framework consisting of general and task-specific distillation. MiniLMv2 [29] proposes multi-head self-attention relational distillation to compress pre-trained transformers. 4.2

Experimental Settings

The training batch size is set to 16, so one question corresponds to 32 (16×2) negative passages in each batch. We simultaneously train two networks up to 40 epochs for large datasets (NQ, TriviaQA) and 100 epochs for WebQA. We use Adam optimizer with the learning rate 2e-5, and the dropout rate is set to 0.1. Additionally, the hyper-parameters of distillation temperature τ is 3, and the hyper-parameters of training loss α1 , α2 are all set to 0.1. We evaluate the performance of passage retrievers among the retrieved top k candidates with Recall and Mean Reciprocal Rank (MRR) metrics. Specifically, R@k computes the proportion of passages containing the correct answer in retrieved candidates. MRR@k represents where the answer occurs in the candidate set via computing the averaged reciprocal of the first positive example. 2

https://dumps.wikimedia.org/enwiki/latest/.

OnMKD: An Online Mutual Knowledge Distillation Framework

4.3

727

General Results

Table 2. Experimental results of baselines and OnMKD for passage retrieval on three QA test sets, where the 6-layer baselines are distilled from BERT12 . The evaluation metrics include Recall@20, Recall@100, and MRR@20. Model

#Layers NQ WebQA TriviaQA R@20 R@100 MRR@20 R@20 R@100 MRR@20 R@20 R@100 MRR@20

BERT12 BERT6 MiniLM DistilBERT TinyBERT OnMKD

12 6 6 6 6 6

81.75 77.64 75.83 78.95 70.19 80.80

87.37 85.57 84.56 86.41 80.11 87.01

60.98 57.28 56.99 59.08 46.28 59.23

72.93 69.49 67.18 71.64 70.69 73.47

81.10 78.10 77.46 80.40 79.55 81.55

49.33 45.18 43.27 47.58 46.86 48.66

79.57 76.37 75.33 77.83 71.57 78.71

85.34 82.92 82.81 83.80 80.02 84.63

61.66 58.69 58.42 59.13 49.91 59.88

As depicted in Table 2, we have the following observations: (1) OnMKD has improved the retrieval performance compared with the other 6-layer models in three metrics. Compared with the strongest 6-layer baseline DistilBERT, we obtain an improvement of R@20 consistently on three datasets(78.95% → 80.80% on NQ, 72.64% → 73.47% on WebQA and 77.83% → 78.71% on TriviaQA). (2) Compared with BERT12 , our model keeps retrieval performance stabilized while reducing parameters in half. OnMKD has improved within R@20 (+0.54%) and R@100 (+0.45%) in WebQA and a slight performance sacrifice in NQ and TriviaQA. The above results show that OnMKD can improve performance over other 6-layer models and get closer to the large model BERT12 . (3) There is a tradeoff between model size and model performance. On the three datasets, BERT12 outperforms these 6-layer baselines, indicating that the number of parameters positively affects the model performance. Table 3. Comparison of inference time and storage memory between baselines and OnMKD during Wikipedia inference process. Model

#Layers #Params Memory

Inference Time Speed-up

BERT12

12

109.5 MB 2.44 GB

822 ms

1.00×

BERT6

6

67.0 MB

421 ms

1.95×

MiniLM

1.49 GB

6

67.0 MB

1.50 GB

458 ms

1.79×

TinyBERT 6

67.0 MB

1.49 GB

428 ms

1.92×

OnMKD

67.0 MB 1.48 GB 411 ms

6

2.00×

As depicted in Table 3, the inference time represents the times consumed to generate indexes for each passage sequence. The statistics of DistilBERT are omitted since OnMKD shares the same parameters with it. Compared to heavy BERT12 , OnMKD halves the number of parameters (109.5 MB → 67.0 MB) and

728

J. Deng et al.

doubles the inference speed (822 ms →411 ms), which shows OnMKD can relieve the memory and times costs. Although all 6-layer networks show improvements in inference speed and memory size, OnMKD has the best results in storage (1.48 GB) and inference time (2.00×). BERT12 has enormous limitations on memory storage and inference time (memory: 2.44 GB; inference time: 822 ms), which makes it challenging to deploy on low-resource devices. 4.4

Ablations and Analysis

Table 4. Ablation study of different combinations of training loss and distillation schemes. These models are initialized with the same parameters (DistilBERT) and evaluated on WebQA. Distillation Schemes Training Loss

R@20 R@100 MRR@20

-

Ldual

71.64

80.40

47.58

offline

Ldual + Lkd 72.72 Ldual + Lcrd 72.89 Ldual + Lkd + Lcrd 73.24

81.36 81.42 81.48

48.50 48.55 48.66

online

Ldual + Lkd 72.76 81.45 Ldual + Lcrd 73.08 81.48 Ldual + Lkd + Lcrd 73.47 81.55

48.54 48.64 48.66

In this section, we conduct extensive experiments to verify the effectiveness of the online distillation scheme and the cross-wise contrastive loss. For the offline distillation scheme, we utilize a pre-trained 6-layer model as the teacher to distill a 6-layer student model, which are initialized with BERT6 and DistilBERT, respectively. We combine three losses on WebQA: the dual loss Ldual , the distillation loss Lkd between the peer networks, and the contrastive loss of the intermediate layers Lcrd . We compare the performance of networks initialized with DistilBERT parameters under the different loss combination modes on the test set of WebQA, as shown in Table 4. Moreover, Fig. 3 depicts the trend of retrieval accuracy on the training set and the development set with different loss combinations. Online Knowledge Distillation. As seen from left side of Fig. 3, the purple line with only added online distillation loss Lkd outperforms the base network (green) throughout training. In Table 4, the results obtained in online distillation are higher than those in offline distillation, especially the best results obtained in the sum of the three losses, R@20 (73.47%), R@100 (81.55%) and MRR@20 (48.66%). Furthermore, the models under both offline and online distillation perform better than the network without distillation, indicating distillation can obtain additional knowledge. Cross-wise Contrastive Loss. In Fig. 3, we observe that the orange line with Lcrd has higher accuracy than the green with Lkd , which demonstrates the

OnMKD: An Online Mutual Knowledge Distillation Framework

729

Fig. 3. The accuracy trend of different loss combinations in WebQA. The left side is in the train set, and the right is in the dev set. We conduct experiments combining three losses (dual-encoder loss, knowledge distillation loss, and cross-wise contrastive loss).

effectiveness of cross-wise contrastive loss. Moreover, offline distillation (blue) performs better early in the dev set, but online distillation (red and orange) gradually catches up and surpasses it as training progresses. Offline distillation initially provides professional knowledge for the student, while online distillation can continuously learn from each other and improve together during training. From Table 4, the network with the combination of Ldual +Lkd +Lcrd achieves the best performance on all three metrics, which represents the online distillation and cross-wise contrastive learning can provide each other with additional knowledge to help improve retrieval performance.

5

Conclusion

This paper proposes an online mutual knowledge distillation framework (OnMKD) to generate a lightweight passage retriever by updating two peer networks simultaneously. To further gain latent knowledge from intermediate layers, we utilize cross-wise contrastive learning to alternately fuse the representation of questions and passages in the networks. Experimental results show that OnMKD performs better than other small baselines with the same magnitude. Compared to the large-scale PLMs, OnMKD achieves approximate performance and further reduces memory and time costs.

References 1. Aguilar, G., Ling, Y., Zhang, Y., Yao, B., Fan, X., Guo, C.: Knowledge distillation from internal representations. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7350–7357 (2020)

730

J. Deng et al.

2. Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on freebase from question-answer pairs. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1533–1544 (2013) 3. Chen, D., Mei, J.P., Wang, C., Feng, Y., Chen, C.: Online knowledge distillation with diverse peers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3430–3437 (2020) 4. Chen, D., et al.: Cross-layer distillation with semantic calibration. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7028–7036 (2021) 5. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020) 6. Gao, L., Callan, J.: Unsupervised corpus aware language model pre-training for dense passage retrieval. arXiv preprint arXiv:2108.05540 (2021) 7. Gao, T., Yao, X., Chen, D.: Simcse: simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821 (2021) 8. Guo, Q., et al.: Online knowledge distillation via collaborative learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11020–11029 (2020) 9. Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.: Retrieval augmented language model pre-training. In: International Conference on Machine Learning, pp. 3929– 3938. PMLR (2020) 10. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020) 11. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. Statistic 1050, 9 (2015) 12. Jiao, X., et al.: Tinybert: distilling bert for natural language understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4163–4174 (2020) 13. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans. Big Data 7(3), 535–547 (2019) 14. Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611 (2017) 15. Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781 (2020) 16. Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019) 17. Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 453–466 (2019) 18. Nogueira, R., Yang, W., Cho, K., Lin, J.: Multi-stage document ranking with bert. arXiv preprint arXiv:1910.14424 (2019) 19. Oğuz, B., et al.: Domain-matched pre-training tasks for dense retrieval. arXiv preprint arXiv:2107.13602 (2021) 20. Qu, Y., et al.: Rocketqa: an optimized training approach to dense passage retrieval for open-domain question answering. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5835–5847 (2021)

OnMKD: An Online Mutual Knowledge Distillation Framework

731

21. Ren, R., et al.: Pair: leveraging passage-centric similarity relation for improving dense passage retrieval. arXiv preprint arXiv:2108.06027 (2021) 22. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014) 23. Sachan, D.S., et al.: End-to-end training of neural retrievers for open-domain question answering. arXiv preprint arXiv:2101.00408 (2021) 24. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019) 25. Sekulić, I., Soleimani, A., Aliannejadi, M., Crestani, F.: Longformer for ms marco document re-ranking task. arXiv preprint arXiv:2009.09392 (2020) 26. Soleimani, A., Monz, C., Worring, M.: BERT for evidence retrieval and claim verification. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 359– 366. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_45 27. Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for bert model compression. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4323–4332 (2019) 28. Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C., Mittal, A.: The fact extraction and verification (fever) shared task. EMNLP 2018, 1 (2018) 29. Wang, W., Bao, H., Huang, S., Dong, L., Wei, F.: Minilmv2: multi-head selfattention relation distillation for compressing pretrained transformers. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2140–2151 (2021) 30. Wang, X., Fu, T., Liao, S., Wang, S., Lei, Z., Mei, T.: Exclusivity-consistency regularized knowledge distillation for face recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 325–342. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_20 31. Wu, B., Zhang, Z., Wang, J., Zhao, H.: Sentence-aware contrastive learning for open-domain passage retrieval. arXiv preprint arXiv:2110.07524 (2021) 32. Wu, Z., Wang, S., Gu, J., Khabsa, M., Sun, F., Ma, H.: Clear: Contrastive learning for sentence representation. arXiv preprint arXiv:2012.15466 (2020) 33. Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020) 34. Zhang, S., Liang, Y., Gong, M., Jiang, D., Duan, N.: Multi-view document representation learning for open-domain dense retrieval. arXiv preprint arXiv:2203.08372 (2022) 35. Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4320–4328 (2018)

Unsupervised Clustering for Negative Sampling to Optimize Open-Domain Question Answering Retrieval Feiqing Zhuang, Conghui Zhu(B) , and Tiejun Zhao Harbin Institute of Technology, Harbin, China [email protected]

Abstract. Open-domain question answering(ODQA), as a rising question answering task, has attracted attention of many researchers due to its large number of information sources from various fields and can be applied in search engines and intelligent robots. ODQA relies heavily on the information retrieval task. Previous research has mostly focused on the accuracy of open-domain retrieval. However, in practical applications, the convergence speed of ODQA retrieval task training is also important because it affects the generalization ability of ODQA retrieval tasks on new datasets. This paper proposes an unsupervised clustering negative sampling method to improve the convergence speed and retrieval performance of the model by changing the distribution of negative samples in contrastive learning. Experiments show that, the method improves the convergence speed of the model and achieves 5.3% and 2.2% higher performance on two classic open-domain question answering datasets compared to the random negative sampling baseline model. At the same time, the gap statistics method is introduced to find the most suitable number of clusters for open-domain question answering retrieval tasks, reducing the difficulty of using the method. Keywords: ODQA · Contrastive Learning · Negative sampling · Convergence Speed

1 Instruction The open-domain question answering (ODQA) task is taking an important part of search engines and intelligent robots [1]. It differs from other question answering tasks in that its corpus contains tens of millions of knowledge from various fields, making it highly practicable in filed like search engines and intelligent robots. In addition, due to the large scale of the open-domain question answering system’s corpus, it relies heavily on text retrieval capabilities to find the information most relevant to the question, if reducing ODQA to machine reading, a huge performance degradation is often observed in practice1 , which indicates the need of improving retrieval [10]. Many excellent papers have emerged in improving the performance of retrieval tasks in open-domain question answering, such as DrQA [9] and R3 [24] based on traditional term-based method 1 In Yang et al.[22] experiment, the exact match score drop from 80% to 40%.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 732–743, 2023. https://doi.org/10.1007/978-3-031-44696-2_57

Unsupervised Clustering for Negative Sampling

733

like BM25 [3] and TF-IDF [2]. After the emergence of pre-trained models (BERT [4], Roberta [5], Ernie [6], etc.), more effective dense vector retrieval models such as DPR [10], DAR [23], and DenSPI [14] appeared one after another to further improve the performance of ODQA retrieval and they were trained using simple random negative sampling contrastive learning. However, these studies have not taken into account that during negative sampling, if the similarity between the negative samples and the question is low, it can hinder the model’s convergence speed, which we will carry out theoretical derivation in Sect. 0. And study in [16] shows convergence speed will influence the performance of contrastive learning. In summary, the low similarity between questions and negative samples will not only extend the training time but also influence the performance of ODQA retrieval task. Therefore, this paper proposes an unsupervised clustering negative sampling method, KNPR, K-means Negative Passages Retriever, to improve the convergence speed and performance of the model by improving the similarity between negative sampling and questions, see Fig. 1. Through changing the negative distribution this method can effectively improve the convergence speed and performance of the model for ODQA datasets, and by introducing the gap statistics method to find the most suitable number of clusters for open-domain question answering retrieval task, it can reduce the difficulty of using the method.

Fig. 1. KNPR changed the similarity distribution of questions and negative samples within the batch (batch size = 256). (a) Negative contrastive learning was not performed, and (b) The KNPR method was adopted.

2 Related Work Open-domain question and answering (ODQA) takes a question, retrieves evidence from a large corpus, and finds an answer based on that evidence [8]. Early question answering systems built a complex pipeline model with many components, including query understanding, candidate recall, and answer extraction. In recent years, motivated by the reading comprehension task, Chen et al. [9] proposed a two-stage approach to open-domain question answering systems, dividing open-domain question answering into Retriever and Reader, the first stage using an information retrieval model (Retriever) to select the N (N  |C|) most informative paragraphs from the Corpus C that can answer

734

F. Zhuang et al.

the user’s question and then using a reading comprehension model (Reader) to obtain the final result. The reader task can be divided into extractive readers (such as DPR [10] and BPR [11]) and generative readers (such as FiD [12] and KG-FiD [13]). Since the research content of this paper is more focused on the retriever, the related work section mainly analyzes the development history of the retriever. As a prerequisite task, the accuracy of the retriever will greatly affect the final effect of the ODQA system. Although traditional inverted index-based retrievers (such as BM25 [3] and TF-IDF [2]) have already built a strong baseline model for retrieval, due to the disadvantages of term-based sparse representations that cannot accurately represent the semantic information of text, such as polysemy and synonyms, its ability to match questions and information paragraphs is somewhat limited. With the emergence of pre-trained language models such as BERT and Roberta, the retriever has shifted to a dual-encoder structure based on dense vector representations. Due to the strong semantic understanding ability of pre-trained language models, dual-encoder models have become a recent research focus (such as DPR [12], which is the current strong baseline model for ODQA tasks and DenSPI [14], which enhances the retrieval ability of the retriever, and IRRR [15] for retrieving complex questions). The dual-encoder uses an end-to-end method to encode questions and passages into dense vector representations separately and then uses the inner product or cosine distance between the question representation and passages representation as a similarity for information retrieval. Like other dual-encoder models, in-batch negative sampling is used for training dense retrievers. However, the similarity between questions and random negative examples within each batch is low, and small similarity will affect the convergence speed of model training. Research by Xiong et al. [16] shows that using simple random negative contrastive learning methods will affect the effect of contrastive learning, thereby affecting the effect of the retriever. Similar to our work, ANCE [16] using an asynchronous method to rebuild the index from the previous checkpoint and take passages that are similar to the question but not correct evidence as negative samples, which optimizes inbatch negatives. However, this method doesn’t pay attention to the relationship between the distribution of negative samples and convergence speed and is very expensive to rebuild the index and requires a lot of time and memory space.

3 Problem Analysis and Method 3.1 Relationship Between the Negative Sampling and Convergence Speed KNPR is a dense retriever based on a dual-encoder. During the training process of the dual-encoder, questions and passages are densely encoded separately, specifically by inputting questions and passages into two different encoders. After obtaining the dense embedding of the question and passages, the inner product is used as the similarity between questions and passages, as shown in Eq. 1, where EQ is the encoder for the question and EP is the encoder for the passage. Then the negative log-likelihood loss function is used as shown in Eq. 2, where qi represents the question, pi+ represents − represents the negative example for that the positive example for that question, and pi,j question. In prediction, all passage dense vectors are indexed offline using Faiss2 , an 2 https://github.com/facebookresearch/faiss.

Unsupervised Clustering for Negative Sampling

735

Fig. 2. The comparison between KNPR and simple rondom sampling in-batch negative, where (a) (b) (d) indicate that KNPR first performs K-means clustering on the questions, and then selects the batch within the same cluster for in-batch negative sampling; (c)(d) indicates the traditional in-batch negative process. The graph in (d) shows that after using the KNPR method, the distance between negative samples and questions is closer.

open-source tool based on approximate nearest neighbor algorithms. When the user inputs a question, its dense representation is calculated using a trained question encoder, and then the index is used to find the N most relevant passages to this question. sim(q, p) = EQ (q)T EP (p)   − − = − log , . . . , pi,n Lneg qi , pi+ , pi,1

(1) esim(qi ,pi )   − ) +nj=1 esim qi ,pi,j +

(

+ esim qi ,pi

(2)

In-Batch Negative. Since in-batch negative can efficiently calculate similarity, this method will also be used when training KNPR. See in Fig. 2, assuming there are B Query-Passage pairs, Q and P are the representations of the dual-encoder for questions and passages, respectively, Q ∈ RB∗d , P ∈ RB∗d , and d is the dimension of the dense representations encoded by the encoder. S = QP T ∈ RB∗B is a scoring matrix, where each row represents the score of a question for B passages. In this matrix, when i = j, it is the score of the question for the positive sample, and the rest are scores for negative samples. Thus, through an efficient vector multiplication operation, we have calculated the scores of B2 samples. This method uses the positive samples of other questions in the batch as negative samples for this question, so each question is trained against 1 positive sample and B-1 negative sample. This method quickly and effectively calculates the similarity of samples and is suitable for training tasks of ODQA retrievers. Analysis of Convergence Speed. As in Fig. 1, when performing simple random inbatch negative sampling, the distance between negative samples and questions is far.

736

F. Zhuang et al.

Assuming the optimizer is stochastic gradient decent, a stochastic gradient decent (SGD) step with importance sampling [17] is Eq. 3 and Eq. 4, where pd − the sampling probability of negative instance d − , θt the parameter at t-th step, θt+1 the one after, and N the total number of negatives, the scaling factor Np1 − is to make sure Eq. 3 is an unbiased d estimator of the full gradient, ∇φL Lneg 2 is the gradient w.r.t. the last layer. The farther the distance between negative samples and questions, the smaller the loss will be. When performing backpropagation, the update will be slower. Intuitively, negative samples with near zero loss have near zero gradients and contribute little to model convergence. Therefore, the current similarity distribution within the batch will affect the convergence speed of the model. Research in [16] shows that the current negative sample sampling method will result in low information in negative samples and thus affect the effect of contrastive learning. To prove that the negative contrastive learning method based on K-means clustering can effectively improve the effect of contrastive learning, this paper introduces metrics for evaluating the effect of contrastive learning, alignment and uniformity. θt+1 = θt − η Np1 − ∇θt Lneg

(3)

    sim q, p− → 0 ⇒ Lneg → 0 ⇒ ∇φL Lneg 2 → 0 ⇒ ∇θt Lneg → 0 ⇒ θt+1 − θt → ε

(4)

d

3.2 Unsupervised Clustering for Negative Sampling In response to the above problem, this article proposes a negative contrast learning method based on unsupervised K-means clustering (KNPR) to train the retriever. The functions are shown in Fig. 2, that is, first perform K-means clustering on the questions in the training dataset, and then select the questions in the same cluster after clustering when constructing the batch, performing in-batch negative sampling to reduce the distance between the question and the negative samples. From the perspective of convergence speed, as the distance between the questions and the negative sample decreases and their similarity increases, the loss will also increase, thereby increasing the update speed and making it more likely for the model to reach convergence. From the perspective of contrast learning quality, increasing the similarity between negative samples and question will increase the difficulty of contrastive learning. Contrastive learning will pay more attention to those negative samples that have not been learned well. Therefore, the model needs to truly learn the distribution of data to increase the distance between negative samples and questions. As a result, we believe that this method can effectively improve convergence speed and enhance contrastive learning effects. Since K-means clustering is an unsupervised model, the value of K will greatly affect the clustering performance. We explore the effect of different K on KNPR and introduce the Gap Statistics method to obtain the optimal K for clustering questions. Its calculation Eq. 5 is shown, where Dk is the sum of the distances from m samples used for K-means clustering to their respective cluster centers, and E(logDk ) is the expected value of logDk after n times of K-means clustering with m samples satisfying a uniform

Unsupervised Clustering for Negative Sampling

737

distribution. The best K clustering represents that the variance between each cluster is small when K is as small as possible so that Gap(K) can be as small as possible. Gap(K) = E(logDk ) − logDk Dk = Sum(Dk1 , Dk2 , . . . , Dkk )

(5)

4 Experiment Setting 4.1 Encoder Referring to the work of DPR, we use two BERT-base-uncased and Roberta-base models as encoders for questions and passages respectively. After inputting the questions or passages into the pre-train model, we use the output of [CLS] as their dense vector embeddings. 4.2 Reader We use an extractive model to complete the subsequent reading comprehension task. In this model, we calculate the scores of passages and spans respectively, as shown in Eq. 6, Pstart,s (s) = softmax(Pi wstart )s Pend ,i (s) = softmax(P  Ti wend )s  Pselected (i) = softmax P wselected 

(6)

i

RL∗h

is the vector encoding of the i-th passage by BERT, L is the length of where Pi ∈ the input, h is the dimension of the vector, P = [P1cls , . . . , Pkcls ] ∈ Rh∗k represents the matrix formed by the encoding of the [CLS] position of the top-k passages obtained, wstart , wend , wselected ∈ Rh . Scoring the start and end positions of the answer span and adding the score of the passage at the same time determines the specific final answer span. 

4.3 Datasets and Metrics Corpus. Referring to the work of Chen et al. [9], the English Wikipedia on December 20, 2018 was used as the source of the corpus. After data washing and removing semistructured data and other content, the information source was divided into equal-length. paragraphs with a length of 100 words as the smallest unit for retrieval. In the end, we obtained 21,015,324 passages. Datasets. We used two ODQA datasets for experiments. Natural Questions(NQ) [18] is composed of real questions raised by Google search users and answers manually annotated from Wikipedia articles. The questions in TriviaQA [19] were written by people interested in knowledge question answering and covered multiple domains and topics. The scale of the dataset is shown in Table 1, where #classification is the number of categories recommended in the paper proposing the dataset.

738

F. Zhuang et al. Table 1. Statistical information of the experimental dataset. #Classification

#Train

#Dev

#Test

#Total

Avg. Len

NQ

5

58,880

8,757

3,610

71,247

9.3

TriviaQA

6

60,413

8,837

11,313

80,563

16.9

Alignment and Uniformity. Contrastive learning is a method that uses the similarity and differences between data to learn feature representation. Specifically, it brings the distance closer to positive samples and further away from negative samples. Wang et al. [20] proposed an evaluation index for the effect of contrastive learning, alignment and uniformity. Alignment refers to the distance between the question and the positive example. The smaller the alignment, the closer the question and positive example are, and the better the model can distinguish between relevant and irrelevant data. Alignment can be calculated by Eq. 7, where ppos refers to the question-positive sample pair. The alignment value is better when it is smaller. Uniformity refers to the evenness of feature vector distribution, that is, whether they can make full use of feature space. The smaller the uniformity, the more dispersed the embedding vectors are, and the better the model can extract rich and diverse information. Uniformity can be calculated by Eq. 8, where pdata refers to all question-sample pairs. This uniformity value is also better when it is smaller.   align  f (x) − f x+ 2 E (7) (x,x+ )∼ppos uniform 

e−2f (x)−f (y) E i, i, d pdata (x,y) ∼

2

(8)

4.4 Settings During the training process, we used Adam as the optimizer with a learning rate of 1e-5, linear scheduling with warm-up, and a dropout rate of 0.1. The epoch was set to 40 and the batch size was set to 64. It took 23 h to train dual-retriever on 4 GeForce RTX™ 3090. Finally, we built an index using Faiss, which occupied 66.5GB of memory and took 18 h to build.

5 Results DPR is a dense retriever model that use simple random in-batch negative sampling for contrastive learning, while BM25 is a sparse vector-based retriever. ANCE uses asynchronous index updating for training, that is, it updates the Faiss index with the parameters of the stored checkpoint, and then indexes passage that is similar to the question but does not contain the true answer as the negative sample for that question.

Unsupervised Clustering for Negative Sampling

739

Table 2. The accuracy of the model in retrieving the top-k paragraphs on the NQ and TriviaQA datasets. BM25 is a sparse vector-based retriever, DPR is a baseline model for dense retrieval, ANCE and RocketQA are both dense retrieval models that use negative contrastive learning, and KNPR is the method proposed in this paper. Top-20 NQ

Top-100 TriviaQA

NQ

TriviaQA

BM25





59.1

66.9

73.7

76.7

DPR

64

BERT

78.4

79.4

85.4

85.0

ANCE

64

Roberta

81.9

80.3

87.5

85.3

RocketQA

512*2

Ernie

82.7



88.5



KNPR_BERT

64

BERT

81.7

80.9

86.6

86.2

KNPR_Roberta

64

Roberta

82.0

81.2

87.9

86.8

KNPR*

64*4

Roberta

83.7

82.5

89.3

87.3

RocketQA [21] changes the in-batch negative sampling method to cross-batch negative sampling. This method is to communicate with other cards under the condition of multiple cards so that the samples on other cards become negative samples for questions on this card. The disadvantage of this method is that it can only work when training on multiple cards, and it does not screen negative samples. As in Table 2, KNPR achieved good results on both NQ and TriviaQA datasets. The difference between our method and the DPR method is that the distribution of samples within our batch is changed, see Fig. 1. Therefore, it can be proved that our method of modifying the similarity distribution of negative sampling is effective. We believe that the reason why KNPR_Roberta is slightly lower than RocketQA on the NQ dataset is that RocketQA not only uses the negative contrast enhancement method but also introduces a re-ranking task. In addition, the batch used by RocketQA is 512*2, which is much larger than our 64. Retrievers based on Roberta are better than those based on BERT. We analyze that this is because Roberta used more training data during pre-training, and there is some overlap between these training data and the corpus of English Wikipedia. KNPR* uses two methods: K-means negative contrastive learning and cross-batch negative. We can see that it is better than all current methods that use contrastive learning enhancement. We also found that the model’s improvement effect on TriviaQA is not as good as NQ. This is because the average length of TriviaQA questions is higher than NQ, so we think TriviaQA is more difficult, so our method’s improvement on TriviaQA is less than on NQ. The final results of the QA task are shown in Table 3. We control all methods to use the same reader, which encodes passages with BERT. The retriever outputs the same K = 100 passages, and KNPR can achieve good results in the QA task. It proves that a better-performing retriever can effectively improve the overall effect of the QA task. Sampling Distribution. After calculating the similarity, we normalize the similarity to the 0–1 interval through the sigmoid function. As shown in Fig. 1, we set the batch size

740

F. Zhuang et al.

Table 3. Performance of different models on NQ and TriviaQA for end-to-end ODQA tasks. EM accuracy NQ

TriviaQA

BM 25

32.6

52.4

DPR

41.5

56.8

ANCE

46.0

57.5

RocketQA

46.5



KNPR*

46.9

57.8

Table 4. Using contrastive learning metrics to measure different models. alignment

uniformity

Top-20

DPR

0.313

−2.02

78.4

ANCE

0.275

−2.20

81.9

RocketQA

0.325

−2.37

82.7

KNPR*

0.250

−2.41

83.7

to 256, so the batch contains 1 positive sample and 255 negative samples. We found that after negative contrast learning based on K-means clustering, the similarity distribution between negative samples and questions changed from being concentrated in the 0–0.5 interval to being concentrated in the 0.4–0.7 interval, indicating that after using the method in this paper, the distribution between negative samples and questions changed to reduce their distance. Research on Convergence Speed. As shown in Fig. 3, we found that the KNPR method can effectively improve the convergence speed of the model. That is, the KNPR model will converge to the validation set earlier than the DPR model, which to some extent. proves that by increasing the difficulty of negative samples which is reducing the similarity between negative samples and questions, the convergence speed of the model can be accelerated. How Different Negative Sampling Methods Impact Contrastive Learning Performance. We use the evaluation metric of contrastive learning to measure the effect of different negative sampling for ODQA retrieval contrastive learning. Table 4 shows that methods such as KNPR* and ANCE that perform negative sampling for contrast learning improve the effect of contrastive learning by reducing the alignment, that is, making the model as close as possible to the positive sample. Both RocketQA and KNPR* improve the effect of contrastive learning by using cross-batch to allow more data to be trained and thus make the obtained dense embedding more uniformity. The results of Table 4 also support the view that the better the contrastive learning effect, the better the performance of the retriever.

Unsupervised Clustering for Negative Sampling

741

Fig. 3. The relationship between the number of training steps and the Top-100 accuracy of the model on the training set. KNPR reaches convergence earlier than DPR.

Fig. 4. Experiments on NQ and TriviaQA found that the value of Gap Statistics is positively correlated with the accuracy of the retriever. The left axis represents precision, and the right axis represents the value of Gap (K). Therefore, when Gap (K) reaches its maximum value, it also achieves the best K for the retrieval task.

The Best Value of K for ODQA Retrieval. Figure 4 shows that the value of K that can perform the best K-means clustering for the dataset distribution is the optimal value of K for the retrieval task. Therefore, we can use the Gap Statistics method to find the best value of K for the retrieval task and reduce the difficulty of implementing the model. We analyze that the reason why a larger K does not necessarily mean that the larger the similarity within the batch, the better for the performance of the retriever is because there is a certain degree of overlap in the positive examples of questions in the ODQA dataset. Therefore, taking data with too high similarity can lead to the phenomenon of false negatives.

6 Conclusion To solve the problem that simple random sampling for in-batch negative contrastive learning may slow down the convergence speed of training and affect the performance of the ODQA retrieval due to the low similarity between negative samples and questions,

742

F. Zhuang et al.

we propose a method of dense vector retrieval based on negative contrastive learning with unsupervised K-means clustering. To find the most suitable value of K for data distribution, we introduce the Gap statistics method reducing the difficulty of implementing the model. Our method greatly improves the convergence speed of training and the effect of dense retriever on two common ODQA datasets which is superior to other negative sampling methods for ODQA retrieval.

References 1. Chen, D., Yih, S.: ACL2020 tutorial: Open-domain Question and Answering. https://github. com/danqi/acl2020-openqa-tutorial. Accessed 5 Jul 2020 2. Juan, R.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning. vol. 242. no. 1 (2003) 3. Stephen, R., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. In: Foundations and Trends® in Information Retrieval. vol. 3. no. 4, pp. 333–389 (2009) 4. Jacob, D., et al.: BERT: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805 5. Yinhan, L., et al.: RoBERTa: A robustly optimized BERT pretraining approach (2019). arXiv preprint arXiv:1907.11692 6. Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., Liu, Q.: ERNIE: enhanced language representation with informative entities. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441–1451, Florence, Italy. Association for Computational Linguistics (2019) 7. Robert, T., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Statistical Methodology) 63(2), 411–423 (2001) 8. Adam, R., Raffel, C., Shazeer, N.: How much knowledge can you pack into the parameters of a language model? (2020). arXiv preprint arXiv:2002.08910 9. Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading wikipedia to answer open-domain questions. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1870–1879, Vancouver, Canada. Association for Computational Linguistics (2017) 10. Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781, Online. Association for Computational Linguistics (2020) 11. Yamada, I., Asai, A., Hajishirzi, H.: Efficient passage retrieval with hashing for open-domain question answering. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 979–986, Online. Association for Computational Linguistics (2021) 12. Izacard, G., Grave, E.: Leveraging passage retrieval with generative models for open domain question answering. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 874–880, Online. Association for Computational Linguistics (2021) 13. Yu, D., et al.: KG-FiD: infusing knowledge graph in fusion-in-decoder for open-domain question answering. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4961–4974, Dublin, Ireland. Association for Computational Linguistics (2022)

Unsupervised Clustering for Negative Sampling

743

14. Seo, M., Lee, J., Kwiatkowski, T., Parikh, A., Farhadi, A., Hajishirzi, H.: Real-time opendomain question answering with dense-sparse phrase index. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4430–4441, Florence, Italy. Association for Computational Linguistics (2019) 15. Qi, P., Lee, H., Sido, T., Manning, C.: Answering open-domain questions of varying reasoning steps from text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 3599–3614, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics (2021) 16. Lee, X., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval (2020). arXiv preprint arXiv:2007.00808 17. Guillaume, A., et al.: Variance reduction in sgd by distributed importance sampling (2015). arXiv preprint arXiv:1511.06481 18. Tom, K., et al.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 453–466 (2019) 19. Mandar, J., et al.: TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension (2017). arXiv preprint arXiv:1705.03551 20. Tongzhou, W., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: International Conference on Machine Learning. PMLR (2020) 21. Yingqi, Q., et al.: RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering (2020). arXiv preprint arXiv:2010.08191 22. Wei, Y., et al.: End-to-end open-domain question answering with bertserini (2019). arXiv preprint arXiv:1902.01718 23. Soyeong, J., et al.: Augmenting document representations for dense retrieval with interpolation and perturbation (2022). arXiv preprint arXiv:2203.07735 24. Wang, S., Yu, M., Guo, X., et al.: R3: reinforced ranker-reader for open-domain question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)

Enhancing In-Context Learning with Answer Feedback for Multi-span Question Answering Zixian Huang, Jiaying Zhou, Gengyang Xiao, and Gong Cheng(B) State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China {zixianhuang,jyzhou,gyxiao}@smail.nju.edu.cn, [email protected]

Abstract. Whereas the emergence of large language models (LLMs) like ChatGPT has exhibited impressive general performance, it still has a large gap with fully-supervised models on specific tasks such as multispan question answering. Previous researches found that in-context learning is an effective approach to exploiting LLM, by using a few task-related labeled data as demonstration examples to construct a few-shot prompt for answering new questions. A popular implementation is to concatenate a few questions and their correct answers through simple templates, informing LLM of the desired output. In this paper, we propose a novel way of employing labeled data such that it also informs LLM of some undesired output, by extending demonstration examples with feedback about answers predicted by an off-the-shelf model, e.g., correct, incorrect, or incomplete. Experiments on three multi-span question answering datasets as well as a keyphrase extraction dataset show that our new prompting strategy consistently improves LLM’s in-context learning performance.

Keywords: ChatGPT answering

1

· In-context learning · Multi-span question

Introduction

Recently, the rise of large language models (LLMs) [5,21,22] represented by ChatGPT1 provides a new paradigm for NLP research, which can perform well using only natural language instructions rather than being trained on the target dataset. Based on LLMs, many tasks are expected to be more convenient and accessible to users with different needs, including multi-span question answering (MSQA). MSQA aims to automatically find one-to-many answers at the span level for a given question, which has attracted many in-depth research works [15,26] based on pre-trained language models (PLMs), and has broad application scenarios such as medical question answering [11,34]. 1

https://openai.com/blog/chatgpt.

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 744–756, 2023. https://doi.org/10.1007/978-3-031-44696-2_58

Enhancing In-Context Learning with Answer Feedback for MSQA

745

However, compared with PLMs fine-tuned on the complete training data, LLMs still have a large gap on difficult MSQA datsets [13] such as DROP [8,21]. To address it, in-context learning [7] is a promising approach to enhancing the capability of LLMs. The idea of in-context learning is to concatenate the test question with an analogous demonstration context to prompt LLMs to generate answers. As shown in the left half of Fig. 1, the demonstration context consists of a few task-related demonstration examples with labeled answers, which can be retrieved from the training set of the target dataset.

Fig. 1. An example of our new prompting strategy (right) compared with the conventional prompting strategy (left). Our strategy first answers the demonstration question using an off-the-shelf model (e.g., based on conventional PLMs) and records the corrected answers as feedback, and then combines demonstration examples with corrected answers to construct a prompt for LLM.

Motivation: Although existing works have designed a range of approaches for retrieving and exploiting demonstration examples [1,18,20], the common practice of constructing a demonstration context is still concatenating questions and labeled answers through simple templates. We argue that only showing demonstration questions with correct answers may not guide LLMs to think deeply about demonstration examples, e.g., lack of reflection on mistakes in problem solving, which may lead to under-utilization of the labeled answers. Our Work: In this paper, we propose to enhance in-context learning with diverse information derived from labeled answers to improve their utilization. Inspired by supervised learning which receives feedback from training loss to update model, we design a novel prompting strategy for LLM to obtain feedback information in the form of corrected answers. Specifically, as shown in the right part of Fig. 1, this strategy first answers the demonstration question using an off-the-shelf model (e.g., based on conventional

746

Z. Huang et al.

PLMs), compares its results with labeled answers, and records the corrected answers as feedback (e.g., correct, incorrect, or missing answers). Then we use both demonstration examples and corrected answers to construct an enhanced prompt for LLM. With this idea, we conducted experiments on three MSQA datasets as well as one keyphrase extraction dataset. The results show that our feedback-based prompting strategy significantly improves the capability of ChatGPT to answer multi-span questions.

2 2.1

Related Work Large Language Models

From GPT-3 [5] to the latest GPT-4 [21], the emergence of powerful LLMs in recent years has triggered new thinkings and paradigms in NLP research. LLMs perform various downstream tasks using only text instructions, have matched state-of-the-art results in many tasks including machine translation [10] and relation extraction [30], and have influenced a range of domain applications such as education [28] and medical writing [4]. Despite the great success of LLMs, studies have also reported that it still has shortcomings in specific tasks [2, 24] and has a large gap in handling difficult tasks compared with PLM-based methods [13]. In particular, question answering (QA) is a task with long-term research and is faced with various challenges. The performance of LLMs on QA has received extensive attention. Some analytical works reported that LLMs have many limitations in QA tasks, including insufficient stability [29], poor performance on newly released datasets [17], and suffering from hallucinations [2]. Based on empirical observations, some works designed methods to improve the performance of LLMs on specific QA tasks such as commonsense QA [3], open-domain QA [16], and multi-document QA [23]. However, as an important and realistic QA task, Multi-Span QA (MSQA) currently lacks dedicated research based on LLMs, whose performance on this task remains unclear. In this paper, we propose and evaluate a novel strategy for effectively adapting LLMs to the MSQA task. 2.2

In-Context Learning

With the development of LLMs, in-context learning [7] has also received extensive attention in recent years. Some research works studied it from the perspective of demonstration formatting, proposing template engineering to construct better human-written or automatically generated prompts [19,32]. Some other methods enhanced in-context learning by selecting better demonstration examples, searching for the best ordering of demonstration examples [20], or using the KNN algorithm with lexical [1] or semantic [18] features to dynamically retrieve demonstration examples for each question. The usage of labeled answers in the above methods is to append them to the question using some simple templates, which leads to potential under-utilization

Enhancing In-Context Learning with Answer Feedback for MSQA

747

of labeled answers. The work most similar to ours is [30], which feeds demonstration examples to LLM to obtain a clue about the gold labels in a given document in a relation extraction task. However, the clue generated by LLM often contains mistakes, which also causes some loss of label information, and it is very expensive to interact every demonstration example with LLM. By contrast, in this paper, we obtain answer feedback by comparing the prediction results on the demonstration example with the labeled answers, and use it to enrich in-context learning with more insightful information obtained from the corrected answers.

Reasoning Stage

Retrieval Stage Document Question

Document Question

Answer

Answer

Feedback

Document Question

Index

Indexing

Document

Question

Answer

TaskTemp

FeedbackTemp

Top K

Exercise Stage

Feedback

Exercise

Predict

ConcatTemp QA Model

Document Question

Large Language Model

Answers

Fig. 2. An overview of our prompting strategy, which includes a retrieval stage searching for relevant demonstration examples, an exercise stage for producing feedback, and a reasoning stage for in-context learning with feedback.

3

Approach

Given a question Q and a reference document D, the goal of MSQA is to generate a set of n answers A = {A1 , . . . , An }, where Ai is a span-level text that may be either present in D or absent in D. Let T = {[D1T , QT1 , AT1 ], . . .} be a set of labeled examples, i.e., the set of all the available question-document-answers triples from which demonstration examples can be selected for in-context learning, e.g., the training set of a MSQA dataset. Figure 2 gives an overview of our strategy, which includes a retrieval stage searching for relevant demonstration examples, an exercise stage for producing feedback, and a reasoning stage for in-context learning with feedback. 3.1

Retrieval Stage

We first search for a few relevant demonstration examples for test question Q from the labeled examples set T . To this end, a question index I is built for

748

Z. Huang et al.

each question QTi in T , and a retrieval module is executed to obtain the set E of top-k relevant labeled examples: I = Index(T ) E = Retriever(Q, I), where

E ⊂T ,

(1)

where Index(·) and Retriever(·, ·) are indexing and retrieval functions, respectively, and we realize them using an inverted index and BM25 in our experiments. E E E E E = {[D1E , QE 1 , A1 ], . . . , [Dk , Qk , Ak ]} is the selected demonstration examples set with size k. 3.2

Exercise Stage

Then we regard the selected demonstration examples E as exercises to predict their answers and extend them with corrected answers as feedback. The set of E predicted answers AP i for each demonstration question Qi is obtained as follows: E E AP i = QAModel(Di , Qi ),

(2)

where QAModel(·, ·) is an off-the-shelf MSQA model (e.g., a conventional MSQA P P method based on PLMs), and AP i = {A1 , . . . , Am } is the predicted answers set with size m. Next, the predicted answers set AP i is compared with the labeled answers to obtain feedback about the predicted answers. The feedback consists set AE i I of three parts: the correctly predicted set AC i , the incorrectly predicted set Ai , M C I and the unpredicted (i.e., missing) set Ai , satisfying that |Ai | + |Ai | = m and M |AC i | + |Ai | = n. 3.3

Reasoning Stage

After obtaining the answer feedback, an extended demonstration context is constructed from E and the feedback. For each demonstration example, we use , a task description template to construct demonstration context PromptDEMO i use a feedback template to construct feedback context PromptFB i , and the is constructed by concatenatexpended demonstration context PromptDEMO+ i and PromptFB ing PromptDEMO i i : E PromptDEMO = TaskTemp(DiE , QE i , Ai ) i I M = FeedbackTemp(AC PromptFB i , Ai , Ai ) i

PromptDEMO+ i

=

(3)

[PromptDEMO ; PromptFB i i ],

where TaskTemp(·, ·, ·) and FeedbackTemp(·, ·, ·) are two template filling functions. The details of the templates can be found in Table 1. For the test question Q, we construct test context using the same task description template but set the answers to an empty set: PromptTEST = TaskTemp(D, Q, ∅) .

(4)

Enhancing In-Context Learning with Answer Feedback for MSQA

749

Table 1. Prompting templates used for MSQA and Keyphrase Extraction (KE) Task

Function

Templates

MSQA & KE

FeedbackTemp (·, ·, ·)

Here are some correct answers (or present/absent keyphrases) responded by other AI model: 1. [CORRECT1 ]; 2.[CORRECT2 ]; ... Here are some incorrect answers (or present/absent keyphrases) responded by other AI model: 1. [INCORRECT1 ]; 2. [INCORRECT2 ]; ... Here are some answers (or present/absent keyphrases) missed by other AI model: 1. [MISS1 ]; 2. [MISS2 ]; ...

MSQA

TaskTemp (·, ·, ·)

Reading the passage: [DOCUMENT ] Extract spans from the above passage to answer the question: [QUESTION ] Answer as a list e.g. 1. answer1; 2. answer2 Answer: 1. [ANS1 ]; 2. [ANS2 ]; ...

ConcatTemp (·, ·)

Example 1: [DEMO CONTEXT1 ] Example 2: [DEMO CONTEXT2 ] ... Then, answer me a question like the above examples: [TEST QUESTION ]

KE

TaskTemp (·, ·, ·)

Reading the passage: [DOCUMENT ] Extract present (or Generate absent) keyphrases from the above passage: Response as a list e.g. 1. keyphrase1; 2. keyphrase2 Keyphrases: 1. [KEYPHRASE1 ]; 2. [KEYPHRASE2 ]; ...

ConcatTemp (·, ·)

Example 1:[DEMO CONTEXT1 ] Example 2: [DEMO CONTEXT2 ] ... Then, extract present (or generate absent) keyphrases like the above cases: [TEST QUESTION ]

Finally, we use a concatenation template to construct the complete prompt and feed it into LLM: , . . . }, PromptTEST ) Prompt = ConcatTemp({PromptDEMO+ i ALLM = LLM(Prompt) ,

(5)

where ConcatTemp(·, ·) is a template filling function detailed in Table 1, and ALLM is a text answer returned by LLM. Since the instruction in the prompt requires LLM to answer in the form of a list, we can easily parse the text into multiple span-level answers to the test question.

4

Experimental Setup

We refer to our approach as FBPrompt, which is available from GitHub.2 2

https://github.com/nju-websoft/FBPrompt.

750

Z. Huang et al.

4.1

Datasets

We evaluated on three MSQA datasets: MultispanQA [15], QUOREF [6], and DROP [8]. Since the test set of DROP was not public, we used its official development set as our test set. In addition, we used a keyphase extraction dataset Inspec [9], which has a similar format to MSQA, with one document input and multiple span-level outputs, but without question. Considering the experimental cost, we only randomly sampled 500 samples for evaluation from QUOREF and DROP. Table 2 shows some statistics about these datasets. Table 2. Dataset statistics. Present Labels (%) indicates the percentage of answers in MSQA datasets or percentage of keyphrases in keyphrase extraction datasets that explicitly appear in the document. Dataset

Type

# Test # Used Present Labels (%) Avg. # Answers

MultiSpanQA MSQA 653

653

100

2.89

QUOREF

MSQA 2537

500

100

1.14

DROP

MSQA 9,536

500

66.16

1.10

INSPEC

KE

500

73.58

2.48

4.2

500

Baselines

We compared FBPrompt with five popular usages of LLM as follows: Zero-Shot. Prompts LLM only using handle-written instructions without demonstration examples. Random. Sampling randomly selects k demonstration examples from the training set for each test question to construct prompt as done in [18]. BM25. Calculates lexical similarity between questions to obtain top-k relevant demonstration examples for each test question. It can be viewed as a simplified version of our FBPrompt—without using answer feedback. KATE. [18] uses KNN algorithm selecting k demonstration examples with highest semantic similarity score for each test question. We implemented it based on dense passage retrieval [12]. Label-Induced. Reasoning [30] feeds labeled answers, the question, and the document to LLM to obtain a clue about the relation between question and answers. We implemented it using the same BM25 results as our FBPrompt. 4.3

Evaluation Metrics

We evaluated on each dataset using their official metrics [6,8,15,31]. For MultiSpanQA, we used Exact Match F1 (EM) and Partial Match F1 (PM). For QUOREF and DROP, we used Exact Match Global (EMG ) and F1 score (F1). For INSPEC, we used macro-averaged F1@5 and F1@M.

Enhancing In-Context Learning with Answer Feedback for MSQA

751

Table 3. Main results on MSQA. The best results are in bold. ‡ indicates the results reported in [27]. ∗ indicates that the results are not completely comparable due to the difference in test data. MultiSpanQA QUOREF

DROP

EM

EMG

PM

EMG

F1

INSPEC F1

Present

Absent

F1@5 F1@M F1@5 F1@M

4.4

SOTA

73.13∗ 83.36∗ 80.61∗ 86.70∗ 84.86∗ 87.54∗ 0.401‡ 0.476‡ 0.030‡ 0.041‡

Zero-shot

39.47

68.14

33.60

51.07

5.81

17.25

Random

58.62

80.62

71.40

80.25

47.70

60.53

0.401 0.472 0.033 0.051

BM25

61.33

81.63

70.80

79.00

58.40

65.93

0.405 0.470 0.029 0.051

KATE

0.399 0.468 0.026 0.038

0.298‡ 0.417‡ 0.016‡ 0.030‡

60.78

81.51

73.00

79.76

50.90

60.69

Label-induced 54.56

76.99

64.40

71.96

12.63

16.47

0.115 0.135 0.009 0.013

FBPrompt

83.11

73.60

80.55

62.00

69.11

0.425 0.499 0.034 0.055

64.60

Implementation Details

We used OpenAI’s official API3 with the model gpt-3.5-turbo-0301 for all our experiments. We used the T5-base [25] model as our off-the-shelf model in FBPrompt, including an implementation based on MindSpore. For the keyphrase extraction task, we performed extraction of present keyphrases and generation of absent keyphrases in two independent steps with two slightly different instructions as show in Table 1. Unless otherwise specified, we set k = 3, i.e., FBPrompt and all the few-shot baselines used three demonstration examples.

5

Experimental Results

5.1

Comparison with Baselines

In Table 3, FBPrompt significantly outperforms previous LLM-based methods on all metrics in the four datasets. In particular, compared with BM25 which uses the same demonstration examples as ours, FBPrompt exceeds it by a large lead, thus exhibiting the performance brought by our proposed answer feedback. We also show the state-of-the-art (SOTA) results reported by other papers using fully-supervised fine-tuned models: they are [14] for MultiSpanQA, [26] for QUOREF, [33] for DROP, and [27] for INSPEC. Although the experimental results on the three MSQA datasets are not directly comparable due to inconsistent test data, it can be found that LLM-based models are still weaker than the fully-supervised models, but performs relatively well on the keyphrase extraction dataset INSPEC. FBPrompt closes the gap to SOTA on MSQA and achieves new SOTA results on the INSPEC dataset.

3

https://platform.openai.com/.

752

5.2

Z. Huang et al.

The Effectiveness of Different Feedback

We compare FBPrompt with the method using only one type of feedback to analyze whether all the three types of feedback bring benefits. The results reported in Table 4 reveal that each part of the feedback has a effect in improving the performance of LLM. In particular, using only correct answers leads to the largest loss compared with using only incorrect or missing answers, which shows that negative feedback has the largest benefit to LLM. Table 4. Effectiveness of different feedback. The best results are in bold. MultiSpanQA QUOREF

DROP

INSPEC

EM

EMG F1

Present

PM

EMG F1

Absent

F1@5 F1@M F1@5 F1@M FBPrompt

64.60 83.11

73.60 80.55 62.00 69.11 0.425 0.499 0.034 0.055

- only correct

62.70 82.75

71.40 79.69 58.40 65.60 0.401 0.463 0.027 0.046

- only incorrect 62.93 82.97

72.40 80.23 60.20 67.92 0.417 0.490 0.030 0.048

- only missing

72.80 79.75 61.20 68.80 0.416 0.480 0.027 0.046

63.48 82.90

Table 5. Comparsion with random feedback. MultiSpanQA QUOREF

DROP

INSPEC

EM

EMG F1

Present

PM

EMG F1

Absent

F1@5 F1@M F1@5 F1@M FBPrompt

64.60 83.11

Random feedback 60.95 81.40

73.60 80.55 62.00 69.11 0.425 0.499 0.034 0.055 64.40 72.11 60.72 68.03 0.415 0.482 0.029 0.047

Fig. 3. Results of FBPrompt and BM25 with different numbers of examples on four datasets.

Enhancing In-Context Learning with Answer Feedback for MSQA

5.3

753

Comparison with Random Feedback

Then, we simulate feedback by randomly generating predicted answers to observe whether the improvement of FBPrompt is really brought about by our carefully designed feedback. For the labeled answers set AE i from demonstration example E E [DiE , QE , A ], we randomly selected a number n ˆ 1 in the range [0, |Ai |}], and i i E randomly sampled nˆ1 positive answers from the labeled answers set Ai as pseudo positive predicted answers APos . Similarly, we randomly selected a number n ˆ2 E in the range [0, |AE |}], and randomly sampled n ˆ spans from the document D 2 i i Neg Pos Neg as pseudo negative predicted answers A . Then, we merged A and A as the pseudo predicted answers and executed FBPrompt to generate answers. As shown in Table 5, the performance of FBPrompt drops significantly when random feedback is used, which shows that our constructed feedback is useful. Table 6. Case Studys Demonstration Examples

Test Question

Document: Glycogen functions as one of two forms of long - term energy reserves, with the other form being triglyceride stores in adipose tissue (i.e. , body fat) . In humans, glycogen is made and stored primarily in the cells of the liver and skeletal muscle ...

Document: Aflatoxins are poisonous carcinogens that are produced by certain molds Aspergillus flavus and Aspergillus parasiticus) which grow in soil, decaying vegetation, hay, and grains ...

Question: Which two forms of energy do muscles produce ?

Question: Where are the organisms that produce aflatoxins found ?

Gold Answers: Glycogen, triglyceride

Gold Answers: soil, decaying vegetation, hay, grains

Feedback:

Baseline Predictions: Aspergillus parasiticus, Aspergillus flavus

Incorrect Answers: liver, skeletal muscle Answers Missed: Glycogen, triglyceride

FBPrompt Predictions: soil, decaying vegetation, hay, grains

5.4

Number of Demonstration Examples

We study whether FBPrompt exhibits consistent effectiveness when the number of demonstration examples varies. In Fig. 3, we report the changing trend of FBPrompt and BM25 when the number of examples changes from 1 to 4. We observe that with a varying number of examples in the four datasets, the performance of FBPrompt is consistently higher than that of BM25. Especially in the case of one-shot, FBPrompt largely outperforms BM25. 5.5

Case Study

A real case from MultiSpanQA is presented in Table 6. The left part shows an demonstration example for the test question in the right part. We can observe that the prediction of the baseline method (BM25) makes a mistake, since LLM

754

Z. Huang et al.

observes ‘produce’ in the question and directly finds answers around ‘produced’, instead of analyzing the meaning of the question thoroughly. As for FBPrompt, our off-the-shelf model also observes ‘produce’ in the question, and mistakenly finds the answers ‘liver’, ‘skeletal musc’ in the original document near ‘made’, which is semantically close to ‘produce’. But after given a feedback, LLM learns not to be confused by such specific word, and tries to understand the entire question. Therefore, FBPrompt finally generates correct answers.

6

Conclusion

In this paper, we explore the performance of LLM in multi-span question answering, finding that existing in-context learning methods under-utilize labeled answers. To alleviate this problem, we propose a novel prompting strategy called FBPrompt, which constructs and employs answer feedback from an off-the-shelf model to enhance in-context learning. Experiments on multiple datasets show that FBPrompt using answer feedback significantly improves the performance of LLM on MSQA tasks. In the future, we will deeply analyze the working principle of answer feedback, and try to integrate more useful feedback information into LLM for various tasks. Acknowledgements. This work was supported in part by the NSFC (62072224) and in part by the CAAI-Huawei MindSpore Open Fund.

References 1. Agrawal, S., Zhou, C., Lewis, M., Zettlemoyer, L., Ghazvininejad, M.: In-context examples selection for machine translation. CoRR abs/2212.02437 (2022) 2. Bang, Y., S., et al.: A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. CoRR abs/2302.04023 (2023) 3. Bian, N., Han, X., Sun, L., Lin, H., Lu, Y., He, B.: ChatGPT is a knowledgeable but inexperienced solver: an investigation of commonsense problem in large language models. CoRR abs/2303.16421 (2023) 4. Biswas, S.: ChatGPT and the future of medical writing. Radiology 307, e223312 (2023) 5. Brown, T.B., et al.: Language models are few-shot learners. In: NeurIPS 2020 (2020) 6. Dasigi, P., Liu, N.F., Marasovic, A., Smith, N.A., Gardner, M.: QUOREF: a reading comprehension dataset with questions requiring coreferential reasoning. In: EMNLP-IJCNLP 2019, pp. 5924–5931 (2019) 7. Dong, Q., et al.: A survey for in-context learning. CoRR abs/2301.00234 (2023) 8. Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., Gardner, M.: DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In: NAACL-HLT 2019, pp. 2368–2378 (2019) 9. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: EMNLP 2003, EMNLP ’03, pp. 216–223 (2003) 10. Jiao, W., Wang, W., Huang, J., Wang, X., Tu, Z.: Is ChatGPT a good translator? Yes with GPT-4 as the engine. arXiv preprint: arXiv:2301.08745 (2023)

Enhancing In-Context Learning with Answer Feedback for MSQA

755

11. Ju, Y., Wang, W., Zhang, Y., Zheng, S., Liu, K., Zhao, J.: CMQA: a dataset of conditional question answering with multiple-span answers. In: COLING 2022, pp. 1697–1707 (2022) 12. Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: EMNLP 2020, pp. 6769–6781 (2020) 13. Kocon, J., et al.: ChatGPT: jack of all trades, master of none. CoRR abs/2302.10724 (2023) 14. Lee, S., Kim, H., Kang, J.: LIQUID: a framework for list question answering dataset generation. CoRR abs/2302.01691 (2023) 15. Li, H., Tomko, M., Vasardani, M., Baldwin, T.: MultiSpanQA: a dataset for multispan question answering. In: NAACL 2022, pp. 1250–1260 (2022) 16. Li, J., Zhang, Z., Zhao, H.: Self-prompting large language models for open-domain QA. CoRR abs/2212.08635 (2022) 17. Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., Zhang, Y.: Evaluating the logical reasoning ability of ChatGPT and GPT-4. CoRR abs/2304.03439 (2023) 18. Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., Chen, W.: What makes good in-context examples for GPT-3? In: DeeLIO@ACL 2022, pp. 100–114 (2022) 19. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55 (2023) 20. Lu, Y., Bartolo, M., Moore, A., Riedel, S., Stenetorp, P.: Fantastically ordered prompts and where to find them: overcoming few-shot prompt order sensitivity. In: ACL 2022, pp. 8086–8098 (2022) 21. OpenAI: GPT-4 technical report. CoRR abs/2303.08774 (2023) 22. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022) 23. Pereira, J.A., do Nascimento Fidalgo, R., de Alencar Lotufo, R., Nogueira, R.F.: Visconde: multi-document QA with GPT-3 and neural reranking. In: ECIR 2023, pp. 534–543 (2023) 24. Qin, C., Zhang, A., Zhang, Z., Chen, J., Yasunaga, M., Yang, D.: Is ChatGPT a general-purpose natural language processing task solver? CoRR abs/2302.06476 (2023) 25. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-totext transformer. J. Mach. Learn. Res. 21, 1–67 (2020). https://jmlr.org/papers/ v21/20-074.html 26. Segal, E., Efrat, A., Shoham, M., Globerson, A., Berant, J.: A simple and effective model for answering multi-span questions. In: EMNLP 2020, pp. 3074–3080 (2020) 27. Song, M., et al.: Is ChatGPT a good keyphrase generator? A preliminary study. CoRR abs/2303.13001 (2023) 28. Susnjak, T.: ChatGPT: the end of online exam integrity? CoRR abs/2212.09292 (2022) 29. Tan, Y., et al.: Evaluation of ChatGPT as a question answering system for answering complex questions. CoRR abs/2303.07992 (2023) 30. Wan, Z., et al.: GPT-RE: in-context learning for relation extraction using large language models. CoRR abs/2305.02105 (2023) 31. Yuan, X., et al.: One size does not fit all: generating and evaluating variable number of keyphrases. In: ACL 2020, pp. 7961–7975 (2020) 32. Zhou, Y., et al.: Large language models are human-level prompt engineers. CoRR abs/2211.01910 (2022)

756

Z. Huang et al.

33. Zhou, Y., et al.: OPERA: operation-pivoted discrete reasoning over text. In: NAACL 2022, pp. 1655–1666 (2022) 34. Zhu, M., Ahuja, A., Juan, D., Wei, W., Reddy, C.K.: Question answering with long multiple-span answers. In: EMNLP 2020, pp. 3840–3849 (2020)

Poster: Large Language Models

RAC-BERT: Character Radical Enhanced BERT for Ancient Chinese Lifan Han1 , Xin Wang1(B) , Meng Wang2 , Zhao Li1 , Heyi Zhang1 , Zirui Chen1 , and Xiaowang Zhang1 1

College of Intelligence and Computing, Tianjin University, Tianjin, China {hanlf,wangx,lizh,zhy111,zrchen,xiaowangzhang}@tju.edu.cn 2 College of Design and Innovation, Tongji University, Shanghai, China

Abstract. In recent years, Chinese pre-training language models have achieved significant improvements in the fields, such as natural language understanding (NLU) and text generation. However, most of these existing pre-trained language models focus on modern Chinese but ignore the rich semantic information embedded for Chinese characters, especially the radical information. To this end, we present RAC-BERT, a language-specific BERT model for ancient Chinese. Specifically, we propose two new radical-based pre-training tasks, which are: (1) replacing the masked tokens with random words of the same radical, that can mitigate the gap between the pre-training and fine-tuning stages; (2) predicting the radical of the masked token, not the original word, that reduces the computational effort. Extensive experiments were conducted on two ancient Chinese NLP datasets. The results show that our model significantly outperforms the state-of-the-art models on most tasks. And we conducted ablation experiments to demonstrate the effectiveness of our approach. The pre-trained model are publicly available at https:// github.com/CubeHan/RAC-BERT

Keywords: Pre-trained language model

1

· Ancient Chinese · Radical

Introduction

Pre-trained language models (PLMs), such as BERT (Bidirectional Encoder Representations from Transformers) [4], are widely used in natural language processing (NLP). Their basic structure is based on a bidirectional Transformer [15] encoder to accomplish model pre-training through two unsupervised tasks: masked language model (MLM) and next sentence prediction (NSP). After finetuning, models can capture general language patterns and handle more specific downstream language tasks. Since the release of BERT, many pre-training language models have been developed. These models focus on optimizing the structure and improving the representation of context. However, most pre-training tasks are designed for English or modern Chinese languages, and none have been dedicated to explicitly c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 759–771, 2023. https://doi.org/10.1007/978-3-031-44696-2_59

760

L. Han et al.

ancient Chinese. Since ancient Chinese are quite different from modern Chinese in terms of grammar and vocabulary, modern Chinese PLMs cannot effectively learn the tacit knowledge of ancient languages and perform poorly in handling ancient Chinese NLP tasks. Radicals, which are part of Chinese characters and originate from ancient China, carry part of the semantic information of Chinese and usually have similar semantic and grammatical usage in Chinese characters with the same radical. Most Chinese characters have radicals, which we call combined characters. Usually, the radical is a simple graphic expressing some abstract meaning. For example, the radical of “河 (river)”, “湖 (lake)” and “海 (sea)” is “氵 (water)”, which represents the meaning related to water; The Chinese characters “想 (think)”, “忘 (forget)” and “忠 (loyal)” all have radical “心 (heart)”, which expresses mental activity-related meanings. Radicals exist in both modern and ancient Chinese. The difference lies in that in ancient Chinese, radicals more directly reflect specific aspects of word meanings. However, in modern Chinese, with the expansion and evolution of semantics, the meanings of many words have become more extensive, leading to a relatively diminished directness of radicals in expressing semantic aspects. Intuitively speaking, in addition to the Chinese characters themselves, the additional semantic information carried by their radicals can improve the ability of PLMs to understand ancient Chinese. Moreover, unlike other existing work, we use the design of the pre-training task to introduce radical. In this paper, we propose a simple but effective model called RAC-BERT, which means Radical enhanced Ancient Chinese BERT. Because Chinese characters with the same radical usually exhibit similar semantic and syntactic features, we propose two radical-based pre-training tasks. The first task is radical replacement, where we replace the selected tokens with random words of the same radical, which reduces the differences between the pre-training and finetuning phases while introducing radical semantics. The second task is the radical prediction task, where we only predict the radicals of the masked tokens, significantly reducing the computational effort. We conducted extensive experiments on two ancient Chinese NLP datasets, including text classification, sentiment analysis, and named entity recognition (NER). The results show that our proposed RAC-BERT can achieve significant improvement on most tasks. Finally, we conducted detailed ablation experiments on RAC-BERT to determine the source of its effectiveness. In summary, the contributions of this paper include: (1) we propose a simple and effective pre-trained language model for ancient Chinese, RAC-BERT, which considers the information of Chinese radicals; (2) we design two radical-based pre-training tasks, which reduce the computational effort while mitigating the gap between the pre-training and fine-tuning stages; (3) experiments on two benchmark datasets for eight ancient Chinese NLP tasks show that our proposed model RAC-BERT outperforms most existing models.

RAC-BERT

2

761

Related Work

2.1

Pre-trained Language Models

In recent years, BERT [4] has been demonstrated to be very effective in many NLP tasks. BERT comprises two pre-training tasks: MLM and NSP. Following this, a series of follow-up works to improve BERT were proposed. RoBERTa [11] replaces the static mask of the vocabulary with the dynamic mask, thus improving the utilization of the training data. In addition, the model removes the NSP task in the pre-training phase. BERT-wwm [2] proposes a more appropriate whole word masking for Chinese text. ALBERT [9] significantly compresses the number of parameters by factorized embedding parameterization and cross-layer parameters. It replaces the NSP in BERT with sentence order prediction (SOP) task for learning coherence between contiguous sentences. Standing on the advance of BERT, multiple languages have also been released, such as mBERT, which has been trained on a corpus of various languages. In addition to multilingual models, specific language BERT models have also been trained for different languages, including RobBERT [3] for Dutch, CamemBERT [12] for French. [13] has compared multilingual BERT and language-specific BERT models. The primary conclusion drawn from the results is that although the existing multilingual BERT models can bring improvements to specific language baselines, using language-specific BERT models can further significantly improve the performance of various NLP tasks. 2.2

Ancient Chinese PLMs

Because of the syntactic and semantic differences between ancient and modern Chinese, several works on ancient Chinese PLMs have been proposed. CANCNBERT [7] is a ancient and modern Chinese pre-trained language model which proposes a particular optimization method through the switch mechanism for ancient Chinese and modern Chinese. SikuBERT and SikuRoBERTa [16] are PLMs constructed based on the BERT deep language model framework using “Siku Quanshu (Complete library in the Four Branches of Literature)” as the training corpus. GuwenBERT1 is based on the Continue Training technology. It combines modern Chinese RoBERTa weights and a large amount of ancient Chinese corpus to transfer some of the language features of modern Chinese to ancient Chinese. [5] implemented an automatic sentence-breaking model for ancient Chinese poems via conditional random fields and convolutional neural networks. The model proposed in the above work is only trained on the model of BERT using the ancient Chinese corpus, and there is no pre-training task designed for ancient Chinese.

1

https://github.com/ethan-yt/guwenbert.

762

2.3

L. Han et al.

Learning Structure Information

Some work on learning structural information from Chinese characters has also been proposed. BioHanBERT [17] uses component-level internal semantic information of Chinese characters to enhance the semantics of Chinese biomedical concepts and terms. There are also many early works on learning the radical structure information of Chinese characters in word embeddings [10,14]. The above works all introduce structural information, such as radicals in Chinese characters, to improve semantic representation. However, no large-scale training in the BERT-style has been carried out.

Fig. 1. An overview of our model. In radical replacement, we can see the 海 (sea) consist of 氵 (water) and 每 (every), where 氵 (water) is its radical. Then, 海 (sea) was replaced with other characters of the same radical, for example, we can combine 氵 (water) and 胡 (Hu) to form a new Chinese character 湖 (lake). In the radical prediction task, the radical of the 蜉 (mayfly) is masked, and then the model needs to predict the radical of this masked character, which is the 虫 (worm).

3 3.1

RAC-BERT Pre-training Overview

In this section, we present our proposed two radical-based self-supervised pretraining tasks to improve the feature representation of the model. The overall model is shown in Fig. 1. Firstly, on the token-level task, we propose a method named radical replacement, which replaces the [MASK] tokens with randomly Chinese characters that

RAC-BERT

763

Table 1. Examples of different masking strategies. For instance, where the character 蜉 (mayfly) has a 虫 (worm) radical, and the model replaces it with an 蚁 (ant), which is also a 虫 (worm) radical.

Original Input

Chinese

Translation in English

寄蜉蝣于天地,渺沧海之一粟。

Like a mayfly in the vastness of the universe, I am but a tiny grain in the vastness of the ocean

Radicals in the Input 宀虫虫二大土,水水水丶一米。

treasure worm worm two big soil, water water water dot one rice

Original Masking

寄[M]蝣于天地,渺沧[M]之一[M]。 Like a [M] ##fly in the vast ##ness of the universe, I am but a tiny [M] in the vast ##ness of the [M]

Radical Replacement

寄蚁蝣于天地,渺沧湖之一糙。

Like a ant in the vast ##ness of the universe, I am but a tiny lake in the vast ##ness of the rough

share the same radical component as the original token. We also propose the radical prediction task to replace the MLM task in the original BERT. For the sentence-level task, we abandon the NSP task and perform the SOP task proposed in ALBERT [9], where its positive example is consistent with the NSP task. Both are consecutive sentences, while the negative example switches the original order of two consecutive sentences. In Sect. 5, we eliminate these modifications to show the contribution of each component more clearly. 3.2

Radical Replacement

In Chinese PLMs, whole word masking are usually adopted, using word-level information for masking. The grammar of ancient Chinese and modern Chinese differs, modern Chinese words are mostly two syllables, while ancient Chinese words are mostly one syllable. In ancient Chinese, one word can usually convey the whole meaning, so we select 15% of the tokens randomly for each input sequence in character units. Our method involves randomly selecting a specified proportion of input tokens and replacing them with random Chinese characters that share the same radicals. The adoption of radical replacement over character prediction in PLMs aims to leverage structural information like radicals, enabling the capture of finer linguistic features and enhancing the model’s comprehension of Chinese text. Additionally, this method resolves the issue of [MASK] token absence during fine-tuning by avoiding the use of [MASK] tokens during the masking process, ensuring the model encounters real [MASK] tokens during training and improving its performance in real-world scenarios. In accordance with the training approach inspired by BERT, our method utilizes various masking strategies, as depicted in Table 1. Specifically, for selected tokens, 80% of them will be replaced with random words of the same radical, 10% will be replaced with a random word, and kept with original words for the rest of 10%. Note that every Chinese character has at least one character with

764

L. Han et al.

the same prefix (including itself), and if the masked token is punctuation or other, then it is masked by the original MLM. 3.3

Radical Prediction Task

Instead of performing a softmax over the full vocabulary, we propose the radical predict task, in which the model simply predicts the radicals of each masked token. We define a 275 classification task that includes 274 types of radicals and other characters that do not contain radicals (punctuation marks, spaces, etc.). In each sample, we mask 15% of the tokens, where 80% of the tokens are replaced with [MASK], 10% of the tokens are replaced with a random word and the rest of the tokens remain unchanged. Additionally, we only compute the cross-entropy loss for the masked tokens. Formally, we denote that the transformer encoder maps a sequence on input tokens x = [x1 , x2 , ..., xn ] into a sequence of contextualized vector representations h(x) = [h1 , h2 , ..., hn ]. Given a position t, where xt is masked, the probability for generating a particular token xt with a softmax layer:   T G(xt , x) = exp e (xt ) h(x)t (1) P (xt | x) = G(xt , x)/



G(xs , x)

(2)

xs

where e denotes embeddings. In radical replacement, the model first chooses a set of random positions (integers between 1 and sentence length n) to mask out m = [m1 , m2 , ..., mj ]. We define the Radical(x, m) function to mean replacing the value at position m in x with a random word of the same radical, and the function returns a vector of x after the replacement. Formally, the loss function of our model is shown as follows:  LRP (x, θ) = − log P (xi | Radical(x, m)) (3) i∈m

The final loss function of RAC-BERT are: L = LRP + LSOP

(4)

where LSOP is the standard objective of SOP task. 3.4

Pretraining Data

The data for our training model is first collected from the ancient literature of Daizhige2 , which contains a large-scale uncleaned ancient Chinese literature. We preprocess the text (removing special symbols, variant characters, etc.) and end up with a total of 1.2 B Chinese characters. 2

https://github.com/garychowcmu/daizhigev20.

RAC-BERT

3.5

765

Implementation

We trained our model using MindSpore [1] and used four HUAWEI Ascend 910 NPU to train our model. We added some new words to SikuBERT’s vocabulary, finally reaching 29,810 words as our training vocabulary. The batch size was set to 128 and the model was trained with a maximum sequence length of 512. We used the Adam [8] optimizer used in BERT to optimize the model. Automatic mixing precision and distributed training were used during pretraining, and 200 k steps were trained using radical replacement and the predict radical task, respectively, for a total of 400 k steps. Table 2. Data statistics and evaluation metrics for different fine-tuning tasks. Dataset

4

Task

Train# Dev# Test# Metrics

C-CLUE Named Entity Recognition Relation Extraction

15.9 K 3.1 K

2.2 K 0.9 K

1.1 K 0.4 K

CCLUE

26.9 K

4.1 K

39.9 K F1

26.9 K

4.1 K

39.9 K F1

160 K 16 K

20 K 2K

20 K 2K

ACC ACC

2.6 K –

0.3 K –

0.3 K 10 K

F1 ACC

Ancient Chinese Sentence Segmentation Ancient Chinese Sentence Punctuation Ancient Text Classification Ancient Text Sentiment Analysis Named Entity Recognition Retrieval

F1 F1

Experiments

4.1

Fine-Tuning Tasks

We conducted extensive experiments on eight natural language processing tasks to test the pre-trained language models thoroughly. The tasks details are shown in Table 2. Specifically, we chose two mainstream ancient Chinese datasets, i.e., C-CLUE [6] and CCLUE3 , which include a total of seven tasks. 4.2

Baselines

Several ancient and modern Chinese MLM pre-training models were selected for comparison to evaluate effectiveness of the proposed model, including BERTbase [4], RoBERTa [11], SikuBERT, SikuRoBERTa [16], CANCN-BERT [7], and GuwenBERT. Note that among the baselines mentioned above, only BERT-base 3

https://github.com/Ethan-yt/CCLUE.

766

L. Han et al.

and RoBERTa were trained using modern Chinese data, while all other training data were based on ancient Chinese. Additionally, our model was pre-trained on a smaller dataset compared to theirs. To validate the effectiveness of our approach, we also trained an MLM pretraining model using the same ancient Chinese corpus data as our model. We followed the BERT method of masking 15% of the tokens in each training instance, where 80% of the tokens were replaced with [MASK], 10% of the tokens were replaced with random words, and the remaining tokens were unchanged. We calculated the average cross-entropy loss only for the masked tokens. 4.3

Implementation

To fairly measure the performance of downstream tasks, we let all models keep the same hyperparameters for each task, and we tested learning rates from 1e-5 to 5e-5, epochs from 1 to 5, linearly warming up the first 10% of steps followed by a linear decay to 0. We set mean layer id in Retrieval from 5 to 10. We set the maximum sequence length to 10 on Ancient Text Classification, 256 on NER, 128 on RE, and 512 on all other tasks. We set the batch size of Retrieval to 128, Ancient Text Classification, Ancient Chinese Sentence Segmentation, and Ancient Chinese Sentence Punctuation to 64, and all other tasks to 32. Note that We run the same experiment ten times to ensure the reliability of results. Table 3. The performance of the different models on the NER and RE tasks in CCLUE test set. The result of precision (P), recall (R), and F1 results are reported for comparison for NER, and accuracy (ACC) results are reported for RE. NER P R

a

4.4

F1

RE Avg. ACC

BERT-base [4]

44.3 53.6 48.1 42.3

45.2

RoBERTa [11]

44.8 53.7 48.8 42.2

45.5

SikuBERT [16]

45.1 54.2 48.7 42.3

45.5

SikuRoBERTa [16]

44.8 54.1 49.0 35.5

42.3

GuwenBERT2

31.7 19.3 22.2 37.4

29.8

CANCN-BERT [7]

47.3 56.1 51.3 43.7

47.5

BERT (our reimplementation) 45.6 53.7 49.6 43.1

46.4

RAC-BERT

48.5

48.6 57.2 52.6 44.3

https://github.com/ethan-yt/guwenbert

Results

Table 3 presents the performance of our proposed RAC-BERT model on Named Entity Recognition (NER) and Relation Extraction (RE) tasks in the C-CLUE

RAC-BERT

767

dataset. For NER, RAC-BERT achieves 48.6%, 57.2%, and 52.6% in precision, recall, and F1, respectively, outperforming the previous best-performing model, CANCN-BERT, by 1.3%, 1.1%, and 3.0%, respectively. Compared with our reimplemented BERT with MLM, RAC-BERT achieves at least a 3% performance improvement in each metric. For RE, our BERT re-implementation improves the accuracy of the BERT-base model by 0.8%, while RAC-BERT further improves the accuracy to 48.5%, surpassing the previous best result of 47.5%. Table 4. Results on CCLUE test sets. ATC

ATSA Retrieval NER ACSS ACSP Avg.

ACC ACC

ACC

F1

F1

BERT-base

81.6

57.7

60.9

80.3

85.6

F1 70.8

72.8

RoBERTa

82.0

57.9

60.9

80.5

86.1

71.0

73.1

SikuBERT

82.7

58.7

72.1

83.8

87.1

76.4

76.8

SikuRoBERTa

82.5

59.9

71.0

84.4

88.3

77.2

77.2

GuwenBERT

83.8

60.4

70.5

82.2

83.5

73.6

75.7

CANCN-BERT

82.3

59.4

73.2

83.6

87.5

72.3

76.4

BERT (our reimplementation)

82.9

59.5

71.8

84.0

87.3

76.5

77.0

RAC-BERT

84.2

61.4

73.5

85.1

87.4

75.6

77.9

Table 4 summarizes the performance of our proposed RAC-BERT model on various tasks in the CCLUE dataset. RAC-BERT achieves superior performance in Ancient Text Sentiment Analysis (ATSA), Retrieval, Named Entity Recognition (NER), and Ancient Text Classification (ATC) tasks, outperforming previous state-of-the-art models. In contrast, RAC-BERT performs similarly to other models in Ancient Chinese Sentence Segmentation (ACSS) and Punctuation (ACSP) tasks, with slightly lower performance due to the model’s difficulty in understanding unpunctuated and unsegmented text. 4.5

Overall

We compared our model with seven baselines in two benchmark datasets for eight tasks and found that RAC-BERT outperformed the baselines on almost all tasks. In six of the tasks, RAC-BERT outperforms the other baselines. In the remaining two tasks, it has a slightly lower F1 than SOTA. These findings highlight the effectiveness of our proposed method, which leverages the structural information of Chinese characters to enhance the model’s ability to capture the semantic and syntactic features of the Chinese language.

5

Discussion

Although our model significantly improve on various ancient Chinese tasks, we are curious as to where the critical components of this improvement came from.

768

L. Han et al. Table 5. Ablation of RAC-BERT on different fine-tuning tasks. NER (C-CLUE) F1

RE

ACSS ACSP ATC ATSA Retrieval NER (CCLUE) ACC F1 F1 ACC ACC ACC F1

Avg.

RAC-BERT

52.6

44.3

87.4

75.6

84.2

61.4

73.5

85.1

70.5

SOP→ NSP

52.5

44.3

87.3

75.3

84.0

61.4

73.4

84.7

70.4

Radical Replacement + MLM

52.1

44.0

86.9

75.5

83.5

60.2

72.3

83.9

70.0

w/o SOP

52.3

44.1

87.0

75.4

83.4

60.9

73.0

84.6

70.1

w/o Radical Replacement

51.2

43.5

86.7

75.2

82.8

59.3

71.9

84.2

69.4

w/o Predict Radical

52.0

43.9

86.8

75.3

83.6

60.4

72.6

84.6

69.9

BERT (our reimplementation)

49.6

43.1

87.3

76.5

82.9

59.5

71.8

84.0

69.3

As a result, we conducted detailed ablation experiments on RAC-BERT to determine the source of its effectiveness. Finally, we also studied the effect of different data sizes. We would like to explore the effect of introducing radical and the SOP task. For a fair comparison, we used the same number of training steps and the same model parameters and pre-trained different models on the same dataset. The settings include “SOP→ NSP”, in which the NSP task replaces the SOP task; “Radical Replacement + MLM”, in which we use Radical Replacement and the MLM task to train BERT; “w/o SOP”, in which the SOP task is eliminated and the rest remains the same as the original model; “w/o Radical Replacement”, in which the radical replacement is eliminated and only the radical prediction task is considered; “w/o Predict Radical”, in which the predict radical task is eliminated. “BERT(our reimplementation)”, which is our reimplementation of BERT on the ancient Chinese dataset. We fine-tune the different models on the eight datasets for comparison. The results are shown in Table 5. It can be seen that by removing the SOP task or replacing the SOP task with the NSP task, the performance of the model slightly decreases or remains the same. This indicates that the SOP task has a slightly enhanced performance over the original NSP task in BERT but is far less important than the MLM-style task. Furthermore, we can see that removing either the radical replacement or the predict radical task leads to performance degradation, and the negative impact of removing the radical replacement is more significant. These experiments validate the importance of introducing radical for modeling Chinese semantics. We next explored the effect of different training sizes on the model. We selected various proportions of the original data, shuffled them, and used them to train BERT and RAC-BERT models separately. The performance of the models was evaluated on the CCLUE and C-CLUE datasets. Figure 2 illustrates the results, which demonstrate that RAC-BERT consistently outperforms BERT on

RAC-BERT

769

all training sizes for both datasets. Specifically, RAC-BERT converges faster than BERT, with a more significant performance improvement at data sizes of less than 50% and a minor improvement after that. This is because the radical-based pre-training tasks in RAC-BERT leverages the radical information to enhance its ability to capture fine-grained language features. These findings suggest that RAC-BERT can effectively learn from small-scale datasets, making it a practical solution for scenarios where large amounts of training data are not available.

Fig. 2. Performance of RAC-BERT and BERT at different training sizes on CCLUE and C-CLUE.

6

Conclusion

In this paper, we propose RAC-BERT, an ancient Chinese pre-training model improved by radical. In RAC-BERT, we replace the selected tokens with random words of the same radical, which reduces the differences between the pre-training and fine-tuning phases. Moreover, RAC-BERT proposes a method of predicting radicals of masked tokens, this approach replaces the original MLM classification method used on the entire vocabulary with the task of classifying only the number of radicals, resulting in a significant reduction in computational effort. Experimental results on two benchmark datasets for eight tasks show that our approach is effective and practical. Chinese characters with the same radical usually exhibit similar semantic and syntactic features. Our model leverages the characteristics of radicals to enhance its ability to capture fine-grained language features in Chinese, enabling it to outperform state-of-the-art results in a highly efficient manner on most tasks. In the future, we aim to investigate more effective ways to leverage radicals to further enhance the performance of ancient Chinese PLMs. Acknowledgement. This work is supported by the CAAI-Huawei MindSpore Open Fund (2022037A).

770

L. Han et al.

References 1. Chen, L.: Deep Learning and Practice with MindSpore. Springer, Cham (2021) 2. Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. IEEE/ACM Trans. Audio, Speech Lang. Process. 29, 3504–3514 (2019) 3. Delobelle, P., Winters, T., Berendt, B.: RobBERT: a Dutch RoBERTa-based language model. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3255–3265 (2020) 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology. org/N19-1423 5. Hu, R., Li, S., Zhu, Y.: Knowledge representation and sentence segmentation of ancient Chinese based on deep language models. J. Chin. Inf. Process. 35(4), 8–15 (2021) 6. Ji, Z., Shen, Y., Sun, Y., Yu, T., Wang, X.: C-CLUE: a benchmark of classical Chinese based on a crowdsourcing system for knowledge graph construction. In: Qin, B., Jin, Z., Wang, H., Pan, J., Liu, Y., An, B. (eds.) CCKS 2021. CCIS, vol. 1466, pp. 295–301. Springer, Singapore (2021). https://doi.org/10.1007/978-98116-6471-7 24 7. Ji, Z., Wang, X., Shen, Y., Rao, G.: CANCN-BERT: a joint pre-trained language model for classical and modern Chinese. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 3112–3116 (2021) 8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint: arXiv:1412.6980 (2014) 9. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. arXiv preprint: arXiv:1909.11942 (2019) 10. Li, Y., Li, W., Sun, F., Li, S.: Component-enhanced Chinese character embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 829–834 (2015) 11. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint: arXiv:1907.11692 (2019) 12. Martin, L., et al.: Camembert: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219 (2020) 13. Nozza, D., Bianchi, F., Hovy, D.: What the [mask]? Making sense of languagespecific BERT models. arXiv preprint: arXiv:2003.02912 (2020) 14. Sun, Y., Lin, L., Yang, N., Ji, Z., Wang, X.: Radical-enhanced Chinese character embedding. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds.) ICONIP 2014. LNCS, vol. 8835, pp. 279–286. Springer, Cham (2014). https://doi. org/10.1007/978-3-319-12640-1 34 15. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

RAC-BERT

771

16. Wang, D., et al.: Construction and application of pre-training model of “siku quanshu” oriented to digital humanities. Libr. Tribune 42(6), 31–43 (2022) 17. Wang, X., Xiong, Y., Niu, H., Yue, J., Zhu, Y., Philip, S.Y.: BioHanBERT: a Hanziaware pre-trained language model for Chinese biomedical text mining. In: 2021 IEEE International Conference on Data Mining (ICDM), pp. 1415–1420. IEEE (2021)

Neural Knowledge Bank for Pretrained Transformers Damai Dai1,2 , Wenbin Jiang2 , Qingxiu Dong1 , Yajuan Lyu2 , and Zhifang Sui1(B) 1

MOE Key Lab of Computational Linguistics, Peking University, Beijing, China {daidamai,szf}@pku.edu.cn, [email protected] 2 Baidu Inc., Beijing, China {jiangwenbin,lvyajuan}@baidu.com

Abstract. The ability of pretrained Transformers to remember factual knowledge is essential but still limited for existing models. Inspired by existing work that regards Feed-Forward Networks (FFNs) in Transformers as key-value memories, we design a Neural Knowledge Bank (NKB) and a knowledge injection strategy to introduce extra factual knowledge for pretrained Transformers. The NKB is in the form of additional knowledgeable memory slots to the FFN and the memory-like architecture makes it highly interpretable and flexible. When injecting extra knowledge with the Salient Span Masking (SSM) pretraining objective, we fix the original pretrained model and train only the NKB. This training strategy makes sure the general language modeling ability of the original pretrained model is not influenced. By mounting the NKB onto T5, we verify its strong ability to store extra factual knowledge based on three closed-book question answering datasets. Also, we prove that mounting the NKB will not degrade the general language modeling ability of T5 through two representative tasks, summarization and machine translation. Further, we analyze the interpretability of the NKB and reveal the meaning of its keys and values in a human-readable way. Finally, we show the flexibility of the NKB by directly modifying its value vectors to update the factual knowledge stored in it. The code is available at https://github.com/Hunter-DDM/nkb.

Keywords: Transformer

1

· Knowledge Enhancement · Interpretability

Introduction

Large-scale pretrained Transformers [4,6,7,17,24] are usually trained on largescale corpora, which contain oceans of factual knowledge. When facing knowledgeintensive downstream tasks such as closed-book question answering, the ability to remember factual knowledge will be essential. [21] show that pretrained models can recall factual knowledge in the training corpus in a zero-shot manner. [25] also prove D. Dai—Joint work of Peking University and Baidu Inc. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 772–783, 2023. https://doi.org/10.1007/978-3-031-44696-2_60

Neural Knowledge Bank for Pretrained Transformers

773

that after finetuning, T5 [24] can answer open-domain questions without external knowledgeable contexts. Even so, the ability of pretrained models to store factual knowledge is still limited [2,22]. In this paper, we aim to design an interpretable method to introduce extra knowledge for pretrained models. [10] point out that Feed-Forward Networks (FFNs) in Transformers work in a similar way to key-value memories. As shown in Fig. 1, in an FFN, we regard the first linear layer as a series of keys and the second linear layer as the corresponding values. The input hidden state of the FFN is fed into the first linear layer and activates a set of intermediate neurons. Then, taking these activated neurons as weights, the second linear layer integrates the corresponding value vectors through weighted sum. On top of this view, [5] further find that FFNs in pretrained Transformers store factual knowledge in a memory-like manner. Inspired by the above view, Output Hidden State …… we design a Neural Knowledge Bank (NKB) Weighted Sum nd Linear Layer 2 and a knowledge injection strategy (Values) to introduce extra factual knowl…… Intermediate edge for pretrained Transformers. The Hidden State 0.1 0.2 0 0 0.6 0.1 0 0.2 0 0 0.4 0.3 Memory …… NKB is an FFN-like module concate(Weights) Slot Activation nated after the original FFN as addi…… tional knowledgeable memory slots. In st Layer order to inject factual knowledge, we 1 Linear(Keys) first acquire a knowledgeable corpus Inner…Product … Input Hidden State from Wikipedia and then pretrain the (Query) NKB with the Salient Span Masking (SSM) [11] pretraining objective. Fig. 1. A view of the FFN as key-value Note that during knowledge injec- memory. The first linear layer serves as keys tion, we fix the original pretrained and the second as values. Each key-value model to avoid influencing its general pair forms a memory slot. The intermedilanguage modeling ability. For down- ate hidden state contains the weights used stream tasks, we can directly finetune to integrate the values. the whole model. The advantages of the NKB are reflected in three aspects: (1) The knowledge injection process for the NKB is independent of the original pretrained model, so introducing extra knowledge will not degrade the general language modeling ability of the original model. (2) The memory-like architecture of the NKB makes it highly interpretable and we can explain the meaning of its keys and values in a human-readable way. (3) The key-value architecture of the NKB has high flexibility and we can easily perform knowledge updating on the NKB by modifying its value vectors. On three closed-book question answering datasets spanning different domains, we find that mounting the NKB can boost the performance of T5, especially in the biomedical domain, which the T5 pretraining corpus does not cover much. Also, through two representative tasks, summarization and machine translation, we prove that mounting the NKB will not degrade the general language modeling ability of the original T5 model. Further, we analyze the NKB to reveal its working mechanism and present the meaning of its keys and values in a human-readable way. Finally, we show the flexibility of the NKB by directly modifying its value vectors to update stored factual knowledge.

774

D. Dai et al. Output Hidden State

Feed-Forward Network

Neural Knowledge Bank

……

Transformer Decoder Layer

Weighted Sum

2nd

Neural Knowledge Bank

……

Linear Layer (Values)

NKB Values ……

Intermediate Hidden State (Weights)

……

NKB Weights ……

……

Activation

Cross-Attention Self-Attention

……

……

1st Linear Layer (Keys)

NKB Keys Inner Product

……

……

Input Hidden State (Query)

Fig. 2. Illustration of the Neural Knowledge Bank (NKB) for a Transformer decoder layer. The NKB shares the same architecture as the FFN and is composed of d additional knowledgeable memory slots.

2

Background: Transformer

Transformer [27] is the most popular and effective NLP architecture. Taking a Transformer encoder as an example, it is stacked with L identical Transformer layers, where each layer consists of a self-attention (SelfAtt) module and a feedforward network (FFN). For an input sequence with len tokens, let X ∈ Rlen×d denote the input hidden states, we formulate these two modules as follows: Qh = XWhQ , Kh = XWhK , Vh = XWhV ,   SelfAtth (X) = Softmax Qh KhT Vh ,   FFN(H) = ActFunc HW1T W2 ,

(1) (2) (3)

where SelfAtth (·) computes the h-th attention heads; H ∈ Rlen×d is the output hidden states of self-attention, which is computed by projecting the cond catenation of all attention heads; WhQ , WhK , WhV ∈ Rd× n , W1 , W2 ∈ R4d×d are parameter matrices; and ActFunc(·) denotes the activation function. We omit the scaling factor in the self-attention module and the bias terms for simplicity. Comparing Eq. (2) and (3), we can find that the formula of FFN(·) is almost the same as that of SelfAtth (·), except that they have different nonlinear functions. Therefore, it is reasonable to view the FFN as a module with the querykey-value mechanism [5,9,10]. Specifically, the input H serves as queries, the parameters of two linear layers W1 and W2 are keys and values, respectively.

3

Method

Following [5,9,10], we also view FFNs as key-value memories. On top of this view, we design a Neural Knowledge Bank (NKB) and a knowledge injection strategy to introduce extra factual knowledge for pretrained Transformers. 3.1

Key-Value Memory View of FFN

We formulate the FFN as key-value memory like [5,10]. As illustrated in Fig. 1, we regard the input hidden state h ∈ Rd as a query, and the parameter matrices

Neural Knowledge Bank for Pretrained Transformers

775

W1 , W2 ∈ R4d×d of two linear layers as 4d keys and 4d values, respectively, where each key or value is a d-dimension vector. First, h is fed into the first linear layer. For each key vector in W1 , we compute a scalar score si through inner product: si = hT W1 [i, :],

(4)

where [·, ·] denotes the slice operation for a matrix. Then, after activation, these scores compose the intermediate hidden state h(inter) ∈ R4d of the FFN: wi = ActFunc(si ),

h(inter) = [w1 ; w2 ; ...; w4d ],

(5)

where [·; ·] denotes concatenation. Finally, taking these activated neurons as weights, we integrate the value vectors in W2 to get the output hidden state h(output) ∈ Rd of the FFN through weighted sum: h(output) =

4d 

wi W2 [i, :].

(6)

i=1

In the above formulations, we omit the bias terms. 3.2

Neural Knowledge Bank

Inspired by [5] who find that FFNs in Transformers can store factual knowledge, we design our NKB with the same architecture as an FFN. Specifically, for a  Transformer layer, we allocate two new matrices W1 , W2 ∈ Rd ×d as additional  keys and values, where d is a hyper-parameter that control the capacity of the NKB. As illustrated in Fig. 2, we mount the NKB onto a Transformer layer by concatenating W1 and W2 after W1 and W2 in the original FFN, respectively. With the NKB, the intermediate hidden state h(inter) of the FFN will be extended to (4d + d )-dimensions: si = hT W1 [i, :], (inter)

h

=

wi = ActFunc(si ),

[w1 ; w2 ; ...; w4d ; w1 ; w2 ; ...; wd  ],

(7) (8)

where wi is the weight of the i-th additional memory slot. Finally, taking the new h(inter) as weights, the value vectors in W2 and W2 are integrated as h(output) =

4d  i=1



wi W2 [i, :] +

d 

wi W2 [i, :].

(9)

i=1

As a simple extension of the FFN, the NKB is easy to implement and use. More importantly, the memory-like architecture of the NKB makes it highly interpretable and we can explain the meaning of its keys and values in a humanreadable way. Also, this architecture has high flexibility and we can easily perform knowledge updating on the NKB by directly modifying its value vectors.

776

3.3

D. Dai et al.

Knowledge Injection

In order to introduce extra knowledge, we pretrain the NKB with the Salient Span Masking (SSM) [11,25] objective. To be specific, we first acquire a knowledgeable corpus from Wikipedia using the DrQA [3] document retriever. Then, we recognize the salient spans (i.e., named entities and dates) in each text segment using the entity recognizer in spaCy. Finally, we randomly mask one of the salient spans and train the NKB parameters to reconstruct the masked span. During knowledge injection, we freeze the original parameters and update only the NKB parameters. Compared with previous work [25] that updates all parameters for knowledge injection, our training strategy can avoid the general language modeling ability of a pretrained model being degraded. In addition, since new factual knowledge is precisely injected into the NKB, locating and analyzing newly introduced knowledge will also be easier. After knowledge injection, we can directly finetune the whole model for downstream tasks.

4

Experiments

4.1

Tasks and Datasets

Following [25], we use the closed-book question answering task to evaluate the factual knowledge stored in a model. We use three datasets in this paper: two general-domain datasets, Natural Questions [13] and WebQuestions [1], and a biomedical-domain dataset, HEAD-QA [28]. We ignore related contexts and only retain the questions as input and their corresponding answers as output. Following [14,25], we filter out the examples with answers longer than five tokens. We split the filtered datasets into training and validation sets as in Table 1. In addition, in order to evaluate the general language modeling ability, we also consider two representative tasks, summarization and machine translation. They are not heavily dependent on external factual knowledge and can evaluate whether a model can model and generate languages. For summarization, we use the Xsum [19] dataset. For machine translation, we use the English-German translation data in WMT141 (WMT-En-De) and English-Romanian translation data in WMT16 (WMT-En-Ro). For these three datasets, we use the training and validation sets following their official data splits, which are shown in Table 2. 4.2

Experimental Settings

All experiments are run on NVIDIA V100 GPUs with 32 GB memory. For knowledge injection, we adopt the DrQA [3] document retriever to retrieve the top-3 Wikipedia documents related to each training question to construct the knowledgeable corpus for a dataset. After retrieval, we use the entity recognizer in spaCy2 to extract named entities and dates for salient span masking. We 1 2

https://www.statmt.org. https://spacy.io.

Neural Knowledge Bank for Pretrained Transformers

Table 1. Data splits of three filtered closed-book question answering datasets. Dataset Training Valid

Table 2. Data splits of summarization and machine translation datasets. Dataset Training Valid

Natural Questions (NQ) 74,773

3,003

Xsum

WebQuestions (WQ)

3,190

1,710

WMT-En-De 4,548,885 2,169

436

436

HEAD-QA (HQ)

777

204,045 11,332

WMT-En-Ro 610,320

1,999

mount the NKB onto the last FFN layer in the decoder of T5base [24]. The number of additional memory slots d is set to 3072, the same as the intermediate hidden dimension in T5base . When injecting knowledge, we freeze all the parameters in T5base and only update the parameters in the NKB. We use AdaFactor [26] as the optimizer and do not apply dropout. For downstream tasks, we finetune all the parameters in T5base +NKB. We tune the hyper-parameters on the validation set and report the best validation performance for each dataset. For closed-book question answering, following [25], we use AdaFactor [26] as the optimizer. For summarization and machine translation, we use AdamW [18] as the optimizer. Due to the space limit, we provide the complete hyper-parameters for knowledge injection and finetuning downstream tasks as appendices at https:// github.com/Hunter-DDM/nkb/blob/main/Appendix.pdf. 4.3

Baselines

We compare T5base +NKB with four baselines: (1) Transformer denotes a vanilla Transformer that shares the same architecture with T5base but its parameters are not pretrained. (2) T5base and T5large denote the standard T5 pretrained models with 220 M and 770 M parameters, respectively. (3) T5base +NKB-a denotes a model that shares the same architecture with T5base +NKB but all its parameters are updated during knowledge injection. 4.4

Results

Closed-Book Question Answering. The results on closed-book question answering are shown in Table 3. We use the Exact Match (EM) score as the metric, which evaluates whether the generated answer is totally the same as one of the ground-truth answers. From the table, we have the following observations: (1) The vanilla Transformer without pretraining performs extremely poorly on closed-book question answering since it is not knowledgeable at all. (2) With pretraining, T5base achieves a good EM score on the general-domain datasets (i.e., Natural Questions and WebQuestions), but on the biomedical-domain dataset HEAD-QA, it also performs poorly since biomedical texts account for only a small proportion in the T5 pretraining corpus. (3) T5large achieves better performance than T5base , but it still cannot address the biomedical-domain closedbook question answering well. (4) With only 5M more parameters, the NKB significantly boosts the performance of T5base , which approaches the performance

778

D. Dai et al.

Table 3. Exact Match (EM) scores on closed-book question answering. # Knowledgeable Params denote the parameters that are trained during knowledge injection. Model

# Params # Knowledgeable Params NQ

WQ HQ

Ave.

Transformer

220 M

N/A

0.4

2.0

0.9

T5base

220 M

N/A

26.3 29.9 3.7

T5base +NKB-a 225 M

225 M

26.9 31.9 11.0 23.3

T5base +NKB

225 M

5M

27.4 31.1 11.0 23.2

T5large

770 M

N/A

28.5 31.5 5.0

0.2

20.0

21.7

Table 4. Results on summarization and machine translation. Model

WMT-Xsum (Rouge-L) WMT-En-De (BLEU) En-Ro (BLEU) Ave.

Transformer

20.7

21.5

21.9

21.4

T5base

30.1

30.5

28.2

29.6

T5base +NKB-a 24.9

25.9

25.1

25.3

30.1

30.5

28.2

29.6

T5base +NKB

of T5large on the general-domain datasets and largely outperforms T5large on the biomedical-domain dataset. (5) Knowledge injection for only 5M NKB parameters (T5base +NKB) can achieve comparable performance with T5base +NKB-a that updates all parameters during knowledge injection. Summarization and Machine Translation. We use Rouge-L [15] as the metric for summarization and SacreBLEU [23] for machine translation. We demonstrate the results in Table 4, where the NKB is trained on the WebQuestionsrelated knowledgeable corpus. We have the following findings: (1) With pretraining, T5base achieves much better performance than the vanilla Transformer on all the datasets. (2) Although performing well on closed-book question answering, T5base +NKB-a has a poor performance on summarization and machine translation since it changes the pretrained parameters of T5base during knowledge injection. (3) For T5base +NKB, we train only the newly introduced parameters, so the general language modeling ability of T5base will not be negatively influenced. As a result, T5base +NKB can not only address knowledge-intensive tasks well, but also keep a good performance on other tasks that do not heavily rely on external factual knowledge.

5 5.1

Interpretability of NKB Value Vectors Store Entities

[9] state that value vectors in FFNs are often corresponding to human-readable concepts. Inspired by them, we analyze the NKB value vectors in the output vocabulary space and find that most of the value vectors store specific entities. Method. We randomly sample NKB value vectors from the best T5base +NKB checkpoint finetuned on WebQuestions for analysis. For the i-th value vector

Neural Knowledge Bank for Pretrained Transformers Table 5. Example NKB value vectors and their top-scoring tokens.

Date

779

Other 4% 11%

Person 11%

Value Top-Scoring Token Entity Category v71

Norway

15%

Place

v2878 Constantine

Organization

v2170 Columbus

Person

v221 1974

Date

v1581 Portuguese

Other

v1046 desert

Non-entity

Place

Non-enty

42% 17% Organizaon

Fig. 3. Distribution of top-scoring tokens.

Table 6. Common semantic patterns for example NKB key vectors. Key

Top-Triggering Questions

Common Semantic Pattern

k726

What kind of language do Egyptian speak?

Ask about the language of Egypt

What was the ancient Egyptians’ spoken language? What language is mainly spoken in Egypt? k1452

What school did Mark Zuckerberg attend?

Ask about the schools of celebrities

What college did Albert Einstein go to? What university did Obama graduate from? k2750

What style of art did Henri Matisse do?

Ask about the art genre

What type of artist is Henri Matisse? What genre of art is the Mona Lisa?

vi ∈ Rd , we first project it into the output vocabulary space and get a probability distribution pi ∈ RNvocab : pi = Softmax(Evi ), where E ∈ RNvocab ×d is the output embedding matrix. Then, for each vector vi , we manually check the token with the highest probability and annotate it as one out of six entity categories: Person, Place, Organization, Date, Other, or Non-entity. We ignore incomplete words and non-English words, and finally, we annotate 100 valid value vectors. Results. For each entity category, we show an example value vector and its topscoring token in Table 5. We also present the percentage of all top-scoring tokens in Fig. 3. We find that 85% of the sampled value vectors are corresponding to specific entities. Entity information in value vectors will be integrated into the FFN output hidden state and contribute to the prediction of the final answer. 5.2

Key Vectors Capture Input Patterns

NKB key vectors determine which memory slots to activate according to input patterns. By analyzing the memory activation, we reveal the triggering patterns. Method. We analyze the same checkpoint as in Sect. 5.1. For each question in the WebQuestions validation set, when the model is generating the first answer token, we record its NKB weights (i.e., the intermediate hidden state in the NKB). After the NKB weights for all the questions are recorded, we randomly sample 100 key vectors for analysis. For the i-th key vector ki ∈ Rd , we retain

780

D. Dai et al.

Table 7. Knowledge updating results. ↑ means higher is better. ↓ means lower is better. λ

Success Rate↑ Destruction Rate↓

0.01 19.4%

0.9%

0.03 49.3%

1.5%

0.05 86.6%

1.8%

0.07 98.5%

2.7%

0.09 100.0%

4.8%

Table 8. Examples of knowledge updating. ex denotes the output embedding of token x. By modifying one value vector, we update the model prediction to the expected one. Question

Ori. Pred.

Knowledge Surgery

New Pred.

What places make up New England?

Maine

v1112 +=λ(eEurope -eMaine )

Europe

What are the colors of the NY Giants?

Purple

v2419 +=λ(eBlue -ePurple )

Blue

Where do KFC get their chicken from?

Tennessee

v1716 +=λ(eKentucky -eTennessee )

Kentucky

5 questions that have the highest NKB weights. Finally, we manually identify human-readable semantic patterns that appear in at least 2 questions. Results. We find that 78% of the sampled keys have common semantic patterns in their top-triggering questions. We show some examples in Table 6 for an intuitive understanding of the patterns. Putting the above findings together, we summarize the working mechanism of the NKB: key vectors determine whether to activate memory slots according to input patterns, and then the activated value vectors are integrated to contribute to the answer generation.

6

Knowledge Updating for NKB

Knowledge updating is meaningful to avoid a model producing erroneous or outdated results. Based on the key-value architecture of the NKB, we can use a small knowledge surgery to directly update an answer to another one. Data. In the WebQuestions validation set, we select the questions that the model cannot solve, and aim to update the wrong answers to correct ones. We only consider the questions that have single-token ground-truth answers and predicted answers. Finally, we keep 67 examples for the following experiments. Method. For a question, we first select the memory slot with the highest NKB weight. Then, we eliminate the information of the original answer from the value vector and introduce new information about the target answer: vt += λ(etgt - eori ),

(10)

Neural Knowledge Bank for Pretrained Transformers

781

where t is the index of the highest-weight slot, λ is a hyper-parameter, eori and etgt denote the output word embeddings of the original and target answers, respectively. By modifying only one value vector (accounting for only 0.00034% of model parameters), this knowledge surgery suppresses the predicted probability of the original answer and encourages the model to generate the target answer. Metric. The success rate denotes the proportion of examples whose answers are successfully updated to expected ones. We also use a destruction rate to measure the influence of the knowledge surgery on other examples. For each knowledge surgery, we randomly sample another 5 questions and the destruction rate is the proportion of questions whose answers are changed. Results. As shown in Table 7, with a tolerable destruction rate, the success rate steadily increases as λ becomes larger. For an intuitive understanding, we also present some examples of knowledge updating in Table 8. By modifying only one value vector, we update the model prediction to the expected one.

7

Related Work

Knowledge Probing for Pretrained Models. Through the fill-in-the-blank cloze task, [12,21] find that pretrained models can recall some factual knowledge without finetuning. [8] measure the knowledge consistency of pretrained models using different prompt templates. [25] finetune T5 on the closed-book question answering task to evaluate how much knowledge is stored in parameters. [2,22] also draw some negative conclusions that factual knowledge in pretrained models is overestimated. In summary, pretrained models can store factual knowledge but to a limited extent, so enhancing them with extra knowledge is meaningful. Knowledge Enhancement for Pretrained Models. ERNIE [32] and KnowBERT [20] enhance word representations with external knowledge graphs. KEPLER [30] optimizes the objectives of masked language modeling and knowledge embedding jointly on the same model. K-adapter [29] injects knowledge into specific adapters in series while keeping the original pretrained model fixed. K-BERT [16] appends relevant triplets to the input sentence to enhance sentence encoding during finetuning. Kformer [31] retrieves knowledgeable texts during finetuning and encodes them to extend the FFN. Our NKB combines good features of the existing methods: it injects knowledge before finetuning, can keep the language modeling ability of the original model, and is highly interpretable. Understanding of FFNs. [10] build connections between FFNs and key-value memories. [9] further regard the value vectors in FFNs as sub-updates that encode human-readable concepts. [5] point out that FFNs store factual knowledge in a memory-like manner and the knowledge neurons in FFNs are positively correlated to corresponding knowledge expression. Inspired by them, we propose the FFN-like NKB to introduce factual knowledge for pretrained Transformers in an interpretable way.

782

8

D. Dai et al.

Conclusion

We propose the NKB to introduce extra knowledge for pretrained Transformers in a neural way. On closed-book question answering, we verify that the NKB can store extra factual knowledge. Also, we prove that mounting the NKB will not degrade the general language modeling ability of the original pretrained model. Further, we show the interpretability of the NKB by revealing its working mechanism and presenting the meaning of its keys and values in a human-readable way. Finally, we show the flexibility of the NKB by directly modifying its value vectors to update the factual knowledge stored in it. Acknowledgement. Damai Dai, Qingxiu Dong, and Zhifang Sui are supported by the National Key Research and Development Program of China 2020AAA0106700 and NSFC project U19A2065.

References 1. Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on freebase from question-answer pairs. In: EMNLP 2013, pp. 1533–1544 (2013) 2. Cao, B., et al.: Knowledgeable or educated guess? Revisiting language models as knowledge bases. In: ACL/IJCNLP 2021, pp. 1860–1874 (2021) 3. Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to answer opendomain questions. In: ACL 2017, pp. 1870–1879 (2017) 4. Clark, K., Luong, M., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. In: ICLR 2020. OpenReview.net (2020) 5. Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., Wei, F.: Knowledge neurons in pretrained transformers. In: ACL 2022, pp. 8493–8502 (2022) 6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT 2019, pp. 4171– 4186. Association for Computational Linguistics (2019) 7. Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. In: NeurIPS 2019, pp. 13042–13054 (2019) 8. Elazar, Y., et al.: Measuring and improving consistency in pretrained language models. CoRR abs/2102.01017 (2021) 9. Geva, M., Caciularu, A., Wang, K.R., Goldberg, Y.: Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. CoRR abs/2203.14680 (2022) 10. Geva, M., Schuster, R., Berant, J., Levy, O.: Transformer feed-forward layers are key-value memories. CoRR abs/2012.14913 (2020) 11. Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.: REALM: retrieval-augmented language model pre-training. CoRR abs/2002.08909 (2020) 12. Jiang, Z., Xu, F.F., Araki, J., Neubig, G.: How can we know what language models know? Trans. Assoc. Comput. Linguist. 8, 423–438 (2020) 13. Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 452–466 (2019) 14. Lee, K., Chang, M., Toutanova, K.: Latent retrieval for weakly supervised open domain question answering. In: ACL 2019, pp. 6086–6096 (2019)

Neural Knowledge Bank for Pretrained Transformers

783

15. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches out, pp. 74–81 (2004) 16. Liu, W., et al.: K-BERT: enabling language representation with knowledge graph. In: AAAI 2020, pp. 2901–2908. AAAI Press (2020) 17. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019) 18. Loshchilov, I., Hutter, F.: Fixing weight decay regularization in Adam. CoRR abs/1711.05101 (2017) 19. Narayan, S., Cohen, S.B., Lapata, M.: Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In: EMNLP 2018, pp. 1797–1807. Association for Computational Linguistics (2018) 20. Peters, M.E., et al.: Knowledge enhanced contextual word representations. In: EMNLP-IJCNLP 2019, pp. 43–54. Association for Computational Linguistics (2019) 21. Petroni, F., et al.: Language models as knowledge bases? In: EMNLP-IJCNLP 2019, pp. 2463–2473. Association for Computational Linguistics (2019) 22. P¨ orner, N., Waltinger, U., Sch¨ utze, H.: BERT is not a knowledge base (yet): factual knowledge vs. name-based reasoning in unsupervised QA. CoRR abs/1911.03681 (2019) 23. Post, M.: A call for clarity in reporting BLEU scores. In: Bojar, O., et al. (eds.) WMT 2018, pp. 186–191. Association for Computational Linguistics (2018) 24. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020) 25. Roberts, A., Raffel, C., Shazeer, N.: How much knowledge can you pack into the parameters of a language model? In: EMNLP 2020, pp. 5418–5426. Association for Computational Linguistics (2020) 26. Shazeer, N., Stern, M.: Adafactor: adaptive learning rates with sublinear memory cost. In: ICML 2018, pp. 4603–4611 (2018) 27. Vaswani, A., et al.: In: NeurIPS 2017, pp. 5998–6008 (2017) 28. Vilares, D., G´ omez-Rodr´ıguez, C.: HEAD-QA: a healthcare dataset for complex reasoning. In: ACL 2019, pp. 960–966 (2019) 29. Wang, R., et al.: K-adapter: infusing knowledge into pre-trained models with adapters. In: ACL-IJCNLP 2021. Findings of ACL, vol. ACL/IJCNLP 2021, pp. 1405–1418. Association for Computational Linguistics (2021) 30. Wang, X., et al.: KEPLER: a unified model for knowledge embedding and pretrained language representation. Trans. Assoc. Comput. Linguist. 9, 176–194 (2021) 31. Yao, Y., Huang, S., Zhang, N., Dong, L., Wei, F., Chen, H.: Kformer: knowledge injection in transformer feed-forward layers. CoRR abs/2201.05742 (2022) 32. Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., Liu, Q.: ERNIE: enhanced language representation with informative entities. In: ACL 2019, pp. 1441–1451. Association for Computational Linguistics (2019)

Poster: Summarization and Generation

A Hybrid Summarization Method for Legal Judgment Documents Based on Lawformer Jingpei Dan1 , Weixuan Hu1 , Lanlin Xu1 , Yuming Wang2(B) , and Yingfei Wang3 1 College of Computer Science, Chongqing University, Chongqing, China 2 School of Electronic Information and Communications, Huazhong University of Science and

Technology, Wuhan, China [email protected] 3 Information Technology Service Center of People’s Court, Beijing, China

Abstract. Legal Judgment Summarization (LJS) is a crucial task in the field of Legal Artificial Intelligence (LegalAI) since it can improve the efficiency of case retrieval for judicial work. However, most existing LJS methods are confronted with the challenges of long text and complex structural characteristics of legal judgment documents. To address these issues, we propose a hybrid method of extractive and abstractive summarization with encoding by Lawformer to enhance the quality of LJS. In this method, by segmentation, long legal judgment documents can be shortened into three relatively short parts according to their specific structure. Furthermore, Lawformer, a new pre-trained language model for long legal documents, is applied as an encoder to deal with the long text problem. Additionally, different summarization models are applied to summarize the corresponding part in terms of its structural characteristics, and the obtained summaries of each part are integrated into a high-quality summary involving both semantic and structural information. Extensive experiments are conducted to verify the performance of our method, and the comparative results show that the summary obtained by our method outperforms all other baselines in matching with the reference summary. It is indicated that our method is effective for LJS and has prospects for LegalAI applications. Keywords: Legal Judgment Summarization · Extractive Summarization · Abstractive Summarization · Lawformer

1 Introduction Legal judgment summarization (LJS) generates summaries from legal judgment documents, reflecting facts, reasons for adjudication, and the basis of judgment in the trial process. It has practical significance and necessity for constructing the rule of law [1]. However, in the face of massive and rapidly increasing judgment documents, it consumes many human resources and is inefficient to manually compile legal judgment summaries only by legal professionals. [2] With the development of Legal Artificial Intelligence (LegalAI) [3], academia and industry expect to use automatic text summarization technology to generate summaries of judgment documents, which can assist judicial practice decision-making and legal research. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 787–798, 2023. https://doi.org/10.1007/978-3-031-44696-2_61

788

J. Dan et al.

Extractive summarization and abstractive summarization are the commonly used methods for LJS. Extractive summarization mainly uses neural network models or pretraining models to extract text units and form a summary [4–6]. The key sentences in the original text are extracted directly using extractive summarization models, which can ensure the integrity of the sentences. However, it may have the disadvantages of poor coherence and semantic loss. Abstractive summarization belongs to the field of text generation in NLP. The summary is generated by the semantic understanding of the original text using generative technology [7]. For example, Yoon et al. [8] propose a dataset of Korean legal judgment documents, and a combination of two pre-trained models is used to generate summaries of legal judgment documents. The abstractive summarization model is effective in restoring semantics. However, it is often inefficient in processing long text and occasionally suffers from duplication and factual errors. Unlike general texts such as news articles, legal judgment documents are usually lengthy, far exceeding the usual text processing tasks. Moreover, they are highly professional and contain many unstructured and semi-structured expressions. To address these issues, the researchers have also proposed some solutions. Obtaining summaries after segmenting legal judgment documents can effectively shorten the length of model input. In order to make the document meet the input length of the pre-trained model, Liu et al. [9] split a document into five critical components via phrase extraction. Then the generative pre-trained model GPT-2 [10] is adopted for text compression and integration. However, due to the diversity of cases, not every document has these five components, for example, the defendant’s defense, which may cause data loss and partition errors. In addition, combining two types of summarization models can also solve the problem of long text to a certain extent. This method extracts the key sentences of a document before generating its summary, which can also improve the quality of the final generated summary. Gao et al. [11] propose a 2-stage summarization method combined with content extraction and summary generation. The method extracts corresponding key sentences using a self-attention network first to shorten the length of input sequences. Then an abstractive summarization model based on RoBERTa-wwm and UniLM [12] is used to compress the key sentences and generate the final summary. However, the 2-stage method takes the extract summaries obtained in the first stage as the input in the second stage, resulting in cumulative errors. Besides, the final summary generated by abstractive summarization may still have problems with missing or incorrect structural information. In order to break through the limitations of the solutions mentioned above, in this paper, we propose a hybrid method based on Lawformer [13] by integrating extractive and abstractive summarization. Specifically, we first segment the legal judgment document into three relatively short parts according to its specific structure. Different from the segmentation method in [9], in order to ensure the accuracy and completeness of each part after segmentation, we segment the legal judgment document into three relatively short parts, which are case information, case description, and court judgment, according to the special structure of documents. This segmentation can ensure that each legal judgment document can be accurately segmented since they all have these three

A Hybrid Summarization Method for Legal Judgment Documents

789

parts. Meanwhile, the three parts after segmentation are suitable for subsequent input of different summarization models. For summaries extraction and generation, we adopt Lawformer, a new pre-trained language model based on large-scale legal long documents, as the encoder of our summarization models. Lawformer can better understand the semantic information of legal documents and significantly improve the processing efficiency of the model. Excepting applicability to LegalAI, Lawformer can increase the maximum input length of 512 bytes restricted by the general pre-trained model to 4196 bytes, making it possible for the summarization model to capture the long-distance dependency. Specifically, we combine a BERTSUM-based Lawformer as the encoder with a summarization-specific layer for extractive summarization. BERTSUM [14] modifies the input sequence and embedding of Lawformer to make it possible to extract multiple sentences. The summarization-specific layer, like Transformer [15] or Bi-GRU, is stacked on top of Lawformer outputs to capture document-level features for extracting summaries. Furthermore, based on encoder-decoder architecture, we propose abstractive summarization based on different decoders, for example, GRU or PGN-decoder, with Lawformer as the encoder. The fusion of Lawformer makes it more effective in capturing long-distance dependency and modeling legal semantic features than the standard Seq2Seq attentional model [16]. Moreover, the summarization models are adopted to deal with different parts after segmentation according to their structural characteristics. Unlike the 2-stage model, which is a vertical combination, our integration of extractive and abstractive summarization models is horizontal. We take different summarizations respectively for each part of the legal judgment document. The obtained summaries are integrated to form the final summary, which can ensure the integrity of semantic and structural information. To summarize, the main contributions of this paper are as follows: • A hybrid LJS method is proposed based on extractive and abstractive summarization models. Different suitable summarization model is applied to different part segmented according to specific legal judgment document structure. Thus, this hybrid method can improve the whole quality of summaries by integrating semantic and structural information. • To solve the problem of long text, firstly, a legal judgment document is shortened into relatively short parts according to its specific structure as the inputs suitable for different summarization models. Then, based on the encoder, Lawformer, both summarization models can better capture the semantic features of legal documents and process long text. • Our proposal is verified by comparative experiments on a real-world dataset CAIL20201 of Chinese legal judgment documents. The experimental results show that our model significantly outperforms the baseline model LEAD-3 by 15.79% on the mean ROUGE score, which indicates that our proposal is effective for generating high-quality legal judgment summaries.

1 http://cail.cipsc.org.cn/

790

J. Dan et al.

2 Method In this section, we describe the components of our model and each module in detail. We segment the legal judgment document into three distinct parts according to its specific structure. To generate the corresponding summaries for each part, we use a hybrid summarization method based on extractive and abstractive summarization models. The summaries are then integrated to generate the final legal judgment summary. An overview of our model is shown in Fig. 1.

Fig. 1. The framework of our model.

2.1 Structural Segmentation To address the challenge of summarizing legal judgment documents with varying structural characteristics, we use the regular matching method to segment the document according to its structure. Although the description and judgment of cases in legal judgment documents are not the same, they all share a similar structure. The first part of the document is the case information, from which we can extract the case type. The corresponding summary of this part can be obtained by a static summary template, which looks like: summary 1 = ”The plaintiff and defendant are in the relation of ” + case type

(1)

The second part is the case description, which contains the appeals of the plaintiff and the defendant, the fact description, and the court’s opinions. Due to the complexity of cases, case description cannot be extracted directly for summarization. Therefore, the abstractive summarization model is adopted to model the semantic feature of the relevant text and generate a high-quality summary. The third part is the court judgment, from which we can extract the legal basis and other structural information. To avoid generating wrong information, the extractive summarization model is adopted for this part. 2.2 Hybrid Summarization As previously stated, Lawformer is a pre-trained language model suitable for processing Chinese legal long documents, which demonstrates exceptional performance in legal

A Hybrid Summarization Method for Legal Judgment Documents

791

documents understanding and significantly enhances the overall effectiveness of the model. Moreover, it is capable of processing up to 4196 bytes of text without requiring text truncation due to text length exceeding model limits. Based on the encoder, Lawformer, we build our abstractive and extractive summarization model. An overview of the summarization models is shown in Fig. 2.

Fig. 2. The framework of the hybrid summarization model.

Abstractive Summarization. We propose an abstractive summarization model based on the encoder-decoder architecture, utilizing Lawformer as the encoder to capture semantic features from the case description part. Different decoders are employed to decode the features into summary output. We mainly use GRU, a commonly used recurrent neural network language model, and the decoder constructed by PGN [17]. PGN is an extension of the standard encoder-decoder attentional model, which introduces copy and coverage mechanisms to aid the accurate reproduction of information and keep track of what has been summarized to discourage repetition. We denote an input sequence A = [word0 , word1 , . . . , wordn ] for Lawformer Encoder Layer, where word0 is the [CLS] token, producing a sequence of encoder output hi : hi = Lawformer(A, i)

(2)

For each step t, the decoder receives the word embedding of the previous word and the hidden state from the previous decoder. It is worth noting that the previous word comes from the reference summary while training but is predicted by the decoder while testing. The encoder output of [CLS] h0 is used as the first hidden state in the decoder. The decoder obtains the decoder hidden state st at step t. We use hi , st to get the attention distribution of the i-th word in the original text of the time step: eit = vT tanh(Wh hi + Ws st + battn )

(3)

  at = softmax et

(4)

792

J. Dan et al.

where v, Wh , Ws and battn are learnable parameters. The attention distribution tells the decoder which part of the source text words to focus on when generating the next word. The resulting attention weights and hi are weighed sum to obtain the significant context vector h∗t , which can be seen as a fixed-size representation of the information summary of the entire input sequence for this step:  ait hi (5) h∗t = i

Then st and h∗t are fed through a generator layer, which contains two linear layers, to obtain the vocabulary distribution Pvocab :       Pvocab = softmax V  V st , h∗t + b + b (6)   where st , h∗t indicates the concatenation of st and h∗t , and V , V  , b, b are learnable parameters. The final distribution of the predicted word w: P(w) = Pvocab (w)

(7)

We use the cross-entropy loss function to calculate the primary loss. The primary loss in time step t is:    ˆ lossta = − (PlogP + 1 − Pˆ log(1 − P)) (8) i

where Pˆ is the ground-truth label of the predicted word. Notably, the final loss with the PGN-decoder in step t consists of coverage loss and primary loss:

  t−1 t  (9) min ait , a lossta = lossta + λ i  i

t =0

The overall loss of the abstractive summarization model for the summary is: lossa =

1 T lossta t=1 T

(10)

Extractive Summarization. We build an extractive summarization model which combines a BERTSUM-based Lawformer with a summarization-specific layer. Before summarizing, the part of court judgment is processed by a word segmentation tool to meet extractive summarization (each word obtained by word segmentation is regarded as a sentence). However, like BERT [18], Lawformer can only indicate two sentences, which cannot meet the requirement of extracting multiple sentences for a summary simultaneously. To overcome this limitation, we adopt BERTSUM, which modifies the input sequence and embedding of BERT to make it possible for extractive summarization. We combine BERTSUM with Lawformer to encode and model the semantic feature of this part. The summarization-specific layer is stacked on top of Lawformer outputs to capture document-level features for extracting summaries. Different neural networks can be used as the summarization-specific layer. In this paper, we adopt Transformer-encoder and

A Hybrid Summarization Method for Legal Judgment Documents

793

Bi-GRU. The Transformer-encoder utilizes a multi-headed self-attention mechanism, which allows for the capturing of global dependencies between the input and output, and thus the long-distance dependent information. Bi-GRU is a type of recurrent neural network that simplifies the internal structure of LSTM and enhances its performance. Let D denote a document containing sentences [sent1 , sent2 , . . . , sentn ], where senti is the i-th sentence in the court judgment. We insert a [CLS] token before each sentence and  embedding is represented as TE =  a [SEP] token after each sentence. The token E[CLS] , Esent1 , E[SEP] , . . . , E[CLS] , Esentn , E[SEP] . We assign a segment embedding EA or EB to distinguish between input sentences. The segment embedding is represented as SE = [EA , EA , . . . , EB , EB ], and the position embedding is represented as PE = [E0 , E1 , E2 , . . .]. Lawformer learns the representation of the document as follows:   W = w[CLS]i , wsenti , w[SEP]i = Lawformer(TE, SE, PE) (11)   vectors w[CLS]i from Lawformer, we take W =  After obtaining the [CLS] w[CLS]1 , w[CLS]2 , . . . , w[CLS]n as input to the summarization layer. The output of the summarization layer is represented as (12), and hi is the vector for senti :   H = (h1 , h2 , . . . , hn ) = SummazationLayer W  (12)

The final layer is a linear layer as the classifier and uses a softmax function to get the probability of whether the sentence is selected to form the summary: yi = pi0 , pi1 = softmax(W0 hi + b0 ) (13) where pi1 is the probability of selecting the sentence to form the summary, and pi0 is the opposite of it. The loss for senti is:    (ˆyi logyi + 1 − yˆ i log(1 − yi )) (14) lossie = − i

where yˆ i is the ground-truth label. The overall loss of the extractive summarization model for the summary is: 1 n losse = lossie (15) i=1 n Summary Integration. Following the hybrid summarization, we concatenate the obtained summary of each segmented part in the original document, producing a final summary that incorporates both semantic and structural information.

3 Experiments To verify the effectiveness of the proposed method, we conduct a series of experiments on a real-world dataset. In this section, we will first introduce the dataset and evaluation metrics in experiments. Then two comparative experiments are designed to evaluate the effectiveness of the Lawformer-based model for LJS and the hybrid summarization method, respectively. We compare the performance of our model with the baselines and perform an analysis of the contribution of each module in our model.

794

J. Dan et al.

3.1 Dataset and Evaluation Metric We use the LJS dataset of CAIL-2020 to evaluate the performance of our method. There are 9,484 civil judgment documents of first instance and their corresponding summaries in the dataset. We divide the dataset into 80%, 10%, and 10%, which are used for training, validation, and testing. The statistics for CAIL-2020 are shown in Table 1. Table 1. Statistics about the LJS dataset of CAIL-2020. Dataset

Amount

Len-Document Max

Train

7588

Valid Test

Min

Len-Summary Average

Max

Min

Average

13064

866

2569.38

474

66

283.14

948

8930

1014

2561.10

441

130

284.65

948

11115

1008

1567.15

381

122

283.88

To evaluate the quality of the final obtained summaries, we choose the standard ROUGE [19] metric and use the F1 value of ROUGE-1, ROUGE-2 and ROUGE-L computed by the Rouge package in Python for experimental comparison. 3.2 Baselines To verify the effectiveness of our method, we select some representative baseline models for comparison: • LEAD-32 : LEAD-3 extracts three sentences of a document as the summary. • TextRank [20]: TextRank uses sentence similarity to construct undirected weighted edges. N nodes with the highest scores are selected as summaries. • BERTSUM + Classifier [14]: A simple linear layer is added to BERTSUM outputs for extracting summaries. • UniLM [12]: Based on Transformers and pre-trained by three unsupervised language models, UniLM is suitable for both NLU and NLG. • Seq2Seq + Attention [16]: Encode by Bi-GRU and decode by GRU. The Attention mechanism improves the efficiency when the input sequence is too long. • PGN [17]: PGN introduces the copy and coverage mechanism to the Seq2Seq + attention baseline model. • BERT [18] + PGN: A 2-stage model. BERT extracts key sentences, and PGN generates the final summary. 3.3 Experiment Setting For the TextRank, we set N to 6. For BERTSUM + Classifier, we adopt the bert-basechinese checkpoint. For UniLM, we finetune on the unilm-base-chinese for 20 epochs with a batch size of 8. Due to the max length of BERT and UniLM, we directly truncate 2 https://github.com/china-ai-law-challenge/CAIL2020/tree/master/sfzy/baseline.

A Hybrid Summarization Method for Legal Judgment Documents

795

the portion of documents beyond 512 bytes. For Seq2Seq + Attention and PGN, we use a vocabulary of 5.5k words, and the word embedding size is 128, while the hidden state size is 256. For the 2-stage model BERT + PGN, we adopt the bert-base-chinese with frozen parameters to obtain each sentence’s [CLS] token. Lawformer, as the basic encoder in our model, of which the max length of the input is set as 2500. The hidden state size of Lawformer and the summarization layer is 768. The batch size during the training process is 4. We use Adam as the optimizer with an initial learning rate of 1e − 3, and the weight decay is 1e − 3. For the abstractive summarization model, we use a vocabulary of 21128 words, and the max length of generated summary is 400. 3.4 Comparative Experiment of Lawformer-Based Model for LJS In order to illustrate the effectiveness of Lawformer in our model, we run a set of experiments to compare non-Lawformer-based models with Lawformer-based models in extractive and abstractive summarization respectively. Specifically, we select the non-Lawformer-based model and replace the encoder with Lawformer to realize its corresponding Lawformer-based model. For example, BERTSUM + Classifier and Lawformer + Classifier, Seq2Seq + Attention and Lawformer-GRU + Attention, PGN and Lawformer-PGN. As illustrated in Table 2, each Lawformer-based model outperforms its corresponding non-Lawformer-based model by a large margin. Furthermore, Lawformer + Transformer for extractive summarization and Lawformer-PGN for abstractive summarization achieve the best performance on all three metrics. The Lawformer-based model has significantly improved the quality of obtained summaries, demonstrating that Lawformer specializes in capturing long-distance dependency and modeling legal semantic features. It can achieve substantial performance on LJS, which requires the model to process long legal judgment documents. 3.5 Comparative Experiment of Hybrid Summarization To verify the effectiveness of the hybrid summarization, we select the extractive summarization model and the abstractive summarization model with the highest Rouge-L in non-hybrid models for integration from Table 2, which are BERTSUM + Classifier and PGN from baseline models, Lawformer + Transformer and Lawformer-PGN from Lawformer-based models. According to the results in Table 3, compared with the single extractive or abstractive summarization model, their corresponding hybrid model achieves much better performance. More specifically, BERTSUM + Classifier and PGN outperforms the single summarization model by more than 4.70% in Rouge-L, Lawformer + Transformer and Lawformer-PGN outperforms the single summarization model by more than 1.37% in Rouge-L. Furthermore, compared with the 2-stage model, which is a vertical hybrid summarization and state-of-the-art model in baselines, our model achieves 4.78% improvements in ROUGE-L. The experimental results confirm the importance and rationality of using our hybrid summarization in LJS. Compared with models that are limited by input length, such as BERTSUM, the structural segmentation before the hybrid summarization enables our model to process the original document completely, avoiding information omission. The hybrid summarization method, when compared to either single extractive or abstractive summarization,

796

J. Dan et al. Table 2. Comparative experimental results of Lawformer-based model. Model

Rouge-1

Rouge-2

Rouge-L

LEAD-3

46.30

22.84

35.31

TextRank

54.40

28.55

40.06

Extractive Summarization Non-Lawformer-based

Lawformer-based

BERTSUM + Classifier

52.36

29.11

45.95

Lawformer + Classifier

57.59

33.26

49.93

Lawformer + GRU

57.62

34.08

50.13

Lawformer + Transformer

59.89

35.60

51.87

UniLM

49.70

24.91

43.79

Seq2Seq + Attention

48.19

24.83

44.57

PGN

54.74

29.87

46.20

Lawformer-GRU + Attention

50.44

26.72

46.11

Lawformer-PGN

55.85

31.20

48.42

Abstractive Summarization Non-Lawformer-based

Lawformer-based

Table 3. Comparative experimental results of hybrid summarization for LJS. The abbreviations Ex. And Ab. Represent the extractive summarization model and the abstractive summarization model, respectively. Model

Rouge-1

Rouge-2

Rouge-L

56.98

31.51

48.46

BERTSUM + Classifier (Ex.) PGN (Ab.)

56.87

33.28

50.90

(Our Model) Lawformer + Transformer (Ex.) Lawformer-PGN (Ab.)

61.16

37.43

53.24

2-stage Style BERT + PGN Hybrid Style

results in a summary with more accurate structural and semantic information. Abstractive summarization alone may generate errors in accurately extracting structural information such as legal items and names, whereas extractive summarization alone may result in semantic loss due to the inability to generate a general case description. Consequently, hybrid summarization plays an irreplaceable role in our method.

A Hybrid Summarization Method for Legal Judgment Documents

797

3.6 Case Study In this section, we perform a case study to evaluate the qualities of the summaries obtained by different models. In Fig. 3, we select an example of summaries generated by our model compared to BERTSUM + Classifier from extractive summarization models, Seq2Seq + Attention from abstractive summarization models, the 2-stage model BERT + PGN and the reference from the test dataset. We can observe that the summary generated by our model coincides with the reference summary more closely than other summaries. BERTSUM + Classifier accurately extracts the case’s relevant judgment laws and items while it also extracts much redundant information and lacks a general description. The summary generated by Seq2Seq + Attention improves in generality but lacks accuracy and has repeated generation problems. BERT + PGN has improved sentence coherence and information accuracy but lacks a semantic understanding of legal texts. The case study verifies the superiority of our method on LJS.

Fig. 3. An example of summaries generated by different models. The highlighted words in the summaries are the parts that coincide with the reference summary.

4 Conclusion In this paper, we explore the task of LJS and propose a hybrid method based on Lawformer by integrating extractive and abstractive summarization. First, to shorten the document’s length, we segment the legal judgment document into three relatively short parts according to its specific structure. We adopt Lawformer as the encoder to form our extractive and abstractive summarization model, improving the effectiveness of understanding legal semantic information and processing long documents. The final summary is integrated with summaries obtained from three parts to ensure semantic and structural information integrity. The experiment shows that our proposed method outperforms all previous baselines in ROUGE evaluation metrics, indicating our model’s superiority for LJS. In future work, we would like to explore how to combine an external knowledge graph with our model for LJS.

798

J. Dan et al.

References 1. Jain, D., Borah, M.D., Biswas, A.: Summarization of legal documents: where are we now and the way forward. Comput. Sci. Rev. 40, 100388 (2021) 2. Anand, D., Wagh, R.: Effective deep learning approaches for summarization of legal texts. J. King Saud Univ. – Comput. Inform. Sci. 34(5), 2141–2150 (2022) 3. Zhong, H., Xiao, C., et al.: How does NLP benefit legal system: a summary of legal artificial intelligence. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5218–5230 (2020) 4. Zhong, L., Zhong, Z., et al.: Automatic summarization of legal decisions using iterative masking of predictive sentences. In: Proceedings of the 17th International Conference on Artificial Intelligence and Law, pp. 163–172 (2019) 5. Nguyen, D., Nguyen, B., et al.: Robust deep reinforcement learning for extractive legal summarization. In: Neural Information Processing, pp. 597–604 (2021) 6. Gao, Y., Liu, Z., et al.: Extractive summarization of Chinese judgment documents via sentence embedding and memory network. In: CCF International Conference on Natural Language Processing and Chinese Computing, pp. 413–424 (2021) 7. Elaraby, M., Litman, D.: ArgLegalSumm: Improving abstractive summarization of legal documents with argument mining. arXiv:2209.01650 (2022) 8. Yoon, J., Muhammad, J., et al.: Abstractive summarization of Korean legal cases using pre-trained language models. In: 16th International Conference on Ubiquitous Information Management and Communication, pp. 1–7 (2022) 9. Liu, J., Wu, J., Luo, X.: Chinese judicial summarising based on short sentence extraction and GPT-2. In: Qiu, H., Zhang, C., et al. (eds.) KSEM 2021, LNCS, vol. 12816, pp. 376–393. Springer, Cham (2021) 10. Radford, A., Wu, J., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019) 11. Gao, Y., Liu, Z., et al.: Extractive-abstractive summarization of judgment documents using multiple attention networks. In: Baroni, P., et al. (eds.) CLAR 2021, LNCS, vol. 13040, pp. 486–494. Springer, Cham (2021) 12. Dong, L., Yang, N., et al.: Unified language model pre-training for natural language understanding and generation. In: 33rd Conference on Neural Information Processing Systems, pp. 13042–13054 (2019) 13. Xiao, C., Hu, X., et al.: Lawformer: a pre-trained language model for chinese legal long documents. AI Open 2, 79–84 (2021) 14. Liu, Y.: Fine-tune BERT for Extractive Summarization. arXiv:1903.10318 (2019) 15. Vaswani, A., Shazeer, N., et al.: Attention is all you need. In: Annual Conference on Neural Information Processing Systems, pp. 5998–6008 (2017) 16. Nallapati, R., Zhou, B., et al.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290 (2016) 17. See, A., Liu, P.J., Manning, C.D.: Get to the point: summarization with pointer-generator networks. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1073–1083 (2017) 18. Devlin, J., Chang, M., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019) 19. Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004) 20. Mihalcea, R., Tarau, P.: TextRank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp.404–411 (2004)

Enhancing Semantic Consistency in Linguistic Steganography via Denosing Auto-Encoder and Semantic-Constrained Huffman Coding Shuoxin Wang1,2 , Fanxiao Li1,2 , Jiong Yu1,2 , Haosen Lai1,2 , Sixing Wu1,2(B) , and Wei Zhou1,2 1

Engineering Research Center of Cyberspace, Yunnan University, Kunming, China [email protected] 2 National Pilot School of Software, Yunnan University, Kunming, China {wusixing,zwei}@ynu.edu.cn

Abstract. Linguistic steganography is a useful technique to hide secret messages within a normal cover text, playing a crucial role in the field of data protection. Compared to the data encryption techniques, steganography can make the security data transmission process more imperceptible because the outputted stego (steganography) texts are not garbled codes but look like normal texts. Consequently, the essential of linguistic steganography is to improve the imperceptibility of the outputted stego texts. Although prior works can already generate fluent stego texts, how to ensure the semantics of stego text be natural and reasonable in human cognition is still a challenging problem. To alleviate this issue, this work proposes a novel Semantic-Preserved Linguistic Steganography Auto-Encoder (SPLS-AutoEncoder) to improve imperceptibility by enhancing semantic consistency. SPLS-AutoEncoder first minimizes the possible distortion when embedding the secret message into the cover text by using the denoising auto-encoder BART as the backbone model. Then, we propose a novel Dynamic Semantic-Constrained Huffman Coding, which uses a dynamic context information embedding and a global topic embedding to ensure semantic consistency between the cover text and stego text. Experimental results on two Chinese datasets show that our method has excellent performance compared with the previous methods. The datasets and code are released at https://github.com/Y-NLP/ LinguisticSteganography/tree/main/NLPCC2023 SPLS-AutoEncoder. Keywords: Linguistic Steganography Semantic-Aware Huffman Coding

1

· Semantic Consistency ·

Introduction

Steganography hides secret information within a normal cover medium, such as images [15], audios [3], videos [16], or texts [4], while preserving the original perceptual quality and avoiding raising the vigilance. Its history can be dated back c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 799–812, 2023. https://doi.org/10.1007/978-3-031-44696-2_62

800

S. Wang et al.

to ancient Greece, where messages were hidden within wax tablets by scraping away the top layer of wax and writing beneath it. Nowadays, with the rise of digital communication and the widespread use of the Internet, steganography has become a popular data protection technique besides traditional data encryption [10] for maintaining the confidentiality of information, such as sensitive data, private communication, and military information exchange. Text is the most commonly used information carrier in our daily life, such as emails, chat messages, or social media posts. Thus, this work focuses on Linguistic Steganography, a specific form of steganography that involves hiding secret information within natural language text. Roughly, previous linguistic steganography studies can be categorized as either edit-based or generation-based. The edit-based method hides the secret message by modifying the selected tokens in the cover text via pre-defined strategies. Typical edit-based strategies include synonym substitution [22], altering the color of the token [13], and replacing invisible characters [8]. The most notable advantage of edit-based methods is the generated stego text is always fluent and natural; nonetheless, its stego text only has a limited capacity to hide the secret messages. Recently, with the development of language models, there has been a growing interest in generation-based methods to hide more secret messages within a limited length [4,12,23,27,30]. Given a normal cover text beside the secret message, generation-based methods generate a new stego text based on the probability distributions outputted by their backbone language models. Thus, most studies begin to follow this paradigm. As an information-hiding technique, one essential of linguistic steganography is to escape the interception and detection from unauthorized people. This ability is generally denoted as Imperceptibility, including three sub-aspects: 1) Perceptual imperceptibility asks that the original cover text and the modified stego text are indistinguishable and similarly fluent to human eyes; 2) Statistical imperceptibility aims to narrow the differences between the cover text and the stego text in statistical analysis such as frequency analysis; 3) Cognitive Imperceptibility is to ensure that the (semantic meaning of) stego text appears to be natural and reasonable in human cognition. Currently, massive prior works have been devoted to improving imperceptibility. Bin coding [4] and Huffman coding [23] concentrated on perceptual imperceptibility by using bins or Huffman trees to construct codes for the selected tokens. Patient-Huffman coding [2] quantifies statistical imperceptibility with KL divergence between cover text and stego text. Arithmetic coding [30], Saac coding [12], and ADG [27] further improve the statistical imperceptibility by improving the involvement of KL divergence. Compared to the other two sub-aspects, cognitive imperceptibility is more challenging, as human cognition is exceptionally complex. Most generation-based works use the backbone language model to generate a continuation text conditioned on the given cover text, which requires high associative ability and cognitive ability toward the real world. Unfortunately, most language models are struggling because of the inefficient knowledge of the real world. Large-scale black-box language models such as ChatGPT and GPT-4 are competent, but

Enhancing Semantic Consistency in Linguistic Steganography

801

generative linguistic steganography methods must be built on white-box language models instead. Consequently, in previous works, with the increase of the embedding rate (i.e., hiding more secret information in a fixed-length stego text), the semantics of the stego text is always out of control and easier to invoke abnormalities in human cognition. Although recent researchers have tried to alleviate this issue by seeking semantic information from knowledge graph [7] or following the idea of neural machine translation [21], they are still hard to accurately control the semantics of the stego texts while keeping using the efficient coding. With such challenges in mind, we propose SPLS-AutoEncoder, a novel S emantic-Preserved Linguistic S teganography Auto-Encoder with the help of denoising auto-encoder BART and Dynamic Semantic-Constrained Huffman Coding. Specifically, to further improve cognitive imperceptibility, we regard generative linguistic steganography as a denoising auto-encoder task. The nature of the denoising auto-encoder keeps the semantics of the outputted text unchanged while changing partial tokens for denoising. Thus, the semantics of generated stego text in SPLS-AutoEncoder can be preserved as possible while some tokens must be changed to hide secret messages. Meanwhile, at each generation step, we propose a Dynamic Semantic-Constrained Huffman Coding to step-wisely select the best stego token from the semantic-aware Huffman tree conditioned on both the contextually semantic information and the topic-aware semantic. This improved Huffman coding technique can enhance cognitive imperceptibility without the loss of perceptual imperceptibility and statistical imperceptibility compared to the precious Huffman coding techniques. We have conducted extensive experiments on two Chinese datasets, OnlineShopping and Dianping, to verify the effectiveness of SPLS-AutoEncoder. In the experiments, we consider not only all the three sub-aspects of imperceptibility but also the Anti-Steganalysis ability. Experimental results have demonstrated that our method can outperform other methods in most metrics at the various embedding rates, especially in the aspect of cognitive imperceptibility.

2

Related Work

Linguistic Steganography. Previous studies can be roughly categorized as either the Edit-based or the Generation-based. Edit-based methods hide the secret message by modifying the selected tokens in the cover text via pre-defined strategies. The most common strategy is synonym substitution. For a selected token, we can substitute it with its sixth closest synonym, which embeds the secret message “110”. Recently, to liberate humans from constructing complex strategies, Ueoka et al. [14] use a pre-trained masked language model BERT to dynamically obtain synonyms by making the token and then predicting the most relevant tokens. Although edit-based methods can generate very natural stego texts, it is thorny to improve the information-hiding capacity, i.e., secret bits per word (BPW). The next Generation-based methods use a language model to generate a new text as the stego text based on the given cover text and secret message. In [4,23], researchers exploit perceptual imperceptibility, i.e., how

802

S. Wang et al.

to generate a natural and smooth stego text. Fang et al. [4] uniformly divide the vocabulary into 2k bins, and each bin has a unique k-bit code; thus, at each generation step, the token with the highest probability from the specified bin that matches the corresponding k-bit secret message will be selected. To improve the fluency and information capacity, Yang et al. [23] propose a Huffman coding method, which selects the top-k tokens with the highest sorted probability and then constructs a Huffman Tree to assign codes to the token candidates at each step. In [2,12,27,30] researchers exploit statistical imperceptibility, i.e., how to minimize the distance between the actual distribution of natural text and the distorted distribution of stego text. Dai et al. [2] propose a patient-Huffman coding, which uses KL-divergence to ensure statistical imperceptibility. If the KL-divergence is smaller than the threshold, it selects a token from the Huffman distribution; else, it samples from the language modelling distribution. Ziegler et al. [30] propose Arithmetic coding. It regards secret messages as a binary fractional number and uses an arithmetic coding algorithm to encode them. Shen et al. [12] improve the arithmetic coding by dynamically selecting the most appropriate k tokens at each time step with a threshold. Zhang et al. [27] employ an adaptive dynamic grouping method to construct u groups, and the probability of the group is equal. Yi et al. [26] propose an ALiSa coding algorithm, which uses BERT and Gibbs Sampling to generate fluency stego text. The highlight of the method is directly embedding the secret message without a complicated decoding process. In [7], researchers exploited Cognitive Imperceptibility, i.e., how to generate controllable semantics stego text. It proposes a topic-aware steganography method, which uses entities and relationships data from Knowledge Graphs to generate topic-controllable stego text. Compared to such works, our work focuses on improving cognitive imperceptibility by regarding generative linguistic steganography as a denoising auto-encoder task and proposing a Dynamic Semantic-Constrained Huffman Coding to control the semantics of the target stego text. Unlike our work, previous works often can not keep the semantics of stego text unchanged compared to the original cover text, increasing the risk of being detected. Linguistic Steganalysis. Linguistic Steganalysis stands on the opposite side of linguistic steganography, which aims to identify whether a text hides secret information or not. Yang et al. [25] employ CNN to check the high-level semantic differences between the stego texts and the normal cover texts. Yang et al. [24] use RNN to extract the long-term semantic features to decide whether the texts under test contain a secret message. Niu et al. [9] improve the detection accuracy by combining the benefit of bi-LSTM and CNN. Yang et al. [18] leverage bi-LSTM and employ dense connections and feature pyramids to capture different levels of semantic features. Yang et al. [19] employs a transformerarchitecture language model and graph attention network to extract semantics and syntactic features. Yang et al. [20] combine linguistic features contained in texts and context features to improve the detection ability in real social network scenarios.

Enhancing Semantic Consistency in Linguistic Steganography

3 3.1

803

Methodology Preliminary

The steganography task is a Prisoners’ Problem, where the sender Alice tries to stealthily transmit secret messages to the receiver Bob on the public channel monitored by an eavesdropper Eve. The challenge is to avoid the vigilance and the detection from Eve. To this end, Alice encrypts (encodes) a secret message M into another safe cover text Y using predefined linguistic steganography strategies. The outputted stego text Y¯ can be subsequently sent to the Bob in front of Eve‘s eyes because the meaning of Y¯ on the surface is totally different from the original secret message M but related to the cover text Y . Unlike the eavesdropper Eve, Bob knows the predefined strategies; thus, Bob can extract the original secret message M using the inversed process. Problem Definition. Formally, given a linguistic steganography function femb , the inputs include a l-bits secret message M = (m1 , · · · , ml ), mi ∈ {0, 1}, a cover text Y = (y1 , · · · , yn ). The task goal is to output a stego text Y¯ = femb (M ; Y ). Moreover, the sender and receiver must share the same extraction algorithm fext in advance to ensure that the receiver can decrypt the stego text.

Fig. 1. An illustration of the proposed SPLS-AutoEncoder.

3.2

Semantic-Preserved Linguistic Steganography Auto-Encoder

Previous generative works [2,4,6,23,27,30] can achieve acceptable encoding capacity, but their methods will significantly distort the distribution and the semantics of the original cover text, bringing a higher risk of being detected. Considering this fact, we propose SPLS-AutoEncoder, a novel S emantic-Preserved Linguistic S teganography Auto-Encoder . SPLS-AutoEncoder regards the generative linguistic steganography as a denoising auto-encoder task and uses the denoising auto-encoder BART [5] as the backbone. The objective of BART is to  learn a conditional probability distribution Pθ (X|X ), where X is the original

804

S. Wang et al. 

text and X is the corrupted noisy text. Thus, BART can map a corrupted text back to its original text and keeps the semantics unchanged as possible, which helps to reduce the differences between the cover text and stego text. As illustrated in Fig. 1, the proposed SPLS-AutoEncoder has three stages: – Semantic Embedding Encoding: We use the BART encoder to encode the cover text Y into contextual vectors H to capture the corresponding context information. Then, we use the vocabulary embedding matrix of BART to construct a topic embedding s to represent the global topic information. – Semantic-Constrained Huffman Decoding: Subsequently, we use the BART decoder to generate the stego text Y¯ conditioned on the contextual vectors H and the secret message M with the proposed Dynamic Semantic-Constrained Huffman Coding, which not only keeps the advantages of previous Huffman Coding but also improves the global semantic controllability. – Extraction: Finally, an inverse process is used to extract the secret M from Y¯ . 3.3

Semantic Embedding Encoding

In the first stage, we encode two semantic information from the given inputs: 1) Context Information Encoding and 2) Topic Embedding. Context Information Encoding. We use the BART encoder (denoted as BARTEnc ), a multi-layer stacked Transformer network, to encode the context information of the given cover text Y into context vectors H ∈ Rn×d = BARTEnc (Y ), where n is the length of cover text, d is the vector dimension. Topic Embedding. Besides the context vectors H, we also construct a topic embedding s to control the semantics in the next generation stage. We first obtain the vocabulary embedding of the BART built-in vocabulary V ∈ Rv×d , where v is the size of the vocabulary. Then, for each cover text word yi ∈ Y , we retrieve its corresponding word embedding from the V and get cover text embedding W ∈ Rn×d . Finally, we take the average of W and output the corresponding topic embedding s ∈ R1×d . s = AveragePooling(W, axis = 0) 3.4

(1)

Stego Text Generation with Semantic-Constrained Huffman Coding

At each generation time step t, we generate a stego text token y¯t to hide the t-th piece of bits in the message M . To this end, we follow the Huffman encoding paradigm [23] and propose an improved Dynamic Semantic-Constrained Huffman Coding. Compared to the naive Huffman coding [23] or the Patient Huffman coding [2], our method considers the context information Encoding and the topic

Enhancing Semantic Consistency in Linguistic Steganography

805

embedding to construct the dual-way semantic-constrained distribution, which does not only remain the ability to generate stego text with high quality and a high embedding rate but also brings a higher ability to control the semantics of the stego text. Contextually Semantic Scores. We first use the BART decoder (denoted as BARTDec ) to calculate the contextually semantic scores over the vocabulary 1×v based on the context vector H and the previously generated stego uV t ∈ R ¯ text Y1:t−1 : ¯ (2) uV t = Wdec BARTDec (H, Y1:t−1 ) + bdec where the matrix W and the vector bdec are two learnable parameters. Topic-Aware Semantic Scores. Based on the topic embedding s of the cover text Y , we compute a relevance for each vocabulary token ∈ V using the Cosine similarly: (3) uS = CosineSimilarity(s, V) where uS ∈ R1×V ocab is the final topic-aware scores over the vocabulary, where uS [i] denotes the relevance between the i-th vocabulary token and the topic embedding s. Dual-Way Semantic-Constrained Distribution. Then, we aggregate the above two scores and compute the final prediction distribution pt ∈ R1×v via the topic-aware hyper-parameter λ = 6.0 and then normalize it to probability distribution with Softmax: S pt = Softmax(uV t +λ×u )

(4)

Dynamic Semantic-Constrained Huffman Coding. Given a hyper-parameter k, we select top k tokens with the highest probability ∈ pt to construct a Huffman tree (see Fig. 1). Then, according to the current t-th piece of bits ∈ message M , we select the token by following the secret bit stream from the root of the tree and continuing until reaching the corresponding leaf node distribution. Finally, we repeat the above process until the entire secret message M is fully accessed. 3.5

Extraction

In linguistic steganography, we can extract the secret message M from the stego text Y¯ by doing inverse processes of data embedding. Similar to other generative linguistic steganography methods, the authorized receiver should have the same cover text, the same language model, and share coding arithmetic as the sender. Then, the receiver should feed the cover text into the language model and obtain the probability distribution at each time step; then, the receiver can use the

806

S. Wang et al.

same coding method to rerank the probability distribution and construct the Huffman tree. The token in the Huffman tree index reveals the embedded secret information.

4 4.1

Experiment Settings

Dataset. We test models on two Chinese datasets: OnlineShopping and Dianping. OnlineShopping comprises 62,774 OnlineShopping reviews collected from several e-commerce platforms. The Dianping is a restaurant review dataset. There are 2 M data in the train set and 500 K data in the test set. For each dataset, we randomly select 10,000 sentences with lengths ranging from 50 to 200 as cover texts. Models and Implementation. We select the edit-based MASK [14], and the generation-based Bins [4], Huffman [23], Arithmetic [30], SAAC [12], ADG [27], ALisa [26] as our baselines. We use the pre-trained model gpt2-chinesecluecorpussmall [11,29] to rebuild the GPT2-based Bins, Huffman, Arithmetic, SAAC and use the pre-trained bert-base-chinese to build the BERT-based MASK and ALisa. For the Bins method, we set the block number to 2, 4, 8. For the Huffman method, we choose the Huffman tree depth k in 1, 2, 3. For the Arithmetic method, we set the conditional distribution top-k to 100. For the SAAC method, we truncated the imperceptibility gap δ at 0.3. For the MASK method, we set the masking interval parameter to 3 and vary the γ by 0.0001. For the ALisa method, we randomly select two words in a sentence. For SPLS-AutoEncoder, we use a Chinese BART bart-base-chinese-cluecorpussmall [5,29] as the backbone. In the inference, we set the topic-aware hyper-parameter λ at 6 and k in {1/3, 1, 3, 5}, where k = 1/3 denotes we embed 1 bit every 3 generation time steps. Metrics. We evaluate the generated stego texts via four aspects: i) BPW. The average number of embedded bits per word (token) in the stego text. ii) Perceptual Imperceptibility. We here use P erplexity (PPL) to measure the fluency of a sentence from the perspective of probability. The PPLB is calculated by BART and the PPLG is calculated by GPT-2. iii) Statistical Imperceptibility. We measure this by using Kullback-Leibler divergence (KLD) [1], which calculates statistical distributions between the sentence vectors of cover text and stego text. We employ T ext2V ec [17] tool to convert each sentence into vectors and presume the sentence vector representation follows isotropic Gaussian distribution:DKL (Pvc ||Pvs ) ≈  δ 2 +(μ −μ )2 (log δδsc + c 2μc 2 s ) − 12 where vc is the vector representation of cover s text, vs is the vector representation of stego text, μ is the mean of vector and δ is the standard deviation of the vector.

Enhancing Semantic Consistency in Linguistic Steganography

807

iv) Cognitive Imperceptibility. We evaluate semantic relevance between the stego text and the cover text by using Emb-G/X(Embedding Greedy Matching, Vector Extrema), BLEU-3/4, ROUGE-L and BertScore [28] v) Anti-Steganalysis Ability. We use two recent RBiLSTMC [9] (AS1 ) and BiLSTM Dense [18] (AS2 ) to evaluate the anti-steganalysis ability. We report the Accuracy that stego texts can be successfully detected; thus, lower is better. 4.2

Results and Analysis

Overall Performance. We conduct extensive experiments with various settings of the target embedding rate (BPW). According to the experimental results reported in Table 1, our SPLS-AutoEncoder has the best overall performance in all BPW settings. Besides, we can also find that: 1) The selection of backbone models can significantly affect the results of PPL. Thus, we use two different backbones, GPT-2 and BART, to estimate the PPL in the aspect of perceptual imperceptibility. It can be seen that GPT-2-based baselines have better Table 1. The evaluation results on the OnlineShopping dataset and the Dianping Dataset Method

BPW PPLB PPLG KLD

BERT Emb-G/X

ROUGE BLEU-3/4

CharF1 AS1 ↓ AS2 ↓

OnlineShopping MASK(γ=0.0001) 0.35 SPLS-AutoEncoder (k=1/3) 0.33

1.07 1.42

86.78 13.06 42.65 2.25

0.904 0.907 0.883 78.01 0.899 0.968 0.916 77.11

61.00 53.60 80.72 65.95 58.65 84.14

0.348 0.513 0.701 0.508

Bin(bins=2) Huffman(k=1) SPLS-AutoEncoder (k=1)

1.00 1.00 1.00

1.39 1.44 1.12

14.2 36.07 11.06 41.51 71.85 8.72

0.608 0.614 0.798

0.766 0.693 16.96 0.771 0.696 17.63 0.910 0.792 52.67

7.20 3.84 30.30 7.88 4.39 30.71 35.68 25.53 66.91

0.789 0.782 0.819 0.901 0.460 0.530

Bin(bins=4) Huffman(k=2) SPLS-AutoEncoder (k=3)

2.00 1.79 2.17

1.88 1.60 1.26

54.48 21.14 99.50

46.27 0.586 47.12 0.606 34.27 0.702

0.731 0.693 10.77 0.755 0.691 13.97 0.840 0.747 26.14

1.90 0.75 4.23 2.19 11.94 8.41

17.68 28.59 46.89

0.927 0.929 0.892 0.942 0.750 0.880

Bin(bins=8) Huffman(k=4) Arithmetic(topk = 100) SAAC(δ = 0.2) ADG Alisa SPLS-AutoEncoder (k=5)

3.00 2.92 3.02 3.08 3.14 2.63 2.86

2.52 2.59 1.62 2.02 2.88 6.28 1.28

142.63 43.63 26.67 28.46 36.9 48.53 116.48

82.28 73.10 60.29 62.39 78.78 432.62 52.84

0.560 0.588 0.593 0.591 0.583 0.480 0.674

0.706 0.674 0.734 0.683 0.732 0.686 0.729 0.687 0.713 0.676 0.673 0.564 0.813 0.729

0.43 0.13 1.45 0.72 3.27 2.18 3.13 2.08 1.83 1.13 0.07 0.03 7.14 5.00

16.96 22.14 22.42 22.00 19.84 9.61 40.04

0.959 0.94 0.903 0.892 0.819 0.998 0.813

MASK(γ=0.0001) 0.38 SPLS-AutoEncoder (k=1/3) 0.33

1.42 1.28

73.96 14.67 31.22 2.27

0.903 0.899

0.908 0.879 77.88 0.969 0.922 76.85

59.30 51.36 80.00 66.89 59.51 84.59

0.569 0.456 0.734 0.512

Bin(bins=2) Huffman(k=1) SPLS-AutoEncoder (k=1)

1.00 1.00 1.00

1.28 1.31 1.21

12.43 67.68 55.85

43.92 0.620 38.18 0.626 12.61 0.790

0.775 0.685 17.22 0.781 0.686 18.16 0.991 0.841 52.92

7.21 3.59 31.82 7.94 4.09 33.09 36.10 25.80 67.47

0.950 0.921 0.963 0.945 0.750 0.532

Bin(bins=4) Huffman(k=2) SPLS-AutoEncoder (k=3)

2.00 2.00 2.15

1.61 1.62 1.50

46.71 52.38 0.601 115.2 58.55 0.618 76.45 48.04 0.704

0.742 0.674 11.34 0.763 0.677 14.42 0.977 0.760 26.67

1.94 0.72 4.14 2.00 11.98 8.40

24.42 29.86 47.28

0.983 0.963 0.977 0.964 0.942 0.890

Bin(bins=8) Huffman(k=4) Arithmetic(topk = 100) SAAC(delta=0.2) ADG Alisa SPLS-AutoEncoder (k=5)

3.00 2.87 2.89 2.85 2.68 2.64 2.69

2.01 2.16 1.63 1.61 2.88 8.96 1.55

129.06 34.50 20.39 19.81 70.65 47.99 89.34

0.718 0.669 0.743 0.673 0.746 0.682 0.745 0.682 0.736 0.677 0.676 0.561 0.840 0.741

0.44 0.12 1.43 0.66 3.05 1.67 3.21 1.78 2.46 1.31 0.09 0.04 7.33 5.11

17.57 22.97 23.28 23.13 23.17 9.91 40.59

0.986 0.980 0.937 0.953 0.950 0.992 0.958

6.93 9.39 10.61 10.37 8.83 5.80 21.05

0.956 0.929 0.877 0.916 0.814 0.998 0.933

Dianping

113.93 88.74 81.24 86.27 78.49 626.56 69.59

0.574 0.602 0.607 0.607 0.610 0.496 0.680

7.32 9.88 11.25 11.31 10.66 5.73 21.20

0.963 0.974 0.916 0.924 0.912 0.999 0.944

808

S. Wang et al.

PPL when using GPT-2 to compute PPL (i.e., PPLG) but poor PPL when using BART to compute PPL (i.e., PPLB). Thus, we only use PPLG/PPLB to show the impact of the BPW but do not make the direct comparison because SPLS-AutoEncoder is the only BART-based model; 2) SPLS-AutoEncoder has an excellent performance in terms of KLD, a statistical imperceptibility metric, indicating the denoising ability of BART can effectively narrow the distribution between the cover text and stego text. 3) SPLS-AutoEncoder can significantly outperform baselines in most cognitive imperceptibility metrics. The reason is SPLS-AutoEncoder regards this task as a denoising auto-encoder and uses an efficient Dynamic Semantic-Constrained Huffman Coding, bringing extraordinary ability in semantic controlling; 4) Our approach can easily beat other baselines in anti-steganalysis when BPW < 3 and achieve the tier-1 level performance when BPW is around 3. Table 2. The sensitive analysis of the topic-aware hyper-parameter λ on Dianping. k λ

PPLG KLD BERT ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4

CharF1

5

0 102.07 80.86 0.667

18.62

12.40

8.39

5.82

4.08

34.46

5

1 96.37

77.21 0.669

18.98

12.92

8.72

6.03

4.23

35.47

5

3 92.61

73.72 0.675

20.05

14.16

9.57

6.62

4.63

37.67

5

5 89.59

70.90 0.678

20.74

15.25

10.25

7.05

4.92

39.60

5

6 89.34

69.59 0.680

21.20

15.81

10.65

7.33

5.11

40.59

5 10 85.76

71.01 0.683

22.47

17.85

11.94

8.16

5.66

43.73

The Sensitive Analysis of the Topic-Aware Hyper-parameter. The topic-aware hyper-parameter λ (see Eq. 4) controls the participation of the topicaware embedding when generating the stego text. As reported in Table 2, we tested the performance under different values. It can be observed that the results in perceptual and cognitive imperceptibility metrics gradually increase with the increase of λ, demonstrating that topic-aware embedding can indeed control the semantics of the target stego text. Meanwhile, a high λ may slightly impact the statistical imperceptibility at the same time. Thus, to balance all sub-aspects of imperceptibility, we finally select λ = 6. In addition, when λ = 0, the proposed Dynamic Semantic-Constrained Huffman Coding will degenerate to the original Huffman coding, and we can observe a significant decline in all metrics. It also shows the effectiveness of Dynamic Semantic-Constrained Huffman Coding. The Impact of Embedding Rate. We also analyse the impacts of different embedding rates. We first randomly sampled 3,000 stego texts generated by our SPLS-AutoEncoder and the corresponding original cover texts. Then, we used T ext2V ec [17] to extract semantic embedding and used the t-SNE algorithm to

Enhancing Semantic Consistency in Linguistic Steganography

809

Fig. 2. Visualization for stego texts and cover texts by applying t-SNE. The blue dots represent stego texts, and the orange dots represent cover texts. (Color figure online) Table 3. Case Study on the Dianping datasets. ♣/ denotes the BPW is about 0.35/3. We omitted the padding text by replacing them with ‘· · · ’. Methods

Generated Stego Texts

Cover Text

离市区比较近,价格相对其它的温泉酒店还算便宜 周 末可以去泡泡温泉休息下, 但不要奢望太多,这里不会 给你太多的惊喜,重在人少,清净

MASK♣

离市场比较近,价钱相对其它的温泉酒店还算便宜 周 末可以去泡泡温泉休闲下, 但不要指望太多,这里不会 给你太多的惊喜,重在钱少,清洁

SPLS-AutoEncoder♣ 离市区比较近 价格相对其他的温温泉酒店还算便宜 平时 可 以去泡温泉休休下, 但不要奢望太好, 这里不 会给你太多的惊 喜, 重 在 人 少 清 净, Bin 

如若没去到比如温酒 不会给出什锦 一般有三样

Huffman 

而最重点的是温水的好坏 温水温泉的水是冷水

Arithmetic 

小小的泡池比别处的更受欢迎,可以欣赏大海最美的风 光

ADG

大海有很多特色 有很多的不同的天象 还

Alisa

。一人之所以,一人相知其。

SPLS-AutoEncoder 离市郊较较近,价格比其它很低 不过温泉也还可以 如

reduce the dimension to 2. As shown in Fig. 2, we can find that the semantic distortion between the cover text and the stego text spreads with the increase in embedding rate. For instance, it can be seen that the stego texts and cover texts are clustered closely when k = 1, but the stego texts spread from the cover texts when k = 5. Case Study. As shown in Table 3, we sample several stego texts generated by baselines and our SPLS-AutoEncoder. We can find that no matter if BPW is set to about 3 or 0.35, our stego texts can keep the semantics almost unchanged, especially compared with other generative methods, indicating the great potential of our approach SPLS-AutoEncoder in semantics-preserving. Meanwhile, our

810

S. Wang et al.

stego texts are more fluent, rational and natural compared to baselines. We think it can be attributed to the utilization of the denoising auto-encoder backbone model, namely, BART.

5

Conclusion

In this work, we propose a novel Semantic-Preserved Linguistic Steganography Auto-Encoder (SPLS-AutoEncoder) to generate the semantic-preserved stego text based on the given cover text. To simultaneously improve the perceptual, statistical, and cognitive imperceptibility of the generated stego texts, we first try to minimize the distortion involved when embedding the secret message into the cover text by using the denoising auto-encoder BART as our backbone model. Then, we propose a novel Dynamic Semantic-Constrained Huffman Coding, which uses a dynamic context information embedding and a global topic embedding to ensure semantic consistency between the cover text and stego text. We conduct extensive experiments on two different Chinese datasets, and the corresponding experimental results have shown our SPLS-AutoEncoder can outperform other baseline methods in most metrics with various target BPW settings. Future work includes further reducing the risk of stego text detection at high embedding rates while maintaining the balance between perceptual imperceptibility, statistical imperceptibility, and cognitive imperceptibility. Acknowledgements. This work is supported in part by Yunnan Province Education Department Foundation under Grant No.2022j0008, in part by the National Natural Science Foundation of China under Grant 62162067 and 62101480, Research and Application of Object detection based on Artificial Intelligence, in part by the Yunnan Province expert workstations under Grant 202205AF150145.

References 1. Cachin, C.: An information-theoretic model for steganography. Inf. Comput. 192, 41–56 (2004) 2. Dai, F., Cai, Z.: Towards near-imperceptible steganographic text. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019) 3. Djebbar, F., Ayad, B., Meraim, K.A., Hamam, H.: Comparative study of digital audio steganography techniques. EURASIP J AUDIO SPEE 2012, 1–16 (2012) 4. Fang, T., Jaggi, M., Argyraki, K.: Generating steganographic text with LSTMs. In: ACL (2017) 5. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL (2020) 6. Li, F., et al.: Rewriting-Stego: generating natural and controllable steganographic text with pre-trained language model. In: Wang, X., et al. (eds.) DASFAA 2023. Lecture Notes in Computer Science, vol. 13943. Springer, Cham (2023). https:// doi.org/10.1007/978-3-031-30637-2 41 7. Li, Y., Zhang, J., Yang, Z., Zhang, R.: Topic-aware neural linguistic steganography based on knowledge graphs. ACM/IMS Trans. Data Sci. 2, 1–13 (2021)

Enhancing Semantic Consistency in Linguistic Steganography

811

8. Minzhi, Z., Xingming, S., Huazheng, X.: Research on the Chinese text steganography based on the modification of the empty word. Comput. Eng. Appl. 42, 158–160 (2006) 9. Niu, Y., Wen, J., Zhong, P., Xue, Y.: A hybrid R-BILSTM-C neural network based text steganalysis. IEEE Signal Process. Lett. 26, 1907–1911 (2019) 10. Provos, N., Honeyman, P.: Hide and seek: an introduction to steganography. IEEE Secur. Priv. 1, 32–44 (2003) 11. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019) 12. Shen, J., Ji, H., Han, J.: Near-imperceptible neural linguistic steganography via self-adjusting arithmetic coding. In: EMNLP (2020) 13. Tang, X., Chen, M.: Design and implementation of information hiding system based on RGB. In: CECNet (2013) 14. Ueoka, H., Murawaki, Y., Kurohashi, S.: Frustratingly easy edit-based linguistic steganography with a masked language model (2021) 15. Volkhonskiy, D., Borisenko, B., Burnaev, E.: Generative adversarial networks for image steganography (2016) 16. Wang, K., Zhao, H., Wang, H.: Video steganalysis against motion vector-based steganography by adding or subtracting one motion vector value. TIFS 9, 741–751 (2014) 17. Xu, M.: Text2vec: text to vector toolkit. https://github.com/shibing624/text2vec (2022) 18. Yang, H., Bao, Y., Yang, Z., Liu, S., Huang, Y., Jiao, S.: Linguistic steganalysis via densely connected LSTM with feature pyramid. In: ACM Workshop on Information Hiding and Multimedia Security (2020) 19. Yang, J., Yang, Z., Zhang, S., Tu, H., Huang, Y.: SeSy: linguistic steganalysis framework integrating semantic and syntactic features. IEEE Signal Process. Lett. 29, 31–35 (2021) 20. Yang, J., Yang, Z., Zou, J., Tu, H., Huang, Y.: Linguistic steganalysis toward social network. TDSC 18, 859–871 (2023) 21. Yang, T., Wu, H., Yi, B., Feng, G., Zhang, X.: Semantic-preserving linguistic steganography by pivot translation and semantic-aware bins coding. TDSC (2023) 22. Yang, X., Li, F., Xiang, L.: Synonym substitution-based steganographic algorithm with matrix coding. Chin. Comput. Syst. 36, 1296–1300 (2015) 23. Yang, Z.L., Guo, X.Q., Chen, Z.M., Huang, Y.F., Zhang, Y.J.: RNN-stega: linguistic steganography based on recurrent neural networks. TIFS 14, 1280–1295 (2018) 24. Yang, Z., Wang, K., Li, J., Huang, Y., Zhang, Y.J.: TS-RNN: text steganalysis based on recurrent neural networks. IEEE Signal Process. Lett. 26, 1743–1747 (2019) 25. Yang, Z., Wei, N., Sheng, J., Huang, Y., Zhang, Y.J.: TS-CNN: text steganalysis from semantic space based on convolutional neural network. arXiv preprint: arXiv:1810.08136 (2018) 26. Yi, B., Wu, H., Feng, G., Zhang, X.: ALiSa: acrostic linguistic steganography based on BERT and Gibbs sampling. IEEE Signal Process. Lett. 29, 687–691 (2022) 27. Zhang, S., Yang, Z., Yang, J., Huang, Y.: Provably secure generative linguistic steganography. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP (2021) 28. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. arXiv preprint: arXiv:1904.09675 (2019)

812

S. Wang et al.

29. Zhao, Z., et al.: UER: an open-source toolkit for pre-training models. EMNLPIJCNLP (2019) 30. Ziegler, Z., Deng, Y., Rush, A.: Neural linguistic steganography. In: EMNLPIJCNLP (2019)

Review Generation Combined with Feature and Instance-Based Domain Adaptation for Cross-Domain Aspect-Based Sentiment Analysis Xiuwei Lv, Zhiqiang Wang, and Lei Ju(B) Beijing Electronic Science and Technology Institute, Beijing, China [email protected], {wangzq,jl}@besti.edu.cn

Abstract. The supervised learning methods have proven effective for Aspect-Based Sentiment Analysis (ABSA). However, sufficient data with fine-grained labels is essential for supervised learning, which hinders their effectiveness in many domains lacking fine-grained labeled data. Unsupervised domain adaptation methods are proposed to address this issue, but these methods exit some limitations and are difficult to satisfy the requirements of Cross-Domain ABSA. This paper proposes a joint approach named Review Generation Combined with Feature and InstanceBased Domain Adaptation (RGFI) for Cross-Domain ABSA. Based on Bert, RGFI not only uses the approach of review generation but also unifies the feature and instance-based domain adaptation methods for crossdomain ABSA tasks. Compared with other state-of-the-art domain adaptation methods, experiment results on four benchmarks demonstrate the significant effect of RGFI-based approaches in both cross-domain Endto-End ABSA and cross-domain Aspect Extraction tasks.

Keywords: Cross Domain Domain Adaptation

1

· Aspect-Based Sentiment Analysis ·

Introduction

In recent years, Aspect-Based Sentiment Analysis (ABSA) has received considerable attention [1]. End-to-End ABSA and Aspect Extraction (AE) are two tasks of ABSA. The former aims to extract the aspect terms in the sentences and predict their sentiment polarities {POS, NEG, NEU} [2]. As a sub-task of End2End ABSA, AE only aims to extract the aspect terms from the given sentences. For example, given the sentence “The weather is fine today. Let us walk! ”, the extracted pair of the aspect term and its sentiment polarity in this example is {“weather”: Positive}, respectively. X. Lv and Z. Wang—Equal contribution c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 813–825, 2023. https://doi.org/10.1007/978-3-031-44696-2_63

814

X. Lv et al.

With the development of deep learning, many supervised learning methods are applied to ABSA and achieved good results, e.g., aspect extraction [3], and End-to-End ABSA [4]. However, the supervised methods generally need finegrained labeled data, which is highly scarce for some new domains. Many unsupervised domain adaptation methods are proposed to transfer knowledge from the fine-grained labeled source domain to the unlabeled target domain to address this problem. Traditional unsupervised domain adaptation methods mainly focus on coarse-grained sentiment classification and can be divided into two categories as follows: – Feature-based domain adaptation, which aims to learn a new feature representation that can reduce the discrepancy between the source domain and the target domain as far as possible [5]. – Instance-based domain adaptation, which aims to reduce the discrepancy by re-weighting the labeled instances in the source domain so as to make the instances similar to the target domain higher weight and the instances different from the target domain lower weights at the same time [6]. However, few studies explore the domain adaptation problem for ABSA [7–9] because of the challenges in fine-grained adaptation. Moreover, these studies are still based on the two traditional categories above. Recently, a work unifies the two traditional methods (UDA) [10] and makes excellent progress. However, the two traditional categories mainly suffer from two common limitations: one is that the supervision knowledge for the main task solely comes from the fine-grained labeled source domain, although they can reduce the discrepancy by learning domain-invariant representations or re-weighting the training instances, and the other is that the effectiveness of them is lack of interpretability such as we could not know how much discrepancy between two domains they can reduce or how many shared representations they can learn. In our other work, a domain adaptation method called multi-step review generation (MSRG) was proposed [11], which aims to achieve domain adaptation by generating the target-domain reviews with fine-grained annotations based on masked language model (MLM) [12] from the fine-grained labeled source domain, and MSRG makes better effectiveness compared with other SOTA domain adaptation methods for cross-domain ABSA tasks. MSRG solves the problem that fine-grained labeled data is scarce in some domains to a certain extent. However, MSRG still exits limitations on domain adaptation, e.g., MSRG only re-trains the pre-trained model BERT on relative corpus by the cross entropy of MLM, and it is not using the traditional feature-based and instance-based domain adaptation methods. Although there are some inevitable limitations in feature-based and instancebased methods for domain adaptation, the two domain adaptation methods still have specific effects, more or less according to relevant experimental results. Therefore, in this paper, we propose a joint domain adaptation paradigm named Review Generation combined Feature and Instance-based Domain Adaptation (RGFI) for cross-domain Aspect-based Sentiment Analysis.

RGFI for Cross-Domain Aspect-based Sentiment Analysis

815

Specifically, based on BERT, RGFI unifies the feature-based and the instance-based domain adaptation to train an initial model M0 firstly, which aims to make M0 learn the domain-invariant knowledge and shared representations and re-weight the weight of training instances between two domains. Then, RGFI applies MSRG to generate the target-domain reviews with finegrained labeled data from the source domain. After that, RGFI trains M0 on the corpus that contains those generated target-domain reviews as the downstream training for ABSA tasks. Finally, we can obtain the final model MF used for ABSA tasks. To identify the effectiveness of RGFI, we design and conduct a series of relevant experiments compared with other state-of-the-art domain adaptation methods on four benchmark datasets. Experimental results demonstrate the significant effectiveness of our RGFI-based approaches and show that RGFI can significantly improve the performance both in cross-domain End2End-ABSA task and cross-domain AE task. Our main contributions in this paper can be summarized as follows: – We propose a joint domain adaptation method named Review Generation combined with Feature and Instance-based Domain Adaptation (RGFI) for cross-domain ABSA tasks. Besides, we propose a brief yet practical three-part realization of RGFI. – Experiment results on four benchmarks demonstrate the significant effect of RGFI. Compared with other SOTA methods, RGFI obtains an average improvement of 0.69% for cross-domain E2E-ABSA task and 0.76% for crossdomain AE task on Micro-F1 score. – In addition, we further design and conduct relative ablation study to quantitatively measure the effectiveness of each component of our joint approach.

2

Related Work

Aspect-level sentiment classification [13] and aspect extraction [3] have been widely studied in the literature as two crucial tasks in ABSA. In practical applications, a quantity of recent research handles them together in an end-to-end manner, in which many supervised learning methods [4,14,15] have been proposed. Although promising results have been obtained, their main limitation lies in lacking labeled data in new fields. In order to solve this problem, unsupervised domain adaptation methods are desired. Most existing methods for domain adaptation aim to learn domain-shared representations by focusing on coarse-grained sentiment classification, such as pivot-based approaches [5], auto-encoders [16], and domain adversarial networks [17]. Besides, another line of studies focuses on re-weighting labeled instances to find useful labeled instances for the target domain automatically [6]. Moreover, only a few studies focus on cross-domain ABSA [7–10] because of the challenges in fine-grained domain adaptation. However, these approaches still follow the two traditional domain adaptation paradigms. Recently, a work named UDA attempted to unify the two traditional domain adaptation methods [10]

816

X. Lv et al.

and made better progress than previous work. Nevertheless, this work still exits the limitations between two traditional domain adaptation paradigms. Unlike these methods, MSRG aims to achieve domain adaptation by generating target-domain reviews with fine-grained labels and making promising results. However, MSRG has not applied feature- and instance-based domain adaptation methods. From the results of UDA, we can learn that the improvement of UDA is considerable, although UDA exits the limitations of the traditional methods. Therefore, this paper proposes a joint method named Review Generation Combined with Feature and Instance-based Domain Adaptation (RGFI) for cross-domain ABSA tasks.

3

Problem Statement

In this paper, we mainly focus on two ABSA tasks above, i.e., End-to-End ABSA task and Aspect Extraction task. We can formulate both tasks as sequence labeling problems by following [9]. Formally, given a sequence of tokens α = {a1 , a2 , . . . , an } as input, and the output is a label sequence β = {b1 , b2 , . . . , bn }. For E2E-ABSA, bi ∈ {B-NEG, I-NEG, B-NEU, I-NEU, B-POS, I-POS, O}, and for AE, bi ∈ {B, I, O}. Cross-Domain ABSA. Based on the unsupervised domain adaptation settings, fine-grained labeled data from the source domain and unlabeled data from the target domain are only available to us. Specifically, given a set of s , sentences with fine-grained labels from the source domain DS = {(αis , βis )}N i=1 u and another unlabeled set of sentences from the target domain DU = {αiu }N i=1 . The goal is to predict the sequences of labels for test data in the target domain: t βit = ft (αit ), DT = {αit }N i=1 .

4

Methodology

Overview: This paper combines the MSRG-based approaches and two traditional domain adaptation paradigms. We propose a joint domain adaptation method called Review Generation combined with Feature and Instance-based domain adaptation (RGFI) for cross-domain ABSA tasks. To achieve this goal, we propose a brief yet effective approach divided into three parts, as shown in Fig. 1. 4.1

The First Part: Training the Initial Model

Feature-Based Domain Adaptation. The core idea of feature-based domain adaptation is structural correspondence learning [5]. Following [10], we use partof-speech information and dependency relations as self-supervised signals to learn the structural correspondence between the source and target domain. For

RGFI for Cross-Domain Aspect-based Sentiment Analysis

817

Fig. 1. The overview of the RGFI-based approaches for Cross-Domain ABSA tasks.

part-of-speech information, we use the Masked POS Tag Prediction method to learn, and for dependency relations, we use the Dependency Relation Prediction method [10]. Masked POS Tag Prediction. Specifically, we firstly convert the token sequences w = {w1 , w2 , . . . , wn } into continuous embedding e = {e1 , e2 , . . . , en }, among them the ei = [ti , ss , pi , tagi ], and the first three kinds of embedding are defined in [12] initialized with BERT model. tagi ∈ Rd is the POS tag embedding, and it is randomly initialized and trained with unlabeled data from both the source and target domains. In this part, we assume that sub-words share the same POS tags due to BERT. Through a multi-layer transformer, the token embedding sequences e were converted into a context-aware representation as follows: H = transformer(E)

(1)

After that, we randomly replace 25% of original tokens and their POS tags with the masked tokens [MASK] to prepare the input for the masked POS tag prediction task. Then the masked feature in H is fed into the softmax layer and as follows: converted into the probability over POS tag types ppos i ppos = softmax(Wp hi + bp ) i

(2)

where ppos ∈ Rn kinds , n kinds is the number of POS tag types, Wp and bp is the i weight matrix and bias vector belonging to the softmax layer. The cross-entropy loss is that: T  pos K(i)l(ppos (3) Lpos = i , yi ) DU

i

where K(i) is equal to 1 if masked and equal to 0 if unmasked, and yipos means the real POS tag type of the i-th token.

818

X. Lv et al.

Dependency Relation Prediction. To reconstruct the dependency relation in H, we feed H to two non-linear transformation functions to obtain H head = , hhead , . . . , hhead } and H child = {hchild , hchild , . . . , hchild } as follows: {hhead n n 1 2 1 2 hhead = tanh(W1 hi + b1 ) i

(4)

hchild = tanh(W2 hi + b2 ) i

(5)

d

where hhead , hchild ∈ R 4 ; W1 and W2 are learnable parameters. hhead and hchild i i i i are the representations of the head token and child token in the dependency tree. Now, given the i-th token and the j-th token in the input sequence, they are connected in the dependency tree, and the i-th token is the head node, and the other is the child node. The prediction of their dependency relation oij is that: oij = [hhead ; hchild ; hhead − hchild ; hhead  hchild ] i j i j i j

(6)

where − and  mean element-wise subtraction and multiplication, [; ] means concatenation operation. the pdep ij was converted from oij by a softmax layer: pdep ij = softmax(Wd oij + bd )

(7)

where Wd ∈ Rd×narc is the weight matrix for relation classification, and narc is the number of relation types. We use token pairs to construct training examples. I(ij) means weather token pairs (i, j) have a direct edge in dependency tree or not. Then we predict their dependency relation, and the optimization objective is as follows: N N   dep I(ij)l(pdep (8) Ldep = ij , yij ) DU

i

j

Then, we perform feature-based domain adaptation through the two tasks above, and the objective of the optimization is as follows: Lf eature = Lpos + γLdep

(9)

where γ is a trade-off hyper-parameter, and its function is to control the contributions of the two tasks above. Instance-Based Domain Adaptation. Following [10], we need to re-weight instances in the source domain for instance-based domain adaptation by using pt ps at the word level due to fine-grained requirements. Specifically, when we train the main task, we also train a domain classifier D based on unlabeled data. D is used to identify whether each word is from the source domain or the target domain. The input is the output of H and will be sent to a softmax layer; then we can obtain the domain distribution probability pD i of the i-th word wi as follows: (10) PiD = softmax(Wd hi + bd )

RGFI for Cross-Domain Aspect-based Sentiment Analysis

819

|y | where pD and y n d = {source, target}. And the cross entropy loss i ∈ R and the ground-truth yiD in training D is as follows: between pD i n d

Ld =

T 

D l(pD i , yi )

(11)

DU i=1

Then, we can obtain the domain distribution of each word by D and compute the ratio of its target-domain probability to its source-domain probability, i.e., pD i,t , pD i,s

as the weight of each word and the information of re-weighting instances in

the main task. The optimization of the main task is as follows: pm i = softmax(Wm hi + bm ) Lm =

N  DS

m ωi ∗ l(pm i , yi )

(12) (13)

i

where ωi means the weight of each word and is computed by the re-normalization over

pD i,t pD i,s

of all the N tokens based on D, therefore, the optimization objective

of instance-based domain adaptation is as follows: Linstance = Lm + Ld

(14)

Through the approaches above, we can get an initial model M0 , which has learned the domain-shared representations and re-weighted the distribution of instances for cross-domain ABSA. 4.2

The Second Part: Generating the Target-Domain Sentences

In this part, we apply the approach of MSRG to generate the target-domain sentences with fine-grained labels. Specifically, the generation can be divided into two steps. In the first step, we need to extract all the aspect and opinion terms for each domain by Double Propagation [18] firstly, then we remove all the terms that occur in both target and source domains, and we can get the domainspecific term lists which only appear in one domain. Then, for each token of each sentence in a domain, we substitute the tokens which exist in the domain-specific term lists of this domain with a unique token [MASK], and we can obtain the domain-independent sentence respectively. After that, the multi-step prediction (MSP) can be applied to generate the target-domain sentences in the second step. Precisely, given a masked domainindependent sentence xm 0 = {token0 , [MASK]1 , token2 , [MASK]3 , . . . , tokenn }, MSP will predict the masked tokens based on masked language model which is re-trained on the unlabeled target-domain data and select the most probability tokens which only selected in the domain-specific term lists of the target domain. Then we substitute one of the masked tokens with the selected token and get the masked domain-independent sentence xm 1 . The steps above will be

820

X. Lv et al.

iterated until all the masked tokens in the domain-independent sentence are predicted and substituted. As for labeled data, we directly use the original labels of the sentences from the source domain as the labeled data of generated sentences by the opinion specific introduced in [11]. Then, we can obtain the new s target-domain sentences with fine-grained labels set DG = {(αig , βis )}N i=1 . 4.3

The Third Part: Downstream Training for ABSA

We can get the initial model M0 and the generated target-domain sentences set DG through the two parts above. Then, we only need to train M0 on the corpus set DG for the End2End-ABSA task or Aspect Extraction task by supervised learning. Following [11], we continue to use the Indep training, Merge training, BiIndep training and BiMerge training introduced in [11] as the training strategies to verify the effectiveness of RGFI conveniently and reasonably.

5 5.1

Experiment Datasets

Table 1. Statistics of the datasets. Dataset Domain

Sentences Training Testing

L

Laptop

3845

3045

800

R

Restaurant 6035

3877

2158

D

Device

3836

2557

1279

S

Service

2239

1492

747

We conduct experiments on four benchmark datasets: Laptop (L), Restaurant (R), Device (D), and Service (S). L contains user reviews from the laptop domain from SemEval ABSA challenge 2014 [19], and R is a combination of the restaurant datasets from SemEval 2014, 2015, and 2016 ABSA challenge [2,19,20]. D is a union set of device reviews from [21]. S is the review set of web services introduced by [22]. Detailed statistics are shown in Table 1. 5.2

Experiment Settings & Implementation Details

We conduct experiments on 10 transfer pairs using the four domains above, and D→R and R→D are removed because the two domains are very similar [23]. For each pair, the training data contains both the training data from the source domain and the training data from the target domain, and the evaluation experiments are conducted on the test data from the target domain. We use the Micro-F1 score as the evaluation metric to compare with the previous work

RGFI for Cross-Domain Aspect-based Sentiment Analysis

821

fairly. In the first part, for the BERT language model fine-tuning, the learning rate is set to 3e-5, and the batch size is set to 32. In the second part, we set learning rate 3e-5 and batch size 32 for re-training the MLM task. Moreover, in the third part, the learning rate, the batch size, and the dropout rate are set to 5e-5, 32, and 0.1. In all the training processes, we used the Adam optimizer [24]. Each experiment is repeated 5 times, and we report the average results of the 5 runs. 5.3

Baselines

To verify the effectiveness of RGFI, we select some famous SOTA domain adaptation methods as our baselines. The selected methods are as follows: – DP [18]: The unsupervised Double Propagation approach. – Hier-Joint [7]: A syntactic rule-based auxiliary tasks based on RNN. – RNSCN [8]: A domain adaptation method incorporates syntactic structures based on recursive neural structural correspondence network. – AD-SAL [9]: A selective adversarial learning method for cross-domain ABSA. – BERTB and BERTE : directly fine-tuning BERTB and BERTE on the labeled source domain data. BERTB is the uncased base model from [12]. BERTE is another uncased base model introduced in [25] and is pre-trained on Ecommerce reviews. – BERTB -UDA and BERTE -UDA [10]: the method UDA unifies feature-based and instance-based domain adaptation based on BERTB and BERTE . – BERTB -MSRG-X and BERTE -MSRG-X [11]: our other work aims to generate labeled target-domain reviews. X refers the training strategies. – BERTB -RGFI-X and BERTE -RGFI-X: the RGFI-based approaches proposed in this paper. X also refers the training strategies. 5.4

Experiment Results & Analysis

The comparison results on cross-domain End2End ABSA are shown in Table 2, from which we can observe that our RGFI-based approaches generally perform better than the SOTA DA method (i.e., MSRG) on most transfer pairs for crossdomain E2E-ABSA. As expected, RGFI significantly improves under all the training strategies compared to MSRG or UDA. Moreover, based on BERTE , RGFI outperforms MSRG by 0.69% on average using the BiMerge training strategy. All these observations significantly demonstrate the effectiveness of our RGFI approach and the usefulness of combining the feature and instance-based domain adaptation on the basis of MSRG. Besides, we also report the experiment results on cross-domain AE in Table 3. It is easy to find that the overall trend of the performance of each method is similar to their performance in cross-domain E2E-ABSA. RGFI still achieves significantly better performance weather compared to the UDA or the SOTA methods MSRG and obtains the best performance and outperforms MSRG by an absolute improvement of 0.76% based on BERTE in the BiMerge Training

822

X. Lv et al.

Strategy respectively. However, from the average results, we can find that the improvements of RGFI in End2End ABSA are lower than those in AE, supposing the reason is that AE is one of the sub-tasks of End2End ABSA. No matter what, all these experiment results significantly demonstrate the effectiveness of our RGFI-based approaches for domain adaptation in ABSA tasks. 5.5

Ablation Study

To investigate the effectiveness of the component of RGFI, we further design and conduct the ablation study over different variants of the proposed model in Table 4 to show the effect of each component. Only Feature represents only using the feature-based domain adaptation, and Only Instance only uses the instancebased domain adaptation. Besides, MSRG means only using the MSRG-based approach. All the methods shown in Table 4 are based on BERTE , respectively. The experiment results in Table 4 show the effect of each component clearly. Compared to UDA and MSRG, RGFI has achieved much better F1 scores on most transfer pairs, which indicates the significant effectiveness of RGFI in reducing domain discrepancy. Table 2. Experiment results of different methods for Cross-Domain End2End ABSA based on Micro-F1. Methods

S→R

L→R D→R R→S

L→S

D→S

R→L S→L

R→D S→D

Average

DP Hier-Joint RNSCN AD-SAL

34.47 31.10 33.21 41.03

34.47 33.54 35.65 43.04

34.47 32.87 34.60 41.01

18.31 15.56 20.04 28.01

18.31 13.90 16.59 27.20

18.31 19.04 20.03 26.62

16.63 20.72 26.63 34.13

16.63 22.65 18.87 27.04

19.03 24.53 33.26 35.44

19.03 23.24 22.00 33.56

22.97 23.72 26.09 33.71

BERTB BERTE BERTE -UDA BERTE -MSRG-Indep BERTE -MSRG-Merge BERTE -MSRG-BiIndep BERTE -MSRG-BiMerge

44.66 51.34 53.97 53.54 54.41 53.59 54.72

40.38 45.40 49.52 56.62 58.09 56.64 58.43

40.32 42.62 51.84 50.86 55.41 50.94 55.75

19.48 24.44 30.67 36.72 39.13 36.75 39.33

25.78 23.28 27.78 41.30 41.88 41.34 42.09

30.31 28.18 34.41 38.58 42.94 38.60 43.21

31.44 39.72 43.95 41.50 42.92 41.56 43.20

30.47 35.04 35.76 34.49 36.35 34.51 36.52

27.55 33.22 40.35 28.39 30.49 28.42 30.69

33.96 33.22 38.05 34.31 37.96 34.35 38.14

32.44 35.65 40.63 41.63 43.96 41.67 44.21

BERTE -RGFI-Indep BERTE -RGFI-Merge BERTE -RGFI-BiIndep BERTE -RGFI-BiMerge

54.83 55.62 55.02 55.86

57.04 59.37 57.72 59.45

55.16 57.93 56.87 59.04

43.32 42.19 43.75 42.28

42.53 42.52 42.25 42.11

44.24 44.67 44.59 44.82

37.19 41.12 37.42 41.19

33.40 36.03 34.64 36.13

27.40 29.75 27.95 30.08

34.55 37.55 34.73 38.06

42.97 44.68 43.49 44.90

RGFI for Cross-Domain Aspect-based Sentiment Analysis

823

Table 3. Experiment results of different methods for Cross-Domain Aspect Extraction (AE) based on Micro-F1. Methods

S→R

L→R D→R R→S

L→S

D→S

R→L S→L

R→D S→D

Average

DP Hier-Joint RNSCN AD-SAL

37.63 46.39 48.89 52.05

37.63 48.61 52.19 56.12

37.63 42.96 50.39 51.55

19.74 27.18 30.41 39.02

19.74 25.22 31.21 38.26

19.74 29.28 35.50 36.11

19.79 34.11 47.23 45.01

19.79 33.02 34.03 35.99

21.82 34.81 46.16 43.76

21.82 35.00 32.41 41.21

25.53 35.66 40.84 43.91

BERTB BERTE BERTE -UDA BERTE -MSRG-Indep BERTE -MSRG-Merge BERTE -MSRG-BiIndep BERTE -MSRG-BiMerge

54.29 57.56 59.07 61.49 63.48 61.57 63.75

46.74 50.42 55.24 67.06 68.98 67.13 69.31

44.63 45.71 56.40 56.57 59.79 56.64 60.06

22.31 26.50 34.21 44.66 45.74 44.70 45.90

30.66 25.96 30.68 53.48 52.53 53.52 52.71

33.33 30.40 38.25 41.28 45.76 41.32 45.92

37.02 44.18 54.00 50.20 53.69 50.26 53.91

36.88 41.78 44.25 43.72 43.32 43.76 43.46

32.03 35.98 42.40 32.68 34.08 32.72 34.09

38.06 35.13 40.83 37.04 40.23 37.06 40.49

37.60 39.36 45.53 48.82 50.76 48.87 50.96

BERTE -RGFI-Indep BERTE -RGFI-Merge BERTE -RGFI-BiIndep BERTE -RGFI-BiMerge

61.91 63.84 61.97 64.01

67.45 69.79 67.72 70.02

60.63 62.79 61.35 62.81

51.57 50.11 51.71 50.26

54.29 53.01 54.43 53.22

48.66 49.09 48.54 49.25

46.03 52.32 46.21 52.35

41.38 42.45 41.60 42.52

30.91 33.42 31.01 33.58

36.45 38.88 37.02 39.14

49.93 51.53 50.16 51.72

Table 4. Ablation study of our RGFI approach for cross-domain End2End ABSA. ABSA

BERTE Only Feature Only Instance UDA

MSRG RGFI

S→R L→R D→R R→S L→S D→S R→L S→L R→D S→D

51.34 45.40 42.62 24.44 23.28 28.18 39.72 35.04 33.22 33.22

Average 35.65

6

53.09 49.79 50.67 27.09 24.51 35.89 41.93 35.17 37.79 38.45

51.05 48.08 46.22 25.01 25.92 34.21 40.52 34.33 39.27 36.70

53.97 49.52 51.84 30.67 27.78 34.41 43.95 35.76 40.35 38.05

54.72 58.43 55.75 39.33 42.09 43.21 43.20 36.52 30.69 38.14

55.86 59.45 59.04 42.28 42.11 44.82 41.19 36.13 30.08 38.06

39.44

38.18

40.63

44.21

44.90

Conclusion and Future Work

In this paper, we preliminarily explore the approach of applying the two traditional domain adaptation methods to the review generation and propose a new domain adaptation method named Review Generation Combined with Feature and Instance-based Domain Adaptation (RGFI) for cross-domain ABSA. To implement RGFI, we propose a three-parts method and state its principle and implementation details. In addition, to verify the effect of RGFI, we further design and conduct a series of experiments including the ablation study to research its effectiveness and performance in domain adaptation. Compared with other existing methods, experiments on four benchmarks demonstrate the sig-

824

X. Lv et al.

nificant and stable effectiveness of our RGFI-based approaches for cross-domain End2End ABSA and cross-domain AE tasks. However, although RGFI achieves some progress, the progress is not very obvious. Exploring a method that can better combine review generation and the two traditional domain adaptation methods is a promising work in the future. Acknowledgement. This paper’s work was supported by the Fundamental Research Funds for the Central Universities (Grant No. 328202203, 20230045Z0114), and China Postdoctoral Science Foundation funded project (Grant No. 2019M650606).

References 1. Liu, B.: Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge University Press (2020) 2. Pontiki, M., et al.: Semeval-2016 task 5: Aspect based sentiment analysis. In: SemEval-2016 (2016) 3. Xu, H., Liu, B., Shu, L., Yu, P.S.: Double embeddings and CNN-based sequence labeling for aspect extraction. In: ACL (2018) 4. Li, X., Bing, L., Li, P., Lam, W.: A unified model for opinion target extraction and target sentiment prediction. In: AAAI (2019a) 5. Blitzer, J., Dredze, M., Pereira, F.: Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: ACL (2007) 6. Dredze, M., Kulesza, A., Crammer, K.: Multi-domain learning by confidenceweighted parameter combination. Mach. Learn. 79(1–2), 123–149 (2010) 7. Ding, Y., Yu, J., Jiang, J.: Recurrent neural networks with auxiliary labels for crossdomain opinion target extraction. In: AAAI (2017) 8. Wang, W., Pan, S.J.: Recursive neural structural correspondence network for crossdomain aspect and opinion co-extraction. In: ACL (2018) 9. Li, Z., Li, X., Wei, Y., Bing, L., Zhang, Y., Yang, Q.: Transferable end-to-end aspect-based sentiment analysis with selective adversarial learning. In: EMNLP (2019b) 10. Gong, C., Yu, J., Xia, R.: Unified feature and instance based domain adaptation for end-to-end aspect-based sentiment analysis. In: EMNLP (2020) 11. Lei, J., Lv, X., Wang, Z., Miao, Z.: Multi-step review generation for cross-domain aspect-based sentiment analysis. In: NLPCC (2023) 12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019) 13. Wang, S., Mazumder, S., Liu, B., Zhou, M., Chang, Y.: Target-sensitive memory networks for aspect sentiment classification. In: ACL (2018) 14. Mitchell, M., Aguilar, J., Wilson, T., Van Durme, B.: Open domain targeted sentiment. In: EMNLP (2013) 15. Zhang, K., Jiao, M., Chen, X., Wang, Z., Liu, B., Liu, L.: Sc-bicapsnet: a sentiment classification model based on bi-channel capsule network. IEEE Access 7, 171801– 171813 (2019) 16. Chen, M., Xu, Z., Weinberger, K., Sha, F.: Marginalized denoising autoencoders for domain adaptation. In: ICML (2012) 17. Li, Z., Wei, Y., Zhang, Y., Yang, Q.: Hierarchical attention transfer network for cross-domain sentiment classification. In: AAAI (2018b)

RGFI for Cross-Domain Aspect-based Sentiment Analysis

825

18. Qiu, G., Liu, B., Bu, J., Chen, C.: Opinion word expansion and target extraction through double propagation. Comput. Linguist. 37(1), 9–27 (2011) 19. Pontiki, M., Galanis, D., Pavlopoulos, J., Papageorgiou, H., Androutsopoulos, I., Manandhar, A.: Semeval-2014 task 4: Aspect based sentiment analysis. In: SemEval-2014 (2014) 20. Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar, S., Androutsopoulos, I.: Semeval-2015 task 12: Aspect based sentiment analysis. In: SemEval-2015 (2015) 21. Toprak, C., Jakob, N., Gurevych, I.: Sentence and expression level annotation of opinions in user-generated discourse. In: ACL (2010) 22. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: SIGKDD (2004) 23. Yu, J., Gong, C., Xia, R.: Cross-domain review generation for aspect-based sentiment analysis. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4767–4777 (2021) 24. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 25. Xu, H., Liu, B., Shu, L., Yu, P.S.: Bert post-training for review reading comprehension and aspect-based sentiment analysis. In: NAACL (2019)

WikiIns: A High-Quality Dataset for Controlled Text Editing by Natural Language Instruction Xiang Chen1,2,3 , Zheng Li1,3,4 , and Xiaojun Wan1,2,3(B) 1

Wangxuan Institute of Computer Technology, Peking University, Beijing, China {caspar,lizheng2001,wanxiaojun}@pku.edu.cn 2 Center for Data Science, Peking University, Beijing, China 3 The MOE Key Laboratory of Computational Linguistics, Peking University, Beijing, China 4 Schools of Electronic Engineering and Computer Science, Peking University, Beijing, China

Abstract. Text editing, i.e., the process of modifying or manipulating text, is a crucial step in human writing process. In this paper, we study the problem of controlled text editing by natural language instruction. According to a given instruction that conveys the edit intention and necessary information, an original draft text is required to be revised into a target text. Existing automatically constructed datasets for this task are limited because they do not have informative natural language instruction. The informativeness requires the information contained in the instruction to be enough to produce the revised text. To address this limitation, we build and release WikiIns, a high-quality controlled text editing dataset with improved informativeness. We first preprocess the Wikipedia edit history database to extract the raw data (WikiInsRaw). Then we crowdsource high-quality validation and test sets, as well as a small-scale training set (WikiIns-Gold). With the high-quality annotated dataset, we further propose automatic approaches to generate a large-scale “silver” training set (WikiIns-Silver). Finally, we provide some insightful analysis on our WikiIns dataset, including the evaluation results and the edit intention analysis. Our analysis and the experiment results on WikiIns may assist the ongoing research on text editing. The dataset, source code and annotation guideline are available at https:// github.com/CasparSwift/WikiIns. Keywords: Controlled Text Editing

1

· Informativeness

Introduction

Text editing has recently become a new scenario for text generation. Previous works [7,12,13,20] have proposed several text editing architectures for downstream text generation tasks such as text simplification, sentence fusion, and grammatical error correction. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  F. Liu et al. (Eds.): NLPCC 2023, LNAI 14303, pp. 826–837, 2023. https://doi.org/10.1007/978-3-031-44696-2_64

WikiIns

827

Table 1. A simple illustration of the difference between WikiDocEdits and our proposed WikiIns. The words in red are the drastically changed content. The words in orange are the content to be deleted. The words in blue are the content to be inserted. Our instructions are more informative. WikiDocEdits [4] [Instruction] Reword [Source] ByteDance responded by adding a kids-only mode to TikTok which allows music videos to be recorded, but not posted and by removing some accounts and content from those determined to be underage [Target] ByteDance responded by adding a kids-only mode to TikTok which blocks the upload of videos, the building of user profiles, direct messaging, and commenting on other’s videos, while still allowing the viewing and recording of content. WikiIns (this work) [Instruction] Apple USB modem no longer a current model [Source] Apple also sells a variety of computer accessories for Mac computers including the AirPort wireless networking products, Time Capsule, Cinema Display, Magic Mouse, Magic Trackpad, Wireless Keyboard, the Apple Battery Charger and the Apple USB Modem [Target] Apple also sells a variety of computer accessories for Mac computers including the AirPort wireless networking products, Time Capsule, Cinema Display, Magic Mouse, Magic Trackpad, Wireless Keyboard, and the Apple Battery Charger

In this study, we aim to construct a high-quality text editing benchmark to better evaluate the text editing models. Previous works typically evaluate their methods for a particular type of edit, such as text simplification or grammar correction. However, it is inadequate to evaluate methods only on specific edits because there are various categories for human editing [24]. Considering that natural language is the simplest way to express all kinds of edit operations, in this work, we focus on the task of controlled text editing by natural language instruction. The given instructions typically express the intention of edits and also convey the necessary information for editing to the users. Wikipedia edit history [14] is a natural resource to construct the benchmark for controlled text editing by natural language instruction, which contains the old (draft text) and new (revised text) versions of Wikipedia documents and the comments written by the editors. Existing datasets [4,26] follow the automatic data mining process, which directly views the editors’ comments as a proxy for the natural language instruction. However, this automatic way will inherently introduce too much noise. For instance, as illustrated in Table 1, the instruction “reword” fails to give enough information for the word insertions (e.g., “user profiles”, “messaging”, “commenting”). It is infeasible to reproduce the target only given the source and instruction. This case belongs to open-ended text editing tasks rather than controlled text editing tasks, and evaluation is very hard for open-ended text editing tasks.

828

X. Chen et al.

Table 2. Comparisons with existing related datasets based on Wikipedia edit history.

Fig. 1. An overview of dataset construction procedure for proposed WikiIns.

Dataset

Instruction Type

Annotation Informative

Yang et al. [24] Faruqui et al. [5] Zhang et al. [26] Iso et al. [7] Anthonio et al. [1] Faltings et al. [4] Du et al. [3] Spangher et al. [19] Jiang et al. [8] WikiIns (this work)

Taxonomic Atomic Natural Language Fact Triples Natural Language Taxonomic Taxonomic Taxonomic Natural Language

         

         

To address this problem, we propose WikiIns, a high-quality controlled text editing dataset with informative natural language instruction. Table 2 demonstrates the differences between WikiIns and existing datasets. In this paper, we design a reproducible methodology to improve informativeness. First, based on a small subset of the corpus, we employ human annotators to manually filter the noisy samples and rewrite the instructions that are not informative. In this way, we crowdsource high-quality validation and test datasets, as well as a relatively small training dataset for controlled text editing. Then we use a model trained on this small dataset to automatically filter the noisy samples on the entire corpus, which produces a large-scale silver training set. Figure 1 presents an overview of the data process methodology. Furthermore, we conduct extensive experiments and analyses on WikiIns, proving the quality improvement over previous datasets. We also implement some prevailing text editing models as well as some conventional seq2seq models as strong baselines. We believe that WikiIns is valuable for researchers to explore text editing models. Our contributions can be summarized as follows: – Propose and formulate the “Controlled Text Editing” as a standalone task. – Build and release WikiIns, a dataset for controlled text editing with informative natural language instructions. – Conduct extensive experiments and empirical analysis on WikiIns to provide an in-depth understanding of the dataset, which may aid future research for better text editing model design.

2

Related Work

Datasets for Text Editing. Wikipedia edit history is an important resource to build text editing datasets. Yang et al. [24] constructed a dataset to classify the edits of English Wikipedia into 13 categories of edit intentions. Faruqui et al. [5] focused on the atomic edits where the instruction is a single span of text. Zhang et al. [26] focused on predicting the locations of edits through the comments and identify the most relevant comment for an edit. Anthonio et al. [1] extracted data from wikiHow and further investigated the edits by manual annotation.

WikiIns

829

Faltings et al. [4] proposed a new text editing task that aims to edit text by the user’s command and the grounding. However, as shown in Table 2, none of these works has human-annotated natural language instruction and follows the informativeness rule. To address these limitations, we propose WikiIns for controlled text editing with informative instructions. Models for Text Editing. Text editing models have recently attracted widespread attention. Gu et al. [6] proposed Levenshtein Transformer (LevT) with nonautoregressive decoding by insertion and deletion. Malmi et al. [13] proposed LaserTagger to predict text editing operations over a fixed set. Mallinson et al. [12] proposed FELIX, which contains two components for tagging and insertion. The tagging model is based on a BERT [2] encoder and pointer network [22]. The insertion model uses another BERT to predict masked tokens. Other general text editing models include Seq2Edits [20], CopySpan [15], and EDIT5 [11].

3

Problem Formulation

For the task of controlled text editing investigated in this study, each data sample consists of: (1) Source: the source draft text X, (2) Instruction: an informative instruction for text editing Z, (3) Target: the revised text after editing operations Y . We formulate Controlled Text as a task to generate the target NEditing Y p(yt |y