Computational Linguistics and Intelligent Text Processing: 20th International Conference, CICLing 2019, La Rochelle, France, April 7–13, 2019, Revised Selected Papers, Part II 3031243390, 9783031243394

The two-volume set LNCS 13451 and 13452 constitutes revised selected papers from the CICLing 2019 conference which took

228 37 37MB

English Pages 682 [683] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Organization
Contents – Part II
Contents – Part I
Name Entity Recognition
Neural Named Entity Recognition for Kazakh
1 Introduction
2 Related Work
3 Named Entity Features
4 The Neural Networks
4.1 Mapping Words and Tags into Feature Vectors
4.2 Tensor Layer
4.3 Tag Inference
5 Experiments
5.1 Data-Set
5.2 Model Setup
5.3 Results
6 Conclusions
References
An Empirical Data Selection Schema in Annotation Projection Approach
1 Introduction
2 Related Work
3 Method
3.1 Problems of Previous Method
3.2 Our Method
4 Experiments
4.1 Data Sets and Evaluating Methods
4.2 Results
5 Conclusion
References
Toponym Identification in Epidemiology Articles – A Deep Learning Approach
1 Introduction
2 Previous Work
3 Our Proposed Model
3.1 Embedding Layer
3.2 Deep Feed Forward Neural Network
4 Experiments and Results
4.1 Effect of Domain Specific Embeddings
4.2 Effect of Linguistic Features
4.3 Effect of Window Size
4.4 Effect of the Loss Function
4.5 Use of Lemmas
5 Discussion
6 Conclusion and Future Work
References
Named Entity Recognition by Character-Based Word Classification Using a Domain Specific Dictionary
1 Introduction
2 Related Work
3 Baseline Method
4 Proposed Method
5 Experiments
5.1 Datasets
5.2 Methods
5.3 Pre-trained Word Embeddings
5.4 Experimental Results and Discussion
6 Conclusion
References
Cold Is a Disease and D-cold Is a Drug: Identifying Biological Types of Entities in the Biomedical Domain
1 Introduction
2 Related Work
3 Dataset
4 Approach
4.1 Ontology Creation
4.2 Algorithm: Identify Entity with Its Biological Type
5 Experimental Setup and Results
6 Conclusion
References
A Hybrid Generative/Discriminative Model for Rapid Prototyping of Domain-Specific Named Entity Recognition
1 Introduction
2 Related Work
2.1 General and Domain-Specific NER
2.2 Types of Supervision in NER
2.3 Unsupervised Word Segmentation and Part-of-Speech Induction
3 Proposed Method
3.1 Task Setting
3.2 Model Overview
3.3 Semi-Markov CRF with a Partially Labeled Corpus
3.4 PYHSMM
3.5 PYHSCRF
4 Experimentals
4.1 Data
4.2 Training Settings
4.3 Baselines
4.4 Results and Discussion
5 Conclusion
References
Semantics and Text Similarity
Spectral Text Similarity Measures
1 Introduction
2 Related Work
3 Similarity Measure/Matrix Norms
4 Document Similarity Measure Based on the Spectral Radius
5 Spectral Norm
6 Application Scenarios
6.1 Market Segmentation
6.2 Translation Matching
7 Evaluation
8 Discussion
9 Supervised Learning
10 Conclusion
A Example Contest Answer
References
A Computational Approach to Measuring the Semantic Divergence of Cognates
1 Introduction
1.1 Related Work
1.2 Contributions
2 The Method
2.1 Cross-Lingual Word Embeddings
2.2 Cross-Language Semantic Divergence
2.3 Detection and Correction of False Friends
3 Conclusions
References
Triangulation as a Research Method in Experimental Linguistics
1 Introduction
2 Methodology
2.1 Semantic Research and Experiment
2.2 Expert Evaluation Method in Linguistic Experiment
3 Conclusions
References
Understanding Interpersonal Variations in Word Meanings via Review Target Identification
1 Introduction
2 Related Work
3 Personalized Word Embeddings
3.1 Reviewer-Specific Layers for Personalization
3.2 Reviewer-Universal Layers
3.3 Multi-task Learning of Target Attribute Predictions for Stable Training
3.4 Training
4 Experiments
4.1 Settings
4.2 Overall Results
4.3 Analysis
5 Conclusions
References
Semantic Roles in VerbNet and FrameNet: Statistical Analysis and Evaluation
1 Introduction
2 VerbNet and FrameNet as Linguistic Resources for Analysis
2.1 VerbNet
2.2 FrameNet
2.3 VerbNet and FrameNet in Comparison
3 Basic Statistical Analysis
4 Advanced Statistical Analysis
4.1 Distribution of Verbs per Class in VN and FN
4.2 Distribution of Roles per Class in VN and FN
4.3 General Analysis and Evaluation
5 Hybrid Role-Scalar Approach
5.1 Hypothesis: Roles Are Not Sufficient for Verb Representation
5.2 Scale Representation
6 Conclusion
References
Sentiment Analysis
Fusing Phonetic Features and Chinese Character Representation for Sentiment Analysis
1 Introduction
2 Related Work
2.1 General Embedding
2.2 Chinese Embedding
3 Model
3.1 Textual Embedding
3.2 Training Visual Features
3.3 Learning Phonetic Features
3.4 Sentence Modeling
3.5 Fusion of Modalities
4 Experiments and Results
4.1 Experimental Setup
4.2 Experiments on Unimodality
4.3 Experiments on Fusion of Modalities
4.4 Validating Phonetic Feature
4.5 Visualization of the Representation
4.6 Who Contributes to the Improvement?
5 Conclusion
References
Sentiment-Aware Recommendation System for Healthcare Using Social Media
1 Introduction
1.1 Problem Definition
1.2 Motivation
1.3 Contributions
2 Related Works
3 Proposed Framework
3.1 Sentiment Classification
3.2 Top-N Similar Posts Retrieval
3.3 Treatment Suggestion
4 Dataset and Experimental Setup
4.1 Forum Dataset
4.2 Word Embeddings
4.3 Tools Used and Preprocessing
4.4 UMLS Concept Retrieval
4.5 Relevance Judgement for Similar Post Retrieval
5 Experimental Results and Analysis
5.1 Sentiment Classification
5.2 Top-N Similar Post Retrieval
5.3 Treatment Suggestion
6 Conclusion and Future Work
References
Sentiment Analysis Through Finite State Automata
1 Introduction
2 State of the Art
3 Methodology
3.1 Local Grammars and Finite-State Automata
3.2 Sentita and Its Manually-Built Resources
4 Morphology
5 Syntax
5.1 Opinionated Idioms
5.2 Negation
5.3 Intensification
5.4 Modality
5.5 Comparison
5.6 Other Sentiment Expressions
6 Conclusion
References
Using Cognitive Learning Method to Analyze Aggression in Social Media Text
1 Introduction
2 Related Work
3 Methodology
3.1 Dataset
3.2 Pre-processing
3.3 Feature Extraction
4 Experiments and Results
4.1 Experimental Setup
4.2 Result
4.3 Discussion and Analysis
5 Conclusion and Future Work
References
Opinion Spam Detection with Attention-Based LSTM Networks
1 Introduction
2 Related Work
2.1 Opinion Spam Detection
2.2 Deep Learning for Sentiment Analysis
2.3 Attention Mechanisms
3 Methodology
3.1 Attention-Based LSTM Model
4 Experiments
5 Results and Analysis
5.1 All Three-Domain Results
5.2 In-domain Results
5.3 Cross-domain Results
5.4 Comparison with Previous Work
6 Conclusion and Future Work
References
Multi-task Learning for Detecting Stance in Tweets
1 Introduction
2 Related Work
3 Proposed Approach
3.1 Task Formulation
3.2 Multi-task Learning
3.3 Model Details
4 Experiments
4.1 Dataset
4.2 Training Details
4.3 Baselines
4.4 Evaluation Metrics
4.5 Results
4.6 Ablation Study
4.7 Importance of Regularization
4.8 Effect on Regularization Strength ()
4.9 Case-Study and Error Analyses
5 Conclusion
References
Related Tasks Can Share! A Multi-task Framework for Affective Language
1 Introduction
2 Related Work
3 Proposed Methodology
3.1 Hand-Crafted Features
3.2 Word Embeddings
4 Experiments and Results
4.1 Dataset
4.2 Preprocessing
4.3 Experiments
4.4 Error Analysis
5 Conclusion
References
Sentiment Analysis and Sentence Classification in Long Book-Search Queries
1 Introduction
2 Related Work
3 User Queries
4 Sentiment Intensity
5 Reviews Language Model
6 Analysing Scores
6.1 Sentiment Intensity, Perplexity and Usefulness Correlation
6.2 Sentiment Intensity, Perplexity and Information Type Correlation
6.3 Graphs Interpretation
7 Conclusion and Future Work
References
Comparative Analyses of Multilingual Sentiment Analysis Systems for News and Social Media
1 Introduction
1.1 Tasks Description
1.2 Systems Overview
2 Related Work
3 Datasets
3.1 Twitter Datasets
3.2 Targeted Entity Sentiment Datasets
3.3 News Tonality Datasets
4 Evaluation and Results
4.1 Baselines
4.2 Twitter Sentiment Analysis
4.3 Tonality in News
4.4 Targeted Sentiment Analysis
4.5 Error Analysis
5 Conclusion
References
Sentiment Analysis of Influential Messages for Political Election Forecasting
1 Introduction
2 Related Works
2.1 Sentiment Analysis
2.2 Election Forcasting Approaches
3 Proposed Method
3.1 Data Collection
3.2 Feature Generation
3.3 Influential Classifier Construction
3.4 Election Outcome Prediction Model
4 Results and Findings
4.1 Learning Quality
4.2 Features Quality
4.3 Predicting Election Outcome Quality
5 Conclusion
References
Basic and Depression Specific Emotions Identification in Tweets: Multi-label Classification Experiments
1 Introduction
1.1 Emotion Modeling
1.2 Multi-label Emotion Mining Approaches
1.3 Problem Transformation Methods
1.4 Algorithmic Adaptation Methods
2 Baseline Models
3 Experiment Models
3.1 A Cost Sensitive RankSVM Model
3.2 A Deep Learning Model
3.3 Loss Function Choices
4 Experiments
4.1 Data Set Preparation
4.2 Feature Sets
4.3 Evaluation Metrics
4.4 Quantifying Imbalance in Labelsets
4.5 Mic/Macro F-Measures
5 Results Analysis
5.1 Performance with Regard to F-Measures
5.2 Performance with Regard to Data Imbalance
5.3 Confusion Matrices
6 Conclusion and Future Work
References
Generating Word and Document Embeddings for Sentiment Analysis
1 Introduction
2 Related Work
3 Methodology
3.1 Corpus-Based Approach
3.2 Dictionary-Based Approach
3.3 Supervised Contextual 4-Scores
3.4 Combination of the Word Embeddings
3.5 Generating Document Vectors
4 Datasets
5 Experiments
5.1 Preprocessing
5.2 Hyperparameters
5.3 Results
6 Conclusion
References
Speech Processing
Speech Emotion Recognition Using Spontaneous Children's Corpus
1 Introduction
2 Methods
2.1 Data
2.2 Feature Selection
2.3 The i-Vector Paradigm
2.4 Classification Approaches
3 Results
4 Conclusion
References
Natural Language Interactions in Autonomous Vehicles: Intent Detection and Slot Filling from Passenger Utterances
1 Introduction
1.1 Background
2 Methodology
2.1 Data Collection and Annotation
2.2 Detecting Utterance-Level Intent Types
3 Experiments and Results
3.1 Utterance-Level Intent Detection Experiments
3.2 Slot Filling and Intent Keyword Extraction Experiments
3.3 Speech-to-Text Experiments for AMIE: Training and Testing Models on ASR Outputs
4 Discussion and Conclusion
References
Audio Summarization with Audio Features and Probability Distribution Divergence
1 Introduction
2 Audio Summarization
3 Probability Distribution Divergence for Audio Summarization
3.1 Audio Signal Pre-processing
3.2 Informativeness Model
3.3 Audio Summary Creation
4 Experimental Evaluation
4.1 Results
5 Conclusions
References
Multilingual Speech Emotion Recognition on Japanese, English, and German
1 Introduction
2 Methods
2.1 Emotional Speech Data
2.2 Classification Approaches
2.3 Shifted Delta Cepstral (SDC) Coefficients
2.4 Feature Extraction
2.5 Evaluation Measures
3 Results
3.1 Spoken Language Identification Using Emotional Speech Data
3.2 Emotion Recognition Based on a Two-Level Classification Scheme
3.3 Emotion Recognition Using Multilingual Emotion Models
4 Discussion
5 Conclusions
References
Text Categorization
On the Use of Dependencies in Relation Classification of Text with Deep Learning
1 Introduction
2 A Syntactical Word Embedding Taking into Account Dependencies
3 Two Models for Relation Classification Using Syntactical Dependencies
3.1 A CNN Based Relation Classification Model (CNN)
3.2 A Compositional Word Embedding Based Relation Classification Model (FCM)
4 Experiments
4.1 SemEVAL 2010 Corpus
4.2 Employed Word Embeddings
4.3 Experiments with the CNN Model
4.4 Experiments with the FCM Model
4.5 Discussion
5 Conclusion
References
Multilingual Fake News Detection with Satire
1 Introduction
2 Experimental Framework and Results
2.1 Text Resemblance
2.2 Domain Type Detection
2.3 Classification Results
2.4 Result Analysis
3 Conclusion
References
Active Learning to Select Unlabeled Examples with Effective Features for Document Classification
1 Introduction
2 Related Works
2.1 Active Learning
2.2 Uncertainty Sampling
3 Proposed Method
4 Experiments
4.1 Data Set
4.2 Experiments on Active Learning
4.3 Experimental Results
5 Conclusion
References
Effectiveness of Self Normalizing Neural Networks for Text Classification
1 Introduction
2 Related Work
3 Self-Normalizing Neural Networks
3.1 Input Normalization
3.2 Initialization
3.3 SELU Activations
3.4 Alpha Dropout
4 Model
4.1 Word Embeddings are Not Normalized
4.2 ELU Activation as an Alternative to SELU
4.3 Model Architecture
5 Experiments and Datasets
5.1 Datasets
5.2 Baseline Models
5.3 Model Parameters
5.4 Training
6 Results and Discussion
6.1 Results
6.2 Discussion
7 Conclusion
References
A Study of Text Representations for Hate Speech Detection
1 Introduction
2 Problem Definition
3 Related Work
3.1 Text Representations for Hate Speech
3.2 Classification Approaches
4 Study and Proposed Method
4.1 Text Representations
4.2 Classification Methods
5 Experiments and Results
5.1 Datasets and Experimental Setup
5.2 Results
5.3 Significance Testing
5.4 Discussion
6 Conclusion and Future Work
References
Comparison of Text Classification Methods Using Deep Learning Neural Networks
1 Introduction
2 Related Work
3 Experimental Evaluation and Analysis
3.1 Non-neural Network Approach
3.2 Analysis of the Experiments
3.3 Comparison Tables
4 Conclusion
References
Acquisition of Domain-Specific Senses and Its Extrinsic Evaluation Through Text Categorization
1 Introduction
2 Acquisition of Domain-Specific Senses
3 Application to Text Categorization
4 Experiments
4.1 Acquisition of Senses
4.2 Text Categorization
5 Related Work
6 Conclusion
References
``News Title Can Be Deceptive'' Title Body Consistency Detection for News Articles Using Text Entailment
1 Introduction
2 Related Work
3 Methodology
3.1 Multilayer Perceptron Model (MLP)
3.2 Convolutional Neural Networks Model (CNN)
3.3 Long Short-Term Memory Model (LSTM)
3.4 Combined CNN and LSTM Model
3.5 Modeling
4 Experiments
4.1 Data
4.2 Experimental Setup
4.3 Results and Discussion
4.4 Error Analysis
5 Conclusion and Future Work
References
Look Who's Talking: Inferring Speaker Attributes from Personal Longitudinal Dialog
1 Introduction
2 Related Work
3 Conversation Dataset
4 Message Content
5 Groups over Time
6 Conversation Interaction
7 Model
8 Features
9 Experiments
10 Results
11 Conclusion
References
Computing Classifier-Based Embeddings with the Help of Text2ddc
1 Introduction
2 Related Work
3 Model
3.1 Step 1 and 2: Word Sense Disambiguation
3.2 Step 3: Classifier
3.3 Step 4: Classification Scheme
4 Experiment
4.1 Evaluating text2ddc
4.2 Evaluating CaSe
5 Discussion
5.1 Error Analysis
6 Conclusion
References
Text Generation
HanaNLG: A Flexible Hybrid Approach for Natural Language Generation
1 Introduction
2 Related Work
3 HanaNLG: Our Proposed Approach
3.1 Preprocessing
3.2 Vocabulary Selection
3.3 Sentence Generation
3.4 Sentence Ranking
3.5 Sentence Inflection
4 Experiments
4.1 NLG for Assistive Technologies
4.2 NLG for Opinionated Sentences
5 Evaluation and Results
6 Conclusions
References
MorphoGen: Full Inflection Generation Using Recurrent Neural Networks
1 Introduction
2 Datasets
3 MorphoGen Architecture
4 Generation Experiments
5 Results
6 Conclusions
References
EASY: Evaluation System for Summarization
1 Introduction
2 EASY System Design
2.1 Summarization Quality Metrics
2.2 Baselines
3 Implementation Details
3.1 Input Selection
3.2 Metrics
3.3 Baselines
3.4 Correlation of Results
4 Availability and Reproducibility
5 Conclusions
References
Performance of Evaluation Methods Without Human References for Multi-document Text Summarization
1 Introduction
2 Related Work
2.1 ROUGE-N
2.2 ROUGE-L
2.3 ROUGE-S y ROUGE-SU
3 Evaluation Methods
3.1 Manual Methods
3.2 Automatic Methods
4 Proposed Methodology
5 Obtained Results
5.1 Comparison of the State-of-the-Art Evaluation Methods
6 Conclusions and Future Works
References
EAGLE: An Enhanced Attention-Based Strategy by Generating Answers from Learning Questions to a Remote Sensing Image
1 Introduction
2 Methodology
2.1 Problem Formulation
2.2 EAGLE: An Enhanced Attention-Based Strategy
2.3 Overall Framework
3 Remote Sensing Question Answering Corpus
3.1 Creation Procedure
3.2 Corpus Statistics
4 Experimental Evaluation
4.1 Models Including Ablative Ones
4.2 Dataset
4.3 Evaluation Metrics
4.4 Implementation Details
4.5 Results and Analysis
5 Related Work
5.1 Visual Question Answering (VQA) with Attention
5.2 Associated Datasets
6 Conclusion
References
Text Mining
Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis
1 Introduction
2 Methodology
2.1 Hierarchy of Word Clusters
2.2 Taxonomy-Augmented Features Given a Set of Predefined Words
2.3 Taxonomy-Augmented Features Given the Hierarchy of Word Clusters
3 Experiments
3.1 Datasets
3.2 Experimental Set-Up
3.3 Experimental Results on Document Classification
3.4 Experimental Results on Document Clustering
3.5 Semantic Analysis
4 Conclusion
References
Adversarial Training Based Cross-Lingual Emotion Cause Extraction
1 Introduction
2 Related Work
2.1 Emotion Cause Extraction
2.2 Cross-Lingual Emotion Analysis
3 Model
3.1 Task Definition
3.2 Adversarial Training Based Cross-Lingual ECA Model
4 Experiments
4.1 Data Sets
4.2 Experimental Settings and Evaluation Metrics
4.3 Comparisons of Different Methods
4.4 Comparisons of Different Architectures
4.5 Effects of Sampling Methods
4.6 Effects of Different Attention Hops
5 Conclusion and Future Work
References
Techniques for Jointly Extracting Entities and Relations: A Survey
1 Introduction
2 Problem Definition
3 Motivating Example
4 Overview of Techniques
5 Joint Inference Techniques
6 Joint Models
7 Experimental Evaluation
7.1 Datasets
7.2 Evaluation of End-to-End Relation Extraction
7.3 Domain-Specific Entities and Relations
8 Conclusion
References
Simple Unsupervised Similarity-Based Aspect Extraction
1 Introduction
2 Background and Definitions
3 Related Work
4 Simple Unsupervised Aspect Extraction
5 Experimental Design
6 Results and Discussion
7 Conclusion
References
Streaming State Validation Technique for Textual Big Data Using Apache Flink
1 Introduction
2 Preliminaries
2.1 Stateful Stream Processing
2.2 Why Using Apache Flink?
2.3 Apache Flink System
2.4 Core Concepts
3 Design Framework
4 Implementation and Evaluation
4.1 Implementation Setup
4.2 Design of the Implementation
4.3 Experimental Setup
4.4 Results
4.5 Evaluation
4.6 Evaluation Matrices
4.7 Visualization of Results
5 Conclusions and Future Work
5.1 Conclusion
5.2 Future Work
References
Automatic Extraction of Relevant Keyphrases for the Study of Issue Competition
1 Introduction
2 Related Work
3 Keyphrases Extraction
3.1 Candidate Identification
3.2 Candidate Scoring
3.3 Top n-rank Candidates
4 Experiments
4.1 Evaluation Metric
4.2 Datasets
4.3 Results
5 Key-Phrase Extraction Using Portuguese Parliamentary Debates
5.1 Candidates Selection
5.2 Visualisation
6 Conclusion
A Appendix
References
Author Index
Recommend Papers

Computational Linguistics and Intelligent Text Processing: 20th International Conference, CICLing 2019, La Rochelle, France, April 7–13, 2019, Revised Selected Papers, Part II
 3031243390, 9783031243394

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

LNCS 13452

Alexander Gelbukh (Ed.)

Computational Linguistics and Intelligent Text Processing 20th International Conference, CICLing 2019 La Rochelle, France, April 7–13, 2019 Revised Selected Papers, Part II

Lecture Notes in Computer Science Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA

Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Moti Yung Columbia University, New York, NY, USA

13452

More information about this series at https://link.springer.com/bookseries/558

Alexander Gelbukh (Ed.)

Computational Linguistics and Intelligent Text Processing 20th International Conference, CICLing 2019 La Rochelle, France, April 7–13, 2019 Revised Selected Papers, Part II

Editor Alexander Gelbukh Instituto Politécnico Nacional Mexico City, Mexico

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-031-24339-4 ISBN 978-3-031-24340-0 (eBook) https://doi.org/10.1007/978-3-031-24340-0 © Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

CICLing 2019 was the 20th International Conference on Computational Linguistics and Intelligent Text Processing. The CICLing conferences provide a wide-scope forum for discussion of the art and craft of natural language processing research, as well as the best practices in its applications. This set of two books contains three invited papers and a selection of regular papers accepted for presentation at the conference. Since 2001, the proceedings of the CICLing conferences have been published in Springer’s Lecture Notes in Computer Science series as volumes 2004, 2276, 2588, 2945, 3406, 3878, 4394, 4919, 5449, 6008, 6608, 6609, 7181, 7182, 7816, 7817, 8403, 8404, 9041, 9042, 9623, 9624, 10761, 10762, 13396, and 13397. The set has been structured into 14 sections representative of the current trends in research and applications of natural language processing: General; Information Extraction; Information Retrieval; Language Modeling; Lexical Resources; Machine Translation; Morphology, Syntax, Parsing; Name Entity Recognition; Semantics and Text Similarity; Sentiment Analysis; Speech Processing; Text Categorization; Text Generation; and Text Mining. In 2019 our invited speakers were Preslav Nakov (Qatar Computing Research Institute, Qatar), Paolo Rosso (Universidad Politécnica de Valencia, Spain), Lucia Specia (University of Sheffield, UK), and Carlo Strapparava (Foundazione Bruno Kessler, Italy). They delivered excellent extended lectures and organized lively discussions. Full contributions of these invited talks are included in this book set. After a double-blind peer review process, the Program Committee selected 95 papers for presentation, out of 335 submissions from 60 countries. To encourage authors to provide algorithms and data along with the published papers, we selected three winners of our Verifiability, Reproducibility, and Working Description Award. The main factors in choosing the awarded submission were technical correctness and completeness, readability of the code and documentation, simplicity of installation and use, and exact correspondence to the claims of the paper. Unnecessary sophistication of the user interface was discouraged; novelty and usefulness of the results were not evaluated, instead they were evaluated for the paper itself and not for the data. The following papers received the Best Paper Awards, the Best Student Paper Award, as well as the Verifiability, Reproducibility, and Working Description Awards, respectively: Best Verifiability, Reproducibility, and Working Description Award: “Text Analysis of Resumes and Lexical Choice as an Indicator of Creativity”, Alexander Rybalov. Best Student Paper Award: “Look Who’s Talking: Inferring Speaker Attributes from Personal Longitudinal Dialog”, Charles Welch, Veronica Perez-Rosas, Jonathan Kummerfeld, Rada Mihalcea.

vi

Preface

Best Presentation Award: “A Framework to Build Quality into Non-expert Translations”, Christopher G. Harris. Best Poster Award, Winner (Shared): “Sentiment Analysis Through Finite State Automata”, Serena Pelosi, Alessandro Maisto, Lorenza Melillo, and Annibale Elia. And “Toponym Identification in Epidemiology Articles: A Deep Learning Approach”, Mohammad Reza Davari, Leila Kosseim, Tien D. Bui. Best Inquisitive Mind Award: Given to the attendee who asked the most (good) questions to the presenters during the conference, Natwar Modani. Best Paper Award, First Place: “Contrastive Reasons Detection and Clustering from Online Polarized Debates”, Amine Trabelsi, Osmar Zaiane. Best Paper Award, Second Place: “Adversarial Training based Cross-lingual Emotion Cause Extraction”, Hongyu Yan, Qinghong Gao, Jiachen Du, Binyang Li, Ruifeng Xu. Best Paper Award, Third Place (Shared): “EAGLE: An Enhanced Attention-Based Strategy by Generating Answers from Learning Questions to a Remote Sensing Image”, Yeyang Zhou, Yixin Chen, Yimin Chen, Shunlong Ye, Mingxin Guo, Ziqi Sha, Heyu Wei, Yanhui Gu, Junsheng Zhou, Weiguang Qu. Best Paper Award, Third Place (Shared): “dpUGC: Learn Differentially Private Representation for User Generated Contents”, Xuan-Son Vu, Son Tran, Lili Jiang. A conference is the result of the work of many people. First of all, I would like to thank the members of the Program Committee for the time and effort they devoted to the reviewing of the submitted articles and to the selection process. Obviously, I thank the authors for their patience in the preparation of the papers, not to mention the development of the scientific results that form this book. I also express my most cordial thanks to the members of the local Organizing Committee for their considerable contribution to making this conference become a reality. November 2022

Alexander Gelbukh

Organization

CICLing 2019 (20th International Conference on Computational Linguistics and Intelligent Text Processing) was hosted by the University of La Rochelle (ULR), France, and organized by the L3i laboratory of the University of La Rochelle (ULR), France, in collaboration with the Natural Language and Text Processing Laboratory of the CIC, IPN, the Mexican Society of Artificial Intelligence (SMIA), and the NewsEye project. The NewsEye project received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 770299. The conference aims to encourage the exchange of opinions between the scientists working in different areas of the growing field of computational linguistics and intelligent text and speech processing.

Program Chair Alexander Gelbukh

Instituto Politécnico Nacional, Mexico

Organizing Committee Antoine Doucet (Chair) Nicolas Sidère (Co-chair) Cyrille Suire (Co-chair)

University of La Rochelle, France University of La Rochelle, France University of La Rochelle, France

Members Karell Bertet Mickaël Coustaty Salah Eddine Christophe Rigaud

L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France

Additional Support Viviana Beltran Jean-Loup Guillaume Marwa Hamdi Ahmed Hamdi Nam Le Elvys Linhares Pontes Muzzamil Luqman Zuheng Ming Hai Nguyen

L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France

viii

Organization

Armelle Prigent Mourad Rabah

L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France

Program Committee Alexander Gelbukh Leslie Barrett Leila Kosseim Aladdin Ayesh Srinivas Bangalore Ivandre Paraboni Hermann Moisl Kais Haddar Cerstin Mahlow Alma Kharrat Dafydd Gibbon Evangelos Milios Kjetil Nørvåg Grigori Sidorov Hiram Calvo Piotr W. Fuglewicz Aminul Islam Michael Carl Guillaume Jacquet Suresh Manandhar Bente Maegaard Tarık Ki¸sla Nick Campbell Yasunari Harada Samhaa El-Beltagy Anselmo Peñas Paolo Rosso Horacio Rodriguez Yannis Haralambous Niladri Chatterjee Manuel Vilares Ferro Eva Hajicova Preslav Nakov

Instituto Politécnico Nacional, Mexico Bloomberg, USA Concordia University, Canada De Montfort University, UK Interactions, USA University of São Paulo, Brazil Newcastle University, UK MIRACL Laboratory, Faculté des Sciences de Sfax, Tunisia ZHAW Zurich University of Applied Sciences, Switzerland Microsoft, USA Bielefeld University, Germany Dalhousie University, Canada Norwegian University of Science and Technology, Norway CIC-IPN, Mexico Nara Institute of Science and Technology, Japan TiP, Poland University of Louisiana at Lafayette, USA Kent State University, USA Joint Research Centre, EU University of York, UK University of Copenhagen, Denmark Ege University, Turkey Trinity College Dublin, Ireland Waseda University, Japan Newgiza University, Egypt NLP & IR Group, UNED, Spain Universitat Politècnica de València, Spain Universitat Politècnica de Catalunya, Spain IMT Atlantique & UMR CNRS 6285 Lab-STICC, France IIT Delhi, India University of Vigo, Spain Charles University, Prague, Czech Republic Qatar Computing Research Institute, HBKU, Qatar

Organization

Bayan Abushawar Kemal Oflazer Hatem Haddad Constantin Orasan Masaki Murata Efstathios Stamatatos Mike Thelwall Stan Szpakowicz Tunga Gungor Dunja Mladenic German Rigau Roberto Basili Karin Harbusch Elena Lloret Ruslan Mitkov Viktor Pekar Attila Novák Horacio Saggion Soujanya Poria Rada Mihalcea Partha Pakray Alexander Mehler Octavian Popescu Hitoshi Isahara Galia Angelova Pushpak Bhattacharyya Farid Meziane Ales Horak Nicoletta Calzolari Milos Jakubicek Ron Kaplan Hassan Sawaf Marta R. Costa-Jussà Sivaji Bandyopadhyay Yorick Wilks Vasile Rus Christian Boitet Khaled Shaalan Philipp Koehn

ix

Arab Open University, Jordan Carnegie Mellon University in Qatar, Qatar iCompass, Tunisia University of Wolverhampton, UK Tottori University, Japan University of the Aegean, Greece University of Wolverhampton, UK University of Ottawa, Canada Bogazici University, Turkey Jozef Stefan Institute, Slovenia IXA Group, UPV/EHU, Spain University of Roma Tor Vergata, Italy University Koblenz-Landau, Germany University of Alicante, Spain University of Wolverhampton, UK University of Birmingham, UK Pázmány Péter Catholic University, Hungary Universitat Pompeu Fabra, Spain Nanyang Technological University, Singapore University of North Texas, USA National Institute of Technology Silchar, India Goethe-University Frankfurt am Main, Germany IBM, USA Toyohashi University of Technology, Japan Institute for Parallel Processing, Bulgarian Academy of Sciences, Bulgaria IIT Bombay, India University of Derby, UK Masaryk University, Czech Republic Istituto di Linguistica Computazionale – CNR, Italy Lexical Computing, UK Nuance Communications, USA Amazon, USA Institute for Infocomm Research, Singapore Jadavpur University, India University of Sheffield, UK University of Memphis, USA Université Grenoble Alpes, France The British University in Dubai, UAE Johns Hopkins University, USA

x

Organization

Software Reviewing Committee Ted Pedersen Florian Holz Miloš Jakubíˇcek Sergio Jiménez Vargas Miikka Silfverberg Ronald Winnemöller

Best Paper Award Selection Committee Alexander Gelbukh Eduard Hovy Rada Mihalcea Ted Pedersen Yorick Wilks

Contents – Part II

Name Entity Recognition Neural Named Entity Recognition for Kazakh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gulmira Tolegen, Alymzhan Toleu, Orken Mamyrbayev, and Rustam Mussabayev

3

An Empirical Data Selection Schema in Annotation Projection Approach . . . . . . Yun Hu, Mingxue Liao, Pin Lv, and Changwen Zheng

16

Toponym Identification in Epidemiology Articles – A Deep Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MohammadReza Davari, Leila Kosseim, and Tien D. Bui

26

Named Entity Recognition by Character-Based Word Classification Using a Domain Specific Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Makoto Hiramatsu, Kei Wakabayashi, and Jun Harashima

38

Cold Is a Disease and D-cold Is a Drug: Identifying Biological Types of Entities in the Biomedical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suyash Sangwan, Raksha Sharma, Girish Palshikar, and Asif Ekbal

49

A Hybrid Generative/Discriminative Model for Rapid Prototyping of Domain-Specific Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suzushi Tomori, Yugo Murawaki, and Shinsuke Mori

61

Semantics and Text Similarity Spectral Text Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tim vor der Brück and Marc Pouly A Computational Approach to Measuring the Semantic Divergence of Cognates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana-Sabina Uban, Alina Cristea (Ciobanu), and Liviu P. Dinu

81

96

Triangulation as a Research Method in Experimental Linguistics . . . . . . . . . . . . . 109 Olga Suleimanova and Marina Fomina Understanding Interpersonal Variations in Word Meanings via Review Target Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Daisuke Oba, Shoetsu Sato, Naoki Yoshinaga, Satoshi Akasaki, and Masashi Toyoda

xii

Contents – Part II

Semantic Roles in VerbNet and FrameNet: Statistical Analysis and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Aliaksandr Huminski, Fiona Liausvia, and Arushi Goel Sentiment Analysis Fusing Phonetic Features and Chinese Character Representation for Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Haiyun Peng, Soujanya Poria, Yang Li, and Erik Cambria Sentiment-Aware Recommendation System for Healthcare Using Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Alan Aipe, N. S. Mukuntha, and Asif Ekbal Sentiment Analysis Through Finite State Automata . . . . . . . . . . . . . . . . . . . . . . . . . 182 Serena Pelosi, Alessandro Maisto, Lorenza Melillo, and Annibale Elia Using Cognitive Learning Method to Analyze Aggression in Social Media Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Sayef Iqbal and Fazel Keshtkar Opinion Spam Detection with Attention-Based LSTM Networks . . . . . . . . . . . . . 212 Zeinab Sedighi, Hossein Ebrahimpour-Komleh, Ayoub Bagheri, and Leila Kosseim Multi-task Learning for Detecting Stance in Tweets . . . . . . . . . . . . . . . . . . . . . . . . 222 Devamanyu Hazarika, Gangeshwar Krishnamurthy, Soujanya Poria, and Roger Zimmermann Related Tasks Can Share! A Multi-task Framework for Affective Language . . . . 236 Kumar Shikhar Deep, Md Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya Sentiment Analysis and Sentence Classification in Long Book-Search Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Amal Htait, Sébastien Fournier, and Patrice Bellot Comparative Analyses of Multilingual Sentiment Analysis Systems for News and Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 Pavel Pˇribáˇn and Alexandra Balahur Sentiment Analysis of Influential Messages for Political Election Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 Oumayma Oueslati, Moez Ben Hajhmida, Habib Ounelli, and Erik Cambria

Contents – Part II

xiii

Basic and Depression Specific Emotions Identification in Tweets: Multi-label Classification Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Nawshad Farruque, Chenyang Huang, Osmar Zaïane, and Randy Goebel Generating Word and Document Embeddings for Sentiment Analysis . . . . . . . . . 307 Cem Rıfkı Aydın, Tunga Güngör, and Ali Erkan Speech Processing Speech Emotion Recognition Using Spontaneous Children’s Corpus . . . . . . . . . . 321 Panikos Heracleous, Yasser Mohammad, Keiji Yasuda, and Akio Yoneyama Natural Language Interactions in Autonomous Vehicles: Intent Detection and Slot Filling from Passenger Utterances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Eda Okur, Shachi H. Kumar, Saurav Sahay, Asli Arslan Esme, and Lama Nachman Audio Summarization with Audio Features and Probability Distribution Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Carlos-Emiliano González-Gallardo, Romain Deveaud, Eric SanJuan, and Juan-Manuel Torres-Moreno Multilingual Speech Emotion Recognition on Japanese, English, and German . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 Panikos Heracleous, Keiji Yasuda, and Akio Yoneyama Text Categorization On the Use of Dependencies in Relation Classification of Text with Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 Bernard Espinasse, Sébastien Fournier, Adrian Chifu, Gaël Guibon, René Azcurra, and Valentin Mace Multilingual Fake News Detection with Satire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 Gaël Guibon, Liana Ermakova, Hosni Seffih, Anton Firsov, and Guillaume Le Noé-Bienvenu Active Learning to Select Unlabeled Examples with Effective Features for Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Minoru Sasaki Effectiveness of Self Normalizing Neural Networks for Text Classification . . . . . 412 Avinash Madasu and Vijjini Anvesh Rao

xiv

Contents – Part II

A Study of Text Representations for Hate Speech Detection . . . . . . . . . . . . . . . . . 424 Chrysoula Themeli, George Giannakopoulos, and Nikiforos Pittaras Comparison of Text Classification Methods Using Deep Learning Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 Maaz Amjad, Alexander Gelbukh, Ilia Voronkov, and Anna Saenko Acquisition of Domain-Specific Senses and Its Extrinsic Evaluation Through Text Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Attaporn Wangpoonsarp, Kazuya Shimura, and Fumiyo Fukumoto “News Title Can Be Deceptive” Title Body Consistency Detection for News Articles Using Text Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462 Tanik Saikh, Kingshuk Basak, Asif Ekbal, and Pushpak Bhattacharyya Look Who’s Talking: Inferring Speaker Attributes from Personal Longitudinal Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 Charles Welch, Verónica Pérez-Rosas, Jonathan K. Kummerfeld, and Rada Mihalcea Computing Classifier-Based Embeddings with the Help of Text2ddc . . . . . . . . . . 491 Tolga Uslu, Alexander Mehler, and Daniel Baumartz Text Generation HanaNLG: A Flexible Hybrid Approach for Natural Language Generation . . . . . 507 Cristina Barros and Elena Lloret MorphoGen: Full Inflection Generation Using Recurrent Neural Networks . . . . . 520 Octavia-Maria S¸ ulea, Steve Young, and Liviu P. Dinu EASY: Evaluation System for Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 Marina Litvak, Natalia Vanetik, and Yael Veksler Performance of Evaluation Methods Without Human References for Multi-document Text Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 Alexis Carriola Careaga, Yulia Ledeneva, and Jonathan Rojas Simón EAGLE: An Enhanced Attention-Based Strategy by Generating Answers from Learning Questions to a Remote Sensing Image . . . . . . . . . . . . . . . . . . . . . . . 558 Yeyang Zhou, Yixin Chen, Yimin Chen, Shunlong Ye, Mingxin Guo, Ziqi Sha, Heyu Wei, Yanhui Gu, Junsheng Zhou, and Weiguang Qu

Contents – Part II

xv

Text Mining Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 Sattar Seifollahi and Massimo Piccardi Adversarial Training Based Cross-Lingual Emotion Cause Extraction . . . . . . . . . 587 Hongyu Yan, Qinghong Gao, Jiachen Du, Binyang Li, and Ruifeng Xu Techniques for Jointly Extracting Entities and Relations: A Survey . . . . . . . . . . . 602 Sachin Pawar, Pushpak Bhattacharyya, and Girish K. Palshikar Simple Unsupervised Similarity-Based Aspect Extraction . . . . . . . . . . . . . . . . . . . 619 Danny Suarez Vargas, Lucas R. C. Pessutto, and Viviane Pereira Moreira Streaming State Validation Technique for Textual Big Data Using Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632 Raheela Younas and Amna Qasim Automatic Extraction of Relevant Keyphrases for the Study of Issue Competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648 Miguel Won, Bruno Martins, and Filipa Raimundo Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671

Contents – Part I

General Visual Aids to the Rescue: Predicting Creativity in Multimodal Artwork . . . . . . . Carlo Strapparava, Serra Sinem Tekiroglu, and Gözde Özbal Knowledge-Based Techniques for Document Fraud Detection: A Comprehensive Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Beatriz Martínez Tornés, Emanuela Boros, Antoine Doucet, Petra Gomez-Krämer, Jean-Marc Ogier, and Vincent Poulain d’Andecy

3

17

Exploiting Metonymy from Available Knowledge Resources . . . . . . . . . . . . . . . . Itziar Gonzalez-Dios, Javier Álvez, and German Rigau

34

Robust Evaluation of Language–Brain Encoding Experiments . . . . . . . . . . . . . . . Lisa Beinborn, Samira Abnar, and Rochelle Choenni

44

Connectives with Both Arguments External: A Survey on Czech . . . . . . . . . . . . . Lucie Poláková and Jiˇrí Mírovský

62

Recognizing Weak Signals in News Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniela Gifu

73

Low-Rank Approximation of Matrices for PMI-Based Word Embeddings . . . . . Alena Sorokina, Aidana Karipbayeva, and Zhenisbek Assylbekov

86

Text Preprocessing for Shrinkage Regression and Topic Modeling to Analyse EU Public Consultation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nada Mimouni and Timothy Yu-Cheong Yeung

95

Intelligibility of Highly Predictable Polish Target Words in Sentences Presented to Czech Readers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Klára Jágrová and Tania Avgustinova Information Extraction Multi-lingual Event Identification in Disaster Domain . . . . . . . . . . . . . . . . . . . . . . 129 Zishan Ahmad, Deeksha Varshney, Asif Ekbal, and Pushpak Bhattacharyya

xviii

Contents – Part I

Detection and Analysis of Drug Non-compliance in Internet Fora Using Information Retrieval Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Sam Bigeard, Frantz Thiessard, and Natalia Grabar Char-RNN and Active Learning for Hashtag Segmentation . . . . . . . . . . . . . . . . . . 155 Taisiya Glushkova and Ekaterina Artemova Extracting Food-Drug Interactions from Scientific Literature: Relation Clustering to Address Lack of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Tsanta Randriatsitohaina and Thierry Hamon Contrastive Reasons Detection and Clustering from Online Polarized Debates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Amine Trabelsi and Osmar R. Zaïane Visualizing and Analyzing Networks of Named Entities in Biographical Dictionaries for Digital Humanities Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Minna Tamper, Petri Leskinen, and Eero Hyvönen Unsupervised Keyphrase Extraction from Scientific Publications . . . . . . . . . . . . . 215 Eirini Papagiannopoulou and Grigorios Tsoumakas Information Retrieval Retrieving the Evidence of a Free Text Annotation in a Scientific Article: A Data Free Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Julien Gobeill, Emilie Pasche, and Patrick Ruch Salience-Induced Term-Driven Serendipitous Web Exploration . . . . . . . . . . . . . . . 247 Yannis Haralambous and Ehoussou Emmanuel N’zi Language Modeling Two-Phased Dynamic Language Model: Improved LM for Automated Language Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Debajyoty Banik, Asif Ekbal, and Pushpak Bhattacharyya Composing Word Vectors for Japanese Compound Words Using Dependency Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 Kanako Komiya, Takumi Seitou, Minoru Sasaki, and Hiroyuki Shinnou Microtext Normalization for Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Ranjan Satapathy, Erik Cambria, and Nadia Magnenat Thalmann

Contents – Part I

xix

Building Personalized Language Models Through Language Model Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Milton King and Paul Cook dpUGC: Learn Differentially Private Representation for User Generated Contents (Best Paper Award, Third Place, Shared) . . . . . . . . . . . . . . . . . . . . . . . . . . 316 Xuan-Son Vu, Son N. Tran, and Lili Jiang Multiplicative Models for Recurrent Language Modeling . . . . . . . . . . . . . . . . . . . . 332 Diego Maupomé and Marie-Jean Meurs Impact of Gender Debiased Word Embeddings in Language Modeling . . . . . . . . 342 Christine Basta and Marta R. Costa-jussà Initial Explorations on Chaotic Behaviors of Recurrent Neural Networks . . . . . . 351 Bagdat Myrzakhmetov, Rustem Takhanov, and Zhenisbek Assylbekov Lexical Resources LingFN: A Framenet for the Linguistic Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Shafqat Mumtaz Virk, Per Klang, Lars Borin, and Anju Saxena SART - Similarity, Analogies, and Relatedness for Tatar Language: New Benchmark Datasets for Word Embeddings Evaluation . . . . . . . . . . . . . . . . . . . . . . 380 Albina Khusainova, Adil Khan, and Adín Ramírez Rivera Cross-Lingual Transfer for Distantly Supervised and Low-Resources Indonesian NER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 Fariz Ikhwantri Phrase-Level Simplification for Non-native Speakers . . . . . . . . . . . . . . . . . . . . . . . 406 Gustavo H. Paetzold and Lucia Specia Automatic Creation of a Pharmaceutical Corpus Based on Open-Data . . . . . . . . . 432 Cristian Bravo, Sebastian Otálora, and Sonia Ordoñez-Salinas Fool’s Errand: Looking at April Fools Hoaxes as Disinformation Through the Lens of Deception and Humour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Edward Dearden and Alistair Baron Russian Language Datasets in the Digital Humanities Domain and Their Evaluation with Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 Gerhard Wohlgenannt, Artemii Babushkin, Denis Romashov, Igor Ukrainets, Anton Maskaykin, and Ilya Shutov

xx

Contents – Part I

Towards the Automatic Processing of Language Registers: Semi-supervisedly Built Corpus and Classifier for French . . . . . . . . . . . . . . . . . . . 480 Gwénolé Lecorvé, Hugo Ayats, Benoît Fournier, Jade Mekki, Jonathan Chevelu, Delphine Battistelli, and Nicolas Béchet Machine Translation Evaluating Terminology Translation in MT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 Rejwanul Haque, Mohammed Hasanuzzaman, and Andy Way Detecting Machine-Translated Paragraphs by Matching Similar Words . . . . . . . . 521 Hoang-Quoc Nguyen-Son, Tran Phuong Thao, Seira Hidano, and Shinsaku Kiyomoto Improving Low-Resource NMT with Parser Generated Syntactic Phrases . . . . . . 533 Kamal Kumar Gupta, Sukanta Sen, Asif Ekbal, and Pushpak Bhattacharyya How Much Does Tokenization Affect Neural Machine Translation? . . . . . . . . . . . 545 Miguel Domingo, Mercedes García-Martínez, Alexandre Helle, Francisco Casacuberta, and Manuel Herranz Take Help from Elder Brother: Old to Modern English NMT with Phrase Pair Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 Sukanta Sen, Mohammed Hasanuzzaman, Asif Ekbal, Pushpak Bhattacharyya, and Andy Way Adaptation of Machine Translation Models with Back-Translated Data Using Transductive Data Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 Alberto Poncelas, Gideon Maillette de Buy Wenniger, and Andy Way Morphology, Syntax, Parsing Automatic Detection of Parallel Sentences from Comparable Biomedical Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583 Rémi Cardon and Natalia Grabar MorphBen: A Neural Morphological Analyzer for Bengali Language . . . . . . . . . 595 Ayan Das and Sudeshna Sarkar CCG Supertagging Using Morphological and Dependency Syntax Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 Luyê.n Ngo.c Lê and Yannis Haralambous

Contents – Part I

xxi

Representing Overlaps in Sequence Labeling Tasks with a Novel Tagging Scheme: Bigappy-Unicrossy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 Gözde Berk, Berna Erden, and Tunga Güngör *Paris is Rain. or It is raining in Paris?: Detecting Overgeneralization of Be-verb in Learner English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636 Ryo Nagata, Koki Washio, and Hokuto Ototake Speeding up Natural Language Parsing by Reusing Partial Results . . . . . . . . . . . . 648 Michalina Strzyz and Carlos Gómez-Rodríguez Unmasking Bias in News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658 Javier Sánchez-Junquera, Paolo Rosso, Manuel Montes-y-Gómez, and Simone Paolo Ponzetto Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669

Name Entity Recognition

Neural Named Entity Recognition for Kazakh Gulmira Tolegen, Alymzhan Toleu(B) , Orken Mamyrbayev, and Rustam Mussabayev Institute of Information and Computational Technologies, Almaty, Kazakhstan [email protected]

Abstract. We present several neural networks to address the task of named entity recognition for morphologically complex languages (MCL). Kazakh is a morphologically complex language in which each root/stem can produce hundreds or thousands of variant word forms. This nature of the language could lead to a serious data sparsity problem, which may prevent the deep learning models from being well trained for underresourced MCLs. In order to model the MCLs’ words effectively, we introduce root and entity tag embedding plus tensor layer to the neural networks. The effects of those are significant for improving NER model performance of MCLs. The proposed models outperform state-of-the-art including character-based approaches, and can be potentially applied to other morphologically complex languages. Keywords: Named entity recognition · Morphologically complex language · Kazakh language · Deep learning · Neural network

1

Introduction

Named Entity Recognition (NER) is a vital part of information extraction. It aims to locate and classify the named entities from unstructured text. The different entity categories are usually the person, location and organization names, etc. Kazakh language is an agglutinative language with complex morphological word structures. Each root/stem in the language can produce hundreds or thousands of new words. It leads to the severe problem of data sparsity when automatically identifying the entities. In order to tackle the problem, Tolegen et al. (2016) [24] have given the systematic study for Kazakh NER by using conditional random fields. More specifically, the authors assembled and annotated the Kazakh NER corpus (KNC), and proposed a set of named entity features with the exploration of their effects. To achieve a state-of-the-art result for Kazakh NER compared with other languages’ NER. Authors have manually designed feature templates, which in practice is a labor-intensive process and requires a lot of expertise. With the intention of alleviating the task-specific feature engineering, there has been increasing interest in using deep learning to solve the NER task for many languages. However, the effectiveness of the deep learning for Kazakh NER is c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 3–15, 2023. https://doi.org/10.1007/978-3-031-24340-0_1

4

G. Tolegen et al.

still unexplored. One of the aims of this work is to use deep learning for Kazakh NER to avoid the task-specific feature engineering and to achieve a new stateof-the-art result. As in similar studies [5] the neural networks (NNs) produces high results for English or for other languages by using distributed word representations. But using only surface word representation in deep learning is may not enough to reach the state-of-the-art results for under-resourced MCLs. The main reason is that deep learning approaches are data hungry, their performance is strongly correlated with the amount of available training data. In this paper, we introduce three types of representation for MCL including word, root and entity tag embeddings. With the purpose of discovering how above embeddings contribute to model performance independently, we use a simple NN as the baseline to do the investigation. We also improve this basic model from two perspectives. One is to apply a tensor transformation layer to extract multi-dimensional interactions among those representations. The other is to map each entity tag into a vector representation. The result shows that the use of root embedding can lead to a significant improvement to the models in term of improving test results. Our NNs reached good outcomes by transferring intermediate representations learned on large unlabeled data. We compare the NNs with the existing CRF-based NER system for Kazakh [24] and the other bidirectional-LSTM-CRF [12] that considered as the state-of-the-art in NER. Our NNs outperforms the state-of-the-art and the result indicates that the proposed NNs can be potentially applied to other morphologically complex languages. The rest of the paper is organized as follows: Sect. 2 reviews the existing work. Section 3 gives the named entity features used in this work. Section 4 describes the details of neural networks. Section 5 reports the results of experiments and the paper is concluded in Sect. 6 with future work.

2

Related Work

Named Entity Recognition have been studied for several decades, not only for English [4,9,23], but also for other MCL, including Kazakh [24] and Turkish [20,29]. For instance, Chieu and Hwee Tou (2003) [4] presented a maximum entropy approach based NER systems for English and German, where the authors used both local and global features to enhance their models and achieved good performance in NER. In order to explore the flexibilities of the four diverse classifiers (Hidden Markov model, maximum entropy, transformationbased learning, robust linear classifier) for NER, the work [6] showed that a combined system of these models under different conditions could reduce the F1-score error by a factor of 15 to 21% on English data-set. As known, the maximum entropy approach was suffering from the label bias problem [11], then the researchers attempted to use CRF model [17] and presented CRF-based NER systems with a number of external features. Such supervised NER systems were extremely sensitive to the selection of an appropriate feature set, in the work [23], the authors explored various combinations of a set of features (local and

Neural Named Entity Recognition for Kazakh

5

non-local knowledge features) and compared their impact on recognition performance for English. Using the CRF with optimized feature template, they obtained a 91.02% F1-score on the CoNLL 2003 [22] data-set. For Turkish, Yeniterzi (2011) [29] analyzed the effect of the morphological features, they utilized CRF that enhanced with several syntactic and contextual features, their model achieved an 88.94% F1-score on Turkish test data. In same direction Seker and Eryigit (2012) [20] presented a CRF-based NER system with their feature set, their final model achieved the highest F1-score (92%). For Kazakh, Tolegen et al. (2016) [24] annotated a Kazakh NER corpus (KNC), and carefully analyzed the effect of the morphological (6 features) and word type (4 features) features using CRF. Their results showed that the model could be improved by using morphological features significantly, the final CRF-based NER system achieved an 89.81% F1 on Kazakh test data. In this work, we use such CRF-based NER system as one baseline and make comparison to our deep learning models. Recently, deep learning models including biLSTM have obtained a significant success on various natural languages processing tasks, such as POS tagging [13,25,26,28], NER [4,10], machine translation [2,8], word segmentation [10] and on other fields like speech recognition [1,7,15,15,16]. As the state-of-the-art of NER, in the study [12], the authors have explored various neural architectures for NER including the language independent character-based biLSTM-CRF models. These type of models on German, Dutch and English have achieved 81.74%, 85.75% and 90.94%. Our models have several differences compared to other state-of-the-art. One difference is that we introduce root embedding to tackle the problem of data sparsity caused by MCL. The decoding part (refers it to CRF layer in literature [12,14,30]) of NNs is combined into NNs using tag embedding. Then the word, root and tag embeddings are efficiently incorporated and calculated by NNs in the same vector space, which allows us to extract higher-level vector features. Table 1. The entity features, more details see Tolegen et al. [24] Morphological features Word type features

3

Root

Case feature

Part of speech

Start of the sentence

Inflectional suffixes

Latin spelling words

Derivational suffixes

Acronym

Proper noun



Kazakh Name suffixes



Named Entity Features

NER models are often enhanced with named entity features. In this work, with the purpose of making a fair comparison, we utilize the same entity features proposed by Tolegen et al. (2016) [24]. The entity features are given in Table 1

6

G. Tolegen et al.

with two categories: morphological and word type information. Morphological features are extracted by using the morphological tagger of our implementation. We used a single value (1 or 0) to represent each feature according to each word has the feature or not. Then each word in the corpus contains an entity feature vector to feed into NNs with word, root and tag embeddings.

4

The Neural Networks

In this section, we describe our NNs for MCL NER. Unlike other NNs for English or other similar languages, we introduce three types of representations: word, root and tag embedding. In order to explore the effect of root and tag embedding separately and clearly, our first model is general deep neural network (DNN), which was first proposed by Bengio et al. (2003) [3] for probabilistic language model, and re-introduced by Collobert et al. (2011) [5] for multiple NLP tasks. DNN also is a standard model for sequence labeling task and could be a strong baseline. The second model is the extension of the DNN by applying a tensor layer to DNN. The tensor layer can be viewed as a non-linear transformation that extracts higher dimensional interactions from the input. The architecture of our NN is shown in Fig. 1. The first layer is lookup table layer which extracts features for each word. Here, the features are a window of words, and root (Si ) plus tag embedding (ti−1 ). The concatenation of these feature vectors are fed into the next several layers for feature extractions. The next layer is tensor layer and the remaining layers are standard NN layers. The NN layers are trained by backpropagation and the details of NNs are given in the following sections. 4.1

Mapping Words and Tags into Feature Vectors

The NNs have two dictionaries1 : one for roots and another for words. For simplicity, we will use one notation for both dictionaries in the following descriptions. Let D be the finite dictionary, and for each word xi ∈ D is represented as a ddimensional vector Mxi ∈ R1×d where d is word vector size (a hyper-parameter). All word representation of the D are stored in a embedding matrix M ∈ Rd×|D| where |D| is size of the dictionary. Each word xi ∈ D corresponds to an index ki which is column index of the embedding matrix, and then the corresponding word embedding is retrieved by the lookup table layer LTM (·): LTM (ki ) = Mxi

(1)

Similar to word embedding, we introduce tag embedding L ∈ Rd×|T | , where d is the vector size and T is a tag set. The lookup table layer can be seen as a simple projection layer where the word embedding for each context and tag 1

The dictionary is extracted from training data and performed some pre-processing, namely lowercasing and word-stemming. Words outside this dictionary are replaced by a single special symbol.

Neural Named Entity Recognition for Kazakh

7

Fig. 1. The architecture of the neural network.

embedding for the previous word is retrieved by lookup table operation. To use these features effectively, we use a sliding window approach2 . More precisely, for each word xi ∈ X, a window size word’s embeddings are given by the lookup table layer:   (2) fθ1 (xi ) = Mxi− w2 . . . Mxi . . . Mxi+ w2 , Si , ti−1 where fθ1 (xi ) ∈ R1×wd is w word feature vectors, the w is the window size (a hyper-parameter), ti−1 ∈ R1×d is previous tag embedding, Si is embedding of current root. These embedding matrix is initialized with small random numbers and trained by back-propagation. 4.2

Tensor Layer

In order to capture more interactions between roots, surface words, tags and entity features, we extend the DNN to the tensor neural network. We use 3-way tensor T ∈ Rh2 ×h1 ×h1 , where h1 is size of previous layer and h2 is size of tensor layer. We define the output of a tensor product h via the following vectorized notation. (3) h = g(eT Te + W 3 e + b3 ) 2

The words exceeding the sentence boundaries are mapped to one of two special symbols, namely “start” and “end” symbols.

8

G. Tolegen et al.

where e ∈ Rh1 is output of previous layer, W 3 ∈ Rh2 ×h1 , h ∈ Rh2 . Maintaining the full tensor directly leads to parametric explosion. Here, we use a tensor factorization approach [19] that factorizes each tensor slice as the product of two low-rank matrices, and get the factorized tensor function: h = g(eT P [i] Q[i] e + W 3 e + b3 )

(4)

where the matrix P [i] ∈ Rh1 ×r and Q[i] ∈ Rr×h1 are two low rank matrices, and r is number of the factors (a hyper-parameter). 4.3

Tag Inference

There are strong dependencies between the named entity tags in a sentence for the NER. In order to capture the tag transitions, we use a transition score Aij [5,31] for jumping from one tag i ∈ T to another tag j ∈ T and an initial scores A0i for starting from the ith tag. For the input sentence X with a tag sequence Y , a sentence-level score can be calculated by the sum of transition and the output of NNs: N  (5) s(X, Y, θ) = (Ati−1 ,ti + fθ (ti |i)) n=1

where fθ (ti |i) indicates the score output by the network for the ti tag at the ith word. It should be noted that this model calculates the tag transition score independently from NNs. One possible way of combining the both tag transitions and neural network outputs is to feed the previous tag embedding to the NNs. Then, the output of NNs could calculate a transition score given the previous tag embedding, and it can be written as follows: s(X, Y, θ) =

N 

fθ (ti |i, ti−1 )

(6)

n=1

At inference time, for a sentence X, we can find the best tag path Y ∗ by maximizing the sentence score. The Viterbi algorithm can be used for this inference.

5

Experiments

We conducted several experiments to evaluate our NNs. One of them is to explore the effects of the word, root and tag embedding plus the tensor layer for MCL NER task, independently. Another is to show the results of our models after using the pre-trained root and word embeddings. The last is to compare our models to the state-of-the-art including character embedding-based biLSTM-CRF [12].

Neural Named Entity Recognition for Kazakh

5.1

9

Data-Set

In experiments we used the data from [27] for Turkish and the Kazakh NER corpus (KNC) from [24]. Both corpus were divided into training (80%), development (10%) and test (10%) set. The development set is for choosing the hyperparameters and model selection. We adopted IOB tagging scheme [21] for all experiments and used standard conlleval evaluation script3 to report the F-score, precision and recall values. Table 2. Corpus statistics. Kazakh Turkish #sent. #token #LOC #ORG #PER #sent. #token #LOC #ORG #PER Train 14457

2065

3424

22050

397062 9387

7389

13080

Dev.

1807

27277

785

247

413

2756

48990 1171

869

1690

Test

1807

27145

731

247

452

2756

46785 1157

925

1521

5.2

215448 5870

Model Setup

A set of experiments were conducted to chose the hyper-parameters and the hyper-parameters are tuned on the development set. The initial learning rate of AdaGrad is set to 0.01 and the regularization is fixed to 10−4 . Generally, the number of hidden units has a limited impact on the performance as long as it is large enough. Window size w was set to 3, the word, root and tag embedding size was set to 50, number of hidden units was 300 for NNs, and for those NNs with tensor layer, it was set to 50 and its factor size was set to 3. After finding the best hyper-parameters, we would train final models for all NNs. After each epoch over the training set, we measured the accuracy of the model on the development set and chose the final model that obtained the highest performance on development set, then use the test set to evaluate the selected model. We made several preprocessing to the corpora, namely token and sentence segmentation, lowercasing surface words and the roots were kept in original forms. 5.3

Results

We evaluate the following model variations in the experiment: i) a baseline neural network, NN, which contains a discrete tag transition; ii) NN+root refers to a model that uses root embedding and the discrete tag transition. iii) NN+root+tag is a model that the discrete tag transition in NN is replaced by named entity tag embedding. iv) NN+root+tensor refers to tensor layer-based model with discrete tag transition. v) models with +feat refer to the models use the named entity feature. 3

www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt.

10

G. Tolegen et al.

Table 3. Results of the NNs for Kazakh and Turkish (F1-score, %). Here root and tag indicate root and tag embeddings; tensor means tensor layer; feat denotes entity feature vector; Kaz - Kazakh and Tur - Turkish; Ov - Overall. L.

# Models

Development set LOC ORG PER Ov

Test set LOC ORG

PER

Ov

Kaz 1 2 3 4 5 6 7

NN NN+root NN+root+tag NN+root+tensor NN+root+feat NN+root+tensor+feat NN+root+tag+tensor+feat

86.69 87.48 88.85 89.56 93.48 93.78 93.65

68.95 70.23 67.69 72.54 78.35 81.48 81.28

68.57 75.66 79.68 81.07 91.59 90.91 92.42

78.66 81.20 82.81 84.22 90.40 90.87 91.27

86.32 87.74 87.65 88.51 92.48 92.22 92.96

69.51 72.53 73.75 75.79 78.90 81.57 78.89

64.78 75.25 76.13 77.32 90.75 91.27 91.70

76.89 81.36 81.86 82.83 89.54 90.11 90.28

Tur 8 9 10 11 12 13 14

NN NN+root NN+root+tag NN+root+tensor NN+root+feat NN+root+tensor+feat NN+root+tag+tensor+feat

85.06 87.38 90.70 92.43 91.54 93.60 91.77

74.70 77.13 84.93 86.45 89.04 88.88 89.72

81.11 84.78 86.67 89.63 91.62 92.23 92.23

80.86 83.78 87.53 89.78 91.01 91.88 91.44

83.17 85.78 90.02 90.50 90.27 92.05 92.80

76.26 78.66 86.14 87.14 89.50 89.35 88.45

80.55 84.03 85.95 90.00 91.95 92.01 91.91

80.29 83.17 87.31 89.42 90.78 91.34 91.39

Table 3 summaries the results for Kazakh and Turkish. Rows (1–4, 8–11) are given to compare the root, tag embedding and tensor layer independently. Rows (5–7, 12–14) shows the effect of entity features. As shown, when only use the surface word forms, the NN gives 76.89% overall F1-score for Kazakh. The NN gives low F1-scores of 64.78% and 69.51% for PER and ORG respectively. There are mainly two reasons for this: i) the number of person and organization names are less than location (Table 2), and ii) compared to other entities, the length of organization name is much longer, it also has ambiguous words with people names4 . For Turkish, NN yields 80.29% overall F1. It is evident from (row 2, 9) that NN+root is improved significantly in all terms after using the root embedding. There are 4.47% and 2.88% improvements in overall F1 for Kazakh and Turkish compare to NN. More precisely, using root embedding, NN+root gives 10.47%, 3.02% and 1.42% improvements for Kazakh PER, ORG, LOC entities, respectively. The result for Turkish also follows the pattern. Row (3,10) shows the effect of replacing the discrete tag transition with named entity tag embedding. We could observe that NN+root+tag yields overall F1-scores of 81.86% and 87.31% for Kazakh and Turkish. Compared to NN+root, the model with entity tag embedding has a significant improvement for Turkish with 4.14% in overall F1. For two languages, the model performances are boosted by using tensor transformation; it shows that the tensor layer could capture the more interactions between root and word vectors. Using the entity features, NN+root+feat give a significant improvement for Kazakh (from 81.36 4

It often appears when the organization name is given after someone’s name.

Neural Named Entity Recognition for Kazakh

11

to 89.54%) and Turkish (from 83.17 to 90.78%). The best result for Kazakh is 90.28% F1-score that is obtained by using tensor transformation with tag embeddings and entity features. We compare our NNs with exiting CRF-based NER system [24] and other state-of-the-art models. According to the recent studies for NER [12,14,30], the current cutting-edge deep learning models for sequence labeling problem is bidirectional LSTM with CRF layer. On the one hand, we trained such state-of-theart NER model for Kazakh language for making comparisons. On the other, It is also worth to see how does a character-based model perform well for agglutinative languages. Because the character-based approaches seem to be well suited for agglutinative nature of the languages and it can serve as a stronger baseline than CRF. For those biLSTM-based models, we set hyper-parameters are comparable with those models yield the state-of-the-art results for English [12,14]. The word and character embeddings are set to 300 and 100, respectively. The hidden unit of LSTM for both character and word are set to 300. The dropout is set to 0.5 and use “Adam” updating strategy for learning model parameters. It should be note that the form of entities in Kazakh always starts with capital letter, and the data set used for all biLSTM-based models are not converted to lowercase, which could lead a positive effect for recognition. For a fair comparison, the following NER models are trained on the same training, development and test set. Table 4 shows the comparison of our NNs with state-of-the-art for Kazakh. Table 4. Comparison of our NNs and state-of-the-art Models

LOC

ORG

PER

Overall

CRF [24] biLSTM+dropout biLSTM-CRF+dropout biLSTM-CRF+Characters+dropout

91.71 85.84 86.52 90.43

83.40 68.91 69.57 76.10

90.06 72.75 75.79 85.88

89.81 78.76 80.28 86.45

NN+root+feat NN+root+tensor+feat NN+root+tag+tensor+feat

92.48 78.90 92.22 81.57 92.96 78.89

90.75 91.27 91.70

89.54 90.11 90.28

NN+root+feat* NN+root+tensor+feat* NN+root+tag+tensor+feat*

91.74 92.91 91.33

81.00 90.99 89.70 81.76 91.09 90.40 81.88 92.00 90.49

The CRF-based system [24] achieved an F1-score of 89.81% using all features with their well-designed feature template. The biLSTM-CRF with character embedding yields 86.45% F1-score which is better than the result of the model without using characters. It can be seen, the significant improvement about 6% in overall F1-score was gained after using character embeddings. It indicates that character-based model fits the nature of the MCL. We initialized the root and word embedding by using pre-trained embeddings. The skip-gram model of

12

G. Tolegen et al.

word2vec 5 [18] is used to train root and word vectors on large Kazakh news articles and Wikipedia texts6 . Table 4 also shows the results after pre-training the root and word embedding marked with symbol *. As shown, the pre-trained root and word representations have a minor effect on the overall F1-score of NN models. Especially for organization names, the pre-trained embeddings have positive effects. The NN+root+feat* and the NN+root+tag+tensor+feat* models achieve around 2% improvement for organization F1-score compared to those of the models without using the per-trained embeddings (the former’s is form 78.90% to 81.00% and the latter’s is from 78.89% to 81.88%). Overall, our NN outperforms the CRF-based system and other state-of-the-art (biLSTM-CRFcharacter+dropout), and the best NN yields an F1 of 90.49%, a new state-ofthe-art for Kazakh NER. To show the effect of word embeddings after the model training. We calculated the ten nearest neighbors of a few randomly chosen query words (first row). Their distances were measured by the cosine similarity. As given in Table 5, the nearest neighbors in three columns are related to their named entity labels: all location, person and organization names are listed in the first, second and third column, respectively. Compared to CRF, instead of using discrete features, the NNs project root, words into a vector space, which could group similar words by their meaning and the NNs has non-linear transformations to extract higher-level features. In this way, the NNs may reduce the effects of data sparsity problems of MCL. Table 5. Example words in Kazakh and their 10 closest neighbors. Here, we used the Latin alphabet to write Kazakh words for convenience. Kazakhstan (Location) Meirambek (Person) KazMunayGas (Organization)

5 6

Kiev

Oteshev

Nurmukasan

Sheshenstandagy

Klinton

TsesnaBank

Kyzylorda

Shokievtin

Euroodaktyn

Angliada

Dagradorzh

Atletikony

Burabai

Tarantinonyn

Bayern

Iran

Nikliochenko

Euroodakka

Singapore

Luis

CenterCredittin

Neva

Monhes

Juventus

London

Fernades

Aldaraspan

Romania

Fog

Liverpool

https://code.google.com/p/word2vec/. In order to reduce dictionary size of root and surface word, we did some preprocessing namely, lowercasing and word stemming by morphological analyzer and disambiguator.

Neural Named Entity Recognition for Kazakh

6

13

Conclusions

We presented several neural networks for NER of MCLs. The key aspects of our model for MCL are to utilize different embeddings and layer, namely, i) root embedding, ii) entity tag embedding and iii) the tensor layer. The effects of those aspects are investigated individually. The use of root embedding leads to a significant result on MCLs’ NER. The other two also gives positive effects. For Kazakh, the proposed NNs outperform the CRF-based NER system and other state-of-the-art including character-based biLSTM-CRF model. The comparisons showed that character embedding is vital to MCL’s NER. The experimental results indicate that the proposed NNs can be potentially applied to other morphologically complex languages. Acknowledgments. The work was funded by the Committee of Science of Ministry of Education and Science of the Republic of Kazakhstan under the grant AP09259324.

References 1. Baba Ali, B., W´ ojcik, W., Orken, M., Turdalyuly, M., Mekebayev, N.: Speech recognizer-based non-uniform spectral compression for robust MFCC feature extraction. Przegl. Elektrotechniczny 94, 90–93 (2018). https://doi.org/10.15199/ 48.2018.06.17 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014) 3. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003) 4. Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy approach. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 160–163. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003) 5. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493– 2537 (2011) 6. Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 168–171. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003) 7. Graves, A., Fern´ andez, S., Gomez, F.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning, ICML 2006, pp. 369–376 (2006) 8. He, D., et al.: Dual learning for machine translation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems vol. 29, pp. 820–828. Curran Associates, Inc. (2016) 9. Klein, D., Smarr, J., Nguyen, H., Manning, C.D.: Named entity recognition with character-level models. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 180–183. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)

14

G. Tolegen et al.

10. Kuru, O., Can, O.A., Yuret, D.: CharNER: character-level named entity recognition. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 911–921. The COLING 2016 Organizing Committee (December 2016) 11. Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, San Francisco, CA, USA, pp. 282–289. Morgan Kaufmann Publishers Inc., (2001) 12. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics (2016) 13. Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1520– 1530. Association for Computational Linguistics (September 2015) 14. Ma, X., Hovy, E.: End-to-end sequence labeling via Bi-directional LSTM-CNNsCRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), pp. 1064–1074. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/P16-1101, https://aclweb. org/anthology/P16-1101 15. Mamyrbayev, O., Toleu, A., Tolegen, G., Mekebayev, N.: Neural architectures for gender detection and speaker identification. Cogent Eng. 7(1), 1727168 (2020). https://doi.org/10.1080/23311916.2020.1727168 16. Mamyrbayev, O., et al.: Continuous speech recognition of kazakh language. ITM Web Conf. 24, 01012 (2019). https://doi.org/10.1051/itmconf/20192401012 17. Mccallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 188–191. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003) 18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013) 19. Pei, W., Ge, T., Chang, B.: Max-margin tensor neural network for chinese word segmentation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland (Vol. 1: Long Papers), pp. 293– 303. Association for Computational Linguistics (June 2014) 20. Seker, G.A., Eryigit, G.: Initial explorations on using CRFs for turkish named entity recognition. In: COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8–15 December 2012, Mumbai, India, pp. 2459–2474 (2012) 21. Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: languageindependent named entity recognition. In: Proceedings of the 6th Conference on Natural Language Learning - Vol. 20, pp. 1–4. COLING 2002, Association for Computational Linguistics, Stroudsburg, PA, USA (2002) 22. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 142–147. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)

Neural Named Entity Recognition for Kazakh

15

23. Tkachenko, M., Simanovsky, A.: Named entity recognition: exploring features. In: ¨ Jancsary, J. (ed.) Proceedings of KONVENS 2012. pp. 118–127. OGAI (September 2012). main track: oral presentations 24. Tolegen, G., Toleu, A., Zheng, X.: Named entity recognition for kazakh using conditional random fields. In: Proceedings of the 4-th International Conference on Computer Processing of Turkic Languages TurkLang 2016, pp. 118–127. Izvestija KGTU im.I.Razzakova (2016) 25. Toleu, A., Tolegen, G., Mussabayev, R.: Comparison of various approaches for dependency parsing. In: 2019 15th International Asian School-Seminar Optimization Problems of Complex Systems (OPCS), pp. 192–195 (2019) 26. Toleu, A., Tolegen, G., Makazhanov, A.: Character-aware neural morphological disambiguation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada (Vol. 2: Short Papers), pp. 666– 671. Association for Computational Linguistics (July 2017) 27. T¨ ur, G., Hakkani-t¨ ur, D., Oflazer, K.: A statistical information extraction system for turkish. Nat. Lang. Eng. 9(2), 181–210 (2003) 28. Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Charagram: embedding words and sentences via character n-grams. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1504–1515. Association for Computational Linguistics (November 2016) 29. Yeniterzi, R.: Exploiting morphology in turkish named entity recognition system. In: Proceedings of the ACL 2011 Student Session, pp. 105–110. HLT-SS 2011, Association for Computational Linguistics, Stroudsburg, PA, USA (2011) 30. Zhai, Z., Nguyen, D.Q., Verspoor, K.: Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition. In: Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis, pp. 38–43. Association for Computational Linguistics (2018) 31. Zheng, X., Chen, H., Xu, T.: Deep learning for Chinese word segmentation and POS tagging. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 647–657. Association for Computational Linguistics (October 2013)

An Empirical Data Selection Schema in Annotation Projection Approach Yun Hu1,2(B) , Mingxue Liao2 , Pin Lv2 , and Changwen Zheng2 1

2

University of Chinese Academy of Sciences, Beijing, China [email protected] Institute of Software, Chinese Academy of Sciences, Beijing, China {mingxue,lvpin,changwen}@iscas.ac.cn

Abstract. Named entity recognition (NER) system is often realized using supervised methods such as CRF and LSTM-CRF. However, supervised methods often require large training data. In some low-resource languages, annotated data is often hard to obtain. Annotation projection method obtains annotated data from high-resource languages automatically. However, the data obtained automatically contains a lot of noise. In this paper, we propose a new data selection schema to select the high-quality sentences in annotated data. The data selection schema computes the sentence score considering the occurrence number of entity-tags and the minimum scores of entity-tags in sentences. The selected sentences can be used as an auxiliary annotated data in low resource languages. Experiments show that our data selection schema outperforms previous methods. Keywords: Named entity recognition · Annotation projection · Data selection schema

1 Introduction Name entity recognition is a fundamental Natural Language Processing task that labels each word in sentences with predefined types, such as Person (PER), Location (LOC), Organization (ORG) and so on. The results of NER can be used in many downstream NLP tasks, such as relation extraction [1] and question answering [16]. The supervised methods like CRF [7] and neural network methods [2, 8] are often used to realize the NER system. However, supervised methods require large data to train the appropriate model, which leads to supervised methods can only be used in high-resource languages such as English. In low-resource languages, annotation projection method is one of methods which can be used to improve the performance of the NER system. Annotation projection is used to obtain annotated data in low-resource languages through parallel corpus. The method can be formalized as a pipeline work. An example

The work is supported by both National scientific and Technological Innovation Zero (No. 17-H863-01-ZT-005-005-01) and State’s Key Project of Research and Development Plan (No. 2016QY03D0505). c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 16–25, 2023. https://doi.org/10.1007/978-3-031-24340-0_2

An Empirical Data Selection Schema in Annotation Projection Approach

17

is shown in Fig. 1. We use English-Chinese language as example.1 First, we use the English NER system to obtain the NER tags of English sentences. For example, the ‘Committee’ is labeled as ‘ORG’ (Organization). Then, the alignment system is used to find word alignment pairs. For example, the GIZA++ [11] model can find that the ‘Committee’ is translated to ‘委员会’. Next, we take the tags of English words as the low-resource language tags in each word alignment pairs. For example, The tag of ‘Committee’ can be mapped to ‘委员会’. So the ‘委员会’ is labeled as organization. Finally, the data obtained automatically can be as training data together with the manual data. Directly using the annotation projection data leads to the low performance in some languages such as Chinese. In this paper, we use the data together with manual data as the training data of NER system. The increasing of the training data results in increasing of performance.

Fig. 1. An case of annotation projection.

However, the data obtained from annotation projection contains lots of noise. Higher quality training data may lead to higher performance in NER system. Previous works use data selection schema to obtain high quality annotation projection corpus. For example, Ni et al. considers the entity-tags and average the entity-tag scores in sentences as sentence scores [10]. The entity-tag means an entity labeled with a predefined entity type. However, this method selects low-quality sentences when an ‘entity’ appears one time in corpus or one of entities has wrong predict. In this paper, we propose a new data selection schema which considers the occurrence number of entities and uses the minimum value of entity-tags in sentences. Experimental results show that our data selection method outperforms the previous works and improves the baseline 1.93 points.

2 Related Work Recently, most NER systems are based on Conditional Random Fields (CRF) or Neural Networks model. CRF model is generally used in sequence labeling tasks like partof-speech tagging and NER [7]. The correlations between labels in neighborhoods are considered in CRF model. The shortcoming of the CRF model is that the model relies heavily on hand-crafted features. Neural Network model partly alleviates the use of hand-crafted features. The features of the sequences are extracted automatically by the model. Collobert et al. used word embedding and CNN to capture the sentence information [2]. Lample et al. incorporated character embedding with word embedding as the 1

We consider English as high-resource language. Although Chinese is not truly low-resource language, we simulate the low-resource environment by limiting the size of training data which is similar to [14].

18

Y. Hu et al.

bi-LSTM input and achieved the best results [8]. However, all these models require large amounts of annotated data to train. Annotated data is hard to obtain in low-resource languages. In this paper, we focus on realizing a NER system in low-resource languages with limited annotated data. To build NER system in low-resource languages, annotation projection and model transfer are widely used. Model transfer focuses on finding features that are independent to languages. Taeckstroem et al. generated cross-lingual word cluster feature which is useful in the model transfer [13]. Yang et al. used the hierarchical recurrent networks to model the character similarity between different languages [15]. However, the model transfer model is hard to be used in languages that have little similarity. In this paper, we focus on annotation projection method. Another approach to realize the named entity recognition in low-resource languages is annotation projection. The annotation projection is not limited by the similarity between the languages. The annotation projection method was first proposed by [17]. They used annotation projection for POS tagging, Base NP, NER and morphological analysis. Kim et al. introduced the GIZA++ to do the entity alignment in annotation projection and applied an entity dictionary to correct the alignment errors [5]. The phrase-based statistical machine translation system and an external multilingual named entity database were used by [4] . Wang and Manning used the soft projection to avoid that the error of CRF prediction propagated to the alignment step [14]. Recently, a data selection schema was proposed by [10]. The data selection schema measures sentence quality in data obtained from annotation projection and do not have to consider which steps lead the errors. The low-quality sentences are discarded when the model is trained. The details will be discussed in Sect. 3.1.

3 Method 3.1

Problems of Previous Method

We first introduce the data selection schema proposed by [10]. The method first computes the entity-tag score, then computes the sentence score. The relative frequency is used as entity-tag score. The score of sentence is calculated by averaging each entity-tag score in sentences. n si (1) feq-avg = i=1 n where n is the number of entities in a sentence, and si is the entity-tag score. We define the data selection schema in [10] as feq-avg method. An example is shown to compute the entity-tag score in Table 1. The word ‘货币基金组织’ (International Monetary Fund, IMF) appears 49 times in the annotation projection data and 40 times is labeled as ‘ORG’, so the entity-tag score of ‘货币基金组织’ labeled as ‘ORG’ is 0.816. The method does not use the word-tag score, because a lot of entities are composed by common words. For example, ‘货币’ (Money) is a common word. In Fig. 2, the first sentence ‘秘鲁恢复了在货币基金组织的积极成员地位’ (Peru returned to activemember status at IMF.) has 2 entities, the score of first sentence is 0.878 (average of 0.94 and 0.816). The sentence score can be an index of sentence quality. When the sentence score is low, we have the high confidence to consider that some errors occur.

An Empirical Data Selection Schema in Annotation Projection Approach

19

We check the annotation projection data selected by feq-avg method and find that the method can not process two situations appropriately. First, in computing entitytag score step, the feq-avg method can not process the situation when an ‘entity’ only occurs a few times in corpus. The ‘entity’ is labeled as entity by annotation projection method and is not truly entity. For example, in Table 1, ‘推选赛义德· 阿姆贾德· 阿 里’ (elected Syed Amjad Ali) is considered as a person name in annotation projection data. The error is that the entity contains the word ‘推选’ (elected) which is a common word. The word ‘推选’ may appear many times in whole corpus, however, the ‘entity’ ‘推选赛义德· 阿姆贾德· 阿里’ appears one time in whole corpus. In feq-avg method, the entity-tag score is 1. The same situation also happen at 新闻部有关 (related to the Department of Public Information) which contains ‘有关’ (related to) as an entity. Second, in computing sentence score step, some errors may occur when a sentence has a lot of entities, most of them are high scores, and an entity is low score. For example, in the second sentence of Fig. 2, the sentence contains four entities. The sentences score can be computed as sentence-score =

0.99 + 0.816 + 0.99 + 0.03 4

(2)

The feq-avg method obtains the sentence score as 0.7065 which is a high score. However, the sentence contains a error that ‘汇率’ (exchange rate) is labeled as ‘ORG’. Because other entities in sentence obtain high score, the entity-tag score of ‘汇率’ is ignored in sentence score computing. Table 1. The case of computing entity-tag score. Words

Label type Count Relative frequency

货币基金组织

ORG

40

0.816

货币基金组织

MISC

2

0.041

货币基金组织

O

7

0.143

推选赛义德·阿姆贾德·阿里 PER

1

1.0

新闻部有关

1

1.0

ORG

Fig. 2. The case of computing sentence score.

20

Y. Hu et al.

3.2

Our Method

In our methods, we use minimum value instead of average and a weighted entity-tag score instead of directly relative frequency. The sentence score is computed as: wfeq-min = minαsi where the α is α=1−

1 2ex−1

(3)

(4)

In Eq. 3, the si is the entity-tag score which is relative frequency. In Eq. 4, the x is the occurrence number of entity in corpus. We define our data selection schema as wfeqmin method. For the first problem in feq-avg, the entity-tag score will be multiplied by a small α when the entity-tag only occurs a few times. when the x is large, the α is close to 1 and the entity-tag score is close to relative frequency. For example, the α and sentence score of ‘推选赛义德· 阿姆贾德· 阿里’ labeled as ‘PER’ and ‘新闻部相关 ’ labeled as ‘ORG’ are 0.5. The α of ‘货币基金组织’ labeled as ‘ORG’ is 0.99. The score of ‘货币基金组织’ labeled as ‘ORG’ is 0.816. For the second problem, we use the minimum value of the entity-tag score in sentences as sentence score. For example, for second sentence in Fig. 2, the sentence score becomes 0.03. The sentence score of first sentence in Fig. 2 is 0.816. After computing the sentence score, we can set a value q. The sentences which scores are over the q can be seen as high-quality sentences.

4 Experiments 4.1

Data Sets and Evaluating Methods

Following the steps of annotation projection method, we first require English datasets and English NER system. The English NER system we used is LSTM-CRF model [8] trained by Conll 2003 data [12]. We replace the ‘MISC’ tag with ‘O’ tag in the English dataset. Then, the alignment system and parallel data are required. The English-Chinese parallel data is from datum2017 corpus provided by Datum Data Co., Ltd.2 . The corpus contains 20 files, covering different genres such as news, conversations, law documents, novels, etc. Each file has 50,000 sentences. The whole corpus contains 1 million sentences. The alignment system we used is GIZA++. Next, the data selection schemas are used in annotation projection data. We consider four kinds of data selection schema to process the annotation projection data. – feq-avg. The sentence score is computed by averaging the entity-tag scores which are relative frequency. The method is the same as the method described in Sect. 3.1. – feq-min. The sentence score is computed by using minimum of the entity-tag scores which are relative frequency. – wfeq-avg. The sentence score is computed by averaging the weighted entity-tag scores which are relative frequency. 2

http://nlp.nju.edu.cn/cwmt-wmt/.

An Empirical Data Selection Schema in Annotation Projection Approach

21

– wfeq-min. The sentence score is computed by using minimum of the weighted entity-tag scores which are relative frequency. The method is the same as the method described in Sect. 3.2. Finally, we require evaluating data and evaluating model to show that our data selection schema outperforms other methods. We use different training data to train the model and the same testing data to test the model. In this paper, the training data contains two parts: annotation projection data part and original news data part. Annotation projection data part is data obtained from annotation projection methods. The original news data part is data from the training data part of third SIGHAN Chinese language processing bakeoff [9]. To simulate the low-resource environment, we do not use all the training data. The testing data is testing data part of [9]. The third SIGHAN Chinese language processing bakeoff which is one of widely used Chinese NER datasets. The NER dataset contains 6.1M training data, 0.68M developing data and 1.3M testing data. Three types of entities are considered in dataset: PER (Person), ORG (Organization) and LOC (Location). The domain of the NER data is news domain. The evaluating model is the same when different evaluating data are used. We use the LSTM-CRF model as our evaluating model. The model is similar to the model in [3] excepted that we do not use radical part to extract the Chinese character information. The model is shown in Fig. 3. The sentences are as the input of character embedding layer.3 We use pretrained character embedding to initialize our lookup table. Each character maps to a low dimensional dense vector. The dimension of character embedding is 100. Then a bi-lstm layer is used to model the sentence level information. The LSTM dimension is 100. The projection layer dimension is the same as the number of NER tags. We use the CRF layer to model the relation between the neighborhoods. The optimization method we used is adam [6]. The dropout strategy is used to avoid overfit.

Fig. 3. The evaluating model. 3

Before using the data from annotation projection in Chinese, we change the word tag schema to character tag schema. For example, the word ‘委员会’ (Committee) is labeled as ‘ORG’, so the three characters in ‘委员会’ will be labeled as ‘ORG’.

22

4.2

Y. Hu et al.

Results

We first do the experiments that only annotation projection data is used as training data. The results show that the models are hard to train and converge. The F1 score of final results is low (less than 30%). The reason may be that the annotation data contains a lot of noise even through data selection. In this paper, we use annotation projection data together with manual data to show that the data schema can be helpful when the original manual data is limited. We use 0.5M data from original news training data and 0.5M data from annotation projection. The results of using different data selection schemas are shown in Table 2. The baseline system only uses 0.5M original news data. The baseline-D system uses 0.5M data from original news training data and 0.5M data from annotation projection without data selection schema. The results show that directly using annotation projection data may lead to performance declining. The reason may be that the increasing of the training data can not compensate the tremendous noise. Four data selection schemas improve the performance of the baseline system, which means that the data selection schema is important to improve the sentence quality. The wfeq-min data selection schema described in Sect. 3.2 achieves the best results and improves the baseline system 1.93 points in F1 score. Both wfeq-avg and feq-min data selection schema outperform feq-avg data selection schema, which presents that both the occurrence number of entity-tags and the minimum scores of entity-tags in sentences are important. Table 2. The overview results. P

R

F

baseline 68.58 67.98 68.28 baseline-D 71.09 64.01 67.36 feq-avg wfeq-avg feq-min wfeq-min

71.38 73.29 73.87 73.43

66.95 65.99 66.06 67.26

69.09 69.45 69.75 70.21

When we utilize the data selected by data selection schema, three hyperparameters are considered: the size of the original news training data, the size of the data from annotation projection, and the selected value q in the data selection schema. When one hyperparameter is considered, two other hyperparameters are fixed. The first hyperparameter is the size of the original news training data. The results of F1 score are shown in Table 3. The original news data size is from 0.25M to 1M. The annotation projection data size is 0.5M and the selected value q is 0.7. The baseline system does not use the annotation projection data. The baseline results explain that the larger training data sets, the better the model works. In all sizes, the directly using annotation projection data harms the performance. The models using data from selection schema all outperform the baseline-D system. The wfeq-min data selection schema obtains the best results. We also observe that the effect of annotation projection data is reduced when the original data size is increased. For example, when the original data size is 0.25M, the wfeq-min method improves the baseline 3.04 points. When

An Empirical Data Selection Schema in Annotation Projection Approach

23

the original size is 1.0M, the wfeq-min method improves the baseline 0.84 points. The experiments indicate that the annotation projection method is more helpful in the low resource situation. Table 3. The results of different data sizes from original news data. size/M

0.25

0.5

0.75

1.0

baseline 56.41 68.28 74.43 77.82 baseline-D 56.87 67.36 72.97 75.80 feq-avg wfeq-avg feq-min wfeq-min

58.55 58.82 59.01 59.45

69.09 69.45 69.75 70.21

74.52 74.64 74.78 75.16

77.25 77.43 77.82 78.12

The second hyperparameter is the size of the data from annotation projection. The size of original news data is 0.5M and the q selects 0.7. We test the annotation projection data from 0M to 1.0M. The results of F1 score are shown in Table 4. Directly using the annotation projection data harms the performance in all data sizes (baseline-D). In results of data selection schemas, we observe that the performance increases first and declines latter. The reason may be that the disadvantage from the noise data arises with increasing of the annotation projection data. The experiments indicate that the annotation projection data should have similar data size as the original news data. Table 4. The results of different data sizes from annotation projection. size/M

0

0.25

0.5

0.75

1.0

baseline-D 68.28 68.19 67.36 66.48 64.93 feq-avg

68.28 70.50 69.09 68.74 67.42

wfeq-avg

68.28 70.93 69.45 68.84 67.52

feq-min

68.28 70.98 69.75 68.97 67.63

wfeq-min

68.28 71.08 70.21 70.04 69.49

The finally hyperparameter is the selected value q in the data selection schema. The results of F1 score is presented in Fig. 4. The training data set contains 0.5M original news data and 0.5M annotation projection data. The q is used to select the annotation projection data. Big q value leads to high-quality sentences. To avoid the situation when the q is small and the size of data over q is large, we only selected 0.5M data from the annotation projection data in all q. The figure presents that the performance increases with increasing of the q , which means that the quality of the training data is increasing. In all q, the wfeq-min method obtains the best results and the feq-avg method obtains the worst results. Compared feq-min method with wfeq-avg method, the feq-min method obtains better results when the q is small value and obtain worse results when the q is big value. Through analyzing the selected sentences in feq-min method, we find the first problem described in Sect. 3.1 will be obvious when q is big value.

24

Y. Hu et al.

Fig. 4. The results of using different q values.

5 Conclusion In this paper, we propose a new data selection schema to select the high-quality sentences in annotation projection data. The data selection schema considers the occurrence number of entity-tags and uses the minimum value of entity-tags in sentences. The selected sentences can be used as an auxiliary data to obtain higher performance in NER system. Experiments show that our methods can obtain higher quality sentences compared with previous methods. In the future, more sophisticated model will be used to utilize the high-quality sentences instead of directly using as training data.

References 1. Bunescu, R., Mooney, R.: A shortest path dependency kernel for relation extraction. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (2005). https://aclweb.org/anthology/H05-1091 2. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011) 3. Dong, C., Zhang, J., Zong, C., Hattori, M., Di, H.: Character-based LSTM-CRF with radicallevel features for Chinese named entity recognition. In: Lin, C.-Y., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds.) ICCPOL/NLPCC -2016. LNCS (LNAI), vol. 10102, pp. 239– 250. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50496-4 20 4. Ehrmann, M., Turchi, M.: Building multilingual named entity annotated corpora exploiting parallel corpora. In: Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora (AEPC) (2010) 5. Kim, S., Jeong, M., Lee, J., Lee, G.G.: A cross-lingual annotation projection approach for relation detection. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 564–571. Coling 2010 Organizing Committee (2010). https:// www.aclweb.org/anthology/C10-1064 6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. Computer Science (2014)

An Empirical Data Selection Schema in Annotation Projection Approach

25

7. Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Eighteenth International Conference on Machine Learning, pp. 282–289 (2001) 8. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics (2016). https://doi.org/10.18653/ v1/N16-1030, https://www.aclweb.org/anthology/N16-1030 9. Levow, G.A.: The third international Chinese language processing bakeoff: word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 108–117. Association for Computational Linguistics (2006). https://www.aclweb.org/anthology/W06-0115 10. Ni, J., Dinu, G., Florian, R.: Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 1470–1480. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1135, https://www.aclweb.org/anthology/P17-1135 11. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1) (2003). https://www.aclweb.org/anthology/J03-1002 12. Sang, E.F.T.K., Meulder, F.D.: Introduction to the conll-2003 shared task: languageindependent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (2003). https://www.aclweb.org/anthology/W030419 13. T¨ackstr¨om, O., McDonald, R., Uszkoreit, J.: Cross-lingual word clusters for direct transfer of linguistic structure. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 477– 487. Association for Computational Linguistics (2012). https://www.aclweb.org/anthology/ N12-1052 14. Wang, M., Manning, C.D.: Cross-lingual projected expectation regularization for weakly supervised learning. Trans. Assoc. Comput. Linguist. 2, 55–66 (2014). https://www.aclweb. org/anthology/Q14-1005 15. Yang, Z., Salakhutdinov, R., Cohen, W.W.: Transfer learning for sequence tagging with hierarchical recurrent networks (2016) 16. Yao, X., Van Durme, B.: Information extraction over structured data: question answering with freebase. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 956–966. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/P14-1090,https://aclweb.org/anthology/P14-1090 17. Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research (2001). https://www.aclweb.org/anthology/H011035

Toponym Identification in Epidemiology Articles – A Deep Learning Approach MohammadReza Davari(B) , Leila Kosseim, and Tien D. Bui Department of Computer Science and Software Engineering, Concordia University, Montreal, QC H3G 1M8, Canada [email protected], {leila.kosseim,tien.bui}@concordia.ca

Abstract. When analyzing the spread of viruses, epidemiologists often need to identify the location of infected hosts. This information can be found in public databases, such as GenBank [3], however, information provided in these databases are usually limited to the country or state level. More fine-grained localization information requires phylogeographers to manually read relevant scientific articles. In this work we propose an approach to automate the process of place name identification from medical (epidemiology) articles. The focus of this paper is to propose a deep learning based model for toponym detection and experiment with the use of external linguistic features and domain specific information. The model was evaluated using a collection of 105 epidemiology articles from PubMed Central [33] provided by the recent SemEval task 12 [28]. Our best detection model achieves an F1 score of 80.13%, a significant improvement compared to the state of the art of 69.84%. These results underline the importance of domain specific embedding as well as specific linguistic features in toponym detection in medical journals. Keywords: Named entity recognition neural network · Epidemiology articles

1

· Toponym identification · Deep

Introduction

With the increase of global tourism and international trade of goods, phylogeographers, who study the geographic distribution of viruses, have observed an increase in the geographical spread of viruses [9,12]. In order to study and model the global impact of the spread of viruses, epidemiologists typically use information on the DNA sequence and structure of viruses, but also rely on meta data. Accurate geographical data is essential in this process. However, most publicly available data sets, such as GenBank [3], lack specific geographical details, providing information only at the country or state level. Hence, localized geographical information has to be extracted through a manual inspection of medical journals. The task of toponym resolution is a sub-problem of named entity recognition (NER), a well studied topic in Natural Language Processing (NLP). Toponym c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 26–37, 2023. https://doi.org/10.1007/978-3-031-24340-0_3

Toponym Identification in Epidemiology Articles

27

resolution consists of two sub-tasks: toponym identification and toponym disambiguation. Toponym identification consists of identifying the word boundaries of expressions that denote geographic expressions; while toponym disambiguation focuses on labeling the expression with their corresponding geographical locations. Toponym resolution has been the focus of much work in recent years (e.g. [2,7,30]) and studies have shown that the task is highly dependent on the textual domain [1,8,14,25,26]. The focus of this paper is to propose a deep learning based model for toponym detection and experiment with the use of external linguistic features and domain specific information. The model was evaluated using the recent SemEval task 12 datatset [28] and shows that domain specific embedding as well as some linguistic features do help in toponym detection in medical journals.

2

Previous Work

The task of toponym detection consists in labeling each word of a text as a toponym or non-toponym. For example, given the sentence: (1) WNV entered Mexico through at least 2 independent introductions1 . The expected output is shown in Fig. 1.

Fig. 1. An example of input and expected output of toponym detection task. Example from the [33] dataset.

Toponym detection has been addressed using a variety of methods: rule based approaches (e.g. [29]), dictionary or gazetteer-driven (e.g. [18]), as well as machine learning approaches (e.g. [27]). Rule based techniques try to manually capture textual clues or structures which could indicate the presence of a toponym. However, these handwritten rules are often not able to cover all possible cases, hence leading to a relatively large number of false negatives. Gazetteer driven approaches (e.g. [18]), suffer from a large number of false positive identifications, since they cannot disambiguate entities that refer to geographical locations from other categories of named entities. For example in the sentence, (2) Washington was unanimously elected President by the Electoral College in the first two national elections. 1

Example from the [33] dataset.

28

M. Davari et al.

the word Washington will be recognized as a toponym since it is present in geographic gazetteers but in this context, the expression refers to a person name. Finally, standard machine learning approaches (e.g. [27]), require large datasets of labeled texts and carefully engineered features. Collecting such large datasets is costly and feature engineering is a time consuming task, with no guarantee that all relevant features have been modeled. This motivated us to experiment with automatic feature learning to address the problem of toponym detection. Deep Learning approaches to NER (e.g. [5,6,16,17,31]) have shown how a system can infer relevant features and lead to competitive performances in that domain. The task of toponym resolution for the epidemiology domain is currently the object of the SemEval 2019 shared task 12 [28]. Previous approaches to toponym detection in this domain includes rule based approach [33], Conditional Random Fields [32], and a mixture of deep learning and rule based approaches [19]. The baseline model used at the SemEval 2019 task 12 [28] is modeled after the deep feed forward neural network (DFFNN) architecture presented in [19]. The network consists of 2 hidden layers with 150 rectified linear unit (ReLU) activation functions per layer. The baseline F1 performance is reported to be 69.84%. Building upon the work of [19,28] we propose a DFFNN that uses domain-specific information as well as linguistic features to enhance the state of the art performance.

3

Our Proposed Model

Fig. 2 shows the architecture of our toponym recognition model. The model is comprised of 2 main layers: an embedding layer, and a deep feed-forward network. 3.1

Embedding Layer

As shown in Fig. 2, the model takes as input a word (e.g. derived) and its context (i.e. n words around it). Given a document, each word is converted into an embedding along with its context. Specifically, two types of embeddings are used: word embeddings and feature embeddings. For word embeddings, our basic model uses the pretrained WikipediaPubMed embeddings2 . This embedding model was trained on a vocabulary of 201, 380 words and each word is represented by a 200 dimensional feature vector. This embedding model was used as opposed to more generic Word2vec [20] or GloVe [23] in order to capture more domain specific information (see Sect. 4). Indeed, the corpus used for training the Wikipedia-PubMed embedding consists of Wikipedia pages and PubMed articles [21]. This entails that the embeddings should be more appropriate when processing medical journals, and domain specific words. Moreover, the embedding model can better represent the closeness and relation of words in medical articles. The word embeddings for the target 2

http://bio.nlplab.org/.

Toponym Identification in Epidemiology Articles

29

Fig. 2. Toponym recognition model. Input: words are extracted with a fixed context window (a) Embeddings: For each window, an embedding is constructed (b) Deep Neural Network: A feed-forward neural network with 3 layers and 500 neurons per layer outputs a prediction label indicating whether the word in the center of the context window is a toponym or not.

word and its context are concatenated to form a single word embedding vector of size 200 × (2c + 1), where c is the context size. Specific linguistic features have been shown to be very useful in toponym detection [19]. In order to leverage this information, our model is augmented using embedding for these features. These include the use of capital letters for the first character of the word or for all characters of the word. These features are encoded as a binary vector representation. If a word starts with a capitalized letter, its feature embedding is [1, 0] otherwise it is [0, 1] and if all of its letters are capitalized then its feature embedding is [1, 1]. Other linguistic features we observed to be useful (see Sect. 4) include part of speech tags, and the word embedding of the lemma of the word. The feature embedding of the input word and its context are combined to the word embedding via concatenation to form a single vector and passed to the next layer. 3.2

Deep Feed Forward Neural Network

The concatenated embeddings formed in the embedding layer (Sect. 3.1) are fed to a deep feed forward network (DFFNN) (see Fig. 2) whose task is to perform binary classification. This component is comprised of 3 hidden layers and one output layer. Each hidden layer is comprised of 500 ReLU activation nodes. Once an input vector x enters a hidden layer h, the output h(x) is computed as: h(x) = ReLU(W x + b)

(1)

The model is defined using the above equation recursively for all 3 hidden layers. The output layer contains 2 dimensional softmax activation functions. Upon

30

M. Davari et al.

receiving the input x, this layer will output O(x) as follows: O(x) = Softmax(W x + b)

(2)

The Softmax function was chosen for the output layer since it provides a categorical probability distribution over the labels for an input x, i.e.: p(x = toponym) = 1 − p(x = non-toponym)

(3)

We employed 2 mechanisms to prevent overfitting: drop-out and early-stopping. In each hidden layer the probability of drop-out was set to 0.5. The early-stopping caused the training to stop if the loss on the development set (see Sect. 4) started to rise preventing over-fitting and poor generalization. Norm clipping [22] scales the gradient when its norm exceeds a certain threshold and prevents the occurrence of exploding gradient; we experimentally found the best performing threshold to be 1 for our model. We experimented with variations of the model architecture both in depth and number of hidden units per layer as well as other hyper-parameters listed in Table 1. However, deepening the model lead to immediate over-fitting due to the small size of the dataset used [13] (see Sect. 4) even with the presence of a dropout function to prevent it. The optimal hyper-parameter configuration with the development set used to fine tune them can be found in Table 1. Table 1. Optimal hyper-parameters of the neural network. Parameters

Value

Learning Rate 0.01 Batch Size

4

32

Optimizer

SGD

Momentum

0.1

Loss

Weighted Categorical cross-entropy

Loss weights

(2, 1) for toponym vs. nontoponym

Experiments and Results

Our model has been evaluated as part of the recent SemEval 2019 task 12 shared task [28]. As such, we used the dataset and the scorer3 provided by the organisers. The dataset consists of 105 articles from PubMed annotated with toponym mentions and their corresponding geographical locations. The dataset was split into 3 sections: training, development, and test set containing 60%, 10%, and 30% of the dataset respectively. Table 2 shows statistics of the dataset. 3

https://competitions.codalab.org/competitions/19948#learn the detailsevaluation.

Toponym Identification in Epidemiology Articles

31

Table 2. Statistics of the dataset. Training Development Test Size

2.8MB

0.5MB

1.5MB

Number of articles

63

10

32

Average size of each article (in words)

6422

5191

6146

44

50

Average number of toponyms per article 43

A baseline model for toponym detection was also provided by the organizers for comparative purposes. The baseline, inspired by [19], also uses a DFFNN but only uses 2 hidden layers and 150 ReLU activation functions per layer. Table 3 shows the results of our basic model presented in Sect. 3.2 (see #4) compared to the baseline (row #3).4 We carried out a series of experiments to evaluate a variety of parameters. These are described in the next sections. Table 3. Performance score of the baseline, our proposed model and its variations. The suffixes represent the presence of a feature, P.:Punctuation marks, S: Stop words, C: Capitalization features, POS: Part of speech tags, W: Weighted loss, L: Lemmatization feature. For example DFFNN Basic+P+S+C+POS refers to the model that only takes advantage of capitalization feature and part of speech tags and does not ignore stop words or punctuation marks. # Model

4.1

Context Precision Recall

F1

8

DFFNN Basic+P+S+C+POS+W+L 5

80.69%

79.57% 80.13%

7

DFFNN Basic+P+S+C+POS+W

5

76.84%

77.36% 77.10%

6

DFFNN Basic+P+S+C+POS

5

77.55%

70.37% 73.79%

5

DFFNN Basic+P+S+C

2

78.82%

66.69% 72.24%

4

DFFNN Basic+P+S

2

79.01%

63.25% 70.26%

3

Baseline

2

73.86%

66.24% 69.84%

2

DFFNN Basic+P−S

2

74.70%

63.57% 68.67%

1

DFFNN Basic+S−P

2

64.58%

64.47% 64.53%

Effect of Domain Specific Embeddings

As [1,8,14,25,26] showed, the task of toponym detection is dependent on the discourse domain; this is why our basic model used the Wikipedia-PubMed embeddings. In order to measure the effect of such domain specific information, we experimented with 2 other pretrained word embedding models: Google News Word2vec [11], and a GloVe Model trained on Common Crawl [24]. Table 4 shows the characteristics of these pretrained embeddings. Although, the WikipediaPubMed has a smaller vocabulary in comparison to the other embedding models, it suffers from the smallest percentage of out of vocabulary (OOV) words within our dataset since it was trained on a closer domain. 4

At the time of writing this paper, the results of the other teams were not available. Hence only a comparison with the baseline can be made at this point.

32

M. Davari et al. Table 4. Specifications of the word embedding models. Model

Vocabulary size Embedding dimension OOV words

Wikipedia-PubMed

201, 380

200

28.61%

Common Crawl GloVe

2.2M

300

29.84%

300

44.36%

Google News Word2vec 3M

We experimented with our DFFNN model with each of these embeddings and optimized the context window size to achieve the highest F-measure on the development set. The performance of these models on the test set is shown in Table 5. As predicted, we observe that Wikipedia-PubMed performs better than the other embedding models. This is likely due to its small number of OOV words and its domain-specific knowledge. As Table 5 shows, the performance of the GloVe model is quite close to the performance of Wikipedia-PubMed. To investigate this further, we decided to combine the two embeddings and train another model and evaluate performance. As shown in Table 5, the performance of this model (Wikipedia-PubMed + GloVe) is higher than the GloVe model alone but lower than the Wikipedia-PubMed. This decrease in performance suggests that because GloVe embeddings are more general, when the network is presented with a combination of GloVe and Wikipedia-PubMed, they dilute the domain specific information captured by the Wikipedia-PubMed embeddings, hence the performance suffers. From here on, our experiments were carried on using Wikipedia-PubMed word embeddings alone. Table 5. Effect of word embeddings on the performance of our proposed model architecture.

4.2

Model

Context window Precision Recall

Wikipedia-PubMed

2

79.01%

63.25% 70.26%

F1

Wikipedia-PubMed + GloVe 2

73.09%

67.22% 70.03%

Common Crawl GloVe

1

75.40%

64.05% 69.25%

Google News Word2vec

3

75.14%

58.96% 66.07%

Effect of Linguistic Features

Although deep learning approaches have lead to significant improvements in many NLP tasks, simple linguistic features are often very useful. In the case of NER, punctuation marks constitute strong signals. To evaluate this in our task, we ran the DFFNN Basic without punctuation information. As Table 3 shows, the removal of punctuation, decreased the F-measure from 70.26% to 64.53% (see Table 3 #1). A manual error analysis showed that many toponyms appear inside parenthesis, near a dot at the end of a sentence, or after a comma. Hence, as shown in [10] punctuation is a good indicator of toponyms and should not be ignored.

Toponym Identification in Epidemiology Articles

33

As Table 3 (#2) shows, the removal of stop words, did not help the model either and lead to a decrease in F-measure (from 70.26% to 68.67%). We hypothesize that some stop words such as in do help the system detect toponyms as they provide a learnable structure for detection of toponyms and that is why the model accuracy suffered once the stop words were removed. As seen in Table 3 our basic model suffers from low recall. A manual inspection of the toponyms in the dataset revealed that either their first letter is capitalized (e.g. Mexico) or all their letters are capitalized (e.g. UK ). As mentioned in Sect. 3.1 we used this information in an attempt to help the DFFNN learn more structure from the small dataset. As a result the F1 performance of the model increased form 70.26% to 72.27% (see Table 3 #5). In order to help the neural network better understand and model the structure of the sentences, we experimented with part of speech (POS) tags as part of our feature embeddings. We used the NLTK POS tagger [4] which uses the Penn Treebank tagset. As shown in Table 3 (#6), the POS tags significantly improve the recall of the network (from 66.69% to 70.37%) hence leading to a higher performance in F1 (from 72.24% to 73.79%). The POS tags help the DFFNN to better learn the structure of the sentences and take advantage of more contextual information (see Sect. 4.3). 4.3

Effect of Window Size

In order to measure the effect of the size of the context window, we varied this value using the basic DFFNN. As seen in Fig. 3, the best performance is achieved at c = 2. With values over this threshold, the DFFNN overfits as it cannot extract any meaningful structure. Due to the small size of the data set, the DFFNN is not able to learn the structure of the sentences, hence increasing the context window alone does not help the performance. In order to help the neural network better understand and use the contextual structure of the sentences in its predictions, we experimented with part of speech (POS) tags as part of our feature embeddings. As shown in Fig. 3, the POS tags help the DFFNN to take advantage of more contextual information as a result the DFFNN with POS embeddings achieves a higher performance on larger window sizes. The context window for which the DFFNN achieved its highest performance on the development set was c = 5, and on the test set the performance was increased from 72.24% to 77.10% (see Table 3 #6). 4.4

Effect of the Loss Function

As shown in Table 2 most models suffer from a lower recall than precision. The dataset is quite imbalanced, that is the number of non-toponyms are much higher than toponyms (99% vs 1%). Hence, the neural network prefers to optimize its performance by concentrating its efforts on correctly predicting the labels for the dominant class (non-toponym). In order to minimize the gap between recall than precision, we experimented with a weighted loss function. We adjusted the importance of predicting the correct labels experimentally and found that

34

M. Davari et al.

Fig. 3. Effect of context window on performance of the model with and without POS features. (DFFNN Basic+P+S and DFFNN Basic+P+S+C+P )

by weighing the toponyms 2 times more than the non-toponyms, the system reaches an equilibrium in the precision and recall measure, leading to a higher F1 performance. (This is indicated by “w” in Table 3 row #7) 4.5

Use of Lemmas

Neural networks require large datasets to learn structures and they learn better if the dataset contains similar examples so that the system can cluster them in its learning process. Since our dataset is small and the Wikipedia-PubMed embeddings suffer from 28.61% OOV words (see Table 4), we tried to help the network better cluster the data by adding the lemmatized word embeddings of the words to the feature embeddings and see how our best model reacts to it. As shown in Table 3 (#8), this improved the F1 measure significantly (from 77.10% to 80.13%). Furthermore, we picked 2 random toponyms and 2 random non-toponyms to visualize the confidence of our best model and the baseline model in their prediction as given by the softmax function (see Eq. 2). Figure 4 shows that our model produces much sharper confidence in comparison to the baseline model.

5

Discussion

Overall our best model (DFFNN #8 in Table 3) is made out of the basic DFFNN plus capitalized feature, POS embeddings, weighted loss function, and lemmatization feature. The experiments and results described in Sect. 4 underlines the

Toponym Identification in Epidemiology Articles

35

Fig. 4. (a) Confidence of our proposed model in its categorical predictions. (b) Confidence of the baseline in its categorical predictions.

importance of linguistic insights in the task of toponym detection. Ideally the system should learn all these insights and features by itself given access to enough data. However, when the data is scarce, as in our case, we should take advantage of the linguistic structure of the data for better performance. Our experiments also underline the importance of domain specific word embedding models. These models reduce OOV words and also present us with embeddings that capture the relation of the words in the specific domain of study.

6

Conclusion and Future Work

This paper presented the approach we used to participate to the recent SemEval task 12 shared task on toponym resolution [28]. Our best DFFNN approach took advantage of domain specific embeddings as well as linguistic features. It achieves a significant increase in F-measure compared to the baseline system (from 69.74% to 80.13%). However, as the official results were not available at the time of writing, comparison with other approaches cannot be done at this time. The focus of this paper was to propose a deep learning based model for toponym detection and experiment with the use of external linguistic features and domain specific information. The model was evaluated using the recent SemEval task 12 datatset [28] and shows that domain specific embedding as well as some linguistic features do help in toponym detection in medical journals. One of the main factors preventing us from exploring deeper models, was the small size of the data set. With more human annotated data the models could be extended for better performance. However, since human annotated data is expensive to produce, we suggest distant supervision [15] to be explored for further increasing performance. As our experiments pointed out, the model could

36

M. Davari et al.

heavily benefit from linguistic insights, hence equipping the model with more linguistic driven features could potentially lead to a higher performing model. We did not have the time or computational resources to explore recurrent neural architectures, however future work could be done focusing on these models. Acknowledgments. This work was financially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

References 1. Amitay, E., Har’El, N., Sivan, R., Soffer, A.: Web-a-where: geotagging web content. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–280. ACM (2004) 2. Ardanuy, M.C., Sporleder, C.: Toponym disambiguation in historical documents using semantic and geographic features. In: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage, pp. 175–180. ACM (2017) 3. Benson, D.A., et al.: Genbank. Nucleic Acids Res. 41(D1), D36–D42 (2012) 4. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. ”O’Reilly Media, Inc.”, Sebastopol (2009) 5. Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016) 6. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008) 7. DeLozier, G., Baldridge, J., London, L.: Gazetteer-independent toponym resolution using geographic word profiles. In: AAAI, pp. 2382–2388 (2015) 8. Garbin, E., Mani, I.: Disambiguating toponyms in news. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 363–370. Association for Computational Linguistics (2005) 9. Gautret, P., Botelho-Nevers, E., Brouqui, P., Parola, P.: The spread of vaccinepreventable diseases by international travellers: a public-health concern. Clin. Microbiol. Infect. 18, 77–84 (2012) 10. Gelernter, J., Balaji, S.: An algorithm for local geoparsing of microtext. GeoInformatica 17(4), 635–667 (2013) 11. Google: Pretrained word and phrase vectors. https://code.google.com/archive/p/ word2vec/ (2019). Accessed 10 Jan 2019 12. Green, A.D., Roberts, K.I.: Recent trends in infectious diseases for travellers. Occup, Med. 50(8), 560–565 (2000). https://dx.doi.org/10.1093/occmed/50.8.560 13. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012) 14. Kienreich, W., Granitzer, M., Lux, M.: Geospatial anchoring of encyclopedia articles. In: Tenth International Conference on Information Visualisation (IV 2006), pp. 211–215 (July 2006). https://doi.org/10.1109/IV.2006.57 15. Krause, S., Li, H., Uszkoreit, H., Xu, F.: Large-scale learning of relation-extraction rules with distant supervision from the web. In: Cudr´e-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 263–278. Springer, Heidelberg (2012). https:// doi.org/10.1007/978-3-642-35176-1 17

Toponym Identification in Epidemiology Articles

37

16. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016) 17. Li, L., Jin, L., Jiang, Z., Song, D., Huang, D.: Biomedical named entity recognition based on extended recurrent neural networks. In: 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 649–652 (Nov 2015). https://doi.org/10.1109/BIBM.2015.7359761 18. Lieberman, M.D., Samet, H.: Multifaceted toponym recognition for streaming news. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 843–852. ACM (2011) 19. Magge, A., Weissenbacher, D., Sarker, A., Scotch, M., Gonzalez-Hernandez, G.: Deep neural networks and distant supervision for geographic location mention extraction. Bioinformatics 34(13), i565–i573 (2018) 20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 21. Moen, S., Ananiadou, T.S.S.: Distributional semantics resources for biomedical text processing. In: Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan, pp. 39–43 (2013) 22. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, pp. 1310–1318 (2013) 23. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 24. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. https://nlp.stanford.edu/projects/glove/ (2014). Accessed 10 Jan 2019 25. Purves, R.S., et al.: The design and implementation of spirit: a spatially aware search engine for information retrieval on the internet. Int. J. Geogr. Inf. Sci. 21(7), 717–745 (2007) 26. Qin, T., Xiao, R., Fang, L., Xie, X., Zhang, L.: An efficient location extraction algorithm by leveraging web contextual information. In: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 53–60. ACM (2010) 27. Santos, J., Anast´ acio, I., Martins, B.: Using machine learning methods for disambiguating place references in textual documents. GeoJournal 80(3), 375–392 (2015) 28. SemEval: Toponym resolution in scientific papers. https://competitions.codalab. org/competitions/19948#learn the details-overview (2018). Accessed 20 Jan 2019 29. Tamames, J., de Lorenzo, V.: EnvMine: a text-mining system for the automatic extraction of contextual information. BMC Bioinformatics 11(1), 294 (2010) 30. Taylor, M.: Reduced Geographic Scope as a Strategy for Toponym Resolution. Ph.D. thesis, Northern Arizona University (2017) 31. Wang, P., Qian, Y., Soong, F.K., He, L., Zhao, H.: A unified tagging solution: bidirectional LSTM recurrent neural network with word embedding. arXiv preprint arXiv:1511.00215 (2015) 32. Weissenbacher, D., Sarker, A., Tahsin, T., Scotch, M., Gonzalez, G.: Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods. AMIA Summits Transl. Sci. Proc. 2017, 114 (2017) 33. Weissenbacher, D., et al.: Knowledge-driven geospatial location resolution for phylogeographic models of virus migration. Bioinformatics 31(12), i348–i356 (2015). https://dx.doi.org/10.1093/bioinformatics/btv259

Named Entity Recognition by Character-Based Word Classification Using a Domain Specific Dictionary Makoto Hiramatsu1(B) , Kei Wakabayashi1 , and Jun Harashima2 1

University of Tsukuba, Tsukuba, Japan [email protected] 2 Cookpad Inc., Yokohama, Japan

Abstract. Named entity recognition is a fundamental task in natural language processing and has been widely studied. The construction of a recognizer requires training data that contains annotated named entities. However, it is expensive to construct such training data for low-resource domains. In this paper, we propose a recognizer that uses not only training data but also a domain specific dictionary that is available and easy to use. Our recognizer first uses character-based distributed representations to classify words into categories in the dictionary. The recognizer then uses the output of the classification as an additional feature. We conducted experiments to recognize named entities in recipe text and report the results to demonstrate the performance of our method. Keywords: Named entity recognition

1

· Recipe text · Neural network

Introduction

Named entity recognition (NER) is one of the fundamental tasks in natural language processing (NLP) [20]. The task is typically formulated as a sequence labeling problem, for example, estimating the most likely tag sequence Y = (y1 , y2 , . . . , yN ) for a given word sequence X = (x1 , x2 , . . . , xN ). We can train a recognizer using annotated data that consists of (X, Y ). However, the construction of such annotated data is labor-intensive and timeconsuming. Although the beginning, inside, and outside (BIO) format is often used in NER, it is challenging to annotate sentences with tags, particularly for people who are not familiar with NLP. Furthermore, there are low-resource domains that do not have a sufficient amount of data. We focus on the recipe domain as an example of such a domain. Even in such domains, we can find a variety of dictionaries available. For example, Nanba et al. [13], constructed a cooking ontology, Harashima et al. [3] constructed a dictionary for ingredients, and Yamagami et al. [21] built a knowledge base for basic cuisine. These resources can be utilized for NER (Fig. 1) In this paper, we propose a method to integrate a domain-specific dictionary into a neural NER using a character-based word classifier. We demonstrate the c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 38–48, 2023. https://doi.org/10.1007/978-3-031-24340-0_4

Named Entity Recognition by Character-Based Word Classification

39

Fig. 1. LSTM-CRF based neural network.

effectiveness of the proposed method using experimental results on the Cooking Ontology dataset [13] as a dictionary. We report our experimental results on the recipe domain NE corpus [12].

2

Related Work

In recent years, NER methods that use long short-term memory (LTSM) [4] and conditional random fields (CRF) [7] have been extensively studied [8–10,15,19]. This type of neural network is based on Huang et al. [5]. Note that they used Bidirectional LSTM (Bi-LSTM), which concatenate two types of LSTM; one is forward LSTM, and another is backward LSTM. In these studies, the researchers assumed that training data with sequence label annotation was provided in advance. In our experiments, we use recipe text as a low-resource domain to evaluate our proposed method. Although Mori et al. [12] constructed an r-NE corpus, it consists of only 266 Japanese recipes. To overcome this problem, Sasada et al. [18] proposed an NE recognizer that is trainable from partially annotated data. However, as seen in Sect. 5.4, the method does not perform better than recent neural network-based methods. Preparing training data for NER is time-consuming and difficult. In addition to the strategy that uses partial annotation, there have been attempts to make use of available resources. Peters et al. [15,16] acquired informative features using language modeling. However, these approaches require a large amount of unlabeled text for training, which makes it difficult in a low-resource scenario. To avoid this difficulty, making use of a task that does not require a large amount of data could be useful.

40

M. Hiramatsu et al.

Whereas it is time-consuming to prepare training data for NER, it is relatively easy to construct a domain-specific dictionary [1,13,21]. Some researchers have used a dictionary as an additional feature [10,19]. Pham et al. [10] incorporated dictionary matching information as additional dimensions of a feature vector of a token. In their method, the representations are zero vectors for words that are not in the dictionary. Our proposed method overcomes this limitation by extracting character-based features from a classifier trained on a dictionary.

3

Baseline Method

Fig. 2. Word-level feature extractor proposed by Lample et al. [8]

As described in Sect. 2, the popular methods use Bi-LSTM (bidirectional-LSTM) and CRF, which is so called LSTM-CRF. Lample et al. [8] takes account of not only word-level but also character-level information to extract features. We show an illustration of the word-level feature extractor proposed by Lample et al. in Fig. 2. Let X = (x1 , x2 , . . . , xN ) be an input word sequence and Ct = (ct,1 , ct,2 , . . . , ct,M ) be the character sequence of the t’th word. A word distributed representation corresponding to xt is defined by vxt , and a character distributed representation corresponding to Ct,k is defined by vCt,k . Let VCt = (vCt,1 , vCt,2 , . . . , vCt,M ). Then their model can be represented as wt (char) = Bi-LSTM(char) (VCt ), xt = [wt ; wt (char) ].

(1) (2)

Named Entity Recognition by Character-Based Word Classification

41

Then, let VX = (x1 , x2 , . . . , xN ), ht = Bi-LSTM(VX )t ,

(3)

where wt indicates the word representation corresponding to xt . After extracting the feature vector ht of the sequence, they applied CRF to predict the tag sequence considering their tag transitions. Let y = (y1 , y2 , . . . , yN ) be a tag sequence. Using H = (h1 , h2 , . . . , hN ), we can calculate the probability of the tag sequence using n 

P (y | H; W, b) =

i=1



ψi (yi−1 , yi , H) n 

y ∈Y(H) i=1

ψi (y  i−1 , yi , H)

,

(4)

where ψi (yi−1 , yi , H) = exp(WyTi hi + byi−1 ,yi ). Wyi is the weight vector and ˆ , which is byi−1 ,yi is the bias term. What we want is the optimal tag sequence y defined by ˆ = argmax P (y | H; W, byi−1 ,yi ). y

(5)

y ∈Y(H)

ˆ by maximizing P using the Viterbi We can obtain the optimal tag sequence y algorithm (Fig. 3).

Fig. 3. Overview of the character-based word classifier. We use a 3 stacked Bi-LSTM.

42

M. Hiramatsu et al.

4

Proposed Method

In this paper, we propose a recognizer that uses not only training data but also a domain-specific dictionary. As described in Sect. 1, it is expensive to construct training data for a recognizer. We thus make use of a domain-specific dictionary that contains pairs that consist of a word and category.

Fig. 4. Overview of the proposed method. We concatenate the classifier output to a feature vector from the Bi-LSTM.

Figure 4 shows the architecture of our proposed recognizer. Our recognizer can be considered as an extension of Lample et al. [8]. We incorporate the character-based word classifier which calculates at as follows: h(classif ier) t = Stacked Bi-LSTM(Ct ) 

(classif ier)

a t = Wh t+b  at = Softmax(a t ),

(6) (7) (8)

This classifier is a neural network that consists of a embedding layer, a stacked Bi-LSTM layer, and a fully connected layer. Stacked Bi-LSTM is one kind of neural network which applies Bi-LSTM k times where k > 1. Classifier takes the character sequence of words as input and predicts categories of it defined in a dictionary. After passing the word classifier, our method concatenates the hidden state calculated in Sect. 3 and the output of Classifier defined by ht = ht ⊕ at . Finally, as in Sect. 3, our method transforms h by zt = Wht + b , and the CRF predicts the most likely tag sequence.

Named Entity Recognition by Character-Based Word Classification

43

Table 1. The statistics of corpora using our experiments. Note that the r-NE corpus is annotated for NEs with the BIO format. We show character-level information only for r-NE because it is used to train a recognizer. Attribute

Cookpad

Wikipedia

Doc

1,715,589

1,114,896

Sent

12,659,170

18,375,840

Token

216,248,517 600,890,895 60,542

Type

221,161

2,306,396

r-NE 436 3,317 3,390

Char token –



91,560

Char type



1,130



Table 2. r-NEs and their frequencies. NE Description

# of Examples

F

Food

6,282

T

Tool

1,956

D

Duration

409

Q

Quantity

404

Ac Action by the chef 6,963 Af

Action by foods

1,251

Sf

State of foods

1,758

St

State of tools

216

Although our method is simple, it has two advantages: First, our method is based on character-level distributed representations, which avoid the mismatching problem between words in the training data and words in the dictionary. Second, the method can use a dictionary with arbitrary categories that are not necessarily equal to the NE categories in the sequence labels. Consequently, our method can be applied in all scenarios in which there is a small amount of training data that contains NEs and there is a domain dictionary constructed arbitrarily (Table 1).

5 5.1

Experiments Datasets

We used the following four datasets: r-NE [12]: used to train and test methods. We used 2,558 sentences for training, 372 for validation, and 387 for testing.

44

M. Hiramatsu et al. Table 3. Word categories, frequencies, and results on classification. Category

# of Examples Prec Recall Fscore

Ingredient-seafood (example: salmon)

452

0.60 0.62

0.61

Ingredient-meat (example: pork)

350

0.88 0.83

0.85

Ingredient-vegetable (example: lettuce) 935

0.75 0.79

0.77

Ingredient-other (example: bread)

725

0.75 0.71

0.73

Condiment (example: salt)

907

0.81 0.84

0.83

Kitchen tool (example: knife)

633

0.79 0.74

0.76

Movement (example: cut)

928

0.94 0.99

0.96

Other

896

0.70 0.66

0.68

Cooking Ontology [13]: used to train the word classifier. We use 3,825 words for training, 1,000 for validation, and 1,000 for testing. Cookpad [2]: used to train word embeddings. Cookpad corpus contains 1.7M recipe texts. Wikipedia: used to train word embeddings. There were various types of topics in this corpus. We downloaded the raw data of this corpus from the Wikipedia dump1 . Wikipedia corpus contains 1.1M articles. As in Table 2 and Table 3, the categories in the cooking ontology were different from the tags in the r-NE corpus. However, as described in Sect. 4, our method flexibly incorporated such information into its network. 5.2

Methods

We compared the following methods in our experiments: Sasada et al. [18] is a pointwise tagger. They use Logistic Regression as the tagger. Sasada et al. [18]+DP is an extension of LR, which optimizes LR’s prediction using dynamic programming. This method achieved state-of-the-art performance for the r-NE task. Lample et al. [8] is an LSTM-CRF tagger described in Sect. 2. Dictionary is an LSTM-CRF based naive baseline that uses a dictionary. A dictionary feature is added to Lample’s feature in the form of a one-hot vector. 1

https://dumps.wikimedia.org/jawiki/.

Named Entity Recognition by Character-Based Word Classification

45

Table 4. Results on NER (averaged over five times except for Sasada et al. [18] because KyTea [14], the text analysis toolkit used in their experiments, does not have the option to specify a random seed). Method

Recall

Fscore

Sasada et al. [18] –

Decoder Embedding Prec. –

82.34

80.18

81.20

Sasada et al. [18] DP



82.94

82.82

82.80

Lample et al. [8] CRF

Uniform

82.59 (± 0.94)

88.19 (± 0.25)

85.24 (± 0.46)

Lample et al. [8] CRF

Cookpad

84.54 (± 1.22)

88.47 (± 0.69)

86.40 (± 0.89)

Lample et al. [8] CRF

Wikipedia

85.31 (± 0.67)

88.22 (± 0.65)

86.68 (± 0.47)

Dictionary

CRF

Uniform

82.36 (± 1.25)

88.28 (± 0.25)

85.18 (± 0.71)

Dictionary

CRF

Cookpad

83.91 (± 1.21)

88.60 (± 0.41)

86.16 (± 0.72)

Dictionary

CRF

Wikipedia

85.44 (± 1.04)

87.67 (± 0.25)

86.50 (± 0.56)

Proposed

CRF

Uniform

82.81 (± 0.88)

88.40 (± 0.41)

85.46 (± 0.58)

Proposed

CRF

Cookpad

85.08 (± 1.30)

88.46 (± 0.18)

86.68 (± 0.71)

Proposed

CRF

Wikipedia 85.63 (± 0.52) 88.87 (± 0.37) 87.18 (± 0.34)

Proposed is the proposed method that uses the character-level word classifier described in Sect. 4. 5.3

Pre-trained Word Embeddings

In NLP, a popular approach is to make use of pre-trained word embeddings to initialize parameters in neural networks. In this paper, three strategies are used to initialize word vectors: Uniform initializes word vectors by sampling from −3 3 , dim ]. the uniform distribution over [ dim Wikipedia initializes word vectors using those trained on the Wikipedia corpus. Word vectors not in pre-trained word vectors are initialized by Uniform. Cookpad initializes word vectors using those trained on the Cookpad corpus. Word vectors not in pre-trained word vectors are initialized by Uniform. We use train word embeddings with skip-gram with negative sampling (SGNS) [11]. As the hyperparameter of SGNS, we set 100 as the dimension of the word vector, 5 for the size of the context window, and 5 for the size of negative examples, and use default parameters defined in Gensim [17] for other parameters. In our proposed network, we set 50 dimensions for character-level distributed representations and 2 × 50 for character-level Bi-LSTM as a word classifier. The word feature extracted by the word classifier is concatenated with the wordlevel representation and fed into the word-level Bi-LSTM to obtain the entire

46

M. Hiramatsu et al.

word features. To train neural networks, we use the Adam optimizer [6] with mini-batch size 10 and clip gradient with threshold 5.0. 5.4

Experimental Results and Discussion

Table 3 shows the performance of our word classifier. Our classifier successfully classified words with a certain degree of accuracy. We show the results of comparing each recognizer in Table 4 In our experiments, (i) pre-trained word vectors played an essential role in improving the performance of NER and (ii) our classifier enhanced the performance of the Lample’s method. Interestingly, we obtained the best result when pre-trained word vectors were trained on the Wikipedia corpus, which is not a domain-specific corpus. This suggests that our method to have successfully combined universal knowledge from pre-trained word vectors and domain-specific knowledge from the classifier trained on a domain-specific dictionary. Table 5. Results on named entity recognition (for each NE, averaged over five times). NE Precision

Recall

Fscore

Ac 91.77 (± 1.02) 95.23 (± 0.42) 93.46 (± 0.33) Af

78.87 (± 3.68) 78.12 (± 1.19) 78.46 (± 2.22)

D

96.63 (± 1.71) 93.88 (± 2.88) 95.23 (± 2.16)

F

85.84 (± 0.94) 89.01 (± 0.65) 87.39 (± 0.59)

Q

58.70 (± 3.81) 70.00 (± 3.19) 63.69 (± 1.82)

Sf

75.12 (± 4.40) 78.17 (± 1.95) 76.52 (± 2.04)

St

66.03 (± 5.64) 52.63 (± 4.70) 58.46 (± 4.52)

T

82.53 (± 2.30) 89.09 (± 1.26) 85.66 (± 1.21)

We show the label-wise results of prediction in Table 5. In this result, we can see that the proposed model successfully predicted tags of Ac, D, F, and T. However, prediction performances for Af, Q, Sf, and St were limited because there is no entry for these categories in our dictionary.

Example Translation Ground Truth Baseline Proposed method

Yo ji Cocktail stick B-T B-Sf B-T

de DAT O O O

Fig. 5. Prediction results for an example

to

me clip B-Ac O B-Ac O B-Ac O

te O O O

Named Entity Recognition by Character-Based Word Classification Example Translation Ground Truth Baseline Peoposed method

Denshi renji Microwave B-T I-T B-T I-T B-T I-T

( ( O O O

500 500 B-St B-T B-St

W W I-St I-T I-St

) ) O O O

47

de DAT O O O

Fig. 6. Prediction results for another example

Figure 5 and Fig. 6 show prediction results for the baseline and our methods. Note that the abbreviation DAT means dative. In the first example, the word classifier taught the model that cocktail stick was a kitchen tool, which made the proposed method successfully recognize it as a tool. In the second example, the word classifier taught the model that 500W is not a kitchen tool. Then, the proposed method avoided the baseline’s failure and estimate the correct NE tag sequence.

6

Conclusion

We proposed a recognizer that is trainable from not only annotated NEs but also a list of examples for some categories related to NE tags. The proposed method uses the output of a character-based word classifier. Thanks to this characterbased modeling, the proposed method considers sub-word information to extract dictionary features for words not in the dictionary. Our experiment demonstrates that our method achieves state-of-the-art performance on the r-NE task. This implies that the proposed method successfully extracts an informative feature to improve the performance of NER.

References 1. Chung, Y.J.: Finding food entity relationships using user-generated data in recipe service. In: Proceedings of International Conference on Information and Knowledge Management, pp. 2611–2614 (2012) 2. Harashima, J., Michiaki, A., Kenta, M., Masayuki, I.: A large-scale recipe and meal data collection as infrastructure for food research. In: Proceedings of International Conference on Language Resources and Evaluation, pp. 2455–2459 (2016) 3. Harashima, J., Yamada, Y.: Two-step validation in character-based ingredient normalization. In: Proceedings of Joint Workshop on Multimedia for Cooking and Eating Activities and Multimedia Assisted Dietary Management, pp. 29–32 (2018) 4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1–32 (1997) 5. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF Models for Sequence Tagging (2015). https://arxiv.org/abs/1508.01991 6. Kingma, D.P., Ba, J.L.: Adam: a Method for Stochastic Optimization. In: Proceedings of International Conference on Learning Representations (2015)

48

M. Hiramatsu et al.

7. Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of International Conference on Machine Learning, pp. 282–289 (2001) 8. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270 (2016) 9. Ma, X., Hovy, E.: End-to-end sequence labeling via Bi-directional LSTM-CNNsCRF. In: Proceedings of Annual Meeting of the Association for Computational Linguistics (2016) 10. Mai, K., Pham, et al.: An empirical study on fine-grained named entity recognition. In: Proceedings of International Conference on Computational Linguistics, pp. 711– 722 (2018) 11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of International Conference on Learning Representations (2013) 12. Mori, S., Maeta, H., Yamakata, Y., Sasada, T.: Flow graph corpus from recipe texts. In: Proceedings of International Conference on Language Resources and Evaluation, pp. 2370–2377 (2014) 13. Nanba, H., Takezawa, T., Doi, Y., Sumiya, K., Tsujita, M.: Construction of a cooking ontology from cooking recipes and patents. In: Proceedings of ACM International Joint Conference on Pervasive and Ubiquitous Computing Adjunct Publication, pp. 507–516 (2014) 14. Neubig, G., Nakata, Y., Mori, S.: Pointwise prediction for robust, adaptable japanese morphological analysis. In: Proceedings of Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 529–533 (2011) 15. Peters, M.E., Ammar, W., Bhagavatula, C., Power, R.: Semi-supervised sequence tagging with bidirectional language models. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 1756–1765 (2017) 16. Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2227–2237 (2018) 17. Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of LREC Workshop on New Challenges for NLP Frameworks, pp. 45–50 (2010) 18. Sasada, T., Mori, S., Kawahara, T., Yamakata, Y.: Named entity recognizer trainable from partially annotated data. In: Proceedings of International Conference of the Pacific Association for Computational Linguistics. vol. 593, pp. 148–160 (2015) 19. Sato, M., Shindo, H., Yamada, I., Matsumoto, Y.: Segment-level neural conditional random fields for named entity recognition. In: Proceedings of International Joint Conference on Natural Language Proceedings of Sing, pp. 97–102. No. 1 (2017) 20. Yadav, V., Bethard, S.: A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of International Conference on Computational Linguistics, pp. 2145–2158 (2018) 21. Yamagami, K., Kiyomaru, H., Kurohashi, S.: Knowledge-based dialog approach for exploring user’s intention. In: Procceedings of FAIM/ISCA Workshop on Artificial Intelligence for Multimodal Human Robot Interaction, pp. 53–56 (2018)

Cold Is a Disease and D-cold Is a Drug: Identifying Biological Types of Entities in the Biomedical Domain Suyash Sangwan1(B) , Raksha Sharma2 , Girish Palshikar3 , and Asif Ekbal1 1

2

Indian Institute of Technology Patna, Bihta, India [email protected], [email protected] Indian Institute of Technology, Roorkee, Roorkee, India [email protected] 3 TCS Innovation Labs, Chennai, India [email protected]

Abstract. Automatically extracting different types of knowledge from authoritative biomedical texts, e.g., scientific medical literature, electronic health records etc., and representing it in a computer analyzable as well as human-readable form is an important but challenging task. One such knowledge is identifying entities with their biological types in the biomedical domain. In this paper, we propose a system which extracts end-to-end entity mentions with their biological types from a sentence. We consider 7 interrelated tags for biological types viz., gene, biological-process, molecularfunction, cellular-component, protein, disease, drug. Our system employs an automatically created biological ontology and implements an efficient matching algorithm for end-to-end entity extraction. We compare our approach with a Noun-based entity extraction system (baseline) as well as we show a significant improvement over standard entity extraction tools, viz., Stanford-NER, Stanford-OpenIE.

Keywords: Biomedical entity tagging extraction · POS tagging

1

· Ontology creation · Entity

Introduction

An enormous amount of biomedical data have been generated and collected at an unprecedented speed and scale. For example, the application of electronic health records (EHRs) is documenting large amounts of patient data. However, retrieving and processing this information is very difficult due to the lack of formal structure in the natural language used in these documents. Therefore we need to build systems which can automatically extract the information from the biomedical text which holds the promise of easily consolidating large amounts of biological knowledge in computer or human accessible form. Ability to query and use such extracted knowledge-bases can help scientists, doctors and other c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 49–60, 2023. https://doi.org/10.1007/978-3-031-24340-0_5

50

S. Sangwan et al.

users in performing tasks such as question-answering, diagnosis and identifying opportunities for the new research. Automatic identification of entities with their biological types is a complex task due to the domain-specific occurrences of entities. Consider the following example to understand the problem well. – Input Sentence: Twenty courses of 5-azacytidine (5-Aza) were administrated as maintenance therapy after induction therapy with daunorubicin and cytarabine. – Entities Found = {5-azacytidine, daunorubicin, cytarabine} – Biological types for the Entities = {drug, drug, drug} All the three extracted entities in the example are specific to the biomedical domain having biological type drug. Hence, an entity extraction tool trained on generic data will not be able to capture these entities. Figure 1 shows the entities tagged by Stanford-NER and Stanford-OpenIE tools. Stanford-NER fails to tag any of the entity, while Stanford-OpenIE is able to tag daunorubicin. In this paper, we propose an approach which uses the biomedical ontology to extract end-to-end entity mentions with their biological types from a sentence in the biomedical domain. By end-to-end we mean that correctly identify the boundary of each entity mention. We consider 7 interrelated tags for biological types viz., gene, biological-process, molecular-function, cellular-component, protein, disease, drug. They together form a complete biological system, where one biological type is the cause or effect of another biological type. The major contribution of this research is as follows. 1. Ontology in the biomedical domain: Automatic creation of ontology having biological entities with their biological types. 2. Identifying end-to-end entities with their biological types: We have implemented an efficient matching algorithm, named it All Subsequences Entity Match (ASEM). It is able to extract entities with their biological types from a sentence using ontology. ASEM is also able to tag entities which are the subsequence of another entity. For example, for mammalian target of rapamycin, our system detects two entities {mammalian target of rapamycin, rapamycin} with biological types {protein, drug} respectively. Since nouns are the visible candidates for being entities, we consider a Nounbased entity extraction system as a baseline. This system uses NLTK POS tagger for tagging the words with POS tags. In addition, we compare performance of our ASEM-based approach with Stanford-NER1 and Stanford-OpenIE2 . The rest of the paper is organized as follows. Section 2 describes the related work. Section 3 gives a description of the dataset used. Section 4 presents the ontology creation details and the ASEM algorithm. Section 5 provides experimental setup and results and Sect. 6 concludes the paper. 1 2

Available at: https://nlp.stanford.edu/software/CRF-NER.shtml. Available at: https://nlp.stanford.edu/software/openie.html.

Cold Is a Disease and D-cold Is a Drug

51

Fig. 1. Entity tagging by Stanford-NER and Stanford-OpenIE

2

Related Work

Entity Extraction has been a widely studied area of research in NLP. There have been attempts for both supervised as well as unsupervised techniques for entity extraction task [5]. Etzioni et al. (2005) [8] proposed an unsupervised approach to extract named entities from the Web. They built a system KNOWITALL, which is a domain-independent system that extracts information from the Web in an unsupervised and open-ended manner. KNOWITALL introduces a novel, generate-and-test architecture that extracts information in two stages. KNOWITALL utilizes a set of eight domain-independent extraction patterns to generate candidate facts. Baluja et al. (2000) [2] presented a machine learning approach for building an efficient and accurate name spotting system. They described a system that automatically combines weak evidence from different, easily available sources: parts-of-speech tags, dictionaries, and surface-level syntactic information such as capitalization and punctuation. They showed that the combination of evidence through standard machine learning techniques yields a system that achieves performance equivalent to the best existing hand-crafted approach. Carreras et al. (2002) [3] presented a Named Entity Extraction (NEE) problem as two tasks, recognition (NER) and classification (NEC), both the tasks were performed sequentially and independently with separate modules. Both modules are machine learning based systems, which make use of binary AdaBoost classifiers. Cross-lingual techniques are also developed to build an entity recognition system in a language with the help of another resource-rich language [1,6,7,11,14,17,19]. There are a few instances of use of already existing ontology or creation of a new ontology for entity extraction task. Cohen and Sarawagi, (2004) [4] considered the problem of improving named entity recognition (NER) systems by using external dictionaries. More specifically, they extended state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. Textpresso’s which is a tool by Muller et al. (2004) [12] has two major elements, it has a collection of the full text of scientific articles split into individual sentences and the implementation of categories of terms for which a database of articles and individual sentences can be searched.

52

S. Sangwan et al.

The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. Wang et al. (2009) [16] used approximate dictionary matching with edit distance constraints. Their solution was based on an improved neighborhood generation method employing partitioning and prefix pruning techniques. They showed that their entity recognition system was able to capture typographical or orthographical errors, both of which are common in entity extraction tasks yet may be missed by token-based similarity constraints. There are a few instances of entity extraction in the biomedical domain [10,18,20]. Takeuchi and Collier (2005) [15] applied Support Vector Machine for the identification and semantic annotation of scientific and technical terminology in the domain of molecular biology. This illustrates the extensibility of the traditionally named entity task to special domains with large-scale terminologies such as those in medicine and related disciplines. More recently, Joseph et al. (2012) [9] built a search engine dedicated to the biomedical domain, they also populated a dictionary of domain-specific entities. In this paper, we have proposed an unsupervised approach, which first generates a domain-specific ontology and then performs all subsequence matches against the ontology entries for entity extraction in the biomedical domain.

3

Dataset

To evaluate the performance of our algorithm and other approaches, we asked an expert to manually annotate a dataset of 50 abstracts (350 sentences) of Leukemia-related papers from PubMed [13] having cause-effect relations. We obtained 231 biological entities with their biological types. Below is an example from the manually tagged output. Entities are enclosed in curly brackets and their types are attached using ‘ ’ symbol. {6-Mercaptopurine} drug (6-MP drug) is one of the main components for the treatment of childhood {acute lymphoblastic leukemia} disease (ALL disease). To observe the performance of our system on a large corpus, we used an untagged corpus of 10, 000 documents given by Sharma et al. (2018) [13]. They downloaded 10, 000 abstracts of Leukemia-related papers from PubMed using the Biopython library with Entrez package. They used this dataset (89, 947 sentences having 1, 935, 467 tokens) to identify causative verbs in the biomedical domain.

4

Approach

In this paper, we present an approach to identify end-to-end biological entities with their biological types in a sentence using automatically created ontology. The following Sects. 4.1 and 4.2 elaborate the process of ontology creation and ASEM algorithm.

Cold Is a Disease and D-cold Is a Drug

4.1

53

Ontology Creation

We automatically built an ontology for 7 biological types, viz., gene, biologicalprocess, molecular-function, cellular-component, protein, disease, drug. We referred to various authentic websites having biological entity names with their types.3 Since direct downloadable links are not available to obtain complete dataset, we built a customized HTML parser to extract biological entities with their types. The selection of the websites for this task is done manually. We obtained an ontology of size 90, 567 with our customized HTML parser. Joseph et al. (2012) [9] also created a dictionary for biological entities having information about their biological types. They used this dictionary to equip with TPX. TPX is a Web-based PubMed search enhancement tool that enables faster article searching using analysis and exploration features. Their process of creating dictionary from the various sources, has been granted a Japanese patent (JP2013178757). In order to enrich our ontology further, we included entities available in TPX. Table 1. Ontology: entities-name and biological-type Entities

Biological type

Chitobiase

Gene

Reproduction

Biological-process

Acyl binding

Molecular-function

Obsolete repairosome Cellular-component Delphilin

Protein

Acanthocytoses

Disease

Calcimycin

Drug

We found approx 1, 50, 000 new biological entities with their types from Joseph et al. [9] work. The entire ontology is stored in a Hash table format, where entity name is the unique key and biological type is the value. Table 1 shows a few entries from the ontology used in the paper. Table 2 depicts the total number of entities extracted with respect to each biological type. 4.2

Algorithm: Identify Entity with Its Biological Type

Nouns are explicitly visible candidates for being a biological entity, we designed a Noun-based entity extraction system. Performance of this system completely depends on the POS tagger, which assigns NOUN tag. The Noun-based system is not able to identify end-to-end entities or the correct entity boundary. Our ASEM-based system is able to find the boundary of an entity in a sentence without POS tag information. 3

1. http://www.geneontology.org/, 2. https://bioportal.bioontology.org/ontologies/ DOID, 3. http://browser.planteome.org/amigo/search/ontology?q=%20regimen.

54

S. Sangwan et al. Table 2. Ontology statistics Biological type

No. of entities

Gene

1,79,591

Biological-process

30,695

Molecular-function

11,936

Cellular-component

4,376

Protein

1,16,125

Disease

74,470

Drug

52,923

Noun-Based System. The system comprises 4 modules. Figure 2 depicts the workflow of the Noun-based system. The description of the modules is as follows. Module-1: POS Tagging We used NLTK POS tagger to tokenize and assign POS tags to words. Words which are tagged as Noun are considered as candidates for being a biological entity. The tagger is trained on general corpus.4 We observed that NLTK failed to assign correct tags to many words specific to the biomedical domain. For example, 6-Mercaptopurine is tagged as Adjective by NLTK, however, it is the name of a medicine used for Leukemia treatment, hence it should be tagged as a noun. Module-2: Preprocessing Since NLTK tokenizes and tags many words erroneously, we apply preprocessing on the output produced by Module-1. In the preprocessing step, we removed all single letter words which are tagged as Noun, and words starting and ending with a symbol. In order to reduce the percentage of wrongly tagged words, we removed stop words also. For this purpose, we used a standard list of stop words (very high-frequency words) in the biomedical domain.5 Below are a few examples of stop words from the list. {Blood, analysis, acid, binding, brain, complex} Module-3: Get Abbreviations (Abv) We observed that there were entries in the ontology for the abbreviation of the entity, but not for the actual entity. To capture such instances, we defined rules to form abbreviations from the words of a sentence. For example, acute lymphoblastic leukemia was also represented as ALL. In such scenario, if acute lymphoblastic leukemia is missing in the ontology, but ALL is present, we assign the biological type of ALL to acute lymphoblastic leukemia. 4 5

We also experimented with Stanford POS tagger, but the performance of this tagger was worse than NLTK tagger for the biological entities. Available at: https://www2.informatik.hu-berlin.de/∼hakenber/corpora/medline/ wordFrequencies.txt.

Cold Is a Disease and D-cold Is a Drug

55

Fig. 2. Work flow of the Noun-based system

Module-4: Extract Biological Type This module searches for the Entity Candidate (EC) in the ontology (O). If there is an exact match for the candidate word, the Biological Type (BT) of the entity is extracted. The final outcome of this module is the biological type attached to the entity name. ASEM-Based System. All Subsequences Entity Match algorithm finds all subsequences up-to sentence length (n) from the sentence whose entities have to be recognized with biological types.6 In order to get all possible subsequences, we used n-gram package of NLTK.7 This system doesn’t require preprocessing step as it doesn’t consider POS tagged words, hence there is no error due to the tagger. Module-3 (Get Abbreviations) of the Noun-based approach is also part of ASEM algorithm as it helps to get the biological type of the entity whose abbreviation is an entry in the ontology, but not the entity itself. If we find an entry in the ontology for any subsequence, we consider the subsequence as a valid biological entity and retrieve the biological type of the entity from the ontology. Algorithm 1 gives the pseudo code of the proposed approach. Table 3 defines the functions and symbols used in Algorithm 1. 6 7

Though we obtained subsequences up-to length (n), we observed that there were no entity more than 4 words long. Available at: http://www.nltk.org/ modules/nltk/model/ngram.html.

56

S. Sangwan et al. 1 2 k Input: WP = {wB , wB , ...wB }, TPXDictionary , S = {s1B , s2B , ....sm B },

Output: Entity names with their Entity Types for s∈S Ontology := ∅; for each Web-page wp ∈ W P do Ontology[EntityN ame , EntityT ype ] := HT M LP arser (wp) end Ontology[EntityN ame , EntityT ype ] := Ontology[EntityN ame , EntityT ype ] ∪ T P XDictionary for each sentence s ∈ S do E BT = ∅ N G := n-grams(s), //Getting all subsequence of S where n ∈ {1,2,..,length(s)} for each ng ∈ N G do abvng := Get Abbreviation(ng) if ng in Ontology then E BT := E BT ∪ (ng,Ontology[ng]) if abvng in Ontology then E BT := E BT ∪ (ng,Ontology[abvng ]) end Entities in s with their biological types : E BT end

Algorithm 1: Identifying Biological Types of Biological Entities Table 3. Symbols used in Algorithm 1 Symbol

Description

WP

Set of relevant Web-pages

TPXDictionary

Dictionary by Joseph et al. [9]

S

Set of sentences in the Biomedical (B) Domain

Ontology

Hash-table having entity-name as key and its biological-type as value

HT M LP arser ()

Extracts entity-name and its value from a HTML page

E BT

Set of entities tagged with Biological Types (BT)

n-grams()

A function to obtain all subsequences (ng) of s ∈ S

Get Abbreviations() A function to generate abbreviation of ng

Cold Is a Disease and D-cold Is a Drug

5

57

Experimental Setup and Results

In this paper, we hypothesize that matching of all subsequences of a sentence against automatically created ontology in the biomedical domain can efficiently extract end-to-end entities with their biological types. We compare our ASEMbased system with a Noun-based system (4.2), NER-based system and OpenIEbased system. Named entities are good candidates for being biological entities. We have used Stanford Named Entity Recognizer (NER) to obtain entities. On the other hand, Open information extraction (OpenIE) refers to the extraction of binary relations, from plain text, such as (Mark Zuckerberg; founded; Facebook). It assigns subject and object tags to related arguments. We considered these two arguments as candidates for being an entity. In this paper, we have used Stanford-OpenIE tool. NER and OpenIE are able to extract end-to-end entities, in other words, they are able to tag entities having multiple words. However, they both fail to tag many of the entities which are specific to the biomedical domain (See example in Fig. 1). The Algorithm 1 remains the same with NER or OpenIE, except the all subsequences set N G is replaced with the set of entities extracted by NER or OpenIE. Table 4. Precision (P), Recall (R) and F-score (F) using different approaches in %. System

P

Noun-based 74.54

R

F

35.49

48.09

NER

94.44

7.35

13.65

OpenIE

93.75

12.98

22.81

ASEM

95.86 80.08 87.26

Table 4 shows the results obtained with the 4 different systems, viz., Nounbased, NER-based, OpenIE-based, and ASEM-based (our approach) on test data of 350 sentences having manually annotated 231 entities with their biological types. A True Positive (TP) scenario is when both entity and its type exactly match with the manually tagged entry, else False Positive (FP). A False Negative (FN) scenario is when a manual tagging is there for an entity, but the same is not produced by the system. We have used the same ontology to obtain biological type with all 4 systems of Table 4. Results validate our hypothesis that ASEMbased system is able to obtain a satisfactory level of Precision (P), recall (R) and F-score (F) for this domain-specific task. Though Precision is good for all cases, first three systems fail to score good Recall (R) as they use external NLP tools to extract entities from text. Table 5 shows the results obtained with ASEM-based system for each biological type. We obtained a positive Pearson correlation of 0.67 between Recall (‘R’ column of Table 5) obtained for the biological types and the size (‘Entity’ column

58

S. Sangwan et al.

Table 5. Precision, Recall and F-score in % with respect to biological type with ASEMbased system. B-Type Gene

P

R

F

92

95

92

Biological-process

100

75

86

Molecular-function

86

50

63

Cellular-component 100

10

18

Protein

90

91

93

Disease

100 100 100

Drug

100 100 100

of Table 2) of the ontology. The positive correlation asserts that enriching the ontology further would enhance the performance of our approach. Error Analysis: In the Noun-based system, where we have considered nouns as candidates for entities, precision is minimum as compared to other approaches. NLTK (or Stanford) POS tagger is not able to correctly tag domain-specific entities like PI3K/AKT, JNK/STAT etc. (words having any special character in between), they treat PI3K and AKT as two separate words and assign tags accordingly. Below is an example from the biomedical domain which shows the use of these entities. “Targeted therapies in pediatric leukemia are targeting BCR/ABL, TARA and FLT3 proteins, which activation results in the downstream activation of multiple signaling pathways, including the PI3K/AKT, JNK/STAT, Ras/ERK pathways” These random breaks in entities introduced by the POS taggers cause a drop in the overall precision of the system. In addition, the Noun-based system is not able to detect the boundary of the entity. However, the biomedical domain is full of entities constituting multiple words. Hence, the Noun-based system produces a poor F-Score of 48.09%. The NER-based system which uses Stanford-NER tagger is not breaking words like BCR/ABL, PI3K/AKT, JNK/STAT, Ras/ERK etc., as separate entities, unlike the Noun-based system. Therefore Precision is quite high than the Noun-based system. But due to the generic behavior of Stanford-NER, it is able to extract very few entities. So false negatives increase abruptly and hence recall score drops down to 7.35%. On the other hand, OpenIE-based system considers all subject and object as candidates for entities, therefore there are relatively higher chances to extract the exact entity. Our ASEM-based approach considers all subsequences of the input sentence as candidates for entities and matches these subsequences against the entries in the ontology. Therefore we are able to get all one-word entities, abbreviations and entities constituting more than one words. Consequently, we obtain a high Recall with our approach. However, the ontology is automatically created from

Cold Is a Disease and D-cold Is a Drug

59

the Web. There are a few entities in our Gold standard dataset, which are not found in the ontology. On the other hand, ontology also contains a few words like led, has, next, which are not the biological entities as per our annotator. These lacunae in the ontology cause drop in the P and F score of our system. A positive correlation of 0.67 between Recall and ontology size justifies that enriching the ontology further would enhance the performance of our approach.

6

Conclusion

The biomedical domain is full of domain-specific entities, which can be distinguished based on their biological types. In this paper, we presented a system to identify biological entities with their types in a sentence. We showed that All Subsequence Entity Match against an automatically created ontology specific to the domain provides an efficient solution than Noun-based entity extraction. In addition, due to generic behavior of standard Entity extraction tools like Stanford-NER and Stanford-OpenIE, they fail to equate the level of performance achieved with ASEM-based system. Furthermore, a high positive correlation between Recall obtained with ASEM-based system and Ontology size emphasizes that expansion of ontology can lead to a better system for this domain-specific knowledge (entity with its type) extraction task. Though we have shown the efficacy of our approach with the biomedical domain, we believe that it can be extended to any other domain where entities are domain specific and can be distinguished based on their types. For example, financial domain, legal domain etc.

References 1. Asahara, M., Matsumoto, Y.: Japanese named entity extraction with redundant morphological analysis. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 8–15. Association for Computational Linguistics (2003) 2. Baluja, S., Mittal, V.O., Sukthankar, R.: Applying machine learning for highperformance named-entity extraction. Comput. Intell. 16(4), 586–595 (2000) 3. Carreras, X., Marquez, L., Padr´ o, L.: Named entity extraction using adaboost. In: Proceedings of the 6th Conference on Natural Language Learning, vol. 20, pp. 1–4. Association for Computational Linguistics (2002) 4. Cohen, W.W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods. In: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98. ACM (2004) 5. Collins, M.: Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 489–496. Association for Computational Linguistics (2002) 6. Daiber, J., Jakob, M., Hokamp, C., Mendes, P.N.: Improving efficiency and accuracy in multilingual entity extraction. In: Proceedings of the 9th International Conference on Semantic Systems, pp. 121–124. ACM (2013)

60

S. Sangwan et al.

7. Darwish, K.: Named entity recognition using cross-lingual resources: Arabic as an example. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1558–1567 (2013) 8. Etzioni, O., et al.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165(1), 91–134 (2005) 9. Joseph, T., et al.: TPX: biomedical literature search made easy. Bioinformation 8(12), 578 (2012) 10. Krallinger, M., Leitner, F., Rabal, O., Vazquez, M., Oyarzabal, J., Valencia, A.: CHEMDNER: the drugs and chemical names extraction challenge. J. Cheminform. 7(1), S1 (2015) 11. Laurent, D., S´egu´ela, P., N`egre, S.: Cross lingual question answering using QRISTAL for CLEF 2006. In: Peters, C., et al. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 339–350. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3540-74999-8 41 12. M¨ uller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2(11), e309 (2004) 13. Sharma, R., Palshikar, G., Pawar, S.: An unsupervised approach for causeeffect relation extraction from biomedical text. In: Silberztein, M., Atigui, F., Kornyshova, E., M´etais, E., Meziane, F. (eds.) NLDB 2018. LNCS, vol. 10859, pp. 419–427. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91947-8 43 14. Sudo, K., Sekine, S., Grishman, R.: Cross-lingual information extraction system evaluation. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 882. Association for Computational Linguistics (2004) 15. Takeuchi, K., Collier, N.: Bio-medical entity extraction using support vector machines. Artif. Intell. Med. 33(2), 125–137 (2005) 16. Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 759–770. ACM (2009) 17. Yang, Z., Salakhutdinov, R., Cohen, W.: Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270 (2016) ˇ ˇ Holzinger, A.: An adaptive 18. Yimam, S.M., Biemann, C., Majnaric, L., Sabanovi´ c, S., annotation approach for biomedical entity and relation recognition. Brain Inform. 3(3), 157–168 (2016). https://doi.org/10.1007/s40708-016-0036-4 19. Zhang, B., et al.: ELISA-EDL: a cross-lingual entity extraction, linking and localization system. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 41–45 (2018) 20. Zheng, J.G., et al.: Entity linking for biomedical literature. BMC Med. Inform. Decis. Mak. 15(1), S4 (2015)

A Hybrid Generative/Discriminative Model for Rapid Prototyping of Domain-Specific Named Entity Recognition Suzushi Tomori1(B) , Yugo Murawaki1 , and Shinsuke Mori2 1

Graduate School of Informatics, Kyoto University, Kyoto, Japan [email protected], [email protected] 2 Academic Center for Computing and Media Studies, Kyoto University, Kyoto, Japan [email protected] Abstract. We propose PYHSCRF, a novel tagger for domain-specific named entity recognition that only requires a few seed terms, in addition to unannotated corpora, and thus permits the iterative and incremental design of named entity (NE) classes for new domains. The proposed model is a hybrid of a generative model named PYHSMM and a semi-Markov CRF-based discriminative model, which play complementary roles in generalizing seed terms and in distinguishing between NE chunks and non-NE words. It also allows a smooth transition to full-scale annotation because the discriminative model makes effective use of annotated data when available. Experiments involving two languages and three domains demonstrate that the proposed method outperforms baselines. Keywords: Named entity recognition model · Natural Language Processing

1

· Generative/Discriminative

Introduction

Named entity recognition (NER) is the task of extracting named entity (NE) chunks from texts and classifying them into predefined classes. It has a wide range of NLP applications such as information retrieval [1], relation extraction [2], and coreference resolution [3]. While the standard classes of NEs are PERSON, LOCATION, and ORGANIZATION among others, domain-specific NER with specialized classes has proven to be useful in downstream tasks [4]. A major challenge in developing a domain-specific NER system lies in the fact that a large amount of annotated data is needed to train high-performance systems, and even larger amounts are needed for neural models [5]. In many domains, however, domain-specific NE corpora are small in size or even nonexistent because manual corpus annotation is costly and time-consuming. What is worse, domain-specific NE classes cannot be designed without specialized knowledge of the target domain, and even with expert knowledge, a trial-anderror process is inevitable, especially in the early stage of development. c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 61–77, 2023. https://doi.org/10.1007/978-3-031-24340-0_6

62

S. Tomori et al.

In this paper, we propose PYHSCRF, a novel NE tagger that facilitates rapid prototyping of domain-specific NER. All we need to run the tagger is a few seed terms per NE class, in addition to an unannotated target domain corpus and a general domain corpus. Even with minimal supervision, it yields reasonable performance, allowing us to go back-and-forth between different NE definitions. It also enables a smooth transition to full scale annotation because it can straightforwardly incorporate labeled instances. Regarding the technical aspects, the proposed tagger is a hybrid of a generative model and a discriminative model. The generative model called the PitmanYor hidden semi-Markov model (PYHSMM) [6] recognizes high-frequency word sequences as NE chunks and identifies their classes. The discriminative model, semi-Markov CRF (semiCRF) [7], initializes the learning process using the seed terms and generalizes to other NEs of the same classes. It also exploits labeled instances more powerfully when they are available. The two models are combined into one using a framework known as JESS-CM [8]. Generative and discriminative models have mutually complementary strengths. PYHSMM exploits frequency while semiCRF does not, at least explicitly. SemiCRF exploits contextual information more efficiently, but its high expressiveness is sometimes harmful. Because of this, it has difficulty in balancing between positive and negative examples. We treat the seed terms as positive examples and the general corpus as proxy data for negative examples. While semiCRF is too sensitive to use the general corpus as negative examples, PYHSMM utilizes them in a softer manner. We conducted extensive experiments on three domains in two languages and demonstrated that the proposed method outperformed baselines.

2 2.1

Related Work General and Domain-Specific NER

NER is one of the fundamental tasks in NLP and has been applied not only to English but to a variety of languages such as Spanish, Dutch [9], and Japanese [10,11]. NER can be classified into general NER and domain-specific NER. Typical NE classes in general NER are PERSON, LOCATION, and ORGANIZATION. In domain-specific NER, special NE classes are defined to facilitate the development of downstream applications. For example, the GENIA corpus for the biomedical domain has five NE classes, such as DNA and PROTEIN, to organize research papers [12] and to extract semantic relations [13]. Disease corpora [14– 16], which are annotated with the disease class and the treatment class, are used to solve disease-treatment relation extraction. However, domain-specific NER is not limited to only the biomedical domain; it also covers recipes [17] and game commentaries [18], to name a few examples. In addition, recognition of brand names and product names [19], recognition of the names of tasks, materials, and processes in science texts [20] can be seen as domain-specific NER.

A Hybrid Generative/Discriminative Model for Rapid Prototyping

2.2

63

Types of Supervision in NER

The standard approach to NER is supervised learning. Early studies used the hidden Markov model [21], the maximum entropy model [22], and support vector machines [23] before conditional random fields (CRFs) [24,25] dominated. A CRF can be built on top of neural network components such as a bidirectional LSTM and convolutional neural networks [26]. Although modern high-performance NER systems require a large amount of annotated data in the form of labeled training examples, annotated corpora for domain-specific NER are usually of limited size because building NE corpora is costly and time-consuming. Tang et al. [5] proposed a transfer learning model for domain-specific NER with a medium-sized annotated corpus (about 6,000 sentences). Several methods have been proposed to get around costly annotation and they can be classified into rule-based, heuristic feature-based, and weakly supervised methods. Rau [27] proposed a system to extract company names while Sekine and Nobata [28] proposed a rule-based NE tagger. Settles [29] proposed a CRF model with hand-crafted features for biomedical NER. These methods are timeconsuming to develop and need specialized knowledge. Collins and Singer [30] proposed bootstrap methods for NE classification that exploited a small amount of seed data to classify NE chunks into typical NE classes. Nadeau et al. [31] proposed a two-step NER system in which NE extraction followed NE classification. Since their seed-based NE list generation from Web pages exploited HTML tree structures, it cannot be applied to plain text. Zhang and Elhadad [32] proposed another two-step NER method for the biomedical domain which first uses a noun phrase chunker to extract NE chunks and then classifies them using TF-IDF and biomedical terminology. Shang et al. [33] and Yang et al. [34] proposed weakly supervised methods by using domain-specific terminologies and unannotated target domain corpus. Shang et al. [33] automatically build a partially labeled corpus and then train a model by using it. Yang et al. [34] also use automatically labeled corpus and then select sentences to eliminate incomplete and noisy labeled sentences. The selector is trained on a human-labeled corpus. We also use automatically labeled corpus but there is a major difference. We focus on rapid prototyping of domainspecific NER that only requires a few seed terms because domain-specific terminologies are not necessarily available in other domains. 2.3

Unsupervised Word Segmentation and Part-of-Speech Induction

The model proposed in this paper has a close connection to unsupervised word segmentation and part-of-speech (POS) induction [6]. A key difference is that, while they use characters as the unit for the input sequence, we utilize word sequences. Uchiumi et al. [6] can be seen as an extension to Mochihashi et al. [35], who focused on unsupervised word segmentation. They proposed a nonparametric Bayesian n-gram language model based on Pitman-Yor processes. Given an unsegmented corpus, the model infers word segmentation using Gibbs sampling.

64

S. Tomori et al.

Uchiumi et al. [6] worked on the joint task of unsupervised word segmentation and POS induction. We employ their model, PYHSMM, for our task. However, instead of combining character sequences into words and assigning POS tags to them, we group word sequences into NE chunks and give NE classes to them. To efficiently exploit annotated data when available, Fujii et al. [36] extended Mochihashi et al. [35] by integrating the generative word segmentation model into a CRF-based discriminative model. Our model, PYHSCRF, is also a hybrid generative/discriminative model but there are two major differences. First, to extend the approach to NER, we combine PYHSMM with a semiCRF, not an n-gram model with a plain CRF. Second, since our goal is to facilitate rapid prototyping of domain-specific NER, we consider a much weaker type of supervision than fully annotated sentences: a few seed terms per NE class. This is challenging partly because seed terms can only be seen as implicit positive examples although most text fragments are outside of NE chunks (i.e., the O class). Our solution is to use a general domain corpus as implicit negative examples.

Fig. 1. The overall architecture of PYHSCRF for domain-specific NER. Here, the maximum length of NE chunks L = 2. F and Ac stand for FOOD and ACTION, respectively, while O indicates a word outside of any NE chunks.

Fig. 2. Partially labeled sentences. F stands for FOOD.

3 3.1

Proposed Method Task Setting

NER is often formalized as a sequence labeling task. Given a word sequence x = (x1 , x2 , ..., xN ) ∈ Xl , our system outputs a label sequence y = (y1 , y2 , .., yN  ) ∈

A Hybrid Generative/Discriminative Model for Rapid Prototyping

65

Yl , where yi = (zi , bi , ei ) means that a chunk starting at the bi -th word and ending at the ei -th word belongs to class zi . The special O class is assigned to any word that is not part of an NE (if zi = O, then bi = ei ). In the recipe domain, for example, the word sequence “Sprinkle cheese on the hot dog” contains an NE in the F (FOOD) class, “hot dog,” which corresponds to y5 = (F, 5, 6). Likewise the third word “on” is mapped to y3 = (O, 3, 3). We assume that we are given a few typical NEs per class (e.g., “olive oil” for the F class). Since choosing seed terms is by far less laborious than corpus annotation, our task settings allow us to design domain-specific NER in an exploratory manner. In addition to seed terms, an unannotated target domain corpus Xu and an unannotated general domain corpus Xg are provided. The underlying assumption is that domain-specific NEs are observed characteristically in Xu . Contrasting Xu with Xg helps distinguishing NEs from the O class. 3.2

Model Overview

Figure 1 illustrates our approach. We use seed terms as implicit positive examples. We first automatically build a partially labeled corpus Xl , Yl  using seed terms. For example, if “olive oil” is selected as a seed term of class F, sentences in the target domain corpus Xu that contain the term are marked with its NE chunks and the class as in Fig. 2. We train semiCRF using the partially labeled corpus (Sect. 3.3). To recognize high-frequency word sequences as NE chunks, we apply PYHSMM to the unannotated corpus Xu (Sect. 3.4). The general domain corpus Xg is also provided to the generative model as proxy data for the O class, with the assumption that domain-specific NE chunks should appear more frequently in the target domain corpus than in the general domain corpus. PYHSMM is expected to extract high-frequency word sequences in the target domain as NE chunks. Note that we do not train semiCRF with the implicit negative examples because the discriminative model is too sensitive to noise inherent to them. We combine the discriminative and generative models using JESS-CM [8] (Sect. 3.5). 3.3

Semi-Markov CRF with a Partially Labeled Corpus

We use semiCRF as the discriminative model, although Markov CRF is more often used as an NE tagger. Markov CRF employs the BIO tagging scheme or variants of it to identify NE chunks. Since each NE class is divided into multiple tags (e.g., B-PERSON and I-PERSON), it is unsuitable for our task, which is characterized by the scarcity of supervision. For this reason, we chose semiCRF.

66

S. Tomori et al.

SemiCRF is a log-linear model that directly infers NE chunks and classes. The probability of y given x is defined as: exp(Λ · F (x, y)) , p(y|x, Λ) = Z(x)  Z(x) = exp(Λ · F (x, y)), y ∈Y

where F (y, x) = (f1 , f2 , · · · , fM ) are features, Λ = (λ1 , λ2 , · · · , λM ) are the corresponding weights, and Y is the set of all possible label sequences. The feature function can be expressed as the combination of F (bi , ei , zi , zi−1 ) in relation to xi , yi , and yi−1 . The training process is different from standard supervised learning because we use partially labeled corpus Xl , Yl . Following Tsuboi et al. [37], we marginalize the probabilities of words that are not labeled. Instead of using the full log likelihood  p(y|x)F (x, y) LL = F (x, y) − y ∈Y

as the objective function, we use the following marginalized log likelihood   p(y|Yp , x)F (x, y) − p(y|x)F (x, y), M LL = y ∈Y p

y ∈Y

where Yp is the set of all possible label sequences in which labeled chunks are fixed. 3.4

PYHSMM

The generative model, PYHSMM, was originally proposed for joint unsupervised word segmentation and POS induction. While it was used to group character sequences into words and assign POS tags to them, here we extend it to wordlevel modeling. In our case, PYHSMM consists of 1) transitions between NE classes and 2) the emission of each NE chunk xi = xbi , ..., xei from its class zi . As a semi-Markov model, it employs n-grams not only for calculating transition probabilities but also for computing emission probabilities. The building blocks of PYHSMM are hierarchical Pitman-Yor processes, which can be seen as a back-off n-gram model. To calculate the transition and emission probabilities, we need to keep track of latent table assignments [38]. For notational brevity, let Θ be the set of the model’s parameters. The joint probability of the i-th chunk xi and its class zi conditioned on history hxz is given by p(xi , zi |hxz ; Θ) = p(xi |hnx , zi ; Θ)p(zi |hnz ; Θ),

A Hybrid Generative/Discriminative Model for Rapid Prototyping

67

where hnx = xi−1 , xi−2 , ..., xi−(n−1) and hnz = zi−1 , zi−2 , ..., zi−(n−1) . p(xi |hx , zi ) is the chunk n-gram probability given its class zi , and p(zi |hz ) is the class n-gram probability. The posterior predictive probability of the i-th chunk is p(xi |hnx , zi ) =

f req(xi |hnx ) − d · txi ,hnx θ + d · thnx + p(xi |hn−1 , zi ), x n θ + f req(hx ) θ + f req(hnx )

(1)

where hn−1 is the shorter history of (n − 1)-gram, θ and d are hyperparameters, x is n-gram frequency, thnx ,xi  is a count related to table assignments, f req(xi |hnx )  f req(hnx ) = x f req(xi |hnx ), and thnx = x thnx ,xi . The class n-gram probability i i is computed in a similar manner. Gibbs sampling is used to infer PYHSMM’s parameters [35]. During training, we randomly select a sentence and remove it from the parameters (e.g., we subtract n-gram counts from f req(xi |hnx )). We sample a new label sequence using forward filtering-backward sampling. We then update the model parameters by adding the corresponding n-gram counts. We repeat the process until convergence. Now we explain the sampling procedure in detail. We consider the bigram case for simplicity. The forward score α[t][k][z] is the probability that a subsequence (x1 , x2 , ..., xt ) of a word sequence x = (x1 , x2 , ..., xN ) is generated with its last k words being a chunk (xtt−k+1 = xt−k+1 , ..., xt ) which is generated from class z. Let L be maximum length of a chunk and Z be the number of classes. α[t][k][z] is recursively computed as follows: α[t][k][z] =

L  Z   j=1 r=1

p(xtt−k+1 |xt−k t−k−j+1 , z)p(z|r)α[t

 − k][j][r] .

(2)

The forward scores are calculated from the beginning to the end of the sentence. Chunks and classes are sampled in the reverse direction by using the forward score. There is always the special token EOS and its class zEOS at the end of the sequence. The final chunk and its class in the sequence is sampled with the score proportional to N p(EOS|wN −k , zEOS ) · p(zEOS |z) · α[N ][k][z].

The second-to-last chunk is sampled similarly using the score of the last chunk. We continue this process unti we reach the beginning of the sequence. To update the parameters in Eq. (1), we add n-gram counts to f req(xi |hnx ) and f req(hnx ), and also update the table assignment count thnx ,xi . Parameters related to the class n-gram model are updated in the same manner. Recall that we use the general domain corpus Xg to learn the O class. We assume that Xg consists entirely of single-word chunks in the O class. Although the general domain corpus might contain some domain-specific NE chunks, most words indeed belong to the O class. During training, we add and remove sentences in Xg without performing sampling. Thus these sentences can be seen as implicit negative samples.

68

S. Tomori et al.

3.5

PYHSCRF

PYHSCRF combines discriminative semiCRF with generative PYHSMM in a similar manner to the model presented in Fujii et al. [36]. The probability of label sequence y given word sequence x is written as follows: p(y|x) ∝ pDISC (y|x; Λ) pGEN (y, x; Θ)λ0 , where pDISC and pGEN are the discriminative and generative models, respectively. Λ and Θ are their corresponding parameters. When pDISC is a log-linear model like semiCRF, p(y|x) can be expressed as a log-linear model:   M pDISC (y|x) ∝ exp λm fm (y, x) , m=1

  M  λm fm (y, x) p(y|x) ∝ exp λ0 log(pGEN (y, x)) + m=1

= exp(Λ∗ · F ∗ (y, x)),

(3)

where Λ∗ = (λ0 , λ1 , λ2 ..., λM ), F (y, x) = (log(pGEN ), f1 , f2 , ..., fM ). ∗

In other words, PYHCRF is another semiCRF in which PYHSMM is added to the original semiCRF as a feature. Algorithm 1. Learning algorithm for PYHSCRF. Xl , Yl  is a partially labeled corpus and Xu is an unannotated corpus in the target domain. Xg is the general domain corpus used as implicit negative examples. for epoch = 1, 2, ..., E do for x in randperm(Xu , Xg ) do if epoch > 1 then Remove parameters of y from Θ end if if x ∈ Xu then Sample y according to p(y|x; Λ∗ , Θ) else Determine y according to Xg end if Add parameters of y to Θ end for Optimize Λ∗ on Xl , Yl  end for

The objective function is p(Yl |Xl ; Λ∗ ) p(Xu , Xg ; Θ).

A Hybrid Generative/Discriminative Model for Rapid Prototyping

69

Algorithm 1 shows our training algorithm. During training, PYHSCRF repeats the following two steps: 1. fixing Θ and optimizing Λ∗ of semiCRF on Xl , Yl , 2. fixing Λ∗ and optimizing Θ of PYHSMM on Xu , Xg until convergence. When updating Λ∗ , we use the marginalized log likelihood of the partially labeled data. When updating Θ, we sample chunks and their classes from unlabeled sentences in the same manner as in PYSHMM. In PYHSCRF, a modification is needed to Eq. (2) because forward score α[t][k][z] incorporates the semiCRF score:  L  Z  exp λ0 log(p(xtt−k+1 |xt−k α[t][k][z] = t−k−j+1 , z)p(z|r)) j=1 r=1

 +Λ · F (t − k + 1, t, z, r) α[t − k][j][r], where F (t − k + 1, t, z, r) is a feature function in relation to chunk candidate xtt−k+1 , its class z, and class r of the preceding chunk candidate xt−k t−k−j+1 .

4

Experimentals

4.1

Data

Table 1 summarizes the specifications of three domain-specific NER datasets used in our experiments: the GENIA corpus, the recipe corpus, and the game Table 1. Statistics of the datasets for the experiments. Language Corpus (#NE classes) English

#Sentences #Words #NE instances

Target GENIA corpus (5) Train

10,000

264,743

-

Test

3,856

101,039

90,309

50,000

1039,886 -

General Brown (-) Japanese

Target Recipe corpus (8) Train

10,000

244,648

-

Test

148

2,667

869

Game commentary corpus (21) Train

10,000

398,947

-

Test

491

7,161

2,365

40,000

936,498

-

Oral communication corpus (-) 10,000

124,031

-

General BCCWJ (-)

70

S. Tomori et al.

commentary corpus. We used the GENIA corpus, together with its test script in the BioNLP/NLPBA 2004 shared task [39], as an English corpus for the biomedical domain. It contains five biological NE classes such as DNA and PROTEIN in addition to the O class. The corresponding general domain corpus was the Brown corpus [40], which consists of one million words and ranges over 15 domains. The recipe corpus [17] and the game commentary corpus [18] are both in Japanese. The recipe corpus consists of procedural texts from recipes for cooking. The game commentary corpus consists of commentaries on professional matches of Japanese chess (shogi) given by professional players and writers. We used gold-standard word segmentation for both corpora. As NEs, eight classes such as FOOD, TOOL, and ACTION were defined for the recipe corpus, while the game commentary corpus was annotated with 21 classes such as PERSON, STRATEGY, and ACTION. Note that NE chunks were not necessarily noun phrases. For example, most NE chunks labeled with ACTION in the two corpora were verbal phrases. The combination of the Balanced Corpus of Contemporary Written Japanese (BCCWJ) [41] and the oral communication corpus [42] were used as the general domain corpus. We automatically segmented sentences in these corpora using KyTea1 [43]. (The segmentation accuracy was higher than 98%.) 4.2

Training Settings

Although PYHSMM can theoretically handle arbitrarily long n-grams, we limited our scope to bigrams to reduce computational costs. To initialize PYHSMM’s parameter, Θ, we treated each word in a given sentence as an Oclass chunk. Just like Uchiumi et al. [6] modeled expected word length with negative binomial distributions for the tasks of Japanese word segmentation and POS induction, chunk length was drawn from a negative binomial distribution. Uchiumi et al. [6] set different parameters for character types such as hiragana and kanji, but we used a single parameter. We constrained the maximum length of chunk L to be 6 for computational efficiency. We used the normal priors of truncated N (μ, σ 2 ) to initialize PYHSMM’s weight λ0 and semiCRF’s weights λ1 , λ2 , · · · , λM . We set μ = 1.0 and σ = 1.0. We fixed the L2 regularization parameter C of semiCRF to 1.0. We used Table 2. Feature templates for semiCRF. chunki consists of word n-grams w ebii = wbi wbi +1 ...wei , wi−1 . wi−1 and wi+1 are the preceding word and the following word, respectively. BoW is a set of words (bag-of-words) in chunki . Semi-Markov CRF features chunki (wbi wb1 +1 ...wei ) wi−2 , wi−1 , wi+1 , wi+2 BoW(wbi , wbi +1 , ..., wei ) 1

http://www.phontron.com/kytea/ (accessed on March 15, 2017).

A Hybrid Generative/Discriminative Model for Rapid Prototyping

71

stochastic gradient descent for optimization of semiCRF. The number of iterations J was set to 300. Table 2 shows the feature templates for semiCRF. Each target domain corpus was divided into a training set and a test set. For each NE class, the 2 most frequent chunks according to the training set were selected as seed terms. In the GENIA corpus, for example, we automatically chose “IL-2” and “LTR” as seed terms for the DNA class. 4.3

Baselines

In biomedical NER, the proposed model was compared with two baselines. MetaMap is based on a dictionary matching approach with biomedical terminology [44]. The other baseline model is a weakly supervised biomedical NER system proposed by Zhang and Elhadad [32]. To our knowledge, there was no weakly supervised domain-specific NER tool in the recipe and game commentary domains. For these domains, we created a baseline model as follows: We first used a Japanese term extractor2 to extract NE chunks and then classified them with seed terms using a Bayesian HMM originally proposed for unsupervised POS induction [45]. Note that only noun phrases were extracted by the term extractor. 4.4

Results and Discussion

Table 3 compares the proposed method with baselines in terms of precision, recall, and F-measure. We can see that PYHSCRF consistently outperformed the baselines. Taking a closer look at the results, we found that the model successfully inferred NE classes from their contexts. For example, the NE chunk “水” (water) can be both FOOD and TOOL in the recipe domain. It was correctly identified Table 3. Precision, recall, and F-measure of various systems. Target method

Precision Recall F-measure

GENIA MetaMap [44] N/A Weakly supervised biomedical NER [32] 15.40 19.20 PYHSCRF

N/A 7.70 15.00 15.20 23.50 21.13

Recipe Baseline PYHSCRF

49.78 38.45

25.89 34.07 42.58 40.41

52.75 75.57

29.18 37.57 35.05 47.89

Game Baseline PYHSCRF 2

http://gensen.dl.itc.u-tokyo.ac.jp/termextract.html (accessed on March 15, 2017).

72

S. Tomori et al.

Fig. 3. Learning curve for recipe NER. The horizontal axis shows number of seed terms in each NE class.

Fig. 4. Learning curve for recipe NER. The horizontal axis shows number of general domain sentences.

as TOOL when it was part of the phrase “水で洗い流す” (wash with water) while the phrase “水をに加える” (add water in the pot) was identified as the FOOD class. We conducted a series of additional experiments. First, we changed the number of seed terms to examine their effects. Figure 3 shows F-measure as a function of the number of seed terms per NE class in the recipe domain. The F-measure increased almost monotonically as more seed terms became available. A major advantage of PYHSCRF over other seed-based weakly supervised methods for NER [31,32] is that it can straightforwardly exploit labeled instances. To see this, we trained PYHSCRF with fully annotated data (about 2,000 sentences) in the recipe domain and compared it with vanilla semiCRF.

A Hybrid Generative/Discriminative Model for Rapid Prototyping

73

We found that they achieve competitive performance (the F-measure was 90.01 for PYHSCRF and 89.98 for vanilla semiCRF). In this setting, PYHSCRF ended up simply ignoring PYHSMM (−0.1 < λ0 < 0.0). Next, we reduced the size of the general domain corpus. Figure 4 shows how F-measure changes with the size of the general domain corpus in recipe NER. We can confirm that PYHSCRF cannot be trained without the general domain corpus because it is a vital source for distinguishing NE chunks from the O class. Finally, we evaluated NE classification performance. Collins and Singer [30] focused on weakly supervised NE classification, in which given NE chunks were classified into three classes (PERSON, LOCATION, and ORGANIZATION) by bootstrapping with seven seed terms and hand-crafted features. We tested PYHSCRF with the CoNLL 2003 dataset [46] in the same settings. We did not use a general corpus because NE chunks are given a priori. PYHSCRF achieved competitive performance (over 93% accuracy, compared to over 91% accuracy for Collins and Singer [30]) although the use of different datasets makes direct comparison difficult. The semiCRF feature templates in our experiments are simple. Though not explored here, the accuracies can probably be improved by a wider window size or richer feature sets such as character type and POS. Word embeddings [47, 48], character embeddings [49], and n-gram embeddings [50] are other possible improvements because domain-specific NE chunks exhibit spelling variants. For example, in the Japanese recipe corpus, the NE chunk “玉ねぎ” (onion, kanji followed by hiragana) can also be written as “たまねぎ” (hiragana), “タマネ ギ” (katakana), and “玉葱” (kanji).

5

Conclusion

We proposed PYHSCRF, a nonparametric Bayesian method for distant supervised NER in specialized domains. PYHSCRF is useful for rapid prototyping domain-specific NER because it does not need texts annotated with NE tags and boundaries. We only need a few seed terms as typical NEs in each NE class, an unannotated corpus in the target domain, and a general domain corpus. PYHSCRF incorporates word level PYHSMM and semiCRF. In addition, we use implicit negative examples from the general domain corpus to train the O class. In our experiments, we used a biomedical corpus in English, and a recipe corpus and a game commentary corpus in Japanese as examples. We conducted domain-specific NER experiments and showed that PYHSCRF achieved higher accuracy than the baselines. Therefore we can build a domain-specific NE recognizer with much less cost. Additionally, PYHSCRF can be easily applied to other domains for domain-specific NER and is useful for low-resource languages and domains. In the future, we would like to investigate the effectiveness of the proposed method for downstream tasks of domain-specific NER such as relation extraction and knowledge base population.

74

S. Tomori et al.

Acknowledgement. In this paper, we used recipe data provided by Cookpad and the National Institute of Informatics.

References 1. Thompson, P., Dozier, C.C.: Name searching and information retrieval. CoRR cmp-lg/9706017 (1997) 2. Feldman, R., Rosenfeld, B.: Boosting unsupervised relation extraction by using NER. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 473–481 (2006) 3. Lee, H., Recasens, M., Chang, A., Surdeanu, M., Jurafsky, D.: Joint entity and event coreference resolution across documents. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 489–500 (2012) 4. Shahab, E.: A short survey of biomedical relation extraction techniques. CoRR abs/1707.05850 (2017) 5. Tang, S., Zhang, N., Zhang, J., Wu, F., Zhuang, Y.: NITE: a neural inductive teaching framework for domain specific NER. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2642–2647 (2017) 6. Uchiumi, K., Tsukahara, H., Mochihashi, D.: Inducing word and part-of-speech with Pitman-Yor hidden semi-Markov models. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1774–1782 (2015) 7. Sarawagi, S., Cohen, W.W.: Semi-Markov conditional random fields for information extraction. Adv. Neural. Inf. Process. Syst. 17, 1185–1192 (2005) 8. Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data, pp. 665–673. In: Proceedings of ACL 2008: HLT. Association for Computational Linguistics (2008) 9. Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: languageindependent named entity recognition. In: Proceedings of the 6th Conference on Natural Language Learning, vol. 31, pp. 1–4 (2002) 10. Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, vol. 1 (1996) 11. Sekine, S., Isahara, H.: IREX: IR and IE evaluation project in Japanese. In: Proceedings of International Conference on Language Resources and Evaluation (2000) 12. Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus: a semantically annotated corpus for bio-textmining. Bioinformatics 19(Suppl. 1), i180-2 (2003) ˇ 13. Ciaramita, M., Gangemi, A., Ratsch, E., Saric, J., Rojas, I.: Unsupervised learning of semantic relations between concepts of a molecular biology ontology. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp. 659–664 (2005) ¨ South, B.R., Shen, S., DuVall, S.L.: i2b2/VA challenge on concepts, 14. Uzuner, O., assertions, and relations in clinical text. J. Am. Med. Inform. Assoc. 18(2011), 552–556 (2010) 15. Do˘ gan, R.I., Lu, Z.: An improved corpus of disease mentions in PubMed citations. In: BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing, pp. 91–99 (2012)

A Hybrid Generative/Discriminative Model for Rapid Prototyping

75

16. Do˘ gan, R.I., Leaman, R., Lu, Z.: NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inform. 47, 1–10 (2014) 17. Mori, S., Maeta, H., Yamakata, Y., Sasada, T.: Flow graph corpus from recipe texts. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 2370–2377 (2014) 18. Mori, S., Richardson, J., Ushiku, A., Sasada, T., Kameko, H., Tsuruoka, Y.: A Japanese chess commentary corpus. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp. 1415–1420 (2016) 19. Bick, E.: A named entity recognizer for Danish. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004) (2004) 20. Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: Semeval 2017 task 10: scienceie - extracting keyphrases and relations from scientific publications. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), pp. 546–555 (2017) 21. Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high-performance learning name-finder. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194–201 (1997) 22. Borthwick, A.E.: A maximum entropy approach to named entity recognition. Ph.D. thesis, AAI9945252 (1999) 23. Asahara, M., Matsumoto, Y.: Japanese named entity extraction with redundant morphological analysis. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 8–15 (2003) 24. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282– 289 (2001) 25. McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp. 188–191 (2003) 26. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNsCRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1064–1074 (2016) 27. Rau, L.F.: Extracting company names from text. In: Proceedings of the Seventh Conference on Artificial Intelligence Applications CAIA-91 (Volume II: Visuals), pp. 189–194 (1991) 28. Sekine, S., Nobata, C.: Definition, dictionaries and tagger for extended named entity hierarchy. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004) (2004) 29. Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp. 33–38 (2004) 30. Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora (1999) 31. Nadeau, D., Turney, P.D., Matwin, S.: Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity. In: Conference of the Canadian Society for Computational Studies of Intelligence, pp. 266–277 (2006)

76

S. Tomori et al.

32. Zhang, S., Elhadad, N.: Unsupervised biomedical named entity recognition: experiments with clinical and biological texts. J. Biomed. Inform. 46, 1088–1098 (2013) 33. Shang, J., Liu, L., Gu, X., Ren, X., Ren, T., Han, J.: Learning named entity tagger using domain-specific dictionary. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2054–2064. Association for Computational Linguistics (2018) 34. Yang, Y., Chen, W., Li, Z., He, Z., Zhang, M.: Distantly supervised NER with partial annotation learning and reinforcement learning. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2159–2169. Association for Computational Linguistics (2018) 35. Mochihashi, D., Yamada, T., Ueda, N.: Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 100–108 (2009) 36. Fujii, R., Domoto, R., Mochihashi, D.: Nonparametric Bayesian semi-supervised word segmentation. Trans. Assoc. Comput. Linguist. 5, 179–189 (2017) 37. Tsuboi, Y., Kashima, H., Mori, S., Oda, H., Matsumoto, Y.: Training conditional random fields using incomplete annotations. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 897–904 (2008) 38. Teh, Y.W.: A hierarchical Bayesian language model based on Pitman-Yor processes. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 985–992 (2006) 39. Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bioentity recognition task at JNLPBA. In: Proceedings of the International Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA 2004), pp. 70–75 (2004) 40. Francis, W.N., Kucera, H.: Brown corpus manual. Brown University, vol. 2 (1979) 41. Maekawa, K., et al.: Balanced corpus of contemporary written Japanese. Lang. Resour. Eval. 48, 345–371 (2014) 42. Keene, D., Hatori, H., Yamada, H., Irabu, S.: Japanese-English Sentence Equivalents. Electronic book edn. Asahi Press (1992) 43. Neubig, G., Nakata, Y., Mori, S.: Pointwise prediction for robust, adaptable Japanese morphological analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 529–533 (2011) 44. Aronson, A.R.: Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium, p. 17 (2001) 45. Goldwater, S., Griffiths, T.: A fully Bayesian approach to unsupervised part-ofspeech tagging. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 744–751 (2007) 46. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4. CoNLL 2003, pp. 142–147 (2003) 47. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 48. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural. Inf. Process. Syst. 26, 3111–3119 (2013)

A Hybrid Generative/Discriminative Model for Rapid Prototyping

77

49. Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Charagram: embedding words and sentences via character n-grams. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1504–1515 (2016) 50. Zhao, Z., Liu, T., Li, S., Li, B., Du, X.: Ngram2vec: learning improved word representations from ngram co-occurrence statistics. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 244–253 (2017)

Semantics and Text Similarity

Spectral Text Similarity Measures Tim vor der Br¨ uck(B) and Marc Pouly(B) School of Computer Science and Information Technology, Lucerne University of Applied Sciences and Arts, Lucerne, Switzerland {tim.vorderbrueck,marc.pouly}@hslu.ch Abstract. Estimating semantic similarity between texts is of vital importance in many areas of natural language processing like information retrieval, question answering, text reuse, or plagiarism detection. Prevalent semantic similarity estimates based on word embeddings are noise sensitive. Thus, small individual term similarities can have in aggregate a considerable influence on the total estimation value. In contrast, the methods proposed here exploit the spectrum of the product of embedding matrices, which leads to increased robustness when compared with conventional methods. We apply these estimate on two tasks, which are the assignment of people to the best matching marketing target group and finding the correct match between sentences belonging to two independent translations of the same novel. The evaluation revealed that our proposed method based on the spectral norm could increase the accuracy compared to several baseline methods in both scenarios. Keywords: Text similarity

1

· Similarity measures · Spectral radius

Introduction

Estimating semantic document similarity is of vital importance in a lot of different areas, like plagiarism detection, information retrieval, or text summarization. One drawback of current state-of-the-art similarity estimates based on word embeddings is that small term similarities can sum up to a considerable amount and make these estimates vulnerable to noise in the data. Therefore, we propose two estimates that are based on the spectrum of the product F of embedding matrices belonging to the two documents to compare. In particular, we propose the spectral radius and the spectral norm of F, where the first denotes F’s largest absolute eigenvalue and the second its largest singular value. Eigenvalue and singular value oriented methods for dimensionality reduction aiming to reduce noise in the data have a long tradition in natural language processing. For instance, principal component analysis is based on eigenvalues and can be used to increase the quality of word embeddings [8]. In contrast, Latent Semantic Analysis [11], a technique known from information retrieval to improve search results in term-document matrices, focuses on largest singular values. Furthermore, we investigate several properties of our proposed measures that are crucial for qualifying as proper similarity estimates, while considering both unsupervised and supervised learning. c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 81–95, 2023. https://doi.org/10.1007/978-3-031-24340-0_7

82

T. vor der Br¨ uck and M. Pouly

Finally, we applied both estimates to two natural language processing scenarios. In the first scenario, we distribute participants of an online contest into several target groups by exploiting short text snippets they were asked to provide. In the second scenario, we aim to find the correct matching between sentences originating from two independent translations of a novel by Edgar Allen Poe. The evaluation revealed that our novel estimators performed superior to several baseline methods for both scenarios. The remainder of the paper is organized as follows. In the next section, we look into several state-of-the-art methods for estimating semantic similarity. Section 3 reviews several concepts that are vital for the remainder of the paper and also for building the foundation of our theoretical results. In Sect. 4, we describe in detail, how the spectral radius can be employed for estimating semantic similarity. Some drawbacks and shortcomings of such an approach as well as an alternative method that very elegantly solves all of these issues exploiting the spectral norm are discussed in Sect. 5. The two application scenario for our proposed semantic similarity estimates are given in Sect. 6. Section 7 describes the conducted evaluation, in which we compare our approach with several baseline methods. The results of the evaluation are discussed in Sect. 8. So far, we covered only unsupervised learning. In Sect. 9, we investigate, how our proposed estimates can be employed in a supervised setting. Finally, this paper concludes with Sect. 10, which summarizes the obtained results.

2

Related Work

Until recently, similarity estimates were predominantly based either on ontologies [4] or on typical information retrieval techniques like Latent Semantic Analysis. In the last couple of years, however, so-called word and sentence embeddings became state-of-the-art. The prevalent approach to document similarity estimation based on word embeddings consists of measuring similarity between vector representations of the two documents derived as follows: 1. The word embeddings (often weighted by the tf-idf coefficients of the associated words [3]) are looked up in a hashtable for all the words in the two documents to compare. These embeddings are determined beforehand on a very large corpus typically using either the skip gram or the continuous bag of words variant of the Word2Vec model [15]. The skip gram method aims to predict the textual surroundings of a given word by means of an artificial neural network. The influential weights of the one-hot-encoded input word to the nodes of the hidden layer constitute the embedding vector. For the socalled continuous bag of words method, it is just the opposite, i.e., the center word is predicted by the words in its surrounding. 2. The centroid over all word embeddings belonging to the same document is calculated to obtain its vector representation. Alternatives to Word2Vec are GloVe [17], which is based on aggregated global word co-occurrence statistics and the Explicit Semantic Analysis (or shortly

Spectral Text Similarity Measures

83

ESA) [6], in which each word is represented by the column vector in the tf-idf matrix over Wikipedia. The idea of Word2Vec can be transferred to the level of sentences as well. In particular, the so-called Skip Thought Vector (STV) model [10] derives a vector representation of the current sentence by predicting the surrounding sentences. If vector representations of the two documents to compare were successfully established, a similarity estimate can be obtained by applying the cosine measure to the two vectors. [18] propose an alternative approach for ESA word embeddings that establishes a bipartite graph consisting of the best matching vector components by solving a linear optimization problem. The similarity estimate for the documents is then given by the global optimum of the objective function. However, this method is only useful for sparse vector representations. In case of dense vectors, [14] suggested to apply the Frobenius kernel to the embedding matrices, which contain the embedding vectors for all document components (usually either sentences or words, cf. also [9]). However, crucial limitations are that the Frobenius kernel is only applicable if the number of words (sentences respectively) in the compared documents coincide and that a word from the first document is only compared with its counterpart from the second document. Thus, an optimal matching has to be established already beforehand. In contrast, the approach as presented here applies to arbitrary embedding matrices. Since it compares all words of the two documents with each other, there is also no need for any matching method. Before going more into detail, we want to review some concepts that are crucial for the remainder of this paper.

3

Similarity Measure/Matrix Norms

According to [2], a similarity measure on some set X is an upper bounded, exhaustive and total function s : X × X → I ⊂ R with |I| > 1 (therefore I is upper bounded and sup I exists). Additionally, a similarity measure should fulfill the properties of reflexivity (the supremum is reached if an item is compared to itself) and symmetry. We call such a measure normalized if the supremum equals 1 [1]. Note that an asymmetric similarity measure can easily be converted into a symmetric by taking the geometric or arithmetic mean of the asymmetric measure applied twice to the same arguments in switched order. A norm is a function f : V → R over some vector space V that is absolutely homogeneous, positive definite and fulfills the triangle inequality. It is called matrix norm if its domain is a set of matrices and if it is sub-multiplicative, i.e., AB ≤ A · B. An example of a matrix norm is the spectral norm, which denotes the largest singular value of a matrix. Alternatively, one can define this norm as: A2 := ρ(A A), where the function ρ returns the largest absolute eigenvalue of the argument matrix.

84

4

T. vor der Br¨ uck and M. Pouly

Document Similarity Measure Based on the Spectral Radius

For an arbitrary document t we define the embeddings matrix E(t) as follows: E(t)ij is the i-th component of the normalized embeddings vector belonging to the j-th word of the document t. Let t, u be two arbitrary documents, then the entry (i, j) of a product F := E(t) E(u) specifies the result of the cosine measure estimating the semantic similarity between word i of document t and word j of document u. The larger the matrix entries of F are, the higher is usually the semantic similarity of the associated texts. A straight-forward way to measure the magnitude of the matrix is just to summate all absolute matrix elements, which is called the L11 -norm. However, this approach has the disadvantage that also small cosine measure values are included in the sum, which can have in aggregate a considerable impact on the total similarity estimate making such an approach vulnerable to noise in the data. Therefore we propose instead to apply an operator, which is more robust than the L1,1 norm and which is called the spectral radius. This radius denotes the largest absolute eigenvalue of the input matrix and constitutes a lower bound of all matrix norms. It also insinuates the convergence of the matrix power series limn→∞ Fn . The series converges if and only if the spectral radius does not exceed the value of one. Since the vector components obtained by Word2Vec can be negative, the cosine measure between two word vectors can also assume negative values (rather rarely in practice though). Akin to zeros, negative cosine values indicate unrelated words as well. Because the spectral radius usually treats negative and positive matrix entries alike (the spectral radius of a matrix A and of its negation coincide), we replace all negative values in the matrix by zero. Finally, since our measure should be restricted to values from zero to one, we have to normalize it. Formally, we define our similarity measure as follows: sn(t, u) := 

ρ(R(E(t) E(u)) ρ(R(E(t) E(t)) · ρ(R(E(u) E(u)))

where E(t) is the embeddings matrix belonging to document t, where all embedding column vectors are normalized. R(M) is the matrix, where all non-zero entries are replaced by zero, i.e. R(M)ij = max{0, Mij }. In contrast to matrix norms that can be applied to arbitrary matrices, eigenvalues only exist for square matrices. However, the matrix F∗ := R(E(t) E(u)) that we use as basis for our similarity measures is usually non-quadratic. In particular, this matrix would be quadratic, if and only if the number of terms in the two documents t and u coincide. Thus, we have to fill up the embedding matrix of the smaller one of the two texts with additional embedding vectors. A quite straightforward choice, which we followed here, is to just use the centroid vector for this. An alternative approach would be to sample the missing vectors. A further issue is that eigenvalues are not invariant concerning row and column permutations. The columns of the embedding matrices just represent the

Spectral Text Similarity Measures

85

words appearing in the texts. However, the word order can be arbitrarily for the text representing the marketing target groups (see Sect. 6.1 for details). Since a similarity measure should not depend on some random ordering, we need to bring the similarity matrix F∗ in some normalized format. A quite natural choice would be to enforce the ordering that maximizes the absolute value of the largest eigenvalue (which is actually our target value). Let us formalize this. We denote with F∗P,Q the matrix obtained from F∗ by applying the permutation P on the rows and the permutation Q on the columns. Thus, we can define our similarity measure as follows: sn sr (t, u) := max ρ(F∗P,Q ) P,Q

(1)

However, solving this optimization problem is quite time-consuming. Let us assume the matrix F∗ has m rows and columns. Then we would have to iterate over m! · m! different possibilities. Hence, such an approach would be infeasible already for medium-sized texts. Therefore, we instead select the permutations that optimize the absolute value of the arithmetic mean over all eigenvalues, which is a lower bound of the maximum absolute eigenvalue. Let λi (M) be the i-th eigenvalue of a matrix M. With this, we can formalize our optimization problem as follows: sn˜sr (t, u) :=ρ(F∗P˜ ,Q˜ ) ˜ = arg max | P˜ , Q

m 

P,Q

λi (F∗P,Q )|

(2)

i=1

The sum over all eigenvalues is just the trace of the matrix. Thus, ˜ = arg max |tr(F∗P,Q )| P˜ , Q P,Q

(3)

which is just the sum over all diagonal elements. Since we constructed our matrix F∗ in such a way that it contains no negative entries, we can get rid of the absolute value operator. ˜ = arg max{tr(F∗P,Q )} P˜ , Q P,Q

(4)

Because the sum is commutative, the sequence of the individual summands is irrelevant. Therefore, we can leave either the row or column ordering constant and only permutate the other one. sn˜sr (t, u) =ρ(F∗P˜ ,id ) P˜ = arg max{tr(F∗P,id )}

(5)

P

P˜ can be found by solving a binary linear programming problem in the following way. Let X be the set of decision variables and let further Xij ∈ X be one if and only if row i is changed to row j in the reordered matrix and zero otherwise.

86

T. vor der Br¨ uck and M. Pouly

Then the objective function is given by maxX denotes an 1:1 mapping, i.e., m  i=1 m 

m m i=1

j=1

∗ Xji Fji . A permutation

Xij =1 ∀j = 1, . . . , m Xij =1 ∀i = 1, . . . , m

(6)

j=1

Xij ∈{0, 1} ∀i, j = 1, . . . , m

5

Spectral Norm

The similarity estimate as described above has several drawbacks. – The boundedness condition is violated in some cases. Therefore, this similarity does not qualify as a normalized similarity estimate according to the definition in Sect. 3. – The largest eigenvalue of a matrix depends on the row and column ordering. However, this ordering is arbitrary for our proposed description of target groups by keywords (cf. Sect. 6.1 for the details). To ensure a unique eigenvalue, we apply linear optimization, which is an expensive approach in terms of runtime. – Eigenvalues are only defined for square matrices. Therefore, we need to fill up the smaller of the embedding matrices to meet this requirement. An alternative to the spectral radius is the spectral norm, which is defined by the largest singular value of a matrix. Formally, the spectral norm-based estimate is given as: (R(E(t) E(u))2 sn 2 (t, u) :=  R(E(t) E(t))2 · R(E(u) E(u))2  where A2 = ρ(A A). By using the spectral norm instead of the spectral radius, all of the issues mentioned above are solved. The spectral norm is not only invariant to column or row permutations, it can also be applied to arbitrary rectangular matrices. Furthermore, boundedness is guaranteed as long as no negative cosine values occur as it is stated in the following proposition. Proposition 1. If the cosine similarity values between all embedding vectors of words occurring in any of the documents are non-negative, i.e., if R(E(t) E(u)) = E(t) E(u) for all document pairs (t, u), then sn 2 is a normalized similarity measure.

Spectral Text Similarity Measures

87

Symmetry Proof. At first, we focus on the symmetry condition. Let A := E(t), B := E(u), where t and u are arbitrary documents. Symmetry directly follows, if we can show that Z2 = Z 2 for arbitrary matrices Z, since with this property we have sn2 (t, u) =  = =

A B2 A A2 

· B B2

(B A) 2

B B2 · A A2

(7)



B A2 B B2 · A A2

=sn2 (u, t) Let M and N be arbitrary matrices such that MN and NM are both defined and quadratic, then (see [5]) ρ(MN) = ρ(NM)

(8)

where ρ(X) denotes the largest absolute eigenvalue of a squared matrix X. Using identity 8 one can easily infer that:   Z2 = ρ(Z Z) = ρ(ZZ ) = Z 2 (9) Boundedness Proof. The following property needs to be shown: 

A B2 A A2 · B B2

≤1

(10)

In the proof, we exploit the fact that for every positive-semidefinite matrix X, the following equation holds ρ(X2 ) = ρ(X)2

(11)

88

T. vor der Br¨ uck and M. Pouly

We observe that for the denominator A A2 · B B2   = ρ((A A) A A) ρ((B B) B B)   = ρ((A A) (A A) ) ρ((B B) (B B) )   = ρ([(A A) ]2 ) ρ([(B B) ]2 )   (11) = ρ((A A) )2 ρ((B B) )2

(12)

=ρ((A A) )ρ((B B) ) (9)

= A22 · B22

Putting things together we finally obtain 

A B2 A A2 B B2

sub-mult.



(9)



=

A 2 · B2 A A2 B B2 A2 · B2 A A2 B B2

(13)

A2 · B2 =  =1 A22 · B22

(12)

The question remains, how the similarity measure value induced by matrix norms performs in comparison with the usual centroid method. General statements about the spectral-norm based similarity measure are difficult, but we can draw some conclusions, if we restrict to the case where A B is a square diagonal matrix. Hereby, one word of the first text is very similar to exactly one word of the second text and very dissimilar to all remaining words. The similarity estimate is then given by the largest eigenvalue (the spectral radius) of A B, which equals the largest cosine measure value. Noise in form of small matrix entries is completely ignored.

6

Application Scenarios

We applied our semantic similarity estimates to the following two scenarios: 6.1

Market Segmentation

Market segmentation is one of the key tasks of a marketer. Usually, it is accomplished by clustering over behaviors as well as demographic, geographic and psychographic variables [12]. In this paper, we will describe an alternative approach based on unsupervised natural language processing. In particular, our business

Spectral Text Similarity Measures

89

partner operates a commercial youth platform for the Swiss market, where registered members get access to third-party offers such as discounts and special events like concerts or castings. Actually, several hundred online contests per year are launched over this platform sponsored by other firms, an increasing number of them require the members to write short free-text snippets, e.g. to elaborate on a perfect holiday at a destination of their choice in case of a contest sponsored by a travel agency. Based on the results of a broad survey, the platform provider’s marketers assume five different target groups (called milieus) being present among the platform members: Progressive postmodern youth (people primarily interested in culture and arts), Young performers (people striving for a high salary with a strong affinity to luxury goods), Freestyle action sportsmen, Hedonists (rather poorly educated people who enjoy partying and disco music), and Conservative youth (traditional people with a strong concern for security). A sixth milieu called Special groups comprises all those who cannot be assigned to one of the upper five milieus. For each milieu (with the exception of Special groups) a keyword list was manually created by describing its main characteristics. For triggering marketing campaigns, an algorithm shall be developed that automatically assigns each contest answer to the most likely target group: we propose the youth milieu as best match for a contest answer, for which the estimated semantic similarity between the associated keyword list and user answer is maximal. In case the highest similarity estimate falls below the 10 percent quantile for the distribution of highest estimates, the Special groups milieu is selected. Since the keyword list typically consists of nouns (in the German language capitalized) and the user contest answers might contain a lot of adjectives and verbs as well, which do not match very well to nouns in the Word2Vec vector representation, we actually conduct two comparisons for our Word2Vec based measures, one with the unchanged user contest answers and one by capitalizing every word beforehand. The final similarity estimate is then given as the maximum value of both individual estimates. 6.2

Translation Matching

The novel The purloined letter authored by Edgar Allen Poe was independently translated by two translators into German1 . We aim to match a sentence from the first translation to the associated sentence of the second by looking for the assignment with the highest semantic relatedness disregarding the sentence order. To guarantee an 1:1 sentence mapping, periods were partly replaced by semicolons.

7

Evaluation

For evaluation we selected three online contests (language: German), where people elaborated on their favorite travel destination (contest 1, see Appendix A for 1

This corpus can be obtained under the URL https://www.researchgate.net/ publication/332072718 alignmentPurloinedLettertar.

90

T. vor der Br¨ uck and M. Pouly

an example), speculated about potential experiences with a pair of fancy sneakers (contest 2) and explained why they emotionally prefer a certain product out of four available candidates. In order to provide a gold standard, three professional marketers from different youth marketing companies annotated independently the best matching youth milieus for every contest answer. We determined for each annotator individually his/her average inter-annotator agreement with the others (Cohen’s kappa). The minimum and maximum of these average agreement values are given in Table 2. Since for contest 2 and contest 3, some of the annotators annotated only the first 50 entries (last 50 entries respectively), we specified min/max average kappa values for both parts. We further compared the youth milieus proposed by our unsupervised matching algorithm with the majority votes over the human experts’ answers (see Table 3) and computed its average inter-annotator agreement with the human annotators (see again Table 2). The obtained accuracy values for the second scenario (matching translated sentences) are given in Table 4.

Fig. 1. Scatter Plots of Cosine between Centroids of Word2Vec Embeddings (W2VC) vs similarity estimates induced by different spectral measures.

Table 1. Corpus sizes measured by number of words. Corpus

# Words

German Wikipedia

651 880 623

Frankfurter Rundschau News journal 20 min

34 325 073 8 629 955

The Word2Vec word embeddings were trained on the German Wikipedia (dump originating from 20 February 2017) merged with a Frankfurter Rundschau newspaper Corpus and 34 249 articles of the news journal 20 min 2 , where the latter is targeted to the Swiss market and freely available at various Swiss train stations (see Table 1 for a comparison of corpus sizes). By employing articles from 2

http://www.20min.ch.

Spectral Text Similarity Measures

91

Table 2. Minimum and maximum average inter-annotator agreements (Cohen’s kappa)/average inter-annotator agreement values for our automated matching method. Method Min kappa Max. kappa

Contest 1 2

3

0.123 0.295/0.030 0.110/0.101 0.178 0.345/0.149 0.114/0.209

Kap. (spectral norm) 0.128 0.049/0.065 0.060/0.064 # Entries

1544 100

100

Table 3. Obtained accuracy values for similarity measures induced by different matrix norms and for five baseline methods. (W)W2VC = Cosine between (weighed by tf-idf) Word2Vec Embeddings Centroids. Method Random ESA ESA2 W2VC WW2VC Skip-Thought-Vectors

Contest 1 2

3

all

0.167 0.357 0.355 0.347 0.347 0.162

0.167 0.288 0.227 0.227 0.197 0.273

0.167 0.335 0.330 0.330 0.322 0.189

0.167 0.254 0.284 0.328 0.299 0.284

Spectral Norm 0.370 0.299 0.353 0.313 Spectral Radius Spectral Radius+W2VC 0.357 0.299

0.288 0.350 0.182 0.326 0.212 0.334

20 min, we want to ensure the reliability of word vectors for certain Switzerland specific expressions like Velo or Glace, which are underrepresented in the German Wikipedia and the Frankfurter Rundschau corpus. ESA is usually trained on Wikipedia, since the authors of the original ESA paper suggest that the articles of the training corpus should represent disjoint concepts, which is only guaranteed for encyclopedias. However, Stein and Anerka [7] challenged this hypothesis and demonstrated that promising results can be obtained by applying ESA on other types of corpora like the popular Reuters newspaper corpus as well. Unfortunately, the implementation we use (Wikiprep-ESA3 ) expects its training data to be a Wikipedia Dump. Furthermore, Wikiprep-ESA only indexes words that are connected by hyperlinks, which are usually lacking in ordinary newspaper articles. So we could train ESA on Wikipedia only but we have developed meanwhile a version of ESA that can be applied to arbitrary corpora and which was trained on the full corpus

3

https://github.com/faraday/wikiprep-esa.

92

T. vor der Br¨ uck and M. Pouly

(Wikipedia+Frankfurter Rundschau+20 min). In the following, we refer to this implementation as ESA2. The STVs (Skip Thought Vectors) were trained on the same corpus as our estimates and Word2Vec embedding centroids (W2VC). The actual document similarity estimation is accomplished by the usual centroid approach. An issue we are faced with for the first evaluation scenario of market segmentation (see Sect. 6.1) is that STVs are not bag of word models but actually take the sequence of the words into account and therefore the obtained similarity estimate between milieu keyword list and contest answer would depend on the keyword ordering. However, this order could have arbitrarily been chosen by the marketers and might be completely random. A possible solution is to compare the contest answers with all possible permutation of keywords and determine the maximum value over all those comparisons. However, such an approach would be infeasible already for medium keyword list sizes. Therefore, we apply for this scenario a beam search to extends the keyword list iteratively while keeping only the n-best performing permutations. Table 4. Accuracy value obtained for matching a sentence of the first to the associated sentence of the second translation (based on the first 200 sentences of both translations). Method

Accuracy

ESA

0.672

STV

0.716

Spectral Radius 0.721

8

W2VC

0.726

Spectral Norm

0.731

Discussion

The evaluation showed that the inter-annotator agreement values vary strongly for contest 2 part 2 (minimum average annotator agreement according to Cohen’s kappa of 0.03 while the maximum is 0.149, see Table 2). On this contest part, our spectral norm based matching obtains a considerably higher average agreement than one of the annotators. Regarding baseline systems, the most relevant comparison is naturally the one with W2VC, since it employs the same type of data. The similarity estimate induced by the spectral norm performs quite stable over both scenarios and clearly outperforms the W2VC approach. In contrast however, the performance of the spectral radius based estimate is rather mixed. While it performs well on the first contest, the performance on the third contest is quite poor and lags there behind the Word2Vec centroids. Only the average of both measures (W2VC+Spectral Radius) performs reasonable well on all three

Spectral Text Similarity Measures

93

contests. One major issue of this measure is its unboundedness. The typical normalization with the geometric mean of comparing the documents with itself results in values exceeding the desired upper limit of one in 1.8% of the cases (determined on the largest contest 1). So still some research is needed to come up with a better normalization. Finally, we conducted a scatter plot (see Fig. 1), plotting the values of the spectral similarity estimates against W2VC. While the spectral norm is quite strongly correlated to W2VC, the spectral radius behaves much more irregular and non-linear. In addition, its values exceed several times the desired upper limit of 1, which is a result of its non-boundedness. Furthermore, both of the spectral similarity estimates tend to assume larger values than W2VC, which is a result of its higher robustness against noise in the data. Note that a downside of both approaches in relation to the usual Word2Vec centroids method is the increased runtime since it requires the pair-wise comparison of all words contained in the input documents. In our scenario with rather short text snippets and keyword lists, this was not much of an issue. However, for large documents, such a comprehensive comparison could become soon infeasible. This issue can be mitigated for example by constructing the embedding matrices not on basis of individual words but on entire sentences, for instance by employing the skip-thought-vector representation.

9

Supervised Learning

So far, our two proposed similarity measures were only applied in an unsupervised setting. However, supervised learning methods usually obtain superior accuracy. For that, we could use our two similarity estimates as kernels for a support vector machine [19] (SVM in short), potentially combined with an RBF kernel applied to an ordinary feature representation consisting of tf-idf-weights of word forms or lemmas (not yet evaluated however). One issue here is to investigate, whether our proposed similarity estimates are positive semidefinite and qualify as regular kernels. In case of non positive-semidefiniteness, the SVM training process can stuck in a local minimum resulting in failing to reach the global minimum for the hinge loss. The estimate induced by the spectral radius and also the spectral norm in case of negative cosine measure values between word embedding vectors can possibly violate the boundedness constraint and therefore, it cannot constitute a positivesemidefinite kernel. To see this, let us consider the kernel matrix K. According to Mercer‘s theorem [13,16], an SVM kernel is exactly then positive-semidefinite, if for any possible set of inputs, the associated kernel matrices are positivesemidefinite. So we must show that there is at least one kernel matrix that is not positive-semidefinite. Let us select one kernel matrix K with at least one violation of boundedness. We can assume that K is symmetric, since symmetry is a prerequisite for positive-semidefiniteness. Since our normalization procedure guarantees reflexivity, a text compared with itself always yields the estimated similarity of one. Therefore, the value

94

T. vor der Br¨ uck and M. Pouly

of one can only be exceeded for off-diagonal elements. Let us assume the entry Kij = Kji with i < j of the kernel matrix equals 1 +  for some  > 0. Consider a vector v with vi = 1, vj = −1 and all other components equal to zero. Let w := v K and q := v Kv = wv, then wi = 1 − (1 + ) = − and wj = 1 +  − 1 = . With this, it follows that q = − −  = −2. And therefore K cannot be positive-semidefinite. Note that sn2 can be a proper kernel in certain situations. Consider the fact that all of the investigated texts are so dissimilar that the kernel matrices are diagonal dominant for all possible sets of inputs. Since diagonal dominant matrices with non-negative diagonal elements are positive-semidefinite, the kernel is positive-semidefinite as well. It is still an open question if this kernel can also be positive-semidefinite if not all of the kernel matrices are diagonal dominant.

10

Conclusion

We proposed two novel similarity estimates based on the spectrum of the product of embedding matrices. These estimates were evaluated on a two task, i.e., assigning users to the best matching marketing target groups and matching sentences of a novel translation with its counterpart from a different translation. Hereby, we obtained superior results compared to the usual centroid of Word2Vec vectors (W2VC) method. Furthermore, we investigated several properties of our estimates concerning boundness and positive-definiteness. Acknowledgement. Hereby we thank the Jaywalker GmbH as well as the Jaywalker Digital AG for their support regarding this publication and especially for annotating the contest data with the best-fitting youth milieus.

A

Example Contest Answer

The following snippet is an example user answer for the travel contest (contest 1): 1. Jordanien: Ritt durch die W¨ uste und Petra im Morgengrauen bestaunen bevor die Touristenbusse kommen 2. Cook Island: Schnorcheln mit Walhaien und die Seele baumeln lassen 3. USA: Eine abgespaceste Woche am Burning Man Festival erleben English translation: 1. Jordan: Ride through the desert and marveling Petra during sunrise before the arrival of tourist buses 2. Cook Island: Snorkeling with whale sharks and relaxing 3. USA: Experience an awesome week at the Burning Man Festival

Spectral Text Similarity Measures

95

References 1. Attig, A., Perner, P.: The problem of normalization and a normalized similarity measure by online data. Trans. Case-Based Reason. 4(1), 3–17 (2011) 2. Belanche, L., Orozco, J.: Things to know about a (dis)similarity measure. In: K¨ onig, A., Dengel, A., Hinkelmann, K., Kise, K., Howlett, R.J., Jain, L.C. (eds.) KES 2011. LNCS (LNAI), vol. 6881, pp. 100–109. Springer, Heidelberg (2011). https:// doi.org/10.1007/978-3-642-23851-2 11 3. Brokos, G.I., Malakasiotis, P., Androutsopoulos, I.: Using centroids of word embeddings and word mover’s distance for biomedical document retrieval in question answering. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing, Berlin, Germany, pp. 114–118 (2016) 4. Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006) 5. Chatelin, F.: Eigenvalues of Matrices - Revised Edition. Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania (1993) 6. Gabrilovic, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. J. Artif. Intell. Res. 34, 443–498 (2009) 7. Gottron, T., Anderka, M., Stein, B.: Insights into explicit semantic analysis. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, UK, pp. 1961–1964 (2011) 8. Gupta, V.: Improving word embeddings using kernel principal component analysis. Master’s thesis, Bonn-Aachen International Center for Information Technology (BIT) (2018) 9. Hong, K.J., Lee, G.H., Kom, H.J.: Enhanced document clustering using Wikipediabased document representation. In: Proceedings of the 2015 International Conference on Applied System Innovation (ICASI), Osaka, Japan (2015) 10. Kiros, R., et al.: Skip-thought vectors. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS), Montr´eal, Canada (2015) 11. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25, 259–284 (1998) 12. Lynn, M.: Segmenting and targeting your market: strategies and limitations. Technical report, Cornell University (2011). http://scholorship.sha.cornell.edu/articles/ 243 13. Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Phil. Trans. R. Soc. A 209, 441–458 (1909) 14. Mijangos, V., Sierra, G., Montes, A.: Sentence level matrix representation for document spectral clustering. Pattern Recognit. Lett. 85, 29–34 (2017) 15. Mikolov, T., Sutskever, I., Ilya, C., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, Nevada, pp. 3111–3119 (2013) 16. Murphy, K.P.: Machine Learning - A Probabilistic Perspective. MIT Press, Cambridge (2012) 17. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Katar (2014) 18. Song, Y., Roth, D.: Unsupervised sparse vector densification for short text similarity. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Denver, Colorado (2015) 19. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

A Computational Approach to Measuring the Semantic Divergence of Cognates Ana-Sabina Uban1,3(B) , Alina Cristea (Ciobanu)1,2 , and Liviu P. Dinu1,2 1

2

Faculty of Mathematics and Computer Science, University of Bucharest, Bucharest, Romania {auban,alina.cristea,ldinu}@fmi.unibuc.ro Human Language Technologies Research Center, University of Bucharest, Bucharest, Romania 3 Data Science Center, University of Bucharest, Bucharest, Romania

Abstract. Meaning is the foundation stone of intercultural communication. Languages are continuously changing, and words shift their meanings for various reasons. Semantic divergence in related languages is a key concern of historical linguistics. In this paper we investigate semantic divergence across languages by measuring the semantic similarity of cognate sets in multiple languages. The method that we propose is based on cross-lingual word embeddings. In this paper we implement and evaluate our method on English and five Romance languages, but it can be extended easily to any language pair, requiring only large monolingual corpora for the involved languages and a small bilingual dictionary for the pair. This language-agnostic method facilitates a quantitative analysis of cognates divergence – by computing degrees of semantic similarity between cognate pairs – and provides insights for identifying false friends. As a second contribution, we formulate a straightforward method for detecting false friends, and introduce the notion of “soft false friend” and “hard false friend”, as well as a measure of the degree of “falseness” of a false friends pair. Additionally, we propose an algorithm that can output suggestions for correcting false friends, which could result in a very helpful tool for language learning or translation. Keywords: Cognates · Semantic divergence · Semantic similarity

1 Introduction Semantic change – that is, change in the meaning of individual words [3] – is a continuous, inevitable process stemming from numerous reasons and influenced by various factors. Words are continuously changing, with new senses emerging all the time. [3] presents no less than 11 types of semantic change, that are generally classified in two wide categories: narrowing and widening. Most linguists found structural and psychological factors to be the main cause of semantic change, but the evolution of technology and cultural and social changes are not to be omitted. Measuring semantic divergence across languages can be useful in theoretical and historical linguistics – being central to models of language and cultural evolution – but also in downstream applications relying on cognates, such as machine translation. c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 96–108, 2023. https://doi.org/10.1007/978-3-031-24340-0_8

A Computational Approach to Measuring the Semantic Divergence

97

Cognates are words in sister languages (languages descending from a common ancestor) with a common proto-word. For example, the Romanian word victorie and the Italian word vittoria are cognates, as they both descend from the Latin word victoria (meaning victory) – see Fig. 1. In most cases, cognates have preserved similar meanings across languages, but there are also exceptions. These are called deceptive cognates or, more commonly, false friends. Here we use the definition of cognates that refers to words with similar appearance and some common etymology, and use “true cognates” to refer to cognates which also have a common meaning, and “deceptive cognates” or “false friends” to refer to cognate pairs which do not have the same meaning (anymore). The most common way cognates have diverged is by changing their meaning. For many cognate pairs, however, the changes can be more subtle, relating to the feeling attached to a word, or its connotations. This can make false friends even more delicate to distinguish from true cognates.

Fig. 1. Example of cognates and their common ancestor

Cognate word pairs can help students when learning a second language and contributes to the expansion of their vocabularies. False friends, however, from the more obvious differences in meaning to the more subtle, have the opposite effect, and can be confusing for language learners and make the correct use of language more difficult. Cognate sets have also been used in a number applications in natural language processing, including for example machine translation [10]. These applications rely on properly distinguishing between true cognates and false friends. 1.1 Related Work Cross-lingual semantic word similarity consists in identifying words that refer to similar semantic concepts and convey similar meanings across languages [16]. Some of the most popular approaches rely on probabilistic models [17] and cross-lingual word embeddings [13]. A comprehensive list of cognates and false friends for every language pair is difficult to find or manually build - this is why applications have to rely on automatically identifying them. There have been a number of previous studies attempting to automatically extract pairs of true cognates and false friends from corpora or from dictionaries. Most methods are based either on ortographic and phonetic similarity, or require large parallel corpora or dictionaries [5, 9, 11, 14]. We propose a corpus-based approach that is capable of covering the vast majority of the vocabulary for a large number of languages, while at the same time requiring minimal human effort in terms of manually evaluating word pairs similarity or building lexicons, requiring only large monolingual corpora.

98

A.-S. Uban et al.

In this paper, we make use of cross-lingual word embeddings in order to distinguish between true cognates and false friends. There have been few previous studies using word embeddings for the detection of false friends or cognate words, usually using simple methods on only one or two pairs of languages [4, 15]. 1.2

Contributions

The contributions of our paper are twofold: firstly, we propose a method for quantifying the semantic divergence of languages; secondly, we provide a framework for detecting and correcting false friends, based on the observation that these are usually deceptive cognate pairs: pairs of words that once had a common meaning, but whose meaning has since diverged. We propose a method for measuring the semantic divergence of sister languages based on cross-lingual word embeddings. We report empirical results on five Romance languages: Romanian, French, Italian, Spanish and Portuguese. For a deeper insight into the matter, we also compute and investigate the semantic similarity between modern Romance languages and Latin. We finally introduce English into the mix, to analyze the behavior of a more remote language, where words deriving from Latin are mostly borrowings. Further, we make use of cross-lingual word embeddings in order to distinguish between true cognates and false friends. There have been few previous studies using word embeddings for the detection of false friends or cognate words, usually using simple methods on only one or two pairs of languages [4, 15]. Our chosen method of leveraging word embeddings extends naturally to another application related to this task which, to our knowledge, has not been explored so far in research: false friend correction. We propose a straightforward method for solving this task of automatically suggesting a replacement when a false friend is incorrectly used in a translation. Especially for language learners, solving this problem could result in a very useful tool to help them use language correctly.

2 The Method 2.1

Cross-Lingual Word Embeddings

Word embeddings are vectorial representations of words in a continuous space, built by training a model to predict the occurrence of a given word in a text corpus given its context. Based on the distributional hypothesis stating that similar words occur in similar contexts, these vectorial representations can be seen as semantic representations of words and can be used to compute semantic similarity between word pairs (representations of words with similar meanings are expected to be close together in the embeddings space). To compute the semantic divergence of cognates across sister languages, as well as identify pairs of false cognates (pairs of cognates with high semantic distance), which by definition are pairs of words in two different languages, we need to obtain a multilingual semantic space, which is shared between the cognates. Having the representations

A Computational Approach to Measuring the Semantic Divergence

(a) Es-Fr

(e) Fr-It

(b) Es-It

(c) Es-Ro

(f) Fr-Ro

(g) Fr-Pt

(i) It-Pt

(j) Ro-Pt

(m) En-Es

(n) En-Ro

(q) Fr-La

(k) En-Fr

(r) Es-La

(o) En-Pt

(s) It-La

99

(d) Es-Pt

(h) It-Ro

(l) En-It

(p) En-La

(t) Pt-La

(u) Ro-La

Fig. 2. Distributions of cross-language similarity scores between cognates.

of both cognates in the same semantic space, we can then compute the semantic distance between them using their vectorial representations in this space. We use word embeddings computed using the FastText algorithm, pre-trained on Wikipedia for the six languages in question. The vectors have dimension 300, and were obtained using the skip-gram model described in [2] with default parameters.

100

A.-S. Uban et al.

The algorithm for measuring the semantic distance between cognates in a pair of languages (lang1, lang2) consists of the following steps: 1. Obtain word embeddings for each of the two languages. 2. Obtain a shared embedding space, common to the two languages. This is accomplished using an alignment algorithm, which consists of finding a linear transformation between the two spaces, that on average optimally transforms each vector in one embedding space into a vector in the second embedding space, minimizing the distance between a few seed word pairs (for which it is known that they have the same meaning), based on a small bilingual dictionary. For our purposes, we use the publicly available multilingual alignment matrices that were published in [12]. 3. Compute semantic distances for each pair of cognates words in the two languages, using a vectorial distance (we chose cosine distance) on their corresponding vectors in the shared embedding space. 2.2

Cross-Language Semantic Divergence

We propose a definition of semantic divergence between two languages based on the semantic distances of their cognate word pairs in these embedding spaces. The semantic distance between two languages can then be computed as the average the semantic divergence of each pair of cognates in that language pair. We use the list of cognates sets in Romance languages proposed by [6]. It contains 3,218 complete cognate sets in Romanian, French, Italian, Spanish and Portuguese, along with their Latin common ancestors. The cognate sets are obtained from electronic dictionaries which provide information about the etymology of the words. Two words are considered cognates if they have the same etymon (i.e., if they descend from the same word). The algorithm described above for computing semantic distance for cognate pairs stands on the assumption that the (shared) embedding spaces are comparable, so that the averaged cosine similarities, as well as the overall distributions of scores that we obtain for each pair of languages can be compared in a meaningful way. For this to be true, at least two conditions need to hold: 1. The embeddings spaces for each language need to be similarly representative of language, or trained on similar texts - this assumption holds sufficiently in our case, since all embeddings (for all languages) are trained on Wikipedia, which at least contains a similar selection of texts for each language, and at most can be considered comparable corpora. 2. The similarity scores in a certain (shared) embeddings space need to be sampled from a similar distribution. To confirm this assumption, we did a brief experiment looking at the distributions of a random sample of similarity scores across all embeddings spaces, and did find that the distributions for each language pair are similar (in mean and variance). This result was not obvious but also not surprising, since: – The way we create shared embedding spaces is by aligning the embedding space of any language to the English embedding space (which is a common reference to all shared embedding spaces). – The nature of the alignment operation (consisting only of rotations and reflections) guarantees monolingual invariance, as described in these papers: [1, 12].

A Computational Approach to Measuring the Semantic Divergence

101

The Romance Languages. We compute the cosine similarity between cognates for each pair of modern languages, and between modern languages and Latin as well. We compute an overall score of similarity for a pair of languages as the average similarity for the entire dataset of cognates. The results are reported in Table 1. Table 1. Average cross-language similarity between cognates (Romance languages). Fr

It

Pt

Ro

La

Es 0.67 0.69 0.70 0.58 0.41 Fr

0.66 0.64 0.56 0.40

It

0.66 0.57 0.41

Pt

0.57 0.41

Ro

0.40

We observe that the highest similarity is obtained between Spanish and Portuguese (0.70), while the lowest are obtained for Latin. From the modern languages, Romanian has, overall, the lowest degrees of similarity to the other Romance languages. A possible explanation for this result is the fact that Romanian developed far from the Romance kernel, being surrounded by Slavic languages. In Table 2 we report, for each pair of languages, the most similar (above the main diagonal) and the most dissimilar (below the main diagonal) cognate pair for Romance languages. Table 2. Most similar and most dissimilar cognates Es Es –

Fr

It

Ro

Pt

ocho/huit(0.89)

diez/dieci(0.86)

ocho/opt(0.82)

ocho/oito(0.89)

Fr caisse/casar(0.05) – It

dix/dieci(0.86)

prezzo/prez(0.06) punto/ponte(0.09) –

Ro miere/mel(0.09)

face/facteur(0.10) as/asso(0.11)

Pt

pena/paner(0.09)

prez/prec¸o(0.05)

d´ecembre/decembrie(0.83) huit/oito(0.88) convincere/convinge(0.75) convincere/convencer(0.88) –

opt/oito(0.83)

preda/prea(0.08) linho/in(0.05) –

The problem that we address in this experiment involves a certain vagueness of reported values (also noted by [8] in the problem of semantic language classification), as there isn’t a gold standard that we can compare our results to. To overcome this drawback, we use the degrees of similarity that we obtained to produce a language clustering (using the UPGMA hierarchical clustering algorithm), and observe that it is similar with the generally accepted tree of languages, and with the clustering tree built on intelligibility degrees by [7]. The obtained dendrogram is rendered in Fig. 3.

102

A.-S. Uban et al.

Fig. 3. Dendrogram of the language clusters

The Romance Languages vs English. Further, we introduce English into the mix as well. We run this experiment on a subset of the used dataset, comprising the words that have a cognate in English as well1 . The subset has 305 complete cognate sets. The results are reported in Table 3, and the distribution of similarity scores for each pair of languages is rendered in Fig. 2. We notice that English has 0.40 similarity with Latin, the lowest value (along with French and Romanian), but close to the other languages. Out of the modern Romance languages, Romanian is the most distant from English, with 0.53 similarity. Another interesting observation relates to the distributions of scores for each language pair, shown in the histograms in Fig. 2. While similarity scores between cognates among romance languages usually follow a normal distribution (or another unimodal, more skewed distribution), the distributions of scores for romance languages with English seem to follow a bimodal distribution, pointing to a different semantic evolution for words in English that share a common etymology with a word in a romance language. One possible explanation is that the set of cognates between English and romance languages (which are pairs of languages that are more distantly related) consist of two distinct groups: for example one group of words that were borrowed directly from the romance language to English (which should have more meaning in common), and words that had a more complicated etymological trail between languages (and for which meaning might have diverged more, leading to lower similarity scores).

1

Here we stretch the definition of cognates, as they are generally referring to sister languages. In this case English is not a sister of the Romance languages, and the words with Latin ancestors that entered English are mostly borrowings.

A Computational Approach to Measuring the Semantic Divergence

103

Table 3. Average cross-language similarity between cognates Fr

It

Pt

Ro

En

La

Es 0.64 0.67 0.68 0.57 0.61 0.42 Fr

0.64 0.61 0.55 0.60 0.40

It

0.65 0.57 0.60 0.41

Pt

0.56 0.59 0.42

Ro

0.53 0.40

En

0.40

2.3 Detection and Correction of False Friends In a second series of experiments, we propose a method for identifying and correcting false friends. Using the same principles as in the previous experiment, we can use embedding spaces and semantic distances between cognates in order to detect pairs of false friends, which are simply defined as pairs of cognates which do not share the same meaning, or which are not semantically similar enough. This definition is of course ambiguous: there are different degrees of similarity, and as a consequence different potential degrees of falseness in a false friend. Based on this observation, we define the notions of hard false friend and soft false friend. A hard false friend is a pair of cognates for which the meanings of the two words have diverged enough such that they don’t have the same meaning anymore, and should not be used interchangibly (as translations of one another). In this category fall most known examples of false friends, such as the French-English cognate pair attendre/attend: in French, attendre has a completely different meaning, which is to wait. A different and more subtle type of false friends can result from more minor semantic shifts between the cognates. In such pairs, the meaning of the cognate words may remain roughly the same, but with a difference in nuance or connotation. Such an example is the Romanian-Italian cognate pair amic/amico. Here, both cognates mean friend, but in Italian the connotation is that of a closer friend, whereas the Romanian amic denotes a more distant friend, or even acquaintance. A more suitable Romanian translation for amico would be prieten, while a better translation in Italian for amic could be conoscente. Though their meaning is roughly the same, translating one word for the other would be an inaccurate use of the language. These cases are especially difficult to handle by beginner language learners (especially since the cognate pair may appear as valid a translation in multilingual dictionaries) and using them in the wrong contexts is an easy trap to fall into. Given these considerations, an automatic method for finding the appropriate term to translate a cognate instead of using the false friend would be a useful tool to aid in translation or in language learning. As a potential solution to this problem, we propose a method that can be used to identify pairs of false friends, to distinguish between the two categories of false friends

104

A.-S. Uban et al.

defined above (hard false friends and soft false friends), and to provide suggestions for correcting the erroneous usage of a false friend in translation. False friends can be identified as pairs of cognates with high semantic distance. More specifically, we consider a pair of cognates to be a false friend pair if in the shared semantic space, there exists a word in the second language which is semantically closer to the original word than its cognate in that language (in other words, the cognate is not the optimal translation). The arithmetic difference between the semantic distance between these words and the semantic distance between the cognates will be used as a measure of the falseness of the false friend. The word that is found to be closest to the first cognate will be the suggested “correction”. The algorithm can be described as follows:

Algorithm 1. Detection and correction of false friends 1: Given the cognates pair (c1 , c2 ) where c1 is a word in lang1 and c2 is a word in lang2 : 2: Find the word w2 in lang2 such that for any wi in lang2 , distance(c2 , w2 ) < distance(c2 , wi ) 3: if w2 = c2 then 4: (c1 , c2 ) is a pair of false friends 5: Degree of falseness = distance(c1 , w2 ) − distance(c1 , c2 ) 6: return w2 as potential correction 7: end if

We select a few results of the algorithm to show in Table 4, containing examples of extracted false friends for the language pair French-Spanish, along with the suggested correction and the computed degree of falseness. Depending on the application, the measure of falseness could be used by choosing a threshold to single out pairs of false friends that are harder or softer, with a customizable degree of sensitivity to the difference in meaning. Table 4. Extracted false friends for French-Spanish FR cognate ES cognate Correction Falseness prix

prez

premio

0.67

long

luengo

largo

0.57

face

faz

cara

0.41

change

caer

cambia

0.41

concevoir

concebir

dise˜nar

0.18

majeur

mayor

importante 0.14

A Computational Approach to Measuring the Semantic Divergence

105

Evaluation. In this section we describe our overall results on identifying false friends for every language pair between English and five Romance languages: French, Italian, Spanish, Portuguese and Romanian. Table 5. Performance for Spanish-Portuguese using curated false friends test set Accuracy Precision Recall Our method

81.12

86.68

75.59

85.82

54.50

(Castro et al.) 77.28 WN Baseline 69.57

We evaluate our method in two separate stages. First, we measure accuracy of false friend detection on a manually curated list of false friends and true cognates in Spanish and Portuguese, used in a previous study [4], and introduced in [15]. This resource is composed by 710 Spanish-Portuguese word pairs: 338 true cognates and 372 false friends. We also compare our results to the ones reported in this study, which uses a method similar to ours (using a simple classifier that takes embedding similarities as features to identify false friends) and shows improvements over results in previous research. The results are show in Table 5. For the second part of the experiment, we use the list of cognates sets in English and Romance languages proposed by [6] (the same that we used in our semantic divergence experiments), and try to automatically decide which of these are false friends. Since manually built false friends lists are not available for every language pair that we experiment on, for the language pairs in this second experiment we build our gold standard by using a multilingual dictionary (WordNet) in order to infer false friends and true cognate relationships. We assume two cognates in different languages are true cognates if they occur together in any WordNet synset, and false friends otherwise. Table 6. Performance for all language pairs using WordNet as gold standard. Accuracy Precision Recall EN-ES 76.58 ES-IT

75.80

63.88

88.46

41.66

54.05

ES-PT 82.10

40.0

42.85

EN-FR 77.09

57.89

94.28

FR-IT

74.16

32.81

65.62

FR-ES 73.03

33.89

69.96

EN-IT 73.07

33.76

83.87

IT-PT

76.14

29.16

43.75

EN-PT 77.25

59.81

86.48

106

A.-S. Uban et al.

We measure accuracy, precision, and recall, where: – a true positive is a cognate pair that are not synonyms in WordNet and are identified as false friends by the algorithm, – a true negative is a pair which is identified as true cognates and is found in the same WordNet synset, – a false positive is a word pair which is identified as a false friends pair by the algorithm but also appears as a synonym pair in WordNet, – and a false negative is a pair of cognate words that are not synonyms in WordNet, but are also not identified as false friends by the algorithm. We should also note that in the WordNet based method we can only evaluate results for only slightly over half of cognate pairs, since not all of them are found in WordNet. This also makes our corpus-based method more useful than a dictionary-based method, since it is able to cover most of the vocabulary of a language (given a large monolingual corpus to train embeddings on). To be able to compare results to the ones evaluated on the manually built test set, we use the WordNet-based method as a baseline in the first experiment. Results for the second evaluation experiments are reported in Table 6. In this evaluation experiment we were able to measure performance for language pairs among all languages in our cognates set except for Romanian (which is not available in WordNet).

3 Conclusions In this paper we proposed a method for computing the semantic divergence of cognates across languages. We relied on word embeddings and extended the pairwise metric to compute the semantic divergence across languages. Our results showed that Spanish and Portuguese are the closest languages, while Romanian is most dissimilar from Latin, possibly because it developed far from the Romance kernel. Furthermore, clustering the Romance languages based on the introduced semantic divergence measure results in a hierarchy that is consistent with the generally accepted tree of languages. When further including English in our experiments, we noticed that, even though most Latin words that entered English are probably borrowings (as opposed to inherited words), its similarity to Latin is close to that of the modern Romance languages. Our results shed some light on a new aspect of language similarity, from the point of view of crosslingual semantic change. We also proposed a method for detecting and possibly correcting false friends, and introduced a measure for quantifying the falseness of a false friend, distinguishing between two categories: hard false friends and soft false friends. These analyses and algorithms for dealing with false friends can possibly provide useful tools for language learning or for (human or machine) translation. In this paper we provided a simple method for detecting and suggesting corrections for false friends independently of context. There are, however, false friends pairs that are context-dependent - the cognates can be used interchangibly in some contexts, but not in others. In the future, the method using word embeddings could be extended to provide false friend correction suggestions in a certain context (possibly by using the word embedding model to predict the appropriate word in a given context).

A Computational Approach to Measuring the Semantic Divergence

107

Acknowledgements. Research supported by BRD—Groupe Societe Generale Data Science Research Fellowships.

References 1. Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294 (2016) 2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606 (2016) 3. Campbell, L.: Historical Linguistics. An Introduction. MIT Press, Cambridge (1998) 4. Castro, S., Bonanata, J., Ros´a, A.: A high coverage method for automatic false friends detection for Spanish and Portuguese. In: Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 29–36 (2018) 5. Chen, Y., Skiena, S.: False-friend detection and entity matching via unsupervised transliteration. arXiv preprint arXiv:1611.06722 (2016) 6. Ciobanu, A.M., Dinu, L.P.: Building a dataset of multilingual cognates for the Romanian lexicon. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, pp. 1038–1043 (2014) 7. Dinu, L.P., Ciobanu, A.M.: On the Romance languages mutual intelligibility. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, pp. 3313–3318 (2014) 8. Eger, S., Hoenen, A., Mehler, A.: Language classification from bilingual word embedding graphs. In: Proceedings of COLING 2016, Technical Papers, pp. 3507–3518 (2016) 9. Inkpen, D., Frunza, O., Kondrak, G.: Automatic identification of cognates and false friends in French and English. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, vol. 9, pp. 251–257 (2005) 10. Kondrak, G., Marcu, D., Knight, K.: Cognates can improve statistical translation models. In: Companion Volume of the Proceedings of HLT-NAACL 2003-Short Papers (2003) 11. Nakov, S., Nakov, P., Paskaleva, E.: Unsupervised extraction of false friends from parallel bi-texts using the web as a corpus. In: Proceedings of the International Conference RANLP2009, pp. 292–298 (2009) 12. Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017) 13. Søgaard, A., Goldberg, Y., Levy, O.: A strong baseline for learning cross-lingual word embeddings from sentence alignments. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, pp. 765–774 (2017) 14. St Arnaud, A., Beck, D., Kondrak, G.: Identifying cognate sets across dictionaries of related languages. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2519–2528 (2017) 15. Torres, L.S., Alu´ısio, S.M.: Using machine learning methods to avoid the pitfall of cognates and false friends in Spanish-Portuguese word pairs. In: Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology (2011)

108

A.-S. Uban et al.

16. Vulic, I., Moens, M.: Cross-lingual semantic similarity of words as the similarity of their semantic word responses. In: Proceedings of Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, pp. 106–116 (2013) 17. Vulic, I., Moens, M.: Probabilistic models of cross-lingual semantic similarity in context based on latent cross-lingual concepts induced from comparable data. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, pp. 349–362 (2014)

Triangulation as a Research Method in Experimental Linguistics Olga Suleimanova

and Marina Fomina(B)

Moscow City University, Moscow 129226, Russia [email protected]

Abstract. The paper focuses on the complex research procedure based on hypothesis-deduction method (with semantic experiment as its integral part), corpus-based experiment, and the analysis of search engine results. The process of verification that increases validity of research findings by incorporating several methods in the study of the same phenomenon is often referred to as triangulation. Triangulation being a well-established practice in social sciences is relatively recent in linguistics. The authors describe a step-by-step semantic research technique employed while studying semantic features of the group of English synonymous adjectives – empty, free, blank, unoccupied, spare, vacant and void. The preliminary stage of the research into the meaning of the adjectives consists in gathering information on their distribution, valence characteristics and all possible contexts they may occur in. The results of this preliminary analysis enable to frame a hypothesis on the meaning of the linguistic units. Then the authors proceed to the experimental verification of the proposed hypotheses supported by corpus-based experiment, the analysis of search engine results, and mathematical-statistical methods and procedures that can help separate the random factor from the informants’ grade determined by the system of language. The research findings result in stricter semantic descriptions of the adjectives. Keywords: Triangulation · Linguistic experiment · Corpus-based experiment · Expert evaluation method · Mathematical statistics · Informant · Semantics

1 Introduction Triangulation is regarded as a process of verification that increases validity of research findings by incorporating several methods in the study of the same phenomenon in interdisciplinary research. The proponents of this method claim that “by combining multiple observers, theories, methods, and empirical materials, researchers can hope to overcome the weakness or intrinsic biases and the problems that come from single-method, singleobserver, single-theory studies” [1]. In 1959, D. Campbell and D. Fiske advocated an approach to assessing the construct validity of a set of measures in a study [2]. This method that relied on a matrix (‘multitrait-multimethod matrix’) of intercorrelations among tests representing at least two traits, each measured by at least two methods, can be viewed as a prototype of the triangulation technique. © Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 109–120, 2023. https://doi.org/10.1007/978-3-031-24340-0_9

110

O. Suleimanova and M. Fomina

In social sciences, N. Denzin distinguishes between the following triangulation techniques: – data triangulation (the researcher collects data from a number of different sources to form one body of data); – investigator triangulation (there are several independent researchers who collect and then interpret the data); – theoretical triangulation (the researcher interprets the data relying on more than one theory as a starting point); – methodological triangulation (the researcher relies on more than one research method or data collection technique) which is the most commonly used technique [3]. Triangulation being a well-established practice in social sciences (e.g. see [1–5] and many others) is relatively recent in linguistics. The 1972 study by W. Labov states the ‘complementary principle’ and the ‘principle of convergence’ among the key principles of linguistic methodology that govern the gathering of empirical data [6]. W. Labov stresses the importance of triangulation principles in linguistics arguing that “the most effective way in which convergence can be achieved is to approach a single problem with different methods, with complementary sources of error” [6]. Modern verification procedures and experimental practices are steadily narrowing the gap between linguistics (as an originally descriptive science relying mostly on qualitative methods in studying linguistic phenomena) and exact sciences. The results of linguistic research get the status of tested and proved theories and established laws. In addition to well-known research procedures, the linguistic experiment, being entirely based on interviews with native speakers (often referred to as ‘informants’), is rapidly getting ground (see [7] for a detailed account of verification capacity of semantic experiment). Recent years have witnessed a significant rise in the number of corpus-based experimental studies. Many linguists support their research procedure by the analysis of search engine results (e.g. Google results). In the paper, we shall focus on verification procedures that rely on the methodological triangulation when experimental practices are supported by corpus-based experiment, the analysis of search engine results, and mathematical-statistical methods.

2 Methodology 2.1 Semantic Research and Experiment The semantic experiment is an integral, indispensable part of the complex research procedure often referred to as hypothesis-deduction method. J.S. Stepanov distinguishes the four basic steps of hypothesis-deduction method: (1) to collect practical data and provide its preliminary analysis; (2) to put forward a hypothesis to support the practical data and relate the hypothesis to other existing theories; (3) to deduce rules from the suggested theories; (4) to verify the theory by relating the deduced rules to the linguistic facts [8].

Triangulation as a Research Method in Experimental Linguistics

111

Following the steps, O.S. Belaichuk worked out a step-by-step procedure of semantic experiment [9]. Let us demonstrate how it works on the semantic analysis of the meanings of English adjectives empty, free, blank, unoccupied, spare, vacant and void [10]. The preliminary stage of semantic research into the meaning of a language unit consists in gathering information on its distribution, valence characteristics and all possible contexts it may occur in. The results of this preliminary analysis enable the researcher to frame a hypothesis on the meaning of the linguistic unit in question (see [7] for a detailed description of the step-by-step procedure). At the next stage, we arrange a representative sampling by reducing the practically infinite sampling to a workable set. Then an original word in the representative sampling is substituted by its synonym. For example, in the original sentence The waiter conducted two unsteady businessmen to the empty table beside them [11] the word empty is replaced by the adjective vacant: The waiter conducted two unsteady businessmen to the vacant table beside them. Then other synonyms − free, blank, spare, unoccupied and void − are also put in the same context. At this stage, we may not have any hypothesis explaining the difference in the meanings of the given adjectives. At the next stage of the linguistic experiment, informants grade the acceptability of the offered utterances in the experimental sample according to a given scale suggested by A. Timberlake [12] − consider a fragment of a questionnaire (see Fig. 1) used in the interview of native speakers of English [13]. Then the linguist processes and analyses the informants’ grades to put forward a linguistic hypothesis, and then proceeds to the experimental verification of the proposed hypotheses. There is a variety of tests for verifying hypotheses, e.g. when the researcher varies only one parameter of the situation described while others should be fixed and invariable (see [7, 14–16]) for the detailed account of verification procedures). In addition to the well-established verification procedures employed in the linguistic experiment, corpus-based experiment and the analysis of search engine results are rapidly getting ground. Researchers claim that these new IT tools give a linguist value added: text corpora as well as such search engines as Google provide invaluable data, though they remain underestimated, and have not been explored as regards their full potential [17]. While in linguistic experiment we obtain the so-called ‘negative linguistic material’ (the term used by L.V. Scherba), i.e. the sentences graded as unacceptable, the text corpora do not provide the researcher with marked sentences. Most frequently occurring search results are likely to be acceptable and preferred, while marginally acceptable and not preferred sentences are to be rare. To verify the hypothesis with corpora and Google big data, the researcher determines whether the corpora and Google experimental data complies with his/her predictions and expectations, and to what extent. So, in accordance with the expectations we get frequent search results with the word empty describing a physical object (a bottle, a box, a table, a room, etc.) construed as three-dimensional physical space; and rare or no results with the word blank in these adjective-noun-combinations (see Table 1).

112

O. Suleimanova and M. Fomina QUESTIONNAIRE Name: ________________________________ Nationality: ____________________________ Age: __________________________________ Qualifications: __________________________ DIRECTIONS Grade each of the sentences below according to the following scale: Rating

Meaning

Comment

1

Unacceptable

Not occurring

2

Marginally acceptable

Rare

3

Not preferred

Infrequent

4

Acceptable, not preferred

Frequent

5

Acceptable, preferred

Most frequent

NOTE: Grade sentences with reference to the norm of standard English (slang, vernacular, argot or stylistically marked words are not in the focus of investigation) Useful hints to prevent possible misapprehension ! Do not try to assess the degree of synonymy of the words analysed ! Do not develop possible contexts that may seem to be implied by the words used in the statements; assess the acceptability of the utterances judging by the way the information is presented ! Still if you feel that the context is insufficient to assess the acceptability of the sentence, suggest your own context in the column “comments” corresponding to the sentence (A-G). Then grade the utterance according to the context offered by you Any of your comments will be highly appreciated!

A B C D E F G

Sentence The room is empty. All the furniture has been removed. The room is free. All the furniture has been removed. The room is blank. All the furniture has been removed. The room is spare. All the furniture has been removed. The room is unoccupied. All the furniture has been removed. The room is vacant. All the furniture has been removed. The room is void. All the furniture has been removed.

Rating

THANK YOU.

Fig. 1. Questionnaire (a fragment).

Comments

Triangulation as a Research Method in Experimental Linguistics

113

Table 1. BNC and Google search results. Bottle Box Table Wall Screen Sheet of paper (BNC/Google) (BNC/Google) (BNC/Google) (BNC/Google) (BNC/Google) (BNC/Google) Empty 4.97 m/32

3.98 m/19

1.33 m/15

2.4 m/2

0.381 m/2

0.665/1

Blank

0.654 m/0

0.305 m/0

4.92 m/38

4.13 m/12

1.68 m/20

0.156 m/0

2.2 Expert Evaluation Method in Linguistic Experiment While grading the sentences the informant is governed by the language rules and regulations as well as by some random factors. Thus, each grade being the result of deterministic and random processes can be treated as a variate (not to confuse with a ‘variable’). In the linguistic experiment, this variate (X) can take on only integer values on the closed interval [1; 5] (five-point system). Therefore, it should be referred to as a discrete variate. Discrete variates can be processed by mathematical-statistical methods. We chose several statistics that best describe such random distributions. The first one is the expectation for each sentence, or − in other words − the mean value of grades. The expectation corresponds to the centre of a distribution. Thus, it can be interpreted as a numerical expression of the influence of deterministic factors. This characteristic is defined as 1 χij (i = 1, 2, . . . , n) m m

μi =

(1)

j=1

where μi is the mean value (the expectation) of grades for the ith sentence; i is a sentence number (i = 1, 2, …, n); j is an informant’s number (j = 1, 2, …, m); n is the total number of sentences; m is the total number of informants; χ ij is the ith sentence’s grade given by the jth informant (χ ij = 1 ÷ 5). The second characteristic is the dispersion. It defines the extent the grades are spread around their mean value. It means that the dispersion is a numerical expression of the influence of random factors. The lower the dispersion of the grade, the more reliable the grade is (the influence of random factors is lower), and vice versa. If the dispersion is high, the researcher should try and find possible reasons which might have led to this value. This statistic can be calculated with (2): 1 (χij − μi )2 (i = 1, 2, . . . , n) m m

Di =

j=1

where Di is the dispersion of grades for the ith sentence; μi is the mean value (the expectation) of grades for the ith sentence; i is a sentence number (i = 1, 2, …, n); j is an informant’s number (j = 1, 2, …, m); n is the total number of sentences;

(2)

114

O. Suleimanova and M. Fomina

m is the total number of informants; χ ij is the ith sentence’s grade given by the jth informant (χ ij = 1 ÷ 5). The next step of the algorithm is calculating the mean value for each sentence taking into account the competence of informants (3). The measure of competence of an informant can be expressed via the coefficient of competence which is a standardized value and can take on any value on the interval (0; 1). The sum of the coefficients of the whole group of informants is to amount to 1 (4). These coefficients can be calculated a posteriori, after the interview. We proceed from the assumption that informants’ competence should be estimated in terms of the extent to which each informant’s grade agrees with the mean value [13]. χi =

m 

χij κj (i = 1, 2, . . . , n)

(3)

j=1

where χ i is the mean value of grades for the ith sentence; i is a sentence number (i = 1, 2, …, n); j is an informant’s number (j = 1, 2, …, m); n is the total number of sentences; m is the total number of informants; χ ij is the ith sentence’s grade given by the jth informant (χ ij = 1 ÷ 5); κ j is the coefficient of competence for the jth informant, the coefficient of competence being a standardized value, i.e. m 

κj = 1

(4)

j=1

The coefficients of competence can be calculated with recurrence formulas (5), (6) and (7): χit =

m 

χij κjt−1 (i = 1, 2, . . . , n)

(5)

j=1

λt =

m n  

χij χ ti (t = 1, 2, . . . )

(6)

i=1 j=1

κjt =

n m  1  t χ χ ; kjt = 1(j = 1, 2, . . . , m) ij i λt i=1

(7)

j=1

We start our calculations with t = 1. In (5) the initial values of the competence coefficients are assumed to be equal and take the value of κj0 = 1/ m . Then, the cluster estimate for the ith sentence in the first approximation (expressed in terms of (5)) is therefore: 1 χij (i = 1, 2, . . . , n) m m

χi1 =

j=1

(8)

Triangulation as a Research Method in Experimental Linguistics

115

λ1 can be obtained using (6): λ1 =

m n  

χij χ 1i

(9)

i=1 j=1

The coefficients of competence in the first approximation are calculated according to (7): κj1

n 1  = 1 χij χi1 λ

(10)

i=1

With the coefficients of competence in the first approximation, we may repeat the calculations using (5), (6), (7) to obtain χi2 , λ2 , κj2 in the second approximation, etc. Now consider the results of the interview (a fragment) to illustrate how the algorithm works. Eleven informants were asked to grade seven examples (A. The room is empty. All the furniture has been removed; B. The room is free. All the furniture has been removed; C. The room is blank. All the furniture has been removed; D. The room is spare. All the furniture has been removed; E. The room is unoccupied. All the furniture has been removed; F. The room is vacant. All the furniture has been removed; G. The room is void. All the furniture has been removed) according to the above five-point system (see Fig. 1). Table 2 features the results of the interview in the form of grades. Table 2. Matrix of grades (a fragment). χ ij

1

2

3

4

5

6

7

8

9

10

11

1

5

5

5

5

5

5

5

5

5

5

5

2

3

3

2

1

1

4

1

4

4

1

4

3

1

1

2

1

1

1

1

1

1

2

1

4

1

1

2

1

1

1

1

1

1

1

1

5

4

4

2

2

1

3

4

3

3

2

3

6

3

4

4

4

2

1

4

1

4

5

1

7

1

1

1

1

1

1

1

1

1

2

1

We start our calculations with t = 1. In (5) the initial values of the  competence coefficients are assumed to be equal and take the value of κj0 = 1/ m = 1 11 . Then, the cluster estimate for the ith sentence in the first approximation (expressed in terms of (5)) is therefore (see Table 3). λ1 can be obtained using (6): λ1 =

m n   i=1 j=1

χij χ 1i =

11 7   i=1 j=1

χij χ 1i = 574.18

116

O. Suleimanova and M. Fomina Table 3. Matrix of cluster estimates (t = 1).

χ 11

χ 12

χ 13

χ 14

χ 15

χ 16

χ 17

5

2.55

1.18

1.09

2.82

3

1.09

Table 4. Matrix of the coefficients of competence (t = 1). κ 11

κ 12

κ 13

κ 14

κ 15

κ 16

κ 17

κ 18

κ 19

κ 110

κ 111

0.098

0.1032

0.09

0.08

0.07

0.09

0.09

0.09

0.1

0.09

0.09

Table 4 features the coefficients of competence in the first approximation. With the coefficients of competence in the first approximation, we may repeat the calculations using (5), (6), (7) to obtain χi2 , λ2 , κj2 in the second approximation (see Tables 5 and 6), etc. Table 5. Matrix of cluster estimates (t = 2). χ 21

χ 22

χ 23

χ 24

χ 25

χ 26

χ 27

5

2.59

1.19

1.09

2.89

3.07

1.09

Now consider the statistic used to assess agreement among informants − the coefficient of concordance. It can be calculated with the following formula:  2 2 δmax W = δact (11) 2 is the actual dispersion of pooled informants’ grades; where δact 2 δmax is the dispersion of pooled grades if there is complete agreement among the informants. The coefficient of concordance may assume a value on the closed interval [0; 1]. If the statistic W is 0, then there is no overall trend of agreement among the informants, and their responses may be regarded as essentially random. If W is 1, then all the informants have been unanimous, and each informant has given the same grade to each of the sentences. Intermediate values of W indicate a greater or lesser degree of unanimity among the informants. To treat the grades as concurring enough it is necessary that W is higher than a set normative point W n (W > W n ). Let us take W n = 0.5. Thus, in case W > 0.5, the informants’ opinions are rather concurring than different. Then we admit the results of expertise to be valid and the group of informants to be reliable. What is more significant is that we have succeeded in the experiment, and expertise procedures were accurately arranged to meet all the requirements of the linguistic experiment.

Triangulation as a Research Method in Experimental Linguistics

117

Table 6. Matrix of the coefficients of competence (t = 1; 2). κ tj

t=1

t=2

κ1t

0.0980

0.0981

κ2t κ3t κ4t κ5t κ6t κ7t κ8t κ9t t κ10 t κ11 11 t j=1 kj

0.1032

0.1034

0.0929

0.0929

0.0845

0.0845

0.0692

0.0690

0.0871

0.0870

0.0944

0.0947

0.0871

0.0872

0.1028

0.1031

0.0937

0.0920

0.0871

0.0892

1

1

Now consider the results of the interview (see Table 7) to illustrate the calculation procedure. If the informants’ opinions had coincided absolutely, each informant would have graded the first sentence as 5, the second one – as 4, the third and the forth – as 1, the fifth – as 3, the sixth – as 4, and the seventh sentence – as 1. Then the total (pooled) grades given to the sentences would have amounted to 55, 44, 11, 11, 33, 44 and 11, respectively. The mean value of the actual pooled grades is (55 + 28 + 13 + 12 + 31 + 33 + 12) / 7 = 26.3. 2 = (55 − 26.3)2 + (28 − 26.3)2 + (13 − 26.3)2 + (12 − 26.3)2 + Then δact (31 − 26.3)2 + (33 − 26.3)2 + (12 − 26.3)2 = 1479.43 2 = (55 − 26.3)2 +(44 − 26.3)2 +(11 − 26.3)2 +(11 − 26.3)2 +(33 − 26.3)2 + δmax (44 − 26.3)2+ (11 − 26.3)2 = 2198.14 1479.43 2 δ2 W = δact max = 2198.14 = 0.67. The coefficient of concordance equals 0.67, which is higher than the normative point 0.5. Thus, the informants’ opinions are rather concurring. Still the coefficient could have been higher if the grades for the second and sixth examples (see sentences B and F in Fig. 1) had revealed a greater degree of unanimity among the informants – the dispersion of grades for these sentences (D2 and D6 ) is the highest (see Table 8). We analysed possible reasons which might have led to some of the scatter in the grades. Here we shall consider the use of adjective free in the following statement: The room is free. All the furniture has been removed. The research into semantics of free revealed that native English speakers more readily and more frequently associate the word free with ‘costing nothing’, ‘without payment’ rather than with ‘available, unoccupied, not in use’. In case the word free used in the latter

118

O. Suleimanova and M. Fomina Table 7. Results of the interview (a fragment).

Informant / Sentence

1

2

3

4

5

6

7

1

5

3

1

1

4

3

1

2

5

3

1

1

4

4

1

3

5

2

2

2

2

4

1

4

5

1

1

1

2

4

1

5

5

1

1

1

1

2

1

6

5

4

1

1

3

1

1

7

5

1

1

1

4

4

1

8

5

4

1

1

3

1

1

9

5

4

1

1

3

4

1

10

5

1

2

1

2

5

2

11

5

4

1

1

3

1

1

Actual pooled grade

55

28

13

12

31

33

12

Pooled grade (if W = 1)

55

44

11

11

33

44

11

Table 8. Dispersion of grades. D1

D2

D3

D4

D5

D6

D7

0

1.7

0.15

0.08

0.88

2.00

0.08

meaning may cause some ambiguity, native speakers opt for synonymous adjectives such as empty, blank, unoccupied, vacant or available to differentiate from the meaning of ‘without cost’. Consider the following utterances with free: The teacher handed a free test booklet to each student; Jane parked her car in a free lot; Mary entered the free bathroom and locked the door. Informants assess the statements as acceptable provided the adjective free conveys the information that one can have or use the objects (a test booklet, a lot, a bathroom) without paying for them. When we asked the informants to evaluate the same statements with the word free meaning ‘available for some particular use or activity’, the above sentences were graded as unacceptable: *The teacher handed a free test booklet to each student; *Jane parked her car in a free lot; *Mary entered the free bathroom and locked the door. The study revealed that many statements with free can be conceived of in two different ways depending on the speaker’s frame of reference. This ambiguity leads to a high dispersion of informants’ grades, i.e. the grades appear to be spread around their mean value to a great extent and thus cannot be treated as valid.

Triangulation as a Research Method in Experimental Linguistics

119

Thus, the use of the word free is often situational. If there is a cost issue assumed by the speaker, it can lead to ambiguities that may explain some of the scatter in the grades. In the following statement, The room is free. All the furniture has been removed the speaker may have in his/her mind the possibility of a room being available for use without charge, unless it is furnished. Thus, the removal of the furniture has the effect of making the room free from cost, letting this choice seem possibly more frequently used than it might otherwise be graded. When we asked the informants to assess the statement, assuming the word free conveyed the information ‘available for some activity’, the statement was graded as acceptable, whereas the use of free meaning ‘without payment’, ‘without charge’ was found to be not occurring (see [18]).

3 Conclusions Summing up the results of the research into verification procedures that rely on the methodological triangulation when experimental practices are supported by corpusbased experiment, the analysis of search engine results, and mathematical-statistical methods, we may conclude that: (1) New IT tools give a linguist value added: text corpora as well as such search engines as Google provide invaluable data, though they remain underestimated – they are to be explored as regards their full explanatory potential; (2) The results of expert evaluation, represented in the digital form, can be treated as discrete variates, and then be processed with mathematical-statistical methods; these methods and procedures can help separate the random factor from the grade determined by the system of language; as a result the researcher obtains a mathematical calculation for the influence of deterministic as well as random factors, the consistency in informants’ data and, consequently, reliability of their grades; high consistency, in its turn, testifies to the ‘quality’ of the group of informants and means that interviewing this group will yield good reliable data; (3) Of prime importance is the elaboration of a comprehensive verification system that relies on more than one research method or data collection technique; (4) The use of triangulation as a research method in experimental linguistics is steadily bridging the gap between linguistics as an originally purely descriptive field and other sciences, where mathematical apparatus has long been applied.

References 1. Jakob, A.: On the triangulation of quantitative and qualitative data in typological social research: reflections on a typology of conceptualizing “uncertainty” in the context of employment biographies. Forum Qual. Soc. Res. 2(1), 1–29 (2001) 2. Campbell, D., Fiske, D.: Convergent and discriminant validation by the multitraitmultimethod matrix. Psychol. Bull. 56(2), 81–105 (1959) 3. Denzin, N.: The Research Act: A Theoretical Introduction to Sociological Methods. Aldine, Chicago (1970)

120

O. Suleimanova and M. Fomina

4. Yeasmin, S., Rahman, K.F.: “Triangulation” research method as the tool of social science research. BUP J. 1(1), 154–163 (2012) 5. Bryman, A.: Social Research Methods, 2nd edn. Oxford University Press, Oxford (2004) 6. Labov, W.: Some principles of linguistic methodology. Lang. Soc. 1, 97–120 (1972) 7. Souleimanova, O.A., Fomina, M.A.: The potential of the semantic experiment for testing hypotheses. Sci. J. “Modern Linguistic and Metodical-and-Didactic Researches” 2(17), 8−19 (2017) 8. Stepanov, J.S.: Problema obshhego metoda sovremennoj lingvistiki. In: Vsesojuznaja nauchnaja konferencija po teoreticheskim voprosam jazykoznanija (11−16 nojabrja 1974 g.): Tez. dokladov sekcionnyh zasedanij, pp. 118−126. The Institute of Linguistics, Academy of Sciences of the USSR, Moscow (1974) 9. Belaichuk, O.S.: Gipotetiko-deduktivnyj metod dlja opisanija semantiki glagolov otricanija (poshagovoe opisanie metodiki, primenjaemoj dlja reshenija konkretnoj issledovatel’skoj zadachi). In: Lingvistika na rubezhe jepoh: dominanty i marginalii 2, pp. 158−176. MGPU, Moscow (2004) 10. Fomina, M.: Universal concepts in a cognitive perspective. In: Schöpe, K., Belentschikow, R., Bergien, A. et al. (eds.) Pragmantax II: Zum aktuellen Stand der Linguistic und ihrer Teildisziplinen: Akten des 43. Linguistischen Kolloquiums in Magdeburg 2008, pp. 353–360. Peter Lang, Frankfurt a.M. et al. (2014) 11. British National Corpus (BYU-BNC). https://corpus.byu.edu/bnc/. Accessed 17 Jan 2019 12. Timberlake, A.: Invariantnost’ i sintaksicheskie svojstva vida v russkom jazyke. Novoe v zarubezhnoj lingvistike 15, 261–285 (1985) 13. Fomina, M.A.: Expert appraisal technique in the linguistic experiment and mathematical processing of experimental data. In: Souleimanova, O. (ed.) Sprache und Kognition: Traditionelle und neue Ansätze: Akten des 40. Linguistischen Kolloquiums in Moskau 2005, pp. 409−416. Peter Lang, Frankfurt a.M. et al. (2010) 14. Fomina, M.A.: Konceptualizacija “pustogo” v jazykovoj kartine mira. (Ph.D. thesis). Moscow City University, Moscow (2009) 15. Seliverstova, O.N., Souleimanova, O.A.: Jeksperiment v semantike. Izvestija AN SSSR. Ser. literatury i jazyka 47(5), 431−443 (1988) 16. Sulejmanova, O.A.: Puti verifikacii lingvisticheskih gipotez: pro et contra. Vestnik MGPU. Zhurnal Moskovskogo gorodskogo pedagogicheskogo universiteta. Ser. Filologija. Teorija jazyka. Jazykovoe obrazovanie 2(12), 60−68 (2013) 17. Suleimanova, O.: Technologically savvy take it all, or how we benefit from IT resources. In: Abstracts. 53. Linguistics Colloquium, 24–27 September 2018, pp. 51−52. University of Southern Denmark, Odense (2018) 18. Fomina, M.: Configurative components of word meaning. In: Küper, Ch., Kürschner, W., Schulz, V. (eds.) Littera: Studien zur Sprache und Literatur: Neue Linguistische Perspektiven: Festschrift für Abraham P. ten Cate, pp. 121−126. Peter Lang, Frankfurt am Main (2011)

Understanding Interpersonal Variations in Word Meanings via Review Target Identification Daisuke Oba1(B) , Shoetsu Sato1,2 , Naoki Yoshinaga2 , Satoshi Akasaki1 , and Masashi Toyoda2 1

2

The University of Tokyo, Tokyo, Japan {oba,shoetsu,akasaki}@tkl.iis.u-tokyo.ac.jp Institute of Industrial Science, The University of Tokyo, Tokyo, Japan {ynaga,toyoda}@iis.u-tokyo.ac.jp

Abstract. When people verbalize what they felt with various sensory functions, they could represent different meanings with the same words or the same meaning with different words; we might mean a different degree of coldness when we say ‘this beer is icy cold,’ while we could use different words such as “yellow ” and “golden” to describe the appearance of the same beer. These interpersonal variations in word meanings not only prevent us from smoothly communicating with each other, but also cause troubles when we perform natural language processing tasks with computers. This study proposes a method of capturing interpersonal variations of word meanings by using personalized word embeddings acquired through a task of estimating the target (item) of a given reviews. Specifically, we adopt three methods for effective training of the item classifier; (1) modeling reviewer-specific parameters in a residual network, (2) fine-tuning of reviewer-specific parameters and (3) multi-task learning that estimates various metadata of the target item described in given reviews written by various reviewers. Experimental results with review datasets obtained from ratebeer.com and yelp.com confirmed that the proposed method is effective for estimating the target items. Looking into the acquired personalized word embeddings, we analyzed in detail which words have a strong semantic variation and revealed some trends in semantic variations of the word meanings. Keywords: Semantic variation

1

· Personalized word embeddings

Introduction

We express what we have sensed with various sensory units as language in different ways, and there exist inevitable semantic variations in the meaning of words because the senses and linguistic abilities of individuals are different. For example, even if we use the word “greasy” or “sour,” how greasy or how sour can differ greatly between individuals. Furthermore, we may describe the appearance of the same beer with different expressions such as “yellow,” “golden” and “orange.” These semantic variations not only cause problems in communicating c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 121–134, 2023. https://doi.org/10.1007/978-3-031-24340-0_10

122

D. Oba et al.

with each other in the real world but also delude potential natural language processing (nlp) systems. In the context of personalization, several studies have attempted to improve the accuracy of nlp models for user-oriented tasks such as sentiment analysis [5], dialogue systems [12] and machine translation [21], while taking into account the user preferences in the task inputs and outputs. However, all of these studies are carried out based on the settings of estimating subjective output from subjective input (e.g., estimating a sentiment polarity of the target item from an input review or predicting responses from input utterances in a dialogue system). As a result, the model not only captures the semantic variation in the user-generated text (input), but also handles annotation bias of the output labels (the deviation of output labels assigned by each annotator) and selection bias (the deviation of output labels inherited from the targets chosen by users in sentiment analysis) [5]. The contamination caused by these biases hinders us from understanding the solo impact of semantic variation, which is the target in this study. The goal of this study is to understand which words have large (or small) interpersonal variations in their meanings (hereafter referred to as semantic variation in this study), and to reveal how such semantic variation affects the classification accuracy in tasks with user-generated inputs (e.g., reviews). We thus propose a method for analyzing the degree of personal variations in word meanings by using personalized word embeddings acquired through a review target identification task in which the classifier estimates the target item (objective output) from given reviews (subjective input) written by various reviewers. This task is free from annotation bias because outputs are automatically determined without annotation. Also, selection bias can be suppressed by using a dataset in which the same reviewer evaluates the same target (object) only once, so as not to learn the deviation of output labels caused by the choice of inputs. The resulting model allows us to observe only the impact of semantic variations from acquired personalized word embeddings. A major challenge in inducing personalized word embeddings is the number of parameters (reviewers), since it is impractical to simultaneously learn personalized word embeddings for thousands of reviewers. We therefore exploit a residual network to effectively obtain personalized word embeddings using reviewer-specific transformation matrices from a small amount of reviews, and apply a fine-tuning to make the training scalable to the number of reviewers. Also, the number of output labels (review targets) causes an issue when building a reliable model due to the difficulty of extreme multi-class classification. We therefore perform multi-task learning with metadata estimation of the target, to stabilize the learning of the model. In the experiments, we hypothesize that words related to the five senses have inherent semantic variation, and validate this hypothesis. We utilized two largescale datasets retrieved from ratebeer.com and yelp.com that include a variety of expressions related to the five senses. Using those datasets, we employ the task of identifying the target item and its various metadata from a given review with the reviewer’s ID. As a result, our personalized model successfully captured semantic variations and achieved better performance than a reviewer-universal model in

Understanding Interpersonal Variations in Word Meanings

123

both datasets. We then analyzed the acquired personalized word embeddings from three perspectives (frequency, dissemination and polysemy) to reveal which words have large (small) semantic variation. The contributions of this paper are three-fold: – We established an effective and scalable method for obtaining personal word meanings. The method induces personalized word embeddings acquired through tasks with objective outputs via effective reviewer-wise fine-tuning on a personalized residual network and multi-task learning. – We confirmed the usefulness of the obtained personalized word embeddings in the review target identification task. – We found different trends in the obtained personal semantic variations from diachronic and geographical semantic variations observed in previous studies in terms of three perspectives (frequency, dissemination and polysemous).

2

Related Work

In this section, we introduce existing studies on personalization in natural language processing (nlp) tasks and analysis of semantic variation1 of words. As discussed in Sect. 1, personalization in nlp attempts to capture three types of user preferences: (1) semantic variation in task inputs (biases in how people use words; our target) (2) annotation bias of output labels (biases in how annotators label) and (3) selection bias of output labels (biases in how people choose perspectives (e.g., review targets) that directly affects outputs (e.g., polarity labels)). In the history of data-driven approaches for various nlp tasks, existing studies have focused more on (2) or (3), particularly in text generation tasks such as machine translation [14,17,21] and dialogue systems [12,22]. This is because data-driven approaches without personalization tend to suffer from the diversity of probable outputs depending on writers. Meanwhile, since it is difficult to properly separate these facets, as far as we know, there is no study aiming to analyze only the semantic variations of words depending on individuals. To quantify the semantic variation of common words among communities, Tredici et al. [20] obtained community-specific word embeddings by using the Skip-gram [15], and analyzed obtained word embeddings on multiple metrics such as frequency. Their approach suffers from annotation biases since Skipgram (or language models in general) attempts to predict words in a sentence given the other words in the sentence and therefore both inputs and outputs are defined by the same writer. As a result, the same word can have dissimilar embeddings not only because they have different meanings, but also because 1

Apart from semantic variations, some studies try to find, analyze, or remove biases related to socially unfavorable prejudices (e.g., the association between the words receptionist and female) from word embeddings [2–4, 19]. They analyze word “biases” in the sense of political correctness, which are different from biases in personalized word embeddings we targeted.

124

D. Oba et al.

they just appear with words in different topics.2 In addition, their approach is not scalable to the number of communities (reviewers in our case) since it simultaneously learns all the community-specific parameters. There also exist several attempts in computational linguistics to capture semantic variations of word meanings caused by diachronic [7,10,18], geographic [1,6], or domain [20] variations. In this study, we analyze the semantic variations of meanings of words at the individual level by inducing personalized word embedding, focusing on how semantic variations are correlated with word frequency, dissemination, and polysemy as discussed in [7,20].

3

Personalized Word Embeddings

In this section, we describe our neural network-based model for inducing personalized word embeddings via review target identification (Fig. 1). Our model is designed to identify the target item from a given review with the reviewer’s ID. A major challenge in inducing personalized word embeddings is the number of parameters. We therefore exploit a residual network to effectively obtain personalized word embeddings using reviewer-specific transformation matrices and apply a fine-tuning for the scalability to the number of reviewers. Also, the number of output labels makes building a reliable model challenging due to the

Fig. 1. Overview of our model.

2

Let us consider the two user communities of Toyota and Honda cars. Although the meaning of the word “car” used in these two communities is likely to be the same, its embedding obtained by Skip-gram model from two user communities will be different since “car” appears with different sets of words depending on each community.

Understanding Interpersonal Variations in Word Meanings

125

difficulty of extreme multi-class classification. We therefore perform multi-task learning to stabilize the learning of the model. 3.1

Reviewer-Specific Layers for Personalization u

First, our model computes the personalized word embeddings ewji of each word wi in input text via a reviewer-specific matrix Wuj ∈ Rd×d and bias vector u buj ∈ Rd . Concretely, an input word embedding ewi is transformed to ewji as below: (1) euwji = ReLU(Wuj ewi + buj ) + ewi where ReLU is a rectified linear unit function. As shown in Eq. (1), we employ a Residual Network (ResNet) [8] since semantic variation is namely the variation from the reviewer-universal word embedding. By sharing the reviewer-specific parameters for transformation across words and employing ResNet, we aimed for the model to stably learn personalized word embeddings even for infrequent words. 3.2

Reviewer-Universal Layers u

Given the personalized word embedding ewji of each word wi in an input text, our model encodes them through Long short-term Memory (LSTM) [9]. LSTM updates the current memory cell ct and the hidden state ht following the equations below: ⎤ ⎡ ⎤ ⎡ σ it ⎢ ft ⎥ ⎢ σ ⎥  uj ⎥ ⎢ ⎥=⎢ (2) ⎣ ot ⎦ ⎣ σ ⎦ WLSTM · ht−1 ; ewi tanh cˆt ct = ft  ct−1 + it  cˆt

(3)

ht = ot  tanh (ct )

(4)

where it , ft , and ot are the input, forget, and output gate at time step t, respectively. ewi is the input word embedding at time step t, and WLSTM is a weight matrix. cˆt is the current cell state. The operation  denotes elementwise multiplication and σ is the logistic sigmoid function. We adopt single-layer Bi-directional LSTM (Bi-LSTM) to utilize the past and the future context. As the representation of the input text h, Bi-LSTM concatenates the outputs from the forward and the backward LSTM:

−−−→ ← − (5) h = hL−1 ; h0 −−−→ ← − Here, L denotes the length of the input text. hL−1 and h0 denote the outputs from forward/backward LSTM at the last time step, respectively. ˆ Lastly, a feed-forward layer computes an output probability distribution y from the representation h with a weight matrix Wo and bias vector bo as: ˆ = softmax (Wo h + bo ) y

(6)

126

3.3

D. Oba et al.

Multi-task Learning of Target Attribute Predictions for Stable Training

We consider that training our model for the target identification task can be unstable because its output space (review targets) is extremely large (more than 50,000 candidates). To mitigate this problem, we set up auxiliary tasks that estimate metadata of the target item and solve them simultaneously with the target identification task (target task) by multi-task learning. This idea is motivated by the hypothesis that understanding related metadata of the target item contributes to the accuracy of target identification. Specifically, we add independent feed-forward layers to compute outputs from the shared sentence representation h defined by Eq. (5) for each auxiliary task (Fig. 1). We assume three types of auxiliary tasks: (1) multi-class classification (same as the target task), (2) multi-label classification, and (3) regression. We perform the multi-task learning under a loss that sums up individual losses for the target and auxiliary tasks. We adopt cross-entropy loss for multi-class classification, a summation of cross-entropy loss of each class for multi-label classification and mean-square loss for regression. 3.4

Training

Considering the case where the number of reviewers is enormous, it is impractical to simultaneously train the reviewer-specific parameters of all reviewers due to memory limitation. Therefore, we first pre-train the model using all the training data without personalization, and then we apply fine-tuning only to reviewerspecific parameters by training independent models from the reviews written by each reviewer. In this pre-training, the model uses reviewer-universal parameters W and b (instead of Wuj and buj ) in Eq. (1), and then initializes the reviewer-specific parameters Wuj and buj by them. This method makes our model scalable even to a large number of reviewers. We fix all the reviewer-universal parameters at the time of fine-tuning. Furthermore, we perform multi-task learning only during the pretraining without personalization. We then fine-tune reviewer-specific parameters Wuj , buj of the pre-trained model while only optimizing the target task. This enables the model to prevent the personalized embeddings from containing the selection bias, otherwise the prior output distribution of the auxiliary tasks by individuals can be implicitly learned.

4

Experiments

We first evaluate the target identification task using two review datasets to confirm the effectiveness of the personalized word embeddings induced by our method. If our model can successfully solve this objective task better than the reviewer-universal model obtained by the pre-taining of our reviewer-specific

Understanding Interpersonal Variations in Word Meanings

127

model, it is considered that those personalized word embeddings capture the personal semantic variation. We then analyze the degree and tendencies of the semantic variation in the obtained word embeddings. 4.1

Settings

Dataset. We adopt review datasets of beer and services related to foods for evaluation, since there are a variety of expressions that describe what we have sensed with various sensory units in these domains. RateBeer dataset is extracted from ratebeer.com3 [13] that includes a variety of beers. We selected 2,695,615 reviews about 109,912 types of beers written by reviewers who posted at least 100 reviews. Yelp dataset is derived from yelp.com4 that includes a diverse range of services. We selected reviews that (1) have location metadata, (2) fall under either the “food” or “restaurant” categories, and (3) are written by a reviewer who posted at least 100 reviews. As a result, we extracted 426,816 reviews of 56,574 services written by 2,414 reviewers in total. We divided these datasets into training, development, and testing sets with the ratio of 8:1:1. In the rest of this paper, we refer the former as RateBeer dataset and the latter as Yelp dataset. Auxiliary Tasks. Regarding the metadata for multi-task learning (MTL), we chose style and brewery for multi-class classification and alcohol by volume (ABV) for regression in the experiments with RateBeer dataset. As for the Yelp dataset, we used location for multi-class classification and category for multi-label classification. Models and Hyperparameters. We compare our model described in Sect. 3 with four different settings.5 Their differences are, (1) whether the fine-tuning for personalization is applied and (2) whether the model is trained through MTL before the fine-tuning. Table 1. Hyperparameters of our model. Model

Optimization

Dimensions of hidden layer

200

Dropout rate 0.2

Dimensions of word embeddings

200

Algorithm

Adam

Vocabulary size (Ratebeer dataset) 100,288 Learning rate 0.0005 Vocabulary size (Yelp dataset)

3 4 5

98,465

Batch size

200

https://www.ratebeer.com. https://www.yelp.com/dataset. We implemented all the models using PyTorch (https://pytorch.org/) version 0.4.0.

128

D. Oba et al.

Table 1 shows major hyperparameters. We initialize the embedding layer by Skip-gram embeddings [15] pretrained from each of the original datasets, containing all the reviews in RateBeer and Yelp datasets, respectively. The vocabulary for each dataset includes all the words that appeared 10 times or more in the dataset. For optimization, we trained the models up to 100 epochs with Adam [11] and selected the model at the epoch with the best results in the target task on the development set as the test model. 4.2

Overall Results

Table 2 and Table 3 show the results on the two datasets. We gain two insights from the results: (1) in the target task, the model with both MTL and personalization outperformed the others, (2) personalization also improves the auxiliary tasks. The model without personalization assumes that the same words written by different reviewers have the same meanings, while the model with personalization distinguishes them. The improvement by personalization on the target task with objective outputs partly supports the fact that the same words written by different reviewers have different meanings, even though they are in the same domain (beer or restaurant). Simultaneously solving the auxiliary tasks that Table 2. Results on the product identification task on RateBeer dataset. Accuracy and RMSE marked with ∗∗ or ∗ was significantly better than the other models (p < 0.01 or 0.01 < p ≤ 0.05 assessed by paired t-test for accuracy and z-test for RMSE). Model

Target task

Auxiliary tasks

Multi-task Personalize Product [Acc.(%)] Brewery [Acc.(%)] Style [Acc.(%)] ABV [RMSE]   



Baseline

15.74 16.69 16.16 17.56∗∗

n/a n/a (19.98) (20.81∗∗ )

n/a n/a (49.00) (49.78∗∗ )

n/a n/a (1.428) (1.406∗ )

0.08

1.51

6.19

2.321

Table 3. Results on the service identification task on Yelp dataset. Accuracy marked with ∗∗ was significantly better than the others (p < 0.01 assessed by paired t-test).

Model

Target task

Auxiliary tasks

Multi-task Personalize Service [Acc.(%)] Location [Acc.(%)] Category [Micro F1]    Baseline



6.75 7.15 9.71 10.72∗∗

n/a n/a (70.33) (83.14∗∗ )

n/a n/a (0.578) (0.577)

0.05

27.00

0.315

Understanding Interpersonal Variations in Word Meanings

129

estimate metadata of the target item guided the model to understand the target item from various perspectives, like part-of-speech tags of words. We should mention that only the reviewer-specific parameters are updated for the target task in fine-tuning. This means that the improvements on auxiliary tasks were obtained purely by the semantic variations captured by reviewerspecific parameters. Impact of the Number of Reviews for Personalization. We investigated the impact of the number of reviews for personalization when we solved the review target identification. We first grouped the reviewers into several bins according to the number of reviews, and then evaluated the classification accuracies for reviews written by the reviewers in the same bin. Figure 2 shows the classification accuracy of the target task plotted against the number of reviews per reviewer; for example, the plots (and error bars) for 102.3 represent the accuracy (variation) of the target identification for reviews written by each reviewer with review n (102.1 ≤ n < 102.3 ). Contrary to our expectation, in (a) RateBeer dataset, all of the models obtained lower accuracies as the number of reviews increased. On the other hand, in (b) Yelp dataset, only the model employing MTL and personalization obtained higher accuracies as they increased. We consider that this difference came from the biases of frequencies in review targets. Since RateBeer dataset is heavily skewed, where the top-10% frequent beers account for 74.3% of the entire reviews, while the top-10% frequent restaurants in Yelp dataset account for 48.0% of the reviews. Therefore, it is more difficult to estimate infrequent targets in RateBeer dataset and such reviews tend to be written by experienced reviewers. Although the model without MTL and personalization also obtained slightly lower accuracies even in Yelp dataset, the model with both MTL and personalization successfully exploited the increased reviews and obtained higher accuracies.

(a) RateBeer dataset

(b) Yelp dataset

Fig. 2. Accuracies in target identification task against the number of parameters per reviewer. In the legend, MTL and PRS stands for multi-task learning and personalization.

130

D. Oba et al.

(a) log-frequency

(b) dissemination

(c) polysemy

RateBeer dataset

(d) log-frequency

(e) dissemination

(f) polysemy

Yelp dataset

Fig. 3. Personal semantic variations of the words on the two datasets. Their Pearson coefficient correlations are (a) 0.43, (b) 0.29, (c) −0.07, (d) 0.27, (e) 0.16, (f) −0.19, respectively. The trendlines show 95% confidence intervals from kernel regressions.

4.3

Analysis

In this section, we analyze the obtained personalized word embeddings to see what kind of personal biases exist in each word. Here, we target only the words used by 30% or more reviewers (excluding stopwords) to remove the influences of low frequent words. We first define the personal semantic variation6 of a word wi , to determine how the representations of the word are different by individuals, as: 1 |U(wi )|



(1 − cos(euwji , ewi ))

(7)

uj ∈U(wi )

u

where ewji is the personalized word embedding to wi of a reviewer uj , ewi is the u average of ewji for U(wi ), and U(wi ) is the set of the reviewers who used the word wi at least once in training data. Here, we focus on three perspectives: frequency, dissemination, and polysemy which have been discussed in the studies of semantic variations caused by diachronic or geographical differences of text [6,7,20] (Sect. 2). Figure 3 shows 6

Unlike the definition of the semantic variation of the existing studies [20], which measures the degree of change from a point to a point of a word meaning, personal semantic variation measures how much a number of meanings of a word defined by individuals are diverged.

Understanding Interpersonal Variations in Word Meanings

131

Table 4. The list of top-50 words with the largest (and the smallest) semantic variation on the RateBeer dataset and Yelp dataset. Adjectives are boldfaced. Top-50

Bottom-50

RateBeer dataset

Ery bready ark slight floral toasty tangy updated citrusy soft deep mainly grassy aroma doughy dissipating grass ot great earthy smell toasted somewhat roasty soapy perfume flowery lingering musty citrus malty background malt present hue minimal earth foamy faint dark medium clean nice copper hay bread herbs chewy complexity toast reddish

Reminds cask batch oil reminded beyond canned conditioned double abv hope horse oats rye brewery blueberry blueberries maple bells old cork shame dogfish become dog hand plastic course remind christmas cross rogue extreme organic fat lost words islands etc. growler hot heat stout alcohol unibroue pass nitro longer scotch rare

Yelp dataset

Tasty fantastic great awesome delish excellent yummy delicious good amazing phenomenal superb asparagus risotto flavorful calamari salmon creamy chicken got veggies incredible ordered scallops sides outstanding sausage flatbread shrimp eggplant patio ambiance sandwich wonderful desserts salty gnocchi fabulous quesadilla atmosphere bacon mussels sauce vegetables restaurant broth grilled mushrooms ravioli decor food

Easily note possibly almost nearly warning aside opposite alone even needless saving yet mark thus wish apart thankfully straight possible iron short eye period thumbs old deciding major zero meaning exact replaced fully somehow single de key personal desired hence pressed rock exactly ups keeping hoping whole meant seeing test hardly

the semantic variations against the three metrics. Each of the x-axes corresponds to log frequency of the word ((a) and (d)), the ratio of the reviewers who used the word ((b) and (e)), and the number of synsets found in WordNet [16] ((c) and (f)), respectively. Interestingly, in contrast to the reports by [7] and [20], semantic variations correlate highly with frequency and dissemination, and poorly with polysemy in our results. This tendency of interpersonal semantic variations can be explained as follows: In the datasets used in our experiments, words related to five senses such as “soft” and “creamy” frequently appear and their usage depend on feelings and experiences by individuals. Therefore, they show high semantic variations. As for polysemy, although the semantic variations might change the degree or nuance of the word sense, they do not change its synset. This is because those words are still used only in skewed contexts related to food and drink where word senses do not fluctuate significantly. Table 4 shows the top-50 words with the largest (and smallest) semantic variations. As can be seen from the tables, the list of top-50 words contains much more adjectives compared with the list of bottom-50, which are likely to be used to represent individual feelings that depend on the five senses. To see in detail what kind of word have large semantic variation, we classify the adjectives of the top-50 (and bottom-50) by the five senses, which are sight (vision), hearing (audition), taste (gustation), smell (olfaction), and touch (somatosensation). From the results, on the RateBeer dataset, there were more words representing each sense except hearing in the top-50 words compared with the bottom-50. On the other hand, the list of words on Yelp dataset include less words related to the five senses than the RateBeer dataset, but there are many adjectives that could be applicable to various domains (e.g., “tasty,” and

132

D. Oba et al.

(a) RateBeer dataset

(b) Yelp dataset

Fig. 4. Two-dimensional representation of the words, bready and tasty on the two datasets, respectively, with the words closest to them in the universal embedding space.

“excellent”). This may be due to the domain size of Yelp dataset and the lack of reviews detailing the specific products in the restaurant reviews. We also analyze whether there are words that get confused. We use the word “bready” and “tasty” with the highest semantic variation in each dataset. We visualized the personalized word embeddings using Principal Component Analysis (PCA), with six words closest to the target words in the universal embedding space in Fig. 4. As can be seen, clusters of “cracky,” “doughy,” and “biscuity” are mixed each other, suggesting that words representing the same meaning may differ by individuals.

5

Conclusions

In this study, we focused on interpersonal variations in word meanings, and explored a hypothesis that words related to the five senses have inevitable personal semantic variations. To verify this, we proposed a novel method for obtaining semantic variation using personalized word embeddings induced through a task with objective outputs. Experiments using large-scale review datasets from ratebeer.com and yelp.com showed that the combination of multi-task learning and personalization improved the performance of the review target identification, which means that our method could capture interpersonal variations of word meanings. Our analysis showed that words related to the five senses have large interpersonal semantic variations. For future studies, besides factors we worked on this study such as frequency, we plan to analyze relationships between semantic variations and demographic factors of the reviewers such as gender and age which are inevitable for expressing individuals.

Understanding Interpersonal Variations in Word Meanings

133

Acknowledgements. We thank Satoshi Tohda for proofreading the draft of our paper. This work was partially supported by Commissioned Research (201) of the National Institute of Information and Communications Technology of Japan.

References 1. Bamman, D., Dyer, C., Smith, N.A.: Distributed representations of geographically situated language. In: 56th ACL, pp. 828–834 (2014) 2. Bolukbasi, T., Chang, K.W., Zou, J.Y., Saligrama, V., Kalai, A.T.: Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In: NIPS 2016, pp. 4349–4357 (2016) 3. Caliskan, A., Bryson, J.J., Narayanan, A.: Semantics derived automatically from language corpora contain human-like biases. Science 356(6334), 183–186 (2017) 4. D´ıaz, M., Johnson, I., Lazar, A., Piper, A.M., Gergle, D.: Addressing age-related bias in sentiment analysis. In: CHI Conference 2018, p. 412. ACM (2018) 5. Gao, W., Yoshinaga, N., Kaji, N., Kitsuregawa, M.: Modeling user leniency and product popularity for sentiment classification. In: 6th IJCNLP, pp. 1107–1111 (2013) 6. Garimella, A., Mihalcea, R., Pennebaker, J.: Identifying cross-cultural differences in word usage. In: 26th COLING, pp. 674–683 (2016) 7. Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: 54th ACL, pp. 1489–1501 (2016) 8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR 2016, pp. 770–778 (2016) 9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 10. Jaidka, K., Chhaya, N., Ungar, L.: Diachronic degradation of language models: insights from social media. In: 56th ACL, pp. 195–200 (2018) 11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR 2015 (2015) 12. Li, J., Galley, M., Brockett, C., Spithourakis, G., Gao, J., Dolan, B.: A personabased neural conversation model. In: 54th ACL, pp. 994–1003 (2016) 13. McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In: 7th ACM on Recommender Systems, pp. 165–172 (2013) 14. Michel, P., Neubig, G.: Extreme adaptation for personalized neural machine translation. In: 56th ACL, pp. 312–318 (2018) 15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS 2013, pp. 3111–3119 (2013) 16. Miller, G.A.: Wordnet: a lexical database for English. ACM Commun. 38(11), 39–41 (1995) 17. Mirkin, S., Meunier, J.L.: Personalized machine translation: predicting translational preferences. In: EMNLP 2015, pp. 2019–2025 (2015) 18. Rosenfeld, A., Erk, K.: Deep neural models of semantic shift. In: NAACL-HLT 2018, pp. 474–484 (2018) 19. Swinger, N., De-Arteaga, M., Heffernan, I., Thomas, N., Leiserson, M.D., Kalai, A.T.: What are the biases in my word embedding? arXiv:1812.08769 (2018) 20. Tredici, M.D., Fern´ andez, R.: Semantic variation in online communities of practice. In: 12th IWCS (2017)

134

D. Oba et al.

21. Wuebker, J., Simianer, P., DeNero, J.: Compact personalized models for neural machine translation. In: EMNLP 2018, pp. 881–886 (2018) 22. Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personalizing dialogue agents: i have a dog, do you have pets too? In: 56th ACL, pp. 2204–2213 (2018)

Semantic Roles in VerbNet and FrameNet: Statistical Analysis and Evaluation Aliaksandr Huminski(B) , Fiona Liausvia, and Arushi Goel Institute of High Performance Computing, Singapore 138632, Singapore {huminskia,liausviaf,arushi goel}@ihpc.a-star.edu.sg Abstract. Semantic role theory is a widely used approach for verb representation. Yet, there are multiple indications that semantic role paradigm is necessary but not sufficient to cover all elements of verb structure. We conducted a statistical analysis of semantic role representation in VerbNet and FrameNet to provide empirical evidence of insufficiency. The consequence of that is a hybrid role-scalar approach.

Keywords: Verb representation

1

· Semantic role · VerbNet · FrameNet

Introduction

The semantic representation of verbs has a long history in linguistics. 50 years ago the article “The case for case” [1] gave a start to semantic role theory that is widely used for verb representation. Since semantic role theory is one of the oldest constructs in linguistics, variety of resources with different sets of semantic roles has been proposed. There are three types of resources depending on the level of role set granularity. The first level is very specific with roles like “eater” for the verb eat or “hitter” for the verb hit. The third level is very general with the range of roles from only two proto-roles [2] to nine roles. The second level is located between them and contains, to the best of our knowledge, from 10 to 50 roles approximately. This rough classification corresponds to the largest linguistic resources: Frame Net [3], VerbNet [4] and PropBank [5] that belong to the first, second and third type of resources accordingly. All of them use semantic role representation for verbs and are combined in Unified Verb Index system1 . They are used widely in most advanced NLP and NLU tasks, such as semantic parsing and semantic role labeling, question answering, information extraction, recognizing textual entailment, and information extraction. Knowledge of semantic representation and verb-argument structure is a key point for NLU systems and applications. The paper is structured as follows. Section 2 briefly introduces VerbNet and FrameNet, the ideas underlying their construction and the main differences 1

http://verbs.colorado.edu/verb-index/.

c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 135–147, 2023. https://doi.org/10.1007/978-3-031-24340-0_11

136

A. Huminski et al.

between them. Section 3 focuses on the basic statistical analysis of VerbNet and FrameNet. Section 4 describes advanced statistical analysis that shows that the role paradigm itself is necessary but not sufficient for proper representation of all verbs. A hybrid role-scalar approach is presented in Sect. 5. The final Sect. 6 reports our concluding observations.

2

VerbNet and FrameNet as Linguistic Resources for Analysis

VerbNet and FrameNet are the two most well-known resources where semantic roles are used. PropBank which is considered as the third resource in the Unified Verb Index, provides a semantic role representation for every verb in the Penn TreeBank [6]. But we will not analyse it in this article since PropBank defines semantic roles on a verb-by-verb basis, not making any higher generalizations2 . We will not use neither WordNet [7] for the analysis since this resource does not have semantic role representation for verbs. 2.1

VerbNet

VerbNet (VN) is the largest domain-independent computational verb lexicon currently available for English. In this paper we use the version 3.33 released in June 2018. It contains 6791 verbs and provides semantic role representation for all of them. VN 3.3 with 39 roles belongs to the second level of role set resources. In other words, the roles are not so fine-grained as in FrameNet and not so coarsegrained as in Propbank. VN was considered together with the LIRICS role set for the ISO standard 24617-4 for Semantic Role Annotation [8–10]. Idea of Construction. VN is constructed on Levin’s classification of verbs [11]. Verb classification is based on the idea that syntactic behavior of verbs (syntactic alternations) is to a large degree determined by its meaning. Similar syntactic behavior is taken as a method of grouping verbs into classes that are considered as semantic classes. So, verbs that fall into classes according to shared behavior would be expected to show shared meaning components. As a result of that, each verb4 belongs to a specific class in VN. In turn, each class has a role set that equally characterizes all members (verbs) of the class. 2

3 4

Each verb in PropBank has verb-specific numbered roles: Arg0, Arg1, Arg2, etc. with several more general roles that can be applied to any verb. That makes semantic role labeling too coarse-grained. Most verbs have two to four numbered roles. And although the tagging guidelines include a “descriptor” field for each role, such as “kicker” for Arg0 or “instrument” for Arg2 in the frameset of the verb kick, it does not have any theoretical standing [5]. http://verbs.colorado.edu/verb-index/vn3.3/. More accurate to use the term verb sense here because of verb polycemy.

Semantic Roles in VerbNet and FrameNet

2.2

137

FrameNet

FrameNet (FN) is a lexicographic project constructed on a theory of Frame Semantics, developed by Fillmore [12]. We will consider FrameNet releases 1.5 and 1.75 . Roles in FN are extremely fine-grained in comparison with VN. According to FN approach, situations and events should be represented through highly detailed roles. Idea of Construction. FN is based on the idea that a word’s meaning can be understood only with reference to a structured background [13]. In contrast to VN, FN is first and foremost semantically driven. The same syntactic behavior is not needed to group verbs together. FN takes semantic criteria as primary criteria where roles (called frame elements in FN) are assigned not to a verb class, but to a frame that describes an event. Frames are empirically derived from the British National Corpus and each frame is considered as a conceptual structure that describes event and its participants. As a result of that, a frame can include not only verbs, but also nouns, multi-word expressions, adjectives, and adverbs. All of them are grouped together according to the frames. The same as in VN, each frame has a role set that equally characterizes all members of the frame. Role set is essential for understanding an event (situation) represented by a frame. 2.3

VerbNet and FrameNet in Comparison

Table 1 summarizes the differences between VN and FN6 . Table 1. Basic differences of VN and FN. FrameNet Basis

3

VerbNet

Lexical semantics Argument syntax

Data source Corpora

Linguistic literature

Roles

Fine-grained

Coarse-grained

Results

Frames

Verb classes

Basic Statistical Analysis

Basic statistical analysis is considered as a necessary step for advanced analysis. Prior to analysis of the relations across verbs, classes/frames7 and roles, we need to extract classes/frames, those of them where at least one verb occurs, all unique roles, all verbs in classes/frames, etc. 5 6 7

https://framenet.icsi.berkeley.edu/fndrupal/. We modified the original comparison presented in [14] for our own purposes. The expression classes/frames is used hereinafter to emphasize that verbs are grouped into classes in VN and into frames in FN.

138

A. Huminski et al.

Table 2 summarizes the basic statistics related to VN and FN. It is necessary to provide some comments: 1. Number of classes/frames with and without verbs are different since there are classes in VN and frames in FN with no verbs. Also there are non-lexical frames in FN with no lexical units inside. 2. Calculating the number of classes in VN, we consider the main class and its subclass as 2 different classes even if they have the same role set. 3. Number of roles in FN reflects the number of unique roles that occur only in frames with verbs. We distinguish here the number of uniques roles from the number of the roles with duplicates in all frames (total 10542 for FN 1.7). 4. Number of verbs in reality is a number of verb senses that are assigned to different classes/frames. Because of polysemy the number of verb senses is larger than the number of unique verbs.

Table 2. Basic statistics of VN and FN. Resource

4

Number of Number of Number of Number of Av. number roles classes/ classes/ verbs of verbs per frames frames class/frame with verbs

VerbNet 3.2

30

484

454

6338

14

VerbNet 3.3

39

601

574

6791

11.8

FrameNet 1.5

656

1019

605

4683

7.7

FrameNet 1.7

727

1221

699

5210

7.45

Advanced Statistical Analysis

We will investigate further only the latest versions of VN (3.3) and FN (1.7). Advanced statistical analysis includes the following 2 types: – the distribution of verbs per class; – the distribution of roles per class. The distribution of verbs per class is about how many verbs similar in meaning are located in one class. The distribution of roles per class is about how many verbs similar in meaning are located in different classes. 4.1

Distribution of Verbs per Class in VN and FN

Distribution of verbs can be presented in 2 mutually dependent modes. First mode consists of 3 steps: 1. calculation of the number of verbs per class; 2. sorting the classes according to the number of verbs;

Semantic Roles in VerbNet and FrameNet

139

Fig. 1. Distribution of verbs per class in VN 3.3.

Fig. 2. Distribution of verbs per class in FN 1.7.

3. distribution of the verb number per class starting from the top class. Figure 1 for VN and Fig. 2 for FN illustrate the final 3rd step. Based on them one can conclude that: – verbs are not distributed evenly across the classes/frames. There is a sharp deviation from the average value: 11.8 verbs per class in VN 3.3 and 7.45 verbs per frame in FN 1.7. – regardless of the resource type (coarse-grained or fine-grained role set), a sharp deviation remains surprisingly the same.

140

A. Huminski et al.

Fig. 3. Verb coverage starting from the top classes in VN 3.3.

Fig. 4. Verb coverage starting from the top classes in FN 1.7.

Second mode has the same step#1 and step#2 but the 3rd step is different. It is a distribution of the verb coverage (from 0 to 1) starting from the top class. Figure 3 for VN and Fig. 4 for FN illustrate the final 3rd step. Based on them one can conclude that: – verb coverage is a non-linear function; – regardless of the resource type (coarse-grained or fine-grained role set), verb coverage remains surprisingly the same non-linear function. For example, having 574 classes in VN 3.3, 50% of all verbs are covered by 123 classes, 90% are covered by 319 classes; having 699 frames in FN 1.7, 50% of all verbs are covered by 95 classes, 90% are covered by 416 classes.

Semantic Roles in VerbNet and FrameNet

4.2

141

Distribution of Roles per Class in VN and FN

If the distribution of verbs reflects similarity between verbs in one class/frame, the distribution of roles shows similarity between verbs located in different classes/frames. This similarity is unfolded through identical role sets that different classes have. We extracted all different classes that have the same role set and merged them together. Table 3 shows the difference between total number of classes/frames and the number of merged classes/frames that have the same role set. Table 3. Statistics of classes/frames with different role sets. Resource

Number of Number of classes/frames classes/frames with with verbs that have different verbs role sets

VerbNet 3.3

574

138

FrameNet 1.7

699

619

Table 4 provides some examples of role sets that different classes/frames have. Table 4. Examples of the same role sets used in different classes/frames. Resource

Role set for representation of the class/frame

VN 3.3

Agent:Destination:Initial Location:Theme

15

532

Location:Theme

14

269

Agent:Recipient:Topic

FN 1.7

Number classes/ frames

of Number of verbs in all classes

11

251

Agent:Instrument:Patient

7

312

Agent:Instrument:Patient:Result

6

506

17

64

Entity Agent:Theme:Source

3

83

Experiencer:Stimulus

2

138

Self mover:Source:Path:Goal:Direction

2

137

Cause:Theme:Goal or Agent:Theme:Goal

2

125

Second type of role distribution is the number of role occurrences in all classes/frames (see Fig. 5 for VN and Fig. 6 for FN). Based on this distribution, one can conclude that: – distribution of roles is a non-linear function. Top 2–3 roles occur in almost all classes.

142

A. Huminski et al.

Fig. 5. Role distribution in VN 3.3.

Fig. 6. Role distribution in FN 1.7.

Semantic Roles in VerbNet and FrameNet

143

– regardless of the resource type (coarse-grained or fine-grained role set), distribution of roles remains surprisingly the same non-linear function. – distribution of roles correlates with the distribution of verbs (compare Fig. 5 and Fig. 6 with Fig. 1 and Fig. 2 accordingly). 4.3

General Analysis and Evaluation

Both verb and role distributions in VN and FN show sharp deviation from the average value. Despite the fact that VN and FN are different in the principles of their construction (Table 1) and are significantly different in the number of roles (Table 2), we have identical picture in both VN and FN for all types of distributions. This similarity looks weird since the obvious expectation is the following: the larger is the number of roles, the more even the role/verb distribution should be and the less disproportion is expected. The Reason Why It Happens. Assigning a role representation to all verbs of a language assumes by default that the set of all verbs is homogeneous and because of homogeneity it can be described through one unique approach: semantic roles. We consider the statistical results as an indication that the role paradigm itself is necessary but not sufficient for proper representation of all verbs. We argue the set of verbs in a language is not homogeneous. Instead, it is heterogeneous and requires at least 2 different approaches.

5

Hybrid Role-Scalar Approach

For the sake of getting universal semantic representation we offer a hybrid approach: role-scale. 5.1

Hypothesis: Roles Are Not Sufficient for Verb Representation

By definition, any semantic role is a function of a participant represented by NP, towards an event represented by a verb. Nevertheless, to cover all verbs, semantic role theory was extended beyond the traditional definition in such a way to represent, for example, a change of state. In VN there are roles like Attribute, Value, Extent, Asset etc. that match abstract participants, attributes, and their changes. For example, in the sentence Oil soared in price by 10%, “price” is in the role of Attribute and “10%” is in the role of Extent which, according to the definition, specify the range or degree of change. If we are going to represent a change of state through roles, we need to assign a role to state of a participant, not to a participant itself. Second, a change of state means a change in the value of state in particular direction. For example,

144

A. Huminski et al.

the event “heat the water” includes values of state “temperature” for water as a participant. So, to reflect a change of state we need to introduce two new roles: initial value of state and its final value on the scale of increasing values on a dimension of temperature. These 2 new roles look like numbers, not roles, on a scale. It is unclear, what it really means: a role of value. We argue that the attempts of semantic role theory extension contradict the nature of a semantic role. Roles are just one of the parts in event representation that does not cover an event completely. While a role is a suitable means for action verbs like “hit” or “stab”, a scalar is necessary for representation of the verbs like “kill” or “heat”. For instance, in semantic role theory the verb kill has the role set [Agent, Patient] while the meaning of kill contains no information about what Agent really did towards Patient. Having Agent and Patient, the verb kill is represented through an unknown action. Meanwhile, what is important for kill is not an action itself but the resulting change of state: Patient died. And this part of meaning, being hidden by roles, can be represented via a scalar change “alive-dead”. Roles gives us a necessary but not a sufficient representation, since change-of-state verbs do not indicate how it was done but what was done. The dichotomy between role and scale can be expressed in other way as the dichotomy between semantic field and semantic scale. A frame is considered as a semantic field where members of a frame are closely related with each other by their meanings, while semantic scale includes a set of values that are scattered along the scale and opposite each other. 5.2

Scale Representation

A scalar change in an entity involves a change in the value of one of its attributes in a particular direction. Related Work. The idea of using scales for verb representation was elaborated by many authors. Dixon [15,16] extracted 7 classes of property concepts that are consistently lexicalized across languages: dimension, age, value, color, physical, speed and human propensity. Rappaport Hovav [17,18] paid attention that change-of-state verbs and socalled result verbs lexicalize a change in a particular direction in the value of a scalar attribute, frequently from the domain of property concepts of Dixon. A similar approach comes from cognitive science framework [19–21] that considers verb representation to be based on a 2-vector structure model: a force vector representing the cause of a change and a result vector representing a change in object properties. It is argued that this framework provides a unified account for the multiplicity of linguistic phenomena related to verbs. Jackendoff [22–24] stated that result verbs representation can be derived from the physical space. Accordingly, a change in the value can be represented the same way as a movement in the physical space. For example, a change of possession can be represented as a movement in the space of possession.

Semantic Roles in VerbNet and FrameNet

145

Fleischhauer [25] discussed in detail the idea of verbal degree gradation and elaborated the notion of scalar change. Change-of-state verbs are considered as one of the prototypical examples of scalar verbs. There are two reasons for this: first, some of the verbs are derived from gradable adjectives, and second, the verbs express a change along a scale. It was stated that a change-of-state verb lexicalizes a scale, even if one or more of the scale parameters remain unspecified in the meaning of the verb. Scale Representation for VerbNet and FrameNet. We just outline the approach how the verbs from the three largest frames in FN (and the classes in VN accordingly) can be additionally represented via scales. More detailed analysis of the scale representation goes beyond the limits of the article. The benefit that such approach provides is that the large frames can be splitted by identifying within-frame semantic distinctions. The top largest frame in FN is the frame “Self motion”. According to the definition, it “most prototypically involves individuals moving under their own power by means of their bodies”8 . The frame contains 134 verbs and corresponds to the run-class (51.3.2) with 96 verbs in VN. The necessity of scale representation for run-class was directly indicated by Pustejovsky [26]. To make an explicit representation of change of state he introduced the concept of opposition structure in generative lexicon (GL) as an enrichment to event structure [27]. After that he applied GL-inspired componential analysis to the run-class and extracted six distinct semantic dimensions, which provide clear differentiations in meaning within this class. They are: SPEED: amble, bolt, sprint, streak, tear, chunter, flit, zoom; PATH SHAPE: cavort, hopscotch, meander, seesaw, slither, swerve, zigzag; PURPOSE: creep, pounce; BODILY MANNER: amble, ambulate, backpack, clump, clamber, shuffle; ATTITUDE: frolic, lumber, lurch, gallivant; ORIENTATION: slither, crawl, walk, backpack. The second largest frame in FN with 132 verbs is the frame “Stimulate emotion”. This frame is about some phenomenon (the Stimulus) that provokes a particular emotion in an Experiencer9 . In other words, the emotion is a direct reaction towards the stimulus. It corresponds to the second largest amuse-class (31.1) in VN with 251 verbs. Fellbaum and Mathieu [28] examined experiencer-subject verbs like surprise, fear, hate, love etc. where the gradation is richly lexicalized by verbs that denote different degrees of intensity of the same emotion (e.g., surprise, strike, dumbfound, flabbergast). The results of analysis show, first, that the chosen verbs indeed possess scalar qualities; second, they confirm the prior assignment of the verbs into broad classes based on a common underlying emotion; finally, the 8 9

https://framenet2.icsi.berkeley.edu/fnReports/data/frameIndex.xml?frame=Self motion. https://framenet2.icsi.berkeley.edu/fnReports/data/frameIndex.xml? frame=Stimulate emotion.

146

A. Huminski et al.

web-data allows to construct consistent scales with verbs ordered according to the intensity of the emotion. The third largest frame in FN is the frame “Make noise” (105 verbs) that corresponds to the sound emission-class (129 verbs) in VN. The frame is defined as “a physical entity, construed as a source, that emits a sound”10 . SnellHornby [29] suggested the following scales to characterize verbs of sound: VOLUME (whirr vs. rumble); PITCH (squeak vs. rumble); RESONANCE (rattle vs. thud ); DURATION (gurgle vs. beep).

6

Conclusion

Based on statistical analysis of VerbNet and FrameNet as verb resources we showed empirical evidence of role insufficiency as a unique approach used for verb representation. It supports the hypothesis that roles as a tool for meaning representation do not cover the variety of all verbs. As a consequence of that, another paradigm—scalar approach—is needed to fill up the gap. The hybrid role-scalar approach looks promising for verb meaning representation and will be elaborated in future.

References 1. Fillmore, Ch.: The case for case. In: Universals in Linguistic Theory, pp. 1–88 (1968) 2. Dowty, D.: Thematic proto-roles and argument selection. Language 67, 547–619 (1991) 3. Baker, C., Fillmore, Ch., Lowe, J.: The Berkeley FrameNet project. In: Proceedings of the International Conference on Computational Linguistics, Montreal, Quebec, Canada, pp. 86–90 (1998) 4. Kipper-Schuler, K.: VerbNet: a broad-coverage, comprehensive verb lexicon. Ph.D. thesis. Computer and Information Science Department, University of Pennsylvania, Philadelphia, PA (2005) 5. Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–106 (2005) 6. Marcus, M., et al.: The penn treebank: annotating predicate argument structure. In: ARPA Human Language Technology Workshop (1994) 7. Fellbaum, Ch., Miller, G.: Folk psychology or semantic entailment? A reply to Rips and Conrad. Psychol. Rev. 97, 565–570 (1990) 8. Petukhova, V., Bunt, H.: LIRICS semantic role annotation: design and evaluation of a set of data categories. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC), Marrakech, Morocco (2008) 9. Claire, B., Corvey, W., Palmer, M., Bunt, H.: A hierarchical unification of LIRICS and VerbNet semantic roles. In: Proceedings of the 5th IEEE International Conference on Semantic Computing (ICSC 2011), Palo Alto, CA, USA (2011) 10

https://framenet2.icsi.berkeley.edu/fnReports/data/frameIndex.xml?frame=Make noise.

Semantic Roles in VerbNet and FrameNet

147

10. Bunt, H., Palmer, M.: Conceptual and representational choices in defining an ISO standard for semantic role annotation. In: Proceedings of the Ninth Joint ACLISO Workshop on Interoperable Semantic Annotation (ISA-9), Potsdam, Germany (2013) 11. Levin, B.: English verb classes and alternations: a preliminary investigation. University of Chicago Press, Chicago, IL (1993) 12. Fillmore, Ch.: Frame semantics. In: Linguistics in the Morning Calm, pp. 111–137. Hanshin, Seoul (1982) 13. Fillmore, Ch., Atkins, T.: Toward a frame-based lexicon: the semantics of RISK and its neighbors. In: Frames, Fields, and Contrasts: New Essays in Semantic and Lexical Organization, pp. 75–102. Erlbaum, Hillsdale (1992) 14. Baker, C., Ruppenhofer, J.: FrameNet’s frames vs. Levin’s verb classes. In: Proceedings of the 28th Annual Meeting of the Berkeley Linguistics Society, Berkeley, California, USA, pp. 27–38 (2002) 15. Dixon, R.: Where Have All the Adjectives Gone? and Other Essays in Semantics and Syntax. Mouton, Berlin (1982) 16. Dixon, R.: A Semantic Approach to English Grammar. Oxford University Press, Oxford (2005) 17. Rappaport Hovav, M.: Lexicalized meaning and the internal temporal structure of events. In: Crosslinguistic and Theoretical Approaches to the Semantics of Aspect, pp. 13–42. John Benjamins, Amsterdam (2008) 18. Rappaport Hovav, M.: Scalar roots and their results. In: Workshop on Roots: Word Formation from the Perspective of “Core Lexical Elements”, Universitat Stuttgart, Stuttgart (2009) 19. Gardenfors, P.: The Geometry of Meaning: Semantics Based on Conceptual Spaces. MIT Press, Cambridge (2017) 20. Gardenfors, P., Warglien, M.: Using conceptual spaces to model actions and events. J. Semant. 29(4), 487–519 (2012) 21. Warglien, M., Gardenfors, P., Westera, M.: Event structure, conceptual spaces and the semantics of verbs. Theor. Linguist. 38(3–4), 159–193 (2012) 22. Jackendoff, R.: Semantics and Cognition. MIT Press, Cambridge (1983) 23. Jackendoff, R.: Semantic Structures. MIT Press, Cambridge (1990) 24. Jackendoff, R.: Foundations of Language. Oxford University Press, Oxford (2002) 25. Fleischhauer, J.: Degree Gradation of Verbs. Dusseldorf University Press, Dusseldorf (2016) 26. Pustejovsky, J., Palmer, M., Zaenen, A., Brown, S.: Verb meaning in context: integrating VerbNet and GL predicative structures. In: Proceedings of the LREC 2016 Workshop: ISA-12, Potoroz, Slovenia (2016) 27. Pustejovsky, J.: Events and the semantics of opposition. In: Events as Grammatical Objects, pp. 445–482. Center for the Study of Language and Information (CSLI), Stanford (2010) 28. Fellbaum, Ch., Mathieu, Y.: A corpus-based construction of emotion verb scales. In: Linguistic Approaches to Emotions in Context. John Benjamins, Amsterdam (2014) 29. Snell-Hornby, M.: Verb-descriptivity in German and English. A Contrastive Study in Semantic Fields. Winter, Heidelberg (1983)

Sentiment Analysis

Fusing Phonetic Features and Chinese Character Representation for Sentiment Analysis Haiyun Peng, Soujanya Poria, Yang Li, and Erik Cambria(B) School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore {penghy,sporia,liyang,cambria}@ntu.edu.sg Abstract. The Chinese pronunciation system offers two characteristics that distinguish it from other languages: deep phonemic orthography and intonation variations. We are the first to argue that these two important properties can play a major role in Chinese sentiment analysis. Hence, we learn phonetic features of Chinese characters and fuse them with their textual and visual features in order to mimic the way humans read and understand Chinese text. Experimental results on five different Chinese sentiment analysis datasets show that the inclusion of phonetic features significantly and consistently improves the performance of textual and visual representations. Keywords: Phonetic features · Character representation · Sentiment analysis

1 Introduction In recent years, sentiment analysis has become increasingly popular for processing social media data on online communities, blogs, wikis, microblogging platforms, and other online collaborative media [3]. Sentiment analysis is a branch of affective computing research that aims to mine opinions from text (but sometimes also videos [4]) based on user intent [9] in different domain [2]. Most of the literature is on English language but recently an increasing number of works are tackling the multilinguality issue, especially in booming online languages such as Chinese [17]. Chinese is one of the most popular languages on the Web and it has two relevant advantages over other languages. Nevertheless, research in Chinese sentiment analysis is still very limited due to the lack of experimental resources. Chinese language has two fundamental characteristics which make language processing on this language challenging yet interesting. Firstly, it is a pictogram language [8], which means that its symbols intrinsically carry meanings. Through a geometric composition, various symbols integrate together to form a new symbol. This is different from Romance or Germanic languages whose characters do not encode internal meanings. In order to utilize this characteristic, two branches of research exist in the literature. One of them studies the sub-word components (such as Chinese character and Chinese radicals) via a textual approach [6, 16, 20, 24, 25]. The other branch explores the compositionality using the visual presence of the characters [14, 22] by means of extracting c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 151–165, 2023. https://doi.org/10.1007/978-3-031-24340-0_12

152

H. Peng et al.

visual features from bitmaps of Chinese characters to further improve the Chinese textual word embeddings. The second characteristic of Chinese language is its deep phonemic orthography. In other words, it is difficult to infer the pronunciation of characters or words from their text. The modern pronunciation system of Chinese language is called pinyin, which is a romanization of Chinese text. It is so different from the written system that they can be seen as two unrelated languages if their mapping relationship is unknown. However, to the best of our knowledge, no work has utilized this characteristic for Chinese NLP, nor has for sentiment analysis. Furthermore, we found one feature of pinyin system which is beneficial to sentiment analysis. For each Chinese character, the pinyin system has one fundamental pronunciation and four variations1 due to four different tones (tone and intonation are interchangeable within this paper). These tones have immediate impact on semantics and sentiment, as shown in Table 1. Although phonograms (or phonosemantic compounds, xingsheng zi) are quite common in Chinese language, only less than 5% of them have exactly the same pronunciation and intonation. Table 1. Examples of intonations that alter meaning and sentiment. Text

Pinyin

空闲

k`ongxi´an Free

Meaning

Neutral

Sentiment polarity

空气

k¯ongq`ı

Air

Neutral

好吃

hˇaochi

Delicious

Positive

h`aochi

Gluttonous Negative

We argue that these two factors of Chinese language can play a vital role in Chinese natural language processing especially sentiment analysis. In this work, we take advantage of the two above-mentioned characteristics of the Chinese language by means of a multimodal approach for Chinese sentiment analysis. For the first factor, we extract visual features from Chinese character pictograms and textual features from Chinese characters. To consider the deep phonemic orthography and intonation variety of the Chinese language, we propose to use Chinese phonetic information and learn various phonetic features. For the final sentiment classification, we fuse visual, textual and phonetic information in both early fusion and late fusion mechanism. Unlike the previous approaches which either consider the visual information or textual information for Chinese sentiment analysis, our method is multimodal where we fuse visual, textual and phonetic information. Most importantly, although the importance of character level visual features has been explored before in the literature, to the best of our knowledge none of the previous works in Chinese sentiment analysis have considered the deep phonetic orthographic characteristic of Chinese language. In this work, we leverage this by learning phonetic features and use that for sentiment analysis. The experimental results show that the use of deep phonetic orthographic information is useful for Chinese sentiment analysis. Due to the above, the proposed multimodal framework outperforms the state 1

Neutral tone, in addition to the four variations, is neglected for the moment, due to its lack of connection with sentiment.

Fusing Phonetic Features and Chinese Character Representation

153

of the art Chinese sentiment analysis method by a statistically significant margin. In summary, the two main contributions of this paper are as follows: – We use Chinese phonetic information to improve sentiment analysis. – We experimentally show that a multimodal approach that leverages phonetics, visual and textual features of Chinese characters at the same time is useful for Chinese sentiment analysis. The remainder of this paper is organized as follows: we first present a brief review of general embeddings and Chinese embeddings; we then introduce our model and provide technical details; next, we describe the experimental results and presents some discussions; finally, we conclude the paper and suggest future work.

2 Related Work 2.1 General Embedding One-hot representation is the initial numeric word representation method in NLP. However, it usually leads to a problem of high dimensionality and sparsity. To solve this problem, distributed representation (or word embedding) [1] is proposed. Word embedding is a representation which maps words into low dimensional vectors of real numbers by using neural networks. The key idea is based on distributional hypothesis so as to model how to represent context words and the relation between context words and target word. In 2013, Mikolov et al. [15] introduced both continuous bag-of-words (CBOW) model and skip-gram model. The former placed context words in the input layer and target word in the output layer whereas the latter swapped the input and output in CBOW. In 2014, Pennington et al. [15] created the ‘GloVe’ embeddings. Unlike the previous which learned the embeddings from minimizing the prediction loss, GloVe learned the embeddings with dimension reduction techniques on co-occurrence counts matrix. 2.2 Chinese Embedding Chinese text differs from English text for two key aspects: it does not have word segmentations and it has a characteristic of compositionality due to its pictogram nature. Based on the former aspect, word segmentation tools are always employed before text representation, such as ICTCLAS [26], THULAC [23], Jieba2 and so forth. Based on the latter aspect, several works had focused on the use of sub-word components (such as characters and radicals) to improve word embeddings. [6] proposed decomposition of Chinese words into characters and presented a character-enhanced word embedding (CWE) model. [24] had decomposed Chinese characters to radicals and developed a radical-enhanced Chinese character embedding. 2

https://github.com/fxsjy/jieba.

154

H. Peng et al.

In [20], pure radical based embeddings were trained for short-text categorization, Chinese word segmentation and web search ranking. [25] extend the pure radical embedding by introducing multi-granularity Chinese word embeddings. Multilingual sentiment analysis in the past few years has become a growing area of research, especially in Chinese language due to the economic growth of China in the last decade. [14, 22] explored integrating visual features to textual word embeddings. The extracted visual features proved to be effective in modeling the compostionality of Chinese characters. To the best of our knowledge, no previous work has integrated pronunciation information to Chinese embeddings. Due to its deep phonemic orthography, we believe the Chinese pronunciation information could elevate the embeddings to a higher level. Thus, we propose to learn phonetic features and present two fusions of multimodal features.

3 Model In this section, we present how features from three modalities were extracted. Next, we introduce the methods to fuse those features. 3.1

Textual Embedding

As in most recent literature, textual word embedding vectors were treated as the fundamental representation of texts [1, 15, 18]. Pennington et al. [18] developed ‘GloVe’ in 2014 which employed a count-based mechanism to embed word vectors. Following the convention, we used ‘GloVe’ character embeddings [18] of 128-dimension to represent text. It is worth noting that we set the fundamental token of Chinese text as the character instead of the word for two reasons. Firstly, it is designed to align against the audio feature. Audio features can only be extracted at character level, as Chinese pronunciation is on each character. Secondly, character-level processing can avoid the errors induced by Chinese word segmentation. Although we used character GloVe embedding for our final model, experimental comparisons were conducted with both CBOW [15] and Skip-gram embeddings. 3.2

Training Visual Features

Unlike the Romance or Germanic language, the Chinese written language originated from pictograms. Later, simple symbols were combined to form complex symbols in order to express abstract meanings. For example, a geometric combination of three ‘木 (wood)’ creates a new character ‘森 (forest)’. This phenomenon gives rise to a compositional characteristic of Chinese text. Instead of a direct modeling of text compositionality using sub-word [6, 25] or sub-character [16, 24] elements, we opt for a visual model. In particular, we constructed a convolutional auto-encoder (convAE) to extract visual features. Details of the convAE are listed in Table 2. The input to the model is an 60 by 60 bitmap for each of the Chinese characters and the output of the model is a dense vector with a dimension of 512. The model was

Fusing Phonetic Features and Chinese Character Representation

155

Table 2. Configuration of convAE for visual feature extraction. Layer# Layer configuration 1

Convolution 1: kernel 5, stride 1

2

Convolution 2: kernel 4, stride 2

3

Convolution 3: kernel 5, stride 2

4

Convolution 4: kernel 4, stride 2

5

Convolution 5: kernel 5, stride 1

Feature Extracted visual feature: (1,1,512) 6

Dense ReLu: (1,1,1024)

7

Dense ReLu: (1,1,2500)

8

Dense ReLu: (1,1,3600)

9

Reshape: (60,60,1)

trained using Adagrad optimizer on the reconstruction error between original bitmap and reconstructed bitmap. The loss is given as: L 

(|xt − xr | + (xt − xr )2 )

(1)

j=1

where L is the number of samples. xt is the original input bitmap and xr is the reconstructed output bitmap. After training the visual features, we obtained a lookup table where each Chinese character corresponds to a 512-dimensional feature vector. 3.3 Learning Phonetic Features Written Chinese and spoken Chinese have several fundamental differences. To the best of our knowledge, all previous literature that worked on Chinese NLP ignored the significance of the audio channel. As cognitive science suggests, human communication depends not only on visual recognition but also audio activation. This drove us to explore the mutual influence between the audio channel (pronunciation) and textual representation. Popular Romance and Germanic languages such as Spanish, Portuguese, English etc. share two remarkable characteristics. Firstly, they have shallow phonemic orthography3 . In other words, the pronunciation of a word is largely dependent on the text composition in such languages. One can almost infer the pronunciation of a word given its textual spelling. From this perspective, textual information can be interchangeable with phonetic information. For instance, if the pronunciations of English word ‘subject’ and ‘marineland’ were known, it is not hard to speculate the pronunciation of word ‘submarine’, because one can combine the pronunciation of ‘sub’ from ‘subject’ and ‘marine’ from ‘marineland’. This implies that phonetic information of these languages may not have additional information entropy than textual information. 3

https://en.wikipedia.org/wiki/Phonemic orthography.

156

H. Peng et al.

Secondly, intonation information is limited and implicit in these languages. Generally speaking, emphasis, ascending intonation and descending intonation are the major variations in these languages. Although they exerted great influence in sentiment polarity during communication, there is no clue to infer such information only from the texts. However, in comparison to these languages, Chinese has opposite characteristics to the above languages. Firstly, it has deep phonemic orthography. One can hardly infer the pronunciation of Chinese word/character from its textual writing. For example, the pronunciations of characters ‘日’ and ‘月’ are ‘r`ı’ and ‘yu`e’, respectively. A combination of the two characters makes another character ‘明 ’ which pronounced ‘m´ıng’. This characteristic motivates us to find how the pronunciation of Chinese can affect natural language understanding. Secondly, intonation information of Chinese is rich and explicit. In addition to emphasis, each Chinese character has one tone (but there are four different tones), marked by diacritics explicitly. These intonations (tones) greatly affect the semantic and sentiment of Chinese characters and words. Examples were shown in Table 1. To this end, we found it was not trivial to explore how Chinese pronunciation can influence natural language understanding, especially sentiment analysis. In particular, we designed three ways to learn phonetic information, namely extraction from audio signal, pinyin with intonation and pinyin without intonation. An illustration is shown in Table 3. Extracted Feature from Audio Clips (Ex). The symbols of modern Chinese spoken system is named ‘Hanyu Pinyin’, abbreviated to ‘pinyin’. It is the official romanization system for mandarin in mainland China. The system includes four diacritics denoting tones. For each of the Chinese characters, it has one corresponding pinyin. This pinyin has four variations in tones. In order to extract phonetic features, for each tone of each pinyin, we collected an audio clip which recorded a female’s pronunciation of each pinyin (with tone) from a language learning resource4 . Each audio clip lasts around one second with a standard pronunciation of one pinyin with tone. The quality of these clips were validated by two native speakers. Next, we used openSMILE [7] to extract phonetic features on each of the obtained pinyin-tone audio clip. Audio features are extracted 30 Hz frame-rate and a sliding window of 20 ms. They consist of a total num of 39 low-level descriptors (LLD) and their statistics, e.g., MFCC, root quadratic mean, etc. Table 3. Illustration of 3 types of phonetic features (a(x) stands for the audio clip for pinyin ‘x’). Text 假设学校明天放假 (Suppose the school is on vacation tomorrow.) Pinyin Jiˇa Sh`e Xu´e Xi`ao M´ıng Ti¯an F`ang Ji`a Ex PW PO

4

a(Jiˇa) a(Sh`e) a(Xu´e) a(Xi`ao) a(M´ıng) a(Ti¯an) a(F`ang) a(Ji`a) Jia3 She4 Xue2 Xiao4 Ming2 Tian1 Fang4 Jia4 Jia She Xue Xiao Ming Tian Fang Jia

https://chinese.yabla.com/.

Fusing Phonetic Features and Chinese Character Representation

157

After obtaining features for each of the pinyin-tone clip, we obtained an m × 39 dimensional matrix for each clip, where m depends on the length of clip and 39 is the number of features. To regulate the feature representation for each clip, we conducted SVD on the matrices to reduce them to 39-dimensional vectors, where we extracted the vector with the singular values. In the end, high dimensional feature matrices of each pinyin clip were transformed to a dense feature vector of 39 dimensions. A lookup table between pinyin and audio feature vector is constructed accordingly. Pinyin with Intonation (PW). Instead of collecting audio clips for each pinyin, we directly represent Chinese characters with pinyin tokens, as shown in Table 3. The number denotes four kinds of intonations. Specifically, we retrieve a textual corpus, which was used to train for Chinese character textual embeddings. We employ parser5 to convert each character in the above corpus to its corresponding pinyin. For example, a sentence from textual corpus ‘今天心情好’ will be converted to Pinyin (PW) sentence ‘jin1 tian1 xin1 qing2 hao3’. Although1 3.49% of the common 3500 Chinese characters are heteronym, the parser claimed to select the most probable pinyin of the heteronym based on context. Moreover, we don’t aim to address the issue of heterohym is this paper. For the rest of the 96% common characters, the parser is accurate becasue there is no ambiguity. Now the sequence of Chinese characters in textual corpus has been converted to a sequence of pinyins. Thus context from textual corpus is maintained in the pinyin corpus. We then treat each pinyin in the above converted corpus as our new token and train embedding for it. Thus 128-dimension pinyin token embedding vectors using conventional ‘GloVe’ character embeddings [18] were trained. A lookup table between pinyin with intonation (PW) and embedding vector is constructed accordingly. Pinyin Without Intonation (PO). PO distinguishes from PW in removing intonations. In PO, Chinese characters are represented by pinyins without intonations. In the previous example, the textual sentence will be converted to Pinyin (PO) sentence ‘jin tian xin qing hao’. Accordingly, 128-dimension ‘GloVe’ pinyin embeddings were trained. Pinyins that have same pronunciation but different intonations will share the same glove embedding vector, such as Jiˇa and Ji`a in Table 3. 3.4 Sentence Modeling Various deep learning models have been proposed to learn sentence representations, such as convolutional neural networks [12] and recurrent neural networks [10]. Among these, bidirectional recurrent neural networks [27] have proved their effectiveness. Thus, they were used in our experiments to model sentence representation. Particularly, we used a bidirectional long short-term memory (LSTM) network. For a sentence s of n Chinese characters s = {x1 , x2 , ..., xn−1 , xn }. xi stands for the embedding representation of ith character in the sentence. Thus, the sequence was fed to a forward LSTM. The hidden output from forward direction is denoted by hf orw . Bidirectional LSTM applied the LSTM in a backward direction to obtain hback . In the end, the representation of this sentence S is the concatenation of the two. 5

https://github.com/mozillazg/python-pinyin.

158

3.5

H. Peng et al.

Fusion of Modalities

In the context of the Chinese language, textual embeddings have been applied in various tasks and proved its effectiveness in encoding semantics or sentiment [6, 20, 24, 25]. Recently, visual features pushed the performance of textual embedding further via a multimodal fusion [14, 22]. This is achieved due to the effective modeling of compositionality of Chinese characters by the visual features. In this work, we hypothesize that the use of phonetic features along with textual and visual can improve the performance. Thus, we introduced the following fusion methods. 1. Early Fusion: In this kind of fusion, each Chinese character is represented by a concatenation of three segments. Each segment represents one modality, see below: char = [embT ⊕ embP ⊕ embV ]

(2)

where char is character representation. embT , embP , embV are embeddings from text, phoneme and vision, respectively. 2. Late Fusion: Fusion takes place at sentence classification level. Sentence representations from each of the modality are fused just before feeding to a softmax layer. Equation 3 shows the late fusion mechanism: sentence = [ST ⊕ SP ⊕ SV ]

(3)

where ST , SP , SV are bidirectional LSTM outputs from textual, phonetic and visual modality, respectively. sentence is the concatenation of three modalities and was sent to a softmax classifier whose output class is sentiment polarity. There are other complex fusion methods available in the literature [19], however, we did not use them in our paper for two reasons. (1) Fusion through concatenation is one proven effective method [11, 14, 21], and (2) it has the added benefit of simplicity, thus allowing for the emphasis (contributions) of the system to remain with the features themselves.

4 Experiments and Results In this section, we start with introducing the experimental setup. Experiments were conducted in five steps. Firstly, we compare unimodal features. Secondly, we experiment on various fusion methods. We then analyze and validate the role of phonetic features. Next, we visualize the role of different features used in experiments. Last but not least, we locate the cause of improvement. 4.1

Experimental Setup

Datasets. We evaluate our method on five datasets: Weibo, It168, Chn2000, Review-4 and Review-5. The first three datasets consist of reviews extracted from micro-blog and review websites. The last two datasets contain reviews from [5], where Review-4 has reviews from computer and camera domains, and Review-5 contains reviews from car

Fusing Phonetic Features and Chinese Character Representation

159

and cellphone domains. The experimental datasets6 are shown in Table 4. For phonetic experiments, we employ online codes7 on the datasets to convert text to pinyin with intonations (As this step functions as disambiguation, we collect three online resources and select the most reliable one). For visual features, we refer to the lookup table to convert characters to visual features. Table 4. # of reviews in different experimental datasets. Weibo It168 Chn2000 Review-4 Review-5 Positive

1900

560

600

1975

2599

Negative 1900

458

739

879

1129

1018 1339

2854

3728

Sum

3800

Setup and Baselines. We used TensorFlow and Keras to implement our model. All models used an Adam Optimizer with a learning rate of 0.001 and a L2-norm regularizer of 0.01. Dropout rate is 0.5. Each mini-batch contains 50 samples. We report the average testing results of each model for 50 epochs in a 5-fold cross validation. The above parameters were set with the use of a grid search on the validation data. In related works of Chinese textual embedding, all of them aimed at improving Chinese word embeddings, such as CWE [6], MGE [25]. Those who utilized visual features [14, 22] also aimed at word level. However, they cannot stand as fair baselines to our proposed model, as our model studied on Chinese character embedding. There are two major reasons for studying at character level. Firstly, pinyin pronunciation system is designed for character level. Pinyin system does not have corresponding pronunciations to Chinese words. Secondly, character level will bypass Chinese word segmentation operation which may induce errors. Conversely, using character level pronunciation to model word level pronunciation will cause sequence modeling issues. For instance, a Chinese word ‘你好’ is comprised of two characters, ‘你’ and ‘好’. For textual embedding, the word can be treated as one single unit by training a word embedding vector. For phonetic embedding, however, we cannot treat the word as one single unit from the perspective of pronunciation. The correct pronunciation of the word is a time sequence of character pronunciation of firstly ‘你’ and then ‘好’. If we work at word level, we have to come up with a representation of the pronunciation of this word, such as an average of character phonetic feature etc. To make a fair comparison, we used three embedding methods to train Chinese textual character embedding, namely Skipgram, CBOW [15] and GloVe [18]. We also implement the radical-enchanced character embeddings, charCBOW and charSkipGram, from [13]. To model Chinese sentences, we also tested on three deep learning models, namely convolutional neural network, LSTM and bidirectional LSTM (see footnote 6).

6 7

Both the datasets and codes in this paper are available for public download upon acceptance. https://github.com/mozillazg/python-pinyin.

160

H. Peng et al. Table 5. Classification accuracy of unimodality in bidirectional LSTM. Weibo It168 Chn2000 Review-4 Review-5

4.2

GloVe

75.97

83.29 84.76

88.33

87.53

CBOW

73.94

83.57 82.90

86.83

85.70

Skip-gram

75.28

81.81 82.30

87.38

86.91

Visual

63.20

69.72 70.43

79.99

79.83

Phonetic feature Ex

67.20

76.19 80.36

80.98

81.38

PO

68.12

81.13 80.28

83.36

83.15

PW 73.31

83.19 84.24

86.58

86.56

Experiments on Unimodality

For textual embeddings, we have tested GloVe, skip-gram and CBOW. For phonetic representation, three types of features were tested. For visual representation, the 512 dimensional feature extracted from bitmap was used to represent each character. As shown in Table 5, textual embeddings (GloVe) achieved the best performance among all three modalities in four datasets. This is due to the fact that they successfully encoded the semantics and dependency between characters. We also noticed that visual feature achieved the worst performance among three modalities, which was within our expectation. As demonstrated in [22], pure visual features are not representative enough to obtain a comparable performance with the textual embedding. Last but not least, phonetic features performed better than visual feature. Although visual features captured compositional information of Chinese characters, they failed to distinguish different meanings of characters that have same writing but different tones. These tones could largely alter the sentiment of Chinese words and further affect sentiment of sentence, as proved by the observation that PW outperformed PO constantly. In order to use the complimentary information available in the modalities we proposed two fusion techniques below. 4.3

Experiments on Fusion of Modalities

In this set of experiments, we evaluated both the early fusion and late fusion with every possible combination of modalities. After extensive experimental trials, we summarized that the concatenation of Ex and PW embeddings (denoted as ExPW) performed best among all phonetic feature combinations. Thus we used it as phonetic feature in the fusion of modalities. The results in Table 6 suggest the best performance was achieved by fusing either textual, phonetic and visual features or textual and phonetic features. We found that charCBOW and charSkipGram methods perform quite close to the original CBOW and Skipgram methods. They perform slightly but not constantly better than their baselines. We conjecture this could be caused by the relatively small size of our training corpus compared to the original Chinese Wikipedia Dump training corpus. With the corpus size increased, all embedding methods are expected to have improved performance. It is without doubts, though, that the corpus we used still presents a fair platform for all methods to compare.

Fusing Phonetic Features and Chinese Character Representation

161

Table 6. Classification accuracy of multimodality. (T and V represent textual and visual, respectively. + means the fusion operation. ExPW is the concatenation of Ex and PW.) Weibo It168 Chn2000 Review-4 Review-5 Unimodal T

Early

Late

75.97

83.29 84.76

88.33

87.53

ExPW

73.89

82.69 84.31

86.41

86.51

V

63.20

69.72 70.43

79.99

79.83

charCBOW [13]

73.31

83.19 83.94

87.07

85.68

charSkipGram [13] 72.00

83.08 82.52

85.39

84.52

T+ExPW

76.52

86.53 87.6

88.96

88.73

T+V

76.07

84.17 85.66

88.26

88.30

ExPW+V

73.39

84.27 84.99

87.14

88.3

T+ExPW+V

75.73

86.43 87.98

88.72

89.48

T+ExPW

75.5

85.29 84.63

89.67

86.03

T+V

76.05

82.60 83.87

87.53

87.47

ExPW+V

72.18

82.84 85.22

89.77

85.9

T+ExPW+V

75.37

85.88 85.45

90.01

86.19

We also noticed that phonetic features when fused with textual or visual features, improved the performance of both textual and visual unimodal classifiers. This validates our hypothesis that phonetic features are an important factor in improving Chinese sentiment analysis performance. An integration of multiple modalities took advantages of each modality and pushed the overall performance to a better place. A p-value of 0.008 in the paired t-test between with and without phonetic features suggested that the best performing improvement of integrating phonetic feature is statistically significant. We also note that early fusion generally outperformed late fusion with a notable gap. We conjecture that late fusion cuts off the complementary capability offered by each modality for each character and fusion at the sentence level may not remember the multimodal information of characters. In comparison, early fusion merged multiple modalities at character level that encapsulated the multimodal information for each character. Initially, we believed the fusion of all modalities would performed the best. However, the results on Weibo and It168 datasets violated our hypothesis. We assumed that this was caused by the poor performance of visual modality in these two datasets, as unimodal shown in Table 6. Table 7. Performance of learned and random generated phonetic feature. Weibo It168 Chn2000 Review-4 Review-5 Learned phonetic feature Ex 67.20 PO 68.12 PW 73.31

76.19 80.36 81.13 80.28 83.19 84.24

80.98 83.36 86.58

81.38 83.15 86.56

Random phonetic feature

57.63 58.71

69.31

69.82

53.30

162

H. Peng et al.

Glove was chosen as the textual embedding in our model due to it performance in Table 5. Although we did not show its results from CBOW or Skip-gram, the general trend still remained the same. Either the fusion of all modalities or the fusion of phonetic and textual embeddings achieved the best performance in most of the cases. This indicated the positive contribution of phonetic feature when used collaboratively with other modalities. 4.4

Validating Phonetic Feature

In the previous section, we showed that phonetic features help improving the overall sentiment classification performance when fused with other modalities. However, the improvement could also be due to training of the classifier. In other words, the improvement may still occur even when phonetic features were totally random, which did not encode any phonetic information. To wipe out this concern, we developed a set of controlled experiments in which we validated the performance of phonetic features. In particular, we generated random real-valued vectors as random phonetic feature for each character. Each dimension of the random phonetic feature vector is a float number between -1 to 1 sampled from a Gaussian distribution. Then, we used this random feature vector to represent each Chinese character and yielded the results in Table 7. In the comparison between the learned phonetic feature and random phonetic feature, we observe that the learned feature outperformed random feature with at least 10% in all datasets. This result indicated the improvement of performance was due to the contribution of learned phonetic feature but not the training of classifiers. Phonetic feature itself is the cause and random features will not provide similar performance. 4.5

Visualization of the Representation

We visualize the extracted phonetic features (Ex) to see what information has been extracted. As shown in Fig. 1 on the left, pinyins that share similar vowels were clustered. We also found that various pinyins of same intonations stay close to each other. This suggested that our phonetic features had captured certain phonetic information of pinyins. We also visualize the fused embedding which concatenated textual, phonetic and visual features in Fig. 1 on the right. We noted that the fused embedding clustered characters that share not only similar pronunciations but also same components (radicals).

Fig. 1. Selected t-SNE visualization of phonetic embeddings and fused embeddings (Left is the Ex feature, where number denotes intonation. Right is fused embedding of T+Ex+V. Green/red circles cluster phonetic/compositional (or semantic) similarity.) (Color figure online)

Fusing Phonetic Features and Chinese Character Representation

163

It can be concluded that the fused embeddings combine the phonetic information from phonetic features, the compositional information from visual features and semantic information from textual embeddings. Since we mentioned the two characteristics of pinyin system in the beginning, it is reasonable to determine either the deep phonemic orthography or the variety of intonations contributes to the improvement. This leads to another group of controlled experiments in the following section. 4.6 Who Contributes to the Improvement? As shown on the left in Fig. 2, red lines outperformed blue lines consistently in all three fusions. Red lines differ from blue lines only in having extra Ex features. The extracted phonetic features were considered as encoding the uniqueness of Chinese pronunciations, namely deep phonemic orthography. This validated that the deep phonemic orthography property of Chinese helped in sentiment analysis task. Similarly on the right in Fig. 2, red lines also outshined green lines in all fusion cases. The difference between the green and red line is the lack of intonations. ExPO eliminates intonations, compared to ExPW. This difference has caused the performance gap between the green and red lines, which further proves the importance of intonations.

90 88

88 86

83

84 82

78

80 78 76

73

74 72

68

Weibo

It168

Chn2000

Review-4

Review-5

Weibo

It168

Chn2000

Review-4

Review-5

T+PW

V+PW

T+PW+V

T+ExPO

V+ExPO

T+ExPO+V

T+ExPW

ExPW+V

T+ExPW+V

T+ExPW

V+ExPW

T+ExPW+V

Fig. 2. Performance comparison between various phonetic features in early fusion. (Color figure online)

5 Conclusion Modern Chinese pronunciation system (pinyin) provides a new perspective in addition to the written system in representing Chinese language. Due to its deep phonemic orthography and intonation variations, it is expected to bring new contributions to statistical representation of Chinese language, especially in the task of sentiment analysis. To the best of our knowledge, we are the first to present an approach to learn phonetic information out of pinyin (both from pinyin tokens and from audio signal). We then integrate the extracted information to textual and visual features to create new Chinese representations. Experiments on five datasets demonstrated the positive contribution of phonetic information to Chinese sentiment analysis, as well as the effectiveness of fusion mechanism. Even though our method is straightforward, it suggests

164

H. Peng et al.

greater potential of taking advantage of phonetic information of languages with deep phonemic orthography, such as Arabic and Hebrew. In the future, we plan to extend the work in the following directions. Firstly, we will try to explore the mutual influence of multimodality. Direct concatenation, as in our work, ignores the dependency between modalities. Secondly, we would explore how phonetic features encode semantics.

References 1. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003) 2. Cambria, E., Song, Y., Wang, H., Howard, N.: Semantic multi-dimensional scaling for opendomain sentiment analysis. IEEE Intell. Syst. 29(2), 44–51 (2014) 3. Cambria, E., Wang, H., White, B.: Guest editorial: big social data analysis. Knowl.-Based Syst. 69, 1–2 (2014) 4. Chaturvedi, I., Satapathy, R., Cavallari, S., Cambria, E.: Fuzzy commonsense reasoning for multimodal sentiment analysis. Pattern Recogn. Lett. 125, 264–270 (2019) 5. Che, W., Zhao, Y., Guo, H., Su, Z., Liu, T.: Sentence compression for aspect-based sentiment analysis. IEEE Trans. Audio Speech Lang. Process. 23(12), 2111–2124 (2015) 6. Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: IJCAI, pp. 1236–1242 (2015) 7. Eyben, F., W¨ollmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462. ACM (2010) 8. Hansen, C.: Chinese ideographs and western ideas. J. Asian Stud. 52(2), 373–399 (1993) 9. Howard, N., Cambria, E.: Intention awareness: improving upon situation awareness in human-centric environments. Hum.-Centric Comput. Inf. Sci. 3(9), 1–17 (2013) 10. Irsoy, O., Cardie, C.: Opinion mining with deep recurrent neural networks. In: EMNLP, pp. 720–728 (2014) 11. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014) 12. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014) 13. Li, Y., Li, W., Sun, F., Li, S.: Component-enhanced chinese character embeddings. arXiv preprint arXiv:1508.06669 (2015) 14. Liu, F., Lu, H., Lo, C., Neubig, G.: Learning character-level compositionality with visual features. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers), pp. 2059–2068 (2017) 15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 16. Peng, H., Cambria, E., Zou, X.: Radical-based hierarchical embeddings for Chinese sentiment analysis at sentence level. In: FLAIRS, pp. 347–352 (2017) 17. Peng, H., Ma, Y., Li, Y., Cambria, E.: Learning multi-grained aspect target sequence for Chinese sentiment analysis. Knowl.-Based Syst. 148, 167–176 (2018) 18. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014) 19. Poria, S., Cambria, E., Hazarika, D., Mazumder, N., Zadeh, A., Morency, L.P.: Multi-level multiple attentions for contextual multimodal sentiment analysis. In: ICDM, pp. 1033–1038 (2017)

Fusing Phonetic Features and Chinese Character Representation

165

20. Shi, X., Zhai, J., Yang, X., Xie, Z., Liu, C.: Radical embedding: delving deeper to Chinese radicals, vol. 2: Short Papers, p. 594 (2015) 21. Snoek, C.G., Worring, M., Smeulders, A.W.: Early versus late fusion in semantic video analysis. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 399–402. ACM (2005) 22. Su, T.r., Lee, H.y.: Learning Chinese word representations from glyphs of characters. In: EMNLP, pp. 264–273 (2017) 23. Sun, M., Chen, X., Zhang, K., Guo, Z., Liu, Z.: Thulac: an efficient lexical analyzer for chinese. Technical Report (2016) 24. Sun, Y., Lin, L., Yang, N., Ji, Z., Wang, X.: Radical-enhanced Chinese character embedding. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds.) ICONIP 2014. LNCS, vol. 8835, pp. 279–286. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12640-1 34 25. Yin, R., Wang, Q., Li, P., Li, R., Wang, B.: Multi-granularity Chinese word embedding. In: EMNLP, pp. 981–986 (2016) 26. Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: Hhmm-based Chinese lexical analyzer ictclas. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17, pp. 184–187. Association for Computational Linguistics (2003) 27. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)

Sentiment-Aware Recommendation System for Healthcare Using Social Media Alan Aipe, N. S. Mukuntha(B) , and Asif Ekbal Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, India {alan.me14,mukuntha.cs16,asif}@iitp.ac.in Abstract. Over the last decade, health communities (known as forums) have evolved into platforms where more and more users share their medical experiences, thereby seeking guidance and interacting with people of the community. The shared content, though informal and unstructured in nature, contains valuable medical and/or health related information and can be leveraged to produce structured suggestions to the common people. In this paper, at first we propose a stacked deep learning model for sentiment analysis from the medical forum data. The stacked model comprises of Convolutional Neural Network (CNN) followed by a Long Short Term Memory (LSTM) and then by another CNN. For a blog classified with positive sentiment, we retrieve the top-n similar posts. Thereafter, we develop a probabilistic model for suggesting the suitable treatments or procedures for a particular disease or health condition. We believe that integration of medical sentiment and suggestion would be beneficial to the users for finding the relevant contents regarding medications and medical conditions, without having to manually stroll through a large amount of unstructured contents. Keywords: Health social media Medical sentiment

1

· Deep learning · Suggestion mining ·

Introduction

With the increasing popularity of electronic-bulletin boards, there has been a phenomenal growth in the amount of social media information available online. Users post about their experiences on social media such as medical forums and message-boards, seeking guidance and emotional support from the online community. As discussed in [2], medical social media is an increasingly viable source of useful information. These users, who are often patients themselves or the friends and/or relatives of patients write their personal views and/or experiences. Their posts are rich in information such as their experiences with disease and their satisfaction with treatment methods and diagnosis. As discussed in [3], medical sentiment refers to a patient’s health status, medical conditions and treatment. Extraction of this information as well as its analysis have several potential applications. The difficulty in the extraction of information such as sentiment and suggestions from a forum post can be attributed c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 166–181, 2023. https://doi.org/10.1007/978-3-031-24340-0_13

Sentiment-Aware Recommendation System for Healthcare

167

to a variety of reasons. Forum posts contain informal language combined with the usages of medical conditions and terms. The medical domain itself is sensitive to misinformation. Thus, any system built on this data would also have to incorporate relevant domain knowledge. 1.1

Problem Definition

Our main objective is to develop a sentiment-aware recommendation system to help build a patient assisted health-care system. We propose a novel framework for mining medical sentiments and suggestions from medical forum data. This broad objective can be modularized into the following set of research questions: RQ1: Can an efficient multi-class classifier be developed to help us understand the overall medical sentiment expressed in a medical forum post?

RQ2: How can we model the similarity between the two medical forum posts?

RQ3: Can we propose an effective algorithm for treatment suggestion by leveraging medical sentiment obtained from the forum posts? By addressing these research questions, we aim to create a patient assisted health-care system, which is able to determine the sentiments of any user that s/he expresses in the forum post, point the user to similar forum posts for more information, and suggest possible treatment(s) or procedural methods for the user’s symptoms and possible disorders. 1.2

Motivation

The amount of health related information being sought after on the Internet is on the rise. As discussed in [4], an estimated 6.75 million health-related searches are made on Google every day. The Pew Internet Survey [5] claims that 35% of U.S. adults have used the internet to diagnose a medical condition they themselves or another might have, and that 41% of these online diagnosers have had their suspicions confirmed by a clinician. There has also been an increase in the number of health-related forums and discussion boards on the Internet, which contain useful information that is yet to be properly harnessed. Sentiment analysis has various applications. We believe it can also provide important information in health-care. In addition to doctor’s advice, connecting with other people who have been in similar situations can help with several practical difficulties. According to The Pew Internet Survey, 24% of all adults have obtained information or support from others who are having the same health conditions.

168

A. Aipe et al.

A person posting on such a forum is often looking for emotional support from similar people. Consider the following two blog posts: Post 1: Hi. I have been on sertaking 50 mgs for about 2 months now and previously was at 25 mg for two weeks. Overall my mood is alot more stable and I dont worry as much as I did before however I thought I would have a bath and when I dried my hair etc I started to feel anxious, lightheaded and all the lovely feeling you get with panic. Jus feel so yuck at the moment but my day was actually fine. This one just came out of the blue.. I wanted to know if anyone else still gets some bad moments on these. I don’t know if they feel more intense as I have been feeling good for a while now. Would love to hear others stories.

Post 2: Just wanna let you all know who are suffering with head aches/pressure that I finally went to the doctor. Told him how mines been lasting close to 6 weeks and he did a routine check up and says he’s pretty I have chronic tension headaches. He prescribed me muscle relaxers, 6 visits to neck massages at a physical therapist and told me some neck exercises to do. I went in on Tuesday and since yesterday morning things have gotten better. I’m so happy I’m finally getting my life back. Just wanted you all to know so maybe you can feel better soon In the first post, the author discusses an experience with a drug, and is looking to hear from people with similar issues. In the second post, the author discusses a positive experience and seeks to help people with similar problems. One of our aims is to develop a system to automatically retrieve and provide such a user with the posts most similar to theirs. Also, in order to make an informed decision knowing patient’s satisfaction for a given course of treatment might be useful. We also seek to provide suggestions for treatment for a particular patient’s problems. The suggestions can subsequently be verified by a qualified professional, and then be prescribed to the patients, or in more innocuous cases (such as with ‘more sleep’ or ‘meditation’), can be directly taken as advice. 1.3

Contributions

In this paper, we propose a sentiment-aware patient assisted health-care system using the information extracted from the medical forums. We propose a deep learning model with a stacked architecture that makes use of Convolutional Neural Network (CNN) layers, and a Long-Short Term Memory (LSTM) network, for the classification of a blog post into its medical sentiment class. To the best of our knowledge, exploring the usage of medical sentiment to retrieve similar posts (from medical blogs) and treatment options is not yet attempted. We summarize the contributions of our proposed work as follows:

Sentiment-Aware Recommendation System for Healthcare

169

– We propose an effective deep learning based stacked model utilizing CNN and LSTM for medical sentiment classification. – We develop a method for retrieving the relevant medical forum posts, similar to a given post. – We propose an effective algorithm for treatment suggestions, that could lead towards building a patient care system.

2

Related Works

Social media is a source of huge information that can be leveraged for building many socially intelligent systems. Sentiment analysis has been explored quite extensively in various domains. However, this has not been addressed in the medical/health domain in the required measure. In [3], authors have analyzed the peculiarities of sentiment and word usage in medical forums, and performed quantitative analysis on clinical narratives and medical social media resources. In [2], multiple notions of sentiment analysis with reference to medical text are discussed in details. In [1], authors have built a system that identifies drugs that cause serious adverse reactions, using messages discussing them from online health forums. They use an ensemble of Na¨ıve Bayes (NB) and Support Vector Machines (SVMs) classifiers to successfully identify the past drugs withdrawn from the market. Similarly, in [13], users’ written contents from social media were used to mine the association between drugs for predicting the Adverse Drug Reactions (ADRs). FDA alerts were used as gold standard, and the statistic Proportional Reporting Ratios (PRR) was shown to be of high importance in solving the problem. In [11], one of the shared tasks involved the retrieval of medical forum posts related to the search queries provided. The queries involved were short, detailed and to the point, typically being less than 10 words. Our work however, focuses more on medical sentiment involved in an entire forum post, and helps to retrieve the similar posts. Recently, [12] presented a benchmark setup for analyzing the medical sentiment of users on social media. They identified and analyzed multiple forms of medical sentiments in text from forum posts, and developed a corresponding annotation scheme. They also annotated and released a benchmark dataset for the problem. In our current work we propose a novel stacked deep learning based ensemble model for sentiment analysis in the medical domain. This is significantly different from the prior works mentioned above. To the best of our knowledge, no prior attempt has been made to exploit medical sentiment from social media posts to suggest treatment options to build a patient-assisted recommendation system.

3

Proposed Framework

In this section, we describe our proposed framework comprising of three phases, each of which tackles a research question enumerated in Sect. 1.1.

170

3.1

A. Aipe et al.

Sentiment Classification

Medical sentiment refers to analyzing the health status reflected in a given social media post. We approach this task as a multi-class classification problem using sentiment taxonomy as described in Sect. 4.1. Convolutional Neural Network (CNN) architectures have been extensively applied to sentiment analysis and classification tasks [8,12]. Long Short Term Memory (LSTMs) are a special kind of Recurrent Neural Network (RNN) capable of learning long-term dependencies by handling the vanishing and exploding gradient problem [6]. We propose an architecture consisting of two deep Convolutional Neural Network (CNN) layers and a Long-Short Term Memory (LSTM) layer, stacked with a fully connected layer followed by three-neurons output layer having softmax as an activation function. A diagrammatic representation of the classifier is shown in Fig. 1. The social media posts are first vectorized (as discussed in Sect. 4.3) and then fed as input to the classifier. Convolutional layers, used in the classifier, generate 200 dimensional feature maps of unigram and bigram filter sizes. Feature maps from the final CNN layer are maxpooled, flattened and fed into the fully connected layer having a rectified linear unit (ReLU) activation function. The output of the above mentioned layer is fed into another fully connected layer with a softmax activation to obtain class probabilities. Sentiment denoted by the class having the highest softmax value is considered to be the medical sentiment of the input message. The intuition behind adopting a CNN-LSTM-CNN architecture is as follows: During close scrutiny of the dataset, we observed that users often share experiences adhering to their time frame. For example, “I was suffering from anxiety. My doctor asked me to take cit 20 mg per day. Now I feel better”. In this post, the user portrays his/her initial condition, explains the treatment which was used and then the effect of the treatment- all in a very timely sequence. Moreover, health status also keeps changing in the same sequence. This trend was observed throughout the dataset. Therefore, temporal features are the key to medical sentiment classification. Hence, in our stacked ensemble model, first CNN layer extracts top-level features, then LSTM finds the temporal relationships between the extracted features and the final CNN layer filters out the top temporal relationships which are subsequently fed into a fully connected layer.

Fig. 1. Stacked CNN-LSTM-CNN Ensemble Architecture for medical sentiment classification

Sentiment-Aware Recommendation System for Healthcare

3.2

171

Top-N Similar Posts Retrieval

Users often share contents in forums, seeking guidance and to connect with other people who experienced similar medical scenarios. Thus, retrieving topN similar posts would help to focus on contents which are relevant to one’s medical condition, without having to manually scan through all the forum posts. We could have posed this task as a regression problem where a machine learning (ML) model learns to predict similarity score for a given pair of forum posts, but there is no suitable dataset available for this task to the best of our knowledge. We tackle this task by creating a similarity metric (as shown in E.q. 5) and evaluating it by manually annotating a small test set (as discussed in Sect. 4.5). The similarity metric comprises of three terms: – Disease Similarity: It refers to the Jaccard similarity score computed between the two forum posts with respect to the diseases mentioned in the posts. Section 4.4 discusses how the diseases are extracted from a given post. Let J(A,B) denotes the Jaccard similarity between set A and B, DS(P,Q) denotes the disease similarity between two forum posts P and Q, D(P) and D(Q) denote the set of diseases mentioned in P and Q respectively, then: DS(P, Q) = J(D(P ), D(Q))

(1)

where, J(A, B) =

|A ∩ B| |A ∪ B|

– Symptom similarity : It refers to the Jaccard similarity between two forum posts with respect to the symptoms mentioned in them. Section 4.4 discusses how the symptoms mentioned in texts are extracted from a given post. Let SS(P,Q) denotes the disease similarity between two forum posts P and Q, S(P) and S(Q) denote the set of diseases mentioned in P and Q respectively, then SS(P, Q) = J(S(P ), S(Q)) (2) – Text similarity : It refers to the cosine similarity between the document vectors corresponding to two forum posts. Document vector of a post is the sum of vectors of all the words (Sect. 4.3) in a given sentence. Let DP and DQ denote the document vectors corresponding to the forum posts P and Q, TS(P,Q) denotes the cosine similarity between them, then T S(P, Q) =

DP · DQ |DP | × |DQ |

(3)

We compute the above similarities between a pair of posts, and use Eq. 5 to obtain the overall similarity score Sim(P,Q) between two given forum posts P and Q. For a given test instance, training posts are ranked according to the

172

A. Aipe et al.

similarity score (with respect to test) and top-N posts are retrieved. 2 × DS(P, Q) + SS(P, Q) 3 2 × M ISim(P, Q) + T S(P, Q) Sim(P, Q) = 3 M ISim(P, Q) =

(4) (5)

where MISim(P,Q) denotes the similarity between P and Q with respect to the relevant medical information. The main objective of similar post retrieval is to search for the posts depicting similar medical experience. Medical information shared in a forum post can be considered as an aggregate of the disease conditions and symptoms encountered. Medical experience shared in a forum post can be considered as an aggregate of the medical information shared and the semantic meaning of the text, in the same order of relevance. This is the intuition behind adoption of the similarity metric in Eq. 5. 3.3

Treatment Suggestion

A treatment T mentioned in a forum post P can be considered suitable for a disease D mentioned in post Q if P and Q depict similar medical experience and the probability that T to produce a positive medical sentiment, given D. Thus, suggestion score G(T,D) is given by, G(T, D) = Sim(P, Q) × P r(+veSentiment|T, D)

(6)

G(T, D) ≥ τ Treatment T is suggested < τ Treatment T is not suggested where τ is a hyper-parameter of the framework and Pr(A) denotes the probability of event A.

4

Dataset and Experimental Setup

In this section, we discuss the details of the datasets used for our experiments and the evaluation setups. 4.1

Forum Dataset

We perform experiments using a recently released datasets for sentiment analysis [12]. This dataset consists of social media posts collected from the medical forum ’patient.info’. In total 5,189 posts were segregated into three classes – Exist, Recover, Deteriorate based on medical conditions the post described, and 3,675 posts were classified into three classes – Effective, Ineffective, Serious Adverse

Sentiment-Aware Recommendation System for Healthcare

173

Effect based on the effect of medication. As our framework operates at a generic level, we combine both the segments into a single dataset, mapping labels from each segment to a sentiment taxonomy as discussed in Sect. 4.1. The classes with respect to medical condition are redefined as follows: – Exist: User shares the symptoms of any medical problem. This is mapped to the neutral sentiment. – Recover: Users share their recovery status from the previous problems. This is mapped to the positive sentiment. – Deteriorate: User share information about their worsening health conditions. We map this to the negative sentiment. The classes with respect to the effect of medication are: – Effective: User shares information about the usefulness of treatment. This is mapped to the positive sentiment. – Ineffective: User shares information that the treatment undergone has no effect as such. These are mapped to the neutral sentiment. – Serious adverse effect: User shares negative opinions towards the treatment, mainly due to adverse drug effect. This is mapped to the negative sentiment. Sentiment Taxonomy A different sentiment taxonomy is conceptualized keeping in mind the generic behavior of our proposed system. It does not distinguish between the forum posts related to medical conditions and medication. Thus, a one-to-one mapping from sentiment classes used in each segment of the dataset to a more generic taxonomy is essential. We show the class distribution in Table 1. Table 1. Class distribution in the dataset with respect to sentiment taxonomy Sentiment

Distribution(%)

Positive

37.49

Neutral

32.34

Negative

30.17

– Positive sentiment : Forum posts depicting improvement in overall health status or positive results of the treatment. For example : ”I have been suffering from anxiety for a couple of years. Yesterday, my doc prescribed me Xanax. I am feeling better now.” This post is considered positive as it depicts positive results of Xanax. – Negative sentiment : Forum posts describing deteriorating health status or negative results of treatment. For example : ”Can citalopram make it really hard for you to sleep? i cant sleep i feel wide awake every night for the last week and im on it for 7 weeks.”

174

A. Aipe et al.

– Neutral sentiment: This denotes to the forum posts where neither positive nor negative sentiment is expressed, with no change in overall health status of the person. For example : ”I was wondering if anyone has used Xanax for anxiety and stress. I have a doctors appointment tomorrow and not sure what will be decided to use.” 4.2

Word Embeddings

Capturing semantic similarity between the target texts is an important step towards accurate classification. For this reason, word embeddings play a pivotal role. We use the pre-trained word2vec [10] model1 , induced from the PubMed and PMC texts along with the texts extracted from the Wikipedia dump. 4.3

Tools Used and Preprocessing

The codebase, during experimentation, is written in Python (version 3.6) with external libraries – namely keras 2 for neural network design, sklearn 3 for evaluation of baseline and the proposed model, pandas 4 for easier access of data in the form of tables (or, data frames) during execution, nltk 5 for textual analysis and pickle 6 for saving and retrieving input-output of different modules from the secondary storage devices. The preprocessing phase comprises of the removal of non-ASCII characters, stop words and handling of non alphanumeric characters followed by tokenization. Tokens of size (number of characters) less than 3 were also removed due to a very low probability of these becoming indicative features to the classification model. Labels corresponding to sentiment classes of each segment in the dataset are mapped to the generic taxonomy classes (as discussed in Sect. 4.1), and the corresponding one-hot encodings are generated. Text Vectorization. Using the pre-trained word2vec model (discussed in Sect. 4.2), each token is converted to a 200-dimensional vector. They are stacked together and padded to form a 2-D matrix of desired size (150 × 200). The number 150 denotes the maximum number of tokens in any preprocessed forum posts belonging to the training set. 4.4

UMLS Concept Retrieval

Identification of medical information like diseases, symptoms and treatments mentioned in a forum post is essential for the top-n similar post retrieval 1 2 3 4 5 6

http://bio.nlplab.org/. https://keras.io/. http://scikit-learn.org/. http://pandas.pydata.org/. http://www.nltk.org/. https://docs.python.org/3/library/pickle.html.

Sentiment-Aware Recommendation System for Healthcare

175

(Sect. 3.2) and treatment suggestion (Sect. 3.3) phases of the proposed framework. The Unified Medical Language System7 (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences (created in 1986). Therefore UMLS concept identifiers, related to the above mentioned medical information, were retrieved using Apache cTAKES8 . Concepts with semantic type ‘Disorder or Disease’ were added to the set of diseases-those with semantic types ‘Sign or Symptom’ were added to the set of symptoms and those with semantic types ‘Medication’ and ‘Procedures’ were added to the set of treatments. 4.5

Relevance Judgement for Similar Post Retrieval

Annotating pairs of forum posts with their similarity scores as per human judgment is necessary to evaluate how much the retrieved text is relevant. This corresponds to the evaluation of the proposed custom similarity metric (E.q. 5). Since annotating every pairs of posts is a cumbersome task, 20% of the total posts in the dataset were randomly selected, maintaining equal class distribution for the annotation purpose. For each such post, top 5 similar posts are retrieved using the similarity metric. Annotators were asked to judge the similarity between each retrieved post and the original post on a Likert-type scale, from 1 to 5 (1 represents high dissimilarity while 5 represents high similarity between a pair of posts). Annotators were provided with the guidelines for relevance judgments on two questions–‘Is this post relevant to the original post in terms of medical information?’ and ‘Are the experiences and situations depicted in the pair of posts similar?’. A pair of posts is given high similarity rating if both of the conditions are true, and a low rating if neither is true. Three annotators having post-graduate educational levels performed the annotations. We measure the inter-annotator agreement using Krippendorff’s alpha metric [9], and this was observed to be 0.78. Disagreements between the annotators can be explained on the basis of ambiguities, encountered during the labeling task. We provide few examples below: 1. There are cases where original writer of the blog assigned higher rating (denoting relevant), but the annotator disagreed on what constituted a ‘relevant’ post. This often corresponds to the posts giving general advice for illness. For example,’You can take xanax in case of high stress. It worked for me.’ Such advice may not be applicable to a certain specific situation. 2. Ambiguities are also observed for the cases where the authors of the posts are of similar age, sex and socio-economic backgrounds, but have different health issues (for example, one post depicted a male teenager with severe health anxiety, while the other post described a male teenager with social anxiety). For such cases, similarity ratings were varied. 3. Ratings also vary in cases where the symptoms match, but the cause and disorder differ. Annotators face problem in judging the posts which do not 7 8

https://www.nlm.nih.gov/research/umls/. https://ctakes.apache.org/.

176

A. Aipe et al.

contain enough medical information. For example, headache can be a symptom for different diseases.

5

Experimental Results and Analysis

In this section, we report the evaluation results and present necessary analysis. 5.1

Sentiment Classification

The classification model (described in Sect. 3.1) is trained on a dataset of 8,864 unique instances obtained after preprocessing. We define a baseline model by implementing the CNN based system as proposed in [12] under the identical experimental conditions as that of our proposed architecture. We also develop a model based on an LSTM. To see the real impact of the third layer, we also show the performance of a CNN-LSTM based model. Batch size for training was set to 32. Results of 5-fold cross validation are shown in Table 2. Table 2. Evaluation results of 5-fold cross-validation for sentiment classification. Model

Accuracy Cohen-Kappa Macro

Baseline [12] LSTM CNN-LSTM Proposed model

0.63 0.609 0.6516 0.6919

0.443 0.411 0.4559 0.4966

Precision 0.661 0.632 0.6846 0.7179

Recall 0.643 0.628 0.6604 0.7002

F1-Score 0.652 0.63 0.6708 0.7089

Evaluation shows that the proposed model performs better than the baseline system, and efficiently captures medical sentiment from the social media posts. Table 2 shows the accuracy, cohen-kappa, precision, recall and F1 score of the proposed model as 0.6919, 0.4966, 0.7179, 0.7002 and 0.7089, respectively. In comparison to the baseline model this is approximately a 9.13% improvement in terms of all the metrics. Posts usually consist of medical events and experiences. Therefore, capturing temporally related spatially close features is required for inferring the overall health status. The proposed CNN-LSTM-CNN network has been shown to be better at this task compared to the other models. The high value of the Cohen-Kappa metric suggests that the proposed model indeed learns to classify posts into 3 sentiment classes rather than making any random guess. A closer look at the classification errors revealed that there are instances where CNN and LSTM predict incorrectly, but the proposed model correctly classifies. With the following example where baseline and LSTM both failed to correctly classify, but the proposed model succeeded: ‘I had a doctors appointment today. He told I was recovering and should be more optimistic. I am still anxious and stressed most of the time’. Baseline model and LSTM classified

Sentiment-Aware Recommendation System for Healthcare

177

it as positive (might be because of terms like ’recovering’ and ’optimistic’) while the proposed model classified it as negative. This shows that the proposed model can satisfactorily capture the contextual information, and leverage it effectively for the classification task. To understand where our system fails we perform detailed error analysisboth quantitatively and qualitatively. We show quantitative analysis in terms of confusion matrix as shown in Fig. 2.

Fig. 2. Confusion matrix of sentiment classification

Close scrutiny of the predicted and actual values of test instances reveals that majority of misclassification occurs in cases where sentiment remains positive or negative throughout the post, and suddenly alters at the end of the post. For example: “I have been suffering from anxiety for a couple of years now. Doctors kept prescribing new and new medicines but I felt no change. Stress was unbearable. I lost my parents last year. The grief made me even worse. But I am currently feeling good after taking coQ10”. We observe that the proposed model was confused in such cases. Moreover, users often share some personal content which does not help the medical domain significantly. Such noises also contribute to the misclassification. Comparison with Existing Models: One of the recent works on medical sentiment analysis is reported in [12]. They have trained and evaluated a CNN based architecture separately for medical condition and medication segments. As discussed in the dataset and experiment section, we have merged both the datasets related to medical condition and medications into one for training and evaluation. Our definition of medical sentiment is, thus, more generic in nature, and direct comparison to the existing system is not very rational. None of the related works mentioned in the related works section addressed sentiment analysis for medical suggestion mining. The experimental setups, datasets and the sentiment classes used in all these works are also very different. 5.2

Top-N Similar Post Retrieval

Evaluation of the retrieval task is done by comparing similarity scores assigned for a pair of forum posts by the system and by human annotator (as discussed

178

A. Aipe et al.

in Sect. 4.5). Our focus is to determine the correlation between the similarity score assigned to the pairs of posts through human and the system judgments (rather than on actual similarity values). That is if a human feels that a post P is more relavant to post Q than post R, then the system also operates in the same way. Therefore, we use pearson correlation coefficient for the evaluation purpose. Statistical significance of the correlation (2-tailed p-value from T-test with null hypothesis that the correlation occured by chance) was found to be 0.00139, 0.0344 and 0.0186, respectively, for each sentiment class. Precision@5 is also calculated to evaluate the relevance of the retrieved forum posts. As annotation was done using top-5 retrieved posts (as discussed in Sect. 4.5), Recall@5 could not be calculated. We design a baseline model using K-nearest neighbour algorithm that makes use of cosine similarity metric for capturing the textual similarity. We show the results in Table 3. From the evaluation results, it is evident that similarity scores assigned by the proposed system are more positively correlated with the human judgments than the baseline. Correlation can be considered statistically significant as the p-values corresponding to all the sentiment classes are less than 0.05. The better Precision@5 metric corresponds that a greater number of relevant posts are retrieved by the proposed approach in comparison to the baseline model. Table 3. Evaluation corresponding to the top-n similar posts retrieval. ’A’ and ’B’ denote the results corresponding to the proposed metric and K-nearest neighbor algorithm using text similarity metric, respectively. Sentiment Pearson correlation Precision@5 Positive Neutral Negative

A 0.3586 0.3297 0.3345

B 0.2104 0.2748 0.2334

A 0.6638 0.5932 0.623

B 0.5467 0.5734 0.5321

DCG5 A 6.0106 5.3541 4.7361

B 2.1826 2.4353 2.4825

We also calculate the Discounted Cumulative Gain (DCG) [7] of the similarity scores for both models from the human judgments. The idea behind DCG is that highly relevant documents appearing lower in the ranking should be penalized. A logarithmic reduction factor was applied to the human relevance judgment which was scaled from 0 to 4, and the DCG accumulated at a rank position 5 was calculated with the following formula: DCG5 =

5  i=1

reli log2 (i + 1)

(7)

where reli is the relevance judgment of the post at position i. The NDCG could not be calculated, as annotation was done only using top-5 retrieved posts (as discussed in Sect. 4.5).

Sentiment-Aware Recommendation System for Healthcare

179

During error analysis, we observe few forum posts where users share their personal feelings, but due to the presence of less medically relevant contents, these are labeled as irrelevant by the system. However, these contain some relevant information that could be useful to the end users. For example, ’Hello everyone, Good morning to all. I know I had been away for a couple of days. I went outing with my family to get away from the stress I had been feeling lately. Strolled thorugh the park, played tennis with kids and visited cool places nearby. U know Family is the best therapy for every problem. Still feeling a little bit anxious lately. Suggest me something’ – The example blog contains proportionally more personal information than the medically relevant one. However, ’Feeling a little bit anxious lately’ is the medically relevant part of the post. Thus, filtering out such contents is required for better performance and would help the system to focus better on the relevant contents. There are two possible ways to tackle this problem. We would like to look into these in future. 1. Increasing the weight of medical information similarity (represented as MISim in Eq. 5) while computing the overall similarity score. 2. Identifying and removing personal, medically irrelevant contents using (possibly) by either designing a sequence labeling model (classifying relevant vs. irrelevant) or by manually verifying the data or by finding certain phrases or snippets from the blog. 5.3

Treatment Suggestion

Evaluation of treatment suggestion is particularly challenging because it requires the annotators with high level of medical expertise. Moreover to the best of our knowledge there is no existing benchmark dataset for this evaluation. Hence, we are not able to provide any quantitative evaluation of the suggestion module. However, it is to be noted that our suggestion module is based on the soundness of sentiment classification module. Our evaluation presented in the earlier section shows that our sentiment classifier has acceptable output quality. The task of a good treatment suggestion system is to mine the best and relevant treatment suggestion for a candidate disease. As the function for computing the suggestion score (Eq. 6) involves computing the probability of positive sentiment, given a treatment T and disorder/disease D, it is always ensured that T is a candidate treatment for D, i.e. the treatment T produced positive results in context of D in at least one case. In other words, the probability term ensures that irrelevant treatments that did not give positive result in context of D would never appear as treatment suggestion for D. The efficiency of the suggestion module depends on the following three factors: 1. Apache cTAKES retrieved correct concepts in majority of the cases with only a few exceptions, which are mostly ambiguous in nature. For example, the word ‘basis’ can represent clinically-proven NAD+ Supplement or can be used as synonym of the word ’premise’. 2. If an irrelevant post is labeled as relevant by the system, then suggestions shouldn’t contain treatments mentioned in that post. Thus, the similarity

180

A. Aipe et al.

metric plays an important role in picking the right treatment for a given candidate disease. 3. Value of hyper-parameter τ (Eq. 6): As its value decreases, more number of candidate treatments are suggested by the system. Performance of the module can be augmented and tailored by tweaking the above parameters depending on the practical application in hand.

6

Conclusion and Future Work

In this paper, we have established the usefulness of medical sentiment analysis for building a recommendation system that will assist building a patient assisted health-care system. A deep learning model has been presented for classifying the medical sentiment expressed in a forum post into conventional polarity-based classes. We have empirically shown that the proposed architecture can satisfactorily capture sentiment from the social media posts. We have also proposed a novel similarity metric for the retrieval of forum posts with similar medical experiences and sentiments. A novel treatment suggestion algorithm has been also proposed, that utilizes our similarity metric along with the patient-treatment satisfaction ratings. We have performed a very detailed analysis of our model. In our work, we use the UMLS database due to its wide usage and acceptability as a standard database. We also point to other future work, such as annotating a dataset for treatment suggestions – which would increase the scope of machine learning, developing a sequence labeling model to remove personal irrelevant contents etc. Our work serves as an initial study in harnessing the huge amounts of open, useful information available on medical forums. Acknowledgements. Asif Ekbal acknowledges the Young Faculty Research Fellowship (YFRF), supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia).

References 1. Chee, B.W., Berlin, R., Schatz, B.: Predicting adverse drug events from personal health messages. In: AMIA Annual Symposium Proceedings, vol. 2011, pp. 217–226 (2011) 2. Denecke, K.: Sentiment analysis from medical texts. In: Health Web Science. HIS, pp. 83–98. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20582-3 10 3. Denecke, K., Deng, Y.: Sentiment analysis in medical settings: new opportunities and challenges. Artif. Intell. Med. 64(1), 17 – 27 (2015). https://doi. org/10.1016/j.artmed.2015.03.006, http://www.sciencedirect.com/science/article/ pii/S0933365715000299 4. Eysenbach, G., Kohler, C.h.: What is the prevalence of health-related searches on the World Wide Web? qualitative and quantitative analysis of search engine queries on the internet. In: AMIA Annual Symposium Proceedings, pp. 225–229 (2003)

Sentiment-Aware Recommendation System for Healthcare

181

5. Fox S, D.M.: Health Online. Washington, DC: Pew Internet & American Life (2013). Accessed 20 Nov 2013. http://www.pewinternet.org/Reports/2013/ Health-online/Summary-of-Findings.aspx 6. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 7. J¨ arvelin, K., Kek¨ al¨ ainen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002). https://doi.org/10.1145/582415. 582418 8. Kim, Y.: Convolutional neural networks for sentence classification. CoRR abs/1408.5882 (2014) 9. Krippendorff, K.: Computing krippendorff’s alpha-reliability (2011). https:// repository.upenn.edu/asc papers/43 10. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 2013 (2013) 11. Palotti, J.R.M., et al.: CLEF 2017 task overview: the IR task at the ehealth evaluation lab - evaluating retrieval methods for consumer health search. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, 11–14 September 2017 (2017). http://ceur-ws.org/Vol-1866/invited paper 16.pdf 12. Yadav, S., Ekbal, A., Saha, S., Bhattacharyya, P.: Medical sentiment analysis using social media: towards building a patient assisted system. In: Chair, N.C.C., et al. (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan, 7–12 May 2018 (2018) 13. Yang, C.C., Yang, H., Jiang, L., Zhang, M.: Social media mining for drug safety signal detection. In: Proceedings of the 2012 International Workshop on Smart Health and Wellbeing, SHB 2012, pp. 33–40. ACM, New York (2012). https://doi. org/10.1145/2389707.2389714

Sentiment Analysis Through Finite State Automata Serena Pelosi(B) , Alessandro Maisto, Lorenza Melillo, and Annibale Elia Department of Political and Communication Science, University of Salerno, Salerno, Italy {spelosi,amaisto,amelillo,elia}@unisa.it

Abstract. The present research aims to demonstrate how powerful Finite State Automata (FSA) can be, into a domain in which the vagueness of the human opinions and the subjectivity of the user generated contents make the automatic “understanding” of texts extremely hard. Assuming that the semantic orientation of sentences is based on the manipulation of sentiment words, we built from scratch, for the Italian language, a network of local grammars for the annotation of sentiment expressions and electronic dictionaries for the classification of more than 15,000 opinionated words. In the paper we explain in detail how we made use of FSA for both the automatic population of sentiment lexicons and the sentiment classification of real sentences.

Keywords: Finite State Automata Valence Shifters · Sentiment lexicon

1

· Sentiment Analysis · Contextual · Electronic dictionary

Introduction

The Web 2.0, as an interactive medium, offers to Internet users the opportunity to freely share thoughts and feelings with the web communities. This kind of information is extremely important under the consumers decision making process; we make particular reference to experience and search goods or to the e-commerce in general, if one think to what extent the evaluation of the products qualities is influenced by the past experiences of those customers that had already experienced the same goods and that had posted their opinions online. The automatic treatment of User Generated Contents becomes a relevant research problem when the huge volume of raw texts online makes their semantic content impossible to be managed by human operators. As a matter of fact, the largest amount of on-line data is semistructured or unstructured and, as a result, its monitoring requires sophisticated Natural Language Processing (NLP) tools, that must be able to pre-process them from their linguistic point of view and, then, automatically access their semantic content. The Sentiment Analysis research field can have a large impact on many commercial, Government and Business Intelligence application. Examples are c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 182–197, 2023. https://doi.org/10.1007/978-3-031-24340-0_14

Sentiment Analysis Through Finite State Automata

183

ad-placement applications, Flame Detection Systems, Social Media Monitoring Systems, Recommendation Systems, Political Analysis, etc. However, it would be difficult, indeed, for humans to read and summarize such a huge volume of data, but, in other respects, to introduce machines to the semantic dimension of the human language remains an open problem. The largest part of on-line data is semistructured or unstructured and, as a result, its monitoring requires sophisticated NLP strategies and tools, in order to preprocess them from their linguistic point of view and, then, automatically access their semantic content. In the present work we present a method which exploits Finite State Automata (FSA) with the purpose of building high performance tools for the Sentiment Analysis1 . We computed the polarity of more than 15.000 Italian sentiment words, which have been semi-automatically listed into machine-readable electronic dictionaries, through a network of FSA, which consists of a syntactic network of grammars composed by 125 graphs. We tested a strategy based on the systematic substitution of semantically oriented classes of (simple or compound) words into the same sentence patterns. The combined use of dictionaries and automata made it possible to apply our method on real text occurrences2 . In Sect. 2 we will mention the most used techniques for the automatic propagation of sentiment lexicons and for the sentence annotation. Section 3 will delineate our method, carried out through finite state technologies. Then, in Sect. 4 and 5 we will go through our morphological and syntactic solutions to the mentioned challenges.

2

State of the Art

The core of this research consists of two distinguished Sentiment Analysis tasks: at the word level, the dictionary population and, at the sentence level, the annotation of complex expressions. In this paragraph we will summarize other methods used in literature to face those tasks. Many techniques have been discussed in literature to perform the Sentiment Analysis. They can be classified into lexicon based methods, learning methods and hybrid methods. In Sentiment Analysis tasks the most effective indicators used to discover subjective expressions are adjectives or adjective phrases [67], but recently it became really common the use of adverbs [6], nouns [72] and verbs as well [55]. Among the state of the art methods used to build and test dictionaries we mention the Latent Semantic Analysis (LSA) [41]; bootstrapping algorithms [65]; 1 2

Annibale Elia, Lorenza Melillo and Alessandro Maisto worked on the Conclusion of the paper, while Serena Pelosi on Introduction and Paragraphs 1, 2, 3 and 4. We chose a rule-based method, among others, in order to verify the hypothesis that words can be classified together in accordance to both semantic and syntactic criteria.

184

S. Pelosi et al.

graph propagation algorithms [33,71]; conjunctions and morphological relations between adjectives [29]; Context Coherency [35]; distributional similarity3 [79]. Pointwise Mutual Information (PMI) using Seed Words 4 has been applied to sentiment lexicon propagation by [22,63,69–71]. It has been observed, indeed, that positive words occur often close to positive seed words, whereas negative words are likely to appear around negative seed words [69,70]. Learning and statistical methods for Sentiment Analysis intent usually make use of Support Vector Machine [52,57,80] or Na¨ıve Bayes classifiers [37,68]. In the end, as regards the hybrid methods we must cite the works of [1,9,10, 24,42,60,64,76]. The largest part of the state of the art works on polarity lexicons for Sentiment Analysis purposes has been carried out on the English language. Italian lexical databases are mostly created by translating and adapting the English ones, SentiWordNet and WordNet-Affect, among others. The works on the Italian language that deserve to be mentioned are [5], merged the semantic information belonging to existing lexical resources in order to obtain an annotated lexicon of senses for Italian, Sentix (Sentiment Italian Lexicon)5 . Basically, MultiWordNet [58], the Italian counterpart of WordNet [20,47], has been used to transfer polarity information associated to English synsets in SentiWordNet [19] to Italian synsets, thanks to the multilingual ontology BabelNet [53]. Every Sentix’s entry is described by information concerning its part of speech, its WordNet synset ID, a positive and a negative score from SentiWordNet, a polarity score (from -1 to 1) and an intensity score (from 0 to 1). [8] presented a lexical sentiment resource thant contains polarized simple words, multiwords and idioms which has been annotated with polarity, intensity, emotion and domain labels6 . [12] built a lexicon for the EVALITA 2014 task by collecting adjectives, adverbs (extracted from the De Mauro - Paravia Italian dictionary [11]), nouns and verbs (from Sentix) and by classifying their polarity through the online Sentiment Analysis API provided by Ai Applied 7 . Another Italian Sentiment Lexicon is the one semi-automatically developed from ItalWordNet v.2 starting from a list of seed key-words classified manually [66]. It includes 24.293 neutral and polarized items distributed in XML-LMF format8 . [30] achieved good results in the SentiPolC 2014 task by 3

4

5 6 7 8

Word Similarity is a very frequently used method in the dictionary propagation over the thesaurus-based approaches. Examples are the Maryland dictionary, created thanks to a Roget-like thesaurus and a handful of affixe [48], and other lexicons based on WordNet, like SentiWordNet, built on the base of quantitative analysis of glosses associated to synsets [17, 18] or other lexicons based on the computing of the distance measure on WordNet [17, 34]. Seed words are words which are strongly associated with a positive/negative meaning, such as eccellente (“excellent”) or orrendo (“horrible”), by which it is possible to build a bigger lexicon, detecting other words that frequently occur alongside them. http://valeriobasile.github.io/twita/downloads.html. https://www.celi.it/. http://ai-applied.nl/sentiment-analysis-api. http://hdl.handle.net/20.500.11752/ILC-73.

Sentiment Analysis Through Finite State Automata

185

semi-automatically translating in Italian different lexicons; namely, SentiWordNet, Hu-Liu Lexicon, AFINN Lexicon, Whissel Dictionary, among others. As regards the works on the lexicon propagation, we mention three main research lines: the first one is grounded on the richness of the already existent thesauri, WordNet9 [47] among others. The second approach is based on the hypothesis that the words that convey the same polarity appear close in the same corpus, so the propagation can be performed on the base of co-occurrence algorithms [69,78] and, [4,36,61,69,78]. In the end, the morphological approach, which is the one that employs morphological structures and relations for the assignment of the prior sentiment polarities to unknown words, on the base of the manipulation of the morphological structures of known lemmas10 [40,50,50,77]. However, it does not seem to be enough to just dispose of sentiment dictionaries. Actually, the syntactic structures in which the opinionated lemmas occur have a strong impact on the resulting polarity of the sentences. That is the case of negation, intensification, irrealis markers and conditional tenses. Rule-based approaches, that take into account the syntactic dimension of the Sentiment Analysis, are [49,51]. FSA have been used for the linguistic analysis of sentiment expressions by [3,27,43]. Rule-based approaches, that take into account the syntactic dimension of the Sentiment Analysis, are [49,51,81].

3

Methodology

The present research has been grounded on the richness, in term of lexical and grammatical resources, of the linguistic databases built in the Department of Political and Communication Science (DSPC) of the University of Salerno by the Computational Linguistic lab “Maurice Gross”, which started its study on language formalization since the 1981 [16,73]. These resources take the shape of lexicon-grammar tables, that cross-check the lexicon and the syntax of any given language, in this case Italian; domain independent machine-readable dictionaries and inflectional and derivational local grammars in the form of finite state automata. Differently from other lexicon-based Sentiment Analysis methods, our approach has been grounded on the solidity of the Lexicon-Grammar resources 9

10

Although WordNet does not include semantic orientation information for its lemmas; semantic relations, such as synonymy or antonymy, are commonly used in order to automatically propagate the polarity, starting from a manually annotated set of seed word. [2, 13, 18, 18, 28, 31, 31, 34, 39, 45, 45]. This approach presents some drawbacks, such as the lack of scalability, the unavailability of enough resources for many languages and the difficulty to handle newly coined words, which are not already contained in the thesauri. Morphemes allow not only the propagation of a given word polarity (e.g. en-, -ous, -fy), but also its switching (e.g. dis-, -less), its intensification (e.g. super-, over-) and its weakening (e.g. semi-) [54].

186

S. Pelosi et al.

and classifications [16,73], that provide fine-grained semantic but also syntactic descriptions of the lexical entries. Such lexically exhaustive grammars distance themselves from the tendency of other sentiment resources to classify together words that have nothing in common from the syntactic point of view. In the present work, we started from the annotation of a small sized dictionary of opinionated words11 . FSA are used in both the morphological expansion of the lexicon and in the syntactic modeling of the words in context. In this research we assume that words can be classified together not only on the base of their semantic content, but also according to syntactic criteria. Thanks to finite state technologies, we computed the polarity of individual words by systematically replacing them with other items (endowed with the same and/or different individual polarity) into many sentence (or phrase) patterns. The hypothesis is that classes of words, characterized by the same individual annotation, can be generalized when considered into different syntactic contexts, because they undergo the same semantic shifting when occurring in similar patterns. Dictionaries and FSA used in tandem made it possible to verify these assumptions on real corpora. 3.1

Local Grammars and Finite-State Automata

Sentiment words, multiwords and idioms used in this work are listed into Nooj electronic dictionaries while the local grammars12 used to manipulate their polarities are formalized thanks to Finite State Automata. Electronic dictionaries have then been exploited in order to list and to semantically and syntactically classify, into a machine readable format, the sentiment lexical resources. The computational power of Nooj graphs has, instead, been used to represent the acceptance/refuse of the semantic and syntactic properties through the use of constraint and restrictions. Finite-State Automata (FSA) are abstract devices characterized by a finite set of nodes or “states” connected one another by transitions that allow us to determine sequences of symbols related to a particular path. These graphs are read from left to right, or rather, from the initial state to the final state [26].

11

12

While compiling the dictionary, the judgment on the words “prior polarity” is given without considering any textual context. The entries of the sentiment dictionary receive the same annotation and, then, are grouped together if they posses the same semantic orientation. The Prior Polarity [56] refers to the individual words Semantic Orientation (SO) and differs from the SO because it is always independent from the context. Local grammars are algorithms that, through grammatical, morphological and lexical instructions, are used to formalize linguistic phenomena and to -parse texts. They are defined “local” because, despite any generalization, they can be used only in the description and analysis of limited linguistic phenomena.

Sentiment Analysis Through Finite State Automata

3.2

187

Sentita and Its Manually-Built Resources

In this Paragraph we will briefly describe the Sentiment lexicon, available for the Italian language, which have been semi-automatically created on the base of the resources of the DSPC. The tagset used for the Prior Polarity annotation of the resources is composed of four tags: POS positive; NEG negative; FORTE intense and DEB weak. Such labels, if combined together, can generate an evaluation scale that goes from –3 to +3 and a strength scale that ranges from –1 to +1. Neutral words (e.g. nuovo “new”, with score 0 in the evaluation scale) have been excluded from the lexicon13 . In our resources, adjectives and bad words have been manually extracted and evaluated starting from the Nooj Italian electronic dictionary of simple words, preserving their inflectional (FLX) and derivational (DRV) properties. Moreover, compound adverbs [15], idioms [74,75], verbs [14,16] have been weighted starting from the Italian Lexicon-Grammar tables14 , in order to maintain the syntactic, semantic and transformational properties connected to each one of them.

4

Morphology

In this paragraph we will describe how FSA have been exploited to enrich the sentiment lexical resources. The adjectives have been used as starting point for the expansion of the sentiment lexicon, on the base of the morphophonological relations that connect the words and their meanings15 . Thanks to a morphological FSA it has been possible to enlarge the size of SentIta on the base of the morphological relations that connect the words and their meanings. More than 5,000 labeled adjectives have been used to predict the orientation of the adverbs with which they were morphologically related. All the adverbs contained in the Italian dictionary of simple words have been used as an input and a mophological FSA has been used to quickly populate the new dictionary by extracting the words ending with the suffix -mente, “-ly”, and by making such words inherit the adjectives’ polarity. The Nooj annotations consisted in

13

14 15

The main difference between the words listed in the two scales is the possibility to use them as indicators for the subjectivity detection: basically, the words belonging to the evaluation scale are “anchors” that begin the identification of polarized phrases or sentences, while the ones belonging to the strength scale are just used as intensity modifiers (see Paragraph 5.3). available for consultation at http://dsc.unisa.it/composti/tavole/combo/tavole.asp. The morphological method could be also applied to Italian verbs, but we chose to avoid this solution because of the complexity of their argument structures. We decided, instead, to manually evaluate all the verbs described in the Italian Lexicongrammar binary tables, so we could preserve the different lexical, syntactic and transformational rules connected to each one of them [16].

188

S. Pelosi et al.

a list of 3,200+ adverbs that, at a later stage, have been manually checked, in order to adjust the grammar’s mistakes16 . In detail, the Precision achieved in this task is 99% and the Recall is 88%. The derivation of quality nouns from qualifier adjectives is another derivation phenomenon of which we took advantage for the automatic enlargement of SentIta. These kind of nouns allow to treat as entities the qualities expressed by the base adjectives. A morphological FSA, following the same idea of the adverbs grammar, matches in a list of abstract nouns the stems that are in morpho-phonological relation with our list of hand-tagged adjectives. Because the nouns, differently from the adverbs, need to have specified the inflection information, we associated to each suffix entry, into an electronic dictionary dedicated to suffixes of quality nouns, the inflectional paradigm that they give to the words with which they occur. In order to effortlessly build a noun dictionary of sentiment words we firstly exploit the hand-made list of nominalization of the psychological verbs [25,27,46]. Furthermore, we took advantage from other derivation phenomena connected to nouns: the derivation of quality nouns from qualifier adjectives. We built a morphological FSA that, following the same idea of the adverbs grammar, matches into a list of abstract nouns the stems that are in morphophonological relation with our list of hand-tagged adjectives. Table 1. Analytical description of the most productive quality nouns suffixes. Suffixes Inflection Correct Precision

16

-it` a

N602

-mento

N5

666

98%

514

90%

-(z)ione N46

359

86%

-ezza

N41

305

99%

-enza

N41

148

94%

-ia

N41

145

98%

-ura

N41

142

88%

-aggine

N46

72

97%

-eria

N41

71

95%

-anza

N41

57

86%

TOT



2579

93%

The meaning of the deadjectival adverbs in -mente is not always predictable starting from the base adjectives from which they are derived. Also the syntactic structures in which they occur influences their interpretation. Depending on their position in sentences, the deadjectival adverbs can be described as adjective modifiers (e.g. altamente “highly”), predicate modifiers (e.g. perfettamente “perfectly”) or sentence modifiers (e.g. ultimamente “lately”).

Sentiment Analysis Through Finite State Automata

189

As regards the suffixes used to form the quality nouns (Table 1) [62], it must be said that they generally make the new words simply inherit the orientation of the derived adjectives. Exceptions are -edine and -eria that almost always shift the polarity of the quality nouns into the weakly negative one (–1), e.g. faciloneria “slapdash attitude”. Also the suffix -mento differs from the others, in so far it belongs to the derivational phenomenon of the deverbal nouns of action [21]. It has been possible to use it into our grammar for the deadjectival noun derivation by using the past participles of the verbs listed in the adjective dictionary of sentiment (e.g. V:sfinire “to wear out”, A:sfinito “worn out”, N:sfinimento; “weariness”). The Precision achieved in this task is 93%. In this work we draw up a FSA which can also interact, at a morphological level, with a list of prefixes able to negate (e.g. anti-, contra-, non- among others) or to intensify/downtone (e.g. arci-, semi- among others) the orientation of the words in which they appear [32].

5

Syntax

Contextual Valence Shifters are linguistic devices able to change the prior polarity of words when co-occurring in the same context [38,59]. In this work we handle the contextual shifting by generalizing all the polar words that posses the same prior polarity. A network of local grammars as been designed on a set of rules that compute the words individual polarity scores, according to the contexts in which they occur. In general, the sentence annotation is performed through an Enhanced Recursive Transition Network, by using six different metanodes17 , that, working as containers for the sentiment expressions, assign equal labels to the patterns embedded in the same graphs. Among the most used Contextual Valence Shifters we took into account linguistic phenomena like Intensification, Negation, Modality and Comparison. Moreover, we formalized some classes of frozen sentences that modify the polarity of the sentiment words that occur in them. Our network of 15,000 opinionated lemmas and 125 embedded FSA has been tested on a multi-domain corpus of customer reviews18 achieving in the sentencelevel sentiment classification task an average Precision of 75% and a Recall of 73%.

17 18

Metanodes are labeled through the six corresponding values of the evaluation scale, which goes from –3 to +3. The dataset contains Italian opinionated texts in the form of users reviews and comments from e-commerce and opinion websites; it lists 600 texts units (50 positive and 50 negative for each product class) and refers to six different domains, for all of which different websites (such as www.ciao.it; www.amazon.it; www.mymovies.it; www.tripadvisor.it) have been exploited [44].

190

5.1

S. Pelosi et al.

Opinionated Idioms

More than 500 Italian frozen sentences containing adjectives [74,75] have been evaluated and then formalised with a pair of dictionary-grammar. Among the idioms considered there are the comparative frozen sentences of the type N0 Agg come C1, described by [74], that usually intensify the polarity of the adjective of sentiment they contain, as happens in (1). (1) Mary `e bella [+2] come il sole [+3] “Mary is as beautiful as the sun” Otherwise, it is also possible for an idiom of that sort to be polarised when the adjective (e.g. bianco, “white”) contained in it is neutral (2), or even to reverse its polarity as happens in (3) (e.g. agile, “agile”, is positive). In that regard, it is interesting to notice that the 84% of the idioms has a clear SO, while just the 36% of the adjectives they contain is polarised19 . (2) Mary `e bianca [0] come un cadavere [–2] “Mary is as white as a dead body” (Mary is pale) (3) Mary `e agile [+2] come una gatta di piombo [–2] “Mary is as agile as a lead cat” (Mary is not agile) 5.2

Negation

As regards negation, we included in our grammar negative operators (e.g. non, “not”, mica, per niente, affatto, “not at all”), negative quantifiers (e.g. nessuno, “nobody” niente, nulla, “nothing”) and lexical negation (e.g. senza, “without”, mancanza di, assenza di, carenza di, “lack of”) [7]. As exemplified in the following sentences, negation indicators not always change a sentence polarity in its positive or negative counterparts (4); they often have the effect of increasing or decreasing the sentence score (5). That is why we prefer to talk about valence “shifting” rather than “switching”. (4) Citroen non [neg] produce auto valide [+2] [–2] “Citroen does not produce efficient cars” (5) Grafica non proprio [neg] spettacolare [+3] [–2] “The graphic not quite spectacular”

19

Other idioms included in our resources are of the kind N0 essere (Agg + Ppass) Prep C1 (e.g. Max `e matto da legare, “Max is so crazy he should be locked up”); N0 essere Agg e Agg (e.g. Max `e bello e fritto, “Max is cooked”); C0 essere Agg (come C1 + E) (e.g. Mary ha la coscienza sporca ↔ La coscienza `e sporca, “Mary has a guilty conscience” ↔ “The conscience is guilty”), N0 essere C1 Agg (e.g. Mary `e una gatta morta, “Mary is a cock tease”).

Sentiment Analysis Through Finite State Automata

5.3

191

Intensification

We included the Intensification rules into our grammar net, firstly, by combining in the words belonging to the strength scale (tags FORTE/DEB) with the sentiment words listed in the evaluation scale (tags POS/NEG)20 . Besides, also the repetition of more than one negative or positive words, or the use of absolute superlative affixes have the effect of increasing the words’ Prior Polarity. In general, the adverbs intensify or attenuate adjectives, verbs and other adverbs, while the adjectives modify the intensity of nouns. Intensification and negation can also appear together in the same sentence. 5.4

Modality

According to [7], modality can be used to express possibility, necessity, permission, obligation or desire, through grammatical cues, such as adverbial phrases (e.g. “maybe”, “certainly”); conditional verbal moods; some verbs (e.g. “must”, “can”, “may”); some adjectives and nouns (e.g. “a probable cause”). When computing the Prior Polarities of the SentIta items into the textual context, we considered that modality can also have a significant impact on the SO of sentiment expressions. According to the literature trends, but without specifically focusing on the [7] modality categories, we recalled in the FSA dedicated to modality the following linguistic cues and we made them interact with the SentIta expressions: sharpening and softening adverbs; modal verbs and conditional and imperfect tenses. Examples of the modality in our work are the following: – “Potere” + Indicative Imperfect + Oriented Item: (6) Poteva [Modal+IM] essere una trama interessante [+2] [–1] “It could be an interesting plot” – “Potere” + Indicative Imperfect + Comparative + Oriented Items: (7) Poteva [Modal+IM] andare peggio [I-OpW +2] [–1] “It might have gone worse” – “Dovere” + Indicative Imperfect: (70) Questo doveva [Modal+IM] essere un film di sfumature [+1] [–2] “This one was supposed to be a nuanced movie” – “Dovere” + “Potere” + Past Conditional : (71) Non[Negation] avrei [Aux+C] dovuto [Modal+PP] buttare via i miei soldi [–2] “I should not have burnt my money” 20

Words that, at first glance, seem to be intensifiers but at a deeper analysis reveal a more complex behavior are abbastanza “enough” troppo “too much” and poco “not much”. In this research we noticed as well that the co-occurrence of troppo, poco and abbastanza with polar lexical items can provoke, in their semantic orientation, effects that can be associated to other contextual valence shifters. The ad hoc rules dedicated to these words (see Table ??) are not actually new, but refer to other contextual valence shifting rules that have been discussed in this Paragraph.

192

5.5

S. Pelosi et al.

Comparison

Sentences that express a comparison generally carry along with them opinions about two or more entities, with regard to their shared features or attributes [23]. As far as the comparative sentences are concerned, we considered in this work the already mentioned comparative frozen sentences of the type N0 Agg come C1 ; some simple comparative sentences that involve the expressions meglio di, migliore di, “better than”, peggio di, peggiore di, “worse than”, superiore a, “superior to” inferiore a, “less than”; and the comparative superlative. The comparison with other products has been evaluated with the same measures of the other sentiment expression; so the polarity can range from –3 to +3. 5.6

Other Sentiment Expressions

In order to reach high levels of Recall, the lexicon-based patterns require also the support of lexicon independent expressions. In our work, we listed and computed many cases in which expressions that do not imply the words contained in our dictionaries are sentiment indicators as well. This is the case in which one can see the importance of the Finite-state automata. Without them it would be really difficult and uneconomical for a programmer to provide the machine with concise instructions to correctly recognise and evaluate some kind of opinionated sentences that can often reach high levels of variability. Examples of patterns of this kind are valerne la pena [+2] , “to be worthwhile”; essere (dotato + fornito + provvisto) di [+2] , “to be equipped with”; grazie a [+2] , “thanks to”; essere un (aspetto + nota + cosa + lato) negativo [–2] , “to be a negative side”; non essere niente di che [–1] , “to be nothing special”; tradire le (aspettative + attese + promesse) [–2] , “not live up to one’s expectations”; etc. For simplicity, in the present work we put in this node of the grammar the sentences that imply the use of frozen, semi-frozen expression and words that, for the moment, are not part of the dictionaries.

6

Conclusion

In this paper we gave our contribution to the most two challenging tasks of the Sentiment Analysis field: the lexicon propagation and the sentence semantic annotation. The necessity to quickly monitor huge quantity of semistructured and unstructured data from the web, poses several challenges to Natural Language Processing, that must provide strategies and tools to analyze their structures from lexical, syntactical and semantic point of views. Unlike many other Italian and English sentiment lexicons, SentIta, made up on the interaction of electronic dictionaries and lexicon dependent local grammars, is able to manage simple and multiword structures, that can take the shape of distributionally free structures, distributionally restricted structures and frozen structures.

Sentiment Analysis Through Finite State Automata

193

According with the major contribution in the Sentiment Analysis literature, we did not consider polar words in isolation. We computed they elementary sentence contexts, with the allowed transformations and, then, their interaction with contextual valence shifters, the linguistic devices that are able to modify the prior polarity of the words from SentIta, when occurring with them in the same sentences. In order to do so, we took advantage of the computational power of the finite-state technology. We formalized a set of rules that work for the intensification, downtoning and negation modeling, the modality detection and the analysis of comparative forms. Here, the difference with other state-of-theart strategies consists in the elimination of complex mathematical calculation in favor of the easier use of embedded graphs as containers for the expressions designed to receive the same annotations into a compositional framework.

References 1. Andreevskaia, A., Bergler, S.: When specialists and generalists work together: overcoming domain dependence in sentiment tagging. In: ACL, pp. 290–298 (2008) 2. Argamon, S., Bloom, K., Esuli, A., Sebastiani, F.: Automatically determining attitude type and force for sentiment analysis, pp. 218–231 (2009) 3. Balibar-Mrabti, A.: Une ´etude de la combinatoire des noms de sentiment dans une grammaire locale. Langue fran¸caise, pp. 88–97 (1995) 4. Baroni, M., Vegnaduzzo, S.: Identifying subjective adjectives through web-based mutual information, vol. 4, pp. 17–24 (2004) 5. Basile, V., Nissim, M.: Sentiment analysis on Italian tweets. In: Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 100–107 (2013) 6. Benamara, F., Cesarano, C., Picariello, A., Recupero, D.R., Subrahmanian, V.S.: Sentiment analysis: adjectives and adverbs are better than adjectives alone. In: ICWSM (2007) 7. Benamara, F., Chardon, B., Mathieu, Y., Popescu, V., Asher, N.: How do negation and modality impact on opinions? pp. 10–18 (2012) 8. Bolioli, A., Salamino, F., Porzionato, V.: Social media monitoring in real life with blogmeter platform. In: ESSEM@ AI* IA 1096, 156–163 (2013) 9. Dang, Y., Zhang, Y., Chen, H.: A lexicon-enhanced method for sentiment classification: An experiment on online product reviews. In: Intelligent Systems, IEEE, vol. 25, pp. 46–53. IEEE (2010) 10. Dasgupta, S., Ng, V.: Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-vol. 2, pp. 701–709. Association for Computational Linguistics (2009) 11. De Mauro, T.: Dizionario italiano. Paravia, Torino (2000) 12. Di Gennaro, P., Rossi, A., Tamburini, F.: The ficlit+ cs@ unibo system at the evalita 2014 sentiment polarity classification task. In: Proceedings of the Fourth International Workshop EVALITA 2014 (2014) 13. Dragut, E.C., Yu, C., Sistla, P., Meng, W.: Construction of a sentimental word dictionary, pp. 1761–1764 (2010) 14. Elia, A.: Le verbe italien. Les compl´etives dans les phrases ` aa un compl´ement (1984)

194

S. Pelosi et al.

15. Elia, A.: Chiaro e tondo: Lessico-Grammatica degli avverbi composti in italiano. Segno Associati (1990) 16. Elia, A., Martinelli, M., D’Agostino, E.: Lessico e Strutture sintattiche. Liguori, Introduzione alla sintassi del verbo italiano. Napoli (1981) 17. Esuli, A., Sebastiani, F.: Determining the semantic orientation of terms through gloss classification. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 617–624. ACM (2005) 18. Esuli, A., Sebastiani, F.: Determining term subjectivity and term orientation for opinion mining vol. 6, p. 2006 (2006) 19. Esuli, A., Sebastiani, F.: SentiWordNet: a publicly available lexical resource for opinion mining. In: Proceedings of LREC, vol. 6, pp. 417–422 (2006) 20. Fellbaum, C.: WordNet. Wiley Online Library (1998) 21. Gaeta, L.: Nomi d’azione. La formazione d elle parole in italiano. T¨ ubingen: Max Niemeyer Verlag, pp. 314–351 (2004) 22. Gamon, M., Aue, A.: Automatic identification of sentiment vocabulary: exploiting low association with known sentiment terms, pp. 57–64 (2005) 23. Ganapathibhotla, M., Liu, B.: Mining opinions in comparative sentences. In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1. pp. 241–248. Association for Computational Linguistics (2008) 24. Goldberg, A.B., Zhu, X.: Seeing stars when there aren’t many stars: graph-based semi-supervised learning for sentiment categorization. In: Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, pp. 45–52. Association for Computational Linguistics (2006) 25. Gross, M.: Les bases empiriques de la notion de pr´edicat s´emantique. Langages, pp. 7–52 (1981) 26. Gross, M.: Les phrases fig´ees en fran¸cais. In: L’information grammaticale, vol. 59, pp. 36–41. Peeters (1993) 27. Gross, M.: Une grammaire locale de l’expression des sentiments. Langue fran¸caise, pp. 70–87 (1995) 28. Hassan, A., Radev, D.: Identifying text polarity using random walks, pp. 395–403 (2010) 29. Hatzivassiloglou, V., McKeown, K.R.: Predicting the semantic orientation of adjectives. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pp. 174–181. Association for Computational Linguistics (1997) 30. Hernandez-Farias, I., Buscaldi, D., Priego-S´ anchez, B.: Iradabe: adapting English lexicons to the Italian sentiment polarity classification task. In: First Italian Conference on Computational Linguistics (CLiC-it 2014) and the fourth International Workshop EVALITA2014, pp. 75–81 (2014) 31. Hu, M., Liu, B.: Mining and summarizing customer reviews, pp. 168–177 (2004) 32. Iacobini, C.: Prefissazione. La formazione delle parole in italiano. T¨ ubingen: Max Niemeyer Verlag, pp. 97–161 (2004) 33. Kaji, N., Kitsuregawa, M.: Building lexicon for sentiment analysis from massive collection of html documents. In: EMNLP-CoNLL, pp. 1075–1083 (2007) 34. Kamps, J., Marx, M., Mokken, R.J., De Rijke, M.: Using wordnet to measure semantic orientations of adjectives (2004) 35. Kanayama, H., Nasukawa, T.: Fully automatic lexicon expansion for domainoriented sentiment analysis. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 355–363. Association for Computational Linguistics (2006)

Sentiment Analysis Through Finite State Automata

195

36. Kanayama, H., Nasukawa, T.: Fully automatic lexicon expansion for domainoriented sentiment analysis, p. 355 (2006) 37. Kang, H., Yoo, S.J., Han, D.: Senti-lexicon and improved na¨ıve bayes algorithms for sentiment analysis of restaurant reviews. In: Expert Systems with Applications, vol. 39, pp. 6000–6010. Elsevier (2012) 38. Kennedy, A., Inkpen, D.: Sentiment classification of movie reviews using contextual valence shifters. Comput. Intell. 22(2), 110–125 (2006) 39. Kim, S.M., Hovy, E.: Determining the sentiment of opinions, p. 1367 (2004) 40. Ku, L.W., Huang, T.H., Chen, H.H.: Using morphological and syntactic structures for Chinese opinion analysis, pp. 1260–1269 (2009) 41. Landauer, T.K., Dumais, S.T.: A solution to plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. In: Psychological Review, vol. 104, p. 211. American Psychological Association (1997) 42. Li, F., Huang, M., Zhu, X.: Sentiment analysis with global topics and local dependency. In: AAAI (2010) 43. Maisto, A., Pelosi, S.: Feature-based customer review summarization. In: Meersman, R., et al. (eds.) OTM 2014. LNCS, vol. 8842, pp. 299–308. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-45550-0 30 44. Maisto, A., Pelosi, S.: A lexicon-based approach to sentiment analysis. the Italian module for Nooj. In: Proceedings of the International Nooj 2014 Conference, University of Sassari, Italy. Cambridge Scholar Publishing (2014) 45. Maks, I., Vossen, P.: Different approaches to automatic polarity annotation at synset level, pp. 62–69 (2011) 46. Mathieu, Y.Y.: Les pr´edicats de sentiment. Langages, pp. 41–52 (1999) 47. Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995) 48. Mohammad, S., Dunne, C., Dorr, B.: Generating high-coverage semantic orientation lexicons from overtly marked words and a thesaurus. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-vol. 2, pp. 599–608. Association for Computational Linguistics (2009) 49. Moilanen, K., Pulman, S.: Sentiment composition, pp. 378–382 (2007) 50. Moilanen, K., Pulman, S.: The good, the bad, and the unknown: morphosyllabic sentiment tagging of unseen words, pp. 109–112 (2008) 51. Mulder, M., Nijholt, A., Den Uyl, M., Terpstra, P.: A lexical grammatical implementation of affect, pp. 171–177 (2004) 52. Mullen, T., Collier, N.: Sentiment analysis using support vector machines with diverse information sources. In: EMNLP, vol. 4, pp. 412–418 (2004) 53. Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network, vol. 193, pp. 217–250 (2012) 54. Neviarouskaya, A.: Compositional approach for automatic recognition of finegrained affect, judgment, and appreciation in text (2010) 55. Neviarouskaya, A., Prendinger, H., Ishizuka, M.: Compositionality principle in recognition of fine-grained emotions from text. In: ICWSM (2009) 56. Osgood, C.E.: The nature and measurement of meaning. Psychol. Bull. 49(3), 197 (1952) 57. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing, vol. 10, pp. 79–86. Association for Computational Linguistics (2002)

196

S. Pelosi et al.

58. Pianta, E., Bentivogli, L., Girardi, C.: MultiWordNet: developing an aligned multilingual database. In: Proceedings of the first international conference on global WordNet, vol. 152, pp. 55–63 (2002) 59. Polanyi, L., Zaenen, A.: Contextual valence shifters, pp. 1–10 (2006) 60. Prabowo, R., Thelwall, M.: Sentiment analysis: a combined approach. J. Inf. 3, 143–157 (2009) 61. Qiu, G., Liu, B., Bu, J., Chen, C.: Expanding domain sentiment lexicon through double propagation. vol. 9, pp. 1199–1204 (2009) 62. Rainer, F.: Derivazione nominale deaggettivale. La formazione delle parole in italiano, pp. 293–314 (2004) 63. Rao, D., Ravichandran, D.: Semi-supervised polarity lexicon induction. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 675–682. Association for Computational Linguistics (2009) 64. Read, J., Carroll, J.: Weakly supervised techniques for domain-independent sentiment classification. In: Proceedings of the 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion, pp. 45–52. ACM (2009) 65. Riloff, E., Wiebe, J., Wilson, T.: Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4. pp. 25–32. Association for Computational Linguistics (2003) 66. Russo, I., Frontini, F., Quochi, V.: OpeNER sentiment lexicon italian - LMF (2016). http://hdl.handle.net/20.500.11752/ILC-73, digital Repository for the CLARIN Research Infrastructure provided by ILC-CNR 67. Taboada, M., Anthony, C., Voll, K.: Methods for creating semantic orientation dictionaries. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), Genova, Italy, pp. 427–432 (2006) 68. Tan, S., Cheng, X., Wang, Y., Xu, H.: Adapting naive bayes to domain adaptation for sentiment analysis. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 337–349. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00958-7 31 69. Turney, P.D.: Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews, pp. 417–424 (2002) 70. Turney, P.D., Littman, M.L.: Measuring praise and criticism: inference of semantic orientation from association. ACM Trans. Inf. Syst. (TOIS) 21, 315–346 (2003) 71. Velikovich, L., Blair-Goldensohn, S., Hannan, K., McDonald, R.: The viability of web-derived polarity lexicons. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 777–785. Association for Computational Linguistics (2010) 72. Vermeij, M.: The orientation of user opinions through adverbs, verbs and nouns. In: 3rd Twente Student Conference on IT, Enschede June (2005) 73. Vietri, S.: The Italian module for Nooj. In: In Proceedings of the First Italian Conference on Computational Linguistics, CLiC-it 2014. Pisa University Press (2014) 74. Vietri, S.: On some comparative frozen sentences in Italian. Lingvisticæ Investigationes 14(1), 149–174 (1990) 75. Vietri, S.: On a class of Italian frozen sentences. Lingvisticæ Investigationes 34(2), 228–267 (2011) 76. Wan, X.: Co-training for cross-lingual sentiment classification. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 1, pp. 235–243. Association for Computational Linguistics (2009)

Sentiment Analysis Through Finite State Automata

197

77. Wang, X., Zhao, Y., Fu, G.: A morpheme-based method to Chinese sentence-level sentiment classification. Int. J. Asian Lang. Proc. 21(3), 95–106 (2011) 78. Wawer, A.: Extracting emotive patterns for languages with rich morphology. Int. J. Comput. Linguist. Appl. 3(1), 11–24 (2012) 79. Wiebe, J.: Learning subjective adjectives from corpora. In: AAAI/IAAI, pp. 735– 740 (2000) 80. Ye, Q., Zhang, Z., Law, R.: Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. In: Expert Systems with Applications, vol. 36, pp. 6527–6535. Elsevier (2009) 81. Yi, J., Nasukawa, T., Bunescu, R., Niblack, W.: Sentiment analyzer: extracting sentiments about a given topic using natural language processing techniques, pp. 427–434 (2003)

Using Cognitive Learning Method to Analyze Aggression in Social Media Text Sayef Iqbal and Fazel Keshtkar(B) Science and Mathematics Department, St. John’s University College of Professional Studies Computer Science, 8000 Utopia Parkway, Jamaica, NY 11439, USA {sayef.iqbal16,keshtkaf}@stjohns.edu

Abstract. Aggression and hate speech is a rising concern in social media platforms. It is drawing significant attention in the research community who are investigating different methods to detect such content. Aggression, which can be expressed in many forms, is able to leave victims devastated and often scar them for life. Families and social media users prefer a safer platform to interact with each other. Which is why detection and prevention of aggression and hatred over internet is a must. In this paper we extract different features from our social media data and perform supervised learning methods to understand which model produces the best results. We also analyze the features to understand if there is any pattern involved in the features that associates to aggression in social media data. We used state-of-the-art cognitive feature to gain better insight in our dataset. We also employed ngrams sentiment and Part of speech features as a standard model to identify other hate speech and aggression in text. Our model was able to identify texts that contain aggression with an f-score of 0.67. Keywords: Hate speech Classification

1

· Aggression · Sentiment · Social media ·

Introduction

According to Wikipedia1 , aggression is defined as the action or response of an individual who expresses something unpleasant to another person [4]. Needless to say, aggression in social media platforms has become a major factor in polarizing the community with hatred. Aggression can take the form of harassment, cyberbullying, hate speech and even taking jabs at one another. It is growing as more and more users are joining the social network. Around 80% of the teenagers use social media nowadays and one in three young people have been found victims of cyberbullying [25]. The rise of smartphones and smart devices and ease of use of social media platforms have led to the spread of aggression over the internet [8]. Recently, 1

https://en.wikipedia.org/wiki/Aggression Date: 11/22/2018.

c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 198–211, 2023. https://doi.org/10.1007/978-3-031-24340-0_15

Using Cognitive Learning Method to Analyze Aggression

199

social media giants like Facebook and Twitter took some action and have been investigating this issue (i.e. deleting suspicious accounts). However, there is still a lack of sophisticated algorithms which can automatically detect these problems. Hence, more investigation needs to be done in order to address this issue at a larger scale. On the other hand, due to the subjectivity of the aggression and hate associated with aggression, this problem has been challenging as well. Therefore, an automatic detection system for front line defense against such aggression texts will be useful to minimize spread of hatred across social media platforms and it can help to maintain a healthy online environment. This paper focuses on generating a binary classification model for analyzing any pattern from 2018 shared task TRAC (Trolling, Aggression and Cyberbullying) dataset [12]. The data initially was annotated into three categories as follows: – Non Aggression (NAG)- there is no aggression in the text – Overtly Aggressive (OAG)- text contains open aggressive lexical features – Covertly Aggression (CAG)- text contains aggression without open acknowledgement of aggressive lexical features Examples of each category are shown in Table 1. Table 1. Some examples with original labels and our modified labels. Text examples

Original label New label

Cows are definitely gonna vote for Modi ji in 2019 ;)

CAG

AG

Don’t u think u r going too far u Son of a B****........#Nigam

OAG

AG

Happy Diwali.!!let’s wish the next one year health, wealth n growth to our Indian economy.

NAG

NAG

To analyze the aggression patterns, in this paper, we focus on building a classification model using Non Aggression (NAG) and Aggression (AG) classes. We combine the overlapping OAG and CAG categories into the AG category from the initial dataset. In this research, we investigate a combination of features such as word n-gram, LIWC, part of speech and sentimental polarity. We also applied different supervised learning algorithms to evaluate our model. While most of the supervised learning methods produced promising results, Random Forest classifier produced the best accuracy (68.3%) and f-score (0.67) while also producing state of the art true-positive rate of (83%). However, all the classifiers produced results with greater accuracy and precision for our proposed binary class (AG) and (NAG) than the initial three classes of (NAG), (CAG) and (OAG). We also analyzed n-gram and LIWC features that were used for model building and found that it mostly affirms the presence of non-aggressive content in texts. This paper serves to lay the ground for our future work which is to identify what differentiates OAG from CAG.

200

S. Iqbal and F. Keshtkar

The rest of the paper is organized as follows: Related Work section gives a brief overview of the research already done in this area. The Methodology section describes our methodology and the details about the dataset, pre-processing steps, feature extraction and the algorithms that were used. Experiments and Result section presents the experiments and results from the proposed model and finally, the conclusion and future works are discussed in Conclusion and Future work section.

2

Related Work

Several studies have been done in order to detect aggression level in social media texts [5]. Some research focuses on labelling the texts as either expressing positive or negative opinion about a certain topic. Raja and Swamynathan in his research analyzed sentiment from tweet posts using sentiment corpus to score sentimental words in the tweets [15]. They proposed a system which tags any sentimental word in a tweet and then scores the word using SentiWordnet’s word list and sentimental relevance scoring using an estimator method. The system produced promising sentimental values of words. However, the research did not focus on the analysis of the lexical features or sentimental words, such as how often they appear in a text and what kinfd of part of speech does the word(s) belong to. Samghabadi et al. analyzed data for both Hindi and English language by using a combination of lexical and semantic features [19]. They used Twitter dataset for training and testing purposes and applied supervised learning algorithms to identify texts as being Non Aggressive, Covertly Aggressive and Overtly Aggressive. They used lexical feature such as word n-gram, char-ngram, k-skip n-grams and tf-idf word transformation. For word embedding, they employed Word2vec [24] and also used Stanford’s sentimental [21] tool to measure the sentiment scores of words. They also used LWIC to analyze the texts from tweets and Facebook comments [22]. Finally, they used binary calculation to identify the gender probability to produce an effective linguistic model. They were able to retrieve an f-score of 0.5875 after applying classifiers on their linguistic featured model. In contrast to this research, our system produced results with higher f-score (0.67) even though we used different feature set and employed different approach for supervised learning and analysis. On the other hand, Roy et al. [17] used Convolution Neural Networks (CNN) [11] and Support Vector Machine (SVM) classifiers on the pre-processed dataset of tweets and Facebook comments to classify the data. They employed the preprocessing technique of removing regular expressions, urls and usernames from the text. They used an ensemble approach using both CNN and SVM for classifying their data. In contrast to our research which produced better results when using Random Forest classifier, the performance of their system improved when SVM was used on unigrams and tf-idf features along with CNN with a kernel size of 2 x embedding size. The system was able to classify the social media posts with an f-score of 0.5099.

Using Cognitive Learning Method to Analyze Aggression

201

On a different note, Sharma et al. proposed a degree based classification of harmful speeches that are often manifested in posts and comments in social media platforms [20]. They extracted bag of word and tf-idf features from preprocessed Facebook posts and comments that was annotated by three different annotators subjectively. They performed Naive Bayes, Support vector Machine and Random Forest classifiers on their model. Random Forest worked the best on their model and gave results with an accuracy of 76.42%. Van Hee et al. explored the territory of cyberbullying, a branch of aggression, content in social media platform [23]. Cyberbullying can really affect the confidence, self-esteem emotional values of a victim especially in the youth. They propose a linear SVM supervised machine learning method for detecting cyberbullying content in social media by exploring wide range of features in English and Dutch corpus. The detection system provides a quantitative analysis on texts for a way to signal cyberbullying events in social media platform. They performed a binary classification experiment for the automatic detection for cyberbullying texts with an f-score of 64.32% and for English corpus. Unlike in our research, [23] did not employ sentiment or use any psycholinguistic feature for supervised learning methods. However, our system was able to produce slightly better f-score result even though we use different dataset. Sahay et al. in their research address the negative aspect of online social interaction. Their work is based on the detection of cyberbullying and harassment in social media platforms [18]. They perform classification analysis of labelled textual content from posts and comments that could possible contain cyberbullying, trolling, sarcastic and harassment content. They build their classification model based on n-gram feature, opposed to our system in which we consider other features like part of speech along side n-gram features. They perform different machine learning algorithms to evaluate their feature-engineering process generating a score between 70–90% for the training dataset. Similarly, Reynolds et al. perform a lexical feature extraction on their labelled dataset that was collected from web crawling that contained posts mainly from teenagers and college students [16]. They proposed a binary classification of ‘yes’ or ‘no’ for posts from 18,554 users in Formspring.me website that may or may not contain cyberbullying content. They perform different supervised learning method on their extracted features and found J48 produced the best true positive accuracy of 61.6% and an average accuracy 0f 81.7%. While the researchers were able to retrieve results with better accuracy they do not analyze the texts within which is a strong focal point of our research. Dinkar et al. propose a topic-sensitive classifier to detect cyberbullying content using 4,500 comments from youtube to train and test their sub-topic classification models [7]. The sub-topics included, sexuality, race and culture, and intelligence. They used tf-idf, Ortony lexicons, list of profane words, part of speech tagging and topic-specific unigrams and binary grams as their features. Although they applied multiple classifiers on their feature model, SVM produced

202

S. Iqbal and F. Keshtkar

the most reliable with kappa value of above 0.7 for all topic-sensitive classes and JRip producing most accurate results for all the classes. They found that building label-specific classifiers were more effective than multiclass classifiers at detecting cyberbullying sensitive messages. Chen et al. also propose a Lexical Syntactic Feature (LSF) architecture to detect use of offensive language in social media platforms [6]. They included a user’s writing style, structure, lexical and semantic of content in the texts among many others to identify the likeliness of a user putting up an offensive content online. They achieved a precision of 98.24% and recall of 94.34% in sentence offensive detection using LSF as a feature in their modelling. They performed both Naive Bayes and SVM classification algorithm with SVM producing the best accuracy result in classifying the offensive content. However, in this paper, we propose a new model, which to the best of our knowledge, has not been used in previous researches. We build a model with a combination of psycholinguistic, semantic, word n-gram, part of speech and other lexical features to analyze aggression patterns in the dataset. Methodology section explains the details of our model.

3

Methodology

In this section we discuss the details of our methodology, dataset, pre-processing, feature extraction, and algorithms that have been used in this model. The data was collected from shared TRAC [12]. 3.1

Dataset

The dataset was collected from the TRAC [12] workshop (Trolling, Aggression and Cyberbullying) 2018 workshop held in August 2018, NM, USA. TRAC focuses on investigating online aggression, trolling, cyberbullying and other related phenomena. The workshop aimed to create a platform for academic discussions on this problem, based on previous joint work that they have done as part of a project funded by the British Council. Our dataset was part of the workshop’s English data that comprised of 11,999 Facebook posts and comments with 6,941 comments labelled as Aggessive and 5,030 as Non-aggressive. The comments were annotated subjectively into three categories NAG, CAG and OAG by research scientists and reviewers who organized the workshop. We decided to use a binary class of AG and NAG for these texts. Figure 1 illustrates the distribution of the categories of aggression in texts. We considered complete dataset for analyzing and model building. The corpus is code-mixed, i.e., it contains texts in English and Hindi (written in both Roman and Devanagari script). However, for our research, we only considered using English text written in Roman script. Our final dataset, excluding Devanagari script, contained 11,999 Facebook comments.

Using Cognitive Learning Method to Analyze Aggression

203

Fig. 1. Distribution of dataset OAG(22.6%) CAG(35.3%) & NAG(42.1%)

3.2

Pre-processing

Pre-processing is the technique of cleaning and normalization of data which may consist in removing less important tokens, words, or characters in a text such as ‘a’, ‘and’, ‘@’ etc. and also lowering capitalized words like ‘APPLE’. The texts contained several unimportant tokens, for instance, urls, numbers, html tags, and special characters which caused noise in the text for analysis. We cleaned the data first by employing NLTK (Natural language and Text Processing Toolkit) [2] stemmer and stopwords package. Table 2 illustrates the transformation of text before and after pre-processing. Table 2. Text before and after pre-processing Before Respect all religion sir, after all we all have to die, and after death there Will be no disturbance and will be complete silence After

3.3

Respect religion sir die death disturbance complete silence

Feature Extraction

In this section we describe the features that we extracted from the dataset. We extracted various features, however, for the sake of this specific research, we only consider the following features due to their better performance in our final model. We adapted the following features- part of speech, n-grams (unigrams, bigrams, trigrams), tf-idf, sentiment polarity and LIWC’s psycholinguistic feature. Figure 2 illustrates the procedure that was adapted in the process of feature extraction to build a model for supervised learning. Part-of-Speech Features. Part-of-Speech (PoS) are classes or lexical categories assigned to words due to the word’s similar grammatical and syntactic properties. Typical examples of part of speech include adjective, noun, adverb, etc. PoS helps us to identify or tag words into certain categories and find any pattern they create with regards to aggression and non-aggression texts. For the purposes of this research, we applied NLTK’s [2] part of speech tagging package

204

S. Iqbal and F. Keshtkar

Fig. 2. Feature extraction architecture of system.

on our dataset to count the occurrences of PoS tags in each text. This led to the extraction of 24 categories of words. For instance extracting PoS tags from the text respect religion sir die death disturbance complete silence leaves us with ‘respect’: NN, ‘religion’: NN, ‘sir’: NN, ‘die’: VBP, ‘death’: NN, ‘disturbance’: NN, ‘complete’: JJ, ‘silence’: NN where NN represents for tagging a noun word and JJ and VBP for adjective and verb of non-3rd person singular present form, respectively. N-grams Features. Language model or n-grams in natural language processing (NLP) refers to the sequence of n items (word, token) from texts. Typically, n refers to the numbers of words in a sequence gathered from a text after applying text processing techniques. N-gram is commonly used in NLP for developing supervised machine learning models [3]. It helps us identify which words tend to appear together more then often. We implemented Weka [9] tool to extract unigram, bigram and trigram word feature from these texts. We utilized Weka’s Snowball stemmer to stem the words for standard cases and Rainbow stopwords to further remove any potential stop words. We considered tf-idf score as values of word n-gram instead of their frequencies. Over 270,000 tokens were extracted after n-gram feature extraction. We also employed Weka’s built-in ranker algorithm to identify which features contribute most towards the correct classification of the texts. This helped us understand which words were most useful and related to our annotated classes. We considered only top 437 items for further analysis. We dropped features ranked below 437 as they were barely of any relevance as per ranker algorithm. Table 3 illustrates some examples of unigram,

Using Cognitive Learning Method to Analyze Aggression

205

bigram and trigram after applying n-gram feature extraction on the text ‘respect religion sir die death disturbance complete silence’. Table 3. Examples of n-gram features n-gram

Example of n-gram tokens

unigram Respect, religion, disturbance Respect religion, disturbance complete bigram h trigram Respect religion sir, die death disturbance

Sentiment Features. Sentiment features are used to analyze any opinion expressed from texts as having a positive, negative or neutral emotion [10]. Sentiment analysis, especially in social media texts, is an important technique to monitor public opinion over a topic. It helps to understand the opinion expressed in a text by performing and evaluating the sentiment value of each word and overall text. We used TextBlob [1] to evaluate the score of sentiment polarity of each pre-processed word and the text as a whole. TextBlob provides easy access to common text-processing operations. The package converts sentences to a list of words and performs word-level sentiment analysis to give a sentiment polarity score for each text. Sentiment polarity is a floating number ranging from –1.0 to 1.0. A number closer to –1.0 is an expression of negative opinion and a number closer to 1.0 is an expression of positive opinion. We keep track of the document id and the corresponding sentiment polarity score as a feature. For instance, the text ‘respect religion sir die death disturbance complete silence’ produced a sentiment polarity score of –0.10 with subjectivity of –0.40. However, we only consider polarity score in the feature. Linguistic Inquiry and Word Count Features. LIWC (Linguistic Inquiry and Word Count) performs computerized text analysis to extract psycholinguistic features from texts [14]. We utilized LIWC 2015 psychometric text analyzer [13] in order to gain insight on our data. The features provide understanding to textual content by scoring and labelling the text segments into many of its categories. In this research we applied Weka’s ranking algorithm on LIWC features to rank the most significant and useful LIWC feature that contributed most towards classifying the texts as AG or NAG. We found 12 such LIWC features which were crucial in our analysis and which produced the best accuracy and f-score for our classifiers. 3 illustrates the distribution of the psycholinguisting features among 11,999 facebook comments. Each document may contain one or more of these cognitive features.

206

S. Iqbal and F. Keshtkar

Fig. 3. Psycholinguist category distribution using LIWC

4 4.1

Experiments and Results Experimental Setup

In this section we evaluate the performance of our model using supervised learning algorithms. We report accuracy and f-score of different supervised learning methods on the models that was created using the features explained in previous sections. We also evaluated the validity of our models and identify vital features and patterns that caused high and low performances in our system. 4.2

Result

We considered different combinations of features to build the best possible model that could eventually lead to higher performance. We ran various algorithms such as Support Vector Machine and Random Forest on different combination of features. Some combination of features performed better than others and we picked the one that produced best result. We noted from our results that Random Forest produced better results. Table 4 shows the results obtained by applying these classifiers on different combination of features. We kept n-gram words as our gold standard feature in model building and then applied different combinations of other features. The features that were used in model building were Unigram (U), Bigram (B), Trigram (T), Sentiment polarity (SP), Part of Speech (PoS) and LIWC (LIWC). We applied different classifier using both 10-fold cross validation and 66% data for training using multiple combination of these features. The best results were obtained when

Using Cognitive Learning Method to Analyze Aggression

207

Table 4. F-score of classifiers using 66% data for training F score Feature

SVM

Random forest

U+B+T U+B+T+SP U+B+T+SP+LIWC U+B+T+SP+LIWC+PoS

0.6100 0.6360 0.6410 0.6450

0.6340 0.6500 0.6450 0.6700

considering U, BU, SP, GI and SW as features and using 66% of the data for training and 33% for testing. Figure 4 shows the confusion matrix of our model for both classes, AG and NAG. The confusion matrix was generated by applying Random Forest classifier on 66% of the data (testing set) using U+B+T+SP+PoS+LIWC features in the model. Interestingly, according to the confusion matrix, upon applying Random Forest classifier on the model, 1,930 out of 2,351 of the AG class texts in the test set were identified correctly. This leads us to understand that the true positive for aggression in texts was 83% which is extremely promising.

Fig. 4. Confusion matrix of random forest classifier.

We also found that the sentiment polarity score for texts were evenly distributed among both the classes, even though it was evaluated as a vital feature

208

S. Iqbal and F. Keshtkar

by ranking algorithm. And it was among the top feature ranked by ranker algorithm which contributed towards higher accuracy and f-score. We also found that some of the words happened to exist in both unigram and bigram, for instance, ‘loud’ in ‘loudspeaker’ and ‘loud noise’. This leaves us to understand that those words are key in classifying the texts. When considering word n-gram feature, there were very few bigrams in the model as it mostly comprised of unigrams and contained words that related to religion and politics. Also, most words in texts that were annotated as aggressive comprised of adjectives and nouns. 4.3

Discussion and Analysis

A common issue with the dataset was that it often contained either abbreviated or unknown words and phrases which could not be extracted by using any of the lexicons. Hence, these words and phrases were left out of our analysis. Also, some texts contained either stop words or a mixture of stop words and emoticons which led to the removal of all of the content upon pre-possessing. Performing pre-processing on the text hare pm she q ni such or the hare panic mr the h unto ab such ran chalice led to the removal of whole text. This also prevented us from further analyzing the text even though it may potentially have had some aggressive lexical or emoticons. But because emoticons can be placed sarcastically in texts, we did not consider it as a feature in our model. There were some texts which comprised of non-English words. The words in these texts switched between English and other languages which made our analysis difficult as it was solely intended for English corpus. Some words like ‘dadagiri’, which means ‘bossy’ in Hindi context, were not transliterated, which is why the semantic of the text could not be captured. The sentiment polarity score for the text ‘chutiya rio hittel best mobile network india’ was 0.0 where clearly it should have been scored below 0.0 as it contained a strong negative word in another language (Hindi in Roman script). Analyzing Result Data. Adjective (JJ), Verb non-3rd person singular present form (VBP) and NN (noun) were among the prominent part of speech for n-gram words that we extracted. Figure 5 illustrates the distribution of part of speech in our n-gram words feature. Since Sentiment Polarity (SP) was among the top features ranked by Weka, (SP) as a feature identified 6,094 texts correctly and 5,905 texts incorrectly. Out of the 6,094 that were correctly identified, only 1,861 texts were labelled as AG and 4,233 as NAG. Figure 6 illustrates the distribution of AG and NAG labels after performing sentiment polarity analysis on the texts. Also, of the top 434 n-gram words ranked by Weka, (SP) identified 396 n-gram words as NAG and only 37 as AG. This clearly indicates, that sentiment polarity is a good feature in identifying NAG texts.

Using Cognitive Learning Method to Analyze Aggression

209

Fig. 5. Part of speech tagging of n-gram features

Fig. 6. Comparison of sentimental polarity feature

5

Conclusion and Future Work

In this paper, we propose an approach to detect aggression in texts. We tried to understand patterns in both AG and NAG class texts based on the part of speech and sentiment. The model produced promising results as it helps us to make clear distinction between texts that contain aggression and those that do not. Our System architecture also adapted well to the feature extraction process for aggression detection. For future work, we plan to use more lexicon features for sentiment analysis in order to further improve the accuracy and f-score value for correct classification of our model. We also plan to use hashtags and emoticons which we think will be promising features. These features will help us to identify more important words and contents from texts that were not detected. We would also like to investigate on the sub domains of Aggression- Covertly Aggressive and Overtly Aggressive contents and identify distinguishing factors between them.

210

S. Iqbal and F. Keshtkar

References 1. Bagheri, H., Islam, M.J.: Sentiment analysis of twitter data. arXiv preprint arXiv:1711.10377 (2017) 2. Bird, S., Loper, E.: Nltk: the natural language toolkit. In: Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, p. 31. Association for Computational Linguistics (2004) 3. Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992) 4. Buss, A.H.: The psychology of aggression (1961) 5. Chatzakou, D., Kourtellis, N., Blackburn, J., De Cristofaro, E., Stringhini, G., Vakali, A.: Mean birds: detecting aggression and bullying on twitter. In: Proceedings of the 2017 ACM on Web Science Conference, pp. 13–22. ACM (2017) 6. Chen, Y., Zhou, Y., Zhu, S., Xu, H.: Detecting offensive language in social media to protect adolescent online safety. In: Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), pp. 71–80. IEEE (2012) 7. Dinakar, K., Reichart, R., Lieberman, H.: Modeling the detection of textual cyberbullying. Soc. Mob. Web 11(02), 11–17 (2011) 8. G¨ orzig, A., Frumkin, L.A.: Cyberbullying experiences on-the-go: when social media can become distressing. Cyberpsychology 7(1), 4 (2013) 9. Holmes, G., Donkin, A., Witten, I.H.: Weka: a machine learning workbench. In: Intelligent Information Systems, 1994. Proceedings of the 1994 Second Australian and New Zealand Conference on, pp. 357–361. IEEE (1994) 10. Keshtkar, F., Inkpen, D.: Using sentiment orientation features for mood classification in blogs. In: 2009 International Conference on Natural Language Processing and Knowledge Engineering, pp. 1–6 (2009). https://doi.org/10.1109/NLPKE. 2009.5313734 11. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014) 12. Kumar, R., Reganti, A.N., Bhatia, A., Maheshwari, T.: Aggression-annotated corpus of Hindi-English code-mixed data. arXiv preprint arXiv:1803.09402 (2018) 13. Pennebaker, J.W., Boyd, R.L., Jordan, K., Blackburn, K.: The development and psychometric properties of liwc2015. Technical Report (2015) 14. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Assoc. 71(2001), 2001 (2001) 15. Raja, M., Swamynathan, S.: Tweet sentiment analyzer: Sentiment score estimation method for assessing the value of opinions in tweets. In: Proceedings of the International Conference on Advances in Information Communication Technology & Computing, p. 83. ACM (2016) 16. Reynolds, K., Kontostathis, A., Edwards, L.: Using machine learning to detect cyberbullying. In: Machine Learning and Applications and Workshops (ICMLA), 2011 10th International Conference on, vol. 2, pp. 241–244. IEEE (2011) 17. Roy, A., Kapil, P., Basak, K., Ekbal, A.: An ensemble approach for aggression identification in English and Hindi text. In: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), pp. 66–73 (2018) 18. Sahay, K., Khaira, H.S., Kukreja, P., Shukla, N.: Detecting cyberbullying and aggression in social commentary using NLP and machine learning. people (2018) 19. Samghabadi, N.S., Mave, D., Kar, S., Solorio, T.: Ritual-uh at trac 2018 shared task: aggression identification. arXiv preprint arXiv:1807.11712 (2018)

Using Cognitive Learning Method to Analyze Aggression

211

20. Sharma, S., Agrawal, S., Shrivastava, M.: Degree based classification of harmful speech using twitter data. arXiv preprint arXiv:1806.04197 (2018) 21. Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013) 22. Tausczik, Y., Pennebaker, J.: The psychological meaning of words: Liwc and computerized text analysis methods. J. Lang. Soc. Psychol. 29, 24–54 (2010) 23. Van Hee, C., et al.: Automatic detection of cyberbullying in social media text. arXiv preprint arXiv:1801.05617 (2018) 24. Wang, H.: Introduction to word2vec and its application to find predominant word senses. http://compling.hss.ntu.edu.sg/courses/hg7017/pdf/word2vec %20and%20its%20application%20to%20wsd.pdf (2014) 25. Zainol, Z., Wani, S., Nohuddin, P., Noormanshah, W., Marzukhi, S.: Association analysis of cyberbullying on social media using apriori algorithm. Int. J. Eng. Technol. 7, 72–75 (2018). https://doi.org/10.14419/ijet.v7i4.29.21847

Opinion Spam Detection with Attention-Based LSTM Networks Zeinab Sedighi1 , Hossein Ebrahimpour-Komleh2 , Ayoub Bagheri3 , and Leila Kosseim4(B) 1

4

Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada [email protected] 2 Department of Computer Engineering, University of Kashan, Kashan, Islamic Republic of Iran [email protected] 3 Department of Methodology and Statistics, Utrecht University, Utrecht, The Netherlands [email protected] Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada [email protected] Abstract. Today, online reviews have a great influence on consumers’ purchasing decisions. As a result, spam attacks, consisting of the malicious inclusion of fake online reviews, can be detrimental to both customers as well as organizations. Several methods have been proposed to automatically detect fake opinions; however, the majority of these methods focus on feature learning techniques based on a large number of handcrafted features. Deep learning and attention mechanisms have recently been shown to improve the performance of many classification tasks as they enable the model to focus on the most the important features. This paper describes our approach to apply LSTM and attentionbased mechanisms for detecting deceptive reviews. Experiments with the Three-domain data set [15] show that a BiLSTM model coupled with Multi-Headed Self Attention improves the F-measure from 81.49% to 87.59% in detecting fake reviews. Keywords: Deep learning · Attention mechanisms · Natural language processing · Opinion spam detection · Machine learning

1

Introduction

Due to the increasing public reliance on social media for decision making, companies and organizations regularly monitor online comments from their users in order to improve their business. To assist in this task, much research has addressed the problems of opinion mining [2,22]. However, the ease of sharing comments and experience on specific topics has also led to an increase in fake review attacks (or spam) by individuals or groups. In fact, it is estimated that as much as one-third c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 212–221, 2023. https://doi.org/10.1007/978-3-031-24340-0_16

Opinion Spam Detection with Attention-Based LSTM Networks

213

of opinion reviews on the web are spam [21]. These, in turn, decrease the trustworthiness of all online reviews for both users and organizations. Manually discerning spam reviews from non-spam ones has been shown to be both difficult and inaccurate [17]; therefore developing automatic approaches to detect review spam has become a necessity. Although automatic opinion spam detection has been addressed by the research community, it still remains an open problem. Most previous work on opinion spam detection have used classic supervised machine learning methods to distinct spam from non-spam reviews. Consequently, much attention has been paid to learning appropriate features to increase the performance of the classification. In this paper we explore the use of an LSTM based model that uses an attention mechanism to learn representations and features automatically to detect spam reviews. All the deep learning models tested obtained significantly better performance than traditional supervised approaches and the BiLSTM+MultiHeaded Self Attention yielded a best F-measure of 87.59%, a significant improvement over the current state of the art. This article is organized as follows. Section 2 surveys related work in opinion spam review detection. Our attention-based model is then described in Sect. 3. Results are presented and discussed in Sect. 4. Finally, Sect. 5 proposes future work to improve our model.

2

Related Work

According to [7], spam reviews can be divided into three categories 1) untruthful reviews which deliberately affect user decisions, 2) reviews whose purpose is to advertise specific brands and 3) non-reviews which are irrelevant. Types 2 and 3 are more easy to detect as the topic of the spam differs significantly from truthful reviews; however type 1 spam are more difficult to identify. This article focuses on reviews type 1, which try to mislead users using topic-related deceptive comments. 2.1

Opinion Spam Detection

Much research has been done on the automatic detection of spam reviews. Techniques employed range from unsupervised (e.g. [1]), semi-supervised (e.g. [10,13]) and supervised methods (e.g. [9,14,27]) with a predominance for supervised methods. Most methods rely on human feature engineering and experiment with different classifiers and hyper parameters to enhance the classification quality. To train from more data, [17] generated an artificial data set and applied supervised learning techniques for text classification. [10] uses spam detection for text summarization and [10] applies Naive Bayes, logistic regression and Support Vector Machine methods after feature extraction using Part-Of-Speech

214

Z. Sedighi et al.

tags and LIWC. To investigate cross domain spam detection, [12] uses a data set consist of three review domains to avoid the dependency to a specific domain. They examine SVM and SAGE as classification models. 2.2

Deep Learning for Sentiment Analysis

Deep learning models have been widely used in natural language processing and text mining [16] such as sentiment analysis [20], co-reference resolution [5], POS tagging [19] and parsing [6] as they are capable to learn relevant syntactic and semantic features automatically as opposed to hand-feature engineering. Because long term dependencies are prominent in natural language, Recurrent Neural Networks (RNNs) and in particular Long Short Term Memories (LSTMs) have been very successful in many applications (e.g. [18,20,25]). [11] employed an RNN in parallel with a Convolutional Neural Network (CNN) to improve the analysis of sentiment phrases. [20] used an RNN to create sentence representations. [25] presented a context representation for relation classification using a ranking recurrent neural network. 2.3

Attention Mechanisms

Attention mechanisms have also shown much success in the last few years [3]. Using such mechanisms, neural networks are able to better model dependencies in sequences of information in texts, voices, videos, etc. [4,26] regardless of their distance. Because they learn which information from the sequence is more useful to predict the current output, attention mechanisms have increased the performance of many Natural Language Processing tasks such as [3,23]. An attention function maps an input sequence and a set of key-value pairs to an output. The output is calculated as a weighted sum of the values. The weight assigned to each value is obtained using a compatibility function of the sequence and the corresponding key. In a vanilla RNN without attention, the model embodies all the information of the input sequence by means of the last hidden state. However, when applying an attention mechanism, the model is able to glance back at the entire input; not only by accessing the last hidden state but also by accessing a weighted combination of all input states. Several types of attention mechanisms have been proposed [23]. Self Attention, also known as intra-attention, aims to learn the dependencies between the words in a sentence and uses this information to model the internal structure of the sentence. Scaled Dot-Product Attention [24], on the other hand, calculates the similarity using scaled dot-product. As opposed to Self Attention, Scaled Dot-Product Attention uses an additional dimension to adjust the inner product from becoming too large. If the calculation is performed several times instead of once, it will enable the model to learn more relevant information concurrently in different sub-spaces. This last model is called Multi-Headed Self Attention. In light of the success of these attention mechanisms, this paper evaluates the use of these techniques for the task of opinion spam detection.

Opinion Spam Detection with Attention-Based LSTM Networks

3 3.1

215

Methodology Attention-Based LSTM Model

Figure 1 shows the general model used in our experiments. The look-up layer maps words to a look-up table by applying word embeddings. In order to better capture the relations between distant words, the next layer uses LSTM units. In our experiments (see Sect. 4), we investigated with both undirectional and bi-directional LSTMs (BiLSTM) [8]. We considered the use of one LSTM layer, one BiLSTM layer and two BiLSTM layers. In all cases we used 150 LSTM units in each layer and training phase was applied after each 32 time steps using Back Propagation Through Time (BPTT) with a learning rate of 1e-3 and a dropout rate of 30%.

Fig. 1. General architecture of the attention-based models

The results from the LSTM layer is fed into the attention layer. Here, we experimented with two mechanisms: Self Attention and Multi-Headed Self Attention mechanisms [24]. Finally, a Softmax layer is applied for the final binary classification.

4

Experiments

To evaluate our models, we use the Three-Domain data set [15] which constitutes the standard benchmark in this field. [15] introduced the Three-domain

216

Z. Sedighi et al.

review spam data set: a collection of 3032 reviews in three different domains (Hotel, Restaurant and Doctor) annotated by three types of annotators (Turker, Expert and Customer). Each review is annotated with a binary label: truthful or deceptive. Table 1 shows statistics of the data set. Table 1. Statistics of the three-domain dataset Data set

Turker Expert Customer Total

Hotel

800

280

800

1880

Restaurant

200

120

400

720

Doctor Total

200

32

200

432

1200

432

1400

3032

To compare our proposed models with traditional supervised machine learning approaches we also experimented with SVM, Naive Bayes and Log Regression methods. For these traditional feature-engineered models, we pre-processed the text to remove stop words, and stemmed the removing words. Then, to distinguish the role of words, POS tagging was applied. To extract helpful features for classifying reviews, feature engineering techniques are required. Bigrams and TF-IDF are applied to extract more repetitive words in the document. For other deep learning models, we used both a CNN and an RNN. The CNN and the RNN use the same word embeddings as our model (see Sect. 3). The CNN uses two Convolutional and Pooling layers connected to one fully connected hidden layers.

5

Results and Analysis

Our various models were compared using all three domains, in-domain and cross domains. 5.1

All Three-Domain Results

In our first set of experiments we used the combination of all three domains (Hotel, Restaurant and Doctor) for a total of 3032 reviews. Table 2 shows the results of our deep learning models compared classic machine learning methods where using cross-validation. As Table 2 shows, in general all deep learning models yield a higher F-measure than SVM, Naive Bayes and Log Regression. It is interesting to note that both precision and recall benefit from the deep models and those attention mechanisms yield a significant improvement in performance. The Multi-Headed Self Attention performs better than the Self Attention; and the bidirectional LSTM does provide a significant improvement compared to the unidirectional LSTM.

Opinion Spam Detection with Attention-Based LSTM Networks

217

Table 2. Results in all three-domain classification Methods

Precision Recall F-measure

SVM

72.33

68.50

70.36

Naive Bayes

61.69

63.32

62.49

Log Regression

55.70

57.34

56.50

CNN

79.23

69.34

73.95

RNN

75.33

73.41

74.35

LSTM

80.85

68.74

74.30

BiLSTM

82.65

80.36

81.49

BiLSTM + Self Attention

85.12

83.96

84.53

BiLSTM + Multi-Headed Self Attention 90.68

84.72

87.59

Table 3. Results in in-domain classification Data sets

Methods

Precision Recall F-measure

Hotel

SVM

69.97

67.36

68.64

Naive Bayes

58.71

62.13

60.37

Log Regression

55.32

56.45

55.87

RNN

72.25

70.31

71.27

CNN

78.76

74.30

76.47

LSTM

84.65

81.11

82.84

BiLSTM

86.43

83.05

84.70

BiLSTM + Self Attention

90.21

85.73

87.91

BiLSTM + Multi-Headed Self Attention 89.33

92.59

90.93

73.76

69.43

71.52

Naive Bayes

63.18

66.32

64.71

Log Regression

59.96

63.87

61.85

RNN

77.12

74.96

76.02

CNN

78.11

76.92

77.51

LSTM

80.23

78.74

79.47

BiLSTM

86.85

87.35

87.10

BiLSTM + Self Attention

88.68

86.47

87.56

BiLSTM + Multi-Headed Self Attention 89.66

91.00

90.32

SVM

72.17

74.39

73.26

Naive bayes

63.18

67.98

65.49

Log regression

65.83

69.27

67.51

RNN

75.28

67.98

71.44

CNN

77.63

70.03

73.63

LSTM

79.92

74.21

77.49

BiLSTM

79.85

78.74

79.29

BiLSTM + Self Attention

82.54

80.61

81.56

BiLSTM + Multi-Headed Self Attention 84.76

81.10

82.88

Restaurant SVM

Doctor

5.2

In-domain Results

Table 3 shows the results of same models for each domain. As Table 3 shows, the same general conclusion can be drawn for each specific domain: deep learning

218

Z. Sedighi et al.

methods show significant improvements compared to classical ML methods, and attention mechanisms increase the performance even more. These results seem to shows that neural models are more suitable for deceptive opinion spam detection. The results on the Restaurant data are similar to those on the Hotel domain. However, the models yield lower results on the Doctor domain. A possible reason is that the number of reviews in this domain are relatively lower, which leads to relatively lower performance. 5.3

Cross-domain Results

Finally, to evaluate the generality of our model, we performed an experiment across domains, where we trained models on one domain and evaluated them on another. Specifically, we trained the classifiers on Hotel reviews (for which we had more data), and evaluated their the performance on the other domains. Table 4 shows the result of these experiments. Again the same general trend appears. One notices however, that the performance of the models does drop compared to Table 3 where the training was done on the same domain. 5.4

Comparison with Previous Work

In order to compare the proposed model with the state of the art, we performed a last experiment in line with the experimental set up of [15]. As indicated in Sect. 2, [15] used an SVM and SAGE based on unigram + LIWS + POS tags. To our knowledge, their approach constitutes the sate of the art approach on Table 4. Results in cross domain classification Datasets

Methods

Hotel vs. Doctor

SVM

Precision Recall F-measure 67.28

64.91

66.07

Naive bayes

59.94

63.13

61.49

Log regression

55.47

51.93

53.64

RNN

72.33

68.50

70.36

CNN

61.69

63.32

62.49

LSTM

74.65

70.89

72.72

BiLSTM

76.82

71.63

74.13

BiLSTM+Self Attention

78.10

73.79

75.88

BiLSTM+Multi-Headed Self Attention 81.90

77.34

79.55

70.75

68.94

69.83

Naive Bayes

64.87

67.11

65.97

Log Regression

60.72

57.90

59.27

RNN

79.23

69.34

73.95

CNN

75.33

73.41

74.35

LSTM

80.15

73.94

76.91

BiLSTM

80.11

79.92

80.01

BiLSTM+Self Attention

87.73

82.27

84.91

BiLSTM+Multi-Headed Self Attention 90.68

84.72

87.59

Hotel Vs. Restaurant SVM

Opinion Spam Detection with Attention-Based LSTM Networks

219

Table 5. Comparison of the proposed model with [15] on the customer data Data

Hotel

Customer Precision Our model

Recall [15] Our model

F-Measure [15] Our model

[15]

85

67

93

66

89

66

Restaurant 90

69

92

72

91

70

Table 6. Comparison of the proposed model with [15] on the expert data Data

Hotel

Expert Precision Our model

Recall [15] Our model

F-Measure [15] Our model

[15]

80

58

85

61

82

59

Restaurant 79

62

84

64

81

70

Table 7. Comparison of the proposed model with [15] on the Turker data Data

Hotel

Turker Precision Our model

Recall [15] Our model

F-Measure [15] Our model

[15]

87

64

92

58

89

61

Restaurant 88

68

89

70

88

68

the Three Domain dataset. Although [15] performed a variety of experiments, the set up used applied the classifiers on the turker and the customer sections of the dataset only (see Table 1). The model was trained on the entire data set (Customer+Expert+Turker), and tested individually on the Turker, Customer, and Expert sections using cross-validation. To compare our approach, we reproduced their experimental set up, and as shown in Table 5 to Table 7, our BiLSTM+Multi-Headed Self Attention model outperforms this state of the art.

6

Conclusion and Future Work

In this paper we showed that an attention mechanism can learn document representation automatically for opinion spam detection and clearly outperform nonattention based models as well as classic models. Experimental results show that the Multi-Headed Self Attention performs better than the Self Attention; and the

220

Z. Sedighi et al.

bidirectional LSTM does provide a significant improvement compared to the unidirectional LSTM. Utilizing a model with no need for manual feature extraction from documents with high performance is effective to improve the detection of spam comments. Utilizing an attention mechanism and an LSTM model enable us to have a comprehensive model for distinguishing different reviews in different domains. This shows the generality power of our model. One challenge left for the future is to improve the performance of cross domains spam detection. This would enable the model to be used widely to reach the performance of in domain results in all domains. Acknowledgments. This work was financially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

References 1. Abbasi, A., Zhang, Z., Zimbra, D., Chen, H., Nunamaker Jr, J.F.: Detecting fake websites: the contribution of statistical learning theory. In: Mis Quarterly, pp. 435– 461 (2010) 2. Bagheri, A., Saraee, M., De Jong, F.: Care more about customers: Unsupervised domain-independent aspect detection for sentiment analysis of customer reviews. Knowl.-Based Syst. 52, 201–213 (2013) 3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 4. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), pp. 4960–4964 (2016) 5. Clark, K., Manning, C.D.: Improving coreference resolution by learning entity-level distributed representations vol. 1, pp. 643–653 (2016) 6. Collobert, R.: Deep learning for efficient discriminative parsing. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 224–232. FL, USA (2011) 7. Dixit, S., Agrawal, A.: Survey on review spam detection. Int. J. Comput. Commun. Technol. 4, 0975–7449 (2013) 8. Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3, 115–143 (2002) 9. Jindal, N., Liu, B.: Opinion spam and analysis. In: Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM 2008), pp. 219–230. ACM, Palo Alto, California, USA (2008) 10. Jindal, N., Liu, B., Lim, E.P.: Finding unusual review patterns using unexpected rules. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM 2010), pp. 1549–1552. Toronto, Canada, October 2010 11. Kuefler, A.R.: Merging recurrence and inception-like convolution for sentiment analysis. https://cs224d.stanford.edu/reports/akuefler.pdf (2016) 12. Lau, R.Y., Liao, S., Kwok, R.C.W., Xu, K., Xia, Y., Li, Y.: Text mining and probabilistic language modeling for online review spam detection. ACM Trans. Manage. Inf. Syst. (TMIS) 2(4), 25 (2011)

Opinion Spam Detection with Attention-Based LSTM Networks

221

13. Li, F., Huang, M., Yang, Y., Zhu, X.: Learning to identify review spam. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2011), pp. 2488–2493 (2011). https://dl.acm.org/citation.cfm?id=2283811 14. Li, H.: Detecting Opinion Spam in Commercial Review Websites. Ph.D. thesis, University of Illinois at Chicago (2016) 15. Li, J., Ott, M., Cardie, C., Hovy, E.: Towards a general rule for identifying deceptive opinion spam. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL-2014), vol. 1, pp. 1566– 1576 (2014) 16. Manning, C.D.: Computational linguistics and deep learning. Comput. Linguist. 41(4), 701–707 (2015) 17. Ott, M., Cardie, C., Hancock, J.T.: Negative deceptive opinion spam. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HLT 2013), pp. 497–501 (2013) 18. Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y.: How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026 (2013) 19. Santos, C.D., Zadrozny, B.: Learning character-level representations for part-ofspeech tagging. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1818–1826 (2014) 20. Socher, R.: Deep learning for sentiment analysis - invited talk. In: Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, p. 36. San Diego, California (2016) 21. Streitfeld, D.: The Best Book Reviews Money Can Buy. The New York Times, New York, 25 (2012) 22. Sun, S., Luo, C., Chen, J.: A review of natural language processing techniques for opinion mining systems. Inf. Fusion 36, 10–25 (2017) 23. Vaswani, A., et al.: Attention is all you need. In: 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA (2017) 24. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 25. Vu, N.T., Adel, H., Gupta, P., Sch¨ utze, H.: Combining recurrent and convolutional neural networks for relation classification. arXiv preprint arXiv:1605.07333 (2016) 26. Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning (ICML 2015), pp. 2048–2057. Lille, France (2015) 27. Zhang, D., Zhou, L., Kehoe, J.L., Kilic, I.Y.: What online reviewer behaviors really matter? Effects of verbal and nonverbal behaviors on detection of fake online reviews. J. Manage. Inf. Syst. 33(2), 456–481 (2016)

Multi-task Learning for Detecting Stance in Tweets Devamanyu Hazarika1(B) , Gangeshwar Krishnamurthy2 , Soujanya Poria3 , and Roger Zimmermann1 1

2

National University of Singapore, Singapore, Singapore {hazarika,rogerz}@comp.nus.edu.sg Artificial Intelligence Initiative, Agency for Science Technology and Research (A*STAR), Singapore, Singapore [email protected] 3 Nanyang Technological University, Singapore, Singapore Abstract. Detecting stance of online posts is a crucial task to understand online content and trends. Existing approaches augment models with complex linguistic features, target-dependent properties, or increase complexity with attention-based modules or pipeline-based architectures. In this work, we propose a simpler multi-task learning framework with auxiliary tasks of subjectivity and sentiment classification. We also analyze the effect of regularization against inconsistent outputs. Our simple model achieves competitive performance with the state of the art in micro-F1 metric and surpasses existing approaches in macro-F1 metric across targets. We are able to show that multi-tasking with a simple architecture is indeed useful for the task of stance classification.

Keywords: Detecting stance

1

· Sentiment analysis · Text classification

Introduction

Automatic detection of stance over text is an emerging task of opinion mining. In recent times, its importance has increased due to its role in practical applications. It is used in information retrieval systems to filter content based on the authors’ stance, to analyze trends in politics and policies [14], in summarization systems to understand online controversies [13]. It also finds its use in modern day problems that plague the Internet, such as identification of rumor or hate speech [21]. The task involves determining whether a piece of text, such as a tweet or debate post is FOR, AGAINST, or NONE towards an entity which can be persons, organizations, products, policies, etc. (see Table 1). This task is challenging due to the use of informal language and literary devices such as sarcasm. For example, in the second sample in Table 1, the phrase Thank you God! can mislead a trained model to consider it as a favoring stance. Challenges also amplify as in many D. Hazarika and G. Krishnamurthy—Equal contribution. c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 222–235, 2023. https://doi.org/10.1007/978-3-031-24340-0_17

Multi-task Learning for Detecting Stance in Tweets

223

Table 1. Sample tweets representing stances against target topics.

Target

Tweet

Stance

1 Climate change is a real concern

Incredibly moving as a scientist weeps on @BBCRadio4 for the #ocean and for our grandchildren’s future

For

2 Atheism

I still remember the days when I prayed God Against for strength.. then suddenly God gave me difficulties to make me strong. Thank you God!

3 Feminist

When I go up the steps of my house I feel like the @ussoccer wnt .. I too have won the Women’s World Cup. #brokenlegprobs #USA

Against

tweets the target of the stance may or may not be mentioned. In the third sample in Table 1, the tweet doesn’t talk about feminism in particular but rather mocks indirectly using Women’s World Cup. Present state-of-the-art networks in this task majorly follow the neural approach. These models increase their complexity either by adding extra complex features – such as linguistic features [26] – as input or through complex networks with attention modules or pipeline-based architectures [9,10]. In this paper, we restrict ourselves from increasing complexity and search for simple solutions for this task. To this end, we propose a simple convolutional neural network, named MTL-Stance, that adopts multi-task learning (MTL) for stance classification (Task A) by jointly training with the related tasks of subjectivity (Task B) and sentiment prediction (Task C). For subjectivity analysis, we categorize both For and Against stances to be subjective while None stance as objective. It is important to note that unlike traditional opinion mining, subjectivity here refers to the presence of stance towards a target in a tweet. Conversely, objectivity contains both tweets which have either no stance or are subjective by their stance is indeterminable. To tackle inconsistent predictions (e.g. Task A predicts For stance while Task B predicts objective), we explore a regularization term that penalizes the network for inconsistent outputs. Overall, subjectivity represents a coarse-grained version of stance classification and is thus expected to aid the task at hand. We also consider sentiment prediction (Task C) in the MTL framework to allow the model learn common relations (if any). [18] mentions how sentiment and stance detection are related tasks. However, both the tasks are not same as a person might express same stance towards a target either by positive or negative opinion. A clear relationship is also often missing since the opinion expressed in text might not be directed towards the target. Nevertheless, both the tasks do tend to rely on some related parameters which motivates us to investigate their joint training.

224

D. Hazarika et al.

The contributions of this work can be summarized as follows: – We propose a multi-task learning algorithm for stance classification by associating the related tasks of subjectivity and sentiment detection. – We demonstrate that a simple CNN-based classifier trained in an end-to-end fashion can supersede models having extra linguistic information or pipelinebased architectures. – Our proposed model achieves competitive results to the state-of-the-art performance on the SemEval 2016 benchmark dataset whilst having a simpler architecture with a single-phase end-to-end mechanism [17]. The paper is organized as follows: Sect. 2 presents the related works in the literature and compares them to our proposed work; Sect. 3 presents the proposed model and explains the MTL framework utilized for training; Sect. 4 details the experimental setup and the results on the dataset. Finally, Sect. 5 provides concluding remarks.

2

Related Work

The analysis of stance detection has been performed on various forms of text such as debates in congress or online platforms [12,24,27,30], student essays [20], company discussions [1], etc. With the popularity of social media, there is also a surge of opinionated text in microblogs [17]. Traditional approaches involve linguistic features into their models such as sentiment [24], lexicon-based and dependency-based features [2], argument features [12]. Many works also use structural information from online user graphs or retweet links [19,22]. With the proliferation of deep-learning, several neural approaches have been attempted on this task with state-of-the-art performance. Most of the works utilize either recurrent or convolutional networks to model the text and the target. In [3], the authors use a bi-directional recurrent network to jointly model the text along with the target by initializing the text network with the output of the target network. On the other hand, convolutional neural networks also have been used for encoding text in this task [29]. Apart from basic networks, existing works also incorporate extra information as auxiliary input into their neural models. These features include user background information such as user tastes, comment history, etc. [5]. We focus on some recent works that have attained state-of-the-art performance. Specifically, we look at Target-specific Attentional Network (TAN) [10], Two-phase Attention-embedded LSTM (T-PAN) [9] and Hierarchical Attention Network (HAN) [26]. Similar to [3], TAN uses a bi-directional LSTM scheme to model the task. It includes the target embeddings into the network by using an attention mechanism [4]. We follow similar motivations to use target-specific information. However, aiming to minimize network complexity, we opt for a simple concatenation-based fusion scheme. T-PAN stands closest to our proposed model as it too incorporates information from classifying subjectivity of tweets. It is achieved by following a two-phase

Multi-task Learning for Detecting Stance in Tweets

225

Fig. 1. Stance-MTL: Multi-task framework for stance detection.

model where in first phase the subjectivity is decided and in second phase, only the subjective tweets from first phase are used to be classified as favoring or nonfavoring stances. In contrast to this approach, we do not use a pipeline-based approach as it bears an higher possibility of error propagation. Instead, in our MTL framework, both the classifications are done simultaneously. The Hierarchical Attention Network (HAN), proposed by [26] contains of a hierarchical framework of attention-based layers which includes extra information from sentiment, dependency and argument representations. Unlike HAN, our model is not dependent on complex linguistic features. Rather, we enforce a simple CNN-based model that trains end-to-end under the multi-task regime.

3 3.1

Proposed Approach Task Formulation

The task of stance classification can be defined as follows: given a tweet text and its associated target entity, the aim of the model is to determine the author’s stance towards the target. The possible stances could be favoring (FOR), against (AGAINST) or inconclusive (NONE). The NONE class consists of tweets that could either have a neutral stance or be the case whether determining the stance is not easy. 3.2

Multi-task Learning

Multi-task learning (MTL) is a framework that requires optimizing a network towards multiple tasks [23]. The motivation arises from the belief that features learnt for a particular task can be useful for related tasks [7]. In particular, MTL exploits the synergies between related tasks through joint learning, where supervision from the related/auxiliary tasks provides an inductive bias into the network that allows it to generalize better. Effectiveness of MTL framework is evident in the literature of various fields such as speech recognition [8], computer vision [11], natural language processing [6], and others.

226

D. Hazarika et al.

MTL algorithms can be realized by two primary methods. First is to train individual models for each task with a common regularization that enforces the models to be similar. Second way is to follow a stronger association by sharing common weights across tasks. In this work, we take influence from both these approaches by using a shared model along with explicit regularization against inconsistent output combinations. Below we provide the details of our model: MTL-Stance. 3.3

Model Details

The overall architecture of the MTL-Stance is shown in the Fig. 1. It consists of the input tweet and its target. The inputs are processed by shared convolutional layers whose outputs are concatenated. The further layers are separated into the three mentioned tasks. Concrete network details are mentioned below. Input Representation. A training example consists of a tweet text: {T wi }ni=0 , a target entity: {T ri }m i=0 , stance label: y1 ∈ {For, Against, None}, the derived subjectivity label: y2 ∈ {Subjective, Objective} and sentiment label: y3 ∈ {Positive, Negative, Neither}. Both T w ∈ Rn×k and T r ∈ Rm×k are sequences of words represented in a matrix form with each word corresponding to its k-dimensional word vector [15]. Shared Parameters. To both the tweet and target representations, we apply a shared convolutional layer to extract higher-level features. We use multiple filter of different sizes. The width of each filter is fixed to k but the height, h, is varied (as hyper-parameter). For example, let w ∈ Rh×k be a filter which can extract a feature vector z of size RL−h+1 where L is the length of the input. Each entry of vector z is given by: zi = g(w  Ti:i+h−1 + b) here,  is the convolution operation, b ∈ R is a bias term, and, g is a non-linear function. We then apply a max-over-time pooling operation over the feature vector z to get the maximum value of zˆ = max(z). The above convolution layer with Fl filters is applied M times on both tweet and target representations to get an output of M · Fl features. These feature representations of tweet text and target text are given by FT w and FT r . Next, we obtain the joint representation by concatenating them, i.e., FT = [FT w ; FT r ] ∈ R2·M ·Fl . This representation is fed to a non-linear fully-connected layer f cinter coupled with Dropout [25]. hinter = f cinter (FT ) This layer is also the last shared layer before task-specific layers are applied.

Multi-task Learning for Detecting Stance in Tweets

227

Task-Specific Layers. For each of the three tasks, i.e., stance, subjectivity, and sentiment classification, we use three different non-linear fully-connected layers. The layer weights are not shared among them so that they can individually learn task specific features. hi = f ci (hinter )

∀i ∈ {1, 2, 3}

Finally, we pass these features through another projection layer with softmax normalization to get the probability distribution over the labels for each task. y ˆi = sof tmax(Wi · hi + bi ) ∀i ∈ {1, 2, 3} Loss Function. We use the categorical cross-entropy on each of the outputs to calculate the loss. We also add a joint regularization loss (Regloss ) to the total loss function: ⎧ ⎫ ⎛ ⎞ Ck N 3 ⎬ i,j −1  ⎨ ⎝ Loss = yki,j log2 (ˆ yk )⎠ + Regloss (1) ⎭ N i=1 ⎩ j=1 k=1

where N is the number of mini-batch samples, Ck is the number of categories for each of k th task ( in the order: stance, subjectivity and sentiment), yki,j is the probability of the ith sample of k th task for the j th label, and similarly yˆki,j is its predicted counterpart. In our setup, C1 = 3, C2 = 2 and C3 = 3. The regularization term Regloss is dependent on the output of the first two tasks, and defined as: y1i ) | ⊕ sgn| argmax(ˆ y2i ) |) Regloss = α · (sgn| argmax(ˆ

(2)

where α is a weighting term (hyper-parameter), sgn|.| is the sign function and ⊕ is a logical XOR operation used to penalize the instances where both subjectivity and stance are predicted with contradiction.

4

Experiments

4.1

Dataset

We utilize the benchmark Twitter Stance Detection corpus for stance classification originally proposed by [16] and later used in a SemEval task 61 [17]. The dataset presents the task of identifying the stance of a tweet’s author towards a target, determining whether the author is favoring (For) or is against (Against) a target or whether neither of the inference is likely (None). The dataset comprises of English tweets spanning five primary targets: Atheism, Climate Change is Concern, Feminist Movement, Hillary Clinton, Legalization of Abortion with pre-defined training and testing splits. The distribution statistics are provided in Table 2. Along with stance labels, sentiment labels are also provided which we use for the joint training (see Table 2). 1

http://alt.qcri.org/semeval2016/task6/.

228

D. Hazarika et al.

Table 2. Percentage distribution of stance and sentiment labels for instances (tweets) in the dataset across targets and splits Target

Stance Train # for

Atheism Climate Feminism Hillary Abortion All

4.2

513 395 664 689 653

17.9 53.7 31.6 17.1 18.5

Sentiment Test Train Test Against Neither # For Against Neither pos Neg None Pos Neg None 59.3 3.8 49.4 57.0 54.4

22.8 42.5 19.0 25.8 27.1

2914 25.8 47.9

26.3

220 169 285 295 280

14.5 72.8 20.4 15.3 16.4

72.7 6.5 64.2 58.3 67.5

12.7 20.7 15.4 26.4 16.1

60.4 60.4 17.9 32.0 28.7

35.0 35.0 77.2 64.0 66.1

4.4 4.4 4.8 3.9 5.0

1249 24.3 57.3

18.4

33.0 60.4 6.4

59.0 59.0 19.3 25.7 20.3

35.4 35.4 76.1 70.1 72.1

5.4 5.4 4.5 4.0 7.5

29.4 63.3 7.2

Training Details

We use the standard training and testing set provided in the dataset. Hyperparameters are tuned using a held out validation data: 10% of the training data. To optimize the parameters, we use RMSProp [28] optimizer with an inital learning rate of 1e−4 . The hyper-parameter are fl = 128, M = 3. And for each of the 3 filter size the window size (h) is 2, 3 and 4. We fix the tweet length n to 30 and target length m to 6. The number of hidden units in task-specific layers f c[1/2/3] is 300. We initialize the word vectors with the 300-dimensional pretrained word2vec embeddings [15] which are optimized during training. Following the previous works, we train different models for different targets but with the same hyperparameters. And the final result is the concatenation of predicted result of these models. 4.3

Baselines

We compare MTL-Stance with the following baseline methods: – SVM: This model accounts for a non-neural baseline that has been widely used in previous works [17]. The model uses simple bag-of-words features for stance classification. – LSTM: A simple LSTM model without target features for classification. – TAN: is an RNN-based architecture that uses an target-specific attentionmodule to focus on parts of the tweet that is related to the target topic [10]. – T-PAN: is a two-phase model for classifying the stance [9]. The first phase classifies subjectivity and the second phase classifies the stance based on first phase. Concretely, utterances classified as objective in the first-phase are dropped out from the second phase and assigned the None label. – HAN: is a hierarchical attention model which uses linguistic features that include sentiment, dependency and argument features [26].

Multi-task Learning for Detecting Stance in Tweets

229

Table 3. Comparision of MTL-Stance with state-of-the-art models on Twitter Stance Detection corpus. MTL-Stance results are the average of 5 runs with different initializations. Model

Atheism Climate Feminism Hillary Abortion M acFavg M icFavg

SVM

62.16

42.91

56.43

55.68

60.38

55.51

67.01

LSTM

58.18

40.05

49.06

61.84

51.03

52.03

63.21

TAN

59.33

53.59

55.77

65.38

63.72

59.56

68.79

T-PAN

61.19

66.27

58.45

57.48

60.21

60.72

68.84

HAN

70.53

49.56

57.50

61.23

66.16

61.00

69.79

64.66

58.82

66.27

67.54

64.69

69.88

MTL-Stance 66.15

4.4

Evaluation Metrics

We use both micro-average and macro-average of F1-score across targets as our evaluation metric as defined by [26]. The F1-score for Favour and Against categories for all instances is calculated as: F[f avor/against] =

2 × P[f avor/against] × R[f avor/against] P[f avor/against] + R[f avor/against]

(3)

where P and R are precision and recall. Then the final metric, M icFavg is the average of Ff avor and Fagainst . M icFavg = 4.5

Ff avor + Fagainst 2

(4)

Results

Table 3 shows the performance results on Twitter Stance Detection corpus. Our model, MTL-Stance, performs significantly better than the state-of-the-art models across most targets. The SVM model does not perform well since it only uses bag of words features of tweet text only. LSTM model also does not exploit the information from target text; hence its performance is significantly lower, though it uses a neural architecture. On the other hand, neural models such as TAN, T-PAN, and HAN use both tweet and target text which outperforms both SVM and LSTM. This indicates that target information is a useful feature for stance classification. 4.6

Ablation Study

We further experiment on different variations of the MTL-Stance model to analyze the extent to which various features of our model contribute to the performance. The variants of the model are as follows: – Single: This model does not use multi-task learning framework. The model is trained with only stance labels.

230

D. Hazarika et al.

– Single + subj.: This model uses multi-task framework and uses subjectivity labels along with the stance labels. – Single + subj. + regloss : This model further adds regularization loss (see Sect. 3.3) to add penalty to mismatched output. – Single + sent.: This model uses multi-task framework and uses sentiment labels along with the stance labels. – MTL-Stance: Our final model that uses multi-task learning with regularization loss. This model uses all the three labels: subjectivity, sentiment and stance. Table 4 provides the performance results of these models. As seen, understanding the subjectivity of a tweet towards the target helps the model make better judgment about its stance. Intuitively, a tweet that has no stance towards the target tends to be objective while the one with opinion tends to be subjective. Addition of regularization penalty further improves the overall performance. Analyzing the confusion matrix between the sentiment and stance labels reveals that stance and sentiment are not correlated [18]. Yet, addition of sentiment classification task in MTL improves performance of the model. This indicates the presence of common underlying relationships that the model is able to learn and exploit. Also note that our single model consists of a very simple architecture and does not beat the state-of-the-art models described in Table 3. But the same architecture outperforms them with a multi-task learning objective and regularization loss. This indicates that the performance can be significantly improved if complex neural architectures are combined with the multi-task learning for stance classification. Table 4. Ablation across auxiliary tasks. Note: subj. = subjectivity (Task B) , sent. = sentiment (Task C) Model: stance

Atheism Climate Feminism Hillary Abortion M acFavg M icFavg 63.71

43.89

58.75

63.12

63.05

58.50

67.40

66.27

51.70

56.70

62.57

65.03

60.45

67.41

+ subj. + regloss 64.87

51.53

60.09

64.25

65.49

61.25

68.40

+ sent.

66.30

62.06

56.19

63.11

64.44

62.42

67.76

MTL-Stance

66.15

64.66

58.82

66.27

67.54

64.69

69.88

+ subj.

4.7

Importance of Regularization

Table 5 compares the effect of regularization loss that we have introduced in this paper. The regularization loss allows the model to learn the correlation between subjectivity and stance more effectively by penalizing when the model predicts a tweet as subjective but a neutral stance (or vice-versa). The performance improvement shows the effectiveness of the regularization in our model.

Multi-task Learning for Detecting Stance in Tweets

231

Table 5. MTL-Stance with and without regularization loss

RegLoss Atheism Climate Feminism Hillary Abortion M acFavg M icFavg No

64.03

63.75

58.46

64.21

68.58

63.81

68.13

Yes

66.15

64.66

58.82

66.27

67.54

64.69

69.88

4.8

Effect on Regularization Strength (α)

Figure 2 shows the performance trend of the model as α is varied in regularization loss. At α = 5, the model reaches the highest performance with M icFavg = 69.88 and M acFavg = 64.69. As the value of α is increased, we observe that the performance of the model starts dropping. This is expected as the model starts under-fitting after it exceeds the α value of 10, and continues to drop in performance as it is increased.

Fig. 2. Performance plot of the MTL-Stance model when α in regularization loss is varied.

4.9

Case-Study and Error Analyses

We present an analysis of few instances where our model, MTL-Stance, succeeds and also fails in predicting the correct stance. We also include the subjectivity of the tweet that the model predicts for more insights into the model’s behavior. Tweet: @violencehurts @WomenCanSee The most fundamental right of them all, the right to life, is also a right of the unborn. #SemST

232

D. Hazarika et al.

Target: Legalization of Abortion Actual Stance: Against Predicted Stance: Against Actual Sentiment: Positive Predicted Sentiment: Positive Predicted Subjectivity: Subjective This is an example of a tweet having positive sentiment while having an opposing opinion towards the target. Though the stance of the tweet towards the target is Against, the overall sentiment of the tweet without considering target is Positive. We observe that MTL-Stance is able to capture such complex relationships across many instances in the test set. Tweet: @rhhhhh380 What we need to do is support all Republicans and criticize the opposition. #SemST Target: Hillary Clinton Actual Stance: Against Predicted Stance: None Predicted Subjectivity: Subjective For the tweet above MTL-Stance predicts None whereas the true stance is Against. The tweet is targeted towards ‘Hillary Clinton’, but we observe that the author is referring to Republicans and not the target directly. This is a challenging example since it requires knowledge about the relation of both entities (Hillary and Republicans) to predict the stance label correctly. MTL-Stance. however, is able to correctly perdict the subjective label which demonstrates that it is able to capture some of these patterns in the coarse-grained classification. Tweet: Please vote against the anti-choice amendment to the Scotland Bill on Monday @KevinBrennanMP - Thanks! #abortionrights #SemST Target: Legalization of Abortion Actual Stance: Against Predicted Stance: Favor Predicted Subjectivity: Subjective The above instance suggests other challenges that models face in predicting the stance correctly. The tweet has multiple negations in it and requires multihop inference in order to come to the right conclusion about the stance. Handling such use cases demands rigorous design and fundamental reasoning capabilities.

5

Conclusion

In this paper, we introduced MTL-Stance, a novel model that leverages sentiment and subjectivity information for stance classification through a multi-task learning setting. We also propose a regularization loss that helps the model to learn the correlation between subjectivity and stance more effectively. MTL-Stance

Multi-task Learning for Detecting Stance in Tweets

233

uses simple end-to-end model with CNN architecture for stance classification. In addition, it does not use any kind of extra linguistic features or pipeline methods. The experimental results shows that MTL-Stance outperforms state-of-the-art models on the Twitter Stance Detection benchmark dataset. Acknowledgment. This research is supported by Singapore Ministry of Education Academic Research Fund Tier 1 under MOE’s official grant number T1 251RES1820.

References 1. Agrawal, R., Rajagopalan, S., Srikant, R., Xu, Y.: Mining newsgroups using networks arising from social behavior. In: Proceedings of the 12th International Conference on World Wide Web, pp. 529–535. ACM (2003) 2. Anand, P., Walker, M., Abbott, R., Tree, J.E.F., Bowmani, R., Minor, M.: Cats rule and dogs drool!: classifying stance in online debate. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pp. 1–9. Association for Computational Linguistics (2011) 3. Augenstein, I., Rockt¨ aschel, T., Vlachos, A., Bontcheva, K.: Stance detection with bidirectional conditional encoding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 876–885 (2016) 4. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2016) 5. Chen, W.F., Ku, L.W.: Utcnn: a deep learning model of stance classification on social media text. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1635–1645 (2016) 6. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008) 7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493– 2537 (2011) 8. Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8599–8603. IEEE (2013) 9. Dey, K., Shrivastava, R., Kaushik, S.: Topical stance detection for twitter: a twophase LSTM model using attention. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 529–536. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7 40 10. Du, J., Xu, R., He, Y., Gui, L.: Stance classification with target-specific neural attention networks. In: International Joint Conferences on Artificial Intelligence (2017) 11. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015) 12. Hasan, K.S., Ng, V.: Why are you taking this stance? identifying and classifying reasons in ideological debates. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 751–762 (2014)

234

D. Hazarika et al.

13. Jang, M., Allan, J.: Explaining controversy on social media via stance summarization. arXiv preprint arXiv:1806.07942 (2018) 14. Lai, M., Hern´ andez Far´ıas, D.I., Patti, V., Rosso, P.: Friends and enemies of clinton and trump: using context for detecting stance in political tweets. In: Sidorov, G., Herrera-Alc´ antara, O. (eds.) MICAI 2016. LNCS (LNAI), vol. 10061, pp. 155–168. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62434-1 13 15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) 16. Mohammad, S., Kiritchenko, S., Sobhani, P., Zhu, X., Cherry, C.: A dataset for detecting stance in tweets. In: LREC (2016) 17. Mohammad, S., Kiritchenko, S., Sobhani, P., Zhu, X., Cherry, C.: Semeval-2016 task 6: detecting stance in tweets. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 31–41 (2016) 18. Mohammad, S.M., Sobhani, P., Kiritchenko, S.: Stance and sentiment in tweets. ACM Trans. Internet Technol. (TOIT) 17(3), 26 (2017) 19. Murakami, A., Raymond, R.: Support or oppose?: classifying positions in online debates from reply activities and opinion expressions. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 869–875. Association for Computational Linguistics (2010) 20. Persing, I., Ng, V.: Modeling stance in student essays. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 2174–2184 (2016) 21. Poddar, L., Hsu, W., Lee, M.L., Subramaniyam, S.: Predicting stances in twitter conversations for detecting veracity of rumors: a neural approach. In: 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 65–72. IEEE (2018) 22. Rajadesingan, A., Liu, H.: Identifying users with opposing opinions in Twitter debates. In: Kennedy, W.G., Agarwal, N., Yang, S.J. (eds.) SBP 2014. LNCS, vol. 8393, pp. 153–160. Springer, Cham (2014). https://doi.org/10.1007/978-3-31905579-4 19 23. Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017) 24. Somasundaran, S., Wiebe, J.: Recognizing stances in ideological on-line debates. In: Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, pp. 116–124. Association for Computational Linguistics (2010) 25. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 26. Sun, Q., Wang, Z., Zhu, Q., Zhou, G.: Stance detection with hierarchical attention network. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2399–2409 (2018) 27. Thomas, M., Pang, B., Lee, L.: Get out the vote: Determining support or opposition from congressional floor-debate transcripts. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 327–335. Association for Computational Linguistics (2006) 28. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–31 (2012)

Multi-task Learning for Detecting Stance in Tweets

235

29. Vijayaraghavan, P., Sysoev, I., Vosoughi, S., Roy, D.: Deepstance at semeval-2016 task 6: detecting stance in tweets using character and word-level CNNs. In: Proceedings of SemEval, pp. 413–419 (2016) 30. Walker, M., Anand, P., Abbott, R., Grant, R.: Stance classification using dialogic properties of persuasion. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 592–596. Association for Computational Linguistics (2012)

Related Tasks Can Share! A Multi-task Framework for Affective Language Kumar Shikhar Deep, Md Shad Akhtar(B) , Asif Ekbal, and Pushpak Bhattacharyya Department of Computer Science and Engineering, Indian Institute of Technology Patna, Bihta, India {shikhar.mtcs17,shad.pcs15,asif,pb}@iitp.ac.in Abstract. Expressing the polarity of sentiment as ‘positive’ and ‘negative’ usually have limited scope compared with the intensity/degree of polarity. These two tasks (i.e. sentiment classification and sentiment intensity prediction) are closely related and may offer assistance to each other during the learning process. In this paper, we propose to leverage the relatedness of multiple tasks in a multi-task learning framework. Our multi-task model is based on convolutional-Gated Recurrent Unit (GRU) framework, which is further assisted by a diverse hand-crafted feature set. Evaluation and analysis suggest that joint-learning of the related tasks in a multi-task framework can outperform each of the individual tasks in the single-task frameworks. Keywords: Multi-task learning · Single-task learning classification · Sentiment intensity prediction

1

· Sentiment

Introduction

In general, people are always interested in what other people are thinking and what opinions they hold for a number of topics like product, politics, news, sports etc. The number of people expressing their opinions on various social media platforms such as Twitter, Facebook, LinkedIn etc. are being continuously growing. These social media platforms have made it possible for the researchers to gauge the public opinion on their topics of interest- and that too on demand. With the increase of contents on social media, the process of automation of Sentiment Analysis [20] is very much required and is in huge demand. User’s opinions extracted from these social media platform are being used as inputs to assist in decision making for a number of applications such as businesses analysis, market research, stock market prediction etc. Coarse-grained sentiment classification (i.e. classifying a text into either positive or negative sentiment) is a well-established and well-studied task [10]. However, such binary/ternary classification studies do not always reveal the exact state of human mind. We use language to communicate not only our sentiments but also the intensity of those sentiments, e.g. one could judge that we are very angry, slightly sad, very much elated, etc. through our utterances. Intensity refers to the degree of sentiment a person may express through his text. It c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 236–247, 2023. https://doi.org/10.1007/978-3-031-24340-0_18

Related Tasks Can Share! A Multi-task Framework for Affective Language

237

Table 1. Example sentences with their sentiment classes and intensity scores from SemEval-2018 dataset on Affect in Tweets [15]. Tweet

Valence Intensity

@ LoveMyFFAJacket FaceTime - we can still annoy you

Pos-S

0.677

and i shouldve cut them off the moment i started hurting myself over them

Neg-M

0.283

@ VescioDiana You forgot #laughter as well

Pos-S

0.700

also facilitates us to analyze the sentiment on much finer level rather than only expressing the polarity of the sentiments as positive or negative. In recent times, studies on the amount of positiveness and negativeness of a sentence (i.e. how positive/negative a sentence is or the degree of positiveness/negativeness) has gained attention due to its potential applications in various fields. Few example sentences are depicted in Table 1. In this work, we focus on the fine-grained sentiment analysis [29]. Further, we aim to solve the fine-grained analysis with two different lenses i.e. fine-grained sentiment classification and sentiment intensity prediction. – Sentiment or Valence1 Classification: In this task, we classify each tweet into one of the seven possible fine-grained classes -corresponding to various levels of positive and negative sentiment intensity- that best represents the mental state of the tweeter, i.e. very positive (Pos-V ), moderately positive (Pos-M ), slightly positive (Pos-S ), neutral (Neu), slightly negative (NegS ), moderately negative (Neg-M ), and very negative (Neg-V ). – Sentiment or Valence Intensity Prediction: Unlike the discrete labels in the classification task, in intensity prediction, we determine the degree or arousal of sentiment that best represents the sentimental state of the user. The scores are a real-valued number in the range 0 & 1, with 1 representing the highest intensity or arousal. The two tasks i.e. sentiment classification and their intensity predictions are related and have inter-dependence on each other. Building separate system for each task is often less economical and more complex than a single multi-task system that handles both the tasks together. Further, joint-learning of two (or more) related tasks provides a great assistance to each other and also offers generalization of multiple tasks. In this paper, we propose a hybrid neural network based multi-task learning framework for sentiment classification and intensity prediction for tweets. Our network utilizes bidirectional gated recurrent unit (Bi-GRU) [28] network in cascade with convolutional neural network (CNN) [13]. The max-pooled features and a diverse set of hand-crafted features are then concatenated, and subsequently fed to the task-specific softmax layer for the final prediction. We evaluate 1

Valence signifies the pleasant/unpleasant scenarios.

238

K. S. Deep et al.

our approach on the benchmark dataset of SemEval-2018 shared task on Affect in Tweets [15]. We observe that, our proposed multi-task framework attains better performance when both the tasks are learned jointly. The rest of the paper are organized as follows. In Sect. 2, we furnish the related work. We present our proposed approach in Sect. 3. In Sect. 4, we describe our experimental results and analysis. Finally, we conclude in Sect. 5.

2

Related Work

The motivation behind applying multi-task model for sentiment analysis comes from [27] which gives a general overview of multi-task learning using deep learning techniques. Multitask learning (MTL) is not only applied to Natural Language Processing [4] tasks, but it has also shown success in the areas of computer vision [9], drug discovery [24] and many other. The authors in [5] used stacking ensemble technique to merge the results of classifiers/regressors through which the handcrafted features were passed individually and finally fed those results to a meta classifier/regressor to produce the final prediction. This has reported to have achieved the state-of-the-art performance. The authors in [8] used bidirectional Long Short Term Memory (biLSTM) and LSTM with attention mechanism and performed transfer learning by first pre-training the LSTM networks on sentiment data. Later, the penultimate layers of these networks are concatenated to form a single vector which is fed as an input to the dense layers. There was a gated recurrent units (GRU) based model proposed by [26] with a convolution neural network (CNN) attention mechanism and training stacking-based ensembles. In [14] they combined three different features generated using deep learning models and traditional methods in support vector machines (SVMs) to create an unified ensemble system. In [21] they used neural network model for extracting the features by transferring the emotional knowledge into it and passed these features through machine learning models like support vector regression (SVR) and logistic regression. In [2] authors have used a Bi-LSTM in their architecture. In order to improve the model performance they applied a multi-layer self attention mechanism in Bi-LSTM which is capable of identifying salient words in tweets, as well as gain insight into the models making them more interpretable. Our proposed model differs from previous models in the sense that we propose an end to end neural network based approach that performs both sentiment analysis and sentiment intensity prediction simultaneously. We use gated recurrent units (GRU) along with convolutional neural network (CNN) inspired by [26]. We fed the hidden states of GRU to CNN layer in order to get a fixed size vector representation of each sentence. We also use various features extracted from the pre-trained resources like DeepMoji [7], Skip-Thought Vectors [12], Unsupervised Sentiment Neuron [23] and EmoInt [5].

Related Tasks Can Share! A Multi-task Framework for Affective Language

239

Fig. 1. Proposed architecture

3

Proposed Methodology

In this section, we describe our proposed multi-task framework in details. Our model consists of a recurrent layer (biGRU) followed by a CNN module. Given a tweet, the GRU learns contextual representation of each word in the sentence, i.e. the representation of each word is learnt based on the sequence of words in the sentence. This representation is then used as input to the CNN module for the sentence representation. Subsequently, we apply max-pooling over the convoluted features of each filter and concatenated them. The hidden representation, as obtained from the CNN module, is shared across multiple tasks (here, two tasks i.e. sentiment classification and intensity prediction). Further, the hidden representation is assisted by a diverse set of hand-crafted features (c.f. Sect. 3.1) for the final prediction. In our work, we experiment with two different paradigms of predictors i.e. a) the first model is the traditional deep learning framework that makes use of softmax (or sigmoid) function in the output layer, and b) the second model is developed by replacing softmax classifier using support vector machine (SVM) [31] (or support vector regressor (SVR)). In the first model we feed the concatenated representation to two separate fully-connected layers with softmax (classification) and sigmoid (intensity) functions for the two tasks. In the second model, we feed hidden representations as feature vectors to the SVM and SVR respectively, for the prediction. A high-level block diagram of the proposed methodology is depicted in Fig. 1. 3.1

Hand-Crafted Features

We perform transfer learning from various state-of-the-art deep learning techniques. Following sub-sections explains these models in detail:

240

K. S. Deep et al.

– DeepMoji [7]: DeepMoji performs distant supervision on a very large dataset [19,32] (1.2 billion tweets) comprising of noisy labels (emojis). By incorporating transfer learning on various downstream tasks, they were able to outperform the state-of-the-art results of 8 benchmark datasets on 3 NLP tasks across 5 domains. Since our target task is closely related to this, we adapt this for our domain. We extract 2 different feature sets: • the embeddings from the softmax layer which is of 64 dimensions. • the embeddings from the attention layer which is of 2304 dimensions. – Skip-Thought Vectors [12]: Skip-thought is a kind of model that is trained to reconstruct the surrounding sentences to map sentences that share semantic and syntactic properties into similar vectors. It has the capability to produce highly generic semantic representation of sentence. The skip-thought model has two parts: • Encoder: It is generally a Gated Recurrent Unit (GRU) whose final hidden state is passed to the dense layers to get the fixed length vector representation of each sentence. • Decoder: It takes this vector representation as input and tries to generate the previous and next sentence. For this two different GRUs are needed. Due to its fixed length vector representation, skip-thought could be helpful to us. The feature extracted from skip-thought model is of dimension 4800. – Unsupervised Sentiment Neuron: [23] developed an unsupervised system which learned an excellent representation of sentiment. Actually the model was designed to produce Amazon product reviews, but the data scientists discovered that one single unit of network was able to give high predictions for sentiments of texts. It was able to classify the reviews as positive or negative, and its performance was found to be better than some popular models. They even got encouraging results on applying their model on the dataset of Yelp reviews and binary subset of the Stanford Sentiment Treebank. Thus the sentiment neuron model could be used to extract features by transfer learning. The features extracted from Sentiment Neuron model are of dimension 4096. – EmoInt [5]: We intended to use various lexical features apart from using some pre-trained embeddings. EmoInt [5] is a package which provides a high level wrapper to combine various word embeddings. The lexical features includes the following: – AFINN [17] contains list of words which are manually rated for valence between −5 to +5 where −5 indicates very negative sentiment and +5 indicates very positive sentiment. – SentiWordNet [1] is lexical resource for opinion mining. It assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. – SentiStrength [32] gives estimation of strength of positivity and negativity of sentiment. – NRC Hashtag Emotion Lexicon [17] consists of emotion word associations computed via Hashtags on twitter texts labelled by emotions. – NRC Word-Emotion Association Lexicon [17] consists of 8 sense level associations (anger, fear, joy, sadness, anticipation, trust, disgust and surprise) and 2 sentiment level associations(positive and negative)

Related Tasks Can Share! A Multi-task Framework for Affective Language

241

– The NRC Affect Intensity [16] are the lexicons which provides real values of affect intensity. The final feature vector is the concatenation of all the individual features. This feature vector is of size (133, 1). 3.2

Word Embeddings

Embedding matrix is generated from the pre-processed text using a combination of three pre-trained embeddings: 1. Pre-trained GloVe embeddings for tweets [22]: We use 200-dimensional pre-trained GloVe word embeddings, trained on the Twitter corpus, for the experiments. To make it compatible with the other embeddings, we pad 100dimensional zero vector to each embedding. 2. Emoji2Vec [6]: Emoji2Vec provides 300 dimension vectors for most commonly used emojis in twitter platform (in case any emoji is not replaced with its corresponding meaning). 3. Character-level embeddings2 : Character-level embeddings are trained over common crawl glove corpus providing 300 dimensional vectors for each character (used in case if word is not present in other two embeddings). Procedure to generate representations for a tweet using all these embeddings is described in Algorithm 1.

Algorithm 1. Procedure to generate representations for word in tweet do if word in GloVe then word vector = get vector(GloVe, word ) else if word in Emoji2Vec then word vector = get vector(Emoji2Vec, word ) else /*n = Number of characters in word */  word vector = n1 * n 1 get vector(CharEmbed, chars[n]) end if end for

4

Experiments and Results

4.1

Dataset

We evaluate our proposed model on the datasets of SemEval-2018 shared task on Affect in Tweets [15]. There are approximately 1181, 449 & 937 tweets for 2

https://github.com/minimaxir/char-embeddings.

242

K. S. Deep et al. Train - 1181 Development - 449 Test - 937

Tweets

300

200

100

0 Neg-V

Neg-M

Neg-S

Neu Pos-S Sentiment

Pos-M

Pos-V

(a) Sentiment class distribution.

(b) Sentiment intensity distribution.

Fig. 2. Sentiment distribution for SemEval-2018 task on Affect in Tweets [15]

training, development and testing. For each tweet, two labels are given: a) sentiment class (one of the seven class on sentiment scale i.e. very positive (PosV ), moderately positive (Pos-M ), slightly positive (Pos-S ), neutral (Neu), slightly negative (Neg-S ), moderately negative (Neg-M ), and very negative (Neg-V )); and b) an intensity score in the range 0 to 1. We treat prediction of these two labels as two separate tasks. In our multi-task learning framework, we intend to solve these two tasks together. A brief statistics of the datasets is depicted in Fig. 2. 4.2

Preprocessing

Tweets in raw form are noisy because of the use of irregular, short form of text (e.g. hlo, whtsgoin etc.), emojis and slangs and are prone to many distortions in terms of semantic and syntactic structures. The preprocessing step modifies the raw tweets to prepare for feature extraction. We use Ekphrasis tool [3] for tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction. Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. They used word statistics from 2 big corpora i.e. English Wikipedia and Twitter (330 million English tweets). Ekphrasis was developed as a part of text processing pipeline for SemEval-2017 shared task on Sentiment Analysis in Twitter [25]. We list the preprocessing steps that have been carried out below. – All characters in text are converted to lower case – Remove punctuation except ! and ? because ‘!’ and ‘?’ may contribute to better result of valence detection. – Remove extra space and newline character – Group similar emoji, replace them with their meaning in words using Emojipedia – Named Entity recognition and replace with keyword or token (@shikhar → username, https://www.iitp.ac.in → url ) – Split the hashtags (#iamcool → i am cool ) – Correct the misspelled words (facbok → facebook ) (Table 2).

Related Tasks Can Share! A Multi-task Framework for Affective Language

4.3

243

Experiments

We pad each tweet to a maximum length of 50 words. We employ 300-dimensional word embedding for the experiments (c.f. Sect. 3.2). The GRU dimension is set to 256. We use 100 different filters of varying sizes (i.e. 2-gram, 3-gram, 4-gram, 5gram and 6-gram filters) with max-pool layer in the CNN module. We use ReLU [18] activation and set the Dropout [30] as 0.5. We optimize our model using Adam [11] with cross-entropy and mean-squared-error (MSE) loss functions for sentiment classification and intensity prediction, respectively. For experiments, we employ python based deep learning library Keras with TensorFlow as the backend. We adopt the official evaluation metric of SemEval2018 shared task on Affect in Tweets [15], i.e. Pearson correlation coefficient, for measuring the performance of both tasks. We train our model for the maximum 100 epochs with early stopping criteria having patience = 20. Table 2. Pearson correlation for STL and MTL frameworks for sentiment classification and intensity prediction. + Reported in [5]; ∗ Reproduced by us. Framework

Sentiment classification Intensity prediction DL (softmax) ML (SVM) DL (sigmoid) ML (SVR)

Single-task learning (STL) 0.361

0.745

0.821

0.818

Multi-task learning (MTL) 0.408

0.772

0.825

0.830

State-of-the-art [5]

0.836+ (0.776∗ )

0.873+ (0.829∗ )

In single-task learning (STL) framework, we build separate systems for both sentiment classification and intensity prediction. We pass the normalized tweet to our Convolutional-GRU framework for learning. Since the number of training samples are considerably few to effectively learn a deep learning model, we assist the model with various hand-crafted features. The concatenated representations are fed to the softmax layer (or sigmoid) for the sentiment (intensity) prediction. We obtain 0.361 Pearson coefficient for sentiment classification and 0.821 for intensity prediction. Further, we also try to exploit the traditional machine learning algorithms for prediction. We extract the concatenated representations and feed them as an input to SVM for sentiment classification and SVR for intensity prediction. Consequently, SVM reports increased Pearson score of 0.745 for sentiment classification, whereas we observe comparable results (i.e. 0.818 Pearson score) for intensity prediction. The MTL framework yields an improved performance for both the tasks in both the scenarios. In the first model, MTL reports 0.408 and 0.825 Pearson scores as compared with the Pearson scores of 0.361 & 0.821 in STL framework for the sentiment classification and intensity prediction, respectively. Similarly, the MTL framework reports 3 and 2 points improved Pearson scores in the second model for the two tasks, respectively. These improvements clearly suggest that the MTL framework, indeed, exploit the inter-relatedness of multiple tasks in

244

K. S. Deep et al.

order to enhance the individual performance through a joint-model. Further, we observe the improvement of MTL models to be statistically significant with 95% confidence i.e. p-value < 0.05 for paired T-test. On same dataset, Duppada et al. [5] (winning system of SemEval-2018 task on Affect in Tweets [15]) reports Pearson scores of 0.836 and 0.873 for sentiment classification and intensity prediction, respectively. The authors in [5] passed the same handcrafted features individually through XGBost and Random Forest classifier/regressor and combined the results of all the classifiers/regressors using stacking ensemble technique. After that they passed the results from the models to a meta classifier/regressor as input. They used Ordinal Logistic Classifier and Ridge Regressor as meta classifier/regressor. In comparison, our proposed system (i.e. MTL for ML framework) obtains Pearson scores of 0.772 and 0.830 for sentiment classification and intensity prediction, respectively. It should be noted that we tried to reproduce the works of Duppada et al. [5], but obtained Pearson scores of only 0.776 and 0.829, respectively. Further, our proposed MTL model offers lesser complexity compared to the state-of-the-art systems. Unlike the state-of-the-art systems we do not require separate system for each task, rather an end-to-end single model addresses both the tasks simultaneously. 4.4

Error Analysis

In Fig. 3, we present the confusion matrices for both the models (first and second, based on DL and ML paradigms). It is evident from the confusion matrices that most of the mis-classifications are within the close proximity of the actual labels, and our systems occasionally confuse with ‘positive’ and ‘negative’ polarities (i.e. only 43 and 22 mis-classifications for the first model-DL based and second modelML based, respectively).

(a) Multi-task:DL

(b) Multi-task:ML

Fig. 3. Confusion matrices for sentiment classification.

We also perform error analysis on the obtained results. Few frequently occurring error cases are presented below: – Metaphoric expressions: Presence of metaphoric/ironic/sarcastic expressions in the tweets makes it challenging for the systems in correct predictions.

Related Tasks Can Share! A Multi-task Framework for Affective Language

245

Table 3. MTL vs STL for sentiment classification and intensity prediction Sentence

Actual

DL

ML

MTL

STL

MTL

STL

Neg-M

Neg-M

Neg-V

Neg-M

Neg-S

Pos-M

Pos-M

Pos-S

Pos-M

Pos-S

Maybe he was partly right. THESE emails might lead to impeachment and ’lock him up’ #ironic #ImpeachTrump #Laughter strengthens #relationships. #Women are more attracted to someone with the ability to make them #laugh. a) Sentiment Classification I graduated yesterday and already had 8 family members asking what job I’ve got now #nightmare @rohandes Lets see how this goes. We falter in SL and this goes downhill. It’s kind of shocking how amazing your rodeo family is when the time comes that you need someone

0.55 0.49 0.52

0.57

0.51

0.59

0.64

(+0.02)

(-0.04)

(+0.04)

(+0.09)

0.48

0.35

0.49

0.29

(-0.01)

(-0.14)

(+0.00)

(-0.20)

0.53

0.55

0.52

0.51

(+0.01)

(+0.03)

(+0.00)

(-0.01)

b) Intensity Prediction

• “@user But you have a lot of time for tweeting #ironic”. Actual: Neg-M Prediction: Neu – Neutralizing effect of opposing words: Presence of opposing phrases in a sentence neutralizes the effect of actual sentiments. • “@user Macron slips up and has a moment of clarity & common sense... now he is a raging racist. Sounds right. Liberal logic” Actual: Neg-M Prediction: Neu We further analyze the predictions of our MTL models against STL models. Analysis suggests that our MTL model indeed improves the predictions of many examples that are mis-classified (or having larger error margins) than the STL models. In Table 3, we list a few examples showing the actual labels, MTL prediction and STL prediction for both sentiment classification and intensity prediction.

5

Conclusion

In this paper, we have presented a hybrid multi-task learning framework for affective language. We propose a convolutional-GRU network with the assistance of a diverse hand-crafted feature set for learning the shared hidden representations for multiple tasks. The learned representation is fed to SVM/SVR classifier for the predictions. We have evaluated our model on the benchmark datasets of SemEval-2018 shared on Affect in Tweets for the two tasks (i.e. sentiment classification and intensity prediction). Evaluation suggests that a single multi-task model obtains improved results against separate systems of single-task models.

246

K. S. Deep et al.

Acknowledgement. Asif Ekbal acknowledges the Young Faculty Research Fellowship (YFRF), supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia).

References 1. Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: LREC, vol. 10, pp. 2200– 2204 (2010) 2. Baziotis, C., et al.: NTUA-SLP at SemEval-2018 task 1: predicting affective content in tweets with deep attentive RNNs and transfer learning. arXiv Preprint arXiv:1804.06658 (2018) 3. Baziotis, C., Pelekis, N., Doulkeridis, C.: DataStories at SemEval-2017 task 4: deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), Vancouver, Canada, pp. 747–754 (2017) 4. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, NY, USA, pp. 160–167 (2008) 5. Duppada, V., Jain, R., Hiray, S.: SeerNet at SemEval-2018 task 1: domain adaptation for affect in tweets. arXiv Preprint arXiv:1804.06137 (2018) 6. Eisner, B., Rockt¨ aschel, T., Augenstein, I., Boˇsnjak, M., Riedel, S.: emoji2vec: learning emoji representations from their description. arXiv Preprint arXiv:1609. 08359 (2016) 7. Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., Lehmann, S.: Using-millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2017) 8. Gee, G., Wang, E.: psyML at SemEval-2018 task 1: transfer learning for sentiment and emotion analysis. In: Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 369–376 (2018) 9. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015) 10. Kim, S.M., Hovy, E.: Determining the sentiment of opinions. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 1367 (2004) 11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv Preprint arXiv:1412.6980 (2014) 12. Kiros, R., et al.: Skip-thought vectors. arXiv Preprint arXiv:1506.06726 (2015) 13. LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. In: The Handbook of Brain Theory and Neural Networks, vol. 3361, no. 10 (1995) 14. Meisheri, H., Dey, L.: TCS research at SemEval-2018 task 1: learning robust representations using multi-attention architecture. In: Proceedings of the 12th International Workshop on Semantic Evaluation, pp. 291–299 (2018) 15. Mohammad, S., Bravo-Marquez, F., Salameh, M., Kiritchenko, S.: SemEval-2018 task 1: affect in tweets. In: Proceedings of the 12th International Workshop on Semantic Evaluation, pp. 1–17 (2018) 16. Mohammad, S.M., Bravo-Marquez, F.: Emotion intensities in tweets. arXiv Preprint arXiv:1708.03696 (2017)

Related Tasks Can Share! A Multi-task Framework for Affective Language

247

17. Mohammad, S.M., Bravo-Marquez, F.: WASSA-2017 shared task on emotion intensity. arXiv Preprint arXiv:1708.03700 (2017) 18. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-2010), pp. 807–814 (2010) 19. Nakov, P., Ritter, A., Rosenthal, S., Sebastiani, F., Stoyanov, V.: SemEval-2016 task 4: sentiment analysis in Twitter. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 1–18 (2016) 20. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: Sentiment classification using machine learning techniques. In: Proceedings of the ACL-2002 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 79–86 (2002) 21. Park, J.H., Xu, P., Fung, P.: PlusEmo2Vec at SemEval-2018 task 1: exploiting emotion knowledge from emoji and hashtags. arXiv Preprint arXiv:1804.08280 (2018) 22. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 23. Radford, A., Jozefowicz, R., Sutskever, I.: Learning to generate reviews and discovering sentiment. arXiv Preprint arXiv:1704.01444 (2017) 24. Ramsundar, B., Kearnes, S., Riley, P., Webster, D., Konerding, D., Pande, V.: Massively multitask networks for drug discovery. arXiv Preprint arXiv:1502.02072 (2015) 25. Rosenthal, S., Farra, N., Nakov, P.: SemEval-2017 task 4: sentiment analysis in Twitter. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 502–518, August 2017 26. Rozental, A., Fleischer, D.: Amobee at SemEval-2018 task 1: GRU neural network with a CNN attention mechanism for sentiment classification. arXiv Preprint arXiv:1804.04380 (2018) 27. Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv Preprint arXiv:1706.05098 (2017) 28. Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. Trans. Sig. Proc. 45(11), 2673–2681 (1997) 29. Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013) 30. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) 31. Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999). https://doi.org/10.1023/A: 1018628609742 32. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment strength detection in short informal text. J. Am. Soc. Inform. Sci. Technol. 61(12), 2544– 2558 (2010)

Sentiment Analysis and Sentence Classification in Long Book-Search Queries Amal Htait1,2(B) , S´ebastien Fournier1,2 , and Patrice Bellot1,2 1

2

Aix Marseille Univ, Universit de Toulon, CNRS, LIS, Marseille, France {amal.htait,sebastien.fournier,patrice.bellot}@lis-lab.fr Aix Marseille Univ, Avignon Universit, CNRS, EHESS, OpenEdition Center, Marseille, France {amal.htait,sebastien.fournier,patrice.bellot}@openedition.org

Abstract. Handling long queries can involve either reducing its size by retaining only useful sentences, or decomposing the long query into several short queries based on their content. A proper sentence classification improves the utility of these procedures. Can Sentiment Analysis has a role in sentence classification? This paper analysis the correlation between sentiment analysis and sentence classification in long booksearch queries. Also, it studies the similarity in writing style between book reviews and sentences in book-search queries. To accomplish this study, a semi-supervised method for sentiment intensity prediction, and a language model based on book reviews are presented. In addition to graphical illustrations reflecting the feedback of this study, followed by interpretations and conclusions. Keywords: Sentiment intensity · Language model · Search queries Books · Word embedding · Seed-words · Book reviews

1

·

Introduction

The book search field is a subsection of data search domain, with a label of recommendation. Users would be seeking book suggestions and recommendations by a request of natural language text form, called user query. One of the main characteristic of queries in book search is their length. The user query is often long, descriptive, and even narrative. Users express their needs of a book, opinion toward certain books, describe content or event in a book, and even sometimes share personal information (e.g., I am a teacher ). Being able to differentiate types of sentences, in a query, can help in many tasks. Detecting non-useful sentences from the query (e.g., Thanks for any and all help.), can help in query reduction. And classifying sentences by the type of information within, can be used for adapted search. For example, sentences including good read experience, with a book title, can be oriented to a book c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 248–259, 2023. https://doi.org/10.1007/978-3-031-24340-0_19

Sentiment Analysis and Sentence Classification in Long Book-Search Queries

249

similarity search, but sentences including a certain topic preferences should be focusing on a topic search. And also, sentences including personal information can be used for personalised search. In this work, sentence classification is studied on two levels: the usefulness of the sentence towards the search, and the type of information provided by the useful sentence. And three types of information are highlighted on: book titles and author names (e.g., I read “Peter the Great His Life and World” by Robert K. Massie.), personal information (e.g., I live in a very conservative area), and narration of book content or story (e.g., The story opens on Elyse overseeing the wedding preparation of her female cousin). “Different types of sentences express sentiment in very different ways” [4], therefore, the correlation is studied between the sentiment in a sentence and its type. And for the task, sentiment intensity prediction is calculated using a semi-supervised method, explained in Sect. 4. In addition, sentences in a query can share similar writing style and subject with book reviews. Below is a part of a long book search query: I just got engaged about a week and a half ago and I’m looking for recommendations on books about marriage. I’ve already read a couple of books on marriage that were interesting. Marriage A History talks about how marriage went from being all about property and obedience to being about love and how the divorce rate reflects this. The Other Woman: Twenty-one Wives, Lovers, and Others Talk Openly About Sex, Deception, Love, and Betrayal not the most positive book to read but definitely interesting. Dupont Circle A Novel I came across at Kramerbooks in DC and picked it up. The book focuses on three different couples including one gay couple and the laws issues regarding gay marriage ... In the example query, the part in bold present a description of specific books content with books titles, e.g. “Marriage A History”, and interpretations or personal point of view with expressions like “not the most positive book ... but definitely interesting”. These sentences seem as book reviews sentences. Therefore, finding similarities between certain sentences in a query and books reviews can be an important feature for sentence classification. To calculate that similarity in a general form, a reviews’ statistical language model is used to find for each sentence in the query its probability of being generated from that model (and therefore its similarity to that model’s training dataset of reviews). This work covers an analysis of sentence’s type correlation with its sentiment intensity and its similarity to reviews, and the paper is presented as below: – Presenting the user queries used for this work. – Extracting sentiment intensity of each sentence in the queries. – Creating a statistical language model based on reviews, and calculating the probability for each sentence to be generated from the model. – Analysing the relation between language model scores, sentiment intensity scores and the type of sentences.

250

2

A. Htait et al.

Related Work

For the purpose of query classification, many machine learning techniques have been applied, including supervised [9], unsupervised [6] and semi-supervised learning [2]. In book search field, fewer studies covered query classification. Ollagnier et al. [10] worked on a supervised machine learning method (Support Vector Machine) for classifying queries into the following classes: oriented (a search on a certain subject with orienting terms), non-oriented (a search on a theme in general), specific (a search for a specific book with an unknown title), and non-comparable (when the search do not belong to any of the previous classes). Their work was based on 300 annotated query from INEX SBS 20141 But the mentioned work, and many more, processed the query classification and not the classification of the sentences within the query. The length of booksearch queries created new obstacles to defeat, and the most difficult obstacle is the variety of information in its long content, which require a classification at the sentence level. Sentences in general, based on their type, reveal sentiment in different ways, therefore, Chen et al. [4] focused on using classified sentences to improve sentiment analysis with deep machine learning. In this work, the possibility of an opposite perspective is studied, which is the improvement of sentence classification using sentiment analysis. In addition, this work is studying the improvement of sentence classification using language model technique. Language models (LM) have been successfully applied to text classification. In [1], models were created using training annotated datasets and then used to compute the likelihood of generating the test sentences. In this work, a model is created based on book reviews and used to compute the likelihood of generating query sentences, as a similarity measurement between book reviews style and book-search query sentences type.

3

User Queries

The dataset of user queries, used in this work, is provided by CLEF - Social Book Search Lab - Suggestion Track2 . The track provides realistic search requests (also known as user queries), collected from LibraryThing3 . Out of 680 user queries, from the 2014’s dataset of Social Book Search Lab, 43 queries are randomly selected based on their length. These 43 queries have more than 55 words, stopwords excluded. Then, each query is segmented into sentences, which results a total of 528 sentences. These sentences are annotated based on usefulness towards the search, and on the information provided as: book titles and authors names, personal information, and narration of book content, an example is shown in the below XML extraction at Fig. 1. 1 2 3

https://inex.mmci.uni-saarland.de/data/documentcollection.html. http://social-book-search.humanities.uva.nl/#/suggestion. https://www.librarything.com/.

Sentiment Analysis and Sentence Classification in Long Book-Search Queries

251

Fig. 1. An example of annotated sentences from user queries.

4

Sentiment Intensity

As part of this work, sentiment intensity is calculated for each sentence of the queries. The following method is inspired by a semi-supervised method for sentiment intensity prediction in tweets, and it was established on the concepts of adapted seed-words and words embedding [8]. To note that the seed-words are words with strong semantic orientation, chosen for their lack of sensitivity to the context. They are used as paradigms of positive and negative semantic orientation. And adapted seed-words are seed-words with the characteristic of being used in a certain context or subject. Also, word embedding is a method to represent words in high quality learning vectors, from large amounts of unstructured and unlabelled text data, to predict neighbouring words. In the work of Htait et al. [8], the extracted seed-words were adapted to micro- blogs. For example, the word cool is an adjective that refers to a moderately low temperature and has no strong sentiment orientation, but it is often used in micro-blogs as an expression of admiration or approval. Therefore, cool is considered a positive seed-word in micro-blogs. In this paper, book search is the targeted domain for sentiment intensity prediction, therefore, the extracted seed- words are adapted to book search domain, and more specifically, extracted from book reviews since the reviews has the richest vocabulary in the book search domain. Using annotated book reviews, as positive and negative, by Blitzer et al.4 [3], the list of most common words in every annotation class is collected. Then, after removing the stop words, the first 70 most relevant to book domain words, with strong sentiment, are selected manually from each previously described list, as positive and negative seed-words. An example of adapted to book-search positive seed-words: insightful, inspirational and masterpiece. And an example of negative seed-words: endless, waste and silly. 4

Book reviews from Multi-Domain Sentiment Dataset by http://www.cs.jhu.edu/ mdredze/datasets/sentiment/index2.html.

252

A. Htait et al.

Word embedding, or distributed representations of words in a vector space, are capable of capturing lexical, semantic, syntactic, and contextual similarity between words. And to determine the similarity between two words, the measure of cosine distance is used between the vectors of these two words in the word embedding model. In this paper, a word embedding model is created based on more than 22 million Amazon’s book reviews [7], as training dataset, after applying a pre-processing to the corpora, to improve its usefulness (e.g. tokenization, replacing hyperlinks and emoticons, removing some characters and punctuation). For the purpose of learning word embedding from the previously prepared corpora (which is raw text), Word2Vec is used with the training strategy SkipGram (in which the model is given a word and it attempts to predict its neighboring words). To train word embedding and create the models, Gensim5 framework for Python is used. And for the parameters, the models are trained with word representations of dimensionality 400, a context window of one and negative sampling for five iterations (k = 5). As a result, a model is created with a vocabulary size of more than 2.5 million words. Then, and for each word in the sentence, the difference between average cosine similarity with positive seed-words and negative seed-words represent its sentiment intensity score, using the previously created model. For example, the word confusing has an average cosine similarity with positive seed-words equals to 0.203 and an average cosine similarity with negative seed-words equals to 0.322, what makes its sentiment intensity score equals to −0.119 (a negative score represent a negative feeling). And for the word young the sentiment intensity score equals to 0.012. To predict the sentiment intensity of the entire sentence, first the adjectives, nouns and verbs are selected from the sentence using Stanford POS tagger [12], then the ones with high sentiment intensity are used by adding up their score to have a total score for the sentence. Note that the created tool Adapted Sentiment Intensity Detector (ASID), used to calculate the sentiment intensity of words, is shared by this work’s researchers as an open source6 .

5

Reviews Language Model

The book reviews are considered a reference in sentence’s characteristic detection, since a similarity in style is noticed between certain sentences of user queries and the reviews. To calculate this similarity in writing style, a statistical language modelling approach is used to compute the likelihood of generating a sentence of a query from a book reviews language model. The statistical language modelling were originally introduced by Collins in [5], and it is the science of building models that estimate the prior probabilities of word strings [11]. The model can be presented as θR = P (wi|R) with i ∈ [1, |V |], where P (wi|R) is the probability of word wi in the reviews corpora 5 6

https://radimrehurek.com/gensim/index.html. https://github.com/amalhtait/ASID.

Sentiment Analysis and Sentence Classification in Long Book-Search Queries

253

R, and |V | is the size of the vocabulary. And this model is used to denote the probability of a word according to the distribution as P (wi|θR ) [13]. The probability of a sentence W to be generated from a book reviews language model θR is defined as the following conditional probability P (W |θD ) [13], which is calculated as following: P (W |θD ) =

m 

P (wi|θR )

(1)

i=1

where W is a sentence, wi is a word in the sentence W , and θR represents the book reviews model. The tool SRILM7 [11] is used to create the model from book reviews dataset (as training data), and for computing the probability of sentences in queries to be generated from the model (as test data). The language model is created as a standard language model of trigram and Good-Turing discounting (or Katz) for smoothing, based on 22 million of Amazon’s book reviews [7], as training dataset. The tool SRILM offers details in the diagnostic output like the number of words in the sentence, the sentence likelihood to model or the logarithm of likelihood by logP (W |θR ), and the perplexity which is the inverse probability of the sentence normalized by the number of words. In this paper, the length of sentences vary from one word to almost 100 words, therefore the score of perplexity seems more reliable for a comparison between sentences. To note that minimising perplexity is the same as maximising probability of likelihood, and a low perplexity indicates the probability distribution is good at predicting the sample.

6

Analysing Scores

As previously explained in Sect. 3, a corpora of 528 sentences from user queries is created and annotated as the examples in Fig. 1. Then, for each sentence the sentiment intensity score and the perplexity score are calculated following the methods previously explained in Sects. 4 and 5. To present the scores, Violin plots are used for their ability to show the probability density of the data at different values. Also, they include a marker(white dot) for the median of the data and a box(black rectangle) indicating the interquartile range. 6.1

Sentiment Intensity, Perplexity and Usefulness Correlation

The graph in Fig. 2 shows the distribution (or probability density) of sentiment intensity between two categories of sentences: on the right the sentences which are useful to the search and on the left the sentences which are not useful to the search. The shape on the left is horizontally stretched compared to the right one, and mostly dilated over the area of neutral sentiment intensity (sentiment score = 0), where also exist the median of the data. On the other hand, the shape on 7

http://www.speech.sri.com/projects/srilm/.

254

A. Htait et al.

the right is vertically stretched, showing the diversity in sentiment in- tensity in the useful to search sentences, but concentrated mostly in the positive area, at sentiment score higher than zero but lower than 0.5.

Fig. 2. The distribution of sentiment intensity between two categories of sentences: on the right the sentences which are useful to the search and on the left the sentences which are not useful to the search.

The graph in Fig. 3 represent the distribution of perplexity between two categories of sentences: on the right the sentences which are useful to the search and on the left the sentences which are not useful to the search. Both shapes are vertically compressed and dilated over the area of low perplexity. But the graph on the right, of the useful sentences, shows the median of the data on a lower level of score of perplexity, than the left graph. Explained by the slightly horizontal dilation of the left graph above the median level. 6.2

Sentiment Intensity, Perplexity and Information Type Correlation

The graphs in Fig. 4 shows the distribution of sentiment between the informational sentences, consecutively from top to bottom: – Book titles and authors names: on the right, the sentences with books titles or authors names, and on the left, the sentences without books titles and authors names. The graph on the right shows a high distribution of positive sentiment, but the left graph shows a high concentration on neural sentiment with a small distribution for positive and negative sentiment. Also, It is noticed the lack of negative sentiment in sentences with books titles or authors names. – Personal information: on the right, the sentences containing personal information about the user, and on the left, the sentences without personal information. The graph on the right shows a high concentration on neutral sentiment, where also exist the median of the data, and then a smaller distribution in

Sentiment Analysis and Sentence Classification in Long Book-Search Queries

255

Fig. 3. The distribution of perplexity between two categories of sentences: on the right the sentences which are useful to the search and on the left the sentences which are not useful to the search.

positive sentiment. On the left, the graph shows a lower concentration on neural sentiment, but it is noticeable the existence of sentences with extremely high positivism. – Narration of book content: on the right, the sentences containing book content or events, and on the left, the sentences without book content. Both graphs are vertically stretched but have different shapes. The graph on the right shows a higher distribution of negative sentiment as for sentences with book content, and the graph on the left shows higher positive values. The graphs in Fig. 5 shows the distribution of perplexity between the informational sentences, consecutively from top to bottom: Book titles and authors names, Personal information and Narration of book content. When comparing the first set of graphs, of book titles and authors names, the left graph has its median of data on a lower perplexity level than the right graph, with a higher concentration of data in a tighter interval of perplexity. For the second sets of graphs, of personal information, the right graph shows a lower interquartile range than the left graph. As for the third set of graphs, of book content, a slight difference can be detected between the two graphs, where the left graph is more stretched vertically. 6.3

Graphs Interpretation

Observing the distribution of data in the graphs of the previous sections, many conclusions can be extracted: – In Fig. 2, it is clear that useful sentences tend to have high level of emotions (positive or negative), but non-useful sentences are more probable to be neutral.

256

A. Htait et al.

Fig. 4. The distribution of Sentiment between the informational categories of sentences: Books titles or authors names, Personal information and Narration of book content.

Sentiment Analysis and Sentence Classification in Long Book-Search Queries

257

Fig. 5. The distribution of perplexity between the informational categories of sentences: Books titles or authors names, Personal information and Narration of book content.

258

A. Htait et al.

– The Fig. 3 shows that sentences with high perplexity, which means they are not similar to book reviews sentences, have a higher probability of being not useful sentence than useful. – The Fig. 4 gives an idea of sentiment correlation with sentences information: sentences with book titles or author names have a high level of positive emotions, but sentences with personal information tend to be neutral. And sentences with book content narration are distributed over the area of emotional moderate level, with a higher probability of positive than negative. – The Fig. 5 gives an idea of the correlation of reviews style similarity with sentences information: sentences with no book titles are more similar to reviews than the ones with book titles. Also, sentences with personal information tend to be similar to reviews. And sentences with book content narration show a slight more similarity with reviews sentences style than the sentences with no book content narration.

7

Conclusion and Future Work

This paper analysis the relation between sentiment intensity and reviews similarity toward sentences types in long book-search queries. First, by presenting the user queries and books collections, then extracting the sentiment intensity of each sentence of the queries (using Adapted Sentiment Intensity Detector (ASID)). Then, by creating a statistical language model based on reviews, and calculating the probability of each sentence being generated from that model. And finally by presenting, in graphs, the relation between sentiment intensity score, language model score, and the type of sentences. The graphs show that sentiment intensity can be an important feature to classify the sentences based on their usefulness to the search. Since non-useful sentences are more probable to be neutral in sentiment, than useful sentences. Also, the graphs show that sentiment intensity can also be an important feature to classify the sentences based on the information within. It is clear in the graphs, that the sentences containing book titles are richer in sentiment and mostly positive compared to sentences not containing book titles. In addition, the graphs show that sentences with personal information tend to be neutral, in a higher probability than those with no personal information. On the other hand, the graphs show that the similarity of sentences to reviews style can also be a feature to classify sentences by usefulness and by their information content, but in a slightly lower level of importance than sentiment analysis. Similarity between sentences and book reviews style is higher for useful sentences, for sentences with personal information and for sentences with narration of book content, but not for sentences containing book titles. The previous analysis and conclusions gives a preview on the effect of sentiment analysis and similarity to reviews in sentence classification of long booksearch queries. The next task would be to test these conclusions by using sentiment analysis and similarity to reviews, as new features, in a supervised machine learning classification of sentences in long book-search queries.

Sentiment Analysis and Sentence Classification in Long Book-Search Queries

259

Acknowledgement. This work has been supported by the French State, man- aged by the National Research Agency under the “Investissements d’avenir” program under the EquipEx DILOH projects (ANR-11-EQPX-0013).

References 1. Bai, J., Nie, J.Y., Paradis, F.: Using language models for text classification. In: Proceedings of the Asia Information Retrieval Symposium, Beijing, China (2004) 2. Beitzel, S.M., Jensen, E.C., Frieder, O., Lewis, D.D., Chowdhury, A., Kolcz, A.: Improving automatic query classification via semi-supervised learning. In: Fifth IEEE International Conference on Data Mining (ICDM 2005), pp. 8–pp. IEEE (2005) 3. Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 440– 447 (2007) 4. Chen, T., Xu, R., He, Y., Wang, X.: Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Expert Syst. Appl. 72, 221–230 (2017) 5. Collins, M.: Three generative, lexicalised models for statistical parsing. arXiv preprint cmp-lg/9706022 (1997) 6. Diemert, E., Vandelle, G.: Unsupervised query categorization using automaticallybuilt concept graphs. In: Proceedings of the 18th International Conference on World Wide Web, pp. 461–470 (2009) 7. He, R., McAuley, J.: Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th International Conference on World Wide Web, pp. 507–517 (2016) 8. Htait, A., Fournier, S., Bellot, P.: LSIS at SemEval-2017 task 4: using adapted sentiment similarity seed words for English and Arabic tweet polarity classification. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 718–722 (2017) 9. Kang, I.H., Kim, G.: Query type classification for web document retrieval. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 64–71 (2003) 10. Ollagnier, A., Fournier, S., Bellot, P.: Analyse en d´ependance et classification de requˆetes en langue naturelle, applicationa la recommandation de livres. Traitement Automatique des Langues 56(3) (2015) 11. Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Seventh International Conference on Spoken Language Processing (2002) 12. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 252–259 (2003) 13. Zhai, C.: Statistical language models for information retrieval. In: Synthesis Lectures on Human Language Technologies, vol. 1, no. 1, pp. 1–141 (2008)

Comparative Analyses of Multilingual Sentiment Analysis Systems for News and Social Media Pavel Pˇrib´ an ˇ1,2(B)

and Alexandra Balahur1

1

2

European Commission Joint Research Centre, Via E. Fermi 2749, 21027 Ispra, VA, Italy [email protected] , [email protected] Faculty of Applied Sciences, Department of Computer Science and Engineering, University of West Bohemia, Univerzitni 8, 301 00 Pilsen, Czech Republic

Abstract. In this paper, we present evaluation of three in-house sentiment analysis (SA) systems originally designed for three distinct SA tasks, in a highly multilingual setting. For the evaluation, we collected a large number of available gold standard datasets, in different languages and varied text types. The aim of using different domain datasets was to achieve a clear snapshot of the level of overall performance of the systems and thus obtain a better quality of an evaluation. We compare the results obtained with the best performing systems evaluated on their basis and performed an in-depth error analysis. Based on the results, we can see that some systems perform better for different datasets and tasks than the ones they were designed for, showing that we could replace one system with another and gain an improvement in performance. Our results are hardly comparable with the original dataset results because the datasets often contain a different number of polarity classes than we used, and for some datasets, there are even no basic results. For the cases in which a comparison was possible, our results show that our systems perform very well in view of multilinguality.

Keywords: Sentiment analysis

1

· Multilinguality · Evaluation

Introduction

Recent years have seen a growing interest in the task of Sentiment Analysis (SA). In spite of these efforts however, real applications for sentiment analysis are still challenged by a series of aspects, such as multilinguality and domain dependence. Sentiment analysis can be divided into different sub-tasks like aspect based SA, polarity or fine-grained SA, entity-centered SA. SA can also be applied on many different levels of scope – document level, sentence or phrases level. Performing sentiment analysis in a multilingual setting is even more challenging, as most datasets available are annotated for English texts and low-resourced languages c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 260–279, 2023. https://doi.org/10.1007/978-3-031-24340-0_20

Comparative Analyses of Multilingual Sentiment Analysis Systems

261

suffer from a lack of annotated datasets on which machine learning models can be trained. In this paper, we describe an evaluation of our three in-house SA systems designed for three distinct SA tasks, in a highly multilingual setting. These systems process a tremendous amount of text every day, and therefore it is essential to know their quality and also be able to evaluate these applications correctly. At present, these systems cannot be sufficiently evaluated. Due to the lack of correct evaluation, we decided to prepare appropriate resources and tools for the evaluation, assess these applications and summarize obtained results. We collect and describe a rich collection of publicly available datasets for sentiment analysis, and we present the performance of individual systems for the collected datasets. We also carry out additional experiments with the datasets, and we show that for news articles performance of classification increases when adding the title of the news article to the body text. 1.1

Tasks Description

The evaluated systems are intended for solving three sentiment related tasks – Twitter Sentiment Analysis (TSA) task, Tonality in News (TON ) task and the Targeted Sentiment Analysis (ESA) task that can also be called Entity-Centered Sentiment Analysis. In the Twitter Sentiment Analysis and Tonality tasks, the systems have to assign a polarity which determines the overall sentiment of a given tweet or a news article (generally speaking text). Targeted Sentiment Analysis (ESA) task is a task of a sentiment polarity classification towards an entity mention in a given text. For all mentioned tasks, the sentiment polarity can be one of the positive, negative or neutral labels or a number from −100 to 100, where a negative value indicates negative sentiment, a positive value indicates positive sentiment and zero (or values close to zero) means neutral sentiment. In our evaluation experiments, we used the 3-point scale (positive, negative, neutral ). 1.2

Systems Overview

TwitOMedia system [4] for the TSA task uses a hybrid approach, which employs supervised learning with a Support Vector Machines Sequential Minimal Optimization [32], on unigram and bigram features. EMMTonality system for the TON task counts occurrences of language specific sentiment terms from our in-house language specific dictionaries. Each sentiment term has a sentiment value assigned. The system sums up values for all words (which are present in the mentioned dictionary) in a given text. The resulting number is normalized and scaled to a range from −100 to 100 where the negative value indicates negative tonality, the positive value indicates positive tonality and the neutral tonality is expressed with zero. EMMTonality system also contains a module for the ESA task which computes sentiment towards an entity in a given text. This approach is the same as for

262

P. Pˇrib´ an ˇ and A. Balahur

the tonality in news articles, with the difference that only a certain number of words surroundings the entity are used to compute the sentiment value towards the entity. EMMSenti system is intended to solve only the ESA task. This system uses a similar approach to the EMMTonality system, see [38] for the detailed description.

2

Related Work

In [35], authors summarize eight publicly available datasets for a Twitter sentiment analysis and they are giving an overview of the existing evaluation datasets and their characteristics. Another comparison of available methods for sentiment analysis is mentioned in [15]. They describe four different approaches (machine learning, lexicon-based, statistical and rule-based ) and they distinguish between three different levels of the scope of sentiment analysis, i.e. document level, sentence level and word/phrase/sub-sentence level. In recent years most of the state-of-the-art systems and approaches for sentiment analysis used neural networks and deep learning techniques. Very popular became the Convolutional Neural Network (CNN) [24] and the Recurrent Neural Network (RNN) like Long Short-Term Memory (LSTM) [21] or Gated Recurrent Unit (GRU) [12]. In [22] they used a CNN architecture for sentiment analysis and question answering. One of the proofs of neural networks successfulness is that most of the top teams [8,14,18] in sentiment analysis (or tasks related to the sentiment analysis) in the last SemEval [28,34] and WASSA [23,27] competitions used deep learning techniques. In [41] they present a comprehensive survey of current application in sentiment analysis. [5] compare several models on six different benchmark datasets, which belong to different domains and additionally have different levels of granularity. They showed that LSTMs based neural networks are particularly good at fine-grained sentiment tasks. In [39] the authors introduced sentiment-specific word embeddings (SSWE) for Twitter sentiment classification, which encode sentiment information in the continuous representation of words. The majority of the sentiment analysis research mainly focuses on monolingual methods, especially in English but some effort is being made for multilingual approaches as well. [2] propose an approach to obtain training data for French, German and Spanish using three distinct Machine translation (MT) systems. They translated English data to the three languages, and then they evaluated performance for sentiment analysis after using the three MT systems. They showed that the gap in classification performance between systems trained on English and translated data is minimal, and they claim that MT systems are mature enough to be reliably employed to obtain training data for languages other than English and that sentiment analysis systems can obtain comparable performances to the one obtained for English. In [3] they extended work from [2] and showed that tf-idf weighting with unigram features has a positive impact on the results. In [11], the authors study possibilities of usage of English model for sentiment analysis in different Russian, Spanish, Turkish and Dutch languages where the

Comparative Analyses of Multilingual Sentiment Analysis Systems

263

annotated data are more limited. They propose a multilingual approach where a single RNN model is built in the language where the largest sentiment analysis resources are available. Then they used MT to translate test data to English and finally they used the model to classify the translated data. The paper [16] provide a review of multilingual sentiment analysis. They compare their implementation of existing approaches on common data. Precision observed in their experiments is typically lower than the one reported by the original authors, which could be caused by the lack of detail in the original presentation of those approaches. In [42] they created bilingual sentiment word embeddings, which is based on the idea of encoding sentiment information into semantic word vectors. Related multilingual approach for sentiment analysis for low-resource languages is presented in [6]. They introduced Bilingual Sentiment Embeddings (BLSE), which are jointly optimized to represent (a) semantic information in the source and target languages, which are bound to each other through a small bilingual dictionary, and (b) sentiment information, which is annotated on the source language only. In [7], authors extend an approach from [6] to domain adaption for sentiment analysis. Their model takes as input two mono-domain embedding spaces and learns to project them to a bi-domain space, which is jointly optimized to project across domains and to predict sentiment. From the previous review, we can deduce that the current state-of-the-art approaches for sentiment analysis in English are solely based on neural networks and deep learning techniques. Deep learning techniques usually require more data than the “traditional” machine learning approaches (Support Vector Machine, Logistic Regression) and it is evident that they will be used for richresources languages (English). On the other hand, much less effort was invested in the multilingual approaches, and low-resources languages compared to English. First studies about multilingual approaches mostly relied on machine translation systems, but in recent years neural networks along with deep learning techniques were employed as well. Another common idea for multilingual approaches in SA is that researchers are trying to find a way how to create a model based on data from rich-resources language and transform the knowledge in such a way that it is possible to use the model for other languages.

3

Datasets

In this section, we describe the datasets we collected for the evaluation. The applications assessed require different types of datasets or at least different domains to carry out a proper evaluation. We collected mostly public available datasets, but we also used our in-house non-public datasets. The polarity labels for all collected Twitter and news datasets are positive, neutral or negative. If the original dataset contained other polarity labels than the three mentioned, we either discarded them or mapped them to positive, neutral or negative polarity labels.

264

P. Pˇrib´ an ˇ and A. Balahur

Sentiment analysis of tweets is a prevalent problem, and much effort is being put into solving this problem and related problems in recent years [19,20,23,27, 29,30,34]. Therefore, datasets for this task are easier to find. On the other hand, finding datasets for the ESA task is much more challenging because there is less research effort being put into this task and thus there are less existing resources. For the sentiment analysis in news articles we were not able to find a proper public dataset for the English language, and therefore we used our in-house datasets. For some languages exist publicly available corpora such as Slovenian [10], German [25], Brazilian Portuguese [1], Ukrainian and Russian [9]. 3.1

Twitter Datasets

In this subsection, we present the sentiment datasets for the Twitter domain. We collected 2.8M labelled tweets in total from several datasets, see Table 1 for detailed statistics. Next, we shortly describe each of these datasets. Table 1. Twitter datasets statistics. Dataset

Total

Positive

Negative Neutral

Sentiment140 Test Sentiment140 Train Health Care Reform Obama-McCain Debate Sanders T4SA SemEval 2017 Train SemEval 2017 Test InHouse Tweets Test InHouse Tweets Train

498 1600 000 2394 1904 3424 1 179 957 52 806 12 284 3813 4569

182 800 000 543 709 519 371 341 20 555 2375 1572 2446

177 800 000 1381 1195 572 179 050 8430 3972 601 955

139 – 470 – 2333 629 566 23 821 5937 1640 1168

Total

2 861 649 1 200 242 996 333

665 074

Sentiment140 [19] dataset consists of two parts – training and testing. The training part includes 800k positive and 800k negative automatically labelled tweets. Authors of this dataset collected tweets containing certain emoticons, and to every tweet, they assigned a label based on the emoticon. For example, :) and :-) both express positive emotion and thus tweets containing these emoticons were labelled as positive. The testing part of this dataset is composed of 459 manually annotated tweets (177 negative, 139 neutral and 182 positives). The detailed description of this approach is described in [19]. The authors of [37] created Health Care Reform dataset based on tweets about health care reform in the USA. They extracted tweets containing the health care reform hashtag “#hcr” from the early 2010s. This dataset contains 543 positive, 1381 negative and 470 neutral examples.

Comparative Analyses of Multilingual Sentiment Analysis Systems

265

Obama-McCain Debate ne [36] dataset was manually annotated with the Amazon Mechanical Turk by one or more annotators for the categories positive, negative, mixed or other. Total 3269 tweets posted during the presidential debate on September 26th, 2008 between Barack Obama and John McCain were annotated. We filtered this dataset to obtain only tweets with a positive or negative label (no neutral classes were present). After the filtering process, we received 709 positives and 1195 negative examples. T4SA [40] dataset was obtained from July to December 2016. The authors discarded retweets, tweets not containing any static image and tweets whose text was less than five words long. Authors were able to gather 3.4M tweets in English. Then, they classified the sentiment polarity of the texts and selected the tweets having the most confident textual sentiment predictions. This approach resulted in approximately a million labelled tweets. For the sentiment polarity classification, authors used an adapted version of the ItaliaNLP Sentiment Polarity Classifier [13]. This classifier uses a tandem LSTM-SVM architecture. Along with the tweets, authors also crawled the images contained in the tweets. The aim was to automatically build a training set for learning a visual classifier able to discover the sentiment polarity of a given image [40]. SemEval-2017 dataset was created for the Sentiment Analysis in Twitter task [34] at SemEval 2017. The authors made available all the data from previous years of the Sentiment Analysis in Twitter [30] tasks and they also collected some new tweets. They chose English topics based on popular events that were trending on Twitter. The topics included a range of named entities (e.g., Donald Trump, iPhone), geopolitical entities (e.g., Aleppo, Palestine), and other entities. The dataset is divided into two parts – SemEval 2017 Train and SemEval 2017 Test. They used CrowdFlower to annotate the new tweets. We removed all duplicated tweets from the SemEval 2017 Train part which resulted in approximately 20K positive, 8K negative and 23K neutral examples and 2K positive, 4K negative and 6K neutral examples for the SemEval 2017 Test part (see Table 1). InHouse Tweets dataset consists of two datasets InHouse Tweets Train and InHouse Tweets Test used in [4]. These datasets come from SemEval 2013 task 2 Sentiment Analysis in Twitter [20]. Sanders twitter dataset1 created by Sanders Analytics consists of 5512 manually labelled tweets by one annotator. Each tweet is related to one of four topics (Apple, Google, Microsoft, Twitter). Tweets are labelled as either positive, negative, neutral or irrelevant. We discarded tweets labelled as irrelevant. In [35] the authors also described and used Sanders twitter dataset. 3.2

Targeted Entity Sentiment Datasets

For the ESA task, we were able to collect three labelled datasets. Datasets from [17,26] are created from tweets, and our InHouse Entity dataset [38] contains sentences from news articles. Detailed statistics are shown in Table 2. 1

Dataset can be obtained from https://github.com/pmbaumgartner/text-feat-lib.

266

P. Pˇrib´ an ˇ and A. Balahur Table 2. Targeted Entity Sentiment Analysis datasets statistics. Dataset

Total

Dong 6940 3288 Mitchel InHouse Entity 1281 Total

Positive Negative Neutral 1734 707 169

1733 275 189

3473 2306 923

11 509 2610

2197

6702

Dong [17] is manually annotated dataset for the ESA task consisting of 1734 positive, 1733 negative and 3473 neutral examples. Each example consists of a tweet, an entity and a class label which denotes a sentiment towards the entity. [26] used the Amazon Mechanical Turk to annotate Mitchel dataset with 3288 examples (tweet – entity pairs) for the ESA task. Tweets with a single highlighted named entity were shown to the annotators, and they were instructed to select the sentiment being expressed towards the entity (positive, negative or no sentiment). For the evaluation, we also used our InHouse Entity dataset created in [38]. This dataset was created as a multilingual parallel news corpus annotated with sentiment towards entities. They used data from Workshops on Statistical Machine Translation (2008, 2009, 2010)2 . Firstly, they recognized the named entities and then selected examples were manually annotated with two annotators. The disagreed cases were judged by the third annotator. They were able to obtain 1281 labelled examples (707 positive, 275 negative and 923 neutral), e.g. sentences with annotated entity and sentiment expressed towards the entity. 3.3

News Tonality Datasets

For the TON 3 task, we used our two non-public multilingual datasets. Firstly, our InHouse News dataset consists of 1830 manually labelled texts from news articles about Macedonian Referendum in 23 languages, but the majority is formed by Macedonian, Bulgarian, English, Italian and Russian, see Table 3. Each example contains a title and description of a given article. For the evaluation of our systems we used only Bulgarian, English, Italian and Russian because other languages are either not supported by the evaluated systems or the number of examples is less than 60 samples. EP News dataset contains more than 50K manually labelled news articles about the European Parliament and European Union in 25 European languages. Each news article in this dataset consists from a title and full text of the article and also from their English translation, we selected five main European languages (English, German, French, Italian and Spanish) for the evaluation, see Table 4 for details. 2 3

http://www.statmt.org/wmt10/translation-task.html. For this task we also used tweets described in Subsect. 3.1.

Comparative Analyses of Multilingual Sentiment Analysis Systems

267

Table 3. InHouse News dataset statistics. InHouse News

Total Positive Negative Neutral

Macedonian Bulgarian English Italian Russian Other Languages Total

974 215 339 62 65 175

516 118 198 41 17 60

234 26 35 3 34 44

224 71 106 18 14 71

1830

950

376

504

Table 4. EP Tonality News dataset statistics. EP News English German French Italian Spanish Total

4

Total 2193 5122 2964 1544 3594

Positive Negative Neutral 263 389 574 291 324

172 179 308 152 135

1758 4554 2082 1101 3135

15417 1841

946

12630

Evaluation and Results

In this section, we present the summary of all the evaluation results for of all the three systems. For each system, we select an appropriate collection of datasets, and we classify examples of each selected dataset separately. Then, we merge all selected datasets, and we classify them together. Except for the InHouse News dataset and EP News dataset, all experiments are performed on English texts. We carry out experiments on the EMMTonality system with the InHouse News dataset on Bulgarian, English, Italian and Russian. Experiments with the EP News dataset are performed on the TwitOMedia and EMMTonality system with English, German, French, Italian and Spanish4 . Each sample is classified as positive, negative or neutral and for all named systems we did not apply any additional preprocessing steps5 . As an evaluation metric, we used Accuracy and Macro F1 score which are defined as: F1M =

4 5

2 × P M × RM P M + RM

(1)

On the EMMTonality system we perfom experiments with all available languages, but we report results only for English, German, French, Italian and Spanish. Except baselines systems.

268

P. Pˇrib´ an ˇ and A. Balahur

where P M denotes Macro Precision an RM denotes Macro Recall. Precision Pi and recall Ri are firstly computed separately for each class (n is the number of classes) and then averaged as follows: n Pi PM = i (2) n n Ri M (3) R = i n 4.1

Baselines

For basic comparison, we created baseline models for the TSA task and TON task. These baseline models are based on unigram or unigram-bigram features. Results are shown in Tables 5, 6, 7, 8 and 9. For the baseline models, we apply minimal preprocessing steps like lowercasing and word normalization which includes conversion of URLs, emails, money, phone numbers, usernames, dates and numbers expressions to one common token, for example, token “www. google.com” is converted to the token “”. These steps lead to a reduction of feature space as shown in [19]. We use ekphrasis library from [8] for word normalization. Table 5. Results of baseline models for the InHouse Tweets Test dataset with unigram features (models were trained on InHouse Tweets Train dataset). Baseline

Macro F1 Accuracy

Log. regression 0.5525 0.5308 SVM 0.4233 Naive Bayes

0.5843 0.5641 0.4993

To train the baseline models, we use an implementation of Support Vector Machines (SVM) – concretely Support Vector Classification (SVC) with linear kernel, Logistic Regression with lbfgs solver and Naive Bayes algorithms from the scikit-learn library [31], default values are used for other parameters of the mentioned classifiers. Our InHouse News dataset does not contain a large number of examples, and therefore we perform experiments with 10-fold crossvalidation, the same approach is applied for the EP News dataset. For the News datasets (InHouse News and EP News) we train baseline models with different combinations of data. In Table 6 are shown results for models which are trained on a concatenation of examples in different languages. For each dataset, we select all untranslated examples (texts in original languages), and we train model regardless of the language. The model is then able to classify texts in all languages which were used to train the model. This approach should lead to performance improvement as is shown in [4]. The same approach is used to

Comparative Analyses of Multilingual Sentiment Analysis Systems

269

Table 6. Macro F1 score and Accuracy results of baseline models with unigram and bigram features. The InHouse News dataset and the EP News dataset with all examples (all languages) were used. We used 10-fold cross-validation (results in table are averages of individual folds). Bold values denote best results for each dataset. Baseline

Config

InHouse News EP News Macro F1 Accuracy Macro F1 Accuracy

Log. regression text 0.663 text+title 0.704

0.705 0.738

0.551 0.578

0.864 0.870

SVM

text 0.657 text+title 0.717

0.697 0.747

0.564 0.591

0.856 0.866

Naive Bayes

text 0.612 text+title 0.646

0.676 0.702

0.513 0.552

0.845 0.852

acquire results for Table 7, but only specific languages are used, specifically for the InHouse News dataset it is English, Bulgarian, Italian and Russian and for the EP News dataset it is English, French, Italian, German and Spanish. Table 8 contains results for models trained only on original English texts. In Tables 6, 7, 8 and 9 column Config denotes whether the text of an example is used or if a title of the example is concatenated with the text and is used as well. Table 7. Macro F1 score and Accuracy results of baseline models with unigram and bigram features. The InHouse News dataset with Bulgarian, English, Italian and Russian examples and the EP News dataset with English, French, Italian, German and Spanish examples were used. We used 10-fold cross-validation (results in table are averages of individual folds). Bold values denote best results for each dataset. Baseline

Config

InHouse News EP News Macro F1 Accuracy Macro F1 Accuracy

Log. regression text 0.629 text + title 0.692

0.682 0.729

0.497 0.529

0.833 0.841

SVM

text 0.630 text + title 0.684

0.677 0.718

0.513 0.540

0.819 0.833

Naive Bayes

text 0.585 text + title 0.612

0.657 0.678

0.432 0.457

0.816 0.817

If we compare baseline results from Table 8 with results from Table 10 (last five lines of the table), we can see that baselines perform much better than our current system (see Macro F1 score in tables). The TwitOMedia system was trained on tweets messages, so it is evident that its performance on news articles will be lower, but the EMMTonality system should achieve better results.

270

P. Pˇrib´ an ˇ and A. Balahur

Table 8. Macro F1 score and Accuracy results of baseline models with unigram and bigram features. The InHouse News dataset and EP News dataset only with original English examples were used. We used 10-fold cross-validation (results in table are averages of individual folds). Bold values denote best results for each dataset. Baseline

Config

InHouse News EP News Macro F1 Accuracy Macro F1 Accuracy

Log. regression text 0.612 text + title 0.685

0.730 0.769

0.510 0.534

0.820 0.826

SVM

text 0.608 text + title 0.674

0.719 0.760

0.530 0.546

0.815 0.827

Naive Bayes

text 0.502 text + title 0.547

0.695 0.713

0.441 0.446

0.818 0.819

Our results from Tables 6, 7 and 8 confirm the claims from [4] that joining of data in different languages leads the performance improvement. Models trained on all examples (regardless language), see Table 6, achieve best results. Table 9. Macro F1 score and Accuracy results of baseline models trained on SemEval 2017 Train and Test datasets with unigram features. Evaluation was performed on original English examples from our InHouse News and EP News datasets. Bold values denote best results for each dataset. Baseline

Config

InHouse News EP News Macro F1 Accuracy Macro F1 Accuracy

Log. regression text 0.395 text + title 0.408

0.432 0.462

0.312 0.310

0.518 0.495

SVM

text 0.380 text + title 0.389

0.429 0.456

0.283 0.287

0.408 0.397

Naive Bayes

text 0.237 text + title 0.239

0.296 0.293

0.313 0.314

0.639 0.620

We collected large manually labelled dataset of tweets, and we wanted to study the possibility to use this dataset to train a model. This model would then be used for classification of news articles that are different from the domain of the training data. After comparing results from Table 9 (last five lines of the table) with results from Table 10, we can see that our simple baseline is not outperformed on the InHouse News dataset by the other two systems. These results show that it is possible to use data from different domains for training and obtain good results. We also observe that incorporating the title (concatenating the title and the text) of a news article leads to an increase in performance across all datasets and

Comparative Analyses of Multilingual Sentiment Analysis Systems

271

combination of data used for training models. These results show that the title is an essential part of the news and contains significant sentiment and semantic information despite its short length. 4.2

Twitter Sentiment Analysis

To evaluate a system for the TSA task, we used a domain rich collection of tweets datasets. We collected datasets with almost 3M labelled tweets, detailed statistics of used datasets can be seen in Table 1. Table 10 shows obtained results for Accuracy and Macro F1 measures. From Table 10 is evident that the TwitOMedia system [4] performs best for the InHouse Tweets Test dataset (bold values in the table). This dataset is based on data from [20] and was used to develop (train and test) this system. The reason why the TwitOMedia system performs better for the InHouse Tweets Test dataset than for the InHouse Tweets Train dataset (HTTr) is that the system was trained on translations of the HTTr dataset. Original training dataset (HTTr) was translated into several languages, and then the translations were merged to one training dataset which was used to train the model. This approach leads to performance improvement as is shown in [4]. For the other datasets, the performance is lower especially for the domainspecific ones and datasets which does not contain instances with neutral classes, for example, Health Care Reform dataset or Sentiment 140 Train dataset. The first reason is most likely that the system was trained on the other domain of texts which is too much different and thus the system is not able to successfully classify (generalize) texts from these domain-specific datasets. Secondly, Sentiment140 Train dataset and Obama-McCain Debate dataset do not contain examples with a neutral class. 4.3

Tonality in News

EMMTonality system for the TON task was evaluated on the same set of datasets like the one for the TwitOMedia system. Obtained results are shown in Table 10. If we compare results of the TwitOMedia system and results of the EMMTonality system, we can see that the EMMTonality system achieves better results for these datasets: Sentiment140 Test, Health Care Reform, ObamaMcCain Debate, Sanders, SemEval 2017 Train, and SemEval 2017 Test. The overall results are better for the TwitOMedia system. Results for the InHouse News and EP News datasets are comparable for both evaluated systems. Regarding multilinguality, the EMMTonality system slightly overperforms the TwitOMedia system in Macro F1 score, see Table 11. Table 11 contains results for the EP News dataset for five European languages (English, German, French, Italian and Spanish).

272

P. Pˇrib´ an ˇ and A. Balahur

Table 10. Macro F1 score and Accuracy results of the evaluated TwitOMedia and EMMTonality systems. Bold values denote the best results in specific dataset category (Individual Twitter datasets, joined Twitter datasets and News datasets), and underlined values denote best results for specific dataset category and for each system separetely. Dataset

TwitOMedia EMMTonality Macro F1 Accuracy Macro F1 Accuracy

Sentiment140 Test Health Care Reform Obama-McCain Debate (OMD) Sanders Sentiment140 Train (S140T) SemEval 2017 Train SemEval 2017 Test T4SA InHouse Tweets Test (HTT) InHouse Tweets Train (HTTr)

0.566 0.410 0.270 0.468 0.312 0.501 0.460 0.603 0.710 0.629

0.530 0.326 0.290 0.591 0.358 0.529 0.500 0.669 0.708 0.599

0.666 0.456 0.331 0.526 0.250 0.538 0.552 0.410 0.583 0.580

0.639 0.403 0.357 0.618 0.375 0.561 0.564 0.392 0.610 0.574

All Tweets w/o S140T, OMD, T4SA 0.597 All Tweets w/o S140T, T4SA 0.507

0.660 0.528

0.545 0.542

0.563 0.558

InHouse News en EP News en, text EP News en, title + text EP News translated, text EP News translated, title + text

0.425 0.698 0.690 0.432 0.388

0.398 0.422 0.425 0.390 0.393

0.425 0.678 0.675 0.278 0.238

0.397 0.368 0.372 0.368 0.369

Table 11. Macro F1 score and Accuracy results for the EP News dataset for English, German, French, Italian and Spanish examples. Lang. Config

TwitOMedia Macro F1 Accuracy

EMMTonality Macro F1 Accuracy

EN

Text 0.368 Text+Title 0.372

0.698 0.690

0.422 0.425

0.678 0.675

DE

Text 0.333 Text+Title 0.344

0.711 0.687

0.348 0.360

0.846 0.730

FR

Text 0.354 Text+Title 0.356

0.614 0.602

0.389 0.383

0.549 0.472

IT

Text 0.314 Text+Title 0.351

0.692 0.690

0.397 0.405

0.347 0.330

ES

Text 0.337 Text+Title 0.332

0.828 0.823

0.392 0.392

0.386 0.333

Comparative Analyses of Multilingual Sentiment Analysis Systems

4.4

273

Targeted Sentiment Analysis

We evaluated the EMMSenti system and EMMTonality system for the ESA task on the Dong, Mitchel and InHouse Entity datasets, see Table 12 for results. We obtained the best results for the InHouse Entity dataset in terms of Accuracy measure and also for the Macro F1 score. The best results across all datasets and systems are obtained for the neutral class (not reported in the table) and for other classes our systems work more poorly. The classification algorithm (for both systems) is based on counting subjective terms (words) around entity mentions (no machine learning algorithm or approach is involved). It is obvious that the quality of dictionaries used, as well as their adaptation to the domain, is crucial. If no subjective term from the text is found in the dictionary, to the example is assigned the neutral label. The best performance of our systems for the neutral class can be explained by the fact that most of the neutral instances do not contain any subjective term. We also have to note that we were not able to reproduce results obtained in [38] and our achieved performance for this dataset is worse. It is possible that the authors of [38] used slightly different lexicons than we used. Table 12. Macro F1 score and Accuracy results for the EMMSenti and EMMTonality systems evaluation. Bold values denote best results for each dataset. Dataset

4.5

EMMSenti EMMTonality Macro F1 Accuracy Macro F1 Accuracy

Dong 0.491 0.483 Mitchel InHouse Entity 0.517

0.512 0.660 0.663

0.496 0.490 0.507

0.501 0.640 0.659

All

0.571

0.512

0.557

0.505

Error Analysis

In order to understand the causes leading in erroneous classification, we analyze the misclassified examples from Twitter and the News datasets for the EMMTonality and the TwitOMedia systems. We categorize the errors into four groups (see below)6 . We randomly select 40 incorrectly classified examples for each class and for each system across all datasets which were used for evaluation of these systems, which resulted in 240 manually evaluated examples in total. We found the following major groups of errors: 1. Implicit sentiment/external knowledge: Sentiment is often expressed implicitly, or external knowledge is needed for a correct classification. The evaluated text does not contain any explicit attributes (words, phrases, 6

Each incorrectly classified example may be contained in more than one error group. Some examples were also (in our view) annotated incorrectly. For some cases, we were not able to discover the reason for misclassification.

274

P. Pˇrib´ an ˇ and A. Balahur

emoji/emoticons) which would clearly indicate the sentiment and because our systems are based on surface level features (unigrams/bigrams or counting occurrences of sentiment words), they will fail in these examples. For example, text like “We went to Stanford University today. Got a tour. Made me want to go back to college.” indicates positive sentiment but for this decision we have to know that Stanford University is prestigious university (which is positive) and according to the sentence “Made me want to go back to college.” author has probably a positive relation to universities or his previous studies. This group of errors is the most common in our set of the error analysis examples, we observed it in 94 cases and only for positive or negative examples. 2. Slang expression: Misclassified examples in this group contain domainspecific words, slang expressions, emojis, unconventional linguistic means, misspelled or uppercased words like “4life”, “YEAH BOII”, “yessss”, “grrrl”, “yummmmmy”. We observe this type of errors in 29 examples and most of them were caused by the EMMTonality system (which is reasonable because this system is intended for news). The appropriate solution for part of this problem is an application of preprocessing steps like spell correction, lowercasing, text normalization (“yesssss” ⇒ “yes”) or extending of dictionaries. In case of extending dictionaries, we have to deal with Twitter vocabulary because the Twitter vocabulary (vocabulary of tweets) is changing quite fast (new modern expressions and hashtags are introduced often) and thus dictionaries have to be modified regularly. On the other hand, the TwitOMedia system would have to be retrained every time with new examples in order to extend its feature set or a more advanced normalization system should be used in the pre-processing stage. 3. Negation: Negation of terms is an essential aspect of sentiment classification [33]. Negations can easily change or reverse the sentimental orientation. This error appeared in 35 cases in our set of the error analysis examples. 4. Opposite sentiment words: The last type of errors is caused by sentiment words which express the opposite or different sentiment than the entire text. This type of error was typical for examples annotated with a neutral label. For example tweet “#Yezidi #Peshmerga forces playing volleyball and crushing #ISIS in the frontline.” is annotated as neutral but contains words like “crushing, #ISIS” or “frontline” which can indicate negative sentiment. We observed this error in 20 examples. The first group of errors (Implicit sentiment/external knowledge) was the most common among the evaluated examples and is also the hardest one because the system would have to have access to world knowledge or be able to detect implicit sentiment in order to be able of correct classification. This error was observed only for examples annotated with positive or negative labels; there, the explicit sentiment markers are missing. The majority of these examples were misclassified as a neutral class. In this case, the sentiment analysis system must be complemented with a system for emotion detection similar to one of the top systems from [23] to improve classification performance. In case of emotion detection for examples which were classified as a neutral class, we would change the neutral class according to the detected emotion. The examples with negative

Comparative Analyses of Multilingual Sentiment Analysis Systems

275

emotions like sadness, fear or anger would be changed to the negative class and examples with positive emotions like joy or surprise would be changed to the positive class. Figure 1 shows confusion matrices for the EMMTonality and TwitOMedia systems. We can see that a noticeable amount of misclassified examples was predicted as a neutral class and the mentioned improvement should positively affect a significant number of examples according to our statistics from the error analysis.

(a) TwitOMedia system

(b) EMMTonality system

Fig. 1. Confusion matrices for the TwitOMedia and EMMTonality systems on all tweets without S140T and T4SA datasets.

Lastly, we have to note that we were not able to decide the reason for misclassification in 35 cases. According to us, in seven cases was the annotated label incorrect.

5

Conclusion

In this paper, we showed the process of thoroughly evaluating three systems for sentiment analysis with a comparison of their performance. We collected and described a rich collection of publicly available datasets, and we performed experiments with these datasets and showed the performance of individual systems. We carried out additional experiments with collected datasets and showed that for news articles is more beneficial also to include the title of the news article along with the text of the article itself. We performed a thorough error analysis and proposed potential solutions for each category of misclassified examples. In our future work, we will explore current state-of-the-art methods and develop new approaches (including deep learning methods, multilingual embeddings and other recent machine learning approaches) for multilingual sentiment analysis in order to implement them in our highly multilingual environment.

276

P. Pˇrib´ an ˇ and A. Balahur

Acknowledgments. This work was partially supported from ERDF “Research and Development of Intelligent Components of Advanced Technologies for the Pilsen Metropolitan Area (InteCom)” (no.: CZ.02.1.01/0.0/0.0/17 048/0007267) and by Grant No. SGS-2019-018 Processing of heterogeneous data and its specialized applications. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

References 1. de Arruda, G.D., Roman, N.T., Monteiro, A.M.: An annotated corpus for sentiment analysis in political news. In: Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology, pp. 101–110 (2015) 2. Balahur, A., Turchi, M.: Multilingual sentiment analysis using machine translation? In: Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis, WASSA 2012, Stroudsburg, PA, USA, pp. 52–60. Association for Computational Linguistics (2012). dl.acm.org/citation.cfm?id=2392963.2392976 3. Balahur, A., Turchi, M.: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis. Comput. Speech Lang. 28(1), 56–75 (2014) 4. Balahur, A., et al.: Resource creation and evaluation for multilingual sentiment analysis in social media texts. In: LREC, pp. 4265–4269. Citeseer (2014) 5. Barnes, J., Klinger, R., Schulte im Walde, S.: Assessing state-of-the-art sentiment models on state-of-the-art sentiment datasets. In: Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 2–12. Association for Computational Linguistics (2017). https://doi. org/10.18653/v1/W17-5202. aclweb.org/anthology/W17-5202 6. Barnes, J., Klinger, R., Schulte im Walde, S.: Bilingual sentiment embeddings: joint projection of sentiment across languages. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2483–2493. Association for Computational Linguistics (2018). aclweb.org/anthology/P18-1231 7. Barnes, J., Klinger, R., Schulte im Walde, S.: Projecting embeddings for domain adaption: joint modeling of sentiment analysis in diverse domains. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 818–830. Association for Computational Linguistics (2018). aclweb.org/anthology/C18-1070 8. Baziotis, C., Pelekis, N., Doulkeridis, C.: DataStories at SemEval-2017 task 4: deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 747–754. Association for Computational Linguistics, August 2017 9. Bobichev, V., Kanishcheva, O., Cherednichenko, O.: Sentiment analysis in the Ukrainian and Russian news. In: 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), pp. 1050–1055. IEEE (2017) ˇ 10. Buˇcar, J., Znidarˇ siˇc, M., Povh, J.: Annotated news corpora and a lexicon for sentiment analysis in Slovene. Lang. Resour. Eval. 52(3), 895–919 (2018). https://doi. org/10.1007/s10579-018-9413-3

Comparative Analyses of Multilingual Sentiment Analysis Systems

277

11. Can, E.F., Ezen-Can, A., Can, F.: Multilingual sentiment analysis: an RNN-based framework for limited data. CoRR abs/1806.04511 (2018) 12. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1179. aclweb.org/anthology/D14-1179 13. Cimino, A., Dell’Orletta, F.: Tandem LSTM-SVM approach for sentiment analysis. In: CLiC-it/EVALITA (2016) 14. Cliche, M.: BB twtr at SemEval-2017 task 4: Twitter sentiment analysis with CNNs and LSTMs. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 573–580. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/S17-2094. aclweb.org/anthology/S17-2094 15. Collomb, A., Costea, C., Joyeux, D., Hasan, O., Brunie, L.: A study and comparison of sentiment analysis methods for reputation evaluation. Rapport de recherche RRLIRIS-2014-002 (2014) 16. Dashtipour, K., et al.: Multilingual sentiment analysis: state of the art and independent comparison of techniques. Cogn. Comput. 8(4), 757–771 (2016). https:// doi.org/10.1007/s12559-016-9415-7 17. Dong, L., Wei, F., Tan, C., Tang, D., Zhou, M., Xu, K.: Adaptive recursive neural network for target-dependent Twitter sentiment classification. In: The 52nd Annual Meeting of the Association for Computational Linguistics (ACL). ACL (2014) 18. Duppada, V., Jain, R., Hiray, S.: SeerNet at SemEval-2018 task 1: domain adaptation for affect in tweets. In: Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 18–23. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/S18-1002. aclweb.org/anthology/S18-1002 19. Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. CS224N project report, Stanford 1(12) (2009) 20. Hltcoe, J.: SemEval-2013 task 2: sentiment analysis in Twitter, Atlanta, Georgia, USA 312 (2013) 21. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 22. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1181. aclweb.org/anthology/D14-1181 23. Klinger, R., De Clercq, O., Mohammad, S., Balahur, A.: IEST: WASSA-2018 implicit emotions shared task. In: Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Brussels, Belgium, pp. 31–42. Association for Computational Linguistics, October 2018. aclweb.org/anthology/W18-6206 24. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 25. Lommatzsch, A., B¨ utow, F., Ploch, D., Albayrak, S.: Towards the automatic sentiment analysis of German news and forum documents. In: Eichler, G., Erfurth, C., Fahrnberger, G. (eds.) I4CS 2017. CCIS, vol. 717, pp. 18–33. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60447-3 2 26. Mitchell, M., Aguilar, J., Wilson, T., Van Durme, B.: Open domain targeted sentiment. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1643–1654 (2013)

278

P. Pˇrib´ an ˇ and A. Balahur

27. Mohammad, S.M., Bravo-Marquez, F.: WASSA-2017 shared task on emotion intensity. In: Proceedings of the Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA), Copenhagen, Denmark (2017) 28. Mohammad, S.M., Bravo-Marquez, F., Salameh, M., Kiritchenko, S.: SemEval2018 task 1: affect in tweets. In: Proceedings of International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA (2018) 29. Mohammad, S.M., Kiritchenko, S., Zhu, X.: NRC-Canada: building the state-ofthe-art in sentiment analysis of tweets. arXiv preprint arXiv:1308.6242 (2013) 30. Nakov, P., Ritter, A., Rosenthal, S., Stoyanov, V., Sebastiani, F.: SemEval-2016 task 4: sentiment analysis in Twitter. In: Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval 2016, San Diego, California. Association for Computational Linguistics, June 2016 31. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 32. Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods, pp. 185–208 (1999) 33. Reitan, J., Faret, J., Gamb¨ ack, B., Bungum, L.: Negation scope detection for Twitter sentiment analysis. In: Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 99– 108 (2015) 34. Rosenthal, S., Farra, N., Nakov, P.: SemEval-2017 task 4: sentiment analysis in Twitter. In: Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval 2017, Vancouver, Canada. Association for Computational Linguistics, August 2017 35. Saif, H., Fernandez, M., He, Y., Alani, H.: Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold (2013) 36. Shamma, D.A., Kennedy, L., Churchill, E.F.: Tweet the debates: understanding community annotation of uncollected sources. In: Proceedings of the First SIGMM Workshop on Social Media, pp. 3–10. ACM (2009) 37. Speriosu, M., Sudan, N., Upadhyay, S., Baldridge, J.: Twitter polarity classification with label propagation over lexical links and the follower graph. In: Proceedings of the First Workshop on Unsupervised Learning in NLP, pp. 53–63. Association for Computational Linguistics (2011) 38. Steinberger, J., Lenkova, P., Kabadjov, M., Steinberger, R., Van der Goot, E.: Multilingual entity-centered sentiment analysis evaluated by parallel corpora. In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pp. 770–775 (2011) 39. Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for Twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1555–1565. Association for Computational Linguistics (2014). https:// doi.org/10.3115/v1/P14-1146. aclweb.org/anthology/P14-1146 40. Vadicamo, L., et al.: Cross-media learning for image sentiment analysis in the wild. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 308–317, October 2017. https://doi.org/10.1109/ICCVW.2017.45

Comparative Analyses of Multilingual Sentiment Analysis Systems

279

41. Zhang, L., Wang, S., Liu, B.: Deep learning for sentiment analysis: a survey. CoRR abs/1801.07883 (2018). arxiv.org/abs/1801.07883 42. Zhou, H., Chen, L., Shi, F., Huang, D.: Learning bilingual sentiment word embeddings for cross-language sentiment classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 430–440. Association for Computational Linguistics (2015). https:// doi.org/10.3115/v1/P15-1042. aclweb.org/anthology/P15-1042

Sentiment Analysis of Influential Messages for Political Election Forecasting Oumayma Oueslati1(B) , Moez Ben Hajhmida1 , Habib Ounelli1 , and Erik Cambria2 1

2

University of Tunis El Manar, Tunis, Tunisia [email protected] Nanyang Technological University, Singapore, Singapore

Abstract. In this paper, we explore the use of sentiment analysis of influential messages on social media to improve political election forecasting. While social media users are not necessarily representative of the overall electors, bias correction of users messages is critical for producing a reliable forecast. The observation motivates our work is that people on social media consult the messages of each other before taking a decision, this means that social media users influence each other. We first built a classifier to detect politically influential messages based on different aspects (messages content, time, sentiment, and emotion). Then, we predicted electoral candidates votes using sentiment degree of influential messages. We applied our proposed model to the 2016 United States presidential election. We conducted experiments at different intervals of times. Results show that our approach achieves better performance than both off-line polling and classical approaches.

Keywords: Sentiment analysis Presidential election

1

· Political election forecasting ·

Introduction

Nowadays, writing and messaging on social media is a part of our daily routine. Facebook, for example, enjoys more than one billion daily active users. The exponential growth of social media has engendered the growth of user-generated content (UGC) available on the web. The availability of UGC raised the possibility to monitor electoral campaigns by tracking and exploring citizens preferences [1]. Jin et al. [2] stated that analyzing social media during an electoral campaign may be more useful and accurate than the traditional off-line polls and surveys. This approach represents not only a more economical process to predict the election outcome but also a faster way to analyze such a massive amount of data. Thus, many studies proved that analyzing social media based on several indicators led to a reliable forecast of the final result. Some works [3,4] have relied on c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 280–292, 2023. https://doi.org/10.1007/978-3-031-24340-0_21

Sentiment Analysis of Influential Messages for Political Election Forecasting

281

simple techniques such as the volume of data related to candidates. More recent works tried to provide a better alternative to the traditional off-line poll using sentiment analysis of UGC [5,6]. Whatever the used technique, addressing the data bias is a crucial phase which impacts the quality of the outcome. While social media contents are not necessarily all relevant for the prediction, an appropriate technique to bias UGC is needed. In this paper, we propose a sentiment analysis based approach to predict the political elections by relying only on influential messages shared on social media. Social influence has been observed not only in political participation but many other domains such as health behaviors and idea generation [7]. To the best of our knowledge, our work is the first investigating politically influential messages to forecast election outcome. According to Cialdini and Trost [8], social influence occurs when an individual’s views, emotions, or actions are impacted by the views, emotions or actions of another individual. By analogy, the political influence was achieved thanks to the direct interaction with voters through social media platforms. The politicians tweet on Twitter and post on Facebook to receive voters feedbacks and understand their expectations. Hence, we built a classifier to select the influential messages based on content, time, sentiment, and emotion features. To compute sentiment features, we adopted a concept-level sentiment analysis Framework largely recommended in literature [9] called SenticNet [10]. To extract emotion features, we built an emotional lexicon based on the Facebook reactions. For each electoral candidate, the number of votes was predicted using sentiment polarity and degree of influential messages figuring in the candidate official Facebook page. We applied the proposed approach to the 2016 United States presidential election. To evaluate the prediction quality, we mainly considered two kinds of ground truths for comparison: the election outcome itself and polls released by traditional polling institutions. We also compared our approach with classical approaches merely based on data volume. Experiments were conducted at different intervals of time. Results showed that using Influential messages led to a more accurate prediction. In term of structure, the rest of the paper is organized as follows: Section 2 explores the current literature; Sect. 3 addresses the research methods; Sect. 4 presents the results discussion and implications; lastly, Sect. 5 gives a synopsis of the main concluding remarks.

2 2.1

Related Works Sentiment Analysis

Social media platforms have changed the way that people use the information to make a decision. They tend to consult the reviews of each other before making their choices and decisions. Sentiment analysis in social media is a challenging problem that has attracted a large body of research [11,12]. In [13], authors investigated the impact of sentiment analysis tools to extract useful information from unstructured data ranging from evaluating consumer products, services, healthcare, and financial services to analyzing social events and political elections.

282

O. Oueslati et al.

Cambria et al. [10] have introduced SenticNet which is a concept-level sentiment analysis framework, consisting of 100,000 concept entries. SenticNet acts as a semantical link between concept-level emotion and natural word-level language data. Five affiliated semantic nodes are listed following each concept. These nodes are connected by semantic relations, four sentics, and a sentiment polarity value. The four sentics present a detailed emotional description of the concept they belong to, namely Introspection, Temper, Attitude, and Sensitivity. The sentiment polarity value is an integrated evaluation of the concept sentiment based on the four parameters. The sentiment polarity provided by SenticNet is a float number in the range between −1 to 1. Many applications have been developed by employing SenticNet. These applications can be exploited in many fields such as the analysis of a considerable amount of social data, human and computer interactions. In [14], Bravo-Marquez et al. used SenticNet to build a sentiment analysis system for Twitter. In [15], authors used SenticNet to build an e-health system called iFeel which analyze patients opinions about the provided healthcare. Another study by Qazi et al. [9] recommended SenticNet to extract sentiment features. Encouraged by these works, we also used SenticNet framework to extract sentiment features from the extracted message. 2.2

Election Forcasting Approaches

Forecasting elections in social media have become the latest buzzword. Politicians have adopted social media, predominantly Facebook and Twitter, as a campaigning tool. On the other hand, the general public has widely adopted social media to conduct political discussions [16]. Hence, Bond et al. [17] affirm that social media content may influence citizens political behavior. Sang and Bos [18] stated that many studies have proven that analyzing social media using several techniques and based on different indicators led to a reliable forecast of electoral campaigns’ and result. Tumasjan et al. in [4] were the first using Twitter to predict the outcome of German Federal election. They used a simple technique based on counting the number of tweets that a party get. Although their success in predicting the winner of the 2009 German Federal Elections, their simple technique get many critics. Jungherr et al. [19] highlighted the lack of methodological justification. Furthermore, Gayo-Avello [5,20] stressed making use of sentiment analysis to produce more accurate results. In [5], Gayo-Avello reported a better error rate when using sentiment analysis (17.1% using volume, 7.6% using sentiment). Consequently, many works have taken his advice such as [6,21–23]. Addressing the data bias is an essential phase in predicting an electoral outcome [24,25]. Social media users are not necessarily representative of the overall population. However, many works such as [4,19] did not proceed by biasing data. Some others works such as [5,24] attempted to reduce the bias according to user age and geolocation. They attempted to improve the overall view of the electorate. However, the authors reported that the success was minimal and the improvement was somewhat marginal. A very recent work by Arroba et

Sentiment Analysis of Influential Messages for Political Election Forecasting

283

al. [25] explore the geographic weighting factors to improve political prediction results. They stated that geographic weighting along with sentiment polarity and relevance led to a better outcome.

3

Proposed Method

In this section, we introduce the approach used to build our model, shown in Fig. 1. Our methodology consists of a series of steps that range from the extraction of Facebook user messages (FUMs) to the election prediction process. Our work is influenced by the advice of [5]. Instead of merely relying on the volume (the number of messages the candidate receive), we have used sentiment analysis in our methodology along with the attempt to bias data by selecting only influential messages. We applied this methodology to the last presidential election of the U.S. The presidential election took place on November 8, 2016 with two favorite candidates: The Republican Donald Trump, and the Democratic Hillary Clinton. Republican Donald Trump lost the popular vote to Democrat Hillary Clinton by more than 2.8 million votes.

Fig. 1. Workflow of the proposed model.

3.1

Data Collection

Twitter is the most used to predict election outcome thanks to the ease that Twitter platform gives to extract data. To choose our data source, we compared Facebook and Twitter in term of data quality and platform popularity. Many previous studies [20,26,27] found that Twitter data was unreliable to predict electoral outcomes. It is mainly due to selecting tweets unrelated to the

284

O. Oueslati et al.

candidates. Selecting tweets based on a manually constructed list of keywords certainly led to a loss of relevant information. Though tweets may not comprise any keywords from the pre-defined list, it does not mean that they are necessarily irrelevant. In contrast, Facebook provides official candidates pages which allow having a large sample of relevant data independently from keywords. It also provides more information about the text message and does not limit the user to a specific number of characters. Twitter limits their users to a 240-characters which forces users to express their opinions briefly and sometimes partially. Next, favorable statistics on the U.S. Facebook users encouraged us to rely on it to have an accurate electoral prediction. The total Facebook audience in the United States amounted to 214 million users, where more than 208 million users are older than 17 years1 . We extracted data from candidates’ official Facebook pages. Namely, we extracted FUMs along with: Users responses on the FUM, Users reactions (Likes, Love, Haha, Wow, Sad, and Angry), and Timestamps (FUM publication time, First FUM reply time, and Last FUM reply time). The collection was directly done from public verified Facebook pages with a self-made application, using the Facebook Graph API2 . Data collection was conducted within one year before the presidential election in November 2016 so that we can experiment our model over several periods of times (one year before the election day, six months before, one week before, etc.). In the first pre-processing step, we deleted URLs, empty messages, nonEnglish messages, and duplicated row data. Hence, if a message is duplicated but has different metrics, we kept it. For example, the following message: “WE NEED TRUMP NOW!!!” appeared three times in our raw data but each time with different numbers of likes and replies, so we have considered it. After the data cleaning step, we kept 10k messages from Hillary Clinton official Facebook page and 12k messages from Donald Trump official Facebook page. 3.2

Feature Generation

This subsection describes our features to characterize influential messages. Based on the definition of the social influence stated by Cialdini and Trost [8]: “Social influence occurs when an individual’s views, emotions, or actions are impacted by the views, emotions or actions of another individual”, we designed four kinds of features (sentiment, emotion, time, and content). In total, we designed 20 features to characterize whether the message is influential or not. Sentiment Feature: We conducted the sentiment analysis task using SenticNet. We attributed to each FUM a sentiment score between 1 and -1. Practically, SenticNet is inspired by the Hourglass of emotions model [28]. In order to calculate the sentiment score, each term is represented on the ground of the intensity of four basic emotional dimensions namely sensitivity, aptitude, attention, and pleasantness. In our work, we computed the sentiment features using the Sentic 1 2

www.statista.com/statistics/398136/us-facebook-user-age-groups/. www.developers.facebook.com/tools/explorer/.

Sentiment Analysis of Influential Messages for Political Election Forecasting

285

API3 . The sentiment features are as follow: (1) OSS which is the Overall Sentiment Score of the FUM, (2) SSPMax which is the Sentiment Score of the most Positive term in the FUM, (3) SSNMax which is the Sentiment Score of the most Negative term in the FUM, (4) SSNMax which is the Sentiment Score of the most Negative term in the FUM, and (5) SSNMin which is the Sentiment Score of the least Negative term in the FUM. Emotion Feature: To extract the emotion features, we first built an emotion lexicon based on Facebook reactions. Kumar and Valdamni [29] stated that in social networks, if someone reacted to a public post or review (message), it means that the person has positive or negative emotions towards the entity in question. So emotions may be denoted explicitly through reviews and messages or implicitly through reactions. In our work, we explore Facebook reactions to construct an emotion lexicon. This lexicon would allow emotion extraction from any FUM, even if the FUM did not receive any reaction yet. There are six reactions that Facebook users use to express their emotions toward a message (Like, Love, Haha (laughing), Wow (surprised), Sad, and Angry). From the collected data we selected all the messages which had received any reaction. After cleaning the review and deleting stop words, based on the reactions, we selected terms reflecting emotions. So we got a list of emotional terms, and based on reactions count, we attribute a score for each term. For example, the term ‘waste’ appears in two messages with (5, 10) Like, (0, 1) Love, (12, 0) Haha, (3, 18) Wow, (0, 40) Sad, and (30, 7) Angry. So the term ‘waste’ has in total 15 Like, 1 Love, 12 Haha, 18 Wow, 40 Sad, and 37 Angry. We normalized the score through the sum of all reaction (123 in this example). Lastly, the emotion features were extracted based on the constructed emotion lexicon. The first emotional feature EMT evaluates the presence of EMotional Terms in the FUM (the number of emotional terms divided by the number of all terms). The six others features are LKR, LVR, LGR, SPR, AGR, and SDR. They represent the Like ratio, the Love ratio, the Laugh ratio, the Surprise ratio, the Anger ratio, and the Sadness ratio respectively. Time Feature: The time aspect is important to analyze. Indeed, we generated two time-features to evaluate users engagement towards FUMs: LCF =

LastP ostedReplyT ime − M essageP ulicationT ime ElectionP redictionT ime − M essageP ulicationT ime

RCT = 1 −

F irstP ostedReplyT ime − M essageP ulicationT ime ElectionP redictionT ime − M essageP ulicationT ime

The feature Life cycle (LCF) measures how much the message persists and remains popular by knowing how long the content can drive user attention and engagement. The Life cycle value is comprised between zero and one. The feature Reaction Time (RCT) evaluates the time that a FUM makes to start receiving 3

http://sentic.net/api.

286

O. Oueslati et al.

responses. This feature allows knowing if a message has rapidly engaged the users and drove their attention. Content Feature: The generated content features attempt to evaluate the quality of the message content. A message which is not clear and readable cannot be influential. Hence, content features include (1) NBC which is the Number of Characters in the FUM, (2) NBW which is the Number of Words in the FUM, (3) NBS which is the Number of Sentences in the FUM, (4) NBWS which is the Number of word Per Sentence, (5) NBSE which is the Number of Spelling Errors in the message, and (6) ARI which is the Automated Readability Index. ARI is a measure calculated as following: ARI = CharCount W ordCount 4.71 ∗ ( W ordCount ) + 0.5 ∗ ( SentCount ) − 21. This score indicates the US educational level required to comprehend a given text. The higher the score, the less readable the text [30]. 3.3

Influential Classifier Construction

In our work, we propose to reduce the data bias based on messages influence rather than who wrote the message. As social influence has been observed in political participation [17], we built a classifier to select only politically influential FUM which make others actions and emotions impacted by the actions and emotions of the FUM writer. To build our classifier we need a labeled dataset. While it is too expensive to label influential message manually, we selected messages which got many responses from other users. If the message and the responses have approximately the same sentiment polarity (positive or negative), the message is marked as influential. On the other hand, if the message and its responses have different sentiment polarity, the message is marked as no-influential. We did a manual revision for the messages having the same sentiment polarity as their responses but a margin that exceeds 0.5 regarding score. Through this technique of semiautomatic labeling, we got a labeled dataset of 1561 messages: 709 labeled influential and 852 labeled non-influential. 3.4

Election Outcome Prediction Model

We used the methods by [5] with some changes. While Gayo Avello et al. counted every positive message and every negative message, we included only the influential one. Then, the predicted vote share for a candidate C1 was computed as follows: inf P osSent(C1) + inf N egSent(C2) inf P osSent(C1) + inf N egSent(C1) + inf P osSent(C2) + inf N egSent(C2) C1 is the candidate for whom support is being computed while C2 is the opposing candidate. Therefore, infPosSent(C) and infNegSent(C) are respectively, the number of positive influential and the number of negative influential messages multiplied by their sentiment score.

Sentiment Analysis of Influential Messages for Political Election Forecasting

4

287

Results and Findings

First of all, we compared the performance of several supervised classification algorithms to select the best one. Subsequently, relying on the best algorithm, we derived our prediction model and evaluated its performance. In our experimentation, we used machine learning algorithms from scikit-learn package, tenfold cross-validation to improve generalization and avoid overfitting. 4.1

Learning Quality

To obtain a model that reasonably fits our objective, we performed the learning phase through several supervised classification algorithms. Then, we selected the best algorithm regarding accuracy (ACC), F-measure (F1) and AUC. To better understand classifiers performance, we examine how classifiers label test data. Therefore, we focus on True Positive (TP) and True Negative (TN) rates generated by each classifier. Classifiers performance are reported in Table 1. Table 1. Performance comparison of various classification algorithms. Classifier

ACC

NN

F1

AUC

%TP

%TN

73.48

72.76

74.34

83.78

64.91

RBF SVM 55.54

69.35

51.85

11.57

92.14

DT

84.10

82.29

79.83

84.74

82.51

RF

89.11 90.02 89.02 88.01 90.02

ANN

52.72

64.86

49.98

20.03

79.93

NB

62.91

70.38

61.11

41.47

80.75

LR

75.53

75.79

76.07

81.95

70.19

In term of ACC, F1, and AUC, the Random Forest (RF) achieved the best performance followed by Decision Tree (DT), Logistic Regression (LR), Nearest Neighbors (NN), and RBF SVM; while Naive Bayes (NB) and Artificial Neural Network (ANN) perform poorly. Regarding the TP and TN rates, Random Forest (RF) also achieved the best rates. Moreover, Random Forest was the best classifier realizing the right balance between the two classes (88.01% as TP and 90.02% as TN). In Fig. 2, we plot in the same graph the ROC curve of each classifier. Upon visual inspection, we observe that the curve of Random Forest classifier is closer than other curves to the upper-left corner of the ROC space. This proves that Random Forest classifier has the best trade-off between sensitivity (TP rate) and specificity (1 - FP rate). Random Forest shows the best performance to predict the Influential class with minimal false Positives correctly. After that, we used the Random Forest classifier for further classifications. For sentiment features, the overall sentiment (OSS) of the FUM is the most important followed by the sentiment score of the most negative term, and the sentiment score of the least positive term (SSNMax and SSPMin).

288

O. Oueslati et al.

Fig. 2. ROC curve of the different classifiers.

4.2

Features Quality

In this subsection, the relevance of the generated features through its prediction strength. We draw the features importance plot in Random Forest classification, as shown in Fig. 3. We notice that features related to FUM sentiment are the most important, followed by the features related to the content, and the features related to the time.

Fig. 3. Feature importance in Random Forest classification.

We find that strongly negative FUM tend to be more attractive and influential than strongly positive FUM. This finding is in line with observations for

Sentiment Analysis of Influential Messages for Political Election Forecasting

289

the features related to emotions. The rate of likes (LKR) is the most important followed by the rate of emotional terms presence (EMT) and sadness rate (SDR). We state that the Like button is ambiguous. Before October 2015, the other reactions did not exist. Only the Like reaction was available which made it overused to explain positive and negative emotions. Even after introducing the rest of emotional reactions, the Like is still overused. Therefore, LKR reflects users engagement toward the FUM more than the emotion that users give off. In contrast, we note that the sadness rate (SDR) is more decisive than the love rate (LVR). This observation also affirms that strongly negative FUMs that imply negative emotion tend to be more influential than FUMs implying positive emotion like Love. For content features, features related to the FUM length (NBC and NBS), and readability (ARI) are the most important. We find that brief FUM cannot be influential as a long FUM. However, the FUM must be readable and comprehensible by a wide range of peoples to be influential. We find that the ARI measure performs well in the context of social media because it was designed based on the length indicators. However, the spelling error rate (NBSE) is not critical in social media because people tend to use colloquial and invented words and to make some frequent mistakes. Lastly, for the time features, we find that the feature RCT is more important than the feature LC. The FUM making less time to engage users tend to have a longer life cycle. The first replies on the FUM reflect if the FUM would be influential or not. 4.3

Predicting Election Outcome Quality

In order to quantify the difference between the prediction and the ground truth, we relied on the Mean Absolute Error (MAE) like the vast majority of previous works. The MAE is defined as follows: 1 n | (Pi − Ri ) | M AE = i=1 n where n is the number of candidates, Pi is the predicted vote percentage of the candidate i and Ri is the true election result percentage of the candidate i . We applied our approach to different time intervals. We also tried other previous approaches to evaluate better the contribution of influential message selection and sentiment analysis: Message Count (MC), Message Sentiment (MS), and our Influential Message Sentiment (IMS). Results are presented in Table 2. Table 2. MAE at different time intervals. 1 year 6 Months 3 Months 1 month 2 Weeks 1 week IMS 01.29 01.52

03.02

00.88

MS

01.10

01.42

03.00

03.83

06.33

02.71

02.00

02.50

MC 06.92

08.72

15.53

07.16

06.07

06.80

290

O. Oueslati et al.

The best MAE done by the most well-known polling institutes4 is 2.3 by Reuters/Ipos. However, the worst MAE is 4.5 by LA Times/U.S.C Tracking. Our approach was capable of achieving a 0.88 by choosing influential messages posted one month before the election day, 1.10 by selecting influential messages published two weeks before, and 1.52 by selecting influential messages published six months before. Also, compared to the MC approach and MS approach, our approach was more accurate by achieving an error rate inferior to one. However, the best error rate achieved by MC was 6.07 and the best error rate achieved by MS was 2. Relying only on data volume led to the highest error. When considering the sentiment add to the volume, the MAE slightly decreases. And especially after removing non-influential messages the MAE is considerably improved. Furthermore, scoring each vote by the strength of the expressed sentiment helps the prediction model to ignore weak messages. To better visualize the difference between approaches, we illustrate the MAE in Fig. 4. We noted that independently of the time interval, relying only on data volume always led to the highest error. We also noted that predicting the election outcome one year before the day of election achieve a good performance compared by others time interval. Exploring the candidates’ online presence strategy before the election is relevant to conclude how well the candidate worked on his/her public image. Based on the success of this last, we can accurately predict the election result. There is mainly two kinds of online presence strategy; the long-term (1 year) and the short-term (less than one month). However, the more the historical record of time is reduced the more the forecast performance got worst.

Fig. 4. MAE overview on different time intervals by different approaches.

The error rate is reduced while forecasting is one year and one month before the election day. In contrast, the error is enormous six months and one week before the election day. That is to say exploring partially the historical record 4

www.realclearpolitics.com.

Sentiment Analysis of Influential Messages for Political Election Forecasting

291

is like analyzing an online political strategy by half. Moreover, few days before the election day, the noise is present more than any period.

5

Conclusion

In this paper, we proposed a novel model for election forecasting using sentiment analysis of influential messages. We collected data from Facebook graph API. Then, we constructed a classifier to select only the influential messages based on messages content, time, sentiment, and emotion. Random Forest algorithm has shown the best classification performance. We applied our model to the 2016 United States presidential election. We demonstrated that it is reliable to predict election results based on sentiment analysis of influential messages. Also, we demonstrated that data bias is appropriately addressed with influential messages selection. We found that our approach was capable of achieving better MAE than both off-line poll and classical approaches. In the future, we plan to continue our work on performing sentiment analysis of influential messages using other modalities such as influence degree definition.

References 1. Woodly, D.: New competencies in democratic communication? Blogs, agenda setting and political participation. Public Choice 134, 109–123 (2008) 2. Jin, X., Gallagher, A., Cao, L., Luo, J., Han, J.: The wisdom of social multimedia: using flickr for prediction and forecast. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1235–1244. ACM (2010) 3. Williams, C., Gulati, G.: What is a social network worth? Facebook and vote share in the 2008 presidential primaries. In: American Political Science Association (2008) 4. Tumasjan, A., Sprenger, T.O., Sandner, P.G., Welpe, I.M.: Predicting elections with twitter: what 140 characters reveal about political sentiment. In: ICWSM, vol. 10, pp. 178–185 (2010) 5. Gayo Avello, D., Metaxas, P.T., Mustafaraj, E.: Limits of electoral predictions using twitter. In: AAAI Conference on Weblogs and Social Media (2011) 6. Burnap, P., Gibson, R., Sloan, L., Southern, R., Williams, M.: 140 characters to victory?: Using twitter to predict the UK: general election. Electoral Stud. 41(2016), 230–233 (2015) 7. Romero, D.M., Reinecke, K., Robert Jr., L.P.: The influence of early respondents: information cascade effects in online event scheduling. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 101–110. ACM (2017) 8. Cialdini, R.B., Trost, M.R.: Social influence: social norms, conformity and compliance (1998) 9. Qazi, A., Raj, R.G., Tahir, M., Cambria, E., Syed, K.B.S.: Enhancing business intelligence by means of suggestive reviews. Sci. World J. 2014 (2014) 10. Cambria, E., Poria, S., Hazarika, D., Kwok, K.: SenticNet 5: discovering conceptual primitives for sentiment analysis by means of context embeddings. In: AAA, no. 1, pp. 1795–1802 (2018)

292

O. Oueslati et al.

11. Cambria, E., Hussain, A.: Sentic album: content-, concept-, and context-based online personal photo management system. Cogn. Comput. 4, 477–496 (2012) 12. Grassi, M., Cambria, E., Hussain, A., Piazza, F.: Sentic web: a new paradigm for managing social media affective information. Cogn. Comput. 3, 480–489 (2011) 13. Cambria, E., Song, Y., Wang, H., Howard, N.: Semantic multidimensional scaling for open-domain sentiment analysis. IEEE Intell. Syst. 29, 44–51 (2014) 14. Bravo-Marquez, F., Mendoza, M., Poblete, B.: Meta-level sentiment models for big social data analysis. Knowl.-Based Syst. 69, 86–99 (2014) 15. Ara´ ujo, M., Gon¸calves, P., Cha, M., Benevenuto, F.: iFeel: a system that compares and combines sentiment analysis methods. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 75–78. ACM (2014) 16. Strandberg, K.: A social media revolution or just a case of history repeating itself? The use of social media in the, finish parliamentary elections. New Media Soc. 15(2013), 1329–1347 (2011) 17. Bond, R.M., et al.: A 61-million-person experiment in social influence and political mobilization. Nature 489, 295–298 (2012) 18. Sang, E.T.K., Bos, J.: Predicting the 2011 Dutch senate election results with twitter. In: Proceedings of the Workshop on Semantic Analysis in Social Media, pp. 53–60. Association for Computational Linguistics (2012) 19. Jungherr, A.: Tweets and votes, a special relationship: the 2009 federal election in Germany. In: Proceedings of the 2nd Workshop on Politics, Elections and Data, pp. 5–14. ACM (2013) 20. Gayo-Avello, D.: “I wanted to predict elections with twitter and all i got was this lousy paper”-a balanced survey on election prediction using twitter data. arXiv preprint arXiv:1204.6441 (2012) 21. Franch, F.: (wisdom of the crowds) 2: UK election prediction with social media. J. Inf. Technol. Polit. 10(2013), 57–71 (2010) 22. Ceron, A., Curini, L., Iacus, S.M., Porro, G.: Every tweet counts? How sentiment analysis of social media can improve our knowledge of citizens’ political preferences with an application to Italy and France. New Media Soc. 16, 340–358 (2014) 23. Caldarelli, G., et al.: A multi-level geographical study of Italian political elections from twitter data. PLoS ONE 9, e95809 (2014) 24. Choy, M., Cheong, M.L., Laik, M.N., Shung, K.P.: A sentiment analysis of Singapore presidential election 2011 using twitter data with census correction. arXiv preprint arXiv:1108.5520 (2011) 25. Arroba Rimassa, J., Llopis, F., Mu˜ noz, R., Guti´errez, Y., et al.: Using the twitter social network as a predictor in the political decision. In: 19th CICLing Conference (2018) 26. Bermingham, A., Smeaton, A.: On using twitter to monitor political sentiment and predict election results. In: Proceedings of the Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2011), pp. 2–10 (2011) 27. Metaxas, P.T., Mustafaraj, E., Gayo-Avello, D.: How (not) to predict elections. In: Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third International Conference on Social Computing (SocialCom), pp. 165–171 (2011) 28. Susanto, Y., Livingstone, A., Ng, B.C., Cambria, E.: The hourglass model revisited. IEEE Intell. Syst. 35, 96–102 (2020) 29. Kumar, R., Vadlamani, R.: A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl.-Based Syst. 89, 14–46 (2015) 30. Smith, E.A., Kincaid, J.P.: Derivation and validation of the automated readability index for use with technical materials. Hum. Factors 12, 457–564 (1970)

Basic and Depression Specific Emotions Identification in Tweets: Multi-label Classification Experiments Nawshad Farruque(B) , Chenyang Huang, Osmar Za¨ıane, and Randy Goebel Alberta Machine Intelligence Institute (AMII), Department of Computing Science, University of Alberta, Edmonton, AB T6G 2R3, Canada {nawshad,chuang8,zaiane,rgoebel}@ualberta.ca

Abstract. We present an empirical analysis of basic and depression specific multi-emotion mining in Tweets, using state of the art multi-label classifiers. We choose our basic emotions from a hybrid emotion model consisting of the commonly identified emotions from four highly regarded psychological models. Moreover, we augment that emotion model with new emotion categories arising from their importance in the analysis of depression. Most of these additional emotions have not been used in previous emotion mining research. Our experimental analyses show that a cost sensitive RankSVM algorithm and a Deep Learning model are both robust, measured by both Micro F-Measures and Macro F-Measures. This suggests that these algorithms are superior in addressing the widely known data imbalance problem in multi-label learning. Moreover, our application of Deep Learning performs the best, giving it an edge in modeling deep semantic features of our extended emotional categories. Keywords: Emotion identification · Sentiment analysis

1 Introduction Mining multiple human emotions can be challenging area of research since human emotions tend to co-occur [12]. For example, most often human emotions such as joy and surprise tend to occur together rather than separately as just joy or just surprise. (See Table 1 for some examples from our dataset). In addition, identifying these cooccurrences of emotions and their compositionality can provide insight for fine grained analysis of emotions in various mental health problems. But there is little literature that has explored multi-label emotion mining from text [2, 11, 17]. However, with the increasing use of social media, where people share their day to day thoughts and ideas, it is easier than ever to capture the presence of different emotions in their posts. So our main research focus is to provide insights on identifying multi-emotions from social media posts such as Tweets. To compile a list of emotions we want to identify, we have used a mixed emotion model [17] which is based on four distinct and widely used emotion models used in psychology. Furthermore, we augment this emotion model with further emotions that are deemed useful for a depression identification task we c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 293–306, 2023. https://doi.org/10.1007/978-3-031-24340-0_22

294

N. Farruque et al.

intend to pursue later. Here we separate our experiments into two: one for a smaller emotion model (using nine basic human emotions), and another for an augmented emotion model (using both basic and depression related human emotions). We present a detailed analysis of the performance of several state of the art algorithms used in multi-label text mining tasks on both sets of data, which have varying degrees of data imbalance. Table 1. Example Tweets with multi-label emotions Tweets

Labels

“Feel very annoyed by everything and I hope to leave soon because I can’t Angry, sad stand this anymore” “God has blessed me so much in the last few weeks I can’t help but smile” Joy, love

1.1

Emotion Modeling

In affective computing research, the most widely accepted model for emotion is the one suggested by [5] previous models, augmented with a small number of additional emotions: love, thankfulness and guilt, all of which are relevant to our study. We further seek to confirm emotions such as betrayal, frustration, hopelessness, loneliness, rejection, schadenfreude and self loathing; any of these could contribute to the identification of a depressive disorder [1, 16]. Our mining of these emotions, with the help of RankSVM and an attention based deep learning model, is a new contribution not previously made [7, 13]. 1.2

Multi-label Emotion Mining Approaches

Earlier research in this area has employed learning algorithms from two broad categories: (1) Problem Transformation and (2) Algorithmic Adaptation. A brief description of each of these follows in the next subsections. 1.3

Problem Transformation Methods

In the problem transformation methods approach (PTM), multi-label data is transformed into single label data, and a series of single label (or binary) classifiers are trained for each label. Together, they predict multiple labels (cf. details provided in Sect. 2). This method is often called a “one-vs-all” model. The problem is that these methods do not consider any correlation amongst the labels. This problem was addressed by a model proposed by [2], which uses label powersets (LP) to learn an ensemble of k-labelset classifiers (RAKEL) [18]. Although this classifier method respects the correlation among labels, it is not robust against the data imbalance problem, which is an inherent problem in multi-label classification, simply because of the typical uneven class label distribution.

Basic and Depression Specific Emotions Identification in Tweets

295

1.4 Algorithmic Adaptation Methods The alternative category to PTMs are the so-called algorithmic adaptation methods (AAMs), where a single label classifier is modified to do multi-label classification. Currently popular AAMs are based on trees, such as the classic C4.5 algorithm adapted for multi-label tasks [4], probabilistic models such as [6], and neural network based methods such as, BP-MLL [19]. However, as with PTMs, these AAM methods are also not tailored for imbalanced data learning, and fail to achieve good accuracy with huge multi-label datasets where imbalance is a common problem. In our approach, we explore two state of the art methods for multi-label classification. One is a cost sensitive RankSVM and the other is a deep learning model based on Long Short Term Memory (LSTM) and Attention. The former is an amalgamation of PTM and AAMs, with the added advantage of large margin classifiers. This choice provides an edge on learning from huge imbalanced multi-label data, while still considering the label correlations. The later is a purely AAM approach, which is able to more accurately capture the latent semantic structure of Tweets. In Sects. 2 and 3 we provide the technical details of our baseline and experimental models.

2 Baseline Models According to the results in [17], good accuracy can be achieved with a series of Na¨ıve Bayes (NB) classifiers (also known as a one-vs-all classifier), where each classifier is trained on balanced positive and negative samples (i.e., Tweets, represented by bag-ofwords features) for each class; especially with respect to other binary classifiers (e.g., Support Vector Machines (SVMs)). To recreate this baseline, we have implemented our own “one-vs-all” method. To do so, we transform our multi-label data into sets of single label data, then train separate binary NB classifiers for each of the labels. An NB classifier N Bi in this model uses emotion Ei as positive sample and all other emotion samples as negative samples, where i is a representative index of our n emotion repertoire. We next concatenate the binary outputs of these individual classifiers to get the final multi-label output. Note that previous closely related research [2, 11] used simple SVM and RAKEL, which are not robust against data imbalance and did not look at short texts, such as Tweets for multi-label emotion mining. On the other hand, [17] had a focus on emotion mining from Tweets but their methods were multi-class, unlike us, where we are interested in multi-label emotion mining.

3 Experiment Models 3.1 A Cost Sensitive RankSVM Model Conventional multi-label classifiers learn a mapping function, h : X → 2q from a D q dimensional feature space, X ∈ RD to the label space Y ⊆ {0, 1} , where q is the number of labels. A simple label powerset algorithm (LP) considers each distinct combination of labels (also called labelsets) that exist in the training data as a single label, thus retaining the correlation of labels. In multi-label learning, some labelsets occur

296

N. Farruque et al.

more frequently than others, and traditional SVM algorithms perform poorly in these scenarios. In many information retrieval tasks, the RankSVM algorithm is widely used to learn rankings for documents, given a query. This idea can be generalized to multilabel classification, where the relative ranks of labels is of interest. [3] have proposed two optimized versions of the RankSVM [8] algorithm: one is called RankSVM(LP), which not only incorporates the LP algorithm but also associates misclassification cost λi with each training instance. This misclassification cost is higher for the label powersets that have smaller numbers of instances, and is automatically calculated based on the distribution of label powersets in the training data. To further reduce the number of generated label powersets (and to speed up processing), they proposed another version of their algorithm called RankSVM (PPT), where labelsets are pruned by an a priori threshold based on properties of the data set. 3.2

A Deep Learning Model

Results by [20] showed that a combination of Long Short Term Memory (LSTM) and an alternative to LSTM, Gated Recurrent Unit layers (GRU), can be very useful in learning phrase level features, and have very good accuracy in text classification. [9] achieved state of the art in sentence classification with the help of bidirectional LSTM (bi-LSTM) combined with a self-attention mechanism. Here we adopt [9]’s model and further enable this model for multi-label classification by using a suitable loss function and a thresholded softmax layer to generate multi-label output. We call this model LSTM-Attention (LSTM-att), as shown in Fig. 1; wi is the word embedding (which can be either one-hot bag-of-words or dense word vectors), hi is the hidden states of LSTM at time step i, and the output of this layer is fed to the Self Attention (SA) layer. The SA layer’s output is then sent to a linear layer, which translates the final output to a probability of different labels (in this case emotions) with the help of softmax activation. Finally, a threshold is applied on the softmax output to get the final multi-label predictions. 3.3

Loss Function Choices

The choice of a loss function is important in this context. For multi-label classification task [10], it has shown that Binary Cross Entropy (BCE) Loss over sigmoid activation is very useful. Our use of a BCE objective can be formulated as follows: n

minimize

L

1  [yil log(σ(yˆil )) n i=1 l=1

(1)

+(1 − yil )log(1 − σ(yˆil ))] where L is the number of labels, n is the number of samples, yi is the target label, yˆi is the predicted label from last linear layer (see in Fig. 1) and σ is a sigmoid function, σ(x) = 1+e1−x .

Basic and Depression Specific Emotions Identification in Tweets

I

am

happy

LSTM

LSTM

LSTM

LSTM

297

Self-attention

Linear Layer Softmax and Threshold

Predictions Fig. 1. LSTM-Attention model for multi-label classification

Thresholding The output of the last linear layer is a k-dimensional vector, where k is the number of different labels of the classification problem. For our task we use the softmax function to normalize each yˆi within the range of (0, 1) as follows. eyˆi yˆi  = k i=1

eyˆi

(2)

Since we are interested in more than one label, we use a threshold and consider only those labels with predicted probability beyond that threshold. Let the threshold be t, hence the final prediction for each label pi is  1, yi  > t pi = (3) 0, else To adjust the thresholds for the LSTM-Att model, we use a portion of our training data as an evaluation set. Based on our choice of evaluation set, we has found that the threshold, t = 0.3 provides the best results.

4 Experiments We use the well-known bag-of-words (BOW) and pre-trained word embedding vectors (WE) as our feature sets, for two RankSVM algorithms: RankSVM-LP and RankSVMPPT, a one-vs-all Naive Bayes (NB) classifier and a deep learning model (LSTM-Att).

298

N. Farruque et al.

We name our experiments with algorithm names suffixed by feature names, for example RankSVM-LP-BOW names the experiment with the RankSVM Label Powerset function on a bag-of-words feature set. We run these experiments on our two sets of data, and we use the RankSVM implementation provided by [3]. For multi-label classification, we implement our own one-vs-all model using a Python library named sci-kit learn and its implementation of multinomial NB. We implement our deep learning model in PyTorch1 . In the next section, we present a detailed description of the datasets, data collection, data pre-processing, feature sets extraction and evaluation metrics. 4.1

Data Set Preparation

We use two sets of multi-label data2 . Set 1 (we call it 9 emotion data) consists of Tweets only from the “Clean and Balanced Emotion Tweets” (CBET) dataset provided by [17]. It contains 3000 Tweets from each of nine emotions, plus 4,303 double labeled Tweets (i.e., Tweets which have two emotion labels), for a total of 31,303 Tweets. To create Set 2 (we call it 16 emotion data), we add extended emotions Tweets (having single and double labels) with Set 1 data adding up to total 50,000 Tweets. We used Tweets having only one and two labels because this is the natural distribution of labels in our collected data. Since previous research showed that hashtag labeled Tweets are consistent with the labels given by human judges [14], these Tweets were collected based on relevant hashtags and key-phrases3 . Table 2 lists the additional emotions we are interested in. Our data collection process is identical to [17], except that we use the Twitter API and key-phrases along with hashtags. In this case, we gather Tweets with these extra emotions between June, 2017 to October, 2017. The statistics of gathered Tweets are presented in Table 3. Both of our data sets have the following characteristics after preprocessing: – – – – – – – – – – –

All Tweets are in English. All the letters in Tweets are converted to lowercase. White space characters, punctuation and stop words are removed. Duplicate Tweets are removed. Incomplete Tweets are removed. Tweets shorter than 3 words are removed. Tweets having more than 50% of its content as name mentions are removed. URLs are replaced by ‘url’. All name mentions are replaced with ‘@user’. Multi-word hashtags are decomposed in their constituent words. Hashtags and key-phrases corresponding to emotion labels are removed to avoid data overfitting.

Finally, we use 5 fold cross validation (CV) (train 80%–test 20% split). In each fold we further create a small validation set (10% of the training set) and use it for parameter 1 2 3

http://pytorch.org/. We intend to release our dataset online upon publication of this paper. We use specific key-phrases to gather Tweets for specific emotions, e.g. for loneliness, we use, “I am alone” to gather more data for our extra emotion data collection process.

Basic and Depression Specific Emotions Identification in Tweets

299

tuning in Rank SVM, and for threshold finding for LSTM-Att. Our baseline model does not have any parameters to tune. Finally, our results are averaged on the test set in the 5 Fold CV based on the best parameter combination that was found in each of the folds validation set and we report that. Using this approach, we find that threshold = 0.3 is generally better. We do not do heavy parameter tuning in LSTM-att. Also, we use the Adam Optimizer with 0.001 learning rate. Table 2. New emotion labels and corresponding hashtags and key phrases Emotion

List of Hashtags

Betrayed

#betrayed

Frustrated

#frustrated, #frustration

Hopeless

#hopeless, #hopelessness, no hope, end of everything

Loneliness

#lonely, #loner, i am alone

Rejected

#rejected, #rejection, nobody wants me, everyone rejects me

Schadenfreude #schadenfreude Self loath

#selfhate, #ihatemyself, #ifuckmyself, i hate myself, i fuck myself

Table 3. Sample size for each new emotions (after cleaning) Emotion

Number of Tweets

Betrayed

1,724

Frustrated

4,424

Hopeless

3,105

Loneliness

4,545

Rejected

3,131

Schadenfreude

2,236

Self loath Total

4,181 23,346

4.2 Feature Sets We create a vocabulary of 5000 most frequent words which occur in at least three training samples. If we imagine this vocabulary as a vector where each of its indices represent each unique word in that vocabulary, then a bag-of- words (BOW) feature can be represented by marking as 1 those indices whose corresponding word matches with a Tweet word. To create word embedding features we use a 200 dimensional GloVe [15] word embeddings trained on a corpus of 2B Tweets with 27B tokens and 1.2M vocabulary4 . We represent each Tweet with the average word embedding of its constituent words that are also present in the pre-trained word embeddings. We further normalize the word embedding features using min-max normalization. 4

https://nlp.stanford.edu/projects/glove/.

300

N. Farruque et al.

4.3

Evaluation Metrics

We report some metrics for our data imbalance measurements to highlight the effect of the analysed approaches in multi-label learning. 4.4

Quantifying Imbalance in Labelsets

We use the same idea as mentioned in [3] for data imbalance calculation in labelsets, inspired by the idea of kurtosis. The following equation is used to calculate the imbalance in labelsets, l (Li − Lmax )4 (4) ImbalLabelSet = i=1 (l − 1)s4 where,

  l 1  (Li − Lmax ) s= l i=1

(5)

here, l is the number of all labelsets, Li is the number of samples in the i-th labelset, Lmax is the number of samples with the labelset having the maximum samples. The value of LabelSetImbal is a measure of the distribution shape of the histograms depicting labelset distribution. A higher value of LabelSetImbal indicates a larger imbalance, and determines the “peakedness” level of the histogram (see Fig. 2), where the x axis depicts the labelsets and the y axis denotes the counts for them. The numeric imbalance level in labelsets is presented in Table 4; there we notice that the 16 emotion dataset is more imbalanced than the 9 emotion dataset. Table 4. Degree of imbalance in labelset and labels Dataset

Labelset Imbal. Label Imbal.

9 Emotion data

27.95

37.65

16 Emotion data 52.16

69.93

Fig. 2. Labelset imbalance: histogram for 9 emotion dataset on the left and histogram for 16 emotion dataset on the right

Basic and Depression Specific Emotions Identification in Tweets

301

4.5 Mic/Macro F-Measures We use label based Mic/Macro F-Measures to report the performance of our classifiers in classifying each label. See Eqs. 6 and 7. l l l    T Pj , F Pj , F Nj ) M icro-F M = F -M easure( j=1

j=1

(6)

j=1

l

M acro-F M =

1 F -M easure(T Pj , F Pj , F Nj ) l j=1

(7)

where, l is the total number of labels, and F-Measure is a standard F1-Score5 and T P, F P, F N are True Positive, False Positive and False Negative labels respectively. Micro-FM provides the performance of our classifiers by taking into account the imbalance in the dataset unlike Macro-FM.

5 Results Analysis Overall, LSTM-Attention with word embedding features (LSTM-Att-WE) performs the best in terms of Micro-FM and Macro-FM measures, compared with the baseline multilabel NB and best performing RankSVM models averaged across 9 emotion and 16 emotion datasets. These results are calculated according to Eq. 8, where, BestF Mi refers to the model with best F-Measure value (either Micro or Macro), i indicates the dataset and can be either 9 or 16, AV G denotes the function that calculates average, the sub-script index refers to our 9 or 16 emotion dataset, and other variables are self explanatory. To compare with RankSVM, we use best RankSVM F-Measures instead of BaselineF M in the above equation. In the Micro-FM measure, LSTM-Att achieves a 44% increase with regard to baseline NB and a 23% increase with regard to best RankSVM models. In Macro-FM measure, LSTM-Att-WE shows a 37% increase with regard to baseline NB-BOW and a 18% increase with regard to best RankSVM models. The percentage values were rounded to the nearest integers. It is worth noting that a random assignment to classes would result in an accuracy of 11% in the case of the 9 emotions (1/9) and 6% in the case of the 16 emotion dataset (1/16). The following sections show analyses based on F-measures, Data Imbalance and Confusion Matrices (see Tables 5, 6 and Fig. 3). BestF M16 − BaselineF M16 , BaselineF M16 BestF M9 − BaselineF M9 ) × 100 BaselineF M9

F M -Inc = AV G(

5

https://en.wikipedia.org/wiki/F-score.

(8)

302

N. Farruque et al. Table 5. Results on 9 emotion dataset Models

Macro-FM Micro-FM

NB-BOW (baseline)

0.3915

0.3920

NB-WE

0.3715

0.3617

RankSVM-LP-BOW

0.3882

0.3940

RankSVM-LP-WE

0.4234

0.4236

RankSVM-PPT-BOW 0.4275

0.4249

RankSVM-PPT-WE

0.3930

0.3920

LSTM-Att-BOW

0.4297

0.4492

LSTM-Att-WE

0.4685

0.4832

Table 6. Results on 16 emotion dataset Models

Macro-FM Micro-FM

NB-BOW (baseline)

0.2602

0.2608

NB-WE

0.2512

0.2356

RankSVM-LP-BOW

0.3523

0.3568

RankSVM-LP-WE

0.3342

0.3391

RankSVM-PPT-BOW 0.3406

0.3449

RankSVM-PPT-WE

0.3432

0.3469

LSTM-Att-BOW

0.3577

0.3945

LSTM-Att-WE

0.4020

0.4314

Fig. 3. Comparative results on 9 and 16 emotion datasets

5.1

Performance with Regard to F-Measures

Overall, in all models, the Micro-FM and Macro-FM values are very close to each other (see Table 5 and 6) indicating that all of the models have similar performance in terms of both most populated and least populated classes.

Basic and Depression Specific Emotions Identification in Tweets

303

5.2 Performance with Regard to Data Imbalance Performance of all the models generally drops as the imbalance increases. For this performance measure, we take average F-Measures (Micro or Macro) across WE and BOW features for NB and DL; for RankSVM the values are averaged over two types of RankSVM models as well (i.e., LP and PPT). See Eq. 9, where, the function, performance drop, and P D(model), takes any model and calculates the decrease or drop of F-measures (either Micro or Macro) averaged over WE and BOW features. For RankSVM, we input RankSVM-PPT and RankSVM-BOW, and take the average to determine the performance drop in overall RankSVM algorithm based on Eq. 10. We observe that LSTM-Att’s performance drop for Micro-FM is 11% and Macro-FM is 15% and with respect to 9 emotions (less imbalanced) to 16 emotion data (more imbalanced). In comparison, RankSVM has higher drop (for Micro-FM it is 17% and for Macro-FM it is 18%) and multi-label NB has the highest drop (for Micro-FM it is 41% and for Macro-FM it is 38%) for the same. These results indicate that overall LSTM-Att and RankSVM models are more robust against data imbalance. 1 AV G9 (model.BOW.F M, model.W E.F M ) ×{(AV G9 (model.BOW.F M, model.W E.F M )

P D(model) = (

(9)

−(AV G16 (model.BOW.F M, model.W E.F M )}) × 100 RankSV M -P D = AV G(P D(RankSV M -P P T ), P D(RankSV M -LP )) 5.3 Confusion Matrices

Fig. 4. Confusion matrix for 9 emotions

(10)

304

N. Farruque et al.

An ideal confusion matrix should be strictly diagonal with all other values set as zero. In our multi-label case, we see that our confusion matrix has the highest values along diagonal implying it is correctly classifying most of the emotions (On the other hand, non-diagonal values imply incorrect classification, where “love” and “joy” and “f rustrated” and “hopeless” are mostly confused labels because these emotion labels tends to occur together (Fig. 4 and 5).

Fig. 5. Confusion matrix for 16 emotions

6 Conclusion and Future Work We have experimented with two state of the art models for a multi-label emotion mining task. We have provided details of data collection and processing for our two multi-label datasets, one containing Tweets with nine basic emotions and another having those Tweets augmented with additional Tweets from seven new emotions (related to depression). We also use two widely used features for this task, including bag-of-words and word embedding. Moreover, we provide a detailed analysis of these algorithms performance based on Micro-FM and Macro-FM measures. Our experiments indicate that a deep learning model exhibits superior results compared to others; we speculate that it is because of improved capture of subtle differences in the language, but we lack an explanatory mechanism to confirm this. In future, we would like to explore several selfexplainable and post-hoc explainable deep learning models to shed some light on what these deep learning models look at for multi-label emotion classification task compared

Basic and Depression Specific Emotions Identification in Tweets

305

to their non-deep learning counterparts. Moreover, deep learning and RankSVM models are both better in handling data imbalance. It is also to be noted that a word embedding feature-based deep learning model is better than bag-of-words feature based deep learning model, unlike Na¨ıve Bayes and RankSVM models. As expected, this confirms that Deep Learning models are good with dense word vectors rather than very sparse bagof-words features. In the future, we would like to do a finer grained analysis of Tweets from depressed people, based on these extended emotions, and identify the subtle language features from the attention layers outputs, which we believe will help us to detect early signs of depression, to monitor depressive condition, its progression and treatment outcome. Acknowledgements. We thank Natural Sciences and Engineering Research Council of Canada (NSERC) and Alberta Machine Intelligence Institute (AMII) for their generous support to pursue this research.

References 1. Abramson, L.Y., Metalsky, G.I., Alloy, L.B.: Hopelessness depression: a theory-based subtype of depression. Psychol. Rev. 96(2), 358 (1989) 2. Bhowmick, P.K.: Reader perspective emotion analysis in text through ensemble based multilabel classification framework. Comput. Inf. Sci. 2(4), 64 (2009) 3. Cao, P., Liu, X., Zhao, D., Zaiane, O.: Cost sensitive ranking support vector machine for multi-label data learning. In: Abraham, A., Haqiq, A., Alimi, A.M., Mezzour, G., Rokbani, N., Muda, A.K. (eds.) HIS 2016. AISC, vol. 552, pp. 244–255. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52941-7 25 4. Clare, A., King, R.D.: Knowledge discovery in multi-label phenotype data. In: De Raedt, L., Siebes, A. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 42–53. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44794-6 4 5. Ekman, P.: An argument for basic emotions. Cogn. Emot. 6(3–4), 169–200 (1992) 6. Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 195– 200. ACM (2005) 7. Hasan, M., Agu, E., Rundensteiner, E.: Using hashtags as labels for supervised learning of emotions in Twitter messages. In: Proceedings of the Health Informatics Workshop (HIKDD) (2014) 8. Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2002) 9. Lin, Z., et al.: A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017) 10. Liu, J., Chang, W.C., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124. ACM (2017) 11. Luyckx, K., Vaassen, F., Peersman, C., Daelemans, W.: Fine-grained emotion detection in suicide notes: a thresholding approach to multi-label classification. Biomed. Inf. Insights 5(Suppl 1), 61 (2012) 12. Mill, A., K¨oo¨ ts-Ausmees, L., Allik, J., Realo, A.: The role of co-occurring emotions and personality traits in anger expression. Front. Psychol. 9, 123 (2018)

306

N. Farruque et al.

13. Mohammad, S.M.: # emotional tweets. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 246–255. Association for Computational Linguistics (2012) 14. Mohammad, S.M., Kiritchenko, S.: Using hashtags to capture fine emotion categories from tweets. Comput. Intell. 31(2), 301–326 (2015) 15. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 16. Pietraszkiewicz, A., Chambliss, C.: The link between depression and schadenfreude: further evidence. Psychol. Rep. 117(1), 181–187 (2015) 17. Shahraki, A.G., Za¨ıane, O.R.: Lexical and learning-based emotion mining from text. In: International Conference on Computational Linguistics and Intelligent Text Processing (CICLing) (2017) 18. Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multilabel classification. IEEE Trans. Knowl. Data Eng. 23(7), 1079–1089 (2011) 19. Zhang, M.L., Zhou, Z.H.: Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 18(10), 1338–1351 (2006) 20. Zhou, C., Sun, C., Liu, Z., Lau, F.: A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630 (2015)

Generating Word and Document Embeddings for Sentiment Analysis Cem Rıfkı Aydın(B) , Tunga G¨ung¨or , and Ali Erkan Computer Engineering Department, Bo˘gazic¸i University, Bebek, 34342 Istanbul, Turkey [email protected], [email protected], [email protected]

Abstract. Sentiments of words can differ from one corpus to another. Inducing general sentiment lexicons for languages and using them cannot, in general, produce meaningful results for different domains. In this paper, we combine contextual and supervised information with the general semantic representations of words occurring in the dictionary. Contexts of words help us capture the domainspecific information and supervised scores of words are indicative of the polarities of those words. When we combine supervised features of words with the features extracted from their dictionary definitions, we observe an increase in the success rates. We try out the combinations of contextual, supervised, and dictionary-based approaches, and generate original vectors. We also combine the word2vec approach with hand-crafted features. We induce domain-specific sentimental vectors for two corpora, which are the movie domain and the Twitter datasets in Turkish. When we thereafter generate document vectors and employ the support vector machines method utilising those vectors, our approaches perform better than the baseline studies for Turkish with a significant margin. We evaluated our models on two English corpora as well and these also outperformed the word2vec approach. It shows that our approaches are cross-domain and portable to other languages. Keywords: Sentiment analysis · Opinion mining · Word embeddings · Machine learning

1 Introduction Sentiment analysis has recently been one of the hottest topics in natural language processing (NLP). It is used to identify and categorise opinions expressed by reviewers on a topic or an entity. Sentiment analysis can be leveraged in marketing, social media analysis, and customer service. Although many studies have been conducted for sentiment analysis in widely spoken languages, this topic is still immature for Turkish and many other languages. Neural networks outperform the conventional machine learning algorithms in most classification tasks, including sentiment analysis [9]. In these networks, word embedding vectors are fed as input to overcome the data sparsity problem and to make the representations of words more “meaningful” and robust. Those embeddings indicate how close the words are to each other in the vector space model (VSM). c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 307–318, 2023. https://doi.org/10.1007/978-3-031-24340-0_23

308

C. R. Aydın et al.

Most of the studies utilise embeddings, such as word2vec [14], which take into account the syntactic and semantic representations of the words only. Discarding the sentimental aspects of words may lead to words of different polarities being close to each other in the VSM, if they share similar semantic and syntactic features. For Turkish, there are only a few studies which leverage sentimental information in generating the word and document embeddings. Unlike the studies conducted for English and other widely-spoken languages, in this paper, we use the official dictionaries for this language and combine the unsupervised and supervised scores to generate a unified score for each dimension of the word embeddings in this task. Our main contribution is to create original and effective word vectors that capture syntactic, semantic and sentimental characteristics of words, and use all of this knowledge in generating embeddings. We also utilise the word2vec embeddings trained on a large corpus. Besides using these word embeddings, we also generate hand-crafted features on a review-basis and create document vectors. We evaluate those embeddings on two datasets. The results show that we outperform the approaches which do not take into account the sentimental information. We also had better performances than other studies carried out on sentiment analysis in Turkish media. We also evaluated our novel embedding approaches on two English corpora of different genres. We outperformed the baseline approaches for this language as well. The source code and datasets are publicly available1 . The paper is organised as follows. In Sect. 2, we present the existing works on sentiment classification. In Sect. 3, we describe the methods proposed in this work. The experimental results are shown and the main contributions of our proposed approach are discussed in Sect. 4. In Sect. 5, we conclude the paper.

2 Related Work In the literature, the main consensus is that the use of dense word embeddings outperforms the sparse embeddings in many tasks. Latent semantic analysis (LSA) used to be the most popular method in generating word embeddings before the invention of the word2vec and other word vector algorithms which are mostly created by shallow neural network models. Although many studies have been employed on generating word vectors including both semantic and sentimental components, generating and analysing the effects of different types of embeddings on different tasks is an emerging field for Turkish. Latent Dirichlet allocation (LDA) is used in [3] to extract mixture of latent topics. However, it focusses on finding the latent topics of a document, not the word meanings themselves. In [19], LSA is utilised to generate word vectors, leveraging indirect cooccurrence statistics. These outperform the use of sparse vectors [5]. Some of the prior studies have also taken into account the sentimental characteristics of a word when creating word vectors [4, 11, 12]. A model with semantic and sentiment components is built in [13], making use of star-ratings of reviews. In [10], a sentiment lexicon is induced preferring the use of domain-specific cooccurrence statistics over the word2vec method and they outperform the latter. 1

https://github.com/cemrifki/sentiment-embeddings.

Generating Word and Document Embeddings for Sentiment Analysis

309

In a recent work on sentiment analysis in Turkish [6], they learn embeddings using Turkish social media. They use the word2vec algorithm, create several unsupervised hand-crafted features, generate document vectors and feed them as input into the support vector machines (SVM) approach. We outperform this baseline approach using more effective word embeddings and supervised hand-crafted features. In English, much of the recent work on learning sentiment-specific embeddings relies only on distant supervision. In [7], emojis are used as features and a bi-directional long short-term memory (bi-LSTM) neural network model is built to learn sentimentaware word embeddings. In [18], a neural network that learns word embeddings is built by using contextual information about the data and supervised scores of the words. This work captures the supervised information by utilising emoticons as features. Most of our approaches do not rely on a neural network model in learning embeddings. However, they produce state-of-the-art results.

3 Methodology We generate several word vectors, which capture the sentimental, lexical, and contextual characteristics of words. In addition to these mostly original vectors, we also create word2vec embeddings to represent the corpus words by training the embedding model on these datasets. After generating these, we combine them with hand-crafted features to create document vectors and perform classification, as will be explained in Sect. 3.5. 3.1 Corpus-Based Approach Contextual information is informative in the sense that, in general, similar words tend to appear in the same contexts. For example, the word smart is more likely to cooccur with the word hardworking than with the word lazy. This similarity can be defined semantically and sentimentally. In the corpus-based approach, we capture both of these characteristics and generate word embeddings specific to a domain. Firstly, we construct a matrix whose entries correspond to the number of cooccurrences of the row and column words in sliding windows. Diagonal entries are assigned the number of sliding windows that the corresponding row word appears in the whole corpus. We then normalise each row by dividing entries in the row by the maximum score in it. Secondly, we perform the principal component analysis (PCA) method to reduce the dimensionality. It captures latent meanings and takes into account high-order cooccurrence removing noise. The attribute (column) number of the matrix is reduced to 200. We then compute cosine similarity between each row pair wi and wj as in (1) to find out how similar two word vectors (rows) are. cos(wi , wj ) =

wi w˙ j ˙ j || ||wi ||||w

(1)

310

C. R. Aydın et al.

Thirdly, all the values in the matrix are subtracted from 1 to create a dissimilarity matrix. Then, we feed the matrix as input into the fuzzy c-means clustering algorithm. We chose the number of clusters as 200, as it is considered a standard for word embeddings in the literature. After clustering, the dimension i for a corresponding word indicates the degree to which this word belongs to cluster i. The intuition behind this idea is that if two words are similar in the VSM, they are more likely to belong to the same clusters with analogous probabilities. In the end, each word in the corpus is represented by a 200-dimensional vector. In addition to this method, we also perform singular value decomposition (SVD) on the cooccurrence matrices, where we compute the matrix M P P M I = U ΣV T . Positive pointwise mutual information (PPMI) scores between words are calculated and the truncated singular value decomposition is computed. We take into account the U matrix only for each word. We have chosen the singular value number as 200. That is, each word in the corpus is represented by a 200-dimensional vector as follows. wi = (U )i 3.2

(2)

Dictionary-Based Approach

In Turkish, there do not exist well-established sentiment lexicons as in English. In this approach, we made use of the TDK (T¨urk Dil Kurumu - “Turkish Language Institution”) dictionary to obtain word polarities. Although it is not a sentiment lexicon, combining it with domain-specific polarity scores obtained from the corpus led us to have state-ofthe-art results. We first construct a matrix whose row entries are corpus words and column entries are the words in their dictionary definitions. We followed the Boolean approach. For instance, for the word cat, the column words occurring in its dictionary definition are given a score of 1. Those column words not appearing in the definition of cat are assigned a score of 0 for that corresponding row entry. When we performed clustering on this matrix, we observed that those words having similar meanings are, in general, assigned to the same clusters. However, this similarity fails in capturing the sentimental characteristics. For instance, the words happy and unhappy are assigned to the same cluster, since they have the same words, such as feeling, in their dictionary definitions. However, they are of opposite polarities and should be discerned from each other. Therefore, we utilise a metric to move such words away from each other in the VSM, even though they have common words in their dictionary definitions. We multiply each value in a row with the corresponding row word’s raw supervised score, thereby having more meaningful clusters. Using the training data only, the supervised polarity score per word is calculated as in (3). wt = log

Nt N Nt N

+ 0.01 + 0.01

(3)

Here, wt denotes the sentiment score of word t, Nt is the number of documents (reviews or tweets) in which t occurs in the dataset of positive polarity, N is the number of all the words in the corpus of positive polarity. N  denotes the corpus of negative

Generating Word and Document Embeddings for Sentiment Analysis

311

polarity. Nt and N  denote similar values for the negative polarity corpus. We perform normalisation to prevent the imbalance problem and add a small number to both numerator and denominator for smoothing. As an alternative to multiplying with the supervised polarity scores, we also separately multiplied all the row scores with only +1 if the row word is a positive word, and with −1 if it is a negative word. We have observed it boosts the performance more compared to using raw scores.

Fig. 1. The effect of using the supervised scores of words in the dictionary algorithm. It shows how sentimentally similar word vectors get closer to each other in the VSM.

The effect of this multiplication is exemplified in Fig. 1, showing the positions of word vectors in the VSM. Those “x” words are sentimentally negative words, those “o” words are sentimentally positive ones. On the top coordinate plane, the words of opposite polarities are found to be close to each other, since they have common words in their dictionary definitions. Only the information concerned with the dictionary definitions

312

C. R. Aydın et al.

are used there, discarding the polarity scores. However, when we utilise the supervised score (+1 or −1), words of opposite polarities (e.g. “happy” and “unhappy”) get far away from each other as they are translated across coordinate regions. Positive words now appear in quadrant 1, whereas negative words appear in quadrant 3. Thus, in the VSM, words that are sentimentally similar to each other could be clustered more accurately. Besides clustering, we also employed the SVD method to perform dimensionality reduction on the unsupervised dictionary algorithm and used the newly generated matrix by combining it with other subapproaches. The number of dimensions is chosen as 200 again according to the U matrix. The details are given in Sect. 3.4. When using and evaluating this subapproach on the English corpora, we used the SentiWordNet lexicon [2]. We have achieved better results for the dictionary-based algorithm when we employed the SVD reduction method compared to the use of clustering. 3.3

Supervised Contextual 4-Scores

Our last component is a simple metric that uses four supervised scores for each word in the corpus. We extract these scores as follows. For a target word in the corpus, we scan through all of its contexts. In addition to the target word’s polarity score (the selfscore), out of all the polarity scores of words occurring in the same contexts as the target word, minimum, maximum, and average scores are taken into consideration. The word polarity scores are computed using (3). Here, we obtain those scores from the training data. The intuition behind this method is that those four scores are more indicative of a word’s polarity rather than only one (the self-score). This approach is fully supervised unlike the previous two approaches. 3.4

Combination of the Word Embeddings

In addition to using the three approaches independently, we also combined all the matrices generated in the previous approaches. That is, we concatenate the reduced forms (SVD - U) of corpus-based, dictionary-based, and the whole of 4-score vectors of each word, horizontally. Accordingly, each corpus word is represented by a 404-dimensional vector, since corpus-based and dictionary-based vector components are each composed of 200 dimensions, whereas the 4-score vector component is formed by four values. The main intuition behind the ensemble method is that some approaches compensate for what the others may lack. For example, the corpus-based approach captures the domain-specific, semantic, and syntactic characteristics. On the other hand, the 4scores method captures supervised features, and the dictionary-based approach is helpful in capturing the general semantic characteristics. That is, combining those three approaches makes word vectors more representative. 3.5

Generating Document Vectors

After creating several embeddings as mentioned above, we create document (review or tweet) vectors. For each document, we sum all the vectors of words occurring in

Generating Word and Document Embeddings for Sentiment Analysis

313

that document and take their average. In addition to it, we extract three hand-crafted polarity scores, which are minimum, mean, and maximum polarity scores, from each review. These polarity scores of words are computed as in (3). For example, if a review consists of five words, it would have five polarity scores and we utilise only three of these sentiment scores as mentioned. Lastly, we concatenate these three scores to the averaged word vector per review. That is, each review is represented by the average word vector of its constituent word embeddings and three supervised scores. We then feed these inputs into the SVM approach. The flowchart of our framework is given in Fig. 2. When combining the unsupervised features, which are word vectors created on a word-basis, with supervised three scores extracted on a review-basis, we have better state-of-the-art results.

Fig. 2. The flowchart of our system.

4 Datasets We utilised two datasets for both Turkish and English to evaluate our methods: For Turkish, as the first dataset, we utilised the movie reviews which are collected from a popular website2 . The number of reviews in this movie corpus is 20,244 and the average number of words in reviews is 39. Each of these reviews has a star-rating score which is indicative of sentiment. These polarity scores are between the values 0.5 and 5, at intervals of 0.5. We consider a review to be negative it the score is equal to or lower than 2.5. On the other hand, if it is equal to or higher than 4, it is assumed to 2

https://www.beyazperde.com.

314

C. R. Aydın et al.

be positive. We have randomly selected 7,020 negative and 7,020 positive reviews and processed only them. The second Turkish dataset is the Twitter corpus which is formed of tweets about Turkish mobile network operators. Those tweets are mostly much noisier and shorter compared to the reviews in the movie corpus. In total, there are 1,716 tweets. 973 of them are negative and 743 of them are positive. These tweets are manually annotated by two humans, where the labels are either positive or negative. We measured the Cohen’s Kappa inter-annotator agreement score to be 0.82. If there was a disagreement on the polarity of a tweet, we removed it. We also utilised two other datasets in English to test the portability of our approaches to other languages. One of them is a movie corpus collected from the web3 . There are 5,331 positive reviews and 5,331 negative reviews in this corpus. The other is a Twitter dataset, which has nearly 1.6 million tweets annotated through a distant supervised method [8]. These tweets have positive, neutral, and negative labels. We have selected 7,020 positive tweets and 7,020 negative tweets randomly to generate a balanced dataset.

5 Experiments 5.1

Preprocessing

In Turkish, people sometimes prefer to spell English characters for the corresponding Turkish characters (e.g. i for ı, c for c¸) when writing in electronic format. To normalise such words, we used the Zemberek tool [1]. All punctuation marks except “!” and “?” are removed, since they do not contribute much to the polarity of a document. We took into account emoticons, such as “:))”, and idioms, such as “kafayı yemek” (lose one’s mind), since two or more words can express a sentiment together, irrespective of the individual words thereof. Since Turkish is an agglutinative language, we used the morphological parser and disambiguation tools [16, 17]. We also performed negation handling and stop-word elimination. In negation handling, we append an underscore to the end of a word if it is negated. For example, “g¨uzel deˇgil” (not beautiful) is redefined as “g¨uzel ” (beautiful ) in the feature selection stage when supervised scores are being computed. 5.2

Hyperparameters

We used the LibSVM utility of the WEKA tool. We chose the linear kernel option to classify the reviews. We trained word2vec embeddings on all the four corpora using the Gensim library [15] with the skip-gram method. The dimension size of these embeddings is set at 200. As mentioned, other embeddings, which are generated utilising the clustering and the SVD approach, are also of size 200. For c-means clustering, we set the maximum number of iterations at 25, unless it converges.

3

https://github.com/dennybritz/cnn-text-classification-tf.

Generating Word and Document Embeddings for Sentiment Analysis

315

Table 1. Accuracies for different feature sets fed as input into the SVM classifier in predicting the labels of reviews. The word2vec algorithm is the baseline method. Word embedding type

Turkish (%) English (%) Movie Twitter Movie Twitter

Corpus-based + SVD (U) Dictionary-based + SVD (U) Supervised 4-scores Concatenation of the above three Corpus-based + Clustering word2vec

76.19 60.64 89.38 88.12 52.27 76.47

64.38 51.36 76.00 73.23 52.73 46.57

66.54 55.29 75.65 73.40 51.02 57.73

87.17 60.00 72.62 73.12 54.40 62.60

Corpus-based + SVD (U) + 3-feats Dictionary-based + SVD (U) + 3-feats Supervised 4-scores + 3-feats Concatenation of the above three + 3 feats Corpus-based + Clustering + 3-feats word2vec + 3-feats

88.45 88.64 90.38 89.77 87.89 88.88

72.60 71.91 78.00 72.60 71.91 71.23

76.85 76.66 77.05 77.03 75.02 77.03

85.88 80.40 72.83 80.20 74.40 75.64

5.3 Results We evaluated our models on four corpora, which are the movie and the Twitter datasets in Turkish and English. All of the embeddings are learnt on four corpora separately. We have used the accuracy metric since all the datasets are completely or nearly completely balanced. We performed 10-fold cross-validation for both of the datasets. We used the approximate randomisation technique to test whether our results are statistically significant. Here, we tried to predict the labels of reviews and assess the performance. We obtained varying accuracies as shown in Table 1. “3 feats” features are those hand-crafted features we extracted, which are the minimum, mean, and maximum polarity scores of the reviews as explained in Sect. 3.5. As can be seen, at least one of our methods outperforms the baseline word2vec approach for all the Turkish and English corpora, and all categories. All of our approaches performed better when we used the supervised scores, which are extracted on a review-basis, and concatenated them to word vectors. Mostly, the supervised 4-scores feature leads to the highest accuracies, since it employs the annotation information concerned with polarities on a word-basis. As can be seen in Table 1, the clustering method, in general, yields the lowest scores. We found out that the corpus - SVD metric does always perform better than the clustering method. We attribute it to that in SVD the most important singular values are taken into account. The corpus - SVD technique outperforms the word2vec algorithm for some corpora. When we do not take into account the 3-feats technique, the corpusbased SVD method yields the highest accuracies for the English Twitter dataset. We show that simple models can outperform more complex models, such as the concatenation of the three subapproaches or the word2vec algorithm. Another interesting finding is that for some cases the accuracy decreases when we utilise the polarity labels, as in the case for the English Twitter dataset.

316

C. R. Aydın et al.

Since the TDK dictionary covers most of the domain-specific vocabulary used in the movie reviews, the dictionary method performs well. However, the dictionary lacks many of the words, occurring in the tweets; therefore, its performance is not the best of all. When the TDK method is combined with the 3-feats technique, we observed a great improvement, as can be expected. Success rates obtained for the movie corpus are much better than those for the Twitter dataset for most of our approaches, since tweets are, in general, much shorter and noisier. We also found out that, when choosing the p value as 0.05, our results are statistically significant compared to the baseline approach in Turkish [6]. Some of our subapproaches also produce better success rates than those sentiment analysis models employed in English [7, 18]. We have achieved state-of-the-art results for the sentiment classification task for both Turkish and English. As mentioned, our approaches, in general, perform best in predicting the labels of reviews when three supervised scores are additionally utilised. We also employed the convolutional neural network model (CNN). However, the SVM classifier, which is a conventional machine learning algorithm, performed better. We did not include the performances of CNN for embedding types here due to the page limit of the paper. As a qualitative assessment of the word representations, given some query words, we visualised the most similar words to those words using the cosine similarity metric. By assessing the similarities between a word and all the other corpus words, we can find the most akin words according to different approaches. Table 2 shows the most similar words to given query words. Those words which are indicative of sentiment are, in general, found to be most similar to those words of the same polarity. For example, the most similar word to muhtes¸em (gorgeous) is 10/10, both of which have positive polarity. As can be seen in Table 2, our corpus-based approach is more adept at capturing domainspecific features as compared to word2vec, which generally captures general semantic and syntactic characteristics, but not the sentimental ones. Table 2. Most similar words to given queries according to our corpus-based approach and the baseline word2vec algorithm. Query Word

Corpus-based word2vec

Muhtes¸em (Gorgeous)

10/10

Harika (Wonderful)

Berbat (Terrible)

Vasat (Mediocre) ˙Ilginc¸

K¨ot¨u (Bad)

Fark (Difference)

(Interesting)

Tespit (Finding) ˙Iyi (Good)

K¨ot¨u (Bad)

Sıkıcı (Boring)

˙Iyi (Good)

G¨uzel (Beautiful)

K¨ot¨u (Bad)

Senaryo (Script)

Kurgu (Plot)

Kurgu (Plot)

Generating Word and Document Embeddings for Sentiment Analysis

317

6 Conclusion We have demonstrated that using word vectors that capture only semantic and syntactic characteristics may be improved by taking into account their sentimental aspects as well. Our approaches are portable to other languages and cross-domain. They can be applied to other domains and other languages than Turkish and English with minor changes. Our study is one of the few ones that perform sentiment analysis in Turkish and leverages sentimental characteristics of words in generating word vectors and outperforms all the others. Any of the approaches we propose can be used independently of the others. Our approaches without using sentiment labels can be applied to other classification tasks, such as topic classification and concept mining. The experiments show that even unsupervised approaches, as in the corpus-based approach, can outperform supervised approaches in classification tasks. Combining some approaches, which can compensate for what others lack, can help us build better vectors. Our word vectors are created by conventional machine learning algorithms; however, they, as in the corpus-based model, produce state-of-the-art results. Although we preferred to use a classical machine learning algorithm, which is SVM, over a neural network classifier to predict the labels of reviews, we achieved accuracies of over 90% for the Turkish movie corpus and about 88% for the English Twitter dataset. We performed only binary sentiment classification in this study as most of the studies in the literature do. We will extend our system in future by using neutral reviews as well. We also plan to employ Turkish WordNet to enhance the generalisability of our embeddings as another future work. Acknowledgments. This work was supported by Boˇgazic¸i University Research Fund Grant Number 6980D, and by Turkish Ministry of Development under the TAM Project number DPT2007K12-0610. Cem Rifki Aydin has been supported by T¨uB˙ITAK BIDEB 2211E.

References 1. Akın, A.A., Akın, M.D.: Zemberek, an open source NLP framework for Turkic languages. Structure 10, 1–5 (2007) 2. Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Calzolari, N., et al. (eds.) LREC European Language Resources Association (2010) 3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993– 1022 (2003) 4. Boyd-Graber, J., Resnik, P.: Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 45–55. Association for Computational Linguistics (2010) 5. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1023/A:1022627411411

318

C. R. Aydın et al.

¨ 6. Ertugrul, A.M., Onal, I., Acart¨urk, C.: Does the strength of sentiment matter? A regression based approach on Turkish social media. In: Natural Language Processing and Information Systems - 22nd International Conference on Applications of Natural Language to Information Systems, NLDB 2017, Li`ege, Belgium, 21–23 June 2017, Proceedings, pp. 149–155 (2017). https://doi.org/10.1007/978-3-319-59569-6 16 7. Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., Lehmann, S.: Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In: EMNLP, pp. 1615–1625. Association for Computational Linguistics (2017) 8. Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. In: Processing, pp. 1–6 (2009) 9. Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. 1510(726), 345–420 (2016) 10. Hamilton, W.L., Clark, K., Leskovec, J., Jurafsky, D.: Inducing domain-specific sentiment lexicons from unlabeled corpora. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 595–605. Association for Computational Linguistics (2016) 11. Li, F., Huang, M., Zhu, X.: Sentiment analysis with global topics and local dependency. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10), pp. 1371–1376. Association for Computational Linguistics (2010) 12. Lin, C., He, Y.: Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 375–384. ACM (2009). https://doi.org/10.1145/1645953.1646003 13. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, pp. 142–150. Association for Computational Linguistics (2011) 14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR 1301(3781), 1–12 (2013) ˇ uˇrek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Pro15. Reh˚ ceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, May 2010 16. Sak, H., G¨ung¨or, T., Sarac¸lar, M.: Morphological disambiguation of Turkish text with perceptron algorithm. In: Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2007), pp. 107–118. CICLing Press (2007). https://doi.org/10.1007/978-3-540-70939-8 10 17. Sak, H., G¨ung¨or, T., Sarac¸lar, M.: Turkish language resources: morphological parser, morphological disambiguator and web corpus. In: Nordstr¨om, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 417–427. Springer, Heidelberg (2008). https://doi.org/10.1007/ 978-3-540-85287-2 40 18. Tang, D., Wei, F., Qin, B., Liu, T., Zhou, M.: Coooolll: a deep learning system for Twitter sentiment classification. In: Proceedings of the 8th International Workshop on Semantic Evaluation, SemEval@COLING 2014, Dublin, Ireland, 23–24 August 2014, pp. 208–212 (2014) 19. Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)

Speech Processing

Speech Emotion Recognition Using Spontaneous Children’s Corpus Panikos Heracleous1 , Yasser Mohammad2,4(B) , Keiji Yasuda1,3 , and Akio Yoneyama1 1

2

KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama 356-8502, Japan {pa-heracleous,yoneyama}@kddi-research.jp National Institute of Advanced Industrial Science and Technology, Tokyo, Japan [email protected] 3 Nara Institute of Science and Technology, Ikoma, Japan [email protected] 4 Assiut University, Asyut, Egypt [email protected]

Abstract. Automatic recognition of human emotions is a relatively new field and is attracting significant attention in research and development areas because of the major contribution it could make to real applications. Previously, several studies reported speech emotion recognition using acted emotional corpus. For real world applications, however, spontaneous corpora should be used in recognizing human emotions from speech. This study focuses on speech emotion recognition using the FAU Aibo spontaneous children’s corpus. A method based on the integration of feed-forward deep neural networks (DNN) and the i-vector paradigm is proposed, and another method based on deep convolutional neural networks (DCNN) for feature extraction and extremely randomized trees as classifier is presented. For the classification of five emotions using balanced data, the proposed methods showed unweighted average recalls (UAR) of 61.1% and 59.2%, respectively. These results are very promising showing the effectiveness of the proposed methods in speech emotion recognition. The two proposed methods based on deep learning (DL) were compared to a support vector machines (SVM) based method and they demonstrated superior performance. Keywords: Speech emotion recognition · Spontaneous corpus · Deep neural networks · Feature extraction · Extremely randomized trees

1

Introduction

Emotion recognition plays an important role in human-machine communication [4]. Emotion recognition can be used in human-robot communication, when robots communicate with humans in accord with the detected human emotions, and also has an important role to play in call centers in detecting a caller’s emotional state in cases of emergency (e.g., hospitals, police stations), or to identify c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 321–333, 2023. https://doi.org/10.1007/978-3-031-24340-0_24

322

P. Heracleous et al.

the level of the customer’s satisfaction (i.e., providing feedback). In the current study, emotion recognition based on speech is experimentally investigated. Previous studies reported automatic speech emotion recognition using Gaussian mixture models (GMMs) [28,29], hidden Markov models (HMM) [24], support vector machines (SVM) [21], neural networks (NN) [20], and DNN [9,26]. In [17], a study based on concatenated i-vectors is reported. Audiovisual emotion recognition is presented in [18]. Previously, i-vectors were used in speech emotion recognition [17]. However, only a very few studies reported speech emotion recognition using i-vectors integrated with DNN [30]. Furthermore, to our knowledge the integration of i-vectors and DL for speech emotion recognition when limited data are available has not been investigated exhaustively so far and, therefore, the research area still remains open. Additionally, in the current study the FAU Aibo [25] state-of-theart spontaneous children’s emotional corpus is used for the classification of five emotions based on DNN and i-vectors. Another method is proposed that uses DCNN [1,14] to extract informative features, which are then used by extremely randomized trees [8] for emotion recognition. The extremely randomized trees classifier is similar to the random forest classifier [11], but with randomized tree splitting. The motivation for using extremely randomized trees lies in previous observations showing their effectiveness in the case of a small number of features, and also because of the computational efficiency. The proposed methods based on DL are compared with a baseline classification approach. In the baseline method, i-vectors and SVM are being used. To further increase temporal information in the feature vectors, in the current study, shifted delta cepstral (SDC) coefficients [3,27] were also used along with the well-known mel-frequency cepstral coefficients (MFCC) [23].

2 2.1

Methods Data

The FAU Aibo corpus consists of 9 h of German speech of 51 children between the ages of 10 and 13 interacting with Sony’s pet robot Aibo. Spontaneous, emotionally colored children’s speech was recorded using a close-talking microphone. The data was annotated in relation to 11 emotion categories by five human labelers on a word level. In the current study, the FAU Aibo data are used for classification of the emotional states of angry, emphatic, joyful, neutral, and rest. To use balanced training and test data, 590 training utterances and 299 test utterances randomly selected from each emotion were used. 2.2

Feature Selection

MFCC features are used in the experiments. MFCCs are very commonly used features in speech recognition, speaker recognition, emotion recognition, and language identification. Specifically, in the current study, 12 MFCCs plus energy are extracted every 10 ms using a window length of 20 ms.

Speech Emotion Recognition Using Spontaneous Children’s Corpus

323

Fig. 1. Computation of SDC coefficients using MFCC and delta MFCC features.

SDC coefficients have been successfully used in language recognition. In the current study, the use of SDC features in speech emotion recognition is also experimentally investigated. The motivation for using SDC is to increase the temporal information in the feature vectors, which consist of frame-level features with limited temporal information. The SDC features are obtained by concatenating delta cepstral features across multiple frames. The SDC features are described by four parameters, N, d, P and k, where N is the number of cepstral coefficients computed at each frame, d represents the time advance and delay for the delta computation, k is the number of blocks whose delta coefficients are concatenated to form the final feature vector, and P is the time shift between consecutive blocks. Accordingly, kN parameters are used for each SDC feature vector, as compared with 2N for conventional cepstral and delta-cepstral feature vectors. The SDC is calculated as follows: Δc(t + iP ) = c(t + iP + d) − c(t + iP − d)

(1)

The final vector at time t is given by the concatenation of all Δc(t+iP ) for all 0 ≤ i < k − 1, where c(t) is the original feature value at time t. In the current study, the feature vectors with static MFCC features and SDC coefficients are of length 112. The concatenated MFCC and SDC features are used as input when using the DCNN with extremely randomized trees and conventional CNN classifiers. In the case of using DNN and SVM, the MFCC and SDC features are used to construct i-vectors used in classification. Figure 1 illustrates the extraction of SDC features. 2.3

The i-Vector Paradigm

A widely used classification approach in speaker recognition is based on GMMs with universal background models (UBM). In this approach, each speaker model is created by adapting the UBM using maximum a posteriori (MAP) adaptation. A GMM supervector is constructed by concatenating the means of the adapted models. As in speaker recognition, GMM supervectors can also be used for emotion classification.

324

P. Heracleous et al.

The main disadvantage of GMM supervectors, however, is the high dimensionality, which imposes high computational and memory costs. In the i-vector paradigm, the limitations of high dimensional supervectors (i.e., concatenation of the means of GMMs) are overcome by modeling the variability contained in the supervectors with a small set of factors. Considering speech emotion classification, an input utterance can be modeled as: M = m + Tw

(2)

where M is the emotion-dependent supervector, m is the emotion-independent supervector, T is the total variability matrix, and w is the i-vector. Both the total variability matrix and emotion-independent supervector are estimated from the complete set of training data. 2.4

Classification Approaches

Deep Neural Networks. DL [10] is behind several of the most recent breakthroughs in computer vision, speech recognition, and agents that achieved human-level performance in several games such as go, poker etc. A DNN is a feed-forward neural network with more than one hidden layer. The units (i.e., neurons) of each hidden layer take all outputs of the lower layer and pass them through an activation function. In the current study, three hidden layers with 64 units and the ReLu activation function are used. On top, a Softmax layer with five classes was added. The number of batches was set to 512, and 500 epochs were used. Convolutional Neural Networks. A CNN is a special variant of the conventional network, which introduces a special network structure consisting of alternating convolution and pooling layers. CNN have been successfully applied to sentence classification [13], image classification [22], facial expression recognition [12], and in speech emotion recognition [16]. Furthermore, in [7] bottleneck features extracted from CNN are used for robust language identification. In this paper, DCNN for learning informative features from the signal that is then used for emotion classification is investigated. The MFCC and SDC features are calculated using overlapping windows with a length of 20 ms. This generates a multidimensional time-series that represent the data for each session. The proposed method is a simplified version of the method recently proposed in [19] for activity recognition using mobile sensors. The proposed classifier consists of a DCNN followed by extremely randomized trees instead of the standard fully connected classifier. The motivation for using extremely randomized trees lies in previous observations showing their effectiveness in the case of a small number of features. The network architecture is shown in Fig. 2, and consists of a series of five blocks, each of which consists of two convolutional layers (64 5×5) followed by a max-pooling layer (2×2). Outputs from the last three blocks are then combined and flattened to represent the learned features. Training of the classifier proceeds in three stages as shown in the Fig. 3:

Speech Emotion Recognition Using Spontaneous Children’s Corpus

325

Fig. 2. The architecture of the deep feature extractor along with the classifier used during feature learning.

Network training, feature selection, and tree training. During network training, the DCNN is trained with predefined windows of 21 feature MFCC/SDC blocks (21 × 112 features). Network training consists of two sub-stages: First, the network is concatenated with its inverse to form an auto-encoder that is trained in unsupervised mode using all data in the training set and without the labels (i.e., pre-training stage). Second, three fully connected layers are attached to the output of the network, and the whole combined architecture is trained as a classifier using the labeled training set. These fully connected layers are then removed, and the output of the neural network (i.e., deep feature extractor) represents the learned features. Every hidden layer is an optimized classifier, and an optimized classifier is a useful feature extractor because the output is discriminative. The second training stage (i.e., feature selection) involves selecting a few of the outputs from the deep feature extractor to be used in the final classification. Each feature (i.e., neuronal output i) is assigned a total quality (Q (i)) according to Eq. 3, where I¯j (i) is z-score normalized feature importance (Ij (i)) according to a base feature selection method. Q (i) =

nf 

wj I¯j (i) ,

(3)

j=0

In the current study, three base selectors are utilized: randomized logistic regression [6], linear SVMs with L1 penalty, and extremely randomized trees. Random linear regression (RLR) estimates feature importance by randomly selecting subsets of training samples and fitting them using a L1 sparsity inducing penalty that is scaled for a random set of coefficients. The features that appear repeatedly in such selections (i.e., with high coefficients) are assumed to be more important and are given higher scores. The second base extractor uses a linear SVM with an L1 penalty to fit the data and then select the features that have nonzero coefficients, or coefficients under a given threshold, from the fitted model. The third feature selector employs extremely randomized trees. During fitting of decision trees, features that appear at lower depths are generally more important. By fitting several such trees, feature importance can be estimated as

326

P. Heracleous et al.

Fig. 3. The proposed training process showing the three stages of training and the output of each stage. Table 1. Equal error rates (EER) for individual emotions when using three different classifiers. Classifier

Angry Emphatic Joyful Neutral Rest Average

DNN

20.1

19.8

16.4

21.1

29.8 21.4

DCNN 24.1 + Randomized trees

24.7

23.7

30.4

29.4 26.5

SVM

27.3

20.4

30.4

41.5 28.7

23.7

the average depth of each feature in the trees. Feature selection uses n-fold cross validation to select an appropriate number of neurons to retain in the final (fast) feature extractor (Fig. 3). For this study, the features (outputs) whose quality (Qi ) exceeds the median value of qualities are retained. Given the selected features from the previous step, an extremely randomized tree classifier is then trained using the labeled data set (i.e., tree training stage). Note that the approach described above allows a classification decision to be generated for each of the 21 MFCC/SDC blocks. To generate a single emotion prediction for each test sample, the outputs of the classifier need to be combined. One possibility is to use a recurrent neural network (RNN), an LSTM, or HMM to perform this aggregation. Nevertheless, in this study, the simplest voting aggregator, in which the label of the test file is the mode of the labels of all its data, is used.

3

Results

In the current study, the equal error rate (EER) and the UAR are used as evaluation measures. The UAR is defined as the mean value of the recall for each class. In addition, in the current study, the detection error tradeoff (DET) graphs are also shown. Table 1 shows the EERs when using the three classifiers. As shown, by using DNN along with i-vectors, the lowest EER is obtained. Specifically, when using DNN, the EER was 21.4%. The second lowest EER was obtained using DCNN with extremely randomized tress. In this case, a 26.5% EER was obtained. Using

Speech Emotion Recognition Using Spontaneous Children’s Corpus

327

Table 2. Confusion matrix of five emotions recognition when using DNN with i-vectors. Angry Emphatic Joyful Neutral Rest Angry

63.5

14.4

6.7

5.0

10.4

Emphatic 15.1

63.9

0.3

14.3

6.4

Joyful

3.7

2.3

68.9

4.4

20.7

Neutral

3.3

14.4

6.4

60.2

15.7

12.3

9.0

17.1

12.4

49.2

Rest

Table 3. Confusion matrix of five emotions recognition when using DCNN and extremely randomized trees. Angry Emphatic Joyful Neutral Rest Angry

65.2

1.7

2.7

0.3

30.1

Emphatic 13.7

61.2

2.0

0

23.1

Joyful

4.2

2.2

61.3

0

32.3

Neutral

8.7

9.4

1.0

43.8

37.1

16.8

10.5

8.7

5.3

58.7

Rest

Table 4. Confusion matrix of five emotions recognition when using conventional CNN. Angry Emphatic Joyful Neutral Rest Angry

51.5

15.1

11.4

10.0

12.0

Emphatic 10.7

53.8

14.7

12.7

8.1

Joyful

13.4

12.0

51.8

12.0

10.8

Neutral

11.7

13.4

12.4

52.5

10.0

Rest

13.7

8.0

16.4

9.4

52.5

SVM, the EER was 28.7%. The results also show, that joyful, emphatic, and angry emotions show the lowest EERs. A possible reason may be the higher emotional information included in these three emotions. On the other hand, the highest EER were obtained in the case of neutral and rest emotions (i.e., less emotional states). The UAR when using DNN with i-vectors was 61.1%. This is a very promising result and superior to other similar studies [2,5,15] that used different classifiers and features with unbalanced data. The result also show that DNN and i-vectors can be effectively integrated in speech emotion recognition even in the case of limited training data. The second highest UAR was obtained in the case of DCNN with extremely randomized trees. In this case, a 59.2% UAR was achieved. When a fully-connected layer was used on top of the convolutional layers (i.e., conventional CNN classifier) the UAR was 52.4%. This rate was lower compared to the extremely randomized trees classifier with deep feature

328

P. Heracleous et al. Table 5. Confusion matrix of five emotions recognition when using SVM. Angry Emphatic Joyful Neutral Rest Angry

55.2

15.7

6.0

7.0

16.1

Emphatic 16.7

44.5

3.3

17.1

18.4

Joyful

3.3

2.7

62.2

4.7

27.1

Neutral

7.7

12.4

13.6

35.5

30.8

11.4

9.4

18.7

14.0

46.5

Rest

Fig. 4. DET curves of speech emotion recognition using DNN.

extractor. Finally, when using SVM and i-vectors, a 48.8% UAR was achieved. The results show that when using the two proposed methods based on DL, higher UARs are achieved compared to the baseline approach. Tables 2, 3, 4, and 5 show the confusion matrices. As shown, in the case of DNN, the classification rates are comparable (with the exception of rest). The joyful, emphatic, angry classes are recognized with the highest rates, and rest is recognized with the lowest rate. In the case of using DCNN with extremely randomized trees, the classes angry and joyful show the highest rates. When using the conventional CNN, similar rates were obtained for all emotions. In the

Speech Emotion Recognition Using Spontaneous Children’s Corpus

329

case of SVM, joyful and angry are recognized with the highest accuracy. It can be, therefore, concluded that the emotions angry and joyful are recognized with the highest rates in most cases. DCNN

False Negative Rate (FNR) [%]

40

20

angry emphatic joyful neutral rest

10 5 2 1 0.5 0.2 0.1 0.1 0.2 0.5 1

2

5

10

20

40

False Positive Rate (FPR) [%]

Fig. 5. DET curves of speech emotion recognition using DCNN and extremely randomized trees.

Figures 4, 5, and 6 show the DET curves of the five individual emotions recognition. As shown, in all cases, superior performance was achieved for the emotion joyful. Figure 7 shows the overall DET curves for the three classifiers. The figure clearly demonstrates that by using the two proposed methods based on DL, the highest performance is achieved. More specifically, the highest performance is obtained when using DNN and i-vectors. Note that above 30% FPR, SVM shows superior performance compared to DCNN with extremely randomized trees. The overall EER, however, is lower in the case of DCNN with extremely randomized trees compared to SVM.

330

P. Heracleous et al.

Fig. 6. DET curves of speech emotion recognition using SVM.

False Negative Rate (FNR) [%]

40

SVM DNN CNN

20

10 5 2 1 0.5 0.2 0.1 0.1 0.2 0.5 1

2

5

10

20

40

False Positive Rate (FPR) [%] Fig. 7. DET curves of speech emotion recognition using three different classifiers.

Speech Emotion Recognition Using Spontaneous Children’s Corpus

4

331

Conclusion

The current paper focused on speech emotion recognition based on deep learning and using the state-of-the-art FAU Aibo emotion corpus of children’s speech. The proposed method based on DNN and i-vectors achieved a 61.1% UAR. This result is very promising and superior to the results obtained using the same data. The results also show that i-vectors and DNN can be efficiently used in speech emotion recognition, even in the case of very limited training data. The UAR when using DCNN with extremely randomized trees was 59.2%. The two proposed methods were compared to a baseline SVM based classification scheme, and they showed superior performance. Currently, speech emotion recognition using the proposed methods and the FAU Aibo data in noisy and reverberant environments is being investigated.

References 1. Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 1533–1545 (2014) 2. Attabi, Y., Alam, J., Dumouchel, P., Kenny, P., Shaughnessy, D.O.: Multiple windowed spectral features for emotion recognition. In: Proceedings of ICASSP, pp. 7527–7531 (2013) 3. Bielefeld, B.: Language identification using shifted delta cepstrum. In: Fourteenth Annual Speech Research Symposium (1994) 4. Busso, C., Bulut, M., Narayanan, S.: Toward effective automatic recognition systems of emotion in speech. In: Gratch, J., Marsella, S. (eds.) Social Emotions in Nature and Artifact: Emotions in Human and Human-Computer Interaction, pp. 110–127. Oxford University Press, New York (2013) 5. Cao, H., Verma, R., Nenkova, A.: Combining ranking and classification to improve emotion recognition in spontaneous speech. In: Proceedings of INTERSPEECH (2012) 6. Friedman, J., Hastie, T., et al.: Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 28(2), 337– 407 (2000) 7. Ganapathy, S., Han, K., Thomas, S., Omar, M., Segbroeck, M.V., Narayanan, S.S.: Robust language identification using convolutional neural network features. In: Proceedings of Interspeech (2014) 8. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006) 9. Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of Interspeech, pp. 2023–2027 (2014) 10. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012) 11. Ho, T.K.: Random decision forests. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, pp. 278–282 (1995)

332

P. Heracleous et al.

12. Huynh, X.-P., Tran, T.-D., Kim, Y.-G.: Convolutional Neural Network Models for Facial Expression Recognition Using BU-3DFE Database. In: Information Science and Applications (ICISA) 2016. LNEE, vol. 376, pp. 441–450. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0557-2 44 13. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751 (2014) 14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097– 1105. Curran Associates, Inc. (2012) 15. Le, D., Provost, E.M.: Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks. In: Proceedings of IEEE ASRU, pp. 216–221 (2013) 16. Lim, W., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: Proceedings of Signal and Information Processing Association Annual Summit and Conference (APSIPA) (2016) 17. Liu, R.X.Y.: Using i-vector space model for emotion recognition. In: Proceedings of Interspeech, pp. 2227–2230 (2012) 18. Metallinou, A., Lee, S., Narayanan, S.: Decision level combination of multiple modalities for recognition and analysis of emotional expression. In: Proceedings of ICASSP, pp. 2462–2465 (2010) 19. Mohammad, Y., Matsumoto, K., Hoashi, K.: Deep feature learning and selection for activity recognition. In: Proceedings of the 33rd ACM/SIGAPP Symposium On Applied Computing, pp. 926–935. ACM SAC (2018) 20. Nicholson, J., Takahashi, K., Nakatsu, R.: Emotion recognition in speech using neural networks. Neural Comput. Appl. 9(4), 290–296 (2000) 21. Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012) 22. Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Commun. 29, 2352–2449 (2017) 23. Sahidullah, M., Saha, G.: Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Commun. 54(4), 543–565 (2012) 24. Schuller, B., Rigoll, G., Lang, M.: Hidden Markov model-based speech emotion recognition. In: Proceedings of the IEEE ICASSP, vol. 1, pp. 401–404 (2003) 25. Steidl, S.: Automatic Classification of Emotion-Related User States in Spontaneous Children’s Speech. Logos Verlag, Berlin (2009) 26. Stuhlsatz, A., Meyer, C., Eyben, F., Zielke1, T., Meier, G., Schuller, B.: Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings of ICASSP, pp. 5688–5691 (2011) 27. Torres-Carrasquillo, P., Singer, E., Kohler, M.A., Greene, R.J., Reynolds, D.A., Deller, J.R.: Approaches to language identification using gaussian mixture models and shifted delta cepstral features. In: Proceedings of ICSLP2002INTERSPEECH2002, pp. 16–20 (2002) 28. Tang, H., Chu, S., Johnson, M.H.: Emotion recognition from speech via boosted Gaussian mixture models. In: Proceedings of ICME, pp. 294–297 (2009)

Speech Emotion Recognition Using Spontaneous Children’s Corpus

333

29. Xu, S., Liu, Y., Liu, X.: Speaker recognition and speech emotion recognition based on GMM. In: 3rd International Conference on Electric and Electronics (EEIC 2013), pp. 434–436 (2013) 30. Zhang, T., Wu, J.: Speech emotion recognition with i-vector feature and RNN model. In: 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), pp. 524–528 (2015)

Natural Language Interactions in Autonomous Vehicles: Intent Detection and Slot Filling from Passenger Utterances Eda Okur1(B) , Shachi H. Kumar2 , Saurav Sahay2 , Asli Arslan Esme1 , and Lama Nachman2 1 Intel Labs, Hillsboro, USA {eda.okur,asli.arslan.esme}@intel.com 2 Intel Labs, Santa Clara, USA {shachi.h.kumar,saurav.sahay,lama.nachman}@intel.com

Abstract. Understanding passenger intents and extracting relevant slots are crucial building blocks towards developing contextual dialogue systems for natural interactions in autonomous vehicles (AV). In this work, we explored AMIE (Automated-vehicle Multi-modal Incabin Experience), the in-cabin agent responsible for handling certain passenger-vehicle interactions. When the passengers give instructions to AMIE, the agent should parse such commands properly and trigger the appropriate functionality of the AV system. In our current explorations, we focused on AMIE scenarios describing usages around setting or changing the destination and route, updating driving behavior or speed, finishing the trip, and other use-cases to support various natural commands. We collected a multi-modal in-cabin dataset with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme via a realistic scavenger hunt game activity. After exploring various recent Recurrent Neural Networks (RNN) based techniques, we introduced our hierarchical joint models to recognize passenger intents along with relevant slots associated with the action to be performed in AV scenarios. Our experimental results outperformed certain competitive baselines and achieved overall F1-scores of 0.91 for utterance-level intent detection and 0.96 for slot filling tasks. In addition, we conducted initial speech-totext explorations by comparing intent/slot models trained and tested on human transcriptions versus noisy Automatic Speech Recognition (ASR) outputs. Finally, we evaluated the results with single passenger rides versus the rides with multiple passengers. Keywords: Intent recognition · Slot filling · Hierarchical joint learning · Spoken language understanding (SLU) · In-cabin dialogue agent

1

Introduction

One of the exciting yet challenging areas of research in Intelligent Transportation Systems is developing context-awareness technologies that can enable c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 334–350, 2023. https://doi.org/10.1007/978-3-031-24340-0_25

Natural Language Interactions in AVs: Intent Detection and Slot Filling

335

autonomous vehicles to interact with their passengers, understand passenger context and situations, and take appropriate actions accordingly. To this end, building multi-modal dialogue understanding capabilities situated in the in-cabin context is crucial to enhance passenger comfort and gain user confidence in AV interaction systems. Among many components of such systems, intent recognition and slot filling modules are one of the core building blocks towards carrying out successful dialogue with passengers. As an initial attempt to tackle some of those challenges, this study introduces in-cabin intent detection and slot filling models to identify passengers’ intent and extract semantic frames from the natural language utterances in AV. The proposed models are developed by leveraging the User Experience (UX) grounded realistic (ecologically valid) in-cabin dataset. This dataset is generated with naturalistic passenger behaviors, multiple passenger interactions, with the presence of a Wizard-of-Oz (WoZ) agent in moving vehicles with noisy road conditions. 1.1

Background

Long Short-Term Memory (LSTM) networks [7] are widely used for temporal sequence learning or time-series modeling in Natural Language Processing (NLP). These neural networks are commonly employed for sequence-to-sequence (seq2seq) and sequence-to-one (seq2one) modeling problems, including slot filling tasks [11] and utterance-level intent classification [5,17] which are wellstudied for various application domains. Bidirectional LSTMs (Bi-LSTMs) [18] are extensions of traditional LSTMs which are proposed to improve model performance on sequence classification problems even further. Jointly modeling slot extraction and intent recognition [5,25] is also explored in several architectures for task-specific applications in NLP. Using Attention mechanism [16,24] on top of RNNs is yet another recent breakthrough to elevate the model performance by attending inherently crucial sub-modules of a given input. There exist various architectures to build hierarchical learning models [10,22,27] for document-tosentence level, and sentence-to-word level classification tasks, which are highly domain-dependent and task-specific. Automatic Speech Recognition (ASR) technology has recently achieved human-level accuracy in many fields [20,23]. For spoken language understanding (SLU), it is shown that training SLU models on true text input (i.e., human transcriptions) versus noisy speech input (i.e., ASR outputs) can achieve varying results [9]. Even greater performance degradations are expected in more challenging and realistic setups with noisy environments, such as moving vehicles in actual traffic conditions. As an example, a recent work [26] attempts to classify sentences as navigation-related or not using the DARPA-supported CUMove in-vehicle speech corpus [6], a relatively old and large corpus focusing on route navigation. For this binary intent classification task, the authors observed that detection performances are largely affected by high ASR error rates due to background noise and multi-speakers in the CU-Move dataset (not publicly available). For in-cabin dialogue between car assistants and driver/passengers,

336

E. Okur et al.

recent studies explored creating a public dataset using a WoZ approach [3], and improving ASR for passenger speech recognition [4]. A preliminary report on research designed to collect data for human-agent interactions in a moving vehicle is presented in a previous study [19], with qualitative analysis on initial observations and user interviews. Our current study is focused on the quantitative analysis of natural language interactions found in this in-vehicle dataset [14], where we address intent detection and slot extraction tasks for passengers interacting with the AMIE in-cabin agent. Contributions. In this study, we propose intent recognition and slot filling models with UX grounded naturalistic passenger-vehicle interactions. We defined in-vehicle intent types and refined their relevant slots through a data-driven process based on observed interactions. After exploring existing approaches for jointly training intents and slots, we applied certain variations of these models that perform best on our dataset to support various natural commands for interacting with the car agent. The main differences in our proposed models can be summarized as follows: (1) Using the extracted intent keywords in addition to the slots to jointly model them with utterance-level intents (where most of the previous work [10,27] only join slots and utterance-level intents, ignoring the intent keywords); (2) The 2-level hierarchy we defined by word-level detection/extraction for slots and intent keywords first, and then filtering-out predicted non-slot and non-intent keywords instead of feeding them into the upper levels of the network (i.e., instead of using stacked RNNs with multiple recurrent hidden layers for the full utterance [10,22], which are computationally costly for long utterances with many non-slot & non-intent-related words), and finally using only the predicted valid-slots and intent-related keywords as an input to the second level of the hierarchy; (3) Extending joint models [5,25] to include both beginning-of-utterance and end-of-utterance tokens to leverage Bi-LSTMs (after observing that we achieved better results by doing so). We compared our intent detection and slot filling results with the results obtained from Dialogflow1 , a commercially available intent-based dialogue system by Google, and we showed that our proposed models perform better for both tasks on the same dataset. We also conducted initial speech-to-text explorations by comparing models trained and tested (10-fold CV) on human transcriptions versus noisy ASR outputs (via Cloud Speech-to-Text2 ). Finally, we evaluated the results with single passenger rides versus the rides with multiple passengers.

2

Methodology

2.1

Data Collection and Annotation

Our AV in-cabin dataset includes around 30 h of multi-modal data collected from 30 passengers (15 female, 15 male) in a total of 20 rides/sessions. In 10 sessions, 1 2

https://dialogflow.com. https://cloud.google.com/speech-to-text/.

Natural Language Interactions in AVs: Intent Detection and Slot Filling

337

Fig. 1. AMIE In-cabin data collection setup

single passenger was present (i.e., singletons), whereas the remaining 10 sessions include two passengers (i.e., dyads) interacting with the vehicle. The data is collected “in the wild” on the streets of Richmond, British Columbia, Canada. Each ride lasted about 1 h or more. The vehicle is modified to hide the operator and the human acting as an in-cabin agent from the passengers, using a variation of WoZ approach [21]. Participants sit in the back of the car, separated by a semisound proof and translucent screen from the human driver and the WoZ AMIE agent at the front. In each session, the participants were playing a scavenger hunt game by receiving instructions over the phone from the Game Master. Passengers treat the car as AV and communicate with the WoZ AMIE agent via speech commands. Game objectives require passengers to interact naturally with the agent to go to certain destinations, update routes, stop the vehicle, give specific directions regarding where to pull over or park (sometimes with a gesture), find landmarks, change speed, get in and out of the vehicle, etc. Further details of the data collection design and scavenger hunt protocol can be found in the preliminary study [19]. See Fig. 1 for the vehicle instrumentation to enhance multi-modal data collection setup. Our study is the initial work on this multi-modal dataset to develop intent detection and slot filling models. We leveraged data from the back-driver video/audio stream recorded by an RGB camera (facing the passengers) for manual transcription and annotation of the in-cabin utterances. In addition, we used the audio data recorded by Lapel 1 Audio and Lapel 2 Audio (Fig. 1) as our input resources for the ASR. For in-cabin intent understanding, we described four groups of usages to support various natural commands for interacting with the vehicle: (1) Set/Change Destination/Route (including turn-by-turn instructions), (2) Set/Change Driving Behavior/Speed, (3) Finishing the Trip Use-cases, and (4) Others (open/close door/window/trunk, turn music/radio on/off, change AC/temperature, show

338

E. Okur et al. Table 1. AMIE dataset statistics: utterance-level intent types AMIE scenario

Intent type

Utterance count

Finishing the Trip Use-cases

Stop Park PullOver DropOff

317 450 295 281

Set/Change Destination/Route

SetDestination SetRoute

552 676

Set/Change GoFaster Driving Behavior/Speed GoSlower

265 238

Others (Door, Music, etc.)

142 202

OpenDoor Other Total

3418

map, etc.). According to those scenarios, ten types of passenger intents are identified and annotated as follows: SetDestination, SetRoute, GoFaster, GoSlower, Stop, Park, PullOver, DropOff, OpenDoor, and Other. For slot filling task, relevant slots are identified and annotated as: Location, Position/Direction, Object, Time Guidance, Person, Gesture/Gaze (e.g., ‘this’, ‘that’, ‘over there’, etc.), and None/O. In addition to utterance-level intents and slots, word-level intentrelated keywords are annotated as Intent. We obtained 1331 utterances having commands to the AMIE agent from our in-cabin dataset. We expanded this dataset via the creation of similar tasks on Amazon Mechanical Turk [2] and reached 3418 utterances with intents in total. Intent and slot annotations are obtained on the transcribed utterances by majority voting of 3 annotators. The annotation results for utterance-level intent types, slots, and intent keywords can be found in Table 1 and Table 2 as a summary of dataset statistics. Table 2. AMIE dataset statistics: slots and intent keywords Slot type

Slot count Keyword type

Keyword count

Location 4460 Position/Direction 3187 Person 1360 632 Object Time Guidance 792 523 Gesture/Gaze 19967 None

Intent Non-Intent Valid-Slot Non-Slot Intent ∪ Valid-Slot Non-Intent ∩ Non-Slot

5921 25000 10954 19967 16875 14046

Total

Total

30921

30921

Natural Language Interactions in AVs: Intent Detection and Slot Filling

2.2

339

Detecting Utterance-Level Intent Types

As a baseline system, we implemented term-frequency and rule-based mapping mechanisms between word-level intent keywords extraction to utterancelevel intent recognition. To further improve the utterance-level performance, we explored various RNN architectures and developed hierarchical (2-level) models to recognize passenger intents along with relevant entities/slots in utterances. Our hierarchical model has the following 2-levels: – Level-1: Word-level extraction (to automatically detect/predict and eliminate non-slot & non-intent keywords first, as they would not carry much information for understanding the utterance-level intent-type). – Level-2: Utterance-level recognition (to detect final intent-types for given utterances, using valid slots and intent keywords as inputs only, which are detected at Level-1). RNN with LSTM Cells for Sequence Modeling. In this study, we employed an RNN architecture with LSTM cells that are designed to exploit long-range dependencies in sequential data. LSTM has a memory cell state to store relevant information and various gates, which can mitigate the vanishing gradient problem [7]. Given the input xt at time t, and hidden state from the previous time step ht−1 , the hidden and output layers for the current time step are computed. The LSTM architecture is specified by the following equations: it = σ(Wxi xt + Whi ht−1 + bi )

(1)

ft = σ(Wxf xt + Whf ht−1 + bf ) ot = σ(Wxo xt + Who ht−1 + bo ) gt = tanh(Wxg xt + Whg ht−1 + bg )

(2) (3) (4)

ct = ft  ct−1 + it  gt ht = ot  tanh(ct )

(5) (6)

where W and b denote the weight matrices and bias terms, respectively. The sigmoid (σ) and tanh are activation functions applied element-wise, and  denotes the element-wise vector product. LSTM has a memory vector ct to read/write or reset using a gating mechanism and activation functions. Here, input gate it scales down the input, the forget gate ft scales down the memory vector ct , and the output gate ot scales down the output to achieve final ht , which is used to predict yt (through a sof tmax activation). Similar to LSTMs, GRUs [1] are proposed as simpler and faster alternatives, having reset and update gates only. For Bi-LSTM [5,18], two LSTM architectures are traversed in forward and backward directions, where their hidden layers are concatenated to compute the output. Extracting Slots and Intent Keywords. For slot filling and intent keywords extraction, we experimented with various configurations of seq2seq LSTMs [17] and GRUs [1], as well as Bi-LSTMs [18]. A sample network architecture can be

340

E. Okur et al.

Fig. 2. Seq2seq Bi-LSTM network for slot filling and intent keyword extraction

seen in Fig. 2 where we jointly trained slots and intent keywords. The passenger utterance is fed into the LSTM/GRU network with an embedding layer, and this sequence of words is transformed into word vectors. We also experimented with GloVe [15], word2vec [12,13], and fastText [8] as pre-trained word embeddings. To prevent overfitting, we used a dropout layer with 0.5 rate for regularization. Best performing results are obtained with Bi-LSTMs and GloVe embeddings (6B tokens, 400K vocabulary size, vector dimension 100). Utterance-Level Recognition. For utterance-level intent detection, we mainly experimented with 5 groups of models: (1) Hybrid: RNN + Rule-based, (2) Separate: Seq2one Bi-LSTM with Attention, (3) Joint: Seq2seq Bi-LSTM for slots/intent keywords & utterance-level intents, (4) Hierarchical & Separate, (5) Hierarchical & Joint. For (1), we detect/extract intent keywords and slots (via RNN) and map them into utterance-level intent-types (rule-based). For (2), we feed the whole utterance as input sequence and intent-type as a single target into the Bi-LSTM network with an Attention mechanism. For (3), we jointly train word-level intent keywords/slots and utterance-level intents (by adding < BOU > / < EOU > terms to the beginning/end of utterances with intenttypes as their labels). For (4) and (5), we detect/extract intent keywords/slots first and then only feed the predicted keywords/slots as a sequence into (2) and (3), respectively.

3 3.1

Experiments and Results Utterance-Level Intent Detection Experiments

The details of the five groups of models and their variations we experimented with for utterance-level intent recognition are summarized in this section. Hybrid Models. Instead of purely relying on machine learning (ML) or deep learning (DL) systems, hybrid models leverage both ML/DL and rule-based systems. In this model, we defined our hybrid approach as using RNNs first for

Natural Language Interactions in AVs: Intent Detection and Slot Filling

341

Fig. 3. Hybrid models network architecture

detecting/extracting intent keywords and slots; then applying rule-based mapping mechanisms to identify utterance-level intents (using the predicted intent keywords and slots). A sample network architecture can be seen in Fig. 3 where we leveraged seq2seq Bi-LSTM networks for word-level extraction before the rule-based mapping to utterance-level intent classes. The model variations are defined based on varying mapping mechanisms and networks as follows: – Hybrid-0: RNN (Seq2seq LSTM for intent keywords extraction) + Rule-based (mapping extracted intent keywords to utterance-level intents) – Hybrid-1: RNN (Seq2seq Bi-LSTM for intent keywords extraction) + Rulebased (mapping extracted intent keywords to utterance-level intents) – Hybrid-2: RNN (Seq2seq Bi-LSTM for intent keywords & slots extraction) + Rule-based (mapping extracted intent keywords & ‘Position/Direction’ slots to utterance-level intents) – Hybrid-3: RNN (Seq2seq Bi-LSTM for intent keywords & slots extraction) + Rule-based (mapping extracted intent keywords & all slots to utterance-level intents)

Separate Seq2one Models. This approach is based on separately training sequence-to-one RNNs for utterance-level intents only. These are called separate models as we do not leverage any information from the slot or intent keyword tags (i.e., utterance-level intents are not jointly trained with slots/intent keywords). Note that in seq2one models, we feed the utterance as an input sequence, and the LSTM layer will only return the hidden state output at the last time step. This single-output (or concatenated output of last hidden states from the forward and backward LSTMs in the Bi-LSTM case) will be used to classify the intent type of the given utterance. The idea behind this is that the last hidden state

342

E. Okur et al.

(a) Separate Seq2one Network

(b) Separate Seq2one with Attention

Fig. 4. Separate models network architecture

of the sequence will contain a latent semantic representation of the whole input utterance, which can be utilized for utterance-level intent prediction. See Fig. 4 (a) for sample network architecture of the seq2one Bi-LSTM network. Note that in the Bi-LSTM implementation for seq2one learning (i.e., when not returning sequences), the outputs of backward/reverse LSTM is actually ordered in reverse time steps (tlast ... tf irst ). Thus, as illustrated in Fig. 4 (a), we concatenate the hidden state outputs of forward LSTM at the last time step and backward LSTM at the first time step (i.e., the first word in a given utterance), and then feed this merged result to the dense layer. Figure 4 (b) depicts the seq2one Bi-LSTM network with Attention mechanism applied on top of Bi-LSTM layers. For the Attention case, the hidden state outputs of all time steps are fed into the Attention mechanism that will allow pointing at specific words in a sequence when computing a single output [16]. Another variation of the Attention mechanism we examined is the AttentionWithContext, which incorporates a context/query vector jointly learned during the training process to assist the attention [24]. All seq2one model variations we experimented with can be summarized as follows: – – – –

Separate-0: Seq2one LSTM for utterance-level intents Separate-1: Seq2one Bi-LSTM for utterance-level intents Separate-2: Seq2one Bi-LSTM with Attention [16] for utterance-level intents Separate-3: Seq2one Bi-LSTM with AttentionWithContext [24] for utterancelevel intents

Joint Seq2seq Models. Using sequence-to-sequence networks, the approach here is jointly training annotated utterance-level intents and slots/intent keywords by adding / tokens to the beginning/end of each utterance, with utterance-level intent-type as labels of such tokens. Our approach is an extension of [5], in which only an term is added with intent-type tags associated to this end-of-sentence token, both for LSTM and Bi-LSTM cases.

Natural Language Interactions in AVs: Intent Detection and Slot Filling

343

Fig. 5. Joint models network architecture

However, we experimented with adding both and terms as Bi-LSTMs will be used for seq2seq learning, and we observed that slightly better results are achieved by doing so. The idea behind this is that, since this is a seq2seq learning problem, at the last time step (i.e., prediction at ), the reverse pass in Bi-LSTM would be incomplete (refer to Fig. 4 (a) to observe the last Bi-LSTM cell). Therefore, adding token and leveraging the backward LSTM output at first time step (i.e., prediction at ) would potentially help for joint seq2seq learning. Overall network architecture can be found in Fig. 5 for our joint models. We will report the experimental results on two variations (with and without intent keywords) as follows: – Joint-1: Seq2seq Bi-LSTM for utterance-level intent detection (jointly trained with slots) – Joint-2: Seq2seq Bi-LSTM for utterance-level intent detection (jointly trained with slots & intent keywords)

Hierarchical and Separate Models. Proposed hierarchical models are detecting/extracting intent keywords & slots using sequence-to-sequence networks first (i.e., level-1), and then feeding only the words that are predicted as intent keywords & valid slots (i.e., not the ones that are predicted as ‘None/O’) as an input sequence to various separate sequence-to-one models (described above) to recognize final utterance-level intents (i.e., level-2). A sample network architecture is given in Fig. 6 (a). The idea behind filtering out non-slot and non-intent keywords here resembles providing a summary of the input sequence to the upper levels of the network hierarchy. Note that we learn this summarized sequence of keywords using another RNN layer. That would potentially result in focusing the utterance-level classification problem on the most salient words of the input sequences (i.e., intent keywords & slots) and also effectively reducing

344

E. Okur et al.

(a) Hierarchical & Separate Model

(b) Hierarchical & Joint Model

Fig. 6. Hierarchical models network architecture

the length of input sequences (i.e., improving the long-term dependency issues observed in longer sequences). Note that, according to the dataset statistics in Table 2, 45% of the words found in transcribed utterances with passenger intents are annotated as non-slot and non-intent keywords. For example, ‘please’, ‘okay’, ‘can’, ‘could’, incomplete/interrupted words, filler sounds like ‘uh’/‘um’, certain stop words, punctuation, and many other tokens exist that are not related to intent/slots. Therefore, the proposed approach would reduce the sequence length nearly by half at the input layer of level-2 for utterance-level recognition. For hierarchical & separate models, we experimented with four variations based on which separate model was used at the second level of the hierarchy, and these are summarized as follows: – Hierarchical & Separate-0: Level-1 (Seq2seq LSTM for intent keywords & slots extraction) + Level-2 (Separate-0: Seq2one LSTM for utterance-level intent detection) – Hierarchical & Separate-1: Level-1 (Seq2seq Bi-LSTM for intent keywords & slots extraction) + Level-2 (Separate-1: Seq2one Bi-LSTM for utterance-level intent detection) – Hierarchical & Separate-2: Level-1 (Seq2seq Bi-LSTM for intent keywords & slots extraction) + Level-2 (Separate-2: Seq2one Bi-LSTM + Attention for utterance-level intent detection) – Hierarchical & Separate-3: Level-1 (Seq2seq Bi-LSTM for intent keywords & slots extraction) + Level-2 (Separate-3: Seq2one Bi-LSTM + AttentionWithContext for utterance-level intent detection) Hierarchical and Joint Models. Proposed hierarchical models detect/extract intent keywords & slots using sequence-to-sequence networks first, and then only the words that are predicted as intent keywords & valid slots (i.e., not the ones that are predicted as ‘None/O’) are fed as input to the joint sequence-to-sequence models (described above). See Fig. 6 (b) for sample network architecture. After

Natural Language Interactions in AVs: Intent Detection and Slot Filling

345

the filtering or summarization of sequence at level-1, and tokens are appended to the shorter input sequence before level-2 for joint learning. Note that in this case, using the Joint-1 model (jointly training annotated slots & utterance-level intents) for the second level of the hierarchy would not make much sense (without intent keywords). Hence, the Joint-2 model is used for the second level as described below: – Hierarchical & Joint-2: Level-1 (Seq2seq Bi-LSTM for intent keywords & slots extraction) + Level-2 (Joint-2 Seq2seq models with slots & intent keywords & utterance-level intents) Table 3 summarizes the results of various approaches we investigated for utterance-level intent understanding. We achieved a 0.91 overall F1-score with our best-performing model, namely Hierarchical & Joint-2. All model results are obtained via 10-fold cross-validation (10-fold CV) on the same dataset. For our AMIE scenarios, Table 4 shows the intent-wise detection results with the initial (Hybrid-0 ) and currently best performing (H-Joint-2 ) intent recognizers. With our best model (H-Joint-2 ), relatively problematic SetDestination and SetRoute intents’ detection performances in baseline model (Hybrid-0 ) jumped from 0.78 to 0.89 and 0.75 to 0.88, respectively. We compared our intent detection results with Dialogflow’s Detect Intent API. The same AMIE dataset is used to train and test (10-fold CV) Dialogflow’s intent detection and slot filling modules, using the recommended hybrid mode (rule-based and ML). As shown in Table 4, an overall F1-score of 0.89 is achieved with Dialogflow for the same task. As you can see, our Hierarchical & Joint models obtained higher results than the Dialogflow for 8 out of 10 intent types. 3.2

Slot Filling and Intent Keyword Extraction Experiments

Slot filling and intent keyword extraction results are given in Table 5 and Table 6, respectively. For slot extraction, we reached a 0.96 overall F1-score using the seq2seq Bi-LSTM model. That is slightly better than using the LSTM model. Note that although the overall performance is slightly improved with Bi-LSTM model, relatively problematic Object, Time Guidance, Gesture/Gaze slots’ F1score performances increased from 0.80 to 0.89, 0.80 to 0.85, and 0.87 to 0.92, respectively. Note that with Dialogflow, we reached a 0.92 overall F1-score for the entity/slot filling task on the same dataset. As you can see, our models reached significantly higher F1-scores than the Dialogflow for 6 out of 7 slot types (except Time Guidance). 3.3

Speech-to-Text Experiments for AMIE: Training and Testing Models on ASR Outputs

For transcriptions, utterance-level audio clips were extracted from the passengerfacing video stream, which was the single source used for human transcriptions of all utterances from passengers, the AMIE agent, and the game master. Since

346

E. Okur et al.

our transcriptions-based intent/slot models assumed perfect (at least close to human-level) ASR in the previous sections, we experimented with the more realistic scenario of using ASR outputs for intent/slot modeling. We employed Cloud Speech-to-Text API to obtain ASR outputs on audio clips with passenger utterances, which were segmented using transcription time-stamps. We observed an overall word error rate (WER) of 13.6% in ASR outputs for all 20 sessions of AMIE. Considering that a generic ASR is used with no domain-specific acoustic models for this moving vehicle environment with in-cabin noise, the initial results were quite promising to move on with the model training on ASR outputs. For initial explorations, we created a new dataset having utterances with commands using ASR outputs of the in-cabin data (20 sessions with 1331 spoken utterances). A human transcriptions version of this set is also created. Although the dataset size is limited, slot/intent keyword extraction and utterance-level intent recognition models are not severely affected when trained and tested (10-fold CV) on ASR outputs instead of manual transcriptions. See Table 7 for the overall F1-scores of the compared models. Singleton versus Dyad Sessions. After the ASR pipeline described above was completed for all 20 sessions of the AMIE in-cabin dataset (ALL with 1331 utterances), we repeated all our experiments on the following two subsets: (i) 10 sessions having a single passenger (Singletons with 600 utterances), and (ii) remaining 10 sessions having two passengers (Dyads with 731 utterances). We observed overall WER of 13.5% and 13.7% for Singletons and Dyads, respectively. The overlapping speech cases with slightly more conversations going on Table 3. Utterance-level intent detection performance results (10-fold CV) Model type

Prec Rec

Hybrid-0: RNN (LSTM) + Rule-based (intent keywords)

0.86 0.85 0.85

Hybrid-1: RNN (Bi-LSTM) + Rule-based (intent keywords)

0.87 0.86 0.86

F1

Hybrid-2: RNN (Bi-LSTM) + Rule-based (intent keywords & Pos slots) 0.89 0.88 0.88 Hybrid-3: RNN (Bi-LSTM) + Rule-based (intent keywords & all slots)

0.90 0.90 0.90

Separate-0: Seq2one LSTM

0.87 0.86 0.86

Separate-1: Seq2one Bi-LSTM

0.88 0.88 0.88

Separate-2: Seq2one Bi-LSTM + Attention

0.88 0.88 0.88

Separate-3: Seq2one Bi-LSTM + AttentionWithContext

0.89 0.89 0.89

Joint-1: Seq2seq Bi-LSTM (uttr-level intents & slots)

0.88 0.87 0.87

Joint-2: Seq2seq Bi-LSTM (uttr-level intents & slots & intent keywords) 0.89 0.88 0.88 Hierarchical & Separate-0 (LSTM)

0.88 0.87 0.87

Hierarchical & Separate-1 (Bi-LSTM)

0.90 0.90 0.90

Hierarchical & Separate-2 (Bi-LSTM + Attention)

0.90 0.90 0.90

Hierarchical & Separate-3 (Bi-LSTM + AttentionWithContext)

0.90 0.90 0.90

Hierarchical & Joint-2 (uttr-level intents & slots & intent keywords)

0.91 0.90 0.91

Natural Language Interactions in AVs: Intent Detection and Slot Filling

347

Table 4. Intent-wise performance results of utterance-level intent detection AMIE

Intent

Our intent detection models

Scenario

Type

Baseline (Hybrid-0) Best (H-Joint-2) Intent detection Prec Rec F1 Prec Rec F1 Prec Rec F1

Finishing The Trip

Stop Park PullOver DropOff

0.88 0.96 0.95 0.90

0.91 0.87 0.96 0.95

0.93 0.94 0.97 0.95

0.90 0.91 0.95 0.92

Dialogflow

0.91 0.94 0.94 0.95

0.92 0.94 0.96 0.95

0.89 0.95 0.95 0.96

0.90 0.88 0.97 0.91

0.90 0.91 0.96 0.93

Dest/Route SetDest SetRoute

0.70 0.88 0.78 0.80 0.71 0.75

0.89 0.90 0.89 0.84 0.91 0.87 0.86 0.89 0.88 0.83 0.86 0.84

Speed

GoFaster GoSlower

0.86 0.89 0.88 0.92 0.84 0.88

0.89 0.90 0.90 0.94 0.92 0.93 0.89 0.86 0.88 0.93 0.87 0.90

Others

OpenDoor 0.95 0.95 0.95 Other 0.92 0.72 0.80

0.95 0.95 0.95 0.94 0.93 0.93 0.83 0.81 0.82 0.88 0.73 0.80

0.86 0.85 0.85

0.91 0.90 0.91 0.90 0.89 0.89

Overall

(longer transcriptions) in Dyad sessions compared to the Singleton sessions may affect the ASR performance, which may also affect the intent/slots models performances. As shown in Table 7, although we have more samples with Dyads, the performance drops between the models trained on transcriptions vs. ASR outputs are slightly higher for the Dyads compared to the Singletons, as expected. Table 5. Slot filling results (10-fold CV) Slot type

Our slot filling models Dialogflow Seq2seq LSTM Seq2seq Bi-LSTM Slot Filling Prec Rec F1 Prec Rec F1 Prec Rec F1

Location Position/Direction Person Object Time Guidance Gesture/Gaze None

0.94 0.92 0.97 0.82 0.88 0.86 0.97

Overall

0.95 0.95 0.95 0.96 0.96 0.96

0.92 0.93 0.96 0.79 0.73 0.88 0.98

0.93 0.93 0.97 0.80 0.80 0.87 0.97

0.96 0.95 0.98 0.93 0.90 0.92 0.97

0.94 0.95 0.97 0.85 0.80 0.92 0.98

0.95 0.95 0.97 0.89 0.85 0.92 0.98

0.94 0.91 0.96 0.96 0.93 0.86 0.92

0.81 0.92 0.76 0.70 0.82 0.65 0.98

0.87 0.91 0.85 0.81 0.87 0.74 0.95

0.92 0.92 0.92

348

E. Okur et al. Table 6. Intent keyword extraction results (10-fold CV) Keyword type Prec Rec

F1

Intent Non-Intent

0.95 0.93 0.94 0.98 0.99 0.99

Overall

0.98 0.98 0.98

Table 7. F1-scores of models Trained/Tested on transcriptions vs. ASR outputs Train/Test on Transcriptions Slot Filling & Intent Keywords

Train/Test on ASR Outputs

ALL Singleton Dyad ALL Singleton Dyad

Slot Filling 0.97 0.96 0.98 0.98 Intent Keyword Extraction Slot Filling & Intent Keyword Extraction 0.95 0.95

0.96 0.97 0.94

0.95 0.94 0.97 0.96 0.94 0.92

0.93 0.96 0.91

Utterance-level Intent Detection

ALL Singleton Dyad ALL Singleton Dyad

Hierarchical & Separate Hierarchical & Separate + Attention Hierarchical & Joint

0.87 0.85 0.89 0.86 0.89 0.87

4

0.86 0.87 0.88

0.85 0.84 0.86 0.84 0.87 0.85

0.83 0.84 0.85

Discussion and Conclusion

We introduced AMIE, the intelligent in-cabin car agent responsible for handling certain AV-passenger interactions. We develop hierarchical and joint models to extract various passenger intents, along with relevant slots for actions to be performed in AV, achieving F1-scores of 0.91 for intent recognition and 0.96 for slot extraction. Besides using the generic ASR with noisy outputs, we show that our models are still capable of achieving comparable results with models trained on human transcriptions. We believe that the ASR performance can be improved by collecting more in-domain data to obtain domain-specific acoustic models. These initial models will allow us to collect more speech data via bootstrapping with the intent-based dialogue application that we have built. Plus, the hierarchy we defined can eliminate costly annotation efforts in the future, especially for the word-level slots and intent keywords. Once enough domain-specific multi-modal data is collected, our future work is to explore training end-to-end dialogue agents for our in-cabin use-cases. We are planning to exploit other modalities for an improved understanding of the in-cabin dialogue as well. Acknowledgments. We want to show our gratitude to our colleagues from Intel Labs, especially, Cagri Tanriover for his tremendous efforts in coordinating and implementing the vehicle instrumentation to enhance multi-modal data collection setup (as he illustrated in Fig. 1), John Sherry and Richard Beckwith for their insights and expertise that guided the collection of this UX grounded and ecologically valid dataset (via scavenger hunt protocol and WoZ research design). The authors are also immensely

Natural Language Interactions in AVs: Intent Detection and Slot Filling

349

grateful to the GlobalMe, Inc. members, especially Rick Lin and Sophie Salonga, for their extensive efforts in organizing and executing the data collection, transcription, and some annotation tasks for this research in collaboration with our team at Intel Labs.

References 1. Chung, J., G¨ ul¸cehre, C ¸ ., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014). http:// arxiv.org/abs/1412.3555 2. Crowston, K.: Amazon mechanical turk: a research tool for organizations and information systems scholars. In: Bhattacherjee, A., Fitzgerald, B. (eds.) IS&O 2012. IAICT, vol. 389, pp. 210–221. Springer, Heidelberg (2012). https://doi.org/ 10.1007/978-3-642-35142-6 14 3. Eric, M., Krishnan, L., Charette, F., Manning, C.D.: Key-value retrieval networks for task-oriented dialogue. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 37–49. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/W17-5506 4. Fukui, M., Watanabe, T., Kanazawa, M.: Sound source separation for plural passenger speech recognition in smart mobility system. IEEE Trans. Consum. Electron. 64(3), 399–405 (2018). https://doi.org/10.1109/TCE.2018.2867801 5. Hakkani-Tur, D., et al.: Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM. ISCA (2016). https://www.microsoft.com/en-us/ research/publication/multijoint/ 6. Hansen, J.H., et al.: Cu-move: analysis & corpus development for interactive invehicle speech systems. In: 7th European Conference on Speech Communication and Technology (2001) 7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 8. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017) 9. Liu, B., Lane, I.: Joint online spoken language understanding and language modeling with recurrent neural networks. CoRR abs/1609.01462 (2016). http://arxiv. org/abs/1609.01462 10. Meng, Z., Mou, L., Jin, Z.: Hierarchical RNN with static sentence-level attention for text-based speaker change detection. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2203–2206. CIKM’17, ACM, New York, NY, USA (2017). https://doi.org/10.1145/3132847.3133110 11. Mesnil, G., et al.: Using recurrent neural networks for slot filling in spoken language understanding. Trans. Audio Speech Lang. Proc. 23(3), 530–539 (2015). https:// doi.org/10.1109/TASLP.2014.2383614 12. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013) 13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Vol. 2, pp. 3111–3119. NIPS’13, Curran Associates Inc., USA (2013). http://dl.acm.org/ citation.cfm?id=2999792.2999959

350

E. Okur et al.

14. Okur, E., Kumar, S.H., Sahay, S., Esme, A.A., Nachman, L.: Conversational intent understanding for passengers in autonomous vehicles. In: 13th WiML Workshop, co-located with the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, Canada (2018). http://arxiv.org/abs/1901.04899 15. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162 16. Raffel, C., Ellis, D.P.W.: Feed-forward networks with attention can solve some long-term memory problems. CoRR abs/1512.08756 (2015). http://arxiv.org/abs/ 1512.08756 17. Ravuri, S., Stolcke, A.: Recurrent neural network and lstm models for lexical utterance classification. In: Proceedings of the Interspeech, pp. 135–139. ISCA - International Speech Communication Association, Dresden (2015) 18. Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. Trans. Sig. Proc. 45(11), 2673–2681 (1997). https://doi.org/10.1109/78.650093 19. Sherry, J., Beckwith, R., Arslan Esme, A., Tanriover, C.: Getting things done in an autonomous vehicle. In: Social Robots in the Wild Workshop at the 13th Annual ACM/IEEE International Conference on Human-Robot Interaction (HRI) (2018). http://socialrobotsinthewild.org/wp-content/uploads/2018/ 02/HRI-SRW 2018 paper 3.pdf 20. Stolcke, A., Droppo, J.: Comparing human and machine errors in conversational speech transcription. CoRR abs/1708.08615 (2017). http://arxiv.org/abs/1708. 08615 21. Wang, P., Sibi, S., Mok, B., Ju, W.: Marionette: enabling on-road wizard-of-oz autonomous driving studies. In: Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, pp. 234–243. HRI’17, ACM, New York, NY, USA (2017). https://doi.org/10.1145/2909824.3020256 22. Wen, L., Wang, X., Dong, Z., Chen, H.: Jointly Modeling Intent Identification and Slot Filling with Contextual and Hierarchical Information. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Yu. (eds.) NLPCC 2017. LNCS (LNAI), vol. 10619, pp. 3–15. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73618-1 1 23. Xiong, W., et al.: Achieving human parity in conversational speech recognition. CoRR abs/1610.05256 (2016). http://arxiv.org/abs/1610.05256 24. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016). https://doi.org/10.18653/v1/N16-1174 25. Zhang, X., Wang, H.: A joint model of intent determination and slot filling for spoken language understanding. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, pp. 2993–2999. IJCAI’16, AAAI Press (2016). http://dl.acm.org/citation.cfm?id=3060832.3061040 26. Zheng, Y., Liu, Y., Hansen, J.H.L.: Navigation-orientated natural spoken language understanding for intelligent vehicle dialogue. In: 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 559–564 (2017). https://doi.org/10.1109/IVS.2017.7995777 27. Zhou, Q., Wen, L., Wang, X., Ma, L., Wang, Y.: A hierarchical LSTM model for joint tasks. In: Sun, M., Huang, X., Lin, H., Liu, Z., Liu, Y. (eds.) CCL/NLPNABD -2016. LNCS (LNAI), vol. 10035, pp. 324–335. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47674-2 27

Audio Summarization with Audio Features and Probability Distribution Divergence Carlos-Emiliano Gonz´ alez-Gallardo1(B) , Romain Deveaud1 , Eric SanJuan1 , and Juan-Manuel Torres-Moreno1,2 1

2

LIA - Avignon Universit´e, 339 chemin des Meinajaries, 84140 Avignon, France {carlos-emiliano.gonzalez-gallardo,eric.sanjuan, juan-manuel.torres}@univ-avignon.fr D´epartement de GIGL, Polytechnique Montr´eal, C.P. 6079, succ. Centre-ville, Montr´eal, Qu´ebec H3C 3A7, Canada

Abstract. The automatic summarization of multimedia sources is an important task that facilitates the understanding of an individual by condensing the source while maintaining relevant information. In this paper we focus on audio summarization based on audio features and the probability of distribution divergence. Our method, based on an extractive summarization approach, aims to select the most relevant segments until a time threshold is reached. It takes into account the segment’s length, position and informativeness value. Informativeness of each segment is obtained by mapping a set of audio features issued from its Melfrequency Cepstral Coefficients and their corresponding Jensen-Shannon divergence score. Results over a multi-evaluator scheme shows that our approach provides understandable and informative summaries. Keywords: Audio summarization Human language understanding

1

· JS divergence · Informativeness ·

Introduction

Multimedia summarization has become a major need since Internet platforms like Youtube1 provide easy access to massive online resources. In general, automatic summarization intends to produce an abridged and informative version of its source [17]. The type of automatic summarization we focus in this article is audio summarization, which source corresponds to an audio signal. Audio summarization can be performed with the following three approaches: directing the summary using only audio features [2,9,10,21], extracting the text inside the audio signal and directing the summarization process using textual methods [1,13,16] and an hybrid approach which consists of a mixture of the first two [15,19,20]. Each approach has advantages and disadvantages with regard to 1

https://www.youtube.com/.

c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 351–361, 2023. https://doi.org/10.1007/978-3-031-24340-0_26

352

C.-E. Gonz´ alez-Gallardo et al.

the others. Using only audio features for creating a summary has the advantage of being totally transcript independent; however, this may also be a problem given that the summary is based only on how things are said. By contrast, directing the summary with textual methods benefits from the information contained within the text, dealing to more informative summaries; nevertheless, in some cases transcripts are not unavailable. Finally, using both audio features and textual methods can boost the summary quality; yet, disadvantages of both approaches are present. The method we propose in this paper consists of an hybrid approach during training phase while text independent during summary creation. It resides on using textual information to learn an informativeness representation based on probability distribution divergences that standard audio summarization with audio features does not consider. During the summarization process this representation is used to obtain an informativeness score without a textual representation of the audio signal to summarize. To our knowledge, probability distribution divergences have not been used for audio summarization. The rest of this article is organized as follows. In Sect. 2 we give an overview of what audio summarization is, we include its advantages and disadvantages comparing it with other summarization techniques. During Sect. 3 we explain how the probability distribution divergence may be used over an audio summarization framework and we describe in detail our summarization proposal. In Sect. 4 we describe the dataset used during training and the summary generation phases as well as the evaluation metric that we adopted to measure the quality of the produced summaries and the results from the experimental evaluation of the proposed method. Finally, Sect. 5 concludes the article.

2

Audio Summarization

Audio summarization without any textual representation aims to produce an abridged and informative version of an audio source using only the information contained in the audio signal. This kind of summarization is challenging because the available information corresponds to how things are said, this is advantageous in terms of transcripts availability. Hybrid audio summarization methods or text based audio summarization algorithms need automatic or manual speech transcripts to select the pertinent segments and produce an informative summary [19,20]. Nevertheless, speech transcripts may be expensive, non available or of low quality, this creates repercussions over the summarization performance. Duxans et al. [2] managed to generate audio based summaries of a soccer match using re-transmissions that detect highlighted events. They based their detection algorithm on two acoustic features: the block energy and the acoustic repetition indexes. The performance was measured in terms of goal recall and summary precision, showing high rates for both categories. Maskey et al. [10] presented an audio based summarization method using a Hidden Markov Model (HMM) framework. They used a set of different acoustic/prosodic features to represent the HMM observation vectors: speaking rate; F0 min, max, mean, range and slope; min, max and mean RMS energy; RMS

Audio Summarization with Audio Features and P.D.D

353

slope and sentence duration. The hidden variables represented the inclusion or exclusion of a segment within the summary. They performed experiments over 20 CNN shows and 216 stories previously used in [9]. Evaluation was made with standard Precision, Recall and F-measures information retrieval measures. Results show us that the HMM framework had a very good coverage (Recall = 0.95) but a very poor precision (P recision = 0.26) when selecting pertinent segments. Zlatintsi et al. [21] addressed the audio summarization task by exploring the potential of a modulation model for the detection of perceptually important audio events. They performed a saliency computation of audio streams based on a set of saliency models and various linear, adaptive and nonlinear fusion schemes. Experiments were performed over audio data extracted from six 30minute movie clips. Results were reported in terms of frame-level precision scores showing that nonlinear fusion schemes perform best. Audio summarization based only on acoustic features like fundamental frequencies, energy, volume change and speaker turn, has the big advantage that no textual information is needed. This approach is especially useful when human transcripts are not available for the spoken documents and Automatic Speech Recognition (ASR) transcripts have a high word error rate. However, for high informative contexts like broadcast news, bulletins or reports, most relevant information resides on the things that are said while audio features are limited to how things are said.

3

Probability Distribution Divergence for Audio Summarization

All presented methods in the previous section omit the informativity content of the audio streams. In order to overcome the lack of information, we propose an extractive audio summarization method capable of representing the informativeness of a segment in terms of its audio features during training phase; informativeness is mapped by a probability distribution divergence model. Then, when creating a summary, textual independence is reached using only audio based features. Divergence is defined by Manning [8] as a function which estimates the difference between two probability distributions. In the framework of automatic text summarization evaluation, [6,14,18] have used divergence based measures such as Kullback-Leibler and Jensen-Shannon (JS) to compare the probability distribution of words between automatically produced summaries and their sources. Extractive summarization based on the divergence of probability distributions has been discussed in [6] and a method has been proposed in [17] (DIVTEX). Our proposal, based on an extractive summarization approach aims to select the most pertinent audio segments until a time threshold is reached. A training phase is in charge of learning a model that maps a set of 277 audio features to an informativeness value. A big dataset is used to compute the informativeness by obtaining the divergence between the dataset documents and their corresponding

354

C.-E. Gonz´ alez-Gallardo et al.

segments. During the summarization phase, the method takes into account the segment’s length, position and the mapped informativeness of the audio features to rank the pertinence of each audio segment. 3.1

Audio Signal Pre-processing

During the pre-processing step, the audio signal is split into background and foreground channels. This process is normally used on music records for separating vocals and other sporadic signals from accompanying instrumentation. Rafii et. al [12] achieved this separation for identifying recurrent elements by looking for similarities instead of periodicities. Rafii et. al approach is useful for those song records where repetitions happen intermittently or without a fixed period; however, we found that applying the same method to newscasts and reports audio files made much easier to segment them using only the background signal. We assume this phenomena is due to the fact that newscasts and reports are heavily edited with a low volume of background music playing while the journalist speak and louder music/noises for transitions (foreground). Following [12], to suppress non-repetitive deviations from the average spectrum and discard vocal elements, audio frames are compared using the cosine similarity. Similar frames separated by at least two seconds are aggregated by taking their per-frequency median value to avoid being biased by local continuity. Next, assuming that both signals are additive, a pointwise minimum between the obtained frames and the original signal is applied to obtain a raw background filter. Then, a foreground and background time-frequency mask is derived from the raw background filter and the input signal with a soft mask operation. Finally, foreground and background components are obtained by multiplying the timefrequency masks with the input signal. 3.2

Informativeness Model

Informativeness is learned from the transcripts of a big audio dataset such as newscasts and reports. A mapping between a set of 277 audio features and an informativeness value is learned during the training phase. It corresponds to the Jensen-Shannon divergence (DJS ) between the segmented transcripts and their source. The DJS is based on the Kullback-Leibler divergence [4] with the main difference that is symmetric. The DJS between a segment Q and its source P is defined by [7,18] as:      2Pw 2Qw 1  Pw log2 + Qw log2 (1) DJS (P ||Q) = 2 Pw + Qw Pw + Qw w∈P P + Cw

δ |P | + δ × β Q +δ Cw Qw = |Q| + δ × β Pw =

(2) (3)

Audio Summarization with Audio Features and P.D.D

355

(P |Q)

where Cw is the frequency of word w over P or Q. To avoid shifting the probability mass to unseen events, the scaling parameter δ is set to 0.0005. |P | and |Q| corresponds to the number of tokens on P and Q. Finally β = 1.5 × |V |, where |V | is the vocabulary size on P . Each segment Q has a length of 10 s and is represented by 277 audio features where 275 corresponds to 11 statistical values of 25 Mel-frequency Cepstral Coefficients (MFCC) and the other two correspond to the number of frames in the segment and its starting time. The 11 statistical values can be seen in Table 1, where φ and φ corresponds to the first and second MFCC derivative. Table 1. MFCC based statistical values Feature

MFCC φ φ φ

min max median mean variance skewness kurtosis

• • • • • • • • •

• •

A linear least squares regression model (LR(X, Y )) is trained to map the 277 audio features (X) into a informativeness score (Y ). Figure 1 shows the whole training phase (informativeness model). All audio processing and feature extraction is performed with the Librosa library2 [11]. 3.3

Audio Summary Creation

The summary creation of a document P follows the same audio signal preprocessing steps described in Sect. 3.1. During this phase, only the audio signal is needed and informativeness of each candidate segment Qi ∈ P is predicted with the LR(Qi , YQi ) model. Figure 2 shows the full summarization pipeline to obtain a θ threshold length summary of an audio document P . After the background signal is isolated from the main signal a temporallyconstrained agglomerative clustering routine is used to partition the audio stream into k contiguous segments. k=

Plength × 20 60

being Plength the length in seconds of P . 2

https://librosa.github.io/librosa/index.html.

(4)

356

C.-E. Gonz´ alez-Gallardo et al.

Fig. 1. Informativeness model scheme

To rank the pertinence of each segment Q1 ...Qk , a score SQi is computed. Audio summarization is performed by choosing those segments which contain higher SQi scores in order of appearance until θ is reached. SQi is defined as: SQi =

1 1 + e−(Δti −5)

t

×

Q |Qi | − i × e Δti × e1−LRQi |P |

(5)

Here Δti = tQi +1 − tQi , being tQi the starting time of the segment Qi and tQi +1 the starting time of Qi+1 . |Qi | and |P | corresponds to the length in seconds of the segment Qi and P respectively.

4

Experimental Evaluation

We trained the informativeness model explained in Sect. 3.2 with a set of 5,989 audio broadcasts which corresponds to more than 310 h of audio in French, English and Arabic [5]. Transcripts were obtained with the ASR system described on [3]. During audio summary creation we focused on a small dataset of 10 English audio samples. In this phase no ASR system was used given the text independence our systems achieves once the informativeness model has been obtained. Selected sample lengths vary between 102 s (1 m 42 s) and 584 s (9 m 44 s) with an average length of 318 s (5 m 18 s). Similar to Rott et al. [13], we implement a 1–5 subjective scaled opinion metric to evaluate the quality of the generated summaries and their parts. During evaluation, we provided a set of five evaluators with the original audio, the generated summary, their corresponding segments and the scale shown in Table 2.

Audio Summarization with Audio Features and P.D.D

357

Fig. 2. Summary creation scheme

4.1

Results

Summary length was set to be the 35% of the original audio length during experimentation. Evaluation was performed over the complete audio summaries as well as over each summary segment. We are interested on measuring the informativeness of the generated summaries but also on measuring the informativeness of each one of its segments. Table 2. Evaluation scale Score Explanation 5 4 3 2 1

Full informative Mostly informative Half informative Quite informative Not informative

Table 3 shows the length of each video and the number of segments that were selected during the summarization process. “Full Score” corresponds to the complete audio summaries evaluation while “Average Score” to the score of their corresponding summary segments. Both metrics represent different things and seem to be quite correlated. “Full Score” quantifies the informativeness of all the summary as a whole while “Average Score” represents the summary quality in terms of the information of each of its segments. To validate this observation,

358

C.-E. Gonz´ alez-Gallardo et al.

we computed the linear correlation between these two metrics obtaining a PCC value equal to 0.53. The average scores of all evaluators can be seen in Table 3. The lowest “Full Score” average value obtained during evaluation was 2.75 and the highest 4.67, meaning that the summarization algorithm generated at least half informative summaries. “Average Score” values oscillate between 2.49 and 3.76. An interesting case is sample #6, which according to its “Full Score” is “mostly informative” (Table 2) but has the lowest “Average Score” of all samples. This difference is given because 67% of its summary segments has an informativity score < 3, but in general it achieves to communicate almost all the relevant information. Figure 3 plots the average score of each one of the 30 segments for sample #6. Table 3. Audio summarization performance over complete summaries and summary segments Sample Length

Segments Full score Average score

1 2 3 4 5 6 7 8 9 10

8 13 5 5 22 30 8 20 18 4

3m 5m 2m 1m 8m 9m 5m 6m 7m 2m

19 s 21 s 47 s 42 s 47 s 45 s 23 s 24 s 35 s 01 s

4.20 3.50 3.80 3.60 4.67 4.00 3.20 3.75 3.75 2.75

2.90 2.78 3.76 2.95 3.68 2.49 3.75 2.84 3.19 2.63

Fig. 3. Audio summarization performance for sample #6

Audio Summarization with Audio Features and P.D.D

359

A graphical representation of the audio summaries and their performance can be seen in Fig. 4. Full audio streams are represented by white bars while summary segments are represented by the gray zones. The height of each summary segment corresponds to their informativeness score.

Fig. 4. Graphical representation of audio summarization performance

From Fig. 4 it can be seen that samples #2, #3, #7, #8 and #10 have all their summary segments clustered to the left. This is due to the preference that the summarization technique is given to the first part of the audio stream region whereby, within a standard newscast, is gathered the major part of the information. The problem is that in cases where different topics are covered over the newscast (multi-topic newscast, interviews, round tables, reports, etc.), relevant information is distributed all over the video. If a big amount of relevant segments are grouped in this region, the summarization algorithm uses all the space available for the summary very fast, discarding a large region of the audio stream. This is the case of samples #7 and #10 which “Full Scores” are less to 3.50. Concerning sample #5, a well distribution of its summary segments is observed. From its 22 segments, only 4 had an informativeness score ≤ 3, achieving the highest “Full Score” of all samples and a good “Average Score”.

5

Conclusions

In this paper we presented an audio summarization method based on audio features and on the hypothesis that mapping the informativeness from a

360

C.-E. Gonz´ alez-Gallardo et al.

pre-trained model using only audio features may help to select those segments which are more pertinent for the summary. Informativeness of each segment was obtained by mapping a set of audio features issued from its Mel-frequency Cepstral Coefficients and their corresponding Jensen-Shannon divergence score. Summarization was performed over a sample of English newscasts, demonstrating that the proposed method is able to generate at least half informative extractive summaries. We can deduce that there is not a clear correlation between the quality of a summary and the quality of its parts. However this behavior could be modeled as a recall based relation between both measures. As future work we will validate this hypothesis as well as expand the evaluation dataset from a multilingual perspective to consider French an Arabic summarization. Acknowledgments. We would like to acknowledge the support of CHIST-ERA for funding this work through the Access Multilingual Information opinionS (AMIS), (France - Europe) project.

References 1. Christensen, H., Gotoh, Y., Renals, S.: A cascaded broadcast news highlighter. IEEE Trans. Audio Speech Lang. Process. 16(1), 151–161 (2008) 2. Duxans, H., Anguera, X., Conejero, D.: Audio based soccer game summarization. In: 2009 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting. BMSB’09, pp. 1–6. IEEE (2009) 3. Jouvet, D., Langlois, D., Menacer, M., Fohr, D., Mella, O., Sma¨ıli, K.: Adaptation of speech recognition vocabularies for improved transcription of youtube videos. J. Int. Sci. Gen. Appl. 1(1), 1–9 (2018) 4. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951) 5. Leszczuk, M., Grega, M., Ko´zbial, A., Gliwski, J., Wasieczko, K., Sma¨ıli, K.: Video summarization framework for newscasts and reports – work in progress. In: Dziech, A., Czy˙zewski, A. (eds.) MCSS 2017. CCIS, vol. 785, pp. 86–97. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69911-0 7 6. Louis, A., Nenkova, A.: Automatic summary evaluation without human models. In: TAC (2008) 7. Louis, A., Nenkova, A.: Automatically evaluating content selection in summarization without human models. In: 2009 Conference on Empirical Methods in Natural Language Processing, Vol, 1. pp. 306–314. ACL (2009) 8. Manning, C.D., Sch¨ utze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999) 9. Maskey, S., Hirschberg, J.: Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization. In: 9th European Conference on Speech Communication and Technology (2005) 10. Maskey, S., Hirschberg, J.: Summarizing speech without text using hidden markov models. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 89–92. Association for Computational Linguistics (2006)

Audio Summarization with Audio Features and P.D.D

361

11. McFee, B., et al.: librosa: audio and music signal analysis in python. In: 14th Python in Science Conference, pp. 18–25 (2015) 12. Rafii, Z., Pardo, B.: Music/voice separation using the similarity matrix. In: ISMIR, pp. 583–588 (2012) ˇ 13. Rott, M., Cerva, P.: Speech-to-text summarization using automatic phrase extraction from recognized text. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 101–108. Springer, Cham (2016). https://doi. org/10.1007/978-3-319-45510-5 12 14. Saggion, H., Torres-Moreno, J.M., Cunha, I.d., SanJuan, E.: Multilingual summarization evaluation without human models. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 1059–1067. COLING’10, Association for Computational Linguistics, Stroudsburg, PA, USA (2010). https://dl.acm.org/doi/10.5555/1944566.1944688 ´ Beke, A.: Summarization of spontaneous speech using 15. Szasz´ ak, G., T¨ undik, M.A., automatic speech recognition and a speech prosody based tokenizer. In: KDIR, pp. 221–227 (2016) 16. Taskiran, C.M., Pizlo, Z., Amir, A., Ponceleon, D., Delp, E.J.: Automated video program summarization using speech transcripts. IEEE Trans. Multimedia 8(4), 775–791 (2006). https://doi.org/10.1109/TMM.2006.876282 17. Torres-Moreno, J.M.: Automatic Text Summarization. John Wiley & Sons, Hoboken (2014) 18. Torres-Moreno, J., Saggion, H., da Cunha, I., SanJuan, E., Vel´ azquez-Morales, P.: Summary evaluation with and without references. Polibits 42, 13–19 (2010). https://polibits.cidetec.ipn.mx/ojs/index.php/polibits/article/view/42-2/1781 19. Zechner, K.: Spoken language condensation in the 21st century. In: 8th European Conference on Speech Communication and Technology (2003) 20. Zlatintsi, A., Iosif, E., Marago, P., Potamianos, A.: Audio salient event detection and summarization using audio and text modalities. In: 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 2311–2315. IEEE (2015) 21. Zlatintsi, A., Maragos, P., Potamianos, A., Evangelopoulos, G.: A saliency-based approach to audio event detection and summarization. In: 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 1294–1298. IEEE (2012)

Multilingual Speech Emotion Recognition on Japanese, English, and German Panikos Heracleous1(B) , Keiji Yasuda1,2 , and Akio Yoneyama1 1

Education and Medical ICT Laboratory, KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama 356-8502, Japan {pa-heracleous,yoneyama}@kddi-research.jp 2 Nara Institute of Science and Technology, Ikoma, Japan [email protected]

Abstract. The current study focuses on human emotion recognition based on speech, and particularly on multilingual speech emotion recognition using Japanese, English, and German emotional corpora. The proposed method exploits conditional random fields (CRF) classifiers in a two-level classification scheme. Specifically, in the first level, the language spoken is identified, and in the second level, speech emotion recognition is carried out using emotion models specific to the identified language. In both the first and second levels, CRF classifiers fed with acoustic features are applied. The CRF classifier is a popular probabilistic method for structured prediction, and is widely applied in natural language processing, computer vision, and bioinformatics. In the current study, the use of CRF in speech emotion recognition when limited training data are available is experimentally investigated. The results obtained show the effectiveness of using CRF when only a small amount of training data are available and methods based on a deep neural networks (DNN) are less effective. Furthermore, the proposed method is also compared with two popular classifiers, namely, support vector machines (SVM), and probabilistic linear discriminant analysis (PLDA) and higher accuracy was obtained using the proposed method. For the classification of four emotions (i.e., neutral, happy, angry, sad) the proposed method based on CRF achieved classification rates of 93.8% for English, 95.0% for German, and 88.8% for Japanese. These results are very promising, and superior to the results obtained in other similar studies on multilingual or even monolingual speech emotion recognition . Keywords: Speech emotion recognition · Multilingual · Conditional random fields · Two-level classification · i-vector paradigm · Deep learning

1

Introduction

Automatic recognition of human emotions [1] is a relatively new field, and is attracting considerable attention in research and development areas because of c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 362–375, 2023. https://doi.org/10.1007/978-3-031-24340-0_27

Multilingual Speech Emotion Recognition on Japanese, English, and German

363

its high importance in real applications. Emotion recognition can be used in human-robot communication, when robots communicate with humans according to the detected human emotions, and also has an important role at call centers to detect the caller’s emotional state in cases of emergency (e.g., hospitals, police stations), or to identify the level of a customer’s satisfaction (i.e., providing feedback). In the current study, multilingual emotion recognition based on speech is experimentally investigated. Specifically, using English, German, and Japanese emotional speech data, multilingual emotion recognition experiments are conducted based on several classification approaches and the i-vector paradigm framework. Previous studies reported automatic speech emotion recognition using Gaussian mixture models (GMMs) [2], support vector machines [3], neural networks (NN) [4], and deep neural networks (DNN) [5]. Most studies in speech emotion recognition have focused solely on a single language, and cross-corpus speech emotion recognition has been addressed in only a few studies. In [6], experiments on emotion recognition are described using comparable speech corpora collected from American English and German interactive voice response systems, and the optimal set of acoustic and prosodic features for mono-, cross-, and multilingual anger recognition are computed. Cross-language speech emotion recognition based on HMMs and GMMs is reported in [7]. Four speech databases for cross-corpus classification, with realistic, non-prompted emotions and a large acoustic feature vector are reported in [8]. In the current study, however, multilingual speech emotion recognition using Japanese, English, and German corpora based on a two-level classification scheme is demonstrated. Specifically, spoken language identification and emotion recognition are integrated in a complete system capable of recognizing four emotions from English, German, and Japanese databases. In the first level, spoken language identification using emotional speech is performed, and in the second level the emotions are classified using acoustic models of the language identified in the first level. For classification in both the first and second levels, CRF classifiers are applied and compared to SVM and PLDA classifiers. A similar study –but with different objectives– is presented in [9]. In a more recent study [10], a three-layer perception model is used for multilingual speech emotion recognition using Japanese, Chinese, and German emotional corpora. In that specific study, the volume of training and test data used in classification is closely comparable with the data used in the current study, and, therefore, comparisons are, to some extent, possible. Although very limited training data were available, DNN and convolutional neural networks (CNN) were also considered for comparison purposes. Automatic language identification (LID) is a process whereby a spoken language is identified automatically. Applications of language identification include, but are not limited to, speech-to-speech translation systems, re-routing incoming calls to native speaker operators at call centers, and speaker diarization. Because of the importance of spoken language identification in real applications, many studies have addressed this issue. The approaches reported are categorized into

364

P. Heracleous et al.

the acoustic-phonetic approach, the phonotactic approach, the prosodic approach, and the lexical approach [11]. In phonotactic systems [11,12], sequences of recognized phonemes obtained from phone recognizers are modeled. In [13], a typical phonotactic language identification system is used, where a language dependent phone recognizer is followed by parallel language models (PRLM). In [14], a universal acoustic characterization approach to spoken language recognition is proposed. Another method based on vector-space modeling is reported in [11,15], and presented in [16]. In acoustic modeling-based systems, different features are used to model each language. Earlier language identification studies reported methods based on neural networks [17,18]. Later, the first attempt at using deep learning has also been reported [19]. Deep neural networks for language identification were used in [20]. The method was compared with i-vector-based classification, linear logistic regression, linear discriminant analysis-based (LDA), and Gaussian modelingbased classifiers. In the case of a large amount of training data, the method demonstrated its superior performance. When limited training data were used, the i-vector yields the best identification rate. In [21] a comparative study on spoken language identification using deep neural networks was presented. Other methods based on DNN and recurrent neural networks (RNN) were presented in [22,23]. In [24], experiments on language identification using i-vectors and CRF were reported. The i-vector paradigm for language identification with SVM [25] was also applied in [26]. SVM with local Fisher discriminant analysis is used in [27]. Although significant improvements in LID have been achieved using phonotactic approaches, most state-of-the-art systems still rely on acoustic modeling.

2 2.1

Methods Emotional Speech Data

Four professional female actors simulated Japanese emotional speech. These comprised neutral, happy, angry, sad, and mixed emotional states. Fifty-one utterances for each emotion were produced by each speaker. The sentences were selected from a Japanese book for children. The data were recorded at 48 kHz and down-sampled to 16 kHz, and they also contained short and longer utterances varying from 1.5 s to 9 s. Twenty-eight utterances were used for training and 20 for testing. The remaining utterances were excluded due to poor speech quality. For the English emotional speech data, the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) set [28] was used. RAVDESS uses a set of 24 actors (12 male, 12 female) speaking and singing with various emotions, in a North American English accent, and contains 7,356 high-quality video recordings of emotionally-neutral statements, spoken and sung with a range of emotions. The speech set consists of the 8 emotional expressions: neutral, calm, happy, sad, angry, fearful, surprised, and disgusted. The song set consists of the 6 emotional expressions: neutral, calm, happy, sad, angry, and fearful. All emotions except

Multilingual Speech Emotion Recognition on Japanese, English, and German

365

neutral are expressed at two levels of emotional intensity: normal and strong. There are 2,452 unique vocalizations, all of which are available in three formats: full audio-video (720p, H.264), video only, and audio only (wav). The database has been validated in a perceptual experiment involving 297 participants. The data are encoded as 16-bit, 48-kHz wav files, and down-sampled to 16 kHz. In the current study, 96 utterances for neutral, happy, angry, and sad emotional states were used. For training, 64 utterances were used for each emotion, and 32 for testing. The German database used was the Berlin database [29], which includes seven emotional states: anger, boredom, disgust, anxiety, happiness, sadness, and neutral speech. The utterances were produced by ten professional German actors (5 female, 5 male) speaking ten sentences with an emotionally neutral content but expressed with the seven different emotions. The actors produced 69 frightened, 46 disgusted, 71 happy, 81 bored, 79 neutral, 62 sad, and 127 angry emotional sentences. For training, 42 utterances were used in the study, and for testing, 20 utterances, in the neutral, happy, angry, and sad modes. 2.2

Classification Approaches

Conditional Random Fields (CRF). CRF is a modern approach similar to HMMs, however with a different nature. CRF are undirected graphical models, a special case of conditionally trained finite state machines. They are discriminative models, which maximize the conditional probability of observation and state sequences. CRF assume frame dependence, and as a result context is also considered. The main advantage of CRF is their flexibility to include a wide variety of non-independent features. CRF have been successfully used for meeting segmentation [30], for phone classification [31], and for events recognition and classification [32]. A language identification method based on deep-structured CRF has been reported in [33]. The current study is based on the popular and very simple linear-chain CRF, along with low dimensional feature representation using i-vectors. Similarly, to [34] for object recognition using CRF, each input sentence in represented by a single vector (i.e., an i-vector), and this scenario is different from the conventional classification approaches in machine learning, where the input space is represented as a set of feature vectors. In CRF, the probabilities of a class label s given the observation sequence o = (o1 , o2 , ..., oT ) are given by the following equation: 1  λ·f (k,s,o) p(k|o, λ) = e (1) z(o, λ) s∈k

where λ is the parameter vector, f is the sufficient statistics vector, and s = (s1 , s2 , ..., sT ) is a hidden state sequence. The function z(o, λ) ensures that the model forms a properly normalized probability and is defined as:  z(o, λ) = eλ·f (k,s,o) (2) k

s∈k

Figure 1 demonstrates the structure of HMM and CRF models.

366

P. Heracleous et al.

Fig. 1. Structures of hidden Markov models (HMM) and conditional random fields (CRF).

Support Vector Machines (SVM). A support vector machine (SVM) is a two-class classifier constructed from sums of a kernel function K(.,.) f (x) =

L 

αi ti K(x, xi ) + d

(3)

i=1

L where the ti are the ideal outputs, i=1 αi ti = 0, and αi > 0. An SVM is a discriminative classifier, which is widely used in regression and classification. Given a set of labeled training samples, the algorithm finds the optimal hyperplane, which categorizes new samples. SVM is among the most popular machine learning methods. The advantages of SVM include the support of high-dimensionality, memory efficiency, and versatility. However, when the number of features exceeds the number of samples the SVM performs poorly. Another disadvantage is that SVM is not probabilistic because it works by categorizing objects based on the optimal hyperplane. Originally, SVMs were used for binary classification. Currently, the multiclass SVM, a variant of the conventional SVM, is widely used in solving multiclass classification problems. The most common way to build a multi-class SVM is to use K one-versus-rest binary classifiers (commonly referred to as ”oneversus-all” or OVA classification). Another strategy is to build one-versus-one classifiers, and to choose the class that is selected by the most classifiers. In this case, K(K-1)/2 classifiers are required and the training time decreases because less training data are used for each classifier. Probabilistic Linear Discriminant Analysis (PLDA). PLDA is a popular technique for dimension reduction using the Fisher criterion. Using PLDA, new axes are found, which maximize the discrimination between the different classes. PLDA was originally applied to face recognition, and and can be used to specify

Multilingual Speech Emotion Recognition on Japanese, English, and German

367

Fig. 2. Classification scheme based on the i-vector paradigm.

a generative model of the i- vector representation. A study using UBM-based LDA for speaker recognition was also presented in [35]. Adapting this to language identification and emotion classification, for the i-th language or emotion, the i-vector wi,j representing the j-th recording can be formulated as: wi,j = m + Sxi + ei,j

(4)

where S represents the between-language or between-emotion variability, and the latent variable x is assumed to have a standard normal distribution, and to represent a particular language or emotion and channel. The residual term ei,j represents the within-language or within-emotion variability, and it is assumed to have a normal distribution. Figure 2 shows the two-level classification scheme used in the current study. 2.3

Shifted Delta Cepstral (SDC) Coefficients

Previous studies showed that language identification performance is improved by using SDC feature vectors, which are obtained by concatenating delta cepstra across multiple frames. The SDC features are described by the N number of cepstral coefficients, d time advance and delay, k number of blocks concatenated for the feature vector, and P time shift between consecutive blocks. For each SDC final feature vector, kN parameters are used. In contrast, in the case of conventional cepstra and delta cepstra feature vectors, 2N parameters are used. The SDC is calculated as follows: Δc(t + iP ) = c(t + iP + d) − c(t + iP − d)

(5)

368

P. Heracleous et al.

The final vector at time t is given by the concatenation of all Δc(t + iP ) for all 0 ≤ i < k, where c(t) is the original feature value at time t. In the current study, SDC coefficients were used not only in spoken language identification, but also in emotion classification. 2.4

Feature Extraction

In automatic speech recognition, speaker recognition, and language identification, mel-frequency cepstral coefficients (MFCC) are among the most popular and most widely used acoustic features. Therefore, in modeling the languages being identified and the emotions being recognized, this study similarly used 12 MFCC, concatenated with SDC coefficients to form feature vectors of length 112. The MFCC features were extracted every 10 ms using a window-length of 20 ms. The extracted acoustic features were used to construct the i-vectors used in emotion and spoken language identification modeling and classification. A widely used approach for speaker recognition is based on Gaussian mixture models (GMM) with universal background models (UBM). The individual speaker models are created using maximum a posteriori (MAP) adaptation of the UBM. In many studies, GMM supervectors are used as features. The GMM supervectors are extracted by concatenating the means of the adapted model. The problem of using GMM supervectors is their high dimensionality. To address this issue, the i-vector paradigm was introduced which overcomes the limitations of high dimensionality. In the case of i-vectors, the variability contained in the GMM supervectors is modeled with a small number of factors, and the whole utterance is represented by a low dimensional i-vector of 100-400 dimension. Considering language identification, an input utterance can be modeled as: M = m + Tw

(6)

where M is the language-dependent supervector, m is the language-independent supervector, T is the total variability matrix, and w is the i-vector. Both the total variability matrix and language-independent supervector are estimated from the complete set of the training data. The same procedure is used to extract i-vectors used in speech emotion recognition. 2.5

Evaluation Measures

In the current study, the equal error rates (EER) (i.e., equal false alarms and miss probability) and the classification rates are used as evaluation measures. The classification rate is defined as: n 1  N o. of corrects f or class k · 100 (7) acc = n N o. of trials f or class k k=1

where n is the number of the emotions. In addition, the detection error trade-off (DET) curves, which show the function of miss probability and false alarms, are also given.

Multilingual Speech Emotion Recognition on Japanese, English, and German

3

369

Results

This section presents the results for multilingual emotion classification based on a two-level classification scheme using Japanese, English, and German corpora. 3.1

Spoken Language Identification Using Emotional Speech Data

The i-vectors used in modeling and classification are constructed using MFCC features and SDC coefficients. For training, 160 utterances from each language are used, and 80 utterances for testing. The dimension of the i-vectors is set to 100, and 256 Gaussian components are used in the UBM-GMM. Due to the fact that only three target languages are used, the identification was perfect almost in all cases (except in the case of using PLDA was 98.8%). On the other hand, it should be noted that language identification is conducted using emotional speech data, and this result indicates that spoken language classification using emotional speech data does not present any particular difficulties compared to normal speech. 3.2

Emotion Recognition Based on a Two-Level Classification Scheme

Table 1 shows the average emotion classification rates when using MFCC features only. As shown, high classification rates are being obtained. The results show that the two classifiers based on DNN and CNN show lower rates (except for Japanese). A possible reason may be the small volume of training data in the case of English and German. Table 2 shows the average classification rates when using MFCC features along with SDC coefficients. As shown, the CRF classifier shows superior performance in most of cases, followed by SVM. The results show that using SDC coefficients along with MFCC features improves classification rates. This result indicates that SDC coefficients are effective not only in spoken language identification, but also in speech emotion recognition. Note, however, that in this case of DNN and CNN, small or no improvements are being obtained. The results Table 1. Average emotion classification rates when using MFCC features for the ivector construction. Classifier Language Japanese English German PLDA

85.2

77.3

91.7

CRF

79.4

87.5

90.0

SVM

82.8

80.5

91.3

DNN

90.6

68.3

85.2

CNN

90.2

71.0

88.7

370

P. Heracleous et al.

indicate that due to the limited training data, DNN and CNN are less effective for this task. Table 3 shows the classification rates for the four emotions when using the CRF classifier and MFCC features along with SDC coefficients. In the case of Japanese the average accuracy was 88.8%, in the case of English the average was 93.8%, and in the case of German, a 95.0% accuracy was obtained. Concerning the German corpus, the results obtained are significantly higher compared to the results reported in [36] when the same corpus was used. Table 4 shows the individual classification rates when SVM was used. In the case of Japanese, a 82.8% average accuracy was achieved, in the case of English the average accuracy was 91.4%, and when using the German corpus the average accuracy was 95.0%. Table 5 shows the recognition rates when using the PLDA classifier. The average accuracy for Japanese was 85.2%, the accuracy for English was 90.2%, Table 2. Average emotion classification rates when using MFCC features and SDC coefficients for the i-vector construction. Classifier Corpus Japanese English German PLDA

87.6

90.9

91.7

CRF

88.8

93.8

95.0

SVM

90.9

93.0

95.0

DNN

83.7

76.2

82.7

CNN

88.8

77.1

84.5

Table 3. Emotion classification rates when using CRF classifier and MFCC features, along with SDC coefficients, for the i-vector construction. Corpus

Emotions Neutral Happy Anger Sad

Japanese 85.0

83.8

88.8

Average

97.5 88.8

English

87.5

100.0 96.9

90.6

German

100.0

95.0

100.0 95.0

85.0

93.8

Table 4. Emotion classification rates when using SVM classifier and MFCC features, along with SDC coefficients, for the i-vector construction. Corpus

Emotions Neutral Happy Anger Sad

Average

Japanese 92.5

70.0

81.3

87.5

82.8

English

84.4

100.0 93.8

87.5

91.4

German

95.0

95.0

100.0 95.0

90.0

Multilingual Speech Emotion Recognition on Japanese, English, and German

371

Table 5. Emotion classification rates when using PLDA classifier and MFCC features, along with SDC coefficients, for the i-vector construction. Corpus

Emotions Neutral Happy Anger Sad

Average

Japanese 72.8

86.4

90.1

91.4 85.2

English

93.9

90.9

97.0

78.8 90.2

German

95.2

81.0

85.7

95.2 89.3

and for the German corpus an accuracy of 89.3% was achieved. The results show that when using CRF, superior performance was obtained, followed by SVM. The lowest rates were obtained when the PLDA classifier was used. The results also show that the emotion sad is recognized with the highest rates in most cases. 3.3

Emotion Recognition Using Multilingual Emotion Models

In this baseline approach, a single-level classification scheme is used. Using emotional speech data from Japanese, English, and German languages, common emotion models are trained. For training, 112 Japanese, 64 English, and 40 German i-vectors are used for each emotion. For testing, 80 Japanese, 32 English, and 20 German i-vectors are used for each emotion. Since also using SDC coefficients improves the performance of the two-level approach, in this method, i-vectors are constructed using MFCC features in conjunction with SDC coefficients. Table 6 shows the classification rates. As shown, using a universal multilingual model, the average emotion classification accuracies for the three languages are 75.2%, 77.7%, and 75.0% when using PLDA, CRF, and SVM classifiers, respectively. This is a promising result and superior to the results obtained in other similar studies. While the rates achieved are lower than with the two-level approach, in this approach a single level is used with reduced system complexity (i.e., language identification is not applied). Furthermore, the classification rates may be improved with a larger amount of training data. These result show that i-vectors can efficiently be applied in multilingual emotion recognition when universal, multilingual emotion models are also used. The results also show that in most cases, the performance of the CRF classifier is superior. Table 6. Average emotion classification rates when using a universal emotion model with MFCC features and SDC coefficients for the i-vector construction. Classifier Corpus Japanese English German Average PLDA

75.3

68.2

82.1

75.2

CRF

80.9

73.4

78.8

77.7

SVM

76.9

71.9

76.3

75.0

372

P. Heracleous et al.

Table 7. Equal error rates (EER) when using a universal emotion model with MFCC features and SDC coefficients for the i-vector construction. Classifier Corpus Japanese English German Average PLDA

22.6

22.6

17.5

20.9

CRF

17.6

22.9

17.5

19.3

SVM

22.5

26.8

20.4

23.2

SVM PLDA CRF

False Negative Rate (FNR) [%]

40

20

10 5 2 1 0.5 0.2 0.1 0.1 0.2 0.5 1

2

5

10

20

40

False Positive Rate (FPR) [%]

(a) Japanese

SVM PLDA CRF

20

10 5 2 1 0.5 0.2

SVM PLDA CRF

40

False Negative Rate (FNR) [%]

False Negative Rate (FNR) [%]

40

20

10 5 2 1 0.5 0.2

0.1

0.1 0.1 0.2 0.5 1

2

5

10

20

40

0.1 0.2 0.5 1

2

5

10

20

False Positive Rate (FPR) [%]

False Positive Rate (FPR) [%]

(b) (English

(c) German

40

Fig. 3. DET curves for the three languages used in emotion classification when using common multilingual models.

Table 7 shows the EER when a universal, multilingual emotion model is used. As shown, the EER for German is the lowest among the three, followed by the EER for Japanese. The average EERs for the three languages are 20.9%, 19.3%, and 23.2% when using PLDA, CRF, and SVM classifiers, respectively. Also in this case, the lowest EERs were obtained using the CRF classifier. Figure 3 shows the DET curves for multilingual emotion recognition using a universal emotion model.

Multilingual Speech Emotion Recognition on Japanese, English, and German

4

373

Discussion

Although using real-world emotional speech data would represent a more realistic situation, acted emotional speech data are widely used in speech emotion classification. Furthermore, the current study mainly investigated classification schemes and features extraction methods, so using acted speech is a reasonable and acceptable approach. Because of limited emotional data, deep learning approaches in multilingual emotion recognition were not investigated. In contrast, a method is proposed that integrates spoken language identification and emotion classification. In addition to SVM and PLDA classifiers, the CRF classifier is also used in combination with the i-vector paradigm. The results obtained show the advantage of using the CRF classifier, especially when limited data are available. For comparison purposes, deep neural networks were also considered. Because of the limited training data, however, the classification rates when using DNN and CNN were significantly lower. In order to address the problems associated with using acted speech, an initiative to obtain a large quantity of spontaneous emotional speech is currently being undertaken. With such data, it will also be possible to analyze the behavior of additional classifiers, such as deep neural networks, and to investigate the problem of multilingual speech emotion recognition in realistic situations (e.g., noisy or reverberant environments).

5

Conclusions

The current study experimentally investigated multilingual speech emotion classification. A two-level classification approach was used, integrating spoken language identification and emotion recognition. The proposed method was based on CRF classifier and the i-vector paradigm. When classifying four emotions, the proposed method achieved a 93.8% classification rate for English, a 95.0% rate for German, and 88.8% rate for Japanese. These results were very promising, and demonstrated the effectiveness of the proposed methods in multilingual speech emotion recognition. An initiative to obtain realistic, spontaneous emotional speech data for a large number of languages is currently being undertaken. As future work, the effect of noise and reverberation will also be investigated.

References 1. Busso, C., Bulut, M., Narayanan, S.: Toward effective automatic recognition systems of emotion in speech. In: Gratch, J., Marsella, S. (eds.) Social emotions in nature and artifact: emotions in human and human-computer interaction, pp. 110– 127. Oxford University Press, New York (2013) 2. Tang, H., Chu, S., Johnson, M.H.: Emotion recognition from speech via boosted gaussian mixture models. In: Proceedings of ICME, pp. 294–297 (2009) 3. Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012) 4. Nicholson, J., Takahashi, K., Nakatsu, R.: Emotion recognition in speech using neural networks. Neural Comput. Appli. 9(4), 290–296 (2000)

374

P. Heracleous et al.

5. Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of Interspeech, pp. 223–227 (2014) 6. Polzehl, T., Schmitt, A., Metze, F.: Approaching multi-lingual emotion recognition from speech-on language dependency of acoustic prosodic features for anger detection. In: Proceedings of Speech Prosody (2010) 7. Bhaykar, M., Yadav, J., Rao, K.S.: Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM. In: 2013 National Conference on Communications (NCC), pp. 1–5. IEEE (2013) 8. Eyben, F., Batliner, A., Schuller, B., Seppi, D., Steidl, S.: Crosscorpus classification of realistic emotions - some pilot experiments. In: Proceedings of the Third International Workshop on EMOTION (satellite of LREC) (2010) 9. Sagha, H., Matejka, P., Gavryukova, M., Povolny, F., Marchi, E., Schuller, B.: Enhancing multilingual recognition of emotion in speech by language identification. In: Proceedings of Interspeech (2016) 10. Li, X., Akagi, M.: A three-layer emotion perception model for valence and arousalbased detection from multilingual speech. In: Proceedings of Interspeech, pp. 3643– 3647 (2018) 11. Li, H., Ma, B., Lee, K.A.: Spoken language recognition: From fundamentals to practice. In: Proceedings of the IEEE, vol. 101(5), pp. 1136–1159 (2013) 12. Zissman, M.A.: Comparison of four approaches to automatic language identification of telephone speech. lEEE Trans. Speech Audio Process. 4(1), 31–44 (1996) 13. Caseiro, D., Trancoso, I.: Spoken language identification using the speechdat corpus. In: Proceedings of ICSLP 1998 (1998) 14. Siniscalchi, S.M., Reed, J., Svendsen, T., Lee, C.-H.: Universal attribute characterization of spoken languages for automatic spoken language recognition. Comput. Speech Lang. 27, 209–227 (2013) 15. Lee, C.-H.: principles of spoken language recognition. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds.) Springer Handbook of Speech Processing. SH, pp. 785–796. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-49127-9 39 16. Reynolds, D.A., Campbell, W.M., Shen, W., Singer, E.: Automatic language recognition via spectral and token based approaches. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds.) Springer Handbook of Speech Processing. SH, pp. 811–824. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-49127-9 41 17. Cole, R., Inouye, J., Muthusamy, Y., Gopalakrishnan, M.: Language identification with neural networks: a feasibility study. In: Proceedings of IEEE Pacific Rim Conference, pp. 525–529 (1989) 18. Leena, M., Rao, K.S., Yegnanarayana, B.: Neural network classifiers for language identification using phonotactic and prosodic features. In: Proceedings of Intelligent Sensing and Information Processing, pp. 404–408 (2005) 19. Montavon, G.: Deep learning for spoken language identification. In: NIPS workshop on Deep Learning for Speech Recognition and Related Applications (2009) 20. Moreno, I.L., Dominguez, J.G., Plchot, O., Martinez, D., Rodriguez, J.G., Moreno, P.: Automatic language identification using deep neural networks. In: Proceedings of ICASSP, pp. 5337–5341 (2014) 21. Heracleous, P., Takai, K., Yasuda, K., Mohammad, Y., Yoneyama, A.: Comparative Study on Spoken Language Identification Based on Deep Learning. In: Proceedings of EUSIPCO (2018) 22. Jiang, B., Song, Y., Wei, S., Liu, J.-H., McLoughlin, I.V., Dai, L.-R.: Deep bottleneck features for spoken language identification. PLoS ONE 9(7), 1–11 (2010)

Multilingual Speech Emotion Recognition on Japanese, English, and German

375

23. Zazo, R., Diez, A.L., Dominguez, J.G., Toledano, D.T., Rodriguez, J.G.: Language identification in short utterances using long short-term memory (lstm) recurrent neural networks. PLoS ONE 11(1), e0146917 (2016) 24. Heracleous, P., Mohammad, Y., Takai, K., Yasuda, K., Yoneyama, A.: Spoken Language Identification Based on I-vectors and Conditional Random Fields. In: Proceedings of IWCMC, pp. 1443–1447 (2018) 25. Cristianini, N., Taylor, J.S.: Support vector machines. Cambridge University Press, Cambridge (2000) 26. Dehak, N., Carrasquillo, P.A.T., Reynolds, D., Dehak, R.: Language recognition via ivectors and dimensionality reduction. In: Proceedings of Interspeech, pp. 857– 860 (2011) 27. Shen, P., Lu, X., Liu, L., Kawai, H.: Local fisher discriminant analysis for spoken language identification. In: Proceedings of ICASSP, pp. 5825–5829 (2016) 28. Livingstone, S.R., Peck, K., F.A., Russo: RAVDESS: The ryerson audio-visual database of emotional speech and song. In: 22nd Annual Meeting of the Canadian Society for Brain, Behaviour and Cognitive Science (CSBBCS), Kingston, ON (2012) 29. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of Interspeech (2005) 30. Reiter, S., Schuller, B., Rigoll, G.: Hidden conditional random fields for meeting segmentation. In: Proceedings of ICME, pp. 639–642 (2007) 31. Gunawardana, A., Mahajan, M., Acero, A., Platt, J.C.: Hidden conditional random fields for phone classification. In: Proceedings of Interspeech, pp. 1117–1120 (2005) 32. Llorens, H., Saquete, E., Colorado, B.N.: TimeML events recognition and classification: learning crf models with semantic roles. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 725– 733 (2010) 33. Yu, D., Wang, S., Karam, Z., Deng, L.: Language recognition using deep-structured conditional random fields. In: Proceedings of ICASSP, pp. 5030–5033 (2010) 34. Quattoni, A., Collins, M., Darrell, T.: Conditional random fields for object recognition. In: Saul, L.K., Weiss, Y., Bottou, L., (eds.) Advances in Neural Information Processing Systems 17, MIT Press, pp. 1097–1104 (2005) 35. Yu, C., Liu, G., Hansen, J.H.L.: Acoustic feature transformation using ubm-based lda for speaker recognition. In: Proceedings of Interspeech, pp. 1851–1854 (2014) 36. Li, X., Akagi, M.: Multilingual speech emotion recognition system based on a three-layer model. In: Proceedimgs of Interspeech, pp. 3606–3612 (2016)

Text Categorization

On the Use of Dependencies in Relation Classification of Text with Deep Learning Bernard Espinasse, Sébastien Fournier, Adrian Chifu(B) , Gaël Guibon, René Azcurra, and Valentin Mace Aix-Marseille Université, Université de Toulon, CNRS, LIS, Marseille, France {bernard.espinasse,sebastien.fournier,adrian.chifu,gael.guibon, rene.azcurra,valentin.mace}@lis-lab.fr Abstract. Deep Learning is more and more used in NLP tasks, such as in relation classification of texts. This paper assesses the impact of syntactic dependencies in this task at two levels. The first level concerns the generic Word Embedding (WE) as input of the classification model, the second level concerns the corpus whose relations have to be classified. In this paper, two classification models are studied, the first one is based on a CNN using a generic WE and does not take into account the dependencies of the corpus to be treated, and the second one is based on a compositional WE combining a generic WE with syntactical annotations of this corpus to classify. The impact of dependencies in relation classification is estimated using two different WE. The first one is essentially lexical and trained on the Wikipedia corpus in English, while the second one is also syntactical, trained on the same previously annotated corpus with syntactical dependencies. The two classification models are evaluated on the SemEval 2010 reference corpus using these two generic WE. The experiments show the importance of taking dependencies into account at different levels in the relation classification. Keywords: Dependencies · Relation classification · Deep learning Word embedding · Compositional word embedding

1

·

Introduction

Deep Learning is more and more used for various task of Natural Language Processing (NLP), such as relation classification from text. It should be recalled that the Deep Learning has emerged mainly with convolutional neural networks (CNNs), originally proposed in computer vision [5]. These CNNs were later used in language processing to solve problems such as sequence labelling [1], semantic analysis - semantic parsing [11], relation extraction, etc. CNNs are the most commonly used for deep neural network models in the relation classification task. One of the first contribution is certainly the basic CNN model proposed by Lui et al. (2013) [7]. Then we can mention the model proposed by Zeng et al. (2014) [13] with max-pooling, and the model proposed by Nguyen and Grishman (2015) [10] with multi-size windows. Performance of c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 379–391, 2023. https://doi.org/10.1007/978-3-031-24340-0_28

380

B. Espinasse et al.

these CNN-based relation classification models are low in terms of Precision and Recall. These low performances can be explained by two reasons. First, despite their success, CNNs have a major limitation in language processing, due to the fact that they were invented to manipulate pixel arrays in image processing, and therefore only take into account the consecutive sequential n-grams on the surface chain. Thus, in relation classification, CNNs do not consider long-distance syntactic dependencies, these dependencies play a very important role in linguistics, particularly in the treatment of negation, subordination, fundamental in the analysis of feelings, etc. [8]. Second, Deep Learning-based relation classification models generally use as input a representation of words obtained by lexical immersion or Word Embedding (WE) with training on a large corpus. Skip-Gram or Continuous Bag-of-Word WEs models, generally only consider the local context of a word, in a window of a few words before and a few words after, this without consideration of syntactic linguistics characteristics. Consequently syntactic dependencies are not taken into account in relation classification using Deep Learning-based models. As syntactic dependencies play a very important role in linguistics, it makes sense to take them into account for classification or relation extraction. The consideration of these dependencies in Deep Learning models can be carried out at different levels. At a first level, (Syntaxical Word Embedding) dependencies are taken into account upstream, at the basic word representation level in a generic WE trained on a large corpus syntactically annotated and generated with specific tool. At a second level, related to the relation classification corpus (Compositional Word Embedding), there is a combination of a generic WE trained on a large corpus with specific features as dependencies, extracted from the words in the sentences of the corpus to classify. This paper assesses the impact of syntactic dependencies in relation classification at these two different levels. The paper is organized as follows. In Sect. 2, we first present a generic syntactical WE trained on a large corpus that has been previously annotated with syntactical dependencies and considering for each word dependencies, in which it is involved. In Sect. 3, two Deep Learning models of relation classification are presented. The first model, that we have developed, is based on a CNN using as input a generic WE trained on a large corpus completed by a positional embedding of the corpus to classify. The second model, the FCM model implemented with a neural network of perceptron type, is based on a compositional WE strategy, using as input a combination of generic WE with specific syntactical features from the corpus to classify relations. In Sect. 4 we present the results of experiments obtained with these two relation classification models on the SemEval 2010 reference corpus using different WEs. Finally, we conclude by reviewing our work and presenting some perspectives for future research.

2

A Syntactical Word Embedding Taking into Account Dependencies

In Deep Learning approach, relation classification models generally use as input a representation of the words of a specific natural language obtained by lexical

On the Use of Dependencies in Relation Classification

381

immersion or Word Embedding (WE). We can distinguish two main WE models: Skip-Gram and Continuous Bag-of-Word. These WEs only consider the local context of a word, in a window of a few words before and a few words after. Syntactic dependencies are not taken into account in these WE models, whereas these syntactic dependencies play a very important role in NLP tasks. Given a classic Bag-of-Words WE taking into account the neighbours upstream and downstream of a word, according to a defined window, we may consider the following sentence: “Australian scientist discovers star with telescope”. With a 2-word window WE, the contexts of the word discovers are Australian, scientist, star and with. This misses the important context of the telescope. By setting the window to 5 in the WE, you can capture more topical content, by example the word telescope, but also weaken the importance of targeted information on the target word. A more relevant word contextualization consists to integrate the different syntactic dependencies in which this word participates, dependencies that can involve words that are very far in the text. The syntactic dependencies that we consider in this paper are those of Stanford (Stanford Dependencies) defined by [2]. Note that syntactic dependencies are both more inclusive and more focused than Bag-of-Words. They capture relations with distant words and therefore out of reach with a small window Bag-of-Words (for example, the discovery instrument is the telescope/preposition with), and also filter out incidental contexts that are in the window, but not directly related to the target word (for example, Australian is not used as a context for discovery). Levy and Goldberg [6] have proposed a generalization of the Skip-gram approach by replacing Bag-of-Word contexts with contexts related to dependencies. They proposed a generalization of the WE Skip-Gram model in which the linear contexts of Bag-of-Words are replaced by arbitrary contexts. This is especially the case with contexts based on syntactic dependencies, which produces similarities of very different kinds. They also demonstrated how the resulting WE model can be questioned about discriminatory contexts for a given word, and observed that the learning procedure seems to favour relatively local syntactic contexts, as well as conjunctions and preposition objects. Levy and Goldberg [6] developed a variant of Word2Vec tool [9], named Word2Vec-f 1 , based on a such syntactic dependency-based contextualization.

3

Two Models for Relation Classification Using Syntactical Dependencies

In this section, two models for relation classification by supervised Deep Learning are presented and used for our experiments developed in Sect. 4. Firstly, we developed a model based on CNN using WEs trained on a large corpus. The second, proposed by [4], is based on a compositional WE combining a generic WE and a syntactic annotation of the corpus whose relations are to be classified. 1

Word2Vec-f : https://bitbucket.org/yoavgo/Word2Vecf.

382

3.1

B. Espinasse et al.

A CNN Based Relation Classification Model (CNN)

The first model, that we have developed, is based on CNN using a generic WE trained on a large corpus. This model is inspired by the one used by Nguyen and Grishman (2015) [10] and it takes as input a WE either using word2Vec or Word2Vec-f tools and a Positional Embedding relative to the corpus from which we want to extract relations. The architecture of our CNN network (Fig. 1) consists of five main layers:

Fig. 1. CNN architecture.

– Two convolutional layers using a number and a size defined for convolutional filters to capture the characteristics of the pretreated input. The filter size is different for each layer. For each layer there are also attached grouping layers (Max Pooling) with an aggregation function (max) for the identification of the most important characteristics produced by the output vector of each convolutional layer. – A fully connected layer that uses the RELU (Rectified Linear Unit) activation function. – A fully connected layer using the Sotfmax activation function to classify the relations to be found. – A logistic regression layer making the optimization of network weighting values with a function to update these values iteratively on the training data. This architecture will be implemented in the TensorFlow 2 platform (version 1.8), using the Tflearn API3 that facilitates its implementation and experimentation. 3.2

A Compositional Word Embedding Based Relation Classification Model (FCM)

This model, named FCM (for Factor-based Compositional embedding Model) is proposed by Gormley, Yu and Dredze (2014–2015) [3,4]. The FCM model is 2 3

TensorFlow : https://www.tensorflow.org/. Tflearn : http://tflearn.org/.

On the Use of Dependencies in Relation Classification

383

based on a compositional WE, which combines a generic WE trained on a large corpus with specific features at a syntactical level from the corpus to classify relations. More precisely, This compositional WE is developed by combining a classic Skip-Gram WE using Word2Vec with features extracted from the sentences’ corpus words, from which we want to extract relations. This expressive model is implemented through a perceptron-type neuromimetic network [12]. The key idea is to combine/compose a generic lexical WE with non-lexical, arbitrary and manually defined features,especially syntactic ones. This non-lexical linguistic context is based on arbitrary, hand-defined linguistic structures called HCF (Hand-Crafted Features). These features can be perceived as simple questions addressed to the word and its context, for example, whether the word is an adjective, whether it is preceded by a verb or whether it is an entity in the sentence. In fact, there is a large number of more or less complex features allowing to capture different information. Thus, the model capitalizes on arbitrary types of linguistic annotations by making better use of the features associated with the substructures of these annotations, including global information. In relation classification the FCM model uses a feature vector f wi over the word wi , the two target entities M1 , M2 , and their dependency path. The main HCF sets used are HeadEmb, Context, In-between and On-path. By example the In-between features indicate whether a word wi is in between two target entities. The On-path features indicate whether the word is on the dependency path, on which there is a set of words P, between the two entities [4]. The FCM model constructs a representation of text structures using both lexical and non-lexical characteristics, thus creating an abstraction of text sentences based on both generic WE and HCF, capturing relevant information about the text. To construct this representation, the FCM takes a sentence as input and its annotations for each word (generated for example by morphosyntactic labelling) and proceeds in three steps: – Step 1 : The FCM first breaks down the annotated sentence into substructures, each substructure is actually a word with its annotations. Figure 2 illustrates the sentence “The movie I watched depicted hope” with its annotations, especially those related to its syntactic dependencies. Its entities are marked with

Fig. 2. A sentence and its FCM annotations [4].

384

B. Espinasse et al.

M1 and M2 and the dependency paths are indicated by arcs, in blue when they are between entities M1 and M2 . – Step 2 : The FCM extracts the HCFs for each substructure (word), obtaining a binary vector for each word in the sentence. The first on-path HCF indicates for each word whether it is on the dependency path between entities M1 et M2 . As illustrated in Fig. 3, each word in the previous sentence corresponds to a vector (or column), f5 representing the HCF of the word depicted.

Fig. 3. HCF vector in the FCM model [4].

Fig. 4. Embedding substructure matrix of FCM model for the word “depicted” [4].

– Step 3 : For each word, the FCM computes the cartesian product, more precisely an inner product, of its HCF vector with its WE vector obtained with Word2Vec. This inner product gives for each word a matrix called Substructure Embedding (see Fig. 4). Then the FCM model sums up all the Embedding Substructures (matrices) of the sentence to obtain a final matrix called Annotated Sentence Embedding noted with: n f wi ⊗ ewi ex = i=1

Thus, the FCM builds an abstraction of the given sentence, first by cutting each word with its annotations (substructure), then by creating for each substructure a matrix of numbers obtained from generic WE and the word features

On the Use of Dependencies in Relation Classification

385

Fig. 5. Neural network implementing FCM model (Gormley et al., 2015) [4].

(Substructure Embedding) by summing these matrices to obtain a final matrix that constitutes its representation of the input sentence (Annotated Sentence Embedding). The FCM is implemented by a multilayer perceptron neural network, with the architecture presented in Fig. 5. In this network, when a sentence has been provided as an input and transformed into an Annotated Sentence Embedding (number matrix), the network will make a prediction on the y relation between the entities in the sentence. It seeks to determine y knowing x = (M1 , M2 , S, A) with M the entities, S the sentence and A its annotations. A tensor T made up of matrices of the same size as the Annotated Sentence Embedding is used, T will act as a parameter (equivalent to network connections) and serve to establish a score for each possible relation. In relation classification, there is a finite number of possible relations constituting the set L. There are thus as many matrices in T as there are relations in the set of relations L. Ty refers to the score matrix for the relation y.

4

Experiments

This section presents the experiments in relation classification, using the two relation classification models previously presented (CNN and FCM) and different generic WE (Word2Vec and Word2Vec-f) taking or not taking into account syntactical dependencies. (First, we present the SemEval 2010 corpus and then the results obtained with these two models. Finally, we compare these results and try to interpret them. 4.1

SemEVAL 2010 Corpus

The SemEVAL 2010 corpus4 is a manually annotated reference corpus for extracting nominal relations. Let be a sentence and 2 annotated names, it is necessary to choose the most appropriate relation among the following 9 relation classes: Cause-Effect ; Instrument-Agency ; Cause-Effect ; Instrument-Agency ; 4

Corpus SemEval : http://www.aclweb.org/anthology/W09-2415.

386

B. Espinasse et al. Table 1. Characterization of the different relation used. #Rel Relation

Support #Rel Relation

R1

Cause-Effect(e1,e2)

150

R11

Instrument-Agency(e1,e2)

108

R2

Cause-Effect(e2,e1)

123

R12

Instrument-Agency(e2,e1)

134

R3

Component-Whole(e1,e2) 194

R13

Member-Collection(e1,e2)

211

R4

Component-Whole(e2,e1) 134

R14

Member-Collection(e2,e1)

22

R5

Content-Container(e1,e2)

32

R15

Message-Topic(e1,e2)

47

R6

Content-Container(e2,e1)

51

R16

Message-Topic(e2,e1)

153

R7

Entity-Destination(e1,e2) 201

R17

Other

162

R8

Entity-Destination(e2,e1)

39

R18

Product-Producer(e1,e2)

291

R9

Entity-Origin(e1,e2)

1

R19

Product-Producer(e2,e1)

454

R10

Entity-Origin(e2,e1)

210

Total:

Support

2717

Product-producer ; Content-Container ; Entity-Destination ; Component-Whole; member-Collection ; Message-Topic. It is also possible to choose the Other class if none of the 9 relation classes seem to be suitable. For instance, given the sentence : "The macadamia nuts in the cake also make it necessary to have a very sharp knife to cut through the cake neatly." The best choice for the following sentence would be: Component-Whole (e1,e2). Note that in the sentence Component-Whole (e1,e2) is there, but Component-Whole (e2,e1) is not there, i.e. we have Other (e2,e1). Thus, it is a question of determining both the relations and the order of e1 and e2 as arguments. The corpus is composed of 8,000 sentences for training the 9 relations and the additional relation Other. The test data set consists of 2,717 examples from the 9 relations and the additional Other relation. More specifically, the test data set contains data for the first 5 relations mentioned above. However, there are some references to the other 4 relations, which can be considered as Other when experimenting with the test data set. Table 1 presents the 19 relations of the SemEval 2010 corpus. 4.2

Employed Word Embeddings

We used both Word2Vec and Word2Vec-f tools to create our own WE, integrating for the later syntactic dependencies, trained throughout the Wikipedia corpus in English. This corpus has been previously annotated with syntactic dependencies using the SpaCy 5 analyzer. Note that we use also a generic WE already done with Word2Vec with GoogleNews corpus. But since we did not have the GoogleNews corpus, we were unable to process it with Word2Vec-f tool. The Table 2 characterizes the different WE used for the experiments. Note that the WE W ikipedia/Word2Vec-f is significantly smaller in terms of vocabulary size 5

SpaCy : https://spacy.io/.

On the Use of Dependencies in Relation Classification

387

Table 2. Characterization of the different word embeddings used. Word embedding

Vocabulary size Window Dimension

GoogleNews/Word2Vec 3,000,000

5

300

Wikipedia/Word2Vec

2,037,291

5

300

Wikipedia/Word2Vecf

285,378

5

300

Table 3. Results for CNN model with Word2Vec (W2V) & Word2Vec-f (W2V-f) WE.

Precision

Recall

F1-score

Rel,

W2V

W2V-f

Delta %

W2V

W2V-f

Delta %

W2V

W2V-f

Delta %

R1

0,88

0,93

5,68

0,84

0,84

0,00

0,86

0,88

2,33

R2

0,83

0,84

1,20

0,90

0,90

0,00

0,87

0,87

0,00

R3

0,84

0,76

-9,52

0,64

0,75

17,19

0,73

0,76

4,11

R4

0,79

0,73

-7,59

0,57

0,64

12,28

0,66

0,68

3,03

R5

0,76

0,75

-1,32

0,84

0,82

-2,38

0,80

0,78

-2,50

R6

0,85

0,83

-2,35

0,72

0,77

6,94

0,78

0,80

2,56

R7

0,87

0,79

-9,20

0,86

0,90

4,65

0,86

0,84

-2,33

R8

0,00

0,00

0,00

0,00

0,00

0,00

0,00

0,00

0,00

R9

0,79

0,73

-7,59

0,81

0,89

9,88

0,80

0,80

0,00

R10

0,90

0,89

-1,11

0,81

0,85

4,94

0,85

0,87

2,35

R11

0,67

0,59

-11,94

0,27

0,45

66,67

0,39

0,51

30,77

R12

0,69

0,63

-8,70

0,51

0,61

19,61

0,59

0,62

5,08

R13

0,70

0,47

-32,86

0,44

0,59

34,09

0,54

0,53

-1,85

R14

0,74

0,82

10,81

0,93

0,88

-5,38

0,82

0,85

3,66

R15

0,75

0,79

5,33

0,79

0,80

1,27

0,77

0,79

2,60

R16

0,76

0,78

2,63

0,49

0,69

40,82

0,60

0,73

21,67

R17

0,39

0,48

23,08

0,51

0,40

-21,57

0,45

0,44

-2,22

R18

0,75

0,77

2,67

0,62

0,69

11,29

0,68

0,73

7,35

R19

0,67

0,70

4,48

0,50

0,67

34,00

0,57

0,69

21,05

Mean

0,72

0,72

0,00

0,71

0,72

1,41

0,71

0,73

2,82

than the other WE, by a factor of 7 with W ikipedia/Word2Vec and by a factor of 10 with GoogleNews/Word2Vec. 4.3

Experiments with the CNN Model

The optimal hyperparameters for learning our CNN, quite close to those adopted by Nguyen and Grishman (2015) [10] and by Zeng et. al. (2015) [13] are : Window size: 2, 3 ; Filter Nb.: 384 ; WE size : 300 ; Position size: 5 ; mini batch size: 35 and Dropout: 0.5). The Table 3 gives the results obtained by our CNN model for each relation classes of the SemEval 2010 corpus with the W ord2Vec and W ord2Vec-f WEs, trained with Wikipedia.

388

B. Espinasse et al.

Globally, for all relations, for Word2Vec and Word2Vec-f WEs, the following F-measure values are obtained: Word2Vec: F-measurement (macro: 0.663; micro: 0.705; weighted: 0.707). Word2Vec-f : F-measurement (macro: 0.692 ; micro: 0.728 ; weighted: 0.722). Thus, for the SemEval 2010 corpus, the CNN model obtains the best results with the WE Wikipedia/Word2Vec-f taking into account syntactic dependencies, there is an improvement of about 10% compared to the use of the WE Wikipedia/Word2Vec not taking into account dependencies. 4.4

Experiments with the FCM Model

There are several implementations of the FCM model, the one developed in Java by M. Gromley in his thesis6 and the one of M. Yu, developed in C++7 . It is the latter, more efficient, that have been used in our experiments. For the experiments with the FCM model, the best results were obtained with a learning rate set at 0.005 and a number of epochs at 30, without early-stopping. Table 4. Results for FCM model with Word2Vec (W2V) & Word2Vec-f (W2V-f) WE.

Precision

6 7

Recall

F1-score

#

W2V

W2V-f

Delta %

W2V

W2V-f

Delta %

W2V

W2V-f

Delta %

R1

0,77

0,79

1,85

0,78

0,77

-0,86

0,78

0,78

0,48

R2

0,80

0,80

0,65

0,74

0,76

3,30

0,77

0,78

2,01

R3

0,91

0,91

-0,04

0,91

0,90

-0,56

0,91

0,91

-0,32

R4

0,76

0,78

2,65

0,81

0,75

-7,41

0,78

0,76

-2,47

R5

0,83

0,81

-1,55

0,75

0,69

-8,33

0,79

0,75

-5,22

R6

0,78

0,76

-1,39

0,75

0,76

2,63

0,76

0,76

0,62

R7

0,81

0,83

2,43

0,92

0,91

-1,62

0,86

0,87

0,49 0,65

R8

0,88

0,86

-2,47

0,74

0,77

3,44

0,81

0,81

R9

0,00

0,00

0,00

0,00

0,00

0,00

0,00

0,00

0,00

R10

0,83

0,84

0,98

0,86

0,89

3,32

0,85

0,87

2,11

R11

0,81

0,82

1,33

0,75

0,81

7,41

0,78

0,81

4,40

R12

0,93

0,90

-2,99

0,93

0,92

-0,81

0,93

0,91

-1,91

R13

0,82

0,83

1,51

0,84

0,87

3,37

0,83

0,85

2,42

R14

0,60

0,67

11,12

0,41

0,45

11,10

0,49

0,54

11,10

R15

0,86

0,86

0,00

0,79

0,79

0,00

0,82

0,82

0,00

R16

0,82

0,82

-0,34

0,87

0,88

1,51

0,84

0,85

0,56

R17

0,83

0,85

1,73

0,80

0,79

-1,55

0,82

0,82

0,04

R18

0,84

0,88

4,26

0,91

0,91

-0,38

0,87

0,89

1,98

R19

0,55

0,56

2,30

0,50

0,54

7,43

0,53

0,55

4,91

Mean

0,78

0,79

1,36

0,79

0,79

1,12

0,78

0,79

1,29

https://github.com/mgormley/pacaya-nlp. https://github.com/Gorov/FCM_nips_workshop.

On the Use of Dependencies in Relation Classification

389

We tested the FCM model on the SemEval 2010 corpus using several WEs, obtained with Word2Vec and Word2Vec-f tools, and trained with Wikipedia in English (treated before with SpaCy for Word2Vec-f ) in sizes 200 and 300 with a window of 5. The Table 4 shows the results obtained. Table 5 gives the F1 measures of the FCM model for various WEs. Table 5. Measures obtained by FCM model with different word embeddings. Word embedding

Macro F1 Micro F1 Weighted F1

GoogleNews vect. neg-dim300 0.746

0.789

0.781

Wikipedia/Word2Vec

0.747

0.7858

0.782

Wikipedia/Word2Vec-f

0.744

0.794

0.792

If we compare the Macro F1-Scores obtained for WE Wikipedia/Word2Vecdim300 and Wikipedia/Word2Vecf-dim300, we obtain a gain of 0.88%. It should be noted that the results of the FCM model obtained on the Semeval 2010 corpus are slightly different from those announced by the authors of the FCM in their article (Gormley et al., 2015) [4], because the latter do not take into account the Other relation class of the corpus. 4.5

Discussion

The Table 6 compares results of the two relation classification models, the CNN model (generic WE + CNN) and the FCM model (compositional WE) on SemEval 2010 corpus (all relation classes are considered). Table 6. All classes F1-measures for SemEval 2010 for the two classification models with different word embeddings. Model/WE

Macro F1 Micro F1 Weighted F1

CNN/Word2Vec

0.663

0.705

0.707

CNN/Word2Vec-f 0.692

0.728

0.722

0.747

0.785

0.782

FCM/Word2Vec-f 0.754

0.794

0.792

FCM/Word2Vec

The CNN model (WE + CNN) has very slightly improved performance when the WE takes into account the syntactic dependencies (Word2Vec-f ) of the large training corpus (Wikipedia). The FCM model (compositional WE + NN), which takes into account syntactic dependencies at the corpus level whose relations have to be classified in its compositional WE, has much better performances than the CNN model (WE + CNN), the FCM model has performances higher than 30%. These performances are slightly improved if we use in the FCM model a WE taking into account the dependencies of the training corpus (with Word2Vec-f.)

390

5

B. Espinasse et al.

Conclusion

Classification of relations between entities remains a complex task. This article focused on relation classification from a corpus of texts by Deep Learning while considering or not syntactic dependencies at different levels. Two levels of consideration have been distinguished: the generic WE at the input of the classification model, and at the level of the corpus whose relationships are to be classified. Two classification models were studied, the first one is based on a CNN and does not take into account the dependencies of the corpus to be treated and the second one is based on a compositional WE combining a generic WE and a syntactic annotation of this corpus. The impact has been estimated with two generic WE: the first one trained on the Wikipedia corpus in English with the Word2Vec tool and the second one trained on the same corpus, previously annotated with syntactic dependencies and generated by the Word2Vec-f tool. The results of our experiments on SemEval 2010 corpus show, first of all, that taking dependencies into account is beneficial for the relation classification task, whatever the classification model used. Then, taking them into account at the level of the corpus to be processed, through a compositional Word Embedding is more efficient and results are further slightly improved by using the WE Word2Vec-f in input. Finally, let us recall that a major interest of WE contextualizing words according to syntactic dependencies in which it intervenes is their concision in terms of vocabulary. They can be 7 to 10 times more compact and therefore more efficient regardless of the classification model.

References 1. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493– 2537 (2011) 2. De Marneffe, M.C., Manning, C.D.: The stanford typed dependencies representation. In: Coling 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, pp. 1–8. Association for Computational Linguistics (2008) 3. Gormley, M.R.: Graphical models with structured factors, neural factors, and approximation-aware training. Ph.D. thesis, Johns Hopkins University (2015) 4. Gormley, M.R., Yu, M., Dredze, M.: Improved relation extraction with featurerich compositional embedding models. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1774–1784. Association for Computational Linguistics (2015) 5. LeCun, Y., et al.: Comparison of learning algorithms for handwritten digit recognition. In: International Conference on Artificial Neural Networks, vol. 60, pp. 53–60. Perth, Australia (1995) 6. Levy, O., Goldberg, Y.: Dependency-based word embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 302–308. Association for Computational Linguistics (2014)

On the Use of Dependencies in Relation Classification

391

7. Liu, C.Y., Sun, W.B., Chao, W.H., Che, W.X.: Convolution neural network for relation extraction. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds.) ADMA 2013. LNCS (LNAI), vol. 8347, pp. 231–242. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-53917-6_21 8. Ma, M., Huang, L., Zhou, B., Xiang, B.: Dependency-based convolutional neural networks for sentence embedding. In: ACL, no. 2, pp. 174–179. The Association for Computer Linguistics (2015) 9. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) 10. Nguyen, T.H., Grishman, R.: Relation extraction: perspective from convolutional neural networks. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 39–48 (2015) 11. Yih, W.T., He, X., Meek, C.: Semantic parsing for single-relation question answering. In: ACL, no. 2, pp. 643–648. The Association for Computer Linguistics (2014) 12. Yu, M., Gormley, M.R., Dredze, M.: Factor-based compositional embedding models. In: In NIPS Workshop on Learning Semantics (2014) 13. Zeng, D., Liu, K., Lai, S., Zhou, G., Zhao, J.: Relation classification via convolutional deep neural network. In: Hajic, J., Tsujii, J. (eds.) COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 23–29 August 2014, Dublin, Ireland, pp. 2335–2344. ACL (2014)

Multilingual Fake News Detection with Satire Gaël Guibon1(B) , Liana Ermakova2 , Hosni Seffih3,4 , Anton Firsov5 , and Guillaume Le Noé-Bienvenu6 1

3

Aix-Marseille Université, LIS-CNRS UMR 7020, Marseille, France [email protected] 2 HCTI - EA 4249, Université de Bretagne Occidentale, Brest, France IUT de Montreuil, Université Paris 8, LIASD-EA4383, Saint-Denis, France 4 GeolSemantics, Saint-Denis, France 5 Knoema Corporation Perm, Perm, Russia 6 PluriTAL, Saint-Denis, France https://knoema.com/

Abstract. The information spread through the Web influences politics, stock markets, public health, people’s reputation and brands. For these reasons, it is crucial to filter out false information. In this paper, we compare different automatic approaches for fake news detection based on statistical text analysis on the vaccination fake news dataset provided by the Storyzy company. Our CNN works better for discrimination of the larger classes (fake vs trusted) while the gradient boosting decision tree with feature stacking approach obtained better results for satire detection. We contribute by showing that efficient satire detection can be achieved using merged embeddings and a specific model, at the cost of larger classes. We also contribute by merging redundant information on purpose in order to better predict satire news from fake news and trusted news. Keywords: Fake news · Deception · Artificial intelligence learning · Natural language processing · Satire

1

· Machine

Introduction

According to the survey performed by the French journal Le Monde, on March 2017 the 14–15, over 12,000 Frenchmen 66% of the French population were going to vote, 41% among them had not made their choice yet [21]. According to the data of Médiamétrie, a French audience measurement company, more than 26 million of French people connected to social networks to read and share articles and posts in February 2017. Therefore Web information influences campaigns. Fake news and post-truth become more and more destabilizing and widespread (e.g., the 2016 United States presidential election [12], Brexit [11]). The information spread on the Web can influence not only politics, but also stock markets. In 2013, $130 billion in stock value was lost in a few minutes after an AP tweeted that Barack Obama had been injured by an “explosion” [24]. c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 392–402, 2023. https://doi.org/10.1007/978-3-031-24340-0_29

Multilingual Fake News Detection with Satire

393

Fake news detection or rumor detection, is a concept that started in early 2010, as social media started to have a huge impact on people’s views. Different approaches have been used throughout the years to detect fake news. They can be divided into two categories: manual and automatic ones. Facebook decided to manually analyze the content after a certain number of users have signalled the doubtful information [26]. Merrimack College published a blacklist of web sites providing fake information [16]. This list was integrated into a Google Chrome extension [13]. Numerous Web sites and blogs (e.g. Acrimed, HoaxBuster, CrossCheck, «démonte rumeur »of Rue89) are designed for fact verification. For example, the web site FactCheck.org proposes to a reader to verify sources, author, date and title of a publication [2]. Pariser started a crowdsourcing initiative “Design Solutions for Fake News” aimed at classifying mass media [23]. However, the methods based on manual analysis are often criticized for insufficient control, expertise requirements, cost in terms of time and money [27]. This system needs human involvement; the source is flagged unreliable by the community, as in BS detector, or by specialists, as in Politi-Fact [4]. On the other side, automatic methods are not widely used. The preference of manual approaches over the automatic methods of the large innovation companies like Facebook is indirect evidence of the lower quality of the existing automatic approaches. According to [9], automatic methods of fake news detection are based on linguistic analysis (lexical, syntactical, semantical, discourse analysis by the means of the Rhetorical Structure Theory, opinion mining) or network study. Various Natural Language Processing (NLP) and classification techniques help achieve maximum accuracy [14]. Criteria-Based Content Analysis, Reality Monitoring, Scientific Content Analysis and Interpersonal Deception Theory provide some keys for the detection of textual fake information [30]. In [30], the authors treated textual features of deception in a dialogue. Despite their work being not applicable for fake news detection since they are monologues, it gives some interesting perspectives. In [5], authors analyzed the nonverbal visual features. Castillo [8] used a feature-based method to define tweets’ credibility. Ma [20] extracted useful features to detect fake news. The last two methods provided satisfying results. Ma [19] used Long-Short Term Memory networks (LSTM) to predict if a stream of tweets modeled as sequential data were rumors. This approach was more successful than the feature-based one. Rashkin [25] took a linguistic approach to detect fake news, examining lexicon distribution and discovering the difference between the language of fake news and trusted news, but the result showed only 22% accuracy. Volkova [28] tried to classify tweets into suspicious news, satire, hoaxes, clickbait and propaganda using linguistic neutral networks with linguistic features. It turned out that syntax and grammar features have little effect. Pomerleau and Rao created Fake News Challenge aimed at the detection of the incoherence between the title and the content of an article [3], while the task proposed by DiscoverText is targeted at the identification of fake tweets [1]. In this paper, we compare several classification methods based on text analysis on the dataset provided by the start-up specializing in fake news detection

394

G. Guibon et al.

Storyzy1 . This dataset was used a Fake News detection task during the french hckathon on Natural Language Processing2 . We compare our work to other teams’ work, especially on the satire detection. During the hackathon, different teams tried to obtain good results in distinguishing fake news from trusted news, in which we ranked second. However, we obtained by far the best scores in satire detection which is the main contribution of this paper. The second contribution is the usage of merged embeddings and lexical features containing redundancy for improved satire detection.

2

Experimental Framework and Results

We performed our evaluation on the dataset provided by the company Storyzy for a hackathon in .tsv format. Storyzy specializes in brand image protection by detection of the suspicious content on websites that can potentially show the advertisements of the brands. The corpus contains texts from various websites in English and French, as well as automatic transcripts of YouTube French videos about vaccination which is a widely disseminated topic in false news. The details are given in Table 1. The task is to classify the textual content into 3 possible classes: Fake News, Trusted, or Satire. We compared our results with those of the participants of the hackathon on fake news detection. The main hackathon’s task was to obtain the best score on the first two classes. However, we consider the satire class as the most interesting and challenging. This is why we have tried to effectively classify all texts including satire. To do so, we used an experimental approach, focusing on the impact of data representation to find the best way to classify text in English, transcribed French, and French (Fig. 1). Table 1. Corpus statistics Language Train Test

1 2

Train format

Test format

English

3828

1277 id, domain, type, uri, author, language, title, text, date, external_uris

id, title, text

French

705

236 id, domain, type, uri, author, language, title, text, date, external_uris

id, title, text

YouTube

234

78 video-id, channel-id, video-title, video-view-count, lang, type, channel-title, text, id

https://storyzy.com/?lang=en. HackaTAL : https://hackatal.github.io/2018/.

id, video-title, text

Multilingual Fake News Detection with Satire

395

Fig. 1. Text resemblance scores. The closer they are, the more difficult they are to discriminate.

Fig. 2. Word clouds for each dataset.

2.1

Text Resemblance

We searched in the Web snapshot using chatNoir API [6] for news titles and we compared the first ten results with the news texts after the tokenization and lemmatization using NLTK3 for the English texts, French texts were preprocessed by using a tokenizer that splits texts by white spaces and apostrophes, deletes stop words. After that, texts are lemmatized by the NLTK FrenchStemmer. Then for each pair of query/article, we got a “resemblance ratio” using 3

https://www.nltk.org/.

396

G. Guibon et al.

difflib SequenceMatcher4 . The difflib module contains tools for computing the differences between sequences and working with them. It is especially useful for comparing texts, and includes functions that produce reports using several common different formats. That way we created a ratio going from 0 to 1 describing how similar the resulting text is to the original, 1 being the exact same text. By applying this method on the train corpus, then calculating an average for all the trusted news, all the fake news and all the satire news results, we were able to obtain Fig. 2 which also presents word clouds for different classes from the train set. Figure 2 testifies that trusted news have approximately the same ratio with each of the query texts and the news text (0.021–0.024). On the other hand, even if the fake news is relatively similar to the news text we get irregular ratios (0.018–0.026). The ratios of the satire query go from 0.029 to 0.037. Between query sites and news texts, there are small resemblances but more irregularities. We can also observe that some subjects are more frequent in satire (e.g. Facebook ) or fake news (e.g. autism, aluminum, or mercury within the vaccination topic) than in trusted sources. 2.2

Domain Type Detection

Using chatNoir API to search news titles in the web snapshot, we obtained the first ten domain names and tested whether they belong to satire, fake, or trusted websites, using a predefined list in which we had a list of famous websites tagged as trusted fake or satire. To establish the list we used the domains on the train corpus and added lists we found on the Internet. Our feature is composed of three parameters (Fake Site, Satire Site, Trusted Site). Each parameter is initialized to 0; it turns to 1 if the domain is found on the list. The result of the train corpus is on Table 2. The first observation is that 74% of the fake news title searches lead to fake news websites, 4% to satire websites, and 22% to trusted websites. This 22% rate is due to well-written titles, resembling trusted news titles, but once we go into the text we notice the “fakeness” of the news. Satire news lead toward fake news websites 83% of the times. This is probably because satire is written in the same way as trusted news. Therefore, we need to detect sarcasm to understand they are fake, hence satire gets reported by fake news websites as trusted news. Table 2. Returned domain type for each news Fake site Satire site Trusted site Elements Percentage Elements Percentage Elements Percentage Fake news 484

74%

25

4%

146

22%

20

83%

0

0%

4

17%

651

50%

116

9%

528

41%

Satire Trusted 4

https://docs.python.org/2/library/difflib.html.

Multilingual Fake News Detection with Satire

397

The biggest problem is that trusted news are present with 50% frequency in fake sites, and 9% in satire, since fake news websites usually begin with trusted news and then change some facts and use a sort of “p-hacking” to make it match their fake message. 2.3

Classification Results

In order to obtain a classifier capable of better generalizing its predictions, we tried different data representations: TF-IDF (tf ). First we used a common vectorization of term frequency and inverse term frequency in the documents (TF-IDF). We first limited the dimensions arbitrarily to 300 and 12,000 in order to keep the most frequent terms from the vocabulary dictionary. FastText (ft). We trained the FastText [7] model on the training set in order to avoid using external data such as already trained embeddings. Using up to bi-grams we obtained vectors of dimension 100. The training was configured with 5-iteration process, and ignored words with less than 5 occurrences in the corpus. Word2Vec (wv). We also trained a skipgram model of Word2Vec [22] with a batch of 32 words, 20 iterations, and 300 dimension vectors. Words with less than 5 occurrences were also ignored. Hashing Trick (hv). Additionally, we used the hashing trick [29] to obtain another data representation, normalized with a L2 regularization [15]. The vectors dimension was limited to 300. All these data representations were applied in several classifiers in order to discriminate fake news from the trusted and satire ones: a Support Vector Machine classifier [10] (SVM) and Light Gradient Boosting Machine5 [17] (LGBM) a state-of-the-art version of Gradient Boosting Decision Tree. SVM were chosen as it was this method which obtained the best micro F1score during the hackathon. Tree Boosting was decided in order to obtain better overall results with more insight on their explanations. We also tried several neural network architectures (LSTM, CNN) applied for (1) characters represented by a number; (2) character embeddings; (3) word embeddings. The first representation prevents the neural network from training since characters with similar numbers are interpreted as almost the same by the network. French+YouTube datasets are not enough to train LSTM on char embeddings. We obtained both very high bias and variance (almost random). Among the deep learning approaches the best results we obtained were with the following architecture: Embedding(64) →Conv1D(512,2,ReLu,dropout=0.4) → GlobalMaxPooling1D →Dense(3,SoftMax) (CNN). As previously said, this architecture was selected after trying deeper ones. We think the size of the dataset played an important role in the number of layers to be applied in the network. 5

https://github.com/Microsoft/LightGBM.

398

G. Guibon et al.

Table 3. Classification schemes summary. FastText (ft). Word2Vec (wv). HackingTrick (hv). Domain Type (dt). Text Resemblance (tr). Data representation

F1 micro F1 macro Cross validation (f1 micro)

Decision Tree (J48) Text Resemblance (tr) 59.79 58.63 Domain Type (dt)

61.34 60.15

58.33 57.20

82.25

/

88.56

88.56

84.76

91.39 92.01

91.87 91.36

87.02 87.52

93.02 93.09

91.19 90.13

88.34 88.37

94.59

86.02

/

Hackathon Winner (SVM) tf12K

94.28

Our LBGM during hackathon tf300+ft+wv Optimized LGBM tf12K+ft+wv+hv tf12K+ft+wv+dt+tr Optimized Linear SVM tf12K+ft+wv+hv tf12K+ft+wv+dt+tr CNN /

Table 3 shows the different approaches we used to better detect fake news along with the one from the winner team during the hackathon organized by Storyzy. All these scores are based on micro and macro f1-score and cross validation (CV). In this table, the first two sections show the performance of fake news detection by applying the text resemblance and domain type detection methods described in Sect. 2. The other sections present the classification scores obtained with SVM and gradient boosting tree. Our systems were first tuned based on the cross validation micro f1-score from the training set. A grid search strategy was then used in order to find the best parameters for the optimized version of our LGBM and SVM classifiers. Grid search parameters were not exhaustive. For the LGBM classifier we tested boosting types such as GOSS and DART, number of leaves, different learning rates and number of estimators along with Lasso and Bridge regularizations values. On the other hand, SVM were tested with different kernels. The cross validation scores are micro f1 scores from a 5-fold stratified cross validation in order to compare directly the results of each method. The lower values of CV can be explained by this stratification strategy when classes are presented in the same proportion. Thus, the errors in satire detection influence a lot the macro scores. Table 3 also presents different data representation. In order to enhance the quality of the overall prediction of the model (macro f1-score) but without losing too much of its more logical prediction (micro f1-score), we used a stacking approach for data representation. Each acronym in the table refers to a representation strategy and the size of its vectors if multiple ones were used. Thus, the final data representation used named

Multilingual Fake News Detection with Satire

399

‘tf12K+ft+wv+hv+dt+tr’ refers to a tf-idf vectorization (12,000) concatenated on the horizontal axis with FastText vectors (100), Word2Vec skipgrams vectors (300), vectors from the hashing trick (300), domain type (3), and text resemblance (9). At the end, every document was represented by a global vector of dimension 12,712 in which some information was obviously duplicated. Merging Embeddings. Embeddings do not always capture the same kind of information, for instance FastText uses hierarchical softmax and hashing trick for bi-grams allowing the model to capture local order into the words representations while a skip-gram embedding model captures a surrounding context into the words representations. This explains why combining them can prove itself worthy, especially for information difficult to obtain, such as satire. 2.4

Result Analysis

The best micro f1-score was achieved by CNN. However, optimized LGBM system performs better in macro f1-score. Detailed results on each class can be seen in Table 4. It shows the performance differences across each class and how the LGBM classifier better predicts the sparser cases of satire, hence leading to a better overall macro f1-score. Hence, CNN seems to be more suitable for discrimination between larger classes (fake news vs trusted ones), while boosted decision trees can better handle a fine-grained classes, such as satire. Table 4. Per class f1-scores Trusted Fake news Satire Hackathon winner 95.72

92.71

58.33

Optimized LGBM 92.96

88.89

93.75

Optimized SVM

94.46

90.86

88.24

CNN

95.78

93.33

68.96

Feature Selection vs Macro F1-Score. Our stacking approach for data representation was made in order to select features before training the classifier. To do so, we applied different kinds of feature selection to find the k best features, such as a chi-squared selection, mutual information [18], and ANOVA F-value. However, we found that applying feature selection significantly decreased the macro f1-score, even if it allowed a faster training process and almost the same micro f1-score. Indeed, all the stacked up minor information was really important to identify satire documents. This is why, we did not apply feature selection at all for our optimized classifiers. We believe that considering the decision tree nature by association with importance scores to each features, an inherent feature selection is made during the boosting process. This could explain why using a brute force approach of features stacking works better than a feature selection made before hand.

400

G. Guibon et al.

Scaling vs Stacking. Usually, vectors have to be scaled around 0 for SVM or to be set to a minimum and maximum range between 0 or 1 for decision trees to gain performance. However, the same conclusion as for feature selection was observed. Even more, scaling stacked up vectors could be quite inappropriate as they represent different information types obtained by different methods. Thus, the data was not scaled at all, to better preserve information gain. Bi-lingual Classification. The corpus is composed of texts in English (YouTube included) and French, the sources also varies from websites to transcribed text from videos. This particularity encouraged participants to create a system easily adaptable to any language given a training set, or even obtain a language agnostic system. We chose the first option in order to take into account lexical features and see their impact on the final classification. Comparison with Related Work. As stated in the second part of the introduction (Sect. 1, several works on fake news automatic detection were done. Our work described in this paper is different from credibility analysis [8] because of the satire detection, which is more subtle than a credibility score. Indeed, satire is credible on its own way, only the nature of the information differs as the purpose of the writer is not to be taken seriously. Rashkin [25] used LSTM to detect the intents and degrees of truth from a range of 6 classes. Although their corpus contained some satire news, they did not try to precisely detect them, this is why we cannot fully compare to their work. Moreover, we did not use a scaled scores but only news types classification on 3 classes: fake, trusted, satire. With our models, we show that LGBM can be really efficient to detect satire if they are combined with a good representation of the data, and can surpass CNN models on this particular task, but not on the classical Trusted/Fake classification.

3

Conclusion

The false information traveling through the Web impacts politics, stock markets, reputation of people and brands, and can even damage public health making filtering it out a burning issue of the day. In this paper, we compared different automatic approaches for fake news detection based on statistical text analysis on the dataset about vaccination provided by the startup Storyzy. Text resemblance and domain type detection alone are not able to handle all the ambiguity. To have a much better vision between Fake/Trust and the Satire, we needed to combine some classification methods, such as text mining ones among others. The feature-based methods are open to improvement by adding more sites to the fake, satire, and trusted list, and by changing the resemblance ratio method for a much more powerful one. LGBM classifier better predicts the sparser cases of satire while neural networks outperform on bigger corpora, but only for the fake and trusted classification. Mapping a text into a vector of hash of characters or words is not appropriate for deep learning methods. One-hot representation and embeddings provide better results. On the small

Multilingual Fake News Detection with Satire

401

datasets, LSTM is not effective. Smaller CNN provide higher results than more complicated networks on this corpus. Finally, we showed that good satire detection can be achieved automatically when combining Gradient Boosting Decision Trees with merged embeddings of different types. We also showed that the redundancy implied by these different embeddings help the classifier to better detect satire news from the fake and trusted news.

References 1. Fake News Student Twitter Data Challenge | DiscoverText, December 2016. http:// discovertext.com/2016/12/28/fake-news-detection-a-twitter-data-challenge-for-st udents/ 2. How to spot fake news, November 2016. http://www.factcheck.org/2016/11/howto-spot-fake-news/ 3. Fake news challenge (2017). http://www.fakenewschallenge.org/ 4. Adair, B.: Principles of politifact and the truth-o-meter. PolitiFact.com. February 21, 2011 (2011) 5. Atanasova, M., Comita, P., Melina, S., Stoyanova, M.: Automatic Detection of Deception. Non-verbal communication (2014). https://nvc.uvt.nl/pdf/7.pdf 6. Bevendorff, J., Stein, B., Hagen, M., Potthast, M.: Elastic ChatNoir: search engine for the ClueWeb and the common crawl. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 820–824. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_83 7. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017) 8. Castillo, C., Mendoza, M., Poblete, B.: Information credibility on twitter. In: Proceedings of the 20th International Conference on World Wide Web, pp. 675–684. ACM (2011) 9. Conroy, N., Rubin, V., Chen, Y.: Automatic deception detection: methods for finding fake news (2015) 10. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 11. Deacon, M.: In a world of post-truth politics, andrea leadsom will make the perfect PM, September 2016, http://www.telegraph.co.uk/news/2016/07/09/in-a-worldof-post-truth-politics-andrea-leadsom-will-make-the-p/ 12. Egan, T.: The post-truth presidency, April 2016. http://www.nytimes.com/2016/ 11/04/opinion/campaign-stops/the-post-truth-presidency.html 13. Feldman, B.: Here’s a chrome extension that will flag fake-news sites for you (2016), http://nymag.com/selectall/2016/11/heres-a-browser-extension-that-willflag-fake-news-sites.html 14. Gahirwal, M., Moghe, S., Kulkarni, T., Khakhar, D., Bhatia, J.: Fake news detection. Int. J. Adv. Res. Ideas Innov. Technol. 4(1), 817–819 (2018) 15. Hoerl, A., Kennard, R.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970) 16. Hunt, E.: What is fake news? How to spot it and what you can do to stop it. The Guardian, December 2016. https://www.theguardian.com/media/2016/dec/ 18/what-is-fake-news-pizzagate

402

G. Guibon et al.

17. Ke, G., et al.: Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 3149–3157 (2017) 18. Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E 69(6), 066138 (2004) 19. Ma, J., et al.: Detecting rumors from microblogs with recurrent neural networks. In: IJCAI, pp. 3818–3824 (2016) 20. Ma, J., Gao, W., Wei, Z., Lu, Y., Wong, K.F.: Detect rumors using time series of social context information on microblogging websites. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1751–1754. ACM (2015) 21. Mandonnet, E., Paquette, E.: Les candidats face aux intox de Web. L’Express 3429, 28–31 (2017) 22. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 23. Morris, D.: Eli Pariser’s crowdsourced brain trust is tackling fake news | Fortune.com (2016). http://fortune.com/2016/11/27/eli-pariser-fake-news-braintrust/ 24. Rapoza, K.: Can ’Fake News’ impact the stock market? February 2017. https:// www.forbes.com/sites/kenrapoza/2017/02/26/can-fake-news-impact-the-stock-m arket/ 25. Rashkin, H., Choi, E., Jang, J., Volkova, S., Choi, Y.: Truth of varying shades: analyzing language in fake news and political fact-checking. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2931–2937 (2017) 26. Solon, O., Wong, J.: Facebook’s plan to tackle fake news raises questions over limitations. The Guardian, December 2016. https://www.theguardian.com/ technology/2016/dec/16/facebook-fake-news-system-problems-fact-checking 27. Twyman, N., Proudfoot, J., Schuetzler, R., Elkins, A., Derrick, D.: Robustness of multiple indicators in automated screening systems for deception detection. J. Manage. Inf. Syst. 32(4), 215–245 (2015). http://www.jmis-web.org/articles/1273 28. Volkova, S., Shaffer, K., Jang, J., Hodas, N.: Separating facts from fiction: linguistic models to classify suspicious and trusted news posts on twitter. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 647–653 (2017) 29. Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1113–1120. ACM (2009) 30. Zhou, L., Burgoon, J., Nunamaker, J., Twitchell, D.: Automating linguistics-based cues for detecting deception in text-based asynchronous computer-mediated communication. Group Decis. Negot. 13, 81–106 (2004)

Active Learning to Select Unlabeled Examples with Effective Features for Document Classification Minoru Sasaki(B) Department of Computer and Information Science, Ibaraki University, 4-12-1, Nakanarusawa, Hitachi, Ibaraki 316-8511, Japan [email protected]

Abstract. In this paper, for document classification task and text mining based on machine learning, I propose a new pool-based active learning method to select unlabeled data that have effective features not found in the training data. Given a small set of training data and a large set of unlabeled data, the active learner selects the most uncertain data that has effective features not found in the training data from the unlabeled data and asks to label it. After capturing these uncertain data from the unlabeled data repeatedly, I apply the existing pool-based active learning to select training data from the unlabeled data efficiently. Therefore, by adding data with effective features from unlabeled data to training data, I consider that it is effective to improve the performance of the pool-based active learning. To evaluate the efficiency of the proposed method, I conduct some experiments and show that my active learning method achieves consistently higher accuracy than the existing algorithm. Keywords: Document classification · Active learning · Text mining

1 Introduction In recent years, research to solve various automatic classification problems has been done such as spam mail filtering and documents classification. To solve these problems, machine learning methods are applied for classification problem. In the machine learning, many training data are needed to construct a classifier that can correctly predict the classes of new objects. For document classification and text mining based on supervised learning, learning algorithms require enough labeled training data to construct a classifier. However, it is hard to obtain a large amount of labeled data and it is timeconsuming with a lot of cost to label a large number of data. In addition to the quantity of the training data, the quality of the training data used to obtain the classifier is critical for accurate and repeatable results. Even when a large number of labeled data are available, sometimes a good classifier cannot be obtained. Therefore, I need to improve the quality of the training data and reduce the amount of noise to achieve the better performance of the classifier. To overcome the above labeling problems, active learning techniques have been successfully applied to classification problems to reduce the labeling effort. Active learning c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 403–411, 2023. https://doi.org/10.1007/978-3-031-24340-0_30

404

M. Sasaki

Fig. 1. Proposed active learning method to select unlabeled data that have effective features not found in the training data

aims to automatically select the next document to label for training accurate classifiers. Though there are many variants of active learning in the literature, the focus of this article is the pool-based active learning, model which is the most widely used [5, 6, 9]. Given a small set of training data L and a large set of unlabeled data U , a classifier is trained with L and the active learner selects the most uncertain data for a classifier from U and asks to label them. The L is augmented with this data and the process is repeated until a stopping criterion is met. However, in this method, uncertainty of unlabeled data is calculated with features in training data. In the previous paper [9], it was reported that active learning method for insufficient-evidence uncertainty performs worse than the existing uncertainty sampling. The insufficient-evidence uncertainty represents an uncertainty of a model due to insufficient evidence for each class. Insufficient-evidence uncertain instances do not have effective features in the training data for classifications. In this case, it is difficult to select informative samples from an unlabeled dataset pool. Therefore, in the early iteration of active learning, the existing methods tend not to improve predictive performance of the classifier (Fig. 1. left.). By solving this problem, I consider that a optimal classifier can be learned efficiently for initial small training data set. In this paper, I propose a new pool-based active learning method to select unlabeled data that have effective features not found in the training data for the document classification task and the text mining based on machine learning. Given a small set of training data L and a large set of unlabeled data U , the active learner selects the most uncertain data that has effective features not found in L from U and asks to label it (Fig. 1. right.). After capturing these uncertain data from U repeatedly, I apply the existing pool-based active learning to select training data from U efficiently. Therefore, by adding data with effective features from unlabeled data to training data, I consider that it is effective to improve the performance of the pool-based active learning. The organization of residual of the paper is as follows. In Sect. 2, I introduce the active learning framework and uncertainty sampling used in this paper. In Sect. 3, I describe the proposed method. In Sect. 4, an outline of experiments and experimental results are presented. Finally, Sect. 5 concludes this paper.

Active Learning to Select Unlabeled Examples with Effective Features

405

2 Related Works In this section, I provide a brief description of the active learning in the context of classification and review the relevant literature to explain the existing researches. 2.1 Active Learning The main task of active learning is to automatically select the informative instances for efficiently reducing the sample complexity. By using active learning techniques, the number of labeled examples required by machine learning algorithms can be reduced. Active learning can be divided into two main settings: stream-based active learning sampling and pool-based active learning [7]. In stream-based active learning, each instance is sampled from some distribution in a streaming manner and the learner has to decide whether to label this instance or discard it immediately [2]. In the pool-based setting, at each active selection step, the active learner chooses one or more instances that is (are) added to a training set from a large pool of unlabeled instances and the learner retrain the model. In this work, I focus on uncertainty sampling for the poolbased active learning, which is one of the most common setups in active learning [5]. 2.2 Uncertainty Sampling One intuitive approach in pool-based active learning is called uncertainty sampling which selects instance that the learner is the most uncertain about [5]. In previous uncertainty sampling methods, a popular uncertainty sampling strategy employs entropy to evaluate the uncertainty [1, 10, 12]. This entropy measure can be employed easily to probabilistic multi-label classifiers for complex structured instances [3, 8]. In a recent paper, the entropy measure is used to evaluate the uncertainty of random variables via random walks on a graph [11]. In the uncertainty sampling topic, Sharma et al. distinguish between the two types of uncertainties, conflicting-evidence uncertainty and insufficient-evidence uncertainty to improve the performance of uncertainty sampling [9]. The experiments showed that the conflicting uncertain instances are effective for classification. However, the method that selects uncertain instance due to insufficient evidence performs worse than the baseline method that selects the t-th most uncertain instance for almost all datasets. In this paper, to solve the problem of this insufficient-evidence uncertainty, I propose a novel active learning framework to select unlabeled data that have effective features not found in the training data.

3 Proposed Method In this section, I describe the algorithm of my proposed method to select unlabeled data that have effective features not found in the training data.

406

M. Sasaki

Figure 3 shows the proposed active learning framework. First, I randomly select five positive examples DP and five negative examples DN as the initial training set DT . DU L is a pool of unlabeled data samples. FU L is a set of features that appear in the unlabeled data DU L but do not appear in the training data DT . DFU L is a set of examples with features in the FU L in the DU L . First, the proposed method constructs a classification model M from the training data DT using Na¨ıve Bayes classifier. If the set FU L is not empty, for each unlabeled data x in the set DFU L , conditional entropy H(x) is calculated as an uncertainty measure as follows: Y  p(yi |x) log p(yi |x). H(x) = − i=1

The method selects the candidate unlabeled instance xmax that has the largest conditional entropy. Then, this candidate instance xmax is labeled with the correct answer and added to DT . This process is repeated until DFU L becomes empty. If DFU L becomes empty before the size of the training set nDT exceeds the maximum size nmax , the proposed method chooses representative instance for which the classifier M is uncertain due to conflicting evidence. For each unlabeled data y in the set DU L , conditional entropy H(y) is calculated and this method extracts the top t largest uncertain unlabeled examples. Among the top t uncertain instances T , the proposed method chooses representative instance zmax for which the model is uncertain due to conflicting evidence as follows: zmax = arg max log E+1 (z) + log E−1 (z), z∈T

where the scores for the example z to belong to the positive and negative class is log E+1 (x) =



log

p(xj | + 1) , p(xj | − 1)

log

p(xj | − 1) . p(xj | + 1)

xj ∈P osx

log E−1 (x) =

 xj ∈N egx

The P osx and N egx contain the attribute values of the example xj that provide evidence for the positive class and the negative class respectively. Then, this example zmax is labeled with the correct answer and added to DT .

4 Experiments To evaluate the effectiveness of the proposed active learning method, I perform some experiments and compare the results of the previous method.

Active Learning to Select Unlabeled Examples with Effective Features

407

4.1 Data Set In this paper, I experimented with two publicly available datasets that can be used for text classification and text mining such as Spambase1 and Internet Advertisements2 , from UCI machine learning repository. The Spambase dataset consists of 4601 email messages in which 1813 are spam and 2788 are non-spam emails. Each email has 57 numeric features that indicate the frequency of spam related term occurrences and lengths of uninterrupted sequences of capital letters. The Internet Advertisements is a

Fig. 2. Experimental result on the spambase 1 2

https://archive.ics.uci.edu/ml/datasets/spambase. https://archive.ics.uci.edu/ml/datasets/internet+advertisements.

408

M. Sasaki

Proposed Method

Fig. 3. Proposed uncertainty sampling

popular dataset for predicting if a given image is an advertisement or not [4]. It contains 3279 examples and 1558 features which include phrases occurring in the URL, the anchor text, words occurring near the anchor text and the geometry of the image and so on. Since there are missing values in about 28% of the examples, I conduct experiments using 2359 examples (excluding missing values from the dataset). 4.2

Experiments on Active Learning

I experimented with the four uncertainty sampling methods such as the proposed method and the three existing methods(conflicting-evidence uncertainty, insufficientevidence uncertainty and uncertainty sampling) [9], I evaluated each method using a multinomial Na¨ıve Bayes classifier for class probability estimation. In the experiments, I use five-fold cross validation to evaluate the proposed method. I divide the whole dataset into five equal-size subsets. For the 80% of the data, five positive examples DP and five negative examples DN are selected as the initial training set DT and the rest of the data is used for a pool of unlabeled data DU L . The remaining 20% of the data is used for test data. Uncertainty sampling methods for the conflictevidence uncertainty and the insufficient-evidence uncertainty operate within top 10 uncertain instances as the initial training set. The maximum training data size nmax in Algorithm 2 was set to 500 instances. For each fold, the experiment was repeated five times and the final score is the average precision over the five results.

Active Learning to Select Unlabeled Examples with Effective Features

409

Proposed Method

Fig. 4. Experimental result on the internet advertisements

4.3 Experimental Results In this section, I present the experimental results for the proposed method and the three existing methods. Figure 2 shows the average learning curves for all methods on the Spambase dataset. The results of the experiments show that the proposed method achieves consistently higher accuracy than the existing methods in an early stage of the iteration process, leading to higher final accuracy overall. Since the Spambase dataset has only a small number of features, all the features appear in the training data in the early iteration of active learning (about 10 examples). By extracting features that are not included in the training data, the proposed method can perform uncertainty sampling more effectively than the existing methods. However, when the number of iteration increases, average precision tends to vary between 80% and 90% by using any method. In future, I would like to find the optimum number of iteration to improve the performance of the proposed method. Figure 4 shows the average learning curves for all methods on the Internet Advertisements dataset. The learning curve of the proposed method fluctuates, while the proposed method selects the most uncertain example that has features not found in training data. However, the results of the experiments show that the proposed method achieves consistently higher accuracy from the middle of the learning iteration. Even when a model is uncertain because it does not have sufficient evidence for either class, uncertainty sampling for insufficient-evidence performs significantly worse than the other methods. However, the proposed method can solve the sampling problem of insufficient-evidence uncertainty and improve the accuracy of active learning in the early iteration.

410

M. Sasaki

5 Conclusion In this paper, I proposed pool-based active learning method to select the unlabeled data that has effective features not found in the training data. In traditional uncertainty sampling, uncertainty of unlabeled data is calculated in the feature space which is generated by the training data. By adding data with effective features using the proposed method, I consider that it is effective to improve the performance of the pool-based active learning. To evaluate the efficiency of the proposed method, I conduct some experiments to compare with the result of the baseline method. The results of the experiments on the Spambase dataset show that my active learning algorithm achieves consistently higher accuracy than the existing algorithm in an early stage of the iteration process, leading to higher final accuracy overall. Therefore, it is shown that the proposed method is effective for active learning. In another dataset, Internet advertisements, I got a result that classification accuracy stabilizes near the highest point after switching to conventional method. By these results, I could indicate that selecting unlabeled data which have features not found in the training data before switching to conventional method is effective.

References 1. Chen, J., Schein, A., Ungar, L., Palmer, M.: An empirical study of the behavior of active learning for word sense disambiguation. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL 2006, pp. 120–127. Association for Computational Linguistics, Stroudsburg (2006) 2. Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning. Mach. Learn. 15(2), 201–221 (1994) 3. Hwa, R.: Sample selection for statistical parsing. Comput. Linguist. 30(3), 253–276 (2004) 4. Kushmerick, N.: Learning to remove internet advertisements. In: Proceedings of the Third Annual Conference on Autonomous Agents, AGENTS 1999, pp. 175–181. ACM, New York (1999). https://doi.org/10.1145/301136.301186 5. Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1994, pp. 3–12. Springer-Verlag, New York (1994). https://doi. org/10.1007/978-1-4471-2099-5 1 6. McCallum, A., Nigam, K.: Employing em and pool-based active learning for text classification. In: Proceedings of the Fifteenth International Conference on Machine Learning, ICML 1998, pp. 350–358. Morgan Kaufmann Publishers Inc., San Francisco (1998) 7. Settles, B.: Active learning literature survey. Technical report (2010) 8. Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, pp. 1070–1079. Association for Computational Linguistics, Stroudsburg (2008) 9. Sharma, M., Bilgic, M.: Evidence-based uncertainty sampling for active learning. Data Mining Knowl. Disc. 31(1), 164–202 (2017)

Active Learning to Select Unlabeled Examples with Effective Features

411

10. Tang, M., Luo, X., Roukos, S.: Active learning for statistical natural language parsing. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 2002, pp. 120–127. Association for Computational Linguistics, Stroudsburg (2002) 11. Yang, Y., Ma, Z., Nie, F., Chang, X., Hauptmann, A.G.: Multi-class active learning by uncertainty sampling with diversity maximization. Int. J. Comput. Vision 113(2), 113–127 (2015) 12. Zhu, J., Hovy, E.: Active learning for word sense disambiguation with methods for addressing the class imbalance problem. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (2007)

Effectiveness of Self Normalizing Neural Networks for Text Classification Avinash Madasu and Vijjini Anvesh Rao(B) Samsung R&D Institute, Bangalore, India {m.avinash,a.vijjini}@samsung.com

Abstract. Self Normalizing Neural Networks (SNN) proposed on Feed Forward Neural Networks (FNN) outperform regular FNN architectures in various machine learning tasks. Particularly in the domain of Computer Vision, the activation function Scaled Exponential Linear Units (SELU) proposed for SNNs, perform better than other non linear activations such as ReLU. The goal of SNN is to produce a normalized output for a normalized input. Established neural network architectures like feed forward networks and Convolutional Neural Networks (CNN) lack the intrinsic nature of normalizing outputs. Hence, requiring additional layers such as Batch Normalization. Despite the success of SNNs, their characteristic features on other network architectures like CNN haven’t been explored, especially in the domain of Natural Language Processing. In this paper we aim to show the effectiveness of proposed, Self Normalizing Convolutional Neural Networks (SCNN) on text classification. We analyze their performance with the standard CNN architecture used on several text classification datasets. Our experiments demonstrate that SCNN achieves comparable results to standard CNN model with significantly fewer parameters. Furthermore it also outperforms CNN with equal number of parameters. Keywords: Self normalizing neural networks networks · Text classification

1

· Convolutional neural

Introduction

The aim of Natural Language Processing (NLP) is to analyze and extract information from textual data in order to make computers understand language, the way humans do. Unlike images which lack sequential patterns, texts involve amplitude of such information which makes processing very distinctive. The level of processing varies from paragraph level, sentence level, word level and to the character level. Deep neural network architectures achieved state-ofart results in many areas like Speech Recognition [1] and Computer Vison [2]. The use of neural networks in Natural language processing can be traced back to [3] where the backpropogation algorithm was used to make networks learn familial relations. The major advancement was when [4] applied neural networks to represent words in a distributed compositional manner. [5] proposed two neural c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 412–423, 2023. https://doi.org/10.1007/978-3-031-24340-0_31

Effectiveness of Self Normalizing Neural Networks for Text Classification

413

network models CBoW and Skip-gram for an efficient distributed representation of words. This was a major break-through in the field of NLP. From then, neural network architectures achieved state-of-results in many NLP applications like Machine Translation [6], Text Summarization [7] and Conversation Models [8]. Convolutional Neural Networks [9] were devised primarily for dealing with images and have shown remarkable results in the field of computer vision [10, 11]. In addition to their contribution in Image processing, their effectiveness in Natural language processing has also been explored and shown to have strong performance in Sentence [12] and Text Classification [13]. The intuition behind Self Normalizing Neural Networks (SNN) is to drive neuron activations across all layers to emit a zero mean and unit variance output. This is done with the help of the proposed activation in SNNs, SELU or scaled exponential linear units. With the help of SELUs an effect alike to batch normalization is replicated, hence slashing the number of parameters along with a robust learning. Special Dropouts and Initialization also help in this learning, which make SNNs remarkable to traditional Neural Networks. As Image based inputs and Text based inputs differ from each other in form and characteristics, in this paper we propose certain revisions to the SNN architecture to empower them on texts efficiently. To explore effectiveness of self normalizing neural networks in text classification, we propose an architecture, Self Normalizing Convolutional Neural Network (SCNN) built upon convolutional neural networks. A thorough study of SCNNs on various benchmark text datasets, is paramount to ascertain importance of SNNs in Natural Language Processing.

2

Related Work

Prior to the success of deep learning, text classification heavily relied on good feature engineering and various machine learning algorithms. Convolutional Neural Networks [9] were devised primarily for dealing with images and have shown remarkable results in the field of computer vision [10,11]. In addition to their contribution in Image processing, their effectiveness in Natural language processing has also been explored and shown to have strong performance. Kim [12] represented an input sentence using word embeddings that are stacked into a two dimensional structure where length corresponds to embedding size and height with average sentence length. Processing this structure using kernel filters of fixed window size and max pooling layer upon it to capture the most important information has shown them promising results on text classification. Additionally, very deep CNN architectures [13] have shown state-of-the art results in text classification, significantly reducing the error percentage. As CNNs are limited to fixed window sizes, [14] have proposed a recurrent convolution architecture to exploit the advantages of recurrent structures that capture distant contextual information in ways fixed windows may not be able to.

414

A. Madasu and V. A. Rao

Klambauer [15] proposed Self Normalizing Neural Networks (SNN) upon feed forward neural networks (FNN). SNN significantly outperformed FNN architectures on various machine learning tasks. Since then, the activation proposed in SNNs, SELU have been widely studied in Image Processing [16–18], where they have been applied on CNNs to achieve better results. SELU’s effectiveness have also been explored in Text [19–21] processing tasks. However these applications are limited to applying utilizing SELUs as activation in their respective architectures.

3

Self-Normalizing Neural Networks

Self-Normalizing Neural Networks (SNN) are introduced by G¨ unter Klambauer [15] to learn higher level abstractions. Regular neural network architectures like Feed forward Neural Networks (FNN), Convolutional Neural Networks (CNN) lack the property of normalizing outputs and require additional layers like Batch Normalization [22] for normalizing hidden layer outputs. SNN are specialized neural networks in which the neuron activations automatically converge to a fixed mean and variance. Training of deep CNNs can be efficiently stabilized by using batch normalization and by using Dropouts [23]. However FNN suffer from high variance when trained with these normalization techniques. In contrast, SNN are very robust to high variance thereby inducing variance stabilization and overcoming problems like exploding gradients [15]. SNN differs from naive FNN by the following: 3.1

Input Normalization

To get a normalized output in SNN without requiring layers like batch normalization, the inputs are normalized. 3.2

Initialization

Weights initialization is an important step in training neural networks. Several initialization methods like glorot uniform [24] and lecun normal [25] have been proposed. FNN and CNN are generally initialized using glorot uniform whereas SNN are initialized using lecun normal. Glorot uniform initialization draws samples centered around 0 and with standard deviation as:  2 (1) stddev = (in + out) Lecun normal initialization draws samples centered around 0 and with standard deviation as:  1 stddev = (2) in where in and out represent dimensions of weight matrix corresponding to number of nodes in previous and current layer respectively.

Effectiveness of Self Normalizing Neural Networks for Text Classification

3.3

415

SELU Activations

Scaled exponential linear units (SELU) is the activation function proposed in SNNs. In general, FNN and CNN use rectified linear units (ReLU) as activation. ReLU activation clips negative values to 0 and hence suffers from dying ReLU problem1 . As explained by [15] an activation function should contain both positive and negative values for controlling mean, saturation regions for reducing high variance and slope greater than one to increase variance if its value is too small. Hence SELU activation is introduced to preserve the aforementioned properties. SELU activation function is defined as :  x if x > 0 (3) selu(x) = λ x αe − α if x  0 where x denotes input, α (α = 1.6733), λ (λ = 1.0507) are hyper parameters, and e stands for exponent. 3.4

Alpha Dropout

Standard dropout [23] drops neurons randomly by setting their weights to 0 with a probability 1 − p . This prevents the network to set mean and variance to a desired value. Standard dropout works very well with ReLUs because in them, zero falls under the low variance region and is the default value. For SELU, standard dropout does not fit well because the default low variance is  limx→∞ selu(x) = - λα = α [15]. Hence alpha dropout is proposed which sets  input values randomly to α . Alpha dropout restores the original values of mean and variance thereby preserving the self-normalizing property [15]. Hence, alpha dropout suits SELU by making activations into negative saturation values at random.

4

Model

We propose Self-normalizing Convolutional Neural Network (SCNN) for text classification as shown in the Fig. 1. To show the effectiveness of our proposed model, we adapted the standard CNN architecture used for text classification to SCNN with the following changes: 4.1

Word Embeddings are Not Normalized

Self-Normalizing Neural Networks require inputs to be normalized for the outputs to be normalized [15]. Normalization of inputs work very well in computer vision because images are represented as pixel values which are independent of

1

http://cs231n.github.io/neural-networks-1/.

416

A. Madasu and V. A. Rao

the neighbourhood pixels. In contrast, word embedding for a particular word is created based on its co-occurrence with context. Words exhibit strong dependency with their neighbourhood words. Similar words contain similar context and are hence close to each other in their embedding space. If word embeddings are normalized, the dependencies are disturbed and their normalized values will not represent the semantics behind the word correctly. 4.2

ELU Activation as an Alternative to SELU

SELU activation originally proposed for SNN [15] preserve the properties of SNN if the inputs are normalized. When inputs are normalized, applying SELU on the activations does not shift the mean. However if inputs weren’t normalized, due to the parameter λ in SELU activation, the neuron outputs will be scaled by a factor λ thereby shifting the mean and variance to a value away from the desired mean and variance. These values are further propagated to other layers thereby shifting mean and variance more and more. Since input word embeddings cannot be normalized as explained in Sect. 4.1, we use ELU activation [26] in the proposed SCNN model instead. ELU activation function is defined as:  x if x > 0 (4) elu(x) = αex − α if x  0 where α is a hyper parameter, and e stands for exponent. The absence of parameter λ in ELU prevents greater scaling of neuron outputs. ELU activation pushes the mean of the activations closer to zero even if inputs are not normalized which enable faster learning [26]. We compare the performance of SCNN with both SELU and ELU activations and the results presented in Table 3 and Fig. 2. 4.3

Model Architecture

The SCNN architecture is shown in the Fig. 1. Let V be the vocabulary size considered for each dataset and X ∈ RV ×d represent the word embedding matrix where each word Xi is a d dimensional word vector. Words present in the pretrained word embedding2 are assigned their corresponding word vectors. Word that are not present are initialized to 0s. Based on our experiments, SCNN showed better performance when absent words are initialized to 0s than randomly initialization. A maximum sentence length of N is considered per sentence or paragraph. If the sentence or paragraph length is less than N , zero padding is done. Therefore, I ∈ RN ×d dimensional vector per each sentence or paragraph is provided as input to the SCNN model. Convolution operation is applied on I with kernel K ∈ Rh×d (h ∈ {3,4,5}) is applied to input vectors with a window size of h. The weight initialization of these kernels is done using lecun normal [25] and bias is initialized to 0. A new 2

https://code.google.com/archive/p/word2vec/.

Effectiveness of Self Normalizing Neural Networks for Text Classification

417

Fig. 1. Architecture of proposed SCNN model

feature vector is C ∈ R(N −h+1)×1 is obtained after the convolution operation for each filter. C = f (I  K) (5) where f represents the activation function (ELU). Number of convolution filters vary depending on the dataset, Table 2 summarizes the number of parameters for all of our experiments. Maxpooling operation is applied across each filter C to get the maximum value. The outputs from the maxpooling layer across all filters are concatenated. Alpha dropout [15] with a dropout value 0.5 is applied on the concatenated layer. The concatenated layer is densely connected to the output layer with activation as sigmoid if task is binary classification and softmax otherwise.

5

Experiments and Datasets Table 1. Summary statistics of all datasets Datasets No. of classes Dataset size Test MR

2

10662

10 fold cross validation

SO

2

10000

10 fold cross validation

IMDB

2

50000

25000

TREC

6

5952

500

CR

2

3773

10 fold cross validation

MPQA

2

10604

10 fold cross validation

418

A. Madasu and V. A. Rao Table 2. Parameters of examined models on all datasets

Datasets No of conv. filters

No of parameters

SCNN and Short-CNN Static CNN [12] SCNN and Short-CNN Static CNN [12] MR

210

300

≈ 254k

≈ 362k

SO

210

300

≈ 254k

≈ 362k

IMDB

210

300

≈ 254k

≈ 362k

TREC

210

300

≈ 254k

≈ 362k

CR

90

300

≈ 108k

≈ 362k

MPQA

90

300

≈ 108k

≈ 362k

5.1

Datasets

We performed experiments on various benchmark data sets of text classification. The summary statistics for the datasets are shown in Table 1. Movie Reviews (MR). It consists of 10662 movie reviews with 5331 positive and 5331 negative reviews [27]. Task involves classifying reviews into positive or negative sentiment. Subjectivity Objectivity (SO). The dataset consists of 10000 sentences with 5000 subjective sentences and 5000 objective [28]. It is a binary classification task of classifying sentences as subjective or objective. IMDB Movie Reviews (IMDB). The dataset consists of 50000 movie reviews of which 25000 are positive and 25000 are negative [29]. TREC. TREC dataset contains questions of 6 categories based on the type of question: 95 questions for Abbreviation,1344 questions for Entity, 1300 questions for Description, 1288 questions for Human, 916 questions for Location and 1009 questions for Numeric Value [30]. Customer Reviews (CR). The dataset consists of 2406 positive reviews and 1367 negative reviews. It is a binary classification task of predicting positive or negative sentiment [31]. MPQA. The dataset consists of 3311 positive reviews and 7293 negative reviews. Binary classification task of predicting positive and negative opinion [32]. 5.2

Baseline Models

We compare our proposed model SCNN with the following models:

Effectiveness of Self Normalizing Neural Networks for Text Classification

419

Static CNN Model. We compare SCNN with the static model, one of the standard CNN models proposed for text classification [12]. SCNN with SELU Activation. SNN was originally proposed with SELU activation. We performed experiments on SCNN using SELU as the activation function in place of ELU. Short CNN. Proposed model SCNN is proposed with fewer parameters compared to Static CNN model [12]. To show the effectiveness of SCNN, we perform experiments on Static CNN model with same number of parameters as SCNN, we refer this model as Short CNN.

(a) Accuracy Score comparison on MR, SO, IMDB, TREC Datasets

(b) F1-Score comparison on CR, MPQA Datasets

(c) Performance comparison of SELU and ELU on MR, SO, IMDB, TREC Datasets

(d) Performance comparison of SELU and ELU on CR, MPQA Datasets Length

Fig. 2. Figures to demonstrate performance of all models

5.3

Model Parameters

Table 2 shows the parameter statistics for all the models. SCNN and Short CNN models are experimented using same number of parameters. We used 70 convolution filters for each kernel in case of MR, SO, IMDB and TREC datasets. For datasets CR and MPQA we considered 30 filters for each kernel. In MPQA dataset, the average sentence length is 3. Hence, we reduced the convolution filters from 70 to 30.

420

5.4

A. Madasu and V. A. Rao

Training

We process the dataset as follows: Each sentence or paragraph is converted to lower case. Stop words are not removed from the sentences. We consider a vocabulary size V for each dataset based on the word counts. The datasets IMDB, TREC have predefined test data and for other datasets we used 10-fold Cross Validation. The parameters chosen vary depending on the dataset size. Table 2 shows the parameters of SCNN model for all datasets. We used Adam [33] as the optimizer for training SCNN.

6

Results and Discussion

6.1

Results

The performance of the models for all datasets is shown in Table 3. For all the balanced datasets MR, SO, IMDB and TREC, accuracy is used as the metric for comparison. The performance comparison on imbalanced datasets like CR and MPQA cannot be justified using accuracy because imbalance can induce bias in the models’ prediction. Hence we use F1-Score as the metric for performance analysis for CR and MPQA. The datasets IMDB and TREC have pre-existing train, test sets. Therefore, we report our results on the provided test sets for them. For remaining datasets, we report results using 10-fold cross validation(CV). 6.2

Discussion

SCNN Models Against Short-CNN When we compare SCNN with SELU and SCNN to Short CNN, both the models of SCNN outperform Short CNN for all the datasets. This shows that SCNN models perform better than CNN models (Short CNN) with same number of parameters indicating a better generalization of training. There is a significant improvement in accuracy and F1-Score when SCNN models are used in place of CNN. We believe that the use of activation functions ELU and SELU in the SCNN models as opposed to ReLU is the leading factor behind this performance difference between SCNN and CNN. In particular, ReLU activation suffers from dying ReLU problem3 . In ReLU, the negative values are cancelled to 0. Therefore negative values in the pretrained word vectors are ignored thereby loosing information about negative values. This problem is solved in ELU and SELU by having activation even for the negative values. In comparison to ReLU, ELU and SELU have faster and accurate training convergence leading to better generalization performance.

3

http://cs231n.github.io/neural-networks-1/.

Effectiveness of Self Normalizing Neural Networks for Text Classification

421

Table 3. Performance of the models on different datasets Model

Datasets MR SO Accuracy

IMDB

TREC CR MPQA F1-Score

Short CNN 77.762 89.63 78.84 85.2 SCNN w/SELU 80.266 91.99 80.664 89.6 80.308 91.759 82.708 90.4 SCNN

76.246 80.906 77.166 84.062 77.666 84.068

CNN-static [12] 81

76.852

93

78.692

92.8

82.584

ELU Against SELU in SCNN We proposed SCNN using ELU as activation function opposed to SELU, the activation function introduced originally for SNN. We found that if ELU is used as activation, the performance of SCNN is better for a majority of the datasets. Our results from Table 3 substantiate the claim regarding the effectiveness of ELU activation. The characteristic difference between SELU and ELU activations is the parameter λ (λ > 1) in SELU activation which scales the neuron outputs. SELU is effective for maintaining normalized mean and variance when the inputs are normalized. Since, pretrained word vectors are not normalized, the parameter λ adversely scales the outputs. This results in a shifted mean and variance from the desired values. On propagation through subsequent layers the difference only gets further magnified. On the other hand, ELU pushes the activations to unit mean even if the inputs are not normalized. Hence, ELU achieved better results compared to SELU activation in SCNN. SCNN Against Static CNN Our results from Table 3 indicate that SCNN achieves comparable results to Static CNN. As shown in Table 2, the stark difference in the parameter counts between SCNN and static CNN is more than a million. For datasets IMDB, CR and MPQA, SCNN outperforms static CNN. In case of MR dataset, performance difference between SCNN and static CNN is very minimal.

7

Conclusion

We propose SCNN for performing text classification. The observations indicate that SCNN has comparable performance to CNN (Static-CNN [12]) model with substantially lesser parameters. Moreover SCNN performs significantly better than CNN with equal number of parameters. The experimental results demonstrate the effectiveness of self normalizing neural networks in text classification. Currently, SCNN is proposed with relatively simple architectures. Proposed work can be further extended by experimenting SCNN on deep architectures. In addition to this, SNN can also be applied on recurrent neural networks(RNN) and its performance can be analyzed.

422

A. Madasu and V. A. Rao

References 1. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Maga. 29, 82–97 (2012) 2. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 3. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Nature 323, 533 (1986) 4. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003) 5. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 6. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 7. Rush, A.M., Chopra, S., Weston, J.: A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685 (2015) 8. Vinyals, O., Le, Q.: A neural conversational model. arXiv preprint arXiv:1506.05869 (2015) 9. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998) 10. Krizhevsky, A., Sutskever, I., E. Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems, vol. 25 (2012) 11. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 12. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014) 13. Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional networks for text classification. arXiv preprint arXiv:1606.01781 (2016) 14. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI, vol. 333, pp. 2267–2273 (2015) 15. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 971–980. Curran Associates, Inc., Red Hook (2017) 16. Lguensat, R., Sun, M., Fablet, R., Tandeo, P., Mason, E., Chen, G.: Eddynet: a deep neural network for pixel-wise classification of oceanic eddies. In: IGARSS 2018–2018 IEEE International Geoscience and Remote Sensing Symposium, pp. 1764–1767. IEEE (2018) 17. Zhang, J., Shi, Z.: Deformable deep convolutional generative adversarial network in microwave based hand gesture recognition system. In: 2017 9th International Conference on Wireless Communications and Signal Processing (WCSP), pp. 1–6. IEEE (2017) 18. Goh, G.B., Hodas, N.O., Siegel, C., Vishnu, A.: Smiles2vec: an interpretable general-purpose deep neural network for predicting chemical properties. arXiv preprint arXiv:1712.02034 (2017) 19. Kumar, S.S., Kumar, M.A., Soman, K.P.: Sentiment analysis of tweets in Malayalam using long short-term memory units and convolutional neural nets. In: Ghosh, A., Pal, R., Prasath, R. (eds.) MIKE 2017. LNCS (LNAI), vol. 10682, pp. 320–334. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71928-3 31

Effectiveness of Self Normalizing Neural Networks for Text Classification

423

20. Ros´ a, A., Chiruzzo, L., Etcheverry, M., Castro, S.: Retuyt in tass 2017: sentiment analysis for Spanish tweets using svm and cnn. arXiv preprint arXiv:1710.06393 (2017) 21. Meisheri, H., Ranjan, K., Dey, L.: Sentiment extraction from consumer-generated noisy short texts. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 399–406. IEEE (2017) 22. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift, pp. 448–456 (2015) 23. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) 24. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y.W., Titterington, M. (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, vol. 9 of Proceedings of Machine Learning Research, pp. 249–256. PMLR, Chia Laguna Resort (2010) 25. LeCun, Y.A., Bottou, L., Orr, G.B., M¨ uller, K.-R.: Efficient backprop. In: Montavon, G., Orr, G.B., M¨ uller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 9–48. Springer, Heidelberg (2012). https://doi.org/10.1007/ 978-3-642-35289-8 3 26. Clevert, D., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). CoRR abs/1511.07289 (2015) 27. Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the ACL (2005) 28. Pang, B., Lee, L.: A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the ACL (2004) 29. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland (2011) 142–150 30. Li, X., Roth, D.: Learning question classifiers. In: Proceedings of the 19th International Conference on Computational Linguistics, COLING 2002, vol. 1, pp. 1–7. Association for Computational Linguistics, Stroudsburg (2002) 31. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168–177. ACM (2004) 32. Wiebe, J., Wilson, T., Cardie, C.: Annotating expressions of opinions and emotions in language. Lang. Res. Eval. 39, 165–210 (2005) 33. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)

A Study of Text Representations for Hate Speech Detection Chrysoula Themeli1,2(B) , George Giannakopoulos2,3(B) , and Nikiforos Pittaras1,2(B) 1

Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Athens, Greece {cthemeli,npittaras}@di.uoa.gr 2 NCSR Demokritos, Athens, Greece {ggianna,pittarasnikif}@iit.demokritos.gr [email protected] 3 SciFY PNPC, Athens, Greece [email protected] Abstract. The pervasiveness of the Internet and social media have enabled the rapid and anonymous spread of Hate Speech content on microblogging platforms such as Twitter. Current EU and US legislation against hateful language, in conjunction with the large amount of data produced in these platforms has led to automatic tools being a necessary component of the Hate Speech detection task and pipeline. In this study, we examine the performance of several, diverse text representation techniques paired with multiple classification algorithms, on the automatic Hate Speech detection and abusive language discrimination task. We perform an experimental evaluation on binary and multiclass datasets, paired with significance testing. Our results show that simple hate-keyword frequency features (BoW) work best, followed by pre-trained word embeddings (GLoVe) as well as N-gram graphs (NGGs): a graph-based representation which proved to produce efficient, very low-dimensional but rich features for this task. A combination of these representations paired with Logistic Regression or 3-layer neural network classifiers achieved the best detection performance, in terms of micro and macro F-measure. Keywords: Hate speech · Natural language processing Classification · Social media

1

·

Introduction

Hate Speech is a common affliction in modern society. Nowadays, people can come across Hate Speech content even more easily through social media platforms, websites and forums containing user-created content. The increase of the use of social media gives individuals the opportunity to easily spread hateful content and reach a number of people larger than ever before. On the other hand, Supported by NCSR Demokritos, and the Department of Informatics and Telecommunications, National and Kapodistrian University of Athens. c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 424–437, 2023. https://doi.org/10.1007/978-3-031-24340-0_32

A Study of Text Representations for Hate Speech Detection

425

social media platforms like Facebook or Twitter want to both comply with legislation against Hate Speech and improve user experience. Therefore, they need to track and remove Hate Speech content from their websites efficiently. Due to the large amount of data transmitted through these platforms, delegating such a task to humans is extremely inefficient. A usual compromise is to rely on user reports in order to review only the reported posts and comments. This is also ineffective, since it relies on the users’ subjectivity and trustworthiness, as well as their ability to thoroughly track and flag such content. Due to all of the above, the development of automated tools to detect Hate Speech content is deemed necessary. The goal of this work is: (i) to study different text representations and classification algorithms in the task of Hate Speech detection; (ii) evaluate whether the n-gram graphs representation [10] can constitute a rich/deep feature set (as e.g. in [20]) for the given task. The structure of the paper is as follows. In Sect. 2 we define the hate speech detection problem, while in Sect. 3 we discuss related work. We overview our study approach and elaborate on the proposed method in Sect. 4. We then experimentally evaluate the performance of different approaches in Sect. 5, concluding the paper in Sect. 6, by summarizing the findings and proposing future work.

2

Problem Definition

The first step to Hate Speech detection is to provide a clear and concise definition of Hate Speech. This is important especially during the manual compilation of Hate Speech detection datasets, where human annotators are involved. In their work, the authors of [13] have asked three students of different race and same age and gender to annotate whether a tweet contained Hate Speech or not, as well as the degree of its offensiveness. The agreement was only 33%, showing that Hate Speech detection can be highly subjective and dependent on the educational and/or cultural background of the annotator. Thus, an unambiguous definition is necessary to eliminate any such personal bias in the annotation process. Usually, Hate Speech is associated with insults or threats. Following the definition provided by [3], “it covers all forms of expressions that spread, incite, promote or justify racial hatred, xenophobia, antisemitism or other forms of hatred based on intolerance”. Moreover, it can be “insulting, degrading, defaming, negatively stereotyping or inciting hatred, discrimination or violence against people in virtue of their race, ethnicity, nationality, religion, sexual orientation, disability, gender identity”. However, we cannot disregard that Hate Speech can be also expressed by statements promoting superiority of one group of people against another, or by expressing stereotypes against a group of people. The goal of a Hate Speech Detection model is, given an input text T , to output True, if T contains Hate Speech and False otherwise. Modeling the task as a binary classification problem, the detector is built by learning from a training set and is subsequently evaluated on unseen data. Specifically, the input is transformed to a machine-readable format via a text representation method, which ideally captures and retains informative characteristics in the input text. The representation data is fed to a machine learning algorithm that assigns the

426

C. Themeli et al.

input to one of the two classes, with a certain confidence. During the training phase, this discrimination information is used to construct the classifier. The classifier is then applied on data not encountered during training, in order to measure its generalization ability. In this study, we focus on user-generated texts from social media platforms, specifically Twitter posts. We evaluate the performance of several established text representations (e.g. Bag of words, word embeddings) and classification algorithms. We also investigate the contribution of the, graph-based, n-gram graph features to the Hate Speech classification process. Moreover, we examine whether a combination of deep features (such as n-gram graphs) and shallow features (such as Bag of Words) can provide top performance in the Hate Speech detection task.

3

Related Work

In this section, we provide a short review of the related work, not only for Hate Speech detection, but for similar tasks as well. Examples of such tasks can be found in [18] where the authors aim to identify which users express Hate Speech more often, while [32] detect and delete hateful content in a comment, making sure what is left has correct syntax. The latter is a demanding task which requires the precise identification of grammatical relations and typed dependencies among words of a sentence. Their proposed method results have 90.94% agreement with the manual filtering results. Automatic Hate Speech detection is usually modeled as a Binary Classification. However, multi-class classification can be applied to identify the specific kind of Hate Speech (e.g. racism, sexism etc.) [1,21]. One other useful task is the detection of the specific words or phrases that are offensive or promote hatred, investigated in [28]. 3.1

Text Representations for Hate Speech

In this work we focus on representations, i.e. the mapping of written human language into a collection of useful features in a form that is understandable by a computer and, by extension, a Hate Speech Detection model. Below we overview a number of different representations used within this domain. A very popular representation approach is the Bag of Words (BOW) [1,2,13] model, a Vector Space Model extensively used in Natural Language Processing and document classification. In BOW, the text is segmented to words, followed by the construction of a histogram of (possible weighted) word frequencies. Since BOW discards word order, syntactic, semantic and grammatical information, it is commonly used as a baseline in NLP tasks. An extension of the BOW is the Bag of N-grams [5,13,18,19,29], which replaces the unit of interest in BOW from words to n contiguous tokens. A token is usually a word or a character in the text, giving rise to word n-gram and character n-gram models. Due to the contiguity consideration, n-gram bags retain local spacial and order information.

A Study of Text Representations for Hate Speech Detection

427

The authors in [5] claim that lexicon detection methods alone are inadequate in distinguishing between Hate Speech and Offensive Language, counterproposing n-gram bags with TF-IDF weighting along with a sentiment lexicon, classified with L2 regularized Logistic Regression [16]. On the other hand, [1] use character n-grams, BOW and TF-IDF features as a baseline, proposing word embeddings from GloVe1 . In [21] the authors use character and word CNNs as well a hybrid CNN model to classify sexist and racist Twitter content. They compare multi-class detection with a coarse-to-fine two-step classification process, achieving similar results with both approaches. There is also a variety of other features used such as word or paragraph embeddings ([1,7,28]), LDA and Brown Clustering ([24,28,29,31]), sentiment analysis ([5,12]), lexicons and dictionaries ([6,12,26] etc.) and POS tags ([19,24,32] etc.). 3.2

Classification Approaches

Regarding classification algorithms, SVM [4], Logistic Regression (LR) and Naive Bayes (NB) are the most widely used (e.g. [5,7,24,28] etc.). In [30] and [31], the authors use a bootstrapping approach to aid the training process via data generation. This approach was used as a semi-supervised learning process to generate additional data automatically or create hatred lexical resources. The authors of [31] use the Map-Reduce framework in Hadoop to collect tweets automatically from users that are known to use offensive language, and a bootstrapping method to extract topics from tweets. Other algorithms used are Decision Trees and Random Forests (RF) ([2,5, 31]), while [1] and [6] have used Deep Learning approaches via LSTM networks. Specifically, [1] use CNN, LSTM and FastText, i.e. a model that is represented by average word vectors similar to BOW, which are updated through backpropagation. The LSTM model achieved the best performance with 0.93 F-Measure, used to train a GBDT (Gradient Boosted Decision Trees) classifier. In [5], the authors use several classification algorithms such as regularized LR, NB, Decision Trees, RF and Linear SVM, with L2-regularized LR outperforming other approaches in terms of F-score. For more information, the survey of [25] provides a detailed analysis of detector components used for Hate Speech detection and similar tasks.

4

Study and Proposed Method

In this section we will describe the text representations and classification components used in our implementations of a Hate Speech Detection pipeline. We have used a variety of different text representations, i.e. bag of words, embeddings, n-grams and n-gram graphs and tested these representations with multiple classification algorithms. We have implemented the feature extraction in Java

1

https://nlp.stanford.edu/projects/glove/.

428

C. Themeli et al.

and used both Weka and scikit-learn (sklearn) to implement classification algorithms. For artificial neural networks (ANNs), we have used sklearn and Keras frameworks. Our model can be found in our GitHub repository2 . 4.1

Text Representations

In order to discard noise and useless artifacts we apply standard preprocessing to each tweet. First, we remove all URLs, mentions (e.g. @username), RT (Retweets) and hashtags (e.g. words starting with #), as well as punctuation, focusing on the text portion of the tweet. Second, we convert tweets to lowercase and remove common English stopwords using a predefined collection3 . After preprocessing, we apply a variety of representations, starting with the Bag of Words (BOW) model. This representation results in a high dimensional vector, containing all encountered words, requiring a significant amount of time in order to process each text. In order to reduce time and space complexity, we limit the number of words of interest to keywords from HateBase4 [5]. Moreover, we have used additional bag models, with respect to word and character n-grams. In order to guarantee a common bag feature vector dimension across texts, we pre-compute all n-grams that appear in the dataset, resulting in a sparse and high-dimensional vector. Similarly to the BOW features, in order to reduce time and space complexity, it is necessary to reduce the vector space. Therefore, we keep only the 100 most frequent n-grams features, discarding the rest. Unfortunately, as we will illustrate in the experiments, this decision resulted in highly sparse vectors and, thus, reduced the efficiency of those features. Furthermore, we have used GloVe word embeddings [22] to represent the words of each tweet, mapping each word to a 50-dimensional real vector and arriving at a single tweet vector representation via mean averaging. Words missing from the GloVe mapping were discarded. Expanding the use of n-grams, we examine wether n-gram graphs (NGGs) [9,11] can have a significant contribution in detecting Hate Speech. NGGs are a graph-based text representation method that captures both frequency and local context information from text n-grams (as opposed to frequency-only statistics that bag models aggregate). This enables NGGs to differentiate between morphologically similar but semantically different words, since the information kept is not only the specific n-gram but also its context (neighboring n-grams). The graph is constructed with n-grams as nodes and local co-occurence information embedded in the edge weights, with comparisons defined via graph-based similarity measures [9]. NGGs can operate with word or character n-grams – in this work we employ the latter version, which has been known to be resilient to social media text noise [11,20]. During training, we construct a representative category graph (RCG) for each category in the problem (e.g. “Hate Speech” or “Clean”), aggregating all 2 3 4

https://github.com/cthem/hate-speech-detection. https://github.com/igorbrigadir/stopwords. https://github.com/t-davidson/hate-speech-and-offensive-language.

A Study of Text Representations for Hate Speech Detection

429

training instances per category to a single NGG. We then compare the NGG of each instance to each RCG, extracting a score expressing the degree that the instance belongs that class – for this, we use the NVS measure [9], which produces a similarity score between the instance and category NGGs. After this process completes, we end up with similarity-based, n-dimensional model vector features for each instance – where n is the number of possible classes. We note that we use 90% of the training instances to build the RCGs, in order to avoid overfitting of our model: in short, using all training instances would result in very high instance-RCG similarities during training. Since we use the resulting model vectors as inputs to a classification phase in the next step, the above approach would introduce extreme overfit to the classifier, biasing it towards expecting perfect similarity scores in cases of an instance belonging to a class, a scenario which of course rarely – if ever – happens with real world data. In addition, we produce sentiment, syntax and spelling features. Sentiment analysis could be a meaningful feature, since hatred is related with a negative polarity. For sentiment and syntax feature extraction we use the Stanford NLP Parser5 . This tool performs sentiment extraction of the longest phrase tracked in the input and additionally can be used to provide a syntactic score with syntax trees, corresponding the best attained score for the entire tweet. Finally, a spelling feature was constructed to examine whether Hate Speech is correlated to the user’s proficiency in writing. We have used an English dictionary to collect all English words with correct spelling and, then, for each word in a tweet, we have calculated its edit distance from each word in the dictionary, keeping the smallest value (i.e. the distance from the best match). The final feature kept was the average edit distance for the entire post, with its value being close to 0 for tweets with the majority of words correctly spelled. At the end of this process, we obtain a 3-dimensional vector, each coordinate corresponding to the sentiment, syntax and spelling scores of the text. 4.2

Classification Methods

Generated features are fed to a classifier that decides the presence of Hate Speech content. We use a variety of classification models, as outlined below. Naive Bayes (NB) [23] is a simple probabilistic classifier, based on Bayesian statistics. NB makes the strong assumption that instance features are independent from one another, but yields performance comparable to far more complicated classifiers – this is why it commonly serves as baseline for various machine learning tasks [14]. Additionally, the independence assumption simplifies the learning process, reducing it to the model learning the attributes separately, vastly reducing time complexity on large datasets. Logistic Regression (LR) [17] is another statistical model commonly applied as a baseline in binary classification tasks. It produces a prediction via a linear combination of the input with a set of weights, passed through a logistic function which squeezes scores in the range between 0 and 1, i.e. thus producing 5

https://nlp.stanford.edu/software/lex-parser.html.

430

C. Themeli et al.

binary classification labels. Training the model involves discovering optimal values for the weights, usually acquired through a maximum likelihood estimation optimization process. The K-Nearest Neighbor (KNN) classifier [8] is another popular technique applied to classification. It is a lazy and non-parametric method; no explicit training and generalization is performed prior to a query to the classification system, and no assumption is made pertaining to the probability distribution that the data follows. Inference requires a defined distance measure for comparing two instances, via which closest neighbors are extracted. The labels of these neighbors determine, through voting, the predicted label of a given instance. The Random Forest (RF) [15] is an ensemble learning technique used for both classification and regression tasks. It combines multiple decision trees during the training phase by bootstrap-aggregated ensemble learning, aiming to alleviate noise and overfitting by incorporating multiple weak learners. Compared to decision trees, RF produce a split when a subset of the best predictors is randomly selected from the ensemble. Artificial Neural Networks (ANNs) are computational graphs inspired by the biological nervous systems. They are composed of a large number of highly interconnected neurons, usually organized in layers in a feed-forward directed acyclic graph. Similarly to a LR unit, neurons compute the linear combination of their input (including a bias term) and pass the result through a non-linear activation function. Aggregated into an ANN, each neuron computes a specific feature from its input, as dictated by the values of the weights and bias. ANNs are trained with respect to a loss function, which defines an error gradient by which all parameters of the ANN are shifted. With each optimization step, the model moves towards an optimum parameter configuration. The gradient with respect to all network parameters is computed by the back-propagation method. In our case, we have used an ANN composed of 3 hidden layers with dropout regularization.

5

Experiments and Results

In this section, we present the experimental setting used to answer the following: – Which features have the best performance? – Does feature combination improve performance? – Do NGGs have significant/comparable performance to BOW or word embeddings despite being represented by low dimensional vectors? – Are there classifiers performing statistically significantly better than others? Is the selection of features or classifiers more significant, when determining the pipeline for Hate Speech detection? In the following paragraphs, we elaborate on the datasets utilized, present experimental and statistical significance results, as well as discuss of our findings.

A Study of Text Representations for Hate Speech Detection

5.1

431

Datasets and Experimental Setup

We use the datasets provided by [30]6 and [5]7 . We will refer to the first dataset as RS (racism and sexism detection) and to the second as HSOL (distinguish Hate Speech from Offensive Language). In both works, the authors perform a multiclass classification task against the corpora. In [30], their goal is to distinguish different kinds of Hate Speech, i.e. racism and sexism, and therefore the possible classes in RS are Racist, Sexist or None. In [5], the annotated classes are Hate Speech, Offensive Language or Clean. Given the multi-class nature of these datasets, we combine them into a single dataset, keeping only instances labeled Hate Speech and Clean in the original. We use the combined (RS + HSOL) dataset to evaluate our model implementations on the binary classification task. Furthermore, we run multi-class experiments on the original datasets for completeness, the results of which are omitted due to space limitations, but are available upon request. We perform three stages of experiments. First, we run a preliminary evaluation on each feature separately, to assess its performance. Secondly, we evaluate the performance of concatenated feature vectors, in three different combinations: 1) the top individually performing features by a significant margin (best), 2) all features all and 3) vector-based features (vector), i.e. excluding NGGs. Via the latter two scenarios, we investigate whether NGGs can achieve comparable performance to vector-based features of much higher dimensionality. Given the imbalanced dataset used (24463 Hate Speech and 14548 clean samples), we report performance in both macro and micro F-measure. Finally, we evaluate (with statistical significance testing) the performance difference between run components, through a series of ANOVA and Tukey HSD test evaluations. 5.2

Results

Here we provide the main experimental results of our described in the previous section, presented in micro/macro F-measure scores. More detailed results, including multi-class classification are omitted due to space limitations but are available upon request. Firstly, to answer the question on the value of different feature types, we perform individual runs which designate BOW, glove embeddings and NGG as the top performers, with the remaining features (namely sentiment, spelling/syntax analysis and n-grams) performing significantly worse. All approaches however surpass a baseline performance in terms of a naive majority-class classifier (scoring 0.382/0.473, in terms of macro and micro F-measure respectively) and are described below. Sentiment, spelling and syntax features proved to be insufficient information sources to the Hate Speech detection classifiers when used separately – not surprisingly, since they produce one-dimensional features. The best performers are syntax with NNs in terms of micro F-measure (0.633) and 6 7

https://github.com/ZeerakW/hatespeech. https://github.com/t-davidson/hate-speech-and-offensive-language.

432

C. Themeli et al.

spelling with NNs in terms of macro F-measure (0.566). In contrast n-gram graph similarity-based features perform close to the best performing BOW configuration (cf. Table 1), having just one additional dimension. This implies that appropriate, deep/rich features can still offer significant information, despite the low dimensionality. NGG-based features appear to have this quality, as illustrated by the results. Finally, N-grams were severely affected by the top-100 token truncation. The best character n-gram model achieves macro/micro FMeasure scores of 0.507/0.603 with NN classification and the best word n-gram model 0.493/0.627 with KNN and NN classifiers. The results of the top individually performing features, in terms of micro/macro average F-Measure, are presented in the left half of Table 1. Bold values represent column-wise maxima, while underlined ones depict maxima in the left column category (e.g. feature type, in this case). “NN ke” and “NN sk” represent the keras and sklearn neural network implementations, repsectively. We can observe that the best performer is BOW with either LR or NNs, followed by word embeddings with NN classification. NGGs have a slightly worse performance, which can be attributed to the severely shorter (2D) feature vector it utilizes. On the other hand, BOW features are 1000-dimensional vectors. Compared to NGGs, this corresponds to a 500-fold dimension increase, with a 9.0% micro F-measure performance gain. Subsequently, we test the question on whether the combination of features achieve a better performance than individual features. The results are illustrated in the right half of Table 1. First, the best combination that involves NGG, BOW and GloVe features is, not surprisingly, the top performer, with LR and NNsklearn obtaining the best performance. The all configuration follows with NB achieving macro/micro F-scores of 0.795 and 0.792 respectively. This shows that the additional features introduced significant amounts of noise, enough to reduce performance by canceling out any potential information the extra features might have provided. Finally, the vector combination achieves the worst performance: 0.787 and 0.783 in macro/micro F-measure. This is testament to the added value NGGs contribute to the feature pool, reinforced by the individual scores of the other vector-based approaches. Apart from experiments in the binary Hate Speech classification on the combined dataset, we have tested our classification models in multi-class classification, using the original RS and HSOL datasets. In RS, our best score was achieved with the all combination and the RF classifier with a micro F-Measure of 0.696. For the HSOL dataset, we achieved a micro F-Measure of 0.855, using the best feature combination and the LR classifier. 5.3

Significance Testing

In Table 2 we present ANOVA results with respect to feature extractors and classifiers, under macro and micro F-measure scores. For both metrics, the selection of both features and classifiers is statistically significant with a confidence level greater than 99.9%. We continue by performing a set of Tukey’s Honest Significance Difference test experiments in Table 3, depicting each statistically

A Study of Text Representations for Hate Speech Detection

433

Table 1. Average micro & macro F-Measure for NGG, BOW and GloVe features (left) and the “best”, “vector” and “all” feature combinations (right). Feature Classifier Macrof Microf Combo Classifiers Macrof Microf NGG

BOW

Glove

KNN

0.712

0.736

KNN

0.810

0.820

LR

0.712

0.739

Best

LR

0.819

0.831

NB

0.678

0.713

NB

0.632

0.667

NN ke

0.718

0.727

NN ke

0.807

0.819

NN sk

0.716

0.740

NN sk

0.819

0.831

RF

0.699

0.726

RF

0.734

0.759

KNN

0.787

0.763

KNN

0.497

0.569

LR

0.808

0.776

All

LR

0.760

0.772

NB

0.629

0.665

NB

0.795

0.792

NN ke

0.808

0.776

NN ke

0.537

0.629

NN sk

0.808

0.776

NN sk

0.664

0.678

RF

0.807

0.776

RF

0.700

0.731

KNN

0.741

0.765

KNN

0.497

0.569

LR

0.749

0.769

Vector

LR

0.745

0.756

NB

0.715

0.726

NB

0.787

0.783

NN ke

0.774

0.788

NN ke

0.592

0.640

NN sk

0.786

0.800

NN sk

0.669

0.675

RF

0.731

0.755

RF

0.727

0.742

different group as a letter. In the upper part we present results between feature combination groups (“a” to “d”), where the best combination is significantly different by the similar all and vector combinations by a large margin, as expected. The middle part compares individual features (groupped from “a” to “g”), where GloVe, BoW and NGGs are assigned to neighbouring groups and arise the most significant features, with the other approaches having a large significance margin from them. Spelling and syntax features are grouped together, as well as the n-gram approaches. Finally, the lower part of the table examines classifier groups (“a” to “c”). Here LR leads the ranking, followed by groups with the ANNs approaches, the NB and RF, and the KNN method. Table 2. ANOVA results with repect to feature and classifier selection, in terms of macro F-measure (left) and micro-Fmeasure (right). Parameter Pr(>F) (macrof) Pr(>F) (microf)

5.4

Features

. This edge is undirected, meaning that subtexts ‘bc’ and ‘cb’ of the original text are represented by the same edge.

Fig. 2. An n-gram graph.

The next step is to represent all reference summaries by a single n-gram graph [3]. We begin by initializing the graph to be an n-gram graph of any of the reference summaries. The initial graph is then updated using every one of the remaining n-gram reference summary graphs as follows. Let G1 be the current merged n-gram graph, and let G2 be the n-gram graph of the next reference summary. The merge function U (G1 , G2 , l) defined edge weights as w(e) = w1 (e) + (w2 (e) − w1 (e)) ∗ l where l ∈ [0, 1] is the learning factor, w1 (e) is the weight of e in G1 , and w2 (e) is the weight of e in G2 . In our system we chose l = 1i where i > 1 is the number of the reference graph being processed. Example 2. Fig. 3 shows how edge weight is calculated when merging graph G1 and reference graph G2 for the learning factor l = 13 , which gives w(e) = w1 (e) + (w2 (e) − w1 (e)) ∗ 13 = 2.5 + (1 − 2.5) ∗ 13 = 2 In the MeMoG metric, the score of a summary is one similarity measurement, denoted by VS , between system summary graph Gj and the merged reference graph Gi . The similarity score between edges is defined as VR(e) = min{w

i

(e),wj (e)}/max{wi (e),wj (e)}

where wi and wj are weights of the same edge e (identified by its end-node labels) in graphs Gi and Gj respectively. The final score is computed as VS (Gi , Gj ) =



/max{|Gi |,|Gj |}

VR(e)

EASY: Evaluation System for Summarization

537

Fig. 3. The merge function of the MeMoG metric.

Example 3. Let two reference summaries be “abca”, “bcab”, and let the system summary be “abcab”. Figure 4 shows VS scores for this system summary with Dwin set to 1 (1-grams) and 2 (bigrams). Note that setting larger a Dwin does not necessarily increase the score.

Fig. 4. MeMoG scores for different sizes of n-grams.

2.2

Baselines

2.2.1 TopK Baseline For this baseline, we simply select the first K sentences of the source document so that the number of words of the candidate summary is at least the predefined word limit W , making K minimal. If K − 1 top sentences contain less than W words, and K sentences contain more than W words, we add the K-th sentence to the summary. 2.2.2 OCCAMS Baseline The OCCAMS, introduced in [5], is an algorithm for selecting sentences from a source document when reference summaries are known. This algorithm finds the best possible sentence subset covering reference summaries because reference summaries are visible to it. While no extractive summary can fully match humangenerated abstractive reference summaries, OCCAMS achieves the best possible result (or its good approximation) for the extractive summarization task. Comparing system summaries to the result of OCCAMS shows exactly how far the tested system is from realistic best possible extractive summarization result. The OCCAMS’ parameters are the weights of the terms W , the number of words in sentences C, and the size of the candidate summary L. Let D be the

538

M. Litvak et al.

source document consisting of sentences S1 , . . . , Sn and let T = {t1 , . . . , tm } be the set of document’s terms (tokenized stemmed words). Initially OCCAMS computes document matrix A = (aij )i=1..n,j=1..m using LSA [6] as follows:  1, tj ∈ Si aij = Lij × Gi , Lij = 0, tj ∈ / Si Gi is the entropy weight of Si defined as Gi = 1 +

 pij log pij j

log n

where pij is the weight of term tj in sentence Si (normalized by total number of appearances of tj in the document). Then, OCCAMS computes the singular value decomposition of matrix A as A = U SV T , following the approach of [40]. The singular value decomposition produces term weights w(ti ) as follows: w(ti ) = (|Uk | · Sk )Ti 1 =

n 

|uij | · Sjj

j=1

For these weights, the final solution is computed by using Budgeted Maximum Coverage (BMC [18]) and Fully Polynomial Time Approximation Scheme (FPTAS [16]) greedy algorithms to select of sentences that provide maximum coverage of the important terms, while ensuring that their total length does not exceed the intended summary size. The full flow of the OCCAMS algorithm is depicted in Algorithm 1.

Algorithm 1. OCCAMS algorithm Input: Document D, terms T , sentences C, term weights W Output: ’Ideal’ extractive summary K 1: K1 = Greedy BMC(T,D,W,C,L)  2: Smax = argmaxsi ∈C { tj ∈Si w(tj )} 3: Let D = D \ Smax 4: Let C  = C \ Smax 5: Let L = L − |Smax | 6: Let T  = T \ {ti ∈ Smax } 7: K2 = Smax ∪ Greedy BMC(T  , D , W  , C  , L ) 8: K3 = Knapsack(Greedy BMC(T, D, W, C, 5L), L) 9: Compute sets of terms  T (Ki ) covered by solutions Ki , i = 1, 2, 3 10: K = argmaxk=1,2,3 { tj ∈T (Ki ) w(tj }

This algorithm closes the gap created following the award of high score by automated assessment system to automated summarization systems.

EASY: Evaluation System for Summarization

3

539

Implementation Details

In this section we describe and give examples of the EASY system interface (screen images are taken from standalone implementation1 . 3.1

Input Selection

In EASY, a user can make a choice between analyzing a single file with its system and reference summaries, or analyzing an entire corpus. In the former case, the user needs to supply file names for the document, reference summary (or summaries), and the system summary that is to be evaluated. In the latter case, a user needs to select folders that contain a corpus, system summaries and reference summaries. Matching between the document and its corresponding summaries is done by comparing the file name parts that precede file extensions. File names are treated as case-sensitive. Figure 5 shows the input selection interface for the case of a corpus.

Fig. 5. EASY welcome screen for choosing directories for corpus, system summaries, and reference summaries.

3.2

Metrics

Figures 6 and 7 show how to compute ROUGE and MeMoG summarization metrics for the selected input (corpus, reference summaries, and system summaries). The top part of the interface in both cases enables the user to select parameters for every metric, while the bottom part gives the user an opportunity to compute baseline summaries and to compute the chosen metric for baselines with the same parameters as above. In both cases, the user can choose to work with a single file and not with a corpus, as is shown in the examples. 1

https://youtu.be/5AhZB5OfxN8.

540

M. Litvak et al.

Fig. 6. Computing ROUGE metric for system summaries.

Fig. 7. Computing MeMoG metric for system summaries.

3.3

Baselines

Figures 8 and 9 show how baseline summaries can be generated with the EASY system. The user needs to select one or more files from the loaded corpus and specify the desired summary length (in both examples it is set to 150 words).

EASY: Evaluation System for Summarization

541

Fig. 8. Summary generation for TopK baseline.

Fig. 9. Summary generation for OCCAMS baseline.

Fig. 10. Spearman’s correlation of two metrics’ scores.

3.4

Correlation of Results

The EASY system gives the user the option of computing and viewing Spearman’s rank correlation [42] between scores obtained for different metrics. The user can select two metrics each time and view visualization of the Spearman correlation as depicted in Fig. 10.

542

4

M. Litvak et al.

Availability and Reproducibility

The EASY system standalone version is implemented in c#, and its Web version is implemented in Angular7 on the client side, and sp.net WebAPI2 on the server side. The EASY system is freely available for everyone via its Web interface at https://summaryevaluation.azurewebsites.net/home. Video of the standalone interface operation is available at https://youtu.be/5AhZB5OfxN8. We encourage members of the NLP community to use it for evaluation of extractive summarization results. Currently, the system supports English and French text evaluation only, but in the future we plan to extend it by adding more languages and also by implementing additional metrics.

5

Conclusions

In this paper we present a framework named that we call EASY, which is intended for evaluation of automatic summarization systems. Currently, EASY supports English and French languages. The EASY system enables the users to compute several summarization metrics for the same set of summaries and to observe how they correlate using Spearsman’s correlation. The system can also compute baseline summaries using the TopK approach, which takes first sentences of the document, and the OCCAMS approach, which computes optimal extractive summaries by taking into account reference summaries (the gold standard). In our future work we plan to employ semantic representations based on LSA, topic modeling, and word embeddings, which can be used for implementing both similarity and coverage metrics. Also, we plan to add readability [23] metrics. According to our observations, there are two well-known extractive summarization methods that are widely compared to the new approaches, namely TextRank [28] and integer linear programming optimization [26]. We intend to implement these methods as baseline summarizers in EASY. Based on our experience, an extensive statistical analysis is usually required for a correct interpretation of results. We intend to provide EASY users with the built-in ability to perform such analysis. We plan to provide an API so that the members of NLP community will be able to contribute their own implementations of different metrics and baselines.

EASY: Evaluation System for Summarization

543

References 1. Abdi, A., Idris, N.: Automated summarization assessment system: quality assessment without a reference summary. In: The International Conference on Advances in Applied Science and Environmental Engineering-ASEE (2014) 2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 3. Cohan, A., Goharian, N.: Revisiting summarization evaluation for scientific articles. arXiv preprint arXiv:1604.00400 (2016) 4. Das, D., Martins, A.F.: A survey on automatic text summarization. Lit. Surv. Lang. Stat. II Course CMU 4, 192–195 (2007) 5. Davis, S.T., Conroy, J.M., Schlesinger, J.D.: OCCAMS-an optimal combinatorial covering algorithm for multi-document summarization. In: 2012 IEEE 12th International Conference on Data Mining Workshops (ICDMW), pp. 454–463. IEEE (2012) 6. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990) 7. Donaway, R.L., Drummey, K.W., Mather, L.A.: A comparison of rankings produced by summarization evaluation measures. In: Proceedings of the 2000 NAACL-ANLP Workshop on Automatic summarization, pp. 69–78. Association for Computational Linguistics (2000) 8. Gambhir, M., Gupta, V.: Recent automatic text summarization techniques: a survey. Artif. Intell. Rev. 47(1), 1–66 (2017) 9. Giannakopoulos, G., Karkaletsis, V.: Autosummeng and memog in evaluating guided summaries. In: Proceedings of Text Analysis Conference (2011) 10. Giannakopoulos, G., et al.: Multiling 2015: multilingual summarization of single and multi-documents, on-line fora, and call-center conversations. In: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 270–274 (2015) 11. Gupta, V., Lehal, G.S.: A survey of text summarization extractive techniques. J. Emerg. Technol. Web Intell. 2(3), 258–268 (2010) 12. Hovy, E., Lin, C.Y., Zhou, L., Fukumoto, J.: Automated summarization evaluation with basic elements. In: Proceedings of the Fifth Conference on Language Resources and Evaluation (LREC 2006), pp. 604–611. Citeseer (2006) 13. Jing, H., Barzilay, R., McKeown, K., Elhadad, M.: Summarization evaluation methods: experiments and analysis. In: AAAI Symposium on Intelligent Summarization, Palo Alto, CA, pp. 51–59 (1998) 14. Jing, H., McKeown, K.R.: The decomposition of human-written summary sentences. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 129–136. ACM (1999) 15. Jones, K.S., Galliers, J.R.: Evaluating Natural Language Processing Systems: An Analysis and Review, vol. 1083. Springer, Heidelberg (1995). https://doi.org/10. 1007/BFb0027470 16. Karger, D.R.: A randomized fully polynomial time approximation scheme for the all-terminal network reliability problem. SIAM Rev. 43(3), 499–522 (2001) 17. Kasture, N., Yargal, N., Singh, N.N., Kulkarni, N., Mathur, V.: A survey on methods of abstractive text summarization. Int. J. Res. Merg. Sci. Technol 1(6), 53–57 (2014) 18. Khuller, S., Moss, A., Naor, J.S.: The budgeted maximum coverage problem. Inf. Process. Lett. 70(1), 39–45 (1999)

544

M. Litvak et al.

19. Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68–73. ACM (1995) 20. Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015) 21. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014) 22. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), pp. 25–26 (2004) 23. Lloret, E., Vodolazova, T., Moreda, P., Mu˜ noz, R., Palomar, M.: Are better summaries also easier to understand? analyzing text complexity in automatic summarization. In: Litvak, M., Vanetik, N. (eds.) Multilingual Text Analysis: Challenges, Models, and Approaches, chap. 10. World Scientific (2019) 24. Mani, I.: Summarization evaluation: an overview (2001) 25. Mani, I., Klein, G., House, D., Hirschman, L., Firmin, T., Sundheim, B.: Summac: a text summarization evaluation. Nat. Lang. Eng. 8(1), 43–68 (2002) 26. McDonald, R.: A study of global inference algorithms in multi-document summarization. In: Advances in Information Retrieval, pp. 557–564 (2007) 27. Merlino, A., Maybury, M.: An Empirical Study of the Optimal Presentation of Multimedia Summaries of Broadcast News. MIT Press, Cambridge (1999) 28. Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (2004) 29. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) 30. Nanba, H., Okumura, M.: Producing more readable extracts by revising them. In: Proceedings of the 18th Conference on Computational Linguistics, vol. 2, pp. 1071–1075. Association for Computational Linguistics (2000) 31. Nenkova, A., McKeown, K.: A survey of text summarization techniques. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 43–76. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4 3 R 32. Nenkova, A., McKeown, K., et al.: Automatic summarization. Found. Trends Inf. Retrieval 5(2–3), 103–233 (2011) 33. Nenkova, A., Passonneau, R.: Evaluating content selection in summarization: the pyramid method. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004 (2004) 34. Ng, J.P., Abrecht, V.: Better summarization evaluation with word embeddings for rouge. arXiv preprint arXiv:1508.06034 (2015) 35. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002) 36. Pastra, K., Saggion, H.: Colouring summaries Bleu. In: Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: are Evaluation Methods, Metrics and Resources Reusable?, pp. 35–42. Association for Computational Linguistics (2003)

EASY: Evaluation System for Summarization

545

37. Pittaras, N., Montanelliy, S., Giannakopoulos, G., Ferraray, A., Karkaletsis, V.: Crowdsourcing in single-document summary evaluation: the argo way. In: Litvak, M., Vanetik, N. (eds.) Multilingual Text Analysis: Challenges, Models, and Approaches, chap. 8. World Scientific (2019) 38. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGrawHill, Inc., New York (1986) 39. Sasaki, Y., et al.: The truth of the f-measure. Teach Tutor Mater 1(5), 1–5 (2007) 40. Steinberger, J., Jeˇzek, K.: Text summarization and singular value decomposition. In: Yakhno, T. (ed.) ADVIS 2004. LNCS, vol. 3261, pp. 245–254. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30198-1 25 41. Steinberger, J., Jeˇzek, K.: Evaluation measures for text summarization. Comput. Inform. 28(2), 251–275 (2012) 42. Well, A.D., Myers, J.L.: Research Design & Statistical Analysis. Psychology Press, London (2003)

Performance of Evaluation Methods Without Human References for Multi-document Text Summarization Alexis Carriola Careaga(B) , Yulia Ledeneva(B) , and Jonathan Rojas Simón(B) Autonomous University of the State of Mexico, 50000 Toluca, Mexico [email protected], {ynledeneva,jrojass}@uaemex.mx, [email protected], [email protected]

Abstract. The evaluation is a significant task in Natural Language Processing (NLP) because it allows comparing the performance of NLP methods. In this sense, the evaluation of summaries has been addressed over the last two decades. Originally, the evaluation methods were manually performed to determine the performance of summaries, where human judgments were required. Afterward, the automatic evaluation methods were developed with the purpose of reducing the time that involves a manual method. Currently, Jensen-Shannon (JS) Divergence and ROUGE-C are well-known methods that evaluate summaries without human references. However, both methods have not been compared in the context of multidocument text summarization. In this paper, we propose a methodology of five stages that seeks to compare the performance of the state-of-the-art metrics, using Pyramids and Responsiveness as manual metrics. The obtained results show a descriptive comparison of evaluation methods with and without human references. Keywords: Multi-document text summarization · Evaluation methods without human references · Evaluation metrics · Jensen-Shannon (JS) divergence · ROUGE-C

1 Introduction The summary has become one of the most indispensable tools in daily life due to the exponential increase in electronic information. Therefore, it is essential to reduce all this documentation in an understandable and simple way, extracting the most important information. This emphasizes the importance of developing automatic methods that allow identifying the most relevant elements of a document, with the aim of creating a shorter text. Automatic Generation of Text Summaries (AGTS) is one of the tasks of the Natural Language Processing (NLP) area, whose purpose is to extract the relevant information from a document or various documents to develop an abbreviated version [6]. However, the Evaluation of Text Summaries (ETS) is also indispensable to determine the performance of the AGTS methods [9]. In accordance with [4], the evaluation methods are classified under two main criteria: extrinsic and intrinsic. Extrinsic evaluation determines the usefulness of the summary © Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 546–557, 2023. https://doi.org/10.1007/978-3-031-24340-0_41

Performance of Evaluation Methods Without Human References

547

in other tasks [14, 23], while intrinsic evaluation focuses on analyzing the content and coherence of the summary [9]. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is an intrinsic evaluation package that allows comparing the content of the automatic summary with the one generated by the human (human references) [8]. Therefore, it is necessary to have a human-made summary, whose generation is time-consuming. For this reason, the evaluation methods without human references have been proposed. The ETS methods without human references, such as the Jensen-Shannon (JS) Divergence and ROUGE-C do not require human references. That is to say, they compare the content between the automatic summary and its original documents [10, 12]. In the context of the ETS, Text Analysis Conference (TAC) is a series of evaluation workshops focused on generating/evaluating multi-document summaries. Since 2008, TAC has organized annual conferences to develop state-of-the-art AGTS and ETS methods. Specifically, the second TAC conference (TAC-2009) is distinguished by addressing the ETS through the Automatically Evaluating Summaries Of Peers (AESOP) competence, providing the TAC09 corpus. For evaluating summaries of the TAC09 corpus, we have used the JS Divergence and ROUGE-C methods through an experimental methodology of five stages to determine the performance of these methods. The structure of the paper is as follows: Sect. 2 describes the ROUGE method as a related work. Section 3 presents the evaluation methods that do not use human references. Moreover, it describes Pyramids and Responsiveness as manual evaluation methods. Section 4 describes de stages of proposed methodology. Section 5 shows the obtained results of the proposed methodology and a comparison of evaluation methods. Finally, the conclusions and future works are presented in Sect. 6.

2 Related Work Currently, ROUGE is considered the most used evaluation package by the scientific community due to the high performance of its evaluation methods [8]. In general, their methods calculate the quality of summaries automatically generated by comparing their content to a set of human references [7, 8]. In this section, the most used ROUGE evaluation methods are described. 2.1 ROUGE-N ROUGE-N measures the statistical co-occurrence of n-grams between the candidate summary and a set of human references [8]. Formally, ROUGE-N measures the recall of n-grams between a candidate summary and a set of reference summaries. ROUGEN is calculated as follows (see Eq. (1)), where n represents the length of the n-gram (gramn ). Count match (gramn ) represents the maximum number of n-grams that co-occur in a candidate summary and human references [6, 8].   s  {HumanReferences} gramm  sCount match(gramn)   (1) ROUGE − N = s  {HumanReferences} gramm  sCount(gramn )

548

A. C. Careaga et al.

2.2 ROUGE-L ROUGE-L considers the candidate summary and human references as two sequences of words. These sequences are used to find the Longest Common Subsequence (LCS)[8]. The resulting LCS is used to calculate the Recall (Rlcs ), Precision (Plcs ), and f-measure (Flcs ), as shown in Eqs. (2), (3), and (4), Rlcs =

LCS(X , Y ) m

(2)

Plcs =

LCS(X , Y ) n

(3)

(1 + β 2 )Rlcs Plcs Rlcs + β 2 Plcs

(4)

Flcs =

where LCS(X , Y ) is the length of the LCS between the candidate summary (X ) and the human references (Y ); m and n represent the length of the sequences of the candidate summary and the human references, respectively; moreover, β = Plcs /Rlcs [8]. 2.3 ROUGE-S y ROUGE-SU ROUGE-S (Skip-Bigram Co-occurrence Statistics) measures the statistical cooccurrence of skip-bigrams between the candidate summary and human references. Unlike ROUGE-N, ROUGE-S uses skip-bigrams to determine the quality of the candidate summaries in terms of Recall (see Eq. (5)), Precision (see Eq. (6)), and F-measure (see Eq. (7)) [8]. Rskip2 =

SKIP2(X , Y ) m

(5)

Pskip2 =

SKIP(X , Y ) C(n, 2)

(6)

(1 + β 2 )Rskip2 Pskip2 Rskip2 + β 2 Pskip2

(7)

Fskip2 =

SKIP2(X , Y ) is the number of skip-bigrams that co-occur between the candidate summary (X ) and human references (Y ); β gives the relative importance between Pskip2 and Rskip2 ; C is the combination function. On the other hand, m and n represent the number of skip-bigrams of X and Y , respectively [7].

3 Evaluation Methods Previously, evaluation methods were performed manually to determine the quality of summaries, where human judgments were required. In consequence, automatic evaluation methods were developed to decrease the time involved in performing manual evaluation [7, 13]. In this section, a brief definition of these methods is given.

Performance of Evaluation Methods Without Human References

549

3.1 Manual Methods Below, it is described the manual evaluation methods. 1. Pyramids: It identifies Summary Content Units (SCUs) to compare the information in summaries [16]. The pyramid method models and quantifies the variation in the selection of human content [17]. 2. Responsiveness: It is a measure of overall quality that combines content selection and linguistic quality [12]. 3.2 Automatic Methods Below, the definitions of automatic evaluation methods are provided. 1. ROUGE-C: It is an evaluation method that do not require without human references. It represents a modification of the traditional ROUGE package by replacing human summaries with its source document [3]. The methods that underlie ROUGE are described as follows: • ROUGE-C-N: It measures the proportion of n-grams between the candidate summary and the source document. Three evaluation metrics derived from this method have been used (ROUGE-C-1, 2, and 3) [3]. • ROUGE-C-L and SU4: These metrics are based on extracting and evaluating the LCS and skip bigrams with four-word gaps between the candidate summary and the source document [3]. 2. SIMetrix: It evaluates summaries without human references based on ten similarity metrics [11]. The best metric of SIMetrix is the JS Divergence that measures the degree of dissimilarity between the automatic summary and its original document [11]. • JS Divergence: It is a method that incorporates the idea of the distance between two distributions. They cannot be very different from the average distances of their mean distribution. It measures the loss of information from the candidate summary [11].

4 Proposed Methodology In this section, the proposed methodology was used to determine the performance of the evaluation methods without human references. The methodology consists of five steps, which are the following: 1. Selection and organization of the corpus: we have selected the TAC09 corpus to evaluate summaries with JS Divergence and ROUGE-C methods. TAC09 is a corpus based on the update summarization task, which consists of 880 source documents grouped into 44 collections. For each collection, two types of documents can be found (Initial: A and Updated: B) [24]. In this paper, we selected the source documents of type A (440 in total). Moreover, we have selected their corresponding initial summaries (2420 documents distributed into 55 AGTS methods) to evaluate them. Within TAC09, the summaries were manually evaluated using the Pyramids and Responsiveness methods [24].

550

A. C. Careaga et al.

2. Preprocessing: First, all summaries and source documents were cleaned by eliminating the HTML tags. Subsequently, stop words were removed. Finally, the reduction of words was carried out through the Snowball stemming algorithm (an updated version of Porter stemmer [18]). 3. Evaluation of summaries: To determine the performance of the evaluation methods, we have used the unsmoothed (JS) and smoothed (JS-Smoothed) versions of the JS Divergence. In addition, we have used the metrics ROUGE-C-1, 2, 3, L, and SU4 of ROUGE-C. Table 1 describes each used metric. 4. Evaluation organization: From the evaluation results, the arithmetic mean is calculated for each used metric. 5. Evaluation procedure using correlation indices: In this step, we have calculated the correlation between automatic and manual evaluation scores (given by Pyramids and Responsiveness) through the following indices: • Pearson’s correlation [15]: Pearson’s correlation coefficient (also known as productmoment correlation coefficient) determines if there is a linear association between automatic and manual evaluation. • Spearman’s rank correlation [22]: Spearman’s rank correlation coefficient (Spearman’s rho) measures the degree of monotonic association between two variables. • Kendall’s rank correlation [5]: Kendall’s rank correlation coefficient (Kendall tau, τ) is a nonparametric measure that identifies the concordant and discordant pairs of two variables. For each variable, a ranking is assigned, and the dependence relationship between such variables is calculated.

Table 1. Evaluation metrics derived from the ROUGE-C method and JS Divergence. Metrics

Description

ROUGE-C-1

It measures the proportion of unigrams (words) that co-occurs between the candidate summary and the source document

ROUGE-C-2

It measures the proportion of bigrams that coexist between the candidate summary and the source document

ROUGE-C-3

It measures the proportion of trigrams that coexist between the candidate summary and the source document

ROUGE-C-L

This metric is based on the extraction and evaluation of LCSs between the candidate summary and its source document

ROUGE-C-SU4 It is similar to ROUGE-C-S, only that it involves bigrams with a maximum jump distance of four words JS

This metric measures the similarity between two probability distributions representing the candidate summary and its source document

JS-Smoothed

Like JS, this metric measures the degree divergence between the probability distributions of the candidate summary and its source document, but it also considers words in the candidate summary that do not appear in the source document

Performance of Evaluation Methods Without Human References

551

5 Obtained Results

Metrics

This section shows the results obtained of the evaluation methods. To know the performance of the summaries in TAC09, we used the evaluation metrics derived from ROUGE-C and JS Divergence (see Table 1). In Fig. 1, it is observed that the best three methods were JS, JS-Smoothed, and ROUGE-C-1. These results show that the JS Divergence obtained the highest score (0.68198), while ROUGE-C-SU4 received the lowest score (0.01426). In general, a lower performance is obtained with the ROUGE-C metrics, whereas the higher performances were obtained with JS Divergence. 0.68198 0.58767

JS JS-Smoothed ROUGE-C-1 ROUGE-C-L ROUGE-C-2 ROUGE-C-3 ROUGE-C-SU4

0.01622 0.01621 0.01534 0.01435 0.01426 0

0.2 0.4 0.6 Average evaluation scores

0.8

Fig. 1. Comparison of the highest average obtained by the JS and ROUGE-C metric without human references.

Table 2 shows the results obtained from the best evaluation methods, where the best results are underlined and bolded. The JS-Smoothed metric obtained the highest correlations in the Pearson’s (0.68636) and Kendall’s (0.53005) coefficients. Additionally, the metric JS obtained the highest correlation in the Spearman’s coefficient (0.68962). On the other hand, the lowest results are italicized and bolded. The metric ROUGE-C-3 obtained the lowest correlation in the Pearson’s coefficient (0.02897). Similarly, ROUGE-C-SU4 obtained the lowest Spearman (0.24363) and Kendall (0.17195) correlations. Table 2. Correlation results between Pyramids and evaluation metrics using the Pearson, Spearman, and Kendall coefficients. Metrics

Pearson

Spearman

Kendall

ROUGE-C-1

0.56546

0.57621

0.41934

ROUGE-C-2

0.16945

0.55335

0.41202

ROUGE-C-3

0.02897

0.49533

0.36124

ROUGE-C-L

0.55746

0.58874

0.42500

ROUGE-C-SU4

0.21663

0.24363

0.17195

JS

0.65228

0.68962

0.52195

JS-Smoothed

0.68636

0.68681

0.53005

552

A. C. Careaga et al.

Table 3 shows the correlation results of evaluation metrics derived from the JS Divergence and ROUGE-C methods, where the best results are underlined and bolded. The metric JS-Smoothed obtained the highest Pearson correlation (0.68884), while the JS metric obtained the highest correlations in Spearman and Kendall (0.64384 and 0.48325). On the other hand, the lowest results are italicized and bolded. The metric ROUGEC-3 obtained the lowest correlation in the Pearson (0.17232), Spearman (0.44501) coefficients. In the same way, ROUGE-C-3 obtained the lowest Kendall correlation (0.31254). Table 3. Correlation results between Responsiveness and evaluation metrics using the Pearson, Spearman, and Kendall coefficients. Metrics

Pearson

Spearman

Kendall

ROUGE-C-1

0.54594

0.45018

0.31254

ROUGE-C-2

0.27289

0.48000

0.35681

ROUGE-C-3

0.17232

0.44501

0.32894

ROUGE-C-L

0.54174

0.45583

0.31561

ROUGE-C-SU4

0.35223

0.48708

0.36709

JS

0.67733

0.64384

0.48325

JS- Smoothed

0.68884

0.62186

0.46701

5.1 Comparison of the State-of-the-Art Evaluation Methods This section compares the performance of the state-of-the-art evaluation methods that participated in the evaluation task of TAC-2009 (AESOP) [24]. Clustering, Linguistics, and Statistics for Summarization Yield (CLASSY): IDA Center for Computing Sciences developed this evaluation system. In 2009, it participated in the update summary task and made four runs of numbers from one to four for the summary evaluation in AESOP [1]. It requires human references. NCSR Demokritos (DemokritosGR): In 2009, it participated in the AESOP task and made two runs, represented by numbers from one to two at the end of the organization’s name for the evaluation of abstracts. Human references are required to perform the assessment. University of Pennsylvania (UPenn): It is an evaluation system developed by the Department of Computer and Information Science. In 2009, it participated in the AESOP task and made three runs. It can evaluate summaries with and without human references.

Performance of Evaluation Methods Without Human References

553

Language Technologies Research Center (LTRC) PRaSA: In 2009, it participated in the AESOP task and made four runs, represented by numbers from one to four at the end of the organization’s name. It can evaluate with and without human references. University of Ottawa (uOttawa): It is an evaluation system developed by the School of Information Technology and Engineering. In 2009, it participated in the AESOP task. However, to evaluate summaries, it provided only one run, represented by the number one at the end of the organization’s name. This system requires human references. University of West Bohemia (UWB.JRC.UT): In 2009, it participated in the AESOP competence and made three runs, represented by numbers from one to three at the end of the organization name. It evaluates with and without human references. To provide a fair comparison of all evaluation metrics, we have calculated the harmonic mean of their Pearson, Spearman, and Kendall correlation results. For several NLP tasks, the harmonic mean has been used to provide an overall comparison of proposed methods (e.g., the F-measure the most common example). Figure 2 shows a comparison of all state-of-the-art metrics, using the metric Pyramids as a manual method of reference. For this comparison, we have included the metrics that require/do not require human references. In general, CLASSY4 (0.90376) and CLASSY2 (0.89032) obtained the highest correlations, while JS-Smoothed (0.62226) and JS (0.61236) obtained lower correlation results. This situation was expected because the best metrics employed human references. However, both JS-Smoothed and JS show the best performance when human references are not available. Similarly, Fig. 3 shows a comparison of all state-of-the-art metrics, using Responsiveness as a reference. In this comparison, CLASSY4 (0.78908) and CLASSY3 (0.77889) show the best correlation results. On the other hand, JS-Smooth (0.58836) and JS (0.57680) metrics remain as the best metrics that do not require human references. Unlike Fig. 2 and 3, Fig. 4 and 5 present a comparison of metrics that do not require human references. As we can observe, both JS-Smoothed (0.62226) and JS (0.61236) show the highest correlation with Pyramids, while the state-of-the-art metrics obtained lower correlations with Pyramids. On the other hand, Fig. 5 shows a comparison of evaluation metrics that do not require human references, using Responsiveness as a manual evaluation method of reference. As shown, JS-Smoothed and JS remain as the best metrics, obtaining correlation results of 0.58836 and 0.57680, respectively. In contrast, UPenn1, UPenn2, PRaSa3, and PRaSa4 show lower correlations with Responsiveness than JS Divergence.

A. C. Careaga et al.

Metrics

554

0.90376 0.89032 0.88859 0.88726 0.88072 0.86073 0.86054 0.85097 0.84880 0.82733 0.81912 0.79929 0.78031

CLASSY4 CLASSY2 DemokritosGR1 UWB.JRC.UT2 UWB.JRC.UT1 PRaSa1 CLASSY3 CLASSY1 DemokritosGR2 UWB.JRC.UT4 UPenn4 UPenn3 PRaSa2 JS-Smoothed JS UPenn2 ROUGE-C-L PRaSa3 ROUGE-C-1 UPenn1 PRaSa4 uOttawa1 UWB.JRC.UT3 ROUGE-C-2 ROUGE-C-SU4 ROUGE-C-3

0.62226 0.61236 0.58386 0.51323 0.51153 0.50946 0.50584 0.47319 0.40023 0.39134 0.29598 0.20638 0.07632 0

0.2

0.4 0.6 Harmonic mean

0.8

1

Metrics

Fig. 2. Comparison of the state-of-the-art evaluation metrics with and without human references using the metric Pyramids as a manual evaluation metric.

0.78908 0.77889 0.77260 0.77156 0.76513 0.75825 0.75222 0.75140 0.73831 0.72888 0.71472 0.70278 0.67694 0.58836 0.57680 0.49216 0.48595 0.48447 0.46578 0.45592 0.41620 0.41365 0.39389 0.35294 0.35086 0.27050

CLASSY4 CLASSY3 DemokritosGR2 DemokritosGR1 UWB.JRC.UT2 UWB.JRC.UT4 UWB.JRC.UT1 CLASSY2 CLASSY1 PRaSa1 UPenn3 UPenn4 PRaSa2 JS-Smoothed JS UPenn1 UPenn2 PRaSa3 PRaSa4 UWB.JRC.UT3 ROUGE-C-L ROUGE-C-1 ROUGE-C-2 uOttawa1 ROUGE-C-SU4 ROUGE-C-3 0

0.2

0.4 0.6 Harmonic mean

0.8

1

Fig. 3. Comparison of state-of-the-art evaluation metrics with and without human references using the metric Responsiveness as a manual evaluation metric.

Metrics

Performance of Evaluation Methods Without Human References JS-Smoothed JS UPenn2 ROUGE-C-L PRaSa3 ROUGE-C-1 UPenn1 PRaSa4 ROUGE-C-2 ROUGE-C-SU4 ROUGE-C-3

555

0.62226 0.61236 0.58386 0.51323 0.51153 0.50946 0.50584 0.47319 0.29598 0.20638 0.07632 0

0.2

0.4 0.6 Harmonic mean

0.8

1

Metrics

Fig. 4. Comparison of state-of-the-art evaluation metrics without human references using the metric Pyramids as a manual evaluation metric.

JS-Smoothed JS UPenn1 UPenn2 PRaSa3 PRaSa4 ROUGE-C-L ROUGE-C-1 ROUGE-C-2 ROUGE-C-SU4 ROUGE-C-3

0.58836 0.57680 0.49216 0.48595 0.48447 0.46578 0.41620 0.41365 0.39389 0.35086 0.27050 0

0.2

0.4 0.6 Harmonic mean

0.8

1

Fig. 5. Comparison of the state-of-the-art evaluation metrics without human references using the metric Responsiveness as a manual evaluation metric.

6 Conclusions and Future Works According to the scores obtained in the comparison of the average for the JS and ROUGEC metrics, it is determined that two best metrics to evaluate a summary without human references are: JS and JS-Smoothed, given that in the experiments, JS obtained a score of 0.68198 and JS-Smoothed with a score of 0.58767, which were two highest scores. According to the scores obtained in the correlation results between the evaluation metrics in Pyramids and Responsiveness by using Pearson, Spearman, and Kendall coefficients, it is determined that JS and JS-Smoothed are the best metrics because they obtained the highest scores. To achieve the obtained results, we proposed an experimental methodology, whose steps are corpus organization, preprocessing, summary evaluation, evaluation organization, and finally, the evaluation procedure using correlation indexes. With the proposed methodology, evaluation methods need to follow to determine how good it is in the evaluation process is presented. It is important to compare the correlation coefficients to see if the methods used have approximate scores.

556

A. C. Careaga et al.

As future work, we propose using other text representation and language models to evaluate summaries, such as syntactic n-grams (sn-grams) [20, 21], Word2vec [15], Doc2vec [16], and BERT [2, 19, 25]. Moreover, we suggest that the combination/ensemble of these representations and models will improve the performance of evaluation methods with or without human references.

References 1. Conroy, J.M., et al.: CLASSY 2009: summarization and metrics. In: Proceedings of the Text Analysis Conference (TAC 2009), pp. 1–12. NIST, Maryland, USA (2009) 2. Devlin, J., et al.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019 - 2019 Conference North American Chapter Association Computational Linguist. Human Language Technology - Proceedings Conference 1, Mlm, pp. 4171–4186 (2019) 3. He, T., et al.: ROUGE-C: A fully automated evaluation method for multi-document summarization. In: 2008 IEEE International Conference Granular Computing GRC 2008, pp. 269–274 (2008). https://doi.org/10.1109/GRC.2008.4664680 4. Jones, K.S., Galliers, J.R.: Evaluating Natural Language Processing Systems. Springer Berlin Heidelberg, Berlin, Heidelberg (2009). https://doi.org/10.1007/BFb0027470 5. Kendall, M.G.: A new measure of rank correlation. Source: Biometrika. 30, 12, 81–93 (1938) 6. Ledeneva, Y., García-Hernández, R.A.: Automatic generation of text summaries: challenges, proposals and experiments. Auton. Univ. State Mex. Toluca (2017) 7. Lin, C.-Y., Hovy, E.: Manual and automatic evaluation of summaries. In: Proceedings of the ACL-02 Workshop on Automatic Summarization. Association for Computational Linguistics, Morristown, NJ, USA, pp. 45–51 (2002). https://doi.org/10.3115/1118162.1118168 8. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings Work. text Summ. branches out (WAS 2004). 1, pp. 25–26 (2004) 9. Lloret, E., Plaza, L., Aker, A.: The challenging task of summary evaluation: an overview. Lang. Resour. Eval. 52(1), 101–148 (2017). https://doi.org/10.1007/s10579-017-9399-2 10. Louis, A., Nenkova, A.: Automatic Summary Evaluation without Human Models (2008) 11. Louis, A., Nenkova, A.: Automatically assessing machine summary content without a gold standard. Diss. Abstr. Int. B Sci. Eng. 70, 8, 4943 (2013). https://doi.org/10.1162/COLI 12. Louis, A., Nenkova, A.: Automatically evaluating content selection in summarization without human models. In: EMNLP 2009 - Proc. 2009 Conf. Empir. Methods Nat. Lang. Process. A Meet. SIGDAT, a Spec. Interes. Gr. ACL, Held Conjunction with ACL-IJCNLP 2009. August, pp. 306–314 (2009). https://doi.org/10.3115/1699510.1699550 13. Mendoza, G.A.M., et al.: Detection of main ideas and production of summaries in English, Spanish, Portuguese an Russian. 60 years of research. Alfaomega Grupo Editor, S.A. de C.V. and Universidad Autónoma del Estado de México, State of Mexico, Mexico (2021) 14. Matias Mendoza, G.A., et al.: Evaluación de las herramientas comerciales y métodos del estado del arte para la generación de resúmenes extractivos individuales. Res. Comput. Sci. 70, 1, 265–274 (2013). https://doi.org/10.13053/rcs-70-1-20 15. Pearson, K.: VII. Note on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 58(347–352), 240–242 (1895). https://doi.org/10.1098/rspl.1895.0041 16. Nenkova, A., Passonneau, R.: Evaluating content selection in summarization: The pyramid method. In: Proceedings HLT-NAACL. 2004, January, pp. 145–152 (2004) 17. Nenkova, A., et al.: The Pyramid Method: Incorporating human content selection variation in summarization evaluation. ACM Trans. Speech Lang. Process. 4, 2 (2007). https://doi.org/ 10.1145/1233912.1233913

Performance of Evaluation Methods Without Human References

557

18. Porter, M.F.: An algorithm for suffix stripping. Program 40(3), 211–218 (2006). https://doi. org/10.1108/00330330610681286 19. Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using siamese BERTnetworks. In: EMNLP-IJCNLP 2019 - 2019 Conference Empir. Methods National Language Processing 9th International Jt. Conference National Language Processing Proceedings Conference, pp. 3982–3992 (2020). https://doi.org/10.18653/v1/d19-1410 20. Sidorov, G., et al.: Syntactic N-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014). https://doi.org/10.1016/j.eswa.2013. 08.015 21. Sidorov, G.: Syntactic n-grams in Computational Linguistics. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-14771-6 22. Spearman, C.: The Proof and Measurement of Association between Two Things. Am. J. Psychol. 15, 1, 72–101 (1904) 23. Steinberger, J., Ježek, K.: Evaluation measures for text summarization. Comput. Informatics. 28(2), 251–275 (2009) 24. Dang, H.T., Owczarzak, K.: Overview of TAC 2009 summarization track. In: Proceedings of the Text Analysis Conference, pp. 1–25. Gaithersburg, USA (2009) 25. Zhang, T., et al.: BERTScore: evaluating text generation with BERT. In: Proceedings of the International Conference on Learning Representations (ICLR 2020), pp. 1–43. Ethiopia (2020)

EAGLE: An Enhanced Attention-Based Strategy by Generating Answers from Learning Questions to a Remote Sensing Image Yeyang Zhou, Yixin Chen(B) , Yimin Chen, Shunlong Ye, Mingxin Guo, Ziqi Sha, Heyu Wei, Yanhui Gu(B) , Junsheng Zhou, and Weiguang Qu School of Computer and Electronic Information, Nanjing Normal University, Nanjing, China [email protected], [email protected]

Abstract. Image understanding is an essential research issue for many applications, such as text-to-image retrieval, Visual Question Answering, visual dialog and so forth. In all these applications, how to comprehend images through queries has been a challenge. Most studies concentrate on general images and have obtained desired results, but not for some specific real applications, e.g., remote sensing. To tackle this issue, in this paper, we propose an enhanced attention-based approach, entitled EAGLE, which seamlessly integrates the property of aerial images and NLP. In addition, we contribute the first large-scale remote sensing question answering corpus (https://github.com/rsqa2018/corpus). Extensive experiments conducted on real data demonstrate that EAGLE outperforms the-state-of-the-art approaches.

Keywords: Visual question answering

1

· Attention · Remote sensing

Introduction

Image understanding is expected to investigate semantic information in images, and further helps to make decisions. Great progress has been achieved in its downstream applications including image captioning [8], visual question generation [19] and text-to-image retrieval [31]. The Visual Question Answering (VQA) [11,24,28] task has recently emerged as a more elusive task. It is required to answer a textual question according to an image. Since questions are openended, VQA includes many challenges in language representation, object recognition, reasoning and specialized tasks like counting. Typically, VQA systems apply LSTM and CNN individually to extract features from input questions and images, then project the two modalities into a joint semantic space for answer prediction. Existing research mainly focuses on common images. In contrast, few studies focus on specific application scenarios, e.g., remote sensing. c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 558–572, 2023. https://doi.org/10.1007/978-3-031-24340-0_42

EAGLE: An Enhanced Attention-Based Strategy

559

Remote sensing is the science of acquiring information from a location that is distant from the data source [13]. In this paper, we only take into consideration visible photographs of Earth’s surface. Investigating concrete information from remote sensing images is an arduous but fundamental task. On the one hand, remote sensing images are demanding to be comprehended. To begin with, remote sensing images are typically captured from aeroplanes or satellites, and variable factors thus should be meditated, e.g., position, depression angle, rotation and illumination. Additionally, given equal image size, the smaller the scale is, the wider range the image will cover, and the coarser content it will have. As a result, remote sensing images may contain an enormous amount of local objects in any size, including tiny ones with low resolution and vague contour. Such uncertainty makes the content of remote sensing images even more intricate. Furthermore, plenty of characteristics in remote sensing images may include uncertain semantics from distinct perspectives. It is likely that there is significant variation among queries even about the same image due to their different purposes. But on the other hand, it is vital to understand remote sensing images automatically from desired angles, since the semantic understanding of aerial images has many useful applications in civilian and even military fields, such as disaster monitoring [4], scene classification [5] and military intelligence generation [26]. Inspired by human attention mechanisms, current VQA systems [6,15,16] attempt to focus on parts of interest and have achieved promising performance on general images. However, the resulting algorithms still fail to know where to look when images are informative and tend to struggle in understanding questions for which specialized knowledge is required. For better understanding the question and image contents and their relations, a good VQA algorithm needs to identify global scene attributes, locate objects, identify object attributes, quantity and categories to make appropriate inference. Therefore, advanced attentive approaches ought to be explored in remote sensing settings. In this paper, we propose an Enhanced Attention-based strategy by Generating answers from Learning questions to a remote sensing imagE, entitled EAGLE. It could be thought of as a metaphor of eagle hunting. Features of the image are extracted at first. Just imagine an eagle is hovering at high altitude, locating its potential preys. Then the question is encoded word by word. After every word embedding is encoded, the co-attention unit helps to dynamically focus on the part-of-interest of both visual and textual information. After the eagle begins swooping down, preys are aware of danger and are fleeing in all directions, yet the penetrating eagle adjusts its subduction angle. After the last word of the question is processed, we combine hindmost attended question and image features, modeling answer prediction as a classification task. Eventually big claws grab preys and pull them up. We argue that bimodal attentions in EAGLE provide complementary information and are effectively integrated in a unified framework. In order to demonstrate the efficacy of EAGLE, we construct the first large-scale remote sensing question answering corpus. Extensive experiments on real data demonstrate that our proposal outperforms the-state-of-the-art approaches.

560

Y. Zhou et al. ^

^

^

v2

v0

EAGLE

(3) answer prediction

^

v3

v7

...

...

Question Attention

Faster R-CNN

V Q

... ^

... ^

q0

+

(1) image feature extraction

Image Attention

^

q7

q3

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

What

is

built

over

the

river

?

(2) question encoding

highway

Fig. 1. The framework of EAGLE with fused question attention and image attention. EAGLE comprises (1) image feature extraction, (2) question encoding and (3) answer prediction. We incorporate attentive question features q and attentive image features  to jointly reason about visual attention and question attention in a series of steps v iteratively. (e.g., “built” in “What is built over the river?” is being encoded.)

2 2.1

Methodology Problem Formulation

The remote sensing question answering task considered in this paper, is defined as follows. Before the task, we predefine an answer candidate set with M classes A = {a1 , a2 , · · · , aM }. Given an aerial image, with its features for N spatial regions represented by V = {v1 , v2 , · · · , vN } and the question about it with K word embeddings R = {r1 , r2 , · · · , rK }, output an answer class with the highest prediction score among the answer set A. The weights in different modules/layers are denoted with W , with appropriate sub/super-scripts as necessary. In the exposition that follows, we omit the bias term b to avoid notational clutter. 2.2

EAGLE: An Enhanced Attention-Based Strategy

Traditional attention methods have obtained fantastic achievement on general image. Yet, it is not the case when there are enormous amount of rich details in the image, e.g., remote sensing photographs. However, traditional attention may assign false weights and thereupon is likely to hinder the overall performance. It shows that more sophisticated attention methods should be explored to attack this issue. Let us shift to a canyon. An eagle takes off from a cliff to forage. It hovers at high altitude. The panoramic view comes into its eyes, and it starts locating its potential preys. All of a sudden, the eagle specifies objectives and begins swooping down on it. Soon, preys are aware of danger and try to flee, but the

EAGLE: An Enhanced Attention-Based Strategy

561

eagle discerns such situation clearly and adjusts its subduction angle. Eventually the eagle seizes its preys and plays knife and fork. Inspired by such observation, we propose an enhanced attention-based strategy by generating answers from learning questions to a remote sensing image, entitled EAGLE (see Fig. 1 for its framework). We incorporate attentive question features q and attentive image  to jointly reasons about visual attention and question attention in a features v series of steps in an alternative manner, in the sense that the image representations are applied to guide the question attention and the question representations are applied to guide image attention. In order to focus on regions relevant to the question in the input image, we exploit attended question features q as guidance. To put it more specifically, the k is a weighted arithmetic mean of all convolutional attentive image feature v features V = {v1 , v2 , · · · , vN }. Dynamic weights in attention matrices indicate the significance of different regions in an image according to qk . Formally at time step k, we first compute attention matrics Hkv associated with region features V corresponding to the feature vector qk : Hkv = tanh(Wv V + (Wqqk )1T )

(1)

where 1 is a vector with all elements to be “1”. Then, we rescale each weight using a softmax: avk = softmax(wvT Hkv ) (2) The image feature vn at the n’th location is then weighted by the attention value k : avk,n , and we obtain the attended feature vectors for images v k = v

N 

(avk,n vn )

(3)

n=1

Attentive question features q are a natural symmetry against attentive image . Unfortunately, there is a major difference between image processing features v and question encoding: image features are extracted in a parallel manner, yet questions are encoded word by word. In consequence, at a different time step, the number of updated hidden state of LSTM varies, and we fail to obtain a fixed-size matrix to represent information in the question. It will have a bad effect on training of parameters. To circumvent such issue, we define a fixed-size matrix Q by: ⎛ ⎞ q1

⎜ q2 ⎟ hp , if p ∈ [1, k] and p ∈ Z ⎟ Q=⎜ (4) ⎝ · · · ⎠ , where qp = 0, else qK It is worth noting that at time step k, we compute softmax only for the first k question features. The rest of the computation is similar to that of image attention. Therefore we only enumerate the formulas without further discussion:

562

Y. Zhou et al.

k−1 )1T ) Hkq = tanh(Wq Q + (Wvv aqk = softmax(wqT Hkq ) 1,2,··· ,k

k  (aqk,p qp ) qk =

(5)

p=1

We argue that the aforementioned enhanced attention-based strategy EAGLE provide complementary information and is effectively integrated in a unified framework. Therefore, EAGLE is beneficial for thorough understanding the aerial images, where there are numerous characteristics. 2.3

Overall Framework

EAGLE comprises three major components: (1) image feature extraction, (2) question encoding and (3) answer prediction. These are the standard building modules for most, if not all, existing visual question answering models. Algorithm 1 describes full details of the forward procedure to answer a given question according to a remote sensing image. Faster R-CNN [25] holds state-of-the-art results in object detection. We apply it to extract convolutional features from input aerial images. Bottom-up attention within the region proposal network aims to focus on specific elements in the given image. The Faster R-CNN is pretrained and its parameters are fixed during training of our model. After the image is passed through Faster R-CNN, we obtain a matrix of its features V = {v1 , v2 , · · · , vN }, where N denotes the number of image regions and the vector vn represents the feature extracted from the n’th region. Then, the input question R with K words is ready. Each word is turned into a dense vector representation according to a look-up dictionary. These vectors are initialized with pretrained word embeddings. The resulting embedding vectors {r1 , r2 , · · · , rK } are fed into our enhanced LSTM model one by one. Our LSTMs encompass a memory cell vector c that can store information for a long period of time, as well as three types of gates that control the flow of information into and out of these cells: input gates i, forget gates f and output gates o. Given word vector rk at time step k, the previous hidden state hk−1 , attentive question feature qk−1 and cell state ck−1 , the update rules of our LSTM model can be defined as follows: ik = sigmoid(Wri rk + Whi hk−1 + Wqi qk−1 ) fk = sigmoid(Wrf rk + Whf hk−1 + Wqf qk−1 ) ok = sigmoid(Wro rk + Who hk−1 + Wqo qk−1 ) gk = tanh(Wrg rk + Whg hk−1 + Wqg qk−1 )

(6)

ck = fk  ck−1 + ik  gk hk = ok  tanh(ck ) where gk is the activation term at time step k, and  is the element-wise multiplication of two vectors.

EAGLE: An Enhanced Attention-Based Strategy

563

Next, we fuse hindmost attentive question feature qK and attentive image K using: feature v K ) (7) h = tanh(WQ qK + WV v It can be interpreted as projecting representation of the question and that of the image into a joint semantic space. To generate an answer, we treat VQA output as a multi-label classification task. Before we train the model, we predetermine a candidate answer set A = {a1 , a2 , · · · , aM }, which is derived from the training answer set. The joint representation h is passed to compute a prediction score for each class, and we output the answer class with the highest prediction score among the candidate answer set A. The exact formulation is as follows: pA = softmax(Wh h)

(8)

Algorithm 1: Forward Pass for One Question

1 2 3 4 5 6 7 8 9

3

Input: an aerial image and a natural language question about it; Output: an answer; Extract image features with Faster R-CNN; q0 = 0; 0 according to eqs. (1) to (3); Compute initial attentive image feature v for k = 1 to K do Encode rk with LSTMs according to eq. (6); Compute attentive question feature qk according to eqs. (4) and (5); k according to eqs. (1) to (3); Compute attentive image feature v end Fuse ultimate question and image features to predict an answer according to eqs. (7) and (8);

Remote Sensing Question Answering Corpus

To evaluate the performance of our proposed method, we construct a largescale remote sensing question answering corpus. To the best of our knowledge, our corpus is the first one for remote sensing question answering. The current version of the corpus contains 6,921 images with 8,763 open-ended questionanswer pairs. Most of the questions are about object, color, shape, number and location. 3.1

Creation Procedure

We invite experts in the remote sensing field to collect pictures with desirable quality from Google Earth. After that, we obtain 6,921 aerial images fixed to 224 × 224 pixels with various resolutions: 921 of them are typical urban shot, and the rest are divided into 30 object classes according to land use. The object

564

Y. Zhou et al.

classes include airport, bareland, baseball field, beach and so on. Each class has 200 images. We choose different land use of images to simulate complicated geographical scenes. We claim that the practice effectively reduces bias that remote sensing question answering models can exploit, and thereupon enhances our corpus’s ability to benchmark VQA algorithms. As for image annotations, we resort to crowdsourcing. Remote sensing images are complicated for ordinary people to describe. In consequence, we screen annotators only from volunteers with related knowledge of remote sensing field and annotation experience. The 921 pictures of typical urban shot are annotated with three questions per picture, and the rest 6,000 pictures one question for each. Each question is tagged with 10 answers. To sum up, we collect 8,763 QA pairs. To get freestyle, interesting and diversified question-answer pairs, the annotators are free to raise any questions. The only limitation is that questions should be answered by the visual content and commonsense. Therefore, our corpus contains a wide range of AI related questions, such as object recognition (e.g., “What is there in green?”), positions and interactions among objects in the image (e.g. “Where is the centre?”) and reasoning based on commonsense and visual content (e.g. “Why does the car park here?”). We give preference to the annotators who provide interesting questions that require high level reasoning to give the answer. Yet, the freedom we give to the annotators makes it harder to control the quality of the annotation, compared to a more detailed instruction. To monitor the annotation quality, we conduct a pilot quality filtering task to select promising annotators before the formal annotation task. Specifically, we firstly randomly sample ten images from our image set as a quality monitoring corpus. Then we ask participants to view these images and write a QA pair for each one. After each annotator finishes labeling on the quality monitoring corpus, we examine the results and eliminate participants whose annotations are far from satisfactory (i.e. the questions are related to the content of the image and the answers are correct). Finally, we select a number of annotators with CV and NLP background (50 individuals) to participate in the formal annotation task. We pick a set of good and bad examples of the annotated question-answer pairs from the pilot quality monitoring task, and show them to the selected annotators as references. We also provide reasons for selecting these examples. We post tasks in small batches over the course of two months to prevent participants from working long hours. Despite these measures, there are still annotations with general, uninformative QA pairs. After the annotation of all the images is completed, we further refine the corpus and remove a small portion of the images with badly labeled questions and answers. 3.2

Corpus Statistics

In our remote sensing corpus, the total number of question-answers pairs is 8,763. For each question, there are around 4.98 words and the sum of questions those are longer than 5 words is 3,801, making up 43.37% of the total. Meanwhile,

EAGLE: An Enhanced Attention-Based Strategy

565

the average length of each answer is about 2.00 words for 12.01% answers longer than 3 words. We split the corpus into training set with 7,170 question-answers pairs and test set with 1,593. We can also see that the questions are mostly about object, color, shape, number and location from the statistical data. We provide an analysis on our remote sensing corpus in terms of both questions and answers. Statistics of Questions – Types of Questions. From our statistical data we can see that the questions are mostly about object, color, shape, number and location, which are in accordance with attributes of remote sensing images. Additionally, there exists a variety of question types including “What is”, “What color”, “How many”, “Is there?”, and questions like “What is around the playground?”, “What color of the tree?”, “What is on the bridge?” are very common in our remote sensing corpus. – Lengths. For each questions, there are around 4.98 words and the sum of questions those are longer than 5 words is 3,801, making up 43.37% of the total questions. Statistics of Answers – The Most Frequent Answers. For lots of questions beginning with “What is”, “What color”, “How many”, answers are mostly about object, color and number. – Lengths. Most answers consist of a single word. The average length of each answer is about 2 words, and there are about 12.01% answers longer than 3 words. Though it may be tempt to believe that brief answers can make question-answering much easier, these question-answers pairs are open-ended and require complicated reasoning.

4 4.1

Experimental Evaluation Models Including Ablative Ones

We choose LSTM+CNN model without attention as Baseline. We refer to our Eagle Attention as EAGLE for conciseness. We also perform ablation studies to quantify the roles of each component in our model. The goal of the comparison is to verify that our improvements derive from synergistic effect of the two intertwining attention approaches. Specifically, we re-train our approach by ablating certain components: – Question Attention alone (QuesAtt for short), where no image attention is performed. – Image Attention alone (ImgAtt for short), where we do not apply any question attention.

566

4.2

Y. Zhou et al.

Dataset

All experiments are conducted on our remote sensing corpus. We split the QA pairs into a training set with 7,170 QA pairs and a test set with 1,593 QA pairs. We ensure that all images are exclusive to either the training set or the test set. 4.3

Evaluation Metrics

We evaluate the predicted answers with accuracy: Accuracy =

4.4

Num. of correctly classified questions × 100% Num. of questions

Implementation Details

For question encoding module, we select two layers of LSTM model with the hidden size of 512. Moreover the length of questions is fixed to 26 and each word embedding is a vector of size 300 and 2,400 dimension hidden states. For image feature extraction module, the pretrained VGG-16 model [30] is applied as initialization and only the fully-connected layers are fine-tuned. Then We take the activation from the last pooling layer of VGGNet as its feature. To prevent overfitting, dropouts are used after fully connected layers with dropout ratio 0.5. 4.5

Results and Analysis

We perform experimental evaluations to demonstrate the effectiveness of the proposed model. The experimental results in Table 1 show that EAGLE model performs the best, and the ImgAtt model and QuesAtt model both perform better than the standard LSTM-CNN VQA model. We argue that both image and question attention are beneficial for feature extraction, and the fusion of the two modal attentions by means of EAGLE is more suitable for remote sensing settings. Table 1. Results on Remote Sensing Question Answering Corpus, in percentage. Models

Types of questions All Color How Inference Number Object Location

Baseline 63.9

18.8 82.0

62.3

38.3

8.8

25.5

QuesAtt 54.6

20.0 82.0

52.3

38.2

9.0

27.7

ImgAtt

64.7

19.4 82.8

62.7

38.8

9.3

27.1

EAGLE 66.2

20.1 83.7

50.9

51.0

9.9

31.6

Table 1 also presents models’ performance in terms of question type. The “color” and “number” category corresponds to the question type “what color”

EAGLE: An Enhanced Attention-Based Strategy

567

Table 2. EAGLE’s accuracy per scene, in percentage. Category Airport Accuracy 3.0

Bareland 37.0

Baseball field 57.0

Category Church Accuracy 5.5

Commercial Dense residential 51.5 50.0

Beach 25.5

Bridge 9.0

Center 33.5

Desert 31.5

Farmland Forest 40.0 59.5

Category Industrial Meadow 50.5 Accuracy 41.5

Medium residential Mountain Park 55.0 36.5 31.0

Pond 38.5

Category Port Accuracy 43.5

Sparse residential 48.0

Storage tank 12.5

Station 25.0

Square 41.5

Stadium 38.5

and “how many”, while the “object” category contains the questions types starting with “What is/type/kind/field/building/bridge...”. The “inference” questions, of which corresponding answers are “Yes/No”. The “How” questions customarily require general knowledge and commonsense of people’s motivations, and “Where” questions also plenty of external knowledge, particularly knowledge of relevant location. We observe that VQA systems excel in answering questions requiring inference, where accuracies exceed 80.0%. However, they always generate inapt responses about commonsense and location. As we can see, EAGLE performs best among four models in all question types except “number” questions. Surprisingly EAGLE improves near 13% points in questions about object. Indeed, the other models have observed counting ability in images with single object type but their ability to focus is weak since the remote sensing images contain amount of characteristics. For some questions like “What is next to the bare land?”, EAGLE can locate the target object and recognize it. The superior performance of EALGE demonstrates the effectiveness of using hybrid attention mechanisms. In order to validate EAGLE’s achievement in real scenes, we report EAGLE’s performance on large quantities of real data in Table 2. Although the average accuracy is 31.6%, there is a huge difference among all scenes. According to the result, EAGLE behaves best in the scene of forest. It achieves 59.5% in accuracy, which is comparable to the results in general images. We argue that such good performance is due to the fact that green is so prevalent in these images that only a little insightful information can be investigated. In light of the large percentage of “green” and “tree” in the corpus, VQA systems can perform well even if it only yields the two answers without reference to the picture. By contrast, EAGLE only achieves 3.0% in airport settings. We assume that it is likely for people to ask questions about numbers in the scene. Unfortunately, EAGLE does not explicitly incorporate modules to handle the task of counting and is inclined to answer these questions arbitrarily. It shows that there is still a long way to go for a general solution to visual questions in specific application scenarios. In the future, we will build some bigger remote sensing question answering corpus in order to train better models for semantic understanding of aerial images.

568

Y. Zhou et al.

Fig. 2. Visualization of attention to question and/or image, if any. From left to right: Baseline, QuesAtt, ImgAtt and EAGLE. Specifically, with regard to rearmost attended question vector qK , questions attentions are scaled (from scarlet:high to transparK , the lighter part denotes emphasized ent:low); as for terminal attended image vector v locations and by contrast other regions become blear.

The main purpose of the proposed EAGLE method is to obtain an enhanced co-attention that performs more than naive combination of question attention and image attention. Hence we visualize the attention regions in both two modal attentions in Fig. 2. In attention visualization we overlay attention probability distribution matrices, denoting the most prominent part of a(n) image/question based on the question/image. Notice that in the first example, while the QuesAtt model identifies the target region river according to the query, it concerns many other irrelevant parts as well. By contrast, our EAGLE model is in single concentration on the river. It is also the case for question attention. As it is possible to see, EAGLE model is capable of more accurately paying attention to the area of interest in both questions and images, and the predicted answers yield better results. The comparison vividly illustrates that our improvements derive from synergistic effect of the two intertwining attention approaches.

5 5.1

Related Work Visual Question Answering (VQA) with Attention

Given an image and a question about it, VQA systems are supposed to generate an appropriate answer. The task requires visual and linguistic comprehension, language grounding as well as common-sense knowledge. Previous studies have exploited holistic content from images without differentiation. However, questions dealing with local regions in images account for a large percentage. To deal with this, many studies [12,22,35] mainly adopt attention mechanism to associate the input question with corresponding image regions for for VQA. [27]

EAGLE: An Enhanced Attention-Based Strategy

569

presents an approach for learning to answer visual questions by selecting image regions relevant to a text-based query. The approach maps textual queries and visual features from various regions into a shared space in which they are compared for relevance by applying an inner product. To answer questions that require reasoning with a series of sub-steps, [33] introduces a multiple-layer attention framework, which stack attention modules to infer the answer iteratively. Recently, combined bottom-up and top-down attention mechanism [1] enables attention to be calculated at the level of objects and other salient image regions. The bottom-up mechanism proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Meanwhile, bilinear pooling methods [7,9,20] have been studied for better fusion of image and question pairs. To answer questions that require a series of sub-steps, the Neural Module Network (NMN) [2] framework uses external question parsers to find the sub-tasks in the question whereas Recurrent Answering Units (RAU) [21] is trained end-to-end and sub-tasks can be implicitly learned. Previous research mainly aims at common image setting, and the lack of studies on specific application scenarios, for example remote sensing, is striking. 5.2

Associated Datasets

A typical VQA dataset comprises large quantities of images, and question-answer (QA) pairs for each image. This can be achieved either by manual annotation, or with a semi-automatic algorithm that uses captions or question templates and detailed image annotations. There are ballpark two dozens of VQA datasets, which have successfully driven progress in the task. The DAQUAR [18] dataset is built on top of the NYU-Depth V2 dataset [29]. The answers are mostly singleword. The complete dataset has 6,795 training QA pairs and 5,674 test pairs. The VQA dataset [3] contains 614,163 questions and 7,984,119 answers for 204,721 images from Microsofts COCO dataset [14], along with 150,000 questions and 1,950,000 answers for 50,000 abstract scenes. On average, an image has nine QA pairs and the answers encompass 894 categories. The Visual Genome [10] dataset contains 1,773,258 QA pairs. On average, an image has 17 QA pairs. This dataset has two types of QA pairs (freeform QAs, which are based on the entire image, and region-based QAs, which are based on selected regions of the image) and six types of questions (what, where, when, who, why, and how). A subset of the freeform QA portion of Visual Genome is released as Visual 7W [36]. It is worth noting that all pictures from existing VQA datasets are about daily life scenes. In consequence, it is urgent to have a VQA dataset directed against aerial images. Another line of datasets related to ours is created for remote sensing image captioning. Two datasets are provided in [23], namely UCM-captions dataset and Sydney-captions dataset. UCM-captions dataset is based on the UC Merced Land Use Dataset [32]. It contains 21 classes land use image, including agricultural, airplane, baseball diamond and so forth, with 100 images for each class. Sydney-captions dataset is based on the Sydney Data Set [34]. It contains seven classes after cropping and selecting, including residential, airport,

570

Y. Zhou et al.

meadow, rivers, ocean, industrial, and runway. RSICD [17] is a recent largescale aerial image dataset, where the total number of remote sensing images are 10,921, with five sentences descriptions per image. In all three aforementioned datasets, five different sentences are given to describe every image. Since remote sensing images are captured from airplanes or satellites, there are complicated semantics in every aerial image. As a result, remote sensing image captioning might be insufficient to exhaustively excavate attributes of the objects and the relation between each object. By contrast, remote sensing question answering can mine fine-grained details, thus comprehending aerial images in a more thorough way.

6

Conclusion

We propose the task of mining details from remote sensing images. It necessitates thorough comprehension of language and visual data. While much research on general images obtains desired results, traditional approaches are not directly applicable in remote sensing setting. In this paper, we propose an enhanced attention-based approach which integrates seamlessly the property of remote sensing images and NLP. In addition, we present a large-scale remote sensing question answering corpus. To the best of our knowledge, it is first such corpus. Extensive experiments conducted on real data demonstrate that our proposal outperforms the-state-of-the-art approaches. Last but not least, we perform ablation studies so as to quantify the roles of different components in our model. Acknowledgements. We would like to thank the anonymous annotators for their professional annotations. This work is supported by the Natural Science Research of Jiangsu Higher Education Institutions of China under Grant 15KJA420001, National Natural Science Foundation of China under Grant 61772278, and Open Foundation for Key Laboratory of Information Processing and Intelligent Control in Fujian Province under Grant MJUKF201705.

References 1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018) 2. Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Deep compositional question answering with neural module networks. CoRR abs/1511.02799 (2015) 3. Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015) 4. Chen, K., Crawford, M.M., Gamba, P., Smith, J.S.: Introduction for the special issue on remote sensing for major disaster prevention, monitoring, and assessment. IEEE Trans. Geosci. Remote Sens. 45(6–1), 1515–1518 (2007) 5. Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: benchmark and state of the art. Proc. IEEE 105(10), 1865–1883 (2017) 6. Das, A., Agrawal, H., Zitnick, L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? vol. 163, pp. 90–100. Elsevier (2017)

EAGLE: An Enhanced Attention-Based Strategy

571

7. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP (2016) 8. Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015) 9. Kim, J., On, K.W., Lim, W., Kim, J., Ha, J., Zhang, B.: Hadamard product for low-rank bilinear pooling. CoRR abs/1610.04325 (2016) 10. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017) 11. Li, Q., Fu, J., Yu, D., Mei, T., Luo, J.: Tell-and-answer: towards explainable visual question answering using attributes and captions. In: EMNLP (2018) 12. Liang, J., Jiang, L., Cao, L., Li, L., Hauptmann, A.G.: Focal visual-text attention for visual question answering. In: CVPR, pp. 6135–6143 (2018) 13. Lillesand, T., Kiefer, R.W., Chipman, J.: Remote Sensing and Image Interpretation. Wiley, Hoboken (2015) 14. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 15. Lin, Y., Pang, Z., Wang, D., Zhuang, Y.: Feature enhancement in attention for visual question answering. In: IJCAI, pp. 4216–4222 (2018) 16. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. 29, 289–297 (2016) 17. Lu, X., Wang, B., Zheng, X., Li, X.: Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 56(4), 2183–2195 (2017) 18. Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS, pp. 1682–1690 (2014) 19. Mostafazadeh, N., Misra, I., Devlin, J., Mitchell, M., He, X., Vanderwende, L.: Generating natural questions about an image. In: ACL (2016) 20. Nguyen, D., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR, pp. 6087–6096 (2018) 21. Noh, H., Han, B.: Training recurrent answering units with joint loss minimization for VQA. CoRR abs/1606.03647 (2016) 22. Qiao, T., Dong, J., Xu, D.: Exploring human-like attention supervision in visual question answering. In: AAAI (2018) 23. Qu, B., Li, X., Tao, D., Lu, X.: Deep semantic understanding of high resolution remote sensing image. In: CITS, pp. 1–5 (2016) 24. Rajani, N.F., Mooney, R.J.: Stacking with auxiliary features for visual question answering. In: NAACL-HLT, pp. 2217–2226 (2018) 25. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015) 26. Shi, Z., Zou, Z.: Can a machine generate humanlike language descriptions for a remote sensing image? IEEE Trans. Geosci. Remote Sens. 55(6), 3623–3634 (2017) 27. Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: CVPR, pp. 4613–4621 (2016) 28. Shimizu, N., Rong, N., Miyazaki, T.: Visual question answering dataset for bilingual image understanding: a study of cross-lingual transfer using attention maps. In: ICCL, pp. 1918–1928 (2018)

572

Y. Zhou et al.

29. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4 54 30. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014) 31. Xie, L., Shen, J., Zhu, L.: Online cross-modal hashing for web image retrieval. In: AAAI, vol. 30 (2016) 32. Yang, Y., Newsam, S.D.: Bag-of-visual-words and spatial extensions for land-use classification. In: GIS (2010) 33. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016) 34. Zhang, F., Du, B., Zhang, L.: Saliency-guided unsupervised feature learning for scene classification. IEEE Trans. Geosci. Remote Sens. 53(4), 2175–2184 (2014) 35. Zhu, C., Zhao, Y., Huang, S., Tu, K., Ma, Y.: Structured attentions for visual question answering. In: ICCV, pp. 1291–1300 (2017) 36. Zhu, Y., Groth, O., Bernstein, M.S., Fei-Fei, L.: Visual7w: grounded question answering in images. In: CVPR, pp. 4995–5004 (2016)

Text Mining

Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis Sattar Seifollahi1,2(B) and Massimo Piccardi1 2

1 University of Technology Sydney, Ultimo, NSW, Australia Capital Markets Cooperative Research Centre, The Rocks, NSW, Australia [email protected]

Abstract. Extracting meaningful features from documents can prove critical for a variety of tasks such as classification, clustering and semantic analysis. However, traditional approaches to document feature extraction mainly rely on first-order word statistics that are very high dimensional and do not capture well the semantic of the documents. For this reason, in this paper we present a novel approach that extracts document features based on a combination of a constructed word taxonomy and a word embedding in vector space. The feature extraction consists of three main steps: first, a word embedding technique is used to map all the words in the vocabulary onto a vector space. Second, the words in the vocabulary are organised into a hierarchy of clusters (word clusters) by using k-means hierarchically. Lastly, the individual documents are projected onto the word clusters based on a predefined set of keywords, leading to a compact representation as a mixture of keywords. The extracted features can be used for a number of tasks including document classification and clustering as well as semantic analysis of the documents generated by specific individuals over time. For the experiments, we have employed a dataset of transcripts of phone calls between claim managers and clients collected by the Transport Accident Commission of the Victorian Government. The experimental results show that the proposed approach has been capable of achieving comparable or higher accuracy than conventional feature extraction approaches and with a much more compact representation. Keywords: Feature extraction

1

· Classification · Clustering

Introduction

The recent years have witnessed an incessant growth in the creation of digital text, from the increasing number of organisational documents and workflows to the large amounts of messages continuously generated on social media. As an example, the number of tweets generated on the popular Twitter platform is S. Seifollahi—Currently working at Resolution Life (Australia). This work was performed while at the University of Technology Sydney. c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 575–586, 2023. https://doi.org/10.1007/978-3-031-24340-0_43

576

S. Seifollahi and M. Piccardi

estimated to have reached over 200 billion per year. The immediate challenge stemming from such a huge growth in textual data is how to understand their contents in effective and efficient ways. Document classification (or categorisation) is likely the most widespread automated task on textual data, with a vast array of applications such as sentiment analysis, ad targeting, spam detection, client relationships, risk assessment and medical diagnosis. A class is a subset of documents which are, in some sense, similar to one another and different from those of other classes, and the goal is to assign text documents to a predefined set of classes. Document classification has been extensively studied, especially since the emergence of the Internet where documents are typically created by an unverified variety of authors and with little metadata. A major issue for effective document classification is the extraction of appropriate features and document representations. Techniques using the bag-of-words (BoW) model are the most widespread [2], with an early reference in a linguistic context dating back to 1954 [13]. In this model, a text (such as a sentence or a document) is represented as the bag or multiset of its words, disregarding the word order but retaining the word counts. The entire vocabulary used in the document corpus is considered as the feature space, and each document is represented by the vector of its word frequencies. Given that most documents only utilize a small subset of the available words, such feature vectors tend to be very sparse and unnecessarily high-dimensional. Such a high dimensionality can be regarded as an instance of the curse of dimensionality [12], making it difficult for clustering and classification algorithms to perform effectively. In addition, the BoW model ignores the linguistic interaction between words and does not account for word ordering [7], while the meaning of natural languages deeply depends on them. It is also very common to reweigh these vectors to somehow reflect the “discriminative power” of each word. The term frequency-inverse document frequency (tf-idf) approach is the most popular weighting scheme for the BoW model [23]. The tf-idf weighting increases the importance of words that appear rarely in the corpus, assuming that they would be more discriminative in any ensuing clustering and classification tasks [4,10,22]. A promising solution for mollifying the BoW flaws is to exploit an ontology (an organised set of concepts in the domain area) [9,11,14]. This can both lead to a dramatic decrease in the number of features and help incorporate semantic knowledge [7,9,11]. We refer the reader to [3] for further details on ontology learning and applications. However, how to best exploit ontologies to represent documents is still an open problem. Most of the proposed techniques only employ existing semantic lexical databases [9,11] such as WordNet to organise the documents’ words and determine similarities. However, if the ontology does not suit the collection of documents, the extracted features may result in poor performance, not only in classification, but also in clustering, semantic analysis and other tasks. An example of challenging text data are the phone call transcripts used for the experiments in this paper (3.1). The transcripts originate from phone conversations between claim managers and clients of an accident support agency,

Taxonomy-Based Feature Extraction for Document Classification

577

and reflect a very specialised terminology and a diversity of transcription styles. These documents represent a challenge for existing, general-purpose ontologies and suggest exploring dedicated approaches for effectively incorporating semantic aspects in the document features. Recently, text analytics have been extensively benefiting from the adoption of word “embeddings” that map words from their intrinsic categorical values to vector spaces [19,21]. These models take into input a large corpus of documents and embed each distinct word in a vector space of low-to-moderate dimensionality (typically, a few hundred dimensions, compared to the ≈ 105 − 106 of BoW). Word embeddings overcome the typical drawbacks of the BoW model, including the issue of high dimensionality and the dismissal of the word’s order in the text [18]. These models have been widely used in the literature for document clustering and classification [1,15–18,25,27–30]. For these reasons, in this paper we propose leveraging a dedicated taxonomy (an instance of an ontology) built upon a word embedding space. This approach had been originally proposed for document clustering in [24], where it had demonstrated its effectiveness at grouping documents into consistent clusters. In this paper, we extend it to classification and semantic analysis to demonstrate its effectiveness also for these tasks. We also leverage sets of predefined “words of interest” for the specific organisation to greatly decrease the number of features to a very manageable size, equal to the size of the set of predefined words. The feature extraction algorithm uses a three-step process: first, a word embedding technique is used to convert each distinct word to a vector of |W | dimensions. Second, the word vectors are partitioned into a hierarchy of clusters, simply referred to as “word clusters”, via a hierarchical clustering algorithm. Third, the individual documents are projected onto a space of predefined words whose size is equal to the size of this set. Our ultimate goal is to develop a versatile feature extraction technique with a strong focus on dimensionality reduction that can assist with a variety of tasks, from phone call classification and clustering to the monitoring of progress of individual clients.

2

Methodology

This section describes the main components of our methodology, namely i) the approach for generating the hierarchy of word clusters, ii) the approach for extracting taxonomy-based features based on a set of predefined words, and iii) a comparison approach that generates the features directly from the clusters in the hierarchy. 2.1

Hierarchy of Word Clusters

In this subsection, we describe a method for partitioning words into a hierarchy of clusters. In the literature, there exist two main lines of methods for clustering, namely partitional and hierarchical algorithms (please refer to [20] for further discussion on clustering techniques and their convergence properties). The k-means algorithm and its variants are the most broadly used partitional

578

S. Seifollahi and M. Piccardi

algorithms [2,8,23,24,26]. In turn, hierarchical algorithms can be divided into two main categories, namely agglomerative and divisive. Agglomerative algorithms generally perform better than divisive algorithms, and often “better” than single-layer algorithms such as k-means [20]. These algorithms exploit a bottom-up approach, i.e. the clustering produced at each layer of the hierarchy merges similar clusters from the previous layer. However, [26] showed that bisecting k-means can produce clusters that are both better than those of standard k-means and as good as (or better than) those produced by agglomerative hierarchical clustering. For these reasons, in this paper we use a method similar to bisecting k-means with two modifications; 1) we use spherical k-means instead of standard k-means, and 2) each cluster is split into two sub-clusters only if the number of its elements exceeds a predefined threshold. Figure 1 illustrates the hierarchical word clusters. For details and steps of the algorithm, we refer readers to the recent paper [24].

Fig. 1. Three layers of the hierarchy of word clusters.

2.2

Taxonomy-Augmented Features Given a Set of Predefined Words

A taxonomy can play a key role in document clustering by reducing the number of features from typically thousands to a few tens only. In addition, the feature reduction process benefits from the taxonomy’s semantic relations between words. In [24], the authors presented an approach for taxonomy-based feature extraction and proved its usefulness for document clustering. In this approach, the individual documents are projected to a space of predefined words, with the feature dimensionality equal to the number of such words. We believe that the approach can also prove useful for other tasks such as document classification and semantic analysis, where features are extracted from the documents (e.g., phone call transcripts) that the individuals produce over time to monitor their progress and tailor interventions. The approach is briefly recapped here for ease of reference; for further details please refer to [24]. To describe the process precisely, let us assume that D = [d1 , . . . , dM ] is the document matrix, where di ∈ RN and N is the number of the

Taxonomy-Based Feature Extraction for Document Classification

579

predefined words. With these positions, the steps for generating the document features are described by Algorithm 1.

Algorithm 1. Taxonomy-augmented features given a set of predefined words Input: Hierarchy of word clusters Cw , set of predefined words S = {w1 , . . . , wN } Output: Documents D ∈ RM ×N 1. Set D = [d1 , . . . , dM ] = 0, where di ∈ RN and N is the size of the set of predefined  L  words. Set Ws = [ws1 , . . . , wsL ] = [ws1,1 , ws1,2 ], . . . , [wsL,1 , . . . , wsL,2 ] , and wsl,j = ∅. 2. For each word w in S : 2.1 For level l = 1, . . . , L : l l,j , j, such that w ∈ Cw . 2.11 Find index of cluster in Cw 2.12 Set wsl,j = wsl,j ∪ w. 3. For each document di , i = 1, . . . , M : 3.1 For each word w in document di : 3.11 For level l = 1, . . . , L : l l,j , i.e. j such that w ∈ Cw . 3.111 Find cluster index in Cw l,j 3.112 Retrieve all words of ws with corresponding indices in S, I(l, j). I(l,j) I(l,j) 3.113 Set di = di + xi,w .

2.3

Taxonomy-Augmented Features Given the Hierarchy of Word Clusters

In this section, we review another method from [24] solely for the sake of comparison. The approach is largely similar to that introduced in the previous subsection; however, it differs in its final stage where the documents are projected directly onto the word clusters rather than the set of predefined words. We note that this approach is not suitable for what we call semantic analysis in this paper since the features are not representative of any “words of interest” for the domain experts. We briefly describe the algorithm’s steps here and refer the interested readers to [24] for further details. Let us assume that D = [d1 , . . . , dM ] are M documents in word space. Each document can be further noted as di = [d1i , d2i , ..., dL i ], where L is the number of levels in the hierarchy of l,2 l,2l l word clusters and di = [dl,1 i , di , ..., di ]. For simplicity, we store D in a twodimensional matrix of size M × N , where N is the overall number of clusters in L all levels, i.e. N = l=1 2l .

3 3.1

Experiments Datasets

To test and compare the proposed approach, we have carried out experiments with textual data from an accident support agency, the Transport Accident

580

S. Seifollahi and M. Piccardi

Algorithm 2. Taxonomy-augmented features Input: Hierarchy of word clusters Cw Output: Document matrix D ∈ RM ×N  l 1. Set D = [d1 , . . . , dM ] = 0 where di ∈ RN , N = L l=1 2 , and L is the number of layers in the hierarchy of word clusters. 2. For each document di , i = 1, . . . , M : 2.1 For each word w in document di : 2.1.1 For level l = 1, . . . , L: l l,j , j, such that w ∈ Cw . 2.1.1.1 Find cluster index in Cw l,j l,j 2.1.1.2 Set di = di + xiw . 2.2 For l = 1 . . . L: 2.2.1 Normalize dli to one, i.e. dli  = 1. 3. Remove any features from D that have zero value across all documents.

Commission (TAC) of the Victorian Government in Australia. The data consist of phone calls between the agency’s clients and its claim managers, annotated by the managers into transcripts. Phone calls for single clients take place over time, so the data are suitable for monitoring the clients’ progress, such as return to work and physical and emotional recovery. The phone calls typically cover a wide range of topics: for example, some are related to the client’s health and recovery, while others are related to payments and compensation. To conduct experiments on classification, we have merged all the phone call transcripts of each individual client into single documents and created four datasets (D1-4) based on the number of words per document and the number of documents itself. Datasets D1 and D2 consist, respectively, of 2,000 and 5,000 short-length documents, ranging from 20 words to 100 words per document. The TAC experts have indicated that such short-length documents are likely to come from clients with fewer phone calls overall and more rapid recovery. Datasets D3 and D4 are similar, but with larger-size documents ranging from 100 to 5,000 words each. The four datasets have been manually annotated by the TAC experts into a binary classification problem using labels “MH” and “NO MH” ( label “MH” is used by the TAC to identify clients with a variety of mental health issues). All datasets are balanced in terms of number of samples per class. The following preprocessing steps have been applied to each phone call before its use in the experiments: 1) removal of numbers, punctuation, symbols and “stopwords”; 2) replacement of synonyms and misspelled words with the base and actual words; 3) removal of sparse terms (keeping 95% sparsity or less) and infrequently occurring words; 4) removal of uninformative words such as people names and addresses.

Taxonomy-Based Feature Extraction for Document Classification

3.2

581

Experimental Set-Up

To learn the word embeddings, we have used the following settings: the dimensionality was set to 100; the context window size was set to 12; and the number of training epochs was set to 1,000. In Algorithm 1, we have set N = 100. For the experiments on document classification we have employed two document features as the baselines for comparison: 1) the well-known tf-idf and 2) doc2vec which is based on the average of word embeddings. We have also used Algorithm 2 as another method for comparison. As word embeddings, we have used GloVe due to its reported strong performance in a variety of tasks [21]. For classification, we have compared three popular classifiers: 1) eXtreme Gradient Boosting (XGBoost) [6], 2) a support vector machine (SVM) with a radial basis function kernel, and 3) a random forest. We have used 10-fold cross-validation to report the accuracy: first, the dataset is randomly shuffled and split into 10 folds; then, in turn, each fold is used as the test set and the remaining nine as training set. All codes, including the packages for the classifiers, have been implemented in the R language in a Windows environment. 3.3

Experimental Results on Document Classification

Figure 2 shows the average accuracies from the 10-fold cross-validation for datasets D1-4. Notations “xgb”, “svm” and “rf” stand for classifiers XGBoost, SVM with RBF kernel and random forest, respectively. Notations “w2v”, “tfidf”, “tax” and “tax.w” stand for feature extraction models doc2vec, BoW with tf-idf weighting, taxonomy-augmented Algorithm 2 and taxonomy-augmented Algorithm 1, respectively. Based on Fig. 2, the model based on predefined words, “tax.w”, performs better than the others in most cases, followed by “tax”. The “tfidf” model performs very similarly to “tax.w” using “xgb”; however, it fails to produce accurate predictions with the other two classifiers, in particular with “svm”. Amongst the classification methods, “xgb” performs the best, followed by “rf” and “svm”, respectively. In Fig. 3 we also use the average of the absolute deviations (AVDEV) to explore the variability of the 10-fold cross-validation accuracies around their means. Figure 3 shows that the AVDEV for all methods are mostly in a very similar range, with the exception of “tfidf” using “rf” in D1 and D4 where the AVDEV are, undesirably, much higher. We have not conducted a formal complexity analysis for the methods. However, we have noted that classification takes a very similar time with the different feature vectors, with the exception of “tfidf” that is considerably slower. If we include the time for feature extraction, “w2v” becomes the fastest, followed by “tax”, “tax.w” and “tfidf”, respectively. Model “tfidf” is much slower due to its much larger dimensionality, in particular in conjunction with “svm” classification.

582

S. Seifollahi and M. Piccardi

Fig. 2. Average accuracy from 10-fold cross-validation. The horizontal axis maps the classifier and the coloured bars represent the feature vectors.

Fig. 3. Average of the absolute deviations (AVDEV) of accuracies from their mean. Horizontal axis shows classification methods and vertical axis is the AVDEV value.

3.4

Experimental Results on Document Clustering

The performance of a document clustering approach can be well described using two complementary measures: connectivity and silhouette [5]. The connectivity

Taxonomy-Based Feature Extraction for Document Classification

583

captures the “degree of connectedness” of the clusters and it is measured in terms of how many nearest neighbors of any given sample belong to other clusters; its value ranges between 0 and infinity and should be minimized. The silhouette measures the compactness and separation of the clusters and ranges between −1 (poorly-clustered observations) and 1 (well-clustered observations). To compute these measures, clValid, a popular R package for cluster validation [5] has been used. Figure 4 shows the connectivity and silhouette measures for a dataset called PCall, similar to D4 but with 8,000 samples. In this figure, “tfidf” stands for the BoW model, “w2v” for doc2vec, and “tax.1” and “tax.2” for the models from Algorithms 2 and 1, respectively. In turn, “km” and “pam” stand for two clustering algorithms, k-means++ and PAM, respectively. These results show that the taxonomy-augmented models, “tax.1” and “tax.2”, have performed better than the other two models for a large majority of cluster numbers. Among the conventional models, the performance of BoW is generally better than that of doc2vec. It is noted that BoW is the most time-consuming due to its large number of features. Based on the experiments, it is approximately 10 times slower than the other models.

Fig. 4. Connectivity and silhouette measures of all models for the PCalls dataset.

3.5

Semantic Analysis

In this section, we illustrate the use of the taxonomy-augmented features of Algorithm 1 for the “semantic analysis” of TAC clients. The data for each client consist of several phone calls, where each phone call has been represented by a set of predefined words using Algorithm 1. In this section, we use S = {f amily, pain, provider, recovery, stress, upset} as predefined words. In this way, each phone call is represented as a distribution over chosen words, which in turn are represented as distributions over the entire vocabulary. The

584

S. Seifollahi and M. Piccardi

concept looks very similar to that of topic modelling, where each document is represented as a distribution over topics and each topic is a distribution over the vocabulary. However, the two models differ notably in that the topics are latent variables without an a-priori semantic, whereas our predefined words are observed and can be chosen by experts. Figure 5 shows the semantic analysis of two clients as plots of the six chosen features along successive phone calls. The first client recorded a total of 48 phone calls over 38 months and the second client 14 over 29 months (a few phone calls have been removed because the number of words after preprocessing had fallen below a minimum set threshold). This figure should be regarded just as an illustrative example as any set of words or reasonable size can be used in this analysis.

Fig. 5. Evolution of the chosen features in the phone calls of two randomly-selected clients. The semantic scores are computed by Algorithm 1.

4

Conclusion

In this paper, we have presented the application of a taxonomy-augmented feature extraction approach to a variety of important document tasks such as classification, clustering, and semantic analysis. The approach addresses two urgent challenges of conventional document representations, namely 1) their large number of features, and 2) the dismissal of the word ordering in the formation of the features. By amending these two shortcomings, the proposed model has proved able to provide a more compact and semantically-meaningful document representation and improve the tasks’ performance.

Taxonomy-Based Feature Extraction for Document Classification

585

In an original set of experiments on document classification (Sect. 3.3), we have compared the proposed approach with two well-known methods, BoW/tfidf and doc2vec, over four phone call datasets from an accident support agency. The results have shown that the proposed models have achieved better average accuracy in the large majority of cases, while their absolute deviations from the averages have kept almost the same. These improvements confirm the results obtained by the proposed models in experiments on document clustering (Sect. 3.4). In addition (Sect. 3.5), we have illustrated the usefulness of the proposed feature vector for the monitoring of individual progress over time. Acknowledgement. This project has been funded by the Capital Markets Cooperative Research Centre and the Transport Accident Commission of Victoria. Acknowledgements and thanks to our industry partners David Attwood (Lead Operational Management and Data Research) and Bernie Kruger (Business Intelligence and Data Science Lead). This research has received ethics approval from University of Technology Sydney (UTS HREC REF NO. ETH16-0968).

References 1. Alshari, E.M., Azman, A., Doraisamy, S., Mustapha, N., Alkeshr, M.: Improvement of sentiment analysis based on clustering of Word2Vec features. In: Proceedings - International Workshop on Database and Expert Systems Applications, DEXA (2017) 2. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Gabow, H. (Ed.) Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms [SODA07], pp. 1027–1035. Society for Industrial and Applied Mathematics (2007) 3. Asim, M.N., Wasim, M., Khan, M.U.G., Mahmood, W., Abbasi, H.M.: A survey of ontology learning techniques and applications. Database (2018) 4. Bagirov, A., Seifollahi, S., Piccardi, M., Zare, E., Kruger, B.: SMGKM: an efficient incremental algorithm for clustering document collections. In: CICLing 2018 (2018) 5. Brock, G., Pihur, V., Datta, S., Datta, S.: clValid: An R package for cluster validation. J. Stat. Softw. 25, 1–22 (2008) 6. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016) 7. Cheng, Y.: Ontology-based fuzzy semantic clustering. In: Proceedings - 3rd International Conference on Convergence and Hybrid Information Technology, ICCIT 2008, vol. 2, pp. 128–133 (2008) 8. Dhillon, S., Fan, J., Guan, Y.: Efficient clustering of very large document collections. In: Kamath, C., Kumar, V., Grossman, R., Namburu, R., (eds.), Data Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, Oxford (2001) 9. Elsayed, A., Mokhtar, H.M.O., Ismail, O.: Ontology based document clustering using Mapreduce. Int. J. Database Manage. Syst. 7(2), 1–12 (2015) 10. Erra, U., Senatore, S., Minnella, F., Caggianese, G.: Approximate TF-IDF based on topic extraction from massive message stream using the GPU. Inf. Sci. 292, 143–161 (2015)

586

S. Seifollahi and M. Piccardi

11. Fodeh, S., Punch, B., Tan, P.-N.: On ontology-driven document clustering using core semantic features. Knowl. Inf. Syst. 28(2), 395–421 (2011) 12. Friedman, J.H.: On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Min. Knowl. Disc. 1(1), 55–77 (1997) 13. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954) 14. A. Hotho, S. Staab, and G. Stumme. Ontologies improve text document clustering. In Third IEEE International Conference on Data Mining, pages 541–544, 2003 15. Kim, J., Rousseau, F., Vazirgiannis, M.: Convolutional sentence kernel from word embeddings for short text categorization. In: Proceedings EMNLP 2015, September, pp. 775–780 (2015) 16. Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. Proc. ICML 37, 957–966 (2015) 17. Lenc, L., Kr´ al, P.: Word embeddings for multi-label document classification. In: Proceedings of Recent Advances in Natural Language Processing, pp. 431–437 (2017) 18. Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: Proceedings of IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 136– 140 (2015) 19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Arxiv, pp. 1–12 (2013) 20. Moseley, B., Wang, J.R.: Approximation bounds for hierarchical clustering: average linkage, bisecting K-means, and local search. In: Number Nips, pp. 3097–3106 (2017) 21. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings EMNLP 2014, pp. 1532–1543 (2014) 22. Qimin, C., Qiao, G., Yongliang, W., Xianghua, W.: Text clustering using VSM with feature clusters. Neural Comput. Appl. 26(4), 995–1003 (2015) 23. Seifollahi, S., Bagirov, A., Layton, R., Gondal, I.: Optimization based clustering algorithms for authorship analysis of phishing emails. Neural Process. Lett. 46(2), 411–425 (2017) 24. Seifollahi, S., Piccardi, M., Borzeshi, E.Z., Kruger, B.: Taxonomy-augmented features for document clustering. In: Islam, R., et al. (eds.) AusDM 2018. CCIS, vol. 996, pp. 241–252. Springer, Singapore (2019). https://doi.org/10.1007/978-981-136661-1 19 25. Stein, R.A., Jaques, P.A., Valiati, J.F.: An analysis of hierarchical text classification using word embeddings. Inf. Sci. 471, 216–232 (2019) 26. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining , vol. 400, pp. 1–2 (2000) 27. Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentimentspecific word embedding for twitter sentiment classification. In: Proceedings ACL, pp. 1555–1565 (2014) 28. Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806–814 (2016) 29. Zhang, D., Xu, H., Su, Z., Xu, Y.: Chinese comments sentiment classification based on word2vec and SVMperf. Expert Syst. Appl. 42(4), 1857–1863 (2015) 30. Zhu, L., Wang, G., Zou, X.: A study of Chinese document representation and classification with Word2vec. In: Proceedings - 2016 9th International Symposium on Computational Intelligence and Design, ISCID 2016, pp. 1:298–302 (2017)

Adversarial Training Based Cross-Lingual Emotion Cause Extraction Hongyu Yan1 , Qinghong Gao1 , Jiachen Du1(B) , Binyang Li2 , and Ruifeng Xu1(B) 1

Harbin Institute of Technology (Shenzhen), Shenzhen, China [email protected], [email protected] 2 University of International Relations, Beijing, China [email protected]

Abstract. Emotion cause extraction (ECA) aims to identify the reasons behind a certain emotion expression in a text. It is a key topic in natural language processing. Existing methods relies on high-quality emotion resources and focuses on only one language. However, the public annotated corpora is fairly rare. Therefore, we propose an adversarial training based cross-lingual emotion cause extraction approach to leverage the semantic and emotion knowledge in a resource-abundant language (source language) for ECA in a resource-scarce language (target language). Instead of large-scale parallel corpora, we capture task-related but language-irrelevant features only on a small-scale Chinese corpora and an English corpora. In addition, an attention mechanism based on position and emotion expression information is designed to obtain the key parts of the clause devoting to ECA. Our proposed approach could capture rich semantic and emotion information in ECA learning process. It is demonstrated that our method can achieve better performance than the state-of-the-art results.

Keywords: Emotion cause extraction mining

1

· Information extraction · Text

Introduction

With the flourishing development of Internet, emotion analysis has attracted much attention in field of Natural Language Processing (NLP). Most of previous researches focus on emotion classification or emotion detection. However, underlying information such as the cause of emotion needs to be extracted and analyzed in many real word applications which provides crucial information for applications ranging from economic forecast and public opinion mining to product design. Example 1: Because he was just attacked with a torrent of abuse online (c1 ). He felt deeply angry (c2 ). c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 587–601, 2023. https://doi.org/10.1007/978-3-031-24340-0_44

588

H. Yan et al.

Emotion cause extraction (ECA) aims to identify the reason behind a certain emotion expression automatically. As shown in Example 1, “angry” is an emotion expression and the cause of “angry” is c1 . It is more challenging compared to emotion classification because it requires deeper understanding on text semantic. Existing approaches to identify emotion cause mainly concentrate on rule based methods [1,2] and machine learning algorithms [3] which ignore the trigger relations between emotion expression and emotion cause. Recently, Gui [4] considered emotion cause extraction as a question answering task to capture the semantic relations between emotion expression and emotion cause. However, above researches fasten on only one language to identify the emotion cause extraction as well as public annotated corpora are rare and imbalanced in different languages, which results in the lack of utilizing abundant information from diverse languages [5]. In this paper, we present a cross-lingual based approach on emotion cause detection to leverage resources in a resource-rich language (such as English) to improve the performance in a resource-scare language (such as Chinese) by making most of the cross-lingual transfer knowledge. The traditional cross-lingual approaches are based on translated resources such as bilingual dictionary and parallel corpora or employ machine translation (MT) systems to translate corpora in the source language into the target language [6]. These methods are restricted because of the gap between the source language and the target language as well as the accuracy of MT systems. Thus, to overcome these issues, inspired by [7,8], we propose an adversarial training based cross-lingual architecture (ATCL-ECA) to model cross-lingual semantic and emotion information between two languages without relying on extra parallel corpora. The major contributions of our work can be summarized as follows: – We propose a cross-lingual approach (ATCL-ECA) to learn languageirrelevant but task-related (LI-TR) information for emotion cause extraction. – Instead of large-scale parallel corpora, adversarial training based method is conducted on the small-scale labeled Chinese corpora from [3] and English corpora from ECA [9]. In addition, an attention mechanism based on position and emotion expression information is designed to obtain the key parts of the clause devoting to ECA. – It is demonstrated that our proposed model can capture cross-lingual semantic information to bridge the gap between two languages effectively for ECA task and outperforms the state-of-the-art approach on a public benchmark dataset.

2

Related Work

In this section, we review the literature related to this paper from two perspectives: emotion cause extraction and cross-lingual method.

Adversarial Training Based Cross-Lingual Emotion Cause Extraction

2.1

589

Emotion Cause Extraction

With further researching on emotion analysis, the emotion cause corresponding to an emotion expression becomes noteworthy. Emotion cause extraction could reveal the cause information which triggers the emotion expression. Lee [10] first gave the formal definition of this task. They manually constructed a Chinese emotion cause corpora from the Academia Sinica Balanced Chinese Corpus. Based on above researches, Chen [2] put this task down to multi-label classification which can capture the long distance information basing on rulebased and semantic features. Other than these rule based methods, Ghazi [11] employed Conditional Random Fields (CRFs) to identify emotion causes. However, this study requires emotion cause and emotion expression in the same clause. Recently, Gui [3] proposed a multi-kernel based method to detect the emotion cause on a public Chinese emotion cause corpora. Whereas, it fairly depended the design of features. Inspired by the neural network, Gui [4] considered this task as a question learning problem by constructing a convolutional memory network. However, since public annotated corpora on this task is heavily rare, existing researches almost focus on only one language which ignore the cross-lingual knowledge obtaining from abundant resources to other languages. 2.2

Cross-Lingual Emotion Analysis

The goal of cross-lingual emotion analysis is to bridge the gap between the source language and target language. Machine translation (MT) or parallel corpora based approaches are usually employed to solve this problem. Machine translation based methods use MT systems to project training data into the target language or test data into source language. Wan [12] proposed a cotraining method which depends on machine translation. They first translated Chinese testing data into English, and English training data into Chinese. After that, they performed training and testing data into two independent views, i.e., English view and Chinese view. Li [13] opted the samples in the source language that were similar to those in the target language to finish the gap between two languages. With the development of deep learning, most of scholars adopt it to learn cross-lingual representations with parallel corpora. Zhou [14] proposed a cross-lingual representation learning model which simultaneously learn both the word and document representations on these two languages. Besides, Zhou [15] use a hierarchical attention model to jointly train with a bidirectional LSTM to learn representation. However, such approaches requires large-scale task-related parallel corpora. To solve above issues, we proposed a cross-lingual architecture via adversarial training to detect the emotion cause. It can not only capture the cross-lingual semantic information between two languages but also be conducted on only small-scale corpora.

590

H. Yan et al. Chinese Dataset (Xcn) CN-CN/ EN-EN/CNEN/EN-CN

Discriminator (D)

Chinese Feature Extractor (Fcn)

Chinese-ECA Classifier (Ccn)

Y/N

English-ECA Classifier (Cen)

Y/N

Shared Feature Extractor (F)

English Feature Extractor (Fen)

English Dataset (Xen)

Fig. 1. Adversarial training based cross-lingual ECA architecture (ATCL-ECA).

3 3.1

Model Task Definition

The formal definition of emotion cause extraction is proposed in [3]. Given a document Doc = {c1 , c2 , · · · , cn } consisting of an emotion expression e and n clauses, which is a passage about an emotion event. Moreover, each clause c = {w1 , w2 , · · · , wk } consists of k words. The goal of this task is to identify the emotion cause clause corresponding to the emotion expression. When dealing with each document, we map each word into a low dimensional and continuous vector, which is known as word embedding [16]. All the word vectors are stacked in a word embedding matrix L = Rdim × V , where dim is the dimension of word vector and V is the vocabulary size. 3.2

Adversarial Training Based Cross-Lingual ECA Model

The overall architecture of Adversarial Training based Cross-lingual ECA Model is illustrated in Fig. 1, which contains three components, i.e., feature extractor, ECA classifier and discriminator. We first use feature extractor to obtain the contextual information from different corpora, which consists of Chinese feature extractor Fcn , English feature extractor Fen and shared feature extractor F to acquire information from Chinese corpora and English corpora as well as shared semantic information between these two corpora. Then, to identify emotion cause corresponding to the emotion expression, we construct two classifiers Ccn and Cen on Chinese and English corpora respectively. It is well know that there is common knowledge notwithstanding the languages are diverse. To make most of this information, we concatenate the outputs of Fcn and the Chinese outputs from F then feed them into Ccn . Analogously, we feed the connection of outputs

Adversarial Training Based Cross-Lingual Emotion Cause Extraction

591

from Fen and the Chinese outputs from F then feed them into Cen . Besides, discriminator D is used to identify the language category of samples from F . The details of these portions are described in the following subsections. Feature Extractor. Normally, emotion expressions and emotion causes are expressed via phrases or sentences rather than only one word. Meanwhile, the same word in different contexts could convey different meanings. To incorporate rich contextual semantic information, we leverage Recurrent Neural Network with Gated Recurrent Unit (GRU) [17] to extract the word sequence features and long term. GRU instead of LSTM are exploited as the former has fewer number of parameters to tune, which contains only update gate z and reset gate r to control the flow of information. For each time step t, GRU first calculates the update gates zt and reset gate rt . Separately, the gate zt is designed to regulate the degree of units updated to ensure that there are dependencies at every moment and the gate rt determines the amount of selection for the previous state. The final hidden state ht and candidate hidden state h˜t are obtained as follows: (1) zt = σ(Wz xt + Uz ht−1 + bz ) rt = σ(Wr xt + Ur ht−1 + br )

(2)

˜ t = tanh(Wh xt + rt  (Uh ht−1 ) + bh ) h

(3)

˜t ht = (1 − zt )  ht−1 + zt  h

(4)

where xt is word embedding of word w at time step t. σ(·) and tanh(·) are sigmoid function and hyperbolic tangent function separately. Wz , Wr , Uz , Ur , Wh and Uh are weight matrices.  is an Element-Wise Multiplication. To capture the semantic features on Chinese corpora, English corpora and the shared knowledge from two languages, we use three feature extractors: Fcn , Fen and F . For each feature extractor, we adopt bidirectional GRU (Bi-GRU) to incorporate the past and future contextual information. The Bi-GRU comprises → − the forward GRU f which reads the sentence from left to right to learn the ← − historical information and the backward GRU f which reads from right to left to obtain the future information. − → −−−−→ hit = GRUw (xit ), t ∈ [1, k]

(5)

← − ←−−−− hit = GRUw (xit ), t ∈ [k, 1] (6) − → Then we concatenate the forward hidden state hit and backward hidden state ← − − → ← − hit , i.e., hit = [hit , hit ], which summarizes the information of the whole sentence around the word w.

592

H. Yan et al.

Fig. 2. GRU-CNN attention based emotion cause identification model.

ECA Classifier. For emotion cause extraction task, not only semantic information of context is important, but also the relations between emotion expressions and emotion causes are salient features. Inspired by [18], to capture the semantic relations between emotion expressions and contexts, we utilize GRUCNN attention based model (GRU-CNN-A). As shown in Fig. 2, we first use Bi-GRU to obtain the contextual information in order that each word contains global semantic information, i.e., c = f (c), where f represents Bi-GRU. It is well acknowledged that not all words contribute equally to contexts. Hence, in order to capture the crucial components of the context, we employ attention mechanism. Moreover, it can prompt the model to pay more or less attention to individual words or sentences. Obviously, the emotion expressions are triggered by specific emotion cause events. Normally, these events are closed to the emotion expressions. Table 1 illustrates the distribution of cause positions and we can see most of emotion cause clauses abut the emotion expressions. Therefore, apart from emotion expressions themselves the distance between current clause and emotion words play core roles in emotion cause identification. In this paper, we combine position information p and emotion expression e to acquire attention of a word. Firstly, we obtain the word attention by the relative distance p between current clause and emotion expression e. Specifically, mp = c · W p · p

(7)

exp(mjp ) αjp = M e k=1 exp(mk )

(8)

where Wp is the weight matrices. αjp represents the importance of word in position j corresponding to the position vector p. That is, we make matrix multipli-

Adversarial Training Based Cross-Lingual Emotion Cause Extraction

593

cation between the output of Bi-GRU c and the position vector p as well as get a normalized importance weight αjp through a softmax function. Similarly, the importance of words in contexts is also corresponding to the emotion expression e. (9) me = c · W e · CN N (e) exp(mje ) αje = M e k=1 exp(mk )

(10)

where We is the weight matrices. αje represents the importance of word in position j corresponding to the emotion expression. Usually, emotion expression is one word or short multiple-words which is not adapted to long term encoder. Therefore, we first feed emotion expression e into CNN to get the emotion expression CN N (e). Then, we measure the importance of the word by making matrix multiplication between c and CN N (e). After that, the clause vector ce and cp are computed as a weighted sum of word annotations based on the weights αje and αjp . For each clause, the final representation is the concatenation of ce and cp .  (αp · c) (11) cp =  (αe · c) (12) ce = o = cp ⊕ ce

(13)

y = sof tmax(W · o)

(14)

The ECA classifier is trained by minimizing the cross entropy:   L= y q logf q (x; θ)

(15)

(x,y)∈D  q∈Q

where D is the collection of training data and Q is the target category of sample. y c is the target distribution. f (x; θ) is the predicted distribution of the model, θ is the parameter set. Discriminator. It is well acknowledged that the deeper fully connected network is, the better expressive ability it has. With the activation layers and dropout layers are added, fully connected network not only has good ability for nonlinear mapping but also can prevent overfitting effectively. Therefore, considering the high prediction accuracy and the simple structure of fully connected network, we applied it to discriminator. Output and Model Training. For emotion cause classifier, we use the log likelihood of the correct labels as objective function: j(Θm , Θs ) =

Nm M   m=1 i=1

(m)

logp(Yi

(m)

|ci

; Θm , Θs )

(16)

594

H. Yan et al.

where Θm and Θs represent all parameters in private layers and shared layers. To guarantee that F can capture LI-TR features, discriminator D and shared feature extractor F need to game by adversarial training. For clause c, we construct discriminator D below, p(·|Θd , Θs ) = sof tmax(WdT h(s) c + bd )

(17)

(s)

where hc is the output of shared feature extractor F . Θd represents the parameter WdT and bd , Θs is the parameter on shared layer. For discriminator, the training objective is an adversarial gaming process, which consists of two components, i.e., optimizing the parameters of discriminator to identify the language of shared feature and optimizing the parameters of shared feature extractor to ensure that discriminator D cannot identify the language of shared feature. These two objective functions are as follows, maxΘd j1adv (Θs )

=

Nm M  

(m)

logp(m|ci

; Θd , Θs )

(18)

m=1 i=1

maxΘs j2adv (Θd ) =

  (m) p(m|ci ; Θd , Θs ) logH( ) m=1 i=1 Nm M  

(19)

 where H(p) = − i pi logpi and it is the entropy of distribution p. Finally, the objective function of ECA and adversarial training are incorporated to obtain the ultima objective function. j(Θ; D) = jseg (Θm , Θs ) + j1adv (Θd ) + λj2adv (Θs ) Here, λ is a weight hyper parameter to control the degree of each portion. Table 1. Cause position of each emotion. Distance ECA-16 ECA-13-En-Train ECA-13-En-Test −3

1.60%

6.42%

6.23%

−2

5.53%

8.35%

7.63%

−1

32.90%

10.55%

9.38%

0

50.41%

12.84%

13.83%

1

5.22%

10.11%

8.18%

2

1.56%

7.76%

5.94%

3

0.48%

6.02%

4.68%

Others

2.3%

37.95%

44.13%

(20)

Adversarial Training Based Cross-Lingual Emotion Cause Extraction

595

Table 2. Distribution of emotion cause clauses and non-emotion cause clauses. Distance EC Clause

4

ECA-16 ECA-13-En-Train ECA-13-En-Test 6.96%

14.17%

15.09%

NEC-Clause 93.04%

85.83%

84.90%

Experiments

4.1

Data Sets

The proposed approach is evaluated on a Chinese emotion cause corpora1 [3] (ECA-16) and NTCIR 2017 ECA-13 dataset2 (ECA-13). The former one ECA16 comprises 2,105 documents from SinaNews3 . The latter one ECA-13 contains two components, i.e., 3,000 Chinese documents from social SinaNews and 3,000 English documents from novels. And for each sub-corpora of ECA-13, there are 2,500 documents for training and 500 documents for testing. To observe the similarity and distinguish between ECA-13 and ECA-16, above all, we keep statistics on these datasets. The details are listed in Table 1 and Table 2. Table 1 shows that 88.53% emotion causes adjoin the emotion expressions, i.e., the distance is less one, however, only 33.5% in ECA-13-EnTrain. Saliently, there are commonalities between these datasets. For example, emotion causes abut the emotion expressions. Apparently, position plays a very important role in ECA. In this paper, we use 90% of the data from ECA-16 for Chinese feature extractor Fcn training and 10% for final testing. Besides, we employ ECA-13 English training dataset for English feature extractor Fen training. Specifically, we feed Chinese clauses and English clauses whose emotion categories are same to the former, English clauses and Chinese clauses whose emotion categories are same to the former, Chinese clauses, English clauses and all their position information into shared feature extractor F , as the example shown in Fig. 3. 4.2

Experimental Settings and Evaluation Metrics

In the experiments, we set the hidden units h = 50, the dimension of word embeddings d = 100 and the learning rate lr = 0.002. Specifically, the word embeddings are pretrained by word2vec from [19] and the 50-dimension position embeddings are randomly initialized with the uniform distribution U(-0.1,0.1). Dropout is set to 0.25 to overcome the overfitting in training process. And the batch size is set to 32 according to the best F value on the testing set. We evaluate the performance of emotion cause identification by the metrics used in [3], i.e. precision (P ), recall (R), and F-measure (F ), which is commonly accepted. If a proposed emotion clause covers an annotated answer, it is considered correct. 1 2 3

http://hlt.hitsz.edu.cn/?page%20id=694. http://hlt.hitsz.edu/?page id=74. https://news.sina.com.cn/society/.

596

H. Yan et al.

Fig. 3. The example of emotion cause extraction

4.3

Comparisons of Different Methods

We compare ATCL-ECA with some traditional and advanced baselines as shown in Table 3. – RB (Rule-based method): RB is proposed by [10], which is a rule-based approach to extract emotion cause. – CB (Commonsense based method): The method is proposed by Russo [20], which uses the Chinese Emotion Cognition Lexicon as commonsense knowledge base [21]. – RB+CB+ML (Machine Learning): Gui [3] employed rule-based and commonsense based method to extract features and classify with machine learning algorithm. – MKSV M : Gui [3] used multi-kernel method trained with SVM to identify the emotion cause. – ConvMS-Memnet: ConvMS-Memnet is proposed by Gui [4]. They regarded the task as a question answering problem, which is the current state-of-the-art method on emotion cause extraction. – ATCL-ECA: ATCL-ECA is our proposed adversarial training based approach. Table 3 shows the evaluation results, and we can observe the followings. Since rule-based methods face low coverage and poor universality, RB gives low recall. Nevertheless, commonsense based method achieves quite high recall but low precision. It is on account of there are almost all collocations between emotion expressions and emotion cause events when constructing the commonsense knowledge base, which ignores that the semantic information of the emotion expression is associated with contexts around it. RB+CB+ML verifies that RB and CB are complementary to improve the model performance. In addition, machine learning based method MKSV M and deep learning based method ConvMS-Memnet outperform above approaches. Both of these two approaches consider the contextual information. Concretely, MKSV M captures the structured information and lexical features. The latter ConvMS-Memnet can model

Adversarial Training Based Cross-Lingual Emotion Cause Extraction

597

Table 3. Comparisons with existing methods. Method

Precision Recall F1

RB

0.6747

0.4287 0.5243

CB

0.2672

0.7130 0.3887

RB+CB+ML

0.5921

0.5307 0.5597

MKSV M

0.6673

0.6841 0.6756

ConvMS-Memnet 0.7067

0.6838 0.6955

ATCL-ECA

0.6825 0.7205

0.7629

the relations between emotion expressions and emotion cause clauses. The best F value is achieved by ATCL-ECA, which outperforms the state-of-the-art method ConvMS-Memnet by 2.5%. The results illustrate the performance of leveraging the cross-lingual semantic information. Meanwhile, it also verifies that crosslingual information can bridge the gap between two languages effectively. 4.4

Comparisons of Different Architectures

Fig. 4. a): The architecture of model-1; b): The architecture of model-2.

We further study the performance of different architectures of cross-lingual emotion cause extraction. As shown in Fig. 4 and Fig. 1, the distinction of three models is that the input features fed into emotion cause extraction classifiers are diverse. Model-1: The concatenations of the features from shared feature extractor F and Chinese feature extractor Fcn are fed into Chinese ECA classifier Ccn . Model-2: The concatenations of the features from shared feature extractor F and embeddings of Chinese documents are fed into Chinese feature extractor Fcn . Model-3: Model-3 is the combination of Model-1 and Model-2.

598

H. Yan et al. Table 4. The results of different architectures. Method

Precision Recall

F1

Model-1

0.6872

0.6733

0.6802

Model-2

0.6631

0.6530

0.6580

Model-3 0.7629

0.6825 0.7205

The concatenations of the features from F and embeddings of Chinese documents are fed into Fcn . Meanwhile, the concatenations of the features from F and Fcn are fed into Chinese ECA classifier Ccn . The results listed in Table 4 are the comparisons of above three models. It is illustrated that the results of Model-3 outperforms others. Specifically, in Model3, the concatenations of features from shared feature extractor F and private feature extractor Fcn and Fen play important roles in ECA classifiers. Moreover, for private feature extractors, the inputs are from pretrained embeddings of data and the outputs of shared feature extractor. Obviously, Model-3 could capture more cross-lingual semantic information for emotion cause extraction. Table 5. The results of different sampling methods.

4.5

Sampling method Precision Recall

F1

N-S

0.4749

0.7765

0.5894

U-S

0.6872

0.6654

0.6761 0.6919

O-S

0.6945

0.6893

O-S(batch 1:1)

0.7629

0.6825 0.7205

Effects of Sampling Methods

The different methods of sampling could reflect the strength of model learning ability. In this paper, the distribution of positive and negative samples in datasets is fairly imbalanced which is shown in Table 2, thus, the data requires sampling first. The results are listed in Table 5. N-S represents no sampling, which gives low recall, precision and F value. Oversampling with batch 1:1 (O-S batch 1:1) achieves best performance compared with undersampling (U-S) and oversampling (O-S). Moreover, batch 1:1 adds not only the ratio of positive and negative samples to 1:1, but also the mandatory ratio to 1:1 on each batch. It is demonstrated that both sampling methods and the ratio of positive and negative samples in each batch play important roles to model optimizing process. 4.6

Effects of Different Attention Hops

It is well know that computational models using deep architecture with multiple layers could have better ability to learn data representations. In this section, we

Adversarial Training Based Cross-Lingual Emotion Cause Extraction

599

Table 6. The results of different attention hops. Attention hops Precision Recall

F1

Hop 1

0.6749

0.6625

0.6686

Hop 2

0.7049

0.6865

0.6856

Hop 3

0.7122

0.7031 0.7076

Hop 4

0.6739

0.6955

0.6845

Hop 5

0.6824

0.7024

0.6923

Hop 6

0.6705

0.6876

0.6789

evaluate the influence of multiple hops in this task. We set the number of hops from 1 to 6. As shown in Table 6, the performance improves with the increasing number of hops when it is from 1 to 3. However, the performance decreases when the number of hops is larger than 3 because of the overfitting on this small dataset. Thus, we opted 3 hops in our final model since it gives the best performance.

5

Conclusion and Future Work

In this paper, we propose a new approach to identify the emotion cause corresponding to the emotion expression. The key property of this approach is the use of cross-lingual shared knowledge. The proposed model capture languageindependent but associated with emotion cause extraction information by adversarial training. Instead of large-scale parallel corpora, our model achieves significantly better performance only on the small-scale Chinese corpora and English corpora compared to a number of competitive baselines. Meanwhile, the attention mechanism based on position information and emotion expression is designed to obtain the key parts of the clause. Experimental results verifies that our proposed approach outperforms a number of competitive baselines. In the future, we will construct English corpora from social news to shrink the disparity on cross-lingual corpora. Acknowledgement. This work was supported by National Natural Science Foundation of China (61632011, 61876053), Shenzhen Foundational Research Funding JCYJ20180507-183527919.

References 1. Lee, S.Y.M., Chen, Y., Huang, C., Li, S.: Detecting emotion causes with a linguistic rule-based approach. Comput. Intell. 29, 390–416 (2013) 2. Chen, Y., Lee, S.Y.M., Li, S., Huang, C.: Emotion cause detection with linguistic constructions. In: COLING 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, 23–27 August 2010, Beijing, China, pp. 179–187 (2010)

600

H. Yan et al.

3. Gui, L., Wu, D., Xu, R., Lu, Q., Zhou, Y.: Event-driven emotion cause extraction with corpus construction. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, 1–4 November 2016, pp. 1639–1649 (2016) 4. Gui, L., Hu, J., He, Y., Xu, R., Lu, Q., Du, J.: A question answering approach to emotion cause extraction. CoRR abs/1708.05482 (2017) 5. Wang, Y., Wu, X., Wu, L.: Differential privacy preserving spectral graph analysis. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 329–340. Springer, Heidelberg (2013). https://doi.org/10. 1007/978-3-642-37456-2 28 6. Banea, C., Mihalcea, R., Wiebe, J., Hassan, S.: Multilingual subjectivity analysis using machine translation. In: 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Proceedings of the Conference, 25–27 October 2008, Honolulu, Hawaii, USA, A Meeting of SIGDAT, a Special Interest Group of the ACL, pp. 127–135 (2008) 7. Goodfellow, I.J., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, 8–13 December 2014, Montreal, Quebec, Canada, pp. 2672–2680 (2014) 8. Chen, X., Athiwaratkun, B., Sun, Y., Weinberger, K.Q., Cardie, C.: Adversarial deep averaging networks for cross-lingual sentiment classification. CoRR abs/1606.01614 (2016) 9. Gao, Q., et al.: Overview of NTCIR-13 ECA task. In: Proceedings of the 13th NII Testbeds and Community for Information Access Research Conference on Evaluation of Information Access Technologies, NTCIR-13 (2017) 10. Lee, S.Y.M., Chen, Y., Huang, C.R.: A text-driven rule-based system for emotion cause detection. In: NAACL HLT Workshop on Computational Approaches to Analysis and Generation of Emotion in Text (2010) 11. Ghazi, D., Inkpen, D., Szpakowicz, S.: Detecting emotion stimuli in emotionbearing sentences. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9042, pp. 152– 165. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18117-2 12 12. Wan, X.: Co-training for cross-lingual sentiment classification. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 235–243 (2009) 13. Li, S., Wang, R., Liu, H., Huang, C.R.: Active learning for cross-lingual sentiment classification. Commun. Comput. Inf. Sci. 400, 236–246 (2013) 14. Zhou, X., Wan, X., Xiao, J.: Cross-language opinion target extraction in review texts. In: 12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium, 10–13 December 2012, pp. 1200–1205 (2012) 15. Zhou, H., Chen, L., Shi, F., Huang, D.: Learning bilingual sentiment word embeddings for cross-language sentiment classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015: Long Papers, 26–31 July 2015, Beijing, China, vol. 1, pp. 430–440 (2015) 16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting held 5–8 December 2013, Lake Tahoe, Nevada, USA, pp. 3111–3119 (2013)

Adversarial Training Based Cross-Lingual Emotion Cause Extraction

601

17. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014) 18. Fan, C., Gao, Q., Du, J., Gui, L., Xu, R., Wong, K.: Convolution-based memory network for aspect-based sentiment analysis. In: The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, 08–12 July 2018, pp. 1161–1164 (2018) 19. Zou, W.Y., Socher, R., Cer, D.M., Manning, C.D.: Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18–21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A Meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1393–1398 (2013) 20. Russo, I., Caselli, T., Rubino, F., Boldrini, E., Mart´ınez-Barco, P.: Emocause: an easy-adaptable approach to extract emotion cause contexts. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, WASSA@ACL 2011, Portland, OR, USA, 24 June 2011, pp. 153–160 (2011) 21. Xu, R., et al.: A new emotion dictionary based on the distinguish of emotion expression and emotion cognition. J. Chin. Inf. Process. (2013)

Techniques for Jointly Extracting Entities and Relations: A Survey Sachin Pawar1,2(B) , Pushpak Bhattacharyya2 , and Girish K. Palshikar1 1

2

TCS Research and Innovation, Pune 411013, India {sachin7.p,gk.palshikar}@tcs.com Indian Institute of Technology Bombay, Mumbai 400076, India [email protected]

Abstract. Relation Extraction is an important task in Information Extraction which deals with identifying semantic relations between entity mentions. Traditionally, relation extraction is carried out after entity extraction in a “pipeline” fashion, so that relation extraction only focuses on determining whether any semantic relation exists between a pair of extracted entity mentions. This leads to propagation of errors from entity extraction stage to relation extraction stage. Also, entity extraction is carried out without any knowledge about the relations. Hence, it was observed that jointly performing entity and relation extraction is beneficial for both the tasks. In this paper, we survey various techniques for jointly extracting entities and relations. We categorize techniques based on the approach they adopt for joint extraction, i.e. whether they employ joint inference or joint modelling or both. We further describe some representative techniques for joint inference and joint modelling. We also describe two standard datasets, evaluation techniques and performance of the joint extraction approaches on these datasets. We present a brief analysis of application of a general domain joint extraction approach to a Biomedical dataset. This survey is useful for researchers as well as practitioners in the field of Information Extraction, by covering a broad landscape of joint extraction techniques. Keywords: Relation extraction End-to-end relation extraction

1

· Entity extraction · Joint modelling ·

Introduction

Entities such as PERSON or LOCATION are the most basic units of information in any natural language text. Mentions of such entities in a sentence are often linked through well-defined semantic relations (e.g., EMPLOYEE OF relation between a PERSON and an ORGANIZATION). The task of Relation Extraction (RE) deals with identifying such relations automatically. Apart from the general domain entities of types such as PERSON or ORGANIZATION, there can be domain-specific entities and relations. For example, in Biomedical domain, an example relation type of interest can be SIDE EFFECT between entities of types DRUG and ADVERSE EVENT. c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 602–618, 2023. https://doi.org/10.1007/978-3-031-24340-0_45

Techniques for Jointly Extracting Entities and Relations: A Survey

603

A lot of approaches [2,7,16,19,27] have been proposed to address the relation extraction task. Most of these traditional Relation Extraction approaches assume the information about entity mentions is available. Here, information about entity mentions consists of their boundaries (words in a sentence constitute a mention) as well as their entity types. Hence, in practice, any end-to-end relation extraction system needs to address 3 sub-tasks: i) identifying boundaries of entity mentions, ii) identifying entity types of these mentions and iii) identifying appropriate semantic relation for each pair of mentions. The first two sub-tasks of end-to-end relation extraction correspond to the Entity Detection and Tracking (EDT) task defined by the the Automatic Content Extraction (ACE) program [4] and the third sub-task corresponds to the Relation Detection and Characterization (RDC) task. Traditionally, the three sub-tasks of end-to-end relation extraction are performed serially in a “pipeline” fashion. Hence, the errors in any sub-task are propagated to the subsequent sub-tasks. Moreover, this “pipeline” approach only allows unidirectional flow of information, i.e. the knowledge about entities is used for extracting relations but not vice versa. To overcome these problems, it is necessary to perform some or all of these sub-tasks jointly. In this paper, we survey various end-to-end relation extraction approaches which jointly address entity extraction and relation extraction.

2

Problem Definition

The problem of end-to-end relation extraction is defined as follows: Input: A natural language sentence S Output: i) Entity Extraction: List of entity mentions occurring in S. Here, each entity mention is identified in terms of its boundaries and entity type. ii) Relation Extraction: List of pairs of entity mentions for which any pre-defined semantic relation holds. E.g., S = Paris, John’s sister, is staying in New York. Here, the expected output of an end-to-end relation extraction system is shown in the Table 1. Table 1. Expected output of end-to-end relation extraction system (For definitions of entity and relation types, see Sect. 7.1) Entity extraction Relation extraction

3

Paris : PER

 Paris, John  : PER-SOC

John : PER

 John, sister  : PER-SOC

sister : PER

 Paris, New York  : PHYS

New York : GPE

 sister, New York  : PHYS

Motivating Example

Any particular semantic relation generally holds between entity mentions of some specific entity types. E.g., social (PER-SOC) relation holds between two

604

S. Pawar et al.

persons (PER); employee-employer (EMP-ORG) relation holds between a person (PER) and an organization (ORG) or a geo-political entity (GPE). Hence, information about entity types certainly helps relation extraction. Traditional “pipeline” approaches for relation extraction approaches use features based on entity types. However, in these “pipeline” approaches there is no bidirectional flow of information; i.e., entity extraction sub-task does not utilize any knowledge/features based on relation information. When entity and relation extraction are jointly addressed, such bidirectional flow is possible. Thus improving performance of both the entity extraction and the relation extraction. Consider an example sentence: Paris, John’s sister, is staying in New York. Most of the state-of-the-art Named Entity Recognition (NER) tools incorrectly identify Paris as a mention of type LOC1 and not as PER. Here, if an entity extraction algorithm has some evidence that Paris is involved in a social (PER-SOC) relation, then it would prefer to label Paris as a PER than as a LOC. This is because social relation is only possible between two persons. Thus, the information about relations helps in determining entity types of entity mentions. This is the motivation behind designing algorithms which jointly extract entities and relations.

4

Overview of Techniques

Various techniques have been proposed for jointly extracting entities and relations since 2002. Table 2 summarizes most of the techniques from the literature of joint extraction. We visualize each of these techniques from two aspects of joint extraction: joint model and joint inference. Most of the techniques exploit Table 2. Overview of various techniques for joint extraction of entities and relations Approach

Joint Joint Model type model inference

Inference technique

Roth and Yih [21]





Belief Network







Roth and Yih [22]





ILP







Kate and Mooney [8]





Parsing







Chan and Roth [3]





Rules







Li and Ji [11]





Structured Prediction

Beam search







Miwa and Sasaki [13]





Table+Structured Prediction Beam search







Gupta et al. [5]





Neural (RNN)







Pawar et al. [14]











Miwa and Bansal [12]





Neural (Bi/tree-LSTM)







Pawar et al. [15]





Table+Neural (NN, LSTM)







Katiyar and Cardie [9] ✓



Neural (Bi-LSTM)







Ren et al. [20]





Embedded Representations







Zheng et al. [26]





Neural (Bi-LSTM)

Joint Label







Zhang et al. [25]





Table+Neural (Bi-LSTM)

Global optimization







Bekoulis et al. [1]





CRF, Neural (Bi-LSTM)

Parsing







Wang et al. [24]





Neural (Bi-LSTM)

Parsing

Li et al. [10]





Neural (Bi-LSTM)

1

MLN MLN

Evaluation ACE’04 ACE’05 CoNLL’04













E.g., Stanford CoreNLP 3.9.1 NER identifies Paris as a city name.

Techniques for Jointly Extracting Entities and Relations: A Survey

605

either one of these aspects. But some recent techniques have exploited both of these aspects. Here, by joint model, we mean that a single model is learned for both the tasks of entity and relation extraction. For example, a single joint neural network model can be learned and both the tasks of entity and relation extraction share the same parameters. Overall, joint models can be of various types as shown in Table 2. Moreover, by joint inference, we mean that the decision about entity and relation labels is taken jointly at a global level (usually a sentence). Here, there may be separate underlying local models for entity and relation extraction. Overall, there are several joint inference/decoding techniques which are shown in Table 2.

5

Joint Inference Techniques

Here, we describe a few techniques used for joint inference: Integer Linear Programming (ILP): Here, a global decision is taken by using Integer Linear Programming which is consistent with some domain constraints. This approach was proposed by Roth and Yih [22]. They first learn independent local classifiers for entity and relation extraction. During inference, given a sentence, a global decision is produced such that the domain-specific or task-specific constraints are satisfied. Often these constraints capture mutual compatibility of entity and relation types. A simple example of such constraints is: both the arguments of the PER-SOC relation should be PER. Consider our example sentence – Paris, John’s sister, is staying in New York. Here, the entity extractor identifies two mentions John and Paris and also predicts entity types for these mentions. For John, let the predicted probabilities be: Pr(PER) = 0.99 and Pr(ORG) = 0.01. For Paris, let the predicted probabilities be: Pr(GPE) = 0.75 and Pr(PER) = 0.25. Also, the relation extractor identifies the relation PER-SOC between the two mentions. If we accept the best suggestions given by the local classifiers, then the global prediction is that the relation PER-SOC exists between the PER mention John and the GPE mention Paris. But this violates the domain-constraint mentioned earlier. Hence the global decision which satisfies all the specified constraints would be to label both the mentions as PER and mark the PER-SOC relation between them. Markov Logic Networks (MLN): Similar to ILP, MLN provides another framework for taking a global decision consistent with the domain constraints. MLN combines first order logic with probability. The domain rules or domain knowledge is represented in an MLN using weighted first order logic rules. Stronger the belief about any rule, higher is its associated weight. Inference in such an MLN gives the most probable true groundings of certain (query) predicates, while ensuring maximum weighted satisfiability of the rules. Pawar et al. [14,15], use MLN for joint inference for extracting entities and relations. As compared to ILP, MLN provides better representability in the form of first order logic rules. For example, the above-mentioned rule “both the arguments of the PER-SOC relation should be PER” can be written as: PER-SOC(x, y) ⇒ PER(x) ∧ PER(y)

606

S. Pawar et al.

Fig. 1. The proposed new tagging scheme for an relation instance PHYS (Paris, New York) in our example sentence.

In addition to the rules for ensuring compatibility of entity and relation types, MLN can easily represent other complex domain knowledge. For example, the rule “a person can be employed at only one organization at a time” can be written as: EMP-ORG(x, y) ∧ PER(x) ∧ ORG(y) ∧ ORG(z) ∧ (y = z) ⇒ ¬EMP-ORG(x, z) Joint Label: Zheng et al. [26] proposed a novel tagging scheme for joint extraction of entities and relations. This tagging scheme reduces the joint extraction task to a tagging problem. Intuitively, a single tag is assigned to a word which encodes entity as well as relation label information, automatically leading to joint inference. Figure 1 depicts an example sentence and its annotations as per the proposed new tagging scheme. The tag “O” represents the “Other” tag, which means that the corresponding word is not part of the expected relation tuples. The other tags consist of three parts: i) the word position in the entity, ii) the relation type, and iii) the relation role (argument number). The BIES (Begin, Inside, End, Single) encoding scheme is used for marking entity boundaries. The relation type information is obtained from a predefined set of relations and the relation role information is represented by the numbers 1 and 2. Let Entity1 and Entity2 be the first and second entity arguments of a relation type RT , respectively. Words in Entity1 are marked with the relation role 1 for RT . Similarly, words in Entity2 are marked with the relation role 2. Hence, the total number of tags is 2 × 4 × N umRelationT ypes + 1. Here, the multiplier 4 represents the entity boundary tags BIES and other multiplier 2 represents two entity arguments for each relation type. For example shown in Fig. 1, the words New and York are part of the entity mention which is the second argument of PHYS relation and hence are marked with the tags B-PHYS-2 and EPHYS-2, respectively. One limitation of this approach is that currently it can not model the scenario where a single entity mention is involved in multiple relations with multiple other entity mentions. Hence, the other relation PER-SOC (Paris, John) in which Paris is involved, can not be handled. Beam Search: Li and Ji [11] proposed an approach for incremental joint extraction of entities and relations. They formulated the joint extraction task as a structured prediction problem to reveal the linguistic and logical properties of the hidden structures. Here, the output structure of each sentence was interpreted as a graph in which entity mentions are nodes and relations are directed arcs labelled with relation types. They designed several local as well as global features to characterize and score these structures. Hence, the joint extraction problem was reduced to predicting a structure with the highest score for any given sentence. They proposed a joint decoding/inference approach for this structured

Techniques for Jointly Extracting Entities and Relations: A Survey

607

Fig. 2. Card-pyramid graph for the sentence Paris, John’s sister, is staying in New York.

prediction task using beam-search. Intuitively, at the ith token, k best partial assignments/structures are maintained and extended further. Similarly, beamsearch based inference was also employed by Miwa and Sasaki [13] where the output structure for a sentence was a table representation. Parsing: Kate and Mooney [8] proposed a parsing based approach which uses a graph called as card-pyramid. The graph is so called because it encodes mutual dependencies among the entities and relations in a graph structure which resembles pyramid constructed using playing cards. This is a tree-like graph which has one root at the highest level, internal nodes at intermediate levels and leaves at the lowest level. Each entity in the sentence correspond to one leaf and if there are n such leaves then the graph has n levels. Each level l contains one less node than the number of nodes in the (l − 1) level. The node at position i in level l is parent of nodes at positions i and (i + 1) in the level (l − 1). Each node in the higher layers (i.e. layers except the lowest layer), corresponds to a possible relation between the leftmost and rightmost nodes under it in the lowest layer. Figure 2 shows this card-pyramid graph for an example sentence. To jointly label the nodes in the card-pyramid graph, the authors propose a parsing algorithm analogous to the bottom-up CYK parsing algorithm for Context Free Grammar (CFG) parsing. The grammar required for this new parsing algorithm is called Card-pyramid grammar and its consists of following production types: – Entity Productions: These are of the form EntityT ype → Entity, e.g. PER→John. Similar to the ILP based approach, a local entity classifier is trained to compute the probability that entity in the RHS being of the type given in the LHS of the production. – Relation Productions: These are of the form RelationT ype → EntityT ype1 EntityT ype2, e.g. PHYS→PER GPE. A local relation classifier is trained to obtain the probability that the two entities in the RHS are related by the type given in the LHS of the production.

608

S. Pawar et al.

Given the entities in a sentence, the Card-pyramid grammar, and the local entity and relation classifiers, the card-pyramid parsing algorithm attempts to find the most probable labelling of all of its nodes which corresponds the entity and relation types. One limitation of this approach is that only entity type identification happens jointly with relation classification, i.e. boundary detection of entity mentions should be done as a pre-processing step and does not happen jointly. Recently, Bekoulis et al. [1] and Wang et al. [24] proposed joint extraction techniques which use dependency parsing like approaches for joint inference. Also, they allow multiple heads for a node (word) to represent participation in multiple relations simultaneously with other nodes.

6

Joint Models

Here, we describe a few joint models which have been employed for joint extraction of entities and relations: Structured Prediction: In most of the earlier approaches for joint extraction of entities and relations, it was assumed that the boundaries of the entity mentions are known. Li and Ji [11] presented an incremental joint framework for simultaneous extraction of entity mentions and relations, which also incorporates the problem of boundary detection for entity mentions. The authors proposed to formulate the problem of joint extraction of entities and relations as a structured prediction problem. They aimed to predict the output structure (y ∈ Y ) for a given sentence (x ∈ X), where this structure can be viewed as a graph modelling entity mentions as nodes and relations as directed arcs with relation types as labels. Following linear model is used to predict the most probable structure y  for x where f (x, y) is the feature vector that characterizes the entire structure. y  = arg max f (x, y) · W y∈Y (x)

The score of each candidate assignment is defined as the inner product of the feature vector f (x, y) and feature weights W . The number of all possible structures for any given sentence can be very large and there does not exist a polynomialtime algorithm to find the best structure. Hence, they apply beam-search to expand partial configurations for the input sentence incrementally to find the structure with the highest score. Neural Models: Here, predictions for both the tasks of entity and relation extraction are carried out using a single joint neural model, where at least some of the model parameters are shared across both the tasks. Joint modelling is realized through such parameter sharing where training for any task updates the parameters involved in both the tasks. Miwa and Bansal [12] presented a neural model for capturing both word sequence and dependency tree substructure information by stacking bidirectional tree-structured LSTMs (tree-LSTM) on bidirectional sequential LSTMs (Bi-LSTM). Their model jointly represents both entities and relations with shared parameters in a single model. The overview of the model

Techniques for Jointly Extracting Entities and Relations: A Survey

609

Fig. 3. End-to-end relation extraction model, with bidirectional sequential and bidirectional tree-structured LSTM-RNNs.

is illustrated in the Fig. 3. It consists of three representation layers: i) a word embeddings layer, ii) a word sequence based LSTM-RNN layer (sequence layer), and iii) a dependency subtree based LSTM-RNN layer (dependency layer). While decoding, entities are detected in greedy, left-to-right manner on the sequence layer. And relation classification is carried out on the dependency layers, where each subtree based LSTM-RNN corresponds to a relation candidate between two detected entities. After decoding the entire model structure, the parameters are updated simultaneously via backpropagation through time (BPTT). The dependency layers are stacked on the sequence layer, so the embedding and sequence layers are shared by both entity detection and relation classification, and the shared parameters are affected by both entity and relation labels. This is the first joint neural model which motivated several other joint models [9,26]. This model was adopted for Biomedical domain by Li et al. [10]. In addition, they use Convolutional Neural Networks (CNN) for extracting morphological information (like prefix or suffix) from characters of words. Then each word in a sentence is represented by a concatenated vector of its word embeddings, POS embeddings and character-level representation by CNN. Character-level information is more useful in Biomedical domain because several biological entities share morphological or orthographic features, e.g., bacteria names helicobacter and campylobacter share the suffix bacter. Table Representation: Another idea for jointly modelling entity and relation extraction tasks is Table Representation or Table Filling. It was first proposed by Miwa and Sasaki [13]. Here, a table is associated with each sentence where every table cell is labelled with an appropriate label so that the whole entities and relations structure in a sentence is represented in a single table. Table 3 depicts this table representation for an example sentence. The diagonal cells of the table represent the entity labels which capture both boundary and type information with the help of BILOU (Begin, Inside, Last, Outside, Unit) or BIO encoding. E.g., in Table 3, the word New gets the label B-GPE as it is the first word of

610

S. Pawar et al.

the complete entity mention New York. The off-diagonal cells represent relation labels. Here, relations between entity mentions are mapped to relations between the last words of the mentions. E.g., the PHYS relation between sister and New York is assigned to the cell corresponding to sister and York. ⊥ represents no pre-defined relation between the corresponding words. As the table is symmetric, only upper or lower triangular part of the table needs to be labelled. Miwa and Sasaki [13] approach this table filling problem using structure learning approach similar to Li and Ji [11]. They define a scoring function to evaluate a possible label assignment to a table and build a model which predicts the most probable label assignment for a table which maximizes the scoring function. During inference, beam search is used which assigns labels to cells one by one and keeps the top K best assignments when moving from a cell to the next cell. Finally, it returns the best assignment when labels are assigned to all the cells. The authors propose various strategies to arrange the cells in two dimensions to a linear order. They also integrate various label dependencies into the scoring function to avoid illegal label assignments. E.g., cell corresponding to the ith and j th words should never be assigned any valid relation label if any of the words are labelled with entity label O. Table 3. Table representation for an example sentence Paris Paris ,

, John O ⊥

John

’s sister

U-PER ⊥ PER-SOC ⊥ ⊥

York

⊥ ⊥ ⊥

⊥ ⊥

PHYS ⊥

⊥ ⊥ ⊥

⊥ ⊥





⊥ ⊥





⊥ ⊥ ⊥

⊥ ⊥





U-PER

⊥ ⊥ ⊥

⊥ ⊥

PHYS ⊥

,

O ⊥ ⊥

⊥ ⊥





is

O

⊥ ⊥





sister

staying in New York .

⊥ PER-SOC ⊥ ⊥ ⊥ O

.



’s

U-PER

⊥ ⊥

, is staying in New

⊥ O

⊥ ⊥





O





B-GPE ⊥





L-GPE ⊥ O

The table representation idea has further motivated several other joint extraction approaches. Pawar et al. [15] use a similar table representation but instead of using BILOU encoding to represent entity boundaries, they introduced a new relation WEM (Within Entity Mention) between head word2 of an entity mention 2

Head word is generally the last word of noun phrase entities but not always. E.g., for Bank of America, the head word is Bank. Head word is that word of an entity mention through which the mention is linked to the rest of the sentence in its dependency tree.

Techniques for Jointly Extracting Entities and Relations: A Survey

611

and other words in the same entity mention. E.g., they would assign entity labels O and GPE to the words New and York, respectively and assign relation label WEM to the cell corresponding to New and York. Further, they train a neural network based model to predict an appropriate label for each cell in the table. They also employ Markov Logic Networks (MLN) based inference at a sentence level to incorporate various dependencies among entity and relation labels. Other recent approaches proposed by Zhang et al. [25] and Gupta et al. [5] build upon the same table representation idea and use Recursive Neural Networks (RNN) and Long Short-Term Memory (LSTM) based models.

7

Experimental Evaluation

In this section, we describe some of the most widely used datasets for end-to-end relation extraction and summarize the reported results on those datasets. We also describe the evaluation methodology and other experimental analysis. 7.1

Datasets

ACE 2004: It is the most widely used dataset in the relation extraction literature and is available from Linguistic Data Consortium (LDC) as catalogue LDC2005T09. It annotates both entity and relation types information in an XML like format. It identifies 7 entity types3 : (i) PER (person), (ii) ORG (organization), (iii) LOC (location), (iv) GPE (geo-political entity), (v) FAC (facility), (vi) VEH (vehicle), and (vii) WEA (weapon). Additionally, it identifies 22 fine-grained relation types which are grouped into 6 coarse-grained relation types4 : (i) EMP-ORG (employee-organization or subsidiary relationships), (ii) GPE-AFF (affiliations of PER/ORG to an GPE entity), (iii) PER-SOC (social relationships between two PER entities), (iV) ART (agent-artifact relationship), (v) PHYS (physical/located at), (vi) OTHER-AFF (other PER/ORG affiliations). Chan and Roth [3] used this dataset for the first time for evaluating end-to-end relation extraction. They ignored the original DISC (discourse) relation as it was only for the purpose of the discourse. They used only news wire and broadcast news subsections of this dataset which consists of 345 documents and 4011 positive relation instances. All the later approaches followed the same methodology for producing comparable results. ACE 2005: This dataset [23] is also available from LDC as catalogue LDC2006T06. It annotates the same entity types a that of ACE 2004. ACE 2005 also kept the relation types PER-SOC, ART and GPE-AFF of ACE 2004, but it split PHYS into two relation types PHYS and a new relation type PARTWHOLE. The DISC relation type was removed, and the relation type OTHER-AFF was merged into EMP-ORG. It was observed that ACE 2005 improved on both 3 4

www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-edt-v4.2.6.pdf. www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-rdc-v4.3.2.PDF.

612

S. Pawar et al.

annotation quality and relation type definition, as compared to ACE 2004. Li and Ji [11] used this dataset for the first time for evaluating end-to-end relation extraction. Ignoring two small subsets (cts and un) from informal genres, they selected the remaining 511 documents. These were randomly split into 3 parts: (i) training (351), (ii) development (80), and (iii) blind test set (80). All the later approaches followed the same methodology for producing comparable results. 7.2

Evaluation of End-to-End Relation Extraction

As discussed earlier, the end-to-end relation extraction system is expected to identify: (i) boundaries of entity mentions, (ii) entity types of these entity mentions, and (iii) relation type (if any) for each pair of entity mentions. Hence, evaluation of end-to-end relation extraction is often done at 2 levels: 1. Entity extraction: Here, only entity extraction performance is evaluated. Two entity mentions are said to be matching if both have same boundaries (i.e. contain exactly the same sequence of words) and same entity type. Any predicted entity mention is counted as a true positive (TP) if it matches with any of the gold-standard entity mentions in the same sentence, otherwise it is counted as a false positive (FP). Both TP or FP are counted for the predicted entity type. Similarly, for each gold-standard entity mention, if there no matching predicted entity mention in the same sentence, then a false negative (FN) is counted for the gold-standard entity type. For each entity type, precision, recall and F1 are computed using its TP, FP and FN counts. F1-scores across all entity types are micro-averaged for computing overall entity extraction performance. Table 4. Performance of various approaches on the ACE 2004 dataset. The numbers are micro-averaged and obtained after 5-fold cross-validation. Actual folds used by each approach may differ. Approach

Entity extraction Entity+Relation extraction P R F P R F

Pipeline [11]

81.5 74.1 77.6

58.4 33.9 42.9

Li and Ji [11]

83.5 76.2 79.7

60.8 36.1 45.3

Pawar et al. [14]

79.0 80.1 79.5

52.4 41.3 46.2

Miwa and Bansal [12]

80.8 82.9 81.8

48.7 48.1 48.4

Pawar et al. [15]

Chan and Roth [3]

42.9 38.9 40.8

81.2 79.7 80.5

56.7 44.5 49.9

Katiyar and Cardie [9] 81.2 78.1 79.6

46.4 45.3 45.7

Bekoulis et al. [1]

50.1 44.5 47.1

81.0 81.3 81.2

Techniques for Jointly Extracting Entities and Relations: A Survey

613

2. Entity+Relation extraction: Here, end-to-end relation extraction performance is evaluated. Any predicted or gold-standard relation mention consists of a pair of entity mentions along with their entity types, and an associated relation type. Hence, two relation mentions are said to be matching only if both the entity mentions match and associated relation types are same. Each gold-standard relation mention is counted as a TP if there is a matching predicted relation mention, otherwise it is counted as FN. Similarly, each predicted relation mention is counted as an FP unless there is any matching gold-standard relation mention. For each relation type, precision, recall and F1 are computed using its TP, FP and FN counts. F1-scores across all relation types are micro-averaged for computing overall entity extraction performance. Analysis of Results: Tables 4 and 5 show the results of various approaches on the ACE 2004 and ACE 2005 datasets, respectively. The F1-scores still below 60% indicate how challenging the task of end-to-end relation extraction is. Li and Ji [11] carried out an interesting experiment where two human annotators were asked to perform end-to-end relation extraction manually on the ACE 2005 test dataset. The human F1-score for this task was observed to be around 70%. Moreover, F1-score of the inter-annotator agreement (the entity/relation extractions where both the annotators agreed) was only about 51.9%. This analysis clearly establishes the high difficulty level of the task. Table 5. Performance of various approaches on the ACE 2005 dataset. The numbers are micro-averaged and obtained on a test split of 80 documents. The ( ) performance numbers are not reported in the original paper. Approach

Entity extraction Entity+Relation extraction P R F P R F

Pipeline [11]

83.2 73.6 78.1

65.1 38.1 48.0

Li and Ji [11]

85.2 76.9 80.8

65.4 39.8 49.5

Miwa and Bansal [12]

82.9 83.9 83.4

57.2 54.0 55.6

Katiyar and Cardie [9] 84.0 81.3 82.6

55.5 51.8 53.6

Zhang et al. [25]

83.6

57.5

614

S. Pawar et al.

Table 6. Example sentences from the ACE 2004 dataset illustrating how the joint extraction of entities and relations helps in determining entity type of its. Entity mentions of interest are highlighted in bold S1 U.S. District Court Judge Murray Schwartz in Wilmington, Del., ruled that Camelot Music could not deduct interest on loans it took out against life insurance on its 1,430 employees in 1990 through 1993. EntityType (its) = ORG, EntityType (employees) = PER, RelType (its, employees) = EMP-ORG S2 our choice is the choice of permanent, comprehensive and just peace, and our aim is to liberate our land and to create our independence state in palestinian blast land with jerusalem as its capital and the return of our refugees to their homes. EntityType (its) = GPE, EntityType (capital) = GPE, RelType (its, capital) = PHYS

Another important aspect of the ACE datasets to note is the nature of its entity mentions. Overall, three types of entity mentions are annotated in the ACE datasets: (i) name mentions (generally proper nouns, e.g. John, United States), (ii) nominal mentions (generally common nouns, e.g. guy, employee), and (iii) pronoun mentions (e.g. he, they, it). Unlike the traditional Named Entity Recognition (NER) task which extracts only the name mentions, the ACE entity extraction task focusses on extracting all the three types of mentions. This makes it more challenging task yielding lower accuracies. Especially for pronoun mentions like its in the example sentences in Table 6, determining the entity type is more challenging. This is because, the mention its is observed both as ORG or as GPE in the training data depending on the context. In the sentence S1 in Table 6, the knowledge that its is related to employees through the EMP-ORG relation, helps in labelling its as ORG. Similarly, in the sentence S2, the knowledge that its is related to capital through the PHYS relation, helps in labelling its as GPE. Hence, these examples illustrate that unlike pipeline methods, in joint extraction methods, both the tasks of entity extraction and relation extraction help each other. 7.3

Domain-Specific Entities and Relations

Except Li et al. [10], all other joint extraction approaches in Table 2 are evaluated on general domain datasets like ACE 2004 or ACE 2005. There is no previous study on how well the approaches designed for general domain work for domain-specific entities and relations. In this section, we present the results of our experiments where we apply a general domain technique on a Biomedical dataset. As a representative general domain approach, we choose Pawar et al. [15] which is the best performing approach on the ACE 2004 dataset.

Techniques for Jointly Extracting Entities and Relations: A Survey

615

Table 7. Performance of various approaches on the ADE dataset. The numbers are micro-averaged and obtained using 10-fold cross-validation. Actual folds used by each approach may differ. Approach

Entity extraction Entity+Relation extraction P R F P R F

Li et al. [10]

82.7 86.7 84.6

67.5 75.8 71.4

Pawar et al. [15] (GloVe vectors)

80.0 82.4 81.2

65.8 66.6 66.2

Pawar et al. [15] (PubMed vectors)

82.1 84.0 83.0

68.5 68.0 68.2

Pawar et al. [15] (GloVe vectors, Lenient)

82.8 85.2 84.0

70.6 71.3 70.9

Pawar et al. [15] (PubMed vectors, Lenient) 85.0 86.8 85.9

73.0 73.7 73.3

Li et al. [10] evaluate their end-to-end relation extraction approach on the Adverse Drug Event (ADE) dataset [6]. This dataset contains sentences from PubMed abstracts annotated with entity types DRUG, ADVERSE EVENT and DOSAGE. It also contains annotations for two relation types: (i) DRUG-AE between a DRUG and an ADVERSE EVENT it causes, and (ii) DRUG-DOSAGE between a DRUG and its DOSAGE. Li et al. [10] evaluated their model only on a subset of the ADE dataset containing sentences with at least one instance of the DRUG-AE relation. They also ignored 120 relation instances containing nested gold annotations, e.g., lithium intoxication, where lithium causes lithium intoxication. We also followed the same methodology for creating a dataset for our experiments. We ended up with a dataset of 4228 distinct sentences5 containing 6714 relation instances. Following is an example sentence and annotations from this dataset: After infliximab treatment, additional sleep studies revealed an increase in the number of apneic events and SaO2 dips suggesting that TNFalpha plays an important role in the pathophysiology of sleep apnea. There are two annotated relation instances of DRUG-AE for this sentence: (i) infliximab, increase in the number of apneic events , and (ii) infliximab, SaO2 dips .

Analysis of Results: Table 7 shows the results of both the methods (Li et al. [10] and Pawar et al. [15]) on the ADE dataset for end-to-end extraction of the DRUG-AE relation. Li et al. used 300 dim word embeddings pre-trained on PubMed corpus [18]. For Pawar et al., we experimented with two types of word embeddings: 100-dim GloVe embeddings trained on Wikipedia corpus [17] (as reported in the original paper) as well as 300-dim embeddings trained on PubMed corpus [18]. As the ADE dataset is also derived from PubMed abstracts, the PubMed word embeddings perform better than GloVe embeddings. Even though it is designed for the general domain, Pawar et al. [15] produces comparable results with respect 5

Li et al. [10] mentions number of sentences in their dataset to be 6821 which seems to be a typo because the original paper [6] for ADE dataset mentions that there are only 4272 sentences containing at least one drug-related adverse effect mention. After ignoring the 120 relation instances of nested annotations, this number comes down to 4228 in our dataset.

616

S. Pawar et al.

to Li et al. [10]. Upon detailed analysis of errors, we found that the major source of errors was incorrect boundary detection for entities of type ADVERSE EVENT. As compared to ACE datasets, the entities in the ADE dataset can have more complex syntactic structures. E.g., it is very rare in case of the ACE entities to be noun phrases (NP) subsuming prepositional phrases (PP), but in the ADE dataset, we frequently encounter entities like increase in the number of apneic events. We also observed that the boundary annotations for the ADVERSE EVENT entities are inconsistent. E.g., the complete phrase severe mucositis is annotated as an ADVERSE EVENT but in case of Severe rhabdomyolysis, only rhabdomyolysis is annotated as an ADVERSE EVENT. Hence, we carried out a lenient version of evaluation where a predicted ADVERSE EVENT AEpredicted is considered to be matching any gold-standard ADVERSE EVENT AEgold if AEpredicted contains AEgold as a prefix or suffix and AEpredicted has at most one extra word as compared to AEgold . E.g., even if AEgold = rhabdomyolysis and AEpredicted = Severe rhabdomyolysis, we consider both of them to be matching. But if AEgold = severe mucositis and AEpredicted = mucositis, we do not consider them to be matching because the predicted mention is missing a word which is expected as per the gold mention. This lenient evaluation leads to a much better performance as shown in Table 7.

8

Conclusion

In this paper, we surveyed various techniques for jointly extracting entities and relations. We first motivated the need for developing joint extraction techniques as opposed to traditional “pipeline” approaches. We then summarized more than a decade’s work in joint extraction of entities and relations in the form of a table. In that table, we categorized techniques based on the approach they adopt for joint extraction, i.e. whether they employ joint inference or joint modelling or both. We further described some of the representative techniques for joint inference and joint modelling. We also described standard datasets and evaluation techniques; and summarized performance of the joint extraction approaches on these datasets. We presented a brief analysis of application of a general domain joint extraction approach on the ADE dataset from Biomedical domain. We believe that this survey would be useful for researchers as well as practitioners in the field of Information Extraction. Also, these joint extraction techniques would motivate new techniques even for other NLP tasks such as Semantic Role Labelling (SRL) where predicates and arguments can be extracted jointly.

References 1. Bekoulis, G., Deleu, J., Demeester, T., Develder, C.: Joint entity recognition and relation extraction as a multi-head selection problem. arXiv preprint arXiv:1804.07847 (2018) 2. Bunescu, R., Mooney, R.: A shortest path dependency kernel for relation extraction. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 724–731. Association

Techniques for Jointly Extracting Entities and Relations: A Survey

3.

4.

5.

6.

7.

8.

9.

10. 11.

12.

13.

617

for Computational Linguistics, Vancouver, British Columbia, Canada, October 2005. http://www.aclweb.org/anthology/H/H05/H05-1091 Chan, Y.S., Roth, D.: Exploiting syntactico-semantic structures for relation extraction. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 551–560. Association for Computational Linguistics, Portland, Oregon, USA, June 2011. http://www. aclweb.org/anthology/P11-1056 Doddington, G.R., Mitchell, A., Przybocki, M.A., Ramshaw, L.A., Strassel, S., Weischedel, R.M.: The automatic content extraction (ACE) program-tasks, data, and evaluation. In: LREC, vol. 2, p. 1 (2004) Gupta, P., Sch¨ utze, H., Andrassy, B.: Table filling multi-task recurrent neural network for joint entity and relation extraction. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2537–2547 (2016) Gurulingappa, H., Rajput, A.M., Roberts, A., Fluck, J., Hofmann-Apitius, M., Toldo, L.: Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. J. Biomed. Inform. 45(5), 885–892 (2012) Jiang, J., Zhai, C.: A systematic exploration of the feature space for relation extraction. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 113–120. Association for Computational Linguistics, Rochester, New York, April 2007. http://www.aclweb.org/anthology/N/N07/N071015 Kate, R.J., Mooney, R.: Joint entity and relation extraction using card-pyramid parsing. In: Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pp. 203–212. Association for Computational Linguistics, Uppsala, Sweden, July 2010. http://www.aclweb.org/anthology/W10-2924 Katiyar, A., Cardie, C.: Going out on a limb: joint extraction of entity mentions and relations without dependency trees. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 917–928 (2017) Li, F., Zhang, M., Fu, G., Ji, D.: A neural joint model for entity and relation extraction from biomedical text. BMC Bioinform. 18(1), 198 (2017) Li, Q., Ji, H.: Incremental joint extraction of entity mentions and relations. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 402–412. Association for Computational Linguistics, Baltimore, Maryland, June 2014. http://www.aclweb.org/anthology/ P14-1038 Miwa, M., Bansal, M.: End-to-end relation extraction using lstms on sequences and tree structures. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1105–1116. Association for Computational Linguistics, Berlin, Germany (August 2016), http://www. aclweb.org/anthology/P16-1105 Miwa, M., Sasaki, Y.: Modeling joint entity and relation extraction with table representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1858–1869. Association for Computational Linguistics, Doha, Qatar (October 2014), http://www.aclweb.org/anthology/D141200

618

S. Pawar et al.

14. Pawar, S., Bhattacharya, P., Palshikar, G.K.: End-to-end relation extraction using Markov logic networks. In: Gelbukh, A. (ed.) CICLing 2016. LNCS, vol. 9624, pp. 535–551. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75487-1 41 15. Pawar, S., Bhattacharyya, P., Palshikar, G.: End-to-end relation extraction using neural networks and Markov logic networks. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, vol. 1, pp. 818–827 (2017) 16. Pawar, S., Palshikar, G.K., Bhattacharyya, P.: Relation extraction: a survey. arXiv preprint arXiv:1712.05191 (2017) 17. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 18. Pyysalo, S., Ginter, F., Moen, H., Ananiadou, S.: Distributional semantics resources for biomedical text processing. In: LBM 2013 (2013) 19. Qian, L., Zhou, G., Kong, F., Zhu, Q., Qian, P.: Exploiting constituent dependencies for tree kernel-based semantic relation extraction. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Coling 2008 Organizing Committee, Manchester, UK, pp. 697–704, August 2008. http://www. aclweb.org/anthology/C08-1088 20. Ren, X., et al.: CoType: joint extraction of typed entities and relations with knowledge bases. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1015–1024. International World Wide Web Conferences Steering Committee (2017) 21. Roth, D., Yih, W.t.: Probabilistic reasoning for entity & relation recognition. In: Proceedings of the 19th International Conference on Computational Linguistics, vol.1, pp. 1–7. ACL (2002) 22. Roth, D., Yih, W.t.: A linear programming formulation for global inference in natural language tasks. In: Ng, H.T., Riloff, E. (eds.) HLT-NAACL 2004 Workshop: Eighth Conference on Computational Natural Language Learning (CoNLL-2004), pp. 1–8. Association for Computational Linguistics, Boston, Massachusetts, USA, 6–7 May 2004 23. Walker, C., Strassel, S., Medero, J., Maeda, K.: ACE 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia, p. 57 (2006) 24. Wang, S., Zhang, Y., Che, W., Liu, T.: Joint extraction of entities and relations based on a novel graph scheme. In: IJCAI, pp. 4461–4467 (2018) 25. Zhang, M., Zhang, Y., Fu, G.: End-to-end neural relation extraction with global optimization. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1730–1740 (2017) 26. Zheng, S., Wang, F., Bao, H., Hao, Y., Zhou, P., Xu, B.: Joint extraction of entities and relations based on a novel tagging scheme. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1227–1236 (2017) 27. Zhou, G., Su, J., Zhang, J., Zhang, M.: Exploring various knowledge in relation extraction. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 427–434. Association for Computational Linguistics, Ann Arbor, Michigan, June 2005. https://doi.org/10.3115/1219840. 1219893

Simple Unsupervised Similarity-Based Aspect Extraction Danny Suarez Vargas, Lucas R. C. Pessutto, and Viviane Pereira Moreira(B) Federal University of Rio Grande do Sul – UFRGS, Porto Alegre, RS, Brazil {dsvargas,lrcpessutto,viviane}@inf.ufrgs.br http://www.inf.ufrgs.br/

Abstract. In the context of sentiment analysis, there has been growing interest in performing a finer granularity analysis focusing on the specific aspects of the entities being evaluated. This is the goal of Aspect-Based Sentiment Analysis (ABSA) which basically involves two tasks: aspect extraction and polarity detection. The first task is responsible for discovering the aspects mentioned in the review text and the second task assigns a sentiment orientation (positive, negative, or neutral) to that aspect. Currently, the state-of-the-art in ABSA consists of the application of deep learning methods such as recurrent, convolutional and attention neural networks. The limitation of these techniques is that they require a lot of training data and are computationally expensive. In this paper, we propose a simple approach called SUAEx for aspect extraction. SUAEx is unsupervised and relies solely on the similarity of word embeddings. Experimental results on datasets from three different domains have shown that SUAEx achieves results that can outperform the state-of-the-art attention-based approach at a fraction of the time.

Keywords: Aspect-Based Sentiment Analysis extraction · Opinion target extraction

1

· Aspect term

Introduction

Opinionated texts are abundant on the Web and its study has drawn a lot of attention from both companies and academics originating the research field known as opinion mining or sentiment analysis. The last few years have been very prolific in this field which combines Natural Language Processing and Data Mining. In a study which dates back to 2013, Feldman [3] mentioned that over 7,000 articles had already been written about the topic. Several facets of the problem have been explored and numerous solutions have been proposed. While most of the work in the area has been devoted to assigning a polarity score (positive, negative, or neutral) to the overall sentiment conveyed by the text of an entire review, in the last few years, there has been increasing interest in performing a finer-grained analysis. Such analysis, known as Aspect-Based Sentiment Analysis (ABSA) [27] deals basically with the texts of extracting and c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 619–631, 2023. https://doi.org/10.1007/978-3-031-24340-0_46

620

D. S. Vargas et al.

scoring the opinion expressed towards an entity. For example, in the sentence “The decor of the restaurant is amazing and the food was incredible”, the words decor and food are the aspects of the entity (or category) restaurant. ABSA is a challenging task because it needs to accurately extract and rate fine-grained information from reviews. Review texts can be ambiguous and contain acronyms, slangs and misspellings. Furthermore, aspects vary from one domain to another – one word that represents a valid aspect in one domain, may not do so for another domain. For example, consider the input sentence “The decor of the place is really beautiful, my sister loved it.” in the Restaurant domain. The ABSA solution should focus its attention on the words “decor” and “restaurant”. However, if we do not explicitly set Restaurant as the domain, the word “sister” can also gain attention as an aspect term. This poses a potential problem to extraction approaches. For example, an approach that is only based on rules [17] can assume that aspects are always nouns and come exactly after the adjective. However, rigid rules can create false positives because not always a noun represents an aspect. Another approach could consider the distribution of words in texts and hypothesize that the number of occurrences of a given word is determinant to decide whether it is an aspect. Again, this could lead to false positives since high-frequency words tend to be stopwords. A disadvantage of current state-of-the-art approaches [4,23,24,26] is that they rely on techniques that require significant computational power, such as deep neural networks. In special, neural attention mechanisms [1] are typically expensive. In this paper, we propose SUAEx, a simple unsupervised method for aspect extraction. SUAEx relies on vector similarity to emulate the attention mechanism which that allows us to focus on the relevant information. Our main contribution is to show that a simple and inexpensive solution can perform as well as the neural attention mechanisms. We tested SUAEx on datasets from different domains, and it was able to outperform the state-of-the-art in ABSA in many cases in terms of quality and in all cases in terms of time.

2

Background and Definitions

Word-Embeddings. Representing words in a vector space is widely used as a means to map the semantic similarity between them. The underlying concept is the hypothesis that words with similar meanings are used in similar contexts. Word embeddings are a low dimensional vector representation of words which is able to keep the distributional similarity between them. Furthermore, word embeddings are able to map some linguistic regularities present in documents. Since the original proposal by Mikolov et al. [11], other techniques have been presented [2,14] adding to the popularity of word embeddings. Category and Aspect. The terms category and aspect are defined as follows. For a given sentence S = w1 , w2 , ..., wn taken from the text of a review, the category C is the broad, general topic of S, while the aspects are the attributes or characteristics of C [9]. In other words, a category (i.e., laptop, restaurant,

Simple Unsupervised Similarity-Based Aspect Extraction

621

cell phone) can be treated as a cluster of related aspects (i.e., memory, battery, and processor are aspects that characterize the category laptop). Reference Words are important in the context of our proposal because they aid in the correct extraction of aspects (i.e., distinguishing aspects from nonaspects), and help determine whether an aspect belongs to a category. A reference word can be an aspect, or the name of the category itself (a synonym, meronym, or hyponym). For example, if we want to discover aspects from the category “laptop”, the words “computer”, “pc” and the word “laptop” itself would be reference words. Attention Mechanism. The attention mechanism was introduced in a neural machine translation solution [1]. The main idea was to modify the encoderdecoder structure in order to improve the performance for long sentences. For a given input sentence S = {w1 , w2 , .., wn }, the encoded value es of S and a set of hidden layers H = {h1 , h2 , ..., hm }, the decoder for each output yi considers not only the value of the previous hidden layer hi−1 and a general context c, but it also considers the relative importance of each output word yi with respect to the input sentence S. For example, for a given output word yj , it can be more important to see the words w2 , w3 in S, while for another output word yk , it can be more important to see only the word w4 in S. The attribution of the relative importance is performed by an attention mechanism.

3

Related Work

Sentiment Analysis can be performed at different levels of granularity. One could be interested in a coarse-grained analysis which assigns a sentiment polarity to an entire review document (i.e., document level analysis); or to each sentence of the review; or, at a finer granularity, to the individual aspects of an entity. The aspect level is quickly gaining importance, mainly due to the relevant information that it conveys [9]. In this level of analysis, the aspects and entities are identified in natural language texts. Aspect extraction task can be classified into three main groups according to the underlying approach [27]: (i) based on language rules [6,16,17], (ii) based on sequence labeling models [7,8], and (iii) based on topic models [12]. However, other works do not fit in only one of these groups as they combine resources from more than one approach [20]. Furthermore, state-of-the-art approaches rely on more sophisticated architectures like recurrent neural networks such as LSTM, Bi-LSTM, Neural Attention Models, and Convolutional Neural Networks [4,5,15,23,24]. The work proposed by He et al. [5], known as Attention-based Aspect Extraction (ABAE), represents the state-of-the-art in ABSA and was used as the baseline of our work. ABAE relies on an attention neural network to highlight the most important words in a given text by de-emphasizing the irrelevant words. ABAE is a three-layer neural network. The input layer receives a given sentence S = {w1 , w2 , ..., wn }, S is represented as a set of fixed length vectors e = {ew 1, ew 2, ..., ew n}. These vectors are processed by the hidden layer setting

622

D. S. Vargas et al.

the attention values a = {a1 , a2 , ..., an } related to a given context ys . The context ys is obtained from the average of the word vectors in e. After performing the attention mechanism, the input sentence is encoded as zs and a dimensionality reduction is performed from the word-embedding space to the aspect-embedding space. In other words, the input sentence is represented only by the most relevant words in rs . In addition, the process of training a neural network needs an optimization function. The output layer is the sentence reconstruction rs and the function to optimize aims to maximize the similarity between zs and rs . Finally, the mathematical definition of ABAE is the following: pt = softmax(W.zs + b) rs = T T .pt n exp(di ) ai ewi ai = n zs = i=1 j=1 exp(dj )  n 1 ys = ewi di = eTwi M ys i=1 n m  J(θ) = max(0, 1 − rs zs + rs ni ) s∈D

i=1

where zs encodes the input sentence S by considering the relevance of its words. ai is the relevance of the ith word in S. di is the value that expresses the importance of the ith word related to the context ys . Finally, J(θ) is the objective function which is optimized in the training process. In summary, each group of solutions for ABSA have advantages and disadvantages. The methods based on language rules are simple but require manual annotation to construct the initial set of rules. Furthermore, these rules are domain-specific – a new set of rules is needed for each domain. The methods based on sequence models, topic models and even some based on neural networks are supervised machine learning solutions. So, their quality is directly proportional to the amount of annotated data. Finally, methods based on unsupervised neural networks, such as our baseline [5], achieve good results but at a high computational cost.

4

Simple Unsupervised Aspect Extraction

This section introduces SUAEx, a simple unsupervised similarity-based solution for ABSA. SUAEx relies on the similarity of vector representations of words to emulate the attention mechanism used in the state-of-the-art. Since our proposed solution does not need to train a neural network, SUAEx is computationally cheaper than state-of-the-art solutions and, as demonstrated in Sect. 6, it achieves results that can surpass the baseline at a fraction of the time. SUAEx1 consists in six modules depicted in Fig. 1: (i) Filtering, (ii) Selection of Reference Words, (iii) Preprocessing, (iv) Word-Embeddings Representation, (v) Similarity, and (vi) Category Attribution. SUAEx requires three inputs and generates two outputs. The inputs are: Input1 – the raw data expressed as 1

The code for SUAEx is available at https://github.com/dannysvof/SUAEx.git.

Simple Unsupervised Similarity-Based Aspect Extraction

623

free text from a given domain (which is used to build the domain-specific word embeddings); Input2 – the test data with the reviews for which the aspects and categories will be extracted; and Input3 – the reference words, which are used to determine the categories as well as extract the aspects related to each category. Next, we describe the components of SUAEx.

DOMAIN RAW DATA

FILTERING

WORD-EMBEDDINGS REPRESENTATION

INPUT 1

TEST DATA PREPROCESSING

SIMILARITY

ASPECTS BY CATEGORY

OUTPUT 1

INPUT 2

SELECTION OF REFERENCE WORDS

INPUT 3

REFERENCE WORDS BY CATEGORY

CATEGORY ATTRIBUTION

TEST DATA WITH CATEGORIES

OUTPUT 2

Fig. 1. SUAEx framework. The continuous arrows represent the path taken by Input 1, while the dashed arrows represent the path followed by Inputs 2 and 3.

The Selection of Reference Words module is responsible for choosing the representative words for each category. In other words, if we want k categories as output, we need to select k groups of reference words. The selection of the words for each group can be performed in three different ways: manual, semiautomatic, and automatic. The manual selection can be done by simply selecting the category words themselves as reference words. The semi-automatic selection can be performed by expanding the initial manually constructed groups of reference words. The expansion can be done through the search for synonyms or meronyms of the words that represent the category name. Finally, an automatic selection mechanism can be performed by considering a taxonomy of objects [19]. The Filtering module aims to select the domain related part from raw data. This module is optional but it is particularly useful when we want to delve into a certain category and we only have raw data for the general topic. For example, if we just performed the aspect extraction for the category “Electronics”, we have the raw data for it. If now, we want to perform the aspect extraction of the category “Laptop”, we need only raw data about this new more specific category. This module is in charge of selecting the right raw data from a large corpus, thus manual filtering is unfeasible. Filtering can be implemented as a binary text classifier for the domain of interest, or simply by choosing reviews that mention the category name.

624

D. S. Vargas et al.

The Preprocessing module normalizes the input data and reduces the amount of raw data needed to construct the vector representation of words (word-embeddings). The amount of data needed to train word-embeddings is directly proportional to the size of the vocabulary of the raw data. Since preprocessing reduces the size of the vocabulary, it has the effect of reducing the amount of raw data needed. This module encompasses typical preprocessing tasks such as tokenization, sentence splitting, lemmatization, and stemming. The Word-Embedding Representation module is responsible for creating the vector representation of words which will be used to measure word similarity in the Similarity module. It receives the preprocessed raw data, transforms it into a vector representation and returns a domain-specific model. The model can be generated using well-known tools such as Word2vec [11], Glove [14], or Fastext [2]. This module is particularly important because SUAEx relies solely on the similarity of domain-specific word-embeddings. The Similarity as an Attention Mechanism module receives two types of inputs, the preprocessed reference words and the test data. The goal of this module is to emulate the behavior of the attention mechanism in a neural network by assigning attention values to each word in an input sentence in relation to a given set of reference words. For each group of reference words, (which are in the same number as the categories desired as output), it returns an attentionvalued version of the test data. This output can be used in two ways: to identify the aspects for each category or as an input to the Category Attribution module. A vector similarity measure, like the cosine similarity, is used to attribute the relevance of a given word x in relation to another word y or related to a group of words c. In this module, we can test with two types of similarity values. The similarity obtained from the direct comparison of two words (direct similarity) and the similarity obtained from the comparison of two words in relation to some contextual words (contextual similarity). Finally, the attention values are obtained by applying the sof tmax function to the similarity values. Output1 is the test data together with the values for attention and similarity assigned by SUAEx. For example, if we consider three groups of reference words in the input, “food”, “staff”, and “ambience”, Output1 consists in three attention-valued and similarity-valued versions of the test data (one set of values for each category). The Category Attribution module uses the output of the Similarity module to assign one of the desired categories to each sentence in the test data. In this module, we can test different ways to aggregate the similarity values assigned to the words. For example, one could use the average for each sentence or only consider the maximum value [21]. If the average is used, it means that there are more than one relevant word to receive attention in the sentence. However, if only the maximum value is used the word with the highest score will get all the attention. Output2 is the main output which contains all the sentences of the test data with the categories assigned by the Category Attribution module.

Simple Unsupervised Similarity-Based Aspect Extraction

5

625

Experimental Design

The experiments of this aim at evaluating SUAEx for aspect and category extraction both in terms of quality and runtime. Our tests are done over datasets coming from three different domains. The experiments are organized in two parts. In the first experiment, the goal is to compare SUAEx to our baseline ABAE [5]. We hope to answer the following questions: Can a simple approach like SUAEx achieve results that are close to the state-of-the-art in aspect extraction? and How does SUAEx behave in different domains? The second experiment performs a runtime analysis of the two approaches. Below, we describe the datasets, the tools and resources used. Datasets. The datasets are summarized in Table 1 and come from different sources and different domains to enable a broad evaluation. They are all freely available to allow comparison with existing approaches. ABAE Datasets. One of our sources of data is our baseline [5], which made two datasets available – one in the domain of Restaurant, known as CitySearch and another in the domain of Beer (originally presented in [10]), known as BeerAdvocate. Each dataset consists of two files: one for training the vector representation of words and one with test data. We used these datasets to test and compare our method to the baseline under the same conditions. For each sentence in the test file, the datasets have annotations that indicate the expected category. SemEval Datasets. Our second source of data is the SemEval evaluation campaigns, specifically from an ABSA task2 . The reviews are on the domains of Restaurant and Laptop. We used the train and test files which contain the text of the reviews, the Aspect Category, as well as the aspect words, aspect word position in text, and their polarity. We considered the category entities as the categories for each review text. For example, if the review text is “The pizza was great”, the category for the aspect cluster is the word “food”. The SemEval datasets were used with the goal of testing adaptability of our solution across domains. In order to run the SUAEx with the SemEval datasets, some modifications on the test data had to be made: (i) we removed the instances which could be classified more than one category; (ii) we considered only the entities as category labels; and (iii) we discarded the categories with very few instances. The same modified version was submitted to both competing systems. For the datasets on the Restaurant domain, we selected three groups of reference words, {“f ood }, {“staf f  }, and {“ambience }. For BeerAdvocate, we selected the groups {“f eel }, {“taste }, {“look  }, and {“overall }. And for the dataset Sem2015-Laptop, we selected the groups {“price }, {“hardware }, {“sof tware }, and {“support }. Tools and Parametrization. NLTK3 was used in the preprocessing module to remove stopwords, perform tokenization, sentence splitting, and for lemmatization. The domain-specific word-embeddings were created with Word2Vec4 2 3 4

http://alt.qcri.org/semeval2015/task12/. https://www.nltk.org/. https://code.google.com/archive/p/word2vec/.

626

D. S. Vargas et al. Table 1. Dataset statistics Dataset name

file

CitySearch CitySearch BeerAdvocate BeerAdvocate

Train Test Train Test

# sentences 281,989 3,328 16,882,700 9,236

Sem2015-Restaurant Sem2015-Restaurant Sem2015-Laptop Sem2015-Laptop

Train Test Train Test

281,989 453 1,083,831 241

using the following configuration: CBOW module, window size of 5 words, and 200 dimensions for the resulting vectors. The remaining parameters were used with the default values (negative sampling = 5, number of iterations = 15). The similarity between word vectors was measured with the cosine similarity in Gensim [18] which reads the model created by the Word2vec. Scikit-learn5 provided the metrics for evaluating the quality of the aspects and categories extracted with the traditional metrics (precision, recall, and F1). Amazon raw data was taken from a public repository6 to be used as an external source of data. This data was necessary because for the laptop domain, since the training file provided by SemEval is too small and insufficient to create the domain word-embeddings. Baseline. Our baseline, ABAE [5], was summarized in Sect. 3. Its code is available on a github repository7 . We downloaded it and ran the experiments using the default configurations according to the authors’ instructions (i.e., wordembeddings dimension = 200, batch-size = 50, vocabulary size = 9000, aspect groups = 14, training epochs = 15). Despite the authors having released a pretrained model for the restaurant domain, we ran the provided code from scratch and step-by-step in order to measure the execution time.

6

Results and Discussion

Results for the Quality of the Aspects and Categories Extracted. The evaluation metrics precision, recall, and F1 were calculated by comparing the outputs generated by both methods against the gold-standard annotations. Figure 2 shows the results of the evaluation metrics for both approaches averaged across categories. As an overall tendency, SUAEx achieves better recall than the baseline in all datasets. ABAE tends to have a better precision in most cases (three out of four datasets). This is expected and can be attributed to the contrast in the way the two solutions use the attention mechanism. While 5 6 7

https://scikit-learn.org/stable/. http://jmcauley.ucsd.edu/data/amazon/. https://github.com/ruidan/Unsupervised-Aspect-Extraction.

Simple Unsupervised Similarity-Based Aspect Extraction

627

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

P

R BeerAdvocate

R

P

Sem2015-Restaurant

ABAE

ABAE

SUAEx

SUAEx

ABAE

ABAE F1

SUAEx

ABAE

SUAEx

ABAE P

SUAEx

ABAE F1

SUAEx

SUAEx

ABAE

ABAE P

SUAEx

ABAE F1

SUAEx

ABAE R

CitySearch

SUAEx

SUAEx

SUAEx

0.0

ABAE

0.1

R

F1

Sem2015-Laptop

Fig. 2. Overall results averaged across categories

ABAE only considers the highest attention-valued word in the sentence, SUAEx uses all the attention values in the sentences. This difference can be seen in the example from Fig. 3. SUAEx considers the reference words as a type of context to guide category attribution. Basically, for a given sentence, ABAE tries to be precise by focusing on a single word, while SUAEx tries to be more comprehensive by considering more words. Our recall improvement was superior to our decrease in precision, so our F1 results were better in all datasets. The results demonstrate the adaptability of SUAEx to different domains. On the SemEval datasets, SUAEx outperformed the baseline in nearly all cases. We can attribute ABAE’s poor results in the SemEval datasets to the dependence on the training (i.e., raw) data. While SUAEx only uses the training data to generate the word-embeddings representation, ABAE also uses it in the evaluation module because it clusters the training data. The results of the evaluation metrics per aspect category are shown in Table 2. SUAEx scored lower in recall for Ambience, Smell, and Look because the reference words (i.e., the names of the categories themselves) are not as expressive as the reference words for the other categories. We can find more similar words for general terms like food or staff than for specific terms like smell. Other works have also used the CitySearch and BeerAdvocate datasets. Thus we are also able to compare SUAEx to them. These works have applied techniques such as LDA [28], biterm topic-models [25], a statistical model over seed words [13], or restricted Boltzmann machines [22]. The same tendency found in the comparison with the baseline remains, i.e., SUAEx achieves better recall in all cases except for the categories Ambience and Smell. In terms of F1, SUAEx is the winner for Food, Staff, Taste, Smell, and Taste+Smell. Figure 3 presents an example in which SUAEx assigns more accurate attention-values to an input sentence. This happens because the word pizza is more similar to our desired category f ood than the word recommend. However, our method is dependent on the reference words, and in some cases it can assign high attention values to adjectives (which typically are not aspects). This happens with the word higher in our example, which is an adjective and received

628

D. S. Vargas et al.

the second highest score. This could be mitigated with a post-filter based on part-of-speech tagging. In Table 3, we show the aspect words extracted for the CitySearch dataset. The extraction was performed by selecting the highest attention-valued words of each sentence and by considering the category classification results. This extraction can be used as an additional module in our framework (Fig. 1). Table 2. Results for the quality of aspect category extraction. Category

SUAEx

ABAE

P

R

F1

P

R

F1

Food

0.917

0.900 0.908 0.953 0.741

0.828

Staff

0.660

0.872 0.752

0.757

Ambience

0.884 0.546

CitySearch

0.675

0.802 0.728 0.815

0.698 0.740

BeerAdvocate Feel

0.687

Taste

0.656 0.794 0.718 0.637

0.832 0.753

0.815 0.824 0.358

Smell

0.689 0.614

0.744 0.575

0.649 0.483

Taste+Smell 0.844

0.922 0.881 0.897 0.853

Look

0.849

0.876

0.862

0.816 0.456 0.866

0.969 0.882 0.905

Sem2015-Restaurant Food

0.953 0.674 0.789 0.573

0.213

0.311

Staff

0.882 0.714 0.789 0.421

0.159

0.213

Ambience

0.627 0.967 0.760 0.107

0.206

0.141

Sem2015-Laptop Price

0.750

0.915 0.824 0.895 0.576

0.701

Hardware

0.785

0.797 0.791 0.914 0.481

0.631

Software

0.714 0.455 0.556 0.714 0.114

0.196

Support

0.667

0.118

0.800 0.727 0.083

0.200

Runtime Results. In order to obtain the runtime results, we ran both methods on the same configuration (Intel Core i7-4790 CPU, GeForce GTX 745 GPU, and 16 GB of main memory). Since both methods went through the same preprocessing steps, our comparison focused on the attention mechanisms (Attention Neural Network training in ABAE and the Similarity as Attention Mechanism module in SUAEx). Table 4 shows the runtime results. The differences are remarkable – ranging from one thousand to almost ten thousand times. For the BeerAdvocate dataset, we were unable to obtain the runtimes for the baseline because the number of training sentences was too large. One could argue that, in practice, these differences are not so significant because they concern the training phase, which can be performed once only. However, training has to be repeated for each domain and, from time to time, to cope with how the vocabulary changes.

Simple Unsupervised Similarity-Based Aspect Extraction

629

Table 3. Aspects extracted by SUAEx for the CitySearch dataset Category

Aspect terms

Staff

replied, atmosphere, answering, whoever, welcomed, child, murray, manager, existence, cold, staff, busy, forward, employee, smile, friendly, gave, woman, dessert, early, kid, lady, minute, bar, helpful, wooden, always, greeting, server, notified, busier, nose, night, guy, tray, seating, everyone, hour, crowd, people, seat, lassi, proper, divine, event, folk, even, waitstaff, borderline, ice

Ambience antique, atmosphere, outdoor, feel, proximity, scene, sleek, bright, weather, terrace, dining, surroundings, music, calm, peaceful, ambience, cool, sauce, location, ceiling, garden, painted, relaxed, dark, warm, artsy, excellent, tank, furniture, dim, bar, romantic, level, inside, parisian, air, architecture, aesthetic, adorn, beautiful, brightly, neighborhood, ambiance, alley, elegant, decour, leafy, casual, decor, room seasonal, selection, soggy, chinese, cheese, penang, dosa, doughy, corned, sichuan, mojito, executed, innovative, dish, chicken, calamari, thai, butternut, bagel, northern, vietnamese, paris, menu, technique, dumpling, dhal, better, location, congee, moules, rice, sauce, ingredient, good, straightforward, mein, food, dessert, overdone, appetizer, creatively, fusion, know, unique, burnt, minute, panang, risotto, shabu, roti

Food

Fig. 3. An example of the attention values assigned to the input sentence “I could n’t recommend their Godmother pizza any higher” by the two solutions . Table 4. Runtime for both methods in all datasets CitySearch BeerAdvocate Sem2015-Restaurant Sem2015-Laptop SUAEx 42 s

7 min

13 s

36 s

ABAE

Undefined

12 h

98 h

12 h

630

7

D. S. Vargas et al.

Conclusion

This paper proposes SUAEx which is an alternative approach to deep neural network solutions for unsupervised aspect extraction. SUAEx relies on the similarity of word-embeddings and on reference words to emulate the attention mechanism used by attention neural networks. With our experimental results, we concluded that SUAEx achieves results that outperform the state-of-the-art in ABSA in a number of cases at remarkably lower runtimes. In addition, SUAEx is able to adapt to different domains. Currently, SUAEx is limited to dealing with aspects represented as single words. As future work, we will extend it to treat compound aspects such as “wine list”, “battery life”. Also, we will improve the selection of reference words by using hierarchical data such as subject taxonomies. Acknowledgments. This work was partially supported by CNPq/Brazil and by CAPES Finance Code 001.

References 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014), http://arxiv.org/abs/1409. 0473 2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. CoRR abs/1607.04606 (2016), http://arxiv.org/abs/1607. 04606 3. Feldman, R.: Techniques and applications for sentiment analysis. Commun. ACM 56(4), 82–89 (2013). https://doi.org/10.1145/2436256.2436274 4. Giannakopoulos, A., Musat, C., Hossmann, A., Baeriswyl, M.: Unsupervised aspect term extraction with B-LSTM & CRF using automatically labelled datasets. In: WASSA, pp. 180–188 (2017) 5. He, R., Lee, W.S., Ng, H.T., Dahlmeier, D.: An unsupervised neural attention model for aspect extraction. In: ACL, vol. 1, pp. 388–397 (2017) 6. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: SIGKDD, pp. 168–177 (2004) 7. Jakob, N., Gurevych, I.: Extracting opinion targets in a single-and cross-domain setting with conditional random fields. In: EMNLP, pp. 1035–1045 (2010) 8. Jin, W., Ho, H.H., Srihari, R.K.: A novel lexicalized HMM-based learning framework for web opinion mining. In: ICML, pp. 465–472 (2009) 9. Liu, B.: Sentiment analysis and opinion mining. Synth. ect. Hum. Lang. Technol. 5(1), 1–167 (2012) 10. McAuley, J., Leskovec, J., Jurafsky, D.: Learning attitudes and attributes from multi-aspect reviews. In: ICDM,pp. 1020–1025 (2012). https://doi.org/10.1109/ ICDM.2012.110 11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119 (2013) 12. Moghaddam, S., Ester, M.: ILDA: interdependent LDA model for learning latent aspects and their ratings from online product reviews. In: SIGIR, pp. 665–674 (2011)

Simple Unsupervised Similarity-Based Aspect Extraction

631

13. Mukherjee, A., Liu, B.: Aspect extraction through semi-supervised modeling. In: ACL, pp. 339–348 (2012). http://dl.acm.org/citation.cfm?id=2390524.2390572 14. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014). http://aclweb.org/anthology/D/ D14/D14-1162.pdf 15. Poria, S., Cambria, E., Gelbukh, A.: Aspect extraction for opinion mining with a deep convolutional neural network. Knowl.-Based Syst. 108, 42–49 (2016) 16. Poria, S., Cambria, E., Ku, L.W., Gui, C., Gelbukh, A.: A rule-based approach to aspect extraction from product reviews. In: SocialNLP, pp. 28–37 (2014) 17. Qiu, G., Liu, B., Bu, J., Chen, C.: Opinion word expansion and target extraction through double propagation. Comput. Linguist. 37(1), 9–27 (2011) ˇ uˇrek, R., Sojka, P.: Software framework for topic modelling with large corpora. 18. Reh˚ In: LREC Workshop on New Challenges for NLP Frameworks, pp. 45–50, May 2010 19. Salazar, C.: Una taxonom´ıa para los objetos. Revista Universidad de Antioquia, pp. 78–82, October 2014 20. Toh, Z., Su, J.: NLANGP at SemEval-2016 task 5: improving aspect based sentiment analysis using neural network features. In: SemEval-2016, pp. 282–288 (2016) 21. Vargas, D.S., Moreira, V.P.: Identifying sentiment-based contradictions. JIDM 8(3), 242–254 (2017). https://seer.lcc.ufmg.br/index.php/jidm/article/view/4624 22. Wang, L., Liu, K., Cao, Z., Zhao, J., de Melo, G.: Sentiment-aspect extraction based on restricted Boltzmann machines. In: ACL, pp. 616–625 (2015). https:// doi.org/10.3115/v1/P15-1060 23. Wang, W., Pan, S.J., Dahlmeier, D., Xiao, X.: Recursive neural conditional random fields for aspect-based sentiment analysis. In: EMNLP, pp. 616–626 (2016) 24. Wang, W., Pan, S.J., Dahlmeier, D., Xiao, X.: Coupled multi-layer attentions for co-extraction of aspect and opinion terms. In: AAAI, pp. 3316–3322 (2017) 25. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: WWW, pp. 1445–1456 (2013). https://doi.org/10.1145/2488388.2488514 26. Yin, Y., Wei, F., Dong, L., Xu, K., Zhang, M., Zhou, M.: Unsupervised word and dependency path embeddings for aspect term extraction. In: IJCAI, pp. 2979–2985 (2016). http://www.ijcai.org/Abstract/16/423 27. Zhang, L., Liu, B.: Aspect and entity extraction for opinion mining. In: Chu, W.W. (ed.) Data Mining and Knowledge Discovery for Big Data. SBD, vol. 1, pp. 1–40. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-40837-3 1 28. Zhao, W.X., Jiang, J., Yan, H., Li, X.: Jointly modeling aspects and opinions with a maxent-LDA hybrid. In: EMNLP, pp. 56–65 (2010). http://dl.acm.org/citation. cfm?id=1870658.1870664

Streaming State Validation Technique for Textual Big Data Using Apache Flink Raheela Younas1 and Amna Qasim2(B) 1 Lahore College for Women University, Lahore, Pakistan

[email protected]

2 Minhaj University Lahore, Lahore, Pakistan

[email protected]

Abstract. Data processing comes on the top of list while handling large amount of data. Batch processing and stream processing are two main frameworks for handling big data. Stateful stream processing has been grasping attention due to its applications into an extensive range of scenarios. Stream processors are emerging as a framework that analytical focused but also focus on handling the core of insistent application logic. Thus, the growing systems needs support for application state with robust consistency guarantee, partial failures, and adaptive cluster reconfigurations. Apache Flink, an open-source stream processing engine served such needs as main design principle by providing state management feature. However, there is absence of strong consistency and accuracy guarantee for returned values of state. The proposed research provides the approach to access state via state processor API recently introduce by Apache Flink and validation process for calculated state. Keywords: Apache Flink · API · Stream processor

1 Introduction Large amount of data with versatility can be processed efficiently with quick response time by using big data systems rather than processed with traditional systems [1]. Batch processing and stream processing are two basic processing approaches that classify big data systems [2]. Batch oriented systems process occurs on significant size of data files, whereas stream-oriented systems process data constantly as it enters to system. Stream processing is done on real time data. Stream processing used to process data as it enters streaming pipeline for decision making and analytics that results are instant and continuously updated as data becomes available. For example, in case of an e-commerce website, a user gets a list of suggested products based on his search. Apache Spark [3], Apache Flink [4], Apache Samza [5] or Apache Storm [6] can be used for stream processing. Stateful stream processing changes the rules of the game by integrate state, to handle real-time unbounded data streams. State is a core unit of stream processing frameworks in big data world. It is persevered in reliable storage and updated occasionally. Stateful © Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 632–647, 2023. https://doi.org/10.1007/978-3-031-24340-0_47

Streaming State Validation Technique

633

stream processing framework is a design pattern for processing unbounded streams of event. State management in stream processing frameworks has received considerable attentions in current years because of the importance of state [7] for several aspects such as fault-tolerance and efficient failure recovery [8]. Apache Flink is an open-source system for processing streaming and batch data [9]. Apache Flink is a link between traditional database systems and big data analysis frameworks. With several levels of complexity and persistence, Flink assists an extensive range of configurable state backends. Flink has the feature to directly query the state known as queryable state [10]. Recently Flink introduced the API to access the state from outside world of Flink cluster known as State processor API, now state can be read and write from outside which enables real-time queries and also allows state debugging [11]. Before introducing the State Processor API [12], Flink provides the feature “Queryable State” to access and query the state values as state represents the current condition of process. Queryable State is not much efficient way because it does not provide the access of state outside the process, and it was also not possible to check state consistency. Now State Processor API provides the feature to expose the state outside but still there is no method that grantee the consistency of state. Consistency of state refers that the exposed state provides the correct information in different scenarios. The question arises how to check that calculated state is correct. The proposed research work is introducing the method for validation of calculated state using State Processor API without having impact on the consistency, scalability, and latency. Querying state from outside will be very helpful for state debugging and bootstrapping purposes. The rest of the paper is ordered as follows. Section 2 gives an overview of Stateful stream processing and Apache Flink system. This Section also summarize the core state management concepts of Flink, namely as types of state, checkpointing, savepoints, state backend, queryable state and state processor API. Section 3 summarizes the research methodology that includes proposed design framework. Finally, Sect. 4 describes the implementation and evaluation matrices of proposed process. The conclusion and future work are discussed in Sect. 5.

2 Preliminaries 2.1 Stateful Stream Processing State in stream processing applications stores information about previously processed data, which will be used in processing the new data. For example, an application, which calculates the average, needs to store the average so that it can be used in the future. Usually, stream processing applications are complex and need to store a large amount of state data during the processing [13]. Stream processing usually involves more than one operation on the input stream, which are performed by the logical units called operators [14]. These operators are divided into two major subclasses, called stateless and stateful operators as shown in Fig. 1. Black bars are representing as input records. Stateless operators process the input record and produce output based on that input solely. Stateless operators perform simple operations e.g. selection or filtering and thus, do not need any

634

R. Younas and A. Qasim

Fig. 1. Stateless and Stateful stream processing [4].

additional information. They take the input from a pipeline, process it, and output the result. While stateful operators need previous information to calculate new results. So, it updates the state with every input and gray bar shown as output that based on current input record and information of priorly processed records. For example, operations like aggregation need to keep the previous results to calculate results on new data. 2.2 Why Using Apache Flink? To explicate why Apache Flink is good choice for state management, the comparison of Storm, Samza, Spark, and Flink that are four currently trendy open-source big data processing frameworks are discussed. Table 1 encapsulates some of the main characteristics related to each of these stream processing frameworks. 2.3 Apache Flink System Apache Flink [4] is a real-time processing framework for high-performance and scalable real-time applications. Apache Flink is an open-source platform under Apache software foundation. Apache Flink is a streaming data flow engine. For distributed computations over data streams, Flink endows low latency, fault-tolerance, high throughput, and datadistribution for distributed computations over data streams. Apache Flink provides event time processing and state management [15]. Flink provides true streaming rather than mini batches to operate streaming data. Flink can perform impulsive DataStream programs in a data-parallel pipelined and data-parallel manner. Flink provides the capacities to execute batch, streams and bulk programs empowered by pipelined runtime system. The Flink processing engine is written in Scala and Java. Apache Flink is used to overcome and reduce the complexity that has been tackled by other distributed data driven engines. The major privileged of Flink is that it is not only provides real time streaming but also an engine for batch processing with high throughput. Apache Flink reads data streams from input source and perform different operation and transformation to process input streams of data [8]. Flink offers nonstop stream processing for event-driven applications. It is also providing batch and stream analytics [16]. The steams of the unbounded data are processed by the engine of Flink as true streams. The true streaming is indications that each record can processed instantly and independently from other records as soon as it enters in the system. Furthermore, the applications can be run 24/7 using Apache Flink framework. The upper-layered APIs and significant performance guarantees agree applications to

Streaming State Validation Technique

635

Table 1. Comparison of stream processing engines [28, 29]. Engine

Flink

Spark

Storm

Samza

Processing Mode

Hybrid Processing

Hybrid Processing

Stream Processing

Stream Processing

State Management

Stateful operators

State DStream

Stateful operators

Stateful operators

Latency

Stateful operators

Few seconds

Sub-second

Sub-second

Fault Tolerance

Checkpointing

Check-pointing

Check-pointing

Check-pointing

Checkpoint Storage

HDFS or RocksDB

HDFS

In-memory or Redis

Filesystem or Kafka

Processing Guarantee

Exactly-once

Exactly-once

Exactly-once with Trident

At least-once

processes and execute data at whirlwind speed therefore this is also known as “4G of Big Data”. Apache Flink has admirable performance as it provides in-Memory computing with low latency and high throughput. The benchmark in Table 1 has demonstrated that Flink can compete with other acknowledged distributed Big Data frameworks. It can also cope with hundreds and millions of records per second. Apache Flink provides the impressive performance to its users as it is a distributed framework that can execute applications on multiple nodes and can process trillions of events per day [17]. Usually, the applications with unbounded Stream handling demands some convention state to sustain intermediate outcomes of their computations [18]. In Apache Spark, the state must explicitly update after each batch processing whereas Flink provides the autonomous feature to update the state. Flink architecture includes significant features such as an asynchronous lightweight incremental checkpoint mechanism, save points and state management, which are discussed in detail in the coming sections of this chapter. These characteristics provides exactly one guarantee, state consistency, fault tolerance. 2.4 Core Concepts 2.4.1 State Management in Apache Flink In Flink, a State [19] in stateful operator can be manifested as memory in each operator that remembers information across multiple events. The operations having stateful operators are also known as stateful operations. For example, stateful operation is used when significant data needs to be managed as the state permits efficient access to the pervious occurred events. The state of operator is empty at the beginning of operation and can be occupied and modified by the execution of operator or function. Flink supports stateful operators to assure the fault tolerance [20].

636

R. Younas and A. Qasim

Fig. 2. Barriers in data streams [25].

In Flink, state management is achieved by using checkpointing and savepoints mechanisms. Flink provides a verity of state backends for storage of state. States can be redistributing across parallel instances in Apache Flink that allows rescaling of Flink applications [21]. State Processor API is the recent feature of Flink that makes possible to access the state outside. There are two main types of states in Apache Flink that are operator state and keyed state. Operator State One particular operator corresponds to an “operator state” to store its current state. All records managed by the same parallel operator have access to the same state. But this does not mean that all parallel operators have access to the same state [22]. A good example of operator state use is Kafka connector [23]. Each parallel consumer instance of Kafka maintains information about topic partitions and offsets. This is a special type of state which is also known as non-keyed state and used in scenarios when there is having no key to partition the state. Keyed State The keyed state is like a key-map value [24]. To use keyed state, a key should be specified on DataStream that used to partition the records and the state. Keyed state takes advantage of partitioned stream. A key selector is used to define the key. Keyed state is combination of either operator or function state and key. Keyed state also known as partitioned state. 2.4.2 Checkpointing Apache Flink implements fault tolerance by using the combination of checkpointing mechanism and stream replay. Checkpointing mechanism [25] is used to recover the state to ensure fault tolerance in Apache Flink. In streaming applications, a specific point is marked in each of the input stream for each operator. This procedure keeps on taking lightweight snapshots for the data flows to ensure the consistency (exactly once or at-least-once semantics) by replaying the stream from specific point. The snapshots are usually very light weighted for streaming applications having small states. Corresponding snapshots for running states and distributed data streams are taken frequently without any considerable effecting overall performance. The snapshots are stored at a configurable place like in distributed file system. Stream barriers are the core element of checkpointing in Apache Flink. Checkpoint barriers are interposed into the data stream and move with the records as part of the

Streaming State Validation Technique

637

Fig. 3. State backend [30].

data stream. Barriers moves firmly in streamline and never pass or overtake the records. Checkpoint barriers are used to separate the records of data stream and set records in the group that goes for the recent snapshot and for next snapshot. Each checkpoint barrier brings the ID of the snapshot whose records it thrust in front of it. Barriers do not intrude the flow of the stream and are hence very lightweight. [20] Fig. 2 showing the working of barriers in data streams. The barriers are injected into the data flow of stream at the stream source level. When the operator has received checkpoint barriers from all input streams for snapshot n, it transfers the barrier for snapshot n to next operator. When a sink operator receives n barriers from all input data streams, it emits that snapshot n to the checkpoint coordinator. After all sinks have acknowledged a snapshot, A snapshot is considered as completed when all sinks have acknowledged a snapshot. In Flink, Checkpoints used to recover from failure. Checkpoints triggered by job manager as shown in Fig. 3. Once the checkpoint is triggered, the notification received to all task managers, and they start writing checkpoints into state backend. The task managers save their states in selected state backends and inform the job manger about checkpoint states [26]. 2.4.3 Synchronous and Asynchronous State Snapshots The state also must be a part of snapshots of stateful operators. Stateful operators snapshot their state after receiving the checkpoint barriers from all their input streams. State snapshots store in job manager’s memory by default but it can be stored in some external state backend such as HDFS usually when the snapshots may be large [27]. Once the state snapshot has been stored, the stateful operator acknowledges the checkpoint, releases the checkpoint barrier into the output streams, and continues. The use of Lightweight Asynchronous Snapshots for Distributed Dataflows in Apache Flink inspired by Chandy-Lamport algorithm [28].

638

R. Younas and A. Qasim

Fig. 4. State validation process architecture diagram.

In the process of state writing, when operators hang the current execution of the records processing and create state snapshot is known as synchronous snapshot. This interruption of execution reduces the performance. Therefore, asynchronous state snapshots let the operators continue the processing of records while the state is being stored. With asynchronous state snapshotting the operators save state in the objects that cannot be modified once they are written. Operators keep processing records continuously and do not pause for state storing in this case. 2.4.4 Savepoints Checkpoints are automatically triggered by Flink runtime to provide a procedure to recover records while revising state in case of a failure. But Flink also have a method to purposely manage versions of state through a feature called savepoints [29]. Savepoints are like ordinary checkpoints, yet they are not triggered by Flink runtime certainly, rather by the user manually. The user can generate savepoints using REST APIs or command line tool. When the user triggered the manual checkpoints (savepoints), the current state saved by the Flink like regular checkpoints triggered automatically during runtime. Once the save point triggered, it generates a snapshot and store it to the state backend using regular checkpointing mechanism. Regular checkpoints are expired automatically but the savepoints are not expired or cleaned automatically when there is newer completed checkpoint [25]. The programs that use checkpointing can continue execution from a savepoint. Without losing any state, savepoints permit to update program and Flink cluster. 2.4.5 Apache Flink’s State Backends Flink gives distinctive state backends that determine how can store a state and from where a stored state can be accessed. There are two main modules of state backends: – local state, – external state.

Streaming State Validation Technique

639

Table 2. Setup environment. Language

Tools

Frameworks

– Java

– IntelliJ IDEA Community Edition 2017.3 x64 – Tableau 2020.3

– Apache Flink 1.11.1 – Apache Kafka 2.1–1.0.0 – Apache Maven 3.6.3

Fig. 5. Experimental design approach.

The state relays on checkpoints to protect against data loss and recover failure consistently when checkpointing is triggered in stateful operator. It depends on the selected State Backend that how the state is represented internally, and how and where it is preserved and retrieved [30]. Figure 3 is showing the state backend mechanism in Flink. 2.4.6 Queryable State Queryable state is the feature in Apache Flink that can expose the keyed or partitioned state to the external domain [31]. This feature allows to query the state of a job from outside the Apache Flink. For some circumstances, queryable state reduces the need for distributed operations/transactions with outer systems such as key-value stores which are often the blockage in practice. 2.4.7 State Processor API State Processor API [12] provides prevailing functionality to read, write, and modify savepoints and checkpoints of states. State Processor API grounded on batch DataSet API of Apache Flink. Relational Table API or SQL queries can also be used to analyze and process state data because DataSet and Table API are interoperable. State Processor API can be used to take a savepoint of a running stream processing application and evaluate it with a DataSet batch program to validate that the application behaves appropriately. It can also be used to bootstrap the state of a streaming application (Table 3).

640

R. Younas and A. Qasim Table 3. Experimental setup. JOB1

Flink Streaming Tweet Count

JOB2

Flink State Reader

JOB3

Flink batch Count

DATASET

https://www.kaggle.com/kazanova/sen timent140

Fig. 6. Operators of Flink streaming tweet count.

For example, read a batch of data from any supply, preprocess data, and write the results to a savepoint for bootstrapping the state. It is also probable to fix unreliable state entries. Conclusively, the State Processor API opens many ways to progress a stateful application that was formerly blocked. It was not possible to apply any change on parameters and design choice without losing all the state after application was started. For instance, users can indiscriminately modify the data types of states, merge, or spilt operator state, determine the parallelism of operators, reallocate operators UIDs.

3 Design Framework As Apache Flink providing the features to directly query and access the keyed state but still not guaranteeing the accuracy of state. State accuracy or consistency of state refers that the intermediate output that is stored in state and will access as calculated state is correct and can be used for other operations such as debugging. By accessing the state from outside and to use that calculated state for further operations it is necessary to validate the accuracy of accessible state. Validation is major requirement to enhance the stateful stream processing. State validation is required to accomplish the need of state debugging and bootstrapping. As if the consistency of state is guaranteed even after accessing state from outside of Flink environment, it can be used as input for different operations to enhance the processing. Figure 4 represents the design framework of proposed process. The architecture diagram of state validation process depicts the

Streaming State Validation Technique

641

main structure to process input data and to validate the state of processed data in Flink cluster. Kafka is used to process input data into streams of data. The same input data is processed by streaming job and batch job. The state of streaming is fetched and compare the calculated state with the results of batch job for evaluation purpose.

4 Implementation and Evaluation 4.1 Implementation Setup 4.2 Design of the Implementation The approach used for the validation of state for streaming Job in Apache Flink is illustrates in the Fig. 5. The method used to validate state includes text data and three jobs. The operations of these jobs are explained in experimental setup section. The text file considers as a “source file”. Job1 is the streaming job. As streaming job process input that comes in the form of streams. So, Kafka is used as a source for streaming job. Kafka is used to transform dataset in streams form for streaming job. Data reads from the text file and write to the Kafka. For writing the text file, Kafka reads the content of text file and publish the records to Kafka topic. Job1 consume messages in form of streams from Kafka and process streams. Job1 contains stateful operator. The savepoint is triggered at the end of processing and stores the state in durable storage. Job2 is Flink batch job that run to access the state file that is saved by trigging savepoint while running job1.In Job2, state processor API is used to access the stored state. Job2 read the files of savepoint from durable storage and process them in the form of key-value. Job3 reads the data from source file and process the same data as given to Job1 but in form of batch. The output of the job3 is also come in the form of key-value. After the running of Job3, the results of Job2 and Job3 are compared to evaluate the validation of state value.

(a)

(b)

Fig. 7. (a). Number of records being processed by Flink streaming tweet count job after start. (b). Flink streaming tweet count job flow after start.

642

R. Younas and A. Qasim

(a)

(b)

Fig. 8. (a). Number of records being processed by Flink streaming tweet count job in middle. (b). Flink streaming tweet count job flow in middle

4.3 Experimental Setup Job1 is the streaming job name as “Flink streaming tweet count” as mentioned in the Table 2 that process the dataset to count the same number of words in tweets. This job receive data from the Kafka broker which create the stream of dataset and send data to the Flink streaming job one by one for processing. This job reads data as stream and calculates the key-valued state. A savepoints is triggered manually to store the state. RocksDB is durable storage that used as state backend for this job is. Flink State Reader job reads the state from the savepoint file of Job1 as input or dataset using State Processor API. It stores output in a csv/txt as key-value (words count). Job 3 is simple batch job name as “Flink batch count”. This job process input data and calculates the words count (as key value pair) used in the tweets and saved this info in a file. 4.4 Results Figure 6 is the indication of Flink streaming tweet count job. This job is having 2 operators. Source operator and Flat Map operator. The number of records processed by operators also representing in Fig. 6. It also screening that how many records are mapped after retrieving from source. The source had the data of 240 MB including about 17M records for processing. The Flat map processed 239 MB data. The flow of working and results of this streaming job is shown as following. Figure 7 representing the job just after start. The Fig. 7(a) shows that the number of records which are being processed by the job. Figure 7(b) represents the graph that how many records are processed in specific time stamp (1 s). Figures 8(a) and 8 (b) represents the situation in the middle of job. Figure 8(a) shows the number of records which are being processed. Figure 8(b) represents the number of records processing per second is 90458/s, in this time. Figure 9 representing the job flow at the end. This is the situation of the streaming job when job processed all the records of dataset. The job processed all the records and wait for some seconds as showing in the graph and then terminate the processing.

Streaming State Validation Technique

643

4.5 Evaluation 4.6 Evaluation Matrices The following quality metrics are used in experiment for the evaluation. Accuracy: Correctness of the data for key-valued pair in the stored state. Latency: Latency is the suspension between the time when an event is created and the time when outcomes become apparent based on this event. Throughput: Number of records that can be processed in specific time. For this experiment, the latency and throughput are referred as the start and end time of the job and the number of records that have being processed by the job. The Fig. 9 showing the time duration when the save point is triggered for the job and time duration for the acknowledgment of checkpoints. The Fig. 10 showing the status of the triggered savepoint. The save point is triggered at 21:25:02 and acknowledged at 21:25:03. The time duration to complete the savepoint is 566 ms and the size of the savepoint data is 14.5 MB (Fig. 11).

Fig. 9. Overview of time duration for checkpointing in Flink streaming tweet count.

Fig. 10. Savepoint History.

644

R. Younas and A. Qasim

Fig. 11. Accuracy measurements.

There is 100% accuracy for the values of top 3000 tweets words keys that are extracted from state reader job. The data in file of state reader job is completely same to the data of Flink batch job file for top 3000 tweets words. The Fig. 12 represents the comparison of the values of calculated state data and batch job data.

Fig. 12. Comparison of key-values.

4.7 Visualization of Results Now, the results are representing visually for better understanding. Figure 13 (a) and 13 (b) representing the different parts of data value of the stored file of Flink batch job and the file that stored by State Reader job visually. Figure 14 showing the vales of data with difference between the data values. As the method giving 100%, the difference showing for all values is 0. Figure 15(a) is the visualization of the top word counts for top 3000 tweets and Fig. 15(b) is the representation of the difference in both files. The proposed approach demonstrates the 100% accuracy with low latency and high throughput. The evolution of the output files refers that the approach used for the validation of state represents the accuracy of the state while having no impact on the latency, consistency and throughput that are the main engrossing features of Flink.

Streaming State Validation Technique

Fig. 13. (a). Output of state reader job. (b). Output of Flink batch job.

Fig. 14. Values of keys with difference in both files.

(a)

(b)

Fig. 15. (a). Visualization of top words. (b). Visualization of difference.

645

646

R. Younas and A. Qasim

5 Conclusions and Future Work 5.1 Conclusion This research focused on state management in Apache Flink. Apache Flink provides much efficient and better ways to handle the stateful operators. Checkpointing is a significant mechanism to provide the exactly-once guarantee in case of failures. Savepoint is the feature of Apache Flink, which permits user to take a “point-in-time” snapshot of entire job state of streaming application. Queryable State is the feature to overcome the bottleneck to access the state directly during processing. State processor API is game changer feature that used to read and write state in implementation. To put it briefly, the State Processor API opens the black box that used to be savepoints. The approach used for the validation of state for streaming job that process data “in-motion”. This approach has the combination of stream and batch jobs to attain the validation. The Flink streaming job calculated the words used in tweets and state is stored by trigging the savepoint. A batch job processed the same dataset and compare the results of the key-value output of batch job with the data of stored state. This validation method gives 100% accuracy and have no effect to the latency of Flink processing. 5.2 Future Work Stream processing frameworks are missing the development of state management feature. This feature can implement for other stream processing frameworks like Apache Samza, Apache Storm and Apache Spark etc. This research can future use in scenarios to access the application state of streaming job for debugging and bootstrap purposes. As in some scenarios, a program needs to reprocess the already processed data, state of one application can be accessed and use for another application. Currently, Flink only provides the State Processor API endorsed on the DataSet API. For true stream processing as the roadmap of Flink, the State Processor API can be migrated to other API’s like DataStream API, Table API, etc.

References 1. Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large scale data processing systems. ACM Comput. Surv. 46(1), 1–44 (2013). https://doi.org/10.1145/2522968.2522979 2. Costan, A.: From big data to fast data: Efficient stream data management. Hal open science (2019) 3. Apache Spark. https://spark.apache.org/ 4. Apache Flink. https://Flink.apache.org/ 5. Apache Samza. http://samza.apache.org/ 6. Apache Storm. http://storm.apache.org/ 7. apache Flink 1.9 documentation: State & Fault Tolerance. https://ci.apache.org/projects/Flink/ Flink-docs-release-1.9/dev/stream/state/ 8. Rabl, T., Traub, J., Katsifodimos, A., Markl, V.: Apache Flink in Current Research 58(4), 157–165 (2016). https://doi.org/10.1515/itit-2016-0005

Streaming State Validation Technique

647

9. Perwej, Y., Omer, M.: A Comprehend The Apache Flink in Big Data Environments. IOSR Journal of Computer Engineering (IOSR-JCE) 20(1), 48–58 (2018) 10. Islam, S.M.R., Kwak, D., Kabir, M.H., Hossain, M., Kwak, K.S.: The Internet of things for health care : a comprehensive survey. Access, IEEE 3, 678–708 (2015). https://doi.org/10. 1109/ACCESS.2015.2437951 11. To, Q.-C., Soto, J., Markl, V.: A survey of state management in big data processing systems. VLDB J. 27(6), 847–872 (2018). https://doi.org/10.1007/s00778-018-0514-9 12. Apache Flink 1.9 documentation: [Online]. https://Flink.apache.org/feature/2019/09/13/ state-processor-api.html 13. Fernandez, R.C. Migliavacca, M., Kalyvianaki, E., Pietzuch, P.: Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management, pp. 725–736 (2013). https://doi. org/10.1145/2463676.2465282 14. Balazinska, M., Balakrishnan, H., Madden, S.R., Stonebraker, M.: Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst. 33(1), 1–44 (2008). https://doi.org/10.1145/1331904.1331907 15. Friedman, L., Tzoumas, K.: Ellen, Introduction to apache Flink: Stream processing for real time and beyond. O’Reilly Media, Inc. (2016) 16. Ewen, S.: Apache Flink TM : Stream and Batch Processing in a Single Engine. Vol. 36 (2015) 17. Streaming, S., et al.: Benchmarking streaming computation engines. IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1789–1792, 820169 (2016). https://doi.org/10.1109/IPDPSW.2016.138 18. Mandal, K.: Evolution of Streaming ETL Technologies Evolution of Streaming Data Processing Pipeline Technologies (2019) 19. Apache Fink documentation: documentation: working with state (2018). https//ci.apache.org/projects/Flink/Flink-docs-release-1.4/dev/stream/ state/state.html 20. Hueske, F., Kalavri, V.: Stream Processing with Apache Flink, First. Fundamentals, Implementation, and Operation of Streaming Applications. O’Reilly Media, Inc. (2019) 21. Class Taskmanager. https://ci.apache.org/projects/Flink/Flink-docs-release1.7/api/java/org/ apache/Flink/runtime/taskmanager/TaskManager.html 22. Wadkar, H.R.S.: Flink in action. Manning Publications Company (2017) 23. Apache Kafka. https://kafka.apache.org/ 24. Marcu, O.C., Tudoran, R., Nicolae, B., Costan, A., Antoniu, G., Pérez-Hernández, M.S.: Exploring shared state in key-value store for window-based multi-pattern streaming analytics. In: EEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 1044– 1052 (2017). 10.1109/ CCGRID.2017.126 25. Apache Fink 1.8 documentation (2019). https://ci.apache.org/projects/Flink/Flink-docs-rel ease-1.8/internals/stream_checkpointing.html 26. Job Manager. https://ci.apache.org/projects/Flink/Flink-docs-stable/internals/job_schedu ling.html 27. Carbone, P., Fóra, G., Ewen, E., Haridi, S., Tzoumas, K.: Lightweight asynchronous snapshots for distributed dataflows. Computer Science, Distributed, Parallel, and Cluster Computing. https://doi.org/10.48550/arXiv.1506.08603 28. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Computer Syst. 3(1), 63–75 (1985). https://doi.org/10.1145/214451. 214456 29. Savepoints. https://ci.apache.org/projects/Flink/Flink-docs-stable/ops/state/savepoints.html 30. State Backends. https://ci.apache.org/projects/Flink/Flink-docs-release-1.0/concepts/con cepts.html 31. Fault-tolerance and State. https://ci.apache.org/projects/Flink/flink-docs-release-1.9/dev/str eam/state/

Automatic Extraction of Relevant Keyphrases for the Study of Issue Competition Miguel Won1(B) , Bruno Martins1 , and Filipa Raimundo2 1

INESC-ID, University of Lisbon, Lisbon, Portugal [email protected] 2 ICS, University of Lisbon, Lisbon, Portugal

Abstract. The use of keyphrases is a common oratory technique in political discourse, and politicians often guide their statements by recurrently making use of keyphrases. We propose a statistical method for extracting keyphrases at document level, combining simple heuristic rules. We show that our approach can compete with state-of-the-art systems. The method is particularly useful for the study of policy preferences and issue competition, which relies primarily on the analysis of political statements contained in party manifestos and speeches. As a case study, we show an analysis of Portuguese parliamentary debates. We extract the most used keyphrases from each parliamentary group speech collection to detect political issue emphasis. We additionally show how keyphrase clouds can be used as visualization aids to summarize the main addressed political issues. Keywords: Keyphrase extraction retrieval

1

· Political discourse · Information

Introduction

In recent years, the study of policy preferences and issue competition has increased considerably [16,17,22,26,27,38,42,43]. To identify what issues political actors emphasize or de-emphasize and what their policy priorities are, scholars have engaged in collecting and coding large quantities of text, from speeches to manifestos [17,42,43]. The assumption is that political actors’ statements are a more accurate account of where they stand on a particular issue than actual behavior. With such growing interest in issue competition and policy preferences, several methods have been developed to analyze large amounts of qualitative data systematically [16,17,42]. Traditionally, these works rely upon human annotated text data, in particular, words and sentences (or quasi-sentences) coded into issue categories. One example vastly used by scholars is the Comparative Manifesto Project (CMP)1 that makes available party manifestos annotated into a restrict code system, which informs about political issue addressing within the texts. 1

https://manifesto-project.wzb.eu.

c Springer Nature Switzerland AG 2023  A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 648–669, 2023. https://doi.org/10.1007/978-3-031-24340-0_48

Automatic Extraction of Relevant Keyphrases

649

One problem with this type of data is that party manifestos are limited in time and written once every election. Studies about internal dynamics of each parliamentary groups during the legislative term, party reaction to relevant political events or the constant action/reaction of the party agenda with the media own agenda, are not possible to be developed using the party manifestos only. In [17] the authors overcome this problem using parliamentary activities, such as questions to the government and parliamentary debates. However, the authors had to implement a coding system and perform human codding. For larger parliaments, with a high volume of daily activity, this manual annotation can be highly costly. Politicians often guide their statements by recurrently making use of keyphrases. The wise use of keyphrases is a common oratory technique in political discourse that allows political actors to transmit the most important ideas associated with their political position. The identification of relevant keyphrases used in political texts can help to synthetically frame political positions and identify the relevant topics addressed by a politician or political group agenda. With today’s massive electronic and public data, such as parliament debates, press releases, electoral manifestos or news media articles, several studies focused on the development of statistical algorithms that extract relevant information from large-scale political text collections [18]. In the present work, we propose to use the keyphrase framework to identify attention given by politicians. We offer a simple method that automatically extracts the most relevant keyphrases associated with a political text, such as given speech, a public statement or an opinion article. We show that using a combination of simple text statistical features is possible to achieve results that compete with alternative state-of-the-art methodologies. We evaluate the method performance with annotated corpora typical used to assess the Natural Language Processing task of Keyphrase Extraction. We additionally show a case study where we apply the proposed methodology to a corpus composed by a collection of speeches given during the plenary sessions of the Portuguese Parliament and propose a visualization scheme using a word cloud.

2

Related Work

Automatic extraction of keyphrases is usually divided into two main branches: supervised and unsupervised. Within the unsupervised approaches, there are two main frameworks commonly used: graph-based and topic cluster. The first approach was originally proposed with TextRank algorithm [33]. TextRank generates a graph form a text document, where each node is a word and each edge a co-occurrence. Centrality measures, such as PageRank [36], are used to score each word and build a ranking system of keyphrases. More recent graph methods are SingleRank [44] that weight the graph edges with the number of co-occurrences, and SGRank [12] that combines several statistical features to weight the edges between candidates of keyphrases. Regarding topic clusters, methods such as KeyCluster [29] and CommunityCluster [19] cluster candidates

650

M. Won et al.

to keyphrases semantically similar, resulting in topic clusters. TopicRank [8] is also a relevant work, by joining these two approaches and generating a graph with topic clusters. More recent works use word embeddings framework [34]. One example is [45] where the authors use a graph-based approach and weight the edges with the semantic word embeddings distances. In a more recent work [3] it is proposed to use the semantic distance between the full document and the keyphrases representation, using Sent2Vec [37]. This latter method, named by the authors as EmbedRank, represents the current state-of-the-art performance for extraction systems. Supervised methodologies use annotated data to train statistical classifiers using syntactic features, such as part-of-speech tags, tf-idf and first position [15,23,35]. Also, works such as [30,31] incorporate additionally external semantic information extracted from Wikipedia. All these methods extract keyphrases present in the texts. Nevertheless, datasets commonly used to evaluate keyphrase extraction systems contain human annotated keyphrases not present in the documents. Some recently some works have dedicated special attention to generative models, where the extracted keyphrases are not necessarily present in the texts, but generated onthe-fly. Examples of works are [10,32], where the authors use a neural networks deep learning framework to generate relevant keyphrases. For some annotated datasets, the performance of these systems improved the current state-of-art results for extraction systems. We note that in the present work we propose a method that extracts keyphrases present only in the input text. In respect to political issues analysis, recent works have intensely used CMP framework to signal the political issues from the analyzed corpora. In [16] the authors use CMP to analyze several Western European party manifestos since 1950 to study how issue priorities evolved with time. In [17] the authors propose a model to frame the issue competition between parliamentary groups, using Danish parliament activity data; and in [42] questions from Belgium and Denmark MPs are analyzed to study the issue emphasis dynamics between the MPs.

3

Keyphrases Extraction

In the same line of previous works in automatic keyphrases extraction, we follow a three-step process [21]: first, we identify a set of potential candidates to keyphrases; second, we calculate a score for each candidate; and third, we select the top-ranked candidates. 3.1

Candidate Identification

The first step to select the best potential candidates to relevant keyphrases is to use morphosyntactic patterns. The use of part-of-speech (POS) filters is a common practice in this type of task, where traditionally only nouns and adjectives are filtered in [28,33,44]. Works such as [1,35] impose additionally morphosyntactic rules for candidates, e.g., the candidate must be a noun-phrase or end with

Automatic Extraction of Relevant Keyphrases

651

a noun. The idea behind these rules comes from the fact that we are searching for keyphrases that represent entities, which are very likely to be expressed by at least one noun. The further addition of adjectives will allow the inclusion of a noun quantifier, allowing the identification of candidates such as “National Health System” (adjective+noun+noun). Following this example, we propose the use of the pattern of at least one noun possible preceded by adjectives2 . Additionally, to avoid candidates overlap, as well as limit the generation of a high number of candidates, we propose the use of a chunking rule based in the morphosyntactic pattern3 . In the present study, we also work with Portuguese texts. For this reason, when working with this corpus, we have applied an equivalent pattern but with the appropriate noun/adjective order for Portuguese, as well the possibility to include a preposition4 . In respect to prepossessing, we apply standard text cleaning procedures by removing candidates containing stopwords, punctuation or numerical digits. For datasets with longer documents, we requested a minimum of two occurrences, where we have considered the stemmed versions of each candidate. We used NLTK Porter Stemmer5 for English and NLTK RSLP Stemmer for Portuguese6 . 3.2

Candidate Scoring

The next step in the pipeline is to estimate the score of each candidate. We show in this work that a scoring system based on the selection of simple heuristic rules can result in state-of-the-art performances. The selection process of such features considered their good performance in previous works. For each candidate we calculate the following features: – Term Frequency (tf ): Keyphrases are usually associated with frequent usage [25,46]. Contrary to the common practice that measures each candidate document frequency, we propose to use instead the sum of each word frequency from that constitutes the candidate. We have observed from our experiments that such feature results in better performances. – Inverse Document Frequency (idf ): In case the keyphrase extraction task has access to a context corpus, is possible to use additional information, by using the idf metric [40]. Since this feature uses a context corpus, we have evaluated two systems, H1 and H2 , where only the former considers idf. – Relative First Occurrence (rfo): Keyphrase extraction systems frequently use the position of the first occurrence as a statistical feature. It is a result from that fact that often relevant keyphrases are used at the beginning 2 3

4

5 6

We use the POS pattern: * +. In our experiments we used NLTK chunker: https://www.nltk.org/api/nltk.chunk. html. We use the chunker POS pattern: (+ * *)? + *. https://www.nltk.org/ modules/nltk/stem/porter.html. https://www.nltk.org/ modules/nltk/stem/rslp.html.

652

M. Won et al.

of the text [12,14,23,35]. In this work, we use the likeliness that a candidate shows earlier than a randomly sampled phrase of the same frequency. We calculate the cumulative probability of the type (1 − a)k , where a ∈ [0, 1[ measures the position of the first occurrence and k the candidate frequency [11]. – Length (len): The candidate size, i.e., the number of words that compose it, can also hint about the candidate likeliness to be a keyphrase. Human readers tend to identify keyphrases with sizes beyond the unigrams, specially bigrams [11,12]. However, a linear score based in the candidate length would result in overweight of lengthy candidates, such as 3 and 4-grams. Therefore, and based on our trials, we propose a simple rule that scores 1 for unigrams and 2 for the remaining sizes. The final score of each candidate is the result of the product of these four features. From the results shown in the next section, we will see that with these simple four heuristic rules is possible to obtain results that compete with the state-of-the-art. We show the formal description of each feature is Eqs. (1)–(4), and in Eqs. (5)–(6) the two final score systems for a candidate w = w1 ...w2 and a set of documents {d ∈ D}, with size N = |D|. Model H1 does not take into consideration idf, and therefore it is calculated at document level only. System H2 includes idf weight that in our experiments is measured by the respective corpus. tf(w1 ...wn ) = tf-idf(w1 ...wn ) =

n  i=1 n 

fr(wi , d) fr(wi , d) × log(

i=1

(1) N 1 + |d ∈ D : wi ∈ d|

rfo(w1 ...wn ) = (1 − a)fr(w1 ...wn ,d)  1:n=1 len(w1 ...wn ) = 2:n≥2

3.3

(2) (3) (4)

H1 = tf × rfo × len

(5)

H2 = tf-idf × rfo × len

(6)

Top n-rank Candidates

After the scoring step, we select the best ranking candidates. It is common to evaluate keyphrase extraction systems at three top-ranking scenarios: the first 5, 10 and 15 candidates. However, the best raking can be influenced by the document size7 . Smaller documents such as abstracts are likely to contain fewer keyphrases than full-length articles. Table 1 shows that larger document datasets are associated with a higher number of keyphrases per document. Therefore, we propose to extract the n-top ranked candidates dynamically by considering the 7

To measure the document size we use the total number of tokens.

Automatic Extraction of Relevant Keyphrases

653

respective document size. We use the identity shown in the Eq. (7) to calculate the n-top candidates for each document. We propose a logarithm growth to prevent significant and fast discrepancies between a small text such as an abstract and a full paper. We found the 2.5 parameter by experiment, and therefore one should note that it can be a result of overfitting. However, as we will see below, this dynamical ranking system returns consistent good results for all datasets, which is an indicator that is document size independent. nkeys = 2.5 × log10 (doc size)

4 4.1

(7)

Experiments Evaluation Metric

We follow the standard procedure to evaluate keyphrase extraction systems: we measure the macro-average F-score (average the F-score for each document), using the harmonic mean of Precision and Recall. For Precision, we use the #correct keyphrases #correct keyphrases identify P = #extracted keyphrases and for Recall R = #gold keyphrases . Both Precision and Recall are calculated using the exact match of the stemmed versions of the extracted keyphrases. 4.2

Datasets

We use five human annotated datasets, four with English texts and a fifth with Portuguese (European). The English datasets are popular sets used to evaluate keyphrase extraction systems. The dataset of Portuguese texts enables us to assess performance with a language rarely tested in keyphrase extraction, as well to validate the case study shown in the present work. Follows a brief description of each dataset: – Inspec [23] is a dataset with 2000 scientific journal abstracts. Equivalently to previous works [33,44] we evaluate the performance using the test dataset, which contains a total of 500 documents. The dataset contains three files per abstract, one containing the text and the remaining two the controlled and uncontrolled keyphrases, separately. We have used the uncontrolled set of keyphrases for evaluation. – DUC2001 [44] is a corpus with 308 newspapers articles. We have used the set of human-annotated keyphrases for each document available with the dataset. In our experiments, this dataset is the only English corpus made from non-academic texts. – Semeval 2010 [24] is a common dataset used in automatic keyphrases extraction research. It contains scientific articles with author and reader annotated keyphrases. We have used the test dataset only, made of 100 papers, and the combined set of keyphrases.

654

M. Won et al.

– Nguyen [35] contains 211 scientific conference papers with author and reader annotated keyphrases. We considered all articles and the combined set of keyphrases. – Geringon¸ ca is a new dataset of news articles written in Portuguese and extracted from the political news web portal http://geringonca.com/. The dataset contains a total of 800 pieces, and for each article, a set of keyphrases were assigned by the respective authors. Table 1. Datasets descriptive statistics. From the second column left we show: the total number of documents used for evaluation; average tokens per document; the average number of keyphrases per document; maximum possible recall for each dataset.

Dataset

No. docs Avg tok Keys/Doc Max recall

Inspec

500

139.49

9.81

78.20

DUC2001

308

896.78

8.06

94.95

Semeval 2010 100

2147.26 14.43

77.36

Nguyen

209

8777.70 10.86

84.58

Geringon¸ca

894

284.10

84.70

5.13

We show in Table 1 some descriptive statistics of the datasets. SemEval and Ngyuen datasets contain large documents with a high number of gold keyphrases. In Fig. 1 we show the distribution of keyphrases size for each dataset. For the English corpora, we confirm that keyphrases sizes go beyond unigrams, especially to bigrams [11]. We don’t see the same type of distribution for the Portuguese dataset. However, we note that the annotation process of this dataset was not supervised nor thought to be used as a gold standard to keyphrase extraction systems. It was created during the writing process of the news pieces and likely annotated as a contribution to a framework of keywords/tags used by the web portal. A known problem when working with the datasets summarized in Table 1 is the non-presence of gold keyphrases in the respective document. For this reason, we show in Table 1 the maximum recall that any system can achieve. Equivalently to others works [20], we include these missing keyphrases in our gold sets. In respect to POS prepossessing, all English datasets, except SemEval, were processed with Stanford POS tagger [41]. For the SemEval we used the prepossessed dataset available in https://github.com/boudinfl/semeval-2010-pre [7]. For the Portuguese Geringon¸ca dataset, all documents were processed by the POS tagger for Portuguese, LX-Tagger [9]. All other tasks related to text processing, such as word and sentence tokenization were performed using NLTK8 .

8

https://www.nltk.org/.

Automatic Extraction of Relevant Keyphrases

655

Fig. 1. Distributions of keyphrase sizes (# of words that compose the keyphrase) for each dataset. The dark line (continuous) histograms show the distribution resulted when keyphrases were annotated by the author, and grey line (dashed) histograms when the keyphrases were annotated by readers.

656

4.3

M. Won et al.

Results

We show in Table 2 the resulted F-scores for the English corpora. We name our method KCRank, where H1 and H2 refer to the use of the respective context dataset corpus (Eq. (6)). Detailed results, in particular, Precision and Recall, are shown in Table 4 in the appendix section. For comparison reasons we also show the F-score when considering tf-idf as a baseline, as well four alternative state-of-the-art methods: KPMiner [13], TopicRank [8], MultipartitieRank [6] and EmbededRank [3]. We have used pke tool [5] to extract keyphrases with KPMiner, TopicRank and MultipartitieRank methods. For EmbededRank (s2v version) we used our own implementation code9 . From Table 2 we show that, except for NUS dataset, KCRank H1 and KCRank H2 models return consistent better F-score results. We also show that there is no significant difference between KCRank H1 and KCRank H2 performance. This result indicates that the use of context corpus does not significantly increase the overall performance. In respect the alternative methods, only EmbedRank returns a competitive F-score. For the NUS dataset, KPMiner shows the best results. We note that KPMiner relies in similar principles we used to build KCRank method, namely the tf-idf. Table 2 shows that KPMiner results are very similar to KCRank for SemEval and NUS datasets, but not for Inspec and DUC. This is an indication that KPMiner method may be fitted to work with long scientific papers only. In Table 3 we show the F-scores for the Geringon¸ca dataset, and detailed results are shown in Table 5 in the appendix section. For this experiment, the KPMiner model gets the best results, surprisingly followed by the baseline tf-idf. One possible reason for these results is the different distribution of the size of keyphrases that we observe for the Geringon¸ca dataset when compared to the English datasets. From Fig. 1 we see that most keyphrases have size one, which is not the case with the English datasets. This statistical difference results in a negative contribution by the keyphrase length feature, used by KCRank. To test the impact of this feature, we conducted an experiment where we turned off the keyphrase length feature. We show the results in Table 3, where we have identified this modified version of KCRank by KCRanklength . The higher performance when this feature is turned off, confirms that the keyphrase size feature is contributing negatively. As previously pointed out, the Gerigon¸ca dataset was created as part of a web portal’s tag system. We claim that this process skewed the annotation process that resulted in the predominance of unigrams.

9

For the English datasets, the keyphrase extraction was performed with pre-trained embeddings for unigrams available in https://github.com/epfml/sent2vec. For Portuguese, we generated fastText [4] embeddings using a compilation of all sentences present in the Gerigon¸ca dataset and the parliamentary speeches used in the case study shown in Sect. 5.

Automatic Extraction of Relevant Keyphrases

657

Table 2. F-scores of KCRank and comparison with state-of-the-art systems, when using the English language datasets. The results for Semeval and NUS datasets were obtained with an additional filter where candidates with less than two occurrences were excluded.

N

Method

Inspec DUC

Semeval (min = 2) NUS (min = 2)

5

tfidf KPMiner TopicRank MultipartiteRank EmbedRank KCRank:H1 KCRank:H2

29.04 16.50 23.30 23.73 29.38 33.70 34.56

17.28 11.69 17.59 18.54 25.77 27.39 27.10

14.39 18.06 11.69 12.68 12.17 16.22 16.57

15.74 21.13 12.97 13.50 12.10 12.41 13.30

10

tfidf KPMiner TopicRank MultipartiteRank EmbedRank KCRank:H1 KCRank:H2

36.53 19.65 26.33 27.57 37.10 39.22 39.62

20.75 14.49 19.43 21.64 29.56 31.00 30.57

18.40 21.53 14.07 14.54 17.39 20.60 22.05

19.93 24.92 15.10 16.93 15.44 16.16 17.11

15

tfidf KPMiner TopicRank MultipartiteRank EmbedRank KCRank:H1 KCRank:H2

37.81 20.55 26.93 28.63 38.55 38.53 39.14

20.81 14.75 19.36 22.09 29.30 30.55 30.77

20.07 22.44 14.43 15.70 20.27 22.32 22.66

20.17 24.44 14.20 16.31 17.38 17.13 17.63

Dynamic tfidf KPMiner TopicRank MultipartiteRank EmbedRank KCRank:H1 KCRank:H2

37.29 20.18 26.69 28.46 37.83 39.40 39.62

20.92 14.80 19.16 22.08 29.45 30.40 30.58

20.03 21.39 13.76 16.20 20.14 21.80 22.28

18.88 22.51 12.97 15.77 16.24 16.71 17.84

658

M. Won et al.

Table 3. F-scores of KCRank and comparison with state-of-the-art systems, when using the Geringon¸ca datasets. The F-scores results of an experiment using KCRank system with keyphrase size feature turned off is also shown (in light bold). N

Method

Geringon¸ca

5

tfidf KPMiner TopicRank MultipartiteRank EmbedRank KCRank:H1 KCRank:H2 KCRanklength :H1 KCRanklength :H2

26.29 25.40 13.00 13.58 18.57 10.27 14.49 13.21 21.01

10

tfidf KPMiner TopicRank MultipartiteRank EmbedRank KCRank:H1 KCRank:H2 KCRanklength :H1 KCRanklength :H2

21.58 25.08 14.72 15.34 18.87 14.44 18.50 18.37 22.79

15

tfidf KPMiner TopicRank MultipartiteRank EmbedRank KCRank:H1 KCRank:H2 KCRanklength :H1 KCRanklength :H2

18.65 22.51 15.05 15.43 17.89 17.68 19.72 19.98 21.06

Dynamic tfidf KPMiner TopicRank MultipartiteRank EmbedRank KCRank:H1 KCRank:H2 KCRanklength :H1 KCRanklength :H2

20.23 24.00 14.87 15.35 18.53 16.19 19.21 19.09 22.49

Automatic Extraction of Relevant Keyphrases

5

659

Key-Phrase Extraction Using Portuguese Parliamentary Debates

Like other national parliaments, the Portuguese Parliament produces faithful transcripts of the speeches given in the plenary sessions and makes them publicly available in electronic format. For our study, we collected transcriptions from the Portuguese Parliament website10 , referring to the last complete legislative term (i.e., from June 2011 to October 2015), together with information on each member of parliament (MP). During this period, the chamber was composed with MPs from five different parties: The Greens (PEV), the Portuguese Communist Party (PCP), the Left Block (BE), the Socialist Party (PS), the Social Democratic Party (PSD) and the Social Democratic Centre (CDS-PP). In total, we have collected 16,993 speeches. Using the keyphrase extraction method described in this work, we can identify the most central and recurrent political issues addressed during the plenary debates. The keyphrase identification of each speech can reveal the political priorities of each parliamentary group and hint about their expressed agenda. Therefore, we extracted the keyphrases from each speech, where we used model KCRank H2 with context corpus the respective parliamentary group collection. 5.1

Candidates Selection

We observed that the extraction of clean and intelligent keyphrases from the speech dataset is a difficult task in this particular corpora, due to the many repetitive words and expressions used by the MPs. Typical examples are expressions such as “Mr. President”, “party” and “draft bill” that due to their frequent use in this type of text are scored as relevant keyphrases. Consequently, for the candidate selection step, we processed all speeches transcriptions with the pipeline used for Geringon¸ca dataset with two additional filters: a minimum of 5 occurrences criteria and an extension of the stopwords list with these common words and expressions11 . 5.2

Visualisation

As visualization scheme, we propose the use of word clouds, with general guidelines introduced in previous studies [2,39], regarding the choice of visual variables and spatial layout (e.g., we preferred a circular layout that tends to place the most relevant key-phrases in the center). Each keyphrase cloud summarizes the number of occurrences of each keyphrase in the dataset analyzed, i.e., the number of speeches in which the respective candidate was selected as a keyphrase. We use this metric to encode the keyphrase font-size and color, where the darker color represents the keyphrase with the highest number of occurrences. With such visualization aid, the reader can obtain, in a dense image, a high number 10 11

http://debates.parlamento.pt. We manually annotated a list of approximately 100 terms.

660

M. Won et al.

of relevant keyphrases from the text collection, and therefore better capture the main issues addressed. Figure 2 shows a keyphrase cloud for the top 40 key-phrases extracted from the collection composed of PSD speeches12 . The two most relevant issues addressed by this text collection are “european union” and “memorandum of understanding”. The use of this terms is possibly related to the fact that PSD was a government support party (in coalition with CDS-PP) during the considered legislature, and was responsible for implementing Troika austerity measures. In Fig. 7 we show the equivalent keyphrase cloud for the CDS-PP speeches, from which we can see a very similar pattern. Figure 2 also shows a mix of topics related with economic affairs, such as “work”, “economic growth” or “job creation”, and welfare state related, such as “social security” and “national health service”. Figure 3 shows the equivalent keyphrase cloud but when considering the collection of speeches from PS, major opposition party during the same legislature. The cloud shows a different scenario, with PS speeches emphasizing issues related with the public sector, namely with the keyphrases “national health service”, “social security”, “public services”. It also shows how PS emphasized “constitutional court” (during the legislative term several austerity measures were requested to be audited by the Portuguese Constitutional Court) and “tax increase”. Both PSD and PS clouds show relevant false positives keyphrases such as “theme”, “situation” and “day”. This unwanted effect is stronger in the PSD cloud. One possible explanation for the stronger effect of false positives in PSD could be the distribution of PSD speeches by more different political issues. Being the main government support party it needs to be more responsive to the different opposition parties agendas [17]. Such an effect will result that a high number of speeches will be not be used as agenda-setting but as a response to the opposition parties questions during the plenary debates. We show in the appendix in Figs. 4, 5, 6 and 7 the equivalent results when considering the text collections from the remaining parties. The results show that keyphrases such as “european union”, “national health service” and “social security” are present in all word clouds, indicating their relevance as a top issues. Finally, we note that the use of keyphrases beyond unigrams, allows us to identify with more precision the relevant political issues. Had the analysis been based in unigrams only, the identification of lengthy and more specific keyphrases as relevant issues would be lost. The use of n-grams is an essential advantage of the present method since it allows the political scientist to keep track of the many political issues related entities with n-gram size.

12

All keyphrases were translated from Portuguese.

Automatic Extraction of Relevant Keyphrases

661

Fig. 2. The 40 most relevant keyphrases extracted from PSD speeches collection.

Fig. 3. The 40 most relevant keyphrases extracted from PS speeches collection.

662

6

M. Won et al.

Conclusion

We present a method that automatically extracts keyphrases from text documents. We combine simple statistical features to generate a ranking list of candidates to keyphrases. From this list, we propose a dynamic selection of the top candidates, based in the document length. We test our methodology with different datasets, commonly used to evaluate keyphrase extraction systems, and show that the proposed method competes with state-of-the-art alternatives. Due to its simplicity, the proposed method can be implemented in any lightweight software application or web application to extract keyphrases from text documents, at document level and on the fly. We show a small case study using plenary speeches given at the Portuguese parliament. From the proposed methodology we extract the most relevant keyphrases from each parliamentary group set of speeches and construct keyphrases clouds. We show how such clouds can efficiently summarize issue agenda-setting by the respective parties. Furthermore, we show that using keyphrases that go beyond simple unigrams allows a higher precision identification of entities of interest. Acknowledgement. This work was supported by Funda¸ca ˜o para a Ciˆencia e Tecnologia (FCT), SFRH/BPD/104176/2014 to M. W., SFRH/BPD/86702/2012 to F.R., and FCT funding POCI/01/0145/FEDER/031460 and UID/CEC/50021/2019. We also want to thank the http://geringonca.com/ team for having made their data available.

A

Appendix

Dynamic

15

10

45.12

KCRank:H1

KCRank:H2

KCRank:H2

23.54 24.67 32.89 34.49 34.62

TopicRank

MultipartiteRank

EmbedRank

KCRank:H1

KCRank:H2

KCRank:H2 16.82

31.85

KCRank:H1

KPMiner

31.34

EmbedRank

32.37

31.25

MultipartiteRank

tfidf

22.23 23.07

TopicRank

15.77

37.03

KCRank:H1

KPMiner

36.75

EmbedRank

30.64

34.45

MultipartiteRank

tfidf

24.90 25.67

TopicRank

33.81

44.20

EmbedRank

17.62

37.88

MultipartiteRank

KPMiner

31.56

TopicRank

tfidf

20.64 31.08

KPMiner

37.56

tfidf

46.31

45.95

44.52

33.61

30.82

25.22

43.97

50.74

49.98

50.31

37.74

34.14

29.49

49.38

42.59

42.04

40.19

29.78

27.95

22.22

39.72

28.00

27.24

23.99

19.01

18.64

13.75

23.67

39.62

39.40

37.83

28.46

26.69

20.18

37.29

39.14

38.53

38.55

28.63

26.93

20.55

37.81

39.62

39.22

37.10

27.57

26.33

19.65

36.53

34.56

33.70

29.38

23.73

23.30

16.50

29.04

22.97

22.85

22.10

16.57

14.44

10.87

15.57

23.60

23.45

22.45

16.92

14.90

11.05

15.79

27.29

27.68

26.35

19.22

17.36

12.60

18.23

34.35

34.61

32.66

23.31

22.39

14.61

21.23

P

5

DUC F

P

R

Inspec

Method

N

45.72

45.39

44.10

33.09

28.45

23.17

31.87

44.22

43.82

42.15

31.81

27.64

22.17

30.52

34.74

35.24

33.67

24.76

22.07

17.05

24.09

22.37

22.66

21.27

15.40

14.48

9.75

14.57

R

30.58

30.40

29.45

22.08

19.16

14.80

20.92

30.77

30.55

29.30

22.09

19.36

14.75

20.81

30.57

31.00

29.56

21.64

19.43

14.49

20.75

27.10

27.39

25.77

18.54

17.59

11.69

17.28

F

19.10

18.70

17.20

13.85

11.75

18.30

17.10

22.13

21.73

19.67

15.27

14.00

21.87

19.47

26.90

25.00

20.90

17.60

17.00

26.00

22.20

32.00

31.20

23.20

24.20

22.40

34.20

27.40

P

26.74

26.13

24.29

19.51

16.61

25.72

24.18

23.21

22.95

20.91

16.15

14.88

23.05

20.70

18.68

17.52

14.89

12.38

12.00

18.37

15.71

11.18

10.96

8.24

8.59

7.91

12.27

9.76

R

22.28

21.80

20.14

16.20

13.76

21.39

20.03

22.66

22.32

20.27

15.70

14.43

22.44

20.07

22.05

20.60

17.39

14.54

14.07

21.53

18.40

16.57

16.22

12.17

12.68

11.69

18.06

14.39

F

Semeval (min = 2)

12.74

11.96

11.61

11.20

9.26

15.96

13.46

14.26

13.91

14.04

13.18

11.45

19.55

16.24

16.22

15.31

14.93

16.08

14.35

23.73

19.09

18.09

16.75

17.03

18.09

17.89

28.71

22.20

P

29.74

27.68

27.02

26.60

21.64

38.16

31.59

23.09

22.29

22.81

21.40

18.68

32.56

26.61

18.11

17.11

15.99

17.88

15.94

26.23

20.85

10.52

9.85

9.38

10.77

10.17

16.71

12.19

R

17.84

16.71

16.24

15.77

12.97

22.51

18.88

17.63

17.13

17.38

16.31

14.20

24.44

20.17

17.11

16.16

15.44

16.93

15.10

24.92

19.93

13.30

12.41

12.10

13.50

12.97

21.13

15.74

F

NUS (min = 2)

Table 4. Precision, Recall and F-scores for the English datasets of KCRank and comparison with state-of-the-art systems.

Automatic Extraction of Relevant Keyphrases 663

664

M. Won et al.

Table 5. Precision, Recall and F-scores for the Geringon¸ca datasets of KCRank and comparison with state-of-the-art systems.

N

Method

Geringon¸ca P R F

5

tfidf KPMiner TopicRank MultipartiteRank EmbedRank KCRank:H1 KCRank:H2

26.24 25.48 13.09 13.64 18.37 10.00 14.30

26.34 25.33 12.90 13.51 18.77 10.56 14.70

26.29 25.40 13.00 13.58 18.57 10.27 14.49

10

tfidf KPMiner TopicRank MultipartiteRank EmbedRank KCRank:H1 KCRank:H2

16.20 18.92 11.35 11.75 14.07 10.68 13.75

32.32 37.19 20.94 22.11 28.62 22.25 28.26

21.58 25.08 14.72 15.34 18.87 14.44 18.50

15

tfidf KPMiner TopicRank MultipartiteRank EmbedRank KCRank:H1 KCRank:H2

12.45 15.10 10.81 10.94 11.89 11.71 13.09

37.16 44.26 24.73 26.16 36.06 36.00 39.92

18.65 22.51 15.05 15.43 17.89 17.68 19.72

Dynamic tfidf KPMiner TopicRank MultipartiteRank EmbedRank KCRank:H1 KCRank:H2

14.34 17.10 11.02 11.28 13.06 11.35 13.51

34.35 40.25 22.86 24.02 31.88 28.26 33.22

20.23 24.00 14.87 15.35 18.53 16.19 19.21

Automatic Extraction of Relevant Keyphrases

665

Fig. 4. The 40 most relevant key-phrases extracted from PEV speeches collection.

Fig. 5. The 40 most relevant keyphrases extracted from BE speeches collection.

666

M. Won et al.

Fig. 6. The 40 most relevant keyphrases extracted from PCP speeches collection.

Fig. 7. The 40 most relevant keyphrases extracted from CDS-PP speeches collection.

Automatic Extraction of Relevant Keyphrases

667

References 1. Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Hamilton, H.J. (ed.) AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45486-1 4 2. Bateman, S., Gutwin, C., Nacenta, M.: Seeing things in the clouds: the effect of visual features on tag cloud selections. In: Proceedings of the ACM Conference on Hypertext and Hypermedia (2008) 3. Bennani-Smires, K., Musat, C., Jaggi, M., Hossmann, A., Baeriswyl, M.: EmbedRank: unsupervised keyphrase extraction using sentence embeddings. arXiv preprint arXiv:1801.04470 (2018) 4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017) 5. Boudin, F.: pke: an open source python-based keyphrase extraction toolkit. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pp. 69–73. The COLING 2016 Organizing Committee, Osaka, Japan, December 2016. http://aclweb.org/anthology/ C16-2015 6. Boudin, F.: Unsupervised keyphrase extraction with multipartite graphs. arXiv preprint arXiv:1803.08721 (2018) 7. Boudin, F., Mougard, H., Cram, D.: How document pre-processing affects keyphrase extraction performance. arXiv preprint arXiv:1610.07809 (2016) 8. Bougouin, A., Boudin, F., Daille, B.: TopicRank: graph-based topic ranking for keyphrase extraction. In: International Joint Conference on Natural Language Processing (IJCNLP), pp. 543–551 (2013) 9. Branco, A., Silva, J.R.: Evaluating solutions for the rapid development of state-ofthe-art PoS taggers for Portuguese. In: LREC (2004) 10. Chen, J., Zhang, X., Wu, Y., Yan, Z., Li, Z.: Keyphrase generation with correlation constraints. arXiv preprint arXiv:1808.07185 (2018) 11. Chuang, J., Manning, C.D., Heer, J.: “Without the clutter of unimportant words”: descriptive keyphrases for text visualization. ACM Trans. Comput.-Hum. Interact. 19 (2012) 12. Danesh, S., Sumner, T., Martin, J.H.: SGRank: combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction. In: Lexical and Computational Semantics (* SEM 2015), p. 117 (2015) 13. El-Beltagy, S.R., Rafea, A.: Kp-miner: a keyphrase extraction system for English and Arabic documents. Inf. Syst. 34(1), 132–144 (2009) 14. Florescu, C., Caragea, C.: PositionRank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1105–1115 (2017) 15. Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: 16th International Joint Conference on Artificial Intelligence (IJCAI 1999), vol. 2, pp. 668–673. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1999) 16. Green-Pedersen, C.: The growing importance of issue competition: the changing nature of party competition in western Europe. Polit. stud. 55(3), 607–628 (2007) 17. Green-Pedersen, C., Mortensen, P.B.: Who sets the agenda and who responds to it in the Danish parliament? A new model of issue competition and agenda-setting. Eur. J. Polit. Res. 49(2), 257–281 (2010)

668

M. Won et al.

18. Grimmer, J., Stewart, B.M.: Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit. Anal. 267–297 (2013) 19. Grineva, M., Grinev, M., Lizorkin, D.: Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th International Conference on World Wide Web, pp. 661–670. ACM (2009) 20. Hasan, K.S., Ng, V.: Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 365–373. Association for Computational Linguistics (2010) 21. Hasan, K.S., Ng, V.: Automatic keyphrase extraction: a survey of the state of the art. In: ACL, vol. 1, pp. 1262–1273 (2014) 22. Hobolt, S.B., De Vries, C.E.: Issue entrepreneurship and multiparty competition. Comp. Pol. Stud. 48(9), 1159–1185 (2015) 23. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223. Association for Computational Linguistics (2003) 24. Kim, S.N., Medelyan, O., Kan, M.Y., Baldwin, T.: SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In: Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 21–26. Association for Computational Linguistics (2010) 25. Kim, S.N., Medelyan, O., Kan, M.Y., Baldwin, T.: Automatic keyphrase extraction from scientific articles. Lang. Resour. Eval. 47(3), 723–742 (2013) 26. Klingemann, H.D.: Mapping Policy Preferences. Oxford University Press, Oxford (2001) 27. Klingemann, H.D.: Mapping policy preferences II: estimates for parties, electors, and governments in Eastern Europe, European Union, and OECD 1990–2003, vol. 2. Oxford University Press on Demand (2006) 28. Liu, F., Pennell, D., Liu, F., Liu, Y.: Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 620–628. Association for Computational Linguistics (2009) 29. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 257–266. Association for Computational Linguistics (2009) 30. Lopez, P., Romary, L.: Humb: automatic key term extraction from scientific articles in grobid. In: Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 248–251. Association for Computational Linguistics (2010) 31. Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 3. pp. 1318–1327. Association for Computational Linguistics (2009) 32. Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., Chi, Y.: Deep keyphrase generation. arXiv preprint arXiv:1704.06879 (2017) 33. Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proceedings of Conference on Empirical Methods on Natural Language Processing (2004) 34. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

Automatic Extraction of Relevant Keyphrases

669

35. Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007). https://doi.org/10.1007/9783-540-77094-7 41 36. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical report, Stanford InfoLab (1999) 37. Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. In: NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics (2018) 38. Petrocik, J.R.: Issue ownership in presidential elections, with a 1980 case study. Am. J. Polit. Sci. 825–850 (1996) 39. Rivadeneira, A.W., Gruen, D.M., Muller, M.J., Millen, D.R.: Getting our head in the clouds: Toward evaluation studies of tagclouds. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (2007) 40. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988) 41. Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the 2000 Joint SIGDAT conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, vol. 13, pp. 63–70. Association for Computational Linguistics (2000) 42. Vliegenthart, R., Walgrave, S.: Content matters: The dynamics of parliamentary questioning in Belgium and Denmark. Comp. Pol. Stud. 44(8), 1031–1059 (2011) 43. Wagner, M., Meyer, T.M.: Which issues do parties emphasise? Salience strategies and party organisation in multiparty systems. West Eur. Polit. 37(5), 1019–1045 (2014) 44. Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the AAAI Conference on Artificial Intelligence (2008) 45. Wang, R., Liu, W., McDonald, C.: Corpus-independent generic keyphrase extraction using word embedding vectors. In: Software Engineering Research Conference, vol. 39 (2014) 46. Yih, W.t., Goodman, J., Carvalho, V.R.: Finding advertising keywords on web pages. In: Proceedings of the 15th International Conference on World Wide Web, pp. 213–222. ACM (2006)

Author Index

Abnar, Samira I-44 Ahmad, Zishan I-129 Aipe, Alan II-166 Akasaki, Satoshi II-121 Akhtar, Md Shad II-236 Álvez, Javier I-34 Amjad, Maaz II-438 Arslan Esme, Asli II-334 Artemova, Ekaterina I-155 Assylbekov, Zhenisbek I-86, I-351 Avgustinova, Tania I-110 Ayats, Hugo I-480 Aydın, Cem Rıfkı II-307 Azcurra, René II-379 Babushkin, Artemii I-468 Bagheri, Ayoub II-212 Balahur, Alexandra II-260 Banik, Debajyoty I-265 Baron, Alistair I-451 Barros, Cristina II-507 Basak, Kingshuk II-462 Basta, Christine I-342 Battistelli, Delphine I-480 Baumartz, Daniel II-491 Béchet, Nicolas I-480 Beinborn, Lisa I-44 Bellot, Patrice II-248 Berk, Gözde I-622 Bhattacharyya, Pushpak I-129, I-265, I-533, I-555, II-236, II-462, II-602 Bigeard, Sam I-143 Borin, Lars I-367 Boros, Emanuela I-17 Bravo, Cristian I-432 Bui, Tien D. II-26 Cambria, Erik I-293, II-151, II-280 Cardon, Rémi I-583 Careaga, Alexis Carriola II-546 Casacuberta, Francisco I-545 Chen, Yimin II-558

Chen, Yixin II-558 Chevelu, Jonathan I-480 Chifu, Adrian II-379 Choenni, Rochelle I-44 Cook, Paul I-304 Costa-jussà, Marta R. I-342 Cristea (Ciobanu), Alina II-96 d’Andecy, Vincent Poulain I-17 Das, Ayan I-595 Davari, MohammadReza II-26 Dearden, Edward I-451 Deep, Kumar Shikhar II-236 Deveaud, Romain II-351 Dinu, Liviu P. II-96, II-520 Domingo, Miguel I-545 Doucet, Antoine I-17 Du, Jiachen II-587 Ebrahimpour-Komleh, Hossein II-212 Ekbal, Asif I-129, I-265, I-533, I-555, II-49, II-166, II-236, II-462 Elia, Annibale II-182 Erden, Berna I-622 Erkan, Ali II-307 Ermakova, Liana II-392 Espinasse, Bernard II-379 Farruque, Nawshad II-293 Firsov, Anton II-392 Fomina, Marina II-109 Fournier, Benoît I-480 Fournier, Sébastien II-248, II-379 Fukumoto, Fumiyo II-451 Gao, Qinghong II-587 García-Martínez, Mercedes I-545 Gelbukh, Alexander II-438 Giannakopoulos, George II-424 Gifu, Daniela I-73 Glushkova, Taisiya I-155 Gobeill, Julien I-233 Goebel, Randy II-293

672

Author Index

Goel, Arushi II-135 Gomez-Krämer, Petra I-17 Gómez-Rodríguez, Carlos I-648 Gonzalez-Dios, Itziar I-34 González-Gallardo, Carlos-Emiliano II-351 Grabar, Natalia I-143, I-583 Gu, Yanhui II-558 Guibon, Gaël II-379, II-392 Güngör, Tunga I-622, II-307 Guo, Mingxin II-558 Gupta, Kamal Kumar I-533

Lê, Luyê.n Ngo.c I-608 Lecorvé, Gwénolé I-480 Ledeneva, Yulia II-546 Leskinen, Petri I-199 Li, Binyang II-587 Li, Yang II-151 Liao, Mingxue II-16 Liausvia, Fiona II-135 Litvak, Marina II-529 Lloret, Elena II-507 Lv, Pin II-16

Hajhmida, Moez Ben II-280 Hamon, Thierry I-169 Haque, Rejwanul I-495 Haralambous, Yannis I-247, I-608 Harashima, Jun II-38 Hasanuzzaman, Mohammed I-495, I-555 Hazarika, Devamanyu II-222 Helle, Alexandre I-545 Heracleous, Panikos II-321, II-362 Herranz, Manuel I-545 Hidano, Seira I-521 Hiramatsu, Makoto II-38 Htait, Amal II-248 Hu, Yun II-16 Huang, Chenyang II-293 Huminski, Aliaksandr II-135 Hyvönen, Eero I-199

Mace, Valentin II-379 Madasu, Avinash II-412 Maisto, Alessandro II-182 Mamyrbayev, Orken II-3 Martins, Bruno II-648 Maskaykin, Anton I-468 Maupomé, Diego I-332 Mehler, Alexander II-491 Mekki, Jade I-480 Melillo, Lorenza II-182 Meurs, Marie-Jean I-332 Mihalcea, Rada II-476 Mimouni, Nada I-95 Mírovský, Jiˇrí I-62 Mohammad, Yasser II-321 Montes-y-Gómez, Manuel I-658 Moreira, Viviane Pereira II-619 Mori, Shinsuke II-61 Mukuntha, N. S. II-166 Murawaki, Yugo II-61 Mussabayev, Rustam II-3 Myrzakhmetov, Bagdat I-351

Ikhwantri, Fariz I-391 Iqbal, Sayef II-198 Jágrová, Klára I-110 Jiang, Lili I-316 Karipbayeva, Aidana I-86 Keshtkar, Fazel II-198 Khan, Adil I-380 Khusainova, Albina I-380 King, Milton I-304 Kiyomoto, Shinsaku I-521 Klang, Per I-367 Komiya, Kanako I-280 Kosseim, Leila II-26, II-212 Krishnamurthy, Gangeshwar II-222 Kumar, Shachi H. II-334 Kummerfeld, Jonathan K. II-476

N’zi, Ehoussou Emmanuel I-247 Nachman, Lama II-334 Nagata, Ryo I-636 Nguyen-Son, Hoang-Quoc I-521 Noé-Bienvenu, Guillaume Le II-392 Oba, Daisuke II-121 Ogier, Jean-Marc I-17 Okur, Eda II-334 Ordoñez-Salinas, Sonia I-432 Otálora, Sebastian I-432 Ototake, Hokuto I-636 Oueslati, Oumayma II-280

Author Index Ounelli, Habib II-280 Özbal, Gözde I-3 Paetzold, Gustavo H. I-406 Palshikar, Girish K. II-602 Palshikar, Girish II-49 Papagiannopoulou, Eirini I-215 Pasche, Emilie I-233 Pawar, Sachin II-602 Pelosi, Serena II-182 Peng, Haiyun II-151 Pérez-Rosas, Verónica II-476 Pessutto, Lucas R. C. II-619 Piccardi, Massimo II-575 Pittaras, Nikiforos II-424 Poláková, Lucie I-62 Poncelas, Alberto I-567 Ponzetto, Simone Paolo I-658 Poria, Soujanya II-151, II-222 Pouly, Marc II-81 Pˇribáˇn, Pavel II-260 Qasim, Amna II-632 Qu, Weiguang II-558 Raimundo, Filipa II-648 Randriatsitohaina, Tsanta I-169 Rao, Vijjini Anvesh II-412 Rigau, German I-34 Rivera, Adín Ramírez I-380 Romashov, Denis I-468 Rosso, Paolo I-658 Ruch, Patrick I-233 Saenko, Anna II-438 Sahay, Saurav II-334 Saikh, Tanik II-462 Sánchez-Junquera, Javier I-658 Sangwan, Suyash II-49 SanJuan, Eric II-351 Sarkar, Sudeshna I-595 Sasaki, Minoru I-280, II-403 Satapathy, Ranjan I-293 Sato, Shoetsu II-121 Saxena, Anju I-367 Sedighi, Zeinab II-212 Seffih, Hosni II-392 Seifollahi, Sattar II-575 Seitou, Takumi I-280 Sen, Sukanta I-533, I-555

673

Sha, Ziqi II-558 Sharma, Raksha II-49 Shimura, Kazuya II-451 Shinnou, Hiroyuki I-280 Shutov, Ilya I-468 Simón, Jonathan Rojas II-546 Sorokina, Alena I-86 Specia, Lucia I-406 Strapparava, Carlo I-3 Strzyz, Michalina I-648 Sulea, ¸ Octavia-Maria II-520 Suleimanova, Olga II-109 Takhanov, Rustem I-351 Tamper, Minna I-199 Tekiroglu, Serra Sinem I-3 Thalmann, Nadia Magnenat I-293 Thao, Tran Phuong I-521 Themeli, Chrysoula II-424 Thiessard, Frantz I-143 Tolegen, Gulmira II-3 Toleu, Alymzhan II-3 Tomori, Suzushi II-61 Tornés, Beatriz Martínez I-17 Torres-Moreno, Juan-Manuel II-351 Toyoda, Masashi II-121 Trabelsi, Amine I-181 Tran, Son N. I-316 Tsoumakas, Grigorios I-215 Uban, Ana-Sabina II-96 Ukrainets, Igor I-468 Uslu, Tolga II-491 Vanetik, Natalia II-529 Vargas, Danny Suarez II-619 Varshney, Deeksha I-129 Veksler, Yael II-529 Virk, Shafqat Mumtaz I-367 vor der Brück, Tim II-81 Voronkov, Ilia II-438 Vu, Xuan-Son I-316 Wakabayashi, Kei II-38 Wangpoonsarp, Attaporn II-451 Washio, Koki I-636 Way, Andy I-495, I-555, I-567 Wei, Heyu II-558 Welch, Charles II-476 Wenniger, Gideon Maillette de Buy

I-567

674

Author Index

Wohlgenannt, Gerhard Won, Miguel II-648

I-468

Yoshinaga, Naoki II-121 Younas, Raheela II-632 Young, Steve II-520

Xu, Ruifeng II-587 Yan, Hongyu II-587 Yasuda, Keiji II-321, II-362 Ye, Shunlong II-558 Yeung, Timothy Yu-Cheong I-95 Yoneyama, Akio II-321, II-362

Zaïane, Osmar R. I-181 Zaïane, Osmar II-293 Zheng, Changwen II-16 Zhou, Junsheng II-558 Zhou, Yeyang II-558 Zimmermann, Roger II-222