Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part I 3031263863, 9783031263866

The multi-volume set LNAI 13713 until 13718 constitutes the refereed proceedings of the European Conference on Machine L

419 79 63MB

English Pages 767 [768] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Organization
Contents – Part I
Clustering and Dimensionality Reduction
Pass-Efficient Randomized SVD with Boosted Accuracy*-4pt
1 Introduction
2 Preliminaries
2.1 Basics of Truncated SVD
2.2 Randomized SVD Algorithm with Power Iteration
2.3 Tropp's Single-Pass SVD Algorithm
3 Pass-Efficient SVD with Shifted Power Iteration
3.1 Randomized SVD with Fewer Passes
3.2 The Idea of Shifted Power Iteration
3.3 Update Shift in Each Power Iteration
3.4 Analysis of Computational Cost
4 Experimental Results
4.1 Error Metrics
4.2 Comparison with Basic Randomized SVD Algorithm
4.3 Comparison with Single-Pass SVD Algorithm
5 Conclusion
References
CDPS: Constrained DTW-Preserving Shapelets*-4pt
1 Introduction
2 Related Work
2.1 Shapelets
2.2 Learning Shapelets
2.3 Unsupervised Shapelets
2.4 Constrained Clustering
3 Constrained DTW-Preserving Shapelets
3.1 Definitions and Notations
3.2 Objective Function
3.3 CDPS Algorithm
4 Evaluation
4.1 Experimental Setup
4.2 Results
5 Discussion
5.1 Model Selection
6 Conclusions
References
Structured Nonlinear Discriminant Analysis*-4pt
1 Introduction
2 Prerequisites
2.1 Circulant Matrices
2.2 (Circulant) Principal Component Analysis
2.3 Linear Discriminant Analysis
3 Structured Discriminant Analysis
3.1 Circulant Discriminant Analysis
3.2 Computational Aspects for Circulant Structures
3.3 Harmonic Solutions
3.4 Truncated -Circulants
3.5 Non-cyclic Structures
4 Examples and Interpretation
4.1 (Quasi-)Stationary Data
4.2 Non-stationary Data
5 Conclusion
References
LSCALE: Latent Space Clustering-Based Active Learning for Node Classification*-4pt
1 Introduction
2 Problem Definition
3 Methodology
3.1 Active Learning Latent Space
3.2 Clustering Module
4 Experiments
4.1 Experimental Setting
4.2 Performance Comparison (RQ1)
4.3 Efficiency Comparison (RQ2)
4.4 Ablation Study (RQ3)
5 Related Work
6 Conclusion
References
Powershap: A Power-Full Shapley Feature Selection Method
1 Introduction
2 Related Work
3 Powershap
3.1 Powershap Algorithm
3.2 Automatic Mode
4 Experiments
4.1 Feature Selection Methods
4.2 Simulation Dataset
4.3 Benchmark Datasets
5 Results
5.1 Simulation Dataset
5.2 Benchmark Datasets
6 Discussion
7 Conclusion
References
Automated Cancer Subtyping via Vector Quantization Mutual Information Maximization*-4pt
1 Introduction
2 Related Work
3 Method
3.1 Problem Setting
3.2 Proposed Model
4 Experiments
4.1 Ground Truth Comparison
4.2 Controversial Label Comparison
4.3 Ablation Study
5 Discussion and Conclusion
References
Wasserstein t-SNE
1 Introduction
2 Methods
2.1 t-SNE
2.2 Wasserstein Metric
2.3 Linear Programming
2.4 Data
3 Results
3.1 Wasserstein t-SNE on Simulated Data
3.2 German Parliamentary Election 2017
4 Discussion
References
Nonparametric Bayesian Deep Visualization*-4pt
1 Introduction
2 Infinite Warped Mixture Model
3 Proposed Methods
3.1 Neural Network Gaussian Processes
3.2 NN-iWMM
3.3 Nonparametric Bayesian Deep Visualization
4 Bayesian Training
5 Simulation Study
6 Experiments on Real-World Data
7 Conclusion
References
FastDEC: Clustering by Fast Dominance Estimation*-12pt
1 Introduction
2 Related Work
3 Preliminaries
4 Proposed Framework
4.1 Direct k-NN Dominator (DkD) Detection
4.2 DC Dominance Estimation
4.3 Complexity Analysis and Implementation
5 Evaluation
5.1 Comparison on Artificial and Real-World Datasets
5.2 Robustness Testing
6 Conclusion
References
SECLEDS: Sequence Clustering in Evolving Data Streams via Multiple Medoids and Medoid Voting
1 Introduction
2 Preliminaries
3 SECLEDS: Sequence Clustering in Evolving Streams
3.1 Stable Cluster Definition via Multiple Medoids
3.2 Center of Mass Estimation
4 The SECLEDS Algorithm
5 Experimental Setup
6 Empirical Results
6.1 Use Case: Intelligent Network Traffic Sampling via SECLEDS
7 Conclusions
References
Knowledge Integration in Deep Clustering
1 Introduction
2 Related Work
3 Expert Loss for Knowledge Integration
3.1 Expert Knowledge Representation
3.2 Constraint-Satisfaction Score
3.3 Constraint-Satisfaction Score Computed by a WMC Problem
3.4 Decomposition of the Problem
3.5 Expert Loss
4 Integrating Knowledge in Deep Clustering Frameworks
4.1 IDEC-LK
4.2 SCAN-LK
5 Experiments
5.1 Experiment Settings
5.2 Experiments and Analysis for Clustering Quality
5.3 Experiments and Analysis for Constraint Satisfaction
6 Conclusion
References
Anomaly Detection
ARES: Locally Adaptive Reconstruction-Based Anomaly Scoring
1 Introduction
2 Background
2.1 Autoencoders
2.2 Local Outlier Factor
3 Related Work
4 Methodology
4.1 Problem Definition
4.2 Statistical Interpretation of Reconstruction-Based Anomaly Detection
4.3 Motivation and Empirical Results
4.4 Adaptive Reconstruction Error-Based Scoring
4.5 Local Reconstruction Score
4.6 Local Density Score
5 Experiments
5.1 Datasets and Experimental Setup
5.2 Baselines
5.3 RQ1 (Accuracy)
5.4 RQ2 (Ablation Study)
6 Conclusion
References
R2-AD2: Detecting Anomalies by Analysing the Raw Gradient
1 Introduction
1.1 Related Work
2 Prerequisites
2.1 Activation Anomaly Analysis
3 R2-AD2
4 Experimental Setup
5 Evaluation
5.1 Known Anomalies
5.2 Noise Resistance
5.3 Number of Known Anomalies
5.4 Unknown Anomalies
5.5 Ablation Study
6 Discussion and Future Work
7 Summary
References
Hop-Count Based Self-supervised Anomaly Detection on Attributed Networks
1 Introduction
1.1 Our HCM Approach in a Nutshell
1.2 Summary of the Contributions
2 Related Work
3 Problem Definition
4 Hop-Count Based Model
4.1 Model Framework
4.2 Model Training
4.3 Model Inference for Anomaly Detection
5 Experiments
5.1 Datasets
5.2 Experimental Settings
5.3 Experimental Results
5.4 Parameter Analysis
5.5 Ablation Study
6 Conclusion
References
Deep Learning Based Urban Anomaly Prediction from Spatiotemporal Data*-10pt
1 Introduction
2 Related Work
2.1 Deep Learning Based Methods
2.2 Hybrid Learning (Graph + Deep Learning) Based Methods
3 Preliminaries
3.1 Notation
3.2 Problem Statement
4 Framework: UrbanAnom
4.1 Semantic Spatial (SS) Module
4.2 Context Aware Temporal (CAT) Module
4.3 Global Attention Module
4.4 Multi-layer Perceptron Based Prediction Module
5 Evaluation
5.1 Dataset
5.2 Parameter Settings
5.3 Performance Validation
5.4 Parameter Sensitivity
5.5 Evaluation of Variants
6 Conclusion
References
Detecting Anomalies with Autoencoders on Data Streams
1 Introduction
2 Problem Statement
3 Related Work
3.1 Offline Anomaly Detection
3.2 Online Anomaly Detection
4 Streaming Anomaly Detection with Autoencoders
5 Experiments
5.1 Data Streams
5.2 Setup
6 Results
7 Conclusion and Future Work
References
Anomaly Detection via Few-Shot Learning on Normality*-12pt
1 Introduction
2 Related Work
2.1 Deep Anomaly Detection
2.2 Information Bottleneck
3 Motivating Example
4 Prototype Data Description
5 Empirical Results
5.1 Setup
5.2 Comparative Analysis
5.3 Ablation Study
5.4 Graphical Analysis
6 Conclusion
References
Interpretability and Explainability
Interpretations of Predictive Models for Lifestyle-related Diseases at Multiple Time Intervalspg*-12pt
1 Introduction
2 Related Work
2.1 Prediction of Diabetes Stages Using Medical Records
2.2 Prediction of Chronic Kidney Diseases Stages Using Medical Records
2.3 Interpretable Prediction of Diseases
3 Dataset
3.1 Structure and Attributes
3.2 Ethical Considerations
4 Method
4.1 Target Attributes
4.2 Prediction Tasks
4.3 Preprocessing
4.4 Training and Interpretation
5 Evaluation
5.1 Prediction Accuracy
5.2 HbA1c Included as a Feature
5.3 HbA1c Not Included as a Feature
5.4 Creatinine Included as a Feature
5.5 Creatinine Not Included as a Feature
6 Conclusion
References
Fair and Efficient Alternatives to Shapley-based Attribution Methods*-10pt
1 Introduction
2 State of the Art
2.1 The Attribution Problem
2.2 Attribution Using Feature Coalisation Analysis
2.3 Attribution Based on Gradient Analysis
3 Fair-Efficient-Symmetric Perturbations-based AMs
3.1 The Equal Surplus Value
3.2 FESP
4 Experiments
4.1 Image Classification: Protocols and Results
4.2 Text Classification: Protocols and Results
4.3 Discussions
5 Conclusion
References
SMACE: A New Method for the Interpretability of Composite Decision Systems*-10pt
1 Introduction
2 Related Work
3 Challenges
4 SMACE
4.1 Setting
4.2 Assumptions
4.3 Overview
4.4 Explaining the Results of the Models
4.5 Explaining the Rule-Based Decision
4.6 Overall Explanations
5 Evaluation
5.1 Qualitative Analysis
5.2 Sanity Check
6 Conclusion and Future Work
References
Calibrate to Interpret*-10pt
1 Introduction
2 Related Works
3 Problem Statement and Other Related Works
3.1 Calibration
3.2 Interpretation Methods
4 Evaluation of Calibration's Impact on Interpretation
4.1 Objectives
4.2 Experimental Setup
4.3 Does Calibration Impact Interpretations?
4.4 Does Calibration Improve the Faithfulness of Interpretation Methods?
4.5 Are Saliency Maps with Calibration More Human-Friendly?
4.6 In Depth Analysis of Meaningful Perturbation
4.7 Discussions
5 Conclusions and Future Works
References
Knowledge-Driven Interpretation of Convolutional Neural Networks
1 Introduction
2 Related Works
3 Ontology-Driven Semantic Alignment
3.1 High-Level Concept Masks
3.2 Alignment Measure
3.3 Direction Learning
3.4 Neural Circuits
4 Results
4.1 Unit Semantic Alignment
4.2 Direction Learning
5 Conclusion
References
Neural Networks with Feature Attribution and Contrastive Explanations
1 Introduction
1.1 Contrastive vs. Counterfactual
2 Related Work
2.1 Contrastive Explanations
2.2 Counterfactual Explanations
2.3 Post-hoc Non-contrastive Explanations
3 Contrastive Explanation Generation
3.1 Neural Nets with Feature Attributions and Contrastive Explanations
3.2 Joint Objective
3.3 Explanations
4 Experiments
4.1 Setup
4.2 Explainability Metrics
4.3 Evaluating Why p?
4.4 Evaluating Why p and Not q?
4.5 Deep Learned Features
4.6 Discussion
5 Conclusion
References
Explaining Predictions by Characteristic Rules
1 Introduction
2 Related Work
3 From Local Explanations to General Characteristic Rules
3.1 Explanation Mining and Rules Selection
3.2 Discriminative vs. Characteristic Rules
4 Empirical Evaluation
4.1 Experimental Setup
4.2 Baseline Experiments
4.3 Comparing Discriminative and Characteristic Rules
4.4 Comparing Local Explanation Techniques
5 Concluding Remarks
References
Session-Based Recommendation Along with the Session Style of Explanation
1 Introduction
2 Related Work
3 Our Proposed Method
3.1 Meta Path-Based Similarity
3.2 Recommendation List Creation by Considering One Meta Path
3.3 Recommendation List Creation by Using Multiple Meta Paths
4 Recommendation Strategies and Single Explanations
5 Hybrid Meta Path-Based Explanation
6 Experimental Evaluation
6.1 Real-Life Datasets
6.2 Evaluation Protocol and Metrics
6.3 xPathSim Sensitivity Analysis
6.4 Comparison with Other Methods
6.5 User Study
7 Conclusion
References
ProtoMIL: Multiple Instance Learning with Prototypical Parts for Whole-Slide Image Classification
1 Introduction
2 Related Works
2.1 Multiple Instance Learning
2.2 Explainable Artificial Intelligence
3 ProtoMIL
4 Experiments
4.1 Bisque Breast Cancer and Colon Cancer Datasets
4.2 Camelyon16 Dataset
4.3 TCGA-NSCLC Dataset
4.4 TCGA-RCC Dataset
4.5 Pruning
4.6 Interpretability of MIL Methods
5 Discussion and Conclusions
5.1 Limitations
5.2 Negative Impact
References
VCNet: A Self-explaining Model for Realistic Counterfactual Generation
1 Introduction
2 Related Work
3 Backgrounds
3.1 Variational Autoencoder (VAE)
3.2 Conditional Variational Autoencoder (cVAE)
4 A Join Training Model
4.1 VCNet Architecture
4.2 Loss Function and Training Procedure
4.3 Counterfactual Generation
5 Experiments and Results
5.1 cVAE for Counterfactual Generation
5.2 Comparison Between VCNet and CounterNet
5.3 Impact of Join-Training on Counterfactual Quality
5.4 Qualitative Results on MNIST Dataset
6 Conclusion
References
Ranking and Recommender Systems
A Recommendation System for CAD Assembly Modeling Based on Graph Neural Networks
1 Recommending Components in Assembly Modeling
2 Graph Neural Networks
3 Graph-Based Recommendations for Assemblies Using Pretrained Embeddings
3.1 Pretraining of Component Embeddings (comp2vec)
3.2 Generating Data Instances for Component Recommendation
3.3 Frequency-Based Baseline Model
4 Experiments
4.1 Experimental Setup
4.2 How Well Can GNNs Learn the Task at Hand?
4.3 Are Component Embeddings Better Than One-Hot Node Features?
4.4 Comparing GAT and GCN
5 Conclusion and Future Work
References
AD-AUG: Adversarial Data Augmentation for Counterfactual Recommendation
1 Introduction
2 Related Work
2.1 Autoencoder-Based CF
2.2 Counterfactual Data Augmentation
2.3 Adversarial Training
3 Preliminaries
3.1 Problem Definition
3.2 Autoencoder CF Framework
4 The Proposed Model
4.1 Model Overview
4.2 Data-Oriented Counterfactual Learning
4.3 Model-Oriented Counterfactual Learning
4.4 Implementation of Augmenter Model
4.5 Curriculum Adversarial Learning
5 Experiment
5.1 Experimental Settings
5.2 Experimental Result
6 Conclusion
References
Bi-directional Contrastive Distillation for Multi-behavior Recommendation
1 Introduction
2 Data Analysis
3 Our Proposed Model
3.1 Multi-behavior GCN
3.2 Bi-directional Contrastive Distillation
3.3 Prediction and Learning
4 Experiments
4.1 Datasets
4.2 Comparison Methods
4.3 Parameter Settings
4.4 Evaluation Metrics
4.5 Performance Comparison
4.6 Ablation Study
4.7 Parameter Analysis
5 Related Work
5.1 Multi-behavior Recommendation
5.2 Contrastive Distillation in Recommendations
6 Conclusions and Future Work
References
Improving Micro-video Recommendation by Controlling Position Bias
1 Introduction
2 Related Work
3 Our Model
3.1 Overview
3.2 Sequence Encoder
3.3 Contrastive Encoder
3.4 Prediction and Loss Function
3.5 Complexity Analysis
3.6 Discussion
4 Experimental Evaluation
4.1 Experimental Setup
4.2 Performance Comparison
4.3 Ablation Study
4.4 Impact of Contrastive Learning Strategies
5 Conclusion
References
Mitigating Confounding Bias for Recommendation via Counterfactual Inference
1 Introduction
2 Methodology
2.1 Preliminary
2.2 Causal Look in Recommendation
2.3 Deconfounded Analysis
2.4 Deconfounded Recommendation Model
3 Experiments
3.1 Experimental Settings
3.2 Performance Comparison (RQ1)
3.3 Case Study (RQ2)
3.4 Deconfounding Capability (RQ3)
4 Related Work
4.1 Debiasing in Recommender Systems
4.2 Deconfounded in Recommender Systems
4.3 Causal Recommendation
5 Conclusion and Future Work
References
Recommending Related Products Using Graph Neural Networks in Directed Graphs
1 Introduction
2 Related Work
3 Related Product Recommendation Problem
4 Proposed Framework
4.1 Product Graph Construction
4.2 DAEMON: Proposed GNN Model
5 Experiments
5.1 Experimental Setting
5.2 EQ1. Node Recommendation Task on Co-purchase Data
5.3 [EQ2, EQ3.] Link Prediction Tasks on Co-purchase Data
5.4 [EQ4, EQ5, EQ6.] Ablation Study on G1 Graph
5.5 Online Platform Performance
6 Conclusion and Future Work
References
A U-Shaped Hierarchical Recommender by Multi-resolution Collaborative Signal Modeling
1 Introduction
2 Related Work
3 Methodology
3.1 The Basic Collaborative Filtering Model
3.2 Modeling Implicit Hierarchies
3.3 UGCN Recommender
3.4 Theoretical Analysis for UGCN Recommender
4 Experiments
4.1 Experimental Settings
4.2 Prediction Accuracy Comparison (QR1)
4.3 Personalization Comparison (QR2)
4.4 Hyper Parameter Analysis (QR3)
5 Conclusion
References
Basket Booster for Prototype-based Contrastive Learning in Next Basket Recommendation
1 Introduction
2 Related Work
2.1 Next Basket Recommendation
2.2 Contrastive Learning
3 The Proposed Method
3.1 Problem Statement
3.2 BPCL
4 Experiments
4.1 Experiments Settings
4.2 Performance Comparison
4.3 Ablation Study
4.4 Hyper-Parameter Study
5 Conclusion
References
Graph Contrastive Learning with Adaptive Augmentation for Recommendation
1 Introduction
2 Preliminaries
3 Methodology
3.1 The Contrastive Learning Framework
3.2 Adaptive Augmentation
3.3 Contrastive Learning
3.4 Multi-task Training
4 Experiments
4.1 Experimental Setup
4.2 Performance Comparison (RQ1)
4.3 Further Study of GCARec
5 Related Work
5.1 Graph-based Recommendation
5.2 Self-supervised Learning in Recommender Systems
6 Conclusion and Future Work
References
Multi-interest Extraction Joint with Contrastive Learning for News Recommendation
1 Introduction
2 Related Work
2.1 Personalized News Recommendation
2.2 Contrastive Learning
3 Methodology
3.1 News Encoder
3.2 Multi-interest User Encoder
3.3 Multi-interest Graph-Enhanced Module
3.4 Multi-interest Contrastive Learning Module
3.5 Adaptive User Aggregator and Click Predictor
3.6 Model Training
4 Experiment
4.1 Dataset and Experimental Settings
4.2 Performance Evaluation
4.3 Ablation Study
4.4 Hyper-Parameters Analysis
4.5 Statistic Analysis
5 Conclusion
References
Transfer and Multitask Learning
On the Relationship Between Disentanglement and Multi-task Learning
1 Introduction
2 Related Work
2.1 Disentanglement
2.2 Multi-task Learning
3 Methods
3.1 Dataset Creation
3.2 Models
3.3 Disentanglement Metrics
4 Results and Discussion
4.1 Does Hard Parameter Sharing Encourage Disentanglement?
4.2 What Are the Properties of the Learned Representations?
4.3 Does Disentanglement Help in Training Multi-task Models?
5 Conclusions
References
InCo: Intermediate Prototype Contrast for Unsupervised Domain Adaptation
1 Introduction
2 Related Work
2.1 Unsupervised Domain Adaptation
2.2 Contrastive Learning
2.3 Contrastive Learning for Domain Adaptation
3 Method
3.1 Problem Definition and Overall Idea
3.2 Revisit of Contrastive Learning
3.3 Intra-Domain Contrastive Learning
3.4 Inter-Domain Contrastive Learning
3.5 Other Losses
3.6 Overall
4 Experiments
4.1 Datasets
4.2 Setup
4.3 Baselines
4.4 Results
4.5 Insight Analysis
5 Conclusion
References
Fast and Accurate Importance Weighting for Correcting Sample Bias
1 Introduction
2 Problem Setting and Proposed Approach
2.1 Learning Scenario
2.2 MMD
2.3 Importance Weighting Network
3 Related Work
4 Experiments
4.1 IWN Settings
4.2 Competitors Settings
4.3 Synthetic Dataset
4.4 UCI Datasets
4.5 Impact of Network Architecture and Batch Size
5 Conclusion
References
Overcoming Catastrophic Forgetting via Direction-Constrained Optimization
1 Introduction
2 Related Work
3 Loss Landscape Properties
4 Algorithm
4.1 Loss Function
4.2 Reduced Linear Autoencoders
4.3 Compression of Autoencoders
4.4 Resulting Algorithm
5 Experiments
5.1 Data Sets and Architectures
5.2 Training Details
5.3 Hyperparameters
5.4 Metric and Results
6 Conclusion
References
Newer is Not Always Better: Rethinking Transferability Metrics, Their Peculiarities, Stability and Performance
1 Introduction
2 Transferability Setup
3 Related Work
4 Improved Estimation of H-score
4.1 Proposed Transferability Measure
4.2 Challenges of Comparing H(f) Across Source Models/Layers
4.3 Efficient Computation for Small Target Data
5 A Closer Look at NCE, LEEP and NLEEP Measures
6 Experiments
6.1 Case Study: Vision Models
6.2 Case Study: Graph Neural Networks
6.3 Timing Comparison Between LogME and H(f)
7 Conclusion
References
Learning to Teach Fairness-Aware Deep Multi-task Learning
1 Introduction
2 Related Work
3 Problem Setting and Basic Concepts
3.1 Fairness Definition and Metric
3.2 Vanilla Multi-task Learning (MTL)
3.3 Fairness-Aware Multi-task Learning (FMTL)
3.4 Deep Q-learning (DQN) and Multi-tasking DQN (MT-DQN)
4 Learning to Teach Fairness-Aware Multi-tasking
4.1 Dynamic Loss Selection Formulation
4.2 L2T-FMT Algorithm
4.3 Student Network
4.4 Teacher Network
5 Experiments
5.1 Experimental Setup
5.2 Overall Fairness-Accuracy Evaluation
5.3 Performance Distribution over the Tasks
5.4 Dynamic Loss Selection
6 Conclusion
References
Author Index
Recommend Papers

Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part I
 3031263863, 9783031263866

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

LNAI 13713

Massih-Reza Amini · Stéphane Canu · Asja Fischer · Tias Guns · Petra Kralj Novak · Grigorios Tsoumakas (Eds.)

Machine Learning and Knowledge Discovery in Databases European Conference, ECML PKDD 2022 Grenoble, France, September 19–23, 2022 Proceedings, Part I

123

Lecture Notes in Computer Science

Lecture Notes in Artificial Intelligence Founding Editor Jörg Siekmann

Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Wolfgang Wahlster, DFKI, Berlin, Germany Zhi-Hua Zhou, Nanjing University, Nanjing, China

13713

The series Lecture Notes in Artificial Intelligence (LNAI) was established in 1988 as a topical subseries of LNCS devoted to artificial intelligence. The series publishes state-of-the-art research results at a high level. As with the LNCS mother series, the mission of the series is to serve the international R & D community by providing an invaluable service, mainly focused on the publication of conference and workshop proceedings and postproceedings.

Massih-Reza Amini · Stéphane Canu · Asja Fischer · Tias Guns · Petra Kralj Novak · Grigorios Tsoumakas Editors

Machine Learning and Knowledge Discovery in Databases European Conference, ECML PKDD 2022 Grenoble, France, September 19–23, 2022 Proceedings, Part I

Editors Massih-Reza Amini Grenoble Alpes University Saint Martin d’Hères, France

Stéphane Canu INSA Rouen Normandy Saint Etienne du Rouvray, France

Asja Fischer Ruhr-Universität Bochum Bochum, Germany

Tias Guns KU Leuven Leuven, Belgium

Petra Kralj Novak Central European University Vienna, Austria

Grigorios Tsoumakas Aristotle University of Thessaloniki Thessaloniki, Greece

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Artificial Intelligence ISBN 978-3-031-26386-6 ISBN 978-3-031-26387-3 (eBook) https://doi.org/10.1007/978-3-031-26387-3 LNCS Sublibrary: SL7 – Artificial Intelligence © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 Chapters 5, 7 and 26 are licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). For further details see license information in the chapters. This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML–PKDD 2022) in Grenoble, France, was once again a place for in-person gathering and the exchange of ideas after two years of completely virtual conferences due to the SARS-CoV-2 pandemic. This year the conference was hosted for the first time in hybrid format, and we are honored and delighted to offer you these proceedings as a result. The annual ECML–PKDD conference serves as a global venue for the most recent research in all fields of machine learning and knowledge discovery in databases, including cutting-edge applications. It builds on a highly successful run of ECML–PKDD conferences which has made it the premier European machine learning and data mining conference. This year, the conference drew over 1080 participants (762 in-person and 318 online) from 37 countries, including 23 European nations. This wealth of interest considerably exceeded our expectations, and we were both excited and under pressure to plan a special event. Overall, the conference attracted a lot of interest from industry thanks to sponsorship, participation, and the conference’s industrial day. The main conference program consisted of presentations of 242 accepted papers and four keynote talks (in order of appearance): – Francis Bach (Inria), Information Theory with Kernel Methods – Danai Koutra (University of Michigan), Mining & Learning [Compact] Representations for Structured Data – Fosca Gianotti (Scuola Normale Superiore di Pisa), Explainable Machine Learning for Trustworthy AI – Yann Le Cun (Facebook AI Research), From Machine Learning to Autonomous Intelligence In addition, there were respectively twenty three in-person and three online workshops; five in-person and three online tutorials; two combined in-person and one combined online workshop-tutorials, together with a PhD Forum, a discovery challenge and demonstrations. Papers presented during the three main conference days were organized in 4 tracks, within 54 sessions: – Research Track: articles on research or methodology from all branches of machine learning, data mining, and knowledge discovery; – Applied Data Science Track: articles on cutting-edge uses of machine learning, data mining, and knowledge discovery to resolve practical use cases and close the gap between current theory and practice; – Journal Track: articles that were published in special issues of the journals Machine Learning and Data Mining and Knowledge Discovery;

vi

Preface

– Demo Track: short articles that propose a novel system that advances the state of the art and include a demonstration video. We received a record number of 1238 abstract submissions, and for the Research and Applied Data Science Tracks, 932 papers made it through the review process (the remaining papers were withdrawn, with the bulk being desk rejected). We accepted 189 (27.3%) Research papers and 53 (22.2%) Applied Data science articles. 47 papers from the Journal Track and 17 demo papers were also included in the program. We were able to put together an extraordinarily rich and engaging program because of the high quality submissions. Research articles that were judged to be of exceptional quality and deserving of special distinction were chosen by the awards committee: – Machine Learning Best Paper Award: “Bounding the Family-Wise Error Rate in Local Causal Discovery Using Rademacher Averages”, by Dario Simionato (University of Padova) and Fabio Vandin (University of Padova) – Data-Mining Best Paper Award: “Transforming PageRank into an Infinite-Depth Graph Neural Network”, by Andreas Roth (TU Dortmund), and Thomas Liebig (TU Dortmund) – Test of Time Award for highest impact paper from ECML–PKDD 2012: “FairnessAware Classifier with Prejudice Remover Regularizer”, by Toshihiro Kamishima (National Institute of Advanced Industrial Science and Technology AIST), Shotaro Akashi (National Institute of Advanced Industrial Science and Technology AIST), Hideki Asoh (National Institute of Advanced Industrial Science and Technology AIST), and Jun Sakuma (University of Tsukuba) We sincerely thank the contributions of all participants, authors, PC members, area chairs, session chairs, volunteers, and co-organizers who made ECML–PKDD 2022 a huge success. We would especially like to thank Julie from the Grenoble World Trade Center for all her help and Titouan from Insight-outside, who worked so hard to make the online event possible. We also like to express our gratitude to Thierry for the design of the conference logo representing the three mountain chains surrounding the Grenoble city, as well as the sponsors and the ECML–PKDD Steering Committee. October 2022

Massih-Reza Amini Stéphane Canu Asja Fischer Petra Kralj Novak Tias Guns Grigorios Tsoumakas Georgios Balikas Fragkiskos Malliaros

Organization

General Chairs Massih-Reza Amini Stéphane Canu

University Grenoble Alpes, France INSA Rouen, France

Program Chairs Asja Fischer Tias Guns Petra Kralj Novak Grigorios Tsoumakas

Ruhr University Bochum, Germany KU Leuven, Belgium Central European University, Austria Aristotle University of Thessaloniki, Greece

Journal Track Chairs Peggy Cellier Krzysztof Dembczy´nski Emilie Devijver Albrecht Zimmermann

INSA Rennes, IRISA, France Yahoo Research, USA CNRS, France University of Caen Normandie, France

Workshop and Tutorial Chairs Bruno Crémilleux Charlotte Laclau

University of Caen Normandie, France Telecom Paris, France

Local Chairs Latifa Boudiba Franck Iutzeler

University Grenoble Alpes, France University Grenoble Alpes, France

viii

Organization

Proceedings Chairs Wouter Duivesteijn Sibylle Hess

Technische Universiteit Eindhoven, the Netherlands Technische Universiteit Eindhoven, the Netherlands

Industry Track Chairs Rohit Babbar Françoise Fogelmann

Aalto University, Finland Hub France IA, France

Discovery Challenge Chairs Ioannis Katakis Ioannis Partalas

University of Nicosia, Cyprus Expedia, Switzerland

Demonstration Chairs Georgios Balikas Fragkiskos Malliaros

Salesforce, France CentraleSupélec, France

PhD Forum Chairs Esther Galbrun Justine Reynaud

University of Eastern Finland, Finland University of Caen Normandie, France

Awards Chairs Francesca Lisi Michalis Vlachos

Università degli Studi di Bari, Italy University of Lausanne, Switzerland

Sponsorship Chairs Patrice Aknin Gilles Gasso

IRT SystemX, France INSA Rouen, France

Organization

ix

Web Chairs Martine Harshé Marta Soare

Laboratoire d’Informatique de Grenoble, France University Grenoble Alpes, France

Publicity Chair Emilie Morvant

Université Jean Monnet, France

ECML PKDD Steering Committee Annalisa Appice Ira Assent Albert Bifet Francesco Bonchi Tania Cerquitelli Sašo Džeroski Elisa Fromont Andreas Hotho Alípio Jorge Kristian Kersting Jefrey Lijffijt Luís Moreira-Matias Katharina Morik Siegfried Nijssen Andrea Passerini Fernando Perez-Cruz Alessandra Sala Arno Siebes Isabel Valera

University of Bari Aldo Moro, Italy Aarhus University, Denmark Télécom ParisTech, France ISI Foundation, Italy Politecnico di Torino, Italy Jožef Stefan Institute, Slovenia Université de Rennes, France Julius-Maximilians-Universität Würzburg, Germany University of Porto, Portugal TU Darmstadt, Germany Ghent University, Belgium University of Porto, Portugal TU Dortmund, Germany Université catholique de Louvain, Belgium University of Trento, Italy ETH Zurich, Switzerland Shutterstock Ireland Limited, Ireland Utrecht University, the Netherlands Universität des Saarlandes, Germany

Program Committees Guest Editorial Board, Journal Track Richard Allmendinger Marie Anastacio Ira Assent Martin Atzmueller Rohit Babbar

University of Manchester, UK Universiteit Leiden, the Netherlands Aarhus University, Denmark Universität Osnabrück, Germany Aalto University, Finland

x

Organization

Jaume Bacardit Anthony Bagnall Mitra Baratchi Francesco Bariatti German Barquero Alessio Benavoli Viktor Bengs Massimo Bilancia Ilaria Bordino Jakob Bossek Ulf Brefeld Ricardo Campello Michelangelo Ceci Loic Cerf Vitor Cerqueira Laetitia Chapel Jinghui Chen Silvia Chiusano Roberto Corizzo Bruno Cremilleux Marco de Gemmis Sebastien Destercke Shridhar Devamane Benjamin Doerr Wouter Duivesteijn Thomas Dyhre Nielsen Tapio Elomaa Remi Emonet Nicola Fanizzi Pedro Ferreira Cesar Ferri Julia Flores Ionut Florescu Germain Forestier Joel Frank Marco Frasca Jose A. Gomez Stephan Günnemann Luis Galarraga

Newcastle University, UK University of East Anglia, UK Universiteit Leiden, the Netherlands IRISA, France Universität de Barcelona, Spain Trinity College Dublin, Ireland Ludwig-Maximilians-Universität München, Germany Università degli Studi di Bari Aldo Moro, Italy Unicredit R&D, Italy University of Münster, Germany Leuphana University of Lüneburg, Germany University of Newcastle, UK University of Bari, Italy Universidade Federal de Minas Gerais, Brazil Universidade do Porto, Portugal IRISA, France Pennsylvania State University, USA Politecnico di Torino, Italy Università degli Studi di Bari Aldo Moro, Italy Université de Caen Normandie, France University of Bari Aldo Moro, Italy Centre National de la Recherche Scientifique, France Global Academy of Technology, India Ecole Polytechnique, France Technische Universiteit Eindhoven, the Netherlands Aalborg University, Denmark Tampere University, Finland Université Jean Monnet Saint-Etienne, France Università degli Studi di Bari Aldo Moro, Italy University of Lisbon, Portugal Universität Politecnica de Valencia, Spain University of Castilla-La Mancha, Spain Stevens Institute of Technology, USA Université de Haute-Alsace, France Ruhr-Universität Bochum, Germany Università degli Studi di Milano, Italy Universidad de Castilla-La Mancha, Spain Institute for Advanced Study, Germany Inria, France

Organization

Esther Galbrun Joao Gama Paolo Garza Pascal Germain Fabian Gieseke Riccardo Guidotti Francesco Gullo Antonella Guzzo Isabel Haasler Alexander Hagg Daniel Hernandez-Lobato Jose Hernandez-Orallo Martin Holena Jaakko Hollmen Dino Ienco Georgiana Ifrim Felix Iglesias Angelo Impedovo Frank Iutzeler Mahdi Jalili Szymon Jaroszewicz Mehdi Kaytoue Raouf Kerkouche Pascal Kerschke Dragi Kocev Wojciech Kotlowski Lars Kotthoff Peer Kroger Tipaluck Krityakierne Peer Kroger Meelis Kull Charlotte Laclau Mark Last Matthijs van Leeuwen Thomas Liebig Hsuan-Tien Lin Marco Lippi Daniel Lobato

University of Eastern Finland, Finland University of Porto, Portugal Politecnico di Torino, Italy Université Laval, Canada Westfälische Wilhelms-Universität Münster, Germany Università degli Studi di Pisa, Italy UniCredit, Italy University of Calabria, Italy KTH Royal Institute of Technology, Sweden Bonn-Rhein-Sieg University, Germany Universidad Autónoma de Madrid, Spain Universidad Politecnica de Valencia, Spain Neznámá organizace, Czechia Stockholm University, Sweden IRSTEA, France University College Dublin, Ireland Technische Universität Wien, Austria Università degli Studi di Bari Aldo Moro, Italy Université Grenoble Alpes, France RMIT University, Australia Polish Academy of Sciences, Poland INSA Lyon, France Helmholtz Center for Information Security, Germany Westfälische Wilhelms-Universität Münster, Germany Jožef Stefan Institute, Slovenia Poznan University of Technology, Poland University of Wyoming, USA Ludwig-Maximilians-Universität München, Germany Mahidol University, Thailand Christian-Albrechts-University Kiel, Germany Tartu Ulikool, Estonia Laboratoire Hubert Curien, France Ben-Gurion University of the Negev, Israel Universiteit Leiden, the Netherlands TU Dortmund, Germany National Taiwan University, Taiwan University of Modena and Reggio Emilia, Italy Universidad Autonoma de Madrid, Spain

xi

xii

Organization

Corrado Loglisci Nuno Lourenço Claudio Lucchese Brian MacNamee Davide Maiorca Giuseppe Manco Elio Masciari Andres Masegosa Ernestina Menasalvas Lien Michiels Jan Mielniczuk Paolo Mignone Anna Monreale Giovanni Montana Gregoire Montavon Amedeo Napoli Frank Neumann Thomas Nielsen Bruno Ordozgoiti Panagiotis Papapetrou Andrea Passerini Mykola Pechenizkiy Charlotte Pelletier Ruggero Pensa Nico Piatkowski Gianvito Pio Marc Plantevit Jose M. Puerta Kai Puolamaki Michael Rabbat Jan Ramon Rita Ribeiro Kaspar Riesen Matteo Riondato Celine Robardet Pieter Robberechts Antonio Salmeron Jorg Sander Roberto Santana Michael Schaub

Università degli Studi di Bari Aldo Moro, Italy University of Coimbra, Portugal Ca’Foscari University of Venice, Italy University College Dublin, Ireland University of Cagliari, Italy National Research Council, Italy University of Naples Federico II, Italy University of Aalborg, Denmark Universidad Politecnica de Madrid, Spain Universiteit Antwerpen, Belgium Polish Academy of Sciences, Poland Università degli Studi di Bari Aldo Moro, Italy University of Pisa, Italy University of Warwick, UK Technische Universität Berlin, Germany LORIA, France University of Adelaide, Australia Aalborg Universitet, Denmark Aalto-yliopisto, Finland Stockholms Universitet, Sweden University of Trento, Italy Technische Universiteit Eindhoven, the Netherlands IRISA, France University of Turin, Italy Technische Universität Dortmund, Germany Università degli Studi di Bari Aldo Moro, Italy Université Claude Bernard Lyon 1, France Universidad de Castilla-La Mancha, Spain Helsingin Yliopisto, Finland Meta Platforms Inc, USA Inria Lille Nord Europe, France Universidade do Porto, Portugal University of Bern, Switzerland Amherst College, USA INSA Lyon, France KU Leuven, Belgium University of Almería, Spain University of Alberta, Canada University of the Basque Country, Spain Rheinisch-Westfälische Technische Hochschule, Germany

Organization

Erik Schultheis Thomas Seidl Moritz Seiler Kijung Shin Shinichi Shirakawa Marek Smieja James Edward Smith Carlos Soares Arnaud Soulet Gerasimos Spanakis Giancarlo Sperli Myra Spiliopoulou Jerzy Stefanowski Giovanni Stilo Catalin Stoean Mahito Sugiyama Nikolaj Tatti Alexandre Termier Luis Torgo Leonardo Trujillo Wei-Wei Tu Steffen Udluft Arnaud Vandaele Celine Vens Herna Viktor Marco Virgolin Jordi Vitria Jilles Vreeken Willem Waegeman Markus Wagner Elizabeth Wanner Marcel Wever Ngai Wong Man Leung Wong Marek Wydmuch Guoxian Yu Xiang Zhang

Aalto-yliopisto, Finland Ludwig-Maximilians-Universität München, Germany University of Münster, Germany KAIST, South Korea Yokohama National University, Japan Jagiellonian University, Poland University of the West of England, UK Universidade do Porto, Portugal Université de Tours, France Maastricht University, the Netherlands University of Campania Luigi Vanvitelli, Italy Otto von Guericke Universität Magdeburg, Germany Poznan University of Technology, Poland Università degli Studi dell’Aquila, Italy University of Craiova, Romania National Institute of Informatics, Japan Helsingin Yliopisto, Finland Université de Rennes 1, France Dalhousie University, Canada Tecnologico Nacional de Mexico, Mexico 4Paradigm Inc., China Siemens AG Corporate Technology, Germany Université de Mons, Belgium KU Leuven, Belgium University of Ottawa, Canada Centrum Wiskunde en Informatica, the Netherlands Universität de Barcelona, Spain CISPA Helmholtz Center for Information Security, Germany Universiteit Gent, Belgium University of Adelaide, Australia Centro Federal de Educacao Tecnologica de Minas, Brazil Universität Paderborn, Germany University of Hong Kong, Hong Kong, China Lingnan University, Hong Kong, China Poznan University of Technology, Poland Shandong University, China University of Hong Kong, Hong Kong, China

xiii

xiv

Organization

Ye Zhu Arthur Zimek Albrecht Zimmermann

Deakin University, USA Syddansk Universitet, Denmark Université de Caen Normandie, France

Area Chairs Fabrizio Angiulli Annalisa Appice Ira Assent Martin Atzmueller Michael Berthold Albert Bifet Hendrik Blockeel Christian Böhm Francesco Bonchi Ulf Brefeld Francesco Calabrese Toon Calders Michelangelo Ceci Peggy Cellier Duen Horng Chau Nicolas Courty Bruno Cremilleux Jesse Davis Gianmarco De Francisci Morales Tom Diethe Carlotta Domeniconi Yuxiao Dong Kurt Driessens Tapio Elomaa Sergio Escalera Faisal Farooq Asja Fischer Peter Flach Eibe Frank Paolo Frasconi Elisa Fromont Johannes Fürnkranz Patrick Gallinari Joao Gama Jose Gamez Roman Garnett Thomas Gärtner

DIMES, University of Calabria, Italy University of Bari, Italy Aarhus University, Denmark Osnabrück University, Germany Universität Konstanz, Germany Université Paris-Saclay, France KU Leuven, Belgium LMU Munich, Germany ISI Foundation, Turin, Italy Leuphana, Germany Richemont, USA Universiteit Antwerpen, Belgium University of Bari, Italy IRISA, France Georgia Institute of Technology, USA IRISA, Université Bretagne-Sud, France Université de Caen Normandie, France KU Leuven, Belgium CentAI, Italy Amazon, UK George Mason University, USA Tsinghua University, China Maastricht University, the Netherlands Tampere University, Finland CVC and University of Barcelona, Spain Qatar Computing Research Institute, Qatar Ruhr University Bochum, Germany University of Bristol, UK University of Waikato, New Zealand Università degli Studi di Firenze, Italy Université Rennes 1, IRISA/Inria, France JKU Linz, Austria Sorbonne Université, Criteo AI Lab, France INESC TEC - LIAAD, Portugal Universidad de Castilla-La Mancha, Spain Washington University in St. Louis, USA TU Wien, Austria

Organization

Aristides Gionis Francesco Gullo Stephan Günnemann Xiangnan He Daniel Hernandez-Lobato José Hernández-Orallo Jaakko Hollmén Andreas Hotho Eyke Hüllermeier Neil Hurley Georgiana Ifrim Alipio Jorge Ross King Arno Knobbe Yun Sing Koh Parisa Kordjamshidi Lars Kotthoff Nicolas Kourtellis Danai Koutra Danica Kragic Stefan Kramer Niklas Lavesson Sébastien Lefèvre Jefrey Lijffijt Marius Lindauer Patrick Loiseau Jose Lozano Jörg Lücke Donato Malerba Fragkiskos Malliaros Giuseppe Manco Wannes Meert Pauli Miettinen Dunja Mladenic Anna Monreale Luis Moreira-Matias Emilie Morvant Sriraam Natarajan Nuria Oliver Panagiotis Papapetrou Laurence Park

xv

KTH Royal Institute of Technology, Sweden UniCredit, Italy Technical University of Munich, Germany University of Science and Technology of China, China Universidad Autonoma de Madrid, Spain Universität Politècnica de València, Spain Aalto University, Finland Universität Würzburg, Germany University of Munich, Germany University College Dublin, Ireland University College Dublin, Ireland INESC TEC/University of Porto, Portugal Chalmers University of Technology, Sweden Leiden University, the Netherlands University of Auckland, New Zealand Michigan State University, USA University of Wyoming, USA Telefonica Research, Spain University of Michigan, USA KTH Royal Institute of Technology, Sweden Johannes Gutenberg University Mainz, Germany Blekinge Institute of Technology, Sweden Université de Bretagne Sud/IRISA, France Ghent University, Belgium Leibniz University Hannover, Germany Inria, France UPV/EHU, Spain Universität Oldenburg, Germany Università degli Studi di Bari Aldo Moro, Italy CentraleSupelec, France ICAR-CNR, Italy KU Leuven, Belgium University of Eastern Finland, Finland Jožef Stefan Institute, Slovenia Università di Pisa, Italy Finiata, Germany University Jean Monnet, St-Etienne, France UT Dallas, USA Vodafone Research, USA Stockholm University, Sweden WSU, Australia

xvi

Organization

Andrea Passerini Mykola Pechenizkiy Dino Pedreschi Robert Peharz Julien Perez Franz Pernkopf Bernhard Pfahringer Fabio Pinelli Visvanathan Ramesh Jesse Read Zhaochun Ren Marian-Andrei Rizoiu Celine Robardet Sriparna Saha Ute Schmid Lars Schmidt-Thieme Michele Sebag Thomas Seidl Arno Siebes Fabrizio Silvestri Myra Spiliopoulou Yizhou Sun Jie Tang Nikolaj Tatti Evimaria Terzi Marc Tommasi Antti Ukkonen Herke van Hoof Matthijs van Leeuwen Celine Vens Christel Vrain Jilles Vreeken Willem Waegeman Stefan Wrobel Xing Xie Min-Ling Zhang Albrecht Zimmermann Indre Zliobaite

University of Trento, Italy TU Eindhoven, the Netherlands University of Pisa, Italy Graz University of Technology, Austria Naver Labs Europe, France Graz University of Technology, Austria University of Waikato, New Zealand IMT Lucca, Italy Goethe University Frankfurt, Germany Ecole Polytechnique, France Shandong University, China University of Technology Sydney, Australia INSA Lyon, France IIT Patna, India University of Bamberg, Germany University of Hildesheim, Germany LISN CNRS, France LMU Munich, Germany Universiteit Utrecht, the Netherlands Sapienza, University of Rome, Italy Otto-von-Guericke-University Magdeburg, Germany UCLA, USA Tsinghua University, China Helsinki University, Finland Boston University, USA Lille University, France University of Helsinki, Finland University of Amsterdam, the Netherlands Leiden University, the Netherlands KU Leuven, Belgium University of Orleans, France CISPA Helmholtz Center for Information Security, Germany Universiteit Gent, Belgium Fraunhofer IAIS, Germany Microsoft Research Asia, China Southeast University, China Université de Caen Normandie, France University of Helsinki, Finland

Organization

xvii

Program Committee Members Amos Abbott Pedro Abreu Maribel Acosta Timilehin Aderinola Linara Adilova Florian Adriaens Azim Ahmadzadeh Nourhan Ahmed Deepak Ajwani Amir Hossein Akhavan Rahnama Aymen Al Marjani Mehwish Alam Francesco Alesiani Omar Alfarisi Pegah Alizadeh Reem Alotaibi Jumanah Alshehri Bakhtiar Amen Evelin Amorim Shin Ando Thiago Andrade Jean-Marc Andreoli Giuseppina Andresini Alessandro Antonucci Xiang Ao Siddharth Aravindan Héber H. Arcolezi Adrián Arnaiz-Rodríguez Yusuf Arslan André Artelt Sunil Aryal Charles Assaad Matthias Aßenmacher Zeyar Aung Serge Autexier Rohit Babbar Housam Babiker

Virginia Tech, USA CISUC, Portugal Ruhr University Bochum, Germany Insight Centre, University College Dublin, Ireland Ruhr University Bochum, Fraunhofer IAIS, Germany KTH, Sweden Georgia State University, USA University of Hildesheim, Germany University College Dublin, Ireland KTH Royal Institute of Technology, Sweden ENS Lyon, France Leibniz Institute for Information Infrastructure, Germany NEC Laboratories Europe, Germany ADNOC, Canada Ericsson Research, France King Abdulaziz University, Saudi Arabia Temple University, USA University of Huddersfield, UK Inesc tec, Portugal Tokyo University of Science, Japan INESC TEC - LIAAD, Portugal Naverlabs Europe, France University of Bari Aldo Moro, Italy IDSIA, Switzerland Institute of Computing Technology, CAS, China National University of Singapore, Singapore Inria and École Polytechnique, France ELLIS Unit Alicante, Spain University of Luxembourg, Luxembourg Bielefeld University, Germany Deakin University, Australia Easyvista, France Ludwig-Maxmilians-Universität München, Germany Masdar Institute, UAE DFKI Bremen, Germany Aalto University, Finland University of Alberta, Canada

xviii

Organization

Antonio Bahamonde Maroua Bahri Georgios Balikas Maria Bampa Hubert Baniecki Elena Baralis Mitra Baratchi Kalliopi Basioti Martin Becker Diana Benavides Prado Anes Bendimerad Idir Benouaret Isacco Beretta Victor Berger Christoph Bergmeir Cuissart Bertrand Antonio Bevilacqua Yaxin Bi Ranran Bian Adrien Bibal Subhodip Biswas Patrick Blöbaum Carlos Bobed Paul Bogdan Chiara Boldrini Clément Bonet Andrea Bontempelli Ludovico Boratto Stefano Bortoli Diana-Laura Borza Ahcene Boubekki Sabri Boughorbel Paula Branco Jure Brence Martin Breskvar Marco Bressan Dariusz Brzezinski Florian Buettner Julian Busch Sebastian Buschjäger Ali Butt

University of Oviedo, Spain Inria Paris, France Salesforce, France Stockholm University, Sweden Warsaw University of Technology, Poland Politecnico di Torino, Italy LIACS - University of Leiden, the Netherlands Rutgers University, USA Stanford University, USA University of Auckland, New Zealand LIRIS, France Université Grenoble Alpes, France Università di Pisa, Italy CEA, France Monash University, Australia University of Caen, France University College Dublin, Ireland Ulster University, UK University of Auckland, New Zealand University of Louvain, Belgium Virginia Tech, USA Amazon AWS, USA University of Zaragoza, Spain USC, USA CNR, Italy Université Bretagne Sud, France University of Trento, Italy University of Cagliari, Italy Huawei Research Center, Germany Babes Bolyai University, Romania UiT, Norway QCRI, Qatar University of Ottawa, Canada Jožef Stefan Institute, Slovenia Jožef Stefan Institute, Slovenia University of Milan, Italy Poznan University of Technology, Poland German Cancer Research Center, Germany Siemens Technology, Germany TU Dortmund Artificial Intelligence Unit, Germany Virginia Tech, USA

Organization

Narayanan C. Krishnan Xiangrui Cai Xiongcai Cai Zekun Cai Andrea Campagner Seyit Camtepe Jiangxia Cao Pengfei Cao Yongcan Cao Cécile Capponi Axel Carlier Paula Carroll John Cartlidge Simon Caton Bogdan Cautis Mustafa Cavus Remy Cazabet Josu Ceberio David Cechák Abdulkadir Celikkanat Dumitru-Clementin Cercel Christophe Cerisara Vítor Cerqueira Mattia Cerrato Ricardo Cerri Hubert Chan Vaggos Chatziafratis Siu Lun Chau Chaochao Chen Chuan Chen Hechang Chen Jia Chen Jiaoyan Chen Jiawei Chen Jin Chen Kuan-Hsun Chen Lingwei Chen Tianyi Chen Wang Chen Xinyuan Chen

xix

IIT Palakkad, India Nankai University, China UNSW Sydney, Australia University of Tokyo, Japan Università degli Studi di Milano-Bicocca, Italy CSIRO Data61, Australia Chinese Academy of Sciences, China Chinese Academy of Sciences, China University of Texas at San Antonio, USA Aix-Marseille University, France Institut National Polytechnique de Toulouse, France University College Dublin, Ireland University of Bristol, UK University College Dublin, Ireland University of Paris-Saclay, France Warsaw University of Technology, Poland Université Lyon 1, France University of the Basque Country, Spain CEITEC Masaryk University, Czechia Technical University of Denmark, Denmark University Politehnica of Bucharest, Romania CNRS, France Dalhousie University, Canada JGU Mainz, Germany Federal University of São Carlos, Brazil University of Hong Kong, Hong Kong, China Stanford University, USA University of Oxford, UK Zhejiang University, China Sun Yat-sen University, China Jilin University, China Beihang University, China University of Oxford, UK Zhejiang University, China University of Electronic Science and Technology, China University of Twente, the Netherlands Wright State University, USA Boston University, USA Google, USA Universiti Kuala Lumpur, Malaysia

xx

Organization

Yuqiao Chen Yuzhou Chen Zhennan Chen Zhiyu Chen Zhqian Chen Ziheng Chen Zhiyong Cheng Noëlie Cherrier Anshuman Chhabra Zhixuan Chu Guillaume Cleuziou Ciaran Cooney Robson Cordeiro Roberto Corizzo Antoine Cornuéjols Fabrizio Costa Gustavo Costa Luís Cruz Tianyu Cui Wang-Zhou Dai Tanmoy Dam Thi-Bich-Hanh Dao Adrian Sergiu Darabant Mrinal Das Sina Däubener Padraig Davidson Paul Davidsson Andre de Carvalho Antoine de Mathelin Tom De Schepper Marcilio de Souto Gaetan De Waele Pieter Delobelle Alper Demir Ambra Demontis Difan Deng Guillaume Derval Maunendra Sankar Desarkar Chris Develder Arnout Devos

UT Dallas, USA Princeton University, USA Xiamen University, China UCSB, USA Mississippi State University, USA Stony Brook University, USA Shandong Academy of Sciences, China CITiO, France UC Davis, USA Ant Group, China LIFO, France AflacNI, UK University of São Paulo, Brazil American University, USA AgroParisTech, France Exeter University, UK Instituto Federal de Goiás - Campus Jataí, Brazil Delft University of Technology, the Netherlands Institute of Information Engineering, China Imperial College London, UK University of New South Wales Canberra, Australia University of Orleans, France Babes Bolyai University, Romania IIT Palakaad, India Ruhr University, Bochum, Germany University of Würzburg, Germany Malmö University, Sweden USP, Brazil ENS Paris-Saclay, France University of Antwerp, Belgium LIFO/Univ. Orleans, France Ghent University, Belgium KU Leuven, Belgium Izmir University of Economics, Turkey University of Cagliari, Italy Leibniz Universität Hannover, Germany UCLouvain - ICTEAM, Belgium IIT Hyderabad, India University of Ghent - iMec, Belgium Swiss Federal Institute of Technology Lausanne, Switzerland

Organization

Laurens Devos Bhaskar Dhariyal Nicola Di Mauro Aissatou Diallo Christos Dimitrakakis Jiahao Ding Kaize Ding Yao-Xiang Ding Guilherme Dinis Junior Nikolaos Dionelis Christos Diou Sonia Djebali Nicolas Dobigeon Carola Doerr Ruihai Dong Shuyu Dong Yixiang Dong Xin Du Yuntao Du Stefan Duffner Rahul Duggal Wouter Duivesteijn Sebastijan Dumancic Inês Dutra Thomas Dyhre Nielsen Saso Dzeroski Tome Eftimov Hamid Eghbal-zadeh Theresa Eimer Radwa El Shawi Dominik Endres Roberto Esposito Georgios Evangelidis Samuel Fadel Stephan Fahrenkrog-Petersen Xiaomao Fan Zipei Fan Hadi Fanaee Meng Fang Elaine Faria Ad Feelders Sophie Fellenz

xxi

KU Leuven, Belgium University College Dublin, Ireland University of Bari, Italy University College London, UK University of Neuchatel, Switzerland University of Houston, USA Arizona State University, USA Nanjing University, China Stockholm University, Sweden University of Edinburgh, UK Harokopio University of Athens, Greece Léonard de Vinci Pôle Universitaire, France University of Toulouse, France Sorbonne University, France University College Dublin, Ireland Inria, Université Paris-Saclay, France Xi’an Jiaotong University, China University of Edinburgh, UK Nanjing University, China University of Lyon, France Georgia Tech, USA TU Eindhoven, the Netherlands TU Delft, the Netherlands University of Porto, Portugal AAU, Denmark Jožef Stefan Institute, Ljubljana, Slovenia Jožef Stefan Institute, Ljubljana, Slovenia LIT AI Lab, Johannes Kepler University, Austria Leibniz University Hannover, Germany Tartu University, Estonia Philipps-Universität Marburg, Germany Università di Torino, Italy University of Macedonia, Greece Leuphana University, Germany Humboldt-Universität zu Berlin, Germany Shenzhen Technology University, China University of Tokyo, Japan Halmstad University, Sweden TU/e, the Netherlands UFU, Brazil Universiteit Utrecht, the Netherlands TU Kaiserslautern, Germany

xxii

Organization

Stefano Ferilli Daniel Fernández-Sánchez Pedro Ferreira Cèsar Ferri Flavio Figueiredo Soukaina Filali Boubrahimi Raphael Fischer Germain Forestier Edouard Fouché Philippe Fournier-Viger Kary Framling Jérôme François Fabio Fumarola Pratik Gajane Esther Galbrun Laura Galindez Olascoaga Sunanda Gamage Chen Gao Wei Gao Xiaofeng Gao Yuan Gao Jochen Garcke Clement Gautrais Benoit Gauzere Dominique Gay Xiou Ge Bernhard Geiger Jiahui Geng Yangliao Geng Konstantin Genin Firas Gerges Pierre Geurts Gizem Gezici Amirata Ghorbani Biraja Ghoshal Anna Giabelli George Giannakopoulos Tobias Glasmachers Heitor Murilo Gomes Anastasios Gounaris

University of Bari, Italy Universidad Autónoma de Madrid, Spain Faculty of Sciences University of Porto, Portugal Universität Politècnica València, Spain UFMG, Brazil Utah State University, USA TU Dortmund, Germany University of Haute Alsace, France Karlsruhe Institute of Technology, Germany Shenzhen University, China Umeå University, Sweden Inria Nancy Grand-Est, France Prometeia, Italy Eindhoven University of Technology, the Netherlands University of Eastern Finland, Finland KU Leuven, Belgium University of Western Ontario, Canada Tsinghua University, China Nanjing University, China Shanghai Jiaotong University, China University of Science and Technology of China, China University of Bonn, Germany Brightclue, France INSA Rouen, France Université de La Réunion, France University of Southern California, USA Know-Center GmbH, Germany University of Stavanger, Norway Tsinghua University, China University of Tübingen, Germany New Jersey Institute of Technology, USA University of Liège, Belgium Sabanci University, Turkey Stanford, USA TCS, UK Università degli studi di Milano Bicocca, Italy IIT Demokritos, Greece Ruhr-University Bochum, Germany University of Waikato, New Zealand Aristotle University of Thessaloniki, Greece

Organization

Antoine Gourru Michael Granitzer Magda Gregorova Moritz Grosse-Wentrup Divya Grover Bochen Guan Xinyu Guan Guillaume Guerard Daniel Guerreiro e Silva Riccardo Guidotti Ekta Gujral Aditya Gulati Guibing Guo Jianxiong Guo Yuhui Guo Karthik Gurumoorthy Thomas Guyet Guillaume Habault Amaury Habrard Shahrzad Haddadan Shah Muhammad Hamdi Massinissa Hamidi Peng Han Tom Hanika Sébastien Harispe Marwan Hassani Kohei Hayashi Conor Hayes Lingna He Ramya Hebbalaguppe Jukka Heikkonen Fredrik Heintz Patrick Hemmer Romain Hérault Jeronimo Hernandez-Gonzalez Sibylle Hess Fabian Hinder Lars Holdijk Martin Holena Mike Holenderski Shenda Hong

xxiii

University of Lyon, France University of Passau, Germany Hochschule Würzburg-Schweinfurt, Germany University of Vienna, Austria Chalmers University, Sweden OPPO US Research Center, USA Xian Jiaotong University, China ESILV, France University of Brasilia, Brazil University of Pisa, Italy University of California, Riverside, USA ELLIS Unit Alicante, Spain Northeastern University, China Beijing Normal University, China Renmin University of China, China Amazon, India Inria, Centre de Lyon, France KDDI Research, Inc., Japan University of St-Etienne, France Brown University, USA New Mexico State University, USA PRES Sorbonne Paris Cité, France KAUST, Saudi Arabia University of Kassel, Germany IMT Mines Alès, France TU Eindhoven, the Netherlands Preferred Networks, Inc., Japan National University of Ireland Galway, Ireland Zhejiang University of Technology, China Indian Institute of Technology, Delhi, India University of Turku, Finland Linköping University, Sweden Karlsruhe Institute of Technology, Germany INSA de Rouen, France University of Barcelona, Spain TU Eindhoven, the Netherlands Bielefeld University, Germany University of Amsterdam, the Netherlands Institute of Computer Science, Czechia Eindhoven University of Technology, the Netherlands Peking University, China

xxiv

Organization

Yupeng Hou Binbin Hu Jian Hu Liang Hu Wen Hu Wenbin Hu Wenbo Hu Yaowei Hu Chao Huang Gang Huang Guanjie Huang Hong Huang Jin Huang Junjie Huang Qiang Huang Shangrong Huang Weitian Huang Yan Huang Yiran Huang Angelo Impedovo Roberto Interdonato Iñaki Inza Stratis Ioannidis Rakib Islam Tobias Jacobs Priyank Jaini Johannes Jakubik Nathalie Japkowicz Szymon Jaroszewicz Shayan Jawed Rathinaraja Jeyaraj Shaoxiong Ji Taoran Ji Bin-Bin Jia Yuheng Jia Ziyu Jia Nan Jiang Renhe Jiang Siyang Jiang Song Jiang Wenyu Jiang

Renmin University of China, China Ant Financial Services Group, China Queen Mary University of London, UK Tongji University, China Ant Group, China Wuhan University, China Tsinghua University, China University of Arkansas, USA University of Hong Kong, China Zhejiang Lab, China Penn State University, USA HUST, China University of Amsterdam, the Netherlands Chinese Academy of Sciences, China Jilin University, China Hunan University, China South China University of Technology, China Huazhong University of Science and Technology, China Karlsruhe Institute of Technology, Germany University of Bari, Italy CIRAD, France University of the Basque Country, Spain Northeastern University, USA Facebook, USA NEC Laboratories Europe GmbH, Germany Google, Canada Karlsruhe Institute of Technology, Germany American University, USA Polish Academy of Sciences, Poland University of Hildesheim, Germany Kyungpook National University, South Korea Aalto University, Finland Virginia Tech, USA Southeast University, China Southeast University, China Beijing Jiaotong University, China Purdue University, USA University of Tokyo, Japan National Taiwan University, Taiwan University of California, Los Angeles, USA Nanjing University, China

Organization

Zhen Jiang Yuncheng Jiang François-Xavier Jollois Adan Jose-Garcia Ferdian Jovan Steffen Jung Thorsten Jungeblut Hachem Kadri Vana Kalogeraki Vinayaka Kamath Toshihiro Kamishima Bo Kang Alexandros Karakasidis Mansooreh Karami Panagiotis Karras Ioannis Katakis Koki Kawabata Klemen Kenda Patrik Joslin Kenfack Mahsa Keramati Hamidreza Keshavarz Adil Khan Jihed Khiari Mi-Young Kim Arto Klami Jiri Klema Tomas Kliegr Christian Knoll Dmitry Kobak Vladimer Kobayashi Dragi Kocev Adrian Kochsiek Masahiro Kohjima Georgia Koloniari Nikos Konofaos Irena Koprinska Lars Kotthoff Daniel Kottke

xxv

Jiangsu University, China South China Normal University, China Université de Paris Cité, France Université de Lille, France University of Bristol, UK MPII, Germany Bielefeld University of Applied Sciences, Germany Aix-Marseille University, France Athens University of Economics and Business, Greece Microsoft Research India, India National Institute of Advanced Industrial Science, Japan Ghent University, Belgium University of Macedonia, Greece Arizona State University, USA Aarhus University, Denmark University of Nicosia, Cyprus Osaka University, Tokyo Jožef Stefan Institute, Slovenia Innopolis University, Russia Simon Fraser University, Canada Tarbiat Modares University, Iran Innopolis University, Russia Johannes Kepler University, Austria University of Alberta, Canada University of Helsinki, Finland Czech Technical University, Czechia University of Economics Prague, Czechia Graz, University of Technology, Austria University of Tübingen, Germany University of the Philippines Mindanao, Philippines Jožef Stefan Institute, Slovenia University of Mannheim, Germany NTT Corporation, Japan University of Macedonia, Greece Aristotle University of Thessaloniki, Greece University of Sydney, Australia University of Wyoming, USA University of Kassel, Germany

xxvi

Organization

Anna Krause Alexander Kravberg Anastasia Krithara Meelis Kull Pawan Kumar Suresh Kirthi Kumaraswamy Gautam Kunapuli Marcin Kurdziel Vladimir Kuzmanovski Ariel Kwiatkowski Firas Laakom Harri Lähdesmäki Stefanos Laskaridis Alberto Lavelli Aonghus Lawlor Thai Le Hoàng-Ân Lê Hoel Le Capitaine Thach Le Nguyen Tai Le Quy Mustapha Lebbah Dongman Lee John Lee Minwoo Lee Zed Lee Yunwen Lei Douglas Leith Florian Lemmerich Carson Leung Chaozhuo Li Jian Li Lei Li Li Li Rui Li Shiyang Li Shuokai Li Tianyu Li Wenye Li Wenzhong Li

University of Würzburg, Germany KTH Royal Institute of Technology, Sweden NCSR Demokritos, Greece University of Tartu, Estonia IIIT, Hyderabad, India InterDigital, France Verisk Inc, USA AGH University of Science and Technology, Poland Aalto University, Finland École Polytechnique, France Tampere University, Finland Aalto University, Finland Samsung AI, UK FBK-ict, Italy University College Dublin, Ireland University of Mississippi, USA IRISA, University of South Brittany, France University of Nantes, France Insight Centre, Ireland L3S Research Center - Leibniz University Hannover, Germany Sorbonne Paris Nord University, France KAIST, South Korea Université catholique de Louvain, Belgium University of North Carolina at Charlotte, USA Stockholm University, Sweden University of Birmingham, UK Trinity College Dublin, Ireland RWTH Aachen, Germany University of Manitoba, Canada Microsoft Research Asia, China Institute of Information Engineering, China Peking University, China Southwest University, China Inspur Group, China UCSB, USA Chinese Academy of Sciences, China Alibaba Group, China The Chinese University of Hong Kong, Shenzhen, China Nanjing University, China

Organization

Xiaoting Li Yang Li Zejian Li Zhidong Li Zhixin Li Defu Lian Bin Liang Yuchen Liang Yiwen Liao Pieter Libin Thomas Liebig Seng Pei Liew Beiyu Lin Chen Lin Tony Lindgren Chen Ling Jiajing Ling Marco Lippi Bin Liu Bowen Liu Chang Liu Chien-Liang Liu Feng Liu Jiacheng Liu Li Liu Shengcai Liu Shenghua Liu Tingwen Liu Xiangyu Liu Yong Liu Yuansan Liu Zhiwei Liu Tuwe Löfström Corrado Loglisci Ting Long Beatriz López Yin Lou Samir Loudni Yang Lu Yuxun Lu

xxvii

Pennsylvania State University, USA University of North Carolina at Chapel Hill, USA Zhejiang University, China UTS, Australia Guangxi Normal University, China University of Science and Technology of China, China UTS, Australia RPI, USA University of Stuttgart, Germany VUB, Belgium TU Dortmund, Germany LINE Corporation, Japan University of Nevada - Las Vegas, USA Xiamen University, China Stockholm University, Sweden Emory University, USA Singapore Management University, Singapore University of Modena and Reggio Emilia, Italy Chongqing University, China Stanford University, USA Institute of Information Engineering, CAS, China National Chiao Tung University, Taiwan East China Normal University, China Chinese University of Hong Kong, China Chongqing University, China Southern University of Science and Technology, China Institute of Computing Technology, CAS, China Institute of Information Engineering, CAS, China Tencent, China Renmin University of China, China University of Melbourne, Australia Salesforce, USA Jönköping University, Sweden Università degli Studi di Bari Aldo Moro, Italy Shanghai Jiao Tong University, China University of Girona, Spain Ant Group, USA TASC (LS2N-CNRS), IMT Atlantique, France Xiamen University, China National Institute of Informatics, Japan

xxviii

Organization

Massimiliano Luca Stefan Lüdtke Jovita Lukasik Denis Lukovnikov Pedro Henrique Luz de Araujo Fenglong Ma Jing Ma Meng Ma Muyang Ma Ruizhe Ma Xingkong Ma Xueqi Ma Zichen Ma Luis Macedo Harshitha Machiraju Manchit Madan Seiji Maekawa Sindri Magnusson Pathum Chamikara Mahawaga Saket Maheshwary Ajay Mahimkar Pierre Maillot Lorenzo Malandri Rammohan Mallipeddi Sahil Manchanda Domenico Mandaglio Panagiotis Mandros Robin Manhaeve Silviu Maniu Cinmayii Manliguez Naresh Manwani Jiali Mao Alexandru Mara Radu Marculescu Roger Mark Fernando Martínez-Plume Koji Maruhashi Simone Marullo

Bruno Kessler Foundation, Italy University of Mannheim, Germany University of Mannheim, Germany University of Bonn, Germany University of Brasília, Brazil Pennsylvania State University, USA University of Virginia, USA Peking University, China Shandong University, China University of Massachusetts Lowell, USA National University of Defense Technology, China Tsinghua University, China The Chinese University of Hong Kong, Shenzhen, China University of Coimbra, Portugal EPFL, Switzerland Delivery Hero, Germany Osaka University, Japan Stockholm University, Sweden CSIRO Data61, Australia Amazon, India AT&T, USA Inria, France Unimib, Italy Kyungpook National University, South Korea IIT Delhi, India DIMES-UNICAL, Italy Harvard University, USA KU Leuven, Belgium Université Paris-Saclay, France National Sun Yat-Sen University, Taiwan International Institute of Information Technology, India East China Normal University, China Ghent University, Belgium University of Texas at Austin, USA Massachusetts Institute of Technology, USA Joint Research Centre - European Commission, Belgium Fujitsu Research, Fujitsu Limited, Japan University of Siena, Italy

Organization

Elio Masciari Florent Masseglia Michael Mathioudakis Takashi Matsubara Tetsu Matsukawa Santiago Mazuelas Ryan McConville Hardik Meisheri Panagiotis Meletis Gabor Melli Joao Mendes-Moreira Chuan Meng Cristina Menghini Engelbert Mephu Nguifo Fabio Mercorio Guillaume Metzler Hao Miao Alessio Micheli Paolo Mignone Matej Mihelcic Ioanna Miliou Bamdev Mishra Rishabh Misra Dixant Mittal Zhaobin Mo Daichi Mochihashi Armin Moharrer Ioannis Mollas Carlos Monserrat-Aranda Konda Reddy Mopuri Raha Moraffah Pawel Morawiecki Ahmadreza Mosallanezhad Davide Mottin Koyel Mukherjee Maximilian Münch Fabricio Murai Taichi Murayama

xxix

University of Naples, Italy Inria, France University of Helsinki, Finland Osaka University, Japan Kyushu University, Japan BCAM-Basque Center for Applied Mathematics, Spain University of Bristol, UK TCS Research, India Eindhoven University of Technology, the Netherlands Medable, USA INESC TEC, Portugal University of Amsterdam, the Netherlands Brown University, USA Université Clermont Auvergne, CNRS, LIMOS, France University of Milan-Bicocca, Italy Laboratoire ERIC, France Aalborg University, Denmark Università di Pisa, Italy University of Bari Aldo Moro, Italy University of Zagreb, Croatia Stockholm University, Sweden Microsoft, India Twitter, Inc, USA National University of Singapore, Singapore Columbia University, USA Institute of Statistical Mathematics, Japan Northeastern University, USA Aristotle University of Thessaloniki, Greece Universität Politècnica de València, Spain Indian Institute of Technology Guwahati, India Arizona State University, USA Polish Academy of Sciences, Poland Arizona State University, USA Aarhus University, Denmark Adobe Research, India University of Applied Sciences Würzburg, Germany Universidade Federal de Minas Gerais, Brazil NAIST, Japan

xxx

Organization

Stéphane Mussard Mohamed Nadif Cian Naik Felipe Kenji Nakano Mirco Nanni Apurva Narayan Usman Naseem Gergely Nemeth Stefan Neumann Anna Nguyen Quan Nguyen Thi Phuong Quyen Nguyen Thu Nguyen Thu Trang Nguyen Prajakta Nimbhorkar Xuefei Ning Ikuko Nishikawa Hao Niu Paraskevi Nousi Erik Novak Slawomir Nowaczyk Aleksandra Nowak Eirini Ntoutsi Andreas Nürnberger James O’Neill Lutz Oettershagen Tsuyoshi Okita Makoto Onizuka Subba Reddy Oota María Óskarsdóttir Aomar Osmani Aljaz Osojnik Shuichi Otake Greger Ottosson Zijing Ou Abdelkader Ouali Latifa Oukhellou Kai Ouyang Andrei Paleyes Pankaj Pandey Guansong Pang Pance Panov

CHROME, France Centre Borelli - Université Paris Cité, France University of Oxford, UK KU Leuven, Belgium ISTI-CNR Pisa, Italy University of Waterloo, Canada University of Sydney, Australia ELLIS Unit Alicante, Spain KTH Royal Institute of Technology, Sweden Karlsruhe Institute of Technology, Germany Washington University in St. Louis, USA University of Da Nang, Vietnam SimulaMet, Norway University College Dublin, Ireland Chennai Mathematical Institute, Chennai, India Tsinghua University, China Ritsumeikan University, Japan KDDI Research, Inc., Japan Aristotle University of Thessaloniki, Greece Jožef Stefan Institute, Slovenia Halmstad University, Sweden Jagiellonian University, Poland Freie Universität Berlin, Germany Magdeburg University, Germany University of Liverpool, UK University of Bonn, Germany Kyushu Institute of Technology, Japan Osaka University, Japan IIIT Hyderabad, India University of Reykjavík, Iceland PRES Sorbonne Paris Cité, France JSI, Slovenia National Institute of Informatics, Japan IBM, France Sun Yat-sen University, China University of Caen Normandy, France IFSTTAR, France Tsinghua University, France University of Cambridge, UK Indian Institute of Technology Gandhinagar, India Singapore Management University, Singapore Jožef Stefan Institute, Slovenia

Organization

Apostolos Papadopoulos Evangelos Papalexakis Anna Pappa Chanyoung Park Haekyu Park Sanghyun Park Luca Pasa Kevin Pasini Vincenzo Pasquadibisceglie Nikolaos Passalis Javier Pastorino Kitsuchart Pasupa Andrea Paudice Anand Paul Yulong Pei Charlotte Pelletier Jaakko Peltonen Ruggero Pensa Fabiola Pereira Lucas Pereira Aritz Pérez Lorenzo Perini Alan Perotti Michaël Perrot Matej Petkovic Lukas Pfahler Nico Piatkowski Francesco Piccialli Gianvito Pio Giuseppe Pirrò Marc Plantevit Konstantinos Pliakos Matthias Pohl Nicolas Posocco Cedric Pradalier Paul Prasse Mahardhika Pratama Francesca Pratesi Steven Prestwich Giulia Preti Philippe Preux Shalini Priya

xxxi

Aristotle University of Thessaloniki, Greece UC Riverside, USA Université Paris 8, France UIUC, USA Georgia Institute of Technology, USA Yonsei University, South Korea University of Padova, Italy IRT SystemX, France University of Bari Aldo Moro, Italy Aristotle University of Thessaloniki, Greece University of Colorado, Denver, USA King Mongkut’s Institute of Technology, Thailand University of Milan, Italy Kyungpook National University, South Korea TU Eindhoven, the Netherlands Université de Bretagne du Sud, France Tampere University, Finland University of Torino, Italy Federal University of Uberlandia, Brazil ITI, LARSyS, Técnico Lisboa, Portugal Basque Center for Applied Mathematics, Spain KU Leuven, Belgium CENTAI Institute, Italy Inria Lille, France Institute Jožef Stefan, Slovenia TU Dortmund University, Germany Fraunhofer IAIS, Germany University of Naples Federico II, Italy University of Bari, Italy Sapienza University of Rome, Italy EPITA, France KU Leuven, Belgium Otto von Guericke University, Germany EURA NOVA, Belgium GeorgiaTech Lorraine, France University of Potsdam, Germany University of South Australia, Australia ISTI - CNR, Italy University College Cork, Ireland CentAI, Italy Inria, France Oak Ridge National Laboratory, USA

xxxii

Organization

Ricardo Prudencio Luca Putelli Peter van der Putten Chuan Qin Jixiang Qing Jolin Qu Nicolas Quesada Teeradaj Racharak Krystian Radlak Sandro Radovanovic Md Masudur Rahman Ankita Raj Herilalaina Rakotoarison Alexander Rakowski Jan Ramon Sascha Ranftl Aleksandra Rashkovska Koceva S. Ravi Jesse Read David Reich Marina Reyboz Pedro Ribeiro Rita P. Ribeiro Piera Riccio Christophe Rigotti Matteo Riondato Mateus Riva Kit Rodolfa Christophe Rodrigues Simon Rodríguez-Santana Gaetano Rossiello Mohammad Rostami Franz Rothlauf Celine Rouveirol Arjun Roy Joze Rozanec Salvatore Ruggieri Marko Ruman Ellen Rushe

Universidade Federal de Pernambuco, Brazil Università degli Studi di Brescia, Italy Leiden University, the Netherlands Baidu, China Ghent University, Belgium Western Sydney University, Australia Polytechnique Montreal, Canada Japan Advanced Institute of Science and Technology, Japan Warsaw University of Technology, Poland University of Belgrade, Serbia Purdue University, USA Indian Institute of Technology Delhi, India Inria, France Hasso Plattner Institute, Germany Inria, France Graz University of Technology, Austria Jožef Stefan Institute, Slovenia Biocomplexity Institute, USA Ecole Polytechnique, France Universität Potsdam, Germany CEA, LIST, France University of Porto, Portugal University of Porto, Portugal ELLIS Unit Alicante Foundation, Spain INSA Lyon, France Amherst College, USA Telecom ParisTech, France CMU, USA DVRC Pôle Universitaire Léonard de Vinci, France ICMAT, Spain IBM Research, USA University of Southern California, USA Mainz Universität, Germany Université Paris-Nord, France Freie Universität Berlin, Germany Josef Stefan International Postgraduate School, Slovenia University of Pisa, Italy UTIA, AV CR, Czechia University College Dublin, Ireland

Organization

Dawid Rymarczyk Amal Saadallah Khaled Mohammed Saifuddin Hajer Salem Francesco Salvetti Roberto Santana KC Santosh Somdeb Sarkhel Yuya Sasaki Yücel Saygın Patrick Schäfer Alexander Schiendorfer Peter Schlicht Daniel Schmidt Johannes Schneider Steven Schockaert Jens Schreiber Matthias Schubert Alexander Schulz Jan-Philipp Schulze Andreas Schwung Vasile-Marian Scuturici Raquel Sebastião Stanislav Selitskiy Edoardo Serra Lorenzo Severini Tapan Shah Ammar Shaker Shiv Shankar Junming Shao Kartik Sharma Manali Sharma Ariona Shashaj Betty Shea Chengchao Shen Hailan Shen Jiawei Sheng Yongpan Sheng Chongyang Shi

xxxiii

Jagiellonian University, Poland TU Dortmund, Germany Georgia State University, USA AUDENSIEL, France Politecnico di Torino, Italy University of the Basque Country (UPV/EHU), Spain University of South Dakota, USA Adobe, USA Osaka University, Japan Sabancı Universitesi, Turkey Humboldt-Universität zu Berlin, Germany Technische Hochschule Ingolstadt, Germany Volkswagen Group Research, Germany Monash University, Australia University of Liechtenstein, Liechtenstein Cardiff University, UK University of Kassel, Germany Ludwig-Maximilians-Universität München, Germany CITEC, Bielefeld University, Germany Fraunhofer AISEC, Germany Fachhochschule Südwestfalen, Germany LIRIS, France IEETA/DETI-UA, Portugal University of Bedfordshire, UK Boise State University, USA UniCredit, R&D Dept., Italy GE, USA NEC Laboratories Europe, Germany University of Massachusetts, USA University of Electronic Science and Technology, China Georgia Institute of Technology, USA Samsung, USA Network Contacts, Italy University of British Columbia, Canada Central South University, China Central South University, China Chinese Academy of Sciences, China Southwest University, China Beijing Institute of Technology, China

xxxiv

Organization

Zhengxiang Shi Naman Shukla Pablo Silva Simeon Simoff Maneesh Singh Nikhil Singh Sarath Sivaprasad Elena Sizikova Andrzej Skowron Blaz Skrlj Oliver Snow Jonas Soenen Nataliya Sokolovska K. M. A. Solaiman Shuangyong Song Zixing Song Tiberiu Sosea Arnaud Soulet Lucas Souza Jens Sparsø Vivek Srivastava Marija Stanojevic Jerzy Stefanowski Simon Stieber Jinyan Su Yongduo Sui Huiyan Sun Yuwei Sun Gokul Swamy Maryam Tabar Anika Tabassum Shazia Tabassum Koji Tabata Andrea Tagarelli Etienne Tajeuna Acar Tamersoy Chang Wei Tan Cheng Tan Feilong Tang Feng Tao

University College London, UK Deepair LLC, USA Dell Technologies, Brazil Western Sydney University, Australia Motive Technologies, USA MIT Media Lab, USA IIIT Hyderabad, India NYU, USA University of Warsaw, Poland Institute Jožef Stefan, Slovenia Simon Fraser University, Canada KU Leuven, Belgium Sorbonne University, France Purdue University, USA Jing Dong, China The Chinese University of Hong Kong, China University of Illinois at Chicago, USA University of Tours, France UFRJ, Brazil Technical University of Denmark, Denmark TCS Research, USA Temple University, USA Poznan University of Technology, Poland University of Augsburg, Germany University of Electronic Science and Technology, China University of Science and Technology of China, China Jilin University, China University of Tokyo/RIKEN AIP, Japan Amazon, USA Pennsylvania State University, USA Virginia Tech, USA INESCTEC, Portugal Hokkaido University, Japan DIMES, University of Calabria, Italy Université de Laval, Canada NortonLifeLock Research Group, USA Monash University, Australia Westlake University, China Shanghai Jiao Tong University, China Volvo Cars, USA

Organization

Youming Tao Martin Tappler Garth Tarr Mohammad Tayebi Anastasios Tefas Maguelonne Teisseire Stefano Teso Olivier Teste Maximilian Thiessen Eleftherios Tiakas Hongda Tian Alessandro Tibo Aditya Srinivas Timmaraju Christos Tjortjis Ljupco Todorovski Laszlo Toka Ancy Tom Panagiotis Traganitis Cuong Tran Minh-Tuan Tran Giovanni Trappolini Volker Tresp Yu-Chee Tseng Maria Tzelepi Willy Ugarte Antti Ukkonen Abhishek Kumar Umrawal Athena Vakal Matias Valdenegro Toro Maaike Van Roy Dinh Van Tran Fabio Vandin Valerie Vaquet Iraklis Varlamis Santiago Velasco-Forero Bruno Veloso Dmytro Velychko Sreekanth Vempati Sebastián Ventura Soto Rosana Veroneze

Shandong University, China Graz University of Technology, Austria University of Sydney, Australia Simon Fraser University, Canada Aristotle University of Thessaloniki, Greece INRAE - UMR Tetis, France University of Trento, Italy IRIT, University of Toulouse, France TU Wien, Austria Aristotle University of Thessaloniki, Greece University of Technology Sydney, Australia Aalborg University, Denmark Facebook, USA International Hellenic University, Greece University of Ljubljana, Slovenia BME, Hungary University of Minnesota, Twin Cities, USA Michigan State University, USA Syracuse University, USA KAIST, South Korea Sapienza University of Rome, Italy LMU, Germany National Yang Ming Chiao Tung University, Taiwan Aristotle University of Thessaloniki, Greece University of Applied Sciences (UPC), Peru University of Helsinki, Finland Purdue University, USA Aristotle University, Greece University of Groningen, the Netherlands KU Leuven, Belgium University of Freiburg, Germany University of Padova, Italy CITEC, Bielefeld University, Germany Harokopio University of Athens, Greece MINES ParisTech, France Porto, Portugal Carl von Ossietzky Universität Oldenburg, Germany Myntra, India University of Cordoba, Portugal LBiC, Brazil

xxxv

xxxvi

Organization

Jan Verwaeren Vassilios Verykios Herna Viktor João Vinagre Fabio Vitale Vasiliki Voukelatou Dong Quan Vu Maxime Wabartha Tomasz Walkowiak Vijay Walunj Michael Wand Beilun Wang Chang-Dong Wang Daheng Wang Deng-Bao Wang Di Wang Di Wang Fu Wang Hao Wang Hao Wang Hao Wang Hongwei Wang Hui Wang Hui (Wendy) Wang Jia Wang Jing Wang Junxiang Wang Qing Wang Rongguang Wang Ruoyu Wang Ruxin Wang Senzhang Wang Shoujin Wang Xi Wang Yanchen Wang Ye Wang Ye Wang Yifei Wang Yongqing Wang

Ghent University, Belgium Hellenic Open University, Greece University of Ottawa, Canada LIAAD - INESC TEC, Portugal Centai Institute, Italy ISTI - CNR, Italy Safran Tech, France McGill University, Canada Wroclaw University of Science and Technology, Poland University of Missouri-Kansas City, USA University of Mainz, Germany Southeast University, China Sun Yat-sen University, China Amazon, USA Southeast University, China Nanyang Technological University, Singapore KAUST, Saudi Arabia University of Exeter, UK Nanyang Technological University, Singapore Louisiana State University, USA University of Science and Technology of China, China University of Illinois Urbana-Champaign, USA SKLSDE, China Stevens Institute of Technology, USA Xi’an Jiaotong-Liverpool University, China Beijing Jiaotong University, China Emory University, USA IBM Research, USA University of Pennsylvania, USA Shanghai Jiao Tong University, China Shenzhen Institutes of Advanced Technology, China Central South University, China Macquarie University, Australia Chinese Academy of Sciences, China Georgetown University, USA Chongqing University, China National University of Singapore, Singapore Peking University, China Chinese Academy of Sciences, China

Organization

Yuandong Wang Yue Wang Yun Cheng Wang Zhaonan Wang Zhaoxia Wang Zhiwei Wang Zihan Wang Zijie J. Wang Dilusha Weeraddana Pascal Welke Tobias Weller Jörg Wicker Lena Wiese Michael Wilbur Moritz Wolter Bin Wu Bo Wu Jiancan Wu Jiantao Wu Ou Wu Yang Wu Yiqing Wu Yuejia Wu Bin Xiao Zhiwen Xiao Ruobing Xie Zikang Xiong Depeng Xu Jian Xu Jiarong Xu Kunpeng Xu Ning Xu Xianghong Xu Sangeeta Yadav Mehrdad Yaghoobi Makoto Yamada Akihiro Yamaguchi Anil Yaman

xxxvii

Tsinghua University, China Microsoft Research, USA University of Southern California, USA University of Tokyo, Japan SMU, Singapore University of Chinese Academy of Sciences, China Shandong University, China Georgia Tech, USA CSIRO, Australia University of Bonn, Germany University of Mannheim, Germany University of Auckland, New Zealand Goethe University Frankfurt, Germany Vanderbilt University, USA Bonn University, Germany Beijing University of Posts and Telecommunications, China Renmin University of China, China University of Science and Technology of China, China University of Jinan, China Tianjin University, China Chinese Academy of Sciences, China University of Chinese Academic of Science, China Inner Mongolia University, China University of Ottawa, Canada Southwest Jiaotong University, China WeChat, Tencent, China Purdue University, USA University of North Carolina at Charlotte, USA Citadel, USA Fudan University, China University of Sherbrooke, Canada Southeast University, China Tsinghua University, China Indian Institute of Science, India University of Edinburgh, UK RIKEN AIP/Kyoto University, Japan Toshiba Corporation, Japan Vrije Universiteit Amsterdam, the Netherlands

xxxviii

Organization

Hao Yan Qiao Yan Chuang Yang Deqing Yang Haitian Yang Renchi Yang Shaofu Yang Yang Yang Yang Yang Yiyang Yang Yu Yang Peng Yao Vithya Yogarajan Tetsuya Yoshida Hong Yu Wenjian Yu Yanwei Yu Ziqiang Yu Sha Yuan Shuhan Yuan Mingxuan Yue Aras Yurtman Nayyar Zaidi Zelin Zang Masoumeh Zareapoor Hanqing Zeng Tieyong Zeng Bin Zhang Bob Zhang Hang Zhang Huaizheng Zhang Jiangwei Zhang Jinwei Zhang Jun Zhang Lei Zhang Luxin Zhang Mimi Zhang Qi Zhang

Washington University in St Louis, USA Shenzhen University, China University of Tokyo, Japan Fudan University, China Chinese Academy of Sciences, China National University of Singapore, Singapore Southeast University, China Nanjing University of Science and Technology, China Northwestern University, USA Guangdong University of Technology, China The Hong Kong Polytechnic University, China University of Science and Technology of China, China University of Auckland, New Zealand Nara Women’s University, Japan Chongqing Laboratory of Comput. Intelligence, China Tsinghua University, China Ocean University of China, China Yantai University, China Beijing Academy of Artificial Intelligence, China Utah State University, USA Google, USA KU Leuven, Belgium Deakin University, Australia Zhejiang University & Westlake University, China Shanghai Jiao Tong University, China USC, USA The Chinese University of Hong Kong, China South China University of Technology, China University of Macau, Macao, China National University of Defense Technology, China Nanyang Technological University, Singapore Tencent, China Cornell University, USA Tsinghua University, China Virginia Tech, USA Worldline/Inria, France Trinity College Dublin, Ireland University of Technology Sydney, Australia

Organization

Qiyiwen Zhang Teng Zhang Tianle Zhang Xuan Zhang Yang Zhang Yaqian Zhang Yu Zhang Zhengbo Zhang Zhiyuan Zhang Heng Zhao Mia Zhao Tong Zhao Qinkai Zheng Xiangping Zheng Bingxin Zhou Bo Zhou Min Zhou Zhipeng Zhou Hui Zhu Kenny Zhu Lingwei Zhu Mengying Zhu Renbo Zhu Yanmin Zhu Yifan Zhu Bartosz Zieli´nski Sebastian Ziesche Indre Zliobaite Gianlucca Zuin

University of Pennsylvania, USA Huazhong University of Science and Technology, China University of Exeter, UK Renmin University of China, China University of Science and Technology of China, China University of Waikato, New Zealand University of Illinois at Urbana-Champaign, USA Beihang University, China Peking University, China Shenzhen Technology University, China Airbnb, USA Snap Inc., USA Tsinghua University, China Renmin University of China, China University of Sydney, Australia Baidu, Inc., China Huawei Technologies, China University of Science and Technology of China, China Chinese Academy of Sciences, China SJTU, China Nara Institute of Science and Technology, Japan Zhejiang University, China Peking University, China Shanghai Jiao Tong University, China Tsinghua University, China Jagiellonian University, Poland Bosch Center for Artificial Intelligence, Germany University of Helsinki, Finland UFM, Brazil

Program Committee Members, Demo Track Hesam Amoualian Georgios Balikas Giannis Bekoulis Ludovico Boratto Michelangelo Ceci Abdulkadir Celikkanat

xxxix

WholeSoft Market, France Salesforce, France Vrije Universiteit Brussel, Belgium University of Cagliari, Italy University of Bari, Italy Technical University of Denmark, Denmark

xl

Organization

Tania Cerquitelli Mel Chekol Charalampos Chelmis Yagmur Gizem Cinar Eustache Diemert Sophie Fellenz James Foulds Jhony H. Giraldo Parantapa Goswami Derek Greene Lili Jiang Bikash Joshi Alexander Jung Zekarias Kefato Ilkcan Keles Sammy Khalife Tuan Le Ye Liu Fragkiskos Malliaros Hamid Mirisaee Robert Moro Iosif Mporas Giannis Nikolentzos Eirini Ntoutsi Frans Oliehoek Nora Ouzir Özlem Özgöbek Manos Papagelis Shichao Pei Botao Peng Antonia Saravanou Rik Sarkar Vera Shalaeva Kostas Stefanidis Nikolaos Tziortziotis Davide Vega Sagar Verma Yanhao Wang

Informatica Politecnico di Torino, Italy Utrecht University, the Netherlands University at Albany, USA Amazon, France Criteo AI Lab, France TU Kaiserslautern, Germany University of Maryland, Baltimore County, USA Télécom Paris, France Rakuten Institute of Technology, Rakuten Group, Japan University College Dublin, Ireland Umeå University, Sweden Elsevier, the Netherlands Aalto University, Finland KTH Royal Institute of Technology, Sweden Aalborg University, Denmark Johns Hopkins University, USA New Mexico State University, USA Salesforce, USA CentraleSupelec, France AMLRightSource, France Kempelen Institute of Intelligent Technologies, Slovakia University of Hertfordshire, UK Ecole Polytechnique, France Freie Universität Berlin, Germany Delft University of Technology, the Netherlands CentraleSupélec, France Norwegian University of Science and Technology, Norway York University, UK University of Notre Dame, USA Chinese Academy of Sciences, China National and Kapodistrian University of Athens, Greece University of Edinburgh, UK Inria Lille-Nord, France Tampere University, Finland Jellyfish, France Uppsala University, Sweden CentraleSupelec, France East China Normal University, China

Organization

Zhirong Yang Xiangyu Zhao

Sponsors

xli

Norwegian University of Science and Technology, Norway City University of Hong Kong, Hong Kong, China

Contents – Part I

Clustering and Dimensionality Reduction Pass-Efficient Randomized SVD with Boosted Accuracy . . . . . . . . . . . . . . . . . . . . Xu Feng, Wenjian Yu, and Yuyang Xie

3

CDPS: Constrained DTW-Preserving Shapelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hussein El Amouri, Thomas Lampert, Pierre Gançarski, and Clément Mallet

21

Structured Nonlinear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Bonenberger, Wolfgang Ertel, Markus Schneider, and Friedhelm Schwenker

38

LSCALE: Latent Space Clustering-Based Active Learning for Node Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juncheng Liu, Yiwei Wang, Bryan Hooi, Renchi Yang, and Xiaokui Xiao Powershap: A Power-Full Shapley Feature Selection Method . . . . . . . . . . . . . . . . Jarne Verhaeghe, Jeroen Van Der Donckt, Femke Ongenae, and Sofie Van Hoecke Automated Cancer Subtyping via Vector Quantization Mutual Information Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zheng Chen, Lingwei Zhu, Ziwei Yang, and Takashi Matsubara

55

71

88

Wasserstein t-SNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Fynn Bachmann, Philipp Hennig, and Dmitry Kobak Nonparametric Bayesian Deep Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Haruya Ishizuka and Daichi Mochihashi FastDEC: Clustering by Fast Dominance Estimation . . . . . . . . . . . . . . . . . . . . . . . . 138 Geping Yang, Hongzhang Lv, Yiyang Yang, Zhiguo Gong, Xiang Chen, and Zhifeng Hao SECLEDS: Sequence Clustering in Evolving Data Streams via Multiple Medoids and Medoid Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Azqa Nadeem and Sicco Verwer

xliv

Contents – Part I

Knowledge Integration in Deep Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Nguyen-Viet-Dung Nghiem, Christel Vrain, and Thi-Bich-Hanh Dao Anomaly Detection ARES: Locally Adaptive Reconstruction-Based Anomaly Scoring . . . . . . . . . . . . 193 Adam Goodge, Bryan Hooi, See Kiong Ng, and Wee Siong Ng R2-AD2: Detecting Anomalies by Analysing the Raw Gradient . . . . . . . . . . . . . . 209 Jan-Philipp Schulze, Philip Sperl, Ana R˘adut, oiu, Carla Sagebiel, and Konstantin Böttinger Hop-Count Based Self-supervised Anomaly Detection on Attributed Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Tianjin Huang, Yulong Pei, Vlado Menkovski, and Mykola Pechenizkiy Deep Learning Based Urban Anomaly Prediction from Spatiotemporal Data . . . 242 Bhumika and Debasis Das Detecting Anomalies with Autoencoders on Data Streams . . . . . . . . . . . . . . . . . . . 258 Lucas Cazzonelli and Cedric Kulbach Anomaly Detection via Few-Shot Learning on Normality . . . . . . . . . . . . . . . . . . . 275 Shin Ando and Ayaka Yamamoto Interpretability and Explainability Interpretations of Predictive Models for Lifestyle-related Diseases at Multiple Time Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Yuki Oba, Taro Tezuka, Masaru Sanuki, and Yukiko Wagatsuma Fair and Efficient Alternatives to Shapley-based Attribution Methods . . . . . . . . . 309 Charles Condevaux, Sébastien Harispe, and Stéphane Mussard SMACE: A New Method for the Interpretability of Composite Decision Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Gianluigi Lopardo, Damien Garreau, Frédéric Precioso, and Greger Ottosson Calibrate to Interpret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 Gregory Scafarto, Nicolas Posocco, and Antoine Bonnefoy Knowledge-Driven Interpretation of Convolutional Neural Networks . . . . . . . . . . 356 Riccardo Massidda and Davide Bacciu

Contents – Part I

xlv

Neural Networks with Feature Attribution and Contrastive Explanations . . . . . . . 372 Housam K. B. Babiker, Mi-Young Kim, and Randy Goebel Explaining Predictions by Characteristic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Amr Alkhatib, Henrik Boström, and Michalis Vazirgiannis Session-Based Recommendation Along with the Session Style of Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 Panagiotis Symeonidis, Lidija Kirjackaja, and Markus Zanker ProtoMIL: Multiple Instance Learning with Prototypical Parts for Whole-Slide Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 Dawid Rymarczyk, Adam Pardyl, Jarosław Kraus, Aneta Kaczy´nska, Marek Skomorowski, and Bartosz Zieli´nski VCNet: A Self-explaining Model for Realistic Counterfactual Generation . . . . . 437 Victor Guyomard, Françoise Fessant, Thomas Guyet, Tassadit Bouadi, and Alexandre Termier Ranking and Recommender Systems A Recommendation System for CAD Assembly Modeling Based on Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 Carola Gajek, Alexander Schiendorfer, and Wolfgang Reif AD-AUG: Adversarial Data Augmentation for Counterfactual Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Yifan Wang, Yifang Qin, Yu Han, Mingyang Yin, Jingren Zhou, Hongxia Yang, and Ming Zhang Bi-directional Contrastive Distillation for Multi-behavior Recommendation . . . . 491 Yabo Chu, Enneng Yang, Qiang Liu, Yuting Liu, Linying Jiang, and Guibing Guo Improving Micro-video Recommendation by Controlling Position Bias . . . . . . . 508 Yisong Yu, Beihong Jin, Jiageng Song, Beibei Li, Yiyuan Zheng, and Wei Zhuo Mitigating Confounding Bias for Recommendation via Counterfactual Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 Ming He, Xinlei Hu, Changshu Li, Xin Chen, and Jiwen Wang Recommending Related Products Using Graph Neural Networks in Directed Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 Srinivas Virinchi, Anoop Saladi, and Abhirup Mondal

xlvi

Contents – Part I

A U-Shaped Hierarchical Recommender by Multi-resolution Collaborative Signal Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558 Peng Yi, Xiongcai Cai, and Ziteng Li Basket Booster for Prototype-based Contrastive Learning in Next Basket Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574 Ting-Ting Su, Zhen-Yu He, Man-Sheng Chen, and Chang-Dong Wang Graph Contrastive Learning with Adaptive Augmentation for Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 Mengyuan Jing, Yanmin Zhu, Tianzi Zang, Jiadi Yu, and Feilong Tang Multi-interest Extraction Joint with Contrastive Learning for News Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 Shicheng Wang, Shu Guo, Lihong Wang, Tingwen Liu, and Hongbo Xu Transfer and Multitask Learning On the Relationship Between Disentanglement and Multi-task Learning . . . . . . . 625 Łukasz Maziarka, Aleksandra Nowak, Maciej Wołczyk, and Andrzej Bedychaj InCo: Intermediate Prototype Contrast for Unsupervised Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642 Yuntao Du, Hongtao Luo, Haiyang Yang, Juan Jiang, and Chongjun Wang Fast and Accurate Importance Weighting for Correcting Sample Bias . . . . . . . . . 659 Antoine de Mathelin, Francois Deheeger, Mathilde Mougeot, and Nicolas Vayatis Overcoming Catastrophic Forgetting via Direction-Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 Yunfei Teng, Anna Choromanska, Murray Campbell, Songtao Lu, Parikshit Ram, and Lior Horesh Newer is Not Always Better: Rethinking Transferability Metrics, Their Peculiarities, Stability and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693 Shibal Ibrahim, Natalia Ponomareva, and Rahul Mazumder Learning to Teach Fairness-Aware Deep Multi-task Learning . . . . . . . . . . . . . . . . 710 Arjun Roy and Eirini Ntoutsi Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727

Clustering and Dimensionality Reduction

Pass-Efficient Randomized SVD with Boosted Accuracy Xu Feng, Wenjian Yu(B) , and Yuyang Xie Department of Computer Science and Technology, BNRist, Tsinghua University, Beijing, China {fx17,xyy18}@mails.tsinghua.edu.cn, [email protected]

Abstract. Singular value decomposition (SVD) is a widely used tool in data analysis and numerical linear algebra. Computing truncated SVD of a very large matrix encounters difficulty due to excessive time and memory cost. In this work, we aim to tackle this difficulty and enable accurate SVD computation for the large data which cannot be loaded into memory. We first propose a randomized SVD algorithm with fewer passes over the matrix. It reduces the passes in the basic randomized SVD by half, almost not sacrificing accuracy. Then, a shifted power iteration technique is proposed to improve the accuracy of result, where a dynamic scheme of updating the shift value in each power iteration is included. Finally, collaborating the proposed techniques with several accelerating skills, we develop a Pass-efficient randomized SVD (PerSVD) algorithm for efficient and accurate treatment of large data stored on hard disk. Experiments on synthetic and real-world data validate that the proposed techniques largely improve the accuracy of randomized SVD with same number of passes over the matrix. With 3 or 4 passes over the data, PerSVD is able to reduce the error of SVD result by three or four orders of magnitude compared with the basic randomized SVD and single-pass SVD algorithms, with similar or less runtime and less memory usage. Keywords: Singular value decomposition · Shifted power iteration Random embedding · Pass-efficient algorithm

1

·

Introduction

Truncated singular value decomposition (SVD) has broad applications in data analysis and machine learning, such as dimension reduction, matrix completion, and information retrieval. However, for the large and high-dimensional input data from social network analysis, natural language processing and recommender system, etc., computing truncated SVD often consumes tremendous computational resource. This work is supported by NSFC under grant No. 61872206. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-26387-3 1. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  M.-R. Amini et al. (Eds.): ECML PKDD 2022, LNAI 13713, pp. 3–20, 2023. https://doi.org/10.1007/978-3-031-26387-3_1

4

X. Feng et al.

A conventional method of computing truncated SVD, i.e. the first k largest singular values and corresponding singular vectors, is using svds in Matlab [3]. In svds, Lanczos bidiagonal process is used to compute the truncated SVD [3]. Although there are variant algorithms of svds, like lansvd in PROPACK [11], svds is still most robust and runs fastest in most scenarios. However, svds needs several times of k times passes over the matrix to produce result, which is not efficient to deal with large data matrices which cannot be stored in RAM. To tackle the difficulty of handling large matrix, approximate algorithms for truncated SVD have been proposed, which consume less time, less memory and fewer passes over input matrix while sacrificing little accuracy [2,6,9,13,15,20– 22]. Among them, a class of randomized methods gains a lot of attention which is based on random embedding through multiplying a random matrix [14]. The randomized method obtains a near-optimal low-rank decomposition of the matrix, and has the performance advantages over classical methods, in terms of computational time, pass efficiency and parallelizability. The comprehensive presentation of relevant techniques and theories can be found in [9,14]. When data is too large to be stored in RAM, traditional truncated SVD algorithms are not efficient, if not impossible, to deal with data stored in slow memory (hard disk). The single-pass SVD algorithms [4,8,18,19,21] can tackle this difficulty. And, they are also suitable to handle streaming data. Among these algorithms, Tropp’s single-pass SVD algorithm in [19] is the state-of-the-art. Although Tropp’s single-pass SVD algorithm performs well on matrices with singular values decaying very fast, it results in large error while handling matrices with slow decay of singular values. Therefore, how to accurately compute the truncated SVD of a large matrix stored on hard disk on a computer with limited memory is a problem. In this paper, we aim to tackle the difficulty of handling large matrix stored on hard disk or slow memory. We propose a pass-efficient randomized SVD (PerSVD) algorithm which accurately computes SVD of large matrix stored on hard disk with less memory and affordable time. Major contributions and results are summarized as follows. – We propose a technique to reduce the number of passes over the matrix in the basic randomized SVD algorithm. It takes advantage of the row-major format of the matrix and reads it row by row to build AΦ and AT AΦ with just one pass over matrix. With this algorithm, the passes over the matrix in the basic randomized SVD algorithm is reduced by half, with negligible loss of accuracy. – Inspired by the shift technique in the power method [7], we propose to use the shift skill in the power iteration called shifted power iteration to improve the accuracy of results. A dynamic scheme of updating the shift value in each power iteration is proposed to optimize the performance of the shifted power iteration. This facilitates a pass-efficient randomized SVD algorithm, i.e. PerSVD, which accurately computes truncated SVD of large matrix on a limited-memory computer. – Experiments on synthetic and real large data demonstrate that the proposed techniques are all beneficial to improve the accuracy of result with same number of passes over the matrix. With same 4 passes the over matrix, the

Pass-Efficient Randomized SVD with Boosted Accuracy

5

result computed with PerSVD is up to 20,318X more accurate than that obtained with the basic randomized SVD. And, the proposed PerSVD with 3 passes over the matrix consumes just 16%–26% memory of Tropp’s algorithm [19] while producing more accurate results (with up to 7,497X reduction of error), with less runtime. For the FERET data stored as a 150GB file, PerSVD just costs 12 min and 1.9 GB memory to compute the truncated SVD (k = 100) with 3 passes over the data.

2

Preliminaries

Below we follow the Matlab conventions to specify indices of matrix and functions. 2.1

Basics of Truncated SVD

The economic SVD of a matrix A ∈ Rm×n (m ≥ n) can be stated as: A = UΣVT ,

(1)

where U = [u1 , u2 , · · · , un ] and V = [v1 , v2 , · · · , vn ] are orthogonal or orthonormal matrices, representing the left and right singular vectors of A, respectively. The n × n diagonal matrix Σ contains the singular values (σ1 , σ2 , · · · , σn ) of A in descending order. The truncated SVD Ak can be derived, which is an approximation of A: A ≈ Ak = Uk Σk VkT , k < min(m, n),

(2)

where Uk and Vk are matrices with the first k columns of U and V respectively, and the diagonal matrix Σk is the k × k upper-left submatrix of Σ. Notice that, Ak is the best rank-k approximation of A in both spectral and Frobenius norm [5]. 2.2

Randomized SVD Algorithm with Power Iteration

The basic randomized SVD algorithm [9] can be described as Algorithm 1, where the power iteration in Step 3 through 6 is for improving the accuracy of result. In Algorithm 1, Ω is a Gaussian random matrix. Although other kinds of random matrix can replace Ω to reduce the computational cost of AΩ, they bring some sacrifice on accuracy. The orthonormalization operation “orth()” can be implemented with a call to a packaged QR factorization. If there is no power iteration, the m × l orthonormal matrix Q = orth(AΩ) approximates the basis of dominant subspace of range(A), i.e., span{u1 , u2 , · · · , ul }. So, A ≈ QQT A, i.e. A ≈ QB according to Step 8. By performing the economic SVD, i.e. (1), on the “short-and-fat” l × n matrix B, one finally obtains the approximate truncated SVD of A. Using the power iteration in Steps 3-6, one obtains Q = orth((AAT )p AΩ). This makes Q better approximate the basis of dominant subspace of range((AAT )p A), same as that of range(A),

6

X. Feng et al.

because (AAT )p A’s singular values decay more quickly than those of A [9]. Therefore, the resulted singular values and vectors have better accuracy, and the larger p makes more accurate results and more computational cost as well. The orthonormalization is practically used in the power iteration steps to alleviate the round-off error in the floating-point computation. It can be performed after every other matrix-matrix multiplication to save computational cost with little sacrifice on accuracy [6,20]. This turns Steps 2-7 of Algorithm 1 to 2: Q ← orth(AΩ) 3: for j ← 1, 2, · · · p do 4: Q ← orth(AAT Q) 5: end for Notice that the original Step 7 in Algorithm 1 is dropped. We use the floatingpoint operation (flop) to specify the time cost of algorithms. Suppose the flop count of multiplication of M ∈ Rm×l and N ∈ Rl×l is Cmul ml2 . Here, Cmul reflects one addition and one multiplication. Thus, the flop count of Algorithm 1 is FC1 = (2p + 2)Cmul mnl + (p + 1)Cqr ml2 + Csvd nl2 + Cmul ml2 ,

(3)

where (2p + 2)Cmul mnl reflects 2p + 2 times matrix-matrix multiplication on A, (p + 1)Cqr ml2 reflects p + 1 times QR factorization on m × l matrix and Csvd nl2 + Cmul ml2 reflects SVD and matrix-matrix multiplication in Step 9 and 10. 2.3

Tropp’s Single-Pass SVD Algorithm

On a machine with limited memory, single-pass SVD algorithms can be used to handle very large data stored on hard disk [4,8,18,19,21]. Although the Algorithm 1. Basic randomized SVD with power iteration Input: A ∈ Rm×n , rank parameter k, oversampling parameter l (l ≥ k), power parameter p Output: U ∈ Rm×k , S ∈ Rk×k , V ∈ Rn×k 1: Ω ← randn(n, l) 2: Q ← AΩ 3: for j ← 1, 2, · · · , p do 4: G ← AT Q 5: Q ← AG 6: end for 7: Q ← orth(Q) 8: B ← QT A 9: [U, S, V] ← svd(B, ‘econ ) 10: U ← QU 11: U ← U(:, 1 : k), S ← S(1 : k, 1 : k), V ← V(:, 1 : k)

Pass-Efficient Randomized SVD with Boosted Accuracy

7

single-pass algorithm in [18] achieves lower computational complexity, it is more suitable for the matrix with fast decay of singular values and m  n. Tropp’s single-pass SVD algorithm [19] is the state-of-the-art single-pass SVD algorithm, with lower approximation error compared with it’s predecessors given the same sketch sizes. Tropp’s single-pass SVD algorithm (Algorithm 2) first constructs several sketches of the input matrix A in Step 4-7. Then, QR factorization is performed in Step 8 and 9 to compute the orthonormal basis of row and column space of A, respectively. Then, matrix C is computed to approximate the core of A in Step 10. Notice that’/’ and’\’ represent the left and right division in Matlab, respectively. This is followed by SVD in Step 11 for computing the singular values and vectors of C. Finally, the singular vectors are computed by projecting the orthonormal matrices to the row and column spaces of original matrix in Step 12. Algorithm 2. Tropp’s single-pass SVD algorithm Input: A ∈ Rm×n , rank parameter k Output: U ∈ Rm×k , S ∈ Rk×k , V ∈ Rn×k 1: r ← 4k + 1, s ← 2r + 1 2: Υ ← randn(r, m), Ω ← randn(r, n), Φ ← randn(s, m), Ψ ← randn(s, n) 3: X ← zeros(r, n), Y ← zeros(m, r), Z ← zeros(s, s) 4: for j ← 1, 2, · · · , m do 5: ai is the i-th row of A 6: X ← X + Υ(i, :)ai , Y(i, :) ← ai Ω, Z ← Z + Φ(i, :)ai ΨT 7: end for 8: [Q, ∼] ← qr(Y, 0) 9: [P, ∼] ← qr(XT , 0) 10: C ← ((ΦQ)/Z)\(ΨPT ) 11: [U, S, V] ← svd(C,  econ ) 12: U ← QU(:, 1 : k), S ← S(1 : k, 1 : k), V ← PV(:, 1 : k) The peak memory usage of Algorithm 2 is (m+n)(2r+s)×8 ≈ 16(m+n)k×8 bytes, which is caused by all matrices computed in Algorithm 2. And, the flop count of Algorithm 2 is FC2 = Cmul m(2nr + ns + s2 ) + Cqr (m + n)r2 + Cmul (m + n)rs + 2Csolve s2 k + Csvd r3 + Cmul (m + n)rk ≈ Cmul m(2nr + ns + s2 ) + Cqr (m + n)r2 + Cmul (m + n)r(s + k) ≈ 16Cmul mnk + 100Cmul mk 2 + 36Cmul nk 2 + 16Cqr (m + n)k 2 , (4) where Cmul m(2nr+ns+s2 ) reflects the matrix-matrix multiplication in Step 4-7, Cqr (m + n)r2 reflects the QR factorization in Step 8 and 9, Cmul (m + n)rs + 2Csolve s2 k reflects the matrix-matrix multiplication and the solve operation in Step 10 and Csvd r3 + Cmul (m + n)rk reflects the SVD and matrix-matrix multiplication in Step 11 and 12. When k  min(m, n), we can drop the 2Csolve s2 k and Csvd r3 in FC2 .

8

X. Feng et al.

It should be pointed out that the Tropp’s algorithm does not perform well on matrices with slow decay of singular values, exhibiting large error on computed singular values/vectors. However, the matrices with slow decay of singular values are common in real applications.

3

Pass-Efficient SVD with Shifted Power Iteration

In this section, we develop a pass-efficient randomized SVD algorithm named PerSVD. Firstly, we develop a pass-efficient scheme to reduce the passes over A within basic randomized SVD algorithm. Secondly, inspired by the shift technique in the power method [7], we propose a technique of shifted power iteration to improve the accuracy of result. Finally, combining with the proposed shift updating scheme in each power iteration, we describe the pass-efficient PerSVD algorithm which is able to accurately compute SVD of large matrix on hard disk with less memory and affordable time. 3.1

Randomized SVD with Fewer Passes

Suppose ai is the i-th (i ≤ m) row of matrix A ∈ Rm×n , and Φ ∈ Rn×l . To compute Y = AΦ and W = AT AΦ with ai , we have Y(i, :) = ai Φ, W = AT Y =

m 

aT i Y(i, :),

(5)

i=1

which reflects the data stored in the row-major format can be read once to compute Y = AΦ and W = AT AΦ. This idea is similar to that employed in [21,22]. Below, we derive that it can be repeatedly used to compute the (AT A)p Ω in the basic randomized SVD algorithm with power iteration. With Y and W, we can develop the formulation of Q and B in Step 7 and 8 of Algorithm 1. Suppose Φ = (AT A)p Ω, Y = AΦ, W = AT AΦ and ˜ V] ˜ ← svd(Y, ‘econ’). Combining the fact W = AT Y, we can derive [Q, S, ˜V ˜ T ⇒ WV ˜S ˜ −1 = AT Q W = AT Y = AT QS

(6)

˜S ˜ −1 )T = QQT A ≈ A. Because Q is the orthonormalization which implies Q(WV of A(AT A)p Ω, this approximation performs with same accuracy as A ≈ QB in Algorithm 1. With (5) and (6) combined, the randomized SVD with fewer passes is proposed and described as Algorithm 3. Now, the randomized SVD with p power iteration just needs p+1 passes over A, which reduces half of the passes in Algorithm 1. The above deduction reveals Algorithm 3 is mathematically equivalent to the basic randomized SVD (the modified Algorithm 1) in exact arithmetic. In the practice considering numerical error, the computational results of the both algorithms are very close which means Algorithm 3 largely reduces the passes with just negligible loss of accuracy. Besides, it is easy to see that we can read multiple rows once to compute Y and W by (5).

Pass-Efficient Randomized SVD with Boosted Accuracy

9

Algorithm 3. Randomized SVD with fewer passes the over matrix Input: A ∈ Rm×n , k, l (l ≥ k), p Output: U ∈ Rm×k , S ∈ Rk×k , V ∈ Rn×k 1: Ω ← randn(n, l) 2: Q ← orth(Ω) 3: for j = 1, 2, · · · , p + 1 do 4: W ← zeros(n, l) 5: for i = 1, 2, · · · , m do 6: ai is the i-th row of matrix A 7: Y(i, :) ← ai Q, W ← W + aT i Y(i, :) 8: end for 9: if j == p + 1 break 10: Q ← orth(W) 11: end for ˜ V] ˜ ← svd(Y,  econ ) 12: [Q, S, ˜ T WT ˜ −1 V 13: B ← S 14: [U, S, V] ← svd(B,  econ ) 15: U ← QU(:, 1 : k), S ← S(1 : k, 1 : k), V ← V(:, 1 : k)

3.2

The Idea of Shifted Power Iteration

In Algorithm 3, the computation Q ← AT AQ in power iteration is the same as that in the power method for computing the largest eigenvalue and corresponding eigenvector of AT A. For the power method, the shift skill can be used to accelerate the convergence of iteration by reducing the ratio between the second largest eigenvalue and the largest one [7]. This inspires our idea of shifted power iteration. To derive our method, we first give two Lemmas [7]. Lemma 1. For symmetric matrix A, its singular values are the absolute values of its eigenvalues. And, for any eigenvalue λ of A, the left singular vector corresponding to its singular value |λ| is the normalized eigenvector for λ. Lemma 2. Suppose matrix A ∈ Rn×n , and a shift α ∈ R. For any eigenvalue λ of A, λ − α is an eigenvalue of A − αI, where I is the identity matrix. And, the eigenspace of A for λ is the same as the eigenspace of A − αI for λ − α. Because AT A is a symmetric positive semi-definite matrix, its singular value is its eigenvalue according to Lemma 1. We use σi (·) to denote the i-th largest singular value. Along with Lemma 2, we see that σi (AT A) − α is the eigenvalue of AT A−αI. Then, |σi (AT A)−α| is the singular value of AT A−αI according to Lemma 1. This can be illustrated by Fig. 1. Notice that |σi (AT A)−α| is not necessarily the i-th largest singular value. For Algorithm 1, the decay trend of the first l singular values of handled matrix affects the accuracy of result. If σi (AT A) − α > 0, (i ≤ l), and they are the l largest singular values of AT A − αI, these shifted singular values

10

X. Feng et al.

Fig. 1. The illustration of the singular value curves of AT A and AT A−αI.

obviously exhibit faster decay (see Fig. 1). The following Theorem states when these conditions are satisfied. Theorem 1. Suppose positive number α ≤ σl (AT A)/2 and i ≤ l. Then, σi (AT A − αI) = σi (AT A) − α, where σi (·) denotes the i-th largest singular value. And, the left singular vector corresponding to the i-th singular value of AT A−αI is the same as the left singular vector corresponding to the i-th singular value of AT A. The complete proof is in Appendix A.1. The first statement of Theorem 1 is straightforward from Fig. 1. The second statement can be derived from the statements on relationships between eigenvectors and singular vectors in Lemma 1 and 2. According to Theorem 1, if we choose a shift α ≤ σl (AT A)/2 we can change the computation Q = AT AQ to Q = (AT A−αI)Q in the power iteration, with the approximated dominant subspace unchanged. We called this shifted power iteration. For each step of shifted power iteration, this makes AQ approximate the basis of dominant subspace of range(A) to a larger extent than executing an original power iteration step, because the singular values of AT A−αI decay faster than those of AT A. Therefore, the shifted power iteration would improve the accuracy of the randomized SVD algorithm with same power parameter p. The remaining problem is how to set the shift α. T A) σi (AT A−αI) Consider the change of ratio of singular values from σσ1i (A (AT A) to σ1 (AT A−αI) , for i ≤ l. It is easy to see σi (AT A−αI) σi (AT A)−α σi (AT A) = < , σ1 (AT A−αI) σ1 (AT A)−α σ1 (AT A)

(7)

if the assumption of α in Theorem 1 holds. And, the larger value of α, the T A−αI) smaller the ratio σσ1i (A , meaning faster decay of singular values. Therefore, (AT A−αI) to maximize the effect of shifted power iteration on improving the accuracy, we should choose the shift α as large as possible while satisfying α ≤ σl (AT A)/2.

Pass-Efficient Randomized SVD with Boosted Accuracy

11

Notice that calculating σl (AT A) is very difficult. Our idea is using the singular value of AT AQ at the first step of the power iteration to approximate σl (AT A) and set the shift α. Lemma 3. [10] Let A, C ∈ Rm×n be given. The following inequality holds for the decreasingly ordered singular values of A, C and ACT (1 ≤ i, j ≤ min(m, n) and i + j − 1 ≤ min(m, n)) σi+j−1 (ACT ) ≤ σi (A)σj (C) ,

(8)

σi+j−1 (A + C) ≤ σi (A) + σj (C) .

(9)

and Based on Lemma 3, i.e. (3.3.17) and (3.3.18) in [10], we prove Theorem 2 Theorem 2. Suppose Q ∈ Rm×l (l ≤ m) is an orthonormal matrix and A ∈ Rm×m . Then, for any i ≤ l σi (AQ) ≤ σi (A).

(10)

Proof. We append zero columns to Q to get an m × m matrix CT = [Q, 0] ∈ Rm×m . Since Q is an orthonormal matrix, σ1 (C) = 1. According to (8) in Lemma 3, (11) σi (ACT ) ≤ σi (A)σ1 (C) = σi (A). Because ACT = [AQ, 0], for any i ≤ l, σi (AQ) = σi (ACT ). Then, combining (11) we can prove (10). Suppose Q ∈ Rm×l is the orthonormal matrix in power iteration of Algorithm 3. According to Theorem 2, σi (AT AQ) ≤ σi (AT A), i ≤ l,

(12)

which means we can set α = σl (AT AQ)/2 to guarantee the requirement of α in Theorem 1 for performing the shifted power iteration. In order to do the orthonormalization for alleviating round-off error and calculate σl (AT AQ), we implement “orth(·)” with the economic SVD. This has similar computational cost as using QR factorization, and the resulting matrix of left singular vectors includes the orthonormal basis of same subspace. So far, we can obtain the value of α at the first step of power iteration, and then we perform Q = (AT A−αI)Q in the following iteration steps. Combining the shifted power iteration with fixed shift value, we derive a randomized SVD algorithm with shifted power iteration as Algorithm 4. 3.3

Update Shift in Each Power Iteration

In order to improve the accuracy of Algorithm 4, we try to make the shift α as large as possible in each power iteration. Therefore, we further propose a dynamic scheme to set α, which updates α with larger values and thus increases the decay of singular values.

12

X. Feng et al.

Algorithm 4. Randomized SVD with shifted power iteration Input: A ∈ Rm×n , k, l (l ≥ k), p Output: U ∈ Rm×k , S ∈ Rk×k , V ∈ Rn×k 1: Ω ← randn(n, l) 2: Q ← orth(Ω) 3: α ← 0 4: for j = 1, 2, · · · , p + 1 do 5: W ← zeros(n, l) 6: for i = 1, 2, · · · , m do 7: ai is the i-th row of matrix A 8: Y(i, :) ← ai Q, W ← W + aT i Y(i, :) 9: end for 10: if j == p + 1 break ˆ ∼] ← svd(W − αQ,  econ ) 11: [Q, S, ˆ l) + α)/2 12: if α == 0 then α ← (S(l, 13: end for ˜ V] ˜ ← svd(Y,  econ ) 14: [Q, S, ˜ T WT ˜ −1 V 15: B ← S 16: [U, S, V] ← svd(B,  econ ) 17: U ← QU(:, 1 : k), S ← S(1 : k, 1 : k), V ← V(:, 1 : k)

In the iteration steps of shifted power iteration, it is convenient to calculate the singular values of (AT A−αI)Q. The following Theorems state how to use it to approximate σi (AT A). So, in each iteration step we obtain a valid value of shift and update α with it if we have a larger α. Theorem 3. Suppose A ∈ Rm×n , Q ∈ Rn×l (l < n) is an orthonormal matrix and 0 < α < σl (AT A)/2. Then, σi ((AT A− αI)Q) + α ≤ σi (AT A), i ≤ l.

(13)

Proof. For any i ≤ l, σi ((AT A− αI)Q) + α ≤σi (AT A− αI) + α = σi (AT A),

(14)

due to Theorem 2 and Theorem 1. Theorem 4. Suppose A ∈ Rm×n , Q ∈ Rn×l (l < n) is an orthonormal matrix, α(0) = 0 and α(u) = (σl (AT AQ − α(u−1) Q) + α(u−1) )/2 for any u > 0. Then, α(0) , α(1) , α(2) , · · · are in ascending order. Proof. We prove this Theorem using induction. When u = 1, α(1) = σl (AT AQ) ≥ α(0) . When u > 1, suppose α(u−1) ≥ α(u−2) . Then, according to (9) in Lemma 3 σl (AT AQ − α(u−2) Q) =σl (AT AQ − α(u−1) Q + (α(u−1) − α(u−2) )Q) ≤σl (AT AQ − α(u−1) Q) + α(u−1) − α(u−2) .

(15)

Pass-Efficient Randomized SVD with Boosted Accuracy

13

Therefore, σl (AT AQ−α(u−2) Q) + α(u−2) 2

σl (AT AQ−α(u−1) Q)+α(u−1) = α(u) . 2 (16) According to these equations, this Theorem is proven. α(u−1) =



Remark 2. According to the proof of Theorem 4, it can be simply proven that α(0) , α(1) , α(2) , · · · are in ascending order when α(0) ≤ α(1) , where α(0) ≥ 0 and α(u) = (σl (AT AQ − α(u−1) Q) + α(u−1) )/2 for any u > 0. So, we can increase α in each shifted power iteration with the following steps. 11: while α dose not converge do ˆ ∼] ← svd(W − αQ,  econ ) 12: [∼, S, ˆ l) then break 13: if α > S(l, ˆ 14: α ← (S(l, l) + α)/2 15: end while These steps are appended in the front of Step 11 in Algorithm 4. Among them, performing SVD consumes much time and can be optimized with the following trick. Suppose C = AT AQ−αQ. Then, the singular values of C are the square root of the eigenvalues of CT C. We can derive CT C = QT AT AAT AQ−2αQT AT AQ+α2 I = WT W−2αYT Y+α2 I. (17) Therefore, we can firstly compute two matrices D1 = WT W and D2 = YT Y to ˆ 2 ] ← eig(D1 − 2αD2 + α2 I), which just applies eigenvalue replace SVD with [∼, S decomposition on l × l matrix to update α. Combining the techniques proposed in last subsections with the dynamic scheme to update α in each power iteration, we derive the PerSVD algorithm T which is described as Algorithm 5. In Step 14 and 18, it checks if σl ((A A−αI)Q)+α 2 is larger than α. For same setting of power parameter p, Algorithm 5 has comparable computational cost as Algorithm 3, but achieves results with better accuracy due to the usage of shifted power iteration. For matrix Q computed with Algorithm 5, we have derived a bound of ||QQT A − A|| which reflects how close the computed truncated SVD is to optimal. And, we prove that the bound is smaller than that derived in [17] for Q computed with the basic randomized SVD algorithm. The complete proof is given in Appendix A.2. To accelerate Algorithm 5, we can use eigenvalue decomposition to compute the economic SVD or the orthonormal basis of a “tall-and-skinny” matrix in Step 2, 20 and 22. This is accomplished by using the eigSVD algorithm from [6], described in Appendix A.3. 3.4

Analysis of Computational Cost

Firstly, we analyze the peak memory usage of Algorithm 5. Because the SVD in Step 17 costs extra 2nl × 8 bytes memory except for the space of W and Q,

14

X. Feng et al.

Algorithm 5 . Pass-efficient randomized SVD with shifted power iteration (PerSVD) Input: A ∈ Rm×n , k, l (l ≥ k), p Output: U ∈ Rm×k , S ∈ Rk×k , V ∈ Rn×k 1: Ω ← randn(n, l) 2: Q ← orth(Ω) 3: α ← 0 4: for j = 1, 2, · · · , p + 1 do 5: W ← zeros(n, l) 6: for i = 1, 2, · · · , m do 7: ai is the i-th row of matrix A 8: Y(i, :) ← ai Q, W ← W + aT i Y(i, :) 9: end for 10: if j == p + 1 break 11: D1 ← WT W, D2 ← YT Y 12: while α dose not converge do ˆ 2 ] ← eig(D1 − 2αD2 + α2 I) 13: [∼, S ˆ l) then break 14: if α > S(l, ˆ 15: α ← (S(l, l) + α)/2 16: end while ˆ ∼] ← svd(W − αQ,  econ ) 17: [Q, S, ˆ l) then α ← (S(l, ˆ l) + α)/2 18: if α < S(l, 19: end for ˜ V] ˜ ← svd(Y,  econ ) 20: [Q, S, ˜ T WT ˜ 21: B ← S−1 V 22: [U, S, V] ← svd(B,  econ ) 23: U ← QU(:, 1 : k), S ← S(1 : k, 1 : k), V ← V(:, 1 : k)

the memory usage in Step 17 is (m + 4n)l × 8 bytes, which reflects the space of one m × l and two n × l matrices and the space caused by SVD operation. Because the SVD in Step 20 is replaced by eigSVD, the memory usage at Step 20 is (2m + n)l × 8 bytes. Therefore, the peak memory usage of Algorithm 5 is max((m + 4n)l, (2m + n)l) × 8 bytes. Usually we set l = 1.5k, so the memory usage of Algorithm 2 (16(m + n)k × 8 bytes) is several times larger than that of Algorithm 5 (max(1.5(m + 4n)k, 1.5(2m + n)k) × 8 bytes). Secondly, we analyze the flop count of Algorithm 5. Because eigSVD is used in Step 2, 20 and 22, and the flop count of eigSVD on a m × l matrix is 2Cmul ml2 + Ceig l3 , the flop count of those computations is Cmul (4n+2m)l2 +3Ceig l3 . Because the computations in Step 12-16 are all about l × l matrices and l  min(m, n), we drop the flop count in Step 12-16. Therefore, the flop count of Algorithm 5 is

Pass-Efficient Randomized SVD with Boosted Accuracy

15

FC5 = (2p+2)Cmul mnl+pCmul (m+n)l2 +pCsvd nl2 +Cmul nl2 +Cmul mlk + Cmul (4n + 2m)l2 + 3Ceig l3 ≈ (2p+2)Cmul mnl+pCmul (m+n)l2 +pCsvd nl2 +Cmul (mlk+2ml2 +5nl2 ) (18) where (2p + 2)Cmul mn reflects 2p + 2 times matrix-matrix multiplication in Step 6-9, pCmul (m + n)l2 reflects the matrix-matrix multiplications in Step 11, pCsvd nl2 reflects the SVD in Step 17, and Cmul nl2 +Cmul mlk reflects the matrixmatrix multiplications in Step 21 and 23. Because min(m, n) l and FC1 and FC5 both contain (2p + 2)Cmul mnl which reflects the main computation, the flop count of Algorithm 5 is similar with flop count of Algorithm 1 in (3) with the same p. Because the flop count of main computation in Algorithm 5 is (2p+2)Cmul mnl and that of Algorithm 2 is 16Cmul mnk, according to the fact l = 1.5k, (2p+2)Cmul mnl = (3p+3)Cmul mnk is less than 16Cmul mnk when p ≤ 4. If p = 2 or 3, we see that the total runtime of the two algorithms may be comparable, considering Algorithm 5 reads data multiple times from the hard disk and Algorithm 2 just reads data once.

4

Experimental Results

In this section, numerical experiments are carried out to validate the proposed techniques. We first compare Algorithms 1, 3, 4 and 5 (PerSVD), to validate whether the proposed schemes in Sect. 3 can remarkably reduce the passes with the same accuracy of results. Then, we compare PerSVD and Tropp’s algorithm to show the advantage of PerSVD on accurately computing SVD of large matrix stored on hard disk1 . We consider several matrices stored in the row-major format on hard disk as the test data, which are listed in Table 1. Firstly, two 40,000 × 40,000 matrices (denoted by Dense1 and Dense2) are synthetically generated. Dense1 is randomly generated with the i-th singular value following σi = 1/i. Then, √ Dense2 is randomly generated with the i-th singular value following σi = 1/ i, which reflects the singular values of Dense2 decay slower than those of Dense1. Then we construct two matrices from real-world data. We use the training set of MNIST [12] which has 60k images of size 28×28, and we reshape each image into a vector in size 784 to obtain the first matrix in size 60, 000 × 784 for experiment. The second matrix is obtained from images of faces from FERET database [16]. As in [8], we add two duplicates for each image into the data. For each duplicate, the value of a random choice of 10% of the pixels is set to random numbers uniformly chosen from 0,1,· · · , 255. This forms a 102,042 × 393,216 matrix, whose rows consist of images. We normalize the matrix by subtracting from each row its mean, and then dividing it by its Euclidean norm.

1

The code is avaliable at https://github.com/XuFengthucs/PerSVD.

16

X. Feng et al. Table 1. Test matrices.

Matrix

# of rows # of columns Space usage on hard disk

Dense1

40,000

40,000

Dense2

40,000

6.0 GB

40,000

6.0 GB

MNIST 60,000

782

180 MB

FERET 102,042

393,216

150 GB

All experiments are carried out on a Ubuntu server with two 8-core Intel Xeon CPU (at 2.10 GHz) and 32 GB RAM. The proposed techniques, basic randomize SVD algorithm and Tropp’s algorithm are all implemented in C with MKL [1] and OpenMP directives for multi-thread parallel computing. svds in Matlab 2020b is used for computing the accurate results and for error metrics. We set l = 1.5k, and each time we read k rows of matrix to avoid extra memory cost for all algorithms. All the programs are evaluated with wall-clock runtime and peak memory usage. To simulate the machine with limited computational resources, we just use 8 threads on one CPU to test the algorithms. Because the matrix FERET is too large to be loaded in Matlab, the experiments on FERET are without accurate results to compute the error metrics. 4.1

Error Metrics

Theoretical research has revealed that the randomized SVD with power iteration produces the rank-k approximation close to optimal. Under spectral (or ˆ Σ ˆ and V) ˆ has the following mulFrebenius) norm the computational result (U, tiplicative guarantee: ˆΣ ˆV ˆ T ≤ (1 + ) A − Ak , A − U

(19)

with high probability. Based on (19), we use ˆΣ ˆV ˆ T F − A − Ak F )/ A − Ak F , F = ( A − U ˆΣ ˆV ˆ T 2 − A − Ak 2 )/ A − Ak 2 , s = ( A − U

and

(20) (21)

as first two error metrics to evaluate the accuracy of randomized SVD algorithms for Frobenius norm and spectral norm. Another guarantee proposed in [15], which is stronger and more meaningful in practice, is: T T ˆT ˆ i | ≤ σk+1 (A)2 , ∀i ≤ k, |uT i AA ui − u i AA u

(22)

ˆ i is the computed i-th left where ui is the i-th left singular vector of A, and u singular vector. This is called per vector error bound for singular vectors. In [15], it is demonstrated that the (1 + ) error bound in (19) may not guarantee any

Pass-Efficient Randomized SVD with Boosted Accuracy

17

accuracy in the computed singular vectors. In contrary, the per vector guarantee (22) requires each computed singular vector to capture nearly as much variance as the corresponding accurate singular vector. Based on the per vector error bound (22), we use PVE = max i≤k

T Tˆ ˆT |uT i| i AA ui − u i AA u 2 σk+1 (A)

(23)

to evaluate the accuracy of randomized SVD algorithms. Notice that the metric (20), (21) and (23) were also used in [2] with name “Fnorm”, “spectral” and “rayleigh(last)”. 4.2

Comparison with Basic Randomized SVD Algorithm

In order to validate proposed techniques, we set different power parameter p and perform the basic randomized SVD (Algorithm 1), Algorithm 3 with passefficient scheme, Algorithm 4 with fixed shift value and the proposed PerSVD

Fig. 2. Error curves of the randomized SVD algorithms with different number of passes (k = 100).

18

X. Feng et al.

with shifted power iteration (Algorithm 5) on test matrices. The orthonormalization is used in the power iteration of Algorithm 1. With the accurate results obtained from svds, the corresponding error metrics (20), (21) and (23) are calculated to evaluate the accuracy. For the largest matrix FERET in Table 1, the accurate SVD cannot be computed with svds due to out-of-memory error. So, the results for it are not available. The curves of error metrics vs. the number of passes over the matrix A are drawn in Fig. 2. We set k = 100 in this experiment. From Fig. 2 we see that, the dynamic scheme for setting the shift consistently (i.e. Algorithm 5) produces results with remarkably better accuracy than the scheme with and without the fixed shift. On Dense1 with 4 passes, the reduction of s of Algorithm 5 is 14X and 13X compared with Algorithms 3 and 4, respectively. And, the results of Algorithm 4 are more accurate than the results of Algorithm 3 with same number of passes. Notice the error metrics of Algorithm 4 on Dense1 and Dense2 are less than those of Algorithm 3 although the curves are indistinguishable in Fig. 2. For example, on Dense1 with 5 passes, the s of Algorithm 4 is 5.8×10−7 less than 6.2×10−7 of Algorithm 3, which reflects fixed shift value can improve limited accuracy of results. With the same number of passes over A, PerSVD (Algorithm 5) exhibits much larger accuracy than the basic randomized SVD algorithm (Algorithm 1) and the advantage increases with the number of passes. Besides, the reduction of result’s error of PerSVD increases with the number of passes. With same 4 passes over the matrix Dense1, the result computed with PerSVD is up to 20,318X more accurate than that obtained with the basic randomized SVD. According to the analysis in Sect. 3.4, when the number of passes is the same, the runtime excluding the reading time of Algorithm 5 is about twice that of Algorithm 1. Since the time for reading data is dominant, the total runtime of Algorithm 5 is slightly more than that of Algorithm 1 with same number of passes. 4.3

Comparison with Single-Pass SVD Algorithm

Now, we compare PerSVD with Tropp’s algorithm in terms of computational cost and accuracy of results. The results are listed in Table 2. In this experiment, we set p = 2 for PerSVD, which implies 3 passes over A. Table 2 shows that PerSVD not only costs less memory but also produces more accurate results than Tropp’s algorithm. The reduction of result’s s is up to 7,497X on Dense1 with k = 50. The peak memory usage of PerSVD is 16%–26% of that of Tropp’s algorithm, which matches the analysis in Sect. 3.4. Although PerSVD costs about 3X time than Tropp’s algorithm on reading data from hard disk, the total runtime of PerSVD is less than Tropp’s algorithm, which also matches our analysis. On the largest data FERET, PerSVD just costs 1.9 GB and about 12 min to compute truncated SVD with k = 100 of all 150 GB data stored on hard disk, which reflects the efficiency of PerSVD.

Pass-Efficient Randomized SVD with Boosted Accuracy

19

Table 2. The runtime, memory cost and result errors of the Tropp’s algorithm and our PerSVD algorithm (p = 2). The unit of runtime is second. Matrix

k

Tropp’s algorithm tr t Memory F

s

PerSVD (p = 2) PVE tr t Memory F

s

PVE

Dense1

50 4.7 100 4.4

27 41

592 MB 0.38 0.46 0.87 15 1133 MB 0.39 0.51 0.66 14

26 30

144 MB 4E–4 6E–5 0.009 260 MB 4E–4 0.001 0.01

Dense2

50 4.4 100 4.4

27 50

591 MB 0.49 3.57 9.2 1133 MB 0.51 3.36 9.2

27 31

144 MB 7E–4 0.006 0.04 260 MB 8E–4 0.02 0.04

MNIST

50 0.29 3.9 100 0.31 8.9

FERET

50 140 648 3.71 GB – 100 178 1366 7.25 GB –

498 MB 966 MB

16 14

0.32 0.42 0.67 0.90 1.4 81 MB 4E–4 0.001 0.008 0.14 0.10 0.24 0.79 1.5 156 MB 4E–4 3E–4 0.006 – –

– –

296 597 0.96 GB – 293 703 1.90 GB –

– –

– –

tr means the time for reading the data, while t means the total runtime (including tr ). “–” means the error metrics are not available for FERET matrix.

5

Conclusion

We have developed a pass-efficient randomized SVD algorithm named PerSVD to accurately compute the truncated top-k SVD. PerSVD builds on a technique reducing the passes over the matrix and an innovative shifted power iteration technique. It aims to handle the real-world data with slowly-decayed singular values and accurately compute the top-k singular triplets with a couple of passes over the data. Experiments on various matrices have verified the effectiveness of PerSVD in terms of runtime, accuracy and memory cost, compared with existing randomized SVD and single-pass SVD algorithms. PerSVD is expected to become a powerful tool for computing SVD of really large data.

References 1. Intel oneAPI Math Kernel Library. https://software.intel.com/content/www/us/ en/develop/tools/oneapi/components/onemkl.html (2021) 2. Allen-Zhu, Z., Li, Y.: LazySVD: even faster SVD decomposition yet without agonizing pain. In: Advances in Neural Information Processing Systems, pp. 974–982 (2016) 3. Baglama, J., Reichel, L.: Augmented implicitly restarted Lanczos bidiagonalization methods. SIAM J. Sci. Comput. 27(1), 19–42 (2005) 4. Boutsidis, C., Woodruff, D.P., Zhong, P.: Optimal principal component analysis in distributed and streaming models. In: Proceedings of the the 48th Annual ACM Symposium on Theory of Computing, pp. 236–249 (2016) 5. Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936) 6. Feng, X., Xie, Y., Song, M., Yu, W., Tang, J.: Fast randomized PCA for sparse data. In: Proceedings of the 10th Asian Conference on Machine Learning (ACML), 14–16 Nov 2018, pp. 710–725 (2018)

20

X. Feng et al.

7. Golub, G.H., Van Loan, C.F.: Matrix computations. JHU Press (2012) 8. Halko, N., Martinsson, P.G., Shkolnisky, Y., Tygert, M.: An algorithm for the principal component analysis of large data sets. SIAM J. Sci. Comput. 33(5), 2580–2594 (2011) 9. Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011) 10. Horn, R.A., Johnson, C.R.: Topics in matrix analysis. Cambridge University Press (1991). https://doi.org/10.1017/CBO9780511840371 11. Larsen, R.M.: Propack-software for large and sparse SVD calculations. https:// sun.stanford.edu/∼rmunk/PROPACK (2004) 12. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 13. Li, H., Linderman, G.C., Szlam, A., Stanton, K.P., Kluger, Y., Tygert, M.: Algorithm 971: an implementation of a randomized algorithm for principal component analysis. ACM Trans. Math. Softw. 43(3), 1–14 (2017) 14. Martinsson, P.G., Tropp, J.A.: Randomized numerical linear algebra: foundations and algorithms. Acta Numer 29, 403–572 (2020) 15. Musco, C., Musco, C.: Randomized block Krylov methods for stronger and faster approximate singular value decomposition. In: Advances in Neural Information Processing Systems, pp. 1396–1404 (2015) 16. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation methodology for face-recognition algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 22(10), 1090–1104 (2000) 17. Rokhlin, V., Szlam, A., Tygert, M.: A randomized algorithm for principal component analysis. SIAM J. Matrix Anal. Appl. 31(3), 1100–1124 (2010) 18. Shishkin, S.L., Shalaginov, A., Bopardikar, S.D.: Fast approximate truncated SVD. Numer. Linear Algebra Appl. 26(4), e2246 (2019) 19. Tropp, J.A., Yurtsever, A., Udell, M., Cevher, V.: Streaming low-rank matrix approximation with an application to scientific simulation. SIAM J. Sci. Comput. 41(4), A2430–A2463 (2019) 20. Voronin, S., Martinsson, P.G.: RSVDPACK: an implementation of randomized algorithms for computing the singular value, interpolative, and CUR decompositions of matrices on multi-core and GPU architectures. arXiv preprint arXiv:1502.05366 (2015) 21. Yu, W., Gu, Y., Li, J., Liu, S., Li, Y.: Single-pass PCA of large high-dimensional data. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 3350–3356 (2017) 22. Yu, W., Gu, Y., Li, Y.: Efficient randomized algorithms for the fixed-precision low-rank matrix approximation. SIAM J. Matrix Anal. Appl. 39(3), 1339–1359 (2018)

CDPS: Constrained DTW-Preserving Shapelets Hussein El Amouri1(B) , Thomas Lampert1(B) , Pierre Gançarski1(B) , and Clément Mallet2(B) 1 2

ICube, University of Strasbourg, Strasbourg, France {helamouri,lampert,gancarski}@unistra.fr Univ Gustave Eiffel, IGN, ENSG, Saint-Mande, France [email protected]

Abstract. The analysis of time series for clustering and classification is becoming ever more popular because of the increasingly ubiquitous nature of IoT, satellite constellations, and handheld and smartwearable devices, etc. The presence of phase shift, differences in sample duration, and/or compression and dilation of a signal means that Euclidean distance is unsuitable in many cases. As such, several similarity measures specific to time-series have been proposed, Dynamic Time Warping (DTW) being the most popular. Nevertheless, DTW does not respect the axioms of a metric and therefore Learning DTW-Preserving Shapelets (LDPS) have been developed to regain these properties by using the concept of shapelet transform. LDPS learns an unsupervised representation that models DTW distances using Euclidean distance in shapelet space. This article proposes constrained DTW-preserving shapelets (CDPS), in which a limited amount of user knowledge is available in the form of must link and cannot link constraints, to guide the representation such that it better captures the user’s interpretation of the data rather than the algorithm’s bias. Subsequently, any unconstrained algorithm can be applied, e.g. K-means clustering, k-NN classification, etc, to obtain a result that fulfils the constraints (without explicit knowledge of them). Furthermore, this representation is generalisable to out-of-sample data, overcoming the limitations of standard transductive constrained-clustering algorithms. CLDPS is shown to outperform the state-of-the-art constrained-clustering algorithms on multiple time-series datasets. An open-source implementation based on PyTorch is available From: https://git.unistra.fr/helamouri/constraineddtw-preserving-shapelets, which takes full advantage of GPU acceleration.

This work was supported by the HIATUS (ANR-18-CE23-0025) and HERELLES (ANR-20-CE23-0022) ANR projects. We thank Nvidia Corporation for donating GPUs and the Centre de Calcul de l’Université de Strasbourg for access to the GPUs used for this research. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-26387-3_2. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  M.-R. Amini et al. (Eds.): ECML PKDD 2022, LNAI 13713, pp. 21–37, 2023. https://doi.org/10.1007/978-3-031-26387-3_2

22

H. E. Amouri et al. Keywords: Shapelets · Semi-supervised learning · Clustering Constrained-clustering · Time series · Learning representation

1

·

Introduction

The availability of time series data is increasing rapidly with the development of sensing technology and the increasing number of fields that uses such technology. This increase in data volume means that providing ground truth labels becomes difficult due to the time and cost needed. Labelling difficulty is exacerbated when making exploratory analyses and when working in nascent domains for which classes are not well defined. For that reason, supervised approaches such as classification become unfeasible and unsupervised clustering is often preferred. However, unsupervised approaches may lead to irrelevant or unreliable results since they have no knowledge about the user’s requirements and are instead lead by the algorithm’s bias. On the other hand, semi-supervised algorithms try to remove the rigid requirements of supervised approaches but retain the ability of a user to guide the algorithm to produce a meaningful output. This can be achieved by providing a set of constraints to the algorithm that encode some expert knowledge. These can take many forms but this work is concerned with must-link and cannot-link constraints since they are the easiest to interpret and provide. Must-link and cannot-link constraints do not define what a sample represents (a class), instead they label pairs of samples as being the same (must-link), thus belong to the same cluster, or not (cannot-link). In this way the algorithm is guided to converge on a result that is meaningful to the user without explicitly, nor exhaustively labelling samples. Generally, time series are characterised by trend, shapes, and distortions either to time or shape [22] and therefore exhibit phase shifts and warping. As such, the Euclidean distance is unsuitable and several similarity measures specific to time-series have been proposed [17], for example compression-based measures [7], Levenshtein Distance [10], Longest Common Subsequnce [25] and Dynamic Time Warping (DTW) [19,20]. DTW is one of the most popular since it overcomes these problems by aligning two series through the computation of a cost function based on Euclidean distance [8], it is therefore known as an elastic measure [17]. Moreover, Paparrizos et al. show that DTW is a good basis for calculating embeddings, an approach that employs a similarity to construct a new representation. Time series also exhibit complex structure which are often highly correlated [22]. This makes their analysis difficult to achieve and time consuming, indeed several attempts to accelerate DTW’s computation have been proposed [1,22]. Shapelets [30] offer a simpler approach to increase the accuracy of time-series analysis. Shapelets are phaseindependent discriminative sub-sequences extracted or learnt to form features that map a time-series into a more discriminative representational space, therefore increasing the reliability and interpretability of downstream tasks. Since DTW does not respect the axioms of a metric, LDPS [13] extends shapelets to preserve DTW distances in a Euclidean embedding.

CDPS: Constrained DTW-Preserving Shapelets

23

The contribution of this article is to introduce constrained DTW-preserving shapelets (CDPS), in which a time series representation is learnt to overcome time series distortions by approximating DTW and is influenced by a limited amount of user knowledge by providing constraints. Thus CDPS can model a user’s interpretation, rather than being influenced by the algorithm’s bias. Subsequently, any unconstrained algorithm can be applied to the embedding, e.g. K-means clustering, k-NN classification, etc, to obtain a result that fulfils the constraints (without explicit knowledge of them). The proposed embedding process is studied in a constrained clustering setting, on multiple datasets, and is compared to COP-KMeans [27], FeatTS [23], and unsupervised DTW-preserving shapelets [13]. The representational embedding that is learnt by CDPS is generalisable to out-of-sample data, overcoming the limitations of standard constrainedclustering algorithms such as COP-KMeans. It is interpretable, since the learnt shapelets can themselves be visualised as time-series. Finally, since CDPS results in a vectorial representation of the data, they and the constraints can be analysed using norm-base measures, something that is not possible when using DTW as a similarity measure [8]. This opens up the possibility of measuring constraint informativeness [3] and constraint consistency [26] in time-series clustering, and explaining and interpreting the constraints, which is a concern for future work. Such measures, and notions of density, are needed to develop novel interactive and active constrained clustering processes for time-series. The rest of this article is organised as follows: in Sect. 2 related work is reviewed, in Sect. 3 the Constrained DTW-Preserving Shapelets (CDPS) algorithm is presented, in Sect. 4 CDPS is compared to constrained/semi-supervised and unconstrained approaches from the literature, and finally Sect. 6 presents the conclusions and future work.

2

Related Work

This section will present works related to shapelets and constrained clustering. 2.1

Shapelets

Shapelets are sub-sequences of time-series that were originally developed to discriminate between time-series using a tree based classifier [30,31]. As such, the shapelets themselves were chosen from a set of all possible sub-sequences of the set of time series being analysed, which is time consuming and exhaustive. Different approaches are proposed to increase the speed of finding shapelets. Rakthanmanon and Keogh [18] propose to first project the time-series into a symbolic representation to increase the speed of discovering the shapelets. Subsequently, Mueen et al. [14] introduce logical shapelets, which combines shapelets with complex rules of discrimination to increase the reliability of the shapelets and their ability to discriminate between the time-series. Sperandio [22] presents a detailed review of early shapelet approaches.

24

H. E. Amouri et al.

Lines et al. [12] proposed a new way of handling shapelets that separated classification from transformation. This was later extended by Hills et al. [6] to the shapelet transform, which transforms the raw data into a vectorial representation in which the shapelets define the representation space’s bases. It was proved that this separation leads to more accurate classification results even with non-tree based approaches. 2.2

Learning Shapelets

In order to overcome the exhaustive search for optimal shapelets, Grabocka et al. [4] introduce the concept of learning shapelets in a supervised setting. In this approach the optimal shapelets are learnt by minimising a classification objective function. The authors consider shapelets to be features to be learnt instead of searching for a set of possible candidates, they report that this method provides a significant improvement in accuracy compared to previous search based approaches. Other supervised approaches have been proposed, Shah et al. [21] increase accuracy by learning more relevant and representative shapelets. This is achieved by using DTW similarity instead of Euclidean distance, since it is better adapted to measure the similarity between the shapelets and the time-series. Another approach for learning shapelets is to optimise the partial AUC [29], in which shapelets are learnt in conjunction with a classifier. 2.3

Unsupervised Shapelets

Zakaria et al. [32] introduced the first approach for clustering time-series with shapelets, called unsupervised-shapelets or u-shapelets. U-shapelets are those that best partition a subset of the time series from the rest of the data set. The shapelets are chosen from a set of all possible sub-sequences by partitioning the dataset and removing the time series that are similar to the shapelet, this process is repeated until no further improvements (i.e. partitions) can be made. It is therefore an exhaustive search, as were the early supervised approaches. Ushapelets have been used in several works since their initial introduction [24,33]. Since these unsupervised methods take a similar approach to the original supervised shapelets, they have the same drawbacks. To overcome these, Zhang et al. [34] propose to combine learning shapelets with unsupervised feature selection to learn the optimal shapelets. Learning DTW-preserving shapelets (LDPS) expands the learning paradigm for shapelets by integrating additional constraints on the learnt representation. In LDPS these constrain the representation space to model the DTW distances between the time-series, instead of focusing on learning shapelets that best discriminate between them. A multitude of other unsupervised approaches to build an embedding space for time series exist (other than shapelets) and Paparrizos et al. [17] provide an extensive study of them. Generic Representation Learning (GRAIL) [15], Shift Invariant Dictionary Learning (SIDL) [35], Preserving Representation Learning method (SPIRAL) [9], and Random Warping Series (RWS) [28] are different

CDPS: Constrained DTW-Preserving Shapelets

25

approaches to building such representations. Since these are unsupervised they are not of concern in this article. 2.4

Constrained Clustering

Constrained clustering algorithms are those that add expert knowledge to the process such as COP-Kmeans [27] and Constraint Clustering via Spectral Regularization (CCSR) [11]. Constraints can be given in different forms such as cluster level constraints and instance level constraints. Must-link and cannotlink constraints between samples fall under the latter. Many constrained clustering algorithms have been proposed, some of which have been adapted to time-series. For a full review, the reader is referred to [8]. Here, those relevant to this study are mentioned. COP-KMeans is an extension to k-Means that often offers state-of-the-art performance without the need to choose parameters [8]. Cluster allocations are validated using the constraint set at each iteration to verify that no constraints are violated. For use with time-series the DTW distance measure is often used along with an appropriate averaging method such as DTW barycenter averaging (DBA) [8] to calculate the cluster centres. Another semi-supervised approach developed specifically for clustering time series is FeatTS [23]. FeatTS uses a percentage of labeled samples to extract relevant features used to calculate a co-occurrence matrix from a graph created by the features. The co-occurrence matrix is then used to cluster the dataset. Other approaches to time-series clustering exist, such as k-shape [16], however being unsupervised, these fall outside the scope of this article.

3

Constrained DTW-Preserving Shapelets

This section proposes Constrained DTW-Preserving Shapelets (CDPS), which learns shapelets in a semi-supervised manner using ML and CL constraints. Therefore allowing expert knowledge to influence the transformation learning process, while also preserving DTW similarity and interpretability of the resulting shapelets. Definitions and notations are presented in Sub-Sect. 3.1, and the algorithm in Sub-Sect. 3.3. 3.1

Definitions and Notations

Time series: is an ordered set of real-valued observations. Let T = {T1 , T2 , . . . , TN } be a set of N uni-dimensional time series (for simplicity of notation, nevertheless CDPS is also applicable to multi-dimensional time series). Li is the length of a time series such that Ti is composed of Li elements (each time-series may have different lengths), such that Ti = Ti,1 , . . . , Ti,Li .

(1)

A segment of a time series Ti at the mth element with length L is denoted as Ti,m:L = {Ti,m , . . . , Ti,L }.

26

H. E. Amouri et al.

Shapelet is an ordered set of real-valued variables, with a length smaller than, or equal to, that of the shortest time series in the dataset. Let a Shapelet be denoted as S having length Lk . Let S = {S1 , . . . , SK } be a set of K shapelets, where Sk = Sj,1:Lk . In our work, the set S can have shapelets with different lengths, but for the simplicity we will use shapelets with same length in the formulation. Squared Euclidean Score is the similarity score between a shapelet Sk and a time series sub-sequence Ti,m:LS , such that l

Di,k,m =

1 (Ti,m+x−l − Sk,x )2 . l x=1

(2)

Euclidean Shapelet Match represents the matching score between shapelet Sk and a time series Ti , such that T i,k =

min

m∈{1:Li −Lk +1}

Di,k,m .

(3)

Shapelet transform is the mapping of time series Ti using Euclidean shapelet matching with respect to the set of shapelets S. Where the new vectorial representation is T i = {T i,1 , . . . , T i,K }. (4) Constraint Sets Let Ck be the k th cluster, ML be the set containing time series connected by a must link and CL the set such that they are connected by a cannot link. Thus, ∀ Ti , Tj such that i, j ∈ {1, . . . , N } and i = j we have

3.2

ML = {(i, j)|∀ k ∈ {1, . . . , K}, Ti ∈ Ck ⇔ Tj ∈ Ck },

(5)

CL = {(i, j)|∀ k ∈ {1, . . . , K}, ¬(Ti ∈ Ck ∧ Tj ∈ Ck )}.

(6)

Objective Function

In order to achieve a guided constrained learning approach, a new objective function is introduced based on contrastive learning [5] that extends the loss function used in LDPS [13] to a semi-supervised setting. The loss between two time-series takes the form L(Ti , Tj ) =

1 2 (DT W (Ti , Tj ) − βDisti,j ) + φi,j , 2

(7)

where DT W (Ti , Tj ) is the dynamic time warping similarity between time-series Ti and Tj , Disti,j = ||T i − T j ||2 is the similarity measure between Ti and Tj in the embedded space such that || · ||2 is the L2 norm, and β scales the timeseries similarity (distance) in the embedded space to the corresponding DTW similarity. The term φi,j is inspired by the contrastive loss and is defined, such that ⎧ 2 ⎪ if (i, j) ∈ ML, ⎨αDisti,j , φi,j = γ max(0, w − Disti,j )2 , if (i, j) ∈ CL, (8) ⎪ ⎩ 0, otherwise,

CDPS: Constrained DTW-Preserving Shapelets

27

where α, γ are weights that regularise the must-link and cannot-link similarity distances respectively, and w is the minimum distance between samples for them to be considered well separated in the embedded space (after which, there is no influence on the loss) and is calculated using the following function w = DT W (T ,Tj ) ), such that i = j. max∀i,∀j (DT W (Ti , Tj )) − log( max∀i,∀j (DTiW (T i ,Tj ) The overall loss function is therefore defined, such that L(T ) =

3.3

K K−1   2 L(Ti , Tj ). K(K − 1) i=1 j=i+1

(9)

CDPS Algorithm

Algorithm 1. CDPS algorithm Input: T a set of Time-series, ML and CL constraint sets, Lmin minimum length of shapelets, Smax maximum number of shapelet blocks, nepochs , sbatch , cbatch Output: Set S of shapelets, Embeddings of T . 1: ShapeletBlocks ← Get_Shapelet_Blocks(Lmin , Smax , Li ) 2: Shapelets ← Initialize_Shapelets(ShapeletBlocks) 3: for i ← 0 to nepochs do 4: for 1 to |T |/sbatch do 5: minibatch ← Get_Batch(T , ML, CL, Sbatch , Cbatch ) 6: Compute the DTW between the Ti s and Tj  s in minibatch 7: Update the Shapelets and β by descending the gradient ∇L(Ti , Tj ) 8: Embeddings ← Shapelet_Transfrom(T )

Algorithm 1 describes CDPS’s approach to learning the representational embedding. In which ShapeletBlocks is a dictionary containing Smax pairs, {shapelet length; shapelet number}, where shapelet length is Lmin · bind , Lmin is the minimum shapelet length and bind ∈ {1, . . . , Smax } is the index of the shapelet block. The number of shapelets for each block is calculated using the same approach as LDPS [13]: 10 log(Li − Lmin · bind ) × 10. The parameter Cbatch defines the number of constraints in each training batch, the aim of this parameter is to increase the importance of the constrained time-series in face of the large number of the unconstrained time-series. Initialize_shapelets initialises the shapelets either randomly or rule-based. Here the following rule-based approach is taken: (1) Shapelets are initialised by drawing a number of time series samples then reshaping them into sub-sequences with length equal to that of the shapelets;

28

H. E. Amouri et al.

(2) k-means clustering is then performed on the sub-sequences and the cluster centers are extracted to form the initial shapelets. Get_batch generates batches containing both constrained and unconstrained samples. If there are insufficient constraints to fulfil Cbatch then they are repeated. For speed and to take advantage of GPU acceleration, the above algorithm can be implemented as a 1D convolutional neural network in which each layer represents a shapelet block composed of all the shapelets having the same length followed by maxpooling in order to obtain the embeddings. The derivation of the gradient of L(T ), ∇L(Ti , Tj ) (Algorithm 1, Line 7), is given in the supplementary material.

4

Evaluation

In this section CDPS is evaluated with respect to different constraint sets under two cases: the classical constrained clustering setting in which clusters are extracted from a dataset, called transductive learning; and the second, which is normally not possible using classical constrained clustering algorithms, in which the constraints used to learn a representation are generalised to an unseen test set, called inductive learning. 4.1

Experimental Setup

Algorithm 1 is executed using mini-batch gradient descent with a batch size sbatch = 64, cbatch = 16 constraints in each batch for the transductive setting, while sbatch = 32, cbatch = 8 for the inductive setting (since there are fewer samples). The influence of α and γ on accuracy were evaluated and the algorithm was found to be stable to variations in most of the cases and for that reason the value for both is fixed to 2.5. The minimum shapelet length Lmin = 0.15·Li , and the maximum number of shapelets Smax = 3 are taken to be the same as used in LDPS [13]. All models are trained for 500 epochs using the Adam optimiser. K-means and COP-KMeans [27] are used as comparison methods (unconstrained and constrained respectively) since k-means based algorithms are the most widely applied in real-world applications, offering state-of-the-art (or close to state-of-the-art) performance. CDPS is also compared to FeatTS [23], which is a semi-supervised algorithm that extracts features and uses k-Mediods clustering. Thirty-five datasets1 chosen randomly from the UCR repository [2] are used for evaluation. The number of clusters is set to the number of classes in each dataset. The Normalised Mutual Information (NMI), which measures the coherence between the true and predicted labels, is measured to evaluate the resulting clusters with 0 indicating no mutual information and 1 a perfect correlation. For the first use case, termed Transductive, the training and test sets of the UCR datasets are combined, this reflects the real-world transductive case in which a dataset is to be explored and knowledge extracted. In the second, termed 1

Details on the datasets used are provided in the supplementary material.

CDPS: Constrained DTW-Preserving Shapelets

29

Inductive, the embedding is learnt on the training set and its generalised performance on the test set is evaluated. This inductive use-case is something that is not normally possible when evaluating constrained clustering algorithms since clustering is a transductive operation and this highlights one of the key contributions of CDPS - the ability generalise constraints to unseen data. The third use case, highlights the importance of CDPS shapelets as features and their general ability to be integrated into any downstream algorithm. As such, FeatTS’s semisupervised statistical features are replaced with the dataset’s CDPS embedding. Each algorithm’s performance is evaluated on each dataset with increasing numbers of constraints, expressed in percentages of samples that are subject to a constraint in the 5%, 15%, 25%. These represent a very small fraction of the total number of possible constraints, which is 12 N (N − 1). Each clustering experiment is repeated 10 times, each with a different random constraint set, and each clustering algorithm is repeated 10 times for each constraint set (i.e. there are 100 repetitions for each percentage of constraints2 ). The constraints are generated by taking the ground truth data, randomly selecting two samples, and adding an ML or CL constraint depending on their class labels until the correct number of constraints are created. In the FeatTS comparison, both the train and test sets were used (i.e. transductive). FeatTS and CDPS were evaluated using 25% of the ground truth information: FeatTS takes this information in the form of labels; while CDPS in the form of ML/CL constraints, CDPS embeddings are generated and replace FeatTS’s features (CDPS+FeatTS). The number of features used for FeatTS was 20, as indicated in the author’s paper. With both feature sets, k-Mediods was applied on the co-occurrence matrix to obtain the final clustering [23]. 4.2

Results

In this section the results of each use case (described in Sect. 4.1) are presented. Transductive: Figure 1 shows the NMI scores for CDPS (Euclidean k-means performed on the CDPS embeddings) compared to k-means (on the raw timeseries), COP-Kmeans (also on the raw time-series), and LDPS (Euclidean kmeans on the LDPS embeddings). Unconstrained k-means and LDPS are presented as a reference for the constrained algorithms (COP-kmeans and CDPS respectively) to give insight into the benefit of constraints for each. It can be seen that overall LDPS and k-means offer similar performance. It can also be seen that CDPS uses the information gained by constraints more efficiently, outperforming COP-Kmeans in almost all the different constraint fractions for most datasets.

2

Note that it is not always possible for COP-KMeans to converge on a result due to constraint violations, although many initialisation were tried to obtain as many results as possible some of the COP-KMeans results represent fewer repetitions.

30

H. E. Amouri et al.

Fig. 1. A Transductive comparison between CDPS+kmeans and Raw-TS+CopKmeans with different constraint fractions.

It appears, nevertheless, that some datasets lend themselves to (unconstrained) k-means based algorithms since it outperforms LDPS. Nevertheless, CPDS exhibits an increase in performance as the number of constraints increase, whereas COP-Kmeans tends to stagnate. This can be seen as the cloud of points move upwards (CDPS score increases) as more constraints are given. We can also observe that for some datasets the constrained algorithms behave similarly with 5% constraints, i.e. the cloud of points in the lower left corner, but again CDPS benefits most from increasing the number of constraints and significantly outperforms COP-KMeans with larger constraint percentages.

CDPS: Constrained DTW-Preserving Shapelets

31

Fig. 2. An inductive comparison between CDPS+kmeans and Raw-TS+CopKmeans with different constraint fractions.

Inductive: Figure 2 presents the Inductive results, in which the embedding space is learnt on the training set and the generalisation performance evaluated on the unseen test set. It should be noted that when training on the train set, there are significantly fewer constraints then when using the merged datasets for the same constraint percentage. It can therefore be concluded that even in the face of few data and constraints, CDPS is still able to learn a generalisable representation and attain (within a certain margin) the same clustering performance then when trained on the merged dataset. This is probably explained by the fact that having a smaller number of samples with few constraints means that they are repeated in the mini-batches (see Sect. 3.3), and this allows CDPS to focus on learning shapelets that are discriminative and preserve DTW rather than shapelets that

32

H. E. Amouri et al.

Fig. 3. A comparison between CDPS+FeatTS and FeatTS with respect to NMI score.

model larger numbers of time series. Thus the resulting representation space is more faithful to the constraints, allowing better clustering of unseen time-series. FeatTS Comparison: This study investigates the significance of the shapelets learnt using CDPS as features over the semi-supervised statistical features extracted using FeatTS. Figure 3 shows the NMI scores of CDPS+FeatTS and FeatTS, and we observe that, out of 35 datasets, CDPS+FeatTS outperforms FeatTS in 27. This indicates that the shapelets learnt using CDPS are better for clustering than the statistical features. For the datasets that achieved around zero NMI with respect to CDPS+FeatTS while high NMI with FeatTS (e.g. GunPointAgeSpan CDPS+FeatTS: 0.001, FeatTS: 0.559) it appears that the shapelets learnt in these cases are not discriminative enough, which is confirmed by CDPS’ low scores (CDPS+Kmeans: 0.004).

5

Discussion

Since LDPS only models DTW distance, the comparisons between it and kmeans (Figs. 2a & 1a) give approximately equal performance. Nevertheless, CDPS is better able to exploit the information contained in the constraints when they are introduced, giving more accurate clustering results overall. Both LDPS and CDPS result in a metric space, which is beneficial for further analysis and processing. Being a hard constraint algorithm, COP-KMeans offers no guarantee of convergence, which was evident in the presented study where several of the results were missing after multiple tries. This is due to the difficulty of clustering with an elastic distance measure such as DTW. In these experiments, all constraints can be considered as coherent since they are generated from the ground truth data, however, in real-world situations this problem would be exacerbated by

CDPS: Constrained DTW-Preserving Shapelets

33

inconsistent constraints, particularly considering time-series since these are very hard to label. CDPS, does not suffer from such limitations. Although it was included in this study in order to have a comparison method, using COP-KMeans in an inductive use-case is not usual practice for a classical clustering algorithm. It was simulated by providing COP-KMeans with the combined ‘training’ dataset, its constraints, and the test data to be clustered. CDPS, on the other hand, offers a truly inductive approach in which new data can be projected into the resulting space, which inherently models user constraints. In this setting the difference between COP-KMeans is reduced, however, it should be noted that CDPS does not ‘see’ the training data during the inductive setup, whereas COP-KMeans used all the data to derive the clusters. This also exposes the infeasibility of using COP-KMeans in this way, the data needs to be stored, and accessed each time new data should be clustered, which will become computationally expensive as its size grows. Finally, CDPS’s embedding can be used for tasks other than clustering (classification, generation, etc.). CDPS’s inductive complexity (once the space has been determined) is O(N Lk K), plus kmeans’ complexity O(N kKi), where k is the number of clusters, i the number of iterations until convergence, and the complexity of COPKMeans is O(N kKi|ML ∪ CL|). Overall, the CDPS algorithm leads to better clustering results since it is able to exploit the information brought to the learning process by the constraints. Relatively, it can be seen that the number of datasets in which CDPS outperforms COP-KMeans increases in line with the number of constraints. In absolute terms, COP-KMeans’ performance tends to decrease as more constraints are introduced, and the opposite can be said for CDPS. These constraints bias CDPS to find shapelets that define a representation that respects them while retaining the properties of DTW. Although the focus of this article is not to evaluate whether clustering on these datasets benefits from constraints, it can be observed that generally better performance is found when constraints are introduced. The studies in the previous section show that the transformed space not only preserves the desirable properties of DTW but also implicitly models the constraints given during training. Although it was not evaluated, it is also possible to use COP-Kmeans (constrained) clustering in the Inductive CDPS embedding, thus allowing another mechanism to integrate constraints after the embedding has been learnt. Although CDPS has several parameters, it has been shown that these do not need to be fine-tuned for each dataset to achieve state-of-the-art performance (although better performance may be achieved if this is done). 5.1

Model Selection

When performing clustering there is no validation data with which to determine a stopping criteria. It is therefore important to analyse the behaviour of CDPS during training to give some general recommendations.

34

H. E. Amouri et al.

Fig. 4. Clustering quality (NMI) as a function of the number of epochs for each dataset, using a constraint fraction of 30%.

Fig. 5. Relationship between NMI and CDPS Loss for each dataset. To highlight the relationship between datasets, both loss and NMI have been scaled to between 0 and 1.

Figure 4 presents the CDPS clustering quality (NMI) as a function of the number of epochs for each dataset (using 30% constraints). It demonstrates that generally most of the models converge within a small number of epochs, with FaceFour taking the most epochs to converge. Moreover, the quality of the learnt representation does not deteriorate as the number of epochs increases, i.e. neither the DTW preserving aspect nor the constraint influence dominate the loss and diminish the other as epochs increase. Figure 5 presents scatter-plots of the NMI and CDPS loss (both normalised to between 0 and 1) for several datasets. In addition to the total loss, both the ML and CL losses have been included. The general trend observed in the overall loss is that a lower loss equates to a higher NMI. These show that the loss can be used as a model selection criterion without any additional knowledge of the dataset. For practical application, the embedding can be trained for a fixed large enough number of epochs (as done in this

CDPS: Constrained DTW-Preserving Shapelets

35

study) or until stability is achieved. This is in line with the typical manner in which clustering algorithms are applied.

6

Conclusions

This article has presented CDPS, an approach for learning shapelet based timeseries representations that respect user constraints while also respecting the DTW similarity of the raw time-series. The constraints take the form of mustlink and cannot-link pairs of samples provided by the user. The influence of the constraints on the learning process is ensured through the use of mini-batch gradient descent in which a fraction of each batch contains samples under constraint. The resulting space removes many limitations inherent with using the DTW similarity measure for time-series, particularly interpretability, constraint analysis, and the analysis of sample density. CDPS therefore paves the way for new developments in constraint proposition and incremental (active) learning for time-series clustering. The representations learnt by CDPS are general purpose and can be used with any machine learning task. The presented study focused on its use in constrained clustering. By evaluating the proposed method on thirtyfive public datasets, it was found that using unconstrained k-means on CPDS representations outperforms COP-Kmeans, unconstrained k-means (on the original time-series), and LDPS with k-means. Also, CDPS is shown to outperform FeatTS that uses statistical features. It was also shown that the representation learnt by CDPS is generalisable, something that is not possible with classic constrained clustering algorithms and when applied to unseen data, CDPS outperforms COP-KMeans.

References 1. Cai, B., Huang, G., Xiang, Y., Angelova, M., Guo, L., Chi, C.H.: Multi-scale shapelets discovery for time-series classification. Int. J. Inf. Technol. Decis. Mak 19(03), 721–739 (2020) 2. Dau, H.A., et al.: The UCR time series classification archive (October 2018). https://www.cs.ucr.edu/~eamonn/time_series_data_2018/ 3. Davidson, I., Ravi, S.: Identifying and generating easy sets of constraints for clustering. In: AAAI Conference on Artificial Intelligence (AAAI), pp. 336–341 (2006) 4. Grabocka, J., Schilling, N., Wistuba, M., Schmidt-Thieme, L.: Learning time-series shapelets. In: International Conference on Knowledge Discovery & Data Mining (SIGKDD), pp. 392–401 (2014) 5. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). vol. 2, pp. 1735–1742 (2006) 6. Hills, J., Lines, J., Baranauskas, E., Mapp, J., Bagnall, A.: Classification of time series by shapelet transformation. Data Min. Knowl. Discov. 28(4), 851–881 (2014) 7. Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards parameter-free data mining. In: International Conference on Knowledge Discovery & Data Mining (SIGKDD), pp. 206–215 (2004)

36

H. E. Amouri et al.

8. Lampert, T., et al.: Constrained distance based clustering for time-series: a comparative and experimental study. Data Min. Knowl. Discov. 32(6), 1663–1707 (2018). https://doi.org/10.1007/s10618-018-0573-y 9. Lei, Q., Yi, J., Vaculin, R., Wu, L., Dhillon, I.S.: Similarity preserving representation learning for time series clustering. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), pp. 2845–2851 (2017) 10. Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966) 11. Li, Z., Liu, J., Tang, X.: Constrained clustering via spectral regularization. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 421–428. IEEE (2009) 12. Lines, J., Davis, L.M., Hills, J., Bagnall, A.: A shapelet transform for time series classification. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 289–297 (2012) 13. Lods, A., Malinowski, S., Tavenard, R., Amsaleg, L.: Learning DTW-preserving shapelets. In: International Symposium on Intelligent Data Analysis (IDA) (2017) 14. Mueen, A., Keogh, E., Young, N.: Logical-shapelets: an expressive primitive for time series classification. In: Proceedings of ACM SIGKDD: International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 1154–1162 (2011) 15. Paparrizos, J., Franklin, M.J.: GRAIL: efficient time-series representation learning. VLDB Endowment 12(11), 1762–1777 (2019) 16. Paparrizos, J., Gravano, L.: k-shape: Efficient and accurate clustering of time series. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1855–1870 (2015) 17. Paparrizos, J., Liu, C., Elmore, A.J., Franklin, M.J.: Debunking four long-standing misconceptions of time-series distance measures. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (ACM SIGMOD), pp. 1887–1905 (2020) 18. Rakthanmanon, T., Keogh, E.: Fast shapelets: A scalable algorithm for discovering time series shapelets. In: Proceedings of the 2013 SIAM International Conference on Data Mining (SDM), pp. 668–676 (2013) 19. Sakoe, H., Chiba, S.: Dynamic-programming approach to continuous speech recognition. In: Proceedings of the International Cartographic Association ICA, pp. 65–69 (1971) 20. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Tans. Acoust. Speech Signal Process. 26(1), 43–49 (1978) 21. Shah, M., Grabocka, J., Schilling, N., Wistuba, M., Schmidt-Thieme, L.: Learning DTW-shapelets for time-series classification. In: Proceedings of the 3rd IKDD Conference on Data Science (ACM IKDD CODS), pp. 1–8 (2016) 22. Sperandio, R.C.: Recherche de séries temporelles à l’aide de DTW-preserving shapelets. Ph.D. thesis, Université Rennes 1 (2019) 23. Tiano, D., Bonifati, A., Ng, R.: Feature-driven time series clustering. In: 24th International Conference on Extending Database Technology (EDBT), pp. 349– 354 (2021) 24. Ulanova, L., Begum, N., Keogh, E.: Scalable clustering of time series with ushapelets. In: Proceedings of the 2015 SIAM International Conference on Data Mining (SDM), pp. 900–908 (2015) 25. Vlachos, M., Hadjieleftheriou, M., Gunopulos, D., Keogh, E.: Indexing multidimensional time-series. VLDB J. 15(1), 1–20 (2006)

CDPS: Constrained DTW-Preserving Shapelets

37

26. Wagstaff, K., Basu, S., Davidson, I.: When is constrained clustering beneficial, and why? In: AAAI Conference on Artificial Intelligence (IAAI) (2006) 27. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML). vol. 1, pp. 577–584 (2001) 28. Wu, L., Yen, I.E.H., Yi, J., Xu, F., Lei, Q., Witbrock, M.: Random warping series: a random features method for time-series embedding. In: 21st International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 793–802 (2018) 29. Yamaguchi, A., Maya, S., Maruchi, K., Ueno, K.: LTSpAUC: learning time-series shapelets for optimizing partial AUC. In: Proceedings of the 2020 SIAM International Conference on Data Mining (SDM), pp. 1–9 (2020) 30. Ye, L., Keogh, E.: Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 947–956 (2009) 31. Ye, L., Keogh, E.: Time series shapelets: a novel technique that allows accurate, interpretable and fast classification. Data Min. Knowl. Discov. 22(1), 149–182 (2011) 32. Zakaria, J., Mueen, A., Keogh, E.: Clustering time series using unsupervisedshapelets. In: 2012 IEEE 12th International Conference on Data Mining (ICDM), pp. 785–794 (2012) 33. Zakaria, J., Mueen, A., Keogh, E., Young, N.: Accelerating the discovery of unsupervised-shapelets. Data Min. Knowl. Discov. 30(1), 243–281 (2016) 34. Zhang, Q., Wu, J., Yang, H., Tian, Y., Zhang, C.: Unsupervised feature learning from time series. In: International Joint Conferences on Artificial Intelligence (IJCAI), pp. 2322–2328 (2016) 35. Zheng, G., Yang, Y., Carbonell, J.: Efficient shift-invariant dictionary learning. In: International Conference on Knowledge Discovery & Data Mining ACM SIGKDD, pp. 2095–2104 (2016)

Structured Nonlinear Discriminant Analysis Christopher Bonenberger1,2(B) , Wolfgang Ertel1 , Markus Schneider1 , and Friedhelm Schwenker2 1

Ravensburg-Weingarten University of Applied Sciences (Institute for Artificial Intelligence), Weingarten, Germany [email protected] 2 University of Ulm (Institute of Neural Information Processing), James-Franck-Ring 89081 Ulm, Germany

Abstract. Many traditional machine learning and pattern recognition algorithms—as for example linear discriminant analysis (LDA) or principal component analysis (PCA)—optimize data representation with respect to an information theoretic criterion. For time series analysis these traditional techniques are typically insufficient. In this work we propose an extension to linear discriminant analysis that allows to learn a data representation based on an algebraic structure that is tailored for time series. Specifically we propose a generalization of LDA towards shift-invariance that is based on cyclic structures. We expand this framework towards more general structures, that allow to incorporate previous knowledge about the data at hand within the representation learning step. The effectiveness of this proposed approach is demonstrated on synthetic and real-world data sets. Finally, we show the interrelation of our approach to common machine learning and signal processing techniques. Keywords: Linear discriminant analysis · Time series analysis · Circulant matrices · Representation learning · Algebraic structure

1

Introduction

Often, when being confronted with temporal data, machine learning practitioners use feature transformations. Yet, mostly these feature transformations are not adaptive but rely on decomposition with respect to a fixed basis (Fourier, Wavelet, etc.). This is because simple data-adaptive methods as principal component analysis (PCA, [12]) or linear discriminant analysis (LDA, [22]) often lead to undesirable results for stationary time series [20]. Both methods, PCA and LDA, are based on successive projections onto optimal one-dimensional subspaces. For time series analysis via LDA this leads to problems, especially when the data at hand is not locally coherent in the corresponding vector space [10]— which is likely to be the case for high-dimensional (long) time series. In this paper we propose an adaption of LDA that relies on learning with algebraic structures. More precisely, we propose to learn a projection onto a structured c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  M.-R. Amini et al. (Eds.): ECML PKDD 2022, LNAI 13713, pp. 38–54, 2023. https://doi.org/10.1007/978-3-031-26387-3_3

Structured Nonlinear Discriminant Analysis

39

multi-dimensional subspace instead of a single vector. The scope of this work is mainly to introduce the theoretical basics of structured discriminant analysis for time series, i.e., we focus on cyclic structures that incorporate shift-invariance and thus regularize supervised representation learning. From an algebraic point of view the problem at hand is mainly an issue of data representation in terms of bases and frames [23]. In this setting we seek a basis (or a frame [5,15]) for the input space, which is optimal with respect to some information-theoretic criterion. When it comes to time series the vital point is, that these algorithms can be tailored to meet the conditions of temporal data. Basis pursuit methods like dictionary learning (DL, optimize over-complete representations with respect to sparsity and reconstruction error) are often altered in order to yield shift-invariance and to model temporal dependencies [8,17,21]. Also convolutional neural networks are implicitly equipped with a mechanism to involve algebraic structure in the learning process, because this way “the architecture itself realizes a form of regularization” [3]. In fact both, shift-invariant DL [8,21] and CNN, use cyclic structures, i.e., convolutions. However, so far the idea of learning with cyclic structures has hardly been transferred to basic machine learning methods. The motive of this work is to transfer the idea of implicit shift-invariance to LDA by learning with algebraic structure. We strive for interpretable algorithms that go along with low computational complexity. This way we seek to bridge complex methods like convolutional neural networks and simple, well-understood techniques like LDA. Recently [1] proposed a generalization of PCA that allows unsupervised representation learning with algebraic structure, which is tightly linked to methods like dynamic PCA [13], singular spectrum analysis [9] and spectral density estimation [1]. However, similarly to PCA this method does not allow to incorporate labeling information. Yet, representation learning can benefit from class-information. In this respect our main contribution is a formulation of linear discriminant analysis that involves cyclic structures, thus being optimized for stationary temporal data. We provide a generalization of this framework towards non-stationary time series and even arbitrary correlation structures. Moreover, the proposed technique is linked to classical signal processing methods.

2

Prerequisites

In the following we will briefly discuss the underlying theory of linear discriminant analysis, circulant matrices and linear filtering. In Sect. 2.2 we revisit principal component analysis and its generalization towards shift-invariance. 2.1

Circulant Matrices

We define a circulant matrix as a matrix of the form ⎤ ⎡ g1 gD gD−1 gD−2 · · · g2 ⎢ g2 g1 gD gD−1 · · · g3 ⎥ ⎥ ⎢ ⎢ g1 gD · · · g4 ⎥ G = ⎢ g3 g2 ⎥ ∈ RD×D , ⎢ .. .. .. .. . . .. ⎥ ⎣ . . .⎦ . . . gD gD−1 gD−2 gD−3 · · · g1

(1)

40

C. Bonenberger et al.

i.e., the circulant G is fully defined by its first column vector g. In short, the i-th row of a circulant matrix contains the first row right-shifted by i − 1. In the following, we write circulant matrices as a matrix polynomial of the form G=

L 

gl Pl−1

(2)

l=0

with

⎤ ⎡ 0 0 0 ··· 1 ⎢1 0 0 0⎥ ⎥ ⎢ ⎢0 1 0 0⎥ P=⎢ ⎥ ∈ RD×D . ⎢ .. . . . . .. ⎥ ⎦ ⎣. 0 0 ··· 1 0

(3)

Note that P itself is also a circulant matrix. Left-multiplying a signal x with a circulant matrix is equivalent to Gx = F−1 ΛFx where F ∈ RD×D is the Fourier matrix with coefficients

(j − 1)(k − 1) 1 [F]j,k = √ exp −2πi D D

(4)

(5)

ˆ = Fg of g ∈ RD on where Λ is a diagonal matrix with the Fourier transform g 2 its diagonal and i is the complex number, i.e., i = −1. Hence, Eq. (4) describes a circular convolution ˆ ) = g  x. Gx = F−1 Λˆ x = F−1 (ˆ gx Here,  is the Hadamard product (pointwise multiplication) and  denotes the discrete circular convolution. Moreover, Eq. (4) describes the diagonalization of circulant matrices by means of the Fourier matrix. 2.2

(Circulant) Principal Component Analysis

Heading towards linear discriminant analysis, it is interesting to start with the Rayleigh quotient and its role in PCA (cf. [19]). The Rayleigh quotient of some vector g ∈ RD with respect to a symmetric matrix S ∈ RD×D is defined as R(g, S) =

gT Sg . gT g

Maximizing R(g, S) with respect to g leads to the eigenvalue problem Sg = λg. The optimal vector g is the eigenvector of S with the largest corresponding eigenvalue.

Structured Nonlinear Discriminant Analysis

41

Having a labeled data set {(xν , yν )}ν=1,...,N with observations x ∈ RD and corresponding labels y ∈ {1, . . . , C}, we define the overall data matrix as ⎤ ⎡ | | X = ⎣x1 · · · xN ⎦ ∈ RD×N . | | Moreover we define class-specific data matrices Xc ∈ RD×Nc , where Nc is the number of observations xν with corresponding label yν = c. The relation to principal component analysis becomes obvious when S is the empirical covariance matrix estimated from X. Assuming zero-mean data, i.e., the expected value E {x} = 0, the matrix S = XXT is the empirical covariance matrix of X. Thus max {R(g, S)} g∈RD

is equivalent to the linear constrained optimization problem 2 2 max gT X s.t. g = 1, g∈RD

2

2

(6)

which T 2in turn formulates principal component analysis, where maximizing g X means to maximize variance (respectively power). As known the opti2 mal principal component vector(s) are found from the eigenvalue problem (cf. [12]) Sg = λg. while classical PCA is based on a projection onto an optimal one-dimensional subspace [1] proposed a generalization of PCA which projects on a multi-dimensional subspace that is formed from cyclic permutations. This results in optimizing

2 2 (7) max GXF s.t. g2 = 1, g∈RD

with G being a κ-circulant matrix defined by the elements of g (see Section 2.1). 2 The Frobenius norm AF = tr{AT A}. Solving Eq. (7) amounts to set the partial derivatives of the Lagrangian function   L(g, λ) = tr{XT GT GX} + λ gT g − 1 to zero. Analogously to PCA this finally leads to the eigenvalue problem (see [1]) Zg = λg  T l−k with [Z]k,l = ν xν P xν using P as defined in Eq. (3), i.e., we write G as in Eq. (2).

42

2.3

C. Bonenberger et al.

Linear Discriminant Analysis

Relying on a decomposition of the overall empirical covariance matrix without using the class labels can result in a disadvantageous data representation. Yet, linear discriminant analysis exploits labeling information by maximizing the generalized Rayleigh quotient (cf. [22]) R(g, B, W) =

gT Bg . gT Wg

(8)

with the beneath-class scatter matrix B=

C 

Pc (xc − x0 ) (xc − x0 )

T

(9)

c=1

and the within-class scatter matrix   C   T W= Pc (xν − xc ) (xν − xc ) , c=1

(10)

ν∈Ic

where Ic is the index set for class c, i.e., Ic = {ν ∈ [1, . . . , N ] | yν = c}. Moreover xc is the sample mean of observations from the class c and x0 is the overall empirical mean value. The a priori class probabilities Pc have to be estimated as Pc ≈ Nc /N . Note that the beneath-class scatter matrix is the empirical covariance matrix of class-specific sample mean values, while the within-class scatter matrix is a sum of the class-specific covariances. While typically rank {W} = D the rank of the beneath-class scatter matrix is rank {B} ≤ C − 1. The expression in Eq. (8), also known as Fisher’s criterion, measures the separability of classes. Maximizing the Rayleigh quotient in Eq. (8) with respect to g defines LDA. Hence the optimal one-dimensional subspace of RD , where the optimality criterion is class separability due to the Rayleigh quotient, is found from   (11) max gT Bg s.t. gT Wg = 1. g∈RD

Again this constrained linear optimization problem is solved by setting the partial derivatives of the corresponding Lagrangian L(g, λ) to zero. Analogously to PCA we find a (generalized) eigenvalue problem ∂L(g, λ) = 0 ⇐⇒ Bg = λWg. ∂g

(12)

Assuming that W−1 exists, then W−1 Bg = λg.

(13)

The projection X⊥ ∈ RC−1×N of X ∈ RD×N onto the optimal subspace defined by the eigenvectors g1 , . . . , gC−1 of W−1 B belonging to the C −1 largest eigenvalues is

Structured Nonlinear Discriminant Analysis

|

⎢ X⊥ = ⎣

⎤⎡

⎤ | | ⎥⎣ ⎦ x1 · · · xN ⎦ , | |

|

gT 1 .. .

gT C−1

|

|



43

i.e., X⊥ is the mapping of X into a feature space that is optimal with respect to class discrimination.

3

Structured Discriminant Analysis

This section presents the main contribution of the paper, namely we introduce a generalization of linear discriminant analysis that allows for learning with algebraic structure. Instead of the projection onto a one-dimensional subspace we propose to learn the coefficients of a multi-dimensional structured subspace that is optimal with respect to class discrimination. 3.1

Circulant Discriminant Analysis

As introduced in Sect. 2.2 [1] proposes to modify Eq. (6) using κ-circulant matrices, which generalizes PCA towards shift invariance. In the following, we will adopt this approach and modify linear discriminant analysis as defined by Eq. (11) using circulant structures, i.e., we seek the coefficients of a circulant matrix G ∈ RD×D of the form L  gl Pl−1 (14) G= l=1

instead of g. Again P performs a cyclic permutation, as defined in Eq. (3). An example is depicted in Fig. 1, panel (1). ˜ ν = xν − xc (class affiliation of xν is unambiguous) In this regard we use x ˜ c = xc − x0 as an abbreviation, i.e., and x B=

C 

T

˜cx ˜c . Pc x

c=1

and W=

C  c=1

Pc



˜ν x ˜T x ν.

ν∈Ic

L

The coefficients g ∈ R of G that go along with optimal class separation are found from the linear constrained optimization problem  C  C    2 2 ˜ P c G xc Pc G˜ xν  = 1, (15) s.t. max g∈RD

c=1

2

2

c=1

ν∈Ic

which basically means to perform LDA on the data set GX.

44

C. Bonenberger et al.

Fig. 1. Different examples on possible structures of G with g being a straight line from −1 to 1. These structures illustrate the dependencies associated to different parameter settings, e.g. in panel (1) each coordinate is related to each other while in (4) dependencies are restricted to ±D/4.

In a geometrical understanding we seek a multi-dimensional cyclically structured subspace of RD that is optimal with respect to class separability, while in classical LDA the sought subspace is one-dimensional. This implies that instead of gT x22 (variance) we measure the length Gx22 of the projection onto the subspace defined by G (total variation 1 with respect to the variable under con˜ c or x ˜ ν ). However, according to Eq. (4) Gx22 can also be undersideration, x stood as the power of the filtered signal GT x, while G is an optimally matched filter. In a two-class setting G can even be understood as a Wiener filter [24]. The Lagrangian for Eq. (15) is L(g, λ) =

C  c=1

T ˜ ˜T Pc x c G G xc − λ

C  c=1

Pc



T ˜T x xν − λ. ν G G˜

(16)

ν∈Ic

Due to (Pi )T Pj = Pj−i ∀i, j ∈ N with P according to Eq. (14) we find 2 0 xT GT Gx = xT (g12 P0 + · · · + g1 gL PL−1 + · · · + gL g1 P−L+1 + · · · + gL P )x

1

The total variation is the trace of the covariance matrix (cf. [18]).

Structured Nonlinear Discriminant Analysis

45

for any vector x ∈ RD . The derivative with respect to gk is     dxT GT Gx = xT gl Pl−k + Pk−l x = 2xT gl Pl−k x. dgk L

L

l=1

l=1

(17)

The second equality in Eq. (17) is using the symmetry of real inner products and (Pi )T = P−i , which leads to xT P−i x = (Pi x)T x = xT Pi x. Using Eq. (17) the partial derivative of Eq. (16) w.r.t. gk can be written as C C      ∂L(g, λ) l−k l−k ˜T ˜ν . =2 P c xT g P x − 2λ P x x l c c c ν l l gl P ∂gk c=1 c=1

(18)

ν∈Ic

Setting Eq. (18) to zero leads to the generalized eigenvalue problem ZB g = λZW g, where [ZB ]k,l = and [ZW ]k,l =

C 

l−k ˜ ˜T Pc x xc cP

(19)

(20)

c=1

C  c=1

Pc



l−k ˜T ˜ν . x x νP

(21)

ν∈Ic

Analogously to classical LDA we find the eigenvalue problem Z−1 W ZB g = λg.

(22)

˜ c rank {ZB } = rank {ZW } = L, which is In contrast to LDA, for non-trivial x due to the permutations in Eqs. (20) and (21). Every eigenvector gq defines a circulant matrix Gq and a corresponding subspace. In accordance with Eq.(15) the length of the projection onto this subspace is x⊥ = Gq x2 . 2

(23)

Note that using the nonlinear projection in Eq. (23) yields a nonlinear algorithm. Of course this is not a necessity, since it would be viable to proceed with the linear representation Gq x. However, Eq. (23) fits in with linear discriminant analysis and is easily interpretable in terms of linear filtering. 3.2

Computational Aspects for Circulant Structures

Since both matrices ZB , ZW ∈ RL×L have a symmetric Toeplitz structure2 the computational complexity of Eqs. 20 and (21) can be reduced to O(L) as both matrices are fully determined by their first row vectors zB , zW ∈ RL respectively. 2

[Z]i,j is constant for constant i − j and xT Pj−i x = xT Pi−j x (cf. [1]).

46

C. Bonenberger et al.

Beyond that, the term xT Pl−k y realizes a circular convolution x  y. A circular convolution in turn can be expressed by means of the (fast) Fourier transform (FFT, cf. [24]), i.e., x  y = F−1 (F(x)  F(y)), where F denotes the Fourier transform and F−1 its inverse. This allows to compute zB and zW in O(D log D) using the fast Fourier transform, i.e.,   C    −1 ˜ c  Fx ˜c Fx Pc F , l = 1...,L (24) [zB ]l = c=1

and [zW ]l =

 C 

l

−1

Pc F

c=1



 F˜ xν  F˜ xν

ν∈Ic

,

l = 1 . . . , L.

(25)

l

Using these insights the projection according to Eq. (23) can be accelerated via   2 (26) x⊥ = F−1 Fgq  Fx 2 , where gq has to be zero-padded such that gq ∈ RD . Beneath the low complexity of estimating ZB and ZW via the FFT, there is a considerable reduction of computational complexity in solving Eq. (22) because L can be chosen much smaller than D. In fact, L D is typically a reasonable choice, because for large L the localization in frequency domain is inappropriately precise (cf. Figs. 4 and 5 and Sect. 3.3). 3.3

Harmonic Solutions

As can be seen from Figs. 4 and 5 for circulant structures with L = D the optimal solution to Eq. (15) is Fourier mode. Investigating Eqs.(21) and (20) for L = D we can see that both, ZW and ZB are circulant matrices for L = D. Generally both matrices have coefficients of the form [Z]k,l = xPl−k x, with some x ∈ RD . Hence, when L = D the first row of Z is palindromic, i.e., Zk,l = Zk,D−l because P−l = PD−l . Thus Z, respectively ZW and ZB are symmetric circulant Toeplitz matrices and both admit an eigendecomposition according to Eq. (4), i.e., ⎤ ⎡ z1,1 z1,2 z1,3 · · · z1,3 z1,2 ⎢z1,2 z1,1 z1,2 · · · z1,4 z1,3 ⎥ ⎥ ⎢ ⎥ ⎢ Z = ⎢z1,3 z1,2 z1,1 · · · z1,5 z1,4 ⎥ = F−1 ΛF ∈ RD×D . ⎢ .. .. ⎥ ⎣ . . ⎦ z1,2 z1,3 z1,4 · · · z1,2 z1,1 Notably, the inverse of a circulant (Toeplitz) matrix is again a circulant matrix [14]. Thus we can conclude that for L = D we have Z−1 W ZB F = FΛ,

Structured Nonlinear Discriminant Analysis

47

with F being the Fourier matrix (cf. Eq. (5)). We observe that independently of the data at hand the optimal solutions to Eq. (22) respectively Eq. (15) are Fourier modes, i.e., for stationary data Fourier modes maximize the Rayleigh quotient. 3.4

Truncated κ-Circulants

Although circulant structures are beneficial in terms of computational complexity their use is tied to the assumption of stationarity 3 . In the following we slightly change the definition of G to a more general “cyclic” matrix Γ in order to gain more flexibility when incorporating dependencies into the structure of Γ. We refer to a cyclic matrix, when Eq. (4) is not full-filled, i.e., the matrix is based on cyclic permutations, but is not strictly circulant. [17] describes κ-circulants as down-sampled versions of simple circulant. This approach can be generalized to truncated κ-circulant matrices Γ=M

L 

gl Pl−1

(27)

l=1

with M performing the down-sampling (with a factor κ) and truncation of all rows following the μ-th row, i.e.,  1 if μ ≥ i = j ∈ [1, κ + 1, 2κ + 1, · · · , D/κ + 1κ] [M]i,j = 0 else. This especially allows to model dependencies for non-stationary data (see Sect. (4.2)). The idea of truncation is important, as it allows a simple handling of the boundaries by setting μ = D − L + 1 (as known from singular spectrum analysis [2,9]). On the other hand using some μ > 1 along with L = D is the LDA-equivalent to dynamic PCA (cf. [1,2,13]). Setting κ > 1 implements down-sampling and is equivalent to stride in CNNs. Some examples are given in Fig. 1.4 Using Γ instead of G leads to [ZB ]k,l = and [ZW ]k,l =

C 

1−l ˜T ˜c Pc x MPk−1 x cP

(28)

c=1

C  c=1

Pc



1−l ˜T ˜ν . x MPk−1 x νP

(29)

ν∈Ic

All derivations are analogous to Sect. 3.1, except for the binary diagonal matrix M with MT M = M.5 Note that with μ = D and κ = 1 Eqs. (28) and (29) 3

4 5

The distribution of stationary signals is invariant with respect to time (E {xt } is constant for all t and the covariance C(xt , xs ) solely depends on the index/time difference |t − s|) [16]. κ-circulant structures can also be used to model Wavelet-like structures [24]. Hence we can fully simplify analogously to the step from Eq. (17) to Eq. (18).

48

C. Bonenberger et al.

are equal to Eqs. (20) and (21) (circulant discriminant analysis). Moreover, for L = D and μ = 1 (or κ = D) Eqs. (28) and (29) coincide with B and W (Eqs. (9) and (10)). In that sense, circulant discriminant analysis and classical LDA are a special case of truncated κ-circulant structures. 3.5

Non-cyclic Structures

Using circulant structures as proposed in Sect. 2.1 is an adequate approach for (weakly) stationary data sets. The generalization to truncated κ-circulant matrices allows to embed more complex dependencies into the structure of G (respectively Γ) and hence allows to work with non-stationary data. Yet, when choosing the correlation structure, of course, one is not limited to cyclic structures. As the  truncated κ-circulant structure in Eq. (27) can be given as Γ = l gl MPl−1 , clearly we can formulate Eq. (15) using an arbitrary structure ΓA ∈ RD×D which is modeled as L  ΓA = gl Πl . l=1

Here, the coefficients of Πl model the dependencies of the i-th variable. The corresponding solution is equivalent to the above derivations, i.e., we find the generalized eigenvalue problem of Eq. (19). However, the matrices ZB and ZW are defined as [ZB ]k,l = and [ZW ]k,l =

C 

T ˜T ˜ Pc x c Πl Πk xc

c=1

C  c=1

Pc



T ˜T ˜ν . x ν Πl Πk x

ν∈Ic

This very general formulation also allows for more complex structures that can explicitly model statistical dependencies for non-temporal data. In the field of time series analysis this general approach can be used to build over-complete multi-scale models.

4

Examples and Interpretation

In this section we illustrate the proposed method at the example of different real-world and synthetic data sets. 4.1

(Quasi-)Stationary Data

As a start we use synthetic data generated from different auto-regressive moving average models (ARMA model, cf. [16]) corresponding to the different classes. In the left panel of Fig. 2 realizations from these four different models are depicted.

Structured Nonlinear Discriminant Analysis

49

Fig. 2. A simple example demonstrating the performance of circulant discriminant analysis compared to classical linear discriminant analysis according to Sect. 4.1.

Fig. 3. This figure shows the first three eigenvectors of Z−1 W ZB for the ARMA-process data set (cf. Section 4.1 and Fig. 2) and their spectrum. The right panel shows the spectral density of the four largest classes (colored) along with Fourier transformed (filter) coefficients. The x axis in the right panel is the frequency axis in half cycles per sample. (Color figure online)

For the sake of simplicity all model parameters are chosen depending on the class index, i.e., observations belonging to class c stem from a ARMA(p, q) model with p = q = c and coefficients θi = φi = 1/(c + 1) for all i = 1, . . . , c, where θi and φi are AR and MA coefficients respectively. Each class comprises Nc = 50 samples, with a 50/50 train-test split. The data dimension is D = 256. In the middle and right panel of Fig. 2 a comparison of circulant discriminant analysis (CDA, according to Sect. 3.1) and linear discriminant analysis based on this data is shown. For CDA we used L = 8. For both methods the projection onto the first three subspaces is used. More precisely, for CDA the projection is according to Eq. (26). Note that CDA is considerably faster than LDA, due to the computational simplifications proposed in Sect. 3.2. In Fig. 6 the “user identification from walking activity” data set (cf. [4]) from the UCR machine learning repository ( [7]) is used. The data set contains accelerometer data from 22 different individuals, each walking the same predefined path. For each class, there are x, y and z measurements of the accelerometer forming three time series. For further use, we use sub-series of equal length D from a single variable (acceleration in x-direction).

50

C. Bonenberger et al.

Fig. 4. The left panel shows the optimal solution to Eq. (15) within a parameter over the filter kernel width L based on data that stems from the ARMA process described in Sect. 4.1. Here D = 64. For the special case L = D = 64 the solution is a pure harmonic oscillation, as the discrete Fourier basis is the optimal basis in this configuration— independently of the data set under consideration (see Sect. 3.3).

While the synthetic data used for Fig. 2 is strictly stationary, real-world data—as the gait pattern data used in the examples of Figs. 6 to 5—can be assumed to be stationary on (small) intervals [11]. For the visualization in Fig. 6 we used a 50/50 train-test-split based on observations of length D = 64 which corresponds to approximately 2 seconds window width. For the sake of simplicity we used only one variable (the x-coordinate). The accuracies in Fig. 6 are based on a feature vector with 3 elements, i.e., classification is performed on the depicted data. The overall 1-nearest neighbor accuracy on the complete data set with 22 classes on a single variable (x-coordinate) is 46% (CDA) and 20% (LDA) respectively. 4.2

Non-stationary Data

In Sect. 3.4 we introduced κ-circulant structures, that account for non-stationary data. Here we demonstrate the use of such structures using the “Plane” data set (cf. [6]). Often for time series the assumption of stationarity does not hold. In one example, the data at hand is triggered, i.e., all observations start at a certain point in time (space, ...). This results in non-stationarity, because distinct patterns are likely to be found at a certain index. The “Plane” data set from the UCR Time Series Archive (see [6]) is such a triggered data set. It contains seven different classes that encode the outline of different planes as a function of angle. The triggering stems from the fact, that the outline is captured using the identical starting angle. Hence, the “Plane” data set is an example for non-stationary data, that nevertheless shows temporal (spatial) correlations.

Structured Nonlinear Discriminant Analysis

51

Fig. 5. Illustration of a parameter sweep over L according to Fig. 4. However, here the underlying dataset is the “user identification from walking activity” data set (cf. Figs. 6 and 7). Again for L = D = 64 we find a Fourier mode as optimal solution (cf. Section 3.3).

Fig. 6. Comparison of LDA and CDA at the example of the according to Sect. 4.1. For this figure the four largest classes of the data set are depicted.

Figure 8 shows a comparison of stationary and non-stationary parameter settings. The difference between these settings is shown in Fig. 1. A structure for stationary data is visualized in panel (2), while the non-stationary setting is shown in panel (5) of Fig. 1. The former is equivalent to a FIR-filter with dataadaptive coefficients, while the latter is similar to singular spectrum analysis. A detailed analysis of interrelations between theses techniques is provided in [2].

52

C. Bonenberger et al.

Fig. 7. The left panel shows the first three solutions to Eq. (22) for the “user identification” data set according to Fig. 6 with L = 8 (cf. Section 4.1). The right panel shows the spectral density of the four largest classes (colored) along with Fourier transformed (filter) coefficients. The x axis in the right panel is the frequency axis in half cycles per sample. (Color figure online)

Fig. 8. Non-stationary analysis via truncated κ-circulant structures (using L = 5, κ = 1, μ = D−L+1) compared to (stationary) circulant discriminant analysis (L = 5) based on the “Plane” data set (D = 144). Here the projection onto the first two subspace is shown.

5

Conclusion

Linear discriminant analysis is a core technique in machine learning and statistics. In this work we introduced an adaption of linear discriminant analysis that is optimized for stationary time series. This approach is based on the idea of projecting data onto cyclically structured subspaces, which is related to adaptive linear filtering. We generalize this approach towards non-stationary data and show how arbitrary correlation structures can be modeled. This reconnects to classical LDA, which is a special case of circulant discriminant analysis with truncated κ-circulants. The effectiveness of this approach is demonstrated on synthetic stationary data and temporal data from benchmark data sets. Finally, we discussed the connection between circulant discriminant analysis and linear filtering as well as Fourier analysis.

Structured Nonlinear Discriminant Analysis

53

Acknowledgements. We are grateful for the careful review of the manuscript. We thank the reviewers for corrections and helpful comments. Additionally we would like to thank the maintainers of the UCI Machine Learning Repository and the UCR Time Series Archive for providing benchmark data sets.

References 1. Bonenberger, C., Ertel, W., Schneider, M.: κ-circulant maximum variance bases. In: Edelkamp, S., M¨ oller, R., Rueckert, E. (eds.) KI 2021. LNCS (LNAI), vol. 12873, pp. 17–29. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87626-5 2 2. Bonenberger, C., Ertel, W., Schwenker, F., Schneider, M.: Singular spectrum analysis and circulant maximum variance frames. In: Advances in Data Science and Adaptive Analysis (2022) 3. Bouvrie, J.: Notes on convolutional neural networks (2006) 4. Casale, P., Pujol, O., Radeva, P.: Personalization and user verification in wearable systems using biometric walking patterns. Pers. Ubiquit. Comput. 16(5), 563–580 (2012) 5. Christensen, O.: An introduction to frames and Riesz bases. ANHA, Springer, Cham (2016). https://doi.org/10.1007/978-3-319-25613-9 6. Dau, H.A., et al.: The UCR time series archive. IEEE/CAA J. Autom. Sinica 6(6), 1293–1305 (2019) 7. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci. edu/ml 8. Garcia-Cardona, C., Wohlberg, B.: Convolutional dictionary learning: a comparative review and new algorithms. IEEE Trans. Comput. Imaging 4(3), 366–381 (2018) 9. Golyandina, N., Zhigljavsky, A.: Singular spectrum analysis for time series. Springer Science & Business Media (2013) 10. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS, Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7 11. Hoffmann, R., Wolff, M.: Intelligente Signalverarbeitung 1. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-45323-0 12. Jolliffe, I.T.: Principal component analysis, vol. 2. Springer (2002). https://doi. org/10.1007/b98835 13. Ku, W., Storer, R.H., Georgakis, C.: Disturbance detection and isolation by dynamic principal component analysis. Chemom. Intell. Lab. Syst. 30(1), 179–196 (1995) 14. Lv, X.G., Huang, T.Z.: A note on inversion of toeplitz matrices. Appl. Math. Lett. 20(12), 1189–1193 (2007) 15. Morgenshtern, V.I., B¨ olcskei, H.: A short course on frame theory. arXiv preprint arXiv:1104.4300 (2011) 16. Pollock, D.S.G., Green, R.C., Nguyen, T.: Handbook of time series analysis, signal processing, and dynamics. Elsevier (1999) 17. Rusu, C., Dumitrescu, B., Tsaftaris, S.A.: Explicit shift-invariant dictionary learning. IEEE Signal Process. Lett. 21(1), 6–9 (2013) 18. Seber, G.A.: Multivariate observations. John Wiley & Sons (2009) 19. Serpedin, E., Chen, T., Rajan, D.: Mathematical foundations for signal processing, communications, and networking. CRC Press (2011) 20. Shumway, R.: Discriminant analysis for time series. Handbook Statist. 2, 1–46 (1982)

54

C. Bonenberger et al.

21. Sulam, J., Papyan, V., Romano, Y., Elad, M.: Multilayer convolutional sparse modeling: pursuit and dictionary learning. IEEE Trans. Signal Process. 66(15), 4090–4104 (2018) 22. Theodoridis, S., Koutroumbas, K.: Pattern recognition. Elsevier (2006) 23. Tosic, I., Frossard, P.: Dictionary learning: what is the right representation for my signal? IEEE Sig. Process. Mag. 28, 27–38 (2011) 24. Vetterli, M., Kovaˇcevi´c, J., Goyal, V.K.: Foundations of signal processing. Cambridge University Press (2014)

LSCALE: Latent Space Clustering-Based Active Learning for Node Classification Juncheng Liu(B) , Yiwei Wang, Bryan Hooi, Renchi Yang, and Xiaokui Xiao School of Computing, National University of Singapore, Singapore, Singapore {juncheng,y-wang,bhooi}@comp.nus.edu.sg, {renchi,xkxiao}@nus.edu.sg Abstract. Node classification on graphs is an important task in many practical domains. It usually requires labels for training, which can be difficult or expensive to obtain in practice. Given a budget for labelling, active learning aims to improve performance by carefully choosing which nodes to label. Previous graph active learning methods learn representations using labelled nodes and select some unlabelled nodes for label acquisition. However, they do not fully utilize the representation power present in unlabelled nodes. We argue that the representation power in unlabelled nodes can be useful for active learning and for further improving performance of active learning for node classification. In this paper, we propose a latent space clustering-based active learning framework for node classification (LSCALE), where we fully utilize the representation power in both labelled and unlabelled nodes. Specifically, to select nodes for labelling, our framework uses the K-Medoids clustering algorithm on a latent space based on a dynamic combination of both unsupervised features and supervised features. In addition, we design an incremental clustering module to avoid redundancy between nodes selected at different steps. Extensive experiments on five datasets show that our proposed framework LSCALE consistently and significantly outperforms the stateof-the-art approaches by a large margin.

1

Introduction

Node classification on graphs has attracted much attention in the graph representation learning area. Numerous graph learning methods [6,11,15,27] have been proposed for node classification with impressive performance, especially on the semi-supervised setting, where labels are required for the classification task. In reality, labels are often difficult and expensive to collect. To mitigate this issue, active learning aims to select the most informative data points which can lead to better classification performance using the same amount of labelled data. Graph neural networks (GNNs) have been used for some applications such as disease prediction and drug discovery [10,21], in which labels often have to be obtained through costly means such as chemical assays. Thus, these applications motivate research into active learning with GNNs. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-26387-3 4. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  M.-R. Amini et al. (Eds.): ECML PKDD 2022, LNAI 13713, pp. 55–70, 2023. https://doi.org/10.1007/978-3-031-26387-3_4

56

J. Liu et al.

In this work, we focus on active learning for node classification on attributed graphs. Recently, a few GNN-based active learning methods [5,9,13,22,28] have been proposed for attributed graphs. However, their performance is still less than satisfactory in terms of node classification. These approaches do not fully utilize the useful representation power in unlabelled nodes and only use unlabelled nodes for label acquisition. For example, AGE [5] and ANRMAB [9] select corresponding informative nodes to label based on the hidden representations of graph convolutional networks (GCNs) and graph structures. These hidden representations can be updated only based on the labelled data. On the other hand, FeatProp [28] is a clustering-based algorithm which uses propagated node attributes to select nodes to label. However, these propagated node attributes are generated in a fixed manner based on the graph structure and node attributes, and are not learnable. In summary, existing approaches do not fully utilize the information present in unlabelled nodes. To utilize the information in unlabelled nodes, a straightforward method is to use features extracted from a trained unsupervised model for choosing which nodes to select. For example, FeatProp can conduct clustering based on unsupervised features for selecting nodes. However, as shown in our experimental results, it still cannot effectively utilize the information in unlabelled nodes for active learning. Motivated by the limitations above, we propose an effective Latent Space Clustering-based Active LEarning framework (hereafter LSCALE). In this framework, we conduct clustering-based active learning on a designed latent space for node classification. Our desired active learning latent space should have two key properties: 1) low label requirements: it should utilize the representation power from all nodes, not just labelled nodes, thereby obtaining accurate representations even when very few labelled nodes are available; 2) informative distances: in this latent space, intra-class nodes should be closer together, while inter-class nodes should be further apart. This can facilitate clustering-based active selection approaches, which rely on these distances to output a diverse set of query points. To achieve these, our approach incorporates an unsupervised model (e.g., DGI [26]) on all nodes to generate unsupervised features, which utilizes the information in unlabelled nodes, satisfying the first desired property. In addition, we design a distance-based classifier to classify nodes using the representations from our latent space. This ensures that distances in our latent space are informative for active learning selection, satisfying our second desired property. To select nodes for querying labels, we leverage the K-Medoids clustering algorithm in our latent space to obtain cluster centers, which are the queried nodes. As more labelled data are received, the distances between different nodes in the latent space change based on a dynamic combination of unsupervised learning features and learnable supervised representations. Furthermore, we propose an effective incremental clustering strategy for clustering-based active learning to prevent redundancy during node selection. Existing clustering-based active learning methods like [24,28] only select nodes in multiple rounds with a myopic approach. More specifically, in each round, they apply clustering over all unlabelled nodes and select center nodes for labelling.

Latent Space Clustering-Based Active Learning for Node Classification

57

Fig. 1. Illustration of the proposed framework LSCALE.

However, the cluster centers tend to be near the ones obtained in the previous rounds. Therefore, the clustering can select redundant nodes and does not provide much new information in the later rounds. In contrast, our incremental clustering is designed to be aware of the selected nodes in the previous rounds and ensure newly selected nodes are more informative. Our contributions are summarized as follows: – We propose a latent space clustering-based active learning framework (LSCALE) for node classification on attributed graphs. LSCALE contains a latent space with two key properties designed for active learning purposes: 1) low label requirements, 2) informative distances. – We design an incremental clustering strategy to ensure that newly selected nodes are not redundant with previous nodes, which further improves the performance. – We conduct comprehensive experiments on three public citation datasets and two co-authorship datasets. The results show that our method provides a consistent and significant performance improvement compared to the stateof-the-art active learning methods for node classification on attributed graphs.

2

Problem Definition

In this section, we present the formal problem definition of active learning for node classification. Let G = (V, E) be a graph with node set V and edge set E, where |V | = n and |E| = m. X ∈ Rn×d and Y represent the input node attribute matrix and label matrix of graph G, respectively. In particular, each node v ∈ V is associated with a length-d attribute vector xv and a one-hot label vector yv . Given a graph G and its associated attribute matrix X, node classification aims

58

J. Liu et al.

to find a model M which predicts the labels for each node in G such that the loss function L(M|G, X, Y) over the inputs (G, X, Y) is minimized. Furthermore, the problem of active learning for node classification is formally defined as follows. In each step t, given the graph G and the attribute matrix X, an active learning strategy A selects a node subset S t ⊆ U t−1 for querying the new set of labelled nodes S t , we labels yi for each node i ∈ S t . After getting the t t t−1 L and a set of unlabelled nodes obtain a set of all labelled nodes L = S U t = U t−1 \ S t prepared for the next iteration. Then G and X with labels yi of i ∈ Lt are used as training data to train a model M at the end of each step t. We define the labelling budget b as the total maximum number of nodes which are allowed to be labelled. The eventual goal is to maximize the performance of the node classification task under the budget b. To achieve this, active learning needs to carefully select nodes for labelling (i.e., choose S t at each step t). The objective is to minimize the loss using all labelled nodes Lt at each step t: min L(M, A|G, X, Y) t L

3

(1)

Methodology

In this section, we introduce our active learning framework LSCALE for node classification in a top-down fashion. First, we describe the overview and the key idea of LSCALE. Then we provide the details of each module used in LSCALE. The overview of our latent space clustering-based active learning framework is shown in Fig. 1. The most important aspect of our framework is to design a suitable active learning latent space, specifically designed for clustering-based active learning. Motivated by the limitations of previous methods, we design a latent space with two important properties: 1) low label requirements: the latent space representations can be learned effectively even with very few labels, by utilizing the representation power from all nodes (including unlabelled nodes) rather than only labelled nodes; 2) informative distances: in the latent space, distances between intra-class nodes should be smaller than distances between inter-class nodes. With the first property, LSCALE can learn effective node representations throughout the active learning process, even when very few labelled nodes have been acquired. The second property makes distances in our latent space informative with respect to active learning selection, ensuring that clustering-based active selection processes choose a diverse set of query points. To satisfy the first property, we use an unsupervised learning method to  learn unsupervised node representations H ∈ Rn×d based on graphs and node  attributes, where d is the dimension of representations. After obtaining H, we design a linear distance-based classifier to generate output predictions. In the classifier, we apply a learnable linear transformation on H to obtain hidden representations Z. The distances between nodes are calculated by a dynamic combination of both Z and H, which satisfies the second desired property. Clustering is performed on the latent space using the distances to select informative nodes. We additionally propose an incremental clustering method to ensure that

Latent Space Clustering-Based Active Learning for Node Classification

59

the newly selected nodes are not redundant with the previously selected nodes. In summary, the framework contains a few main components: – An unsupervised graph learning method to generate unsupervised node representations. – An active learning latent space with two aforementioned properties: 1) low label requirements, 2) informative distances. – An incremental clustering method to select data points as centroids, which we use as the nodes to be labelled, and prevent redundancy during node selection. 3.1

Active Learning Latent Space

To facilitate clustering-based active learning, we need a latent space with two desired properties. Therefore, we propose a distance-based classifier for generating representations from supervised signals and a distance function to dynamically consider supervised and unsupervised representations simultaneously. Distance-Based Classifier. In our framework, we design a novel distancebased classifier to ensure that distances between nodes in our latent space can facilitate active learning further. Intuitively, a desired property of the latent space is that nodes from different classes should be more separated and nodes with the same class should be more concentrated in the latent space. Thus, it can help clustering-based active learning methods select representative nodes from different classes. To achieve this, we first map unsupervised features H to another set of features Z by a linear transformation: Z = HWc , 

(2)



where Wc ∈ Rd ×l is the trainable linear transformation matrix. l is the dimension of latent representations. Then we define a set of learnable class representations c1 , c2 , ..., cK , where K is the number of classes. The distance vector of node i is defined as: ai = ||zi − c1 ||2 ⊕ ||zi − c2 ||2 ... ⊕ ||zi − cK ||2 ∈ RK ,

(3)

where ⊕ is the concatenation operation and || · ||2 is the L2 norm. The j-th ˆi of node i is obtained by the softmax element yˆij in the output prediction y function: exp(||zi − cj ||2 ) (4) yˆij = softmax(ai )j = K k=0 exp(||zi − ck ||2 ) For training the classifier, suppose the labelled node set at step t is Lt . The cross-entropy loss function for node classification over the labeled node set is defined as: K 1  yic ln yˆic , (5) L=− t |L | t c=1 i∈L

where yic denotes the c-th element in the label vector yi .

60

J. Liu et al.

With the guidance of labelled nodes and their labels via backpropagation, we can update the transformation matrix Wc and new features Z can capture the supervised information from labelled data. In addition, new features Z allow intra-class nodes more close and inter-class nodes more separate in the feature space. Through this distance-based classifier, the generated feature space allows the clustering-based active selection effectively select a diverse set of query nodes. Distance Function. In LSCALE, the distance function determines the distances between nodes in the latent space for further clustering. We define our distance function as: d(vi , vj ) = ||g(X)i − g(X)j ||2 ,

(6)

where g(X) is a mapping from node attributes X to new distance features. As previous graph active learning methods do not effectively utilize the unlabelled nodes, we aim to take advantage of unsupervised learning features and supervised information from labelled data. To this end, we combine unsupervised learning features H and supervised hidden representations Z in the distance function. A straightforward way to combine them is using concatenation of H and Z: g(X) = H ⊕ Z. Noted that H and Z are in different spaces and may have different magnitudes of row vectors. We define the distance features as follows: g(X) = α · H ⊕ (1 − α) · Z , 

(7)



where H and Z are l 2-normalized H and Z respectively to make sure they have same Euclidean norms of rows. α can be treated as a parameter for controlling the dynamic combination of unsupervised features and supervised features. Intuitively, Z can be unstable in the early stages as there are relatively few labelled nodes in the training set. So, in the early stages, we would like to focus more on unsupervised features H, which are much more stable than Z. As the number of labelled nodes increases, the focus should be shifted to hidden representations Z in order to emphasize supervised information. Inspired by curriculum learning approaches [2], we set an exponentially decaying weight as follows: t

α = λ|L | , t

(8)

where |L | is the number of labelled nodes at step t. λ can be set as a number close to 1.0, e.g., 0.99. By using this dynamic combination of unsupervised learning features H and supervised hidden representations Z, we eventually construct the latent space g(X) which has the two important properties: 1) low label requirements: it utilizes the representation power from all nodes including unlabelled nodes; 2) informative distances: distances between nodes are informative for node selection. Thus, the latent space can facilitate selecting diverse and representative nodes in the clustering module. Note that FeatProp [28] uses propagated node attributes as representations for calculating distances. The propagated node attributes are fixed and not learnable throughout the whole active learning process, which makes the node selection less effective. In contrast, our latent space is learned based on signals from both labelled and unlabelled data. In addition, it gradually shifts its focus to emphasize supervised signals as we acquire more labelled data.

Latent Space Clustering-Based Active Learning for Node Classification

3.2

61

Clustering Module

At each step, we use the K-Medoids clustering on our latent space to obtain cluster representatives. In K-Medoids, medoids are chosen from among the data points themselves, ensuring that they are valid points to select during active learning. So, after clustering, we directly select these medoids for labelling. This ensures that the chosen centers are well spread out and provide good coverage of the remaining data, which matches the intuition of active learning, since we want the chosen centers to help us classify as much as possible of the rest of the data. At each step t, the objective of K-Medoids is: n  i=1

min d(vi , vj ) =

j∈S t

n  i=1

min ||g(X)i − g(X)j ||2

j∈S t

(9)

Besides K-Medoids, common clustering methods used in the previous work are K-Means [5,9] and K-Centers [24]. K-Means cannot be directly used for selecting nodes in active learning as it does not return real sample nodes as cluster representatives. Incremental Clustering. Despite these advantages of K-Medoids for active learning on graphs, a crucial drawback is that it is possible to select similar nodes for querying during multiple iterations. That is, newly selected nodes may be close to previously selected ones, making them redundant and hence worsening the performance of active learning. The reason is that the clustering algorithm only generates the representative nodes in the whole representation space without the awareness of previously selected nodes. To overcome this problem, we design an effective incremental clustering algorithm for K-Medoids to avoid selecting redundant nodes. In our incremental clustering method, the key idea is that fixing previous selected nodes as some medoids can force the K-Medoids algorithm to select additional medoids that are dissimilar with the previous ones. We illustrate our incremental clustering method in Algorithm 1. After calculating the distances for every node pair (Line 4), incremental KMedoids is conducted (Line 5 to Line 15). Compared to the original K-Medoids, the most important modification is that only clusters with a medoid, which is not in the previous labelled nodes set (i.e., m ∈ Lt−1 ), can update the medoid (Line 10-13). When all the medoids are the same as those in the previous iteration, the K-Medoids algorithm stops and keeps the medoids. For the medoids which are not the previous selected nodes, we put them in selected node set S t , meanwhile we set labelled node set Lt and unlabelled node set U t using S t accordingly.

4

Experiments

The main goal of our experiments is to verify the effectiveness of our proposed framework LSCALE1 We design experiments to answer the following research questions: 1

The code can be found https://github.com/liu-jc/LSCALE.

62

J. Liu et al.

Algorithm 1: Incremental K-Medoids clustering Input: the set of previous labelled nodes Lt−1 , the set of unlabelled nodes U t−1 as the pool, the budget bt of the current step. t−1 1 k ← |L | + bt ; t t−1 2 Randomly select b nodes from U ; t t−1 3 Set selected b nodes and nodes in L as k initial medoids; 4 Compute d(vi , vj ) for every node pair (vi , vj ) by Eq. (6); 5 repeat 6 foreach node u ∈ U t−1 do 7 Assign u to the cluster with the closest medoid; 8 end 9 foreach cluster C with medoid m do 10 if m ∈ Lt−1 then 11 Find the node m which minimize the sum of distances to all other nodes within C; 12 Update node m as the medoid of C; 13 end 14 end 15 until all the medoids are not changed ; t t−1 16 Construct selected node set S using the medoids m ∈ L ;  t−1 t t t t−1 t L ;U ←U 17 L ← S \S ; t t t 18 return L , U , S

– RQ1. Overall performance and effectiveness of unsupervised features: How does LSCALE perform as compared with state-of-the-art graph active learning methods? Is utilizing unsupervised features also helpful for other clustering-based graph active learning methods? – RQ2. Efficiency: How efficient is LSCALE as compared with other methods? – RQ3. Ablation study: Are the designed dynamic feature combination and incremental clustering useful to improve the performance? How does our distance-based classifier affect the performance? Datasets. To evaluate the effectiveness of LSCALE, we conduct the experiments on Cora, Citeseer [23], Pubmed [20], Coauthor-CS (short as Co-CS) and Coauthor-Physics (short as Co-Phy) [25]. The first three are citation networks while Co-CS and Co-Phy are two co-authorship networks. We describe the datasets in detail and summarize the dataset statistics in Supplement B.1. Baselines. In the experiments, to show the compatibility with different unsupervised learning methods, we use two variants LSCALE-DGI and LSCALEMVGRL, which use DGI [26] and MVGRL [12] as the unsupervised learning method, respectively. To demonstrate the effectiveness of LSCALE, we compare

Latent Space Clustering-Based Active Learning for Node Classification

63

Table 1. The averaged accuracies (%) and standard deviations at different budgets on citation networks. Dataset

Cora

Budget

10

30

60

Citeseer 10

30

60

10

30

60

Random

47.65 ± 7.2

65.19 ± 4.6

73.33 ± 3.1

37.76 ± 9.7

57.73 ± 7.1

66.38 ± 4.4

63.60 ± 6.8

74.17 ± 3.9

77.93 ± 2.4

Uncertainty 45.78 ± 4.6

56.34 ± 8.4

70.22 ± 6.0

27.65 ± 8.8

45.04 ± 8.2

59.41 ± 9.2

60.72 ± 5.7

69.64 ± 4.2

74.95 ± 4.2

AGE

41.22 ± 9.3

65.09 ± 2.7

73.63 ± 1.6

31.76 ± 3.3

60.22 ± 9.3

64.77 ± 9.1

66.96 ± 6.7

75.82 ± 4.0

80.27 ± 1.0

ANRMAB

30.43 ± 8.2

61.11 ± 8.8

71.92 ± 2.3

25.66 ± 6.6

47.56 ± 9.4

58.28 ± 9.2

57.85 ± 8.7

65.33 ± 9.6

75.01 ± 8.4

FeatProp

51.78 ± 6.7

66.49 ± 4.7

74.70 ± 2.7

39.63 ± 9.2

57.92 ± 7.2

66.95 ± 4.2

67.33 ± 5.5

75.08 ± 3.2

77.60 ± 1.9

GEEM

45.73 ± 9.8

67.21 ± 8.7

76.51 ± 1.6

41.10 ± 7.2

62.96 ± 7.8

70.82 ± 1.2

64.38 ± 6.7

76.12 ± 1.9

79.10 ± 2.3

DGI-Rand

62.55 ± 5.8

73.04 ± 3.8

78.36 ± 2.6

54.46 ± 7.6

67.26 ± 4.0

70.24 ± 2.4

73.17 ± 3.8

78.10 ± 2.8

80.28 ± 1.6

FeatProp-D 68.94 ± 5.7

75.47 ± 2.9

77.64 ± 2.0

61.84 ± 5.9

66.99 ± 3.6

68.97 ± 2.0

73.50 ± 4.7

77.36 ± 3.4

78.54 ± 2.3

LSCALE-D 70.83 ± 4.8

77.41 ± 3.5

80.77 ± 1.7

65.60 ± 4.7 69.06 ± 2.6 70.91 ± 2.2 74.28 ± 4.4 78.54 ± 2.8

80.62 ± 1.7

LSCALE-M 72.71 ± 3.9 78.67 ± 2.7 82.03 ± 1.8 64.24 ± 4.8

Pubmed

68.68 ± 3.2

70.34 ± 1.9

73.51 ± 4.9

79.09 ± 2.3 81.32 ± 1.7

two variants with the following representative active learning methods on graphs. Random: select the nodes uniformly from the unlabelled node pool; Uncertainty: select the nodes with the max information entropy according to the current model. AGE [5] constructs three different criteria based on graph neural networks to choose a query node. Combining these different criteria with timesensitive variables to decide which nodes to selected for labelling. ANRMAB [9] proposes a multi-arm-bandit mechanism to assign different weights to the different criteria when constructing the score to select a query node. FeatProp [28] performs the K-Medoids clustering on the propagated features obtained by simplified GCN [27] and selects the medoids to query their labels. GEEM [22]: inspired by error reduction, it uses simplified GCN [27] to select the nodes by minimizing the expected error. As suggested in [5,9,28], AGE, ANRMAB, FeatProp, Random, and Uncertainty use GCNs as the prediction model, which is trained after receiving labelled nodes at each step. GEEM uses the simplified graph convolution (SGC) [27] as the prediction model as mentioned in [22]. 4.1

Experimental Setting

We evaluate LSCALE-DGI, LSCALE-MVGRL, and other baselines on node classification task with a transductive learning setup, following the experimental setup as in [9,28] for a fair comparison. Dataset Splits. For each citation dataset, we use the same testing set as in [15], which contains 1000 nodes. For coauthor datasets, we randomly sample 20% nodes as the testing sets. From the non-testing set in each dataset, we randomly sample 500 nodes as a validation set and fix it for all the methods to ensure a fair comparison. Experiment Procedure. In the experiments, we set the budget sizes differently for different datasets and we focus on the “batched” multi-step setting as in [24,28]. Each active learning method is provided a small set of labelled nodes as an initial pool. As in [28], we randomly select 5 nodes regardless of the

64

J. Liu et al.

Table 2. The averaged accuracies (%) and standard deviations at different budgets on co-authorship networks. Dataset

Co-Phy

Budget

10

Co-CS

Random

74.80

86.48

90.70 ± 2.6

49.72

69.98

78.15 ± 3.6

Uncertainty 71.42

86.64

91.29 ± 2.0

42.38

57.43

65.66 ± 9.5

30

60

10

30

60

AGE

63.96

84.47

91.30 ± 2.0

27.20

70.22

76.52 ± 3.6

ANRMAB

68.47

84.19

89.35 ± 4.2

43.48

69.98

75.51 ± 2.4

FeatProp

80.23

86.83

90.82 ± 2.6

52.45

70.83

76.60 ± 3.9

GEEM

79.24

88.58

91.56 ± 0.5

61.63

75.03

82.57 ± 1.9

DGI-Rand

82.81

90.35

92.44 ± 1.5

64.07

78.63

84.28 ± 2.7

FeatProp-D 87.90

91.23

91.51 ± 1.6

67.37

77.65

80.33 ± 2.5

LSCALE-D 90.38 92.75 LSCALE-M 90.28

93.70 ± 0.6 73.07 82.96 86.70 ± 1.7

92.89 93.05 ± 0.7

71.16

81.82

85.79 ± 1.6

class as an initial pool. The whole active learning process is as follows: (1) we first train the prediction model with initial labelled nodes. (2) we use the active learning strategy to select new nodes for labelling and add them to the labelled node pool; (3) we train the model based on the labelled nodes again. We repeat Step (2) and Step (3) until the budget is reached and train the model based on the final labelled node pool. For clustering-based methods (i.e., FeatProp and LSCALE), 10 nodes are selected for labelling in each iteration as these methods depend on selecting medoids to label. Hyperparameter Settings. For hyperparameters of other baselines, we set them as suggested in their papers. We specify hyperparameters of our methods in Supplement B.2. 4.2

Performance Comparison (RQ1)

Overall Comparison. We evaluate the performance by using the averaged classification accuracy. We report the results over 20 runs with 10 different random data splits. In Tables 1 and 2, we show accuracy scores of different methods when the number of labelled nodes is less than 60. Analysing Table 1 and 2, we have the following observations: – In general, our methods LSCALE-DGI (short as LSCALE-D) and LSCALEMVGRL (short as LSCALE-M) significantly outperform the baselines on the varying datasets, while they provide relatively lower standard deviations on most datasets. In particular, when the total budget is only 10, LSCALE-M provides remarkable improvements compared with GEEM by absolute values 26.9%, 23.3%, 11.5%, on Cora, Citeseer, and Co-CS respectively. – With the budget size less than 30, the Uncertainty baseline always performs worse than the Random baseline for all datasets. Meanwhile, AGE and

Latent Space Clustering-Based Active Learning for Node Classification

65

ANRMAB do not have much higher accuracies on most datasets compared with the Random baseline. Both of the above results indicate that GCN representations, which are used in AGE, ANRMAB and the Uncertainty baseline for selecting nodes, are inadequate when having only a few labelled nodes. – GEEM generally outperforms other baselines on all datasets, which might be attributed to its expected error minimization scheme. However, the important drawback of expected error minimization is the inefficiency, which we show later in Sec 4.3. Supplement B.3 shows more results about how accuracy scores of different methods change as the number of labelled nodes increases. Supplement B.4 demonstrates how different hyperparameters affect the performance. Effectiveness of Utilizing Unsupervised Features. Existing works overlook the information in unlabelled nodes, whereas LSCALE utilizes unsupervised features by using unsupervised learning on all nodes (including unlabelled ones). As we argue before, the information in unlabelled nodes is useful for active learning on graphs. To verify the usefulness, we design two additional baselines as follows: – FeatProp-DGI: It replaces the propagated features with unsupervised DGI features in FeatProp to select nodes for labelling. – DGI-Rand: It uses unsupervised DGI features and randomly selects nodes from the unlabelled node pool to label. For simplicity, it trains a simple logistic regression model with DGI features as the prediction model. Regarding the effectiveness of unsupervised features, from Table 1 and 2, we have the following observations: – On all datasets, FeatProp-DGI (short as FeatProp-D) consistently outperforms FeatProp, which indicates unsupervised features are useful for other clustering-based graph active learning approaches besides our framework. – DGI-Rand also achieves better performance compared with AGE, ANRMAB, and GEEM, especially when the labelling budget is small (e.g., 10). This verifies again that existing approaches do not fully utilize the representation power in unlabelled nodes. – DGI-Rand outperforms FeatProp-D when the labelling budget increases to 60. This observation shows that FeatProp-D cannot effectively select informative nodes in the late stage, which can be caused by redundant nodes selected in the late stage. – While DGI-Rand and FeatProp-D use the representation power in unlabelled nodes, they are still consistently outperformed by LSCALE-D and LSCALEM, which verifies the superiority of our framework. 4.3

Efficiency Comparison (RQ2)

We empirically compare the efficiency of LSCALE-D with that of four state-ofthe-art methods (i.e., AGE, ANRMAB, FeatProp, and GEEM). Table 3 shows

66

J. Liu et al. Table 3. The total running time of different models. Method

Cora

Citeseer Pubmed Co-CS

Co-Phy

AGE

208.7 s 244.1 s

2672.8 s

6390.5 s 745.5 s

ANRMAB

201.8 s 231.5 s

2723.3 s

6423.5 s 767.1 s

FeatProp

16.5 s

16.7 s

58.7 s

169.2 s

GEEM

3.1 h

5.2 h

1.8 h

52.5 h

46.2 h

15.6 s

53.4 s

59.8 s

131.3 s

LSCALE-D 13.1 s

336.4 s

the total running time of these models on different datasets. From Table 3, GEEM has worst efficiency as it trains the simplied GCN model n × K times (K is the number of classes) for selecting a single node. FeatProp and LSCALED are much faster than the other methods. The reason is that FeatProp and LSCALE-D both select several nodes in a step and train the classifier once for this step, whereas AGE, ANRMAB and GEEM all select a single node once in a step. Comparing LSCALE-D and FeatProp, LSCALE-D requires less time as the clustering in LSCALE-D is performed in the latent space where the dimension is less than that in the original attribute space used in FeatProp. 4.4

Ablation Study (RQ3)

Effectiveness of Dynamic Feature Combination and Incremental Clustering. We conduct an ablation study to evaluate the contributions of two different components in our framework: dynamic feature combination and incremental clustering. The results are shown in Fig. 3. DGI features is the variant without either dynamic combination or incremental clustering, and it only uses features obtained by DGI as distance features for the K-Medoids clustering algorithm. Dynamic Comb uses the dynamic combination to obtain distance features for clustering. LSCALE is the full version of our variant with dynamic feature combination and incremental clustering. It is worth noting that DGI features can be considered as a simple method utilizing unsupervised features. Analysing Fig. 3, we have the following observations: – Dynamic Comb generally provides better performance than DGI features, which shows the effectiveness of our dynamic feature combination for distance features. – LSCALE and Dynamic Comb provide no much different performance when the number of labelled nodes is relatively low. However, LSCALE gradually outperforms Dynamic Comb as the number of labelled nodes increases. This confirms that incremental clustering can select more informative nodes by avoiding redundancy between nodes selected at different steps. In summary, the results verify the effectiveness and necessity of dynamic combination and incremental clustering.

Latent Space Clustering-Based Active Learning for Node Classification

67

Fig. 2. t-SNE visualization of different distance features on Cora dataset. Colors denote the ground-truth class labels. Our distance features have clearer separations between different classes. (Color figure online)

Fig. 3. Ablation study: dynamic feature combination and incremental clustering.

Effectiveness of Our Distance Features g(X). Furthermore, we also qualitatively demonstrate the effectiveness of our distance feature g(X). Figure 2 shows t-SNE visualizations [17] of FeatProp features, DGI features, and the distance features of LSCALE-DGI. The distance features are obtained by dynamically combining DGI features and supervised hidden representations on 20 labelled nodes. Recall that FeatProp uses propagated node attributes as distance features and DGI features are learned using an unsupervised method with unlabelled data. Compared with others, the distance features used in LSCALE-DGI have clearer boundaries between different classes, which satisfies our second desired property (informative distances) and further facilitates selecting informative nodes in the clustering algorithm. Effectiveness of Distance-Based Classifier. We design a distance-based classifier in LSCALE to ensure that distances are informative in the active learning latent space. To demonstrate the effectiveness of the distance-based classifier, we replace it with a GCN classifier and show the comparison in Table 4. With the distance-based classifier, LSCALE can achieve better performance than that with a GCN classifier. This comparison shows the effectiveness of the designed distance-based classifier.

68

J. Liu et al.

Table 4. Accuracy (%) comparison of LSCALE-DGI with different classifiers. The budget size is 20 · c (c is the number of classes). Classifier

Cora Citeseer Pubmed Co-CS Co-Phy

GCN Classifier

81.83 71.24

80.03

87.28

93.34

Distance Classifier 83.23 72.30

80.62

89.25

93.97

Table 5. Accuracy (%) comparison of FeatProp with different classifiers. Classifier

Cora Citeseer Pubmed Co-CS Co-Phy

GCN Classifier

80.50 72.04

77.65

83.49

93.06

Distance Classifier 80.66 72.14

77.88

84.32

93.28

To further investigate whether the distance-based classifier is also effective for other active learning methods, we change the GCN classifier to our proposed distance-based classifier for FeatProp and present the comparison in Table 5. From Table 5, we note that FeatProp with our distance-based classifier has slightly better performance compared with FeatProp with GCN classifier on all the datasets. This observation indicates that our distance-based classifier is also effective for other clustering-based active learning methods.

5

Related Work

Active Learning on Graphs. For active learning on graphs, early works without using graph representations are proposed in [3,4,19], where the graph structure is used to train the classifier and calculate the query scores for selecting nodes. More recent works [1,8] study non-parametric classification models with graph regularization for active learning with graph data. Recent works [5,9,22,28] utilize graph convolutional neural networks (GCNs) [15], which consider the graph structure and the learned embeddings simultaneously. AGE [5] design an active selecting strategy based on a weighted sum of three metrics considering the uncertainty, the the graph centrality and the information density. Improving upon the weight assignment mechanism, ANRMAB [9] designs a multi-armed bandit method with a reward scheme to adaptively assign weights for the different metrics. Besides the metric-based active selection on graphs, FeatProp [28] uses a clustering-based active learning method, which calculates the distances between nodes based on representations of a simplified GCN model [27] and conducts a clustering algorithm (i.e., K-Medoids) for selecting representative nodes. A recent method GEEM [22] uses a simplified GCN [27] for prediction and maximizes the expected error reduction to select informative nodes to label. Rather than actively selecting nodes and training/testing on a single graph, [13] learns a selection policy on several labelled graphs via reinforcement learning and actively

Latent Space Clustering-Based Active Learning for Node Classification

69

selects nodes using that policy on unlabelled graphs. [16,18] use adversarial learning and meta learning approaches for active learning on graphs. However, even with relatively complicated learning methods, their performance are similar with AGE [5] and ANRMAB [9]. [7] investigates active learning on heterogeneous graphs. [29] considers noisy oracle setting where labels obtained by an oracle can be incorrect. In this work, we focus on the homogeneous single-graph setting like in [5,9,22,28]. To tackle limitations of previous work on this setting, we have presented an effective and efficient framework that can utilize the representation power in unlabelled nodes and achieve better performance under the same labelling budget.

6

Conclusion

In this paper, we focus on active learning for node classification on graphs and argue that existing methods are still less than satisfactory as they do not fully utilize the information in unlabelled nodes. Motivated by this, we propose LSCALE, a latent space clustering-based active learning framework, which uses a latent space with two desired properties for clustering-based active selection. We also design an incremental clustering module to minimize redundancy between nodes selected at different steps. Extensive experiments demonstrate that our method provides superior performance over the state-of-the-art models. Our work points out a new possibility for active learning on graphs, which is to better utilize the information in unlabelled nodes by designing a feature space more suitable for active learning. Future work could propose new unsupervised methods which are more integrated with active learning process and enhance our framework further. Acknowledgements. This paper is supported by the Ministry of Education, Singapore (Grant Number MOE2018-T2-2-091) and A*STAR, Singapore (Number A19E3b0099).

References 1. Aodha, O.M., Campbell, N.D.F., Kautz, J., Brostow, G.J.: Hierarchical subquery evaluation for active learning on a graph. In: CVPR (2014) 2. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML (2009) 3. Berberidis, D., Giannakis, G.B.: Data-adaptive active sampling for efficient graphcognizant classification. IEEE Trans. Sig. Process. 66, 5167–5179 (2018) 4. Bilgic, M., Mihalkova, L., Getoor, L.: Active learning for networked data. In: ICML (2010) 5. Cai, H., Zheng, V.W., Chang, K.C.: Active learning for graph embedding. arXiv preprint arXiv:1705.05085 (2017) 6. Chen, J., Ma, T., Xiao, C.: FastGCN: fast learning with graph convolutional networks via importance sampling. In: ICLR (2018) 7. Chen, X., Yu, G., Wang, J., Domeniconi, C., Li, Z., Zhang, X.: ActiveHNE: active heterogeneous network embedding. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19 (2019)

70

J. Liu et al.

8. Dasarathy, G., Nowak, R.D., Zhu, X.: S2: an efficient graph based active learning algorithm with application to nonparametric classification. In: COLT (2015) 9. Gao, L., Yang, H., Zhou, C., Wu, J., Pan, S., Hu, Y.: Active discriminative network representation learning. In: IJCAI (2018) 10. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: ICML, pp. 1263–1272 (2017) 11. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: NIPS, pp. 1024–1034 (2017) 12. Hassani, K., Khasahmadi, A.H.: Contrastive multi-view representation learning on graphs. In: ICML, pp. 4116–4126 (2020) 13. Hu, S., et al.: In: Advances in Neural Information Processing Systems (2020) 14. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015) 15. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2016) 16. Li, Y., Yin, J., Chen, L.: Seal: semisupervised adversarial active learning on attributed graphs. IEEE Trans. Neural Netw. Learn. Syst. 32, 3136–3147 (2020) 17. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008) 18. Madhawa, K., Murata, T.: Metal: active semi-supervised learning on graphs via meta-learning. In: Asian Conference on Machine Learning, pp. 561–576. PMLR (2020) 19. Moore, C., Yan, X., Zhu, Y., Rouquier, J., Lane, T.: Active learning for node classification in assortative and disassortative networks. In: SIGKDD (2011) 20. Namata, G., London, B., Getoor, L., Huang, B.: Query-driven active surveying for collective classification. In: 10th International Workshop on Mining and Learning with Graphs (2012) 21. Parisot, S., et al.: Disease prediction using graph convolutional networks: application to autism spectrum disorder and Alzheimer’s disease. Med. Image Anal. 48, 117–130 (2018) 22. Regol, F., Pal, S., Zhang, Y., Coates, M.: Active learning on attributed graphs via graph cognizant logistic regression and preemptive query generation. In: ICML, pp. 8041–8050 (2020) 23. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collective classification in network data. AI Mag. 29, 93 (2008) 24. Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. In: ICLR (2018) 25. Shchur, O., Mumme, M., Bojchevski, A., G¨ unnemann, S.: Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018) 26. Veliˇckovi´c, P., Fedus, W., Hamilton, W.L., Li` o, P., Bengio, Y., Hjelm, R.D.: Deep graph infomax. In: ICLR (2018) 27. Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., Weinberger, K.: Simplifying graph convolutional networks. In: ICML, pp. 6861–6871 (2019) 28. Wu, Y., Xu, Y., Singh, A., Yang, Y., Dubrawski, A.: Active learning for graph neural networks via node feature propagation. In: Proceedings of NeurIPS 2019 Graph Representation Learning Workshop (GRL) (2019) 29. Zhang, W., et al.: Rim: reliable influence-based active learning on graphs. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

Powershap: A Power-Full Shapley Feature Selection Method Jarne Verhaeghe(B) , Jeroen Van Der Donckt , Femke Ongenae , and Sofie Van Hoecke IDLab, Ghent University - imec, 9000 Ghent, Belgium [email protected] http://predict.idlab.ugent.be/

Abstract. Feature selection is a crucial step in developing robust and powerful machine learning models. Feature selection techniques can be divided into two categories: filter and wrapper methods. While wrapper methods commonly result in strong predictive performances, they suffer from a large computational complexity and therefore take a significant amount of time to complete, especially when dealing with highdimensional feature sets. Alternatively, filter methods are considerably faster, but suffer from several other disadvantages, such as (i) requiring a threshold value, (ii) many filter methods not taking into account intercorrelation between features, and (iii) ignoring feature interactions with the model. To this end, we present powershap, a novel wrapper feature selection method, which leverages statistical hypothesis testing and power calculations in combination with Shapley values for quick and intuitive feature selection. Powershap is built on the core assumption that an informative feature will have a larger impact on the prediction compared to a known random feature. Benchmarks and simulations show that powershap outperforms other filter methods with predictive performances on par with wrapper methods while being significantly faster, often even reaching half or a third of the execution time. As such, powershap provides a competitive and quick algorithm that can be used by various models in different domains. Furthermore, powershap is implemented as a plug-and-play and open-source sklearn component, enabling easy integration in conventional data science pipelines. User experience is even further enhanced by also providing an automatic mode that automatically tunes the hyper-parameters of the powershap algorithm, allowing to use the algorithm without any configuration needed. Keywords: Feature selection · Shap Toolkit · Python · Open source

1

· Benchmark · Simulation ·

Introduction

In many data mining and machine learning problems, the goal is to extract and discover knowledge from data. One of the challenges frequently faced in these c The Author(s) 2023  M.-R. Amini et al. (Eds.): ECML PKDD 2022, LNAI 13713, pp. 71–87, 2023. https://doi.org/10.1007/978-3-031-26387-3_5

72

J. Verhaeghe et al.

problems is the high dimensionality and the unknown relevance of features [10]. Ignoring these challenges will more than often result in modeling obstacles, such as sparse data, overfitting, and the curse of dimensionality. Therefore, feature selection is frequently applied, among other techniques, to effectively reduce the feature dimensionality. The smaller subset of features has the potential to explain the problem better, reduce overfitting, alleviate the curse of dimensionality, and even facilitate interpretation. Furthermore, feature selection is known to increase model performance, increase computational efficiency, and increase the robustness of many models due to the dimensionality reduction [10]. In this work, we present a novel feature selection method, called powershap, that is a faster and easy-to-use wrapper method. The feature selection is realized by using Shapley values, statistical tests, and power calculations. First, in Sect. 2, a short overview of the related work is given to show how powershap improves upon all these methods. Subsequently, in Sect. 3, the method and the design choices are explained as well as the resulting algorithm. Finally, the performance of powershap is compared to other state-of-the-art methods in Sects. 4 and 5 using both simulation and open-source benchmark datasets and the results are discussed in Sect. 6. Finally, the conclusions are summarized in Sect. 7.

2

Related Work

Feature selection approaches can be categorized into filter and wrapper methods. Filter methods select features by measuring the relevance of the feature using model-agnostic measures, such as statistical tests, information gain, distance, similarity, and consistency to the dependent variable (if available). These methods are model-independent as this category of feature selection does not rely on training machine learning models [6], resulting in a fast evaluation. However, the disadvantages of filter methods are that they frequently impose assumptions on the data, are limited to a single type of prediction, such as classification or regression, not all methods take inter-correlation between features into account, and often require a cut-off value or hyperparameter tuning [8]. Examples of these filter methods are rank, chi2 test, f-test, correlation-based feature selection, Markov blanket filter, and Pearson correlation [6]. Wrapper methods measure the relevance of features using a specific evaluation procedure through training supervised models. Depending on the wrapper technique, models are trained on either subsets of the features or on the complete feature set. The trained models are then utilized to select the resulting feature subset, using the aforementioned performance metrics, or by ranking the inferred feature importances. In general, wrapper methods tend to provide smaller and more qualitative feature subsets than filter methods, as they take the interaction between the features, and between the model and the features, into account [4]. A major drawback of wrapper methods is the considerable time complexity associated with the underlying search algorithm, or in the case of feature importance ranking the hyperparameter tuning. Examples of wrapper methods are forward, backward, genetic, or rank-based feature importance feature selection.

Powershap: A Power-Full Shapley Feature Selection Method

73

In the interpretable machine learning field, one of the emerging and proven techniques to explain model predictions is SHAP [13]. This technique aims at quantifying the impact of features on the output. To do so, SHAP uses a gametheory inspired additive feature-attribution method based on Shapley Regression Values [13]. This method is model-agnostic and implemented for various models, e.g., linear, kernel-based, deep learning, and tree-based models. Although SHAP suffers from shortcomings, such as its TreeExplainer providing non-zero Shapley values to noise features, it is technically strong and very popular [11]. The strength of the SHAP algorithm facilitates the development of new feature selection methods using Shapley values. A simple implementation would be a rank-based feature selection, which ranks the different features based on their Shapley values on which a rank cut-off value determines the final feature set. However, there are more advanced methods available. One of these more advanced techniques is borutashap [7]. Borutashap is based on the Boruta algorithm that makes use of shadow features, i.e. features with randomly shuffled values. Boruta is built on the idea that a feature is only useful if it is doing better than the best-performing shuffled feature. To do so, Boruta compares the feature importance of the best shadow feature to all other features, selecting only features with larger feature importance than the highest shadow feature importance. Statistical interpretation is realized by repeating this algorithm for several iterations, resulting in a binomial distribution which can be used for p-value cut-off selection [9]. Borutashap improves on the underlying Boruta algorithm by using Shapley values and an optimized version of the shap TreeExplainer [7]. As such, implementations of Borutashap are limited to tree-based models only. Another shap-based feature selection method using statistics is shapicant [3]. This feature selection method is inspired by the permutation-importance method, which first trains a model on the true dataset, and afterward, it shuffles the labels and retrains the model on the shuffled dataset. This process is repeated for a set amount of iterations, from which the average feature importances of both models are compared. If for a specific feature, the feature importance of the true dataset model is consistently larger than the importance of the shuffled dataset model, that feature is considered informative. Using a non-parametric estimation it is possible to assign a p-value to determine a wanted cut-off value [1]. Shapicant improves on this underlying algorithm by using Shapley values. Specifically, it uses both the mean of the negative and positive Shapley values instead of Gini importances, which are only positive and frequently used for tree-based model importances. Furthermore, shapicant uses out-of-sample feature importances for more accurate estimations and an improved non-parametric estimation formula [3]. Powershap draws inspiration from the non-parametric estimation of shapicant and the random feature usage in borutashap and improves upon all these stateof-the-art filter and wrapper algorithms resulting in at least comparable performances while being significantly faster.

74

3

J. Verhaeghe et al.

Powershap

Powershap builds upon the idea that a known random feature should have, on average, a lower impact on the predictions than an informative feature. To realize feature selection, the powershap algorithm consists of two components: the Explain component and the core powershap component. First, in the Explain part, multiple models are trained using different random seeds, on different subsets of the data. Each of these subsets is comprised of all the original features together with one random feature. Once the models are trained, the average impact of the features (including the random feature) is explained using Shapley values on an out-of-sample dataset. Then, in the core powershap component, the impacts of the original features are statistically compared to the random feature, enabling the selection of all informative features. 3.1

Powershap Algorithm

In the Explain component, a single known random uniform (RandomUniform) feature is added to the feature set for training a machine learning model. Unlike the Boruta algorithm, where all features are duplicated and shuffled, only a single random feature is added. In some models, such as neural networks, duplicating the complete feature set increases the scale and thereby increases the time complexity drastically. Using the Shapley values on an out-of-sample subset of the data allows for quantifying the impact on the output for each feature. The Shapley values are evaluated on unseen data to assess the true unbiased impact [2]. As a final step, the absolute value of all the Shapley values is taken and then averaged (μ) to get the total average impact of each feature. Compared to shapicant, only a single mean value is used here, resulting in easier statistical comparisons. Furthermore, by utilizing the absolute Shapley values, the positive values and the negative values are added to the total impact, which could result in a different distribution compared to the Gini importance. This procedure is then repeated for I iterations, where every iteration retrains the model with a different random feature and uses a different subset of the data to quantify the Shapley values, resulting in an empirical distribution of average impacts that will further be used for the statistical comparison. In the codebase, the procedure explained above is referred to as the Explain function. The pseudocode of the Explain function is shown in Algorithm 1. Given the average impact of each feature for each iteration, it is then possible to compare it to the impact of the random feature in the core powershap component. This comparison is quantified using the percentile formula shown in Eq. 1 where s depicts an array of average Shapley values for a single feature with the same length as the number of iterations, while x represents a single value, and I represents the indicator function. This formula calculates the fraction of iterations where x was higher than the average shap-value of that iteration and can therefore be interpreted as the p-value. P ercentile(s, x) =

n  I(x > si ) i

n

(1)

Powershap: A Power-Full Shapley Feature Selection Method

75

Algorithm 1: Powershap Explain algorithm Function Explain(I ← Iterations, M ← Model, Dn×m ← Data, rs ← Random seed) powershapvalues ←size [I, m + 1] for i ← 1, 2, . . . , I do RS ← i + rs n ← RandomUniform(RS) ∈ [−1, 1] size n Drandom n Dn×m+1 ← Dn×m ∪ Drandom 0.8n×m+1 0.2n×m+1 Dtrain , Dval ← split D M ← Fit M (Dtrain ) Svalues ← SHAP(M , Dval ) Svalues ← |Svalues | for j ← 1, 2, . . . , m + 1 do powershapvalues [i][j] ← μ(Svalues [. . . ][j]) end end return powershapvalues

Note that this formula provides smaller n p-values than what should be observed, the correct empirical formula is (1+ i I(x > si ))/(n+1) as explained by North et al. [14]. This issue of smaller p-values mainly persists for lower number of iterations. However, powershap implements Eq. 1 as this anticonservative estimation of the p-value is desired behavior for the automatic mode (see Sect. 3.2). This formula enables setting a static cut-off value for the p-value instead of a varying cut-off value and results in fewer required iterations, while still providing correct results. This will be further explained at the end of Sect. 3.2. As the hypothesis states that the impact of the random feature should be on average lower than any informative feature, all impacts of the random feature are again averaged, resulting in a single value that can be used in the percentile function. This results in a p-value for every original feature. This p-value represents the fraction of cases where the feature is less important, on average than a random feature. Given the hypothesis and these p-value calculations, a heuristic implementation of a one-sample one-tailed student-t smaller statistic test can be done, where the null hypothesis states that the random feature (H1 -distribution) is not more important than the tested feature (H0 -distribution) [12]. Therefore, the positive class in this statistical test represents a true null hypothesis. This heuristic implementation does not assume a distribution on the tested feature impact scores, in contrast to a standard student-t statistic test where a standard Gaussian distribution is assumed. Then, given a threshold p-value α, it is possible to find and output the set of informative features. The pseudocode of Algorithm 2 details how the core powershap feature selection method is realized.

76

J. Verhaeghe et al.

Algorithm 2: Powershap core algorithm Function Powershap (I ← Iterations, M ← Model, Fset ← F1 , . . . , Fm , D ← Data size [n, m], α ← required p-value) powershapvalues ←Explain(I, M , D) Srandom ← μ(powershapvalues [...][m + 1]) Pm ← initialize for j ← 1, 2, . . . , m do P[j] ← P ercentile(powershapvalues [...][j], Srandom ) end return {Fi | ∀ i : P[i] < α}

3.2

Automatic Mode

Running the powershap algorithm consisting of the explain and the core components, requires setting two hyperparameters: α the p-value threshold and I the number of iterations. When hyperparameter tuning, one should make a trade-off between runtime and quality. On the one hand, there should be enough iterations to avoid false negatives for a given α, especially with the anticonservative p-values. On the other hand, adding iterations increases the time complexity. To avoid the need for users to manually optimize these two hyperparameters, powershap also has an automatic mode. This automatic mode, automatically determines and optimizes the iteration hyperparameter I using statistical power calculation for α, hence the name powershap. The statistical power of a test is 1 − β, where β is the probability of false negatives. In this case, a false negative is a non-informative feature flagged as an informative one. If a statistical test of a tested sample outputs a p-value α, this represents the chance that the tested sample could be flagged as significant by chance given the current data. This is calculated using Eq. 2. If the data in the statistical test is small, it is possible to have a very low α but a large β, resulting in an output that cannot be trusted. Therefore, for a given α, the associated power should be as close to 1 as possible to avoid any false negatives. The power of a statistical test can be calculated using the cumulative distribution function F of the underlying tested distribution H1 using Eq. 3. Figure 1 explains this visually. In the current context, H0 could represent the random feature impact distribution and H1 the tested feature impact distribution. α(x) = FH0 (x)  −1  P ower(α) = FH1 FH (α) 0

(2) (3)

The power calculations require the cumulative distribution function F . However, the underlying distributions of the calculated feature impacts are unknown. In addition, calculating F heuristically does not enable calculating the required iteration hyperparameter, which is the goal of the automatic mode. Powershap circumvents this by mapping the underlying distributions to two standard

Powershap: A Power-Full Shapley Feature Selection Method

77

Fig. 1. Visualization of p-value, effect size, and power for a standard t-test.

student-t distributions as visualized in Fig. 1. It first calculates the pooled standard deviation, using Eq. 4, by averaging the standard deviations σ of both distributions. It then calculates the distance d between these two distributions, also called the effect size, in terms of this pooled standard deviation using the Cohen’s d effect size as detailed in Eq. 5 [12]. √ Now, it is possible to define two standard student-t distributions with distance I · d apart and I − 1 degrees of freedom, where I is the amount of powershap iterations. The standard central student-t FCT and non-central student-t FN CT cumulative distribution functions are then used to calculate the power of the statistical test according to Eq. 6. This equation can in turn be used in a heuristic algorithm to solve for I. Powershap uses the solve power implementation of statsmodels to determine the required I from the TTestPower equation using brentq expansion for a provided required power [17]. The powershap pseudocode for the calculation of the effect size, power, and required iterations is shown in Algorithm 3.  (σ 2 (s1 ) + σ 2 (s2 )) (4) P ooledStd(s1 , s2 ) = 2 Ef f ectSize(s1 , s2 ) =

μ(s1 ) − μ(s2 ) P ooledStd(s1 , s2 )

(5)

  √ −1 T T estP ower(α, I, dcalc ) = FN CT FCT (α, k = I − 1), k = n − 1, d = I · dcalc (6) With the calculated required amount of iterations n, the automatic powershap algorithm can be executed. The pseudocode to enable the automatic mode is shown in Algorithm 4. As can be seen, this is an expansion of the core algorithm (see Algorithm 2) and starts with an initial ten iterations to calculate the initial p-value, effect sizes, power, and required iterations for all features. Then, it searches for the largest required number of iterations Imax of all tested features having a p-value below the threshold α. If Imax exceeds the already performed number of iterations Iold , automatic mode continues powershap for

78

J. Verhaeghe et al.

Algorithm 3: Powershap analysis function Function Analysis(α ← required p-value, β ← required power, powershapvalues ) Srandom ← powershapvalues [...][m + 1] P ← size [m] Nrequired ← size [m] for j ← 1, 2, . . . , m do Si ← powershapvalues [...][j] P[j] ← P ercentile(Si , μ(Srandom )) effectsize ← EffectSize(Si , Srandom ) Nrequired ← SolveTTestPower(effectsize, α, β) end return P, Nrequired

the extra required iterations. This process is repeated until the performed iterations exceed the required iterations. For optimization, when the extra required iterations (Imax − Iold ) exceed ten iterations, the automatic mode first adds ten iterations and then re-evaluates the required iterations because the required iterations are influenced by the already performed iterations. Furthermore, it is also possible to provide a stopping criterion on the re-execution of powershap to avoid an infinite calculation. As a result the time complexity of the algorithm is linear in terms of the underlying model and shap explainer and can be formulated as O(p[Mn+1 + S(Mn+1 ]), with n the amount of features, p the number of powershap iterations, S the shap explainer time, and Mx the model fit time for x features. For the automatic mode, by default, α is set to 0.01 while the required power is set to 0.99. This results in only selecting features that are more important than the random feature for all iterations. Furthermore, this also compensates for the anticonservative p-value and avoids as many false negatives as possible. Realizing the same desired behavior with the more accurate p-value estimation would require a varying α of 1/n, complicating the power calculations and increasing the likelihood of false negatives. The resulting powershap algorithm is implemented in Python as an open-source plug-and-play sklearn compatible component to enables direct usage in conventional machine learning pipelines [15]. The codebase 1 already supports a wide variety of models, such as linear, tree-based, and even deep learning models. To assure the quality and correctness of the implementation, we tested the functionality using unit testing.

1

The code, documentation, and more benchmarks can be found using the following link: https://github.com/predict-idlab/PowerSHAP.

Powershap: A Power-Full Shapley Feature Selection Method

79

Algorithm 4: Automatic Powershap algorithm version Function Powershap (M ← Model, Fset ← F1 , . . . , Fm , Dn×m ← Data, α ← required p-value, β ← required power) powershapvalues ←Explain(I ← 10, M , D, rs ← 0) P, Nrequired ←Analysis(α, β, powershapvalues ) Imax ← ceil(Nrequired [MaxArg(P < α)]) Iold ← 10 while Imax > Iold do if Imax − Iold > 10 then autovalues ←Explain(I ← 10, M , D, rs ← 0) Iold ← Iold + 10 else autovalues ←Explain(I ← Imax − Iold , M , D, rs ← 0) Iold ← Imax end powershapvalues ← powershapvalues ∪ autovalues P, Nrequired ←Analysis(α, β, powershapvalues ) Imax ← ceil(M ax(Nrequired [i, ∀i : P[i] < α])) end return [Fi , ∀ i : P[i] < α]

4 4.1

Experiments Feature Selection Methods

To facilitate a comparison with other feature selection techniques, we benchmark powershap together with other frequently used techniques on both synthetic and real-world datasets. In particular, powershap is compared with both filter and wrapper methods, and state-of-the-art shap-based wrapper methods. To provide a fair comparison, all methods, including powershap, were used in their default out-of-the-box mode without tuning. For powershap, this default mode is the automatic mode. Concerning filter methods, two methods were chosen: the chisquared and f-test feature selection from the sklearn-library [15]. The chi-squared test measures the dependence between a feature and the classification outcome and assigns a low p-value to features that are not independent of the outcome. As the chi-squared test only works with positive values, the values are shifted in all chi-squared experiments such that all values are positive. This has no effect on tree-estimators as they are invariant to data scaling [12]. The F-test in sklearn is a univariate test that calculates the F-score and p-values on the predictions of a univariate fitted linear regressor with the target [15]. Both filter methods provide p-values that are set to the same threshold as powershap. As wrapper feature selection method, forward feature selection was chosen. This method is a greedy algorithm that starts with an empty set of features and trains a model with each feature separately. In every iteration, forward feature selection then

80

J. Verhaeghe et al.

adds the best feature according to a specified metric, often evaluated in crossvalidation, until the metric stops improving. This is generally considered a strong method but has a very large time complexity [6]. Powershap is also compared to shapicant [3] and borutashap [7], two SHAP-based feature selection methods. The default machine learning model used for all datasets and all feature selection methods, including powershap, is a CatBoost gradient boosting tree-based estimator using 250 estimators with the overfitting detector enabled. For classification, the CatBoost model uses adjusted class weights to compensate for any potential class imbalance. The Catboost estimator often results in strong predictive performances out-of-the-box, without any hyper-parameter tuning, making it the perfect candidate for benchmarking and comparison [16]. All experiments are performed on a laptop with a Intel(R) Core(TM) i7-9850H CPU at 2.60 GHz processor and 16 GB RAM running at 2667 MHz, with background processes to a minimum. 4.2

Simulation Dataset

The methods are first tested on a simulated dataset to assess their ability to discern noise features from informative features. The used simulation dataset is created using the make classification function of sklearn. This function creates a classification dataset, however, exactly the same can be done for obtaining a regression dataset (by using make regression). The simulations are run using 20, 100, 250, and 500 total features to understand the performance on varying dimensions of feature sets. The ratio of informative features is varied as 10%, 33%, 50%, and 90% of the total feature set, allowing for assessing the quality of the selected features in terms of this ratio. The resulting simulation datasets each contain 5000 samples. Each simulation experiment was repeated five times with different random seeds. The number of redundant features, which are linear combinations of informative features, and the number of duplicate features were set to zero. Redundant features and duplicate features reduce the performance of models, but they cannot be discerned from true informative features as they are inherently informative. Therefore they are not included in the simulation dataset as the goal of powershap is to find informative features. The powershap method is compared to shapicant, chi2 , borutashap, and the f-test for feature selection on this simulation dataset. Due to time complexity constraints, forward feature selection was not included in the simulation benchmarking. 4.3

Benchmark Datasets

In addition to the simulation benchmark, the different methods are also evaluated on five publicly available datasets, i.e. three classification datasets: the Madelon [19], the Gina priori [20], and the Scene dataset [18], and two regression datasets: CT location [5] and Appliances [5]. The details of these datasets are shown in Table 1. The Scene dataset is a multi-label dataset, however,

Powershap: A Power-Full Shapley Feature Selection Method

81

Table 1. Properties of all datasets Dataset

Type

Source

# features train size test size

Madelon

Classification OpenML 500

1950

650

Gina priori

Classification OpenML 784

2601

867

Scene

Classification OpenML 294

1805

867

CT location Regression

UCI

384

41347

12153

Appliances

UCI

30

14801

4934

Regression

a multi-label problem can always be reduced to a one-vs-all classification problem. Therefore only the label “Urban” was chosen here to assess binary classification performance. Almost all of these datasets have a large feature set, ideal for benchmarking feature selection methods. The datasets are split into a training and test set using a 75/25 split. All methods are evaluated using both 10-fold crossvalidation on the training set and 1000 bootstraps on the test set to assess the robustness of the performance. The test set is utilized to assess generalization beyond the validation set as wrapper methods tend to slightly overfit their validation set [6], while the training set is used for feature selection. The forward feature selection method was performed with 5-fold cross-validation and not 10fold cross-validation due to the high time complexity. A validation set of 20% of the training set is used for shapicant, using the same validation size as powershap in Algorithm 1. The models are evaluated with the AUC metric for classification datasets and with the R2 metric for regression datasets.

5 5.1

Results Simulation Dataset

The results of the simulation benchmarking are shown in Fig. 2. Each row of subfigures shows the duration, the percentage of informative features found, and the number of selected noise features. These measures are shown for each feature selection method for varying feature set dimensions and varying amounts of informative features. As can be seen, the shapicant method is the slowest wrapper method while powershap is, without doubt, the fastest wrapper method. The filter methods are substantially faster than any of the wrapper methods, as they do not train models. Furthermore, powershap finds all informative features with a limited amount of outputted noise features up to the case with 250 total features with 50% (125) informative features, outperforming every other method. This can be explained by the model underfitting the data. Even with higher dimensional feature sets, powershap finds more informative features than the other methods. Interestingly, most methods do not output many noise features, except for shapicant in the experiment with 20 total and 10% informative features.

Fig. 2. Simulation benchmark results using the make classification sklearn function for 5000 samples with five different make classification random seeds.

82 J. Verhaeghe et al.

Powershap: A Power-Full Shapley Feature Selection Method

5.2

83

Benchmark Datasets

Table 2 shows the duration of the feature selection methods and the size of the selected feature sets for each method on the different open-source datasets. Chi2 does not apply to regression problems and is therefore not included in the results of the CT location and Appliances datasets. The table shows that powershap is again the fastest wrapper method, while the number of selected features is in line with the other methods. The filter methods tend to output more features, while forward feature selection outputs a more conservative set of features. Table 2. Benchmarks results for duration and selected features. “default” indicates no feature selection or all features. Duration (s) Dataset

powershap borutashap shapicant forward chi2

Madelon Gina priori Scene CT location Appliances

132 s 184 s 115 s 459 s 34 s

f test default

186 s 299 s 220 s 543 s 48 s

632 s 812 s 749 s 1553 s 134 s

10483 s 68845 s 12496 s 56879 s 1913 s

< 1s < 1s < 1s N/A N/A

< 1s < 1s < 1s < 1s < 1s

N/A N/A N/A N/A N/A

10 37 14 162 24

30 106 56 74 10

8 26 15 75 13

43 328 93 N/A N/A

18 405 220 350 20

500 784 294 384 30

Selected features Madelon Gina priori Scene CT location Appliances

22 105 36 123 24

The performance of the selected feature sets for each classification benchmark dataset is shown in Fig. 3a and in Fig. 3b for the regression benchmarks. These figures show that powershap provides a steady performance on all datasets, consistently achieving the best or equal performance on both the cross-validation and test sets. However, even in cases with equal performance, powershap achieves these performances considerably quicker, especially compared to shapicant and forward feature selection. The CT location dataset performances show that forward feature selection tends to overfit on the cross-validation dataset while powershap is more robust.

6

Discussion

For the above test results, we used the default automatic powershap implementation. However, similar to many other feature selection methods, powershap can be further optimized or tuned. One of these optimizations is the use of a

84

J. Verhaeghe et al.

convergence mode to extract as many informative features as possible. In this mode, powershap continues recursively in automatic mode where in every recursive iteration, powershap re-executes but with any previously found and selected features excluded from the considered feature set. This process continues, until no more informative features can be found. The convergence mode is especially useful in use-cases with high dimensional feature sets or datasets with a large risk of underfitting as it reduces the feature set dimension each recursive iteration to facilitate finding new informative features. As a basic experiment on the simulation benchmark, using the convergence mode for 500 features and 90% (450) informative features, the percentage of found features increases from around 38% (170) to 73% (330) without adding noise features. However, the duration also increases to the same duration as shapicant. Other possible optimizations are also applicable to other feature selection methods, such as applying backward feature selection after powershap to eliminate any noise features, redundant, or duplicate features. Another possibility is optimizing the used machine learning model to better match the dataset and rerun powershap, e.g. by using more CatBoost estimators for datasets with large sample sizes and high dimensional feature sets. In the benchmarking results, there are datasets where including all features perform equally well or even better, such as in the case of the Gina prior test set. In these cases, the filter methods perform well but output large feature sets, while the forward feature selection performs the worst. Alternatively, powershap can be used here as a fast wrapper-based dimensionality reduction method to retain approximately the same performance with a much smaller feature set. As such, there will still be a trade-off for each use-case between filter and wrapper methods based on time and performance. We are aware that the current design of the benchmarks has some limitations. For the simulation benchmark, the make classification function uses by default a hypercube to create its classification problem, resulting in a linear classification problem, which is inherently easier to classify [15]. The compared filter methods were chosen by their most common usage and availability, however, these are fast and simple methods and are of a much lower complexity than powershap. The same argument could be made for our choice of the forward feature selection method (as wrapper method) compared to other methods such as genetic algorithm based solutions. Furthermore, wrapper methods, and thus also powershap, are highly dependent on the used model, as the feature selection quality suffers from modeling issues such as for example overfitting and underfitting. Therefore, the true potential achievable performances on the benchmark datasets may differ since every use-case and dataset requires its own tuned model to achieve optimal performance. Additionally, the cut-off values and hyperparameters of none of the methods were optimized and are either set to the same value as in powershap or used with their default values. This might impact the performance and could have skewed the benchmark results in both directions. However, choosing the same model and the same values for hyperparameters

Powershap: A Power-Full Shapley Feature Selection Method

85

Fig. 3. Benchmark performances. The error bars represent the standard deviation.

(if possible) in all experiments, reduces potential performance differences and facilitates a fair enough comparison.

7

Conclusion

We proposed powershap, a wrapper feature selection method using Shapley values and statistical tests to determine the significance of features. powershap uses power calculations to optimize the number of required iterations in an automatic mode to realize fast, strong, and reliable feature selection. Benchmarks indicate that powershap’s performance is significantly faster and more reliable than comparable state-of-the-art shap-based wrapper methods. Powershap is implemented as an open-source plug-and-play sklearn component, increasing its

86

J. Verhaeghe et al.

accessibility and ease of use, making it a power-full Shapley feature selection method, ready for your next feature set. Acknowledgements. Jarne Verhaeghe is funded by the Research Foundation Flanders (FWO, Ref. 1S59522N) and designed powershap. Jeroen Van Der Donckt implemented powershap as an sklearn component. Sofie Van Hoecke and Femke Ongenae supervised the project. A special thanks goes to Gilles Vandewiele for proof-reading the manuscript. Code. The code, documentation, and more benchmarks can be found using the following link: https://github.com/predict-idlab/PowerSHAP.

References 1. Altmann, A., Tolo¸si, L., Sander, O., Lengauer, T.: Permutation importance: a corrected feature importance measure. Bioinformatics 26(10), 1340–1347 (2010) 2. Breiman, L.: Random Forests. Mach. Learn. 45(1), 5–32 (2001) 3. Calzolari, M.: manuel-calzolari/shapicant (2022) 4. Colaco, S., Kumar, S., Tamang, A., Biju, V.G.: A review on feature selection algorithms. In: Shetty, N.R., Patnaik, L.M., Nagaraj, H.C., Hamsavath, P.N., Nalini, N. (eds.) Emerging Research in Computing, Information, Communication and Applications. AISC, vol. 906, pp. 133–153. Springer, Singapore (2019). https://doi.org/ 10.1007/978-981-13-6001-5 11 5. Dua, D., Graff, C.: UCI machine learning repository (2017) 6. Jovi´c, A., Brki´c, K., Bogunovi´c, N.: A review of feature selection methods with applications. In: 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1200–1205 (2015) 7. Keany, E.: Borutashap : A wrapper feature selection method which combines the boruta feature selection algorithm with shapley values (2020) 8. Kumari, B., Swarnkar, T.: Filter versus wrapper feature subset selection in large dimensionality micro array: a review. Int. J. Comput. Sci. Inf. Technol. 2, 6 (2011) 9. Kursa, M.B., Rudnicki, W.R.: Feature selection with the Boruta package. J. Stat. Softw. 36(11), 1–13 (2010) 10. Li, J., Cheng, K., Wang, S., et al.: Feature selection: a data perspective. ACM Comput. Surv. 50(6), 1–45 (2017) 11. Linardatos, P., Papastefanopoulos, V., Kotsiantis, S.: Explainable AI: a review of machine learning interpretability methods. Entropy 23(1), 18 (2020) 12. Lomax, R.G.: An introduction to statistical concepts. In: Mahwah, N.J. (eds.): Lawrence Erlbaum Associates Publishers (2007) 13. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems 30, pp. 4765–4774. Curran Associates, Inc. (2017) 14. North, B.V., Curtis, D., Sham, P.C.: A note on the calculation of empirical p values from monte Carlo procedures. Am. J. Hum. Genet. 71(2), 439–441 (2002) 15. Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 16. Prokhorenkova, L., Gusev, G., et al.: CatBoost: unbiased boosting with categorical features. arXiv:1706.09516 (2019)

Powershap: A Power-Full Shapley Feature Selection Method

87

17. Seabold, S., Perktold, J.: statsmodels: Econometric and statistical modeling with python. In: 9th Python in Science Conference (2010) 18. Vanschoren, J.: OpenML: gina priori. https://www.openml.org/d/1042 19. Vanschoren, J.: OpenML: madelon. https://www.openml.org/d/1485 20. Vanschoren, J.: OpenML: scene. https://www.openml.org/d/312

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Automated Cancer Subtyping via Vector Quantization Mutual Information Maximization Zheng Chen1(B) , Lingwei Zhu2 , Ziwei Yang3 , and Takashi Matsubara1 1

3

Osaka University, Osaka, Japan [email protected] 2 University of Alberta, Edmonton, Canada Nara Institute of Science and Technology, Nara, Japan

Abstract. Cancer subtyping is crucial for understanding the nature of tumors and providing suitable therapy. However, existing labelling methods are medically controversial, and have driven the process of subtyping away from teaching signals. Moreover, cancer genetic expression profiles are high-dimensional, scarce, and have complicated dependence, thereby posing a serious challenge to existing subtyping models for outputting sensible clustering. In this study, we propose a novel clustering method for exploiting genetic expression profiles and distinguishing subtypes in an unsupervised manner. The proposed method adaptively learns categorical correspondence from latent representations of expression profiles to the subtypes output by the model. By maximizing the problemagnostic mutual information between input expression profiles and output subtypes, our method can automatically decide a suitable number of subtypes. Through experiments, we demonstrate that our proposed method can refine existing controversial labels, and, by further medical analysis, this refinement is proven to have a high correlation with cancer survival rates.

Keywords: Cancer subtypes

1

· Information maximization · Clustering

Introduction

Cancer is by far one of the deadliest epidemiological diseases known to humans: consider the breast cancer which is the most prevalent (incidence 47.8% worldwide) and the most well-studied cancer in the world [32], the 5-year mortality rate can still reach 13.6% [1]. Its heterogeneity is considered as the crux of limiting the efficacy of targeted therapies and compromising treatment outcomes since some tumors that differ radically at the molecular level might exhibit highly resemblant morphological appearance [22]. Increasing evidence from modern transcriptomic studies has supported the assumption that each specific cancer is composed of multiple categories (known as cancer subtypes) [4,33]. Reliably Z. Chen and L. Zhu—Indicates joint first authors. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  M.-R. Amini et al. (Eds.): ECML PKDD 2022, LNAI 13713, pp. 88–103, 2023. https://doi.org/10.1007/978-3-031-26387-3_6

Automated Cancer Subtyping

89

identifying cancer subtypes can significantly facilitate the prognosis and personalized treatment [21]. However, currently there is a fierce debate in the cancer community: given transcriptomic data of one cancer, authoritative resources put that there might be different number of subtypes from distinct viewpoints, that is, the fiducial definition of the subtypes is constantly undergoing calibration [12], suggesting for the majority of cancers the ground-truth labeling remains partially unavailable and awaits better definition. In the data science community, the lack of ground truth for the cancer data can be addressed as a clustering problem [11], in which the clusters give a hint on the underlying subtypes. Such clustering methods rely crucially on the quality of the data and suitable representations. Modern subtyping methods typically leverage molecular transcriptomic expression profiles (expression profiles in short) which consist of genetic and microRNA (miRNA) expressions that characterize the cancer properties [21,26]. However, several dilemmas exist in the way of fully exploiting the power of expression profiles: – High-dimensionality: the expression profiles are typically of > 60, 000 dimensions; even after typical preprocessing the dimension can still be > 10, 000. – Scarcity: cancer data are scarce and costly. Even for the most well-studied breast cancer, the largest public available dataset consists of expression profiles from around only 1500 subjects [30]; – Dependence: expression profiles have complicated dependence: a specific expression might be under joint control of several genes, and sometimes such the joint regulation can be circular, forming the well-known gene regulation network [10]. To extract information from the inherently high-dimensional expression profiles for tractable grouping [9], traditional methods preprocess the data via variants of principal components analysis (PCA) or least absolute shrinkage and selection operator (LASSO) [3] for reducing the dimensionality of the data. However, expression profiles with such complicated dependence have already been shown to not perform well with PCA and LASSO [14], since many seemingly less salient features can play an important role in the gene regulation network. Motivated by the resurgence of deep learning techniques, recently the community has seen promising applications leveraging deep autoencoders (AEs) or variational AEs (VAEs) for compressing the data into a lower-dimensional latent space that models the underlying genetic regulation [33]. However, VAEs with powerful autoregressive decoders often ignore the latent spaces [8,25], which runs the risk of overfitting [28]. Furthermore, the latent representation is assumed to be continuous variables (usually Gaussian) [18,31], which is at odds with the inherently categorical cancer subtypes [5]. As a result, those subtyping models might have poor performance as well as generalization ability. Aside from feature extraction, another issue concerns the grouping process itself. Given extracted features from the expression profiles, the above-mentioned methods usually apply similarity-based clustering algorithms such as K-means for subsequent grouping. However, such methods require strong assumptions on the data and are sensitive to representations [27]: one will have to define a

90

Z. Chen et al.

similarity metric for the data (often Euclidean) and find appropriate transformations (such as logarithm transform) as informative features. Unsuitable choices of the metric and transformation can greatly degrade the model performance. Recently, mutual information has been gaining huge popularity in deep representation learning as a replacement for similarity metrics [6,13]: it is the unique measure of relatedness between a pair of variables invariant to invertible transformations of the data, hence one does not need to find a right representation [20]. Better yet, if two genes share more than one bit of information, then the underlying mechanism must be more subtle than just on and off. Such subtlety and more general dependence can be captured by the mutual information [27]. In this paper, we propose a novel, generally applicable clustering method that is capable of fully exploiting the expression profiles and outputting sensible cancer subtyping solutions. Besides tackling the above-mentioned problems in a unified and consistent manner, the proposed method has an intriguing property of automatically adjusting the number of groups thanks to its special architecture, which stands as a sheer contrast to prior methods that predetermine the number of groups by domain knowledge. Before introducing the proposed architecture in Sect. 3, we summarize our contributions as follows: – (Algorithmic) Inspired by recent work, we propose a novel clustering method vector quantization regularized information maximization (VQ-RIM) for cancer subtyping. VQ-RIM maximizes mutual information in the categorical VQVAE model, which results in a combination of VAE reconstruction loss and mutual information loss. (Sect. 3.2) – (Effective) We compare the clustering results of VQ-RIM against existing ground truth labels (together with controversial labels from the entirety of labels) on different cancer datasets and find that VQ-RIM concords well with the ground truth, which verifies the correctness of VQ-RIM. (Sects. 4.1 and 4.2) – (Medical) Extensive experiments on distinct cancers verify that VQ-RIM produces subtyping that consistently outperform the controversial labels in terms of enlarged separation of between-group life expectancies. The clearer separation suggests VQ-RIM is capable of better capturing the underlying characteristics of subtypes than controversial labels. We believe such results are far-reaching in providing new insights into the unsettled debate on cancer subtyping.(Sect. 4.2)

2

Related Work

Feature Extraction for Subtyping. Building a model suitable for cancer subtyping is non-trivial as a result of the cancer data scarcity. High dimensionality and data scarcity pose a great challenge to automated models for generating reliable clustering results [31]. Conventionally, the problem is tackled by leveraging classic dimension reduction methods such as PCA [3]. However, since the progress of cancers is regulated by massive genes in a complicated manner (which themselves are under the control of miRNAs), brute-force dimension reduction might run the risk of removing informative features [15]. On the other hand,

Automated Cancer Subtyping

91

recently popular AE-based models [21,33], especially VAEs, construct the feature space by reconstructing the input through a multi-dimensional Gaussian posterior distribution in the latent space [31]. The latent posterior learns to model the underlying causalities, which in the cancer subtyping context corresponds to modeling the relationship among expression profiles such as regulation or co-expression [33]. Unfortunately, recent investigation has revealed that VAEs with powerful autoregressive decoders easily ignore the latent space. As a result, the posterior could be either too simple to capture the causalities; or too complicated so the posterior distribution becomes brittle and at the risk of posterior collapse [2,25]. Moreover, the Gaussian posterior is at odds with the inherently categorical cancer subtypes [5]. In this paper, we propose to leverage the categorical VQ-VAE to address the aforementioned issues: (i) VQ-VAE does not train its decoder, preventing the model from ignoring its latent feature space resulting from an over-powerful decoder; (ii) VQ-VAE learns categorical correspondence between input expression profiles, latent representations, and output subtypes, which theoretically suggests better capability of learning more useful features. (iii) the categorical latent allows the proposed model to automatically set a suitable number of groups by plugging in mutual information maximization classifier, which is not available for the VAEs. Information Maximization for Subtyping. Cancer subtyping is risksensitive since misspecification might incur an unsuitable treatment modality. It is hence desired that the clustering should be as certain as possible for individual prediction, while keeping subtypes as separated as possible [7,11]. Further, to allow for subsequent analysis and further investigation of medical experts, it is desired that the method should output probabilistic prediction for each subject. In short, we might summarize the requirements for the subtyping decision boundaries as (i) should not be overly complicated; (ii) should not be located at where subjects are densely populated; (iii) should output probabilistic predictions. These requirements can be formalized via the information-theoretic objective as maximizing the mutual information between the input expression profiles and the output subtypes [19,29]. Such objective is problem-agnostic, transformation-invariant, and unique for measuring the relationship between pairs of variables. Superior performance over knowledge-based heuristics has been shown by exploiting such an objective [27].

3 3.1

Method Problem Setting

Let X be a dataset X = {x1 , . . . , xN }, where xi ∈ Rd , 1 ≤ i ≤ N are ddimensional vectors consisting of cancer expression profiles. For a given x, our goal lies in determining a suitable cancer subtype y ∈ {1, 2, . . . , K} given x, where K is not fixed beforehand and needs to be automatically determined. Numeric values such as y = 1, . . . , K do not bear any medical interpretation on

92

Z. Chen et al.

their own and simply represent distinct representations due to the underlying data. It is worth noting while a label set Y is available, it comprises a small subset of ground-truth labels Ygt := {ygt } that have been medically validated and a larger portion of controversial labels Yc := {yc }, with Ygt = YnYc . Our approach is to compare the clustering result y of the proposed method against ground truth labels ygt to see if they agree well, as a first step of validation. We then compare y against controversial labels yc and conduct extensive experiments to verify that the proposed method achieves improvement upon the subtyping given by yc . Our goal is to unsupervisedly learn a discriminative classifier D which K outputs conditional probability P (y|x, D). Naturally it is expected that k=1 P (y = k|x, D) = 1 and we would like D to be probabilistic so the uncertainty associated with assigning data items can be quantitized. Following [28], we assume the marginal class distribution P (y|D) is close to the prior P (y) for all k. However, unlike prior work [19,28] we do not assume the amount of examples per class in X is uniformly distributed due to the imbalance of subtypes in the data. 3.2

Proposed Model

Information Maximization. Given expression profiles of subject x, the discriminator outputs a K-dimensional probability logit vector D(x) ∈ RK . The probability of x belonging to any of the K subtypes is given by the softmax parametrization: eDk (x) , P (y = k|x, D) = K Dk (x) k=1 e where Dk (x) denotes the k-th entry of the vector D(x). Let us drop the dependence on D for uncluttered notation. It is naturally desired that the individual prediction be as certain as possible, while the distance between the predicted subtypes as large as possible. This consideration can be effectively reflected by the mutual information between the input expression profiles and the output prediction label. Essentially, the mutual information can be decomposed into the following two terms: K N K   1  P (y = k) logP (y = k) + α P (y = k|xi ) log P (y = k|xi ) . N i=1 k=1 k=1      

ˆ y) := − I(x,

 H(P (y))

 (y|X )) −H(P

(1)  (P (y = k)) and the conditional which are the marginal entropy of labels H  (P (y|X )) approximated by N Monte Carlo samples xi , i ∈ {1, . . . , N }. entropy H α is an adjustable parameter for weighting the contribution, setting α = 1 recovers the standard mutual information formulation [19]. This formulation constitutes the regularized information maximization (RIM) part of the proposed method. The regularization effect can be seen from the following:

Automated Cancer Subtyping

93

 (P (y|X )) encourages confident prediction by minimiz• Conditional entropy H ing uncertainty. It effectively captures the modeling principles that decision boundaries should not be located at dense population of data [11].  (P (y)) aims to separate the subtypes as far as possi• Marginal entropy H ble. Intuitively, it attempts to keep the subtypes uniform. Maximizing only  (P (y|X )) tends to produce degenerate solutions by removing subtypes H  (P (y)) serves as an effective regularization for ensuring non[6,19], hence H trivial solutions. Categorical Latents Generative Feature Extraction. Recent studies have revealed that performing RIM alone is often insufficient for obtaining stable and sensible clustering solutions [6,20,28]: Discriminative methods are prone to overfitting spurious correlations in the data, e.g., some entry A in the expression profiles might appear to have direct control over certain other entries B. The model might na¨ıvely conclude that the appearance of B shows positive evidence of A. However, such relationship is in general not true due to existence of complicated biological functional passways: Such pathways have complex (sometimes circular) dependence between A and B [24]. Since discriminative methods model P (y|x) but not the data generation mechanism P (x) (and the joint distribution P (x, y)) [11], such dependence between genes and miRNAs might not be effectively captured by solely exploiting the discriminator, especially given the issues of data scarcity and high dimensionality. A generative model that explicitly captures the characteristics in P (x) is often introduced as a rescue for leveraging RIM-based methods [13,23,28]. Such methods highlight the use of VAEs for modeling the latent feature spaces underlying input X : given input x, VAEs attempt to compress it to a lower-dimensional ˜ from z. Recently there has been active research on latent z, and reconstruct x leveraging VAEs for performing cancer subtyping [31,33]. However, existing literature leverage continuous latents (often Gaussian) for tractability, which is at odds with the inherently categorical cancer subtypes. Furthermore, VAEs often ignore the latents which implies the extracted feature space is essential dismissed and again runs the risk of overfitting [2]. We exploit the recent vector quantization variational auto-encoder (VQVAE) [25] as the generative part of the proposed architecture. The categorical latents of VQ-VAE are not only suitable for modeling inherently categorical cancer subtypes, but also avoids the above-mentioned latent ignoring problem [18]. In VQ-VAE, the latent embedding space is defined as {ei } ∈ RM ×l , where M denotes the number of embedding vectors and hence a M -way categorical distribution. l < d is the dimension of each latent embedding vector ei , i ∈ {1, . . . , M }. VQ-VAE maps input x to a latent variable z via its encoder ze (x) by performing a nearest neighbor search among the embedding vectors ei , and output a recon˜ via its decoder z q . VQ-VAE outputs a deterministic posterior structed vector x distribution q such that  2 1, if k = arg minj ||ze (x) − ej ||2 q(z = k|x) = (2) 0, otherwise

94

Z. Chen et al.

Fig. 1. Overview of the proposed system. D denotes the discriminator, G denotes the generator.

The decoder does not possess gradient and is trained by copying the gradients from the encoder. The final output of the decoder is the log-posterior probability log P (x|z q ) which is part of the reconstruction loss. Architecture and Optimization. We propose a novel model for clustering expression profiles as shown in Fig. 1. The model consists of a discriminator denoted as D that maximizes the mutual information and a generator G that aims to reconstruct the input via modeling a categorical underlying latent feature space spanned by {ei }. D and G are deeply coupled via the latent embeddings z, which is made possible through the fact the decoder of VQ-VAE does not possess gradients and hence the embedding space can be controlled by only the encoder and the discriminator. In prior work, the generator is often architecturally independent from the discriminator and is only weakly related through loss functions [13,20,28]. Intuitively, one can consider the proposed model attempts to simultaneously minimize reconstruction loss as well as maximize the mutual information:  (P (y)) − H  (P (y|z)) − R(λ) + log P (x|z q ) + ||sg[z e ] − e|| + ||z e − sg[e]|| L := H 2 2       LD

LG

(3) where LD , LG denote the discriminator loss and the generator loss, respectively. R(λ) is a possible regularizer that controls the weight growth, e.g. R(λ) := λ T 2 2 ||w w||2 , where w denotes the weight parameters of the model. sg[·] denotes the stop gradient operator. Automatically Setting Number of Subtypes. The proposed model can automatically determine suitable number of subtypes by exploiting hidden information contained in the expression profiles which is not available to conventional methods such as K-means relying on prior knowledge. The automatic subtyping is made possible via the deeply coupled latents and the discriminator: the

Automated Cancer Subtyping

95

multi-layer perceptron in the discriminator outputs the logarithm of posterior distribution log q(z|x). However, by definition of Eq. (2) the posterior is deterministic, which suggests log q(z|x) must either be 0 or tend to −∞. The subsequent softmax layer hence outputs:  q(z =k|x) 2 K , if k = arg minj ||ze (x) − ej ||2 k=1 q(z =k|x) (4) P (y = k|z) = 0, otherwise ˜ initially that covers the maximum We can set K to a sufficient large integer K possible number of subtypes. Since the nearest neighbor lookup of VQ-VAE typically only updates a small number of embeddings ej , by Eq. (4) we see for any unused ei , i = j the clustering probability is zero, which suggests the number ˜ of subtypes K will finally narrow down to a much smaller number K  K.

4

Experiments

The expression profile data used in this study were collected from the world’s largest cancer gene information database Genomic Data Commons (GDC) portal. All of the used expression data were generated from cancer samples prior to treatment. We utilized the expression profiles of three representative types of cancer for experiments: – Breast invasive carcinoma (BRCA): BRCA is the most prevalent cancer in the world. Its expression profiles were collected from the Illumina Hi-Seq platform and the Illumina GA platform. – Brain lower grade glioma (LGG): the expression profiles were collected from the Illumina Hi-Seq platform. – Glioblastoma multiforme (GBM): the expression profiles were collected from the Agilent array platform. Results on this dataset are deferred to the appendix. These datasets consist of continuous-valued expression profiles (feature length: 11327) of 639, 417 and 452 subjects, respectively. Additional experimental results and hyperparameters can be seen in Appendix Section A available at https:// arxiv.org/abs/2206.10801. The experimental section is organized as follows: we first compare the clustering results with the ground truth labels Ygt in Sect. 4.1 to validate the proposed method. We show in Sect. 4.2 that VQ-RIM consistenly re-assigns subjects to different subtypes and produces one more potential subtype with enlarged separation in between-group life expectancies, which in turn suggests VQ-RIM is capable of better capturing the underlying characteristics of subtypes. Extensive ablation studies on both the categorical generator (VQ-VAE) and the information maximizing discriminator (RIM) are performed to validate the proposed

96

Z. Chen et al.

architecture in Sect. 4.3. We believe the VQ-RIM subtyping result is far-reaching and can provide important new insights to the unsettled debate on cancer subtyping. 4.1

Ground Truth Comparison

For validating the correctness of VQ-RIM, we show an example in Fig. 2, i.e., the Basal-like cancer subtype of BRCA that has been well-studied and extensively validated by human experts and can be confidently subtyped, which can be exploited as the ground-truth labels Ygt . However, other subtypes lack such well-verified labels and are regarded as the controversial labels Yc . The left subfigure of Fig. 2 shows the two principal axes of Basal-like expression profiles after PCA. The blue triangles in the right subfigure indicates the difference between Ygt and the VQ-RIM result. It can be seen Fig. 2. Comparison between Ygt and that VQ-RIM agrees well with the ground the VQ-RIM label y on the Basal-like subtype of BRCA. truth. 4.2

Controversial Label Comparison

Subtype Comparison. We compare existing controversial labels Yc with the clustering results of VQ-RIM in Fig. 3. VQ-RIM output sensible decision boundaries that separated the data well and consistently produced one more subtype than Yc . As confirmed in Sect. 4.1, the Basal-like subtype concorded well with the VQ-RIM Cluster A. On the other hand, other subtypes exhibited significant differences: controversial labels seem to compactly fit into a fan-like shape in the two-dimensional visualization. This is owing to the human experts’ heuristics in subtyping: intuitively, the similarity of tumors in the clinical variables such as morphological appearance often renders them being classified into an identical subtype. However, cancer subtypes are the result of complicated causes on the molecular level. Two main observations can be made from the BRCA VQ-RIM label: (1) Luminal A was divided into three distinct clusters C,D,E. Cluster E now occupies the left and right wings of the fan which are separated by Cluster B and C; (2) A new subtype Cluster F emerged from Luminal B, which was indistinguishable from Cluster E if na¨ıvely viewed from the visualization. This counter-intuitive clustering result confirmed the complexity of cancer subtypes in expression profiles seldom admits simple representations as was done in the controversial labels. A similar conclusion holds as well for other datasets such as LGG: IDH mut-codel was divided into two distinct subtypes (Cluster A, B), among which the new subtype Cluster A found by VQ-RIM occupied

Automated Cancer Subtyping

97

Fig. 3. PCA visualization of the first two principal axes for BRCA and LGG.

the right wing of IDH mut-codel. In later subsections, the one more cluster and re-assignment of VQ-RIM are justified by analyzing the subtype population and from a medical point of view. Due to page limit, we provide analysis focusing on BRCA only. Label Flows. The controversial labels might run the risk of over-simplifying assignment which refers to that in the regions overlapped with several distinct subtypes, controversial labels put all subjects into one of them without further identifying their sources. Such assignment can be illustrated by Fig. 4. Here, Fig. (4a) plots the sample distribution with darker colors indicating denser population of samples. It is visible that the samples can be assigned to five clusters. However, by injecting subtyping label information it is clear from Fig. (4b) that in the lower left corner there existed strong overlaps of three different subtypes. Controversial labels assigned them to a single subtype Luminal A. VQ-RIM, on the other hand, was capable of separating those three subtypes. This separation can be seen from Fig. (4c) which compares the two labeling when setting the number of VQ-RIM subtypes to 5 in accordance with controversial labels, or to 6 by setting K to a sufficiently large value and automatically determines the suitable number of subtypes. In either case, VQ-RIM consistently separated the Luminal A into three distinct subtypes: (B,C,E) in 5 subtypes case and (C* , D* , E* ) in the 6 subtypes case. In the next subsection, we verify the effectiveness of such finer-grained subtyping by performing survival analysis. Medical Evaluation. To demonstrate the clinical relevance of the identified subtypes, we perform subtype-specific survival analysis by the Kaplan-Meier (KM) estimate. KM estimate is one of the most frequently used statistical methods for survival analysis, which we use to complementarily validate the VQRIM labels from a clinical point of view [16]. KM compares survival probabilities in a given length different sample groups. The KM estimator is  of time between i , where ni is the number of samples under observation given by: S = i:ti