169 77 9MB
English Pages 279 [273] Year 2024
Unsupervised and Semi-Supervised Learning Series Editor: M. Emre Celebi
Frederic Ros Rabia Riad
Feature and Dimensionality Reduction for Clustering with Deep Learning
Unsupervised and Semi-Supervised Learning Series Editor M. Emre Celebi, Computer Science Department, Conway, AR, USA
Springer’s Unsupervised and Semi-Supervised Learning book series covers the latest theoretical and practical developments in unsupervised and semi-supervised learning. Titles – including monographs, contributed works, professional books, and textbooks – tackle various issues surrounding the proliferation of massive amounts of unlabeled data in many application domains and how unsupervised learning algorithms can automatically discover interesting and useful patterns in such data. The books discuss how these algorithms have found numerous applications including pattern recognition, market basket analysis, web mining, social network analysis, information retrieval, recommender systems, market research, intrusion detection, and fraud detection. Books also discuss semi-supervised algorithms, which can make use of both labeled and unlabeled data and can be useful in application domains where unlabeled data is abundant, yet it is possible to obtain a small amount of labeled data. Topics of interest in include: - Unsupervised/Semi-Supervised Discretization - Unsupervised/Semi-Supervised Feature Extraction - Unsupervised/Semi-Supervised Feature Selection - Association Rule Learning - Semi-Supervised Classification - Semi-Supervised Regression - Unsupervised/Semi-Supervised Clustering - Unsupervised/Semi-Supervised Anomaly/Novelty/Outlier Detection - Evaluation of Unsupervised/Semi-Supervised Learning Algorithms - Applications of Unsupervised/Semi-Supervised Learning While the series focuses on unsupervised and semi-supervised learning, outstanding contributions in the field of supervised learning will also be considered. The intended audience includes students, researchers, and practitioners. ** Indexing: The books of this series indexed in zbMATH **
Frederic Ros • Rabia Riad
Feature and Dimensionality Reduction for Clustering with Deep Learning
Frederic Ros University of Orléans Orléans, France
Rabia Riad University of Ibnou Zohr Ouarzazate, Morocco
ISSN 2522-848X ISSN 2522-8498 (electronic) Unsupervised and Semi-Supervised Learning ISBN 978-3-031-48742-2 ISBN 978-3-031-48743-9 (eBook) https://doi.org/10.1007/978-3-031-48743-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
About the Book
Clustering has always played a crucial role in problem-solving, offering valuable insights into data organization. However, in the age of big data, traditional clustering methods face considerable challenges related to the curse of dimensionality and scalability. To address these issues, recent advances in deep clustering models have garnered substantial attention. These models provide unique capabilities, efficiently handling complex, high-dimensional, and large-scale datasets thanks to their remarkable representational capacity and rapid inference capabilities. Although feature selection and dimensionality reduction are distinct challenges, both share the fundamental goal of reducing data dimensionality. In high-dimensional clustering scenarios, the central question revolves around finding a compressed representation that preserves the semantic structures of clusters effectively. This raises the challenge of defining an objective function to encourage a suitable representation in the absence of labeled data. This book offers a comprehensive overview of modern feature selection and dimensionality reduction methods based on Deep Neural Networks (DNNs) within the context of clustering. It places particular emphasis on knowledge dissemination, catering to non-experts such as professionals, researchers, students, and enthusiasts from diverse domains, enabling them to grasp these concepts effectively. The initial chapters provide a foundational introduction to clustering and peripheral techniques, tracing the evolution toward deep learning approaches. These chapters serve as stepping stones, gradually immersing readers in the realm of deep clustering, and elucidating key concepts and techniques. It then spotlights and discusses the most representative deep clustering methods, presenting them in a practical and illustrative manner to facilitate reader understanding. In the concluding chapters, the book delves into current challenges and issues within the domain of deep clustering, encouraging readers to contemplate potential solutions and future directions for advancements. This book is thoughtfully designed to be accessible to non-experts, avoiding technical jargon and offering clear, step-by-step guidance
v
vi
About the Book
along with numerous links to additional resources. It serves as a valuable asset for anyone interested in exploring cutting-edge techniques in deep clustering for big data analysis. Through practical explanations and approachable language, the book equips readers with the knowledge necessary to effectively apply deep clustering across a wide spectrum of domains.
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 7
2
Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 PCA: Principal Component Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 ICA: Independent Component Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 NMF: Non-Negative Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Kohonen Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 ISOMap: Isometric Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 UMAP: Uniform Manifold Approximation and Projection . . . . . . . . . . 7 t-SNE: t-Distributed Stochastic Neighbor Embedding . . . . . . . . . . . . . . . 8 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Discussion and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Comparison Between PCA and ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Comparison Between PCA and NMF . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Comparison Between t-SNE and SOM . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Comparison Between UMAP and t-SNE . . . . . . . . . . . . . . . . . . . . . . 9.5 Comparison Between SOM and Autoencoders . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11 12 14 15 17 18 19 19 21 22 22 23 23 24 24 25
3
Feature Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Filter Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Wrapper Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Embedded Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Popular Unsupervised Feature Selection Methods. . . . . . . . . . . . . . . . . . . . 2.1 Filter-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Wrapper-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Embedded Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Deep Learning and Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27 28 28 30 32 33 33 35 38 41 43
vii
viii
4
Contents
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Popular Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 K-Means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 GMM: Gaussian Mixture Models Clustering Algorithm . . . . . 2.3 Mean Shift Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 OPTICS: Ordering Points to Identify the Clustering Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 DBSCAN: Density-Based Spatial Clustering of Applications with Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 DPC: Density Peak Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 STING: Statistical Information Grid in Data Mining . . . . . . . . . 2.9 Spectral Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Recent Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Evaluation Metrics Used in Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 External Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Internal Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Peripherical Metrics/Norms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 KL-Divergence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Binary Cross-Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Frobenius Norm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Subspace and Ensemble Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45 45 46 47 49 51 52
5
Problematic in High Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 High Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Deep Learning Algorithms and Representation Learning . . . . . . . . . . . . 4 Importance of a Good Latent Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Semantic Discovery in Unsupervised Scenarios . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75 75 76 77 78 79 80
6
Deep Learning Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 1 Convolution Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 1.1 Brief History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 1.2 Main Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 1.3 Training CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 2 AE: Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3 VAE: Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4 GAN: Generative Adversarial Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5 Siamese Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
54 56 57 60 62 63 65 65 66 68 68 69 69 70 70 71
Contents
7
8
ix
Learning Approaches and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Training Techniques and Optimization for Deep Learning Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Gradient Retro Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Regularization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Dropout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Network Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Standard and Novel Learning Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Self-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Similarity Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Self-Paced Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 Subspace Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Self-Supervision in Deep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Pretext Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Contrastive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
105
Deep Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Limitations of Convention Feature Selection Methods . . . . . . . . . . . . . . . 2 Measurement Criterion Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Transitioning to Deep Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Taxonomy of Feature Selection Techniques with Deep Learning. . . . 5 Popular Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 FSAE: Feature Selection Guided Autoencoder . . . . . . . . . . . . . . . 5.2 AEFS: AutoEncoder Feature Selector . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 GAFS: Graph Regularized Autoencoder Feature Selection . . 5.4 RAE: Restricted Autoencoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 CAE: Concrete Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 LS-CAE: Laplacian Score-Regularized . . . . . . . . . . . . . . . . . . . . . . . 5.7 LRLMR: Latent Representation Learning and Graph-Based Manifold Regularization . . . . . . . . . . . . . . . . . . . . . . . . 5.8 RNE: Robust Neighborhood Embedding . . . . . . . . . . . . . . . . . . . . . . 5.9 DUFS: Differentiable Unsupervised Feature Selection Based on a Gated Laplacian . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 UFS-TAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
131 131 132 132 133 136 136 138 139 140 141 142
106 106 107 109 109 110 111 112 112 113 114 115 115 117 118 118 119 119 120 121 122 127 128
144 146 147 147 148
x
Contents
9
Deep Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Taxonomy of Deep Clustering Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Exploring Categories: A Conceptual Overview. . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
151 152 155 158
10
Deep Clustering Techniques Based on CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 JULE: Joint Unsupervised LEarning of Deep Representations and Image Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 IMSAT: Information Maximizing Self-augmented Training . . . . . . . . . 3 DAC: Deep Adaptive Image Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 SCAN: Semantic Clustering by Adaptive Nearest Neighbors . . . . . . . 5 NNM: The Nearest Neighbor Matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 DeepCluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Deep Clustering with Sample-Assignment Invariance Prior . . . . . . . . . 8 IIC: Invariant Information Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 IDFD: Instance Discrimination and Feature Decorrelation . . . . . . . . . . 10 SimCLR: Simple Framework for Contrastive Learning of Visual Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 MoCo: Momentum Contrast for Unsupervised Visual Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 PICA: PartItion Confidence mAximisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 DSC-PCM: Deep Semantic Clustering by Partition Confidence Maximization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 SPICE: Semantic Pseudo-Labeling for Image Clustering . . . . . . . . . . . . 15 PCL: Prototypical Contrastive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 ProPos: Prototype Scattering and Positive Sampling . . . . . . . . . . . . . . . . . 17 BYOL: Bootstrap Your Own Latent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 SwAV: Swapping Assignments Between Multiple Views . . . . . . . . . . . . 19 SimSiam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 EDCN: Unsupervised Discriminative Feature Learning via Finding a Clustering-Friendly Embedding Space. . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
159
Deep Clustering Techniques Based on Autoencoders . . . . . . . . . . . . . . . . . . 1 DEN: Deep Embedding Network for Clustering . . . . . . . . . . . . . . . . . . . . . 2 DEC: Deep Embedded Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 IDEC: Improved Deep Embedded Clustering. . . . . . . . . . . . . . . . . . . . . . . . . 4 DCN: Deep Clustering Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 DEPICT: DEeP Embedded RegularIzed ClusTering . . . . . . . . . . . . . . . . . 6 DBC: Discriminatively Boosted Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 COAE: Clustering with Overlapping Autoencoder . . . . . . . . . . . . . . . . . . . 8 ADEC: Adversarial Deep Embedded Clustering . . . . . . . . . . . . . . . . . . . . . 9 DSSEC: Deep Stacked Sparse Embedded Clustering Method . . . . . . . 10 DPSC: Discriminative Pseudo-supervision Clustering . . . . . . . . . . . . . . . 11 DCSPC: Deep Convolutional Self-paced Clustering . . . . . . . . . . . . . . . . . 12 N2D: Not Too Deep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
203 204 205 207 208 209 210 211 212 213 215 216 217
11
160 162 164 166 169 170 173 174 177 178 182 184 186 187 189 191 192 194 196 198 200
Contents
xi
13 ACIC: Adaptive Correlation for Deep Image Clustering . . . . . . . . . . . . . 219 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 12
Deep Clustering Techniques Based on Generative Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Generative Architectures: VAE’s FAMILY . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 VaDE: Variational Deep Embedding. . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 GMVAE: Gaussian Mixture Variational Deep Embedding . . . 2 Generative Architectures: GAN’s FAMILY . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 CatGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 InfoGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 ClusterGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 BiGAN: Bidirectional Generative Adversarial Network and ALI: Adversarially Learned Inference . . . . . . . . . . 2.5 More Recent Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
221 222 223 227 230 232 233 235 237 239 240
13
Deep Clustering Techniques: Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Classification Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Multicriteria Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 AE Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 CNN Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Generative Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
243 243 246 247 248 250 250
14
Issues and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Deep Architectures: Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Deep Architectures: Lack of Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Feature Selection Issue with Large Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Clustering Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Deep Clustering Problematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Challenges in Image Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 The Question of Semantic Representation Is Not Completely Solved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 AE and VAE Have Predefined Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Hybridization Between Selection and Dimensionality Techniques . . 10 Exploratory Analysis of Big and Complex Data: Utopia? . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
253 253 254 255 256 257 257
15
258 259 260 261 262
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Chapter 1
Introduction
How to dig unknown information out of datasets is a widely concerning problem. Clustering, as an unsupervised method to partition a dataset naturally, has the ability to discover the potential and internal knowledge, laws, and rules of data. Clustering is one of the major unsupervised learning techniques and has been applied in many fields. In recent decades, researchers have proposed many clustering algorithms (Ezugwu et al., 2022) based on different theories and models, which are generally divided into several categories such as partitioning, hierarchical, density, grid, and model-based methods. When the data are easy to cluster, meaning that the groups are well-separated, most of the existing algorithms are likely to yield a good result, but clustering algorithms have to deal with more complex situations such as different types of attributes, various shapes, and densities and must include outlier and noise management. Despite the continuing stream of clustering algorithms proposed over the years to handle more complex structures, there are still remaining issues. In addition to the recurrent challenges, there are several issues concerning the dimension and volume of twenty-first century databases. They possess noisy, irrelevant, and redundant along with the most useful information. The general issues involve the curse of dimensionality and algorithm scalability. These issues are not novel but more challenging today in the era of big data. In this era, the size of data increases drastically day by day. Data grow in terms of both the number of instances and features. This increasing dimensionality degrades the performance of machine learning algorithms including clustering ones. Performance degradation is often observed when tackling either unprocessed supports such as images or high-dimensional features extracted from processed supports. In addition, there is a significant increase of computational time and space. Image clustering has recently attracted significant attention due to the increased availability of unlabeled datasets. In this case, unsupervised scenarios are shifted from their initial goal of knowledge discovery as the process is somewhere supervised-driven to improve supervised problems.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 F. Ros, R. Riad, Feature and Dimensionality Reduction for Clustering with Deep Learning, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-031-48743-9_1
1
2
1 Introduction
The curse of dimensionality is a fundamental difficulty in many practical machine learning problems and especially for the most popular clustering algorithms. Without labels, it is not even clear what should be represented. How may an objective function be written to encourage the capture of a semantic representation? To overcome this “Curse of Dimensionality,” Dimensionality Reduction technique has to be applied to the huge amount of data, before discovering the hidden useful information. In the network field, Deep Belief Network (DBN) (Hinton, 2009), Hebbian Learning (Munakata & Pfaffly, 2004), and Self-organizing Maps (SOM) (Kohonen, 1990) were the first neural network techniques that can discover hidden structures in unlabeled data. Self-organizing Maps (SOM) learn non-linear topology-preserving transformations that map data points to centers that are then mapped to the center indices. Despite their successes, they can be complex to use and hampered by low speed. Transforming data from high-dimensional feature space to lower-dimensional space in which to perform clustering is an intuitive solution and has been widely studied. This can be traditionally done by applying dimension reduction techniques such as Principle Component Analysis (PCA), Independent Component Analysis (ICA), Local Linear Embedding (LLE) (Saul & Roweis, 2000), or Isomap (Balasubramanian & Schwartz, 2002), but the representation ability of these approaches is limited as ignoring the interconnection between features learning and clustering as they mingle all features. They also lack interpretability. The major problem in unsupervised scenarios and with highdimensional data remains the question of the good representation in the latent space. This representation depends critically on whether the cluster structures among the data points are preserved in the latent space meaning that semantically similar data are placed close to each other in the latent space, and semantically dissimilar data are placed apart from each other. In any data distribution, two major factors are generally tracked: variance and entanglement (Locatello et al., 2019). Feature Selection and Feature Extraction are the two-dimensionality reduction techniques both aiming at tracking these factors (Achille & Soatto, 2018). The problem of feature selection has been studied extensively in machine learning and statistics (Dy & Brodley, 2004; Khalid et al., 2014; Dokeroglu et al., 2022). The methods exploring graph embedding (Cai et al., 2010; Nie et al., 2016) and using sparse spectral regression have received increasing attention in the last years. The feature selection aims to approximate the original data by selecting a set of essential features. In practice, not all features are equally important and discriminative, since most of them are often highly correlated or even redundant to each other. The redundant features generally would make learning methods overfitting and less interpretable. Consequently, it is necessary to reduce the data dimensionality and select the most important features. Most of the research is focused on supervised feature selection, while some works are devoted to unsupervised ones. The latter is more challenging since it is unassisted by sample labels, and there is a lack of consensus on the correct optimization objective even in low-dimensional space. Some recent learning strategies invoke supervised learning via the use of “pseudo labels.” The goal is then to select a subset of features that can effectively reveal or maintain the underlying structure of data. The idea is that
1 Introduction
3
identifying a subset of informative features benefits such as a reduction in memory and computations, improved generalization, and interpretability. The difficulty is to find the balance between structure characterization and feature selection. Filter methods attempt to remove irrelevant features prior to learning a model. Wrapper methods use the outcome of a model to determine the relevance of each feature. Embedded methods aim to learn the model while simultaneously selecting the subset of relevant features. Unsupervised feature selection methods mostly focus on two main tasks: clustering and dimensionality reduction or manifold learning. Other clustering-dedicated unsupervised feature selection methods assess the relevance of each feature based on different statistical or geometric measures. Entropy, divergence, and mutual information (Estévez et al., 2009) are used to identify features that are informative for clustering the data. Unsupervised feature selection is still a challenging task. In contrast to conventional linear feature selection methods, deep-based models such as autoencoders have more potential. They have the ability to construct abstract features or latent variables on top of the observed features and can select informative features by excavating both linear and non-linear information among features. They can discriminate between relevant and irrelevant features using a feature selection regularizer. Clustering with large databases with high dimensions cannot be done without transforming and reducing the original feature space. Thanks to the development of deep learning, such feature transformation can be achieved by using Deep Neural Networks (DNNs) within a strategy where clustering and feature learning are done simultaneously. This kind of clustering is refereed as deep clustering. Deep clustering frameworks combine feature extraction, dimensionality reduction, and clustering into an end-to-end model, allowing the deep neural networks to learn suitable representations to adapt to the assumptions and criteria of the clustering module that is used in the model. Deep-learning-based clustering is one of the most active topics in the field of unsupervised learning due to its outstanding representative capacity and fast inference speed. With the advancement of neural network architectures, researchers have introduced many learning methods that are based on neural networks. Autoencoders (AE) (Rumelhart et al., 1986), Variational Autoencoders (VAE) (Kingma et al., 2019), Generative Adversarial Networks (GAN) (Goodfellow et al., 2014), and Deep Belief Net (DBN) (Hinton, 2009) are common architectures that are used to create the recent state-of-the-art unsupervised learning architectures for unsupervised and “pseudo-unsupervised” tasks. Many methods are based on autoencoders as they can be used to identify features that are sufficient for reconstructing the data. They have the ability to generalize models better by reducing the dimensions of data through a latent space while maintaining high-quality representation as well. They use a variety of losses to encourage data points to form tightly packed and well-separated clusters in the latent space. Deep Clustering Network (DCN) represents a combined approach of autoencoders and k-means algorithm. Deep Embedding Network (DEN) (Huang et al., 2014) only relies on reconstruction loss of autoencoders and converges to a cluster-friendly representation. Deep Continuous Clustering (DCC) (Shah & Koltun, 2018), Deep Embedded Regularized Clustering (DEPICT) (Huang et al., 2014), Deep Multi-
4
1 Introduction
manifold Clustering (DMC) (Chen et al., 2017), and Deep Subspace Clustering Networks (DSC-Nets) (Ji et al., 2017) perform similar reconstruction loss of autoencoders to perform clustering. One of the most representative deep clustering methods is deep embedded clustering (DEC) (Xie et al., 2016). It learns a nonlinear mapping from a high-dimensional data space to a lower-dimensional feature space in which it iteratively solves a clustering objective function. Inspired by DEC, several deep clustering methods have been proposed for image clustering and obtained promising preliminary results even if there is no guarantee that clusters in the latent space correspond well to semantic clusters in the input space. Other methods are based on CNN architectures. Most of them have been investigated for image data and in a context of supervised, unsupervised, and semisupervised learning tasks (Reddy et al., 2018; Alloghani et al., 2020). They are generally based on the transfer learning concept (Weiss et al., 2016), e.g., the idea of promoting the reuse of features obtained with pre-trained models on standard labeled datasets. More recently, contrastive learning (Chen et al., 2020) as well as self-supervised learning approaches (Ohri & Kumar, 2021) have emerged based on the idea that the data themselves contain inherent features that provide supervision for training the model. Then, using pretext tasks, they allow the network to learn high-level features and therefore obtain semantically meaningful representations from unlabeled data, thus contributing to the knowledge discovery task. The choice of the pretext task is obviously central, and it needs to be well-identified. If these methods differ in their objectives, they share the goal of obtaining rich representations of the inputs, an appropriate metric that is suitable for a clustering task. Differently, generative adversarial networks (GANs) (Goodfellow et al., 2014; Creswell et al., 2018; Saxena & Cao, 2021) are generative models that have shown a remarkable generation performance especially in image synthesis. They generate new data instances that resemble to training data by discovering and learning the regularities or patterns in input data. In the recent years, there has been a great variety of GAN architectures, which explore the latent space abilities to produce realistic data. Most of them can be referred to as hybrid VAE-GAN methods, which bridge the gap between Variational Autoencoders (VAEs) and GANs. Both GAN and VAE aim to match the real data distribution, VAE by using an explicit approximation of maximum likelihood and GAN through implicit sampling. The latent space backprojection in GANs could be used to cluster, but the cluster structure is generally not preserved in the GAN latent space. GANs in their raw formulation are unable to fully impose all the cluster properties of the real data on to the generated data, especially when the real data have skewed clusters. GAN-based deep clustering methods have then been recently proposed, the idea being to use a third encoder network that maps a data object to an instance from the latent space Z. The GAN-based deep clustering methods such as AAE (Makhzani et al., 2015), BiGan (Donahue et al., 2016), CatGAN (Springenberg, 2015), InfoGAN (Chen et al., 2016), ClusterGan (Mukherjee et al., 2019) seek to train the network with a min-max adversarial game and aim to extract interpretable and disentangled features.
1 Introduction
5
The problem of feature selection is different from the more general problem of dimensionality reduction as well as the one of feature extraction (Ghojogh et al., 2019; Zebari et al., 2020) even if both play a role to solve the curse of dimensionality issue and address similar machine learning techniques. Few studies simultaneously have addressed feature selection and subspace learning (Zhou et al., 2016). Feature extraction methods use all features and consist in finding a projection that maps the original high-dimensional data into a lower-dimensional subspace. Standard techniques for dimensionality reduction, such as principal components analysis and autoencoders, are able to represent data with fewer dimensions while preserving maximal variance or minimizing reconstruction loss. However, such methods in their DNAs do not directly select a set of features present in the original dataset and thus cannot be “easily” used to eliminate redundant features and reduce experimental costs. On the other hand, feature selection methods aim to select a subset of the original high-dimensional features based on some performance criterion. Therefore, they can preserve the semantics of the original features and produce dimensionally reduced results that are more interpretable for domain experts. Feature selection is superior in terms of better readability and interpretability since it maintains the original feature values in the reduced space, while feature extraction transforms the data from the original space into a new space with lower dimension, which cannot be linked to the features in the original space. Along the years, many reviews exist for standard clustering (Xu & Wunsch, 2005; Ahmad & Khan, 2019; Ezugwu et al., 2022), a few excellent ones for deep clustering (Mitra et al., 2002; Aljalbout et al., 2018; Min et al., 2018; Schnellbach & Kajo, 2020), for specific domains such as Karim et al. (2021). Many others for feature selection (Khalid et al., 2014; Wang et al., 2016; Xue et al., 2015; Li et al., 2017; Hancer et al., 2020; Chandrashekar & Sahin, 2014; Dokeroglu et al., 2022; Cai et al., 2018) as the problem have been well-known for decades. Some recent review papers are dedicated to unsupervised feature selection (Alelyani et al., 2018; Pandit et al., 2020; Solorio-Fernández et al., 2020) but not dedicated to deep learning architectures. During the last few years, it has been contributed more and more work dealing with deep clustering, but conceptually, there are however only a few disruptive innovations. Despite promising results, the problem of finding an appropriate representation as well as dealing with the specific question of knowledge discovery remains. In addition, there may be overall confusion about the domain for R&D engineers, non-expert, or young researchers. The entry in the domain is also somewhat difficult, as it is supposed to have a good understanding of the numerous machine learning techniques, tricks, and concepts that are involved in the deep clustering area. As an example, semi-supervised, self-supervised, and unsupervised learning are often used in the literature but have overlapping definitions for certain methods. There is also a central issue regarding the question of knowledge discovery that is a key clustering task. In the deep area and for most of the methods, there is a shift toward pseudo-semi-supervised approaches. Even useful especially for image problems, many popular methods partially address the knowledge discovery question. Self-supervised learning approaches are more adequate for this, but most
6
1 Introduction
of the works focus on the signal/image field and the performances highly depend on the pretext task. Therefore, in this book, the most popular works are highlighted in order to provide a more suitable starting point from which to develop a full understanding of the domain. The most innovative and very recent works that appear highly promising are also included. Covering all the techniques and concepts in a complete detail would be cumbersome in this book. We however aim to provide a comprehensive overview of the key ones by covering some of the necessary baselines to clarify the core concepts and make easier the understanding. For the concerned methods, more details are given for the pioneer and disruptive works, while the others are described briefer letting the interested reader consider the original papers where the ideas are described in full detail. Our book addresses both feature selection and dimensionality reduction problems related to the deep clustering area with particular attention to the knowledge discovery question. Even considered strategically different in their objectives, both aim at solving the issue of the curse of dimensionality that affects the most innovative clustering methods and often relies on similar techniques. This book aims to bridge the gap between the complex notions of deep clustering, deep feature selection, and the needs of non-experts who are interested in leveraging these techniques for their own datasets. Before diving into the essential elements of the topics, it is important to establish a solid foundation by introducing the key concepts and building blocks that underpin the clustering process. In the initial chapters, this book provides a comprehensive exploration of key topics in data analysis, specifically focusing on dimensionality reduction, feature selection, and clustering. The author covers a variety of popular algorithms, including K-means, hierarchical clustering, density-based methods, and more. While the book does not aim to be exhaustive or overly technical in its treatment of these concepts as more accessible in various supports, each item is presented in a straightforward manner, effectively highlighting its strengths, weaknesses, and practical considerations for real-world applications. Moving forward, the subsequent chapters (Chaps. 5, 6, and 7) delve into the intriguing world of deep learning, starting with an in-depth examination of the challenges posed by high dimensionality in classic machine learning techniques. These chapters serve as a valuable and accessible introduction to fundamental deep learning concepts, laying the groundwork for readers to develop a comprehensive understanding of the techniques introduced later in the book. Chapters 8 to 13 are particularly noteworthy, focusing on deep feature selection and deep clustering, respectively. Building upon the solid foundation established in earlier chapters, these sections delve into advanced methods, equipping readers with powerful tools to tackle complex data analysis tasks. Chapter 14 is dedicated to formalizing various issues and challenges that arise within deep learning, offering valuable insights into the subject matter. Furthermore, it outlines potential future research directions, paving the way for further advancements in this rapidly evolving field.
References
7
Deep learning is nowadays the state-of-the-art approach to solve clustering or feature selection that is becoming an even more relevant problem as the size of the datasets increases, and the cost of manually labeling data is still very high. This book aims at helping to discover the topic. To ensure a gentle learning curve, the book is specifically tailored to non-experts, avoiding unnecessary jargon and focusing on practical explanations, and clear step-by-step guidelines. It strikes a fine balance between clarity and depth, making it suitable for engineers, students, Ph.D. looking to grasp essential concepts, and experienced practitioners seeking to expand their knowledge in the realm of data analysis and deep learning and to make deep clustering accessible. Whether you are a business professional seeking to uncover customer segments, a researcher exploring patterns in scientific data, or a student looking to explore the fascinating world of deep clustering, this book serves as your guide to understanding and harnessing the power of deep clustering techniques in a non-technical and approachable manner. By the end of this book, non-experts will have a solid grasp of the essential concepts and techniques of deep selection and clustering. They will be equipped with the knowledge and skills necessary to confidently apply deep clustering algorithms to their own data and extract meaningful insights from complex datasets. Our book is divided into different chapters as follows: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
Chap. 2: Dimensionality reduction Chap. 3: Feature selection Chap. 4: Clustering Chap. 5: Problem in high dimension Chap. 6: Deep learning architectures Chap. 7: Learning approaches and tricks Chap. 8: Feature selection techniques with deep clustering Chap. 9: Deep clustering techniques Chap. 10: Deep clustering techniques based on CNN Chap. 11: Deep clustering techniques based on autoencoders Chap. 12: Deep clustering techniques based on generative architectures Chap. 13: Deep clustering techniques in synthesis Chap. 14: Issues and challenges
References Achille, A., & Soatto, S. (2018). Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1), 1947–1980. Ahmad, A., & Khan, S. S. (2019). Survey of state-of-the-art mixed data clustering algorithms. IEEE Access, 7, 31883–31902. Alelyani, S., Tang, J., & Liu, H. (2018). Feature selection for clustering: A review. Data Clustering, 29–60. Aljalbout, E., Golkov, V., Siddiqui, Y., Strobel, M., & Cremers, D. (2018). Clustering with deep learning: Taxonomy and new methods.
8
1 Introduction
Alloghani, M., Al-Jumeily, D., Mustafina, J., Hussain, A., & Aljaaf, A. J. (2020). A systematic review on supervised and unsupervised machine learning algorithms for data science. Supervised and Unsupervised Learning for Data Science, 3–21. Balasubramanian, M., & Schwartz, E. L. (2002). The Isomap algorithm and topological stability. Science, 295(5552), 7–7. Cai, D., Zhang, C., & He, X. (2010). Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 333–342). Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A new perspective. Neurocomputing, 300, 70–79. Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16–28. Chen, D., Lv, J., & Zhang, Y. (2017). Unsupervised multi-manifold clustering by learning deep representation. In Workshops at the Thirty-First AAAI Conference on Artificial Intelligence. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (pp. 1597–1607). PMLR. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems, 29. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., & Bharath, A. A. (2018). Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1), 53– 65. Dokeroglu, T., Deniz, A., & Kiziloz, H. E. (2022). A comprehensive survey on recent metaheuristics for feature selection. Neurocomputing, 494, 269–296. Donahue, J., Krähenbühl, P., and Darrell, T. (2016). Adversarial feature learning. Preprint. arXiv:1605.09782. Dy, J. G., & Brodley, C. E. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research, 5(Aug), 845–889. Estévez, P. A., Tesmer, M., Perez, C. A., & Zurada, J. M. (2009). Normalized mutual information feature selection. IEEE Transactions on Neural Networks, 20(2), 189–201. Ezugwu, A. E., Ikotun, A. M., Oyelade, O. O., Abualigah, L., Agushaka, J. O., Eke, C. I., & Akinyelu, A. A. (2022). A comprehensive survey of clustering algorithms: State-ofthe-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110, 104743. Ghojogh, B., Samad, M. N., Mashhadi, S. A., Kapoor, T., Ali, W., Karray, F., & Crowley, M. (2019). Feature selection and feature extraction in pattern analysis: A literature review. Preprint. arXiv:1905.02845. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems, 27. Hancer, E., Xue, B., & Zhang, M. (2020). A survey on feature selection approaches for clustering. Artificial Intelligence Review, 53(6), 4519–4545. Hinton, G. E. (2009). Deep belief networks. Scholarpedia, 4(5), 5947. Huang, P., Huang, Y., Wang, W., & Wang, L. (2014). Deep embedding network for clustering. In 2014 22nd International Conference on Pattern Recognition (pp. 1532–1537). IEEE. Ji, P., Zhang, T., Li, H., Salzmann, M., & Reid, I. (2017). Deep subspace clustering networks. Advances in Neural Information Processing Systems, 30. Karim, M. R., Beyan, O., Zappa, A., Costa, I. G., Rebholz-Schuhmann, D., Cochez, M., & Decker, S. (2021). Deep learning-based clustering approaches for bioinformatics. Briefings in Bioinformatics, 22(1), 393–415. Khalid, S., Khalil, T., & Nasreen, S. (2014). A survey of feature selection and feature extraction techniques in machine learning. In 2014 Science and Information Conference (pp. 372–378). IEEE.
References
9
Kingma, D. P., Welling, M., et al. (2019). An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), 307–392. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464–1480. Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., & Liu, H. (2017). Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6), 1–45. Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., & Bachem, O. (2019). Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning (pp. 4114–4124). PMLR. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., & Frey, B. (2015). Adversarial autoencoders. Preprint. arXiv:1511.05644. Min, E., Guo, X., Liu, Q., Zhang, G., Cui, J., & Long, J. (2018). A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access, 6, 39,501–39,514. Mitra, P., Murthy, C., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 301–312. Mukherjee, S., Asnani, H., Lin, E., & Kannan, S. (2019). ClusterGAN: Latent space clustering in generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press. Munakata, Y., & Pfaffly, J. (2004). Hebbian learning and development. Developmental Science, 7(2), 141–148. Nie, F., Zhu, W., & Li, X. (2016). Unsupervised feature selection with structured graph optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30(1). Ohri, K., & Kumar, M. (2021). Review on self-supervised image recognition using deep neural networks. Knowledge-Based Systems, 224, 107090. Pandit, A. A., Pimpale, B., & Dubey, S. (2020). A comprehensive review on unsupervised feature selection algorithms. In International Conference on Intelligent Computing and Smart Communication 2019 (pp. 255–266). Springer. Reddy, Y., Viswanath, P., & Reddy, B. E. (2018). Semi-supervised learning: A brief review. International Journal of Engineering & Technology, 7(1.8), 81. Rumelhart, D. E., Hinton, G. E., McClelland, J. L., et al. (1986). A general framework for parallel distributed processing. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1(45-76), 26. Saul, L. K., & Roweis, S. T. (2000). An introduction to locally linear embedding. Unpublished. Available at: http://www.cs.toronto.edu/~roweis/lle/publications.html. Saxena, D., & Cao, J. (2021). Generative adversarial networks (GANs) challenges, solutions, and future directions. ACM Computing Surveys (CSUR), 54(3), 1–42. Schnellbach, J., & Kajo, M. (2020). Clustering with deep neural networks–an overview of recent methods. Network, 39. Shah, S. A., & Koltun, V. (2018). Deep continuous clustering. Preprint. arXiv:1803.01449. Solorio-Fernández, S., Carrasco-Ochoa, J. A., & Martínez-Trinidad, J. F. (2020). A review of unsupervised feature selection methods. Artificial Intelligence Review, 53(2), 907–948. Springenberg, J. T. (2015). Unsupervised and semi-supervised learning with categorical generative adversarial networks. Preprint. arXiv:1511.06390. Wang, L., Wang, Y., & Chang, Q. (2016). Feature selection methods for big data bioinformatics: a survey from the search perspective. Methods, 111, 21–31. Weiss, K., Khoshgoftaar, T. M., & Wang, D. (2016). A survey of transfer learning. Journal of Big Data, 3(1), 1–40. Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning (pp. 478–487). PMLR. Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678. Xue, B., Zhang, M., Browne, W. N., & Yao, X. (2015). A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation, 20(4), 606– 626.
10
1 Introduction
Zebari, R., Abdulazeez, A., Zeebaree, D., Zebari, D., & Saeed, J. (2020). A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. Journal of Applied Science and Technology Trends, 1(2), 56–70. Zhou, N., Xu, Y., Cheng, H., Fang, J., & Pedrycz, W. (2016). Global and local structure preserving sparse subspace learning: An iterative approach to unsupervised feature selection. Pattern Recognition, 53, 87–101.
Chapter 2
Dimensionality Reduction
Dimensionality reduction techniques assume a paramount role across a spectrum of data-driven tasks, achieving dual objectives: the compression of data and the facilitation of insightful visualization. Within the realm of machine learning, dimensionality reduction emerges as the transformative process of curtailing the multitude of features that intricately describe a dataset. This reduction materializes through two distinct avenues: selection, entailing the preservation of only a subset of the existing features, and extraction, whereby a condensed set of novel features is forged, drawing inspiration from the original ones. The profound utility of these techniques becomes apparent in a myriad of scenarios that demand data of lower dimensionality, encompassing realms such as data visualization, data storage efficiency, and resource-intensive computations. The goal of dimensionality reduction is to reduce the high-dimensional space to a lower-dimensional space by creating new features from original features with minimum information loss. The feature transformation due to dimensionality reduction is often irreversible. The critical difference between feature selection and dimensionality reduction is that feature selection finds the best features from the original features. In contrast, dimensionality reduction creates new features from the original features. Linear dimensionality reduction techniques transform the linear subspace of the input features to a low-dimension space as a linear combination of the original variables. Original features are replaced by a reduced set of underlying variables containing maximum variation to preserve as much information as possible. Dimension reduction algorithms either seek to preserve the pairwise distance structure amongst all the data samples like PCA or favor preserving the local distances over global distance as in t-SNE and Isomaps. This chapter embarks on a journey that traverses the landscape of pivotal classical dimensionality reduction methodologies, focusing on extraction techniques. Subsequently, the forthcoming chapter is dedicated to the art of feature selection. It is noteworthy that the arena of deep clustering harnesses the innate potential of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 F. Ros, R. Riad, Feature and Dimensionality Reduction for Clustering with Deep Learning, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-031-48743-9_2
11
12
2 Dimensionality Reduction
neural networks, enabling the acquisition of finely compressed data representations devoid of reliance on meticulously crafted human-engineered features.
1 PCA: Principal Component Analysis PCA (Chaudhary, 2020) is a popular unsupervised dimensionality reduction method for high-dimensional linearly separable data. It is used to transform a dataset while preserving as much of its original information as possible. It does this by creating a set of new variables, called principal components, which are linear combinations of the original variables. PCA is widely used for dimensionality reduction, feature extraction, noise reduction, and data visualization. It has applications in various fields, including image processing, finance, genetics, and data compression. The purpose of PCA is to derive new variables that are linear combinations of the original variables and are uncorrelated. PCA projects n-dimensional data onto a lower d-dimensional subspace by either minimizing the sum of squared error or by maximizing the variance and, as a result, giving uncorrelated projected distributions. The projective hyper-plane explains the maximum amount of variance, such that most of the information of the data is captured (Fig. 2.1). The process can be summarized as follows: • PCA first standardizes all the input variables. • Following that, the first principal component is computed. It is designed to capture the maximum variance present in the data. In other words, it identifies the direction along which the data varies the most. • Afterward, the second principal component is derived in such a way that it best explains the second largest source of variation in the data. The second principal component should be orthogonal to the first Principal Component to capture the variance in the data that the first Principal Component did not capture. • This process continues for as many principal components as there are original variables. Each subsequent principal component is orthogonal (perpendicular) to the ones before it. This means that they capture different aspects of the data, and there is no redundancy between them. One of the key benefits of PCA is that these principal components are uncorrelated with each other, which simplifies the interpretation of the transformed data. The principal components are ordered in such a way that the first few components explain most of the variance in the original data. This allows to reduce the dimensionality of the data while retaining a significant portion of its information. Mathematically (see1 for a detailed approach), a covariance matrix is created for all the standardized input variables. The size of the covariance matrix is proportional
1 https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf.
1 PCA: Principal Component Analysis
13
Fig. 2.1 PCA in 2D. The projective space is represented by the eigenvector 1 (the maximum variability) and the eigenvector 2 (the lowest variability) from source https://blog.bioturing.com/ 2018/06/14/principal-component-analysis-explained-simply/
to the data dimensionality making eigendecomposition computationally expensive for very high-dimensional data. Eigendecomposition on the data covariance matrix is performed to identify the input features with maximum variability. The eigenvalue determines the magnitude of the variability. The Eigenvectors with the lowest eigenvalues represent the least information about the data distribution and hence need to be dropped. Eigenvectors with the most significant eigenvalues are retained, thus transforming the original features to a reduced feature space without losing information from the dataset. PCA is a powerful technique that helps reduce the dimensionality of data while preserving its essential characteristics by creating a set of orthogonal components that capture the most significant sources of variance in the original dataset. When the data structure is non-linear, linear dimensionality reduction techniques like PCA,
14
2 Dimensionality Reduction
which handles linear data, will not provide optimal results. Kernel PCA (Schölkopf et al., 1998, 1997) is the non-linear form of PCA that helps reduce the complicated spatial structure of high-dimensional features into lower dimensions using kernel functions like polynomial, Radial Basis Function (RBF), and Gaussian kernels. PCA is applied when the data is linearly separable, whereas kernel PCA is applied to non-linearly separable data that a non-linear function can describe. Kernel function in Kernel PCA plays the same role as the covariance matrix does for PCA.
2 ICA: Independent Component Analysis The aim of ICA (Stone, 2004) is to extract useful information or source signals from data (a set of measured mixture signals). Independent Component Analysis (ICA) serves as a powerful dimensionality reduction technique by identifying and isolating individual sources within a dataset. Unlike PCA, ICA operates on the assumption that the observed features are composed of mixtures originating from distinct and independent sources. This characteristic sets ICA apart as it aims to untangle these mixed sources present in the data. In essence, ICA acts as a filtering mechanism, allowing for the retention or elimination of specific sources within the data. This is achieved by seeking out features with the least correlation to one another, effectively pinpointing those that contribute independently to the dataset. This task is often referred to as Blind Source Separation. By decomposing the initial feature set, ICA uncovers the components with the highest degree of independence, effectively shedding light on the most critical elements of the dataset. This is very useful when dealing with complicated datasets from several sources. Through this approach, ICA enhances the ability to disentangle and understand the true contributing factors, enabling a deeper comprehension of the data’s underlying structure. ICA stands as a dimensionality reduction technique that excels in isolating and identifying independent sources within a dataset. Its distinct approach, centered around the concept of independence rather than variance, grants it a unique vantage point to uncover hidden insights in complex data. ICA is based on three assumptions. These are: • Mixing process is linear. • All source signals are independent of each other. • All source signals have non-Gaussian distribution. In ICA, the goal is to find the unmixing matrix W and then project the whitened data onto that matrix for extracting independent signals. This matrix can be estimated using three main approaches of independence, which result in slightly different unmixing matrices.
3 NMF: Non-Negative Matrix Factorization
15
The first is based on non-Gaussianity. The basis of this assumption comes from the central limit theorem saying that the sum of independent random variables is more Gaussian than the independent variables. So to infer source variables it is necessary to move away from the Gaussianity. In the case of Gaussian distribution, uncorrelated Gaussian variables are also independent; it is a unique property associated with Gaussian distribution. This can be measured by some measures such as negentropy and kurtosis (Girolami & Fyfe, 1996), and the goal of this approach is to find independent components that maximize the non-Gaussianity. In the second approach, the ICA goal can be obtained by minimizing the mutual information. The higher the mutual information, the higher will be the dependence. Independent components can be also estimated by using maximum likelihood (ML) estimation. All approaches simply search for a rotation or unmixing matrix W . Projecting the whitened data onto that rotation matrix extracts independent signals. The preprocessing steps are calculated from the data, but the rotation matrix is approximated numerically through an optimization procedure. Searching for the optimal solution is difficult due to the local minima that exist in the objective function. ICA has many algorithms such as FastICA, projection pursuit, and Infomax (Hyvärinen & Oja, 2000). The main goal of these algorithms is to extract independent components by (1) maximizing the non-Gaussianity, (2) minimizing the mutual information, or (3) using the maximum likelihood (ML) estimation method. However, ICA suffers from a number of problems such as over-complete ICA and under-complete ICA. • ICA applies a linear transformation to decompose the original data into components that are maximally independent of each other. Unlike PCA, the independent components do not need to be orthogonal to each other. • ICA is well suited for separating superimposed signals and hence ICA has applications in neuroimaging, fMRI, and EEG analysis to separate normal signals from abnormal ones. An illustrative introduction can be found at2 and more technical explanations and details in Comon and Jutten (2010).
3 NMF: Non-Negative Matrix Factorization NMF was first introduced in 1994 (Paatero & Tapper, 1994) and popularized in 1999 (Lee & Seung, 1999). NMF is a popular unsupervised dimensionality reduction technique and feature extraction technique applied to non-negative data. NMF extracts sparse and interpretable features from a set of non-negative data vectors resulting in decomposed matrices with reduced dimension than the original dataset. It is particularly useful when dealing with non-negative data, such as images, text
2 https://arnauddelorme.com/ica_for_dummies/.
16
2 Dimensionality Reduction
data, audio spectrograms, or any dataset where the elements are constrained to be non-negative. NMF decomposes the data as a matrix X into the product of two reduced matrices W (feature matrix) and H (coefficient matrix) having dimensions .m × k and .k × n, respectively. Xm,n = Wm,k × Hk,n
.
(2.1)
where k (.< min(m, n)) is the low rank approximation of X. W and H only contain non-negative elements. It is essential to choose an appropriate value of k and regularization techniques to avoid overfitting and achieve meaningful representations. Therefore, by using NMF, factorized matrices are obtained having significantly lower dimensions than those of the product matrix. NMF assumes that the original input is made of a set of hidden features, represented by each column of the W matrix and each column in the H matrix represents the “coordinates of a data point” in the matrix W . The algorithm iteratively modifies the values of W and H so that their product approaches X. The technique preserves much of the original data structure and guarantees that both basis and weights are non-negative. The algorithm terminates when the approximation error converges, or a specified number of iterations is reached. NMF is widely used in various fields for feature extraction and representation learning such as: • Image Decomposition: NMF can be used to decompose images into a set of basis elements and coefficients, representing the most significant patterns in the images. • Topic Modeling in Text: In natural language processing, NMF can be applied to extract topics from a document-term matrix, representing the underlying themes in a corpus. • Audio Source Separation: NMF has been used for separating audio sources from a mixture, finding the underlying spectral components of different sources. • Document Clustering: NMF can be used for clustering documents based on their feature representation. A gentle introduction including Python codes can be found at3 and more mathematical aspects and algorithm based on NMF in Hoyer (2004), Lee and Seung (2000).
3 https://towardsdatascience.com/non-negative-matrix-factorization-nmf-for-dimensionalityreduction-in-image-data-8450f4cae8fa.
4 Kohonen Neural Network
17
4 Kohonen Neural Network The Self-Organizing Map (SOM), also known as the Kohonen map, was introduced by the Finnish professor Teuvo Kohonen in the early 1980s (Kohonen, 1990). Kohonen developed the SOM as a type of artificial neural network designed for unsupervised learning and pattern recognition. Kohonen’s inspiration for the SOM came from his interest in understanding how the human brain processes and organizes information. He aimed to develop a computational model that could mimic some of the brain’s self-organizing capabilities. SOM is commonly used for data clustering, visualization, and exploratory analysis. It is particularly useful for visualizing high-dimensional data in 2D or 3D maps and identifying similar patterns or clusters (Fig. 2.2). SOM is more concerned with visualizing and clustering high-dimensional data in a grid-like structure, preserving topological relationships. They have different applications and underlying concepts, making them suitable for different tasks in data analysis. It is a type of artificial neural network that uses a grid-like structure to represent high-dimensional data in a lower-dimensional space. It organizes the input data into a set of reference vectors (neurons) arranged in a grid, where neighboring neurons exhibit similar properties. • Each node’s weights are initialized. • A vector is chosen at random from the set of training data. • Every node is examined to calculate which one’s weights are most like the input vector. The winning node is commonly known as the Best Matching Unit (BMU). • Then the neighborhood of the BMU is calculated. The amount of neighbors decreases over time. Fig. 2.2 Self-organizing map. Dimensionality reduction
18
2 Dimensionality Reduction
• The winning weight is rewarded with becoming more like the sample vector. The neighbors also become more like the sample vector. The closer a node is to the BMU, the more its weights get altered and the farther away the neighbor is from the BMU, the less it learns. • Repeat step 2 for N iterations. It should be noted that SOM is not primarily focused on dimensionality reduction, although it can indirectly achieve some reduction. It is more commonly used for visualizing and clustering high-dimensional data in a 2D or 3D grid structure. SOM represents data as a set of reference vectors placed in a grid. Each reference vector represents a specific region in the input space, and the positions of the reference vectors determine the topology of the map. It enforces a grid-like topology where the reference vectors are arranged in a specific manner. The neighboring neurons exhibit similar properties, allowing for spatial organization and preserving the topological relationships of the data. A practical presentation with a toy example can be found at4 .
5 ISOMap: Isometric Mapping Isomap (Tenenbaum et al., 2000) is a manifold learning algorithm based on spectral theory. It focuses on preserving geodesic distances between data points. Geodesic distance is the shortest distance along the manifold, capturing the true, non-linear geometry corresponding to the curved dimension. A k-nearest neighborhood graph is constructed from the data, where the shortest distance between two nodes is considered the geodesic distance. Isomap constructs a global pairwise geodesic similarity matrix between all points in the data, on which classical scaling is applied. By constructing a low-dimensional representation that captures the intrinsic geometric structure of the data, Isomap facilitates meaningful visualization and provides insights into the underlying data manifold. It is particularly useful for datasets with non-linear relationships, as it can capture the underlying data structure more accurately than linear methods like PCA. • IsoMap determines which points are neighbors on the manifold and builds an adjacency neighborhood graph. • It then estimates the geodesic distances between all pairs of points on the manifold by computing their shortest path distances in the graph. • IsoMap then applies classical Multidimensional scaling (MDS) for constructing an embedding of the data to best preserve the manifold’s estimated intrinsic geometry.
4 https://medium.com/machine-learning-researcher/self-organizing-map-som-c296561e2117.
7 t-SNE: t-Distributed Stochastic Neighbor Embedding
19
Isomap can be considered a global manifold learning technique as it seeks to retain the global structure of the data. A pedagogic presentation can be found at5 .
6 UMAP: Uniform Manifold Approximation and Projection UMAP (McInnes et al., 2018) is a non-linear dimensionality reduction technique used for visualization as well as for general non-linear dimension reduction based on manifold learning techniques and ideas from topological data analysis. UMAP has no computational restrictions on the embedding dimension, making it a generalpurpose dimension reduction technique. UMAP approximates a manifold on which the data is assumed to lie and then construct a fuzzy simplicial set representation of the approximated manifold. UMAP is based on the following assumptions: • • • •
There exists a Riemannian manifold on which the data is uniformly distributed. The underlying manifold of interest is locally connected. The Riemannian metric is locally constant or can be approximated. Preserving the topological structure of the manifold is the primary goal. UMAP can be described in two phases.
• A weighted k-neighbor graph is constructed. • A low-dimensional layout of the weighted k-neighbor graph is computed. UMAP, like Isomap, computes the nearest neighbors of points using a kneighbor-based graph algorithm. At a high level, UMAP first constructs a weighted neighbor graph, and from this graph, a low-dimensional layout is computed. This low-dimensional layout is optimized to have as close a fuzzy topological representation to the original as possible based on cross-entropy. A pedagogic introduction can be found at6 and a complete guide (UMAP documentation) can be found at7 .
7 t-SNE: t-Distributed Stochastic Neighbor Embedding t-SNE (Van der Maaten & Hinton, 2008) was introduced in 2008 and is known as a non-linear dimensionality reduction technique. t-SNE is an unsupervised, nonlinear, randomized dimensionality reduction technique algorithm, used only for visualization. t-SNE focuses on keeping very similar data points close together
5 https://medium.com/data-science-in-your-pocket/dimension-reduction-using-isomap-
72ead0411dec. 6 https://www.youtube.com/watch?v=eN0wFzBA4Sc. 7 https://umap-learn.readthedocs.io/en/latest/.
20
2 Dimensionality Reduction Original multidimensional space
Use of Normal distribution curve for calculating similarity scores
Distances Low similarity scores Point of interest
High similarity scores Point of interest
Distances
Fig. 2.3 Similarity scores from source https://towardsdatascience.com/t-sne-machine-learningalgorithm-a-great-tool-for-dimensionality-reduction-in-python-ec01552f1a1e
in lower-dimensional space while preserving the local structure of the data using student t-distribution. t-SNE uses a heavy-tailed Student-t distribution to compute the similarity between two points in the low-dimensional space rather than a Gaussian distribution, which helps address crowding and optimization problems. Outliers do not impact t-SNE. The first step involves creating a probability distribution that represents the similarities between neighbors by selecting a random data point and calculating the Euclidean distance between this point and other data points (Fig. 2.3). This distance represents the degree of similarity between the points. In other words, data points that are close to the selected data point will receive a higher similarity value, while data points that are far from the selected data point will receive a lower similarity value. The main steps of the algorithm are as follows: • Using the similarity values, a similarity matrix .SIM1 is created for each data point. • The next step involves converting the calculated similarity distance into joint probability using the normal distribution. • Then, a low-dimensional space is created with the same number of points as in the original space. These points need to be randomly distributed in this space because initially, their ideal coordinates are not known. • Next, all the same calculations are repeated for the higher-dimensional data points for the lower-dimensional data points, which are again randomly arranged. In this step, the probability is calculated according to the Student’s t-distribution and a similarity matrix .SIM2 is created. The algorithm compares .SIM1 with .SIM2 and computes the difference between .SIM1 and .SIM2 using a gradient descent algorithm executed with the Kullback–Leibler Divergence (KL Diver-
8 Autoencoders
1
21
6
Encoder
6
Decoder
1
6
6 16
16
120
16
16
12×12
d_Conv2
d_Pool 1
24×24
Pool 1
Conv 1
d_Pool2
8×8
Feature
4×4
1×1
4×4
8×8
12×12
28×28
24×24
Conv2
Pool2
28×28
Output
Input
d_Conv2
Fig. 2.4 Autoencoder from source (Guo et al., 2017)
gence) (Van Erven & Harremos, 2014; Kullback & Leibler, 1951) between the conditional probabilities. A gradient is calculated for each point, indicating how “strong” it should be and in which direction it should move. The KL Divergence helps t-SNE to preserve the local structure of the data by minimizing the difference between the two distributions with respect to the data point locations. This effectively reduces the gap between the probability distributions in the original space and the lower-dimensional space.
8 Autoencoders Autoencoders (Bank et al., 2023) are a class of artificial neural networks used for unsupervised learning considered a generative unsupervised deep learning algorithm for dimensionality reduction. They consist of an encoder and a decoder, and their primary objective is to reconstruct the input data. Autoencoder consists of an encoder that accepts high-dimensional input data and translates it to latent low-dimensional data (Fig. 2.4). A decoder then takes the encoder’s output of the low-dimensional data as an input to reconstruct high-dimensional input data using a neural network. During training, the encoder compresses the data into a lowerdimensional representation, and the decoder attempts to reconstruct the original data from this compressed representation. Autoencoders are highly effective for data compression and can be fine-tuned for visualization tasks by leveraging the encoded latent space. For dimensionality reduction using Autoencoder, first, train the data with both the encoder and decoder and then remove the decoder. The output of the encoder is the data’s non-linear projection to a lower-dimensional space. Autoencoder is used for dimensionality reduction, feature extraction, compressing images, anomaly detection, image generation, and image denoising. Autoencoders are pivotal components within the field of deep clustering. Further elaboration on this technique is provided in Chap. 5. Moreover, Chaps. 7, 8, 10, and 11 delve into the intricate interplay of autoencoders in the realms of deep feature selection and clustering, offering a comprehensive understanding of their significance and applications.
22
2 Dimensionality Reduction
9 Discussion and Comparisons There are multiple techniques to reduce the higher-dimensional data such as images, sentences, or audio recordings to lower-dimensional data to help in finding the most relevant features. If the data is linearly separable, use linear dimensionality reduction techniques such as PCA and ICA; if the data is non-linearly separable, such as a higher degree polynomial, use dimensionality reduction techniques like Kernel PCA, Isomap, t-SNE, UMAP, or neural networks such as autoencoders.
9.1 Comparison Between PCA and ICA The goal of PCA is to find the features that best explain the variability in the dataset. In contrast, the purpose of ICA is to identify the mutually independent features of the dataset called the independent components. PCA compresses the data, and ICA separates data using independent features. PCA optimizes the covariance matrix of the data which represents second-order statistics, while ICA optimizes higherorder statistics such as kurtosis. Hence, PCA finds uncorrelated components while ICA finds independent components. While both PCA and ICA share the goal of reducing dimensionality, they diverge in their strategies. PCA seeks to maximize variance in the transformed features, while ICA is dedicated to revealing the underlying independent sources that form the mixture. This distinction is rooted in the assumption that these sources have non-Gaussian distributions and are mutually independent–meaning their linear relationship is negligible. Principal Component Analysis (PCA) and Independent Component Analysis (ICA) are both dimensionality reduction techniques, but they have distinct objectives and methodologies. • PCA aims to identify the features that can best explain the variability present in the dataset. It does this by compressing the data into a set of orthogonal components, each capturing different aspects of the data’s variance. ICA, on the other hand, seeks to uncover mutually independent features within the dataset, known as independent components. Instead of explaining variability, ICA is focused on separating the data into its underlying, independent sources. • PCA compresses the data by retaining the most important variance and reducing dimensionality. ICA separates the data into its constituent sources, emphasizing independence rather than variance. • PCA optimizes the covariance matrix of the data, which primarily captures second-order statistics. ICA goes beyond and optimizes higher-order statistics like kurtosis, specifically aiming to detect non-Gaussian and mutually independent sources. • PCA aims to find uncorrelated components, meaning that they have no linear relationship with each other. ICA’s goal is to discover truly independent compo-
9 Discussion and Comparisons
23
nents, indicating that they are statistically unrelated and exhibit negligible linear dependencies. In synthesis, while both PCA and ICA share the overarching goal of reducing dimensionality, they follow different strategies. PCA focuses on maximizing variance in transformed features and assumes that the data is a linear mixture of the original sources. In contrast, ICA is dedicated to unmasking the independent sources, assuming that they have non-Gaussian distributions and are mutually independent, thus better capturing the underlying structure of the data.
9.2 Comparison Between PCA and NMF NMF decomposes the input matrix into non-negative basis elements and nonnegative coefficients when PCA decomposes the input matrix into orthogonal (uncorrelated) basis elements and coefficients. When using NMF, this leads to partsbased and additive representations, where the basis elements and coefficients are all non-negative. PCA does not inherently provide meaningful interpretations for its basis elements, especially when the data does not have a clear linear structure. Due to its non-negativity constraint, NMF often leads to more interpretable representations, where the basis elements represent meaningful parts or components in the data, and the coefficients indicate the presence or contribution of those parts in each data sample. NMF does not enforce orthogonality between basis elements, so the resulting components can be correlated. The choice between NMF and PCA depends on the nature of the data and the interpretability requirements of the analysis.
9.3 Comparison Between t-SNE and SOM t-SNE and SOM are both powerful techniques for visualizing high-dimensional data. t-SNE is primarily focused on dimensionality reduction and preserving local structure, while SOM is more concerned with clustering and preserving global and topological relationships. t-SNE is effective at preserving the local structure and capturing clusters or groups of similar instances. It tends to emphasize the dense regions of the data and can reveal intricate structures in the lower-dimensional space. However, it may not preserve the global structure as well. SOM is better at preserving the global structure and overall topology of the data. It can provide a more organized representation of the input space, with similar instances grouped together. However, it may not capture fine-grained local structures as effectively as t-SNE. t-SNE is computationally intensive, especially for large datasets. The algorithm has a time complexity of .O(N 2 ), which can make it slower for datasets with many data points. SOM is computationally efficient and can handle large
24
2 Dimensionality Reduction
datasets more effectively. The training process of SOM typically converges quickly, and the algorithm has a linear time complexity of .O(N), where N is the number of data points. The choice between t-SNE and SOM depends on the specific goals of the analysis and the desired characteristics of the visualization.
9.4 Comparison Between UMAP and t-SNE UMAP and t-SNE are both dimensionality reduction techniques commonly used for visualizing high-dimensional data in lower dimensions, but they have some differences in their approach and characteristics: • UMAP (McInnes et al., 2018) and t-SNE are non-linear dimensionality reduction techniques useful for visualization though the quality of UMAP is better than tSNE. • UMAP preserves more of the global structure than t-SNE. UMAP is known for its ability to capture both local and global structures in the data. It can reveal clusters, as well as the overall structure of the data, which makes it suitable for a wide range of applications. • t-SNE is particularly good at preserving local structure and is excellent for visualizing clusters of data points. However, it may not always maintain the global structure as effectively as UMAP. • UMAP has superior run time performance compared to t-SNE. • UMAP can scale to significantly larger data set sizes than possible for t-SNE. • UMAP preserves pairwise Euclidean distances significantly better than t-SNE. Both UMAP and t-SNE are valuable tools for visualizing and exploring highdimensional data. The choice between them depends on the specific characteristics of the data and the goals for dimensionality reduction and visualization.
9.5 Comparison Between SOM and Autoencoders Kohonen Self-Organizing Maps (SOM) and autoencoders are both techniques used for the dimensionality reduction purpose, but they are based on different principles and have distinct characteristics. Both Kohonen Self-Organizing Maps and autoencoders are powerful techniques to obtain a latent space representation of data. While SOM focuses on topological mapping and preserving the data’s intrinsic structure on a grid, autoencoders aim to learn a compressed representation that efficiently encodes the input data. The choice between the two methods depends on the specific characteristics of the data and the objectives of the task at hand.
References
25
References Bank, D., Koenigstein, N., & Giryes, R. (2023). Autoencoders. Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook (pp. 353–374). Chaudhary, A. (2020). A visual guide to self-labelling images. https://amitness.com/2020/04/ illustrated-self-labelling. Comon, P., & Jutten, C. (2010). Handbook of Blind Source Separation: Independent component analysis and applications. Academic press. Girolami, M., & Fyfe, C. (1996). Negentropy and kurtosis as projection pursuit indices provide generalised ICA algorithms. In Advances in Neural Information Processing Systems Workshop, vol. 9. Denver, CO. Guo, X., Liu, X., Zhu, E., & Yin, J. (2017). Deep clustering with convolutional autoencoders. In Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14–18, 2017, Proceedings, Part II 24 (pp. 373–382). Springer. Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5(9), 1457–1569. Hyvärinen, A., & Oja, E. (2000). Independent component analysis: algorithms and applications. Neural Networks, 13(4), 411–430. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464–1480. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. Lee, D., & Seung, H. S. (2000). Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 13. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788–791. McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126. Schölkopf, B., Smola, A., & Müller, K.-R. (1997). Kernel principal component analysis. In International Conference on Artificial Neural Networks, (pp. 583–588). Springer. Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319. Stone, J. V. (2004). Independent component analysis: a tutorial introduction. Tenenbaum, J. B., Silva, V. d., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579–2605. Van Erven, T., & Harremos, P. (2014). Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7), 3797–3820.
Chapter 3
Feature Selection
Feature selection is a critical process in machine learning and data analysis aimed at identifying the most relevant and informative subset of features from a larger set of available features. By selecting the most discriminative features, feature selection can enhance model performance, reduce computational complexity, and improve interpretability. Various techniques are employed for feature selection, including filter methods, wrapper methods, and embedded methods, each with its advantages and considerations. Feature selection plays a crucial role in addressing the curse of dimensionality, handling redundant and irrelevant features, and improving model generalization. This chapter provides a concise overview of the fundamental concept and importance of feature selection in machine learning and data analysis. Within global interpretability, feature selection is a relevant aspect, as it indicates how much participation each variable has within the model. When done before training, feature selection is used to select the most relevant features to feed the model. It also helps in detecting irrelevant features, which reduces overfitting and may lead to an improvement in performance. Furthermore, the simplicity and comprehensibility of a model are bolstered when it operates with fewer variables, a feat made achievable through feature selection. This practice bestows several advantages, including expedited training of machine learning algorithms, a reduction in model intricacy, and an enhancement in interpretability. Importantly, it can also contribute to improved model accuracy, provided the right subset of features is chosen, while simultaneously mitigating the risk of overfitting. It is important to highlight that the majority of research on feature selection has predominantly focused on the supervised learning paradigm, with limited attention given to unsupervised learning tasks, particularly clustering tasks. In unsupervised feature selection, the goal is to leverage the inherent data structures, where the discovered patterns serve as pseudo-supervised cues for identifying the optimal feature subset. However, this endeavor is notably more demanding due to the frequently indistinct, inadequate, and ambiguous nature of the learned structural © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 F. Ros, R. Riad, Feature and Dimensionality Reduction for Clustering with Deep Learning, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-031-48743-9_3
27
28
3 Feature Selection
information. As a result, unsupervised feature selection presents a more formidable challenge. Feature selection has received significant attention in the field of machine learning in recent decades. This is primarily because it offers a means to reduce the dimensionality of data while retaining the inherent semantic significance of the features. Unlike feature extraction, which often transforms original features into new representations, feature selection enables the preservation of the original physical meaning of the features. This, in turn, fosters greater interpretability in machine learning models, making it easier to understand and glean insights from the selected features.
1 Taxonomy Similar to supervised and semi-supervised feature selection, unsupervised feature selection methods may be categorized into three primary groups based on the strategy used for feature selection. First, we organize the feature selection techniques described in the literature in this section (Fig. 3.1). Then, we focus on each of these approaches’ key traits and the concepts upon which they are built to characterize them. There are numerous methods for feature selection, which, according to the literature, can be divided into three categories: wrapper, filter, and embedded methods.
1.1 Filter Methods Filter methods called also feature ranking methods, they use statistical metrics before training for feature selection. They can be categorized as univariate and multivariate. The idea consists of using certain criteria to evaluate each feature to get an ordered ranking list of features, where the final feature subset is selected according to this order. Multivariate filter methods evaluate the relevance of the features jointly rather than individually. Thus, they can handle redundant and irrelevant features. Figure 3.2 shows a general structure of the filter-based selection approach. In a filter-based feature selection strategy, the following stages are generally taken: 1. Feature scoring involves calculating a score metric for each feature in the dataset. 2. Ranking features uses scores to rank the features in descending order. The more significant features, the higher the rating. 3. Feature subset selection using a given criterion selects a subset of the top-ranked features.
1 Taxonomy
29
Fig. 3.1 Taxonomy of feature selection methods
Fig. 3.2 General structure of filter-based feature selection approach
4. Model training, in which a machine learning model is constructed using the selected subset of features and the right methodology. 5. Model Evaluation involves evaluating the model’s performance using the appropriate evaluation metrics, such as accuracy, precision, recall, F1-score, . . . . The important step is feature scoring in which many metrics were used in the literature. Examples of metrics used for ranking features are the Pearson correlation coefficient (Liu et al., 2020), Chi-square test (Liu & Setiono, 1995), and mutual information (Hoque et al., 2014). The Pearson correlation coefficient is used to determine if the continuous variables A and B are linearly related. Its value ranges between .−1 and +1 and
30
3 Feature Selection
can be derived from the Eq. 3.1: ρA,B =
.
cov(A, B) E[(A − μA )(B − μB )] = σA σB σA σB
(3.1)
σX and .μX are, respectively, the standard deviation and the mean of a variable X, while .E is the expectation operator. If the two variables A and B are correlated, one may predict one from another. As a result, if two features are correlated, the model only requires one since the second does not provide any new information. In statistics, the chi-squared test is used to determine the independence of two occurrences. We may derive observed values O and predicted values E from the data of two variables. Chi-Square calculates the difference between the predicted values E and the actual values O as in the Eq. 3.2.
.
χ2 =
.
(Oi − Ei )2 Ei
(3.2)
i
Statisticians typically rely on the chi-squared test to evaluate the level of independence between two categorical variables. Consequently, the decision to employ the chi-squared score to gauge a feature’s significance in relation to the class variable within the realm of data mining was a logical one. Mutual information .MI(A, B), according to information theory, is the degree of uncertainty in A given by the knowledge of B. Mutual information is defined mathematically as : MI(A, B) =
.
a,b
p(a, b) log
p(a, b) p(a)p(b)
(3.3)
where .p(a, b) represents the joint probability distribution function of A and B, and p(a) and .p(b) represent the marginal probability distribution functions of A and B, respectively. In the feature selection process, the feature-class mutual information is computed, and the feature with the highest mutual information is chosen. To sum up, filter techniques identify the most relevant characteristics entirely based on the intrinsic characteristics of the data without employing any clustering algorithms that may direct the search for relevant features. Although filter techniques are fast and simple to use, they may not take into account how features interact with each other and might not perform well with high-dimensional datasets.
.
1.2 Wrapper Methods Wrapper methods also known as subset evaluation use the results of a particular clustering algorithm to analyze feature subsets. In those techniques the model is
1 Taxonomy
31
Fig. 3.3 General structure of wrapper-based feature selection approach
trained with different combinations of input features to determine which gives the best result, the clustering ones for the unsupervised case, e.g., the quality of the results of the clustering algorithm used for the selection. Figure 3.3 presents the main steps in a wrapper-based feature selection approach. The search strategy is critical for wrapper methods. In a sequential search, sequentially features are added to an empty set or removed features from the complete set, which is referred to as Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS), respectively. The drawback with these methods is the features that are eliminated will not be considered for further iterations. This phenomenon is known as a nesting effect. Other methods including evolutionary, bio-inspired algorithms, and branch-andbound search such as Simulated Annealing (SA) aim at solving this issue (Xue et al., 2015; Dokeroglu et al., 2022). Swarm optimization-based (Rostami et al., 2021) wrapper such as hybrid Particle Swarm Optimization (PSO) algorithm-based method, multi-verse optimization algorithm-based method, enhanced chaotic multiswarm whale optimization-based method, and improved Ant Colony Optimizationbased method (ACO) have some research results recently. Wrapper techniques can take into account the relationships between features and are potentially more efficient in high-dimensional datasets, although they are
32
3 Feature Selection
computationally more expensive than filter approaches. They may also be sensitive to the learning method chosen and more prone to overfitting.
1.3 Embedded Methods Embedded methods refer to algorithms with built-in techniques for selecting features. Filter methods are generally better at identifying redundant features while wrapper methods are better at identifying irrelevant features. Embedded methods aim at combining the qualities of filter and wrapper methods (Bolón-Canedo et al., 2013). They are implemented by algorithms that have their own built-in feature selection methods and correspond to ad hoc modeling. These include Random Forests (RF), Lasso Regression, and Ridge Regression, among others (Zou & Hastie, 2005). However, these techniques have a notable drawback in that they cannot be separated from the algorithm they are embedded within. On the other hand, embedded methods offer distinct advantages, including computational efficiency and the reduced bias associated with selected features. The process of feature selection through embedding typically involves the following steps: 1. Choose a learning algorithm that includes feature selection in its training phase. These comprise, among others, the Random Forests (RF), Lasso Regression, and Ridge Regression algorithms. 2. Combine feature selection with the learning model during the training process, to assess the relevance or importance of features. This can be accomplished via weight coefficients, regularization, or other methods included in the algorithm of choice. 3. Evaluate each feature’s importance while building the model. The metric used to evaluate feature importance is frequently based on the impact of the feature on the performance of the learning model; this metric can be error minimization or model’s generalization ability enhancement. 4. Rank features based on the computed importance metric. Features with small importance may be assigned smaller weights or entirely removed. 5. Check the success of the learning model using a cross-validation technique. This stage ensures the effectiveness of selected features and their contribution on the predictive capacity of the model. 6. Iterate and adjust the hyperparameters to find the best subset of features. More generally, the difficulties in feature selection as well as in clustering are due to the variability of data instance distributions, the presence of noisy features and samples. In addition, there is another level of link between data that is more structural while being not always explicit with high-dimensional features.
2 Popular Unsupervised Feature Selection Methods
33
2 Popular Unsupervised Feature Selection Methods The state-of-the-art unsupervised feature selection methods are synthesized here. We refer the interested reader to the aforementioned reviews (Solorio-Fernández et al., 2020; Hancer et al., 2020) for a deep analysis.
2.1 Filter-Based Methods Filter techniques are independent of the model and may be considered as an initial processing step that ranks features based on some criterion and selects the best features. Several filter-based techniques were proposed in the literature, some of them use statistical methods such as Relief (Kira & Rendell, 1992) which estimates the quality of features on the basis of how well the feature can distinguish between instances that are near to each other. Correlation-based Feature Selection (CFS) (Hall, 1999) selects the feature subset with low feature-feature correlation to maintain or increase predictive power. Minimal Redundancy Maximal Relevance (mRMR) (Radovic et al., 2017) selects the feature that has maximum relevance with respect to the target variable and minimum redundancy with respect to the features already included in the model. Others are based on spectral analysis such as Laplacian Score (LS) (He et al., 2005), which selects the features depending on their locality preserving power, and Spectral Feature Selection (SPEC) (Zhao & Liu, 2007), which selects the features using the spectrum of the graph induced from the set of pairwise instance similarities. In the following, we present some popular filter-based methods.
Multimodal Multi-objective Filter-Based Feature Selection One of the recent and popular unsupervised filter-based feature selection techniques was proposed in Jha and Saha (2021). This method is known as multimodal multiobjective filter-based feature selection. By including multimodality into the process, the method generates a vast number of diverse solutions. This allows for a more indepth examination of the feature space, improving the possibility of discovering relevant features. Particle Swarm Optimization (PSO) techniques with specific crowding distance (SCD) are used to promote multimodality. The ring-based PSO aids in the discovery of more optimum feature subsets for the feature selection issue, as SCD enables the preservation of these excellent solutions while increasing variety. To assess the quality of selected feature subsets, sex objective functions .[F1 , . . . , F6 ] based on information-theoretic principles are used such as correlation coefficient and mutual information. These objectives functions were computed and optimized simultaneously.
34
3 Feature Selection
The objective function .F1 based on the correlation of feature subset selection and class attribute is defined by Eq. 3.4.
fi ,fj ∈ SF,fi /=fj
F1 =
.
COR(fi , fj )
fi ,fj ∈ SF,c=class COR(fi , c)
(3.4)
where SF is the selected feature subset, COR.(fi , fj ) represents the correlation coefficient between features .fi and .fj in SF, and COR.(fi , c) is the correlation coefficient between feature .fi in SF and class label c. The goal is to select a good feature subset by minimizing the value of .F1 . The .F2 is computed by the Eq. 3.5 to ensure that selected feature subsets have a low cardinality, which is the number of features in the selected feature subset. F2 = |SF|
(3.5)
.
Mutual information between SF should be kept to a minimum to avoid redundant information in the feature subset. Furthermore, the mutual information between an SF and a non-selected feature (NSF) should be high. The following values were computed based on the normalized mutual information (NMI): φ1 =
.
fi ,fj ∈ SF,fi /=fj
2NMI(fi , fj ) . |SF|(|SF| − 1)
φ2 =
fi ∈ NFS,fj ∈ SF,fj =1NN(fi )
NMI(fi , fj ) . NSF
φ3 = avg_std(SF)
(3.6)
(3.7) (3.8)
To select the feature subset including the non-redundant features to each other, the value of .φ1 should be minimized while the values of .φ2 and .φ3 should be maximized. Based on this, .F3 to .F6 objective functions are formulated as in Eqs. 3.9, 3.10, 3.11, 3.12 which need to be maximized. F3 = φ3 (φ2 − φ1 ).
.
(3.9)
F4 = φ3 φ2 − φ1.
(3.10)
F5 = φ2 − φ1.
(3.11)
φ2 φ1
(3.12)
F6 =
The combination of multimodality with filter-based selection provides a strong and versatile feature selection framework that can increase model performance and generalization on many tasks.
2 Popular Unsupervised Feature Selection Methods
35
Fig. 3.4 Local filter search-based feature redundancy in FBPSO, F represents the set of the original feature, .F ' is the feature set after removing irrelevant and weakly relevant features
Filter-Based Bare-Bone Particle Swarm Optimization Likewise, in Zhang et al. (2019) a Filter-Based Bare-Bone Particle Swarm Optimization (FBPSO) approach for unsupervised feature selection was presented. By adding two filter-based strategies to speed up convergence, it improves particle swarm optimization (PSO) for unsupervised feature selection. One removes irrelevant features using a strategy of space reduction based on mutual information, and the other improves swarm optimization through local filter search based on feature redundancy. For enhanced performance, the method additionally includes a feature similarity-based evaluation function and a parameter-free particle update strategy. The key idea behind FBPSO is the standard Bare Bones Particle Swarms Optimization (BBPSO) (Kennedy, 2003) that is improved by a local filter search. As a selection criterion, the proposed local filter search-based feature redundancy employs the Average Normalized Mutual Information (ANMI). For each feature .fi the ANMI is computed between this feature and its neighbors by the following equation.
ANMI(fi ) =
.
fj ∈k-NN(fi )
NMI(fi , fj ) k
(3.13)
By running a space reduction strategy based on mutual information, a new reduced feature set .F ' is obtained which only contains strongly relevant features based on mutual information, and then BBPSO is employed using the local filter as shown in Fig. 3.4.
2.2 Wrapper-Based Methods Wrapper-based techniques can be subdivided into three categories. The first category comprises sequential wrapper methods, known for their speed and ease of implementation as demonstrated by Breaban and Luchian (2011). The second
36
3 Feature Selection
category involves strategies based on evolutionary algorithms for wrapper feature selection. These methods introduce randomness into the search process to escape local optima, as seen in the works of Kim et al. (2002), and Dutta et al. (2014). The final subcategory encompasses iterative approaches, exemplified by the works of Law et al. (2004), and Guo and Zhu (2018). These methods tackle unsupervised feature selection as an estimation problem rather than a combinatorial search. Here are some examples of wrapper-based approaches.
A Unifying Criterion for Unsupervised Clustering and Feature Selection One of the promising wrapper sequential techniques was proposed by Breaban and Luchian (2011). The authors introduce a new objective function for unsupervised feature selection and clustering as described in the Eq. 3.14. Regardless of the number of clusters or features, the objective function efficiently directs the search for relevant features and optimal partitions in an unbiased manner. CritC =
.
2m F 2m + 1
log2 (k+1)+1 (3.14)
1 m is the number of features, and .F = 1+W/B is a function to be maximized where: k • .W = i=1 d∈Ci δ(ci , d) is the within-cluster inertia, where d represents data item in the cluster .Ci , .ci is the cluster center, and .δ is a distance function. • .B = ki=1 |Ci |δ(ci , g) is the between-cluster inertia, where g is the center of the entire data set, and .|Ci | is the size of each cluster.
This function provides a ranking score for each partition created in the search space of all possible feature subsets and cluster numbers. This method’s criteria offer both a ranking of important attributes and an optimal partition.
Evolutionary Model Selection in Unsupervised Learning Kim et al. (2002) were the pioneers of bio-inspired procedures, where the proposed approach employs an evolutionary local selection algorithm (ELSA) to develop different solutions in a multi-dimensional objective space. Clusters are formed by applying clustering algorithms, k-means, or Expectation Maximization (EM) on feature subsets. Four objective functions were covered in Pareto-based evolutionary data clustering. The first objective function favors dense clusters by measuring the cohesiveness of clusters, the second objective favors clusters that are wellseparated by measuring their distance from the global centroid, the third function has the goal of reducing the number of clusters, and the fourth one aims to find low-cost solutions through minimizing the number of features that are selected. Instead of combining the objective functions, a Pareto-based evolutionary method
2 Popular Unsupervised Feature Selection Methods
37
was used to find numerous Pareto-optimal solutions. The results of the experiments demonstrate that the technique consistently detects relevant characteristics and an adequate number of clusters, resulting in models with higher semantic significance.
Simultaneous Feature Selection and Clustering with Mixed Features by MOGA Another approach based on an evolutionary algorithm is proposed by Dutta et al. (2014). The paper describes a novel evolutionary clustering approach for mixed-type data (numerical and categorical) that conducts both clustering and feature selection at the same time. The technique increases clustering quality, understandability, and scalability by optimizing several objectives decreasing intra-cluster distance, and maximizing inter-cluster distance. To reach near-global optimum solutions, it combines the global search capabilities of the Multi-Objective Genetic method (MOGA) with the local search capability of the K-prototype (KP) method.
Simultaneous Feature Selection and Clustering Using Mixture Models A popular wrapper iterative method proposed by Law et al. (2004) addresses the critical challenges of feature selection and cluster number determination in clustering techniques. In this work, instead of selecting a subset of features, a set of real-valued quantities between 0 and 1, called feature saliencies, was estimated for each feature. A feature saliency is estimated using an expectation-maximization (EM) approach in mixture-based clustering. The technique uses a Minimal Message Length (MML) model selection criterion to successfully execute feature selection by driving the saliency of irrelevant features to zero. The criterion and technique are expanded to assess feature saliencies as well as the number of clusters at the same time.
DGUFS: Dependence Guided Unsupervised Feature Selection More recently Guo and Zhu (2018) propose a Dependence Guided Unsupervised Feature Selection (DGUFS) method that selects features and partitions data in a joint manner to avoid suboptimal feature selection. The method enhances the interdependence among the original data X, cluster labels V , and selected features Y . This is achieved through a projection-free feature selection model based on .L2.0 norm equality constraints, as outlined below: .
min J (X, V , Y ) Y,V
s.t. ||X − Y ||2.0 = d − m, ||Y ||2.0 = m, V ∈ Ω
(3.15)
38
3 Feature Selection
Fig. 3.5 Joint learning framework of the DGUFS
m is the number of selected features, while d is the original data dimension, .min, and s.t. are short for minimize and subject to respectively. Finally, .Ω is the set of potential cluster labels that can perfectly classify data into a given number of clusters. To guide the process of feature selection, two dependence-guided terms .J1 and .J2 were proposed for the model. .J1 is based on the geometrical structure and discriminative information of data to increase the dependence of intended label on the original data. .J2 , on the other hand, is based on the Hilbert–Schmidt Independence Criterion (HSIC) to maximize the dependence of selected features on the intended label. The objective function is constructed as follows: J (X, V , Y ) = βJ1 (X, V ) + (1 − β)J2 (V , Y )
.
(3.16)
where .β ∈ (0, 1) is a regularization parameter. An iterative algorithm based on Alternating Direction Method of Multipliers (ADMM) is designed to solve the constrained minimization problem (Eq. 3.15) efficiently; Fig. 3.5 present the general structure of the DGUFS approach.
2.3 Embedded Methods The sparse learning-based method is an important direction of embedded methods. It gets sparse feature scores, and then removing the features with zero scores and combining the features with non-zeros feature score to be a subset. To make the feature weights sparse, embedding a sparse regularization (Rahangdale & Raut, 2019) term into the learning model is a good ideal. Regularization methods (Hou et al., 2013; Goodfellow et al., 2016) offer an alternative way to learn classifications for data sets with large number of features but small sample size. These methods trim the space of features directly during classification. In other words, regularization
2 Popular Unsupervised Feature Selection Methods
39
effectively shuts down the influence of unnecessary features. Regularization can be incorporated either into the error criterion or directly into the model. The existing feature selection algorithms (Gui et al., 2016) using sparsity can be grouped into two categories: vector-based feature selection based on lasso and matrix based feature selection based on .L(r,p) -norm.
JELSR: Joint Embedding Learning and Sparse Regression Joint Embedding Learning and Sparse Regression (JELSR) (Hou et al., 2013) is an unsupervised feature selection framework. Unlike traditional methods, JELSR combines embedding learning and sparse regression for feature selection. The authors also present a method using local linear approximation and .L2,1 -norm regularization, along with an algorithm to optimize it. The procedure of the JELSR algorithm is broken down by three stages: • Stage one: Graph Construction 1. Construct the nearest neighborhood graph G based on the input data set .X = {xi | i = 1, 2, . . . , n} using a specified neighborhood size k. This step involves identifying the k-nearest neighbors for each data point in the dataset. 2. Compute the similarity matrix S based on the graph G. This matrix quantifies the pairwise similarity between data points in the dataset. Also, compute the graph Laplacian L, which is used to characterize the manifold structure of high-dimensional data. • Stage two: Alternative Optimization 1. Initialize U with the identity matrix .Id×d , where d is the dimensionality of the original data. This kind of initialization speeds up the algorithm’s convergence. 2. Alternately update the matrices U , Y , and W until convergence. The details of these updates are provided by the Eqs. 3.17, 3.19, 3.18, this step involves iteratively refining the matrix of embedding Y with dimensions d by m, the transformation matrix W with dimensions m by n, and the matrix U to optimize the feature representation. Where m is the dimensionality of the embedding, n is the number of data points, and d is the dimensionality of the original data. • Stage three: Feature Selection 1. Compute scores for all features .||Wi ||2 di=1 , where .Wi represents the i-th column of the transformation matrix W . These scores represent the importance or relevance of each original feature in the transformed space. 2. Sort these scores and select the largest s values: Once the feature scores are computed, sort them in descending order, and select the top s features with
40
3 Feature Selection
the highest scores. These selected features’ corresponding indexes are added to the selected feature index set .{r1 , r2 , . . . , rs }. The specific update rules for U , Y , and W is governed by the following equations: Ui,i =
.
1 . ||Wi ||2
(3.17)
W = (XXT + αU )−1 XY T .
(3.18)
arg min tr(Y (L + βIn×n − βXT (XXT + αU )−1 X)Y T )
(3.19)
where .α and .β are two balance parameters. In comparison to classical unsupervised feature selection approaches, the JELSR method combines the advantages of embedding learning and sparse regression. Experimental findings on several data sets, including images, audio, and biological data, have proven the efficacy of the proposed approach.
ESM: Expectation Selection Maximization A Gaussian Mixture Model based embedded feature selection method is proposed in Fu et al. (2021), in which a feature selection step (S) is introduced to the Expectation Maximization (EM) procedure. In particular, a Relevance Index (RI), a statistic showing the likelihood of assigning a data point to a particular clustering group, is added. The RI index reveals how a feature contributes to the clustering process, which can assist in feature selection. The ESM algorithm with the proposed feature selection (S) step is summarized as follows: 1. Initialize the mean .μk , covariance .∑k , and mixing coefficient .αk for the cluster .k = 1, 2, . . . , K, and evaluate the initial value of the log-likelihood. 2. E step: Evaluate the responsibilities using the current parameter values, which is related to the n-th data point and the k-th cluster αk N(xn |μk , ∑k ) γ F (znk ) = K k=1 αk N(xn |μk , ∑k )
.
(3.20)
and responsibilities after excluding the j -th feature − αk N(xn∗ |μ∗k , ∑k∗ ) γ Fj (znk ) = K ∗ ∗ ∗ k=1 αk N(xn |μk , ∑k )
.
(3.21)
where .xn∗ , .μ∗k , and .∑k∗ are the corresponding vector of .xn , .μk , and .∑k after excluding j -th variable. .F = {f1 , f2 , . . . , fD } is the feature space with D
3 Deep Learning and Feature Selection
41
features and .Fj− = {f1 , f2 , . . . , fD }\{fj } is the feature space excluding feature j. 3. S step: Calculate the difference between responsibilities before and after excluding j -th feature at iteration t. RI (j )(t) =
.
− 1 F |γ (znk ) − γ Fj (znk )| NK
(3.22)
n,k
if .|RI (j )(t+1) − RI (j )(t) | < ϵ (converged) and .RI (j )(t) is smaller than a predefined threshold, then discard the feature with the smallest RI and update the full feature space F . 4. M step: For reduced data with feature space F , re-estimate the parameters using the current responsibilities μnew = k
.
N 1 γ (znk )xn. Nk
(3.23)
N 1 new T γ (znk )(xn − μnew k )(xn − μk ) . Nk
(3.24)
Nk N
(3.25)
n=1
∑knew =
n=1
αknew =
where .Nk represent the number of data point in the k-th cluster. 5. Evaluate the log-likelihood
.
ln P (α, μ, ∑|X) =
N n=1
ln
K
αk N(xn |μk , ∑k )
(3.26)
k=1
If the parameters or the log-likelihood are note converged , go back to step 2. ESM performs better than EM in clustering accuracy on synthetic datasets, highlighting its feature identification capabilities. It also competes well in accuracy and runtime on benchmark datasets but struggles with complex data mixing continuous and categorical features.
3 Deep Learning and Feature Selection The connection between deep learning and feature selection could be deep learning based on feature selection or feature selection based on deep learning. The first involves applying traditional feature selection methods in DNN architectures as a processing stage. For instance, in (To˘gaçar et al., 2020) the dimension of
42
3 Feature Selection
the produced deep features driven by a DNN algorithm was reduced by using the minimum redundancy maximum relevance (mRMR) technique. The features obtained by mRMR were then fed into several classifiers as input. Similarly, in (Özyurt, 2020) AlexNet, VGG16, VGG19, GoogleNet, ResNet, and SqueezeNet pre-trained architectures were employed as feature extractors. Then, the Relief feature selection algorithm was used to minimize the generated features from each architecture’s last fully connected layers in order to produce more effective features. Finally, the selected features are given to the support vector machine classifier. More recently, in (Bidgoli et al., 2022) a trained network for histopathology image representation called KimiaNet was used as an extractor. The extracted highdimensional deep features were reduced using multi-objective approaches based on evolutionary algorithms. The selected features called Compact Feature Vector (CFV) were then fed to the k-nearest neighbor algorithm. Understanding the nature of a trained model or a complicated system can be assisted by feature selection at the input level. In (Haq et al., 2021) the authors employ a multi-filter feature selection technique to select an optimal feature set from the original data set. The generated feature set is then fed into a deep generative model for classification purposes. Deep learning methods can also be used to perform feature selection as the weight of an irrelevant feature will be close to zero when training a neural network. This is known as feature selection based on deep learning or deep-learning-based feature selection. Current techniques employ a simple autoencoder to perform feature selection based on reconstruction error (Singh et al., 2016; Antoniades & Took, 2016; Han et al., 2018; Feng & Duarte, 2018). In order to ensure that the error can be back-propagated easily while the best features can be learned efficiently, it is actually required to consider a simple network structure when selecting features. The AutoEncoder Feature Selector (AEFS) approach (Han et al., 2018) reconstructs data using a single-layer autoencoder and performs feature selection with a row-sparsity constraint on the first layer of the autoencoder. Likewise Graph Autoencoder-based unsupervised Feature Selection (GAFS) (Feng & Duarte, 2018) uses a single-layer autoencoder for data reconstruction and feature selection. Furthermore, it employs spectral graph analysis of projected data in the learning process. By using this method, the low-dimensional feature space retains the local data geometry of the original data space. In literature, Zou et al. (2015) introduces a novel deep-learning-based method for feature selection in remote sensing scene classification. It treats feature selection as a feature reconstruction task using a deep belief network (DBN). Features with lower reconstruction errors are considered more essential for image representation. An iterative algorithm is introduced to adapt the DBN for producing the required reconstruction weights. However, the complexity and computation time of this method are frequently very high. In Mirzaei et al. (2020), a Teacher-Student Feature Selection (TSFS) technique was proposed. A teacher autoencoder, which is a complex neural network, is first used in this technique to learn the optimal representation of data in low dimensions. Then, a student autoencoder (a simple neural network) is employed to select features by reducing the reconstruction error of the low-dimensional representation. The low-
References
43
dimensional deep features were ranked by the student network, and the top-ranked features were chosen based on the weights inspired by the student autoencoder.
References Antoniades, A., & Took, C. C. (2016). Speeding up feature selection: A deep-inspired network pruning algorithm. In 2016 International Joint Conference on Neural Networks (IJCNN) (pp. 360–366). Bidgoli, A. A., Rahnamayan, S., Dehkharghanian, T., Riasatian, A., Kalra, S., Zaveri, M., Campbell, C. J., Parwani, A., Pantanowitz, L., & Tizhoosh, H. (2022). Evolutionary deep feature selection for compact representation of gigapixel images in digital pathology. Artificial Intelligence in Medicine, 132, 102368. Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2013). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 34(3), 483–519. Breaban, M., & Luchian, H. (2011). A unifying criterion for unsupervised clustering and feature selection. Pattern Recognition, 44(4), 854–865. Dokeroglu, T., Deniz, A., & Kiziloz, H. E. (2022). A comprehensive survey on recent metaheuristics for feature selection. Neurocomputing, 494, 269–296. Dutta, D., Dutta, P., & Sil, J. (2014). Simultaneous feature selection and clustering with mixed features by multi objective genetic algorithm. International Journal of Hybrid Intelligent Systems, 11(1), 41–54. Feng, S., & Duarte, M. F. (2018). Graph autoencoder-based unsupervised feature selection with broad and local data structure preservation. Neurocomputing, 312, 310–323. Fu, Y., Liu, X., Sarkar, S., & Wu, T. (2021). Gaussian mixture model with feature selection: An embedded approach. Computers & Industrial Engineering, 152, 107000. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Regularization for deep learning. Deep Learning, 216–261. Gui, J., Sun, Z., Ji, S., Tao, D., & Tan, T. (2016). Feature selection based on structured sparsity: A comprehensive study. IEEE Transactions on Neural Networks and Learning Systems, 28(7), 1490–1507. Guo, J., & Zhu, W. (2018). Dependence guided unsupervised feature selection. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). Hall, M. A. (1999). Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato. Han, K., Wang, Y., Zhang, C., Li, C., & Xu, C. (2018). Autoencoder inspired unsupervised feature selection. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2941–2945). Hancer, E., Xue, B., & Zhang, M. (2020). A survey on feature selection approaches for clustering. Artificial Intelligence Review, 53(6), 4519–4545. Haq, A. U., Zeb, A., Lei, Z., & Zhang, D. (2021). Forecasting daily stock trend using multi-filter feature selection and deep learning. Expert Systems with Applications, 168, 114444. He, X., Cai, D., & Niyogi, P. (2005). Laplacian score for feature selection. Advances in Neural Information Processing Systems, 18. Hoque, N., Bhattacharyya, D., & Kalita, J. (2014). MIFS-ND: A mutual information-based feature selection method. Expert Systems with Applications, 41(14), 6371–6385. Hou, C., Nie, F., Li, X., Yi, D., & Wu, Y. (2013). Joint embedding learning and sparse regression: A framework for unsupervised feature selection. IEEE Transactions on Cybernetics, 44(6), 793–804. Jha, K., & Saha, S. (2021). Incorporation of multimodal multiobjective optimization in designing a filter based feature selection technique. Applied Soft Computing, 98, 106823.
44
3 Feature Selection
Kennedy, J. (2003). Bare bones particle swarms. In Proceedings of the 2003 IEEE Swarm Intelligence Symposium. SIS’03 (Cat. No.03EX706) (pp. 80–87). Kim, Y., Street, W. N., & Menczer, F. (2002). Evolutionary model selection in unsupervised learning. Intelligent Data Analysis, 6(6), 531–556. Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. In Machine Learning Proceedings 1992 (pp. 249–256). Elsevier. Law, M., Figueiredo, M., & Jain, A. (2004). Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1154–1166. Liu, H., & Setiono, R. (1995). Chi2: feature selection and discretization of numeric attributes. In Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence (pp. 388–391). Liu, Y., Mu, Y., Chen, K., Li, Y., & Guo, J. (2020). Daily activity feature selection in smart homes based on Pearson correlation coefficient. Neural Processing Letters, 51(2), 1771–1787. Mirzaei, A., Pourahmadi, V., Soltani, M., & Sheikhzadeh, H. (2020). Deep feature selection using a teacher-student network. Neurocomputing, 383, 396–408. Özyurt, F. (2020). Efficient deep feature selection for remote sensing image recognition with fused deep learning architectures. The Journal of Supercomputing, 76(11), 8413–8431. Radovic, M., Ghalwash, M., Filipovic, N., & Obradovic, Z. (2017). Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinformatics, 18(1), 1–14. Rahangdale, A., & Raut, S. (2019). Deep neural network regularization for feature selection in learning-to-rank. IEEE Access, 7, 53988–54006. Rostami, M., Berahmand, K., Nasiri, E., & Forouzandeh, S. (2021). Review of swarm intelligencebased feature selection methods. Engineering Applications of Artificial Intelligence, 100, 104210. Singh, V., Baranwal, N., Sevakula, R. K., Verma, N. K., & Cui, Y. (2016). Layerwise feature selection in stacked sparse auto-encoder for tumor type prediction. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 1542–1548). Solorio-Fernández, S., Carrasco-Ochoa, J. A., & Martínez-Trinidad, J. F. (2020). A review of unsupervised feature selection methods. Artificial Intelligence Review, 53(2), 907–948. To˘gaçar, M., Ergen, B., Cömert, Z., & Özyurt, F. (2020). A deep feature learning model for pneumonia detection applying a combination of mRMR feature selection and machine learning models. IRBM, 41(4), 212–222. Xue, B., Zhang, M., Browne, W. N., & Yao, X. (2015). A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation, 20(4), 606– 626. Zhang, Y., Li, H.-G., Wang, Q., & Peng, C. (2019). A filter-based bare-bone particle swarm optimization algorithm for unsupervised feature selection. Applied Intelligence, 49(8), 2889– 2898. Zhao, Z., & Liu, H. (2007). Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th International Conference on Machine Learning (pp. 1151–1157). Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320. Zou, Q., Ni, L., Zhang, T., & Wang, Q. (2015). Deep learning based feature selection for remote sensing scene classification. IEEE Geoscience and Remote Sensing Letters, 12(11), 2321–2325.
Chapter 4
Clustering
Clustering (Ezugwu et al., 2022) is an unsupervised classification technique that partitions a set of objects in such a way that objects in the same clusters are more similar to one another than the objects in different clusters according to certain predefined criteria. The term unsupervised means that grouping is established based on the intrinsic structure of the data without any need to supply the process with training items. This chapter provides a comprehensive overview of traditional clustering algorithms, which have been fundamental in the field of unsupervised learning. The strengths and limitations of each algorithm are carefully examined, along with guidelines for selecting the most appropriate method based on the dataset characteristics and clustering objectives. Moreover, the main evaluation metrics used in clustering such as cluster validity or popular indexes are presented. By the end of this chapter, readers will gain a deep understanding of traditional clustering algorithms, enabling them to apply these techniques effectively in diverse real-world scenarios and providing a solid foundation for further exploration of more advanced clustering methodologies.
1 Taxonomy Clustering algorithms can generally be divided into two categories: hierarchical clustering and partitional clustering. An incomplete list of methods includes partition-based (e.g., k-means) (Bottou and Bengio, 1994), density-based (e.g., DBSCAN) (Ester et al., 1996), hierarchical (e.g., BIRCH) (Zhang et al., 1997), and grid-based (e.g., STING) (Wang et al., 1997), spectral methods (Von Luxburg, 2007). The most common clustering algorithms are the hierarchical clustering algorithm, partition clustering algorithm, grid-based clustering algorithm, and fuzzy-based clustering algorithm. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 F. Ros, R. Riad, Feature and Dimensionality Reduction for Clustering with Deep Learning, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-031-48743-9_4
45
46
4 Clustering
• Connectivity models: As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. In the first approach, they start by classifying all data points into separate clusters and then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret but lack scalability for handling big datasets. Examples of these models are the hierarchical clustering algorithms and their variants. • Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid or cluster center of the clusters. The k-means clustering algorithm is a popular algorithm that falls into this category. In these models, the no. of cluster parameters required at the end has to be mentioned beforehand, which makes it important to have prior knowledge of the dataset. These models run iteratively to find the local optima. • Distribution models: These clustering models are based on the notion of how probable it is that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These models often suffer from overfitting. A popular example of these models is the expectation-maximization algorithm which uses multivariate normal distributions. • Density models: These models search the data space for areas of the varied density of data points in the data space. They isolate different dense regions and assign the data points within these regions to the same cluster. Popular examples of density models are DBSCAN and OPTICS. These models are particularly useful for identifying clusters of arbitrary shape and detecting outliers, as they can detect and separate points that are located in sparse regions of the data space, as well as points that belong to dense regions. • Grid models: In the grid-based method of clustering, the objects of the pool are quantized into a limited number of units to form a grid structure, and all the operations of clustering are performed on the grid. Grid-based algorithms are fast in execution time, and they do not depend upon the number of sample points. The salient features of grid-based algorithms make them most appropriate for detecting arbitrarily shaped clusters in large-sized datasets. The main advantage of the Grid-based method is its fast processing time which depends on the number of cells in each dimension in quantized space. The drawback of such speed is the accuracy of the result the common grid-based algorithms are STING, MAFIA, CLIQUE, O-CLUSTER, and WAVE-CLUSTER.
2 Popular Clustering Algorithms Over the years, a suite of classical clustering algorithms, including k-means, Pam, Clara, Birch, Dbscan, Chameleon, Denclue, String, Clique, and spectral techniques have emerged as stalwarts in the field. These algorithms have greatly enriched the
2 Popular Clustering Algorithms
47
understanding of data grouping and pattern recognition. However, amidst these established methods, the landscape saw a significant evolution with the advent of the Density Peak Clustering (DPC) algorithm in 2014. DPC stands as a more recent addition, introducing a fresh perspective on tackling density-based clustering approaches. Its inception marked a paradigm shift in the field, offering a novel and effective approach to handling intricate clustering scenarios. While these algorithms have collectively stood the test of time, the pursuit of excellence has led to continuous refinements and enhancements. Yet, it is noteworthy that the foundational versions of these algorithms have retained their relevance, and the most popular are quickly presented in the following. The insights and principles they introduced have laid the groundwork for subsequent iterations, making them the cornerstone upon which newer methods are built.
2.1 K-Means K-means (Macqueen, 1967) is one of the most popular and simplest clustering methods. It takes part in the mechanisms of most deep clustering algorithms. Given a set of observations .{x1 , x2 , . . . , xN }, k-means partitions the observations into K sets so as to minimize the within-cluster sum of squares. It is based on the minimization of the following cost J : J =
N K
.
rnk ‖xn − μk ‖2
(4.1)
n=1 k=1
where .
rnk = 1 if data point n is assigned to cluster k rnk = 0 otherwise
rnk ∈ RN×K and .μk ∈ RK are the parameters of the algorithm. The minimization is processed via two separate steps, one for the .rnk and the other for the .μk . The algorithm is deduced to this process. In the classical version of k-means, the number of clusters K must be known in advance. Thus, the users randomly select K objects and those objects initially represent the center of a cluster, for the remaining objects in the data set (Fig. 4.1). The k-means algorithm assigns the value to the nearest cluster according to its distance from the center of each cluster, updating the .rnk parameters. It recalculates the average value .μk of the object represented by its gravity center .Gk , the center is moved to the mean updating the .μk parameters. After the number of iterations, the error of the algorithm is reduced and the error value of the objective J function gradually decreases until convergence is achieved. This process is then repeated again and again until the final result is achieved, precisely until the cluster
.
48
4 Clustering
G_1
t0 G_2
G_3
t0
t0
Fig. 4.1 k-means process. The gravity centers are first initialized (.t0 ), and they move to the updating mean until convergence
assignments stop changing. The convergence of the algorithm is easily proved by deriving J with respect to each cluster k. 2 ∂ N ∂J n=1 rnk ‖xn − μk ‖ = =0 → ∂μk ∂μk . N rnk xn μk = i=1 N n=1 rnk
(4.2)
• k-means clustering assumes that the data points are distributed in a spherical shape, which may not always be the case in real-world data sets: data of complex shapes, and clusters with varying sizes and density. This can lead to suboptimal cluster assignments and poor performance on non-spherical data. • k-means clustering requires the user to specify the number of clusters in advance, which can be difficult to do accurately in many cases. If the number of clusters is not specified correctly, the algorithm may not be able to identify the underlying structure of the data. • k-means clustering is sensitive to the presence of outliers and noise in the data, which can cause the clusters to be distorted or split into multiple clusters. • k-means clustering is not well-suited for data sets with uneven cluster sizes or non-linearly separable data, as it may be unable to identify the underlying structure of the data in these cases.
2 Popular Clustering Algorithms
49
A myriad of implementations can be found on the internet such as at1 .
2.2 GMM: Gaussian Mixture Models Clustering Algorithm The Gaussian Mixture Models (GMM) clustering algorithm is a probabilistic model used for clustering data points into groups or clusters. It assumes that the data points in each cluster are generated from a Gaussian distribution (Fig. 4.2) and that the data points within a cluster are independent of each other. P (Xi = x) =
K
.
(4.3)
πk N(x, μk , σk )
k=1
where .π1 , . . . , πk are the mixture coefficients. The joint probability of .X1 , X2 . . . Xn is as follows: P (X1 = x1 , X2 = x2 , . . . Xn = xn ) =
K n
.
πk N(x, μk , σk )
(4.4)
i=1 k=1
Cluster 2
Cluster 1 Cluster 3
σ1
σ1 μ1
μ2
σ1 μ3
Fig. 4.2 GMM illustration. Each data point can be expressed by a mixture of three Gaussian distributions, each of them representing a cluster (From source https://towardsdatascience.com/ gaussian-mixture-models-explained-6986aaf5a95)
1 https://realpython.com/k-means-clustering-python/.
50
4 Clustering
The GMM algorithm aims to obtain the maximum likelihood estimates given x1 , x2 , . . . , xn .
.
L(θ ) =
n
.
i=1
log
K
πk N(x, μk , σk )
(4.5)
k=1
where .θ = (μ1 , . . . , μk ; σ1 , . . . , σk ; π1 , . . . , πk ). As for k-means, the minimization is divided into two separate steps (E-step and M-step) one devoted to estimating the probability and the other to adjusting the Gaussian parameters. The GMM clustering algorithm is as follows: • Initialization: Start by randomly initializing the parameters of the GMM, including the means, covariances, and mixture coefficients for each cluster. • Expectation-Maximization (EM) Algorithm: The algorithm follows an iterative process of the Expectation-Maximization (EM) algorithm to estimate the parameters that best fit the data. This algorithm consists of two steps: – Expectation Step (E-step): Calculate the responsibility of each data point for each cluster. The responsibility represents the probability of a data point belonging to a particular cluster, based on the current estimates of the model parameters. – Maximization Step (M-step): Update the model parameters (means, covariances, and mixture coefficients) by maximizing the log-likelihood of the data. This involves using the responsibilities calculated in the E-step to re-estimate the parameters for each cluster. • Repeat the E-step and M-step until convergence criteria are met. Typically, convergence is achieved when the change in log-likelihood or the model parameters falls below a certain threshold. After convergence, assign each data point to the cluster with the highest responsibility value. This step assigns cluster labels to the data points based on the estimated parameters of the GMM. The GMM clustering algorithm is flexible and can capture complex cluster shapes by modeling each cluster as a Gaussian distribution. It can be applied to various types of data and has been widely used in fields such as pattern recognition, image segmentation, and data analysis. A code can be found at2 .
2 https://bit.ly/2MpiZp4.
2 Popular Clustering Algorithms
51
2.3 Mean Shift Clustering Algorithm The Mean Shift algorithm (Cheng, 1995) is a non-parametric clustering algorithm that is used to identify the modes of a density function. It is a simple and effective algorithm that can be used for both clustering and density estimation. The idea behind the Mean Shift algorithm is to iteratively shift each data point toward the direction of the highest increase in density until convergence is reached. This is done by computing the mean of the data points within a certain distance from the current point and shifting the point to this mean. The distance metric used can be any metric, such as Euclidean or Manhattan distances. In mean-shift clustering, the inputs to the algorithm are the data points (multivariate, continuous feature vectors) and the bandwidth or scale which indirectly controls the number of clusters. The algorithm works as follows: 1. Initialization: Choose an initial point as the starting point for the algorithm. 2. Neighborhood determination: Determine the neighborhood of the current point by defining a spherical or hyper-spherical window around it with a certain radius. 3. Mean shift vector computation: Compute the mean shift vector by taking the mean of all the data points within the window. 4. Point update: Update the current point by shifting it in the direction of the mean shift vector. 5. Convergence check: Check if the point has converged by comparing it to the previous point. If it has not converged, go back to step 2 and repeat the process. The algorithm is illustrated in Fig. 4.3: At the left, the dataset with the circle that has a radius equal to the bandwidth used (.σ ). At the right, the paths followed by Gaussian MS for various starting points are shown as well as a contour plot of the Gaussian kernel density (KDE) estimates .p(x) with bandwidth .σ . The KDE has two modes, located at the center of the blue ellipses. The resulting clustering can be deduced from the two paths.
Fig. 4.3 Mean-Shift illustration, from Carreira-Perpinán (2015)
52
4 Clustering
After all the points have converged, assign each point to its corresponding mode. Points that converge to the same mode are considered to belong to the same cluster. The Mean Shift algorithm has several stages. It does not require prior knowledge of the number of clusters, and it can identify clusters of arbitrary shapes and sizes. It also works well for datasets with high-dimensional data, and it can handle noise and outliers. However, the Mean Shift algorithm can be computationally expensive, especially for large datasets, since the neighborhood determination and mean shift vector computation must be performed for each data point. It can also be sensitive to the choice of window size, which can affect the clustering results. A pedagogic introduction including Python codes can be found at3 . A demonstration online with data images can be found in Demirovi´c (2019) as well as the code at4 . Code sources, respectively, in C.++ and Python can be found at5 and 6 .
2.4 Hierarchical Clustering Hierarchical clustering (Murtagh & Contreras, 2017) groups objects into tree-like structures using bottom-up or top-down approaches. The basic idea of hierarchical clustering algorithms is to construct the hierarchical relationship among data according to a dissimilarity/similarity metric between clusters. The hierarchical clustering method is divided into two types: divisive and agglomerative. The divisive hierarchical clustering method first sets all data points into one initial cluster, then divides the initial cluster into several sub-clusters, and iteratively partitions these sub-clusters into smaller ones until each cluster contains only one data point or data points within each cluster are similar enough (Fig. 4.4). Contrary to divisive clustering, agglomerative hierarchical clustering begins with each cluster containing only one data point, and then iteratively merges them into larger clusters until all data points are in one cluster or some conditions are satisfied. In the hierarchical clustering algorithm, the iteration algorithm repeatedly splits or aggregates and moves through the hierarchical structure, the common algorithms are BIRCH, CURE, CHAMELEON, Furthest neighbor, and Nearest neighbor. Both divisive and agglomerative hierarchical clustering generate a dendrogram of relationships of data points and quickly terminate, other advantages are as follows: • • • •
Do not need to specify the number of clusters in advance. The complete hierarchy of clusters can be obtained. Clustering results can be visualized. Easy to handle any form of similarity or distance.
3 https://towardsdatascience.com/understanding-mean-shift-clustering-and-implementation-
with-python-6d5809a2ac40. 4 https://doi.org/10.5201/ipol.2019.255. 5 https://github.com/sinecode/MeanShift. 6 https://github.com/zziz/mean-shift.
2 Popular Clustering Algorithms
53
a
Fig. 4.4 Illustration of hierarchical process
• Suitable for clusters of any data type and arbitrary shape. • A flat partition can be obtained at different granularities by cutting on different levels of the dendrogram. However, there are several disadvantages of the hierarchical clustering: • Once a merging or division is done on one level of the hierarchy, it cannot be undone later. • It is computationally expensive in time and memory, especially for large-scale problems. Generally, the time complexity of hierarchical clustering is quadratic in terms of the number of clustered data points. • Termination criteria are ambiguous. During splitting into the divisive hierarchical clustering methods or merging in the agglomerative hierarchical clustering methods, the dissimilarity or similarity metric will affect the resulting clusters directly and is usually determined according to the different application and feature attributes of the data to be processed. The dissimilarities/similarity based on distance measures are most commonly used, such as Euclidian distance, Minkowski distance, Cosine distance, City-block distance, Mahalanobis distance, and so on. To merge or split subsets of points rather than individual points, the distance between individual points has to be generalized to the distance between subsets. Such a derived proximity measure is called a linkage metric. The type of linkage metric used has a considerable impact on hierarchical algorithms since it reflects the particular concept of closeness and connectivity. Although both types of methods produce the dendrogram of the data as output, the clustering results may be very different depending on the dissimilarity or similarity measure used in the clustering, and different types of methods should be selected according to different types of data and different application scenarios. The common dissimilarities/similarity are as follows: • Single link: smallest distance between an element in one cluster and an element in the other, i.e.:
54
4 Clustering
dist (Ki , Kj ) = min(tip , tj q )
.
(4.6)
• Complete link: largest distance between an element in one cluster and an element in the other, i.e.: dist (Ki , Kj ) = max(tip , tj q )
.
(4.7)
• Average: avg distance between an element in one cluster and an element in the other, i.e.: dist (Ki , Kj ) = avg(tip , tj q )
.
(4.8)
• Centroid: distance between the centroids of two clusters, i.e.: dist (Ki , Kj ) = dist (Ci , Cj )
.
(4.9)
• Medoid: distance between the medoids of two clusters, i.e.: dist (Ki , Kj ) = dist (Mi , Mj )
.
(4.10)
With hierarchical clustering algorithms, clusters of high quality are generally produced, but they lose out to other methods in terms of performance and scalability. They are generally computationally expensive in time and memory, especially for large-scale problems, termination criteria are ambiguous as well as the selection of appropriate metrics. A source code can be found at7 .
2.5 OPTICS: Ordering Points to Identify the Clustering Structure The idea behind OPTICS (Ankerst et al., 1999) is to compute a reachability distance between data points, which is a measure of how easily one point can be reached from another. The reachability distance is defined as the maximum distance between two points that does not violate a specified density threshold. Points that are close to each other and have a high density will have a low reachability distance, while points that are far apart or have a low density will have a high reachability distance. Let X be the set of data points to be clustered. The neighborhood of a point .x ∈ X within a given radius .ϵ (known as the generating distance) is called the .ϵ-neighborhood of x, denoted by .Nϵ (x). More formally:
7 https://github.com/hhundiwala/hierarchical-clustering.
2 Popular Clustering Algorithms
55
Fig. 4.5 Optics. Core and reachability distances
MinPts = 3
ɛ
ɛ” y x z core distance (x) reachability distance (y) reachability distance (z)
Nϵ (x) = {y ∈ X|d(x, y) ≤ ϵ, y /= x},
.
(4.11)
where .d(x, y) is the distance function. A point .x ∈ X is referred to as a core point if its .ϵ-neighborhood contains at least a minimum number of points (minpts), i.e., .|Nϵ (x)| ≥ minpts. A point .y ∈ X is directly density-reachable from .x ∈ X. if y is within the .ϵ-neighborhood of x and x is a core point. A point .y ∈ X is densityreachable from .x ∈ X if there is a chain of points .x1 , x2 , . . . , xn , with .x1 = x and .xn = y such that .xi+1 is directly density-reachable from .xi for all .1 ≤ i ≤ n, .xi ∈ X. Figure 4.5 shows an example explaining the core distance of a point x and the reachability distances of y and z with respect to x. The OPTICS algorithm starts by selecting an arbitrary data point and computing its reachability distance to all other points in the dataset. It then selects the point with the lowest reachability distance as the next core point and repeats the process until all core points have been identified. The core points are the points that have a sufficient number of neighboring points within a specified radius. The algorithm then constructs a hierarchical ordering of the core points based on their reachability distances. This ordering is called the reachability plot, and it can be used to identify clusters of varying densities. Clusters appear as regions of low reachability distances in the plot, while noise points appear as regions of high reachability distances. The advantage of the OPTICS algorithm over other clustering algorithms is that it can identify clusters of varying densities and shapes, without requiring prior knowledge of the number of clusters. Additionally, it can handle datasets with highdimensional and noisy data. A code source can be found at8 .
8 https://github.com/ManWithABike/OPTICS-Clustering.
56
4 Clustering
2.6 DBSCAN: Density-Based Spatial Clustering of Applications with Noise In density-based clustering, clusters are defined as areas of higher density than the remainder of the data set. Density-based clusters can have an arbitrary shape in the feature space. DBSCAN, the mean-shift method, density peak clustering algorithm (D) belong to this family of methods. DBSCAN (Ester et al., 1996) serves as a foundational algorithm for density-based clustering. This versatile method aims at identifying clusters of varying shapes and sizes within extensive datasets, even when they are tainted by noise and outliers. Density-based clustering is often represented by DBSCAN (Fig. 4.6). • DBSCAN does not require the user to specify the number of clusters in advance, which makes it well-suited for data sets where the number of clusters is not known. In contrast, k-means clustering requires the number of clusters to be specified in advance, which might be difficult to accomplish accurately in many circumstances. • DBSCAN can handle data sets with varying densities and cluster sizes, as it groups data points into clusters based on density rather than using a fixed number of clusters. In contrast, k-means clustering assumes that the data points are
Fig. 4.6 Density = number of points within a specified radius r (Eps) A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point. A noise point is any point that is not a core point or a border point from the source (https://www.slideshare.net/MahbuburShimul/DBSCAN-algorithom)
2 Popular Clustering Algorithms
57
distributed in a spherical shape, which may not always be the case in real-world data sets. • DBSCAN can identify clusters with arbitrary shapes, as it does not impose any constraints on the shape of the clusters. In contrast, k-means clustering assumes that the data points are distributed in spherical clusters, which can limit its ability to identify clusters with complex shapes. • DBSCAN is robust to the presence of noise and outliers in the data, as it can identify clusters even if they are surrounded by points that are not part of the cluster. k-means clustering, on the other hand, is sensitive to noise and outliers, which can lead clusters to be distorted or split into many clusters. Overall, DBSCAN is useful when the data has a lot of noise or when the number of clusters is unknown in advance. Unlike other clustering algorithms, which require the number of clusters to be specified, DBSCAN can automatically identify the number of clusters in a dataset. This makes it a good choice for data that does not have well-defined clusters or when the structure of the data is unknown. DBSCAN is also less sensitive to the shape of the clusters than other algorithms, so it can identify clusters that are not circular or spherical. A demonstration of the algorithm can be found at9 . Hierarchical DBSCAN (McInnes et al., 2017) is one of the improved versions of DBSCAN. It includes a hierarchical component to merge too-small clusters by replacing the epsilon hyperparameter of DBSCAN with a more intuitive one called .minc lusters ize. Depending on the choice of .minc lusters ize, the size of the smallest cluster will change. Choosing a small value for this parameter will lead to a large number of smaller clusters, whereas choosing a large value will lead to a small number of big clusters. A code source of the algorithm can be found at10 .
2.7 DPC: Density Peak Clustering The DPC algorithm (Rodriguez & Laio, 2014) was designed to identify arbitraryshaped clusters by finding density peaks in the underlying dataset DPC is able to detect non-spherical clusters and does not require one to specify the number of clusters. DPC is a kind of density-based clustering algorithm based on the idea that cluster centers are characterized by a higher density than their neighbors (cluster centers are local density peaks) and by a relatively large distance from points with higher densities (Fig. 4.7) One data point is in the same cluster as its nearest higherdensity neighbor. The algorithm works by calculating two properties for each data point: its local density .ρ (roughly the number of data points within a specified distance) and
9 https://www.naftaliharris.com/blog/visualizing-DBSCAN-clustering/. 10 https://github.com/choffstein/DBSCAN.
58
4 Clustering
Fig. 4.7 DPC illustration. Each center is associated with its nearest higher-density neighbor giving .δ
its distance .δ to the nearest data point with higher density (density reachability distance). The original DPC algorithm uses a cutoff distance .dc for density estimation. In Eq. 4.12, the cutoff kernel counts the data points in a .dc -radius neighborhood to measure local density. For each point .xi ∈ X and a given function distance .d(.), .ρi is estimated as follows: ρi =
n
.
χ (d(xi , xj ) − dc )
(4.12)
i=1
where .dc is a cutoff distance (to be estimated) and .χ is function defined as follows: .
χ (x) = 1 if x > 0 χ (x) = 0 otherwise
(4.13)
A Gaussian kernel can also be used to estimate the local density with all the data points, whose weights are determined based on the distance and the parameter .dc . .δi is defined as the shortest distance from any other data point that has a higherdensity value than .xi . If .xi has the highest density value, .δi is assigned to the longest distance to any other data point.
2 Popular Clustering Algorithms
59
25 18
17 3 16
7 8
14 12
11
5
1 6
27
20
4 2 9
24 13
21
10
23
15 19
22
26
28
Fig. 4.8 Toy data set from the authors Rodriguez and Laio (2014)
⎧ ⎪ δi = min d(xi , xj ) if ∃ j | ρj > ρi ⎨ j :ρj >ρi .
⎪ ⎩ δi = max d(xi , xj )
(4.14)
j
Local density peaks are rather characterized by large .ρ and large .δ. In contrast, non-center data points have rather small .ρ or small .δ in general. DPC first selects cluster centers as those data with both large local density .ρ and large distance .δ denoting the distance between one data point and its nearest higherdensity neighbor (Figs. 4.8 and 4.9). After that, the non-center data are grouped into respective clusters with the second assumption. The data points are sorted in decreasing order according to local density, and then each data point is grouped to the same cluster as its nearest higher-density neighbor. Although the decision graph is indeed helpful in isolating cluster centers from non-center data, accurate detection of cluster centers is still a challenging task. One major reason is that there is no distinct boundary to differentiate between large and small values of .ρ and .δ. Several rules can be adopted such as a .ρδ criterion to detect cluster centers. Unfortunately, the difficulty of differentiation still exists. This method is rather robust with respect to the choice of .dc as the only parameter. The combination of these properties allows the algorithm to identify cluster centers (density peaks) and assign other data points to their respective clusters. To synthesize, DPC classifies data via two steps: first, assuming that cluster centers should have higher local density and simultaneously are relatively far away from each other, then generating the decision graph to choose cluster centers based on the assumption; second, assigning non-center points to the same cluster as its nearest neighbor with higher density. Based on these steps, DPC can not only select cluster centers from the decision graph efficiently but also reach promising performance on clusters with arbitrary shapes. DPC in its pioneer version has two main deficiencies:
60
4 Clustering
Fig. 4.9 Graph DPC. All the centers are positioned via their .ρ and .δ values allowing the selection of the major ones from the authors Rodriguez and Laio (2014)
• The first one is that although the decision graph provided by DPC can help users choose cluster centers manually; it is still hard to distinguish the true cluster centers from all the points standing out in the decision graph, especially when handling clusters with nonuniform densities and scales. • The second one is, DPC assigning non-center points according to their nearest neighbor with higher density, which leads to the “chain reaction” caused by one wrong assignment of the highest density point that could affect the whole region of points around it. A comprehensive review of DPC can be found in Wei et al. (2023). It includes the latest research progress of DPC, classifies and analyzes its improved algorithms, and tracks its practical applications. A code source of the algorithm can be found at11 .
2.8 STING: Statistical Information Grid in Data Mining The idea behind the STING algorithm is to partition the data into a grid of rectangular cells and then cluster the cells based on their statistical properties. It is a grid-based multiresolution clustering method in which the spatial area is divided into rectangular cells. There are generally several levels of such rectangular cells 11 https://github.com/topics/density-peak-clustering.
2 Popular Clustering Algorithms
61
Fig. 4.10 STING. Hierarchical distribution structure graph
corresponding to multiple levels of resolution, and these cells create a hierarchical mechanism in which each cell at a higher level is separated to generate several cells at the next lower level. The mother unit of a higher level is received through the sub-unit of multiple low levels, and the unit of a lower level continues to contain the sub-unit that is lower than it for the level. As it is shown in Fig. 4.10 Statistical data regarding the attributes in each grid cell (including the mean, maximum, and minimum values) is precomputed and stored. Statistical parameters of higher-level cells can simply be calculated from the parameters of the lower-level cells. These parameters contain the following: the attribute-independent parameter, count, and the attributedependent parameters, mean, stdev (standard deviation), min (minimum), max (maximum); and the type of distribution that the attribute value in the cell follows, including normal, uniform, exponential, or none in the case where the distribution is anonymous. The algorithm works as follows: • Data partitioning: The algorithm first partitions the data into a grid of rectangular cells. The size of the cells is determined by a user-specified parameter that controls the granularity of the clustering. • Statistical analysis: For each cell in the grid, the algorithm computes a set of statistical properties, such as the mean, standard deviation, and correlation coefficients between the features. • Hierarchical clustering: The algorithm then clusters the cells based on their statistical properties using a hierarchical clustering algorithm. The resulting dendrogram represents a hierarchy of clusters at different levels of granularity. • Grid refinement: The algorithm then refines the grid by merging adjacent cells that belong to the same cluster. The user can specify a threshold for the minimum number of cells in a cluster to control the granularity of the clustering. Grid-based computing is inherently query-independent because it relies on the statistical information stored within each individual grid cell, which provide data summaries that remain unaltered by specific queries. The grid’s structural design
62
4 Clustering
not only supports efficient parallel processing but also accommodates seamless incremental updates. The advantage of the Sting algorithm is that it can handle large datasets with mixed data types, including categorical and continuous variables. It can also identify clusters of arbitrary shapes and sizes, and the hierarchical clustering approach provides a natural way to explore the clustering structure at different levels of granularity. However, a drawback of the STING algorithm is that the grid-based partitioning can lead to artificial boundaries between clusters, especially if the size of the cells is not well-tuned to the data. STING can only identify cluster boundaries that are either horizontal or vertical, thereby omitting any detection of diagonal boundaries. Also, the algorithm assumes that the statistical properties of the data are sufficient to capture the underlying clustering structure, which may not always be the case.
2.9 Spectral Clustering Spectral clustering (Von Luxburg, 2007) is a popular technique of graph theory often used to identify communities of nodes in a graph based on the edges connecting them. Without making any assumption on the distribution of the clusters, spectral clustering (Dhillon et al., 2004) makes use of the spectrum of the similarity matrix of data to perform dimensionality reduction before applying k-means. Rather than attempting to cluster points in their native domain, a similarity graph is generated from which a suitable matrix is analyzed from its eigenvalues (spectrum). This matrix is built on the basis of pairwise similarity of objects to be grouped, the similarity measure is one of the key tasks in spectral clustering. A Gaussian kernel function is often used to define an affinity matrix or simply an adjacency matrix (i.e., .Ai,j = δi,j ) is considered. In any case, .Aij → 1 when the points are close in .Rn and .Aij → 0 if they are far apart. Let consider a graph .G = (V , E) where .V = {v1 , v2 , . . . , vn } is the set of nodes. An edge .eij connects nodes .vi and .vj if they are adjacent or neighbors (.vi ∼ vj ) where .E = {e1 , e2 , . . . , em } is the set of edges. The simple Laplacian matrix (several variations exist) is defined as the difference of 2 matrices: L=D−A
.
(4.15)
where .D ∈ Zn×n is the diagonal matrix measuring the degree of each node, i.e., how many edges connect to it. Spectral approaches take as input a matrix of pre-calculated similarity and aim to minimize a cutting criterion. The integration of the constraints is done by guiding the space projection learned with the constraints, by modifying either the derived affinity matrix of the similarity matrix or the way in which the vectors are selected. Most spectral clustering algorithms need to compute the full graph Laplacian matrix and, therefore, have quadratic or super quadratic complexities in the number of data points. A key issue in spectral clustering is to solve the multi-
3 Recent Clustering Algorithms
63
class clustering problem. This is accomplished by representing the graph Laplacian in terms of k eigenvectors, k being the number of classes. Then, either k-means clustering, exhaustive search, or discretization is applied to this lower dimensional representation of the Laplacian to determine the final cluster memberships. Recently, autoencoders have been applied to the Laplacian to obtain the spectral embedding provided by the eigenvectors. Another approach has been to use a deep learning network that directly maps the input data into the lower dimensional eigenvector representation, which is then followed by a simple clustering algorithm. A pedagogic example including Python codes can be found at12 , and a code source of the algorithm can be found at13 . It is worth mentioning that clustering algorithms specialized to deal with complex networks are called community detection. Community detection is very applicable in understanding and evaluating the structure of large and complex networks. Many different algorithms (Su et al., 2022) were proposed and implemented for years. One can argue that community detection is similar to clustering as they share common objectives. Even though clustering can be applied to networks, it is a broader field in unsupervised machine learning which deals with multiple attribute types. There are important differences in the approaches, and one cannot trivially apply community discovery to solve clustering and vice versa. Community discovery assumes sparse connections, while clustering can work with dense datasets; in clustering, attributes with multiple types are usually used, while only a single attribute type edges is considered in community discovery. This approach uses the properties of edges in graphs or networks and is hence more suitable for network analysis rather than a clustering approach.
3 Recent Clustering Algorithms For years, several researchers have been going on around the pioneer techniques to solve issues with complex data organization, and the advances are very promising (Ezugwu et al., 2022) even the major popular algorithms are still used as more established and probably simpler. These novel algorithms such as Munec (Ros and Guillaume, 2019), KdMutual (Ros et al., 2020), S-DBSCAN (Ros et al., 2022b), Path-scan (Ros et al., 2022a) use heuristics that combine notions of distances and densities while limiting the role of the investigators. In the past, there were a myriad of improved versions of k-means. For several years, the research efforts have been more focus on the DBSCAN and DPC algorithms. DBSCAN is very popular and widely used. It does, however, have certain issues with its time complexity and the relevance of cluster it can deliver when the database is large and with varying densities. Different DBSCAN enhance-
12 https://towardsdatascience.com/spectral-clustering-aba2640c0d5b. 13 https://github.com/lucianolorenti/SpectralClustering.jl.
64
4 Clustering
ments (Khan et al., 2014; Singh et al., 2022) have been proposed to improve the conventional DBSCAN algorithm. The DPC algorithm does not require high data distribution, and the identification and clustering effect of non-spherical clusters is very remarkable. However, DPC algorithm still has some disadvantages and there is a large human factor in the allocation and measurement, so many researchers have improved and applied these determinations. Many improvements have been proposed such as the Density peaks clustering algorithm based on a grid (DPCG) (Fang et al., 2023), Adaptive density peaks clustering method with Fisher linear discriminant (ADPC-FLD) (Sun et al., 2019) and DPC based on the layered k-nearest neighbors and subcluster merging (LKSM-DPC) (Ren et al., 2020). Most of the above algorithms suffer from the curse of dimensionality and are not directly scalable. Block DBSCAN (Chen et al., 2021) deals with the scalability of DBSCAN. For all, their effectiveness heavily depends on the dimensionality of the training data as well as the defined distance functions. Therefore, the usefulness of most existing clustering approaches is limited when handling high-dimensional data. Clustering with deep learning has drawn much attention as a possible response to this issue because its highly nonlinear architecture can help to learn powerful feature representations. The K-nearest neighbor (KNN) algorithm is a simple yet highly efficient classification method and was initially introduced into DPC to calculate the local density for data points (KNN-DPC (Boyang & Zhiming, 2018)), which can reduce computational complexity when dealing with high-dimensional datasets, thus enhancing its ability to handle high-dimensional data and achieve superior clustering results. Based on the same idea, it was proposed a DPC algorithm (Xie et al., 2016a) improved by fuzzy KNN (FKNN-DPC), which applied fuzzy weighted K-nearest neighbors to calculate the local density. This method can greatly reduce the parameter sensitivity of DPC. Huang et al. proposed the QCC method (Huang et al., 2017), which leverages the K-nearest neighborhood or reverse Knearest neighborhood to determine cluster centers and introduces a novel concept of similarity between clusters to address complex manifold problems. In SNN-DPC (Liu et al., 2018), points consider both their own nearest neighbors and the neighbors shared by other data points when calculating local density for a data point. However, a common drawback of the above methods is they cannot provide a unified strategy, and thus may lead to poor performance on clusters with nonfriendly data sets (nonuniform densities and non-spherical shapes). Instead of density measures, how to choose cluster centers automatically for DPC is another promising direction. Density Peak Clustering with Connectivity Estimation (DPCCE) (Guo et al., 2022) is proposed for this purpose. In this algorithm, local centers are selected based on their higher relative distances for subsequent computations. Subsequently, a graph-based strategy is introduced to estimate connectivity information among these local centers. This estimated information is then leveraged to apply a distance penalty that takes into account both Euclidean distance and connectivity information, facilitating a reevaluation of the similarity between the local centers.
4 Evaluation Metrics Used in Clustering
65
4 Evaluation Metrics Used in Clustering There are two main approaches to investigating cluster validity: external and internal evaluation. • External Cluster Validation: comparing the results of a clustering method with the known labels. • Internal Cluster Validation: investigating the structure of clustering results without information outside of the dataset, i.e., without the known labels.
4.1 External Evaluation Most of the papers dealing with deep clustering use external indices to validate the proposals using large databases with labels. This can be discussed because of the logic of the labels but this is the only means to evaluate today. When the class labels of each data set are known, external indices (Rendón et al., 2011) are most often used. They aim to measure the consistency between the clusters found by an algorithm and the ground truth. The most popular ones are normalized mutual information (NMI) (Estévez et al., 2009), Entropy (E), F-measure (F) (Sasaki et al., 2007), adjusted Rand index (ARI) (Santos and Embrechts, 2009) and its derivatives, and clustering accuracy (ACC) (Kuhn, 1955) which is often used in deep clustering.
ACC Clustering accuracy (ACC) can be expressed as follows: n 1 [yi = m(ci )] .ACC = max m n
(4.16)
i=1
where .yi is the ground truth, .ci the output generated, and m the mapping function representing the set of possibilities for assigning observations in the classes.
Mutual Information In information theory, mutual information between X and Y , .I (X, Y ), measures the “amount of information” learned from knowledge of random variable Y about the other random variable X. The mutual information can be expressed as the difference of two entropy terms:
66
4 Clustering
I (X, Y ) = H (X) − H (X|Y ) = H (Y ) − H (Y |X)
.
(4.17)
I (X, Y ) is the reduction of uncertainty in X when Y is observed. If X and Y are independent, then .I (X, Y ) = 0, because knowing one variable reveals nothing about the other. By contrast, if X and Y are related by a deterministic, invertible function, then maximal mutual information is attained. The NMI (Normalized Mutual Index) used in clustering is as follows:
.
NMI(Y, C) =
.
2 × I (Y, C) (H (Y ) + H (C))
(4.18)
where Y is the ground truth labels and C the generated one.
4.2 Internal Evaluation For a discovery process, the original labels of data are unknown (unknown ground truth), and more generally the ground truth is rarely available. An unsupervised validation must be done via the use of internal indices such as David-Bouldin index, silhouette index, and many others (Meil˘a, 2007; Vinh et al., 2010; Romano et al., 2014, 2016). In the realm of survey papers, internal indices play a pivotal role and are frequently assessed. The R-package clValid14 itself offers a vast array of over thirty indices, all published within the last few decades (Brock et al., 2008). Each of these indices is underpinned by two fundamental concepts, which are distinctively formalized in each Cluster Validity Index (CVI): • Compactness: This notion gauges the intra-cluster distances between data objects within the same cluster, utilizing a similarity measure to do so. • Separation: It quantifies inter-cluster distances by measuring the distances between clusters themselves. Previous investigations have convincingly demonstrated that no single index universally outperforms all others. The performance of these indices is contingent upon the nature of the data organization. For instance, indices like Dunn’s index, Davies-Bouldin’s index, and others such as the Silhouette index make certain assumptions that are not valid in many real-world scenarios. They often prove too simplistic to handle data with specific structures, such as irregular shapes and dispersed densities. Moreover, they are typically ill-equipped to accurately determine the optimal number of clusters. In essence, they lack the inherent mechanisms to be as effective as the clustering algorithm employed to generate the clustering results. Due to the absence of a standardized protocol for clustering evaluation, the selection of an appropriate quality index can be a perplexing task.
14 https://cran.r-project.org/web/packages/clValid/index.html.
4 Evaluation Metrics Used in Clustering
67
The suitability of a particular index often hinges on the specific characteristics of the dataset and the objectives of the analysis. In addition, it should be noted that all these indices are more suitable for lowdimensional spaces, and the field is still challenging. More concerning, they suffer from the curse of dimensionality as standard clustering algorithms and, therefore, cannot be used for deep clustering evaluation.
Silhouette Index The Silhouette index deals with the cohesion and separation of identified clusters. It is probably the most used CVI today despite the myriad of popular indices. Cohesion measures .a(i) how closely related the objects in a cluster are. It is represented by the average distance of the element i to all other elements in the same cluster. Separation measures .b(i) how distinct or well-separated a cluster is from other clusters. It is represented by the smallest average distance between the element i and all elements in any other cluster. The Silhouette Coefficient (SC) is defined for each data point i as: SC(i) =
.
b(i) − a(i) max{a(i), b(i)}
(4.19)
The silhouette index .Ssil takes the mean value of .SC(i) for each data point in each cluster .Ik as follows: Ssil
.
K 1 1 = SC(i) K |Ik | k=1
(4.20)
i∈Ik
The silhouette index ranges from .−1 to .+1: • A silhouette index close to .+1 indicates that the data point is well-clustered, meaning it is relatively closer to the members of its own cluster than to members of other clusters. • A silhouette index close to 0 suggests that the data point is close to the decision boundary between two neighboring clusters, indicating ambiguity in the clustering assignment. • A silhouette index close to .−1 implies that the data point might have been assigned to the wrong cluster, as it is closer to members of a different cluster than to its own. The silhouette index is sometimes reproached as being complex in terms of computation compared to others such as the Davies Bouldin index.
68
4 Clustering
PDBI: Partitioning Davies-Bouldin Index PDBI (Ros et al., 2023) is a novel CVI initially inspired by the native idea of the Davies-Bouldin Index (DBI). PDBI is based on a strategy that consists of dividing each cluster into sub-clusters that redefine the concepts of internal homogeneity and cluster separation via the integration of sophisticated mechanisms. This strategy makes it possible to process a relevant CVI even in the case of complex data structures and in the presence of clusters with noisy patterns. PDBI is deterministic, runs independently of a given clustering algorithm, and generates a normalized score between 0 and 1. The approach involving the PDBI hinges on a unique strategy of subdividing each cluster into sub-clusters. This strategic maneuver serves to simplify the shape of each cluster, thereby enhancing the relevance and accuracy of the evaluation process. However, it is essential to acknowledge that this strategy is not without its limitations, particularly due to its localized nature. It may not fully account for the comprehensive characteristics of each cluster and can be susceptible to noise. To address these shortcomings, innovative mechanisms have been introduced. The assessment of each cluster’s homogeneity relies on local scores calculated between sub-clusters while also factoring in the sparsity of these sub-clusters. Subclusters are categorized as either noise or non-noise clusters. The calculation of separation for each cluster takes into consideration all other clusters, incorporating information about the labels assigned to their respective sub-clusters. The minimal separation value is retained as a significant metric. By computing a separation-homogeneity index for each individual cluster, a comprehensive CVI can be derived through a weighted average aggregation method. This approach facilitates a more nuanced and holistic assessment of the clusters’ quality, addressing both their homogeneity and separation characteristics. A PDBI demonstration as well as illustrations can be found at15 .
5 Peripherical Metrics/Norms 5.1 KL-Divergence KL-divergence is widely used in deep clustering schemes such as the DEC family algorithms and others (Xie et al., 2016b). KL-divergence allows measuring the matching between two distributions. L = KL(P ‖Q) =
.
i
15 http://r-riad.net/.
j
pij log
pij qij
(4.21)
5 Peripherical Metrics/Norms
69
Here .q(x) is the approximation and .p(x) is the true distribution we are interested in matching .q(x). Intuitively this measures how much a given arbitrary distribution is away from the true distribution. If two distributions perfectly match, .DKL (p||q) = 0 otherwise it can take values between 0 and .∞. The lower the KL-divergence value, the better we have matched the true distribution with our approximation. A pedagogic and complete explanation can be found at16 .
5.2 Binary Cross-Entropy Binary Cross-Entropy, often referred to as Binary Log Loss or Binary Cross-Entropy Loss, serves as a fundamental loss function within the domains of machine learning and deep learning. widely used in deep clustering schemes such as the GAN family. Its primary purpose is to assess the disparity between predicted binary outcomes . y and their corresponding actual binary labels y. This loss function accomplishes this by quantifying the dissimilarity between probability distributions, thereby playing a pivotal role in guiding model training through the imposition of penalties for inaccurate predictions. The loss comes from binary cross-entropy. L( y , y) = y · log y + (1 − y) · (1 − log y)
.
(4.22)
Binary Cross-Entropy can be extended for Multi-Class classification. A detailed explanation of the binary cross-entropy can be found at17 . Binary Cross-Entropy finds extensive utility in various machine learning tasks, particularly in binary classification, where the objective is to categorize data into two distinct classes. In essence, Binary Cross-Entropy meticulously evaluates each predicted probability against the true class output, which can assume values of either 0 or 1. It subsequently computes a score that penalizes these probabilities based on their proximity to the expected values. This quantification effectively measures how closely or remotely the predictions align with the actual values, facilitating the model’s pursuit of more accurate outcomes.
5.3 Cosine Similarity The similarity measurement plays a key role in the deep clustering process. Euclidean distance can be used but Cosine similarity is often considered to compare
16 https://datumorphism.leima.is/wiki/machine-learning/basics/kl-divergence/. 17 https://www.analyticsvidhya.com/blog/2021/03/binary-cross-entropy-log-loss-for-binary-
classification/.
70
4 Clustering
pairs of vectors in high-dimensional space, especially with contrasting learning approaches. Suppose .zi and .zj are latent vectors and normalized, they can be seen as two points on a hyper-sphere. Cosine Similarity =
.
z i × zj ‖zi ‖ × ‖zj ‖
(4.23)
The cosine function is inversely proportional to the angle, then the bigger their similarity, the smaller the angle is and the smaller the angle between them, the nearer they are. More details can be found in the book “Data Mining Concepts and Techniques Third Edition” (Han et al., 2012) and a pedagogic explanation at18 .
5.4 Frobenius Norm The Frobenius norm, also known as the Euclidean norm, is a way to measure the “size” or magnitude of a matrix. It is like taking the square root of the sum of the squares of all the individual elements in the matrix. In essence, it quantifies how “spread out” or “large” the matrix is in a numerical sense. The Frobenius norm is a common mathematical tool used in various areas, including linear algebra, machine learning, and signal processing. The Frobenius norm of .A(n × n) denoted by .|A|F is defined by ⎛ ⎞1/2 n n 2 ⎠ .|A|F = ⎝ ai,j = (trace[AT A])1/2
(4.24)
i=1 j =1
where trace[.AT A] is the sum of the diagonal elements of .AT A and .AT is the transpose of A. More details can be found in the book Brown et al. (2020) and the algebraic notions simply explained at19 .
6 Subspace and Ensemble Clustering Subspace clustering (SC) (Parsons et al., 2004) is an extension of traditional cluster analysis that is worth to be mentioned. Subspace clustering data representation translates data from one space to another with higher separability. It relies on the assumption that high-dimensional data points can be well represented as lying
18 https://towardsdatascience.com/cosine-similarity-how-does-it-measure-the-similarity-mathsbehind-and-usage-in-python-50ad30aad7db. 19 https://inst.eecs.berkeley.edu/~ee127/sp21/livebook/l_mats_norms.html.
References
71
in the union of low-dimensional subspaces. SC allows simultaneously grouping features and observations by creating both row and column clusters. Then, the task of subspace clustering (You et al., 2016; Yang et al., 2019; Gao et al., 2022) is to identify the subspaces and assign data points according to the corresponding subspaces. Iterative methods, statistical methods, algebraic methods, and spectral clustering methods represent the four categories of existing subspace clustering methods (Vidal, 2011). These algorithms, however, often ignore irrelevant features unrelated to clustering tasks in high-dimensional data spaces. For these reasons, they are generally not applicable to combinations of arbitrary features. Ensemble, in the AI context, is a technique that tries to improve performance by aggregating the prediction of multiple Machine Learning models. A clustering ensemble aims to combine multiple clustering models to produce a better result than that of the individual clustering algorithms in terms of consistency and quality.
References Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). Optics: Ordering points to identify the clustering structure. SIGMOD Record, 28(2), 49–60. Bottou, L., & Bengio, Y. (1994). Convergence properties of the k-means algorithms. Advances in Neural Information Processing Systems, 7, 585–592. Boyang, L., & Zhiming, G. (2018). A design method of RBF neural network based on KNN-DPC. In 2018 International Conference on Information Systems and Computer Aided Education (ICISCAE) (pp. 108–111). IEEE. Brock, G., Pihur, V., Datta, S., & Datta, S. (2008). cLValid: An R package for cluster validation. Journal of Statistical Software, 25(4), 1–22. Brown, S., Tauler, R., & Walczak, B. (2020). Comprehensive chemometrics: chemical and biochemical data analysis. Elsevier. Carreira-Perpinán, M. A. (2015). A review of mean-shift algorithms for clustering. Preprint. arXiv:1503.00687. Chen, Y., Zhou, L., Bouguila, N., Wang, C., Chen, Y., & Du, J. (2021). BLOCK-DBSCAN: Fast clustering for large scale data. Pattern Recognition, 109, 107624. Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8), 790–799. Demirovi´c, D. (2019). An implementation of the mean shift algorithm. Image Processing On Line, 9, 251–268. Dhillon, I. S., Guan, Y., & Kulis, B. (2004). Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 551–556). Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96 (pp. 226–231). AAAI Press. Estévez, P. A., Tesmer, M., Perez, C. A., & Zurada, J. M. (2009). Normalized mutual information feature selection. IEEE Transactions on Neural Networks, 20(2), 189–201. Ezugwu, A. E., Ikotun, A. M., Oyelade, O. O., Abualigah, L., Agushaka, J. O., Eke, C. I., & Akinyelu, A. A. (2022). A comprehensive survey of clustering algorithms: State-ofthe-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110, 104743.
72
4 Clustering
Fang, X., Xu, Z., Ji, H., Wang, B., & Huang, Z. (2023). A grid-based density peaks clustering algorithm. IEEE Transactions on Industrial Informatics, 19(4), 5476–5484. Gao, C., Chen, W., Nie, F., Yu, W., & Yan, F. (2022). Subspace clustering by directly solving discriminative k-means. Knowledge-Based Systems, 252, 109452. Guo, W., Wang, W., Zhao, S., Niu, Y., Zhang, Z., & Liu, X. (2022). Density peak clustering with connectivity estimation. Knowledge-Based Systems, 243, 108501. Han, J., Kamber, M., & Pei, J. (2012). Data mining concepts and techniques third edition. University of Illinois at Urbana-Champaign Micheline Kamber Jian Pei Simon Fraser University. Huang, J., Zhu, Q., Yang, L., Cheng, D., & Wu, Q. (2017). QCC: A novel clustering algorithm based on quasi-cluster centers. Machine Learning, 106, 337–357. Khan, K., Rehman, S. U., Aziz, K., Fong, S., & Sarasvady, S. (2014). DBSCAN: Past, present and future. In The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014) (pp. 232–238). IEEE. Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97. Liu, R., Wang, H., & Yu, X. (2018). Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Information Sciences, 450, 200–226. Macqueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967 (pp. 281–297). McInnes, L., Healy, J., & Astels, S. (2017). HDBSCAN: Hierarchical density based clustering. Journal of Open Source Software, 2(11), 205. Meil˘a, M. (2007). Comparing clusterings–an information based distance. Journal of Multivariate Analysis, 98(5), 873–895. Murtagh, F., & Contreras, P. (2017). Algorithms for hierarchical clustering: An overview, II. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(6), e1219. Parsons, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter, 6(1), 90–105. Ren, C., Sun, L., Yu, Y., & Wu, Q. (2020). Effective density peaks clustering algorithm based on the layered k-nearest neighbors and subcluster merging. IEEE Access, 8, 123449–123468. Rendón, E., Abundez, I., Arizmendi, A., & Quiroz, E. M. (2011). Internal versus external cluster validation indexes. International Journal of Computers and Communications, 5(1), 27–34. Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492–1496. Romano, S., Bailey, J., Nguyen, V., & Verspoor, K. (2014). Standardized mutual information for clustering comparisons: One step further in adjustment for chance. In International Conference on Machine Learning (pp. 1143–1151). PMLR. Romano, S., Vinh, N. X., Bailey, J., & Verspoor, K. (2016). Adjusting for chance clustering comparison measures. The Journal of Machine Learning Research, 17(1), 4635–4666. Ros, F., & Guillaume, S. (2019). Munec: A mutual neighbor-based clustering algorithm. Information Sciences, 486, 148–170. Ros, F., Guillaume, S., El Hajji, M., & Riad, R. (2020). KdMutual: A novel clustering algorithm combining mutual neighboring and hierarchical approaches using a new selection criterion. Knowledge-Based Systems, 204, 106220. Ros, F., Guillaume, S., & Riad, R. (2022a). Path-scan: A novel clustering algorithm based on core points and connexity. Expert Systems with Applications, 210, 118316. Ros, F., Guillaume, S., Riad, R., & El Hajji, M. (2022b). Detection of natural clusters via SDBSCAN a self-tuning version of DBSCAN. Knowledge-Based Systems, 241, 108288. Ros, F., Riad, R., & Guillaume, S. (2023). PDBI: A partitioning Davies-Bouldin index for clustering evaluation. Neurocomputing, 528, 178–199. Santos, J. M. & Embrechts, M. (2009). On the use of the adjusted rand index as a metric for evaluating supervised classification. In International Conference on Artificial Neural Networks (pp. 175–184). Springer. Sasaki, Y., et al. (2007). The truth of the f-measure. Teach Tutor Mater, 1(5), 1–5.
References
73
Singh, H. V., Girdhar, A., & Dahiya, S. (2022). A literature survey based on DBSCAN algorithms. In 2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS) (pp. 751–758). IEEE. Su, X., Xue, S., Liu, F., Wu, J., Yang, J., Zhou, C., Hu, W., Paris, C., Nepal, S., Jin, D., et al. (2022). A comprehensive survey on community detection with deep learning. IEEE Transactions on Neural Networks and Learning Systems, 33, 1–21. https://doi.org/10.1109/TNNLS.2021. 3137396 Sun, L., Liu, R., Xu, J., & Zhang, S. (2019). An adaptive density peaks clustering method with fisher linear discriminant. IEEE Access, 7, 72936–72955. Vidal, R. (2011). Subspace clustering. IEEE Signal Processing Magazine, 28(2), 52–68. Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research, 11, 2837–2854. Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395– 416. Wang, W., Yang, J., Muntz, R., et al. (1997). Sting: A statistical information grid approach to spatial data mining. In VLDB (vol. 97, pp. 186–195). Citeseer. Wei, X., Peng, M., & Huang, H. (2023). An overview on density peaks clustering. Neurocomputing, 126633. Xie, J., Gao, H., Xie, W., Liu, X., & Grant, P. W. (2016a). Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors. Information Sciences, 354, 19–40. Xie, J., Girshick, R., & Farhadi, A. (2016b). Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning (pp. 478–487). PMLR. Yang, J., Liang, J., Wang, K., Rosin, P. L., & Yang, M.-H. (2019). Subspace clustering via good neighbors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(6), 1537– 1544. You, C., Robinson, D., & Vidal, R. (2016). Scalable sparse subspace clustering by orthogonal matching pursuit. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3918–3927). Zhang, T., Ramakrishnan, R., & Livny, M. (1997). Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2), 141–182.
Chapter 5
Problematic in High Dimension
The explosive growth of data across various domains has created a pressing need for efficient and effective techniques to handle high-dimensional data. Clustering, as a fundamental unsupervised learning task, faces challenges arising from the diversity in data size, shapes, density, and the presence of noise and outliers. In the era of big data, traditional clustering algorithms often struggle due to their algorithmic complexity when dealing with large databases. Scalable algorithms are now sought after to address this limitation. Moreover, the curse of dimensionality poses a significant hurdle for traditional machine learning algorithms, leading to performance degradation when the number of features surpasses the available training data. Recently, deep learning approaches have emerged as powerful tools to tackle this challenge. This short chapter delves into the role of deep learning approaches in handling high-dimensional data and the challenges that come with it. To address these challenges, it introduces the concept of representation learning as a powerful solution. Representation learning helps uncover meaningful patterns and concepts in unsupervised contexts, where reliance on labeled data is limited. The chapter highlights the limitations of semantic discovery in high-dimensional data caused by the curse of dimensionality and emphasizes the importance of accurate latent space representation. By capturing essential information effectively, an accurate latent space can mitigate challenges stemming from high dimensionality.
1 High Dimensionality Big data refers to vast datasets that contain a large number of records with diverse formats and high dimensionality, an example of high datasets such as MINST and CIFAR10 can be seen in Fig. 5.1. Dealing with such datasets requires the use of
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 F. Ros, R. Riad, Feature and Dimensionality Reduction for Clustering with Deep Learning, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-031-48743-9_5
75
76
5 Problematic in High Dimension
Fig. 5.1 High Dim data: MINST (.28 × 28, 784 features), CIFAR10 (.32 × 32, 1024 features)
novel scalable algorithms, with the challenge of high dimensionality being more pronounced than dealing with the sheer volume of data. High dimensionality poses various difficulties in data analysis, commonly known as the “curse of dimensionality.” As the number of dimensions increases, the data becomes increasingly sparse, making it challenging to identify meaningful patterns or correlations. Additionally, high-dimensional data often includes noise, redundancy, and irrelevant features, which can hinder the effectiveness of machine learning systems. Visualizing and interpreting high-dimensional data become complex tasks, sometimes approaching impossibility, impeding the ability to gain valuable insights and make accurate decisions. To overcome these challenges and extract useful information from high-dimensional data, specialized approaches and procedures are necessary. Researchers and data analysts often employ dimensionality reduction techniques to reduce the number of variables while preserving essential information. Techniques like Principal Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-SNE) presented in Chap. 2 are commonly used for this purpose. Furthermore, feature selection methods help identify the most relevant features, improving the performance of machine learning models and reducing noise.
2 Representation Learning Representation learning (Bengio et al., 2013; Wang & Deng, 2018; Zhang et al., 2018; Wang et al., 2022) is a fundamental concept in machine learning and artificial intelligence, focused on acquiring meaningful and informative representations or features from raw data. The primary goal of representation learning is to transform
3 Deep Learning Algorithms and Representation Learning
77
high-dimensional data into a lower-dimensional representation while retaining the essential information of the original data. One popular application of representation learning is in unsupervised settings, where labeled data is unavailable. Instead of relying on labeled examples, unsupervised representation learning leverages the inherent structure and distribution of the data to obtain meaningful representations. Through unsupervised learning approaches, relevant characteristics and patterns can be discovered in the data, which subsequently prove valuable for tasks such as classification, clustering, and prediction. The core idea of representation learning is to map data representations to other representations, generating dense and compact learned representations that can be generalized to similar data modalities. This lower-dimensional representation is often referred to as a latent space or embedding. By identifying a compact representation, representation learning aims to capture the underlying patterns, structure, or semantics of the data. The benefits of effective representation learning are numerous. It can significantly reduce computing complexity, as the transformed data occupies a lowerdimensional space. Additionally, representation learning enhances interpretability, making it easier for researchers and analysts to understand and extract insights from the data. Moreover, by learning relevant features, machine learning systems can perform better on various tasks. In conclusion, representation learning plays a crucial role in extracting valuable information from high-dimensional data, particularly when labeled data is not readily available. It enables the creation of concise, informative representations that facilitate data analysis and improve the performance of machine learning models.
3 Deep Learning Algorithms and Representation Learning Deep learning algorithms, including deep clustering algorithms, are powerful because they primarily perform representation learning. Good representations (Achille & Soatto, 2018) are expressive, meaning they can capture a vast number of possible input configurations with a reasonably sized learned representation. One of the challenges in representation learning is the absence of a clear objective or target for training, unlike tasks such as classification. Nevertheless, the central objective remains the preservation of information to discover and capture latent semantic structures in the data. This enables models to gain a better understanding of the underlying concepts and relationships and make more informed decisions. Semantics, in general, refers to the meaning or interpretation of language elements. The relation between semantics and representation learning lies in the fact that representation-learning techniques aim to capture semantic information in learned representations. Whether in natural language processing or computer vision, deep learning models learn hierarchical representations that encode semantic information about the content of the data. These learned representations not only solve the
78
5 Problematic in High Dimension
immediate task but also possess generalization properties, making them useful for other downstream tasks such as object detection, segmentation, and pose estimation. Transfer learning is a popular approach that leverages the knowledge gained from one task to solve another task with limited annotations. In traditional clustering methods like k-means, hierarchical or self-organizing maps, the aim is to discover patterns within the data. Most of the time, these approaches are suitable when in-depth data analysis is not required. However, traditional clustering techniques are unsuitable for large and high-dimensional datasets since they cannot identify complex patterns. On the other hand, deep learning clustering uses deep clustering networks, deep adaptive clustering, and deep embedded clustering to perform in-depth analysis in large datasets. As a result, deep learning clustering can identify complex and hidden patterns in data more efficiently. In summary, tackling big data with high dimensionality requires a thoughtful and targeted approach. By employing advanced algorithms and appropriate data processing techniques, analysts can unlock valuable insights and maximize the potential of these vast datasets. In summary, representation learning plays a crucial role in capturing and encoding semantic information from data, enabling models to understand and utilize the underlying meaning and relationships between different elements. This has broad applications in various domains and contributes to the advancement of machine learning and artificial intelligence.
4 Importance of a Good Latent Space A good latent space representation is essential for effective data analysis and visualization. It should capture relevant and meaningful information in a compact and meaningful manner. By preserving the semantic structure of the data, a welldesigned latent space enables us to understand and interpret the underlying patterns and relationships within the data. It also facilitates downstream tasks such as clustering, classification, and prediction. Moreover, a well-constructed latent space can also help mitigate the curse of dimensionality. Capturing the essential features or dimensions that explain the data variations, allows us to effectively reduce the dimensionality of the data without sacrificing important information. This not only improves computational efficiency but also helps in avoiding the problems associated with high-dimensional data, such as sparsity and overfitting. Therefore, the design and construction of a good latent space representation are crucial for successful data analysis and overcoming the challenges posed by the curse of dimensionality (Fig. 5.2).
5 Semantic Discovery in Unsupervised Scenarios
79
Fig. 5.2 Latent space
5 Semantic Discovery in Unsupervised Scenarios Semantic discovery is a crucial process in data analysis that involves uncovering meaningful and interpretable patterns or concepts within the data. In unsupervised scenarios, this process relies solely on the inherent structure or relationships present in the data, without any labeled information. However, in the case of high-dimensional data, the task of finding meaningful semantic representations becomes particularly challenging due to the curse of dimensionality. The difficulty of semantic discovery in high-dimensional data is greatly made more difficult by the curse of dimensionality. As the number of dimensions increases, the available data points become sparser, which makes it difficult to accurately capture the semantics or underlying patterns within the data. With fewer data points and more dimensions to consider, it becomes more challenging to uncover meaningful insights and interpret the results. Moreover, the curse of dimensionality also leads to an increased risk of overfitting and model complexity. The number of parameters needed to represent the data grows exponentially as the dimensionality does. In turn, this may increase the risk that the model will be overfitted to the particular training data, making it more difficult to generalize and use the model with fresh, untested data. Consequently, dealing with the curse of dimensionality requires careful consideration and specialized techniques to mitigate its effects and still extract valuable information from high-dimensional datasets.
80
5 Problematic in High Dimension
References Achille, A., & Soatto, S. (2018). Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1), 1947–1980. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798– 1828. Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., & Yu, P. (2022). Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering, 312, 135–153. Wang, M., & Deng, W. (2018). Deep visual domain adaptation: A survey. Neurocomputing, 312, 135–153. Zhang, D., Yin, J., Zhu, X., & Zhang, C. (2018). Network representation learning: A survey. IEEE Transactions on Big Data, 6(1), 3–28.
Chapter 6
Deep Learning Architectures
Artificial intelligence and machine learning have undergone a radical transition thanks to deep learning architectures, which have sped up innovation in a variety of fields. This chapter undertakes the task of delivering an all-encompassing survey of these architectures, unraveling their pivotal constituents, and showcasing their multifaceted utility across a panorama of applications. It delves profoundly into the bedrock principles and intricate structures underpinning deep learning models, including but not limited to deep neural networks (DNNs) and convolutional neural networks (CNNs). The exposition elucidates how these models adeptly grapple with high-dimensional data, effortlessly imbibing hierarchical representations, and successfully capturing complex inter-feature relationships. This chapter stands as a simple compendium for navigating the intricate landscape of deep learning architectures, bestowing a firm grasp of their foundational tenets, core constituents, and expansive applications. Representing an invaluable repository of knowledge, it serves to illuminate the nuances of feature selection and clustering within the profound expanse of the deep learning domain.
1 Convolution Neural Networks In this section, the essential information concerning the CNN architectures is given. Many review papers provide further information for interested readers, such as Li et al. (2016), Albawi et al. (2017), Gu et al. (2018), Alzubaidi et al. (2021), Ji et al. (2021) that were recently published particularly for CNN architectures.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 F. Ros, R. Riad, Feature and Dimensionality Reduction for Clustering with Deep Learning, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-031-48743-9_6
81
82
6 Deep Learning Architectures
1.1 Brief History The history of Convolutional Neural Networks (CNNs) dates back several decades, with significant advancements and architectural refinements over the years (Alom et al., 2018; Alzubaidi et al., 2021). Here is a brief overview of the key milestones in the development of CNN architectures: • Neocognitron (1980s): Kunihiko Fukushima introduced the concept of Neocognitron (Fukushima et al., 1983), an early precursor to CNNs, in the 1980s. The Neocognitron aimed to emulate the visual cortex’s hierarchical structure and featured a series of trainable layers with local receptive fields, allowing for the detection of complex visual patterns. • LeNet-5 (1990s): LeNet-5 (Lecun et al., 1998; LeCun et al., 2015), developed by Yann LeCun and colleagues in the 1990s, was one of the pioneering CNN architectures. It was designed for handwritten digit recognition and consisted of convolutional layers, sub-sampling layers (pooling), and fully connected layers. LeNet-5 demonstrated the potential of CNNs in achieving superior performance on character recognition tasks. • AlexNet (2012): AlexNet, introduced by Krizhevsky et al. in 2012, marked a breakthrough in the field of computer vision. It was the first CNN architecture to successfully utilize deep learning principles with multiple convolutional and fully connected layers. AlexNet achieved a significant improvement in image classification accuracy and won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)1 in 2012. • GoogLeNet/Inception (2014): GoogLeNet, or the Inception architecture, developed by Szegedy et al. in 2014, introduced the concept of “Inception modules.” These modules consisted of parallel convolutional operations of different kernel sizes, allowing for efficient and scalable feature extraction. GoogLeNet won the ILSVRC in 2014 and demonstrated the benefits of deep and wide architectures. • VGGNet (2014): VGGNet, proposed by Simonyan and Zisserman in 2014, focused on increasing the depth of CNN architectures. It featured a simple and uniform architecture with very small 3 by 3 convolutional filters, allowing for deeper networks. VGGNet achieved remarkable performance in various image recognition tasks and played a crucial role in understanding the impact of network depth. • ResNet (2016): ResNet, or Residual Network, introduced by He et al. in 2016, addressed the challenge of training very deep CNNs. It utilized residual connections, where the output of a layer is added to the input, allowing for the flow of gradients through the network and alleviating the vanishing gradient problem. ResNet architectures with hundreds of layers achieved state-of-the-art results in image classification.
1 www.image-net.org/challenges/LSVRC/.
1 Convolution Neural Networks
83
• EfficientNet (2019): EfficientNet, proposed by Tan and Le in 2019, aimed to improve the efficiency of CNN architectures by jointly scaling up the depth, width, and resolution using a compound scaling technique. It achieved excellent accuracy on image classification tasks while significantly reducing computational requirements, making it more efficient for deployment in resource-constrained scenarios. These are just a few notable CNN architectures that have had a significant impact on the field of computer vision. Since then, there have been numerous architectural variations, advancements in regularization techniques (e.g., batch normalization), and exploration of network designs tailored to specific tasks or domains. CNNs continue to evolve and serve as powerful tools for various image and pattern recognition applications.
1.2 Main Components Convolutional Neural Networks (CNNs) are a type of deep learning architecture specifically designed for processing structured grid-like data, such as images or sequential data. CNNs have revolutionized the field of computer vision and have achieved remarkable success in various other domains as well (Fig. 6.1). Here is a brief overview of CNNs: • Convolutional Layers: The core building blocks of CNNs are convolutional layers. These layers apply a set of learnable filters (also known as kernels) to the input data using convolution operations. Each filter detects specific features or patterns in the input, such as edges, textures, or shapes. By applying multiple filters, the network can learn to extract hierarchical representations of increasing complexity.
Output Convolution
Pooling
Convolution
Pooling
0.2
Donald
0.1
Goofy 0.7
Tweety Kernel Input image
Featured maps
Pooled Featured maps
Featured maps
Feature Maps Feature Extraction
Pooled Featured maps
Flatten layer Fully connected layer Classification
Probalistic distribution
Fig. 6.1 Convolution Neural Networks. Conventional architecture (from source www. analyticsvidhya.com/blog/2022/01/convolutional-neural-network-an-overview/)
84
6 Deep Learning Architectures
• Pooling Layers: Pooling layers are often used in CNNs to reduce spatial dimensions and capture the most relevant features. Max pooling is a common pooling technique where the input is divided into small spatial regions, and only the maximum value within each region is retained. Pooling helps in reducing the sensitivity to small spatial shifts, making the network more robust and reducing computational requirements. • Activation Functions: Activation functions introduce non-linearities into the network, allowing it to model complex relationships and learn non-linear transformations. Common activation functions used in CNNs include Rectified Linear Units (ReLU), sigmoid, and hyperbolic tangent functions. • Fully Connected Layers: At the end of the convolutional layers, one or more fully connected layers are typically added to perform classification or regression tasks. These layers connect every neuron in the previous layer to every neuron in the subsequent layer. They capture high-level abstractions and combine them to make final predictions or decisions.
1.3 Training CNNs CNNs are trained using large labeled datasets through a process called backpropagation. Deep CNN has a huge amount of parameters and its loss function is generally non-convex, which makes it very difficult to train. To achieve a fast convergence in training and avoid the vanishing gradient problem, proper network initialization is one of the most important prerequisites. The bias parameters can be initialized to zero, while the weight parameters should be initialized carefully to break the symmetry among hidden units of the same layer. The network’s weights are adjusted iteratively based on the discrepancy between its predictions and the true labels. Optimization algorithms such as Stochastic Gradient Descent (SGD) or its variants are commonly used to minimize a defined loss or error function. Transfer Learning: CNNs can benefit from transfer learning, where pre-trained models trained on large-scale datasets (e.g., ImageNet) are used as a starting point for new tasks. By leveraging the learned features and representations from these models, transfer learning allows for effective training on smaller datasets or domainspecific tasks. CNNs have achieved tremendous success in various computer vision tasks, including image classification, object detection, image segmentation, and facial recognition. They are also applied in other domains such as natural language processing, recommender systems, and drug discovery. The power of CNNs lies in their ability to automatically learn hierarchical representations of data, capturing complex patterns and features. Their convolutional and pooling operations enable parameter sharing, making them computationally efficient for grid-like data. As a result, CNNs have become a fundamental tool in many deep-learning applications.
2 AE: Autoencoder
85
2 AE: Autoencoder The autoencoder (Rumelhart et al., 1986) is an artificial neural network model, which consists of an encoder and a decoder. The encoder of the model learns the hidden features of the input object by training a mapping function .x → h; this process is encoding. The decoder can reconstruct the object from the hidden layer coding .h → x ' . The autoencoder (Fig. 6.2) is a type of neural network used in semi-supervised learning and unsupervised learning. It is widely used for dimensionality reduction or feature learning. The learning goal of the model is to train a function to make the model output approximately equal to the input .x ∼ x ' having .x → h → x ' The basic loss function is simply the mean squared error (MSE) quantifying how well the reconstruction (.x ' ) matches the original data pattern (x): MSEloss =
.
|xi − xi' |22
(6.1)
i
Autoencoder has to learn a good “encoding” or “compression” of inputs to a latent representation (.x → h) as well as a good “decoding” of that latent representation to a reconstruction of the original input (.h → x ' ). Relevant information is extracted by compressing the amount of information that can traverse the full network, forcing a learned compression of the input data. It uses the idea of sparse coding, uses
Fig. 6.2 Autoencoder architecture from source https://encodebox.medium.com/auto-encoder-inbiology-9264da118b83
86
6 Deep Learning Architectures
sparse features to reconstruct the input, and learns cluster-friendly representations. In 2006, Hinton improved the prototype of the autoencoder architecture to form the deep autoencoder (DAE) (Hinton et al., 2006), which is a neural network with multiple hidden layers and powerful feature representation capabilities. Ideally, any architecture of autoencoder successfully could be trained, choosing the bottleneck dimension and the capacity of the encoder and decoder based on the complexity of distribution. The reality is more complex and autoencoders can fail to generalize, justifying the arrival of regularized, sparse, and denoising autoencoders. The main challenge when designing an autoencoder is its sensitivity to the input data. While an autoencoder should learn a representation that embeds the key data traits as accurately as possible, it should also be able to encode traits that generalize beyond the original training set and capture similar characteristics in other data sets. Thus, several variants (Bank et al., 2023) have been proposed since autoencoders were first introduced. These variants mainly aim to address shortcomings such as improved generalization, disentanglement, and modification to sequence input models. Autoencoders have a history that spans several decades. Here is a brief overview of the key milestones in the development of autoencoders: • Early Developments: The concept of autoencoders can be traced back to the 1980s. The pioneering work by Rumelhart, Hinton, and Williams in 1986 introduced the concept of “Boltzmann machines,” which laid the foundation for later developments in unsupervised learning and generative models. The Boltzmann machines can be considered a precursor to autoencoders. • Autoencoders as Unsupervised Feature Learners: In the early 2000s, autoencoders gained attention as unsupervised feature learners. They were used to learn compressed representations or “codes” of input data, enabling dimensionality reduction and feature extraction. The work by Hinton and Salakhutdinov in 2006 introduced “deep autoencoders” (Hinton & Salakhutdinov, 2006), which were capable of learning hierarchical representations by stacking multiple layers of encoding and decoding units. • Sparse autoencoders: A variant of autoencoders called “sparse autoencoders” was introduced by Ng et al. (2011). These autoencoders enforced sparsity in the learned representations, promoting the discovery of more informative features. The sparse autoencoder minimizes the reconstruction error between the input data used initially and the output produced by the decoder network while simultaneously punishing the activation of concealed units (Fig. 6.3) that do not contribute to the reconstruction. The notion of sparsity is developed in this chapter. • Variational Autoencoders (VAEs): Variational autoencoders (VAEs) where introduced later (2013), which combined the power of autoencoders with probabilistic modeling. VAEs (Doersch, 2016) allowed for generating new data samples by learning a probabilistic distribution in the latent space. They opened up possibilities for generative modeling, image synthesis, and unsupervised learning with probabilistic interpretations. The Variational Autoencoder (VAE)
2 AE: Autoencoder
87
Fig. 6.3 Sparse autoencoders source https://www.baeldung.com/cs/autoencoders-explained
operates by drawing a latent vector from a multivariate Gaussian distribution, achieved through encoding input data into mean and variance vectors. Once this latent vector is obtained, the decoder network employs it to generate a novel sample, utilizing the underlying distribution. The primary goal involves minimizing the disparity between the acquired distribution and the prior distribution across the latent space. Concurrently, it aims to minimize the reconstruction error between the initial input data and the output from the decoder network. • Adversarial Autoencoders (AAEs): In 2015, Makhzani et al. proposed adversarial autoencoders (AAEs) (Makhzani et al., 2015), which incorporated ideas from generative adversarial networks (GANs) into autoencoder frameworks. AAEs added a discriminator network to distinguish between the encoded representations of real and generated data. This allowed for improved sample generation and better disentanglement of latent variables. • Regularized autoencoders: Regularized autoencoders are encoders for which other constraints than their dimensionality are added. To be precise, a regularization term is added in the loss function with the objective to prevent the encoders from overfitting by making the representation as sensitive as possible with respect to changes in input. • Denoising autoencoders: Additionally, “denoising autoencoders” (Fig. 6.4), proposed by Vincent et al., aimed to reconstruct clean data from corrupted versions, effectively learning robust representations. The Denoising Autoencoder minimizes the reconstruction error between the original input data and the output of the decoder network, given the input of corrupted input data. Denoising autoencoders are based on the idea that learning the identity is no longer enough. The idea is to introduce input noise so that denoising autoencoders can retrieve
88
6 Deep Learning Architectures Corrupted images
Predicted images
Noisy input image
Denoised image
Latent Encoder Hidden layer Input layer
Decoder Hidden layer Reconstructed layer
Fig. 6.4 Denoising autoencoders source https://omdena.com/blog/denoising-autoencoders/
the input from a corrupted version of it and can denoise. This process forces the hidden layer to extract more robust features. A sparse autoencoder is simply an autoencoder whose training criterion involves a penalty in the loss function penalizing activations of hidden layers so that only a few nodes are encouraged to activate when a single sample is fed into the network. In recent years, autoencoders have continued to evolve with various architectural advancements and applications (Li et al., 2023). These include convolutional autoencoders for image-related tasks, recurrent autoencoders for sequential data, and transformer-based autoencoders for natural language processing. Additionally, specialized variations like variational recurrent autoencoders (VRNNs) and deep clustering autoencoders have been proposed to address specific challenges in unsupervised learning and representation learning. Autoencoders have found applications in diverse domains, such as image and video processing, anomaly detection, dimensionality reduction, recommendation systems, and more. They continue to be an active area of research, driving innovations in unsupervised learning, generative modeling, and representation learning.
3 VAE: Variational Autoencoders
89
Fig. 6.5 Autoencoder and Variational Autoencoder from https://deeplearning.neuromatch.io/ tutorials.html
3 VAE: Variational Autoencoders A variational autoencoder inherits autoencoder architecture that is trained to minimize the reconstruction error between the encoded-decoded data and the initial data. It can be regarded as an advanced probabilistic iteration of the Autoencoder (AE), blending the principles of variational Bayesian techniques with the adaptability and scalability inherent in neural networks (Doersch, 2016). VaDE (Variational Deep Embedding) (Jiang et al., 2016) and GMVAE (Gaussian Mixture Variational Deep Embedding) (Dilokthanakul et al., 2016) are two popular deep clustering techniques presented in Chap. 11 derived from VAE and based on the same principle. The difference is that it makes strong assumptions concerning the distribution of latent variables (Fig. 6.5). The main reason is that there is no guarantee about the regularity of the organization of the latent space for autoencoders (an illustration can be seen at2 ). Traditional autoencoders can lack continuity in the latent space, preventing interpolation between training points and, thus, its generative ability. There is a dependence on the distribution of the data in the initial space, the dimension of the latent space, and the architecture of the encoder. As a result, it is pretty difficult to ensure, a priori, that the encoder (the generator) will organize the latent space in a smart fashion that is compatible with the classic generative process. Generation encompasses the procedure of determining the data point x based on the latent variable z. Essentially, it involves transitioning from the latent space to the actual data distribution. This transition is mathematically represented by the likelihood .p(x|z). Inference entails the process of deducing the latent variable z from the observed data point x. It is formally characterized by the posterior distribution .p(z|x). To generate a data point, z is sampled from .p(z) being the prior latent distribution, and then sample the data point x from .p(x|z). .p(x|z) represents the 2 https://towardsdatascience.com/difference-between-autoencoder-ae-and-variational-
autoencoder-vae-ed7be1c038f2.
90
6 Deep Learning Architectures
Fig. 6.6 Statistical generation and inference with VAE
distribution of the decoded variable given the encoded one (Fig. 6.6). The objective of the VAE is then to find the posterior distribution .p(z|x), which can be written in terms of the likelihood .p(x|z), the prior .p(z), and the marginal probability density of x, .p(x). The different probabilities are linked via the Bayes rule as demonstrated by Eq. 6.2. p(z|x) =
.
p(x|z)p(z) p(x)
(6.2)
where .p(x) is the distribution of the original data. To infer a latent variable: from sample x given .p(x) and then infer a latent variable z from .p(z|x). .p(z|x) represents the distribution of the encoded variable given the decoded one. .p(x) cannot be computed directly, as intractable between the multiple integrations required in all dimensions. A variational autoencoder can be defined as being autoencoder whose training is regularized to avoid overfitting and ensure that the latent space has good properties that enable a generative process. Variational autoencoders (VAEs) tend to remedy the limitation of AE by modeling the input probability distribution using Bayesian inference. VAEs enable sampling new data from the learned distribution and are also well suited to provide interpretable and disentangled data representations in the low-dimensional space (Fig. 6.7). In order to introduce some regularization of the latent space, a slight modification of the encoding-decoding process has proceeded: instead of encoding an input as a single point, the input is encoded as a distribution over the latent space as shown in Fig. 6.8. In an Autoencoder (AE), a single input corresponds to a single one-dimensional vector. Conversely, in a Variational Autoencoder (VAE), each input represents a pair of vectors .(μ, σ ), representing learned parameters. This distinction bears
3 VAE: Variational Autoencoders
91
~ x' x~
Input φ
z
z θ
x x
Reconstruction
x
Encoder Qφ(z|x)
Decoder Pθ(x|z)
x´
z
Fig. 6.7 The VAE process source (Varolgüne¸s et al., 2020)
μ σ
AE: direct encoding coordinates
VAE: μ and σ represent a probability distribution
Fig. 6.8 Autoencoder (AE) versus Variational Autoencoder (VAE)
intuitive significance: the mean vector governs the central point around which an input’s encoding is positioned, while the standard deviation regulates the “spread,” indicating the extent to which the encoding can deviate from the mean. Since encodings are stochastically generated from within the distribution’s range, the decoder acquires the knowledge that not only does a singular point in the latent space denote a sample of a specific class, but nearby points also hold the same significance. This comprehension empowers the decoder to interpret not only exact encodings in the latent space, preventing the latent space from being disjointed, but also those that exhibit slight variations. This exposure to a spectrum of encoding variations for the same input during training facilitates the decoder’s adaptability. The underlying notion revolves around the concept that any sample extracted from within the specified range will bear resemblance to the original input. The idea is that a sample from anywhere in the area will be similar to the original input.
92
6 Deep Learning Architectures
Fig. 6.9 Illustration of the reparameterization trick. Sampling via .μ and .σ prevents the use of backpropagation (left). By adding a .ϵ ∼ N (0, I ) sampling backpropagation can be done to .μ and .σ
The model is then trained as follows: 1. The input is encoded as a distribution over the latent space. This distribution serves as a foundation from which, following the network’s training, fresh encodings are sampled and subsequently decoded to yield novel samples. 2. A point from the latent space is sampled from that distribution. 3. The sampled point is decoded and the reconstruction error can be computed. 4. The reconstruction error is backpropagated through the network. The sampling process has to be expressed in a way that allows the error to be backpropagated through the network. Random variables are not differentiable, the reparameterization technique is used (Fig. 6.9). Based on the assumption that z follows a Gaussian distribution it can be expressed by .z = μ + σ ∗ N (0, I ). For the ith data and the lth sample .z(i,l) = μ(i) + σ (i) ∗ ϵ (l) where .∗ is elementwise multiplication and .ϵ ∼ N (0, I ) is a fixed stochastic node and does not require backpropagation. Given .p(z) and .p(x|z), the objective is to infer the posterior distribution .p(z|x). This requires an integral calculation that cannot be evaluated in high dimension, and .p(z|x) needs to be approximated. The idea is to approximate .p(z|x) by a probabilistic distribution .q(z|x) that is as close as possible to .p(z|x). This can be formalized as solving the following optimization problem min KL (q(z|x) || p(z|x))
.
φ
(6.3)
q(x) dx where .φ parameterizes the approximation q and .KL(q||p) = x q(x) log p(x) denotes the Kullback–Leibler divergence between q and p then expressing the difference between the true posterior and the variational posterior. As .p(z|x) = p(x,z) p(x) the KL term can be modified as follows:
3 VAE: Variational Autoencoders
.
qφ (z|x) dz = (qφ (z|x)) log p θ (z|x) z
93
qφ (z|x) log z
qφ (z|x)pθ (x) dz pθ (x, z)
qφ (z|x) qφ (z|x) (qφ (z|x)) log dz = qφ (z|x) log dz + log(pθ (x)) qφ (z|x)dz pθ (z|x) pθ (x, z) z z z qφ (z|x) (qφ (z|x)) log (6.4) dz = −L(θ, φ) + log(pθ (x)) pθ (z|x) z where:
pθ (z|x) dz qφ (z|x) z pθ (z|x) . L(θ, φ) = Ez∼φ(z|x) log qφ (z|x)
L(θ, φ) = Ez∼φ(z|x) log(pθ (z|x)) − Ez∼φ(z|x) log(qφ (z|x)) L(θ, φ) =
(qφ (z|x)) log
(6.5)
The first term is referred to as the reconstruction loss and the second term is the KL divergence often called the latent loss, which computes the probability distribution distance between the prior .pθ (z) and approximation .qφ (z|x). It can be deduced that • Minimizing KL is equivalent to maximizing .L(θ, φ) as .pθ (x) is independent of .qφ (z|x). It is done by training .θ and .φ jointly. • .L(θ, φ) is a lower bound of .log(pθ (x)), the KL operator being always positive. Then, minimizing KL is equivalent to maximizing .log pθ (x).
• • • •
For a VAE it is assumed that: .qφ (z|x) = N z, μφ (x), σφ (x)I (the encoder function with .φ as hyperparameters). .pθ (z) = N(0, I ). .pθ (x|z) = N z, μθ (z), σθ (z)I (the decoder function with .θ as hyperparameters). There is no interaction between the features, e.g., a diagonal covariance matrix. Then, assuming that z(i,l) = μ(i) + σ (i) ∗ ϵ (l)
.
(6.6)
where l is the .l th sample, .ϵ (l) = N(0, I ) with .∗ as the element-wise multiplication, that the prior of the latent variables is .pθ (z) = N(0, I ) and the approximation .qφ (z|x) being Gaussian distribution written as
1 1 N(x|μ, σ 2 ) = √ exp − 2 (x − μ)2 2σ 2π σ 2
.
(6.7)
94
6 Deep Learning Architectures
Fig. 6.10 Variational autoencoders (VAE) scheme
one can deduce the Evidence Lower Bound for VAE as: L θ, φ, x
(i)
1 ≃ log pθ x (i) |z(i,l) + L L
l=1
.
1 2
L
2 (i) (i) 2 (i) 2 1 + log σj − μj − σj
(6.8)
l=1
where L denotes the number of samples in the batch. The VAE-based deep clustering methods utilize the VAE to regularize the network training to avoid overfitting by enforcing the latent space to follow some predefined distribution. The regularity that is expected from the latent space in order to make the generative process possible can be expressed through two main properties: continuity (two close points in the latent space should not give two completely different contents once decoded) and completeness (for a chosen distribution, a point sampled from the latent space should give “meaningful” content once decoded). It is expressed as the Kulback–Leibler divergence between the returned distribution and a standard Gaussian (Fig. 6.10). The KL divergence into the loss function encourages the encoder to distribute all encodings evenly around the center of the latent space. LossVAE = α LossAE + β KL N (μ, σ ), N (0, I )
.
(6.9)
4 GAN: Generative Adversarial Network
95
where .μ = [μ1 , . . . , μk ] and .σ = [σ1 , . . . , σk ]. The first term ensures the reconstruction like in AE and the second term forces the latent distribution to be a Gaussian. It is necessary to adjust the tradeoff between the reconstruction error and the KL divergence by adjusting the .α and .β hyperparameters. Given the differentiable loss function, the full learning algorithm for VAE is as follows: • Get the minibatch consisting of M data points. L(θ, φ, xi ) . • Compute the minibatch loss . M ▽L(θ, φ, xi ) . • Compute the gradients . M • Apply gradients to update parameters .φ and .θ . • Repeat the previous steps until convergence. A complete tutorial explaining the VAE loss in detail can be found in Odaibo (2019), a review for variational inference in Blei et al. (2017) and a source code at3 .
4 GAN: Generative Adversarial Network GANs (Goodfellow et al., 2014) are generative models and have become the most popular deep generative model in recent years. They have shown a remarkable generation performance, especially in image synthesis. They generate new pattern instances that resemble training patterns by discovering and learning the regularities or patterns in input patterns. In other terms, GANs aim to capture the underlying probability density of the training patterns. They are based on the adversarial idea (Ganin et al., 2016). The inherent strategy behind GANs can be assimilated as a self-supervised process. To perform the task, GANs include two neural network sub-models: a generator (G) that is trained to generate new patterns, and the discriminator (D) model that learns to distinguish true patterns from the output of the generator. The generator transforms random noise into images of a certain size during training and has no access to the real images. The discriminator discriminates between instances from the true data distribution and fake instances produced by the generator. It is a min-max game. D needs to access both real and fake data and it is trained to push the prediction of fake data to 0 and the prediction of real data to 1 (Fig. 6.11). The generator output is connected directly to the discriminator input. The discriminator classifies both real data and fake patterns from the generator. It is driven by a loss that penalizes the discriminator for misclassifying a real instance as fake or a fake instance as real. Through backpropagation, the discriminator’s classification provides a signal that the generator uses to update its weights. The
3 https://github.com/AntixK/PyTorch-VAE.
96
6 Deep Learning Architectures
Real samples Learn what is real
Discriminator
GAN Feedback
Input (noise)
Generator
Fake samples
Feedback
Fig. 6.11 Original idea of GAN
generator learns to make the discriminator classify its output as real. The generator takes random noise as its input, transforms it into a meaningful output, and is able to produce a wide variety of data, sampling from different places in the target distribution. The generator feeds into the discriminator neural network, and the discriminator produces the output we are trying to affect. The generator loss penalizes the generator for producing a sample that the discriminator network classifies as fake. Through backpropagation again, the weights of the discriminator neural network are modified but at this time to optimize the generator loss and not the discriminator loss. The networks are trained jointly in an alternating fashion using gradient descent. The generator model’s parameters are fixed and perform a single iteration of gradient descent on the discriminator is performed using the real and the generated images. Then there is a switch side. The discriminator is fixed and the generator for another single iteration is trained. Both networks are trained in alternating steps until the generator produces good-quality images. The two models are then linked and trained together but alternately; the parameters of one model are updated, while the parameters of the other are fixed (Fig. 6.12). The generator tries to fool the discriminator and improves with training. The discriminator tries to keep from being fooled and its performance gets worse because the discriminator cannot easily tell the difference between real and fake. The process is stopped when the discriminator model is sufficiently fooled, meaning the generator model is generating plausible patterns. Because of the competition between the sub-models, GANs are known to be challenging to train.
1
1 Loss(D, G) = − Epdata (x) log D(x) − Epz (z) 1 − log G(D(z)) 2 2
.
(6.10)
where x is sampled from real data distribution .pdata (x), z is sampled from the prior distribution .pz (z) such as uniform or Gaussian distribution, and .E[·] represents the expectation.
4 GAN: Generative Adversarial Network
97
Fig. 6.12 Process of GAN
The discriminator outputs a value .D(x), indicating the chance that x is a real image. The first term tries to recognize real patterns better while the second term tries to recognize generated patterns better. The objective is to maximize the chance of recognizing real images as real and generated images as fake. The loss comes from binary cross entropy. L(y, ˆ y) = y log yˆ + (1 − y)(1 − log y) ˆ
.
(6.11)
A pedagogic explanation of the binary cross entropy can be found at4 . .
L((D(x), 1) = log(D(x)) (y = 1) L(D(G(z)), 0) = 1 − log(D(G(z)) (y = 0)
(6.12)
The true label .y = 1 represents a real pattern since .y = 0 represents a fake one. Since Goodfellow et al. (2014), a number of GAN variants such as Wang et al. (2018), Springenberg (2015), Qi (2020), Adler and Lunz (2018) have been proposed to improve the model structure, theoretical extension, adapt the model to novel applications, etc. Because of its better performance in generating samples than autoencoders, researchers deduce that the powerful latent representation of GAN may improve clustering results. InfoGAN (Chen et al., 2016) CatGAN, (Springenberg, 2015) and ClusterGAN (Mukherjee et al., 2019) are three popular
4 https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-
explanation-a3ac6025181a.
98
6 Deep Learning Architectures
Fig. 6.13 General architecture of conditional GAN
Fig. 6.14 Process of conditional GAN
methods in the clustering GAN field. The main drawback is that they occupy large computing resources, initiating novel research in the last recent years. A conditional GAN (Generative Adversarial Network) is an extension of the traditional GAN framework, where the generator is conditioned on additional information, typically represented as auxiliary labels or data (Fig. 6.13). This conditioning allows the conditional GAN to generate data with specific characteristics, making it more controlled and targeted in its generation process. As shown in Fig. 6.14, a condition y is concatenated with both the generator and discriminator inputs. This allows for better control of the generated images, encouraging the generator to create images determined by the condition y. In a conditional GAN, the generator takes both random noise and conditional information y as input. This condition y may be, for instance, the class from which the data belongs, an instruction vector with features, other images, or an embedded textual sentence. By conditioning the generator on this information, it learns to generate data that aligns with the given conditions. The condition y is concatenated with the latent vector z and used as input for the generator, which will create a fake sample .G(z|y). When analyzing samples, the discriminator must have the condition associated with the real sample x, or the condition used to create the fake sample. The training process of a conditional GAN involves feeding both real data with their
5 Siamese Neural Networks
99
corresponding condition labels and generated data with the corresponding condition labels to the discriminator. The generator’s objective is to produce data that not only fools the discriminator into believing it is real but also satisfies the specified conditions.
5 Siamese Neural Networks A Siamese neural network is an emergent tool consisting of two similar neural networks, e.g., they have the same configuration with the same parameters and weights. These networks often called twin networks accept distinct inputs but are joined by an energy function at the top. This function computes some metrics between the highest level feature representation on each side. The parameters between the twin networks are tied. The architecture is based on two convolutional networks with predefined parameters that measure the difference between them to determine how similar they are. They employ error backpropagation during training; they work parallelly in tandem and compare their outputs at the end, usually through a cosine distance. The output generated by a Siamese neural network execution can be considered the semantic similarity between the projected representation of the two input vectors. The objective of a Siamese neural network is not to classify input images, but to differentiate between them. Given an input image, the neural network encoder is trained such that the feature encoding of different augmentations, aka “views,” of the image are close to each other. The symmetric convolutional architecture accepts a variety of inputs, which are not just images but numerical data, and sequential data such as sentences or time signals. The weights are computed using the same energy function representing distinct features of each input. Figure 6.15 shows the architecture of a Siamese neural network. The input to the Siamese neural network, used for training, is a couple of samples, one sample for the top twin and the other for the bottom one, in addition to a label that shows
Fig. 6.15 Siamese architecture
100
6 Deep Learning Architectures
whether the two samples belong to the same class or not. The output of each twin network is a feature vector; these two feature vectors are combined through a cost function; the output of this function is scalar energy. The output of the cost function is then combined with the label through a loss function. In the training phase, the network parameters are updated using the backpropagation method such that the loss function value is minimum for the pairs that belong to the same class and maximum for the pairs that belong to different classes. A contrastive energy function is often used as a loss function. Let .x1 and .x2 be two objects that we want to compare, and .v1 , .v2 their vector representations. If .x1 and .x2 belong to the same class, we want their vector representations to be similar. Therefore we want to minimize .L = ‖v1 − v2 ‖2 . On the other hand, if .x1 and .x2 belong to different classes, we want .L = ‖v1 − v2 ‖ to be large. The term we want to minimize is .L = {max(0, m − ‖v1 − v2 ‖)}2 , where m is a hyperparameter called margin. The idea of margin is that, when .v1 –.v2 are sufficiently different, L will already be 0 and cannot be further minimized. Hence the model will not waste efforts in further separating .v1 –.v2 and will focus on other input pairs instead. L = y‖v1 − v2 ‖2 + (1 − y){max(0, m − ‖v1 − v2 ‖)}2
.
(6.13)
The main advantages or pros of Siamese networks are: • Robustness to class Imbalance: Due to one-shot learning, a few images(very little data for training data) for the same class is sufficient for Siamese networks to classify those images in the future. • Ensemble with one of the classifier algorithms: As its learning mechanism is different from Classification algorithms, ensembling the Siamese networks with a classifier can do much better than average two correlated Supervised models (e.g., GBM .& RF classifier algorithms). • Semantic Similarity: trained Siamese network focuses on learning embeddings (in the deep neural networks) that place the same classes close together. Hence, can learn semantic similarity. The downsides or cons of the Siamese networks can be: • Training time: Requires more training time than traditional neural network architectures and machine learning algorithms. • Computation time: Siamese Networks involve quadratic pairs to learn from which is slower than the normal traditional type of machine learning, the neural network learns fast than the Siamese network. • Does not output probabilities: Since training of Siamese networks involves pairwise learning, it does not output the probabilities of the prediction, but the distance(using a distance formula like Euclidean distance) from each class which is between 0 and 1. The concept of Siamese neural network was extended to what is called “triplet loss,” where three samples are used as input to the neural network, namely, the
References
101
anchor, the positive sample (which belongs to the same class of the anchor), and the negative sample (which belongs to a class different from the anchor class). The idea behind the triplet loss is the minimization of the distance between the anchor and positive samples and the maximization of the distance between the anchor and the negative samples. Triplet loss takes the above idea one step further by considering triplets of inputs (.xa , .xp , .xn ). Here .xa is an anchor object, .xp is a positive object (i.e., .xa and .xp belong to the same class), and .xn is a negative object (i.e., .xa and .xn belong to different classes). Our goal is to make the vector representation .va to be more similar to .vp than to .vn . The precise formula is given by the Eq. 6.14: L = max(0, m + ‖va − vp ‖ − ‖va − vn ‖)
.
(6.14)
where m is the hyperparameter margin. Just like the case for contrastive loss, the margin determines when the difference between .‖va −vp ‖ and .‖va −vn ‖ has become big enough, such that the model will no longer adjust its weights from this triplet. For both contrastive loss and triplet loss, how the sample is done has a great impact on the model training process.
References Adler, J., & Lunz, S. (2018). Banach Wasserstein GAN. Advances in Neural Information Processing Systems, 31, 6755–6764. Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017). Understanding of a convolutional neural network. In 2017 International Conference on Engineering and Technology (ICET) (pp. 1–6). IEEE. Alom, M. Z., Taha, T. M., Yakopcic, C., Westberg, S., Sidike, P., Nasrin, M. S., Van Esesn, B. C., Awwal, A. A. S., & Asari, V. K. (2018). The history began from AlexNet: A comprehensive survey on deep learning approaches. Preprint. arXiv:1803.01164. Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M. A., Al-Amidie, M., & Farhan, L. (2021). Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. Journal of Big Data, 8, 1–74. Bank, D., Koenigstein, N., & Giryes, R. (2023). Autoencoders. Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook (pp. 353–374). Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems, 29, 2180–2188. Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, K., & Shanahan, M. (2016). Deep unsupervised clustering with gaussian mixture variational autoencoders. Preprint. arXiv:1611.02648. Doersch, C. (2016). Tutorial on variational autoencoders. Preprint. arXiv:1606.05908. Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5), 826–834.
102
6 Deep Learning Architectures
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., & Lempitsky, V. (2016). Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1), 2096–2030. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems, 27. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., Cai, J., et al. (2018). Recent advances in convolutional neural networks. Pattern Recognition, 77, 354–377. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Part IV 14 (pp. 630–645). Springer. Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. Ji, Y., Zhang, H., Zhang, Z., & Liu, M. (2021). CNN-based encoder-decoder networks for salient object detection: A comprehensive review and recent advances. Information Sciences, 546, 835–857. Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou, H. (2016). Variational deep embedding: An unsupervised and generative approach to clustering. Preprint. arXiv:1611.05148. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 25. Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. LeCun, Y., et al. (2015). Lenet-5, convolutional neural networks. http://yann.lecun.com/exdb/lenet. Li, P., Pei, Y., & Li, J. (2023). A comprehensive survey on design and application of autoencoder in deep learning. Applied Soft Computing, 138, 110–176. Li, Y., Hao, Z., & Lei, H. (2016). Survey of convolutional neural network. Journal of Computer Applications, 36(9), 2508. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., & Frey, B. (2015). Adversarial autoencoders. Preprint. arXiv:1511.05644. Mukherjee, S., Asnani, H., Lin, E., & Kannan, S. (2019). ClusterGAN: Latent space clustering in generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press. Ng, A., et al. (2011). Sparse autoencoder. CS294A Lecture Notes, 72(2011), 1–19. Odaibo, S. (2019). Tutorial: Deriving the standard variational autoencoder (VAE) loss function. Preprint. arXiv:1907.08956. Qi, G.-J. (2020). Loss-sensitive generative adversarial networks on lipschitz densities. International Journal of Computer Vision, 128(5), 1118–1140. Rumelhart, D. E., Hinton, G. E., McClelland, J. L., et al. (1986). A general framework for parallel distributed processing. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1(45–76), 26. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. Preprint. arXiv:1409.1556. Springenberg, J. T. (2015). Unsupervised and semi-supervised learning with categorical generative adversarial networks. Preprint. arXiv:1511.06390. Szegedy, C., Reed, S., Erhan, D., Anguelov, D., & Ioffe, S. (2014). Scalable, high-quality object detection. Preprint. arXiv:1412.1441. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (pp. 6105–6114). PMLR.
References
103
Varolgüne¸s, Y. B., Bereau, T., & Rudzinski, J. F. (2020). Interpretable embeddings from molecular simulations using gaussian mixture variational autoencoders. Machine Learning: Science and Technology, 1(1), 015012. Wang, Y., Yu, B., Wang, L., Zu, C., Lalush, D. S., Lin, W., Wu, X., Zhou, J., Shen, D., & Zhou, L. (2018). 3d conditional generative adversarial networks for high-quality pet image estimation at low dose. Neuroimage, 174, 550–562.
Chapter 7
Learning Approaches and Tricks
This chapter provides an in-depth exploration of the key training techniques and optimization methods used in deep learning architectures. It aims to shed light on the various learning tricks integrated into deep clustering and feature selection algorithms, both established and contemporary. By focusing on recent advancements, the chapter seeks to offer a comprehensive understanding of prominent approaches. At the core of deep clustering methods lies the crucial concept of “Representation Learning,” which addresses the challenges posed by the curse of dimensionality in clustering. To illustrate the recent successes of deep learning, the chapter delves into the pivotal concepts that have contributed to these achievements. Notably, it emphasizes novel learning strategies like contrastive and self-supervised learning, built upon data augmentations. Data augmentations play a vital role in numerous state-of-the-art deep clustering methods, as well as in pretext tasks, which serve as secondary objectives for the network to solve. Contrastive learning, a specific type of self-supervised learning, focuses on learning representations through contrasting similar and dissimilar pairs of samples, while self-supervision encompasses a broader range of techniques that enable models to learn meaningful representations without explicit supervision. The chapter proceeds by gradually introducing these concepts and provides a succinct overview of techniques like self-paced learning and regularization, which collectively dominate the influential literature in this field. Through this comprehensive analysis, readers gain valuable insights into the cutting-edge advancements in deep learning for clustering and feature selection.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 F. Ros, R. Riad, Feature and Dimensionality Reduction for Clustering with Deep Learning, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-031-48743-9_7
105
106
7 Learning Approaches and Tricks
1 Training Techniques and Optimization for Deep Learning Architectures Deep learning has revolutionized the field of artificial intelligence, achieving remarkable success in various domains, including computer vision, natural language processing, and speech recognition. At the core of this success lies the training process, where deep neural networks learn to map input data to meaningful representations through iterative optimization. Training deep neural networks presents several challenges due to their complex architectures and the vast number of parameters involved. The vanishing and exploding gradient problem can hinder the learning process in deep networks, leading to slow convergence or divergence during training. Additionally, overfitting, which occurs when models memorize training data and fail to generalize to unseen data, poses a significant concern. This section explores the essential training techniques and optimization algorithms used to train deep learning architectures effectively.
1.1 Activation Functions Activation functions play a critical role in introducing non-linearity to deep neural networks, enabling them to learn complex relationships in data (Fig. 7.1). In neural networks, it is very usual to employ sigmoid or tanh activation functions. Employing sigmoid activation functions, it is well-established that the partial derivative of the sigmoid function reaches a maximum value. However, as the network depth increases, the product of these derivatives diminishes progressively until, at a certain point, the partial derivative of the loss function approaches a value that is very close to zero. When this occurs, the gradient essentially disappears, giving rise to what is known as the “vanishing gradient problem.”
Sigmoid
LeakyReLU max(0.1 , )
tanh
Maxout max(
1 = 1+
tanh ( )
ReLU max(0, )
Fig. 7.1 Classical activation functions
ELU
1
+
1,
2
+
2)
1 Training Techniques and Optimization for Deep Learning Architectures
107
This issue is less pronounced in shallow networks, where the small gradient values do not pose a significant challenge. However, in the case of deep networks, the vanishing gradient problem can have a profound detrimental effect on performance. As the gradient approaches zero, the network’s weights remain virtually unchanged during training. In the backpropagation process, a neural network learns by adjusting its weights and biases to minimize the loss function. In a network plagued by the vanishing gradient problem, this crucial weight update process is hindered, effectively preventing the network from learning. Consequently, the network’s overall performance deteriorates. The utilization of the Rectified Linear Unit (ReLU) activation function effectively mitigates the vanishing gradient problem. However, a drawback associated with ReLU arises when the gradient assumes a value of 0. In these instances, the neuron becomes what is known as a “dead node,” as both the old and new weight values remain unchanged. To circumvent this issue, a “leaky” ReLU function can be employed, which ensures that the gradient never drops to zero, thereby averting the problem of dead nodes.
1.2 Gradient Retro Propagation Gradient descent serves as the foundational pillar for training deep learning models. The backpropagation algorithm is the standard training method which uses gradient descent to update the parameters. It updates the hyperparameters .θ of the objective .L(θ ) as: θ (t + 1) = θ (t) − ν∇θ E[L(θ (t))]
.
(7.1)
where .E[L(θ (t))] is the expectation of .L(θ (t)) over the full training set and .ν is the learning rate. Instead of computing .E[L(θ (t))], Stochastic Gradient Descent (SGD) estimates the gradients on the basis of a single randomly picked example .(x(t), y(t)) from the training set: θ (t + 1) = θ (t) − ν∇θ E[L θ (t), x(t), y(t) ]
.
(7.2)
Many gradient descent optimization algorithms have been proposed. In traditional batch gradient descent (BGD), the model’s parameters are updated using the average gradient computed over the entire training dataset. However, this approach can be computationally expensive, especially for large datasets. Selecting an appropriate learning rate is crucial for successful training to achieve a balance between rapid initial progress and fine-tuning toward the end of training. In addition, choosing the batch size for training is an important hyperparameter. The choice affects how machine learning models, including deep learning models, are updated during the training process. The batch size determines the number of data points that are processed together in each iteration of the optimization algorithm. Stochastic Gradient Descent (SGD) introduces a more efficient approach by updating the
108
7 Learning Approaches and Tricks
parameters using the gradient computed on a single data point (or a small batch of data points) at a time. In this section, we will provide a concise introduction to a range of optimization algorithms, each playing a vital role in enhancing the efficiency of the training process. These optimization techniques include stochastic gradient descent (SGD), and mini-batch gradient descent, as well as their variants like RMSprop, AdaGrad, and Adam. The primary objective here is to explore how these optimizers effectively adapt learning rates and momentum, thereby significantly improving the convergence speed and overcoming the limitations associated with standard gradient descent. BGD (Batch Gradient Descent) In BGD, the batch size is set to the total number of data points in the training set. This means that the entire training set is used to compute the gradient and update the model parameters in each iteration. Batch gradient descent provides stable and accurate updates, but it can be computationally expensive, especially for large datasets, as it requires processing the entire dataset at once. SGD (Stochastic Gradient Descent) In SGD, the batch size is set to 1, meaning that only one data point is used to compute the gradient and update the model parameters in each iteration. This approach introduces more noise into the updates, but it is computationally more efficient than batch gradient descent since it processes one data point at a time. The noise can sometimes help escape local minima and can lead to faster convergence in certain cases. However, the noise may also introduce instability, and the updates may fluctuate around the optimal solution. ADAGRAD (Adaptive Gradient Algorithm) The ADAGRAD is an optimization algorithm commonly used for training deep learning models. It adapts the learning rates of individual model parameters based on the historical gradients, allowing for more aggressive updates for infrequent parameters and smaller updates for frequent parameters. The key idea behind ADAGRAD is to adjust the learning rates on a per-parameter basis, which can be particularly beneficial when dealing with sparse data or in scenarios where different features have vastly different frequencies. RMSprop (Root Mean Square Propagation) RMSprop is an optimization algorithm used for training deep learning models. It is an extension of the ADAGRAD algorithm with a modification that addresses the issue of diminishing learning rates over time. The key idea behind RMSprop is to maintain a moving average of the squared gradients for each model parameter, similar to ADAGRAD. However, instead of accumulating all the past squared gradients, RMSprop uses an exponentially decaying average to give more importance to recent gradients while reducing the influence of past gradients. ADAM (Adaptive Moment Estimation) The ADAM algorithm is an optimization algorithm used to update the parameters of a deep learning model during the training process. It is a popular choice for training
1 Training Techniques and Optimization for Deep Learning Architectures
109
neural networks due to its effectiveness and efficiency. ADAM combines the benefits of AdaGrad and RMSprop. It calculates adaptive learning rates for each parameter and also maintains an exponentially decaying average of past squared gradients.
1.3 Regularization Techniques Regularization (Hou et al., 2013; Goodfellow et al., 2016) is a set of methods that both optimize the learning of a Deep Learning model and counter overfitting to prevent a lack of generalization. Most of the deep clustering methods include regularization terms. These methods also offer an alternative way to learn classifications for datasets with a large number of features but a small sample size. They trim the space of features directly during classification. In other words, regularization effectively shuts down the influence of unnecessary features. Regularization can be incorporated either into the error criterion or directly into the model. Regularization can be implemented in multiple ways by either modifying the loss function (.L1 , .L2 , and entropy regularization for the most common), the sampling method (data augmentation), or the training approach itself (dropout). The cluster regularization loss is very popular in the deep clustering area. It enforces the network to preserve suitable discriminant information from the data in the representations. .L1 and .L2 are the most common types of regularization. These update the general cost function by adding another term known as the regularization term. The idea is to penalize the model for non-zero weights (.βj ) so the optimization of the new error function drives all unnecessary parameters to 0. In both cases, the output of the learning is a feature-restricted classification model, so features are selected in parallel with model learning. The .L1 penalty (. j |βj |) promotes sparsity in the solution, resulting in a few features with non-zero weights more suitable to avoid overfitting issues. When using .L1 norm regularization, one calls aboutLasso regression (Least Absolute Shrinkage and Selection Operator). .L2 penalty (. j βj2 ) encourages stability in the solution and is more suitable for feature selection. When using .L2 norm regularization, one calls about Ridge regression.
1.4 Dropout Dropout (Srivastava et al., 2014) stochastic depths skip entire layers or add a penalty to the activations of the neurons, aiming at simplifying the training phase. Dropout is an effective technique commonly used to regularize neural networks in order to target overfitting. It consists in randomly removing a subset of hidden node values and setting them to 0, then regularizing the tendency of the co-adaptation of inside nodes that causes the overfitting.
110
7 Learning Approaches and Tricks
1.5 Network Pruning Pruning is a neural network technique that encourages the development of smaller and more efficient neural networks. In network pruning, the idea is to remove superfluous parts from an existing network to produce a smaller network without impacting the network’s accuracy. Pruning has been used since the late 1980s (Janowsky, 1989; Karnin, 1990; Reed, 1993) but has seen an explosion of interest in the past decade thanks to the rise of deep neural networks. The operation results in networks that run faster and reduction of the high computational cost of deep learning architectures. The methods can be classified into unstructured and structure pruning approaches (Wan et al., 2013; Han et al., 2015; Liang et al., 2021; Xu et al., 2020; Wimmer et al., 2022; Blalock et al., 2020; Zhang et al., 2019; Mitsuno & Kurita, 2021; Vadera & Ameen, 2022). Unstructured approaches act without constraints on the entire neural network and provide the most flexibility for the model; they, however, do no effectively generate compact neural networks. The structured ones that impose block structure on the sparse weights are more suitable but requires a difficult trade-off between compression and accuracy. Sparsity (Gale et al., 2019) refers to the property that a subset of the model parameters have a value of exactly zero. With zero-valued weights, any multiplications which dominate neural network computation can be skipped, and models can be stored and transmitted compactly using sparse matrix formats. By imposing the sparse constraint on the hidden layers, the idea is to compel the model to retain only the most critical parts of the original data as the network is forced capture a more effective feature representation for input data. Various techniques, including dropout (Srivastava et al., 2014), stochastic depths (which can skip entire layers), or applying a penalty to neuron activations, are designed to streamline the training phase. Dropout is an effective technique commonly used to regularize neural networks in order to target overfitting. It consists in randomly removing a subset of hidden node values and setting them to 0, then regularizing the tendency of the co-adaptation of inside nodes that causes the overfitting. Despite their interest, with these methods the entire network is needed at the prediction stage promoting the creation of novel approaches (Yoon & Hwang, 2017; Kang et al., 2017; Hoefler et al., 2021; Wimmer et al., 2022) aiming at producing more compact neural networks. To summarize, in the method of sparse auto-encoder, the reconstructive penalty acts as a degeneracy control, which allows for the sparsest representation by ensuring that the filter matrix does not learn copies or redundant features. According to (Gale et al., 2019), this research field is still challenging as complex techniques can perform inconsistently while simple heuristics can achieve comparable or better results.
1 Training Techniques and Optimization for Deep Learning Architectures
111
1.6 Sparsity Sparsity (Gale et al., 2019) is an important direction of embedded methods. It gets sparse feature scores and then removes the features with zero scores and combines the features with non-zeros feature scores to be a subset. To make the feature weights sparse, embedding a sparse regularization (Rahangdale & Raut, 2019) term into the learning model is a good idea. By imposing the sparse constraint on the hidden layers, the idea is to compel the model to retain only the most critical parts of the original data as the network is forced to capture a more effective feature representation for input data. The sparsity approach in deep learning refers to a technique or methodology aimed at inducing and maintaining sparsity in the neural network’s weights or activations. Sparsity refers to the property of having a significant number of elements equal to zero, resulting in a more concise and efficient representation of data. There are two main types of sparsity approaches in deep learning: ● Weight Sparsity: In weight sparsity, the goal is to encourage a large number of weights in the neural network to be exactly zero. By setting certain weights to zero, the network effectively ignores the corresponding connections, reducing the model’s complexity and memory requirements. This can lead to more efficient inference and reduced overfitting. One common method to achieve weight sparsity is .L1 regularization (Lasso regularization). .L1 regularization adds a penalty term to the loss function proportional to the absolute values of the weights. During training, this penalty encourages many weights to become exactly zero, effectively performing feature selection and simplifying the model. ● Activation Sparsity: Activation sparsity aims to promote sparsity in the activations of neurons within the network. The idea is to have most of the neurons in a layer remain inactive (outputting zero values) for a given input, leading to a more efficient representation of the data. Techniques like dropout and drop connect, commonly used for regularization, can also induce activation sparsity. Dropout randomly sets a fraction of neuron activations to zero during training, forcing the network to learn robust and distributed representations. Drop connect extends this idea to connections between neurons, randomly dropping certain connections during training. The sparsity approach is especially valuable when dealing with large and complex deep learning models. By encouraging sparsity, we can achieve more compact representations, reducing the model’s size and computational requirements. Sparse neural networks can also be more interpretable since only a subset of connections or activations are relevant for a particular task. However, achieving sparsity in deep learning models requires careful tuning of regularization parameters and hyperparameters. Additionally, sparse models might be more sensitive to parameter initialization and may require more specialized optimization techniques to train effectively. Overall, the sparsity approach offers an intriguing direction for
112
7 Learning Approaches and Tricks
optimizing deep learning models, striking a balance between model complexity and efficiency while enhancing interpretability in certain cases.
2 Standard and Novel Learning Strategies Supervised, unsupervised, and reinforcement learning represent classical learning strategies widely used in machine learning. However, with the advent of deep learning approaches, innovative techniques have emerged to address the challenges of limited labeled data and to cater to unsupervised scenarios. This section provides a comprehensive review of these traditional and novel learning strategies, highlighting their applications and advantages.
2.1 Supervised Learning The principle of supervised learning involves training a model using labeled examples to learn the relationship between input features and corresponding output labels. The trained model then uses this knowledge to make predictions or classify new, unseen data accurately. The fundamental principle of supervised learning is to learn a mapping between input data and corresponding output labels, using a training dataset that consists of input-output pairs. In supervised learning, the training data consists of input features (also called independent variables) and their corresponding output labels (also known as dependent variables). The goal of the model is to learn a function that can generalize from the given examples and accurately predict the output labels for unseen input data. The learning process in supervised learning typically involves two main steps: training and inference. ● Training: During the training phase, the model is presented with the labeled training data. It analyzes the input features and their associated labels to identify patterns, relationships, and correlations. The model adjusts its internal parameters or weights iteratively to minimize the difference between its predictions and the true labels. This process is often guided by an optimization algorithm, such as gradient descent, that updates the model’s parameters to minimize a predefined loss or error function. ● Inference: Once the model has been trained, it can make predictions or classify new, unseen input data. During the inference phase, the trained model takes the input features and applies the learned function to generate predictions or output labels. The model leverages the patterns and relationships it learned during training to generalize to unseen data and make accurate predictions.
2 Standard and Novel Learning Strategies
113
The performance of a supervised learning model is evaluated using various metrics, depending on the specific task. Common evaluation metrics include accuracy, precision, recall, F1 score, and mean squared error, among others. These metrics provide insights into how well the model generalizes and how accurately it predicts the correct labels for unseen data. Supervised learning is widely applied in various domains, including image and speech recognition, natural language processing, recommendation systems, fraud detection, and many more. It is particularly useful when labeled data is available, allowing the model to learn from past examples and make informed predictions on new, unseen data.
2.2 Unsupervised Learning Unsupervised learning is a machine learning paradigm that involves discovering patterns, structures, or relationships in unlabeled data without explicit guidance or predefined output labels. The fundamental principle of unsupervised learning is to explore and extract meaningful information from the data to gain insights and make inferences. In unsupervised learning, the training data consists only of input features, and the objective is to uncover hidden patterns or groupings within the data. The goal is to find inherent structures or distributions in the data without being provided with specific labels or targets. The primary approaches in unsupervised learning are clustering and dimensionality reduction. ● Clustering algorithms aim to group similar instances together based on their feature similarities or proximity. The goal is to identify natural clusters within the data where instances within the same cluster are more similar to each other than to instances in other clusters. Common clustering algorithms include k-means, hierarchical clustering, and density-based clustering methods such as DBSCAN. ● Dimensionality reduction techniques aim to reduce the number of features or dimensions in the data while retaining the most important information. These methods help in visualizing and analyzing high-dimensional data, removing redundant or irrelevant features, and identifying lower-dimensional representations that capture the underlying structure. Techniques like Principal Component Analysis (PCA) and t-SNE (t-Distributed Stochastic Neighbor Embedding) are commonly used for dimensionality reduction. Unsupervised learning is often used for exploratory data analysis, pattern recognition, and anomaly detection. It can reveal hidden patterns, discover underlying structures, and provide insights into the data that may not be apparent initially. Unsupervised learning is particularly useful when dealing with large and complex datasets where manual labeling is impractical or unavailable. However, evaluating the performance of unsupervised learning algorithms can be more challenging compared to supervised learning, as there are no explicit labels to
114
7 Learning Approaches and Tricks
measure against. Evaluation is often based on intrinsic measures, such as cohesion, separation, or silhouette scores, or by comparing against domain knowledge or expert validation. In summary, the principle of unsupervised learning involves extracting meaningful patterns or structures from unlabeled data without explicit guidance. Through clustering and dimensionality reduction, unsupervised learning provides valuable insights into the data and helps uncover hidden relationships or patterns that can be further explored and utilized for various applications. Among the most renowned unsupervised algorithms, k-means stands out for its widespread application in clustering data into distinct groups, while PCA is the preferred choice for dimensionality reduction tasks. k-means and PCA are arguably two of the most influential machine learning algorithms ever devised, and their effectiveness is further amplified by their straightforward and intuitive nature.
2.3 Reinforcement Learning Reinforcement learning is a machine learning paradigm that focuses on enabling agents to learn through interactions with an environment. The fundamental principle of reinforcement learning revolves around the concept of an agent, an environment, and rewards. In reinforcement learning, an agent interacts with an environment in a sequential manner. At each time step, the agent observes the current state of the environment, takes an action based on its policy, and receives feedback in the form of a reward signal from the environment. The goal of the agent is to learn an optimal policy, which is a mapping from states to actions, that maximizes the cumulative reward over time. The agent’s decision-making process is guided by the notion of maximizing longterm rewards. It does not have access to explicit instructions or labeled examples but instead learns through trial and error. Through a series of interactions with the environment, the agent learns which actions lead to desirable outcomes and which actions should be avoided based on the rewards received. The core mechanism in reinforcement learning is the use of a value function or a Q-function. The value function estimates the expected cumulative reward that an agent can obtain from a particular state or state-action pair, whereas the Qfunction estimates the expected cumulative reward from taking a specific action in a given state. These value estimates guide the agent’s decision-making process by determining the desirability of different actions or state transitions. Reinforcement learning algorithms employ various techniques, such as value iteration, policy iteration, and Monte Carlo methods, to update the value estimates and improve the agent’s policy over time. Exploration and exploitation strategies are used to strike a balance between trying out new actions to discover better policies and exploiting the already-learned knowledge to maximize rewards.
2 Standard and Novel Learning Strategies
115
Reinforcement learning has been successfully applied in a wide range of domains, including robotics, game-playing, recommendation systems, and autonomous vehicles. It has demonstrated its ability to learn optimal strategies in complex, dynamic environments where the best actions may depend on the current state and are not known in advance. Overall, the principle of reinforcement learning revolves around an agent’s ability to learn through trial-and-error interactions with an environment, seeking to maximize long-term rewards by iteratively updating value estimates and improving its policy.
2.4 Transfer Learning Transfer learning (Weiss et al., 2016) represents techniques that build a model developed for a task is reused as the starting point for a model on a second task. Domain adaptation (Wang & Deng, 2018) is a sub-discipline of transfer learning that deals with scenarios in which a model trained on a source distribution is used in the context of a different but related target distribution. Domain adaptation uses labeled data in one or more source domains to solve new tasks in a target domain. In general, the source and target data have the same feature space. Pre-trained convolutional neural networks, or convnets, have become the building blocks in most computer vision applications. As a matter of fact, ImageNet is relatively small by today’s standards; it “only” contains a million images that cover the specific domain of object classification. The performance of state-of-the-art classifiers on ImageNet is largely underestimated, and little error is left unresolved. It is hard to have good-quality image data without a high cost of time and effort leading to sub-optimal trained models. A natural way to move forward is to build a bigger and more diverse dataset, potentially consisting of billions of images. This, in turn, would require a tremendous amount of manual annotations, despite the expert knowledge in crowd-sourcing accumulated by the community over the years.
2.5 Self-Supervised Learning “Self-supervised learning” (Ohri & Kumar, 2021; Jing & Tian, 2020) is a specific form of unsupervised learning that has been particularly developed to solve the image annotation issue. The idea of “self-supervised learning” consists of using pretext tasks to replace the labels annotated by humans by “pseudo-labels” directly computed from the raw input data. Self-supervised learning is a machine learning process where the model trains itself to learn one part of the input from another part of the input. The core idea for self-supervised learning is to generate supervisory signals by making sense of the unlabeled data provided to it in an unsupervised fashion on the first iteration. Like in unsupervised learning, unstructured data is
116
7 Learning Approaches and Tricks
provided as input to the model. However, the model annotates the data on its own, and labels that have been predicted with high confidence are used as ground truths in future iterations of the model training. To synthesize, self-supervised learning relies on the exploitation of different labeling that is freely available besides or within visual data. Then, they can serve as intrinsic reward signals to learn general-purpose features. It is also known as predictive or pretext learning. The task often used for pre-training is known as the pretext task, e.g., in the image field relative positioning, rotation, colorization, and jigsaw puzzle or mutual information and instance discrimination, and so on. These pretext tasks are designed by directly generating supervisory signals from the raw images without manual labeling and aim to learn well-pretrained representations for downstream tasks, like image classification, object detection, and semantic segmentation. The aim of the pretext task also known as a supervised task is to guide the model to learn intermediate representations of data. It is useful in understanding the underlying structural meaning that is beneficial for practical downstream tasks. It is important to underline that the network is forced to learn what on really care about, e.g., a semantic representation that is discovered. The representations are not directly tailored to a specific classification task. The only difference with a supervised scenario is, the data labels used as ground truths in every iteration are changed as the labels here are generated by the model itself which decide whether the labels generated are reliable or not, and accordingly use them in the next iteration to tune its weights. The pipeline of self-supervised learning consists of two stages: ● Pre-training Convolutional Networks (ConvNets) on a large unlabeled dataset ● Fine-tuning the pre-trained network for downstream tasks with a small set of labeled data The first stage enforces the network to learn the useful and semantic information from the unlabeled data, which can boost the subsequent classification task’s accuracy with limited labeled data. Recently, researchers attempt to learn useful information from the unlabeled data with unsupervised approaches. The popular pretext tasks include reconstruction pretext and the adversarial objective (Mirza & Osindero, 2014). The techniques can be roughly divided into three approaches (Ohri & Kumar, 2021). ● Joint embedding architecture where two embedding vectors are produced from two in the same way in the latent space to be compared. ● Contrastive learning approaches with different variants (Jaiswal et al., 2020). As an elementary illustration of self-supervised learning suggested by (Chaudhary, 2020), the human annotation step is replaced with a creative utilization of inherent data characteristics to establish a pseudo-supervised task. For instance, instead of manually categorizing images as “cat” or “dog,” we can dynamically rotate them by 0, 90, 180, or 270 degrees and then train a model to predict the rotation angle. This approach allows us to generate an abundant supply of training data by leveraging the vast pool of freely available internet images (Fig. 7.2).
2 Standard and Novel Learning Strategies
117
Fig. 7.2 Self-supervision example from Chaudhary (2020)
Different illustrations of self-supervised learning can be found at1 .
2.6 Semi-Supervised Learning Semi-supervised learning is an active area of research that aims to reduce the number of ground truth labels required for training. It is aimed at common practical scenarios in which only a small subset of a large dataset has corresponding ground truth labels. Unsupervised domain adaptation (Ganin and Lempitsky, 2015) is a closely related problem in which one attempts to transfer knowledge gained from a labeled source dataset to a distinct unlabeled target dataset, within the constraint that the objective must remain the same. In brief, supervised methods utilize the labeled information to measure the correlation between features and class labels. Unsupervised methods evaluate feature correlation by maintaining specific features of data. In real life, obtaining unlabeled samples is easy in comparison with labeled samples. So the semi-supervised methods aim at to solving such problems by using labeled and unlabeled information to measure correlation. One of the common techniques in semi-supervised learning is the “self-training” approach, where the model is first trained on the labeled data. The trained model is then used to generate pseudo-labels for the unlabeled data. These pseudo-labels are treated as ground truth for the unlabeled samples, and the model is retrained on the combined set of labeled and pseudo-labeled data. Concretly, pseudo-labeling is the process of using the labeled data model to predict labels for unlabeled data. Here at first, a model has trained with the dataset containing labels and that model is used to generate pseudo-labels for the unlabeled dataset. Finally, both the datasets and labels (original labels and pseudo-labels) are combined for a final model training. It is called pseudo (which means unreal) as
1 https://amitness.com/2020/02/illustrated-self-supervised-learning/
118
7 Learning Approaches and Tricks
these may or may not be real labels and we are generating them based on a similar data model.
2.7 Active Learning Active learning (Settles, 2009) can be seen as a special case of semi-supervised machine learning. The core idea is to limit the training of the learning algorithm with only a selected subset of original data, the selection being based on certain criteria. From a dataset that is fully labeled X, a small amount is used for the seed L since they already have the label, and the rest U is used as if they are unlabeled (.X = L ∪ U ). One condition is to ensure that the subset is representative of the true distribution of the original data. The central questions deal with the selection of key data for annotation and the way to use human annotation to do clustering. There are several scenarios to select the data subset but all require some sort of informativeness measure of the unlabeled instances. A pedagogic guide to understanding active learning can be found at2 .
2.8 Similarity Learning Similarity learning is a technique of supervised machine learning in which the goal is to make the model learn, which is a similarity function that measures how similar two objects are and returns a similarity value. A high score is returned when the objects are similar and a low score is returned when the images or objects are different. Now let us see some use cases where similarity learning, i.e., one-shot classification(Siamese network) is used. The idea behind similarity learning is to learn a function that maps input data into a high-dimensional feature space, where the similarity between inputs can be quantified by measuring the distance or similarity between their corresponding feature representations. This learned function is called a similarity metric or similarity measure. The goal of similarity learning is to optimize the similarity metric such that it captures the underlying structure and relationships between the input data in a meaningful way. Specifically, inputs that are similar should be mapped to feature representations that are close together in the feature space, while dissimilar inputs should be mapped to feature representations that are far apart. Similarity learning can be used in a wide range of applications such as image and video retrieval, face recognition, recommender systems, and clustering. In these applications, the learned similarity metric is used to identify and retrieve similar data points, classify or recognize objects, or group similar data points together.
2 https://www.v7labs.com/blog/active-learning-guide
2 Standard and Novel Learning Strategies
119
One popular method for similarity learning is Siamese networks, which consist of two identical neural networks that share weights and are trained to produce similar feature representations for similar inputs and dissimilar feature representations for dissimilar inputs. Another method is Contrastive Learning, which aims to learn a similarity metric by maximizing the similarity between augmented versions of the same input and minimizing the similarity between different inputs. Overall, similarity learning is a powerful approach for learning meaningful representations of complex data that can be used for a wide range of applications.
2.9 Self-Paced Learning The idea behind Self-paced learning is to better guide the clustering process by selecting suitable samples adaptively along the clustering process. It is based on the concept of curriculum learning (Bengio et al., 2009; Wang et al., 2021) that consists of first learning the simple knowledge, followed by learning more difficult and professional knowledge. Under the assumption that easy samples with smaller losses ought to be selected in the early stage while complex samples with larger losses are supposed to be selected later or not, self-paced learning is raised. In the iterative process, there are a growing number of complex samples to be involved in the model until the model is “mature.” As a result, it is possible for the relatively complex samples to be excluded from the model or be included with smaller weights, which is another way to strengthen its robustness. Concretely, in the process of self-paced learning, the first step is to select a part of the samples with small construction errors for training, so as to obtain accurate training models, it adds more samples by gradually increasing the threshold value to enhance the generalization ability of training model until the established model achieve stability. An intuitive presentation of self-paced learning can be found at3 .
2.10 Subspace Learning Based on self-representation, the idea of subspace learning (Wang et al., 2015) consists in embedding the potential characteristics of the data into the lowdimensional space through a projection matrix. Through subspace learning, a subset of features (l features) is initially selected to eliminate irrelevant features. Then the original high-dimensional data is reconstructed from the representative features by means of the coefficient matrix, which can prevent the influence of noisy features as much as possible. Principal component analysis (PCA) and linear discriminant analysis (LDA) are two notable linear methods for subspace learning. They may
3 https://towardsdatascience.com/self-paced-learning-for-machine-learning-f1c489316c61
120
7 Learning Approaches and Tricks
fail to find the underlying non-linear structure of the data under consideration, and they may lose some discriminant information of the manifolds during the linear projection. Many nonlinear learning methods based on graph embedding, such as ISOMAP, locally linear embedding (LLE), and Laplacian eigenmaps have been proposed. One of the current challenge of the graph-embedding techniques is to take into account the global and local data structures. Through exploiting the lowdimensional subspace structure, subspace clustering could effectively alleviate both the problem of dimensionality curse and linear inseparability.
3 Self-Supervision in Deep Self-supervised learning (SSL) is one of the novel techniques that has played a crucial role in achieving a milestone in deep clustering. In SSL, the learning signal is generated from the data itself, rather than relying on explicit labels. Unlike traditional supervised learning, where data is labeled with explicit targets, SSL aims to create surrogate tasks from the input data, effectively making the data itself serve as the supervisory signal. This enables training deep learning models in scenarios where obtaining large-scale labeled data is expensive or impractical. The novel idea is to design pretext tasks that require the model to predict certain parts or properties of the input data. By training the model to solve these pretext tasks, it learns to extract useful and informative features from the data. Key characteristics and examples of self-supervised learning approaches include: ● Pretext Tasks: In SSL, pretext tasks are constructed from the input data, creating artificial labels or objectives for the network to predict. These pretext tasks are designed such that solving them requires the model to learn meaningful representations of the input data. For instance, in image-based SSL, a pretext task could be predicting image rotation, image colorization, image inpainting (reconstructing masked parts of an image), or image context prediction (predicting the context given a patch). ● Contrastive Learning: Contrastive learning is a popular approach within SSL, where the model is trained to discriminate between similar and dissimilar pairs of samples. By mapping similar samples closer and pushing dissimilar samples apart in the learned embedding space, the model learns to capture useful features from the data. Examples of contrastive learning methods include Siamese networks, InfoNCE (Information Noise-Contrastive Estimation), and SimCLR (A Simple Framework for Contrastive Learning of Visual Representations). ● Temporal Relationships: In sequences, such as videos or audio data, temporal self-supervision can be applied. This involves designing pretext tasks based on temporal relationships, like predicting the next frame in a video sequence or predicting the order of shuffled frames. Temporal contrastive learning and temporal context prediction are common approaches in this category.
3 Self-Supervision in Deep
121
● Generative Models: Self-supervision can be achieved through generative models, where the model is trained to generate parts of the input data from other parts. For example, in autoencoders, the model is trained to reconstruct the input data from a compressed representation. Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) are popular generative models used for self-supervised learning. ● Data Augmentation as Pretext Task: Data augmentation, a common technique in deep learning, can also serve as a pretext task for self-supervised learning. By generating augmented versions of the input data and treating them as different views of the same sample, the model can learn to align these views in the feature space, encouraging robust representations. The benefits of Self-Supervised Learning: ● Reduced dependence on labeled data, making it applicable to a broader range of tasks. ● Potential for better generalization due to learning from rich, diverse pretext tasks. ● Enables learning powerful, task-agnostic representations, transferable to downstream tasks. ● Can be applied to a wide range of data types, including images, text, and sequential data. Challenges: ● Designing effective pretext tasks that lead to meaningful feature learning. ● Balancing the model’s capacity and complexity to avoid overfitting on pretext tasks. ● Challenges in large-scale deployment and computational requirements. Overall, self-supervised learning approaches have shown promising results, unlocking new possibilities for training deep learning models in scenarios with limited labeled data while advancing the state-of-the-art in representation learning.
3.1 Pretext Tasks The notion of a pretext task is a fundamental concept in self-supervised learning, where a model is trained to solve an auxiliary or artificial task using the input data itself as a source of supervision. The primary goal of pretext tasks is to induce the model to learn meaningful and useful representations of the data, which can then be transferred to other downstream tasks where labeled data may be scarce or expensive to obtain. In self-supervised learning, the term “pretext” refers to the task that serves as a proxy or stand-in for the actual task of interest. Pretext tasks are designed to encourage the model to capture essential underlying patterns, structures, and relationships present in the data. These learned representations can subsequently be used to improve performance on different tasks that may have limited labeled
122
7 Learning Approaches and Tricks
data available. The pretext task is not the final objective but rather an intermediate learning objective to facilitate representation learning. Here are some common examples of pretext tasks used in self-supervised learning: ● Image Context Prediction: Given an image, the model is trained to predict the context surrounding a randomly chosen patch within the image. By learning to predict the surrounding context, the model can gain an understanding of the spatial relationships within the image, which can be valuable for various computer vision tasks. ● Image Rotation Prediction: The model is trained to predict the angle of rotation applied to an image. This task encourages the model to learn rotation-invariant features and representations, which can be useful for object recognition and detection tasks. ● Contrastive Learning: The model is trained to distinguish between similar and dissimilar pairs of data samples. Similar pairs could be different augmentations of the same image, while dissimilar pairs could be images from different classes or different domains. This encourages the model to learn a compact and discriminative feature space. ● Temporal Order Prediction: In sequential data like videos or time series, the model is trained to predict the correct temporal order of frames or events. This encourages the model to learn meaningful temporal dependencies and can be helpful for action recognition or temporal event understanding. The choice of pretext task is critical in self-supervised learning, as a welldesigned task can lead to more informative and transferable representations. Researchers aim to create pretext tasks that force the model to understand highlevel semantic features, spatial relationships, temporal dynamics, or other essential aspects of the data that are relevant for the final target task. Once the model is trained on the pretext task, the learned representations can be fine-tuned or used directly for specific downstream tasks (e.g., image classification, object detection, sentiment analysis), requiring only a small amount of labeled data, or even none in some cases, to achieve competitive performance.
3.2 Contrastive Learning This learning method is not explicitly designed for clustering tasks but has somehow implicitly played a role in deep clustering advances. Contrastive learning is developed to improve the performance of self-supervised learning. Its corresponding pretext task is that the features encoded from multi-views of the same image are similar to each other. The core insight behind these methods is to learn multiviews invariant representations. Contrastive Learning is a technique that enhances the performance of vision tasks by using the principle of contrasting samples against
3 Self-Supervision in Deep
123
each other to learn attributes that are common between data classes and attributes that set apart a data class from another. Contrastive Learning is a Machine Learning paradigm where unlabeled data points are juxtaposed against each other to teach a model which points are similar and which are different. That is, as the name suggests, samples are contrasted against each other, and those belonging to the same distribution are pushed toward each other in the embedding space. In contrast, those belonging to different distributions are pulled against each other. The basic contrastive learning framework consists of selecting a data sample, called “anchor,” a data point belonging to the same distribution as an anchor, called the “positive” sample, and another data point belonging to a different distribution called the “negative” sample. In synthesis, contrastive learning aims to learn discriminative representations by optimizing the contrastive loss, where the representations from the same augmentation of images are encouraged to be closer, while other augmentations are separable. The basic idea of contrastive learning is to map the original data to a feature space wherein the similarities of positive pairs are maximized while those of negative pairs are minimized. In early works, the positive and negative pairs are known as prior. Recently, various works have shown that large quantities of data pairs are crucial to the performance of contrastive models and they could be constructed using the following two strategies under the unsupervised setting. One is to use clustering results as pseudolabels to guide the pair construction. The other, which is more direct and commonly used, is to treat each instance as a class represented by a feature vector, and data pairs are constructed through data augmentations. To be specific, the positive pair is composed of two augmented views of the same instance, and the other pairs are defined to be negative. Given the data pairs, several loss functions have been proposed for contrastive learning. The network (e.g., a CNN encoder) projects each image crop into an embedding and pulls the embeddings from the same source closer to each other while pushing embeddings from different sources apart (Fig. 7.3). By solving the task of instance discrimination, the network is expected to learn a useful representation of the image. Contrastive learning has become a popular topic (Chen et al., 2020a), and several attempts have been made to utilize the contrastive loss to improve the deep clustering performance (Van Gansbeke et al., 2020; Dang et al., 2021). Typically, in (Van Gansbeke et al., 2020) it is proposed a two-stage deep clustering method that adopts contrastive learning as a pretext task to learn discriminant features and then exploits the k-nearest neighbors (via the learned features) in the second-stage network training. In (Dang et al., 2021) the nearest neighbor matching (NNM) method is proposed by considering not only the global nearest neighbors but also the local nearest neighbors. Li et al. (Li et al., 2021) performed contrastive learning at both instance-level and cluster-level and obtained the clustering result via a cluster projector. Contrastive learning has emerged as an effective technique for enhancing deep clustering performance, which typically generates positive sample pairs and negative sample pairs via data augmentations, and aims to maximize the agreement between positive pairs and minimize the agreement between negative
124
7 Learning Approaches and Tricks
Fig. 7.3 Pull/Push process of contrastive learning
pairs. Contrastive learning can be used for supervised scenarios but also for selfsupervised ones (Chen et al., 2020a,b) that focus more on the knowledge discovery task. The principle is more and more investigated (Caron et al., 2020; Van Gansbeke et al., 2020; Zhong et al., 2020; Li et al., 2021; Deng et al., 2023) and the results obtained appear very promising. It is still in research as it might be challenging to choose the “right” and appropriate transformations for a specific task. The idea behind a contrastive loss function is to learn a similarity metric that can distinguish between similar and dissimilar data points. The goal of the contrastive loss function is to encourage the model to map similar inputs to nearby points in the feature space while pushing dissimilar inputs further apart. This is achieved by minimizing the distance between similar inputs and maximizing the distance between dissimilar inputs in the learned feature space (Fig. 7.4). The contrastive loss function works by comparing pairs of input data points and computing a loss that depends on their similarity. Given a pair of input data points, the model computes their feature representations using a learned neural network. The distance between the two feature representations is then computed using a distance metric such as the Euclidean distance or cosine similarity. The contrastive loss function is then computed based on this distance, encouraging similar pairs to be close together and dissimilar pairs to be far apart. Overall, the contrastive loss function is a powerful tool for learning a similarity metric that can be used in a wide range of applications, including image and video retrieval, face recognition, and clustering. Typical losses are the following: Logistic Loss (.Ll ), Max margin Contrastive (.Lm , InfoNCE (.INCE ), and NT-Xent Loss (.NXent ).
3 Self-Supervision in Deep
125
f = θ(I ) +
Positive I+
+
Distance Function: δ δ(f ,f ) a
Encoder
+
Minimize
θ f = θ(I ) a
a
Anchor: Ia
Distance Function: δ δ(f ,f ) a
–
Minimize
–
f = θ(I ) –
Negative: I
–
Fig. 7.4 Contrastive with Anchor: The distance function can be anything, from Euclidean distance to cosine distances in the embedding space. (From source https://medium.com/aiguys/contrastivelearning-explained-17fa79f707bf)
The logistic loss is a simple convex loss function popularly used in Supervised Learning literature. It is mathematically expressed as: Ll =
.
1 −yi log θ (si ) + (1 − yi )(1 − log θ (si )) n
(7.3)
The Max margin loss function maximizes the distance between samples if they do not belong to the same distribution and instead minimizes the distance between them if they belong to the same distribution. It is mathematically represented as follows: Lm (si , sj , θ ) = .
1[yi = yj ]‖θ (si ) − θ (sj )‖22 + 1[yi /= yj ] max 0, m − ‖θ (si ) − θ (sj )‖22
(7.4)
InfoNCE loss where NCE stands for Noise-Contrastive Estimation is a concept rooted in the noise-contrastive estimation framework, primarily used in the context of contrastive learning. Its purpose is to quantify the degree of similarity or dissimilarity between pairs of data points in a learned embedding space. To achieve
126
7 Learning Approaches and Tricks
this, InfoNCE loss aims to maximize the agreement between positive pairs while simultaneously minimizing the agreement between negative pairs. The fundamental idea driving InfoNCE loss is to reframe the contrastive learning problem as a binary classification task. Given a positive pair, typically consisting of augmented views of the same data instance and a set of negative pairs comprising instances from different samples, the model is trained to discern between positive and negative instances. This discrimination is based on a probabilistic approach, often employing the softmax function to measure the similarity between these instances. To predict future information, the target x (future) and the context c (present) are encoded into compact distributed vector representations (via non-linear learned mappings) in a way that maximally preserves the mutual information of the original signals x and c defined as: I (x, c) =
.
p(x, c)log
x,c
p(x|c) p(x)
(7.5)
By maximizing the mutual information between the encoded representations (which is bounded by the MI between the input signals), the underlying latent variables the inputs have in common are extracted. The density ratio which preserves the information between .xt+k and .ct is modeled as follows: fk (xt+k , ct ) = ∞
.
p(xt+k vertct ) p(xt+k )
(7.6)
where .∞ is a multiplicative constant. Let be .ct the context (present) and .X = {x1 , x2 , . . . , xN } denotes the set of N random samples containing one positive sample from .p(xt+k |ct ) and .N − 1 negative samples from the proposal distribution .p(xt+k ), the loss function can be mathematically represented as follows:
LInfoNCE
.
fk (xt+k , ct ) = −E log X xj ∈X fk (xj , ct )
(7.7)
The numerator is essentially the output of a positive pair, and the denominator is the sum of all values of positive and negative pairs. The loss forces the positive pairs to have a greater value (pushing the log term to 1 and hence less to 0) and negative pairs further apart. InfoNCE loss finds application in self-supervised contrastive learning, where the positive pairs are crafted by augmenting views of a single data instance, and the negative pairs are generated using instances from distinct samples. By optimizing the InfoNCE loss, the model effectively learns to capture meaningful relationships and distinctions among the data points, ultimately resulting in the acquisition of robust and informative data representations.
3 Self-Supervision in Deep
127
The Normalized Temperature-scaled Cross-Entropy or NT-Xent loss is a modification of the multi-class N-pair loss with an addition of the temperature parameter .τ . It is mathematically represented as follows:
Lij = − log 2N
.
exp〈zi · zj 〉/τ
k=1 1k/=i
(exp〈zi · zk 〉/τ )
(7.8)
where .〈·〉 stands for the dot product between the latent variables .zi and .zj , .τ is the temperature parameter and .1k/=i is an indicator function evaluating to 1 if .k /= i. An illustration example is given in Fig. 7.5.
3.3 Data Augmentation Data augmentation (Shorten & Khoshgoftaar, 2019; Shorten et al., 2021; Bayer et al., 2022) is a process of artificially increasing the amount of data by generating new data points from existing data, generally via the modification of examples within the original dataset. This includes adding minor alterations to data or using machine learning models to apply various transformations under the domain knowledge. Common image augmentation methods are Color Jittering, Image Flipping: Image Noising, and Random Affine. The common objective is to generate
Fig. 7.5 Example of Input augmentation and NT-Xent loss calculation (.τ = 1): exp〈1 · 2〉 . (From source https://medium.com/self-supervised-learning/ .L1,2 = exp〈1 · 2〉 + . . . + exp〈1 · 6〉 nt-xent-loss-normalized-temperature-scaled-cross-entropy-loss-ea5a1ede7c40)
128
7 Learning Approaches and Tricks
Fig. 7.6 Data augmentation illustration
new data points without changing the semantic characteristics of the data. Data augmentation can apply to all machine learning applications where acquiring quality data is challenging. The challenge concerns the appropriate modifications that should be under control: strong augmentations may change the sample identity of the positives, while weak augmentation produces easy positives/negatives leading to nearly zero loss and ineffective learning. Ideally, the transformations should be adapted to better capture the data semantics. Advanced models for data augmentation include adversarial machine learning, GANs, and neural style transfer. Data augmentation (Fig. 7.6) constitutes a form of weak supervision currently used in contrastive and self-supervised learning strategies. Data augmentation plays a crucial role in contrastive self-supervised learning, where the objective is to differentiate between a sample’s augmentations (positives) and other samples (negatives). However, the choice of augmentation strength is pivotal. Strong augmentations have the potential to alter the sample identity of positives, while weak augmentations result in easily distinguishable positives/negatives, leading to minimal loss and ineffective learning. A carefully optimized augmentation strategy can yield significant information gains, as demonstrated in the success of methods like SimCLR and SwAV compared to other approaches such as ICC and IMSAT.
References Bayer, M., Kaufhold, M.-A., & Reuter, C. (2022). A survey on data augmentation for text classification. ACM Computing Surveys, 55(7), 1–39. Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th annual international conference on machine learning (pp. 41–48). Blalock, D., Gonzalez Ortiz, J. J., Frankle, J., & Guttag, J. (2020). What is the state of neural network pruning? Proceedings of Machine Learning and Systems, 2, 129–146. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912–9924.
References
129
Chaudhary, A. (2020). The illustrated self-supervised learning. https://amitness.com/2020/02/ illustrated-self-supervised-learning. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020a). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597– 1607). PMLR. Chen, T., Kornblith, S., Swersky, K., Norouzi, M., & Hinton, G. E. (2020b). Big self-supervised models are strong semi-supervised learners. Advances in Neural Information Processing Systems, 33, 22243–22255. Dang, Z., Deng, C., Yang, X., Wei, K., & Huang, H. (2021). Nearest neighbor matching for deep clustering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13693–13702). Deng, X., Huang, D., Chen, D.-H., Wang, C.-D., & Lai, J.-H. (2023). Strongly augmented contrastive clustering. Pattern Recognition, 139, 109470. Gale, T., Elsen, E., & Hooker, S. (2019). The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574. Ganin, Y., & Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In International conference on machine learning (pp. 1180–1189). PMLR. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Regularization for deep learning. Deep Learning, 216–261. Han, S., Pool, J., Tran, J., & Dally, W. (2015). Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems, 28, 1135–1143. Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., & Peste, A. (2021). Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research, 22(241), 1–124. Hou, C., Nie, F., Li, X., Yi, D., & Wu, Y. (2013). Joint embedding learning and sparse regression: A framework for unsupervised feature selection. IEEE Transactions on Cybernetics, 44(6), 793–804. Jaiswal, A., Babu, A. R., Zadeh, M. Z., Banerjee, D., & Makedon, F. (2020). A survey on contrastive self-supervised learning. Technologies, 9(1), 2. Janowsky, S. A. (1989). Pruning versus clipping in neural networks. Physical Review A, 39(12), 6600. Jing, L., & Tian, Y. (2020). Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11), 4037–4058. Kang, G., Li, J., & Tao, D. (2017). Shakeout: A new approach to regularized deep neural network training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(5), 1245–1258. Karnin, E. D. (1990). A simple procedure for pruning back-propagation trained neural networks. IEEE Transactions on Neural Networks, 1(2), 239–242. Li, Y., Hu, P., Liu, Z., Peng, D., Zhou, J. T., & Peng, X. (2021). Contrastive clustering. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10), 8547–8555. Liang, T., Glossner, J., Wang, L., Shi, S., & Zhang, X. (2021). Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing, 461, 370–403. Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Mitsuno, K., & Kurita, T. (2021). Filter pruning using hierarchical group sparse regularization for deep convolutional neural networks. In 2020 25th international conference on pattern recognition (ICPR) (pp. 1089–1095). IEEE. Ohri, K., & Kumar, M. (2021). Review on self-supervised image recognition using deep neural networks. Knowledge-Based Systems, 224, 107090. Rahangdale, A., & Raut, S. (2019). Deep neural network regularization for feature selection in learning-to-rank. IEEE Access, 7, 53988–54006. Reed, R. (1993). Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4(5), 740– 747. Settles, B. (2009). Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences.
130
7 Learning Approaches and Tricks
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 1–48. Shorten, C., Khoshgoftaar, T. M., & Furht, B. (2021). Text data augmentation for deep learning. Journal of Big Data, 8(1), 1–34. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958. Vadera, S., & Ameen, S. (2022). Methods for pruning deep neural networks. IEEE Access, 10, 63280–63300. Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M., & Van Gool, L. (2020). Scan: Learning to classify images without labels. In European conference on computer vision (pp. 268–285). Springer. Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., & Fergus, R. (2013). Regularization of neural networks using dropconnect. In International conference on machine learning (pp. 1058–1066). PMLR. Wang, M., & Deng, W. (2018). Deep visual domain adaptation: A survey. Neurocomputing, 312, 135–153. Wang, S., Pedrycz, W., Zhu, Q., & Zhu, W. (2015). Subspace learning for unsupervised feature selection via matrix factorization. Pattern Recognition, 48(1), 10–19. Wang, X., Chen, Y., & Zhu, W. (2021). A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 4555–4576. Weiss, K., Khoshgoftaar, T. M., & Wang, D. (2016). A survey of transfer learning. Journal of Big Data, 3(1), 1–40. Wimmer, P., Mehnert, J., & Condurache, A. P. (2022). Dimensionality reduced training by pruning and freezing parts of a deep neural network, a survey. arXiv preprint arXiv:2205.08099. Xu, S., Huang, A., Chen, L., & Zhang, B. (2020). Convolutional neural network pruning: A survey. In 2020 39th Chinese Control Conference (CCC) (pp. 7458–7463). IEEE. Yoon, J., & Hwang, S. J. (2017). Combined group and exclusive sparsity for deep neural networks. In International conference on machine learning (pp. 3958–3966). PMLR. Zhang, Q., Zhang, M., Chen, T., Sun, Z., Ma, Y., & Yu, B. (2019). Recent advances in convolutional neural network acceleration. Neurocomputing, 323, 37–51. Zhong, H., Chen, C., Jin, Z., & Hua, X.-S. (2020). Deep robust clustering by contrastive learning. arXiv preprint arXiv:2008.03030.
Chapter 8
Deep Feature Selection
Feature selection plays a crucial role in machine learning by identifying the most relevant and informative features from the input data, leading to improved model performance and reduced computational complexity. Deep learning, with its ability to automatically learn hierarchical representations from data, has shown promise in feature extraction tasks. This chapter explores the fascinating world of deep feature selection, where we harness the power of deep neural networks to automatically learn and select discriminative features. It paves the way for understanding the cutting-edge techniques of deep feature selection and emphasizes the transformative impact of leveraging deep learning for extracting valuable and representative features from complex data.
1 Limitations of Convention Feature Selection Methods The ever-increasing complexity and volume of data have driven the need for more sophisticated feature selection techniques. Traditional feature selection methods often rely on handcrafted features or statistical measures, which may not capture the intricate underlying patterns in high-dimensional data. These methods are often labor-intensive, requiring domain expertise to manually engineer features or select appropriate statistical measures. Moreover, they may not effectively handle the curse of dimensionality when dealing with high-dimensional data. As the dimensionality increases, the computational cost of evaluating all possible feature subsets becomes prohibitive. Additionally, traditional methods may overlook complex, non-linear relationships between features, leading to suboptimal feature subsets.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 F. Ros, R. Riad, Feature and Dimensionality Reduction for Clustering with Deep Learning, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-031-48743-9_8
131
132
8 Deep Feature Selection
2 Measurement Criterion Issue Measurement criterion is then an important research direction in feature selection but is still challenging. To be concrete, the unsupervised method mainly utilizes certain evaluations that are only assumptions, such as rank ratio, Laplace score, and variance, to evaluate the importance of the features or feature subsets. The idea is to select the top k important features or the best representative feature subset. By selecting the index of feature dimension independently, without considering the intrinsic relationship of different feature units, the selected feature dimensions may be distributed in all of the feature units. The redundant feature units still need to be extracted in practice, which is due to low efficiency and unrobustness. Some assumptions such as data variance that ranks the score of each feature by the variance along a dimension can be useful for representing data but may not be useful for preserving discriminative information or highlighting a semantic direction. Laplacian score is a locality graph-based unsupervised feature selection algorithm reflecting the locality-preserving power of each feature. More recently, sparse representation-based feature extraction has become an active direction. The recent literature indicates that preserving global pairwise sample similarity and local geometric structure of data is of great importance for feature selection. Furthermore, preserving local geometric data structure becomes clearly more important than preserving global pairwise sample similarity for Unsupervised Feature Selection. Considering only the representativeness of features is not sufficient because the diversity of features which may lead to high redundancy and the losses of valuable features are ignored.
3 Transitioning to Deep Feature Selection The goal of feature selection is to identify the most representative or discriminative subset from the original feature space based on specific criteria while retaining the original feature representation. In contrast to feature extraction, which creates a new feature descriptor by transforming the original feature space into a different one, feature selection preserves the original semantics of the variables. This characteristic provides the notable advantage of interpretability, making it easier for domain experts to understand and work with the selected features. Both representativeness and diversity properties are very important for selecting effective features: • The ideal selected features are supposed to represent the whole original features. That is to say, the highly irrelevant features are discarded, and meanwhile, the most relevant features are preserved. • The ideal selected features should be diverse enough to capture not only important (representative) but also comprehensive information from features.
4 Taxonomy of Feature Selection Techniques with Deep Learning
133
More information on data can be captured by considering the diversity property because features usually describe different aspects of data. • The diversity property implies that very similar features should not be selected simultaneously so that the redundancy can be greatly reduced. Therefore, there is a great need for integrating both representativeness and diversity properties of features for feature selection. With the prevalence of unlabeled data, unsupervised feature selection has shown to be effective in alleviating the curse of dimensionality and is essential for comprehensive analysis and understanding of myriads of unlabeled high-dimensional data. Unsupervised Feature Selection aims at selecting a discriminative subset of features from unlabeled high-dimensional data and removing other noisy and unimportant features. In scenarios where numerous features exhibit noise and lack correlation, their combined effects can result in all pairwise distances between data points becoming relatively uniform. This phenomenon renders distance-based concepts, typically effective in low-dimensional spaces, inapplicable. Feature selection within the context of deep learning serves the crucial purpose of mitigating overfitting in neural network models. It achieves this by strategically choosing features from the data, aiming to retain less redundant and minimally irrelevant information, ideally retaining only those essential for the given machine learning task. In essence, feature selection in deep learning actively contributes to the knowledge discovery process. Moreover, employing feature selection techniques can contribute to a reduction in the potential for making decisions influenced by noise. This, in turn, enhances the generalization capabilities of the neural network model by fostering simpler and less complex models. Additionally, it leads to a decrease in the training time required for the neural network model. The streamlined model, with fewer data points, exhibits reduced complexity, thereby accelerating the training of algorithms and neural network models. It is important to note that feature selection in unsupervised learning scenarios presents a notably more challenging problem compared to supervised feature selection, primarily due to the absence of label information, which serves as a valuable guiding factor in the selection process. Transitioning to deep feature selection represents a paradigm shift in feature extraction and representation learning. By harnessing the power of deep learning models, particularly autoencoders, CNNs, and RNNs, informative features from raw data can be automatically learned and selected.
4 Taxonomy of Feature Selection Techniques with Deep Learning Deep learning offers a paradigm shift in feature extraction and representation learning. By leveraging neural networks with multiple layers, deep learning models
134
8 Deep Feature Selection
can automatically learn hierarchical representations from data. The ability to capture complex, non-linear relationships between features makes deep learning an attractive choice for feature selection tasks. The classical permutation method for feature selection is based on the idea that when removing or corrupting a feature from a dataset, the performance will change. By analyzing these changes, it is then possible to determine if a feature is valuable or not. This approach has the obvious drawback of being computationally intensive; in order to check n features, one must train a DL model at least n times. The nature of combinatorial optimization poses a great challenge for deep learning. As the majority of weights in the deep network are not necessary for their accuracy, the pruning concept is central, and most feature selection approaches address pruning strategies. Note that feature selection and pruning are related problems even if pruning addresses a more general goal: adding a new set of features to a task generally requires increasing the number of neurons, all the way up to the last hidden layer. Instead of separating the tasks of feature selection and training, the strategy of several authors such as Scardapane et al. (2017) consists of doing them simultaneously considering that pruning a node and deleting an input feature are almost equivalent problems. The core idea of novel feature selection techniques devoted to deep learning consists of forcing all outgoing connections from a single neuron corresponding to a group to be either simultaneously zero, or not. They differ on the learning approach and the embedded constraints via regularization. The determination of the appropriate level for feature selection within a layered network is a critical consideration. It is worth emphasizing that feature selection holds particular significance for interpretability when implemented at the lowest level, which corresponds to the original feature space. However, for this selection to be meaningful, the chosen features must possess a certain degree of consistency and/or semantic relevance. In situations involving data signals or images, a solitary piece of information often proves insufficient, necessitating the application of feature selection at higherlevel features within the network. Deep learning algorithms are efficient in their capacity to create abstract features or latent variables based on the observed variables. These low-level features or latent variables can serve multiple purposes, including the construction of higher-level features and facilitating the feature selection process. This flexibility in reusing and building upon lower-level abstractions contributes significantly to the power and versatility of deep learning models. In order to explore a more effective unsupervised feature selection method AE networks are often used as follows: Since the redundant features can be represented by linear or non-linear combinations of other useful features, the AE network can squeeze input features into a low-dimensional space and represent original features by exploiting these low-dimensional data. Therefore, features with less effect on the low-dimensional data (i.e., hidden units) could be recognized as redundancy, which can be removed by a group sparsity regularization.
4 Taxonomy of Feature Selection Techniques with Deep Learning
135
The majority of deep feature selection methods have this goal; they only differ in the way they carry out the process. Unlike supervised feature selection, in unsupervised feature selection the class label information is unavailable due to the absence of scoring (label information) to guide the selection of a minimal feature subset. Due to this absence, structure learning which serves as a piece of pseudosupervised information remains the core idea of all methods. In synthesis, most of methods aim at identifying a representative feature subset so that all the features can be well reconstructed by them. Due to this assumption, the selection cannot be really semantic and gives only a partial response to the knowledge discovery question. A fundamental tool in the domain of deep feature selection is the autoencoder. Let us explore the inner workings of autoencoders and their application in unsupervised feature learning. Autoencoders represent a class of neural networks designed to reconstruct their input data at the output layer. Through the process of learning to faithfully reconstruct the input data, the hidden layers of the autoencoder function as learned features are adept at capturing the most pertinent and informative aspects of the data. In the context of computer vision and image analysis, CNNs have emerged as a powerful tool for feature extraction. We explore how CNNs use convolutional layers to detect patterns and features in images, leading to hierarchical representations that are highly discriminative. CNNs have been particularly successful in tasks like image classification and object detection, thanks to their ability to automatically learn relevant image features. For data with temporal dependencies, such as time series or natural language sequences, RNNs offer a valuable approach for feature extraction. RNNs can capture sequential patterns and dependencies, making them well-suited for tasks like sentiment analysis, machine translation, and speech recognition. We illustrate how RNNs can be used to select essential features from sequential data. Deep Autoencoder (AE) is a state-of-the-art deep neural network for unsupervised feature learning. Concrete AutoEncoder (CAE) (Balın et al., 2019) and to a less extent AutoEncoder Feature Selector (AEFS) (Han et al., 2018) are the most challenged among the most recent proposals. The general idea is to learn embedded representations using a series of stacked layers. The methods vary with their primary objective (level of compression, more importance to skip noisy and irrelevant features . . . ) and the techniques (type of regularization) used to complete the task. It is important to note that self-representation techniques have recently emerged, and they are based on the assumption that each feature can be well represented by the linear combination of its relevant features. A major issue of existing feature selection methods remains the time cost and computation complexity of processing big data. Deep neural network-based approaches have been developed to address these issues of big data, but they are built on a similar idea. In this section, the main representatives methods are presented (Fig. 8.1).
136
8 Deep Feature Selection
Fig. 8.1 FSAE architecture from the authors Wang et al. (2017)
5 Popular Methods 5.1 FSAE: Feature Selection Guided Autoencoder FSAE (Wang et al., 2017) is a unified generative model that integrates a general feature selection regularizer and autoencoder together to distinguish relevant and irrelevant features. The feature selection is applied on the learned hidden layer to extract the discriminative features from the irrelevant ones. Simultaneously, the taskrelevant hidden units can feed back to optimize the encoding layer to achieve more discriminability only on selected hidden units. Feature selection is incorporated into the hidden layer units. The core idea is that the discerning hidden units are distinguished from the taskirrelevant units at the hidden layer, and the regularizer on the selected features in turn enforces the encoder to focus on compressing important patterns into selected units; Fig. 8.2 present the mechanism of the proposed FSAE. Assume .X ∈ Rd×n is the training data, with d as the dimensionality of the descriptor and n as the number of data samples. Consider z as the number of hidden units of the autoencoder. 1 λ ‖X − g(F (X))‖2F + C(P , f (X)) W1 ,W2 ,b1 ,b2 ,P 2 2
.
min
(8.1)
with .W1 ∈ Rz×d , W2 ∈ Rd×z and offset vectors .b1 ∈ Rz , b2 ∈ Rd with dimensionality z and d.
5 Popular Methods
137
Fig. 8.2 AutoEncoder Feature Selector diagram. By retaining the representability of original features, the autoencoder network is used to extract relevant features (white units) and eliminate redundant features (black units)
where • .f (X) = σ (W1 X + B1 ) • .g(f (X)) = σ (W2 f (X) + B2 ) B1 , .B2 are the n-repeated column copy of .b1 , .b2 , respectively. .C(P , f (X)) is the feature selecting regularized term, with a learned feature selection matrix P performing on hidden units .f (X). Specifically, the i-th column vector in P denoted by .pi ∈ Rz has the form:
.
pi = [0, . . . , 0, 1, 0, . . . , 0]⏉
.
j −1
(8.2)
z−j
where j means this column vector selects the j -th units into the subset of new features .y ∈ Rm . Then the feature selection procedure can be expressed as given original feature .f (X), finding a matrix P to select a new feature set .Y = P ⏉ f (X) which optimizes an appropriate criterion .C(P , f (X)). This criterion is a general feature selection regularizer that works in both supervised and unsupervised scenarios. It is defined as follows: C(P , f (X)) =
.
T r(P ⏉ f (X) Lw f ⏉ (X)P ) T r(P ⏉ f (X) Lb f ⏉ (X)P )
where .T r(M) is the trace of a matrix .M ∈ Rk,k defined as .T r(M) = .Lw and .Lb are obtained as follows:
(8.3) k
i=1 M
i,i .
• Two weighted undirected graphs .Gw and .Gb are constructed on given the original data which, respectively, reflect the within-class and between-class affinity relationship. • Two weighted matrices .Sw and .Sb are produced to characterize two graphs, respectively. The Laplacian matrices are obtained and defined as .Lw = Dw −Sw , where .Dw is the diagonal matrix of .Sw , and the same is done for .Lb and .Sb .
138
8 Deep Feature Selection
Through FSAE, the discerning hidden units are distinguished from the taskirrelevant units at the hidden layer, and the regularizer on the selected features in turn enforces the encoder to focus on compressing important patterns into selected units. FSAE was evaluated through data classification on three benchmark datasets (COIL100, Caltech101, and CMUPIE) focusing on object recognition and only for a supervised case. The method showed its superiority against non-deep feature selection methods.
5.2 AEFS: AutoEncoder Feature Selector AEFS (Han et al., 2018) is an autoencoder feature selection method that combines group lasso and autoencoder network into a new unsupervised embedded feature selection model. Since the redundant features can be represented by linear or nonlinear combinations of other useful features, the autoencoder network can squeeze input features into a low-dimensional space and represent original features by exploiting these low-dimensional data. Therefore, features with less effect on the low-dimensional data (i.e., hidden units) could be recognized as redundancy, which can be removed by a group sparsity regularization. The method can be viewed as a non-linear extension of the linear method of Regularized Self-Representation (RSR) for unsupervised feature selection (Zhu et al., 2015). Given a data matrix X where n observations are stacked in a row-wise manner, Regularized Self-Representation (RSR) aims at finding a solution to the following optimization problem: .
min ‖X − XW ‖2F + λ‖W ‖2,1
(8.4)
where .‖W ‖2,1 = ni=1 ‖W:i ‖2 , and W is the feature weight matrix each feature. .‖.‖F is the Frobenius norm for matrices. AEFS can select the most important features (white units) in spite of non-linear and complex correlation among features and eliminate redundant features (black units) as shown in Fig. 8.2. The AutoEncoder Feature Selector approach is advanced by imposing the .L2,1 norm regularization on the connecting parameters between the input and hidden layers, which can retain the most discriminative information among features. To recall that the .L2,1 -norm of a matrix .M ∈ Rn×p is defined by: ‖M‖2,1
.
n p = M2
ij
i=1
The cost function is as follows:
j =1
(8.5)
5 Popular Methods
J (θ ) =
.
β 1 ‖W (i) ‖2F ‖X − g(f (X))‖2F + α‖W (1) ‖2,1 + 2 2m
139
(8.6)
where m is the number of samples, f is the encoder function, g is the decoder one. α is the trade-off parameter of the reconstruction loss and the regularization term, .β is a penalty parameter. .θ = {W 1 , W 2 } are weight parameters and .Wijl denotes the parameter of the connection between the i-th neuron in the l-th layer and the j -th neuron in the .(l + 1)-th layer. Experiments on 7 benchmark datasets (Isolet, PCMAC, Madelon, lung discrete, Prostate GE, MNIST) using the k-means algorithm show that AEFS outperforms other methods almost in all the cases limited to non-deep ones. This method cannot, however, ensure whether the orthogonal and non-negative constraint of the indicator matrix exists because of the direct acquisition of the weight matrix. The loss function based on the Frobenius norm is very sensitive to noise and outliers (Liu et al., 2020). The model trains an autoencoder network first, then sorts features and selects k features. Specifically, these two steps are independent. This way may reduce the interpretability of the model. AEFS’s source code may be accessed at1 .
.
5.3 GAFS: Graph Regularized Autoencoder Feature Selection GAFS (Feng & Duarte, 2018) combines the concept of autoencoders and graph regularization to identify and select the most informative features from a dataset. It is based on the idea of adding the local data geometric structure regularization. The framework is then based on an autoencoder and graph data regularization. In the context of GAFS, a graph is constructed where the nodes represent features, and edges represent the connections between features based on their relationships. By incorporating graph regularization into the autoencoder, the model is encouraged to capture not only the inherent structure of the data but also the relationships between features, which can help in selecting the most informative features. The objective function of GAFS includes three terms, where the first two terms .L(θ ) and .R(θ ) are similar to the ones of the AEFS method: 1. A term based .L(θ ) on a single-layer autoencoder promoting broad data structure preservation 2. A regularization term .R(θ ) promoting feature selection 3. A term .G(θ ) based on spectral graph analysis promoting local data structure preservation The GAFS algorithm can be summarized in the following steps:
1 https://github.com/panda1949/AEFS
140
8 Deep Feature Selection
• Step 1: Build a k-nearest neighbor (kNN) graph where nodes represent features and edges represent the relationships between them based on some measure of similarity. • Step 2: Use the graph regularized autoencoder to learn a compressed representation of the input data. The graph regularization term will encourage the autoencoder to capture the relationships between features during the compression process. • Step 3: After training the autoencoder, the most informative features can be selected based on the importance or activations of the nodes in the bottleneck layer of the autoencoder. Features with higher activations in the bottleneck layer are considered more relevant and are selected for the final feature subset. Experiments were carried out on ten benchmark datasets, including five image datasets (MNIST, COIL20, Yale, Caltech101, CUB200), three text datasets (PCMAC, BASEHOCK, RELATHE), one audio dataset (Isolet), and one biological dataset (Prostate GE). GAFS performs better (using k-means) than other compared algorithms that are non-deep in most cases.
5.4 RAE: Restricted Autoencoder An RAE (Covert et al., 2020) is a type of autoencoder that imposes additional constraints on the architecture to enhance its ability to learn useful representations from the input data. By imposing constraints such as weight sharing and others, the model is encouraged to capture the most salient features of the data in the bottleneck layer, which can lead to more efficient and informative representations. The RAE framework is then based on the idea that the optimal approach consists of selecting a set of features that can accurately reconstruct all the remaining features. Additional constraints are imposed on the architecture to enhance its ability to learn useful representations from the input data highly ranked features are those that are most critical for performing accurate reconstruction. Instead of simply selecting the top-ranked features, it may prove beneficial to reject a small number of features and reassess the importance of the remaining ones. The algorithm trains an RAE by iteratively removing features that are not important for performing accurate reconstruction and removing the lowest-ranked features. It is recursive feature elimination. Because no information is lost while selecting a subset of features if the rejected features can be effectively reconstructed. Two sensitivity measures are considered, both of which are based on learning per-feature corruption rates. The first method stochastically sets inputs to zero using learned dropout rates .pj for each feature .j ∈ S = {1, 2, . . . , p}. The second method injects Gaussian noise using learned per-feature standard deviations .σj . These methods are referred to as Bernoulli RAE and Gaussian RAE, due to the kind of noise they inject. Based on the logic that important features tolerate less corruption, features are ranked according to .pj or .σj . The objective functions to be
5 Popular Methods
141
optimized at each iteration are shown below, and both are optimized using stochastic gradient methods and the reparameterization trick. min Em∼B(p) EX ‖X − hθ (XS ⊙ m)‖2 + λ log(pj )
.
θ,p
j ∈S
1 min Ez∼N (0,σ 2 ) EX ‖X − hθ (XS + z)‖2 + λ log(1 + 2 ) θ,σ σj j ∈S
.
(8.7)
(8.8)
λ a hyperparameter that controls the balance between accurate reconstruction and noise injection. RAE was tested with two publicly available biological datasets (single-cell RNA sequencing data, and microarray gene expression data) and showed superior efficiency compared to 9 baseline methods including AEFS.
.
5.5 CAE: Concrete Autoencoder CAE (Balın et al., 2019) is an end-to-end differentiable method for global feature selection, which efficiently identifies a subset of the most informative features. The method aims at selecting discrete features using an embedded method but without resorting to regularization. The idea is to use a relaxation of the discrete random variables, the concrete distribution (Maddison et al., 2016), which allows a lowvariance estimate of the gradient through discrete stochastic nodes. CAE simultaneously learns a neural network to reconstruct the input data from the selected features. The concrete autoencoder is based on using a concrete selector layer as the encoder with a user-specified number of nodes, k. It is based on concrete random variables. .Concrete(αi , T ) is a discrete distribution for which the sampling can be formulated with the Gumbel-Max trick and where the probability of outcome j is .αj / p αp . The reparametrization trick consists in refactoring each stochastic node into a differentiable function of its parameters and a random variable with a fixed distribution. Each element of the sample from the concrete distribution is defined as:
exp (log αj + gj )/T .mj = (8.9)
d k=1 (log αk + gk )/T where .mj refers to the j th element in a particular sample vector, .gj are samples from the standard Gumbel distribution. The decoder is a standard neural network used to reconstruct the remaining input features. Let .fθ the decoder function, .X(n × d) the input matrix, and .Xs (n × k) the matrix regrouping the concatenation of selective vectors, then each element i of the concrete layer is as follows:
142
8 Deep Feature Selection
u(i) = X⏉ Concrete(αi , T )
.
(8.10)
The reconstruction cost can be expressed as the following: L = arg min‖(fθ (Xs ) − X)‖
.
(8.11)
k
where .‖.‖F denotes the Frobenius norm of the matrix, .θ the decoder parameter. During the training phase, the temperature T of the concrete selector layer is gradually decreased with epoch t based on an exponential decay: .T (t) = T (0)(T (t)/T (0)t/tmax ), which encourages a user-specified number of discrete features to be learned. The gradient descent is applied for both the encoder (.α) and the decoder (.θ ) parameters. The values of .αj become more sparse, as the network becomes more confident in particular choices of input features, reducing the stochasticity in selected features. When .T −→ 0, the concrete random variable smoothly approaches distribution, outputting one-hot vector with .mj = the discrete 1 with probability . αj / p αp , then each node in the concrete selector layer outputs exactly one of the input features. For the test, this layer is replaced by a discrete .arg max layer. The concrete layer of CAE provides an architectural-based control for the number of selected features, which simplifies the training process, as the loss function can contain only the reconstruction error term. CAE was compared to the state-of-the-art non-deep unsupervised feature selection methods (ISOLET, COIL20, Smartphone dataset, Mice protein, GEO dataset). CAE effectively minimizes the reconstruction error and maximizes classification accuracy using selected features. It outperformed many different complex feature selection methods. The source code for CAE can be found at2 .
5.6 LS-CAE: Laplacian Score-Regularized LS-CAE (Shaham et al., 2022) is an extension of CAE (Balın et al., 2019), the idea consists of adding a Laplacian score term to its objective function during training, where the Laplacian is computed at the concrete layer. The idea of the Laplacian score is to evaluate the importance of the feature according to its similaritypreserving power, which is based on the graph model. The proposed approach, in particular, inherits from CAE the autoencoder framework, the concrete layer, and the reconstruction loss as a feature selection mechanism that promotes the selection of a sparse subset of representative features that capture a large portion of the information in the entire feature set. This is augmented with a Laplacian score term in the CAE objective function because doing
2 https://github.com/anonymousresearcher12/Concrete-Autoencoders
5 Popular Methods
143
so alone does not stimulate the elimination nuisance features. Even so, equipped with the knowledge that the Laplacian should be computed on the selected features rather than on the whole feature set, the Laplacian is calculated at the concrete layer. The proposed objective function includes a reconstruction term and a Laplacian score term as follows : L(Yˆ , Y ) =
.
‖Y − Yˆ ‖22 Trace[C ⏉ Ldiff (C)C] −
SG Trace[C ⏉ Ldiff (C)C] SG ‖Y − Yˆ ‖22
(8.12)
where Y and .Yˆ are, respectively, the input and the output of the autoencoder, and denote the output of the concrete layer by .C = C(Y ). .Ldiff (C) stands for the diffusion Laplacian, which is calculated using the data’s concrete layer representation, and SG stands for the Stop Gradient operator. To prevent selecting the same input feature many times, a regularization term is included that penalizes selecting a feature several times. The following is how this regularization term is calculated: R(Z) = X max(0, x − 1),
.
(8.13)
where X is a large constant and x is the maximum sum of any feature’s weights by concrete units, i.e., x := max
.
j =1...d
k
Zij
(8.14)
i=1
and Z is the .k × d-sized matrix of concrete layer probabilities. The key steps of the proposed algorithm are : 1. Input Setup: Begin by defining the input parameters, including the dataset .{y1 , . . . , yn }, the desired number of features k, the number of optimization steps .Nsteps , and the temperature annealing schedule S. 2. Iterative Optimization Loop: Initialize a loop that iterates from .i = 1 to t, where t represents the number of optimization iterations or epochs. 3. Annealing Temperature: Obtain the temperature T for the current iteration i from the annealing schedule .S(i, Nsteps ). The temperature determines the degree of exploration and exploitation in the algorithm. 4. Mini-Batch Sampling: Sample a mini-batch of data Y from the dataset. This mini-batch is used to update the model iteratively. 5. Data Reconstruction: Using the obtained temperature T , generate a data reconstruction .Yˆ . 6. Loss Computation: Compute the loss .L(Yˆ , Y ) by comparing the data reconstruction .Yˆ to the original data Y . The specific loss function is typically defined according to the problem’s requirements (as referenced by Eq. 8.12).
144
8 Deep Feature Selection
7. Feature Redundancy Penalty: Calculate the feature redundancy penalty .R(Z) according to the defined equation (as referenced by Eq. 8.13). This penalty is used to discourage redundancy among selected features. 8. Weight Update: Update the model’s weights using backpropagation, with the combined function .L(Yˆ , Y ) + R(Z). This step adjusts the model’s parameters to minimize the reconstruction loss while considering the feature redundancy penalty. 9. Iteration Completion: Repeat steps 3 to 8 for each iteration i within the defined range t. 10. Feature Selection: Once all iterations are complete, return the selected features .{i1 , . . . , ik }. Each feature .ij is determined as the .arg max of the categorical probabilities of the j th concrete unit .πj,1 , . . . , πj,d , which likely involves selecting the features with the highest importance or relevance. This method tested on ten benchmark datasets using k-means (RCV1, GISETTE, PIX10, COIL20, Yale, TOX-171, ALLAML, PROSTATE, FAN, POLLEN). LSCAE achieves impressive reduction and classification improvement compared to state-of-the-art non-deep techniques as well as the CAE approach. It achieves the best performance on seven of the ten benchmark datasets, and the second-best on the remaining three.
5.7 LRLMR: Latent Representation Learning and Graph-Based Manifold Regularization LRLMR (Tang et al., 2019) combines the concept of latent representation learning with graph-based manifold regularization to improve the performance of learning tasks, especially in scenarios where labeled data is limited. It consists in measuring the feature importance in the latent space which is more robust to noise instead of the original data space. The latent representation is modeled by non-negative matrix factorization of the affinity matrix which explicitly reflects the relationships of data instances. Meanwhile, the local manifold structure of the original data space is preserved by a graph-based manifold regularization term in the transformed feature space. This enables the characterization of the intrinsic data structure but also in turn act as label information to serve the feature selection process. The main components and steps involved in LRLMR are: 1. Input Data: • • • •
Data matrix .X ∈ Rn×d Adjacency matrix A Laplacian matrix L Parameters .α, .β, and .γ
5 Popular Methods
145
2. Initialization: • Set iteration counters t and .t1 to 0. • .∨t1 initialized as the identity matrix (I ). • Matrix V initialized with random values of dimensions .n × c. 3. Main Loop: Inside this loop, the algorithm performs several updates until a convergence criterion is met. • Update W using Eq. 8.16. • Update .∧ using Eq. 8.15. • Increment the counter .t1 by 1. 4. 5. 6. 7. 8.
After exiting the main loop, update matrix V using Eq. 8.17. Increment the main iteration counter t by 1. Repeat the main loop until some convergence criteria are met. Output: The algorithm returns matrices W and V as the final result. Feature Selection: • Sort each feature of the input data matrix X based on the .L2 -norm of the corresponding row in matrix W in descending order. • Select the top-K ranked features for further analysis or use.
The specific equations (Eqs. 8.15, 8.16, and 8.17) are crucial for understanding the mathematical operations being performed in each step; they are expressed as follows: .
∧ (i, j ) =
1 . 2||W (i, :)||2
−1 W = X⏉ X + α ∧ +γ X⏉ LX X⏉ V . Vij ← Vij
(2XW + 4βAV )ij (2V + 4βV V ⏉ V )ij
(8.15) (8.16) (8.17)
The key aspects of this algorithm include iterative updates of matrices W and ∧, and the overall goal involves optimizing these matrices based on the input data and parameters .α, .β, and .γ . Where .α controls the sparseness of the model, .β is a parameter to balance the latent representation learning and the feature selection in latent space, and .γ is a parameter to balance the local manifold geometric structure regularization. Eight different publicly available and diverse datasets were used for evaluation (ORL, orlraws10P, warpPIE10P, COIL20,4 Isolet, CLL-SUB-1116, Prostate-GE, and USPS) using k-means clustering. LRLMR shows its superiority to non-deep algorithms and the AEFS one. LRLMR’s source code is accessible at3 .
.
3 https://pan.baidu.com/s/1o8uTluM
146
8 Deep Feature Selection
5.8 RNE: Robust Neighborhood Embedding RNE (Liu et al., 2020) is based on manifold learning and sparsity. The main idea behind RNE is to learn a mapping that embeds the data points into a lower-dimensional space while ensuring robustness against outliers and noisy data. It combines the advantages of neighborhood embedding techniques with robust statistics to achieve this goal. The weight matrix through neighborhood embedding is first acquired through the locally linear embedding (LLE) algorithm (Saul & Roweis, 2000). The idea in LLE is that each data point and its neighbors are close to a locally linear patch of some underlying manifold. L1-norm on the loss term is imposed for the unsupervised feature selection model. Since the model based on L1-norm is convex but non-smooth, the robust neighborhood embedding (RNE) algorithm can efficiently solve it through the alternation direction method of multipliers (ADMM). The key components and steps involved in RNE are: • Neighborhood Embedding: RNE uses the concept of neighborhood embedding to preserve the local structure of the data. In the high-dimensional space, each data point has a set of neighboring points. The relative distances and relationships among these neighboring points are used to define a similarity graph. • Robust Statistics: In the presence of outliers or noisy data, standard neighborhood embedding methods may be sensitive and yield poor results. RNE incorporates robust statistics to handle such situations. Robust statistics are designed to be less influenced by extreme values, making the method more resilient to outliers. • Objective Function: RNE formulates an objective function that combines robust statistics with the neighborhood embedding goal. The objective function aims to find a mapping of the data points into a lower-dimensional space that preserves the neighborhood relationships while being robust to outliers. • Optimization: The parameters of the mapping function are optimized to minimize the objective function. This optimization process is typically carried out using iterative optimization algorithms. Compared with six state-of-the-art algorithms, the performance RNE is tested on nine publicly available datasets (Yale warpAR10P, TOX-171, Madelon, Isolet, GLIOMA, COIL20, Arcene, ALLMLL) by considering their clustering accuracy (ACC). It is better than the comparison algorithms in the most cases on the test datasets. RNE’s source code can be accessed at4 .
4 https://github.com/liuyanfang023/KBS-RNE
5 Popular Methods
147
5.9 DUFS: Differentiable Unsupervised Feature Selection Based on a Gated Laplacian The idea in (Lindenbaum et al., 2021) consists of a differentiable objective for unsupervised feature selection. The proposed method utilizes stochastic input gates, trained to select features with high correlation with the leading eigenvectors of a graph Laplacian computed based on these features. The method introduces learnable Bernoulli gates into a Laplacian score. This gating mechanism allows the reevaluation of the Laplacian for different subsets of features and, thus, unmasks the main data structure buried by the nuisance features. The proposed approach can significantly improve cluster assignments compared with leading algorithms using twelve high-dimensional datasets from multiple domains (GISETTE, PIX10, COIL20, Yale, RCV1, TOX-171, ALLAML, PROSTATE, SRBCT, BIASE, INTESTINE, FAN, POLLEN). DUFS outperforms all baselines on 9 datasets and ranks second on the remaining 3. The results show that the method is extremely useful in bioinformatics and generally better than CAE. The source code for DUFS is available at5 .
5.10 UFS-TAE The UFS-TAE (Unsupervised feature selection via transformed autoencoder) (Zhang et al., 2021) aims to acquire a feature selector constrained by orthogonality and non-negativity. The proposed method is mainly divided into three phases. First and foremost, the indicator matrix is obtained constrained by orthogonality via deep autoencoder. Then the non-negative least squares method is used to obtain the approximate and non-negative indicator matrix. Ultimately, the feature selection matrix is determined according to the indicator matrix and evaluates the proposed model with k-means. The autoencoder with several non-linear activation functions allows our model to select both the features with the linear relationship and the potential non-linear relationship features. .L2,1 -norm is also used to select features with joint sparsity. Figure 8.3 presents the architecture of the proposed model, where n stand for the number of samples, d represents the number of features in the original input data, and k is the number of the selected feature. The method was evaluated by six real-world datasets (Yale, WarpPIE10P, Madelon, WarpAR10P, Lung, and Orlraws10P) and shows overall better performances compared to state-of-the-art unsupervised feature selection methods as well as CAE et AEFS. UFS-TAE source code is accessible at6 .
5 https://github.com/Ofirlin/DUFS 6 https://github.com/wownice333/UFS_TAE
148
8 Deep Feature Selection
Fig. 8.3 UFS-TAE model architecture. In this model, the autoencoder is utilized to address the challenges of feature selection. By employing appropriate activation and loss functions, the learned weights ensure coherence with the original indicator matrix. After training, the indicator matrix is restored using non-negative approximation techniques (Zhang et al., 2021)
References Balın, M. F., Abid, A., & Zou, J. (2019). Concrete autoencoders: Differentiable feature selection and reconstruction. In International conference on machine learning (pp. 444–453). PMLR. Covert, I., Sumbul, U., & Lee, S.-I. (2020). Deep unsupervised feature selection. Feng, S., & Duarte, M. F. (2018). Graph autoencoder-based unsupervised feature selection with broad and local data structure preservation. Neurocomputing, 312, 310–323. Han, K., Wang, Y., Zhang, C., Li, C., & Xu, C. (2018). Autoencoder inspired unsupervised feature selection. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2941–2945). IEEE. Lindenbaum, O., Shaham, U., Peterfreund, E., Svirsky, J., Casey, N., & Kluger, Y. (2021). Differentiable unsupervised feature selection based on a gated Laplacian. Advances in Neural Information Processing Systems, 34, 1530–1542. Liu, Y., Ye, D., Li, W., Wang, H., & Gao, Y. (2020). Robust neighborhood embedding for unsupervised feature selection. Knowledge-Based Systems, 193, 105462. Maddison, C. J., Mnih, A., & Teh, Y. W. (2016). The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Saul, L. K., & Roweis, S. T. (2000). An introduction to locally linear embedding. Unpublished. Available at: http://www.cs.toronto.edu/~roweis/lle/publications.html. Scardapane, S., Comminiello, D., Hussain, A., & Uncini, A. (2017). Group sparse regularization for deep neural networks. Neurocomputing, 241, 81–89. Shaham, U., Lindenbaum, O., Svirsky, J., & Kluger, Y. (2022). Deep unsupervised feature selection by discarding nuisance and correlated features. Neural Networks, 152, 34–43. Tang, C., Bian, M., Liu, X., Li, M., Zhou, H., Wang, P., & Yin, H. (2019). Unsupervised feature selection via latent representation learning and manifold regularization. Neural Networks, 117, 163–178. Wang, S., Ding, Z., & Fu, Y. (2017). Feature selection guided auto-encoder. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), 2725–2731.
References
149
Zhang, Y., Lu, Z., & Wang, S. (2021). Unsupervised feature selection via transformed autoencoder. Knowledge-Based Systems, 215, 106748. Zhu, P., Zuo, W., Zhang, L., Hu, Q., & Shiu, S. C. (2015). Unsupervised feature selection by regularized self-representation. Pattern Recognition, 48(2), 438–446.
Chapter 9
Deep Clustering Techniques
Clustering is a fundamental machine learning problem, whose performance is highly dependent on the quality of data representation. Hence, feature transformations have been extensively used to learn a better data representation for clustering. Deep clustering techniques have emerged as a powerful approach for unsupervised learning tasks, combining the benefits of deep neural networks and traditional clustering algorithms. Deep neural networks are particularly apt to learn nonlinear mappings that allow for transforming data into a representation that eases the clustering task, without the need of performing manual feature selection and engineering. This chapter provides a comprehensive overview of deep clustering techniques and their applications in various domains. The chapter begins by introducing the fundamental concepts of deep learning and clustering, highlighting their respective strengths and limitations. It then delves into the integration of deep learning architectures, such as autoencoders, convolutional neural networks (CNNs), and recurrent neural networks (RNNs), with clustering algorithms to enhance clustering performance. Different types of deep clustering algorithms are discussed, including autoencoder-based methods, variational autoencoders, deep embedded clustering, and adversarial approaches. The chapter also covers optimization techniques, loss functions, and training strategies specific to deep clustering, addressing challenges such as vanishing gradients and convergence issues. Additionally, the chapter provides insights into evaluating the performance of deep clustering techniques, discussing evaluation metrics, benchmark datasets, and comparisons with traditional clustering algorithms. Overall, this chapter serves as a comprehensive guide to deep clustering techniques, offering a deeper understanding of their underlying principles, architectures, applications, and evaluation methodologies. It is a valuable resource for researchers, practitioners, and students interested in exploring the synergies between deep learning and clustering for unsupervised learning tasks.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 F. Ros, R. Riad, Feature and Dimensionality Reduction for Clustering with Deep Learning, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-031-48743-9_9
151
152
9 Deep Clustering Techniques
1 Taxonomy of Deep Clustering Techniques Traditional clustering algorithms give discouraging results on large-scale complex datasets due to the inferior capability of representation learning. The core idea of deep clustering methods is to exploit the representations learned by neural networks (via CNN, AE, VAE, and GANs) to face the issue of the curse of dimensionality. Instead of clustering the samples directly in the original input space X, it is transformed with a non-linear mapping .fθ : X → Z, where .θ are learnable parameters and .Z ∈ RK is the learned or embedded feature space. Within the reduced space, parameter optimization can proceed via classical clustering or by iterating between computing an auxiliary target distribution and minimizing clustering loss. To face issues with high-dimensional spaces, deep clustering successes first depend on the quality of the representations, i.e., the capacity to reduce the original space dimension while capturing the essential data information at a higher level of abstraction in a concise representation. Ideally, each latent factor should represent an underlying dimension of variation between samples that explains variation in multiple dimensions of the measurement space. Obtaining a so-called disentangled representation (Tschannen et al., 2018) is not obvious as it is supposed to decompose the original parameters having non-linear effects in the measurement space. In addition, there is no metric to measure the disentangled quality for real-world problems as the generative factors are unknown. It is the main reason invoking researchers to develop specific learning strategies for deep clustering instead of applying directly advanced traditional clustering approaches limited to lowdimensional spaces. There are, widely, two categories of deep clustering algorithms: learning latent space representation of data separately and performing clustering tasks directly on a compressed data representation. The second stage of work applies dimension reduction and clustering tasks simultaneously. In the deep clustering area, the learning strategy included the involved tricks and the use of various combined losses have high importance as they can influence the latent representation and more generally contribute to the optimization of the clustering process. Unsupervised learning methods generally involve two aspects: pretext tasks and loss functions. As a recall, the term “pretext” implies that the task being solved is not of genuine interest but is solved only for the true purpose of learning a good data representation. Loss functions can often be investigated independently of pretext tasks. Losses can be divided into non-clustering loss and clustering loss. For the first category, the different clustering losses are no clustering loss, k-means loss, cluster assignment hardening, balanced assignments loss, locality-preserving loss, group sparsity loss, cluster classification loss, and agglomerative clustering loss. The general formulation is as follows: L(θ ) = γ Lc (θ ) + (1 − γ )Ln (θ )
.
(9.1)
1 Taxonomy of Deep Clustering Techniques
153
where .θ represent the hyper parameters of the network and .Lc (θ ) and .Ln (θ ), respectively, the clustering and non-clustering loss. .γ is a constant that acts for the trade-off. If .γ is set to 0, then the non-cluster loss function is used for training the network. If .γ is set to 1, then the clustering loss function is used for training the network. If .0 ≤ γ ≤ 1, then both the losses are used for training the network. These losses are independent of the clustering algorithm and usually enforce a desired constraint on the learned model, which guarantees that the learned representation preserves important information (e.g., spatial relationships between features), so the original input can be reconstructed in the decoding phase. Concerning the second, this type of loss (e.g., RL1 and self-augmentation loss) is specific to the clustering method and the clustering-friendliness of the learned representations. Most pioneer’s methods learn the representations for clustering by constructing a data correlation regulation, such as positive sample pairs, self-supervised guidance, and mutual information. Straightforward approaches estimate data correlation with the representations obtained from pre-trained models and then learn a classical clustering model. However, the independence of the two stages causes the prelearned representations may not fully explore the semantic structures of the data. This can result in a suboptimal solution for clustering. Even if progress was made under this strategy, novel methods were imagined to simultaneously optimize clustering and reconstruction losses for preserving the local structures. In deep learning, the more reliable the input samples are, much better the model it learns. However, the problem is that data is often mixed with noisy and confusing data, which makes the models often have insufficient generalization ability. Then, if these methods are conceptually similar, they include variants that stand in the way of defining the losses and the way to incorporate and adapt several learning tricks such as regularization, pruning, sparsity, self-supervision, data augmentation, contrastive learning, self-paced learning, etc. Most of them have been already popular in machine learning for a long time, except for the contrastive concept that showed promising results. It is still in research (Deng et al., 2022; Diallo et al., 2021; Li et al., 2021) to identify the right pretext task for a given problem. Most of the latent space-based approaches in practice have the autoencoder– decoder (AE-DE, CAE, etc.) structure for a feature representation where lowerdimensional encoding latent code is used for clustering tasks. Since reliable guidance is missing in clustering network training, many deep clustering methods seek to generate the target distribution (i.e., “ground truth” soft labels) for discriminative representation learning in a self-optimizing manner. Despite their success, existing methods generate the target distribution with only the information of the autoencoder. The latter may not represent the semantic of the original data, and then the knowledge discovery task cannot be optimal. The methods using deep belief networks or more recently convolutional neural network architectures and Siamese neural networks to obtain data representation are less constrained. When using contrastive/self-supervised learning strategies, they are more suitable for accessing to the original data semantics and appear promising. Differently and perhaps more inclined towards knowledge discovery tasks, generative methods (Creswell et al., 2018; Sajun & Zualkernan, 2022) have been
154
9 Deep Clustering Techniques
adapted for Deep Clustering Techniques. Examples include InfoGan (Chen et al., 2016), ClusterGan (Mukherjee et al., 2019) from the GAN family, as well as VADE (Jiang et al., 2016) and GMM-VAE (Dilokthanakul et al., 2016) from the VAE family, specifically tailored for deep clustering. A generative adversarial network (GAN) empirically learns the map that transforms the latent variables into the complex data distribution by playing a min-max game. The estimation of the posterior distribution of latent inference from the data is however an intractable problem in GAN models. In their raw formulation, GANs are unable to fully impose all the cluster properties of the real data on onto the generated data, especially when the real data has skewed clusters. Data that are generated are not well controlled. GANs have however shown very strong ability to capture complex data distributions, such as images and audio from raw features, and they can then be useful for complex data clustering. GANs may improve clustering results without any prior information. If they are capable of generating “nice” samples, it is essentially due to the powerful latent representation that is worth investigating to assume a discovery task which is the cluster DNA. The general idea consists of using a third encoder network that maps a data object to an instance from the latent space Z. A VAE is an autoencoder whose encoding distribution is regularized during the training in order to ensure that its latent space has good properties allowing us to generate some new data. Moreover, the term ‘variational’ is derived from the close relationship between regularization and the variational inference method in statistics. The VAE can be viewed as two coupled but independently parameterized models: the encoder or recognition model and the decoder or generative model. These two models support each other. The recognition model delivers to the generative model an approximation to its posterior over latent random variables, which it needs to update its parameters inside an iteration of “expectation maximization” learning. Reversely, the generative model is a scaffolding of sorts for the recognition model to learn meaningful representations of the data, including possibly class labels. The recognition model is the approximate inverse of the generative model according to Bayes’ rule. VAEs and GANs seem to have complementary properties (Kingma et al., 2019): while GANs can generate data of high subjective perceptual quality, they tend to lack full support over the data, as opposed to likelihood-based generative models. VAEs, like other likelihood-based models, generate more dispersed samples but are better density models in terms of the likelihood criterion. As such many hybrid models have been proposed to try to represent the best of both worlds. Deep clustering techniques can be classified via the involved deep learning architectures such as CNN, AE, CAE, VAE, and GANs. They can also be categorized based on the strategy: generative or discriminative. Generative approaches to representation learning build a distribution over data and latent embedding and use the learned embeddings as image representations. Many of these approaches rely either on auto-encoding or on adversarial learning, jointly modeling data and representation. Among discriminative methods, contrastive methods more devoted to data images currently achieve state-of-the-art performance in self-supervised learning arousing extensive attention from researchers. Based on the learning
2 Exploring Categories: A Conceptual Overview
155
algorithms underlying a strategy, they can be roughly classified into four categories: direct cluster optimization (Chang et al., 2018), jointly based models with two variants, and generative models. Methods also differ on the way of network initialization, i.e., unsupervised pre-trained, supervised pre-trained, and randomly initialized. As explained before, several learning tricks have been involved in the different architectures as well as in the different kinds of algorithms.
2 Exploring Categories: A Conceptual Overview If identifying deep clustering approaches via network architectures is more intuitive, they can also be identified via their conceptual approaches. There is often a correspondence between the conceptualization and a network architecture but not always. The first category of algorithms directly takes advantage of existing unsupervised deep learning frameworks and techniques (Huang et al., 2014; Tian et al., 2014; Xu et al., 2015). The deep-learning-based clustering task is to learn good feature representations and then run any classical clustering algorithm on the learned representations. There are several deep unsupervised learning methods available that can map data points to meaningful low-dimensional representation vectors. The representation vector contains all the important information of the given data point, and hence clustering on the representation vectors yields better results. It can be seen as a two-stage clustering framework, which separates the two processes of representation learning and clustering. Because of the powerful representation learning capability of deep neural networks, the learned representation can remove redundant features of the original data and map the data with high dimensions into low-dimensional feature space, thereby improving the efficiency and performance of clustering tasks. The disadvantage in this category is the mismatch problem between image representation and clustering. Specifically, the clustering algorithm does not participate in representation learning, which will lead to the blindness of representation learning. The second category of algorithms tries to explicitly define a clustering loss, simulating classification error in supervised deep learning. This deep clustering is the combination of representation learning and clustering, whose target is to allow the network to learn a cluster-oriented representation. These methods jointly train the network to learn better features and use the clustering results to direct the network training. This category harness most of time autoencoders that simultaneously optimize clustering and reconstruction losses for preserving the local structures. Several algorithms are based on CNN architectures and the ones including self-representation learning are very promising. Leveraging the excellent feature learning ability of deep neural networks, these methods substantially outperform traditional clustering methods. In Yang et al., (2017), it is proposed a recurrent framework in deep representations and image clusters, which integrates two processes into a single model with a unified weighted triplet loss and optimizes it end-to-end. DEC (Xie et al., 2016) is one of the pioneer algorithms for joint
156
9 Deep Clustering Techniques
representation learning and clustering. Many researchers have proposed many variants based on the autoencoder, such as VaDE (Jiang et al., 2016), IDEC (Guo et al., 2017), DCN (Yang et al., 2017), N2D (McConville et al., 2021), COAE (Wang et al., 2019), and ASPC-DA (Guo et al., 2019). This category probably represents the most prominent deep clustering approaches. To synthesize, a first group of methods (e.g., DEC, DAC, DeepCluster, DeeperCluster, or others) leverages the architecture of CNNs as a prior to cluster images. Starting from the initial feature representations, the clusters are iteratively refined by deriving the supervisory signal from the most confident samples or through cluster reassignments calculated offline. A second group of methods (e.g., IIC and IMSAT) proposes to learn a clustering function by maximizing the mutual information between an image and its augmentations. In general, methods that rely on the initial feature representations of the network are sensitive to initialization or prone to degenerate solutions, thus requiring special mechanisms (e.g., pretraining, cluster reassignment, and feature cleaning) to avoid those situations. Based on empirical evidence, compared to the two-stage clustering methods (first category), the joint optimization framework (second category) is more conducive to the clustering tasks due to its specifically set loss term for clustering. In this scheme, the defined clustering loss is used to update the parameters of transforming network and cluster centers simultaneously. Starting from the initial feature representations, the clusters are iteratively refined by deriving the supervisory signal from the most confident samples or through cluster reassignments calculated offline. The cluster assignment is then implicitly integrated to soft labels. However, the local structure preservation cannot be guaranteed by the clustering loss. Thus the feature transformation may be misguided, leading to the corruption of embedded space. Moreover, most of the works deal with pseudo K-means loss, and they do not take into account the clustering complexities that are handled by recent algorithms. As a piece of evidence, most unsupervised deep clustering methods fail to generate appreciable performance on complex datasets, but the current state-of-the-art models generate promising results on simple datasets. These two first categories contribute to the knowledge discovery task but with a strong limitation due to the imposed constraints that drive the generation of the latent space. The third category of deep clustering methods aims to train the network with only the clustering loss. This category only focuses on the clustering loss to optimize the deep neural network usually with CNN architectures and generally focus on image area. There is a contribution to the knowledge discovery task but with a limitation related to the complexity of the clustering loss. In addition, there are two crucial factors that affect the stability and effectiveness of these algorithms. On the one hand, the initialization of the convolutional network is an important factor. On the other hand, with the training going on, the local structure preservation of representation cannot be guaranteed. Until 2020, instance discrimination methods based on CNN architectures and self-supervision learning approaches shared a common weakness: the representation
2 Exploring Categories: A Conceptual Overview
157
is not encouraged to encode the semantic structure of data. This problem arises because instance-wise contrastive learning treats two samples as a negative pair as long as they are from different instances, regardless of their semantic similarity. This is magnified by the fact that thousands of negative samples are generated to form the contrastive loss, leading to many negative pairs that share similar semantics but are undesirably pushed apart in the embedding space. Very recent proposals have considered this issue and achieved very promising results. Many of the state-of-the-art contrastive learning methods are based on the task of instance discrimination. As a typical example, instance discrimination trains a network to classify whether two image crops come from the same source image, as shown. The network (e.g., a CNN encoder) projects each image crop into an embedding and pulls the embeddings from the same source closer to each other while pushing embeddings from different sources apart. By solving the task of instance discrimination, the network is expected to learn a useful representation of the image. The fourth category empirically learns the map that transforms the latent variables to the complex data distribution by playing a min-max game. This category referred to VAE and GAN architectures is different from the others. Indeed, neural network models exhibit a dichotomy, neatly falling into two distinctive categories: Discriminative and Generative Models. The discriminative model represents a bottom-up strategy, where data instances traverse from the input layer through hidden layers to ultimately reach the output layer. In contrast, generative models adopt a top-down perspective, guiding data instances in the reverse trajectory. These models come into play for unsupervised pre-training and the resolution of probabilistic distribution challenges. When armed with both input “x” and its corresponding label “y,” a discriminative model diligently learns the probability distribution .p(y|x) in simpler terms, the likelihood of “y” given “x.” In a parallel vein, a generative model embarks on understanding the joint probability distribution .p(x, y), which subsequently enables the prediction of .p(y|x). Generative models have drawn increasing interest from the community and have been developed mainly in two directions: VAE-based models that learn the data distribution via maximum likelihood estimation (MLE) and GAN-based methods that train a generator via adversarial learning. The techniques are very promising and because of the powerful latent space, they can play a major role in the knowledge discovery task. At the moment, they appear complex regarding their common usability. For all the progress made in deep clustering, modern approaches are rather successful when dealing with relatively low semantic datasets such as MNIST, USPS, COIL, etc. However, most state-of-the-art strategies perform very poorly when it comes to clustering high semantic datasets such as CIFAR, STL, and ImageNet. Generative frameworks as well as Joint CNN/clustering using selfsupervised learning concepts appear to be more adapted.
158
9 Deep Clustering Techniques
References Chang, J., Meng, G., Wang, L., Xiang, S., & Pan, C. (2018). Deep self-evolution clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 809–823. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). InfoGan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems, 29, 2180–2188. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., & Bharath, A. A. (2018). Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1), 53– 65. Deng, X., Huang, D., Chen, D.-H., Wang, C.-D., & Lai, J.-H. (2022). Strongly augmented contrastive clustering. arXiv preprint arXiv:2206.00380. Diallo, B., Hu, J., Li, T., Khan, G. A., Liang, X., & Zhao, Y. (2021). Deep embedding clustering based on contractive autoencoder. Neurocomputing, 433, 96–107. Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, K., & Shanahan, M. (2016). Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648. Guo, X., Gao, L., Liu, X., & Yin, J. (2017). Improved deep embedded clustering with local structure preservation. In Ijcai (pp. 1753–1759). Guo, X., Liu, X., Zhu, E., Zhu, X., Li, M., Xu, X., & Yin, J. (2019). Adaptive self-paced deep clustering with data augmentation. IEEE Transactions on Knowledge and Data Engineering, 32(9), 1680–1693. Huang, P., Huang, Y., Wang, W., & Wang, L. (2014). Deep embedding network for clustering. In 2014 22nd International conference on pattern recognition (pp. 1532–1537). IEEE. Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou, H. (2016). Variational deep embedding: An unsupervised and generative approach to clustering. arXiv preprint arXiv:1611.05148. Kingma, D. P., Welling, M., et al. (2019). An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), 307–392. Li, Y., Hu, P., Liu, Z., Peng, D., Zhou, J. T., & Peng, X. (2021). Contrastive clustering. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10), 8547–8555. McConville, R., Santos-Rodriguez, R., Piechocki, R. J., & Craddock, I. (2021). N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding. In 2020 25th International conference on pattern recognition (ICPR) (pp. 5145–5152). IEEE. Mukherjee, S., Asnani, H., Lin, E., & Kannan, S. (2019). ClusterGan: Latent space clustering in generative adversarial networks. In Proceedings of the AAAI conference on artificial intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press. Sajun, A. R., & Zualkernan, I. (2022). Survey on implementations of generative adversarial networks for semi-supervised learning. Applied Sciences, 12(3), 1718. Tian, F., Gao, B., Cui, Q., Chen, E., & Liu, T.-Y. (2014). Learning deep representations for graph clustering. In Proceedings of the twenty-eighth AAAI conference on artificial intelligence, AAAI’14 (pp. 1293–1299). AAAI Press. Tschannen, M., Bachem, O., & Lucic, M. (2018). Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069. Wang, W., Yang, D., Chen, F., Pang, Y., Huang, S., & Ge, Y. (2019). Clustering with orthogonal autoencoder. IEEE Access, 7, 62421–62432. Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In International conference on machine learning (pp. 478–487). PMLR. Xu, J., Xiang, L., Liu, Q., Gilmore, H., Wu, J., Tang, J., & Madabhushi, A. (2015). Stacked sparse autoencoder (SSAE) for nuclei detection on breast cancer histopathology images. IEEE Transactions on Medical Imaging, 35(1), 119–130. Yang, B., Fu, X., Sidiropoulos, N. D., & Hong, M. (2017). Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In International conference on machine learning (pp. 3861–3870). PMLR.
Chapter 10
Deep Clustering Techniques Based on CNN
Pre-trained convolutional neural networks, or ConVents, have become the building blocks in most computer vision applications. They produce excellent generalpurpose features that can be used to improve the generalization of models learned on a limited amount of data. There are however two crucial factors that affect the stability and effectiveness of these algorithms. On the one hand, the initialization of the convolutional network is an important factor. On the other hand, with the training going on, the local structure’s preservation of representation cannot be guaranteed. Then, the focus of the community has shifted to how to learn representation and perform clustering in an end-to-end fashion. Discriminative approaches learn representations using objective functions similar to those used for supervised learning but train networks to perform pretext tasks where both the inputs and labels are derived from an unlabeled dataset. To harness the power of deep networks on smaller datasets and tasks, pre-trained models (e.g., ResNet-101 trained on ImageNet and VGG-Face trained on a large number of face images) are often used as feature extractors or fine-tuned for the new task. At this stage, there is a contribution to the discovery task regarding the high-level features generated. Many such approaches have relied on heuristics to design pretext tasks, which could limit the generality of the learned representations and the knowledge discovery function. Most models are based on a self-supervised learning paradigm where groundtruth labels are obtained automatically based on the natural groupings of the dataset. One is to use clustering results as pseudo-labels to guide the pair construction. The other, which is more direct and commonly used, is to treat each instance as a class represented by a feature vector, and data pairs are constructed through data augmentations. Discriminative approaches based on contrastive learning in the latent space have recently shown great promise, achieving state of the art. There are two broad categories of contrastive learning-based clustering models:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 F. Ros, R. Riad, Feature and Dimensionality Reduction for Clustering with Deep Learning, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-031-48743-9_10
159
160
10 Deep Clustering Techniques Based on CNN
• Individual instance-level discrimination, where only the instance and its augmentations are treated as an independent class, while other instances and their augmentations are regarded as different clusters. PICA is a typical example. • Grouped instance-level discrimination, where the instance and its k nearest neighbors are classified into the same cluster, while other instances and their augmentations are pushed apart. SCAN is a typical example. This chapter presents the most popular deep clustering techniques based on CNN architectures from 2016.
1 JULE: Joint Unsupervised LEarning of Deep Representations and Image Clusters JULE (Yang et al., 2016) is based on an idea similar to Doersch et al. (2015), Dosovitskiy et al. (2014), Wang and Gupta (2015), Lee et al. (2009) and considered a pioneer clustering algorithm. It consists of an end-to-end learning framework to jointly learn deep representations and image clusters from an unlabeled image set .I = {I1 , . . . , In }. The intuition behind JULE is that better image representation will facilitate clustering, while better clustering results will help representation learning. JULE combines two main objectives: learning meaningful representations from images and clustering similar images together. The general problem of image clustering can be formalized as the minimization of a lost function .L(·) as follows: arg min L(y, θ |I )
.
(10.1)
θ,y
where .y = {y1 , . . . , yn } denotes the cluster IDs for all images, and .θ denotes the parameters for representations .X = {x1 , . . . , xn }. JULE divides the process into two alternative steps: • Updating the cluster IDs via an agglomerative process given the current representation parameters • Updating the representation parameters via a CNN given the current clustering result The optimization alternatively concerns the clustering process and the representation. arg min L(y|I, θ ) y .
arg min L(θ |I, y) θ
(10.2)
1 JULE: Joint Unsupervised LEarning of Deep Representations and Image. . .
161
The joint learning is formulated in a recurrent framework, where merging operations of agglomerative clustering are expressed as a forward pass and representation learning of CNN as a backward pass. In the forward pass hierarchical image clustering is performed by merging similar clusters, while in the backward pass feature representation parameters are updated by minimizing the loss generated in the forward pass. JULE is based on a recurrent framework, where data is represented via a convolutional neural network and embedded data is iteratively clustered using an agglomerative clustering algorithm. The main idea behind JULE is that meaningful cluster labels could become supervisory signals for representation learning and discriminative representations help to obtain meaningful clusters. JULE merges data points and takes the clustering results as supervisory signals to learn a more discriminative representation by a neural network. A recurrent process allows merging clusters in multiple time steps. The idea is to derive a unified loss function consisting of the merging process for agglomerative clustering and updating the parameters of the deep representation. During the optimization procedure, clustering is conducted in the forward pass, and representation learning is performed in the backward pass. {Ca , Cb } = arg max A(Ci , Cj )
(10.3)
.
Ci ,Cj ∈C,i/=j
where the matrix A is an affinity measure between .Ci and .Cj . The fusion is done with respect to an affinity matrix calculated in latent space. This choice is weighted by taking into account the local structure of the data, to promote the fusion of pairs of clusters that are both close to each other on the other but far from other neighboring clusters. JULE is based on a partial unrolling strategy, by splitting the overall T time steps into multiple periods and unrolling one period at a time. The losses from all times steps are accumulated, which is formulated as ⏉ L {y 1 , . . . , y ⏉ }, {θ 1 , . . . , θ ⏉ } = Lt (y t , θ ⏉ |y t−1 , I )
.
(10.4)
t=1
y 0 takes each image as a cluster. At time step t, two clusters are merged given The affinity between two clusters is considered as well as the local structure surrounding the clusters. The loss at time step t is a combination of negative affinities as follows: .
t−1 . .y
Lt (y t , θ ⏉ |y t−1 , I ) = −A(Cit , NCKtc )[1]− i
.
λ Kc − 1
Kc k=2
A(Cit , NCKtc )[1] − A(Cit , NCKtc )[k] i
i
(10.5)
162
10 Deep Clustering Techniques Based on CNN
The first term measures the affinity between cluster .Ci and its nearest neighbor, which follows conventional agglomerative clustering, and the second term measures the difference between the affinity of .Ci to its nearest neighbor cluster and the affinities of .Ci to its other neighbor clusters. .λ is a weight parameter. JULE requires tuning a large number of hyperparameters, which is not practical in real-world clustering tasks. In addition, training a recurrent neural network, where the number of time steps is equal to the number of data points, is not computationally efficient. The code of JULE can be found at1 .
2 IMSAT: Information Maximizing Self-augmented Training IMSAT is a method for discrete representation learning using deep neural networks based on information maximization (IM) and self-augmented training (SAT). In information maximization, clusters are created by balancing the number of data points in the clusters. It only considers the regularization penalty on the model parameters to ensure the cluster assignment. IMSAT (Hu et al., 2017) modifies information maximization by combining it with self-augmentation training techniques that penalize the data representation differences between the original and augmented data points. Data augmentation is used to model the invariance of learned data representations (Fig. 10.1). More specifically, data points are mapped into their discrete representations by a deep neural network and regularized by encouraging its prediction to be invariant to data augmentation. The predicted discrete representations then exhibit the invariance specified by the augmentation. SAT encourages the predicted representations of augmented data to be close to those of the original data points. IM maximizes information dependency between inputs and their map outputs. Let X and Y denote the domains of inputs and discrete representations, respectively. Given training samples, .{x1 , x2 , . . . , xn }, the task of discrete representation learning is to obtain a function, .f : X → Y , that maps similar inputs into similar discrete representations. Given a data point x, an augmented training example .T (x) is generated where .T : X → X denotes a predefined data augmentation function. The goal is that the cross-entropy between .p(y|x) and .p(y|T (x)) is minimized. The best augmentation of the prediction can be calculated with self-augmented training (SAT) which uses data augmentation to impose the intended invariance on the data representations. Essentially, SAT penalizes representation dissimilarity between the original data points and augmented ones. Regularization using local perturbation is based on the idea that it is preferable for data representations to be locally invariant (i.e., remain unchanged under local perturbations on data points). The idea would enable neural networks to learn meaningful representations of data. The two representative local perturbation methods are RPT (Random Perturbation
1 https://github.com/jwyang/JULE.torch.
2 IMSAT: Information Maximizing Self-augmented Training
163
Fig. 10.1 IMSAT based on SAT and IM from the author Hu et al. (2017)
Training) (Bachman et al., 2014) and VAT (Virtual Adversarial Training) (Miyato et al., 2018). Let .x ∈ X and .Y ∈ {0, . . . , K − 1} (K is the number of clusters) denote random variables for data and cluster assignments, respectively. Loss =
n
.
Rsat θ, xi , T (xi ) − λ H (Y ) − H (Y |X)
(10.6)
i=1
where .H (·) and .H (·|·) are entropy and conditional entropy, respectively, and θ a trade-off parameter. Increasing the marginal entropy .H (Y ) encourages the cluster sizes to be uniform. Decreasing the condition entropy .H (Y |X) encourages unambiguous cluster assignments. The IMSAT algorithm is as follows:
.
• Input: IMSAT takes as input a set of unlabeled data points, represented as feature vectors. • Data augmentation: IMSAT performs data augmentation by applying random transformations to the input data points, such as random crops, flips, and
164
10 Deep Clustering Techniques Based on CNN
rotations. The data augmentation step aims to increase the diversity of the input data and make the clustering problem more challenging. • Cluster assignments: IMSAT assigns each augmented data point to one of K clusters, using a clustering loss function, such as the deep clustering loss or the normalized cut loss. The clustering loss function encourages the network to group similar data points into the same cluster, while separating dissimilar data points into different clusters. • Mutual information maximization: IMSAT maximizes the mutual information between the input data and the cluster assignments, by minimizing the negative of the mutual information, using a contrastive loss function. The contrastive loss function encourages the network to learn representations of the data that capture the most informative features for clustering, by contrasting the representations of similar data points with the representations of dissimilar data points. • Iterative refinement: IMSAT iteratively refines the cluster assignments and the network parameters using a self-augmented training framework. In each iteration, IMSAT applies a new set of random transformations to the input data points and updates the cluster assignments and the network parameters using the clustering loss function and the contrastive loss function. The iterative refinement step aims to improve the quality of the clusters by incorporating the cluster assignments into the learning process and by encouraging the network to learn more informative representations of the data. IMSAT has been shown to achieve state-of-the-art results on several benchmark datasets for unsupervised clustering. The code of IMSAT can be found at2 .
3 DAC: Deep Adaptive Image Clustering Deep adaptive image clustering (DAC) (Chang et al., 2017) falls under the category of unsupervised clustering techniques and can be categorized as part of the “direct cluster optimization” family. This approach aims to optimize the clustering process directly, without relying on intermediate steps. Pseudo-label-based clustering first derives some rational pseudo-labels and then selects instances with pseudo-labels having high confidence to update explicitly or implicitly the cluster assignment and guide the unsupervised training process. DAC transforms clustering to be a binary classification problem and studies pairwise similarity among different instances based on the pseudo-labels as supervision (Fig. 10.2). At its core, DAC operates through a pairwise classification mechanism to facilitate the creation of clusters. The algorithm takes pairs of data points as input and makes decisions about whether these two instances should be grouped within
2 https://github.com/weihua916/imsat.
3 DAC: Deep Adaptive Image Clustering
165
Fig. 10.2 DAC. Pairs of instances are compared using the dot product from source https:// divamgupta.com/unsupervised-learning/2019/03/08/an-overview-of-deep-learning-basedclustering-techniques.html
the same cluster or not. The overarching objective is to identify and group together instances that exhibit strong similarities. To achieve this, DAC employs a novel strategy involving the modeling of data correlations. This is accomplished by analyzing the differences between probability distributions associated with pairs of instances. By quantifying these distribution differences, the algorithm identifies pairs of data points that display high levels of confidence in terms of their correlation, indicating that they should be part of the same cluster. In essence, DAC’s innovative approach centers around the careful selection of data pairs that show significant and reliable pairwise correlations. This strategy contributes to the effective formation of clusters by capturing meaningful relationships between instances in the data. As the dot product is a differentiable operation, a backpropagation rule with pairwise training labels can be applied to do the training. Similar to the idea of pseudo-labels, clusters are predicted and used to retrain the network. The architecture is a convolutional neural network with a binary pairwise classification as clustering loss. Besides, this method introduced a regularization term to encourage the learning of one-hot labeled representations. The method is motivated by a basic assumption that the relationship between pairwise images is binary, i.e., .rij = 1 indicates that .xi and .xj belong to the same cluster and .rij = 0 otherwise. The cosine distance .〈.〉 between all cluster predictions is used to determine whether the input images are similar or dissimilar with a given certainty. A regularization constraint is added to help learn label features as one-hot encoded feature.
166
10 Deep Clustering Techniques Based on CNN
Loss = min
.
W,λ
vij L(rij , 〈Ii .Ij 〉) + u(λ) − l(λ)
(10.7)
i,j
where .vij = 1 indicates that the sample .(xi , xj , rij ) is selected for training and vij = 0 otherwise. When selected for training .rij = 1 if .〈Ii .Ij 〉 > u(λ) and .rij = 0 if .〈Ii .Ij 〉 < v(λ). Here, .λ is an adaptive parameter controlling the selection, and .u(λ) and .l(λ) are the thresholds for selecting similar and dissimilar labeled samples, respectively. .
L(rij , g(xi , xj , w)) = −rij log g(xi , xj , w) − (1 − rij ) 1 − log g(xi , xj , w) (10.8)
.
where .g(xi , xj , w) = 〈Ii .Ij 〉. The flowchart of DAC is as follows: 1. 2. 3. 4.
The input is a set of unlabeled images. Generate the label features of the images by using a CNN. Calculate the cosine similarities between images based on the label features Select training samples according to the cosine similarities, and the samples depicted in the red boxes represent the omitted samples in the training procedure. 5. Utilize the selected samples to train the ConvNet based on the formulated binary pairwise classification model. 6. Iterate step 1 to step 4 until all the samples are considered for training. Conclusively, images are clustered by locating the largest response of label features. DAC has been shown to achieve state-of-the-art results on several benchmark datasets for unsupervised image clustering and can be used for a variety of applications, including image retrieval, content-based image retrieval, and image segmentation. The code of DAC can be found at3 .
4 SCAN: Semantic Clustering by Adaptive Nearest Neighbors SCAN (Van Gansbeke et al., 2020) is an unsupervised learning algorithm for clustering high-dimensional data, such as text documents or images, based on their semantic similarity. The central idea behind SCAN is to encourage the clustering model to assign the same labels to instances and their corresponding nearest neighbors (Fig. 10.3), thus reinforcing the quality of clustering outcomes. The algorithm is based on the idea of adapting the nearest neighbors metric to the local structure of the data and using this metric to cluster the data points into
3 https://github.com/JiaxinZhuang/DAC.Pytorch.
4 SCAN: Semantic Clustering by Adaptive Nearest Neighbors
167 Clusters
Dataset NeuralNetwork:фθ Weights:θ
Clustering KNN
Xi:sample Nxi:mined neighbors
Fig. 10.3 Conceptual idea of SCAN based on the use of neighbors as pretext task from the author Van Gansbeke et al. (2020)
meaningful groups. The aim of the algorithm is to train a neural network .φ which classifies a data pattern .Xi and its mined neighbors .Nxi into the same cluster. A pretext task .τ learns in a self-supervised fashion an embedding function .φ— parameterized by a neural network with weights .θ. The pretext task .τ aims to minimize the distance between images .Xi and their augmentations .T (Xi ) which can be expressed as min{d(φ(Xi ), φ(T (Xi ))}
.
θ
(10.9)
SCAN first carries out a pretext task involving contrastive learning to identify nearest neighbors by seeking similarities across the entire dataset. In the pretext task, SCAN engages in contrastive learning, aiming to identify samples with strong similarities across the dataset. Then for every .Xi in the dataset, .Nx i are mined, based on embeddings from the pretext task. SCAN performs further learning and clustering optimization based on the nearest neighbors in the next stage. Via this idea, a clustering model to output the same labels for similar instances is encouraged, which further improves the clustering performance. It then obtains the clustering result via the second-stage learning and clustering optimization. In SCAN, a clustering function parameterized by a neural network .φ with weight .ν is learned to classify instances .Xi and their associated neighbors .NXi collectively, optimizing for the clustering task (Fig. 10.4). Let .C = [1, . . . , C] be with .φν (Xi ) ∈ [0, 1]C . The loss function is as follows: Loss =
.
' −1 ' log〈φν (X), φν (k)〉 + λ φνc log(φνc ) |D| X∈D k∈NX
(10.10)
c∈C
where .〈.〉 standsfor the dot product and .D = [X1 , . . . , XD ] the pattern images and ' φνc = (1/|D|) φνc (X). The first term imposes .φν to make consistent predictions
.
X∈D
for a sample .Xi and its neighboring. The second entropy spreads the predictions uniformly across the clusters and therefore avoids .φν from assigning all samples to a single cluster. SCAN focuses on the clustering phase and largely improves performance based on a given pre-designed representation learning. Considering the imperfect
168
NuralNetwork: Фη Weights: η
Softmax
Xi:sample Nxi:mined neighbors
10 Deep Clustering Techniques Based on CNN
1 2
C
Unpdated: η Using Loss Function
Fig. 10.4 Scan: loss adaptation from the author Van Gansbeke et al. (2020)
embedding features, the local nearest samples in the embedding space do not always have the same semantics especially when the samples lie around the borderlines between different clusters which may compromise the performance. Essentially, SCAN only utilizes the instance similarity for training the clustering model without explicitly exploring the semantic discrepancy between clusters, so that it cannot identify the semantically inconsistent samples. The SCAN algorithm is as follows: • Input: The SCAN algorithm takes a set of high-dimensional data points as input, where each point is represented by a feature vector. • Nearest neighbors graph: The algorithm constructs a graph of the data points, where each point is connected to its k-nearest neighbors based on the Euclidean distance between their feature vectors. The value of k is chosen based on the size and density of the data and can be adjusted during the course of the algorithm. • Adaptive nearest neighbors: The algorithm adapts the nearest neighbors metric to the local structure of the data, by defining a distance metric that varies depending on the density of the data points around each point. Specifically, the distance metric is defined as the sum of the Euclidean distance between the two points and a penalty term that depends on the density of the data points around the two points. This penalty term encourages the algorithm to cluster together data points that are semantically similar but may not be close in Euclidean distance. • Clustering: The algorithm performs clustering on the graph of data points, using a clustering algorithm that is based on the adaptive nearest neighbors metric. Specifically, the algorithm iteratively removes edges from the graph that connect dissimilar points, until the graph is split into connected components that represent distinct clusters of semantically similar data points.
5 NNM: The Nearest Neighbor Matching
169
• Semantic analysis: Finally, the algorithm performs semantic analysis on the clusters to identify the most representative features of each cluster and to assign meaningful labels to the clusters based on these features. SCAN has been shown to be effective for clustering high-dimensional data in a variety of domains, including text documents, images, and gene expression data. One advantage of SCAN over other clustering algorithms is its ability to adapt the nearest neighbors metric to the local structure of the data and to capture the semantic similarity between data points that may not be close in Euclidean distance. The code of SCAN can be found at4 .
5 NNM: The Nearest Neighbor Matching To extend the SCAN method, the Nearest Neighbor Matching (NNM) method (Dang et al., 2021) is based on the idea of matching samples with their nearest neighbors from both local and global levels (Fig. 10.5). During training, the algorithm learns the mapping between image features and class labels by minimizing a loss function that measures the difference between the predicted labels and the true labels of the reference set. Since the embedding features are not perfect, similar instances do not always have the same semantics especially when the samples lie near the borderlines of different clusters. Therefore, only using the instance similarity and ignoring the semantic discrepancy between clusters to guide model training may limit the Local Neighbour Search
Global Neighbour Search
Batch 1 Per Batch
Batch 2
Per Epoch Batch 3 . . . Batch Embedded Features
Overall Embedded Features
Fig. 10.5 NNM (basic idea) from the author Dang et al. (2021)
4 https://github.com/wvangansbeke/Unsupervised-Classification.
170
10 Deep Clustering Techniques Based on CNN
clustering performance. The loss function includes a global and local loss as well as an entropy term. LNNM = LG + LC + λLH
.
(10.11)
where .λ is a weight parameter. The global and local losses are composed of consistent and class losses. The first one aims to maximize the similarity between samples and their neighbors, while the second aims to keep the sample view of these clustering assignments consistent. The entropy term serves to prevent the trivial solution that assigns a majority of samples into a minority of clusters. Here is a simplified flowchart of the NNM algorithm: • Input: NNM takes as input a reference set of labeled images, which are used to learn the mapping between image features and class labels, and a query image, which is the image to be classified. • Feature extraction: NNM extracts a set of features from each image in the reference set and the query image. The features can be extracted using any feature extraction algorithm, such as SIFT, HOG, or deep convolutional neural networks. • Nearest neighbor search: NNM computes the Euclidean distance between the features of the query image and the features of each image in the reference set and finds the image in the reference set that is closest to the query image in the feature space. This image is called the nearest neighbor. • Label prediction: NNM uses the label of the nearest neighbor as the predicted label for the query image. If there are multiple nearest neighbors with different labels, NNM can use a majority vote or a weighted vote to determine the predicted label. NNM is a simple and effective algorithm for image classification and can be used as a baseline for more complex algorithms such as deep convolutional neural networks. One limitation of NNM is that it relies heavily on the quality of the feature extraction algorithm and may not perform well if the features do not capture the relevant information for the classification task. The code of NMN can be found at5 .
6 DeepCluster DeepCluster (Caron et al., 2018) is a deep learning algorithm for unsupervised clustering of images for image classification by jointly learning class labels and CNN feature representations. The core concept behind DeepCluster involves the simultaneous learning of class labels and convolutional neural network (CNN) feature representations. The approach is built upon the principle of clustering feature vectors extracted from a pre-trained CNN.
5 https://github.com/ZhiyuanDang/NNM.
6 DeepCluster
171
Fig. 10.6 Pseudo-label generation source (Chaudhary, 2020b): randomly initialized network to bootstrap the first set of image labels. Generate the labels and then train a model on these labels, generate new labels from the trained model, and repeat the process
DeepCluster is introduced in Fig. 10.7, which mainly iterates between (A) generating pseudo-labels and (B) learning the network parameters based on these pseudo-labels (Fig. 10.6). To generate pseudo-labels, the entire dataset is fed forward into the network, and a feature vector is computed for each sample. The input is then projected from Euclidean space into a ConvNet feature space. The K-means clustering algorithm is then applied to the generated features to assign similar samples into a cluster. The assignments of samples are considered to be the pseudo-labels on which the model will be trained. Clustering-based approaches for deep networks typically build target classes by clustering visual features produced by ConvNets. Given a training set .X = {x1 , x2 , . . . , xN } of N images, we want to find a parameter .θ such that the mapping .fθ produces good general-purpose features. Consider a latent pseudo-label .zn ∈ {0, 1}k (k being the number of predefined classes) in Z for each image n as well as a corresponding linear classifier W . DeepCluster alternates between learning the parameters .θ and W and updating the pseudo-labels .zn (Fig. 10.7). Between two reassignments, the pseudo-labels .zn are fixed, and the parameters and classifier are optimized by solving N 1 l(zn , W (fθ )(xn )) θ,W N
min
.
(10.12)
n=1
where l is a multinomial logistic loss. Then, the pseudo-labels .zn can be reassigned by minimizing an auxiliary loss function. With DeepCluster, the latent targets are obtained by clustering the activations with K-means. More precisely, the targets .zn are updated by solving the following optimization problem:
172
10 Deep Clustering Techniques Based on CNN
K clusters
(1)
(2)
(A)
CNN
(B)
CNN
MLP
K pseudo-classes
Fig. 10.7 DeepCluster pipeline from the author Caron et al. (2018)
.
min
C∈Rd×k
N n=1
min ‖Czn − fθ )(xn )‖22
z∈{0,1}k
(10.13)
where C is the matrix where each column corresponds to a centroid, k is the number of centroids, and .zn is a binary vector with a single nonzero entry. This approach assumes that the number of clusters k is known a priori; in practice, we set it by validation on a downstream task. The latent targets are updated every T epochs. Here is a simplified overview of the DeepCluster algorithm: • Input: DeepCluster takes as input a large set of unlabeled images. • Feature extraction: DeepCluster extracts feature vectors from the images using a pre-trained CNN, such as VGG or ResNet. The feature vectors are obtained by passing the images through the CNN and extracting the activations of one of the intermediate layers. • Clustering: DeepCluster clusters the feature vectors using the K-means algorithm, where K is the number of clusters. The K-means algorithm is applied to the feature vectors using a large batch size, and the cluster centroids are used as the cluster prototypes. • Fine-tuning: DeepCluster fine-tunes the CNN and the cluster prototypes jointly, using a joint loss function that combines the clustering loss with a reconstruction loss. The reconstruction loss encourages the network to learn a more informative
7 Deep Clustering with Sample-Assignment Invariance Prior
173
representation of the data, by minimizing the difference between the input images and their reconstructions from the network. • Iterative refinement: DeepCluster iteratively refines the cluster assignments and the network parameters using a self-augmented training framework. In each iteration, DeepCluster applies a new set of random transformations to the input images and updates the cluster assignments and the network parameters using the joint loss function. The iterative refinement step aims to improve the quality of the clusters by incorporating the cluster assignments into the learning process and by encouraging the network to learn more informative representations of the data. • Hierarchy construction: DeepCluster constructs a hierarchical clustering tree from the cluster assignments obtained in the previous step, using the agglomerative clustering algorithm. The hierarchical clustering tree represents the similarity relationships between the images and can be used for tasks such as image retrieval and image segmentation. A notable aspect of DeepCluster is its generalizability. The method is adaptable to different CNN architectures because the class labels are inferred using K-means clustering based on the features extracted from the CNN. This allows DeepCluster to be applied to a variety of CNN models. DeepCluster has been shown to achieve state-of-the-art results on several benchmark datasets for unsupervised image clustering and can be used for a variety of applications, including image retrieval, content-based image retrieval, and image segmentation. The code of DeepCluster can be found at6 .
7 Deep Clustering with Sample-Assignment Invariance Prior In Peng et al. (2019), the clustering method focuses on minimizing the discrepancy between pairwise sample assignments for each data point. This work introduces the concept of sample-assignment invariance, which treats labels as ideal representations. The ideal representation for clustering or classification tasks is a vector with only one nonzero entry indicating the index of the assigned cluster. This representation ensures consistent predictions across different metrics, as it can be considered the predicted label itself. The main objective of the proposed method is to learn a metric-invariant space that captures the underlying structure of the data effectively. The approach is one of the first end-to-end clustering algorithms that jointly learns both the clustering assignment and the data representation. It allows for better integration and optimization of the clustering process and has the potential
6 https://github.com/facebookresearch/deepcluster.
174
10 Deep Clustering Techniques Based on CNN
to achieve higher performance compared to traditional methods that treat representation learning and clustering as separate steps. The proposed method distinguishes itself from many existing subspace clustering techniques by simultaneously learning the data representation and performing clustering. This integration ensures that the learned representation is optimized for the specific clustering task, leading to improved performance. The method consists of two steps. • The first step aims to learn representation, which is conducted to map inputs into a latent space in the forward pathway of the neural network. • The second step implements data clustering in the backward pathway of our neural network, which simultaneously forward propagates a supervision signal to update the clustering membership and parametric transformations. With such a strategy, even if no manual annotation is provided, the neural network can still be trained in an end-to-end manner and such a manner will lead to better representation and clustering results.
8 IIC: Invariant Information Clustering IIC (Ji et al., 2019) introduces the concept of Invariant Information Clustering as a discriminant model designed to cultivate noise-independent features by leveraging pairs of both unaltered original images and images with added noise. The essence of IIC lies in its objective to minimize the loss of mutual information between these paired data instances. The underlying principle of IIC involves using paired data for mutual information computation. These pairs consist of an original image and a version of that image that has been randomly perturbed. This perturbation introduces controlled noise, enhancing the model’s ability to identify genuine patterns by emphasizing invariances that persist even in the presence of varying degrees of noise. The algorithm is based on the idea of clustering the feature vectors obtained from a deep neural network .φ (from X to .Y = {1 . . . , C}) to assign a cluster among C while maximizing the mutual information between the cluster assignments and the original images. .φ(x) ∈ [0, 1]C represents the distribution of a discrete random variable z over C classes, formally given by .P (z = c|x) = φ(c(x)). Their conditional joint distribution is given by .P (z = c, z' = c' |(x, x ' )) = φc (x)×φc (x ' ). For n patterns and with a pair of images, the joint probability is given by a matrix C,C : .P ∈ R 1 φ(xi ), φ(xi' )⏉ n n
P =
.
i=1
(10.14)
8 IIC: Invariant Information Clustering
(Pcc' ) 1 Pcc' × ln n Pc × Pc' ' C
P =
.
175 C
(10.15)
c=1 c =1
where .Pcc' = P (z = c, z' = c' ), and .Pc and .Pc' are the marginal probabilities. The goal behind IIC is to learn a representation that preserves what is common between a pair of data x and .x ' , .x ' = g(x) being a perturbed version of x that is geometrically and photometrically invariant such as scaling, skewing, rotation, flipping, contrast/saturation change, etc. The idea is that images should belong to the same class regardless of the augmentation. The augmentation has to be a transformation to which the neural network should be invariant. max(φ(x), φ(x ' ))
.
φ
(10.16)
where .φ represents the CNN function. This generic clustering algorithm directly trains a randomly initialized neural network into a classification function, end-to-end, and without any labels. The CNN converts the image into a vector of probability values using a softmax function in the output layer as a classical soft clustering approach (Fig. 10.8). IIC directly learns semantic labels without learning representations based on mutual information between image pairs. The mutual information score has to be expanded into a loss function which is the mutual information between the function’s classifications for paired data samples. I (z, z' ) = H (z) − H (z|z' )
.
(10.17)
Maximizing this quantity involves a trade-off between minimizing the conditional cluster assignment entropy .H (z|z' ) and maximizing the entropy of individual cluster assignments .H (z). The input data can be of any modality and, since the clustering space is discrete, mutual information can be computed exactly. ICC maximizes the mutual information between augmented views of an image. In Fig. 10.8, .x ' is the randomly perturbed version of the image where the perturbation .x ' = g(x). The authors do not maximize directly over the output distributions but over the class distribution which is approximated for every batch. The authors employ an auxiliary output layer that is parallel to the main output layer, trained to produce an over-clustering to increase their performance in the unsupervised case. Here is a simplified overview of the IIC algorithm: • Input: IIC takes as input a set of unlabeled images. • Feature extraction: IIC extracts feature vectors from the images using a deep neural network, such as a convolutional neural network (CNN). The feature vectors are obtained by passing the images through the CNN and extracting the activations of one of the intermediate layers. • Clustering: IIC clusters the feature vectors using the K-means algorithm, where K is the number of clusters. The K-means algorithm is applied to the feature
176
10 Deep Clustering Techniques Based on CNN
Fig. 10.8 IIC (Ji et al., 2019); dashed line denotes shared parameters, g is a random transformation, and .I (a, b) denotes mutual information function
vectors using a large batch size, and the cluster assignments are used as the output of the clustering layer. • Mutual information maximization: IIC maximizes the mutual information between the cluster assignments and the original images, by minimizing a contrastive loss function that encourages the clustering layer to produce similar cluster assignments for similar images and dissimilar cluster assignments for dissimilar images. The contrastive loss function is based on the idea of learning invariant representations, where the network is encouraged to learn features that are invariant to different transformations of the input images, such as rotations and translations. • Iterative refinement: IIC iteratively refines the cluster assignments and the network parameters using a self-augmented training framework. In each iteration, IIC applies a new set of random transformations to the input images and updates the cluster assignments and the network parameters using the mutual information maximization objective. The iterative refinement step aims to improve the quality of the clusters by incorporating the cluster assignments into the learning process and by encouraging the network to learn more informative representations of the data. The method is not specialized to computer vision and operates on any paired dataset samples. Up to this point, the method is completely unsupervised. The first unsupervised stage can be seen as a self-supervised pretext task. In contrast to other pretext tasks, this task already predicts representations that can be seen as classifications. IIC has been shown to achieve state-of-the-art results on several benchmark datasets for unsupervised image clustering. The code of IIC can be found at7 . 7 https://github.com/xu-ji/IIC.
9 IDFD: Instance Discrimination and Feature Decorrelation
177
9 IDFD: Instance Discrimination and Feature Decorrelation IDFD (Tao et al., 2021) combines instance discrimination and feature decorrelation into learning representation to improve the performance of complex image clustering. The idea behind IDFD is to learn similarities among data while removing redundant correlation among features via feature decorrelation. The algorithm exploits the idea of spectral clustering to find a low-dimensional embedding of data in the eigenspace of the Laplacian matrix, which is derived from pairwise similarities between data. Input images are converted into feature representations in a lower d-dimensional latent space, via non-linear mapping with deep neural networks such as ResNet or other architectures. By using the embedded representations, the data are clustered by the K-means algorithm in the low-dimensional space. Similar to Wu et al. (2018), instance discrimination is considered by learning similarities among data to improve the performance of nearest neighbor classification. Each unlabeled instance is treated as its own distinct class, and discriminative representations are learned to distinguish between individual instance classes. The network learns to map similar instances close together in the feature space while pushing dissimilar instances apart. Figure 10.9 illustrates the step-by-step learning process for image clustering motifs. The initial input images X undergo a transformation into feature representations V within a lower-dimensional latent space, accomplished through non-linear mapping utilizing deep neural networks, like the ResNet architecture or others. These d-dimensional vectors are simultaneously acquired via a combination of instance discrimination and feature decorrelation techniques. Subsequently, the learned feature representations are subjected to clustering using a method such as classical K-means clustering, resulting in the final clustering outcomes. ( ) CNN backbone Low dim
,
(
= { } =1
1st image 2nd image
L2 norm
Input Instancediscriminate Softmax
= { } =1 d-Dim
ith image (n - 1)th image nth image
d-Dim transpose = { } =1
Culstering e.g. K-means
L2 norm Featureindependent Softmax
Ypred
Fig. 10.9 IDFD pipeline from the author Tao et al. (2021)
(
, 1st dimension of feature 2nd dimension of feature lth dimension of feature (d - 1)th dimension of feature dth dimension of feature
178
10 Deep Clustering Techniques Based on CNN
The objective function is formulated based on the softmax criterion. Feature decorrelation aims to reduce the redundancy and correlation among features learned by the neural network. Feature decorrelation is done by removing redundant correlation among features. Constraints are introduced to make the latent features orthogonal and therefore ensure that redundant information is reduced. The objective function IDFD which combines instance discrimination and feature decorrelation learning is as follows: LIDFD = LI + αLF
(10.18)
.
where .α is a weight that balances the contributions of two terms .LI and .LF . d
LF = ‖V V ⏉ − I ‖2 =
.
(fl⏉ fl − 1)2 +
n
(fj⏉ fl )2
(10.19)
j =1,j /=l
l=1
where f is the set of latent feature vectors, and .fl denotes the l-th feature vector. The transposition of latent vectors V coincides with .{fl }dl , where d is the dimensionality of representations. LI = −
d
.
log(Q(l|f ))
(10.20)
l=1
where exp(fl⏉ f/τ2 ) Q(l|f ) = d ⏉ m=1 exp(fm f/τ2 )
.
(10.21)
Q(l|f ) measures how correlated a feature vector is to itself and how dissimilar it is to others. .τ2 is a temperature parameter. IDFD achieves accuracies comparable to state-of-the-art values on the CIFAR-10 and ImageNet-10 datasets using simple K-means. The code source of IDFD is available at8 .
.
10 SimCLR: Simple Framework for Contrastive Learning of Visual Representations SimCLR (Chen et al., 2020) aims to learn meaningful representations from unlabeled data by creating pretext tasks. These pretext tasks involve maximizing agreement between different views of the same image while minimizing agreement
8 https://github.com/TTN-YKK/Clustering_friendly_representation_learning.
10 SimCLR: Simple Framework for Contrastive Learning of Visual. . .
179
Fig. 10.10 SimCLR (from https://blog.research.google/ 2020/04/advancing-selfsupervised-and-semi.html). Minimize the distance between images that contain the same object and maximize the distance between images that contain vastly different objects
between views of different images. SimCLR framework is based on the main idea of self-supervision and data augmentation. SimCLR employs extensive data augmentation to create different views of the same image (Fig. 10.10). It generates multiple augmented versions of an image and treats each version as a different view of that image. SimCLR follows a contrastive learning approach. Given two views (positive pairs) from the same image, the model aims to bring their representations closer in the embedding space. At the same time, it pushes the representations of different images (negative pairs) farther apart. An image is taken and random transformations are applied to it to get a pair of two augmented images .xi and .xj . Each image in that pair is passed through an encoder such as ResNet-50 or others with the same hyperparameters to get representations. The representations .hi and .hj of the two augmented images are then passed through a series of fully connected layers to apply non-linear transformation and project it into representations .zi and .zj . The task is to maximize the similarity between these two representations .zi and .zj for the same image based on the cosine similarity. SimCLR uses a contrastive loss called “NT-Xent loss” (Normalized Temperature-Scaled Cross-Entropy Loss). Both positive and negative pairs are made from the same mini-batch. If the mini-batch size is n, n pairs of positive pairs are generated by augmentation. SimCLR uses a base encoder (typically a deep convolutional neural network) to extract image representations. SimCLR learns representations by maximizing the agreement between differently augmented views of the same data example via a contrastive loss in the latent space (Fig. 10.11).
180
10 Deep Clustering Techniques Based on CNN
Fig. 10.11 SimCLR. Flow formalization from https://amitness.com/2020/03/illustrated-simclr/ (Chaudhary, 2020a) Fig. 10.12 SIMCLR. Maximize the similarity between these two representations .zi and .zj for the same image (Chen et al., 2020)
It employs multiple projection heads to map the representations to different feature spaces. The representations .hi and .hj of the two augmented images are then passed through a series of non-linear .Dense → Relu → Dense layers to apply non-linear transformation and project it into representations .zi and .zj (Figs. 10.11 and 10.12) This is denoted by .g(·) called projection head. This multi-head projection allows the model to learn more robust and semantically meaningful representations. For comparing the representations (Fig. 10.12) produced by the projection head, cosine similarity is used. sij =
.
where .τ is a temperature parameter.
zi⏉ zj τ ‖zi ‖‖zj ‖
(10.22)
10 SimCLR: Simple Framework for Contrastive Learning of Visual. . .
181
SimCLR uses a contrastive loss function, often based on the InfoNCE loss (Normalized Cross-Entropy). A minibatch of N examples is sampled and the contrastive prediction task on pairs of augmented examples is derived from the minibatch, resulting in 2N data points. Negative examples are not sampled explicitly. Instead, given a positive pair, the other .2(N −1) augmented examples within a minibatch are treated as negative examples. The contrastive loss encourages the positive pairs to have high similarity scores while pushing the negative pairs to have low similarity scores. To stabilize the learning process and improve performance, SimCLR applies .L2 normalization to the projected representations and introduces a temperature parameter in the contrastive loss. The temperature parameter scales the similarity scores, effectively controlling the contrastive learning objective’s sensitivity. Then the loss function for a positive pair of examples (i,j) is defined as lij = − log 2N
.
exp〈zi · zj 〉/τ
k=1 1k/=i (exp〈zi
· zk 〉/τ )
(10.23)
where .〈.〉 stands for the dot product or cosine similarity (see Eq. 10.22), .τ is the temperature parameter, and .1k/=i is an indicator function evaluating to 1 if .k /= i. The resulting loss for the batch is Loss =
.
2N 1 L(2k − 1, 2k) + L(2k, 2k − 1) 2N
(10.24)
k=1
In SimCLR, a single encoder processes a mini-batch of data to create positive and negative pairs for training. This ensures that these pairs have consistent representations generated by an encoder with fixed parameters. Unlike the encoder, which is updated per mini-batch, its parameters remain unchanged during the creation of pairs in a mini-batch. This is crucial to avoid inconsistent representation generation, where the same input image yields different representations in different minibatches. Because the encoder updates with each mini-batch, the representations generated in the current mini-batch become “older” compared to future minibatches. Consequently, mixing representations from different mini-batches is not optimal for training. This is also why SimCLR uses large batch sizes. SimCLR has shown impressive results in unsupervised representation learning and has become one of the leading methods in the field. It has been successfully applied to various computer vision tasks, including image classification, object detection, and segmentation, outperforming some supervised learning approaches with the same architecture on specific benchmarks. Processing more examples in the same batch, using bigger networks, and training for longer all lead to significant improvements. As a known weakness, SimCLR directly uses negative samples coexisting in the current batch, and it needs large batches to guarantee a sufficient number of negative pairs. SimCLR requires big hardware infrastructures while
182
10 Deep Clustering Techniques Based on CNN
being considered as the state of the art for self-supervised learning outperforming competitive methods on ImageNet. The code source of SimCLR is available at9 .
11 MoCo: Momentum Contrast for Unsupervised Visual Representation Learning MoCo (He et al., 2020) learns unsupervised visual features using momentum contrast where the goal is to maximize the similarity between different views of the same image (positive pairs) and minimize the similarity between views of different images (negative pairs). MoCo can be seen as an improvement of SimCLR. In SimCLR, because the encoder updates with each mini-batch, the representations generated in the current mini-batch become “older” compared to future mini-batches. Consequently, mixing representations from different mini-batches is not optimal for training. This is also why SimCLR uses large batch sizes to avoid inconsistency. Therefore, MoCo is proposed to convert the representations encoded from different mini-batches from inconsistency to consistency. Unlike SimCLR in which only one encoder is used, MoCo uses two encoders: an encoder and a momentum encoder (Fig. 10.13). Below is the equation of how the parameters of the momentum encoder are updated: θk = mθk + (1 − m)θq
.
Fig. 10.13 Intuitive idea of Moco from source (He et al., 2020)
9 https://github.com/google-research/simclr.
(10.25)
11 MoCo: Momentum Contrast for Unsupervised Visual Representation. . .
183
In this manner, the parameters of the momentum encoder are updated in consistency with the encoder. As a consequence, the keys generated from the momentum encoder with different mini-batches are also consistent, which enables good performance according to the paper. MoCo uses a contrastive loss function, often based on the InfoNCE loss (Normalized Cross-Entropy). It encourages the positive pairs (corresponding views from the same image) to have higher similarity scores than negative pairs (views from different images). Unlike SimCLR in which only one encoder is used, MoCo uses two encoders: an encoder and a momentum encoder. Both encoder and momentum encoder encode two augmented versions of the same input images and the encoded representations are called queries and keys, respectively. Specifically, the keys are preserved in a queue for later use. During a training step, positive pairs are constructed from queries of keys of the current minibatch, while negative pairs are constructed from queries of the current mini-batch and keys from previous mini-batches. The name difference between the encoder and the momentum encoder indicates how the parameters inside are updated. While the encoder is updated by backpropagation as in SimCLR, the momentum encoder is updated by linear interpolation of the two encoders. During a training step, positive pairs are constructed from queries of keys of the current mini-batch, while negative pairs are constructed from queries of the current mini-batch and keys from previous mini-batches. While the encoder is updated by backpropagation as in SimCLR, the momentum encoder is updated by linear interpolation of the two encoders. MoCo introduces a memory bank that stores representations of past images. When a new batch of data is processed, the representations of the current batch are stored in the memory bank (Fig. 10.14). The model’s parameters are updated by a momentum update rule, where the moving average of the model’s parameters from Fig. 10.14 MoCo. Encoder and momentum encoder from source (He et al., 2020)
184
10 Deep Clustering Techniques Based on CNN
the memory bank is used to update the current model’s parameters. This momentum update helps stabilize and improve the learned representations. To generate negative samples, MoCo maintains a queue of negative samples from previous batches. The model’s encoder is split into two parts: the “key encoder” and the “query encoder.” The query encoder processes the current batch, and the key encoder processes the negative samples from the queue. The negative samples act as the keys for the current batch’s queries. The MoCo algorithm has shown remarkable performance in unsupervised representation learning tasks, surpassing other state-of-the-art methods in various benchmarks. MoCo provides a good global representation, but they can face difficulty when the contrastive loss might be too good. More local representation (finer granularity) will help, and this is the idea of Detco (Xie et al., 2021) and DenseCl (Wang et al., 2022), but this topic is only emergent. The code source of Moco is available at10 .
12 PICA: PartItion Confidence mAximisation PICA (Huang et al., 2020) is based on a partition uncertainty index (PUI) that measures how a deep CNN is capable of interpreting and partitioning the target image data. The core concept of the algorithm revolves around achieving high visual similarity among instances belonging to the same semantic classes. However, a notable challenge arises when apparent similarity significantly impacts the distribution of features. This challenge becomes particularly evident when positive sample pairs are assigned to different clusters, resulting in decreased intra-cluster compactness and reduced inter-cluster diversity. As a consequence, this leads to lower partition confidence. Ultimately, the level of partition confidence plays a critical role in establishing semantic plausibility. PICA introduces a novel approach to enhancing the quality of clustering assignments by introducing a partition uncertainty index. This index serves as a metric for assessing the overall confidence of the clustering results. This index is employed to identify the most plausible and meaningful partitioning of the data. One of the key strengths of PICA lies in its focus on encouraging the model to learn the clusters that exhibit the highest levels of confidence across various potential solutions. This approach aims to identify a partition that reflects a coherent separation between different classes in a way that aligns closely with the true underlying semantics. The ultimate goal is to achieve a partitioning where each cluster corresponds directly to a specific ground truth class. To achieve this, PICA employs several core techniques. It introduces a partition uncertainty index that captures the uncertainty associated with the assignment of data points to clusters. This index is made differentiable and approximated stochasti-
10 https://github.com/facebookresearch/moco.
12 PICA: PartItion Confidence mAximisation
185
cally, enabling its integration into the training process. Additionally, PICA employs a meticulously designed objective loss function that minimizes the aforementioned uncertainty index. The combination of these components allows PICA to seamlessly integrate with traditional deep neural networks and supports efficient mini-batchbased model training. Let .P = [p1 , . . . , pn ] ∈ RK×n be the prediction matrix of all the n input image patterns. Each element .pi,j specifies the predicted probability of the i-th image assigned to the j -th cluster among K. .qj = [p1,j , . . . , pn,j ] ∈ R1×n is the j -th row of P that collects the probability values of all the images for the j -th cluster, which summarizes the ASV (assignment statistics of that cluster) over the whole target data. PUI is formulated as the ASV cosine similarity set of all the cluster pairs through the matrix .MP U I ∈ RK×K , where K is the number of clusters: MP U I (j1 , j2 ) = cos(qj 1 , qj 2 ) =
.
(qj 1 × qj 2 ) (‖qj 1 ‖2 × ‖qj 2 ‖2 )
(10.26)
where .j1 , j2 ∈ [1, . . . , K] The learning objective of PICA is then to minimize the PUI (except the diagonal elements) which is supposed to provide the most confident clustering solution at its minimum. A stochastic approximation of PUI is considered to avoid using the entire target dataset. Instead of using all the images, at each training iteration, a random subset of them is used. PICA is a global clustering solution measurement, different from the most existing deep clustering methods that leverage some local constraints on individual samples or sample pairs without global solution-wise learning guidance. The overall objective function of PICA is formulated as L = Lce + λLne
.
(10.27)
where .λ is a weight parameter, .Lce is the common cross-entropy loss function, and Lne is the negative entropy of the cluster size distribution.
.
Lce = .
K 1 − log(mj,j ) K j =1
exp(MP U I (j, j ' ))
mj,j ' = K
k=1 exp(MP U I (j, k))
(10.28) , j ' ∈ [1 . . . K]
Lne = log(K) − H (Z), Z = [z1 , . . . , zK ] K t . j =1 qj zj = K K t k=1 j =1 qk
(10.29)
186
10 Deep Clustering Techniques Based on CNN
where .H (·) is the entropy of a distribution, and Z is .L1 normalized soft cluster size distribution with each element computed as .zj . Using .log(K) is to ensure nonnegative loss values. PICA can be introduced in standard deep network models and end-to-end trainable without bells and whistles. The advantages of PICA over a wide range of state-of-the-art deep clustering approaches have been demonstrated by extensive experiments on challenging object recognition benchmarks. The code source of SIMCLR is available at11 .
13 DSC-PCM: Deep Semantic Clustering by Partition Confidence Maximization DSC-PCM (Huang et al., 2020) is a clustering algorithm proposed by Hao et al. in 2020. The algorithm is designed to perform clustering on high-dimensional data with complex, non-linear structures, such as images. DSC-PCM extends the PCM algorithm by incorporating a deep neural network that learns a representation of the data that is better suited for clustering. The algorithm consists of the following steps: 1. Feature extraction: The data is first passed through a deep neural network to extract a set of features that capture its underlying structure. 2. Partitioning: The features are partitioned using the PCM algorithm described earlier. 3. Confidence estimation: For each feature, the algorithm estimates the confidence that it belongs to its assigned partition. This is done using a softmax function that converts the feature distances to probabilities. The confidence score is defined as the probability assigned to the assigned partition. 4. Partitioning refinement: The algorithm then refines the partitions by reassigning features to different partitions based on their confidence scores. 5. Cluster center update: The cluster centers are updated based on the new partitions. 6. Convergence: Steps 3–5 are repeated until the partitions and cluster centers converge to a stable configuration. The use of a deep neural network for feature extraction allows DSC-PCM to capture complex, non-linear structures in the data. The confidence estimation step is also improved by using a softmax function, which is better suited for handling high-dimensional data. DSC-PCM has been shown to outperform state-of-the-art clustering algorithms on several image datasets, including MNIST, CIFAR-10, and ImageNet.
11 https://github.com/Raymond-sci/PICA.
14 SPICE: Semantic Pseudo-Labeling for Image Clustering
187
14 SPICE: Semantic Pseudo-Labeling for Image Clustering SPICE (Niu et al., 2022) is a deep learning algorithm for unsupervised image clustering based on a self-supervised learning approach and an ensemble clustering strategy to improve the clustering performance. SPICE employs similarity and divergence information among different semantic clusters to assign pseudo-labels to unlabeled instances in a batch-wise manner. With the purpose of assigning the same cluster label to semantically similar instances, SPICE leverages a classification loss and a double softmax cross-entropy function to iteratively optimize the clustering model. It then synergizes the similarity among instances and the semantic discrepancy between clusters to generate accurate and reliable self-supervision over clustering. The algorithm adopts a strategic “divide-and-conquer” methodology by partitioning the clustering network into two primary components: a feature model and a clustering head. This partitioning enables SPICE to train these components incrementally, gradually optimizing them both independently and jointly. This entire training process is accomplished without resorting to any form of annotation. SPICE employs a dual loss optimization mechanism. The first component involves a classification loss, while the second leverages a double softmax crossentropy function. These loss functions work in tandem to iteratively refine the clustering model. This iterative process allows SPICE to learn and adapt over successive stages. A prototype pseudo-labeling algorithm is designed to identify prototypes for training the clustering head in an expectation-maximization framework illustrated by a toy example provided by the authors (Fig. 10.15). In this process, the algorithm first utilizes the predicted probabilities of 10 samples across 3 clusters. The top three most confidently assigned samples for each cluster are highlighted with distinct colors: green, blue, and red, respectively. Subsequently, these selected samples are mapped onto the corresponding features, represented by dots. This mapping process is employed to estimate prototypes for each cluster, represented as stars and connected to the respective dots. Lastly, the algorithm identifies the top three nearest samples to each cluster prototype, which are represented by dots within the same ellipse, and these samples are assigned with the index of the corresponding prototype. Conversely, any unselected samples are labeled with .−1 and are excluded from the training dataset. Notably, the dashed ellipses signify non-overlapping assignments. By this, the semantic inconsistency of the samples around borderlines can be reduced. A reliable pseudo-labeling algorithm to select reliable samples is considered for jointly training the feature model and clustering head, which effectively improves the clustering performance. Here is a simplified overview of the SPICE algorithm: • Input: SPICE takes as input a set of unlabeled images. • Self-supervised pre-training: SPICE pre-trains a deep neural network using a self-supervised learning approach, such as the rotation prediction task. The pretraining step aims to learn a good representation of the data that captures the underlying patterns and structures in the images.
188
10 Deep Clustering Techniques Based on CNN
S01
0.99
0.01
0.00
S02
0.80
0.10
0.10
S03
0.55
0.20
0.25
S04
0.40
0.15
0.45
0 0
1 s05 0 s02
S05 S06
0.01 0.15
0.98 0.80
S07
0.01
0.50
0.49
S08
0.01
0.02
0.97
S09
0.10
0.08
0.82
S10
0.20
0.35
0.45
0 s07
0.01 0.05
–1
s06 s09
s03
s01 s04
2
1 1
s08 s10
1
2 2
Representation features
Probabilities over K clusters
2 –1 Pseudo labels
Fig. 10.15 Prototype pseudo-labeling from the authors Niu et al. (2022)
• Clustering ensemble: SPICE generates multiple clustering solutions by applying different clustering algorithms, such as K-means and spectral clustering, to the feature vectors obtained from the pre-trained network. The clustering ensemble step aims to reduce the variability and uncertainty in the clustering results by combining the outputs of multiple clustering algorithms. • Cluster similarity estimation: SPICE estimates the similarity between the different clustering solutions using a pairwise similarity matrix. The similarity matrix is computed by comparing the overlap between the clusters obtained from each clustering algorithm. • Progressive cluster refinement: SPICE progressively refines the clustering solution by combining the cluster assignments from different clustering algorithms, using a weighted averaging scheme based on the similarity matrix. The progressive refinement step aims to improve the quality of the clustering solution by incorporating the complementary information from the different clustering algorithms. • Fine-tuning: SPICE fine-tunes the network and the clustering ensemble jointly using a joint loss function that combines the clustering loss with a reconstruction loss. The reconstruction loss encourages the network to learn a more informative representation of the data, by minimizing the difference between the input images and their reconstructions from the network. • Iterative refinement: SPICE iteratively refines the cluster assignments and the network parameters using a self-augmented training framework. In each iteration, SPICE applies a new set of random transformations to the input images and updates the cluster assignments and the network parameters using the joint loss function. The iterative refinement step aims to improve the quality of the
15 PCL: Prototypical Contrastive Learning
189
clusters by incorporating the cluster assignments into the learning process and by encouraging the network to learn more informative representations of the data. SPICE has been shown to achieve state-of-the-art results on several benchmark datasets for unsupervised image clustering and can be used for a variety of applications, including image retrieval, content-based image retrieval, and image segmentation. The SPICE code is made publicly available at12 .
15 PCL: Prototypical Contrastive Learning PCL (Li et al., 2021) is a deep learning algorithm for unsupervised representation learning based on the idea of learning a feature representation that is optimized to separate samples belonging to different classes in a latent space while bringing together samples that belong to the same class. It was developed to address two limitations of instance discrimination algorithms. First, different instances could be discriminated against by exploiting low-level cues, so the network does not necessarily learn useful semantic knowledge. Second, images from the same class are treated as different instances, and their embeddings are pushed apart. This is undesirable because images that share similar semantics should have similar embeddings. In PCL, the idea is to use a “prototype” as the centroid for a cluster formed by similar images. Each image is assigned to multiple prototypes of different granularity. The training objective is to pull each image embedding closer to its associated prototypes, which is achieved by minimizing a ProtoNCE loss function. PCL alternates between the clustering data via K-means and the contrasting samples according to their views and their prototypes (or assigned cluster centroids). The PCL framework is seen in Fig. 10.16. One of the core ideas in PCL consists of replacing the InfoNCE loss with ProtoNCE loss for contrastive learning to encourage representations closer to their assigned prototypes and far from negative prototypes. PCL aims to find the Maximum Likelihood Estimation (MLE) of model parameters, given the observed images x (.x = {xi }n1 ): 1 log(xi , θ ) .θ = arg max n θ ∗
n
(10.30)
i=1
Prototypes .C = {ci }k1 are used as the latent variable related to the observed data n .{xi } , and an Expectation-Maximization algorithm is applied to solve the MLE. In 1 the E-step, the probability of the prototypes is estimated by performing the K-means clustering. In the M-step, the estimated log-likelihood is maximized by training the model to minimize a ProtoNCE loss as follows: 12 https://github.com/niuchuangnn/SPICE.
190
10 Deep Clustering Techniques Based on CNN
Fig. 10.16 PCL framework from the authors Li et al. (2021)
LPNCE =
n i=0
.
exp〈zi · zj 〉/τ − log r j =0 exp〈zi · zk 〉/τ
M exp〈vi · Csm 〉/ψsm 1 log r m m M j =0 exp〈vi · Cj 〉/ψj
+ (10.31)
m=1
• Input: Prototypical Contrastive Learning takes as input a set of unlabeled images. • Prototyping: Prototypical Contrastive Learning constructs a set of prototypes for each class in the dataset by computing the mean feature vector of a set of reference samples from that class. The reference samples are randomly sampled from the training set. • Contrastive learning: Prototypical Contrastive Learning learns a feature representation by maximizing the similarity between a query sample and its corresponding prototype while minimizing the similarity between the query sample and the prototypes of the other classes. This is done by using a contrastive loss function that encourages the feature vectors of the same class to be close together and those of different classes to be far apart. • Iterative refinement: Prototypical Contrastive Learning iteratively refines the prototypes and the feature representation by jointly optimizing the contrastive loss with a prototype loss, which encourages the prototypes to be representative of their corresponding class. The optimization is performed using stochastic gradient descent. Prototypical Contrastive Learning has been shown to achieve state-of-the-art results on several benchmark datasets for unsupervised representation learning, including CIFAR-10 and ImageNet. The algorithm is computationally efficient and can be easily extended to other types of data, such as audio and text. The PCL code is made publicly available at13 .
13 https://github.com/salesforce/PCL.
16 ProPos: Prototype Scattering and Positive Sampling
191
16 ProPos: Prototype Scattering and Positive Sampling ProPos (Huang et al., 2022) is an end-to-end deep clustering method with prototype scattering and positive sampling implementable (as the backbone) in CNN architecture such as ResNet family. It is mainly based on two novel ideas of combining prototype-based clustering with scattering transforms and positive sampling to improve the accuracy and robustness of clustering results. • First, considering that different prototypes/clusters are truly negative pairs, ProPos performs contrastive learning over prototypical representations, in which two augmented views of the same prototypes are positive pairs and different prototypes are negative pairs. The idea is to maximize the between-cluster distance so as to learn uniform representations toward well-separated clusters. It is formalized by a prototype scattering loss (PSL). • Second, to improve the within-cluster compactness, one augmented view of the instance is aligned with the randomly sampled neighbors of another view that are assumed to be truly positive pairs in the embedding space. This refers to as positive sampling alignment (PSA) which takes into account the neighboring samples in the embedding space, improving the within-cluster compactness. ProPos is optimized in an expectation-maximization (EM) framework, in which two steps are iteratively performed: E-step as estimating the instance pseudo-labels via spherical K-means and M-step as minimizing a cost L that combines two costs to obtain well-separated clusters and within-cluster compactness as follows: L = Lpsa + λpsl Lpsl
.
(10.32)
where .λpsl controls the balance between two loss components. Here is a simplified overview of the Prototype Scattering and Positive Sampling algorithm: • Input: Prototype Scattering and Positive Sampling takes as input a set of unlabeled images. • Scattering transforms: Prototype Scattering and Positive Sampling applies scattering transforms to the input images to obtain a set of feature maps that capture the low-level patterns and structures in the images. The scattering transforms are designed to be invariant to translations and small deformations and to be robust to noise and image variations. • Prototyping: Prototype Scattering and Positive Sampling constructs a set of prototypes from the feature maps using a prototype-based clustering algorithm, such as K-means or spectral clustering. The prototypes are representative feature vectors that capture the high-level patterns and structures in the images and are used to partition the data into clusters. • Positive sampling: Prototype Scattering and Positive Sampling applies positive sampling to the prototypes by generating new feature vectors that are close to
192
10 Deep Clustering Techniques Based on CNN
the prototypes, but different enough to improve the diversity and separability of the clusters. The positive sampling approach generates new feature vectors by perturbing the prototypes using random transformations, such as rotations, translations, and scale changes. • Re-clustering: Prototype Scattering and Positive Sampling re-clusters the data using the prototypes and the positive samples as the new feature vectors. The re-clustering step aims to refine the initial clustering results by incorporating the positive samples into the clustering process and by encouraging the prototypes to be more representative of the underlying patterns and structures in the images. • Iterative refinement: Prototype Scattering and Positive Sampling iteratively refines the cluster assignments and the prototype vectors using a joint optimization framework that combines the clustering loss with a reconstruction loss. The reconstruction loss encourages the network to learn a more informative representation of the data, by minimizing the difference between the input images and their reconstructions from the network. Prototype Scattering and Positive Sampling has been shown to achieve stateof-the-art results on several benchmark datasets for unsupervised image clustering and can be used for a variety of applications, including image retrieval, contentbased image retrieval, and image segmentation. The ProPos code is made publicly available at14 .
17 BYOL: Bootstrap Your Own Latent BYOL (Grill et al., 2020) is a self-supervised learning algorithm that trains a neural network to map similar inputs to similar representations in a latent space. BYOL is composed of two separate networks like a Siamese network, referred to as the “online” and “target” networks. • The online network is defined by a set of weights .θ and is comprised of three stages: an encoder .fθ , a projector .gθ , and a predictor .qθ . • The target network has the same architecture as the online network but uses a different set of weights .ϵ. BYOL minimizes a similarity loss between .qθ (zθ ) and .sg(zϵ' ) (Fig. 10.17), where .θ are the trained weights, .ϵ is an exponential moving average of .θ , and sg means stop gradient. At the end of the training, .yθ is used as the image representation. The online network is trained to predict the target network representation of the same image under a different augmented view and updates the target network with a slow-moving average of the online network. The online network is updated during training while the target network is kept fixed. The target network provides the
14 https://github.com/Hzzone/ProPos.
17 BYOL: Bootstrap Your Own Latent view input image
193
ν
prediction qѳ
gѳ
fѳ t
projection
representation
zѳ
yѳ
qѳ(zѳ)
x
online
loss t’
z’ξ
y’ξ
ν’
gξ
fξ
sg(z’ξ)
target
sg
Fig. 10.17 Byol architecture from the authors Grill et al. (2020)
Fig. 10.18 Neural architecture of BYOL from the authors Grill et al. (2020)
regression targets to train the online network, and its parameters .ϵ are an exponential moving average of the online parameters .θ . ϵ = τ ϵ + (1 − τ )θ
.
(10.33)
During each training iteration, BYOL randomly augments a pair of similar input samples and passes them through the online network to obtain two corresponding representations. The goal is to make the online network’s representation of one sample as close as possible to the target network’s representation of the other and vice versa. This is achieved by minimizing a contrastive loss function that penalizes the distance between the two representations (Fig. 10.18). More specifically, given an input image x, two views of the same image v and .v ' are generated by applying two random augmentations to x. Given v and .v ' to online and target encoders in order, vector representations .yθ and .yϵ' are obtained. These representations are
194
10 Deep Clustering Techniques Based on CNN
projected to another subspace z. These projected representations are indicated by .zθ and .zϵ' in the image below. BYOL reduces the contrastive loss between the two representations by simply minimizing the similarity between the online network predictions and the target network projections. The loss is defined as follows: '
Lossθ,ϵ =
.
‖qθ (zθ ) − zϵ ‖22
〈qθ (zθ ) · zϵ 〉 =2−2 ' ‖qθ (zθ )‖2 · ‖zϵ ‖2
(10.34)
Since the target network is the slow-moving average of the online network, the online representations should be predictive of the target representations, i.e., .zθ should predict .zϵ' and hence another predictor (.qθ ) is put on top of .zθ . The key idea behind BYOL is to use the online network’s own predictions as a target for the contrastive loss function and that the encoder can be self-trained without using negative examples. In other words, the online network learns to predict its own future representations, which are generated by passing a slightly modified version of the current input through the network. This approach is called “bootstrap” because the network is effectively “bootstrapping” its own learning by using its own predictions as training targets. Unlike other contrastive learning methods, BYOL achieves state-of-the-art performance on several image classification benchmarks without using any negative samples. BYOL is claimed to require smaller batch sizes, which makes it an attractive choice. The learned representations can be transferred to downstream tasks such as object detection and semantic segmentation, without the need for further fine-tuning. The BYOL code is made publicly available at15 .
18 SwAV: Swapping Assignments Between Multiple Views SwAV (Caron et al., 2020) is an unsupervised contrastive clustering mechanism developed by Facebook. In contrastive learning methods, the features from different transformations of the same images are compared directly to each other. SWAV is a form of contrastive learning, but the way of contrasting image views is done by comparing their cluster assignments instead of their features. SwAV takes advantage of the contrastive methods without requiring to compute pairwise comparisons. Specifically, SwAV simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or “views”) of the same image, instead of comparing features directly as in traditional contrastive learning. The idea of the authors consists of alternating between a cluster assignment step where
15 https://github.com/lucidrains/byol-pytorch.
18 SwAV: Swapping Assignments Between Multiple Views
z1
fθ
Q1
Codes
l~
T
l~
T
x1
195
x
Prototypes
Swapped Prediction
c
l~ T l~ T
x2
fθ
z2
Q2
Codes
Fig. 10.19 SwAV architecture from the authors Caron et al. (2020)
image features of the entire dataset are clustered and a training step where the cluster assignments, i.e., “codes” are predicted for different image views. A code from an augmented version of the image is computed and this code is predicted from other augmented versions of the same image. Given two image features .zt and .zs from two different augmentations of the same image, their codes .qt and .qs are computed by matching these features to a set of K prototypes .{c1 , . . . , cK }. Then, a “swapped” prediction problem is set up with the following loss: Loss(zt , zs ) = l(zt , qs ) + l(zs , qt )
.
(10.35)
where the function .l(z, q) measures the fit between features z and a code q. In SwAV, first the “codes” are obtained by assigning features to prototype vectors. Then, a “swapped” prediction problem is solved wherein the codes obtained from one data-augmented view are predicted using the other view (Fig. 10.19). Thus, SwAV does not directly compare image features. Prototype vectors are learned along with the ConvNet parameters by backpropagation. Intuitively, the proposed method compares the features .zt and .zs using the intermediate codes .qt and .qs . If these two features capture the same information, it should be possible to predict the code from the other feature. If .zt and .zs capture similar information, the code .qs (soft class) can be predicted from the other feature .zt . In other words, if the two views share the same semantics, their targets (codes) will be similar. This is the whole “swapping” idea. In SwAV, there is the intermediate “codes” step (Q). To create the codes (targets), the image features to prototype vectors need to be assigned. A “swapped” prediction problem is solved wherein the codes (targets) are altered for the two image views. The model uses a “swapped” prediction mechanism where it predicts the code of a view from the representation of another view. SWAV also applies L2-normalization to the features as well as to the prototypes throughout training. Prototypes vectors .[c1 , . . . , ck ] are learned by retropropagation, but they are still in the unit sphere area, meaning their L2 norm will be 1. The swapped loss function used in the SwAV algorithm is
196
10 Deep Clustering Techniques Based on CNN
Loss(zt , zs ) = f it (zt , qs ) + f it (zs , qt )
.
(10.36)
where .zt and .zs are the features extracted from two views of the same image, and .qt and .qs are the corresponding intermediate codes (which are obtained by matching the features to the set of prototypes). .f it (·) measures the fit between z and q as follows: f it (zt , qs ) = −
K
.
qs(k) log pt(k)
(10.37)
k=0
exp(1/τ )zt⏉ ck pt(k) = K ⏉ ' k ' =0 exp(1/τ )zt ck
.
(10.38)
where .τ is a temperature parameter. Taking this loss over all the images and pairs of data augmentations leads to the following loss function for the swapped prediction problem: Loss = − .
N T 1 ⏉ ⏉ (1/τ )znt Cqns + (1/τ )zns Cqnt − N s n=1
(10.39)
⏉ ⏉ cq )τ ) − log(exp((zns ck )τ ) log(exp((znt
where C is the matrix whose columns are the .[c1 , . . . , ck ] coefficients. SWaV can perform well with small batch sizes compared to other models such as SimCLR. The SwAV code is made publicly available at16 .
19 SimSiam SimSiam (Chen & He, 2021) is a self-supervised representation learning algorithm proposed by researchers at Facebook AI Research. The algorithm is based on a Siamese neural network architecture, which consists of two identical sub-networks that share the same weights. The main idea behind SimSiam is to train a neural network to predict a transformed version of its own input, without using any explicit supervision signal. Two augmented views .x1 and .x2 from an image x are processed by the same encoder network f represented by a backbone plus a projection MLP head as shown in Fig. 10.20. Then a prediction MLP head is applied on one side, and a stop-gradient operation is applied on the other side. The model maximizes the similarity between both sides. It uses neither negative pairs nor a momentum encoder. 16 https://github.com/facebookresearch/swav.
19 SimSiam
197
Fig. 10.20 SimSiam architecture for the author Chen and He (2021)
similarity
Predictor h
stop-grad
Encoder f
Encoder f x1
x2 image x
The encoder f shares weights between the two views. The prediction MLP head, denoted as h, transforms the output of one view and matches it to the other view. Denoting the two output vectors as .p1 , .h(f (x1 )) and .z2 , .f (x2 ), the objective is to minimize their cosine similarity: D(p1 , z2 ) = −〈
.
p1 z2 〉 · ‖p1 ‖2 ‖z2 ‖2
(10.40)
where .〈·〉 stands for the dot product and .‖‖2 the .L2 norm. Based on a Siamese architecture and without using negative samples, a classic loss can cause the collapse, i.e., f always outputs a constant regardless of the input variance. SimSiam solves the collapse problem via predictor and stop gradient, based on which the encoder is optimized with a symmetric loss defined as follows: Loss =
.
1 D(p1 , sG(z2 )) + D(p2 , sG(z1 )) 2
(10.41)
The total loss is averaged over all images, where .sG(·) is StopGradient. .sG(·) means that it is considered a constant term, and this is a major condition for the algorithm convergence. The encoder on .x2 receives no gradient from .z2 in the first term, but it receives gradients from .p2 in the second term (and vice versa for .x1 ). How SimSiam avoids collapse lies in its asymmetric architecture (Zhang et al., 2022), i.e., one path with h and the other without h. Under this asymmetric architecture, the role of the stop gradient is to only allow the path with the predictor to be optimized with the encoder output as the target, not vice versa. Here is a simplified overview of the SimSiam algorithm: • Input: SimSiam takes a set of unlabeled data samples as input. • Data augmentation: Each input sample is randomly transformed, such as by cropping, flipping, or color jittering, to create a pair of augmented views.
198
10 Deep Clustering Techniques Based on CNN
• Siamese network: The augmented views are passed through the Siamese network, which consists of two identical sub-networks that share the same weights. • Predictive coding: The two sub-networks produce two different representations for each augmented view. One of the sub-networks is used to produce a “predicted” representation for the other view, by passing it through an additional MLP. The predicted representation is then used as a target for the contrastive loss. • Contrastive loss: The contrastive loss function is used to encourage the representations of similar views to be close to each other and those of dissimilar views to be far apart. The loss is computed by comparing the predicted and actual representations of each view pair, using a similarity metric such as the dot product or cosine similarity. • Network updates: The network is updated by backpropagating the contrastive loss through the Siamese network, using standard optimization algorithms such as stochastic gradient descent. The learned representations can be evaluated on downstream tasks, such as image classification or object detection, by using them as input to a linear classifier or finetuning them on the target task. The SimSiam algorithm has been shown to achieve state-of-the-art performance on several benchmark datasets, including CIFAR-10, ImageNet, and COCO17 . One advantage of SimSiam over other self-supervised learning methods is that it does not require any negative samples, which makes it more computationally efficient and easier to implement. The explanatory claims of the original SimSiam are revisited in Zhang et al. (2022) allowing a deep analysis of the algorithm. The SimSiam code is made publicly available at18 .
20 EDCN: Unsupervised Discriminative Feature Learning via Finding a Clustering-Friendly Embedding Space EDCN (Cao et al., 2022) is based on the self-supervised concept and the use of a Siamese network to find a clustering-friendly embedding space to mine highly reliable pseudo-supervised information. The algorithm is designed to learn discriminative features that are useful for clustering high-dimensional data. The process is composed of a Feature Extractor, a Conditional Generator, a Discriminator, and a Siamese network (Fig. 10.21). Two kinds of generated data based on adversarial training are considered, as well as the original data, to train the Feature Extractor for learning effective latent representations. The Siamese network is exploited to find a clustering-friendly embedding space to mine highly reliable
17 https://cocodataset.org/. 18 https://github.com/facebookresearch/simsiam.
20 EDCN: Unsupervised Discriminative Feature Learning via Finding a. . .
199
Fig. 10.21 EDCN architecture from the authors Cao et al. (2022)
pseudo-supervised information for the application of VAT and Conditional GAN to synthesize cluster-specific samples in the setting of unsupervised learning. VAT is adopted to synthesize samples with different levels of perturbations that can enhance the robustness of the Feature Extractor to noise and improve the lower-dimensional latent coding space discovered by the Feature Extractor. In Fig. 10.21, C denotes the concatenation operation which combines latent representations of original data and perturbed ones. Pseudo-labels are obtained by performing spectral clustering on the embedding learned by the Siamese networks whose inputs are the latent representations. The training of EDCN involves adversarial gaming between three players, which not only boosts performance improvement of the clustering but also preserves the cluster-specific information from the Siamese network in synthesizing samples. EDCN consists of the following steps: • Feature extraction: The data is passed through a deep neural network to extract a set of features. • Clustering-friendly embedding space learning: The algorithm learns a new embedding space that is optimized for clustering. This is done by minimizing a contrastive loss function that encourages similar data points to be closer in the embedding space and dissimilar points to be farther apart. • Clustering: The data is clustered using the K-means algorithm on the new embedding space. • Fine-tuning: The neural network is fine-tuned using a triplet loss function to further improve the discriminative power of the learned features. • The contrastive loss function used in EDCN is designed to encourage the embedding space to have a cluster-friendly structure, meaning that similar data
200
10 Deep Clustering Techniques Based on CNN
points are grouped together and dissimilar points are separated. This is achieved by pushing similar points closer together and pulling dissimilar points farther apart. The use of a triplet loss function in the fine-tuning step further improves the discriminative power of the learned features by encouraging similar data points to be closer in the embedding space and dissimilar points to be farther apart, similar to the contrastive loss function used in the previous step. EDCN has been shown to outperform state-of-the-art clustering algorithms on several datasets, including MNIST, CIFAR-10, and ImageNet.
References Bachman, P., Alsharif, O., & Precup, D. (2014). Learning with pseudo-ensembles. Advances in Neural Information Processing Systems, 27. Cao, W., Zhang, Z., Liu, C., Li, R., Jiao, Q., Yu, Z., & Wong, H.-S. (2022). Unsupervised discriminative feature learning via finding a clustering-friendly embedding space. Pattern Recognition, 129, 108768. Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 132–149). Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912–9924. Chang, J., Wang, L., Meng, G., Xiang, S., & Pan, C. (2017). Deep adaptive image clustering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 5879–5887). Chaudhary, A. (2020a). The illustrated SimCLR framework. https://amitness.com/2020/03/ illustrated-simclr. Chaudhary, A. (2020b). A visual guide to self-labelling images. https://amitness.com/2020/04/ illustrated-self-labelling. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (pp. 1597–1607). PMLR. Chen, X. & He, K. (2021). Exploring simple Siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15750–15758). Dang, Z., Deng, C., Yang, X., Wei, K., & Huang, H. (2021). Nearest neighbor matching for deep clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13693–13702). Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1422–1430). Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., & Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. Advances in Neural Information Processing Systems, 27. Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33, 21271–21284.
References
201
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9729–9738). Hu, W., Miyato, T., Tokui, S., Matsumoto, E., & Sugiyama, M. (2017). Learning discrete representations via information maximizing self-augmented training. In International Conference on Machine Learning (pp. 1558–1567). PMLR. Huang, J., Gong, S., & Zhu, X. (2020). Deep semantic clustering by partition confidence maximisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8849–8858). Huang, Z., Chen, J., Zhang, J., & Shan, H. (2022). Learning representation for clustering via prototype scattering and positive sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2022.3216454 Ji, X., Henriques, J. F., & Vedaldi, A. (2019). Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9865–9874). Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 609–616). Li, J., Zhou, P., Xiong, C., & Hoi, S. C. H. (2021). Prototypical contrastive learning of unsupervised representations. Miyato, T., Maeda, S.-i., Koyama, M., & Ishii, S. (2018). Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), 1979–1993. Niu, C., Shan, H., & Wang, G. (2022). Spice: Semantic pseudo-labeling for image clustering. IEEE Transactions on Image Processing, 31, 7264–7278. Peng, X., Zhu, H., Feng, J., Shen, C., Zhang, H., & Zhou, J. T. (2019). Deep clustering with sample-assignment invariance prior. IEEE Transactions on Neural Networks and Learning Systems, 31(11), 4857–4868. Tao, Y., Takagi, K., & Nakata, K. (2021). Clustering-friendly representation learning via instance discrimination and feature decorrelation. Preprint. arXiv:2106.00131. Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M., & Van Gool, L. (2020). Scan: Learning to classify images without labels. In European Conference on Computer Vision (pp. 268–285). Springer. Wang, X. & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2794–2802). Wang, X., Zhang, R., Shen, C., & Kong, T. (2022). DenseCL: A simple framework for selfsupervised dense visual pre-training. Visual Informatics, 7(1), 30–40. Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3733–3742). Xie, E., Ding, J., Wang, W., Zhan, X., Xu, H., Sun, P., Li, Z., & Luo, P. (2021). Detco: Unsupervised contrastive learning for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8392–8401). Yang, J., Parikh, D., & Batra, D. (2016). Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5147–5156). Zhang, C., Zhang, K., Zhang, C., Pham, T. X., Yoo, C. D., & Kweon, I. S. (2022). How does SimSiam avoid collapse without negative samples? a unified understanding with selfsupervised contrastive learning. Preprint. arXiv:2203.16262.
Chapter 11
Deep Clustering Techniques Based on Autoencoders
The general idea using an autoencoder is to let the data produce supervision by pre-training the autoencoder without the clustering objective and obtaining initial centroids used as the labels for each cluster. The clusters are obtained by introducing a new loss function (clustering objective) that minimizes the difference between the representation and the centroids. Then, most of the existing deep clustering strategies share two simple concepts. The first concept is that deep embedded representations are favorable to clustering. The second concept is that clustering assignments can be used as a supervisory signal to learn embedded representations. Based on that, the existing deep clustering methods can be classified into two main families: • Two-stage work that applies clustering after having learned a representation. To learn more robust features, several autoencoder variants were proposed such as adding sparsity and contractive constraints on hidden representation as well as the use of denoising autoencoders. • Approaches that jointly optimize the feature learning and clustering using pseudo-labels with different fine-tuned alternatives. The family of algorithms that jointly perform clustering on the embedded features of an autoencoder is mainly referred to as DEC and its variants, DEC, IDEC, DCN, DBC, DEPICT, and DSSEC. There are several variants of autoencoders that aim at learning the key factors of similarities in the embedded space with respect to the data semanticity. The common point between all of them is the reconstruction loss function, and they mainly differ from each other by the way the encoding operation is constrained. The variants and advances stem from the strategy of including constraints via regularization techniques and the use of multiple losses that are hybridized. Other variants stem in the use of AE or CAE such as DBC and DCEC (Fig. 11.1).
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 F. Ros, R. Riad, Feature and Dimensionality Reduction for Clustering with Deep Learning, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-031-48743-9_11
203
204
11 Deep Clustering Techniques Based on Autoencoders
Fig. 11.1 Architecture overview of AE from Min et al. (2018)
1 DEN: Deep Embedding Network for Clustering DEN (Huang et al., 2014) is a deep embedding network that utilizes a deep neural network to learn clustering-oriented representations from raw data and then performs clustering with k-means. To achieve the clustering-oriented representations, two constraints are imposed on the learned representations of deep embedding networks. One is a locality-persevering constraint that aims to embed original data into its underlying manifold space. The other one is a group sparsity constraint that aims to learn block diagonal representations in which the nonzero groups correspond to its cluster. DEN learns the reduced representations with localitypreserving constraint .Eg and group sparsity constraint .Es and followed by k-means clustering. Eg =
.
Sij ‖f (xi ) − f (xj )‖2
(11.1)
i,j ∈k(i)
‖xi − xj ‖2 and t is a parameter. t The hidden units are divided into G groups, where G is the number of clusters assumed.
where .Sij = exp −
Es =
G N
.
λg ‖fg (xi )‖
(11.2)
i=1 g=1
√ where .λg = λ ng are the weights to sparsity of groups, .ng is the group size, and .λ is a constant. By imposing this group sparsity constraint, only a few groups of the learned representation of each data point can be activated, and the activated groups correspond to specific cluster. As a result, all the representations can be block-diagonalized.
2 DEC: Deep Embedded Clustering
E = Er + αEg + βEs
.
205
(11.3)
where .α and .β are tuning parameters. The learning procedure is a two-stage algorithm that contains a pre-training procedure to initialize the network weights followed by a fine-tuning procedure. In DEN, the representation learning is decoupled from clustering, and the performance is not as good as the one obtained by methods that rely on a joint approach. The code source of DEN is available at1 .
2 DEC: Deep Embedded Clustering DEC (Xie et al., 2016) is an unsupervised learning algorithm that combines autoencoders and clustering techniques. It first learns embedded representations using autoencoders, which map data from the original space to a lower-dimensional latent space. The overall process of DEC involves iteratively updating the deep embedding and cluster assignments until convergence. During training, the network fine-tunes the embeddings to optimize their representation. Additionally, the soft assignments, representing the probabilities of data points belonging to each cluster, are adjusted based on the updated embeddings. The key to DEC is the self-learning clustering loss, which optimizes the neural network parameters and cluster centers simultaneously using KL divergence. By doing so, DEC effectively learns to feature representations and cluster assignments in a joint manner (Fig. 11.2). The soft assignments act as implicit soft labels that guide the clustering process. DEC started with pre-training an autoencoder to learn low-dimensional representations and initialized clustering centers with k-means in the low-dimensional representations. Fig. 11.2 DEC scheme from the authors Xie et al. (2016)
1 https://github.com/XifengGuo/DEC-keras.
206
11 Deep Clustering Techniques Based on Autoencoders
The idea behind DEC consists in improving the clustering using an unsupervised algorithm. It first computes a soft assignment between the embedded points and the cluster centroids. For each training sample i, the soft clustering assignment .qij is computed by embedded point .zi and the cluster center .μj measured by Student’s t-distribution as follows: (1 + ‖zi − μj ‖2 )/α)−(α+1)/2 qij = ' 2 −(α+1)/2 j ' (1 + ‖zi − μj ‖ )/α)
.
(11.4)
and second update the deep mapping .fθ and refine the cluster centroids by learning from current high-confidence assignments using an auxiliary target distribution as follows: pij =
.
qij2 /fj j'
(11.5)
qij2 ' /fj'
where .fj = j qij are soft cluster frequencies. The selected loss is the KL divergence between the soft assignments .qi and the auxiliary distribution .pi : L = KL(P |Q) =
.
i
j
pij log
pij qij
(11.6)
This cost is optimized using a Stochastic Gradient Descent (SGD) with momentum for the cluster centers .μj and DNN parameters .θ of the encoder. The partial derivatives .δL/δμj and .δL/δzj are directly calculable, and then the gradients can be retro-propagated. DEC works by using highly confidential samples as supervision and then making samples in each cluster distribute more densely. The biggest contribution of DEC is the clustering module (or target distribution P ). Its working principle is to use a high-confidence target division to guide the training process so that the distribution of the dense samples in each cluster is more concentrated. However, there is no guarantee of pulling samples near margins toward the correct cluster (Guo et al., 2017). This module also has a defect, that is, it cannot guarantee that the samples near the edge of the cluster will be classified into the correct cluster The local structure preservation cannot be guaranteed by the clustering loss. Thus, the feature transformation may be misguided, leading to the corruption of embedded space. To sum up, DEC is a powerful algorithm that learns low-dimensional representations using autoencoders and improves clustering performance through self-learning with KL divergence. It achieves feature representation and clustering simultaneously, making it effective in various unsupervised learning tasks. Despite its success in clustering, DEC is not able to make use of such prior information to guide the clustering process and to further enhance the clustering performance. However, DEC’s clustering
3 IDEC: Improved Deep Embedded Clustering
207
loss does not guarantee local structure preservation. This can misguide the feature transformation and lead to the corruption of the embedded space, especially when the dataset is highly imbalanced. DEC works well only for balanced datasets. The code source of DEC is available at2 .
3 IDEC: Improved Deep Embedded Clustering IDEC is based on a similar idea to DEC with the contribution consisting in balancing the clustering cost with a reconstruction one after the pre-training phase. Preserving non-discriminative features end-to-end implies a reduction in the overall discriminative ability IDEC utilizes a decoder in the training process to guarantee the local structure preservation from original data into its embedded representations. IDEC utilizes an under-complete autoencoder network to retain the local structure of the generated data distribution. The IDEC algorithm follows a structured process to enhance clustering performance through a combination of pre-training and fine-tuning steps. Initially, the deep autoencoder undergoes unsupervised pre-training, wherein it learns to transform input data into a condensed representation and then reconstruct the original data. This phase enables the autoencoder to capture meaningful features for subsequent clustering. After pre-training, the decoder component is retained in IDEC. This design choice facilitates optimization through reconstruction loss during fine-tuning. By incorporating reconstruction loss into the original DEC loss function, IDEC ensures that the network maintains the integrity of the feature space. This strategy prevents distortion and leads to improved clustering outcomes. IDEC introduces the reconstruction loss term to the DEC loss function deliberately. Excluding reconstruction loss during the clustering phase could disrupt the underlying space structure and hinder the network’s ability to preserve local relationships among data points. Consequently, IDEC conducts a joint fine-tuning process targeting both clustering and reconstruction objectives. This refinement step further hones the acquired representations, contributing to enhanced clustering outcomes. Importantly, this process optimizes cluster label assignments while preserving the local structure, thereby avoiding any degradation in the learned features. Many researchers have extended the DEC by preserving the local structure that prevents learning low-dimensional representations from corruption in the clustering process. The IDEC loss is defined as LIDEC = αLr + βLDEC
.
2 https://github.com/XifengGuo/DEC-keras.
(11.7)
208
11 Deep Clustering Techniques Based on Autoencoders
where .Lr is the reconstruction loss of autoencoder, .LDEC stands for the clustering loss, and .α, .β are in charge of balancing the two loss terms. It is noteworthy that IDEC operates in an entirely unsupervised manner and does not rely on external prior information to guide its learning process. The decision to maintain end-to-end reconstruction may not be as advantageous (Guo et al., 2017) as it appears, as there exists a natural trade-off between clustering and reconstruction. Clustering involves eliminating non-essential information, such as within-cluster variances and between-cluster similarities, to effectively categorize data. This duality underscores IDEC’s approach to striking a balance between the two objectives: Embedded clustering does not take into consideration nondiscriminative patterns that do not contribute to data categorization (e.g., betweenclass similarities), whereas reconstruction is mainly concerned with preserving all the information whether discriminative or not. In summary, IDEC’s methodology comprises sequential pre-training and finetuning steps to enhance clustering performance. By incorporating reconstruction loss and maintaining the end-to-end reconstruction process, IDEC refines learned representations and optimizes clustering without compromising the local data structure. The algorithm’s unsupervised nature and its strategic handling of clustering and reconstruction objectives set the stage for improved clustering outcomes. The code source of IDEC is available at3 .
4 DCN: Deep Clustering Network DCN (Yang et al., 2017) is one of the earliest deep clustering works that perform the k-means clustering on the latent features produced by an autoencoder, where the reconstruction loss and the clustering loss are simultaneously minimized. DCN is similar to DEC and IDEC and has a joint optimization process (Fig. 11.3). It adds a penalty term on reconstruction during the process of optimizing the clustering objective. First, the autoencoder is trained to reduce the dataset dimensionality based on a reconstruction loss function. The ultimate goal of this approach is to get a k-meansfriendly representation at the end of the training process. To this end, a k-means loss is applied along with the vanilla reconstruction. This method requires hard clustering assignments (as opposed to soft clustering assignments based on probabilities). That would induce a discrete optimization process, which is incongruous with the differential aspect of gradient descent. Similar to IDEC, DCN does not have any mechanism to mitigate the clustering– reconstruction trade-off. DCN alternatively learns (rather than jointly learns) the object representations, the cluster centroids, and the cluster assignments, the latter
3 https://github.com/XifengGuo/IDEC.
5 DEPICT: DEeP Embedded RegularIzed ClusTering
209
Fig. 11.3 DCN scheme from the authors Yang et al. (2017)
being based on discrete optimization steps that cannot benefit from the efficiency of stochastic gradient descent. In synthesis, DCN pre-trains the autoencoder and uses it to optimize the reconstruction loss and k-means loss while changing the cluster assignments. The method merely compresses representations and is unable to capture the context effectively. Similar to IDEC, DCN suffers from the reconstruction and the clustering trade-off. The code source of DCN is available at4 .
5 DEPICT: DEeP Embedded RegularIzed ClusTering DEPICT (Ghasedi Dizaji et al., 2017) utilizes correlations between an instance and its “self” to construct positive sample pairs and leverages a convolutional autoencoder for learning embedded representations and their clustering assignments. DEPICT consists of two parts, i.e., a convolutional autoencoder for learning the embedding space and a multinomial logistic regression layer functioning as a discriminative clustering model. The innovation, however, relies on the use of clean and corrupted encoders sharing the same weights. The core idea is to learn the embedding space and perform the reconstruction via the corrupted encoder while forcing via a regularizer term that the clean and corrupter representations are similar as well as the predictive cluster. Let us consider the clustering task of n samples, .X = [x1 , . . . , xn ], into K categories, where each sample .xi ∈ Rdx . Using the embedding function, .φW : X → Z, samples are mapped into the embedding subspace .Z = [z1 , . . . , zn ], where each d .zi ∈ R z has a much lower dimension compared to the input data (i.e., .dz ⪡ dx ). Given the embedded features, a multinomial logistic regression (softmax) function .fθ : Z → Y is used to predict the probabilistic cluster assignments as follows: 4 https://github.com/qunfengdong/DCN.
210
11 Deep Clustering Techniques Based on Autoencoders
exp(θk⏉ zi ) , pik = P (yi = k|zi , Ɵ) = K ⏉ k ' =1 exp(θk zi )
.
(11.8)
where .Ɵ = [θ1 , . . . , θk ] ∈ Rdz ×K are the softmax function parameters, and .pik indicates the probability of the i-th sample belonging to the k-th cluster. For jointly learning the embedding space and clustering, DEPICT employs an alternating approach to optimize a unified objective function. Similar to DEC, DEPICT has a relative cross-entropy (KL divergence) objective function. In order to avoid degenerate solutions, which allocate most of the samples to a few clusters or assign a cluster to outlier samples, a regularization term is imposed on the target variable. Loss = KL(Q‖P ) + KL(f ‖u)
.
(11.9)
where .fk = P (Y = k) = ni=1 qik . This avoids situations, where most of the data points are assigned to a few clusters. This prior knowledge (size of clusters) assumed by DEPICT is not welladapted for pure unsupervised problems. The clustering (.KL(Q‖P˜ )) and regularization loss force the model to have invariant features with respect to noise. 1 1 l 1 ‖zi − zˆ il ‖22 qik log p˜ ik + n n |zil | n
min −
.
φ
K
i=1 k=1
n
l−1
(11.10)
i=1 l=0
zl is the clean features of the l-th layer and .zˆ l is the l-th reconstruction layer output. The code source of DEPICT is available at5 .
.
6 DBC: Discriminatively Boosted Clustering DBC (Li et al., 2018) follows the general idea of DEC with an FCAE (Fully Convolutional Autoencoder), but the learning strategy is different. The algorithm adopts FCAE for feature extraction and applies k-means for the latter clustering. FCAE makes the first attempt to train convolutional autoencoders in an endto-end manner without greedy layerwise pre-training for fast and coarse image feature extraction. Then the decoder part is discarded, and a soft k-means model is added on top of the encoder to make a unified clustering model. The model is jointly trained with gradually boosted discrimination where high-score assignments are highlighted and low-score ones are de-emphasized (Fig. 11.4). DBC reuses the FCAE features and learns deep image representations and cluster assignments 5 https://github.com/herandy/DEPICT.
7 COAE: Clustering with Overlapping Autoencoder
Pre-train a convolutional autoencoder
Input Data
Deep Representation
K-means
211
Cluster Partitions
Fine-tune network using cluster assigment hardening
Deep Representation
Fig. 11.4 DBC scheme from Bank et al. (2023)
jointly with a discriminatively boosting procedure. The DBC model is trained in a self-paced learning procedure, where deep representations of raw images and cluster assignments are jointly learned. Easier samples are first trusted, and then gradually new samples are utilized with increasing complexity. The code source can be found at6 .
7 COAE: Clustering with Overlapping Autoencoder COAE (Wang et al., 2019) is a CAE that consists in encouraging the latent embedding to be orthogonal based (Fig. 11.5). It is based on the assumption that the orthogonality is beneficial to enhance the discriminability and representability of the embedding: more diversity in latent features and enlargement of the differences in the between-class latent features. An orthogonality regularization term is added in the reconstruction cost function to penalize the deviation of embedding Z from the column orthogonal matrix and a cross-entropy loss to boost clustering like in DEC: ˆ 2F + λ‖Z ⏉ Z − I ‖2F + νH (Q, P ) L = min ‖X − X‖
.
w
(11.11)
where .Z = [z1 , . . . , zn ]⏉ ∈ Rn×r are the embedding corresponding to X and .H (Q, P ) is the cross-entropy of target assignment Q and predicted assignment P . .λ and .ν are hyperparameters to balance the penalty terms. The number of clusters k has to be fixed in advance. COAE consists of the following steps: • Data preprocessing: The input data are preprocessed to normalize the features and remove noisy or irrelevant data. • Autoencoder training: A deep autoencoder neural network is trained on the preprocessed data to learn a compressed representation of the data. • Overlapping clustering: The learned representations are clustered using a fuzzy clustering algorithm that allows for overlapping clusters. Each data point is
6 https://github.com/topics/deep-clustering?o=asc&s=updated.
212
11 Deep Clustering Techniques Based on Autoencoders
Fig. 11.5 COAE architecture from the authors Wang et al. (2019)
assigned a membership value for each cluster, indicating the degree to which it belongs to that cluster. • Cluster refinement: The cluster memberships are refined using a thresholding method to assign each data point to its most likely cluster. • Autoencoder fine-tuning: The autoencoder network is fine-tuned using the refined clusters as target labels, to improve the quality of the learned representations. COAE is able to handle overlapping clusters in the data, which is a common occurrence in real-world datasets. The fuzzy clustering algorithm used in COAE allows for soft assignments of data points to multiple clusters, and the thresholding step helps to improve the interpretability of the clusters. The fine-tuning step of COAE improves the quality of the learned representations, which can be useful for downstream tasks such as classification or anomaly detection. Overall, COAE is a promising approach for deep clustering that can handle overlapping clusters and provide high-quality representations of the data. The code source of COAE is available at7 .
8 ADEC: Adversarial Deep Embedded Clustering ADEC (Mrabah et al., 2020) is an autoencoder-based clustering model that addresses the trade-off between Feature Randomness and Feature Drift. It is an extension of the DEC (Deep Embedded Clustering) algorithm.
7 https://github.com/WangDavey/COAE.
9 DSSEC: Deep Stacked Sparse Embedded Clustering Method
213
The main goals of ADEC are similar to DEC, which involves learning lowdimensional representations (embeddings) using autoencoders and then clustering the data based on these embeddings. ADEC, like DEC, aims to minimize the Kullback–Leibler (KL) divergence to an auxiliary target distribution to optimize the clustering performance. However, ADEC introduces an additional regularization technique based on adversarial training to improve the clustering objective. This regularization is achieved through an adversarially constrained reconstruction process. Adversarial training involves training a neural network to generate samples (reconstructions) that are as close as possible to the original input data while having a second network (adversary) trying to distinguish between the original data and the generated samples. ADEC, which extends the Deep Embedded Clustering, employs a discriminator to preserve the local structure of learning representations for clustering. In addition, a generator is introduced to confuse the discriminator D with the generation loss .Lg . The ADEC loss is combined with reconstruction loss, clustering loss, and generation loss .Lg . It is defined as LADEC = αLr + βLDEC + γ Lg
.
(11.12)
where .Lg = ni=1 log 1−D(fw (xi )) and .α, .β, and .γ are the weighted parameters. By combining adversarial training with clustering, ADEC creates a strong competition between the clustering objective and the adversarial reconstruction objective. This competition helps to alleviate issues related to feature randomness and feature drift, which can occur during the clustering process. Feature randomness refers to the undesired variation in the representations, while feature drift indicates the changing behavior of the learned representations during the training process. The adversarial regularization in ADEC aims to improve the robustness of the learned embeddings and, consequently, the clustering performance. By leveraging adversarial training, ADEC seeks to create more stable and discriminative features for clustering. However, adversarial training is still a challenging task due to its lack of stability (mode collapse, failure to converge, and memorization). The code source of ADEC is available at8 .
9 DSSEC: Deep Stacked Sparse Embedded Clustering Method DSSEC (Cai et al., 2021) improves the local structure preservation strategy by exploiting the sparse constraint imposed in the hidden layers of a network; therefore, the embedded features learned are more representative. The network structure DSSEC (Fig. 11.6) contains a stacked sparse autoencoder and a clustering layer 8 https://github.com/shillyshallysxy/ADEC.
214
11 Deep Clustering Techniques Based on Autoencoders
Output Sparse parameters Reconstruction loss Input Decoder
Lowdimensional Feature
Encoder
Clustering Clustering loss
Total loss
Fig. 11.6 DSSEC architecture from author Cai et al. (2021)
connected to the hidden layer. DSSEC utilizes the reconstruction loss and clustering loss to guide the learning process of feature representation. The idea is to replace the traditional autoencoder network AE as in DEC with the stacked sparse autoencoder (SSAE). The main difference between SSAE and AE is the added sparse constraint in the hidden layer. When the number of hidden units comes large, SSAE can still dig out some interesting embedded features compared to AE. Then, SSAE imposes the sparse constraint on the hidden layers to force the network to capture a more effective feature representation for input data. The retention of the local structure of input data is also considered, and the reconstruction loss of SSAE with clustering loss is combined to guide the network training. To this end, a penalty term is added to the reconstruction loss to punish those items where .ρ is significantly different from .ρˆj , which can be formulated by KL divergence as follows: KL(ρ||ρj ) =
h
.
ρ log(ρ/ρˆj ) + (1 − ρ) log
j =1
(1 − ρ) (1 − ρˆj )
(11.13)
The reconstruction loss is then defined as Lr = ‖X − X' ‖2F + βKL(ρ||ρj )
.
(11.14)
where W and .W ' are the weight matrix of the encoder and decoder, respectively, ' ' .X = W W X is the reconstruction matrix, and .β denotes the parameter that controls the sparsity penalty. The code source of DSSEC is available at9 . 9 https://github.com/shillyshallysxy/ADEC.
10 DPSC: Discriminative Pseudo-supervision Clustering
215
10 DPSC: Discriminative Pseudo-supervision Clustering DPSC (Hu et al., 2021) consists of discovering and utilizing the pseudo-supervision information to provide supervisory guidance for discriminative representation learning. With the aid of pseudo-supervision, the representations can be continuously refined to facilitate inter-cluster separability and intra-cluster compactness, thereby leading to more discriminative representations and correctly separated clusters. The essential insight behind DPSC is that benign feature representations are beneficial for clustering, and inversely clustering results can also be treated as pseudo-labels, therefore providing supervisory guidance for representation learning. DPSC comprises three integral components (Fig. 11.7): (1) Feature Extraction, (2) Cluster Assignments, and (3) Pseudo-supervision Regularization. The synergy between Components (1) and (2) forms the foundation of joint clustering methodologies, while Component (3) serves as the central pillar of DPSC. The arrow within the framework signifies a pivotal concept: the extraction of pseudo-labels from clustering outcomes to serve as supervisory cues. These labels play a crucial role in promoting inter-cluster distinctiveness and intra-cluster cohesion, ultimately fostering the acquisition of more discriminative features and, consequently, elevating the overall clustering performance. The pseudo-labels comprise two distinct elements. First, there are the predicted labels, denoted as .cij , which are utilized to directly determine whether two images belong to the same cluster. In this context, .cij takes the value of 1 if the images are in the same cluster, and 0 if they are not. Second, we have the pairwise patterns, denoted as .rij , which treat the cluster assignment probabilities as indicator features of individual samples. The objective here is to unveil pairwise patterns by estimating the similarities between samples. The flowchart of pseudo-labels discovery is
Fig. 11.7 DPSC architecture from the authors Hu et al. (2021)
216
11 Deep Clustering Techniques Based on Autoencoders zi
pi
Pseudo Labels Discovery