Recent Trends in Learning From Data: Tutorials from the INNS Big Data and Deep Learning Conference (INNSBDDL2019) 3030438821, 9783030438821

This book offers a timely snapshot and extensive practical and theoretical insights into the topic of learning from data

142 110 7MB

English Pages 232 [225] Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Introduction
1 Recent Trends in Learning from Data
1.1 Learned Data Structures
1.2 Deep Randomized Neural Networks
1.3 Tensor Decompositions and Practical Applications
1.4 Deep Learning for Graphs
1.5 Limitations of Shallow Networks
1.6 Fairness in Machine Learning
1.7 Online Continual Learning on Sequences
Learned Data Structures
1 Introduction
2 Learned Data Structures for Range Queries
2.1 The Recursive Model Index
2.2 Variations and Extensions of RMI
2.3 The FITing-tree
2.4 The PGM-index
2.5 Learned Multidimensional and Secondary Indexes
2.6 Comparison Among Learned Indexes
3 Learned Data Structures for Exact Membership
4 Learned Data Structures for Approximate Membership
4.1 Learned Bloom Filters
4.2 Sandwiched Learned Bloom Filters
4.3 Neural Bloom Filters
4.4 Adaptive Learned Bloom Filters
5 Learned Data Structures for Frequency Estimation
5.1 ML-Oracle Classifying Heavy Hitters
5.2 ML-Oracle Estimating Heavy Hitters' Counts
6 Conclusions
References
Deep Randomized Neural Networks
1 Introduction
2 Randomization in Feed-Forward Neural Networks
2.1 Description of the Model
2.2 Training the Network
2.3 Additional Considerations
3 Deep Random-Weights Neural Networks
3.1 Analyzing Randomized Deep Networks Through the Lens of Kernel Methods
3.2 The Relation Between Random Weights and Metric Learning
3.3 Deep Randomized Neural Networks as Priors
3.4 Towards Practical Deep Randomized Networks: Relation with Pruning
3.5 Training of Deep Randomized Networks via Stacked Autoencoders
3.6 Semi-random Neural Networks
3.7 Weight-Agnostic Neural Networks
3.8 Final Considerations
4 Randomization in Dynamical Recurrent Networks
5 Reservoir Computing Neural Networks
5.1 Reservoir Initialization
5.2 Reservoir Richness
6 Deep Reservoir Computing
7 Deep Echo State Networks
7.1 Enriched Deep Representations
7.2 Deep Reservoirs for Structures
8 Conclusions
References
Tensor Decompositions and Practical Applications: A Hands-on Tutorial
1 Introduction
2 Prerequisites
2.1 Terminology
2.2 Required Software
3 Theoretical Background
3.1 Fundamental Transformations
3.2 Efficient Representation of Multi-dimensional Arrays
4 Applications
4.1 Image Compression with HOSVD
4.2 Tensor Ensemble Learning
4.3 Support Tensor Machine with Application in Finance
5 Conclusion
References
Deep Learning for Graphs
1 Introduction
2 Foundational Models for Learning in Structured Domains: From Sequences and Trees to Graphs
2.1 Basics on Data Structures and Transductions
2.2 Recursive Models for Hierarchical Data
2.3 Cycles and General Graphs
3 Deep Learning Models
3.1 Spectral Graph Convolutions
3.2 Contextual Graph Processing
3.3 Graph Pooling
4 Discussion and Conclusions
References
Limitations of Shallow Networks
1 Introduction
2 Preliminaries
3 Approximate Measures of Sparsity
4 Correlation and Concentration of Measure
5 Probabilistic Lower Bounds on Approximate Measures of Sparsity
6 Sizes of Dictionaries of Computational Units
7 Constructions of Functions with Large Variations
8 Examples
9 Discussion
References
Fairness in Machine Learning
1 Introduction
2 CBNs: An Essential Tool for Fairness
2.1 CBNs: A Visual Tool for (Un)fairness
2.2 CBNs: A Quantitative Tool for (Un)fairness
3 Methods for Imposing Fairness in a Model
3.1 Constraints on Distributions with Optimal Transport
3.2 General Fair Empirical Risk Minimization
3.3 Learning Fair Representations from Multiple Tasks
3.4 If the Explicit Use of Sensitive Attributes is Forbidden
4 Conclusions
References
Online Continual Learning on Sequences
1 Introduction
2 Online Continual Learning
2.1 Formalizing CL Algorithms
2.2 OCL on Sequences
2.3 Datasets and Benchmarks
3 Hybrid Approaches for Gradient-Based OCL
3.1 Copy Weight with Reinit (CWR)
3.2 Architect and Regularize (AR1)
4 Growing Networks with Experience Replay
4.1 Growing Self-Organizing Networks
4.2 Replay via Neural Re-activations
5 Conclusions
References
Recommend Papers

Recent Trends in Learning From Data: Tutorials from the INNS Big Data and Deep Learning Conference (INNSBDDL2019)
 3030438821, 9783030438821

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Studies in Computational Intelligence 896

Luca Oneto Nicolò Navarin Alessandro Sperduti Davide Anguita   Editors

Recent Trends in Learning From Data Tutorials from the INNS Big Data and Deep Learning Conference (INNSBDDL2019)

Studies in Computational Intelligence Volume 896

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. The books of this series are submitted to indexing to Web of Science, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink.

More information about this series at http://www.springer.com/series/7092

Luca Oneto Nicolò Navarin Alessandro Sperduti Davide Anguita •





Editors

Recent Trends in Learning From Data Tutorials from the INNS Big Data and Deep Learning Conference (INNSBDDL2019)

123

Editors Luca Oneto DIBRIS University of Genoa Genova, Italy Alessandro Sperduti Department of Mathematics “Tullio Levi-Civita” University of Padua Padova, Italy

Nicolò Navarin Department of Mathematics “Tullio Levi-Civita” University of Padua Padova, Italy Davide Anguita DIBRIS Università degli Studi di Genova Genova, Italy

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-43882-1 ISBN 978-3-030-43883-8 (eBook) https://doi.org/10.1007/978-3-030-43883-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The 2019 INNS Big Data and Deep Learning conference has been held in Sestri Levante, Italy, April 16–18, 2019. The conference has been organized by the International Neural Network Society, with the aim of representing an international meeting for researchers and other professionals in Big Data, Deep Learning, and related areas. This book collects the seven tutorials presented at the conference which cover most of the recent trends in learning from data: Learned Data Structures, Deep Randomized Neural Networks, Tensor Decompositions, Deep Learning for Graphs, Fairness in Machine Learning, and Online Continual Learning on Sequences. We would like to thank the University of Genova, the University of Padua, the International Neural Network Society, the Associazione Italiana per l’Intelligenza Artificiale, the Var Group S.p.A., the SoftJam S.p.A., and the aizoOn Technology Consulting for supporting the conference. Genova, Italy Padova, Italy Padova, Italy Genova, Italy

Luca Oneto Nicolò Navarin Alessandro Sperduti Davide Anguita

v

Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luca Oneto, Nicolò Navarin, Alessandro Sperduti, and Davide Anguita

1

Learned Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paolo Ferragina and Giorgio Vinciguerra

5

Deep Randomized Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . Claudio Gallicchio and Simone Scardapane

43

Tensor Decompositions and Practical Applications: A Hands-on Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ilya Kisil, Giuseppe G. Calvi, Bruno Scalzo Dees, and Danilo P. Mandic Deep Learning for Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Davide Bacciu and Alessio Micheli

69 99

Limitations of Shallow Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Věra Kůrková Fairness in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Luca Oneto and Silvia Chiappa Online Continual Learning on Sequences . . . . . . . . . . . . . . . . . . . . . . . . 197 German I. Parisi and Vincenzo Lomonaco

vii

Introduction Recent Trends in Learning from Data Luca Oneto, Nicolò Navarin, Alessandro Sperduti, and Davide Anguita

Abstract The 2019 INNS Big Data and Deep Learning conference has been held in Sestri Levante, Italy, April 16–18, 2019. The conference has been organized by the International Neural Network Society, with the aim of representing an international meeting for researchers and other professionals in Big Data, Deep Learning and related areas. This book collects the tutorials presented at the conference which cover most of the recent trends in learning from data.

1 Recent Trends in Learning from Data The aim of the conference was to represent an international meeting for researchers and other professionals in Big Data, Deep Learning and related areas. This book collects the tutorials presented at the conference which cover most of the recent trends in learning from data. The summary of the content of the book will be summarized here.

L. Oneto (B) · D. Anguita University of Genoa, DIBRIS, Via Opera Pia 11a, 16145 Genova, Italy e-mail: [email protected] D. Anguita e-mail: [email protected] N. Navarin · A. Sperduti, Department of Mathematics “Tullio Levi-Civita”, University of Padua, Via Trieste, 63, 35121 Padova, Italy e-mail: [email protected] A. Sperduti, e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 L. Oneto et al. (eds.), Recent Trends in Learning From Data, Studies in Computational Intelligence 896, https://doi.org/10.1007/978-3-030-43883-8_1

1

2

L. Oneto et al.

1.1 Learned Data Structures Very recently, the unexpected combination of data structures and machine learning has led to the development of a new area of research, called learned data structures. Their distinguishing trait is the ability to reveal and exploit patterns and trends in the input data for achieving more efficiency in time and space, compared to previously known data structures. The goal of this chapter is to provide the first comprehensive survey of these results and to stimulate further research in this promising area.

1.2 Deep Randomized Neural Networks Randomized Neural Networks explore the behavior of neural systems where the majority of connections are fixed, either in a stochastic or a deterministic fashion. Typical examples of such systems consist of multi-layered neural network architectures where the connections to the hidden layer(s) are left untrained after initialization. Limiting the training algorithms to operate on a reduced set of weights inherently characterizes the class of Randomized Neural Networks with a number of intriguing features. Among them, the extreme efficiency of the resulting learning processes is undoubtedly a striking advantage with respect to fully trained architectures. Besides, despite the involved simplifications, randomized neural systems possess remarkable properties both in practice, achieving state-of-the-art results in multiple domains, and theoretically, allowing to analyze intrinsic properties of neural architectures (e.g. before training of the hidden layers’ connections). In recent years, the study of Randomized Neural Networks has been extended towards deep architectures, opening new research directions to the design of effective yet extremely efficient deep learning models in vectorial as well as in more complex data domains. This chapter surveys all the major aspects regarding the design and analysis of Randomized Neural Networks, and some of the key results with respect to their approximation capabilities. In particular, chapter first introduces the fundamentals of randomized neural models in the context of feed-forward networks (i.e., Random Vector Functional Link and equivalent models) and convolutional filters, before moving to the case of recurrent systems (i.e., Reservoir Computing networks). For both, chapter focuses specifically on recent results in the domain of deep randomized systems, and (for recurrent models) their application to structured domains.

1.3 Tensor Decompositions and Practical Applications The exponentially increasing availability of big and streaming data comes as a direct consequence of the rapid development and widespread use of multi-sensor technology. The quest to make sense of such large volume and variety of that has both

Introduction

3

highlighted the limitations of standard flat-view matrix models and the necessity to move toward more versatile data analysis tools. One such model which is naturally suited for data of large volume, variety and veracity are multi-way arrays or tensors. The associated tensor decompositions have been recognised as a viable way to break the “Curse of Dimensionality”, an exponential increase in data volume with the tensor order. Owing to a scalable way in which they deal with multi-way data and their ability to exploit inherent deep data structures when performing feature extraction, tensor decompositions have found application in a wide range of disciplines, from very theoretical ones, such as scientific computing and physics, to the more practical aspects of signal processing and machine learning. It is therefore both timely and important for a wider Data Analytics community to become acquainted with the fundamentals of such techniques. Thus, our aim is not only to provide a necessary theoretical background for multi-linear analysis but also to equip researches and interested readers with an easy to read and understand practical examples in form of a Python code snippets.

1.4 Deep Learning for Graphs This chapter introduces an overview of methods for learning in structured domains covering foundational works developed within the last twenty years to deal with a whole range of complex data representations, including hierarchical structures, graphs and networks, and giving special attention to recent deep learning models for graphs. While providing a general introduction to the field, chapter explicitly focuses on the neural network paradigm showing how, across the years, these models have been extended to the adaptive processing of incrementally more complex classes of structured data. The ultimate aim is to show how to cope with the fundamental issue of learning adaptive representations for samples with varying size and topology.

1.5 Limitations of Shallow Networks Although originally biologically inspired neural networks were introduced as multilayer computational models, shallow networks have been dominant in applications till the recent renewal of interest in deep architectures. Experimental evidence and successful applications of deep networks pose theoretical questions asking: When and why are deep networks better than shallow ones? This chapter presents some probabilistic and constructive results showing limitations of shallow networks. It shows implications of geometrical properties of high-dimensional spaces for probabilistic lower bounds on network complexity. The bounds depend on covering numbers of dictionaries of computational units and sizes of domains of functions to be computed. Probabilistic results are complemented by constructive ones built using Hadamard matrices and pseudo-noise sequences.

4

L. Oneto et al.

1.6 Fairness in Machine Learning Machine learning based systems are reaching society at large and in many aspects of everyday life. This phenomenon has been accompanied by concerns about the ethical issues that may arise from the adoption of these technologies. ML fairness is a recently established area of machine learning that studies how to ensure that biases in the data and model inaccuracies do not lead to models that treat individuals unfavorably on the basis of characteristics such as e.g. race, gender, disabilities, and sexual or political orientation. In this chapter, authors discuss some of the limitations present in the current reasoning about fairness and in methods that deal with it, and describe some work done by the authors to address them. More specifically, authors show how Causal Bayesian Networks can play an important role to reason about and deal with fairness, especially in complex unfairness scenarios. Authors describe how optimal transport theory can be leveraged to develop methods that impose constraints on the full shapes of distributions corresponding to different sensitive attributes, overcoming the limitation of most approaches that approximate fairness desiderata by imposing constraints on the lower order moments or other functions of those distributions. Authors present a unified framework that encompasses methods that can deal with different settings and fairness criteria, and that enjoys strong theoretical guarantees. Authors introduce an approach to learn fair representations that can generalize to unseen tasks. Finally, authors describe a technique that accounts for legal restrictions about the use of sensitive attributes.

1.7 Online Continual Learning on Sequences Online continual learning (OCL) refers to the ability of a system to learn over time from a continuous stream of data without having to revisit previously encountered training samples. Learning continually in a single data pass is crucial for agents and robots operating in changing environments and required to acquire, fine-tune, and transfer increasingly complex representations from non-i.i.d. input distributions. Machine learning models that address OCL must alleviate catastrophic forgetting in which hidden representations are disrupted or completely overwritten when learning from streams of novel input. In this chapter, recent deep learning models that address OCL on sequential input are summarized and discussed through the use (and combination) of synaptic regularization, structural plasticity, and experience replay. Empirical evidence shows that architectures endowed with experience replay, i.e., the re-occurrence of (latent representations of) input sequences for protecting consolidated knowledge over time, typically outperform architectures without in (online) incremental learning tasks. Acknowledgements We would like to thank the University of Genova, the University of Padua, the International Neural Network Society, the Associazione Italiana per l’Intelligenza Artificiale, the Var Group S.p.A., the SoftJam S.p.A., and the aizoOn Technology Consulting for supporting the conference.

Learned Data Structures Paolo Ferragina and Giorgio Vinciguerra

Abstract Very recently, the unexpected combination of data structures and machine learning has led to the development of a new area of research, called learned data structures. Their distinguishing trait is the ability to reveal and exploit patterns and trends in the input data for achieving more efficiency in time and space, compared to previously known data structures. The goal of this chapter is to provide the first comprehensive survey of these results and to stimulate further research in this promising area.

1 Introduction The searching problem is among the oldest and most prominent problems in computer science, well-studied and ubiquitous in research and applications. Not surprisingly, it is often used as an introductory topic in basic algorithms courses, paving the way to the study of fundamental data structures such as arrays, lists, search trees, tries and hash tables. Simply stated, the searching problem asks to preprocess a set of items into a data structure in such a way that certain kinds of queries on these items can subsequently be answered quickly. Clearly, there are many different facets and solutions to this problem, depending on how the data are organised in memory and which operations are supported, but the above basic data structures are often sufficient to solve sophisticated versions of the problem, be them a search of neighbouring points of interest in a map or a search for all documents relevant to a query in a search engine. As an example, a data structure for this last application, called inverted list, consists of a hash table or a trie mapping each possible query term to a list or an array of document-IDs containing that term. P. Ferragina · G. Vinciguerra (B) Università di Pisa, Pisa, Italy e-mail: [email protected] P. Ferragina e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 L. Oneto et al. (eds.), Recent Trends in Learning From Data, Studies in Computational Intelligence 896, https://doi.org/10.1007/978-3-030-43883-8_2

5

6

P. Ferragina and G. Vinciguerra

We refer to this first family as traditional data structures, which also includes variations (stacks, queues, heaps, circular lists, …) and combinations (multiway trees, spatial data structures, randomised lists, …) of the most known ones. In these data structures, the main goal pursued by researchers and software engineers was to achieve efficient/optimal query time and space as a function of the number of items [7]. Since the ‘90s, with the flood of big data and the advent of hierarchical memories in computers, researchers aimed at designing compact data structures, sometimes called succinct, compressed or opportunistic data structures. These data structures exploit the repetitiveness present in the input data to occupy a space close to the information-theoretic lower bound and still provide efficient query operations. This means that they do not require that data is fully decompressed to perform searches over them. Today, it is known how to turn almost any traditional data structure into a compact data structure [29]. In general, both families of traditional and compact data structures offer a wide range of trade-offs, and no single solution is satisfactory for every application because of differing hardware and software requirements or constraints imposed by user needs. As a result, software engineers have often to choose (sometimes incautiously) one among a multitude of data structures based on partial or superficial information available when their choice is done, and then soon discover that their choice is inefficient because it did not take into account some specialities of input data. This situation is well captured by the following excerpt: Another simple way to facilitate […] retrieval is to let people do part of the work, by providing them with suitable printed indexes to the information. This method is often the most reasonable and economical way to proceed (provided, of course, that the old paper is recycled whenever a new index is printed), especially because people tend to notice interesting patterns when they have convenient access to masses of data. — Donald Knuth, The Art of Computer Programming (1973).

Recently, researchers were able to define machine learning-based tools that automatically detect such patterns and, unexpectedly, orchestrated them with classic data structures to design a new family of data structures, called learned data structures, that attempts exactly to reveal and exploit patterns and trends in the input data for achieving more efficiency in time and space compared to the previously known solutions. The key design idea consists of augmenting—and sometimes even replacing—classic data-structural building blocks, such as tree nodes or hash tables, with Machine Learning (ML) models which are better suitable to “notice interesting patterns when they have convenient access to masses of data”. This feature, combined with proper data structural design elements and algorithms has led to outstanding improvements in space occupancy and time efficiency over a plethora of searching problems, some of which will be introduced and discussed in the following sections. A first overview of the problems and corresponding results achieved in this new algorithmic field is offered in Table 1, where for each problem we summarise the ML approach taken, its benefits and drawbacks, and we point out the main references to the literature. We strictly limit ourselves to four main searching problems: exact

Learned Data Structures

7

Table 1 A summary of machine-learned data structures for four main searching problems. CDF stands for Cumulative Density Function of a probability distribution Range queries Exact Approximate Frequency membership membership estimation Sect. 2 Sect. 5 Sect. 3 Sect. 4 Approach

Learn CDF of Learn CDF of data and use it to data and use it as map queries to a hash function memory locations

Classify the membership of items

Pros

Reduced space and faster queries No clear extensions to variable-length keys

Reduced space

Cons

Refs.

[11, 12, 14, 15, 23–25, 33, 36, 38, 39]

Less hash collisions Increased space for the hash function, and increased query latency. Susceptible to adversarial attacks [23, 40]

No guarantees if distribution changes. Modest space reductions. Susceptible to adversarial attacks [10, 23, 26, 28, 34]

Classify heavy hitters and/or predict the frequency of items Improved estimations Requires historical data

[18, 41]

membership, range queries, approximate membership, and frequency estimation. The set of problems which can be solved via ML-based approaches is growing—just to mention a few, cardinality estimation of SQL queries via deep learning [21], learned query optimisers [22], learned operating systems [42], learned sorting algorithms [22], learned prefetchers [16]—but they are beyond the scope of this chapter. Furthermore, while not strictly related to the content of this chapter, we ought to mention the work of [19, 20] concerning the semi-automated (and possibly ML-driven) design of data structures from their first design principles. Notation and basic terminology. We denote by n the input size and use log to denote the logarithm to the base 2. To analyse algorithms, we use both the Random Access Machine (RAM) model and the external (or, two-level) memory model [37]. The RAM model consists of an infinite memory of O(log n) bit cells, and it supports arithmetic, logical and bitwise operations on individual cells in constant time. The time cost of an algorithm is evaluated by counting the asymptotic number of steps it takes to solve a problem, while its space cost is the maximal number of memory cells (sometimes expressed in bits) it occupies during the computation. On the other hand, the external memory model abstracts the memory hierarchy by modelling just two levels: an internal memory of limited size M, and an external memory of unlimited size divided into blocks of B consecutive items. Data is brought into internal memory and written back to external memory by transferring one block at

8

P. Ferragina and G. Vinciguerra

a time. The efficiency of an algorithm is then evaluated by counting the asymptotic number of transfers, or I/Os, it makes for solving a given problem.

2 Learned Data Structures for Range Queries Structuring data to provide fast retrieval by individual keys or range of keys is a problem as old as computer science, arguably the most successful example of the interplay between data structures and machine learning. The indexable dictionary problem asks to store a multiset S of keys drawn from a universe U in order to efficiently support the following query and update operations: • • • • • •

member(x) = true if x ∈ S, false otherwise; lookup(x) returns the satellite data of x ∈ S (if any), nil otherwise; predecessor(x) = max{y ∈ S | y < x}; range(x, y) = S ∩ [x, y]; insert(x) adds x to S, i.e. S = S ∪ {x}; delete(x) removes x from S, i.e. S = S \ {x}.

A data structure implementing the above query and update operations is called an index structure, or simply index. The B-tree and its variations are from the ‘70s the predominant indexes for working in disk memories in commercial database systems [32]. A B-tree is a search tree with fan-out (B). A node r stores r.num = (B) keys r.keyi in ascending order and associated data r.datai (1 ≤ i ≤ r.num). An internal node r stores also r.num + 1 pointers to its children r.child j (1 ≤ j ≤ r.num + 1). A query lookup(x) starts from the root being the current node. If this node contains a key equal to x, then the search terminates successfully and returns the data associated with x. Otherwise, if the node is a leaf, the search terminates unsuccessfully and returns nil. In the other cases, the search recurses into the unique children that may contain x. Since a node can be accessed in O(1) I/Os, the cost of a search is proportional to the height of the tree, i.e. O(log B n) I/Os. Other pointwise queries are slight modifications of the tree traversal we just described. A variation of the B-tree, called B+ -tree, stores instead the keys from S and any associated data at the leaves, which are linked left-to-right for fast range queries. Inner nodes contain copies of certain keys from S and act only as routing elements, i.e. an inner node r contains keys r.keyi and pointers to children r.child j (1 ≤ i < j ≤ r.num + 1) but not the data associated with keys, as shown in Fig. 1. A lookup in a B+ -tree costs (log B n) I/Os. The work of Kraska et al. [23], which extended some previous original results of Ao et al. [1], has provided us with a different perspective on this old-fashioned

Learned Data Structures

9 3

2

1*

5

6

4

2*

3*

4*

5*

6*

7

7*

8*

Fig. 1 A B+ -tree with fan-out 3 stores all the keys x = 1, 2, . . . , 8 and associated data, denoted as x ∗ , at the leaf level. Inner nodes store copies of certain keys and act as routing elements

problem. The key idea introduced by these authors is that indexes are models that can be trained to map keys to their location in the sorted order, and this mapping is enough to efficiently implement any pointwise and range query of the indexable dictionary problem. In fact, let us denote by rank(x) the primitive that returns, for any key x ∈ U, the number of keys in S which are smaller than x, and let A be the array storing the keys of S in sorted order. Then, member(x) can be implemented by checking whether A[rank(x)] = x or not; predecessor(x) consists of returning A[rank(x) − 1]; and range(x, y) consists of scanning the array A from position rank(x) up to the first key larger than y. This parallel between index structures and rank function does not seem a new one, as indeed any B-tree offers an implementation of it. But its novelty becomes clear when we look at the keys x ∈ S as points (x, rank(x)) in the Cartesian plane. As an example, let us consider the case of a set of keys S = {2, 4, 6, . . . , 2n}. Here, as depicted in Fig. 2, the points (x, rank(x)) can be “covered” by the linear model f (x) = 0.5x − 1 so that the function rank(x) can be computed exactly for each key x ∈ S in constant time and space, independently of the number of keys in S. Consequently, one should never build a B-tree on that set of keys! This trivial example sheds light on the potential compression opportunities offered by patterns and trends in the data distribution. However, we cannot argue that all datasets follow exactly a linear trend, nor that the models can learn the data distribution with no errors. Nevertheless, the idea is promising so that, to deal with the general setting, we have to design techniques that: • learn the rank function by extracting the patterns in the data through succinct models, ranging from linear to more sophisticated ones; • admit some “errors” in the output of the model approximating the rank function and which, in turn, can be efficiently (in time and space) corrected to return the exact value of rank. This is a supervised learning task [17], in which the dataset D = {(x, rank(x))}x∈S is given, and we look for a model f : U → [0, n) which maps keys of S to their positions in the sorted order, and minimises the error | f (x) − rank(x)| over all x ∈

10

P. Ferragina and G. Vinciguerra

9 8 7 6

position

Fig. 2 A set of keys S = {2, 4, 6, . . . , 2n} mapped to points (x, rank(x)) in the Cartesian plane, and the line passing through (“covering”) all of them

5

(

4

)=

0



.5

1

3 2 1 0 2

4

6

8

10

12

14

16

18

20

key

S.1 Equivalently, we look for a model F : U → [0, 1] which minimises |n F(x) − rank(x)| over all x ∈ S. Intuitively, F(x) is the fraction of keys that are less than x, i.e. the learned empirical Cumulative Distribution Function (CDF) of D. It is clear that f (x) or F(x) alone are not sufficient to solve the searching problem for a key x in S because of the errors present in the learned approximation provided by those functions. Hence, proper algorithms and data structures have to be designed to orchestrate learned models with classic building blocks that allow to correct these approximations and provide correct answers to the searching for x. A simple but illustrative example of such algorithmic corrections follows below. Example: implementing rank with a linear regression model Let us given an array A = [27, 29, 32, 34, 34, 35, 37, 37, 37, 38, 38, 40, 41] of 13 integer (repeated) keys, and consider the corresponding set of points in the Cartesian plane obtained by representing only the first occurrence of every key with its rank (but dropping the other copies): D = {(x, rank(x))}x∈A = {(27, 0), (29, 1), (32, 2), (34, 3), (35, 5), (37, 6), (38, 9), (40, 11), (41, 12)}.

The linear model f computed using ordinary least squares on D has slope 0.88 and intercept −25.23. As shown in Fig. 3, if we use f to approximate the

1 We assume that universe U is a range of reals because of the arithmetic operations required by the models. This works for any kind of keys that can be mapped to reals by preserving their order, such as integers or strings.

Learned Data Structures Fig. 3 Real data rarely have trivial trends like the one in Fig. 2. More often there are missing keys (gaps in the horizontal axis) and repeated keys (points stacked vertically). In such cases, before replacing a learned model with a traditional index we must design a strategy to fix the error of the model

11

12 11 10 9 8

( ) = 0.88 − 25.23

7 6 5

error

4 3 2 1 0 27

29

32

34 35

37 38

40 41

key

rank of the key x = 34, we get r =  f (x) = 4.69 = 5, but the true rank of x is 3. We can fix the error incurred by f via a linear search that starts from A[r ] and stops when the first occurrence of (a key less than or equal to) x is found. Furthermore, instead of using a linear search, we can keep the maximum error err incurred by f over the key in S, and use it at query time to perform a binary search in A[r − err, r + err] in O(log err) time. Likewise, an exponential search (also called doubling search or galloping search [4]) starting from A[r ] could solve the problem more efficiently in time O(log d), where d = |rank(x) − r | is the actual distance between the estimated position and the correct position of x in A. Finally, we notice that this approach based on linear models and algorithmic correction can also be used to support unsuccessful searches for keys not in A with the same time complexity. As an example, let us assume that we wish to search for x = 33. We compute  f (x) = 0.88 × 33 − 25.23 = 3.81 = 4 and then start an exponential search towards the beginning of A (since A[4] = 34 > 33), which finds that 33 does not occur in A. In the following sections, we trace the logical development of more sophisticated learned data structures. We look at different techniques to learn the function rank for the keys in S, and present solutions to the challenges that arise when replacing well-established algorithms and data structures with ML models in several searching problems, starting from the classic range search.

12

P. Ferragina and G. Vinciguerra

2.1 The Recursive Model Index The Recursive Model Index (RMI) [23] is a fixed hierarchy of regression models organised in stages. At query time each model, starting from the one at the root, takes the searched key as input and picks the model in the lower stage responsible for that key, i.e. the expert of a certain range of the training data where the key falls into. The output of a model in the last stage is used as an approximate position of the searched key in A, the sorted array of input keys. If we imagine creating an edge from each model to the models it picks in the stage below, then the resulting structure is a Direct Acyclic Graph (DAG) since different models at one stage can pick the same model at the stage below, as pictured in Fig. 4. The construction of RMI proceeds top-down by training the hierarchy of regression models. First, the root model f 1,1 is trained on the entire dataset D. Second, this model is used to distribute each pair (x, y) ∈ D to one of the p models in the lower stage according to the map p · f 1,1 (x)/n , thus effectively partitioning D into D1 , . . . , D p . Third, the process is repeated recursively by training the jth model f 2, j on the dataset D j for all j = 1, . . . , p. This way, keys are further partitioned in the models on the lower stage, and the process repeats recursively for all  stages (i.e. the depth of the DAG). The time to build RMI, by assuming that the training time of a model is linear in the size of the training set, is O(n). After the training, the worst over-prediction and under-prediction of every model f , j in the last stage are evaluated (and stored) as     min j =  min (y − f , j (x))  and max j =  max (y − f , j (x)) . (x,y)∈D, j

(x,y)∈D, j

(1)

If one of the two errors exceeds a certain user-defined threshold, the corresponding model is replaced by a B-tree. Otherwise, the errors are stored and used at query time to limit the final binary-search step to A[pos − min j , pos + max j ], where pos is the approximate position computed via the jth last-stage model. An example of searching for a key x in RMI is shown in Fig. 4. Note that RMI has to be monotonic to ensure that the errors min j and max j are guaranteed also for the keys not in D. The recursive model index is different from traditional tree-based indexes (such as B-trees) for at least four main reasons: 1. Its size is constant and is given by the overall number of parameters in the hierarchy of models plus the two errors stored for each model in the last stage. Conversely, the size of traditional indexes is linear in n. 2. It can take advantage of patterns and trends in the distribution of the input data. Conversely, traditional indexes are general-purpose and do not make assumptions about the data distribution. 3. In-between stages there are no searches, as the output of a model is directly used to pick the model of the next stage. Conversely, traditional indexes perform a search to identify the next node to visit.

Learned Data Structures

13

1,1

2,1

3,1

2,2

3,2

2,3

3,3

3,4

[pos − min3 , pos + max3 ]

Fig. 4 RMI is a fixed hierarchy of models organised in stages. In this example, the root model f 1,1 routes the input key x to the model f 2,2 in the second stage, which in turn routes x to the last-stage model f 3,3 . The output of f 3,3 along with its errors min j and max j is used to limit the final search step to A[pos − min j , pos + max j ]

4. The error cannot be bounded/evaluated beforehand. Conversely, traditional indexes allow specifying the node size (i.e. the maximum number of keys inspected at each level) thus bounding in advance the worst-case performance. On datasets of 200 million entries, a two-stage RMI2 always dominated an inmemory implementation of a B+ -tree, being up to 1.5–3× faster and up to two orders of magnitude smaller.3 Among the shortcomings of RMI, we mention: (i) the lack of support for insertions and deletions; (ii) the time-consuming tuning process, needed to shape the hierarchy of models, and the subsequent training/construction time; (iii) the lack of latency guarantees; (iv) the top-down training algorithm, which blindly distributes keys to the models below ignoring their power or workload in terms of partition size.

2.2 Variations and Extensions of RMI After the introduction of RMI, subsequent research has focused mainly on its fixed structure and the lack of support for insertions and deletions of keys. A common denominator of the subsequent research efforts is “adaptability”, that is, getting rid 2 Private correspondence with the authors clarified that only linear models were used in the hierarchy

due to their superior performance. 3 It has to be noted that a B+ -tree supports also insertions and deletions, which require a more complex

structure that is both slower and more space-inefficient than other static tree-based indexes.

14

P. Ferragina and G. Vinciguerra

of the fixed hierarchical structure of RMI, which without a proper (time-consuming) tuning can lead to redundant or overloaded models. Redundancy occurs when upper stages of RMI distribute too few keys to a single model, thus failing to use its approximation potential at full scale. Overloading occurs when upper stages distribute too many keys to a single model, which may not have the capability of fully capture the data trend, thus causing high prediction errors and in turn high query times. In the following, we present four learned indexes that address the problems mentioned above.

2.2.1

ASLM

The Adaptive Single Layer Model (ASLM) [25] is a single-stage learned index that addresses two shortcomings of RMI: the potentially poor distribution of keys among models in the lower levels of the DAG, and the lack of support for updates. For the first problem, ASLM heuristically partitions the dataset D to minimise the chance that a model is trained on data points which are far apart. First, the Euclidean distance between every two consecutive points in the sorted dataset D is computed. Second, the distances are sorted in descending order. Third, the points corresponding to the first p distances in the sorted order are selected as partition boundaries, where p is a fixed parameter. Fourth, the partitions are further refined by splitting or merging the consecutive ones whose size is above or below a certain fixed threshold, so to guarantee a certain balance among the sizes of the partitions. Finally, a model is created for each partition and trained on it. Overall, the construction cost is bounded above by the sorting of n keys. ASLM requires some sort of search to select the correct model for the queried key, while RMI uses the output of the root model to pick the model in the next stage. Moreover, the adaptability of ASLM to a given dataset is somehow hindered by the fact that the number p of models and the thresholds for splitting/merging partitions must be tuned. Authors reported that the partitioning strategy of ASLM reduced the average prediction error up to 3.8× with respect to a two-stage RMI that used 3-layer neural networks with 32 hidden units and ReLU activations. For the second problem (i.e. the support for updates), ASLM maintains with each model both a data array and a buffer. An insertion affects the data array only if the error of the inserted key is less than the average error of the model. Otherwise, the key is inserted into the buffer.4 Once the buffer is full, it is merged with the data array and the corresponding model is retrained. A deletion either removes a key from the buffer, or it marks the key as deleted in the data array. In this latter case, no retraining is necessary since the other keys are not moved from their positions, and hence the performance of the current model remains unchanged. To avoid too large (or too small) data arrays, ASLM performs a split (or merge) strategy that creates a new model (or fuses two consecutive models) once some lower 4 The

authors suggest organising the buffer “like a hash table”. Even though hashing simplifies buffer modifications, we lose the ability to do predecessor or range queries efficiently.

Learned Data Structures

15

(or upper) threshold on the number of elements is hit. Clearly, these split/merge operations trigger the retraining of the affected model, but this retraining is likely to be quick since the weights of the original model can be used as the starting weights for the new model.

2.2.2

H YBRID-O

The authors of RMI suggested to replace a last-stage model by a B-tree if its error exceeds a user-defined threshold. Other authors [33] suggested instead to detect the keys that cause large errors (called “outliers”), store them in a B-tree and retrain the model on the non-outlier keys. Among the strategies proposed to detect outliers, the most effective one (in terms of reduction of the prediction error) is to compute the average μ and the standard deviation σ of the errors produced by a trained model and to collect as outliers the keys whose prediction error falls outside [μ − σ, μ + σ ]. For the index structure, that we name Hybrid-O, the authors of [33] have proposed a single-stage index of linear models in conjunction with one B-tree for the outliers. The construction algorithm partitions the sorted dataset D into blocks of a user-given size and trains one model for each partition. Then, it collects the outliers with the strategy described above, adds them to the B-tree and finally retrains each model on the corresponding outliers-free partition. Both the training and the outliers-detection steps cost O(n) time, thus making the overall time to construct Hybrid-O equal to O(n), if we assume that the time to bulk-insert the outliers in the B-tree is negligible. At query time, a first binary search determines the model responsible for estimating the position pos of a query key x, which is the rightmost model whose first covered key is less than or equal to x. Let j be the model index, and let min j and max j be its errors precomputed using (1). Another binary search for x is executed within A[pos − min j , pos + max j ] and, if the search is unsuccessful, x is searched within the B-tree associated with this model. When inserting a new key y, if Hybrid-O detects that y is an outlier (because its prediction error falls outside [μ − σ, μ + σ ]), then y is inserted in the B-tree. Otherwise, y is added to the partition that corresponds to the model responsible for y, and the model is retrained. Deletions in Hybrid-O are not discussed in [33].

2.2.3

ALEX

The Adaptive LEarned indeX (ALEX) [11] introduces a top-down adaptive way of constructing a tree-shaped linear learned index. Specifically, the linear model in the root is trained on the entire dataset D as in RMI. Then, in order to create the children of the root node, its codomain [0, n) is divided into a number p (to be tuned beforehand) of equal-sized partitions which are scanned in sorted order. If the root model maps more than a user-given number m of keys into one partition, a new inner node for this partition is created, and the construction algorithm is run recursively on this node. Otherwise, the partition is merged with the subsequent partitions (as

16 Fig. 5 To overcome the static nature of RMI, ALEX instantiates linear models as needed, growing deeper until each leaf model has approximately the same number of keys. A search is performed similarly to RMI, with the difference that here a leaf model j stores a subarray A j of the original sorted input array A

P. Ferragina and G. Vinciguerra

1,1

2,1

2,2

2,3

1

3,2

3,1

3

3,3

4

4,1

2

long as the number of accumulated keys is less than m), and a leaf node is created and trained on it. Note that in no case a trained model is discarded: it either becomes part of a leaf node approximating the position of the query key, or it becomes a routing model that forwards the query key to one of its children, as we will see shortly. If we assume that the partitions are balanced in terms of the number of keys which fall in each of them, then this construction process obeys the recurrence T (n) = p T (n/ p) + (n), which has solution O(n log n) for any p > 1. In ALEX, each inner node stores both its learned linear model and one pointer for each children node which is responsible for a partition of the space managed by the current node. This allows to traverse the hierarchy of models without any search in-between levels by taking the currently visited node prediction, say c(x), and following the p · c(x)/n -th child pointer. Instead, a leaf node j stores the model f j and stores the n j keys the model was trained on in a sorted array A j , which is a subarray of A. Once a search arrives at a leaf node j, ALEX performs an exponential search in A j starting from the position f j (x) . An example of a search in ALEX is depicted with solid lines in Fig. 5, where for simplicity we did not show the arrays of pointers inside inner nodes. To accommodate insertions, ALEX can be configured to store the keys at the leaves in either a Gapped Array (GA) [2] or in a Packed Memory Array (PMA) [3]. Both GA and PMA are sorted arrays that evenly intersperse empty spaces among the n j keys so that making space for a new key requires pushing to the left or to the right of the insertion position only a small number of elements, i.e. only O(log n j ) moves for random insertion patterns. However, at the expense of a more involved algorithm, PMA guarantees O(log2 n j ) worst-case time inserts for any insertion pattern, which is better than the O(n j ) time of a GA. In both choices of the key array, ALEX uses the model to predict the approximate insertion position, which is then corrected by an exponential search. If the distribution of keys changes after several inserts, then some leaves will become overloaded with data. In this case, the authors suggest to transform an overloaded leaf to an inner node and to create a number of child nodes, similar to what

Learned Data Structures

17

happens in the construction of the entire ALEX data structure. The net consequence is that ALEX grows deeper and deeper, possibly slowing down queries until a full index reconstruction is performed. Deletions in ALEX are not discussed in [11], but we can observe that, after locating the key to delete with the search algorithm described above, one can use the PMA deletion algorithm that takes O(log2 n j ) amortised time [3].

2.2.4

AIDEL

Similar to ASLM and ALEX, the learned index based on Adaptive InDEpendent Linear regression models (AIDEL) [24] addresses the poor key distribution strategy of RMI and its lack of support for updates. But, unlike the others, AIDEL provides some latency guarantees too. They hinge upon a user-given integer parameter ε ≥ 1 indicating the maximum tolerable error for a model, and thus it proposes a construction algorithm that finds the proper number of models to match this error bound. The construction algorithm trains a linear model on increasingly larger partitions of the key-sorted dataset, starting from the first key. A partition expands by a constant number k of keys set beforehand. At each expansion, the current model errors min j and max j are computed using (1). As soon as one of the two errors exceeds ε, the partition is iteratively decreased in size by a user-given fraction of k until the linear model error is not larger than ε. The last trained linear model is then appended to the result, and the process is continued on the rest of the dataset. Once the whole dataset is processed, each trained model f j (one for each partition) is stored in a directory structure alongside the first key it covers and the values min j and max j . The construction of AIDEL takes O(n 2 ) time, because we have (n) expansions, each taking O(n) time to train and evaluate the error incurred by the current model. At query time, AIDEL picks from the directory structure the model that can best approximate the position of the query key x, i.e. the rightmost f j in the directory whose first covered key is less than or equal to x.5 Then, a final binary search is performed on A[pos − min j , pos + max j ], where pos = f j (x). Note that this last step costs O(log ε) time due to the threshold guaranteed on min j and max j . It goes without saying that RMI, ASLM, Hybrid-O and ALEX have no such guarantee in the search time which may be, thus, bounded only by O(log n). For the insertions, AIDEL adopts a simple but memory-hungry approach. It allocates a sorted list for each pair of consecutive input keys xi , xi+1 that accommodates the insertions of new keys falling between xi and xi+1 . When a sorted list becomes too long, it is merged with A, and the construction algorithm of the previous paragraph is rerun over the newly merged keys. The directory structure is updated accordingly. The deletion of a key y in AIDEL is not discussed in [24], but we can observe that it can be handled by removing y from the ith sorted list if y falls between two authors do not discuss the details of this step, but we assume that it is performed in O(log m) ˜ time via a binary search on the m˜ sorted key-model pairs.

5 The

18

P. Ferragina and G. Vinciguerra

consecutive keys xi , xi+1 in A, or by marking y as deleted (e.g. with an array of flags of the same size of A) if y belongs to A. Intermezzo: Avoiding the Model Retraining Some techniques have been suggested to avoid the model retraining after updates [15]. The main idea is to keep the trained model (e.g. the whole RMI hierarchy or a single model from ASLM or Hybrid-O) as is, to mark the deletions with a flag, and to correct the drift in position estimates caused by the insertions. This correction can be done in two ways. The first way is to learn the update distribution, say with a CDF model G, so that the position of a key x in the index updated after n i insertions is approximated by n F(x) + n i G(x), where F is the CDF model trained on D. The second way is to estimate the drift by considering the (known) drift of some reference points. Specifically, the data array is divided into blocks of fixed size 2b , and the first key inside each block is chosen as the reference point. For each reference point, we store an integer counting how many keys have been inserted in the block that precedes it. Therefore, computing the number of keys inserted before a reference point amounts to compute a prefix sum of the counters of all the blocks that precede the reference point. After the insertions, the drift of the ith reference point is computed by subtracting its true position (equal to i2b plus the number of keys inserted before it) from the position estimated by F for the reference point. At query time, the position for a key x is computed as n F(x) − (x), where (x) is the correction computed by interpolating the two nearest reference point drifts at the left and at the right of x.

2.3 The FITing-tree The FITing-tree [14] has introduced a more principled way of learning the data distribution D = {(x, rank(x))}x∈A . This problem is reduced to the problem of computing a Piecewise Linear Approximation (PLA) of D which guarantees a user-given maximum error ε ≥ 1 and consists of the least amount of linear models. The first condition guarantees an upper bound on the search time, the second condition minimises the space occupancy of a learned index. Formally speaking, the smallest PLA for a subarray A[0, b] which incurs error ε has size (i.e. number of linear models) T [b] = 1 + min T [a − 1] a∈Cε (b)

with T [0] = 1,

(2)

where Cε (b) is the set of starting points of linear models having maximum error ε and ending in A[b]. In other words, the minimum-size PLA for A[0, b] is found by considering the minimum-size PLA for A[0, a − 1] along with the single linear model which covers the remaining subarray A[a, b], where a is varied in such a way that the linear model covering A[a, b] has a maximum error of ε.

Learned Data Structures

19

Equation 2 leads to a dynamic programming algorithm which, however, has a prohibitive O(n 3 ) running time. For this reason, the authors of [14] proposed a greedy linear-time algorithm that takes the first key of A, as the starting point, and then attempts to maximise the length of the linear model whose error is smaller than ε. This maximisation is done by keeping a cone defined by the origin (A[0], 0), initial high slope +∞, and initial low slope 0. Each time the length of the linear model is extended with a new key, the cone either narrows (decreasing one or both slopes), or stays the same. When a key A[i] falls outside the cone, the process is stopped and a linear model is created for the subarray A[0, i − 1] taking as slope any value in the current slope range and using (A[i], i) as the origin of a new cone. Eventually, the whole dataset is processed and the final result is a partition of A into m˜ variable-sized ranges that can be approximated with linear models guaranteeing maximum error ε. In order to dig into the technical details of the FITing-tree, let us define a segment as a tuple containing the first key covered by a linear model (which we call the key of the segment), the parameters of the linear model (hence, its slope and intercept), and a pointer to the subarray of A containing the range of keys covered by that segment. The FITing-tree is then constructed by indexing via a B+ -tree the (first) key of each one the m˜ segments produced by the greedy algorithm. This way, that B+ -tree ˜ occupies (m/B) ˜ disk pages and has depth (log B m). At query time, the B+ -tree is first searched to find the segment s j that the query key x belongs to. Then, the segment is used to predict the approximate position pos of the query key x inside the pointed data array A j . Finally, a binary search is performed in A j [pos − ε, pos + ε] to find the correct position of x in the whole ˜ I/Os, while the array A, as depicted in Fig. 6. Finding the segment s j costs O(log B m) final binary search step costs O(log(ε/B)) I/Os, because s j is guaranteed to incur an error ε in estimating the position of x in A j . The total cost of a query is thus O(log B m˜ + log(ε/B)) I/Os. For the insertions, a FITing-tree adopts one of the following two strategies: in-place inserts or delta inserts. A FITing-tree configured for in-place inserts avoids invalidating a segment by introducing a positive integer parameter β called “insert budget”, by constructing a FITing-tree with maximum error ε + β, and by padding A j with β empty slots at

Fig. 6 A FITing-tree partitions A into subarrays A j , each of which is indexed by a linear model with maximum error ε. The segments s j (tuples containing the first key covered by the linear model and the parameters of the model) are indexed by a B+ -tree which stores in its leaf level the first keys of all those m˜ segments

1

3

2

1

2

5

4

3

4

[ pos − , pos + ]

5

20

P. Ferragina and G. Vinciguerra

the beginning and at the end of it. Inserting a new key x into A j is implemented by locating the insertion position of x and by making room for x by shifting the existing keys towards the start or the end of A j , depending on which is closer. However, once all empty spaces in A j are filled, the greedy algorithm must reprocess the keys in A j . This step either allocates a larger array or it introduces new segments if one segment is not enough to guarantee an error ε + β. In this latter case, the old segment is deleted and the new ones are added to the B+ -tree. The disadvantage of in-place inserts is that the size of A j can grow arbitrarily large if the input dataset shows long linear trends, thus making high the cost of shifting keys in A j . A FITing-tree configured for delta inserts allocates for the jth segment a fixedsize sorted buffer. Once the buffer is full, it is merged with A j and it is re-processed by the greedy algorithm. As before, this can either produce a single segment, with a new empty buffer, or multiple segments. The latter case requires the modification of the B+ -tree. The time complexity of an insertion is now linear in the size of the fixed-size buffer rather than the size of A j , except for when the buffer is full and must be merged with A j . Deletions in the FITing-tree are not discussed in [14].

2.4 The PGM-index The Piecewise Geometric Model index (PGM-index) [12, 36] improves the FITingtree in several issues: (i) it employs a linear-time PLA construction algorithm which finds the minimum number of segments covering D with an error of at most ε; (ii) it builds upon a recursive structure that fully exploits the space/time-efficient routing of segments, thus resulting much more succinct than B+ -trees; (iii) it further reduces its space occupancy by means of novel techniques that compress the linear models of the optimal-sized PLA; and, finally, (iv) its structure is flexible enough that it can adapt not only to the distribution of the input keys but also to the distribution of the queried keys. The rest of this section will dig into some more details about these features. In Sect. 2.3 we have seen how the problem of computing PLAs can be solved either optimally via dynamic programming in O(n 3 ) time, or heuristically in O(n) time but renouncing to optimality. Interestingly enough, this problem has been extensively studied in computational geometry, and it admits streaming algorithms which produce the minimal number of segments m  in only O(n) time [31]. The PGM-index deploys one of these optimal algorithms and stores at its lowest level the minimumsize sequence of m  segments as triples, each consisting of the first key covered by the segment, its slope and intercept. The improvement induced by the storage of m  , instead of m, ˜ segments is significant and up to 63% [12]. The second feature of the PGM-index is its recursive structure. Its construction starts with the whole input array A that is turned into the set of two-dimensional points D = {(x, rank(x))}x∈A , as we commented in the previous pages. D is processed by the optimal algorithm of [31] which produces the minimum-size sequence of m 

Learned Data Structures

21 levels[0] (2, sl00 , ic00 )

levels[1] (2, sl10 , ic10 )

(31, sl11 , ic11 )

(102, sl12 , ic12 )

(187, sl13 , ic13 )

levels[2] (2, sl20 , ic20 ) 0

2

0

(23, sl21 , ic21 )

(31, sl22 , ic22 )

(48, sl23 , ic23 )

(71, sl24 , ic24 )

(102, sl25 , ic25 )

(158, sl26 , ic26 )

(187, sl27 , ic27 ) −1

12 15 18 23 24 29 31 34 36 38 48 55 59 60 71 73 74 76 88 95 102 115 122 123 124 158 159 161 164 165 187 189 190

−1

Fig. 7 Each segment in a PGM-index routes the queried key to one of the segments of the level below by computing a position that is at most ε away from the correct one. In this picture ε = 1, the cyan nodes are the ones traversed by the search for the key x = 76, and the brackets specify the range where the binary search is performed: actually, the binary search is executed over the first key of each segment in the range, which is stored as the first component of the triple denoting a segment

segments covering the keys in A with an error of at most ε. Then, the first key covered by each segment is extracted and used to form a new set of m  keys, over which the minimum-size PLA construction algorithm is applied again. Recursion proceeds until only one segment is obtained. Each level of the recursion corresponds to a level of the PGM-index, starting from the bottom, in which there is a one-toone correspondence between segments and nodes of the PGM-index. In a way, this approach constructs a sort of multiway search tree but with three main advantages with respect to the B+ -tree constructed by the FITing-tree: (i) the nodes of the PGM-index have variable fan-out driven by the (typically large) number of keys covered by segments, so that the height and space occupancy of the index is very small in practice; (ii) the segment associated with a node plays the role of a constantspace and constant-time ε-approximate routing table for the various queries to be supported; (iii) the search in each node corrects the ε-approximate position returned by that routing table via a binary search (see next), and thus it has a time cost that depends logarithmic on ε. At query time, the PGM-index uses the current segment (initially the root) to estimate the position of the searched key among the keys of the next lower level. The real position is then found by a binary search in a range of size 2ε centred around the estimated position. Given that every key on the next lower level is the first key covered by a segment on that level, the binary search has identified the next segment to query. So the process continues until the last level is reached and the position in A of the queried key is determined.

22

P. Ferragina and G. Vinciguerra

Example: search in a PGM-INDEX Consider the PGM-index with ε = 1 in Fig. 7. A query for the key x = 76 starts from the root node which stores the segment levels[0][0] = (2, sl00 , ic00 ) where 2 is its first covered key, sl00 is the slope of the segment and ic00 is its intercept. Level 1. The current segment allows to compute the approximate position of the searched key x = 76 as x · sl00 + ic00 = 1. Since the current segment has a maximum error of ε = 1, a binary search over the (first covered) keys in levels[1][1 − ε, 1 + ε] = [2, 31, 102] suffices to determine that the correct position of x is between 31 and 102, so that the next segment responsible for x is levels[1][1] = (31, sl11 , ic11 ). Level 2. The algorithm above is repeated on the current segment, by computing the approximate position x · sl11 + ic11 = 3 which is then corrected by performing a binary search in levels[2][3 − ε, 3 + ε] = [31, 48, 71]. This finds that the correct position of x is after 71, so that the next segment responsible for x is levels[2][4] = (71, sl24 , ic24 ). Level 3. The last level consists of the array A, and the current segment allows to compute the approximate position x · sl24 + ic24 = 17 which is then corrected by performing a binary search in A[17 − ε, 17 + ε] = [73, 74, 76]. This search is successful and finds that x occurs in position 18.

It can be shown that the construction of the PGM-index reduces the number of segments at each level by a multiplicative factor larger than 2ε, so that the final data structure has O(logε m  ) levels. Theorem 1 ([12]) Let A be an ordered array of n keys from a universe U, and ε ≥ 1 be a fixed integer parameter. The PGM-index with parameter ε indexes the array A taking (m  ) space and answers rank, membership and predecessor queries in O(log m  + log ε) time and O((logε m  ) log(ε/B)) I/Os, where m  is the minimum number of segments covering A with a maximum error of ε, and B is the block size of the external memory model. Range queries are answered in an additional O(K) time and O(K /B) I/Os, where K is the number of keys satisfying the range query. The PGM-index can match the optimal worst-case query complexity of the B+ tree by choosing ε = (B), thus resulting O(log B m  ) = O(log B n). In practice, the fan-out of every node of the PGM-index is much larger than by 2ε as well as the value of m  is orders of magnitude smaller than n, so that its query time and space occupancy result very small for real-world datasets. Therefore, the PGM-index can be considered the first learned drop-in replacement for traditional indexes to date. The PGM-index can also be generalised to include nonlinear models. Indeed, there exists an O(n log n) time greedy algorithm, described in [36, §2.2], that still

Learned Data Structures

23

guarantees a maximum error ε when using nonlinear models. Preliminary experiments with shallow neural networks showed a reduction of the number of models in the lowest level of a PGM-index, but this did not reduce the overall space occupancy, suggesting that trading-off model complexity with space occupancy is an open issue that deserves further research. For what concerns inserts, if new keys are appended to the end of the array A while maintaining the sorted order (as it occurs in time series), the update of the PGMindex is easy and efficient. In fact, the last segment can be updated in O(1) amortised time and, if the new key k can be covered by this last segment while preserving the ε guarantee, then the insertion process stops. Otherwise, a new segment with key k is created and the insertion of k is repeated recursively in the last segment of the level above. The recursion stops when a segment at any level covers k within the ε guarantee, or when the root segment is reached possibly determining its split and, thus, the creation of a new level/segment above. Since the updates at each level take constant amortised I/Os, the overall amortised I/O complexity of this insertion algorithm is O(logε m  ). For general inserts, the PGM-index defines b = (log n) static PGM-indexes built over sets S0 , . . . , Sb of keys which are either empty or have size 20 , 21 , . . . , 2b . Each insert of a key k, finds the first set Si which is empty and builds a new PGMindex over the set S0 ∪ · · · ∪ Si−1 ∪ {k}. This union can be computed in linear time because we can assume that the sets S j s are sorted, and thus a simple merging creates the new sorted set which consists of 2i keys (the sets S j s preceding Si are full). The new merged set can be used as Si , and the previous sets can be emptied. This algorithm can be shown to take O(log n) amortised time, while membership and predecessor queries take O(log n (log m  + log ε)) time because every search must be executed on all the b = (log n) static PGM-indexes. The analysis in the external memory model (that we omit here), completes the proof of the following result. Theorem 2 ([12]) Under the same assumption of Theorem 1, the Dynamic PGMindex with parameter ε indexes the dynamic array A and answers membership and predecessor queries in O(log n (log m  + log ε)) time, insertions and deletions in O(log n) amortised time. In the external memory model with block size B, membership and predecessor queries take O((log B n)(logε m  )) I/Os, insertions and deletions take O(log B n) amortised I/Os. For the results about the compression of the parameters of the segments (intercept and slopes), and the adaptability of the PGM-index to the query distribution, we refer the reader to [12] and mention that an implementation of the PGM-index is publicly available at https://pgm.di.unipi.it.

2.5 Learned Multidimensional and Secondary Indexes All learned indexes we discussed so far can easily be extended to handle multidimensional data such as geographic coordinates. In fact, multidimensional keys can

24

P. Ferragina and G. Vinciguerra

be mapped onto one single dimension via space-filling curves that preserve spatial proximity, and then use any one-dimensional index structure over the projected points [13]. One such mapping, called Z-order or Morton order, simply concatenates the result of interleaving the binary representations of the coordinate values. For example, the three-dimensional point (4, 7, 1) = (100, 111, 001) is mapped to 011 010 110 = 214. Experiments on the combination of Z-order and RMI showed that, with respect to an R-tree, space is reduced by up to 97% and the query time is improved by up to 2.5× [38]. In databases, it is common to build separate indexes on different columns of a table. In such cases, at most one index can have the same ordering of the records in the table, which is typically the index on the primary key of the table. The other indexes, called secondary and unclustered indexes, cannot. Even though we described learned indexes assuming the data being sorted by key, it is still possible to use them as secondary indexes. In fact, we can create (value, pointer) pairs where each value in the indexed column is associated with a pointer to the original record (or a list of pointers in the case of non-unique values). A pointer can be a record identifier or the primary key of the record. Then, we sort the pairs by value and create a learned secondary index on it. At query time, the learned secondary index locates the (value, pointer) pair associated with the sought value and outputs the record pointed by pointer. Note that the space overhead of using this kind of indirection for secondary indexes occurs also if one uses traditional indexes, but learned indexes have the additional benefit of being more succinct in space and possibly faster in query time. An alternative approach to build secondary indexes consists of capturing the correlation between the target column T and a host column H for which there already exists an index. The Tiered Regression Search Tree (TRS-tree) [39] learns the mapping H = h(T ) by recursively dividing T ’s value range into a number of equal-sized subranges until every pair (t, h) of values from T and H covered by the corresponding subrange can be well estimated using linear regression. Specifically, the construction starts from the root node being the current node and associates with each node a range of T , which for the root is the whole range of T ’s values. The algorithm retrieves from the table all pairs (t, h) such that t is within the range associated with the current node and trains a linear regression model on such pairs. Then, the pairs on which the linear model makes an error greater than a user-given parameter ε are inserted into an outlier buffer. If the size of the outlier buffer exceeds a fixed amount, the current range is divided into a fixed number p of equal-sized subranges, the current node becomes an internal (routing) node, and the construction process is repeated recursively on the p children of the current node, one for each subrange. Otherwise, the node becomes a leaf node and it stores the linear regression model parameters and the outlier buffer, which is implemented as a hash table mapping from t to the corresponding record identifier (either a primary key or a tuple location). At query time, a predicate P = lb ≤ T ≤ ub is implemented via a breadth-first search that starts from the root node and visits all children whose range overlaps with P. At a leaf node with range r , the endpoints of the intersection between P and r are mapped to a range on H via the linear model stored in the leaf, and the outliers

Learned Data Structures

25

are collected from the buffer. Once the breadth-first search is completed, the ranges on H are used to query the existing index on the host column, while the outliers are fetched directly by visiting the corresponding record identifiers. Since the TRS-tree may return false positives, the fetched tuples are compared against P and possibly filtered out from the result of the query.

2.6 Comparison Among Learned Indexes A comparison among the learned indexes discussed so far is given in Table 2, where we use the following terminology and notation. First of all, the complexity bounds are given in terms of the number n of keys in the input dataset, the number  of levels in the DAG structure of RMI, the (sub-optimal) number m˜ of segments which cover the keys in the input dataset with error ε, and the optimal number m  of segments which cover the keys in the input dataset with error ε, as computed by the algorithm of [31]. The “Structure” column groups learned indexes according to how they arrange the (learned and non-learned) models they are composed of. The most common kind of structure is a tree, which partitions the input array A so that each model specialises on a specific subarray of A that gets smaller and smaller as the depth of the model in the structure grows. In a tree-structured learned index, a query always follows a root-to-leaf path, where the next model is either chosen via a fixed number of key comparisons (as in FITing-tree or PGM-index) or via a model prediction (as in RMI and ALEX). Differently, we say that the structure is a Direct Acyclic Graph (DAG) when a model can have more than one parent model: this is the case of an RMI with three or more stages. Lastly, we mention that some learned indexes have a flat structure, in which there is only one level of learned models and a comparison-based search algorithm, such as linear or binary search, that corrects the prediction error incurred by the learned models. A flat structure is convenient only when the models can be kept in the faster levels of the memory hierarchy; otherwise, it is better to keep the models in a B+ -tree (as the FITing-tree does) or to recursively build a smaller/faster routing structure composed of levels of learned models (as proposed by the PGM-index). The “Adaptive structure” column shows to what extent a learned index automatically allocates the proper kind and number of models, thus tuning itself to the best configuration for the input dataset. This feature is especially important in scenarios where (i) the software engineer does not have a deep understanding of the learned index and its possible models, for example, because the index is part of a larger software system; (ii) the tuning process is not feasible, such as in resource/timeconstrained devices/applications; or (iii) the data distribution changes over time, such as when new batches of data trigger a rebuild of the index to avoid performance degradation. The “Bounds the error” column shows which learned indexes allow the user to control the maximum error incurred by the learned models. This, in turn, guarantees

a Assuming



= (B);

c plus

O(log ε)

Properties Query time Query I/Os Adaptive structure ? ? – ? ?   ? ?   ? ?   O(log m) ˜ c O(log(m/B)) ˜  O(log m) ˜ c O(log B m) ˜ b   c  b O(log m ) O(logε m ) 

 = partially provides property, – = does not provide property 

models with linear time training;

 = provides property,

Structure Complexities Constr. time RMI [23] DAG O(n)a ASLM [25] Flat O(n)a Hybrid-O [33] Flat O(n) ALEX [11] Tree-like O(n log n) AIDEL [24] Flat O(n 2 ) FITing-tree [14] Tree-like O(n) PGM-index [12] Tree-like O(n)

Name









Optimal space – – – – – –

Bounds the error – – – –













Insertions/deletions –

Table 2 Comparison among known learned indexes. The integer ε ≥ 1 denotes the bound on the prediction error, m˜ represents the number of linear models computed by the greedy algorithms in [14, 24], while m  ≤ m˜ represents the minimum number of linear models computed by the optimal algorithm in [31]

26 P. Ferragina and G. Vinciguerra

Learned Data Structures

27

an upper bound on the search latency. A learned index that does not have such property can be susceptible to unpredictable query times, especially when data is too large to be stored in the faster levels of the memory hierarchy. The last column shows which learned indexes support insertions and deletions without a full reconstruction of the data structure.

3 Learned Data Structures for Exact Membership There are applications which do not demand sophisticated operations like rank, predecessor or range queries. Dropping them and focusing only on membership queries can substantially simplify the design of data structures and improve the time complexity of the queries, as we will see shortly. The dictionary problem asks to store a set S of n keys drawn from an universe U in order to support membership queries, insertions and deletions of keys. A key in S can also be associated with auxiliary data, in which case we aim also to support the lookup operation which, given a query x, returns the data associated with x if x ∈ S, or nil otherwise. The historical solution to the dictionary problem is provided by hash tables. Hash tables are composed of an array T of size m ≥ n, a hash function h : U → {0, . . . , m − 1} that assigns each key in the universe to a location in T , and a strategy to handle the collisions that occur when more than one key maps to the same location. Some examples of these strategies are (see [27] for a complete discussion): chaining, where T [h(x)] is the head of a linked list containing elements having the same hash value h(x), thus allowing O(1 + n/m) expected time membership queries; or Cuckoo hashing, which uses two hash functions h 1 , h 2 and places an element x in either T [h 1 (x)] or T [h 2 (x)], thus allowing O(1) worst-case time membership queries. In Sect. 2, we have seen how the indexing problem boils down to learn the CDF model F(x) of the input set S of keys. It turns out that the same function F can be used to implement h(x) as m F(x) . If the model F perfectly learns the empirical CDF of S and m is sufficiently large, then no conflicts would occur and thus the membership query could be solved in constant worst-case time. In general, the final performance depends on which hash table architecture is used (e.g. table size, collision resolution), how well F approximates the true empirical CDF of S, and how much F is difficult to compute. For example, using several datasets and a table of size m = n, RMI with 100 K linear models in the second stage reduced the number of conflicts by about 45% with respect to a MurmurHash function [23], but at the cost of increased time complexity. In fact, for 8-byte keys with 12-bytes auxiliary data, and a hash table with chaining having size m varied from 75% to 125% of n, the lookup times of

28

P. Ferragina and G. Vinciguerra

using the learned hash function with respect to MurmurHash increased on average by 40%. Very recently, Pavo [40] suggested an alternative approach for string queries. Pavo implements h via a rather complicated graph structure with recurrent neural networks as nodes. Specifically, the input strings are divided into bigrams and fed first to a disperse step, which is an RMI-like hierarchy of experts whose purpose is to split the dataset into subsets that can be learned more easily. Each model in this disperse step is a recurrent neural network trained to “predict the MurmurHash” of a given key.6 Then, a mapping step uses an unsupervised approach to evenly distribute inputs of the previous step to a range of T . Experimentally, Pavo was shown to reduce the average chain lengths by up to 50%, but it increased by 3–4 times the cost of computing the hash value with respect to MurmurHash. Discussion. At the moment of writing, it seems unlikely that learned hash tables can compete with traditional and well-established hash tables. In fact, in contrast to traditional hash functions, the learned approaches described above increase the construction time due to the training step, the number of accesses in memory to retrieve the parameters of the model and the time to compute the hash due to the matrix multiplications required for model inference. As a matter of fact, if we are willing to spend more space (and time) for a learned hash function only to reduce the number of collisions, then we might as well increase the size of the hash table itself and keep using a fast traditional hash function. More importantly, a CDF model F used as a hash function of the form h(x) = m F(x) does not distribute keys uniformly in the hash table, and this could be exploited by an adversary to create arbitrarily long collision chains.

4 Learned Data Structures for Approximate Membership There are situations in which the universe U of keys is very large, and thus the keys are long enough to take a lot of space to be stored in a dictionary structure like the ones described in Sect. 3. As an example, a web crawler could use a hash table to filter out the web pages it has already visited, but the long URLs would quickly consume all the available memory. In these cases, we can slightly relax the definition of membership query, introduce some errors in the returned answers, and eventually use much more space-efficient structures.

6 We

remark that MurmurHash is a known function which can be implemented in a few dozen lines of C code.

Learned Data Structures Fig. 8 The membership query q in a Bloom filter with two elements x1 and x2 is answered negatively because one of the three hash functions maps q to a cell containing 0

29 1

2

The approximate membership problem asks to store a set S of n keys drawn from a universe U and support membership queries by admitting some (controlled) errors in the answer. More precisely, for a query on the key x ∈ S, it is reported that x ∈ S (no false negatives). But for a query on the key x ∈ / S, it is reported erroneously that x ∈ S (a false positive) with probability at most . The most famous and classic data structure for this problem is the Bloom filter [5]. A Bloom filter consists of an array of m bits, initially all set to 0, and k independent hash functions h 1 , . . . , h k with codomain {1, . . . , m}. For each x ∈ S, the bits h i (x) are set to 1 for 1 ≤ i ≤ k. A query for y is answered affirmatively if all bits at h i (y) are 1 and negatively otherwise, as shown in Fig. 8. The false positive probability of a Bloom filter is   k   k 1 kn ≈ 1 − e−kn/m . (3) 1− 1− m Given m and n, the optimal number of hash functions is k = (m/n) ln 2, which gives a false positive probability of about (0.6185)m/n . Once again, ML has given us a different perspective on this classic algorithmic problem. Indeed, the approximate membership problem can be framed as a supervised binary classification task in which the dataset D = {(x, 1)}x∈S ∪ {(x, 0)}x∈U\S is given, and we look for a model g : U → {0, 1} mapping keys from the universe to a boolean indicating whether or not a key is in S. Note that the set U \ S can be huge, and in practice one would choose a subset of it.

4.1 Learned Bloom Filters The first learned approximate membership data structure, called Learned Bloom Filter (LBF) [23], uses a neural network f : U → [0, 1] with sigmoid activations in the output layer and is trained to minimise the log loss function

30

P. Ferragina and G. Vinciguerra

( ) ≥

Learned oracle

Bloom filter

( )
1 and fix pˆ j / pˆ j+1 = c, k j − k j+1 = 1 for j = 1, 2, . . . , r − 1. This ensures that, as j diminishes, k j increases linearly and pˆ j grows exponentially fast. This leaves us with only two parameters, kmax and c. Analogously to Theorem 3 for the LBF, it is possible to show that, under the assumption that the test set T and the future queries have the same distribution, the empirical false positive rate of an Ada-LBF is close to the real E(FPR). Theorem 4 ([10]) Consider an adaptive learned Bloom filter built on a set S, consisting of r regions. Consider a test set T and a query set Q, where T and Q are both determined from samples according to a distribution D. Then, rj=1 | pˆ j − p j | converges to 0 in probability as |T | → ∞. The following result shows under which conditions the Ada-LBF improves the false positive rate of the corresponding (i.e. same array size and same oracle) LBF.

Learned Data Structures

35

Theorem 5 ([10]) Consider a learned Bloom filter f : U → [0, 1] with threshold τ and backup Bloom filter of size m with k hash functions. Consider an adaptive learned Bloom filter with the same oracle f , m bits, r regions, τr −1 = τ , γ as defined in Eq. 5, and p j / p j+1 ≥ c > 1 for j = 1, 2, . . . , r − 1. If there exists λ > 0 such that cγ ≥ 1 + λ holds, n j+1 − n j > 0 for j = 1, 2, . . . , r − 1, and r ≤ 2k is large enough, then the adaptive learned Bloom filter has a smaller false positive rate than the learned Bloom filter. Authors of [10] discuss also a variant called Disjoint Ada-LBF which hashes the keys falling a region to a dedicated Bloom filter rather than to the shared Bloom filter. Experimentally, on a task to identify n ≈ 80 K malicious URLs, Ada-LBF with a random forest oracle (using input features like hostname length, path length, etc.) reduced the false positive rate by 81% compared to LBF and Sandwiched LBF still using the same oracle and size (500 Kb). In turn, Disjoint Ada-LBF reduced the false positive rate by 84%. To achieve a false positive rate of ≈0.35%, Ada-BF and Disjoint Ada-BF used 300 Kb (−40%) with respect to 500 Kb of LBF and Sandwiched LBF. Discussion. The learned approaches to the approximate membership problem seem promising. A model that classifies whether an item belongs to the set by exploiting the item’s features can reduce the overall size taken by the filter. However, before replacing Bloom filters with any of the above learned variants, it is important to understand that the guarantees offered by the latter differ significantly from the former. For example, the false positive probability of Bloom filters holds for any possible query set, while the false positive rate of their learned counterparts is highly dependent on the chosen test set. Consequently, unless the queries have the same distribution of the test set, the actual false positives of the LBF may be substantially larger than expected. For example, if the learned filter is used to prevent accesses to a slow cache, then an adversary can exploit its weakness to perform a denial-of-service attack.

5 Learned Data Structures for Frequency Estimation In the previous sections, we focused on the storage and the retrieval of keys. There are situations in which we only want to keep and retrieve some statistics of the input data, which may be too large to store and to analyse in full efficiently. In these cases, we must content ourselves with synopses of the data, which typically use little space and are efficient in providing very accurate approximate answers [8]. A famous problem in this setting is the following one.

36

P. Ferragina and G. Vinciguerra

+ + + + Fig. 11 In a Count-Min (CM) sketch, the hash function associated with a row maps the item i from the update (i, c) to one of the w counters in the row, which is then incremented by c. The count of an item x is estimated by taking the minimum over all the d selected counters

Consider an infinite stream of updates to an array A = [a1 , . . . , an ] of n counters, initially set to zero, where the tth update is a pair (i t , ct ) indicating that the i t th element of A increased by the value ct . The frequency estimation problem asks to build a data structure that summarises the stream and allows to estimate the current value of an element ai , also called “count” or “frequency” of the ith element of A. The most widely used data structure for this problem is the Count-Min (CM) sketch [8]. The CM sketch with parameters ( , δ) consists of a two-dimensional array C of counters of width w and depth d, initially set to zero, and d hash functions h 1 , . . . , h d : {1, . . . , n} → {1, . . . , w}. When an update (i, c) arrives in the stream, the CM sketch is updated by changing C[ j, h j (i)] ← C[ j, h j (i)] + c for each row 1 ≤ j ≤ d. At query time, the count for the ith element of A is estimated as aˆ i = min1≤ j≤d C[ j, h j (i)], as depicted in Fig. 11. Of course, two items hashing to the same bucket affect each others’ estimate. But the approximation guarantee is that if w = e/ and d = ln(1/δ) , the estimate aˆ i obeys ai ≤ aˆ i (one-sided error); and, with probability at least 1 − δ, aˆ i ≤ ai +

A1 . Here, A1 is the sum of the absolute values of A’s elements. If items with large counts (heavy hitters) collide with other items, then some estimates provided by the CM sketch may have large errors. Even though it has been shown that skewed distributions improve the estimations of CM sketch [8], treating the items with large counts separately from the CM sketch can increase the overall estimation accuracy. For the rest of the discussion, we assume that only positive c > 0 updates are possible and discuss ML-based solutions to the frequency estimation problem.

Learned Data Structures

37

Fig. 12 A Learned CM sketch uses an ML-oracle to classify items with large counts, called “heavy hitters”, and reserves for them unique exact counters

Learned oracle y av ? He itter h

No hi n-h tte ea r? vy

Unique buckets

Count-Min Sketch

Exact count

Approximate count

5.1 ML-Oracle Classifying Heavy Hitters The Learned CM sketch [18] trains beforehand an oracle f to determine whether an item is a heavy hitter or not. At query time, all items classified by f as heavy hitters are assigned to unique buckets storing their exact count. All the other items are forwarded to a CM sketch, as shown in the block diagram in Fig. 12. Note that f should not be regarded as a lookup table, that is, it does not learn the identity of heavy hitters but the properties that allow to identify them. As an example, when applied to network traffic, an oracle from [18] based on recurrent neural networks, fed bit-by-bit with the packet source/destination IP and trained with the squared loss to predict the packet count, was able to eventually group flows with similar IP prefixes together, thus reflecting the hierarchical nature of Internet addresses. The following theorem shows that if f has perfect accuracy and the frequency of items is from a Zipfian distribution ai = 1/i, then the error of the Learned CM sketch is up to a logarithmic factor smaller than that of its non-learned counterpart. Theorem 6 ([18]) Let A = [a1 , . . . , an ] be an array of n frequencies such that ai = 1/i, and let [aˆ 1 , . . . , aˆ n ] be the estimates of an algorithm n C solving the frequency |aˆ i − ai |ai . Then, estimation problem. Define the expected error of C as i=1 1. The expected error of a CM Sketch of width k and depth b/k is O

k ln n b

k+2

ln k−1

kn . b

2. The expected error of a Learned CM sketch with bu unique buckets reserved for heavy hitters and b − bu buckets used by a CM sketch of width k and depth (b − bu )/k is

1 n O , ln2 b − bu bu assuming a heavy hitter classifier with perfect accuracy.

38

P. Ferragina and G. Vinciguerra

In the network traffic scenario we mentioned earlier, the Learned CM sketch reduced the estimation error by 32% compared to CM sketch with a space of 0.5 MB and by 42% with a space of 1 MB. However, the inference time was 2.8 µs per item on a GPU, while CM sketches on CPU were shown to be four orders of magnitude faster [9].

5.2 ML-Oracle Estimating Heavy Hitters’ Counts A different approach proposed in [41] uses an ML model both to classify heavy hitters and to predict the counts of the heavy hitters while maintaining the onesided error guarantee of CM sketch. The construction algorithm takes a training set (X, Y ) of items X and corresponding counts Y , sorted in ascending order by Y . It sets a boundary P = Y [ p] so that the first p counts p of Y have an overall sum less than a given fraction t ∈ (0, 1) of the total, i.e. i=1 Y [i] ≤ tY 1 . Then, the algorithm modifies the training set by increasing the distance between each count in Y and the boundary P by an offset proportional to Y 1 , which is the upper bound on the error of a CM sketch. Finally, an ML model f is trained on (X, Y ) until (i) it does not misclassify heavy-hitters, i.e. items with a count greater than P; and (ii) it overestimates the count of heavy hitters within some desired accuracy. Once the training is finished, training items i satisfying f (i) < P are stored in a traditional CM sketch. At query time, the count for an item x is estimated as f (x) when f (x) ≥ P, and it is estimated by the CM sketch otherwise. The following theorem illustrates that, under certain assumptions, the Learned CM sketch has an upper error bound no worse than a traditional CM sketch. Theorem 7 ([41]) The Learned Count-Min sketch with a threshold t ∈ (0, 1) and a backup Count-Min sketch with parameters ( /t, δ) can return the accuracy guarantee no worse than that provided by a standard Count-Min sketch with parameters ( , δ), on the assumption that the items for testing the model share the same distribution with the query items, and the guarantee of the model works not only on the test set but also on the query items. Using a 3-layer neural network oracle with 20 hidden units, this Learned CM sketch compared to a CM sketch reduced the mean squared estimation error by about 92% on normally-distributed counts and by 73% on Zipf-distributed counts. Authors discuss a variant called Learned Augmented Sketch that stores the counts of the top-k items into exact counters, estimates the (remaining) heavy-hitters with f , and uses the CM sketch otherwise. Discussion. To conclude this section, we want to stress the remarks of Sect. 4, which apply to this problem too. That is, one should be aware that the Learned CM sketch could not be as robust as the traditional version if either the query distribution or the input stream distribution changes over time with respect to the training data. On the other hand, if these conditions hold and more efficient implementations are

Learned Data Structures

39

made available, the practical advantages of a Learned CM sketch should be taken into consideration.

6 Conclusions This is a field still in its infancy that needs a lot of study and experiments to provide a formal framework and practical support to its preliminary, successfully promising achievements. We refer the readers to the papers in the following bibliography for a complete list of open problems, and we content ourselves here to mention that it is crucial to check also the impact of ML-based solutions in real software and to experiment with their combination with compact data structures [29], possibly in automatic engines [19]. Acknowledgements Part of this work has been supported by the Italian MIUR PRIN project “Multicriteria Data Structures and Algorithms: from compressed to learned indexes, and beyond” (Prot. 2017WR7SHH), by Regione Toscana (under POR FSE 2014/2020), and by the European Integrated Infrastructure for Social Mining and Big Data Analytics (SoBigData++, Grant Agreement # 871042).

References 1. Ao, N., Zhang, F., Wu, D., Stones, D.S., Wang, G., Liu, X., Liu, J., Lin, S.: Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proc. VLDB Endow. 4(8), 470–481 (2011) 2. Bender, M.A., Farach-Colton, M., Mosteiro, M.A.: Insertion sort is o(n log n). Theory Comput. Syst. 39(3), 391–397 (2006) 3. Bender, M.A., Hu, H.: An adaptive packed-memory array. ACM Trans. Database Syst. 32(4) (2007) 4. Bentley, J.L., Yao, A.C.C.: An almost optimal algorithm for unbounded searching. Inf. Process. Lett. 5(3), 82–87 (1976) 5. Broder, A., Mitzenmacher, M.: Network applications of Bloom filters: a survey. Internet Math. 1(4), 485–509 (2004) 6. Coraddu, A., Oneto, L., Baldi, F., Anguita, D.: Vessels fuel consumption forecast and trim optimisation: a data analytics perspective. Ocean. Eng. 130, 351–370 (2017) 7. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009) 8. Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches, vol. 4, No. 1–3, pp. 1–294. Foundations and Trends in Databases (2011) 9. Cormode, G., Muthukrishnan, S.: Summarizing and mining skewed data streams. In: Proceedings of the International Conference on Data Mining, SDM, pp. 44–55. SIAM (2005) 10. Dai, Z., Shrivastava, A.: Adaptive learned Bloom filter (Ada-BF): Efficient utilization of the classifier. arXiv:1910.09131 (2019) 11. Ding, J., Minhas, U.F., Zhang, H., Li, Y., Wang, C., Chandramouli, B., Gehrke, J., Kossmann, D., Lomet, D.: ALEX: An updatable adaptive learned index. arXiv:1905.08898 (2019)

40

P. Ferragina and G. Vinciguerra

12. Ferragina, P., Vinciguerra, G.: The PGM-index: a fully dynamic compressed learned index with provable worst-case bounds. PVLDB J. 13(8) (2020). ISSN 2150-8097 13. Gaede, V., Günther, O.: Multidimensional access methods. ACM Comput. Surv. 30(2), 170–231 (1998) 14. Galakatos, A., Markovitch, M., Binnig, C., Fonseca, R., Kraska, T.: FITing-Tree: a data-aware index structure. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 1189–1206. ACM, New York, NY, USA (2019) 15. Hadian, A., Heinis, T.: Considerations for handling updates in learned index structures. In: Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM, pp. 3:1–3:4. ACM, New York, NY, USA (2019) 16. Hashemi, M., Swersky, K., Smith, J.A., Ayers, G., Litz, H., Chang, J., Kozyrakis, C., Ranganathan, P.: Learning memory access patterns. In: ICML, Proceedings of Machine Learning Research, vol. 80, pp. 1924–1933. PMLR (2018) 17. Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction, 2nd edn., Springer Series in Statistics, Springer (2009) 18. Hsu, C.Y., Indyk, P., Katabi, D., Vakilian, A.: Learning-based frequency estimation algorithms. In: International Conference on Learning Representations, ICLR (2019) 19. Idreos, S., Zoumpatianos, K., Chatterjee, S., Qin, W., Wasay, A., Hentschel, B., Kester, M., Dayan, N., Guo, D., Kang, M., Sun, Y.: Learning data structure alchemy. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 42(2), 46–57 (2019) 20. Idreos, S., Zoumpatianos, K., Hentschel, B., Kester, M.S., Guo, D.: The data calculator: Data structure design and cost synthesis from first principles and learned cost models. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 535–550. ACM, New York, NY, USA (2018) 21. Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P.A., Kemper, A.: Learned cardinalities: estimating correlated joins with deep learning. In: CIDR. www.cidrdb.org (2019) 22. Kraska, T., Alizadeh, M., Beutel, A., Chi, E.H., Kristo, A., Leclerc, G., Madden, S., Mao, H., Nathan, V.: SageDB: a learned database system. In: CIDR. www.cidrdb.org (2019) 23. Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 489– 504. ACM, New York, NY, USA (2018) 24. Li, P., Hua, Y., Zuo, P., Jia, J.: A scalable learned index scheme in storage systems. arXiv:1905.06256 (2019) 25. Li, X., Li, J., Wang, X.: ASLM: Adaptive single layer model for learned index. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds.) Database Systems for Advanced Applications, pp. 80–95. Springer International Publishing, Cham (2019) 26. Macke, S., Beutel, A., Kraska, T., Sathiamoorthy, M., Cheng, D.Z., Chi, E.H.: Lifting the curse of multidimensional data with learned existence indexes. In: Workshop on ML for Systems at NeurIPS (2018) 27. Mehta, D.P., Sahni, S. (eds.): Handbook of Data Structures and Applications, 2nd edn. Chapman and Hall/CRC, Boca Raton (2018) 28. Mitzenmacher, M.: A model for learned Bloom filters and optimizing by sandwiching. In: 32nd Conference on Neural Information Processing Systems, NeurIPS (2018) 29. Navarro, G.: Compact data structures: a practical approach. Cambridge University Press, New York (2016) 30. Oneto, L., Ridella, S., Anguita, D.: Tikhonov, ivanov and morozov regularization for support vector machine learning. Mach. Learn. 103(1), 103–136 (2015) 31. O’Rourke, J.: An on-line algorithm for fitting straight lines between data ranges. Commun. ACM 24(9), 574–578 (1981) 32. Petrov, A.: Algorithms behind modern storage systems. Commun. ACM 61(8), 38–44 (2018) 33. Qu, W., Wang, X., Li, J., Li, X.: Hybrid indexes by exploring traditional B-tree and linear regression. In: Proceedings of the 16th International Conference on Web Information Systems and Applications, WISA, pp. 601–613. Springer International Publishing, Cham (2019)

Learned Data Structures

41

34. Rae, J., Bartunov, S., Lillicrap, T.: Meta-learning neural Bloom filters. In: Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97, pp. 5271–5280. PMLR, Long Beach, California, USA (2019) 35. Vahdat, M., Oneto, L., Anguita, D., Funk, M., Rauterberg, M.: A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator. In: European Conference on Technology Enhanced Learning (2015) 36. Vinciguerra, G., Ferragina, P., Miccinesi, M.: Superseding traditional indexes by orchestrating learning and geometry. arXiv:1903.00507 (2019) 37. Vitter, J.S.: External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv. 33(2), 209–271 (2001) 38. Wang, H., Fu, X., Xu, J., Lu, H.: Learned index for spatial queries. In: Proceedings of the 20th International Conference on Mobile Data Management, MDM, pp. 569–574. IEEE (2019) 39. Wu, Y., Yu, J., Tian, Y., Sidle, R., Barber, R.: Designing succinct secondary indexing mechanism by exploiting column correlations. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 1223–1240. ACM, New York, NY, USA (2019) 40. Xiang, W., Zhang, H., Cui, R., Chu, X., Li, K., Zhou, W.: Pavo: A RNN-based learned inverted index, supervised or unsupervised? IEEE Access 7, 293–303 (2019) 41. Zhang, M., Wang, H., Li, J., Gao, H.: Learned sketches for frequency estimation. Inf. Sci. 507, 365–385 (2020) 42. Zhang, Y., Huang, Y.: “Learned” operating systems. SIGOPS Oper. Syst. Rev. 53(1), 40–45 (2019)

Deep Randomized Neural Networks Claudio Gallicchio and Simone Scardapane

Abstract Randomized Neural Networks explore the behavior of neural systems where the majority of connections are fixed, either in a stochastic or a deterministic fashion. Typical examples of such systems consist of multi-layered neural network architectures where the connections to the hidden layer(s) are left untrained after initialization. Limiting the training algorithms to operate on a reduced set of weights inherently characterizes the class of Randomized Neural Networks with a number of intriguing features. Among them, the extreme efficiency of the resulting learning processes is undoubtedly a striking advantage with respect to fully trained architectures. Besides, despite the involved simplifications, randomized neural systems possess remarkable properties both in practice, achieving state-of-the-art results in multiple domains, and theoretically, allowing to analyze intrinsic properties of neural architectures (e.g. before training of the hidden layers’ connections). In recent years, the study of Randomized Neural Networks has been extended towards deep architectures, opening new research directions to the design of effective yet extremely efficient deep learning models in vectorial as well as in more complex data domains. This chapter surveys all the major aspects regarding the design and analysis of Randomized Neural Networks, and some of the key results with respect to their approximation capabilities. In particular, we first introduce the fundamentals of randomized neural models in the context of feed-forward networks (i.e., Random Vector Functional Link and equivalent models) and convolutional filters, before moving to the case of recurrent systems (i.e., Reservoir Computing networks). For both, we focus specifically on recent results in the domain of deep randomized systems, and (for recurrent models) their application to structured domains.

C. Gallicchio (B) Department of Computer Science, University of Pisa, Pisa, Italy e-mail: [email protected] S. Scardapane Department of Information Engineering, Electronics and Telecommunications, Sapienza University of Rome, Rome, Italy e-mail: [email protected]

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 L. Oneto et al. (eds.), Recent Trends in Learning From Data, Studies in Computational Intelligence 896, https://doi.org/10.1007/978-3-030-43883-8_3

43

44

C. Gallicchio and S. Scardapane

1 Introduction The relentless pace of success in deep learning over the last few years has been nothing short of extraordinary. After the initial breakthroughs in the ImageNet competition [52], a popular viewpoint was that deep learning represented a significant shift away from hand-designing features to learning them from data. However, the majority of researchers today would agree that the shift can be more correctly classified as moving towards hand-designing architectural biases in the networks themselves [95]. This, combined with the flexibility of stochastic gradient descent and automatic differentiation, goes a long way towards explaining many of the recent advances in neural networks. In this chapter, we consider how far we can go by relying almost exclusively on these architectural biases. In particular, we explore recent classes of deep learning models wherein the majority of connections are randomized or more generally fixed according to some specific heuristic. In the case of shallow networks, the benefits of randomization have been explored numerous times. Among other things, we can mention the original perceptron architecture [77], random vector functional-links [68, 69], stochastic configuration networks [101–103], random features for kernel approximations [42, 48, 51, 72–74], and reservoir computing [46, 59]. In general, these models trade-off a (possibly negligible) part of their accuracy for training processes that can be orders of magnitude faster than fully trainable networks. In addition, randomization makes them particularly attractive from a theoretical point of view, and a vast literature exists on their approximation properties. Differently from previous reviews [27, 80, 100], in this chapter we focus on recent attempts at extending these ideas to the deep case, where a (possibly very large) number of hidden layers is stacked to obtain multiple intermediate representations. Extending the accuracy/efficiency trade-off also for deep architectures is not trivial, but the benefits of being able to do so are vast. As we show in this chapter, several alternatives exist for obtaining extremely fast and accurate randomized deep learning models in a variety of scenarios, especially whenever the dataset is medium or medium-to-large in size. We also comment on a number of intriguing analytical and theoretical properties arising from the study of deep randomized architectures, from their relation to kernel methods and Gaussian processes [19], to metric learning [38], pruning [75], and so on. Importantly, randomization allows to potentially blend non-differentiable components in the architecture (e.g., Heaviside step functions [49]), further extending the toolkit available to deep learning practitioners. Because we touch on a number of different fields, we do not aim at a comprehensive survey of the literature. Rather, we highlight general ideas and concepts by a careful selection of papers and results, trying to convey the widest perspective. When possible, we also highlight points that in our opinion have been under-explored in the literature, and possible open areas of research. Finally, we consider a variety of types of data, ranging from vectors to images and graph-based datasets.

Deep Randomized Neural Networks

45

Organization The rest of the chapter is organized in two broad parts, each further subdivided in two. We start with shallow, feedforward networks in Sect. 2. Because our focus is on deep models, we only provide basic concepts, and provide references and pointers to more comprehensive expositions of shallow randomized models when necessary. Building on this, Sect. 3 describes a selection of topics and papers pertaining to the analysis, design, and implementation of deep randomized feedforward models. Sections 4 and 6 replicate this organization for recurrent models: we first introduce the basic reservoir computing architecture in Sect. 4 (with a focus on echo state networks), exploring their extension to multiple hidden layers and structured types of data in Sect. 6. We conclude with several remarks in Sect. 8. Notation We use boldface notation for vectors (e.g., v) and matrices (e.g., X). Subscripts are used to denote a specific unit inside a layer, and superscripts are used for denoting a specific layer. An index t in brackets is used for time dependency. For example, xil (t) denotes the ith unit of the lth layer at time t.

2 Randomization in Feed-Forward Neural Networks As we stated in the introduction, neural networks with a single hidden layer whose connections are fixed (either randomly or otherwise) have a long history in the field, dating back to some of the original works on perceptrons. Random vector functionallinks (RVFLs), originally introduced and analyzed in the nineties [44, 45, 68, 69] represent the most comprehensive formalization of this idea, with further innovations and applications up to today [2]. In this section we provide an overview of their design and approximation capabilities, and refer to [80] for a more thorough overview on their history, and to [101–103] for further developments in the context of these models.

2.1 Description of the Model Consider a generic function approximation task, where we denote by x the input vector, by y the output (e.g., a binary {0, 1} for classification), and by f (x) the model we would like to train. In particular, the basic RVFL model is defined as [44]: f (x) =

H  i=0

βi h i (x) ,

(1)

46

C. Gallicchio and S. Scardapane

where the functions h i (x) extract generic (fixed) features, which are linearly combined through the adaptable coefficients βi . An example are sigmoidal basis expansions with random coefficients wi and bi : h i (x) =

1 .  1 + exp −wiT x − bi

(2)

In general, we also consider h 0 (x) = 1 to add an offset to the model, and we can also include the original input features in the output layer (similar to modern residual connections in deep networks). Assuming that the parameters in (2) are all selected beforehand (e.g., by randomization), the final model in (1) is a linear model f (x) = β T h(x), where we stack in two column vectors h(x) and β all feature expansions and output coefficients respectively. As a result, all the theory of linear regression and classification can be applied almost straightforwardly [80]. Approximation capabilities for this class of networks have been studied extensively [39, 44, 69, 72, 73, 79]. In general, RVFL networks retain the universal approximation properties of fully-trainable neural networks, with an error that decreases in the order √1B . The practical success of the networks depends strongly on the selection of the random coefficients, with recent works exploring this subject at length [102].

2.2 Training the Network We dwell now shortly on the topic of training and optimizing standard RVFL networks. In fact, speed of training (while maintaining good nonlinear approximation capabilities) is one of the major advantages of randomized neural networks and, conversely, keeping this accuracy/efficiency trade-off is one of the major challenges in the design of deeper architectures. Consider a dataset of desired input/output pairs {(xi , yi )}i . We initialize the inputto-hidden parameters randomly, and collect the corresponding feature expansions h(xi ) row-wise in a matrix H. While many variants of optimization are feasible [80], by far the most common technique to train an RVFL net is to formulate the optimization problem as an 2 -regularized least squares:   β = arg min Hβ − y2 + λ · β2 ,

(3)

where y is the vector of targets and λ a free (positive) hyper-parameter. The reason (3) is a popular approach relies on (i) its strong convexity (resulting in a single minimizer), and (ii) the linearity of its gradient. The latter is especially important, since for most medium-sized datasets the problem (3) can be solved immediately as: −1 T  H y, β = HT H + λI

(4)

Deep Randomized Neural Networks

47

where I is the identity matrix of appropriate shape, or alternatively (if B is much larger than the number of points in the dataset) as:  −1 y. β = HT HHT + λI

(5)

In general, solving the previous expressions has a cost which is cubic in the number of feature expansions or in the number of data points, depending on the specific formulation being chosen. For large scale problems, many ad-hoc implementations [21] and algorithmic advances [108] are available to solve the problem in a fraction of the cost of a standard stochastic gradient descent. Note how, in both formulations, the term weighted by λ acts as a numerical stabilizer on the diagonal of the matrix being inverted. Clearly, a wide range of variants on the basic problem in (3) are possible, almost all of them loosing the possibility of a closed-form solution. Of these, we mention two that are relevant to the following. First, when considering binary classification tasks (in which the target variable is constrained as yi ∈ {0, 1}), we can reformulate the problem in a logistic regression fashion: β = arg min



 y  log (σ (Hβ)) + (1 − y)  log (σ (1 − Hβ)) ,

(6)

i

where σ (·) is the sigmoid operation from (2) and  denotes elementwise multiplication. The use of the sigmoid ensures that the output of the RVFL network can be interpreted as a probability and can be used in later computations about the confidence in the prediction. Second, replacing the squared 2 norm in (3) with the 1 norm β1 results in sparse weight vectors, which can make the network more efficient [7, 14].

2.3 Additional Considerations Clearly, this is only intended as a very brief introduction to the topic of (shallow) RVFL networks, and we refer to other reviews for a more comprehensive treatment [27, 55, 80, 111]. There is a pletora of interesting topics on which we skip or only brief touch, including ensembling strategies [2] and recent works on selecting the optimal range for the pseudo-random parameters [102]. More generally, albeit we focus on the RVFL terminology, this class of networks has a rich history in which similar ideas have been reintroduced multiple times under different names (see also [80]), so interesting pointers can be found in the literature on random kernel features [73], the no-prop training algorithm [105], and several others. All of these works play on the delicate trade-off between keeping nonlinear approximation capabilities without sacrificing efficiency or, possibly, analytic solutions.

48

C. Gallicchio and S. Scardapane

We now turn to the topic of extending these capabilities to the ‘deep’ case. Differently from the fully-trainable case, where stacking several adaptable layers can be easily justified empirically (and does not change the nature of the optimization problem), in the randomized case this is not trivial. Firstly, it is unclear whether simply stacking several randomized layers can improve accuracy at all, or even distort the original information content in the inputs. Secondly, designing other strategies going beyond the simple ‘stack’ of layers must remain sufficiently simple and efficient to contend with fully-trainable deep learning solutions (i.e., either provide gains in accuracy or order of magnitudes in improved efficiency). In the next section, we review some significant work dealing with these two questions.

3 Deep Random-Weights Neural Networks In this section we collect and organize a series of selected works dealing with the analysis and design of deep randomized networks. This is not built as a comprehensive survey of the state-of-the-art, but rather as a set of pointers to some of the most important ideas and results coming from the recent literature.

3.1 Analyzing Randomized Deep Networks Through the Lens of Kernel Methods To begin with, consider a generic deep randomized network f = g ◦ f R defined as the composition of a representation function f R (a stack of one or more layers with random weights), and a linear model g trained on top of the representations from f R (also called later a readout). This is a relatively straightforward extension of the previous section, where we allow the matrix H to be generated by more complex architectures with random weights than a single, fully-connected layer. Irrespective of the accuracy of such a model, an analysis of its theoretical properties is interesting because it corresponds to investigating the behavior of a deep network in a small subspace around its random initialization. In fact, there is a vast literature showing insightful connections of this problem with the study of kernel machines and Gaussian Processes [4, 17, 19, 61, 74]. A general conclusion of all these works is to show that, in the limit of infinite width, deep networks with randomized weights converge to Gaussian processes. Reference [19] generalizes most of the previous results for a vast class of representation functions f R , whose structure can be described by a directed acyclic graph where we associate to each node a bounded activation function, comprising most commonly used feedforward and sequential networks. They show that the skeleton of this function (i.e., the topological structure with no knowledge of the weights) is univocally associated to a kernel function κ. The representations generated by a

Deep Randomized Neural Networks

49

single realization of the skeleton, obtained by sampling the weights from a Gaussian distribution with appropriately scaled variance, are in general able to approximate the kernel itself. As a result, with high probability one can find a linear predictor g able to approximate all bounded functions in the hypothesis space H associated to κ. A complementary class of results, based on the novel idea of the neural tangent kernel (NTK), can be found in [5, 106], allowing to extend these ideas more formally to networks with weight tying (e.g., convolutional neural networks), and to neural networks with trained weights.

3.2 The Relation Between Random Weights and Metric Learning Another interesting class of results is obtained by [38] and later works, who explored the effect of the randomly initialized representation function f R on the metric space in which the data resides, exploiting tools from compressive sensing and dictionary learning. Roughly speaking, if one assumes that points in the input data corresponding to separate classes have ‘large’ angles (compared to points in the same class), then it is possible to show that f R performs an embedding of the data in which the latter angles are shrunk more than angles corresponding to points in the same cluster. With the separation among classes increasing, the embedding obtained by a deep network makes the data easier to classify by prioritizing their angle. Differently from works described in the previous section, these results are not viable for any deep randomized network, but only for networks with random Gaussian weights and rectified linear units (ReLU) as activation functions (or similar): ReLU(s) = max (0, s) .

(7)

The previous activation is necessary to make the network sensitive to the angles between inputs, shrinking them proportionally to their magnitude. At the same time, the analysis from [38] has several interesting practical implications. On one side, if the assumptions on the data are correct, they allow to derive some bounds between the implicit dimension of the data and the corresponding required size of the training set (see [38, Sect. V]). More in general, even if the assumptions are not satisfied, this analysis provides a justification for the good performance of deep networks in practice, by assuming that learning the linear projection is equivalent to ‘choosing’ a suitable angle on which to perform the shrinking across classes, instead of using the angle of their principal axis. These results directly lead to considering this class of networks for practical learning purposes. From [38]: “In fact, for some applications it is possible to use networks with random weights at the first layers for separating the points with distinguishable

50

C. Gallicchio and S. Scardapane

angles, followed by trained weights at the deeper layers for separating the remaining points.” Extensions and variations on this core concept are considered more in-depth over the next sections.

3.3 Deep Randomized Neural Networks as Priors In a very broad sense, understanding the performance of deep randomized networks in practice is akin to understanding how much the spectacular results of deep networks in several domains are due to their architectures (i.e., their architectural biases), and how much can be attributed to the specific training algorithm for selecting the weights. One key result in this sense was developed in the work on deep image priors [95]. The paper was one of the first to show that a randomly initialized convolutional neural network (CNN) contained enough structural information to act as an efficient prior in many image processing problems. The algorithm they exploit is very simple and can be summarized in a small number of steps. First, they randomly initialize a CNN x = f (z), mapping from a simple latent vector z to the space of images under consideration. Given a noisy starting image x0 (e.g., an image with occlusions) and a loss term E(x, x0 ) that depends on the specific task, the parameters of f are optimized on the single image: (8) f ∗ (z) = arg min E( f (z), x0 ) . The restored image is then given by f ∗ (z). This procedure is able to obtain state-ofthe-art results on several image restoration tasks [95]. In their words, “Our results go against the common narrative that explain the success of deep learning in image restoration to the ability to learn rather than hand-craft priors; instead, random networks are better hand-crafted priors, and learning builds on this basis.” Along a similar line, [71] showed that randomly initialized CNNs on several audio classification problems performed better than some hand-crafted features, especially mel frequency cepstral coefficients (MFCCs), although they are still significantly worse than their fully trained equivalent.

3.4 Towards Practical Deep Randomized Networks: Relation with Pruning Summarizing the discussion up to this point, we saw how deep randomized networks can be helpful for analyzing several interesting properties of deep networks. From a more practical viewpoint, fully randomized networks can be used in some specific scenarios, either as priors (due to their architectural biases), or as generic feature extractors. The question remains open, however, on whether we can exploit them also as generic learning models.

Deep Randomized Neural Networks

51

One of the first works to seriously explore this possibility was [78]. The authors investigated the training of a deep network wherein a large percentage of weights was kept fixed to their original values. They showed that, for modern deep CNNs, it is possible to fix up to nine tenths of the parameters and train only the remaining 10%, obtaining a negligible drop in accuracy in several scenarios. Apart from computational savings, this finding is interesting inasmuch it allows to describe a good portion of the neural network only with the knowledge of the specific pseudo-random number generator and its initial seed [78]. This line of reasoning also connects to one of the fundamental open research questions in deep learning, pruning of architectures [23]. In particular, even if a posteriori (after training), a large percentage of weights in a deep network is found to be redundant and easily removable, a priori (before training) it is very hard to train small, compact networks. The lottery ticket hypothesis [23] is a recent proposal arguing that the success of most deep networks can be attributed to small subsets of weights (tickets), and the benefit of very large networks is in having and initializing a very large number of such tickets, increasing the possibility of finding good ones. The hypothesis has generated many follow-ups (e.g., [24, 65]), although at the moment its relation with fully randomized networks remains under-explored (with some exceptions, e.g., [75]). In particular, when moving to more structured types of pruning, it is found that the lottery ticket hypothesis compares worse with respect to training from scratch smaller architectures [56]. “[...] for these pruning methods, what matters more may be the obtained architecture, instead of the preserved weights, despite training the large model is needed to find that target architecture.” In general, this points to the fact that more work on deep randomized networks and their initialization can be beneficial also to the field of model selection and architecture search. We return on this point in one of the next sections. We refer also to [110] for similar analyses layer-wise.

3.5 Training of Deep Randomized Networks via Stacked Autoencoders One way to combine the advantage of randomization with a partial form of training is the use of stacked autoencoders, similar to some prior work on deep learning [99]. An autoencoder is a neural network with one or more hidden layers that is trained to map an input x (or a corrupted version thereof) to x, learning a suitable intermediate representation internally. A general recipe to combine autoencoders with RVFLs networks is as follows [15]: 1. Initialize a random mapping h(x) similar to Sect. 2.1. 2. Train the readout to map h(x) to the original input x, obtaining a set of weights β (through least-squares or a sparse version of it). 3. Use β T as the first weight matrix of a separate deep randomized network. 4. Repeat points (1)–(3) on the embedding generated at point (3).

52

C. Gallicchio and S. Scardapane

A more constructive and theoretically grounded approach to the design of deep randomized networks is described in the literature on deep stochastic configuration networks [103]. Because our focus here is on RVFL networks, we refer the interested reader to [103] and papers therein for this separate class of algorithms.

3.6 Semi-random Neural Networks Fully-trained and randomized neural networks (what [5] calls strongly-trained and weakly-trained networks) are only two extremes of a relatively large continuum of models, all possessing separate trade-offs concerning accuracy, speed of training, inference, and so on. As an example of a model in the middle of this range, we describe here briefly the semi-random architecture proposed in [49]. We replace the ith feature expansion in the basic RVFL model (1) by:   h i (x) = σs riT x · wiT x ,

(9)

where ri is randomly sampled, wi is trainable, and the activation function σs is defined for a positive hyper-parameter s as: σs (z) = z s · H (z) ,

(10)

with H (z) being the step function. For example, for s = 1 we obtain linear semirandom features, while for s = 2 we obtain squared semi-random features. Mimicking the matrix notation of Sect. 2.1, the feature transformation can be written as: randomized (11) Wx , H = σs (Rx) 

trainable

where  is the Hadamard (element-wise) product between matrices. While apparently counter-intuitive, this model shows a number of remarkable theoretical properties, as analyzed by [49]. Among other things, a single-hidden-layer semi-random model maintains one minimum even with a non-convex optimization problem, and its extension to more than a single hidden layer has generalization bounds that are significantly better than comparable fully-trainable networks with ReLU activation functions [49]. Irrespective of its theoretical and practical capabilities, this model shows the power of smartly combining the two words of fully-trainable deep learning with randomized (or semi-randomized) models, which we believe heavily under-explored at the moment.

Deep Randomized Neural Networks

53

3.7 Weight-Agnostic Neural Networks Up to now, we considered deep networks wherein a majority of the connections are randomized. However, several of the ideas that we discussed can be extended by considering networks with fixed, albeit not randomized, weights. In fact, as we will discuss later, in the reservoir computing field this has become a fruitful research direction. In the feedforward case, we conclude here by showing a single notable result in the case of neural architecture search (NAS), the weight-agnostic neural network [26]. NAS is the problem of finding an optimal architecture for a specific task. A single NAS run requires a large number of models’ training, and as such, it is one of the field that could benefit the most from advancements in this sense (also from an environmental point of view [87]). The idea of weight-agnostic network is to design a network in which all weights are initialized to the same value, and the network should be robust to this value. It allows to try a huge number of architectures extremely quickly, obtaining in some scenarios very interesting results [26]. “Inspired by precocial behaviors evolved in nature, in this work, we develop neural networks with architectures that are naturally capable of performing a given task even when their weight parameters are randomly sampled” [26]. Note that ideas on randomized networks in NAS also have a long history, dating back to works on recurrent neural networks [81].

3.8 Final Considerations We conclude this general overview with a small set of final remarks and considerations. Globally, we saw that deep randomized networks have attracted a large amount of interest lately as tools for the analysis and search of deep networks, going at the hearth of a historical dichotomy between the importance of the network’s architecture and the selection of its weights. Practically, several ideas and heuristics have been developed to make these randomized neural networks useful in real-world scenarios. All the ideas considered here have historical antecedents. Just to cite an example, “It has long been known that [randomized] convolutional nets have reasonable performance on MNIST and CIFAR-10. [randomized] nets that are fully-connected instead of convolutional, can also be thought of as “multi-layer random kitchen” sinks, which also have a long history” [5] . At the same time, we acknowledge that the performance of randomized networks have not been comparable to fully trained network on truly complex scenarios such as ImageNet. This can be due to an imperfect understanding of their behavior, or it can be a fundamental limitation of this class of models. One possibility to overcome this limit could be to combine the idea of fixing part of the network, but moving beyond pure randomization. An example of this is the PCANet [16], which we have not mentioned in the main text.

54

C. Gallicchio and S. Scardapane

While deep RVFL networks show excellent accuracy/performance trade-offs on small and medium problems, this trade-off has yet to be thoroughly analyzed for larger problems. In this line, it would be interesting to evaluate deep RVFL variations on established benchmarks such as Stanford’s DAWN Deep Learning Benchmark.1 Finally, part of this criticism can be attributed to the lack of an established codebase for this class of models. This is also an important line of research for the immediate future. We now turn to the topic of deep randomized recurrent neural networks.

4 Randomization in Dynamical Recurrent Networks Dynamical recurrent neural models, or simply Recurrent Neural Networks [50, 62], are a widely popular paradigm for processing data that comes in the form of timeseries, where each new input information is linked to the previous (and following) one by a temporal relation. Architecturally, the major difference in RNNs with respect to feed-forward neural processing systems analyzed so far is the presence of feedback among the hidden layer’s recurrent units. This is a crucial modification that makes it possible to elaborate each input in the context of its predecessors, i.e., it gives a memory to the operation of the system. Roughly speaking, apart from this architectural change, the basic description of the model does not change: a hidden layer (made up here by recurrent units) implements a representation function f R , whose outcome is tapped by a readout layer of linear units that calculate the output function g. The overall operation can be described as the composition g ◦ f R (as already seen in Sect. 3). A graphical description of this process is given in Fig. 1. Going a step further into the mathematical description of the representation (hidden layer) component, we can see that its operation can be understood as that of an input-driven dynamical system. The state of such system is given by the activation of the hidden units, i.e. h(t). The evolution of such state is ruled by a function f R that can be formulated in several ways. For instance, in continuous-time cases such evolution function is expressed in terms of a set of differential equations, as used, e.g., in the case of spiking neural network models [37]. Here, instead, we refer to the common case of discrete-time dynamical systems that evolve according to an iterated mapping of the form:     h(t) = f R x(t), h(t − 1) = σ WT x(t) + WR T h(t − 1) ,

(12)

where W and WR are weight matrices that parametrize the state update function, respectively modulating the impact of the current external input and that of the previous state of the system. Typically, the activation function σ comes in the form of a squashing non-linearity, as already examined in Sect. 2.

1 https://dawn.cs.stanford.edu/benchmark/.

Deep Randomized Neural Networks

55 =( ∘

Fig. 1 Schematic representation of the RNN operation on temporal data readout

)(

,

−1 )

feed-forward

( )

representaon dynamical ( − 1) input

( )

time

The readout comes often in the same linear form mentioned in Sect. 2, i.e., as layer of linear units that apply a linear combination of the components of the state vector: β T h(t), where the elements in β are the parameters of the readout layer. Training RNN architectures implies gradient propagation across several steps: those corresponding to the length of the time-series on which the hidden layer’s architecture is unrolled. It is then easy to see that training algorithms for RNN face similar difficulties to those encountered when training deep neural networks. A major related downside is that learning is computationally intensive and requires long times (an aspect partially mitigated by the availability of GPU-accelerated algorithms). As such, also in this context, the use of partially untrained RNN architectures appears immediately very intriguing. While already early works in neural networks literature pointed out the possible benefits of having untrained dynamical systems as effective neural processing models (see, e.g., [1]), in the last decade a paradigm called Reservoir Computing hit the literature becoming very popular as an efficient alternative to the common fully-trained design of RNNs.

5 Reservoir Computing Neural Networks Reservoir Computing (RC) [59, 98] is a nowadays popular approach for parsimonious design of RNNs. In the same spirit of randomized neural networks approaches described in Sect. 2, the basic idea of RC is to limit training to the readout part of the network, leaving the representation part unaltered after initialization. This means that the parameters (i.e., the weights) of the recurrent hidden layer are randomly

56

C. Gallicchio and S. Scardapane

initialized and then left untrained. This peculiar part of the architecture, responsible of implementing the representation function f R in Fig. 1, is in this context called the reservoir. The reservoir is typically made up of a large number of non-linear neurons, and its role is essentially to provide a high-dimensional non-linear expansion of the input history into a possibly rich feature space, where the original learning problem can be more easily approached by a simple linear readout layer. This basic RC methodology for fast RNN set up and training has been (almost) contemporary independently proposed in literature under different names and perspectives, among which we mention Echo State Networks (ESNs) [46, 47], usually with discrete-time tanh dynamics, Liquid State Machines (LSMs) [60], in the context of biologicallyinspired spiking neural network models, and Fractal Prediction machines (FPMs) [91], originated from the study of contractive iterated function systems and fractals. Here we adopt formalism and terminology close to the prominently known ESN model.

5.1 Reservoir Initialization Training of the readout is performed in the same way described in 2.2, and as such we are going to discuss it further in this part. The crucial aspect of RC networks is to guarantee a meaningful randomized initialization of the reservoir parameter, i.e., of the weight values in matrices W and WR . As we are dealing in the case with the parameters of a dynamical system, a special care needs to be devoted to the aspect of stability of the determined dynamics. Indeed, if not properly instantiated, the reservoir system could exhibit undesired behaviors, such as instability or even chaos. If this occur, then the resulting learning model would likely respond deeply differently to very similar input time-series, thereby showing very poor generalization abilities. To account for this potential weakness, reservoirs are commonly initialized under stability properties that (in a way or another) ensure that the system dynamics will not fall into undesired regimes when put into operation. Perhaps, the most widely known of such properties is the so called Echo State Property (ESP) [46, 64, 107]. This is a global asymptotic (Lyapunov) stability condition on the input-driven reservoir, and essentially states that the state of the system will progressively forget its initial conditions and will depend solely on the driving input time-series. In formulas, denoting by f˜R (x0 , s N ) the final state of the reservoir starting from initial state x0 and being fed by the N -long input time-series s N , the ESP can be formulated as: for every x0 , z0 , s N :  f˜R (x0 , s N ) − f˜R (z0 , s N ) → 0 as N → ∞.

(13)

Assuming reservoir neurons with tanh non-linearity and bounded input spaces, some baseline conditions for reservoir initialization can be derived. Specifically, a sufficient condition originates by seeing the reservoir as a contraction mapping, and requires that WR 2 < 1. If this condition is met, then the reservoir will show con-

Deep Randomized Neural Networks

57

tractive behavior (and hence stability) for all possible driving inputs. In this regard, it is worth recalling that the analysis of reservoirs as contraction mappings has also interesting connections to the resulting Markovian state space organization, the socalled architectural bias of RNNs [90, 92]. Initializing reservoirs under a contractive constraint inherently enables reservoir systems to discriminate among different input histories in a suffix-based way [28]. Interestingly, this observation explains—at least partially—the surprisingly good performance of reservoirs in many tasks (while at the same time also indicating classes of tasks that are more difficultly tackled by RC). A necessary condition for the ESP condition assumes an autonomous reservoir (i.e., with no input) and studies its stability around the zero state. The resulting condition is given by ρ(WR ) < 1, where ρ(·) denotes the spectral radius, i.e., the largest among the eigenvalues in modulus. Both the conditions are easy to implement, e.g., referring to the necessary one: after random initialization just scale the recurrent matrix by its spectral radius, and then multiply by the desired one. Although not ensuring stability in case of non-null input, the necessary condition on the spectral radius of W R is typically the one used in RC applications.

5.2 Reservoir Richness Another possible issue with untrained dynamics in RNNs is that of potential weakness of the developed temporal representations. Indeed, after contractive initialization, correlation between recurrent units activations could very high, thereby hampering the richness of the state dynamics. A simple rule of thumb here would prescribe to set the reservoir weights close to the limit of stability, e.g., by setting ρ(WR ) to a value very close to 1. Just controlling the value of spectral radius, however, could not be informative enough on the quality of the developed reservoir dynamics [67, 97]. Thereby, several attempts have been done in literature to identify quality measures for reservoirs. Notable examples are given by assessing (and trying to maximizing) information theoretic quantities, such as information storage [58], transfer entropy [84], average state entropy [67] of the reservoir over time, and entropy of individual reservoir neurons’ distribution. For instance, maximizing the latter quantity led to the wellknown intrinsic plasticity (IP) [83, 94] unsupervised adaptation training algorithms for reservoirs. From a perspective closer to the theory of dynamical systems, several works in literature (see, e.g., [8, 54]) indicated that input-driven reservoirs that operate in a regime close to the boundary between stability and instability show higher quality dynamics. Such a region is commonly called edge of stability, edge of criticality, and also—with a slight abuse of terminology—edge of chaos. Relevantly, reservoirs close to such a critical behavior tend to show longer short-term memory [12, 53] and improved predictive quality on certain tasks [57, 82, 98]. While on the one hand it could be questionable to assert that reservoirs should operate close to criticality for every learning task, on the other hand this seems a reasonable initialization condition

58

C. Gallicchio and S. Scardapane

to consider when nothing is known on the properties of the input-target relation for the task at hand. Furthermore, being able to identify the criticality would be useful to know the actual limits of reservoir stable initialization. While the identification of critical reservoir behaviour is still an open topic of research in RC community [63], some (more or less) practical approaches have been introduced in literature, e.g. relying on the spectrum of local Lyapunov exponents [97], recurrence plots [9], Fisher information [57] and visibility graphs [10]. Some works also highlighted the relation between the criticality and information theoretic measures of the reservoir [12, 93]. Another stream of RC research focuses on the idea of enforcing architectural richness in reservoir systems. Typically, reservoir units are connected by following a sparse pattern of connectivity [46] where, for instance, each unit is coupled only to a small constant number of others. Besides the original idea that such sparseness would have diversified the reservoir units activation (see, e.g., [28] for a counterexample), the real advantage is actually the sparsification of the involved reservoir matrices, which can sensibly cut down the computational complexity of the prediction phase. However, a related common question arising in the community is the following: is it possible to get a reservoir organization that is better than just random? Several literature works seem giving a positive answer to the question, pointing out approaches for effective reservoir setup. Prominent examples here are given by initialization of recurrent connections based on a ring topology [76, 86], i.e., where all the units in the reservoir are simply connected to form a cycle. This kind of organization implies a number of advantages: the recurrent matrix of the reservoir is highly sparse, the stability of the system is easily controllable, the performance in many tasks is often optimized, and the resulting memorization skills are improved (approaching the theoretic limit in the linear case) [76, 86, 89]. Other common instances of constrained reservoir topologies include multi-ring reservoirs (where the recurrent neurons are connected to form more than one cycle), and chain reservoirs (where each recurrent neuron is connected only to the next one). On the one hand, these peculiar reservoir organization can be studied from the perspective of architectural simplification [13, 86], on the other hand they can find relations to the interesting concept of orthogonality in dynamical neural systems [22, 41, 104]. E.g., ring and multi-ring reservoirs can be seen as a very simple approach to get orthogonal recurrent weight matrices. Another way of achieving improved quality reservoirs is to introduce depth in their architectural construction, as described in the following.

6 Deep Reservoir Computing The basic idea behind the advancements on deep RNN architectures is to develop richer temporal representations that are able to exploit compositionality in time to capture the multiple levels of temporal abstractions, i.e., multiple time-scales, present in the data. This led to great success in a number of human-level applications, e.g. in the fields of speech, music and language processing [20, 40, 43, 70]. Trying to

Deep Randomized Neural Networks

59

extend the randomized RC approaches described in Sect. 5 towards deep architectures is thereby intriguing under multiple view-points. First of all, it would enable us to analyze the bias of deep recurrent neural systems (i.e., their capabilities before training of recurrent connections). Moreover, it would make it possible to design efficient deep neural network methodologies for learning in time-series domains. The concept of depth in RNN design is sometimes considered questionable. Here we take a perspective similar to the authors of [70] and observe that even if when unrolled in time the recurrent layer’s architecture becomes multi-layered, all the transitions (i) from the input to the recurrent layer, (ii) from the recurrent layer to the output, and (iii) from the previous state to the current state are indeed shallow. Depth can be then introduced in all of these transitions. Interestingly, some works in RC literature attempted at bridging this gap. In particular, the authors of [88] proposed a hybrid architecture where an ESN module is stacked on top of a Deep Belief Network, which introduces depth into the input-to-reservoir transition. On the other hand, the authors of [11] proposed a RC model where a bi-directional reservoir system is tapped by a deep readout network, hence introducing depth into the reservoir-toreadout transition. Here in the rest of this chapter we focus our analysis on the case of deep reservoir-to-reservoir transitions, where multiple reservoir layers are stacked on top of each other. In particular, we keep our focus on the ESN formalism, extended to the multi-layer setting by the Deep Echo State Network (DeepESN) model.

7 Deep Echo State Networks DeepESNs, introduced in [33], are RC models whose operation can be described by the composition of a dynamical reservoir and of a feed-forward readout. The crucial difference with respect to standard RC is that the dynamical part is a stacked composition of multiple reservoirs, i.e., the reservoir is deep as illustrated in Fig. 2. The external input time-series drives the dynamics of the first reservoir in the stack, whose output then excites the dynamics of the second reservoir, and so on until the end of the pipeline. Interestingly, architecturally this corresponds to a simplification (sparsification) of a fully-connected unique reservoir (see [33]). ( )

input ( )

time

representaon representaon layer 2 layer 1

Fig. 2 Deep reservoir architecture

( )

( )

representaon layer L

60

C. Gallicchio and S. Scardapane

From a mathematical perspective, the operation of the deep reservoir can be interpreted as that of a set of nested input-driven dynamical systems. The dynamics of the first reservoir layer are ruled by:     h1 (t) = f R1 x(t), h1 (t − 1) = σ W1 T x(t) + WR 1 T h1 (t − 1) .

(14)

While the evolution of the temporal representations developed in successive layers l > 1 is given by:     hl (t) = f Rl hl−1 (t), hl (t − 1) = σ Wl T hl−1 (t) + WR l T hl (t − 1) ,

(15)

where Wl denotes the weight matrix for the connections between layer l − 1 and l, and WR l is the recurrent weight matrix for layer l. Given such a mathematical formulation, it is possible to derive stability conditions for the ESP of deep RC models. This was achieved in [29] for a more general case of reservoir computing models with leaky integrator units. For the case of standard tanh neurons considered here, the sufficient condition is given by: max

k=1,...,L

k 

WR i 2

i=1

k

W j 2 < 1,

(16)

j=i+1

while the necessary one reads as follows: max ρ(WR k ) < 1.

k=1,...,L

(17)

Notice that both conditions generalize (for multi-layered settings) the respective ones for shallow reservoir systems already discussed in Sect. 5. As illustrated in Fig. 3, there are two basic settings for the readout computation. In a all-layers setup, the readout is fed by the activation of all the reservoir layers. In a last-layer setup, the readout receives only the activations of the last layer in the stack. In the former case the learner is able to exploit the qualitatively different dynamics developed in the different layers of the recurrent architectures (possibly weighting them in a suitable way for the learning task at hand). In the latter case, the idea is that the stack of reservoirs has enriched the developed representations of the driving input in such a way that the readout operation can now be more effective. Again, training is limited to the connections pointing to the readout, and is performed as discussed in Sect. 2.2. Interestingly, the structure that is imposed to the organization of the recurrent units in the reservoir is reflected by a corresponding structure of the developed temporal representation. This has been analyzed recently under several points of view, delineating a pool of potential advantages of deep recurrent architectures that are independent of the training algorithms and shedding light on the architectural bias of deep RNNs.

Deep Randomized Neural Networks Fig. 3 Readout settings for DeepESN. Trained connections are only those pointing to the readout (in red)

61

all-layers readout

last layer readout

g

g ( )

( )

( )

( )

( )

( )

( )

time

( )

time

7.1 Enriched Deep Representations A first inherent benefit of depth in RNNs is given by the possibility to develop progressively more abstract representations of the driving input. In the temporal domain this means that different layers are able to focus on different time-scales, and the networks as a whole is capable of representing temporal information at multiple time-scales. A first evidence in this sense was given in [33], where it was shown that effects of input perturbations last longer in the higher layers of the deep reservoir architecture. This important observation was in line with what reported in [43] for fully trained deep RNNs, and pointed out the great role played by the layering architectural factor in the emergence of multiple-times scales. A further evidence of multiple-view representations in untrained RNN systems was given in [34], where it was shown that the different layers in the deep reservoirs tend to develop different frequency responses (as emerging through a fast Fourier transform of the reservoir activations). This insights was exploited to develop an automatic algorithm for the setup of the depth in untrained deep RNN. The basic idea was to analyzing the behavior of each new reservoir layer in the architecture as a filter, stopping adding new layers when the filtering effect becomes negligible. The resulting approach, in conjunction with IP unsupervised adaptation of reservoirs, was shown to be extremely effective in speech and music processing, achieving state-ofthe-art results and beating the accuracy of more complex fully-trained gated RNN architectures, requiring only a fraction of their respective training times [34, 35].

62

C. Gallicchio and S. Scardapane

Richness of deep reservoir dynamics was also explored in the context of stability of dynamical systems and local Lyapunov exponents. In this regard, the major achievement is reported in [36] where it was shown, both mathematically and experimentally, that organizing the same number of recurrent units into layers naturally (i.e., under easy conditions) has the effect of pushing the resulting system dynamics closer to the edge of criticality. Under a related view-point, deep RC settings were found to boost the short-term memory capacity in comparison to equivalent shallow architectures [33]. More recent works on deep RC highlighted even further the role of certain aspects of network’s architectural construction in the enrichment of developed dynamics. In this concern, results in [31] pointed out the relevance of a proper scaling of interlayer connections, i.e., of the weights in matrices Wl , for l > 1, in (15). It was found that such scaling has a profound impact on the quality of dynamics in higher layers of the network, with larger (resp. smaller) values leading to higher (resp. smaller) average state entropy and number of linearly-uncoupled dynamics. The importance of inter-reservoirs connectivity patterns was also pointed out in the context of spiking neural networks in [109].

7.2 Deep Reservoirs for Structures In many real-world domains the information under consideration presents forms of aggregation that can be naturally represented by complex forms of data structures, such as trees or graphs. Learning in such structured domains opens entire worlds of application opportunities and at the same time it implies a large number of difficulties. The interested reader is referred to [6] for a gentle introduction to the research field. Here we briefly summarize the extension of RC models for dealing with trees and graphs. Starting with tree domains, the basic idea is inspired by the original concept of Recursive Neural Networks (RecNNs) [25, 85], and consists in applying a reservoir system to each node in the input tree, starting from the leaves and ending up in the root. The overall process is again seen as a composition of a representation component followed by a readout layer. In this case, the representation component is implemented by the reservoir as a state transition system that operates on discrete tree structures. The nodes in the input tree take the role of time-steps in the computation of conventional reservoirs, and the states of children nodes takes the role of the previous state. With these concepts in mind, the state (or neural embedding) computed for each node n at layer l can be expressed as:   hl (n) = f R xl (n), hl (ch 1 (n)), . . . , hl (ch k (n))   k WR l T hl (ch i (n)) , = σ Wl T xl (n) + i=1

(18)

Deep Randomized Neural Networks

63

where hl (ch i (n)) is the state computed by layer l for the i-th child of node n. Note that xl (n) is the input information that drives the state update at the current layer: the (external) input label attached to node n for the first layer, and the state for node n already computed at the previous layer, for layers l > 1. For the case of graphs the reservoir operation is further generalized, and the embedding computed for each vertex in the input structure becomes a function of the embedding developed for its neighbors. The state transition of a deep graph reservoir system operating on a vertex v at layer l can be formulated as follows:   hl (v) = f R xl (v), {hl (v )}v ∈N(v)    = σ Wl T xl (v) + v ∈N(v) WR l T hl (v ) ,

(19)

where N(v) is the neighborhood of v and, as before, xl (v) is the driving input information for vertex v at layer l. The two deep reservoir models expressed by (18) and (19) are based on randomization as conventional RC approaches, and are formalized respectively in [30] and [32]. Experimental assessment in these papers indicate the great potentiality of the randomization approach also in dealing with complex data structures, often establishing new state-of-the-art accuracy results on problems in the areas of document processing, cheminformatics and social network analysis.

8 Conclusions In the face of huge computational power and strong automatic differentiation capabilities exhibited by most computers and frameworks today, a focus on randomization as a quick alternative to full optimization can seem counter-productive. Yet, in this chapter we hope to have provided sufficient evidence that, despite the breakthroughs of fully-trained deep learning, randomized neural networks remain an area of research with strong promises. From a practical perspective, they can achieve significant accuracy/efficiency trade-offs in most problems, albeit strong performance on very large-scale problems currently remain difficult. From a theoretical perspective, they are an irreplaceable tool for the analysis of the properties and dynamics of classical neural networks. More importantly, we believe fully-trainable and fullyrandomized networks stand at two extremes of a wide range of interesting architectures, a continuum that only today starts to be more thoroughly explored. We believe our exposition can summarize some of the most promising lines of research and provide a good entry point in this ever growing body of literature.

64

C. Gallicchio and S. Scardapane

References 1. Albers, D., Sprott, J., Dechert, W.: Dynamical behavior of artificial neural networks with random weights. Intell. Eng. Syst. Artif. Neural Netw. 6, 17–22 (1996) 2. Alhamdoosh, M., Wang, D.: Fast decorrelated neural network ensembles with random weights. Inf. Sci. 264, 104–117 (2014) 3. Anguita, D., Ghio, A., Oneto, L., Ridella, S.: Selecting the hypothesis space for improving the generalization ability of support vector machines. In: IEEE International Joint Conference on Neural Networks (2011) 4. Anselmi, F., Rosasco, L., Tan, C., Poggio, T.: Deep convolutional networks are hierarchical kernel machines (2015). arXiv preprint arXiv:1508.01084 5. Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R., Wang, R.: On exact computation with an infinitely wide neural net (2019). arXiv preprint arXiv:1904.11955 6. Bacciu, D., Errica, F., Micheli, A., Podda, M.: A gentle introduction to deep learning for graphs (2019). arXiv preprint arXiv:1912.12693 7. Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends® Mach. Learn. 4(1), 1–106 (2012) 8. Bertschinger, N., Natschläger, T.: Real-time computation at the edge of chaos in recurrent neural networks. Neural Comput. 16(7), 1413–1436 (2004) 9. Bianchi, F., Livi, L., Alippi, C.: Investigating echo state networks dynamics by means of recurrence analysis (2016). arXiv preprint arXiv:1601.07381 10. Bianchi, F.M., Livi, L., Alippi, C., Jenssen, R.: Multiplex visibility graphs to investigate recurrent neural network dynamics. Sci. Rep. 7, 44037 (2017) 11. Bianchi, F.M., Scardapane, S., Lokse, S., Jenssen, R.: Bidirectional deep-readout echo state networks. In: Proceedings of the 26th European Symposium on Artificial Neural Networks (ESANN), pp. 425–430 (2018) 12. Boedecker, J., Obst, O., Lizier, J., Mayer, N., Asada, M.: Information processing in echo state networks at the edge of chaos. Theory Biosci. 131(3), 205–213 (2012) 13. Boedecker, J., Obst, O., Mayer, N.M., Asada, M.: Studies on reservoir initialization and dynamics shaping in echo state networks. In: ESANN (2009) 14. Cao, F., Tan, Y., Cai, M.: Sparse algorithms of random weight networks and applications. Expert. Syst. Appl. 41(5), 2457–2462 (2014) 15. Cecotti, H.: Deep random vector functional link network for handwritten character recognition. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 3628–3633. IEEE (2016) 16. Chan, T.H., Jia, K., Gao, S., Lu, J., Zeng, Z., Ma, Y.: Pcanet: a simple deep learning baseline for image classification? IEEE Trans. Image Process. 24(12), 5017–5032 (2015) 17. Cho, Y., Saul, L.K.: Kernel methods for deep learning. In: Advances in Neural Information Processing Systems, pp. 342–350 (2009) 18. Coraddu, A., Oneto, L., Baldi, F., Anguita, D.: Vessels fuel consumption forecast and trim optimisation: a data analytics perspective. Ocean. Eng. 130, 351–370 (2017) 19. Daniely, A., Frostig, R., Singer, Y.: Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In: Advances In Neural Information Processing Systems, pp. 2253–2261 (2016) 20. El Hihi, S., Bengio, Y.: Hierarchical recurrent neural networks for long-term dependencies. In: Advances in Neural Information Processing Systems, pp. 493–499 (1996) 21. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008) 22. Farkaš, I., Bosák, R., Gergel’, P.: Computational analysis of memory capacity in echo state networks. Neural Netw. 83, 109–120 (2016) 23. Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding sparse, trainable neural networks (2018). arXiv preprint arXiv:1803.03635 24. Frankle, J., Dziugaite, G.K., Roy, D.M., Carbin, M.: The lottery ticket hypothesis at scale (2019). arXiv preprint arXiv:1903.01611

Deep Randomized Neural Networks

65

25. Frasconi, P., Gori, M., Sperduti, A.: A general framework for adaptive processing of data structures. IEEE Trans. Neural Netw. 9(5), 768–786 (1998) 26. Gaier, A., Ha, D.: Weight agnostic neural networks (2019). arXiv preprint arXiv:1906.04358 27. Gallicchio, C., Martín-Guerrero, J.D., Micheli, A., Soria-Olivas, E.: Randomized machine learning approaches: Recent developments and challenges. In: DESANN (2017) 28. Gallicchio, C., Micheli, A.: Architectural and markovian factors of echo state networks. Neural Netw. 24(5), 440–456 (2011) 29. Gallicchio, C., Micheli, A.: Echo state property of deep reservoir computing networks. Cogn. Comput. 9(3), 337–350 (2017) 30. Gallicchio, C., Micheli, A.: Deep reservoir neural networks for trees. Inf. Sci. 480, 174–193 (2019) 31. Gallicchio, C., Micheli, A.: Richness of deep echo state network dynamics. In: International Work-Conference on Artificial Neural Networks, pp. 480–491. Springer (2019) 32. Gallicchio, C., Micheli, A.: Fast and deep graph neural networks. Proceedings of AAAI 2020 (2020). arXiv preprint arXiv:1911.08941 33. Gallicchio, C., Micheli, A., Pedrelli, L.: Deep reservoir computing: a critical experimental analysis. Neurocomputing 268, 87–99 (2017) 34. Gallicchio, C., Micheli, A., Pedrelli, L.: Design of deep echo state networks. Neural Netw. 108, 33–47 (2018) 35. Gallicchio, C., Micheli, A., Pedrelli, L.: Comparison between deepesns and gated rnns on multivariate time-series prediction. In: ESANN 2019 - Proceedings, 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 619–624 (2019) 36. Gallicchio, C., Micheli, A., Silvestri, L.: Local lyapunov exponents of deep echo state networks. Neurocomputing 298, 34–45 (2018) 37. Gerstner, W., Kistler, W.M.: Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cambridge University Press, Cambridge (2002) 38. Giryes, R., Sapiro, G., Bronstein, A.M.: Deep neural networks with random gaussian weights: a universal classification strategy? IEEE Trans. Signal Process. 64(13), 3444–3457 (2016) 39. Gorban, A., Tyukin, I., Prokhorov, D., Sofeikov, K.: Approximation with random bases: pro et Contra. Inform. Sci. 364–365, 129–145 (2016) 40. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE international Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE (2013) 41. Hajnal, M.A., L˝orincz, A.: Critical echo state networks. In: International Conference on Artificial Neural Networks, pp. 658–667. Springer (2006) 42. Hamid, R., Xiao, Y., Gittens, A., Decoste, D.: Compact random feature maps (2013). arXiv preprint arXiv:1312.4626 43. Hermans, M., Schrauwen, B.: Training and analysing deep recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 190–198 (2013) 44. Igelnik, B., Pao, Y.H.: Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE Trans. Neural Netw. 6(6), 1320–1329 (1995) 45. Igelnik, B., Pao, Y.H., LeClair, S., Shen, C.: The ensemble approach to neural-network learning and generalization. IEEE Trans. Neural Netw. 10(1), 19–30 (1999) 46. Jaeger, H.: “The echo state” approach to analysing and training recurrent neural networkswith an erratum note. German National Research Center for Information Technology GMD Technical Report 148 (2001) 47. Jaeger, H., Haas, H.: Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science 304(5667), 78–80 (2004) 48. Kar, P., Karnick, H.: Random feature maps for dot product kernels. In: 15th International Conference on Artificial Intelligence and Statistics (AISTATS) (2012) 49. Kawaguchi, K., Xie, B., Song, L.: Deep semi-random features for nonlinear function approximation. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

66

C. Gallicchio and S. Scardapane

50. Kolen, J.F., Kremer, S.C.: A field guide to dynamical recurrent networks. Wiley, Hoboken (2001) 51. Le, Q., Sarlós, T., Smola, A.: Fastfood-computing hilbert space expansions in loglinear time. In: 30th International Conference on Machine Learning (ICML), pp. 244–252 (2013) 52. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015) 53. Legenstein, R., Maass, W.: Edge of chaos and prediction of computational performance for neural circuit models. Neural Netw. 20(3), 323–334 (2007) 54. Legenstein, R., Maass, W.: What makes a dynamical system computationally powerful. New Directions in Statistical Signal Processing: From Systems to Brain, pp. 127–154 (2007) 55. Li, M., Wang, D.: Insights into randomized algorithms for neural networks: practical issues and common pitfalls. Inf. Sci. 382–383(3), 170–178 (2017) 56. Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning (2018). arXiv preprint arXiv:1810.05270 57. Livi, L., Bianchi, F.M., Alippi, C.: Determination of the edge of criticality in echo state networks through fisher information maximization. IEEE Trans. Neural Netw. Learn. Syst. 29(3), 706–717 (2017) 58. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Local measures of information storage in complex distributed computation. Inf. Sci. 208, 39–54 (2012) 59. Lukoševiˇcius, M., Jaeger, H.: Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 3(3), 127–149 (2009) 60. Maass, W., Natschläger, T., Markram, H.: Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Comput. 14(11), 2531– 2560 (2002) 61. Mairal, J., Koniusz, P., Harchaoui, Z., Schmid, C.: Convolutional kernel networks. In: Advances in Neural Information Processing Systems, pp. 2627–2635 (2014) 62. Mandic, D.P., Chambers, J.: Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. Wiley, Hoboken (2001) 63. Manjunath, G.: Memory-loss is fundamental for stability and distinguishes the echo state property threshold in reservoir computing & beyond (2020). arXiv preprint arXiv:2001.00766 64. Manjunath, G., Jaeger, H.: Echo state property linked to an input: exploring a fundamental characteristic of recurrent neural networks. Neural Comput. 25(3), 671–696 (2013) 65. Morcos, A.S., Yu, H., Paganini, M., Tian, Y.: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers (2019). arXiv preprint arXiv:1906.02773 66. Oneto, L., Ridella, S., Anguita, D.: Tikhonov, ivanov and morozov regularization for support vector machine learning. Mach. Learn. 103(1), 103–136 (2015) 67. Ozturk, M., Xu, D., Príncipe, J.: Analysis and design of echo state networks. Neural Comput. 19(1), 111–138 (2007) 68. Pao, Y., Takefji, Y.: Functional-link net computing: theory, system architecture, and functionalities. IEEE Comput. J. 25(5), 76–79 (1992) 69. Pao, Y.H., Park, G.H., Sobajic, D.: Learning and generalization characteristics of the random vector functional-link net. Neurocomputing 6(2), 163–180 (1994) 70. Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y.: How to construct deep recurrent neural networks (2013). arXiv preprint arXiv:1312.6026 71. Pons, J., Serra, X.: Randomly weighted cnns for (music) audio classification. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 336– 340. IEEE (2019) 72. Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2007) 73. Rahimi, A., Recht, B.: Uniform approximation of functions with random bases. In: 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 555–561. IEEE (2008) 74. Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems, pp. 1313–1320 (2009)

Deep Randomized Neural Networks

67

75. Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., Rastegari, M.: What’s hidden in a randomly weighted neural network (2019)? arXiv preprint arXiv:1911.13299 76. Rodan, A., Tiˇno, P.: Minimum complexity echo state network. IEEE Trans. Neural Netw. 22(1), 131–144 (2011) 77. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958) 78. Rosenfeld, A., Tsotsos, J.K.: Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing. In: 2019 16th Conference on Computer and Robot Vision (CRV), pp. 9–16. IEEE (2019) 79. Rudi, A., Camoriano, R., Rosasco, L.: Generalization properties of learning with random features (2016). arXiv preprint arXiv:1602.04474 80. Scardapane, S., Wang, D.: Randomness in neural networks: an overview. Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 7(2), e1200 (2017) 81. Schmidhuber, J., Hochreiter, S., Bengio, Y.: Evaluating benchmark problems by random guessing. A Field Guide to Dynamical Recurrent Networks pp. 231–235 (2001) 82. Schrauwen, B., Büsing, L., Legenstein, R.A.: On computational power and the order-chaos phase transition in reservoir computing. In: Advances in Neural Information Processing Systems, pp. 1425–1432 (2009) 83. Schrauwen, B., Wardermann, M., Verstraeten, D., Steil, J., Stroobandt, D.: Improving reservoirs using intrinsic plasticity. Neurocomputing 71(7), 1159–1171 (2008) 84. Schreiber, T.: Measuring information transfer. Phys. Rev. Lett. 85(2), 461 (2000) 85. Sperduti, A., Starita, A.: Supervised neural networks for the classification of structures. IEEE Trans. Neural Netw. 8(3), 714–735 (1997) 86. Strauss, T., Wustlich, W., Labahn, R.: Design strategies for weight matrices of echo state networks. Neural Comput. 24(12), 3246–3276 (2012) 87. Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in nlp (2019). arXiv preprint arXiv:1906.02243 88. Sun, X., Li, T., Li, Q., Huang, Y., Li, Y.: Deep belief echo-state network and its application to time series prediction. Knowl.-Based Syst. 130, 17–29 (2017) 89. Tino, P.: Dynamical systems as temporal feature spaces (2019). arXiv preprint arXiv:1907.06382 90. Tino, P., Cernansky, M., Benuskova, L.: Markovian architectural bias of recurrent neural networks. IEEE Trans. Neural Netw. 15(1), 6–15 (2004) 91. Tino, P., Dorffner, G.: Predicting the future of discrete sequences from fractal representations of the past. Mach. Learn. 45(2), 187–217 (2001) 92. Tiˇno, P., Hammer, B., Bodén, M.: Markovian bias of neural-based architectures with feedback connections. In: Perspectives of Neural-symbolic Integration, pp. 95–133. Springer (2007) 93. Torda, M., Farkas, I.: Evaluation of information-theoretic measures in echo state networks on the edge of stability. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE (2018) 94. Triesch, J.: Synergies between intrinsic and synaptic plasticity in individual model neurons. In: Advances in Neural Information Processing Systems, pp. 1417–1424 (2005) 95. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454 (2018) 96. Vahdat, M., Oneto, L., Anguita, D., Funk, M., Rauterberg, M.: A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator. In: European Conference on Technology Enhanced Learning (2015) 97. Verstraeten, D., Schrauwen, B.: On the quantification of dynamics in reservoir computing. In: International Conference on Artificial Neural Networks, pp. 985–994. Springer (2009) 98. Verstraeten, D., Schrauwen, B., D’Haene, M., Stroobandt, D.: An experimental unification of reservoir computing methods. Neural Netw. 20(3), 391–403 (2007) 99. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)

68

C. Gallicchio and S. Scardapane

100. Wang, D.: Editorial: randomized algorithms for training neural networks. Inf. Sci. 364–365, 126–128 (2016) 101. Wang, D., Li, M.: Robust stochastic configuration networks with kernel density estimation for uncertain data regression. Inf. Sci. 412, 210–222 (2017) 102. Wang, D., Li, M.: Stochastic configuration networks: fundamentals and algorithms. IEEE Trans. Cybern. 47(10), 3466–3479 (2017) 103. Wang, D., Li, M.: Deep stochastic configuration networks with universal approximation property. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018) 104. White, O.L., Lee, D.D., Sompolinsky, H.: Short-term memory in orthogonal neural networks. Phys. Rev. Lett. 92(14), 148102 (2004) 105. Widrow, B., Greenblatt, A., Kim, Y., Park, D.: The no-prop algorithm: a new learning algorithm for multilayer neural networks. Neural Netw. 37, 182–188 (2013) 106. Yang, G.: Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation (2019). arXiv preprint arXiv:1902.04760 107. Yildiz, I., Jaeger, H., Kiebel, S.: Re-visiting the echo state property. Neural Netw. 35, 1–9 (2012) 108. Yuan, G.X., Ho, C.H., Lin, C.J.: Recent advances of large-scale linear classification. Proc. IEEE 100(9), 2584–2603 (2012) 109. Zajzon, B., Duarte, R., Morrison, A.: Transferring state representations in hierarchical spiking neural networks. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. IEEE (2018) 110. Zhang, C., Bengio, S., Singer, Y.: Are all layers created equal (2019)? arXiv preprint arXiv:1902.01996 111. Zhang, L., Suganthan, P.: A comprehensive evaluation of random vector functional link networks. Inf. Sci. 367–368, 1094–1105 (2016)

Tensor Decompositions and Practical Applications: A Hands-on Tutorial Ilya Kisil, Giuseppe G. Calvi, Bruno Scalzo Dees, and Danilo P. Mandic

Abstract The exponentially increasing availability of big and streaming data comes as a direct consequence of the rapid development and widespread use of multi-sensor technology. The quest to make sense of such large volume and variety of that has both highlighted the limitations of standard flat-view matrix models and the necessity to move toward more versatile data analysis tools. One such model which is naturally suited for data of large volume, variety and veracity are multi-way arrays or tensors. The associated tensor decompositions have been recognised as a viable way to break the “Curse of Dimensionality”, an exponential increase in data volume with the tensor order. Owing to a scalable way in which they deal with multi-way data and their ability to exploit inherent deep data structures when performing feature extraction, tensor decompositions have found application in a wide range of disciplines, from very theoretical ones, such as scientific computing and physics, to the more practical aspects of signal processing and machine learning. It is therefore both timely and important for a wider Data Analytics community to become acquainted with the fundamentals of such techniques. Thus, our aim is not only to provide a necessary theoretical background for multi-linear analysis but also to equip researches and interested readers with an easy to read and understand practical examples in form of a Python code snippets.

I. Kisil (B) · G. G. Calvi · B. Scalzo Dees, · D. P. Mandic Imperial College London, SW7 2AZ London, UK e-mail: [email protected] G. G. Calvi e-mail: [email protected] B. Scalzo Dees, e-mail: [email protected] D. P. Mandic e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 L. Oneto et al. (eds.), Recent Trends in Learning From Data, Studies in Computational Intelligence 896, https://doi.org/10.1007/978-3-030-43883-8_4

69

70

I. Kisil et al.

1 Introduction The considerable rise in the production of an ever-increasing amount of accessible data has posed unprecedented challenges, while at the same time it has provided novel opportunities to explore events and natural happenings yet requiring full explanation. Such data falls in the category of that which is referred to within the Signal Processing and Machine Learning community as Big Data. This comprises datasets characterised by the so-called “4V’s of Big Data”: Volume, Veracity, Variety, and Velocity, explained in Fig. 1. Thus, Big Data is not necessarily a synonym of raw size only. Big Data sets are hence far from “simple”, and analysing them is a nontrivial task which highlights the limitations of standard flat-view matrix models and underlines the necessity for sophisticated techniques able to assess the information within. To this end, several modern-day data analysis tools employ multi-way generalisations of matrices, or tensors, which have been shown to offer greater flexibility in the choice of constraints matching data properties, hence providing more powerful schemes for performing extraction of latent components present in data [8]. From the Latin word tendere, i.e. “to stretch”, a tensor is a geometrical object used in mathematics and physics as an extension of vectors and matrices to higher dimensions [14]. Tensors have found wide applications in physics and mathematics since the early 20th century, when American mathematician Frank Hitchcock introVOLUME Petabytes

Tim e Im serie age s Bin s ary dat a 3D ima ge Mu s ltiv iew data Pro bab ilist ic

GB

MB

VARIETY

Mi ss Inc ing d ons ata i An om stency a l Ou y t No liers ise

VERACITY

Terabytes

Batch Micro-batch Near real-time Streams

VELOCITY

Fig. 1 The 4V’s of big data. Volume refers to the raw size of datasets; Veracity conceptualises missing or corrupted data; Variety refers to the presence of different data types; Velocity represents the frequency at which new data becomes available

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

71

duced the first principles of tensor decompositions (TDs) [17], used to overcome the Curse of Dimensionality [2]. Bellman first coined the term in 1961, to indicate that the number of parameters needed to estimate an arbitrary function grows exponentially with the number of parameters of the function itself. As in standard linear algebra matrix decompositions can be effectively used to compress matrices by obtaining low-rank representations, the concept of compression of multidimensional data through TDs can be intuitively explained as follows [12]. Consider an N -variate function, f : R N → R, defined as, f (x) = f (x1 , x2 , . . . , x N )

(1)

Supposing f has to be represented by a product of individual functions, in the simplest case it can be approximated by f (x1 , x2 , . . . , x N ) ≈ f (1) (x1 ) f (2) (x2 ) · · · f (N ) (x N )

(2)

Unfortunately, the case in which all variables are separable as in (2) is rare in practice. Instead, the concept of TDs is considerably more flexible, as it relies on the partial separability of the variables of a higher-order tensor [7]. Most of their benefits first became apparent in the 1960s, when Ledyard Tucker devised the Tucker Decomposition (TKD), which was applied to psychometrics [25, 26], along with the rediscovered Canonical Polyadic Decomposition (CPD) [6]. Variants of these have subsequently been employed in chemometrics, food industries, and social sciences [19, 23]. Around 2000, the realisation that the TKD represents a Higher-Order Singular Value Decomposition (HOSVD), a generalisation of the well-known matrix SVD [10], opened new avenues for their utilisations in applied mathematics and scientific computing in high dimensions [3]. This, combined with the Tensor-Train (TT) decomposition introduced by Oseledets in 2011 [21], now makes it possible for tensors to find applications in a variety of disciplines, such as wireless communications, audio and image processing, and machine learning to name but a few. In the following, an overview of the major TDs is provided, together with practical examples of their applications [5, 16] through the Higher Order Tensor ToolBOX (HOTTBOX) [15].

2 Prerequisites 2.1 Terminology Definition 1 A tensor is a multi-way array having N indices, and it is written as X ∈ R I1 ×···×I N . Definition 2 The n-th mode of a tensor is an the n-th index of the tensor.

72

I. Kisil et al.

Table 1 Basic tensor and matrix notation X ∈ R I1 ×I2 ×···×I N xi1 i2 ···i N = X(i 1 , i 2 , . . . , i N ) x, x, X X(n) ∈ R In ×I1 I2 ...In−1 In+1 ...I N A(n) (·)T , (·)−1 , (·)† ◦, ⊗, , ∗ vec(X) = x ∈ R I1 I2 ...I N · F IM 1M ·, ·

Tensor of order N of size I1 × I2 × · · · × I N (i 1 , i 2 , . . . , i N ) entry of X Scalar, vector, matrix Mode-n unfolding of tensor X Factor matrices, in tensor decompositions Transpose, inverse, and Moore-Penrose pseudoinverse operators for matrices Outer, Kronecker, Khatri-Rao and Hadamard products Vectorization of tensor X Frobenius norm Identity matrix of size M × M Vector of ones of size M Inner product

Definition 3 The order of a tensor is the number of its modes. Definition 4 A fiber of a tensor X ∈ R I1 ×···×I N is the collection of entries obtained by fixing all but one index, e.g. x(:, i 2 , i 3 , . . . , i N ) is obtained by fixing all i n for n = 2, 3, . . . , N . Definition 5 A slice of a tensor X ∈ R I1 ×···×I N is the collection of entries obtained by fixing all but two indices, e.g. X(:, :, i 3 , . . . , i N ) is obtained by fixing all i n for n = 3, 4, . . . , N . Definition 6 The mode-n unfolding of tensor X ∈ R I1 ×I2 ×···×I N is denoted by X(n) , and is a matrix with entries (X(n) )in ,i1 ...in−1 in+1 ...i N = xi1 ,i2 ,...,i N

(3)

where, according to the Little–Endian convention [12]: i 1 i 2 . . . i N = i 1 + (i 2 − 1)I1 +(i 3 − 1)I1 I2 + · · · + (i n − 1)I1 . . . I N −1

(4)

This is sometimes referred to in the literature as “classical” or “Kolda-unfolding” [17, 29]. Unless stated otherwise, when a tensor is said to be unfolded refers to the mode-n unfolding defined above (Eqs. 3 and 4). The unfolding of a 3-rd order tensor, along each of its modes, is shown in Fig. 2. Given an N -th order tensor A ∈ R I1 ×···×In ×···×I N and another M-th order tensor B ∈ R J1 ×···×Jm ×···×JM , with sizes of In = Jm , their (m, n)-contraction , ×m n , yields a third tensor C ∈ R I1 ×···×In−1 ×In+1 ×···×I N ×J1 ×···×Jm−1 ×Jm+1 ×···×JM of order (N + M − 2), with entries

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

73

Fig. 2 Illustration of the mode-1, mode-2, and mode-3 unfoldings on a 3-rd order tensor

ci1 ,...,in−1 ,in+1 ,...,i N , j1 ,..., jm−1 , jm+1 ,..., jM =

In 

ai1 ,...,in−1 ,in ,in+1 ,...,i N b j1 ,..., jm−1 ,in , jm+1 ,..., jM

(5)

in

The operator ×1n is equal by convention to ×n and referred to as “mode-n” contraction or product. It is used for contraction between tensors and matrices. A few properties of the mode-n product are in order and summarised below: • Y = X ×n U ⇐⇒ Y(n) = UX(n) • X ×m A ×n B = X ×n B ×m A, if m = n • X ×n A ×n B = X ×n (BA) The Kronecker product of matrix A ∈ R I1 ×I2 and matrix B ∈ R J1 ×J2 yields a matrix C ∈ R I1 J1 ×I2 J2 with entries c(i1 −1)J1 + j1 ,(i2 −1)J2 + j2 = ai1 i2 b j1 j2

(6)

whereas the Khatri-Rao product of A = [a1 , . . . , a R ] ∈ R I ×R and B = [b1 , . . . , b R ] ∈ R J ×R yields a matrix C ∈ R I J ×R , with columns cr = ar ⊗ br

(7)

Notation of the main multi-linear products and operations is summarised in Tables 1 and 2. Definition 7 An N th order tensor is rank-1 if it can be written as an outer product of N vectors, that is (8) X = a(1) ◦ a(2) ◦ · · · ◦ a(N ) In other words, each element of the tensor is the product of the corresponding vector elements ) ai(2) . . . ai(N , for all 1 ≤ i n ≤ In (9) xi1 i2 ...i N = ai(1) 1 2 N

74

I. Kisil et al.

Table 2 Definition of main products C = A ×m (m, n)-contraction product of tensor A ∈ R I1 ×···×In ×···×I N n B and tensor B ∈ R J1 ×···×Jm ×···×JM yields C ∈ R I1 ×···×In−1 ×In+1 ×···×I N ×J1 ×···×Jm−1 ×Jm+1 ×···×JM C = A ×n B Mode-n products of tensor A ∈ R I1 ×I2 ×···×I N and matrix B ∈ R Jn ×In yields C ∈ R I1 ×···×In−1 ×Jn ×In+1 ×···×I N (1) (2) (N ) C = [[A; B , B , . . . , B ]] Full multilinear product C = A ×1 B(1) ×2 B(2) ×3 · · · × N B(N ) (1) (2) (N ) X = a ◦ a ◦ ··· ◦ a Outer product of vectors a(n) ∈ R(In ) , n = 1, . . . , N , yields a rank-1 tensor X ∈ R I1 ×I2 ×···×I N C=A⊗B Kronecker product of A ∈ R I1 ×I2 and B ∈ R J1 ×J2 yields C ∈ R I1 J1 ×I2 J2 C=AB Khatri-Rao product of A ∈ R I ×R and B ∈ R J ×R yields C ∈ R I J ×R

Definition 8 The rank of a tensor X, rank(X) = R, is the smallest number of rank-1 tensors the sum of which generates X exactly.

2.2 Required Software Our aim is not only to present cross-disciplinary community with methods and algorithms of multi-linear analysis but also to equip researches and interested readers with easy to read and understand practical examples. Hence, we will support theoretical material with short code snippets that would server as guiding points for those who wish to enter and explore a fascinating field of tensor decomposition and their applications. All of the hands-on insights are written in Python, which has a tremendous support from the community with a lot of exciting open-source projects emerging around the world. In order to run the code for the methods and approaches discussed throughout this work, it is required to have a working installation of Python 3 and the following packages: NumPy [28], Scikit-Learn [22] and HOTTBOX [15]. All of them can be installed directly from the Python Package Index:

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

75

3 Theoretical Background 3.1 Fundamental Transformations N-dimensional arrays of data can be represented in various different forms. By applying numerical methods (algorithms for tensor decompositions) to the raw data we can obtain, for example, Kruskal or Tucker representation. At the same time, simple data rearrangement procedures (e.g. folding, unfolding) of the raw data also yields different representation (Fig. 3).

3.1.1

Unfolding, Folding and Mode-n Product

Conventionally, unfolding is considered to be a process of element mapping from a tensor to a matrix as depicted on Fig. 2. In other words, it arranges the mode-n fibers of a tensor to be the columns of the matrix, where a fiber is a vector obtained by fixing all but one of the indices, e.g. A[i, :, k], and denoted as: 2

→ A(2) A−

(10)

where A ∈ R I ×J ×K and A(2) ∈ R K ×I J . Thus, within HOTTBOX API this operations requires to specify a mode along which a tensor will be unfolded. Note, in order to

Fig. 3 The ecosystem of representations of an N-dimensional array

76

I. Kisil et al.

be consistent with Python indexing, count of modes and elements within starts from zeros.

By default, it changes the data array of a tensor. However in some cases it might be beneficial to get unfolded tensor as a separate data structure which requires specifying an additional parameter, i.e. inplace=False:

Note unfolding a tensor, A ∈ R I ×I ×I , along different modes is not equivalent to permutation of data: A(n) = PA(m)

(11)

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

77

Fig. 4 Illustration of folding procedure (tensorisation) of a vector into a matrix, 3rd-order tensor and 4th-order tensors

Folding is most commonly referred to as a process of element mapping from a matrix or a vector to a tensor. However, it can be extended to a more general case, when one converts a tensor of order N into a tensor of order M where N < M. In order to ensure integrity of the data, this operation this operation merely reverts the unfolding operation, therefore, it does not require to specify a mode along which a tensor should be folded as opposed to unfold (Fig. 4).

The mode-n product is a multiplication of a tensor by a matrix along the n-th mode of a tensor. This essentially means that each mode-n fiber should be multiplied by this matrix. Mathematically, this can be expressed as: X ×n A = Y



Y(n) = AX(n)

(12)

This is equivalent to projection of a tensor unfolded along a certain mode on to the space spanned by a matrix. For distinct modes in a series of multiplications, the order of the multiplication is irrelevant: X ×n A ×m B = X ×m B ×n A (m = n)

(13)

78

I. Kisil et al.

On the other hand, consecutive multiplications along the same mode can be simplified as: X ×n A ×n B = X ×n (BA)

(14)

In order to perform this operations, HOTTBOX equips Tensor objects with the corresponding method. Similarly to unfold and fold, it can modify the underlying data structure either inplace or provide a new object. On top of that, consecutive mode-n products can be chained together, for user convenience.

3.2 Efficient Representation of Multi-dimensional Arrays Oftentimes, N-dimensional data contains a considerable amount of repetitive and redundant information. By applying different numerical methods, i.e. tensor decompositions, this information can be represented in a more compact and efficient way. There are three major conceptually different factorisation types: Kruskal, Tucker and Tensor Train forms (Fig. 5). Each of them can be obtained either from the original tensor by applying associated tensor decomposition algorithms or constructed from scratch. For all representations and decomposition algorithms, the HOTTBOX provides consistent API as much as possible.

Fig. 5 Fundamental types of efficient representations of an N-dimensional array: Kruskal, Tucker and Tensor Train forms

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

79

Fig. 6 Intuition behind outer product operation and an associated tensor X of rank-1

3.2.1

Kruskal Form and CPD Algorithm

The central operator in tensor analysis is the outer product (sometimes referred to as the tensor product). Consider tensors A ∈ R I1 ×···×I N and B ∈ R J1 ×···×JM , then their outer product yields a tensor of higher order then both of them: A ◦ B = C ∈ R I1 ×···×I N ×J1 ×···×JM ai1 ,...,i N b j1 ,..., jM = ci1 ,...,i N , j1 ,..., jM

(15)

Most of the time we deal with the outer product of vectors, which significantly simplifies the general form expressed above and establishes one the of the most fundamental definitions. A tensor of order N is said to be of rank-1 if it can be represented as an outer product of N vectors (Fig. 6). X = a1 ◦ a2 ◦ · · · ◦ a N

(16)

Consequently, the rank of a tensor, X, is defined as the smallest number of rank1 tensors that produce X as their linear combination. This definition of a tensor rank is commonly referred to as the Kruskal rank and is associated with one of the fundamental multi-linear structure illustrated on Fig. 7. This data format is known as Canonical Polyadic (CP) representation or Kruskal form of the raw data. For a third order tensor, X ∈ R I ×J ×K , of rank R it can be expressed as follows: X=

R  r =1

Xr =

R  r =1

λr · ar ◦ br ◦ cr

(17)

80

I. Kisil et al.

Fig. 7 The Canonical Polyadic representation of a third-order tensor, X, in a form of factor matrices A, B and C interconnected via a super-diagonal core tensor 

The vectors ar , br and cr are often time combined into corresponding factor matrices:   A = a1 , a2 , . . . , a R   B = b1 , b2 , . . . , b R (18)   C = c1 , c2 , . . . , c R Thus, if we employ the mode-n product, the CP representation takes form: X =  ×1 A × 2 B × 3 C = [[; A, B, C]]

(19)

where A ∈ R I ×R , B ∈ R J ×R , C ∈ R K ×R and  ∈ R R×R×R is a super-diagonal core tensor with values [r, r, r ] = λr and zeros elsewhere. An important characteristic of the Kruskal form is that it imposes imposes a one-to-one relation between column vectors of these factor matrices each of which corresponds to a particular dimension of the raw data. There are many algorithms to obtain a Kruskal form, however, they are based on the assumption that the rank of multi-dimensional array of data is known beforehand. The most popular approach is via the alternating least squares (ALS) method due to its simplicity and satisfactory performance for well defined problems. The goal is to find such CP representation [; A, B, C] which provides the best approximation of the original tensor X: min X − [[; A, B, C]] 2F

(20)

The alternating least squares approach fixes B and C to solve this optimisation problem for A, then fixes A and C to solve for B, then fixes A and B to solve for C, and continues to repeat the entire procedure until convergence criterion is satisfied. This procedure is summarised in Algorithm 1.

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

81

Algorithm 1. Basic ALS for the CP decomposition of a 3rd-order tensor Input: X ∈ R I ×J ×K and rank R Output: Factor matrices A ∈ R I ×R , B ∈ R J ×R , C ∈ R K ×R , and scaling vector  ∈ R R 1: Initialise factor matrices A, B, C 2: while not converged or iteration limit is not reached do 3: A ← X(1) (C  B)(CT C ∗ BT B)† 4: B ← X(2) (C  A)(CT C ∗ AT A)† 5: C ← X(3) (B  A)(BT B ∗ AT A)† 6: Normalise each column of A, B and C to unit length (by computing the norm of each column vector and dividing each element of a vector by its norm) and store the norms in  7: end while 8: return [[; A, B, C]]

Within the HOTTBOX API, the CPD-ALS algorithm is implemented through CPD class. A raw multi-dimensional array of data can be then represented in a Kruskal form by using decompose method which takes a Tensor object and corresponding value of desired rank as its inputs and produces an object of TensorCPD class.

The most common way to evaluate performance of the CPD-ALS algorithm is by computing residual error of approximation: X −  X 2F X 2F

(21)

82

I. Kisil et al.

where X is the original multi-dimensional array and  X reconstruction of the Kruskal form obtained by  ×1 A ×2 B ×3 C. Since this metric is frequently employed through out different stages of multi-linear analysis, the HOTTBOX provides the corresponding utility function for user convenience.

The Kruskal form of the raw data provides an excellent compression ratio, i.e. for an N -th-order tensor all I N elements are efficiently represented through the CPD as a linear (instead of exponential) function of number of elements in each mode. Once an acceptable level of approximation has been achieved, CP representation is often time exported into a new file which can replace the raw data, hence significantly reducing storage requirements. For every consequent analysis of data, the TensorCPD object can be constructed based only on its factor matrices and values for the super-diagonal of the core tensor.

In this example we have shown creation of such object using synthetic data for A, B, C and Lambda, however, for the real world application the corresponding value can be obtained by reading information from a set of files. For the use cases that require the full representation of the data the process of converting TensorCPD to Tensor is as simple as:

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

83

Fig. 8 A third-order tensor, X, represented in the Tucker form where factor matrices A, B and C represent distinct subspaces spanned by mode-1, mode-2 and mode-3 fibers of the original tensor X

3.2.2

Tucker Form and HOSVD Algorithm

Tucker form is another member of fundamental efficient representations of the raw data and is illustrated on Fig. 8. For a tensor X ∈ R I ×J ×K illustrated on Fig. 8, the Tucker form represents it as a dense core tensor, G ∈ R I ×J ×K , and a set of factor matrices A ∈ R I ×Q , B ∈ R J ×R and C ∈ R K ×P . Here, the tuple of (Q, R, P) represents multi-linear rank of a tensor X where each value is the rank of the subspace spanned by fibers associated with the corresponding mode of X, i.e. Q for mode-0, R for mode-1 and P for mode-2. For a general case of a tensor of order N with multilinear rank (R1 , R2 , . . . , R N ) the values Rn are not necessarily the same, whereas, for matrices (tensors of order 2) the equality R1 = R2 always holds, where R1 and R2 are the matrix column rank and row rank respectively. The Tucker form of a tensor is closely related to the CP representation and can be expressed through a sequence of mode-n products in a similar way: X

Q  R  P 

gqr p aq ◦ br ◦ c p

q=1 r =1 p=1

= G ×1 A × 2 B × 3 C = [[G; A, B, C]]

(22)

84

I. Kisil et al.

Observe, that one-to-one relation between factor vectors aq , br and c p inherent to the Kruskal form no longer holds. On top of that, multi-linear rank, rank-(Q, R, P), of Tucker form provides more flexibility to approximation of the original data without a detrimental effect on the compression ratio as opposed to CP representation. On practice, there exist several algorithms to represent a given tensor in the Tucker format. The most simplistic one is the Higher Order Singular Value Decomposition (HOSVD) which can be seen as a natural extension of the Singular Value Decomposition (SVD) in a sense that it also constrains the factor matrices to be orthogonal. The latter are computed as truncated version of the left singular matrices of all possible mode-n unfoldings of tensor X: X(1) = U1 1 V1T



A = U1 [1 : R1 ]

U2 2 V2T U3 3 V3T



B = U2 [1 : R2 ]



C = U3 [1 : R3 ]

X(2) = X(3) =

(23)

After factor matrices are obtained, the core tensor G is computed as a projection of the original tensor, X on the subspaces spanned by the A, B, C: G = X ×1 A T ×2 B T ×3 C T

(24)

Although the truncated HOSVD does not provide an optimal approximation of the raw multi-dimensional array. This issue can be addressed by employing iterative procedure of updating factor matrices through the ALS-based algorithm conventionally referred to as Higher Order Orthogonal Iteration (HOOI) and is summarised in Algorithm 2. Algorithm 2. Higher Order Orthogonal Iteration (HOOI) Input: X ∈ R I1 ×I2 ×···×I N and multi-linear rank (R1 , R2 , . . . , R N ) Output: The core tensor G ∈ R R1 ×R2 ×···×R N and factor matrices A(n) ∈ R In ×Rn for n ∈ [1, N ] 1: Initialise factor matrices A(n) 2: while not converged or iteration limit is not reached do 3: for n = 1, …, N do T  4: X ← X × p=n {A( p) } T 5: U← X(n)  X(n) 6: A(n) ← U[1 : Rn ] 7: end for 8: G ←  X ×1 A(1) ×2 A(2) ×3 · · · × N A(N ) 9: end while 10: return [[G; A(1) , A(2) , . . . A(N ) ]]

Implementation of both HOSVD and HOOI algorithms is provided by the HOTTBOX which use similar interface as CPD-ALS and accept akin parameters, e.g. Tensor and values of desired multi-linear rank. However, due to structural and conceptual differences between Kruskal and Tucker forms outlined in Eqs. (19) and (22)

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

85

respectively, the output of decompose method provides an object of TensorTKD class.

Observe that relative error of approximation (see Eq. (21)) of the raw data by its Tucker form is significantly lower then its CP representation. This is attributed to the flexibility of multi-linear rank, rank-(Q, R, P), and the dense nature of the core tensor of Tucker form as opposed to super-diagonal structure employed by the Kruskal form. Moreover, compression ratio has not been affected either as reduced storage requirements of factor matrices A, B and C compensated for the increased number of non-zero values that describe core tensor G. Thus for an N -th order tensor X ∈ R I1 ×I2 ×··· ,×I N the storage cost of TensorCPD and TensorTKD models are as follows: N  OTensorCPD = R In n=1

OTensorTKD =

N  n=1

In Rn +

N  n=1

Rn ≈

N 

(25) In Rn

n=1

Due to inherent similarity of Kruskal and Tucker forms of efficient representation of raw data, the construction of TensorTKD objects from scratch or from previously saved information into an external file, adheres analogous procedure to TensorCPD:

86

3.2.3

I. Kisil et al.

Tensor Train Form and TTSVD Algorithm

Despite of all the advantages in efficient representation and data handling of Kruskal and Tucker forms, their performance deteriorates and suffers from veracity issues with the growth of the order of a multi-dimensional array, thus, making them not feasible for the real world applications. A numerically reliable way to tackle such cases was originally introduced in a field of quantum information theory through the concept of Tensor Network (TN). TN represents a given raw tensor as a set of sparsely interconnected lower-order tensors and factor matrices, generally referred to as TT-cores. This makes processing of such form more suitable for distributed storage and computation approaches. A particular case of TNs known as Tensor Train form which only allows coupling of adjacent TT-cores like the one depicted on Fig. 9. Mathematically speaking, the obtained Tensor Train representation of a 5-th order tensor X ∈ R I1 ×I2 ×···×I5 can be expressed as: X = [[A, G(1) , G(2) , G(3) , B]] = A ×12 G(1) ×13 G(2) ×13 G(3) ×13 B

(26)

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

87

Fig. 9 The Tensor Train representation of a fifth-order tensor X ∈ R I1 ×I2 ×···×I5 . Each component of the Tensor Train exhibits a particular characteristic of the original multi-dimensional array X

where elements of a Tensor Train structure exhibit the following dimension sizes: A ∈ R I1 ×R1 , G(1) ∈ R R1 ×I2 ×R2 , G(2) ∈ R R2 ×I3 ×R3 , G(3) ∈ R R3 ×I4 ×R4 and B ∈ R R5 ×I5 . The TTSVD algorithm for a multi-dimensional array of any order, involves iterative procedure composed of a series of foldings and unfoldings on an original tensor X ∈ R I1 ×I2 ×···×I N in conjunction with employing truncated SVD. At every cycle a core G(n) ∈ R Rn ×In+1 ×Rn+1 is computed based on the values of the TT-rank (R1 , R2 , . . . , R N ) which are being estimated on the fly or have been specified beforehand. Algorithm 3. TTSVD for an N th order tensor Input: Data tensor X ∈ R I1 ×I2 ×···×I N and desired accuracy  Output: Core tensors G(1) , . . . , G(N ) that approximate X ∈ R I1 ×I2 ×···×I N with desired accuracy  1: Initialise R0 = 1 2: Compute truncation parameter δ = √ N−1 X F 3: Z ← X(1) 4: for n = 1 to N − 1 do 5: Compute δ-truncated SVD: Z = USV + E, s.t. E F ≤ δ; U ∈ R R(n−1)In ×Rn 6: Estimate nth TT-rank: Rn = size(U, 2) 7: G(n) ← reshape(U, [R(n−1) , In , Rn ])  8: Z ← reshape(SVT , [Rn I(n+1) , Np=n+2 I p ]) 9: end for 10: Construct the last core as G(N ) ← reshape(Z , [R N , In , 1]) 11: return G(1) , G(2) , . . . , G(N )

Algorithm 3 summarises step by step procedures required to represent an original multi-dimensional array in Tensor Train form. Similarly to all previously considered mathematical methods, one can employ TTSVD class in order to obtain TensorTT data structure which completes the HOTTBOX ecosystem of fundamental efficient representations of an N-dimensional arrays illustrated on Fig. 5.

88

I. Kisil et al.

4 Applications 4.1 Image Compression with HOSVD Color images can be naturally represented as a tensor of order three, X ∈ R I ×J ×K , where I and J can be of any arbitrary value since they correspond to the height and width of an image respectively. However, the last dimension of the tensor, K , is often time attributed to the RGB channels (Red, Green and Blue), thus its size is equal to three. Keeping the original structure of X, allows us to employ multi-linear techniques. One of the direct application of tensor decompositions is compression of the data. In Sect. 3.2 we have covered the HOSVD algorithm as well as how to take advantage of the HOTTBOX API in order to represent the original multi-dimensional array in the Tucker form. Recall, that compression rate is highly depended on the choice of multi-linear rank values at hand: N n=1 In Rn (27) N n=1 In

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

89

Fig. 10 Importance of the selection of multi-linear rank for data recovery at the reconstruction. The original image (left) has been decomposed via HOOI with different values of multi-linear rank where wrong choice might lead to complete loss of information, i.e. colour distribution (right)

where numerator and denominator are storage complexity of the raw tensor and its Tucker form with multi-linear rank (R1 , R2 , . . . , R N ) respectively. It is apparent, that values of multi-linear rank are determinant factors in achieving a high level of compression. At the same time, different modes of an N-dimensional array exhibit distinct characteristics of the underlying data values. Thus, if one wants to exploit HOSVD for data compression purposes, then a physical meaning of every mode should be accounted for when selecting an optimal multi-linear rank for the decomposition. For instance, a color image represented in Tucker form with multi-linear rank (R1 , R2 , 1) would leave us with only a single vector to describe all color information. This is insufficient to model all color variations resulting in reconstructed image of grey shades as depicted at the very right column on Fig. 10. Likewise, if an image contains all three base colours then the choice of multi-linear rank (R1 , R2 , 2) would dispose of one them. This “side-effect” of compression can be observed on illustration of an apple in the second from the right column of the Fig. 10.

4.2 Tensor Ensemble Learning In recent years, machine learning community has witnessed rise in popularity of the Ensemble Learning [11] which stems from the well known phenomenon of wisdom of the crowd. The essence of such framework is to combine several base models in order to improve the overall performance by virtue of increasing the stability and the accuracy of ML algorithms through minimising noise, bias or variance. It has been proven to be a powerful way to boost the predictive power when solving

90

I. Kisil et al.

both classification and regression types of problems [13, 18]. The design of the ensemble learning techniques and approaches can be generalised to inherently multidimensional data in order to account for the underlying multi-modal dependencies intrinsic to the tensor-valued data. The Tensor Ensemble Learning (TEL) framework, originally introduced in [16], is based on direct application of the tensor decompositions and properties associated with the corresponding data representations such as Tucker form and Kruskal form. Recall that factor matrices of Tucker form are mutually uncorrelated as they describe different characteristics (modes) of the original multi-dimensional array. On top of that, HOSVD algorithm enforces orthogonality constraints on every factor matrix. This can be utilised for formation of the surrogate datasets which will be independent from each other, thus, fulfilling a necessary condition for ensemble learning model to be successful. Several different TEL algorithms have been incorporated into the HOTTBOX, but all of them share the similar interface. Here we will consider a particular implementation, TelVI, suitable for classification task for a dataset composed of tensor-valued samples. The schematic diagram of such model is illustrated on Fig. 11. First, every raw multi-dimensional sample has to be represented in the Tucker form through HOSVD algorithm with the fixed multi-linear rank-(Ra , Rb , Rc ) in order to extract latent components.

The above snippet corresponds to the top right of the Fig. 11. Since the HOSVD is applied to every sample individually, it makes no difference whether all samples are processed in a batch or after splitting them into the train and test subsets. Next, a TelVI classifier needs to be initialised, which expects a collection of separate base learners, each of which will be associated with a particular factor vector of the Tucker form as depicted at the bottom of the Fig. 11. In other words, the required number is determined by the sum of multi-linear rank values used for the HOSVD. We will proceed with the implementation of Support Vector Machine (SVM) provided by the scikit-learn, one of the most popular Python package for classical ML [22].

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

91

Fig. 11 The schematic diagram of Tensor Ensemble Learning—Vector Independent (TELVI) algorithm. Original tensor-valued samples are represented in Tucker form through the HOSVD with multi-linear rank (Ra , Rb , Rc ). All factor vectors of training samples, Xm , are reorganised into separate datasets, thus only requiring to train N = Ra + Rb + Rc classifiers. At the final stage, new latent components a1new , . . . , cnew , are utilised for majority voting Rc extract from a new sample, X to predict label of Xnew

92

I. Kisil et al.

Here we have created a set of homogeneous base classifiers, however, parameter base_clf can accommodate a mix of heterogeneous objects representing ML algorithms, e.g. logistic regression, neural network and etc. The only restriction imposed on members of base_clf is that each of them should contain implementation of methods fit, predict and predict_proba, i.e. adhere scikit-learn type API [4]. Finally, components of the Tucker representations of raw data through HOSVD need to be reorganised in a set of new surrogate datasets, each containing an incomplete information about the original samples. Next, these datasets are feed into the corresponding the base learner (see bottom of the Fig. 11) during the training stage. Obtained series of independent hypotheses about the underlying processes are then utilised to classify the previously unseen data samples which is achieved by aggregating the individually learned knowledge through the weighted majority vote. For user convenience, all steps of this procedure are encapsulated and available via methods fit and score for the training and testing stages respectively.

It is a common practice to perform a hyper-parameter tuning of ML algorithms in order to enhance their performance. Recall, that TelVI employs a series of independent base classifiers. Thus, a grid search for optimal parameter can be conducted for everyone of them with the use of grid_search method. It requires to specify a list of key value pairs, each of which would define a name of the hyper-parameter and a set of corresponding values to try.

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

93

4.3 Support Tensor Machine with Application in Finance The Support Tensor Machine (STM) is a tensor extension of the common Support Vector Machine (SVM) [5, 24]. In other words, it is a binary classifier which aims at separating the elements of a given dataset into a “positive” and a “negative” class. Consider M input-output pairs, {Xm , tm }, where each Xm ∈ R I1 ×I2 ×···×I N is an N -th order tensor and tm the desired label/target. The STM operates on the pairs {Xm , tm } and assigns the binary label/target {+1, −1} by solving i=n M  

1 min ||wn ||2 ||wi ||2 + C ξm wn ,b 2 m=1 1≤i≤N ¯ i=n wi ) + b ≥ 1 − ξm s.t. tm wnT (Xm ×

ξm > 0,

(28)

m = 1, 2, . . . , M

Algorithm 4. Least-Squares Support Tensor Machine (LS-STM) Input: Dataset {Xm , tm }, m = 1, . . . , M, with Xm ∈ R I1 ×···×I N Output: Set of weights {wn }, n = 1, . . . , N and bias b. 1: Initialise randomly {wn }, n = 1, . . . , N 2: while not converged or max iterations reached do 3: for n = 1 to N do 4: η ← i=n ||wi ||2 ¯ i=n wi 5: xm = Xm × 6: Find wn by optimising: min

wn ,b,

s.t.

C η ||wn ||2 +  T  2 2

tm (wnT xm + b) = 1 − m ,

m = 1, . . . , M

7: end for 8: end while 9: return {wn }, n = 1, . . . , N and b.

The LS-STM solution is then obtained through the alternating least-squares (ALS) iteration and is given by i=n



γ 1 min ||wn ||2 ||wi ||2 +  T  wn ,b 2 2 1≤i≤N ¯ i=n wi ) + b = 1 − m s.t. tm wnT (Xm × m = 1, . . . , M

(29)

94

I. Kisil et al.

The corresponding procedure is summarised in Algorithm 4. Finally, a label is assigned to a new datapoint X∗ , by



t∗ = sign y(X∗ ) = sign X∗ ×1 w1 ×2 · · · × N w N + b

(30)

The LSSTM framework is efficiently implemented in HOTTBOX as a binary classifier, and can be intuitively used in an object-oriented fashion as shown in the following code snippet. An instance of the lsstm = LSSTM() class is created, and methods for training and testing are called with lsstm.fit() and lsstm.predict(), respectively.

In the following, we show a practical application of the LSSTM to the problem of financial forecasting. The task is to predict the one-day ahead price movement of the S&P 500. As predictors we use gold prices and VIX (“implied index of volatility”— see [5] for details). For fair comparison, the same features are fed into an SVM, and results are shown in Fig. 12. Notice the better performance of LSSTM over SVM in terms of generated profit. Moreover, the fact that the generated curves are mostly insensitive to the change of the parameter C suggests that the LSSTM does well in exploiting the relevant structure within the data [5].

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

Cumulative Profit

4000 3000

C=0.01 C=0.1 C=1 C=10

95

C=30 C=50 C=100 S&P 500

2000 1000 0

2008 2009 2010 2011 2012 2013 2014 2015 2016

Date (a) Profit based on SVM

Cumulative Profit

4000 3000 2000 1000 0

2008 2009 2010 2011 2012 2013 2014 2015 2016

Date (b) Profit based on LS-STM Fig. 12 Cumulative profits generated using SVM and LS-STM based strategies, for varying values of C

5 Conclusion A comprehensive tutorial on HOTTBOX, a Python library for tensor methods, has been provided with the aim to bridge the gap between the theory of multi-linear analysis and the direct implementation of tensor decompositions, which have applications in a wide range of practical disciplines. The tutorial has been presented as a set of easy-to-read and practical examples in form of Python code snippets suitable for industry practitioners, academic researchers and interested readers. HOTTBOX offers a simple and consistent platform to perform tensor signal processing and machine learning in an accessible and straightforward fashion. This has been achieved by employing state-of-the-art tensor methods and operations which follow and seamlessly integrate with the standards adopted by the trending Python scientific community. The library’s speed and ease of use has been demonstrated to allow for an efficient remedy to the well-known “Curse of Dimensionality” associated to tensor decompo-

96

I. Kisil et al.

sitions. Future work aims to further develop the toolbox so as to cater for a broader range of tensor methods and applications, including probabilistic tensor models. HOTTBOX is available at https://hottbox.github.io. Acknowledgements The support of the EPSRC Centre for Doctoral Training in High Performance Embedded and Distributed Systems (HiPEDS, Grant Reference EP/L016796/1) is gratefully acknowledged.

References 1. Anguita, D., Ghio, A., Oneto, L., Ridella, S.: Selecting the hypothesis space for improving the generalization ability of support vector machines. In: IEEE International Joint Conference on Neural Networks (2011) 2. Bellman, R.: Curse of Dimensionality. Adaptive Control Processes: A Guided Tour. Princeton, New Jersey (1961) 3. Beylkin, G., Mohlenkamp, M.J.: Algorithms for numerical analysis in high dimensions. SIAM J. Sci. Comput. 26(6), 2133–2159 (2005) 4. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., Varoquaux, G.: API design for machine learning software: Experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013) 5. Calvi, G.G., Lucic, V., Mandic, D.P.: Support tensor machine for financial forecasting. In: Proceedings of the 44th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8152–8156 (2019) 6. Carroll, J.D., Chang, J.: Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika 35(3), 283–319 (1970) 7. Cichocki, A., Lee, N., Oseledets, I., Phan, A.H., Zhao, Q., Mandic, D.P.: Tensor networks for dimensionality reduction and large-scale optimization. Part 1: Low-rank tensor decompositions. Found. Trends Mach. Learn. 9(4–5), 249–429 (2016) 8. Cichocki, A., Mandic, D.P., De Lathauwer, L., Zhou, G., Zhao, Q., Caiafa, C., Phan, H.A.: Tensor decompositions for signal processing applications: from two-way to multiway component analysis. IEEE Signal Process. Mag. 32(2), 145–163 (2015) 9. Coraddu A., Oneto, L., Baldi, F., Anguita, D.: Vessels fuel consumption forecast and trim optimisation: a data analytics perspective. Ocean Eng. 130, 351–370 (2017) 10. De Lathauwer, L., De Moor, B., Vandewalle, J.: On the best rank-1 and rank-(r1 , r2 , ..., rn ) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl. 21(4), 1324–1342 (2000) 11. Dietterich, T.G.: Ensemble methods in machine learning. In: Proceedings of International Workshop on Multiple Classifier Systems, pp. 1–15 (2000) 12. Dolgov, S., Savostyanov, D.: Alternating minimal energy methods for linear systems in higher dimensions. SIAM J. Sci. Comput. 36(5), A2248–A2271 (2014) 13. Džeroski, S., Ženko, B.: Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 54(3), 255–273 (2004) 14. Fanaee-T, H., Gama, J.: Tensor-based anomaly detection: an interdisciplinary survey. Knowl.Based Syst. 98, 130–147 (2016) 15. Kisil, I., Moniri, A., Calvi, G.G., Scalzo Dees, B., Mandic, D.P.: HOTTBOX: Higher Order Tensors ToolBOX. https://github.com/hottbox 16. Kisil, I., Moniri, A., Mandic, D.P.: Tensor ensemble learning for multidimensional data. In: Proceedings for the IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 1358–1362 (2018)

Tensor Decompositions and Practical Applications: A Hands-on Tutorial

97

17. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009) 18. Koren, Y.: The BellKor solution to the Netflix grand prize. Netflix Prize Doc. 81, 1–10 (2009) 19. Kroonenberg, P.M.: Applied multiway data analysis, 702 (2008) 20. Oneto, L., Ridella, S., Anguita, D.: Tikhonov, ivanov and morozov regularization for support vector machine learning. Mach. Learn. 103(1), 103–136 (2015) 21. Oseledets, I.V.: Tensor-train decomposition. SIAM J. Sci. Comput. 33(5), 2295–2317 (2011) 22. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 23. Smilde, A., Bro, R., Geladi, P.: Multi-way Analysis: Applications in the Chemical Sciences (2004) 24. Tao, D., Li, X., Hu, W., Maybank, S., Wu, X.: Supervised tensor learning. In: Proceedings of the 5th International Conference on Data Mining, pp. 8–16 (2005) 25. Tucker, L.R.: The extension of factor analysis to three-dimensional matrices. In: Contributions to Mathematical Psychology, pp. 110–127 (1964) 26. Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31(3), 279–311 (1966) 27. Vahdat, M., Oneto, L., Anguita, D., Funk, M., Rauterberg, M.: A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator. In: European Conference on Technology Enhanced Learning (2015) 28. Van Der Walt, S., Colbert, S.C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22 (2011) 29. Zhao, Q., Zhou, G., Xie, S., Zhang, L., Cichocki, A.: Tensor ring decomposition (2016). arXiv:1606.05535

Deep Learning for Graphs Davide Bacciu and Alessio Micheli

Abstract We introduce an overview of methods for learning in structured domains covering foundational works developed within the last twenty years to deal with a whole range of complex data representations, including hierarchical structures, graphs and networks, and giving special attention to recent deep learning models for graphs. While we provide a general introduction to the field, we explicitly focus on the neural network paradigm showing how, across the years, these models have been extended to the adaptive processing of incrementally more complex classes of structured data. The ultimate aim is to show how to cope with the fundamental issue of learning adaptive representations for samples with varying size and topology.

1 Introduction The aim of this chapter is to provide an overview of the field referred to as “learning in structured domain”, with a specific focus in deep learning approaches for graph data. Since most of known machine learning methods are limited to the use of flat and fixed-size sample representations (i.e. fixed-length vectors), the possibility to consider also the relationships among the singular samples in form of graphs and networks opens an appealing field of research. Indeed, the processing of structured and graph data in machine learning offers an enormous opportunity to extend the possibility of successful application domains in fields that range from the processing of language structures or molecular data to the analysis of social or biological networks (just to mention some noteworthy examples). On the other side, dealing with structured and graph data requires learning models capable of adapting to complex samples of varying size and topology, capturing D. Bacciu (B) · A. Micheli Dipartimento di Informatica, Universita di Pisa, L.go B. Pontecorvo 3, Pisa, Italy e-mail: [email protected] A. Micheli e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 L. Oneto et al. (eds.), Recent Trends in Learning From Data, Studies in Computational Intelligence 896, https://doi.org/10.1007/978-3-030-43883-8_5

99

100

D. Bacciu and A. Micheli

the relevant structural patterns to perform predictive and explorative tasks, while maintaining the efficiency and scalability necessary to process large scale networks. In particular, the common objective among the different approaches is to directly treat structured data through learning models without resorting to hand-crafted and partial transformations of structures into a set of fixed input features. The aim is therefore to learn a mapping (transduction) between an input structured domain, where information is represented as topology- and size-varying structured data (such as trees or graphs), and a discrete or continuous vectorial output space. Considering the mentioned motivations, it is not surprising that there is a long tradition of machine learning works in the field of the adaptive processing of structured data (which include theoretical results, new approaches and applications), starting from early extension of neural models in the 90s, up to the renewed increasing interest in the field of deep learning for graphs. This evolution will be presented in the sections of this chapter, following also the historical line of development and the literature shift in the point of view and in the nomenclature. By taking a mainly chronological approach, we will first introduce the basic aspects of learning with structured data, summarizing foundational works in this field developed in the last 20 years, following the line of development that leads from the neural processing of sequential to hierarchical data. Then, we survey the major seminal neural network models for general classes of graphs, including cyclic graphs, highlighting concepts that can help understanding and placing most recent research within appropriate focus with respect to the literature. On this basis, in the second part, we will delve into the most recent advancements in terms of deep learning for network and graph data, including learning structure embeddings, graph convolutions, attention-based models and structure pooling mechanisms. We conclude the chapter with a discussion on recent trends of research in the field, both from a methodological perspective and in terms of relevant applications. Finally, we highlight a couple of key issues the community is currently facing and that we believe to be fundamental for the long-term development of the field.

2 Foundational Models for Learning in Structured Domains: From Sequences and Trees to Graphs With the aim to gently introduce the field of learning in structured domains, in the following of this section we first recall the basic data structures by examples, assuming the formal definitions from the standard literature. Then, we summarize the ideas guiding the evolution of models in the class of the recursive approaches to the transduction over structured domain. In particular, we consider first the adaptive processing of hierarchical data from the recurrent models for sequences to recursive models for trees, and then for more complex data. Finally, the discussion of the causality assumption in the recursive approaches leads us to introduce the main approaches to overcome the limits of such assumption and to present the two main

Deep Learning for Graphs

101

Fig. 1 A labeled graph and its components

and pioneering approaches to deal with directed/undirected cyclic/acyclic graphs by neural networks and deep learning methods.

2.1 Basics on Data Structures and Transductions The basic elements of a graph can be expressed by a simple example, as shown in Fig. 1, where a graph is denoted by its set of vertices or nodes (we will use the term nodes in the following), connected by edges (or links or arcs), which can be directed or undirected (whereas two directed/oriented mutual connections among two nodes can be considered equivalent to an undirected edge, see the example between node denoted by the symbols “a” and “b” in Fig. 1). Labels (vectors) can be attached to each node and to each edge to convey numerical data, which in turn can codify symbolic information. A cycle can be present as shown in the example. We define the neighbors of a node v the set of nodes u connected by an edge (v, u) to v in the undirected version of the graph. For example, in Fig. 1, the neighbors of the node “b” are the node “a” on the left of “b”, the node “a” on the right of “b” and the node “d”. For the sub-classes of graphs that will be used in the following, i.e. labeled sequences, trees (and in particular rooted k-ary trees), Directed Positional Acyclic Graphs (DPAG), Directed Acyclic Graphs (DAG) and general graphs, we refer to the classical definition of data structures as in [26]. Note that, in the order provided above, each of this classes of structures is a sub-case of the following one (e.g. a sequence is a k-ary tree with k = 1). Also, we resort to a graphical description (Fig. 2) to facilitate the intuitive line of this introduction to the topic. Throughout this section we will present models following an extension of the input domain corresponding to a similar chain of inclusions as described above and in Fig. 2. The aim is to show how it is possible to bring models closer to the nature of the data, instead of adapting the structured data to the input format used by the models (e.g. rewriting graphs into vectors to fed a feed-forward neural network, with alignment problems, loose of information, etc.). A point of view that is part of the “representation learning” area of research.

102

D. Bacciu and A. Micheli

Fig. 2 Classes of graphs: examples of a labeled node, sequence, tree, Directed Positional Acyclic Graph (DPAG), general undirected graph

A transduction over such structures is a mapping from an a set of input graphs g to an output domain. In a supervised transduction, the output domain, which is used to represent the targets of the relational learning task, can be either a set of scalar values, for classification or regression tasks (e.g., the transduction is a mapping g → R) called structure-to-scalar transduction in the following, or a structured data, e.g. providing an output for each node or a set of nodes, called structure-to-structure transduction in the following. The transductions considered in the following entail an encoding of the input into an inner representation space, composed by state variables associated with each node of the input graph, and a readout or output mapping function that returns the predicted values. The first considered family of models is the family of recursive models able to implement transduction over hierarchical classes of data.

2.2 Recursive Models for Hierarchical Data The family of recursive models has a long line of developments that includes approaches for the adaptive processing of structured data from its simplest form, i.e. a sequence, where relationships follow a serial order (e.g. time series), up to hierarchical data (rooted k-ary trees), and directed acyclic graphs. The common characteristic among all these models is to provide a recursive and parametric realization of the transduction function between structured data and the output domain. The embedding of each node of the input structure is realized through a neural state machine and the adaptivity is obtained by tuning the free parameters of the neural model from the training data at hand.

Deep Learning for Graphs

2.2.1

103

Recurrent Neural Networks

The first class of models that we consider in this family is that of Recurrent Neural Networks (RNN) for sequence processing. The primary form, corresponding to the Simple Recurrent Network (SRN) (or Elman networks [32]), can be expressed as a the following state-transition system. Given an initial state value x(0), we have, for t > 0, x(t) = τ (l(t), x(t − 1)) y(t) = g(x(t))

(1) (2)

where l(v) is a numerical label associated to each node of the sequence, τ is the state-transition function that can be realized by a set of hidden neural units with self-loop feedbacks (recurrent), with associated free parameters (weights w), and g is the readout or output mapping function from the internal state values to the output values y(t), which can be implemented as a standard feed-forward neural network. The role of the state is to summarize the past information received up the step t of the input sequence, realizing a context or node/sub-sequences embedding (or memory of the past). Note that this system is causal since the output at the step/node (or time for time series) t depends only on the current and previous inputs before t. Due to the causality assumption, the recurrent neural networks are able to store, in an internal state, “past” information from the sequence of inputs and to use it together with the current input. The internal state of the recurrent neural network encodes a representation sensible to the context (e.g. temporal-context for temporal signal processing). Moreover, due to the learning capabilities, the memory of a recurrent neural network includes information depending from the task at hand. In other words, the state-units discover adaptive abstract representation of past data containing the information relevant to the prediction. The recurrent neural model opens the connectionist models to a broad set of temporal processing applications, such as the cognition tasks of vision and signal processing, the modeling of system control and digital filter and the time series prediction tasks. Nowadays, RNN are the reference approach, with state-of-the-art results, for sequence processing, especially for speech recognition/processing, text processing (including machine translation), music composition, etc. Many variants of the basic model have been used in such contexts including Bidirectional RNN, Long-Short Term Memory (LSTM) (see [49] for a recent survey), Gated Recurrent Units models [24], efficient implementation based on randomized weights neural networks (Echo State Networks) [38, 59], and multilayered architectures: Deep RNN [48] and Deep ESN [42, 43].

104

2.2.2

D. Bacciu and A. Micheli

Recursive Neural Networks

Recursive Neural Networks (RecNN) extend RNN to match the inherent recursive nature of tree-structured data. Following a description similar to the state-transitions system used for the RNN, the RecNN can be expressed considering a state and an output for each node v of an input tree as: x(v) = τ (l(v), x(ch1 [v])), . . . , x(ch K [v]) y(v) = g(x(v))

(3) (4)

where l(v) is the numerical label of each node, x(ch1 [v]), . . . , x(chk [v]) is the K -dim vector of states for the node children of v, τ is the state transition function that can be realized by a set of hidden neural units with self-loop feedbacks in a number equal to the arity K of the input k-ary trees and g is the readout or output mapping function analogous to the one described for the RNN. Note that x(ch1 [v]), . . . , x(chk [v]) are the vectorial codes obtained by the application of the (recursive) encoding to the subtrees of v. The encoding process starts with null states x0 , typically coded by 0 values, for the empty leaves. The neural network realization of the hidden units can be obtained as described in the following: K  ˆ j (x(ch j [v])) (5) W x(v) = σ (W l(v) + j=1

where l(v) ∈ Rn is the label of each node, σ is the element-wise applied neural activation function (e.g. a sigmoidal function), x(v) ∈ Rm (m hidden recursive units), W ∈ Rm×n is the weight matrix associated with the label space, K is the maximum ˆ j ∈ Rm×m is the weight matrix arity in the set of input trees (k-ary trees), and W associated with the jth sub-structure state space. Note that the bias vector is included in the weight matrix W . The role of the state is again to provide a node embedding by the encoding of the sub-tree rooted in each node. The causality of RNN is extended considering that in the RecNN the output for a node v only depends on v and its descendants. The encoding of each input tree corresponds to a dynamical process that implies a visit/traversing of the tree in bottom-up fashion (starting from the leaves up to the root), applying for each step the model described in Eq. 5 (spatially extending the encoding process that a RNN applies on a input sequence). Note that this bottom-up order of visit is related to the causality constraint. The same model (i.e. the same set of free parameters w) is therefore used to encode all the nodes of the same tree and all the trees of the data set at hand (according to a stationary or weight sharing assumption). Note that if the task aim is for a structure-to-scalar transduction (e.g. for the classifications of trees) the only significant output y(v) that will be considered is the output computed for the root node.

Deep Learning for Graphs

2.2.3

105

The RecNN Family and Applications

Models, analysis and general frameworks have been developed in further works since the 90s, which have a main historical root in the activity of the research group at the Computer Science department of the University of Pisa [25]. The basic models were described in [35, 83] (see also the reference therein) and pioneering applications to the Cheminformatics domains [19, 73] were based on a constructive version of RecNN (Recursive Cascade Correlation). Models that transfer the recursive idea to unsupervised learning were introduced by an extension of the Self-Organizing Maps (SOM-SD) [51] and then studied by a general framework [56]. Generative recursive approaches in the form of Hidden Tree Markov Models (HTMM) have been introduced [35], and for instance, Bottom-up Hidden Tree Markov Models that extend HMM to trees exploiting the recursive approach and coping with the issue of efficiently decompose the joint state transition from children to node states are in [10, 12], while extension of Generative Topographic Mapping to tree processing are in [11, 45]. Also, recently, new kernels have been developed from families of HTMM [13], while the BHTMM model has been extended to nonparametric tree mixtures [4] and to more expressive joint state transitions leveraging Bayesian tensor decompositions [23]. It is also worth mentioning the hybrid use of HTMM within a neural architecture in the Hidden tree Markov Networks (HTN) [1]. Recursive learning can be made efficient by Reservoir Computing (RC)/Echo State Networks (ESN) approaches, a paradigm that exploits randomized weights neural networks (without training) for the recurrent/recursive units, while the training is only for the readout (often implemented as linear model). In this line, the TreeESN [39] extends the applicability of the RC/ESN approach to tree structured data, interestingly showing often competitive results with respect to fully trained models for trees. A multilayer version, the Deep Tree ESN allows hierarchical abstraction both through the input structure and the architectural layers, showing both efficiency and accuracy advantages [40]. Finally, we mention that successful applications of the RecNN have been developed in fields such as document processing [29, 51], cheminformatics [16, 17, 19, 73], bioinformatics [14] image processing [89], including the recent ones, such as [2, 82, 85] which successfully popularized RecNN in the Natural Language Processing (NLP) area. A recent review of RecNN applications in NLP can be found in [3].

2.2.4

Analysis: Causality Assumption

The causality assumption characterizes the basic recursive neural network models. The stationary and causal assumptions make possible a compositional and parsimony adaptive computing over the domains of sequences and trees, handling the variability of the input structures without affecting the computational power. Indeed, under such assumptions, the RecNN model show universal approximation capability for structure-to-element transduction for the sequences and trees domain [54]. However, a causal model is only able to memorize “past” information and the output of the

106

D. Bacciu and A. Micheli

Fig. 3 A labeled tree and the context for the node with label “c” available to a causal model (blue box) and the global context (white box)

models depends only on the sub-structures (sub-sequences or sub-trees) of each node. As a result, the causality affects the computational capability once moving to consider other kind of transductions or more complex structures such are the Directed Acyclic Graphs. If the output (or the embedding) must depend also on other nodes of the structure (rather than only on the sub-structures) we need to consider a contextual processing. Contextual processing can be referred to computations that yield for each node an embedding that depends on the whole information represented in the structured data, according to its topology. In the Fig. 3 causal processing imposes a context for the node with label “c” limited to the descending vertices (the substructure in the blue box), while contextual processing include the other nodes (white box). The Contextual RecCC (CRCC) partially relaxed/extend the causality constraints to deal with contextual processing and to move to the processing of DPAGs [72], still in the framework of recursive approaches. In particular, to properly consider the DPAGs, i.e. to distinguish in terms of embedding all the possible DPAGs of the given set, the model need to exploit the contextual information, as shown in [72]. Universal capability of this approach have been studied in [55].

2.3 Cycles and General Graphs The causality assumption in RecNN introduces issues in processing cycles due to the mutual dependencies among state values. Indeed, in a cyclic graph, the state value of each node belonging to a cycle has effect on neighbors node states which in turn depends on the state of the other nodes involved in that cycle, with a loop in the definitions of such state values (with possible instability issues for such dynamical system). Hence, the causality assumption loses its meaning in terms of “past” or descending nodes and sub-structures that we had for the hierarchical data (sequence, trees and DAG). For the same reason, the order of visit in the traversal of the node of the input structure should be free from the hierarchical nature of the tree or DAG

Deep Learning for Graphs

107

structures and, more in general, of any numbering and alignment of the graph nodes (unless such numbering is provided and it is related to the task). Note that this phenomena affects also the treatment of undirected graphs, since each pair of nodes connected by an undirected edge can be interpreted in terms of a couple of directed connections making a cycle between the the two nodes (see Fig. 1). Through the years, three main classes of approaches have been use to cope with cyclic graphs by neural networks.

2.3.1

Rewriting the Graph

The first approach indirectly faces the issue rewriting the graph into hierarchical structures (applying RecNN approaches), e.g., [18], or, specifically for chemical graphs, by atomic representation of cycles, e.g. functional groups of molecules, as applied for instance in [16, 17, 19, 73], or by rewriting the graph in terms of trees exploiting canonical representation of molecules (e.g. SMILES format [90]). In contrast to this indirect approach to the issue on cycles, the two other coeval approaches described in the following pursue the basic and direct intent to really move the models toward the nature of the data and they originates the field of neural networks/deep learning for graphs, with models able to process acyclic/cyclic, directed/undirected labeled graphs.

2.3.2

Recursive Approaches on Graphs

The approaches introduced in [78] (Graph NN—GNN) and then in GraphESN [37] use a recursive model where the cyclic dependencies among states are allowed and treated by imposing constraints on the resulting dynamical system. In the “essential” version of the GraphNN (GNN) and GraphESN the equations are similar to RecNN in Eq. 5, implementing a state transition function by recursive units over the states of the neighbors N(v) of each input node v: ⎞ ⎛  ˆ x(u)⎠ W (6) x(v) = σ ⎝ W l(v) + u∈N(v)

where l(v) ∈ Rn is the label of each node, σ is the element-wise applied neural activation function (e.g. a sigmoidal function), x(v) ∈ Rm (m hidden recursive units), W ∈ Rm×n is the hidden units weight matrix associated with the label space, and ˆ ∈ Rm×m is the weight matrix associated with the hidden recursive units state W space encoding of each neighbor node (and hence this state space extend the state space for children expressed in Eq. 5). Note that the bias vector is included in the weight matrix W . The encoding of each input graph is made according to a dynamical process with a traversing of the graph where there is still the stationarity assumption used by the

108

D. Bacciu and A. Micheli

RecNN models, possibly enhanced, as assumed here, by sharing also the weights among all the neighbors incoming state values (for the simplest case of no-labeled edges). In the dynamical system expressed by Eq. 6 cycles are allowed in the state computation as the state is computed iterating the state transition function until convergence. The stability of the recursive encoding process is guaranteed by contractive constraints on the state dynamics (Banach theorem for fixed point) [78]. In the original version of the GNN approach this achieved by imposing constrains in the loss function (alternating learning and convergence) [78]. In GraphESN [37] the condition is inherited by contractivity of the reservoir computing (RC) dynamics (Echo State Property conditions) allowing a very efficient approach that avoid the training of the recursive hidden units. It is fundamental to note that the encoding is not just “local”, i.e. the context is not limited to the embedding of the local information from the label of the node and the state of immediate neighbors (of radius one). Instead, through different iterations, the context is extended over all the graph nodes by diffusion on the graph topology, so that each node at the end of the iterative process receives a context that potentially includes information from all the graph nodes. Finally, to produce an output the model use a readout as for the RecNN models, implementing a structure-to-structure transduction or a structure-to-scalar transduction. In the latter case, a mapping function is needed to summarize all the state values of each graph nodes to the input of the readout model. Typically this mapping is made by simple functions like the sum or the average over the state values associated to the nodes of the input graphs (further variants will be discussed in Sect. 3). Current advancements of such approach include theoretical studies on the VCdimension [79], implementations exploiting the GRU models [67], and deep versions of the GraphESN approach [41].

2.3.3

Layering (Not Recursive) Contextual Approaches on Graphs

The basic concept in the third approach (coeval with the recursive approaches described in the previous Sect. 2.3.2) is to exploit the idea of layering different units to manage the mutual dependencies among state values that can occur in cyclic and/or undirected graphs. In contrast to the recursive approaches, the diffusion of the contextual information is not considered over the singular layer of recursive units (iterating the state equations) but through the composition over multiple layers of not recursive/feedforward units. The embedding of each node can take the context of the other nodes computed in the previous layers, accessing progressively to a larger context up to the entire input graph/network. Let us assume a simple version of such approach, using a singular unit in each layer and starting from the first model in this line, the NN4G [71]:

Deep Learning for Graphs

109

⎧ (W l(v)) ⎪ ⎨x1 (v) = σ ⎪ ⎩xi (v) = σ

W l(v) +

i−1 j=1

wˆ i j



x j (u)

i = 2, ..., L

(7)

u∈N(v)

where l(v) ∈ Rn is the label of each node, σ is a neural activation function (e.g. a sigmoidal function), xi (v) ∈ R for each i (i.e. a state component for the layer i and node v of the input graph), L is the number of layers, W ∈ Rm×n is the hidden units weight matrix of the associated with the label space, and wˆ i j is the weight associated to the connection between the current unit i and the jth preceding hidden unit, which brings state values of all the adjacent nodes to v(N(v)). In this original version the information are fed to the units i from all the previous layers, while a simplified version can consider just the previous layer without changing the main concepts. The bias vector is included in the weight matrix W . In this simplified version of the NN4G the wˆ are shared among all the incident edges (without distinguishing them by a full stationary assumption). Both for GNN and NN4G different solutions are possible if also the edges are labeled, adding information about them to the respective state equations. Note that in this equation non-recursive units are used: xi (v) depends only on units of the previous layers and therefore no cyclic dependencies are introduced in the definition of the state transition system. In particular, differently from the recursive approaches for graphs, there is no need to solve and to stabilize a dynamical system. The encoding of each input graph is made using a traversing of the graph where there is still the stationariety assumption allowing to share the same set of weights through the different nodes of the input graphs. As this is a not-recursive model, the traversing of the input nodes can be done without any constraint on the order of visit. The role of the states is to provide a node embedding by exploiting the current node label and the encoding of the context by the units in the previous layers. It is fundamental to note that the encoding is again not just “local” to the node label and the state of immediate neighbors (of radius one). This because by composition the model extends its context to other nodes through the context developed in the previous hidden layers. In particular, since each unit at a given layer i and for a given node v uses contextual information from the previous layer i − 1, which in turn encodes the context of the local neighbors of each u belonging to the neighbor of v, the context of state for v at layer i is extended to a context of radius two for i = 3, and i − 1 more in general for i > 3. This nested construction allows the NN4G to include for the state of each node the information over all the graph. The formal discussion of this property have been introduced in [71] showing that the context grows one step ahead for each added layer and that (i) the dimension of the context is proportional to the number of layers, and (ii) the structure of the composition is given by the topology of the input graph. Also, it is proved that a number of layer greater than the diameter of the graph is sufficient to involve all the vertices of the graph. As result, in this approach, that move toward deep architectures, the depth of the units network is functional to the contextual information used by the model. This process of composition is expressed

110

D. Bacciu and A. Micheli

Fig. 4 The grow of the context for a node v (center) through the information available at different layers of a NN4G

in Fig. 4, where each ellipsoids correspond to the context available to central node in the different layers. A characteristic feature of the original NN4G was that the layering idea was implemented by a constructive approach, based on the family of Cascade Correlation neural networks [34]. In such approach, the learning algorithm progressively adds one new state variable (or hidden unit) at a time to a set of already trained (frozen) units. The units are trained in layer-wise fashion interleaving the minimization of the total error function of the output/readout layer, and the maximization of the (nonnormalized) correlation of the new inserted hidden unit with the output residual error. The training process end when the number of units inserted is sufficient to achieve the desired error of the output layer. As a result, the number of units, and hence of the layers, result by an automatic construction ruled by the supervised task at hand, and there is no need to fix it prior to learning. Moreover, by a divide-et-impera approach the original task is simplified by splitting it in sub-tasks learned by each hidden layer units at time, as all the previous units are left “frozen” after the training, and there is no need to deal with the typical issues of the gradient propagation through multilayers (gradient vanish). Finally, the implementation of the possible transductions can be done with the same proposal introduced at the end of Eq. 6. It is interesting to note the similarity between this kind of approach and the approach underlying the convolutional neural networks (CNN): if we constrain the local receptive field of CNN to the local topology of each node instead of to the local 2-Dimensional matrix kernel used for image processing, the CNN can be extended to graph processing in a way similar to the approach described so far. This line of research has been developed since the 2015 by many authors (as we will describe in the next section). Indeed, many concepts are shared: (i) the traversal (of the nodes) of input graphs through units with weight sharing (stationariety), that correspond to the convolution concept in the CNN; (ii) the composition of the context through the layers and hence, (iii), the use of many layers functional to the context diffusion as described above, moving to deep architectures. However, differently from NN4G that provides also an approach to the automatic design to the neural networks for graphs, the CNN for graphs are typically based on fixed architecture and on the use of a top-down back-propagation (end-to-end), which can be quite computational demanding using many layers.

Deep Learning for Graphs

111

The description of the evolution of the layering contextual approaches to graphs, and in particular of models under the umbrella notion of Convolution neural networks for graphs, a line of research attracting more and more researchers after 2015, is part of the following section.

3 Deep Learning Models The pioneering works discussed in Sect. 2 already reached a level of maturity and expressiveness that allowed processing general directed/undirected cyclic graphs. However, it was not until 2015 that the field of learning for graphs started gathering momentum, mostly due to the renewed attention of the deep learning community. In this section, we take a view of the progress of these modern deep learning approaches to graph processing, highlighting the close links, which are often left unacknowledged in the recent works, with the earlier literature discussed in Sect. 2. The first thing to note about the deep learning for graphs community is a certain tendency to adopt non-uniform nomenclatures for the same methodologies. In these respect, the field of learning models for graph structured data is referred to in literature with the following exchangeable terms • • • • • • •

Deep learning for graphs Graph neural networks Graph convolutional networks Neural networks for graphs (Convolutional) neural networks for/on graphs Learning graph/node embedding Geometric deep learning.

In a recent tutorial paper [7], it has been proposed a uniforming term, Deep Graph Networks (DGNs), which serves to disambiguate references to specific models in literature, such as the Graph Neural Network (GNN) [78] or the Graph Convolutional Network (GCN) [63], from the name of the overall field. The use of the DGN term has also the advantage of been general enough to accommodate also models which are not inherently neural based, such as [6]. Our literature review takes an historical perspective on modern works presenting them following their publication order, whenever possible. We will therefore begin, in Sect. 3.1, by introducing DGN models based on an analysis of the graph convolutions operated in the spectral domain. When appropriately simplified, such spectral operators can be reduced to simple forms of graph convolutions in the spatial domain. We will then show how these are a special case of the contextual graph processing approach introduced in Sect. 2, which will be further explored and detailed in Sect. 3.2, in its modern interpretation referring to it as contextual convolutions and node embeddings. We will conclude our discussion with a recent and very active research topic on GCN models that concerns graph pooling mechanisms. Note that, in this section, we will operate a slight change in notation to reduce cluttering in the

112

D. Bacciu and A. Micheli

equations and to preserve their interpretability also when dealing with probabilistic formulations. In particular, node indexing will now appear as a subscript instead that between brackets, e.g. we will use lv in place of l(v) to denote the label of node v.

3.1 Spectral Graph Convolutions Spectral graph convolutions have been introduced in [22] leveraging the well known Convolution Theorem [20] in Fourier analysis. This essentially states that the convolution of two functions f, g in the spatial domain is the same as the product of their respective Fourier transforms F ( f ), F (g) in spectral space, that is F ( f ⊗ g) = F ( f ) · F (g).

(8)

To apply these results to graphs, one needs to start by defining an appropriate basis for the Fourier transform, which can be straightforwardly obtained from the eigenvectors of the graph Laplacian L=A−D where A is the n × n adjacency matrix of the graph, while D is the diagonal degree matrix with Dvv = u∈N(u) avu . The degree matrix can be computed also for weighted graphs by substituting the adjacency matrix entry avu with the corresponding edge weight evu . The spatial vector function f = [ f 1 , . . . , f N ] in the graph domain is a vector of f v = lv functions attaching a label lv ∈ Rd to each node v. Since the Laplacian defines an a operator on the nodes’ vector space, it can be applied to f, i.e. for a given node v (Lf)(v) =



evu ( f v − f u )

(9)

u∈N(v)

which has the nice interpretation of the weighted sum of the differences between the label of v and those of its neighbors. In practice, the graph Laplacian is typically used in its normalized (scale invariant) version L = In − D− 2 AD− 2 1

1

(10)

which ensures that its application performs an actual local averaging of the neighbors. The term In is the n × n identity matrix used to make sure that nodes takes into consideration their label when computing their local average. The eigenvector matrix Q ∈ Rn×n of a graph Laplacian L provides the orthonormal basis needed to compute the graph Fourier transform of the signal f defined on node labels as F ( f ) = Q f, while its inverse is simply F −1 (Q f) = QQ f. As in standard convolutional neural network, we then take the second signal g to be

Deep Learning for Graphs

113

a parameterized filter. Computing the convolution between g and the graph signals f would then require applying (8) and the definition of graph Fourier transform, yielding (11) F (f ⊗ g) = QQ gQ f where W = Q g is the matrix of learnable convolutional parameters in spectral domain. The Spectral CNN [22] leverages these results to define a spectral graph convolutional layer computing its output as   x = σ QWQ l

(12)

where l are the node labels, σ is a suitable non-linearity and W ∈ Rn×n is a diagonal parameter matrix. Note that (12) refers to a single convolutional filter. It can be straightforwardly extended to a layer of K filters, each with its own parameters Wk , and to generic inputs x (i.e.the output of a preceding convolutional layer) instead of the node labels l. The spectral convolutional layer defined by (12) has a number issues to be carefully considered. First, the parameter matrix W depends on Q, hence on the specific Laplacian for which it has been computed. This means that the learned parameters cannot be shared between graphs; in fact applications of this models have been found primarily on networks data processing (i.e. when the learning task is defined over nodes and edges of a single graph). Second, each convolutional filter requires O(n) parameters (possibly for each input channel), where n is the number of nodes in the graph. The computational complexity of the model can quickly become an issue for larger graphs, also because of the need of computing the eigendecomposition of the Laplacian. Lastly, since the convolutional filters are learned with free parameters in the spectral domain, they may not be localized in the spatial domain. To overcome these issues, [28] has introduced ChebNets, where the spectral filter W is defined by the polynomial k  wi i , (13) W() = i=0

where  is the diagonal eigenvalue matrix of L, w1 , . . . , wk are learnable parameters and k is the polynomial order. To avoid the costly eigendecomposition of L, [28] approximates (13) using the following truncated Chebyshev expansion [57] W() =

k 

   , wi Ti 

i=0

 = 2/λmax − In is the matrix of the eigenvalues rescaled in [−1, 1] (λmax where  being the originally largest one). The term Tk (x) is the Chebyshev polynomial of order k which is defined recursively as

114

D. Bacciu and A. Micheli

⎧ ⎪ ⎨T0 (x) = 1 T1 (x) = x ⎪ ⎩ Tk (x) = 2xTk−1 (x) − Tk−2 (x).

(14)

By leveraging the properties binding the Chebyshev polynomial of a Laplacian to that of its eigenvalues, one can rewrite the convolution in (11) as QW()Q f =

k 

wi fˆi

(15)

i=0

where fˆi = Ti ( L) is the polynomial of the rescaled Laplacian  L = 2L/λmax − In . According to the recursive definition in (14), fˆi can be computed without requiring eigendecompositions as ⎧ ⎪ f0 = f ⎨  Lf f1 =  ⎪ ⎩ L fk−1 −  f f −2 . fk = 2 Given a (typically) sparse Laplacian, the complexity of this convolution is O(k · m), with m being the number of edges in the graph. Differently from [22], ChebNets filters are also localized in spatial domain, as a node can only be affected by nodes which are at maximum k-hops, with k being the polynomial order. ChebNets have been further simplified by [63] by using the normalized Laplacian in (10), assuming λmax = 2 and limiting the order to k = 1. In addition to that, the two Chebyshev parameters are tied into a single one w = w0 + w1 , resulting in a highly simplified convolution   1 1 w · In + D− 2 AD− 2 f. This simplified graph convolution is the basic building block of the Graph Convolutional Network (GCN) [63]. By following the compact formulation in(12), we can write the output of a generic GCN layer comprising h  filters as X = σ (LXW) ,

(16)

where X is a n × h matrix of layer inputs (e.g. outputs from a previous layer of h  filters) and W ∈ Rh×h is the parameter matrix. The GCN is essentially a contextual convolution in the spatial domain, despite the fact that it has been derived starting from a graph convolution in the spectral domain. Its simplicity has made it a popular choice as basic building block in successive DGN architectures. Nonetheless, stacking of several GCN layers in deep architectures is also known to suffer from oversmoothing issues [66], such that all nodes in a graph

Deep Learning for Graphs

115

tend to converge to the same encoding in deeper layers. As a result of this, GCN has troubles learning dependencies between nodes at longer hop-distances. To this end, GCN layers are often used interleaved with pooling layers (see Sect. 3.3), to favor diversity and a to near more distant communities by graph reduction, or in conjunction with skip-connections, also referred to as jumping knowledge.

3.2 Contextual Graph Processing Spectral approaches had certainly the role of reviving the field and attracting attention on the learning for graphs topic. Nonetheless, their limitations in terms of flexibility to handle graphs of different topology and scaling up to larger structures have made the community shift focus on spatial-domain approaches. In particular, much of the work of the community is currently onto models which, at their very core, leverage the layered approach to cycle-resolution and context propagation put forward, firstly, by the NN4G [71] and discussed in Sect. 2. In the following, we summarize some of the most relevant contributions leveraging the contextual approach, starting from early spatial convolution models (Sect. 3.2.1) to the introduction of the concept of learnable node embeddings (Sect. 3.2.2) and their generalization to unsupervised learning (Sect. 3.2.3).

3.2.1

Contextual Convolutions

The GCN [63], despite its spectral-driven inspiration, has a clear spatial interpretation for its convolutional operator in (16), which is a weighted sum of a vertex’s neighborhood, where the weight is given by the combination of learned parameters W and Laplacian entries L. To complete the parallel with the contextual approaches, the final GCN architecture is obtained by stacking several of such convolutional layers interleaved by non-linear activations. GCN is not, however, the first modern model putting forward the use of spatial-contextual convolutions. The Neural Graph Fingerprint [30] has been proposed a couple of years before GCN, taking inspiration from a popular approach used in chemistry to represent molecular graphs in a vectorial form, i.e. circular fingerprints [47]. Following a procedure fully amenable to the contextual approach, circular fingerprints build a layered representation of a molecular graph, where each layer performs a relabeling of the nodes by applying a fixed hash function to the concatenated labels of their neighbors in the previous layer. The hash values are treated as integers from a finite alphabet. Whenever such index is used to relabel a node in the graph, a 1 will be placed in the corresponding position of the vectorial fingerprint. The Neural Graph Fingerprint [30] takes this deterministic hashing process and transforms into a fully differentiable neural architecture. To this end, it substitutes the neighbors concatenate-and-hash operation with a parameterized neural aggregation

116

D. Bacciu and A. Micheli

⎛⎛ xvi = σ ⎝⎝xvi−1 +







xui−1 ⎠ Wi ⎠

u∈N(v)

where subscript i is used to denote the ith layer and xvi can be interpreted as the soft-relabeling of node v at level i. The integer index generation part is made fully differentiable by having the xvi label pass through a softmax and by summing its result to the current value of the differentiable fingerprint vector. Despite the limitations in generality of a model conceived for molecular graphs alone, the neural fingerprint has in itself some relevant intuitions that will be further explored by future works, such as the concept of learning differentiable graph vectorial representations which is at the core of the node embedding approach. PATCHY-SAN [75], while remaining in the realm of contextual spatial convolutions, takes a different approach to node processing and neighborhood selection. It focuses on trying to identify coherent node orderings across different graphs, which are used both to try a reordering of the adjacency matrix of the graph as well as to determine a normalized neighborhood ordering for each vertex which is consistent across the entire data set. By this means, PATCHY-SAN can define a less-constrained weight sharing policy than the convolutional approaches described so far which reuse the same weight for all the neighborhood. PATCHY-SAN, on the other hand, can use a different weight for each neighbor according to the order determined by the normalization procedure. The ordering is obtained via a non-adaptive procedure leveraging algorithmic techniques for graph labelling, but it has to be noted that the problem of graph normalization is NP-complete. While, in principle, the PATCHY-SAN approach can result in a more expressive parameterization, the quality of its results are strongly influenced by the quality of the node ordering. Also, it requires to fix apriori the size of the neighborhood, making it necessary to use padding or to exclude some of the neighbors when computing the convolution. Graph Attention Networks (GATs) [86] put forward another approach to neighborhood aggregation which, needless to say, introduced the use of attention mechanisms on graphs. It builds on the architecture and weight sharing assumptions of a GCN, extending it with the use of multi-head attention between pairs of neighboring nodes v, u to determine the importance αvu of their incident edge. This is used to weigh the contribution of the single neighbors u when computing the encoding of a node v, as follows ⎛ ⎞  xv = σ ⎝ αvu Wxu ⎠ . u∈N(v)

While the original paper [86] reported good empirical performances, these results were later revised by [80]. The complexity of the multi-head mechanism also makes it slower when compared with competing models in literature.

Deep Learning for Graphs

3.2.2

117

Node Embeddings

The basic idea of the node embedding approach is simple and has been largely used in other application fields, the most popular being word embeddings. The goal is to learn a representation of the discrete node objects into a dense vector space in such a way that similarities between nodes in their original representation are preserved in the metric embedding space. An high-level representation of the approach is depicted in Fig. 5. In this very general formulation, node similarity can be however defined, from a user-defined function induced by prior knowledge, e.g. neighboring nodes should be embedded close-by, to a task-induced criteria, e.g. nodes with similar classification should be close in the embedding space. Following Fig. 5, we are interested in building an embedding (v|C(v), W) : S → Rd for a node v in the structure space S and that is computed by taking into account information on the node context C(v), such as the whole graph or the set of its neighbors. Of course, since we are focusing on adaptive approaches, we consider the embedding to be parameterized by weights W that can be learned to optimize a function which is directly or indirectly associated to the node similarity of choice. Given such a very general definition of the problem, models in literature differ in the definition of the context and how this is aggregated and in the choice of the cost function. Node embeddings developed initially in the context of large-scale network data, to learn vectorial encodings of nodes for purposes of inspection and prediction (e.g. node classification, edge prediction, etc.). In this respect much attention has been devoted, at least initially, to efficient neighborhood construction and aggregations schemes. Node2vec [50] has been the first to introduce the idea of learning dense embeddings in graphs by borrowing much from word embedding models in natural language processing. It defines an efficient context assembling mechanism based on a biased random-walk procedure alternating breadth-first and depth-first sampling. The Node2vec cost function is unsupervised and targets making the embedding of a

Fig. 5 High-level view of the node embedding approach: a (possibly adaptive) embedder  is used to project nodes into a vectorial embedding space where some user-defined or task-induced node similarities are preserved with respect to the vectorial metric x · y

118

D. Bacciu and A. Micheli

node and those of its sampled neighborhood similar with respect to a pairwise similarity function. In [50], several similarity functions are assessed, including averaging, Hadamard product and vector norms. GraphSAGE [52] generalized the approach in Node2vec to a wider class of predictive tasks and to more articulated and general aggregation functions, bridging the gap with convolutional and contextual models for graphs. It defines a layered architecture in which the node embedding a given layer is obtained by means of a convolutional operator     , xv = σ W · xv ⊕ γ xu | ∀u ∈ N(v) where hv ∈ Rd is, again, either the input label of node v or its embedding from a previous layer (and similarly for u), W is a learned weight matrix, ⊕ is the concatenation operator and σ an activation function. More interestingly, γ is an, allegedly, permutation-invariant aggregation function that is chosen from a set of candidates including average, max-pool but also a recurrent layer feed with random permutations of the xu neighbors. The neighborhood N(v) is defined in a general way as including all nodes at a maximum of k-hops from u although, for practical and computational tractability purposes, k is typically limited to 2 and the actual neighbor only contains a fixed-size random sample of the general one. The Graph Isomorphism Network (GIN) [92] builds on GraphSAGE by analysing the limitations of its aggregation functions from a theoretical perspective, and extending it with arbitrary aggregation functions on multi-sets. The resulting model is proven to be as powerful as the Weisfeiler–Lehman test of graph isomorphism.

3.2.3

Unsupervised Embedding Learning

Node vectorial representations can also be learned without needing to explicitly imbue information concerning a supervised learning task. These models are, still, little explored in literature, but there are at least two major relevant contextual approaches, both with a probabilistic interpretation. The Contextual Graph Markov Model [6] puts forward a fully probabilistic interpretation of the contextual convolutional operator. In place of a dense embedding space, it assumes nodes to be associated with a set of C (finite) hidden states, as in hidden Markov models for sequences. Thus, it defines X v to be the categorical latent variable denoting the state of node v and computes the probability of the node being in a specific state as the probabilistic convolution i,a P i,a (X v = k|xN(v) )≈

1

C 

|N i,a (v)|

j

P i,a (X = k|x = j)



xvi ( j),

(17)

u∈N a (v)

i,a is the joint assignment of hidden states to the neighbors of node v. where xN(v) Again, in a fully contextual fashion, such an assignment is taken from a proceeding

Deep Learning for Graphs

119

layer i with respect to the one for which we are computing the new embedding, in order to avoid issues with mutual dependencies and cycles. This complex joint conditional probability is factorized by the mixture of simpler probabilities on the right side, where xui ( j) denotes the posterior distribution of the observable states of the neighbor nodes and P i,a (X = k|x = j) is a trained distribution measuring the association between state j in a neighbor u and state k for node v. The subscript a in (17) is used to denote the fact that the distributions can be made parametric with respect to an edge label a (from a finite set). The CGMM model can be trained in a fully unsupervised way by applying expectation-maximization independently, layer by layer. This allows to maintain the computational complexity contained and to build the layered structure incrementally, e.g. on the basis of an external (supervised) criteria. Training is based on maximising the likelihood that each node v has a certain observable label yv conditioned on neighboring information. L(θ |g) =

C  

P(yv |X v = k)P(X v = k|xN(v) ),

(18)

v∈Vert g k=1

where the complex joint distribution P(X v = k|xN(v) ) is factorized into a mixture of simpler probabilities using a switching parents approximation [10, 12]. The hidden state assignments computed at each level of the CGMM architecture can then be aggregated in a vectorial representation, such as the multiset structural fingerprint proposed in [13], which can be used to train a standard supervised learning model for prediction purposes. An alternative approach to produce unsupervised node representations focuses on local mutual information maximization between pairs of graphs. In particular, Deep Graph Infomax (DGI) [87] leverages a more standard GCN-like neural architecture than CGMM, but trains the convolutions using an unsupervised cost function that maximizes the mutual information between local node embeddings xv and global graph summaries sg . The rationale is to seek node representations that capture also some high-level information content of the entire graph. To this end, it introduces a corruption function that generates a distorted version of a graph g. Then, a discriminator is trained to distinguish between the two graphs using a bilinear score on node and graph representations. Note that, in order to use the DGI approach, one needs to manually define the corruption function, which will inject random structural noise in the learning process and, potentially, a bias as well.

3.3 Graph Pooling As observed at the beginning of this section, graph reduction mechanisms may play a fundamental role in ensuring that graph convolutional layer learn rich node representations, by reducing the over-smoothing effects, and in strengthening the scalability of the model, by progressively shrinking the graph and nearing of further contexts.

120

D. Bacciu and A. Micheli

To Preserve a full parallel with convolutional neural networks for images, such graph reduction mechanisms are typically referred with the term graph pooling. Note that some form of global pooling is always necessary, at some point, in graph classification or regression tasks, where we need to obtain a single fixed-sized feature vector representing the whole graph to perform the final prediction. A standard way to do so in by a global sum-pooling of embeddings of the nodes, which is also supported by the theoretical results by [92]. On the other hand, graphs are known to be characterized by the presence of community and modular structures at different level of details. These have been exploited to define more finer grained structure coarsening methods reducing the graphs exploiting embedding and topological information. DiffPool [95] has introduced the first trainable pooling mechanism, where a parameterized neural layer is used to learn a clustering of the current nodes based on their embeddings at the previous layer. Such clustering is realized by means of a con volutional layer, followed by a softmax to obtain a soft-membership matrix Si that associates nodes with clusters 

Si = softmax(GNN(Ai , Xi )), 

where Ai and Xi are the adjacency and encoding matrices of layer i. The Si matrix is then used to recombine the current graph into one of reduced size: 

T

Xi = Si Xi

and



T



Ai = Si Ai Si .

(19)

The DiffPool layer is complemented with auxiliary target functions comprising a link prediction objective, which forces nearby nodes to be pooled together, and an entropy regularization term that makes the assignment matrix close to a one-hot encoding. While the operator is proposed to produce a reduction of the graph, in practice, since the cluster assignment needs to be soft to preserve differentiability, it results in dense, unreduced adjacency matrices. TopKPool [44] overcomes this DiffPool limitation by learning a vector p that is used to compute projection scores of the node embedding matrix using dot product, i.e.  Xi pi i (20) s = i .

p

Such scores are then used to select the indices of the top ranking nodes and to slice the matrix of the original graph to retain only the entries corresponding to top nodes. Node selection is made differentiable by means of a gating mechanism built on the projection scores. SAGPool [65], instead, extends TopKPool by computing the score vector as an attention score with a GCN [63] 

si = σ (GCN(Ai , Xi )).

(21)

Deep Learning for Graphs

121

and leveraging the same differentiable slicing operator. EdgePool [36] operates by targeting edges in place of nodes. These are ranked based on a parametric scoring function which takes in input the concatenated embeddings of the incident nodes, that is  (22) s i ((v, u) ∈ Edgg ) = σ (w T [xvi , xui ] + b). EdgePool contracts every edge iteratively in descending order of score, unless one of the incident node has been already identified to another node. Other approaches take a more topological perspective to the problem, by leveraging only the structure of graph itself as well as its communities. Being non-adaptive, such mechanisms are not required to be differentiable and their results are not taskdependent. Hence, these methods are potentially reusable in multi-task scenarios. NMFPool [5] is an example of such an approach and it provides a soft node clustering using a non-negative factorization of the adjacency matrix that can be used to coarsen the graph without affecting differentiability in the convolutional layers and achieving results on predictive tasks comparable to those of DiffPool.

4 Discussion and Conclusions Despite the boasting recent research, the field of learning for graphs is well rooted into a long-standing consolidated line of research, whose early approaches date back to the end of previous millennium. This chapter took pace from recovering such, often forgotten, literature discussing the evolution of the field up to its most recent developments, falling within the umbrella term of deep learning for graphs. While doing this we did not took the path of summarizing all recent works and research topics in the most exhaustive way, as for this there are numerous recent surveys [15, 21, 46, 53, 91, 97, 98] and tutorial papers [7, 84] in literature. Rather, we focused on describing the main aggregation mechanisms for learning to represent structured data, namely spectral and spatial graph convolutions, contextual/layered aggregation and node embeddings, complementing them with a view on graph reduction techniques (pooling), as they are useful operators to instil diversity in the graph representation layers. To supplement such exposition of consolidated research, in the following, we discuss some of the main research challenges and applications for the field. One of the most active lines of recent research deals with learning to generate graphs. Sampling of a graph g is an articulated endeavour which requires to learn either the underlying distribution P(g) or an appropriate sampling process. Clearly, since graph structures are discrete, combinatorial and of variable-size, gradient-based approaches are not trivially applicable. Thus, the generative process is typically conditioned on a latent vectorial representation of a graph from which the actual structure is decoded. In practice, many works address the problem by sampling the adjacency matrix, constraining somehow the allowed structural freedom in the target graphs; for instance, adjacency matrices assume a specific node ordering. Examples in this sense are [64, 81], where each entry of the adjacency matrix is sampled with a

122

D. Bacciu and A. Micheli

(predicted) probability, learned by a process which performs an approximate graph matching between the predicted and the ground truth matrices in the training set. In [27], instead, the sampling process is made differentiable using a Gumbel-Softmax reparameterization. An alternative approach is generate the graph at node level. This is typically performed by generating a candidate set of node representations and use this information to condition a sampling process that predicts the adjacency matrix [88]. A computationally efficient approach to the problem is to model the generative process as a sequence of actions, namely creation of a specific node and generation of an edge between two nodes [68, 96]. Since the problem is sequential, it can be addressed by classical recurrent neural network. However, for the same reason it assumes the availability of a fixed node ordering. In practice, works in literature have shown that these models are able to generalize well to graphs coming from very different distributions with minimal impact with respect to the ordering policy [8, 9]. Another interesting, and relatively new, line of research is about adversarial attacks to graphs. Neural networks for graphs have been shown to be subject to adversarial attacks [100] just like what happens with any other data type. At the same time, generating adversarial samples for graphs is made more complex due to the discrete nature of the attack which, for instance, requires to operate with non-differentiable arc insertion and deletions [93]. Other approaches are attempting attacks on graphs by leveraging the addition of adversarial noise to the node representations [61]. Much of the attention on deep learning for graphs is fueled by the potential impact of their applications. The most widespread and consolidated application, in this sense, is certainly related to chemistry, whose early works date back to 20 years ago [19]. Predictive tasks in this field concern learning a mapping between molecular structures and outcomes of interest, also saving conspicuous computing times required by classical cheminformatics simulation methods [46], as well as finding structural similarities among compounds [31, 60]. Another interesting application concerns computational drug design, such as drug side-effect identification [99] and drug discovery using learning models to search for molecules with a desired set of chemical properties [62, 69, 77]. This latter task is also a typical validation field for the graph generation models described above [76]. Beside the consolidated applications to chemistry, deep learning for graphs seems likely to have future impacting applications to software analytics and recommender systems. The former leverages structured representations of code and systems, such as augmented abstract syntax trees [58] or control flow graphs [70]. In recommender systems, graphs are a natural candidate to encode the relations between users and items to recommend and learning models are being used to predict recommendations (new edges) from known relationships (a graph of known associations) [74, 94]. A discussion would not be complete without a mention to the main open challenges for the field. In this respect, one of the most pressing issue the community is facing is the definition of a set of rich and robust benchmarks to test and assess deep learning models for graphs in fair, consistent and reproducible conditions. Many of the latest models have been assessed on benchmark datasets but without a shared experimental setup, in radically different model selection conditions, sometimes without relying on

Deep Learning for Graphs

123

those good practices that ensure robust generalization testing. Some works [33, 80] are already bringing to the attention of the community troubling trends and pitfalls as concerns datasets and methodologies used to assess such models in the literature, and proposing robust and standardized frameworks for model validation [33]. A second challenge pertains more formal and theoretical aspects. On the one hand, there seems to be a need for a unified framework for the description of the different adaptive graph processing models, in order to simplify highlighting of similarities, differences and novelties of the different approaches. On the other hand, there is growing interest into works [92] that tackle theoretical and expressiveness properties of deep learning models for graphs. Acknowledgements D. Bacciu would like to acknowledge support from the Italian Ministry of Education, University, and Research (MIUR) under project SIR 2014 LIST-IT (grant n. RBSI14STDE).

References 1. Bacciu, D.: Hidden tree markov networks: Deep and wide learning for structured data. In: 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–8 (2017) 2. Bacciu, D., Bruno, A.: Text summarization as tree transduction by top-down treelstm. In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1411–1418 (2018) 3. Bacciu, D., Bruno, A.: Deep tree transductions - a short survey. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds.) Recent Advances in Big Data and Deep Learning, pp. 236– 245. Springer International Publishing, Cham (2020) 4. Bacciu, D., Castellana, D.: Bayesian mixtures of hidden tree markov models for structured data clustering. Neurocomputing 342, 49–59 (2019). Advances in artificial neural networks, machine learning and computational intelligence 5. Bacciu, D., Di Sotto, L.: A non-negative factorization approach to node pooling in graph convolutional neural networks. In: M. Alviano, G. Greco, F. Scarcello (eds.) AI*IA 2019 – Advances in Artificial Intelligence, pp. 294–306. Springer International Publishing (2019) 6. Bacciu, D., Errica, F., Micheli, A.: Contextual graph markov model: a deep and generative approach to graph processing. In: ICML (2018) 7. Bacciu, D., Errica, F., Micheli, A., Podda, M.: A gentle introduction to deep learning for graphs. Submitted. arXiv:1912.12693 (2019) 8. Bacciu, D., Micheli, A., Podda, M.: Graph generation by sequential edge prediction. In: Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN’19). i6doc.com (2019) 9. Bacciu, D., Micheli, A., Podda, M.: Edge-based sequential graph generation with recurrent neural networks. Neurocomputing (2020) 10. Bacciu, D., Micheli, A., Sperduti, A.: Compositional generative mapping for tree-structured data - part I: bottom-up probabilistic modeling of trees. IEEE Trans. Neural Netw. Learn. Syst. 23(12), 1987–2002 (2012) 11. Bacciu, D., Micheli, A., Sperduti, A.: Compositional generative mapping for tree–structured data-part ii: Topographic projection model. IEEE Trans. Neural Netw. Learn. Syst. 24(2), 231–247 (2012) 12. Bacciu, D., Micheli, A., Sperduti, A.: An input-output hidden markov model for tree transductions. Neurocomputing 112, 34–46 (2013). Advances in artificial neural networks, machine learning, and computational intelligence

124

D. Bacciu and A. Micheli

13. Bacciu, D., Micheli, A., Sperduti, A.: Generative kernels for tree-structured data. IEEE Trans. Neural Netw. Learn. Syst. 29(10), 4932–4946 (2018) 14. Baldi, P., Pollastri, G.: The principled design of large-scale recursive neural network architectures-dag-rnns and the protein structure prediction problem. J. Mach. Learn. Res. 4, 575–602 (2003) 15. Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al.: Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261 (2018) 16. Bernazzani, L., Duce, C., Micheli, A., Mollica, V., Sperduti, A., Starita, A., Tiné, M.R.: Predicting physical- chemical properties of compounds from molecular structures by recursive neural networks. J. Chem. Inf. Model. 46(5), 2030–2042 (2006) 17. Bertinetto, C.G., Duce, C., Micheli, A., Solaro, R., Tiné, M.R.: Qspr analysis of copolymers by recursive neural networks: prediction of the glass transition temperature of (meth) acrylic random copolymers. Mol. Inform. 29(8–9), 635–643 (2010) 18. Bianchini, M., Gori, M., Sarti, L., Scarselli, F.: Recursive processing of cyclic graphs. IEEE Trans. Neural Netw. 17(1), 10–18 (2006) 19. Bianucci, A., Micheli, A., Sperduti, A., Starita, A.: Application of cascade correlation networks for structures to chemistry. Appl. Intell. 12(1/2), 117–146 (2000) 20. Blackledge, J.M.: 2d fourier theory (Chapter 2). In: Blackledge J.M. (ed.) Digital Image Processing. Woodhead Publishing Series in Electronic and Optical Materials, pp. 30–49. Woodhead Publishing, Sawston. http://www.sciencedirect.com/science/article/pii/ B9781898563495500021 (2005). https://doi.org/10.1533/9780857099464.1.30 21. Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric deep learning: going beyond euclidean data. IEEE Signal Process. Mag. 34(4), 25. 18–42 (2017) 22. Bruna, J., Zaremba, W., Szlam, A., Lecun, Y.: Spectral networks and locally connected networks on graphs. In: International Conference on Learning Representations (ICLR2014), CBLS (2014) 23. Castellana, D., Bacciu, D.: Bayesian tensor factorisation for bottom-up hidden tree markov models. In: International Joint Conference on Neural Networks, IJCNN 2019 Budapest, Hungary, July 14–19, 2019, pp. 1–8 (2019) 24. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Gated feedback recurrent neural networks. In: International Conference on Machine Learning, pp. 2067–2075 (2015) 25. Computational Intelligence and Machine Learning group, University of Pisa. http://groups. di.unipi.it/groups/ciml 26. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. The MIT Press, New York (1996) 27. De Cao, N., Kipf, T.: MolGAN: An implicit generative model for small molecular graphs. ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models (2018) 28. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in neural information processing systems, pp. 3844–3852 (2016) 29. Diligenti, M., Frasconi, P., Gori, M.: Hidden tree markov models for document image classification. IEEE Trans. Pattern Anal. Mach. Intell. 25(4), 519–523 (2003) 30. Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., AspuruGuzik, A., Adams, R.P.: Convolutional networks on graphs for learning molecular fingerprints. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett R. (eds.) Advances in Neural Information Processing Systems 28, pp. 2224–2232. Curran Associates, Inc. http://papers.nips.cc/paper/5954-convolutional-networks-on-graphsfor-learning-molecular-fingerprints.pdf (2015) 31. Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarelli, R., Hirzel, T., Aspuru-Guzik, A., Adams, R.P.: Convolutional Networks on Graphs for Learning Molecular Fingerprints. In: Advances in Neural Information Processing Systems 28, pp. 2224–2232. Curran Associates, Inc. (2015)

Deep Learning for Graphs

125

32. Elman, J.L.: Finding structure in time. Cogn. Sci. 14, 179–211 (1990) 33. Errica, F., Podda, M., Bacciu, D., Micheli, A.: A fair comparison of graph neural networks for graph classification. In: International Conference on Learning Representations (2020) 34. Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. In: Touretzky, D. (ed.) Advances in Neural Information Processing Systems, vol. 2, pp. 524–532. Morgan Kaufmann, San Mateo (1990) 35. Frasconi, P., Gori, M., Sperduti, A.: A general framework for adaptive processing of data structures. IEEE Trans. Neural Netw. 9(5), 768–786 (1998) 36. Frederik Diehl Thomas Brunner, M.T.L., Knoll, A.: Towards graph pooling by edge contraction. In: ICML 2019 Workshop on Learning and Reasoning with Graph-Structured Data (2019) 37. Gallicchio, C., Micheli, A.: Graph echo state networks. In: The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2010) 38. Gallicchio, C., Micheli, A.: Architectural and Markovian factors of echo state networks. Neural Netw. 24(5), 440–456 (2011) 39. Gallicchio, C., Micheli, A.: Tree echo state networks. Neurocomputing 101, 319–337 (2013) 40. Gallicchio, C., Micheli, A.: Deep reservoir neural networks for trees. Inf. Sci. 480, 174–193 (2019) 41. Gallicchio, C., Micheli, A.: Fast and deep graph neural networks. In: Proceedings of the of AAAI 2020 (2020). Accepted 42. Gallicchio, C., Micheli, A., Pedrelli, L.: Deep reservoir computing: a critical experimental analysis. Neurocomputing 268, 87–99 (2017). https://doi.org/10.1016/j.neucom.2016.12.089 43. Gallicchio, C., Micheli, A., Pedrelli, L.: Design of deep echo state networks. Neural Netw. 108, 33–47 (2018) 44. Gao, H., Ji, S.: Graph u-nets. In: International Conference on Machine Learning, pp. 2083– 2092 (2019) 45. Gianniotis, N., Tino, P.: Visualization of tree-structured data through generative topographic mapping. IEEE Trans. Neural Netw. 19(8), 1468–1493 (2008) 46. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp. 1263–1272. PMLR (2017) 47. Glen, R., Bender, A., Hasselgren, C., Carlsson, L., Boyer, S., Smith, J.: Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to adme (vol. 9, p. 199, 2006). IDrugs: Investig. Drugs J. 9, 311 (2006) 48. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT Press, Cambridge (2016) 49. Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: Lstm: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016) 50. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. ACM (2016) 51. Hagenbuchner, M., Sperduti, A., Tsoi, A.: A self-organizing map for adaptive processing of structured data. IEEE Trans. Neural Netw. 14(3), 491–505 (2003) 52. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: NIPS (2017) 53. Hamilton, W.L., Ying, R., Leskovec, J.: Representation learning on graphs: methods and applications. IEEE Data Eng. Bull. 40(3), 52–74 (2017) 54. Hammer, B.: Learning with Recurrent Neural Networks. Springer Lecture Notes in Control and Information Sciences, vol. 254. Springer, Berlin (2000) 55. Hammer, B., Micheli, A., Sperduti, A.: Universal approximation capability of cascade correlation for structures. Neural Comput. 17(5), 1109–1159 (2005) 56. Hammer, B., Micheli, A., Sperduti, A., Strickert, M.: Recursive self-organizing network models. Neural Netw. 17(8–9), 1061–1085 (2004) 57. Hammond, D.K., Vandergheynst, P., Gribonval, R.: Wavelets on graphs via spectral graph theory. Appl. Comput. Harmon. Anal. 30(2), 129–150 (2011)

126

D. Bacciu and A. Micheli

58. Iadarola, G.: Graph-based classification for detecting instances of bug patterns. Master’s thesis, University of Twente (2018) 59. Jaeger, H., Haas, H.: Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science 304(5667), 78–80 (2004) 60. Jeon, W., Kim, D.: FP2VEC: a new molecular featurizer for learning molecular properties. Bioinformatics 35(23), 4979–4985 (2019). https://doi.org/10.1093/bioinformatics/btz307 61. Jin, H., Zhang, X.: Latent adversarial training of graph convolution networks. In: ICML Workshop on Learning and Reasoning with Graph-Structured Representations (2019) 62. Jin, W., Barzilay, R., Jaakkola, T.S.: Junction tree variational autoencoder for molecular graph generation. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, pp. 2328–2337 (2018) 63. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017) 64. Kwon, Y., Yoo, J., Choi, Y.S., Son, W.J., Lee, D., Kang, S.: Efficient learning of nonautoregressive graph variational autoencoders for molecular graph generation. J. Cheminformatics 11, 70 (2019). https://doi.org/10.1186/s13321-019-0396-x 65. Lee, J., Lee, I., Kang, J.: Self-attention graph pooling. In: International Conference on Machine Learning, pp. 3734–3743 (2019) 66. Li, Q., Han, Z., Wu, X.M.: Deeper insights into graph convolutional networks for semisupervised learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018) 67. Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv:1511.05493 (2015) 68. Li, Y., Vinyals, O., Dyer, C., Pascanu, R., Battaglia, P.W.: Learning deep generative models of graphs. arXiv:1803.03324 (2018) 69. Liu, Q., Allamanis, M., Brockschmidt, M., Gaunt, A.: Constrained graph variational autoencoders for molecule design. In: Advances in Neural Information Processing Systems 31, 7795–7804 (2018) 70. Massarelli, L., Di Luna, G.A., Petroni, F., Baldoni, R., Querzoni, L.: Safe: Self-attentive function embeddings for binary similarity. In: International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pp. 309–329. Springer (2019) 71. Micheli, A.: Neural network for graphs: a contextual constructive approach. IEEE Trans. Neural Netw. 20(3), 498–511 (2009) 72. Micheli, A., Sona, D., Sperduti, A.: Contextual processing of structured data by recursive cascade correlation. IEEE Trans. Neural Netw. 15(6), 1396–1410 (2004) 73. Micheli, A., Sperduti, A., Starita, A., Bianucci, A.: Analysis of the internal representations developed by neural networks for structures applied to quantitative structure-activity relationship studies of benzodiazepines. J. Chem. Inf. Comput. Sci. 41(1), 202–218 (2001) 74. Monti, F., Bronstein, M.M., Bresson, X.: Geometric matrix completion with recurrent multigraph neural networks. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 3700–3710 (2017) 75. Niepert, M., Ahmed, M., Kutzkov, K.: Learning convolutional neural networks for graphs. In: Proceedings of the 33rd International Conference on Machine Learning (2016) 76. Podda, M., Bacciu, D., Micheli, A.: A deep generative model for fragment-based molecule generation. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020) (2020) 77. Samanta, B., De, A., Jana, G., Chattaraj, P.K., Ganguly, N., Rodriguez, M.G.: Nevae: A deep generative model for molecular graphs. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI, pp. 1110–1117 (2019). https://doi.org/10.1609/aaai.v33i01.33011110 78. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE Trans. Neural Netw. 20(1) (2009) 79. Scarselli, F., Tsoi, A.C., Hagenbuchner, M.: The vapnik-chervonenkis dimension of graph and recursive neural networks. Neural Netw. 108, 248–259 (2018) 80. Shchur, O., Mumme, M., Bojchevski, A., Günnemann, S.: Pitfalls of graph neural network evaluation. In: Relational Representation Learning Workshop, NeurIPS (2018)

Deep Learning for Graphs

127

81. Simonovsky, M., Komodakis, N.: Graphvae: Towards generation of small graphs using variational autoencoders. In: Artificial Neural Networks and Machine Learning - ICANN 2018 - 27th International Conference on Artificial Neural Networks, pp. 412–422 (2018). https://doi.org/10.1007/978-3-030-01418-6_41 82. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642 (2013) 83. Sperduti, A., Starita, A.: Supervised neural networks for the classification of structures. IEEE Trans. Neural Netw. 8(3) (1997) 84. Stankovic, L., Mandic, D., Dakovic, M., Brajovic, M., Scalzo, B., Li, S., Constantinides, A.G.: Graph signal processing – part iii: machine learning on graphs, from graph topology to applications (2020) 85. Tai, K., Socher, R., Manning, C.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 1556–1566. Association for Computational Linguistics (2015) 86. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: International Conference on Learning Representations (ICLR) (2018) 87. Veliˇckovi´c, P., Fedus, W., Hamilton, W.L., Liò, P., Bengio, Y., Hjelm, R.D.: Deep graph infomax. In: International Conference on Learning Representations (2019) 88. Wang, H., Wang, J., Wang, J., Zhao, M., Zhang, W., Zhang, F., Xie, X., Guo, M.: Graphgan: Graph representation learning with generative adversarial nets. In: Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, pp. 2508–2515 (2018) 89. Wang, Z., Hagenbuchner, M., Tsoi, A.C., Cho, S.Y., Chi, Z.: Image classification with structured self-organization map. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN), vol. 2, 1918–1923 (2002) 90. Weininger, D., Weininger, A., Weininger, J.L.: Smiles. 2. algorithm for generation of unique smiles notation. J. Chem. Inf. Comput. Sci. 29(2), 97–101 (1989) 91. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Yu, P.S.: A comprehensive survey on graph neural networks (2019) 92. Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? In: ICLR (2019) 93. Yang, L., Kang, Z., Cao, X., Jin, D., Yang, B., Guo, Y.: Topology optimization based graph convolutional network. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI’19, pp. 4054–4061. AAAI Press. http://dl.acm.org/citation.cfm? id=3367471.3367605 (2019) 94. Yin, R., Li, K., Zhang, G., Lu, J.: A deeper graph neural network for recommender systems. Knowl. Based Syst. 185, 105020 (2019). https://doi.org/10.1016/j.knosys.2019.105020 95. Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., Leskovec, J.: Hierarchical graph representation learning with differentiable pooling. In: Advances in Neural Information Processing Systems 31 (2018) 96. You, J., Ying, R., Ren, X., Hamilton, W.L., Leskovec, J.: GraphRNN: Generating realistic graphs with deep auto-regressive models. In: ICML (2018) 97. Zhang, S., Tong, H., Xu, J., Maciejewski, R.: Graph convolutional networks: a comprehensive review. Comput. Soc. Netw. 6(1), 11 (2019) 98. Zhang, Z., Cui, P., Zhu, W.: Deep learning on graphs: a survey. arXiv:1812.04202 (2018) 99. Zitnik, M., Agrawal, M., Leskovec, J.: Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34(13), i457–i466 (2018). https://doi.org/10.1093/ bioinformatics/bty294 100. Zügner, D., Akbarnejad, A., Günnemann, S.: Adversarial attacks on neural networks for graph data. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, pp. 2847–2856. ACM (2018). https://doi.org/10.1145/ 3219819.3220078

Limitations of Shallow Networks Vˇera Kurková ˚

Abstract Although originally biologically inspired neural networks were introduced as multilayer computational models, shallow networks have been dominant in applications till the recent renewal of interest in deep architectures. Experimental evidence and successful applications of deep networks pose theoretical questions asking: When and why are deep networks better than shallow ones? This chapter presents some probabilistic and constructive results on limitations of shallow networks. It shows implications of geometrical properties of high-dimensional spaces for probabilistic lower bounds on network complexity. The bounds depend on covering numbers of dictionaries of computational units and sizes of domains of functions to be computed. Probabilistic results are complemented by constructive ones built using Hadamard matrices and pseudo-noise sequences.

1 Introduction Originally, biologically inspired neural networks were introduced as multilayer computational models, but later one-hidden-layer (shallow) architectures became dominant in applications (see, e.g., [18, 31] and the references therein). Although multilayer networks with sigmoidal and convolutional units used as filters were proposed for pattern recognition tasks by LeCun [41, 42] already in 1990s, their training by back-propagation was inefficient till the advent of fast graphic processing units (GPU). While development of GPU was motivated commercially as a tool for computer games, they enabled the revival of interest in multilayer architectures. Around 2006, a group of researchers from the Canadian Institute for Advanced Research (Bengio, Hinton, LeCun) exploited them in training networks with several convolutional and pooling layers (see, e.g., the survey article [43]). These networks were V. K˚urková (B) Institute of Computer Science of the Czech Academy of Sciences, Pod Vodárenskou vˇeží 2, 182 07 Prague, Czech Republic e-mail: [email protected]

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 L. Oneto et al. (eds.), Recent Trends in Learning From Data, Studies in Computational Intelligence 896, https://doi.org/10.1007/978-3-030-43883-8_6

129

130

V. K˚urková

called deep [10, 20] to distinguish them from shallow ones with merely one hidden layer. Currently, deep networks are the state of the art in areas such as text classification, musical genre recognition, speech recognition, time-series prediction, object detection, localization, video and tomography images recognition, biomedical image analysis, hyperspectral image analysis, and in combination with tree search in automatic game playing (AlphaGO). While experimental research of deep networks is rapidly evolving, theoretical analysis complementing the empirical evidence is still in its early stages. There are fundamental wide open questions related to the role of depth of network architectures: Why should deep networks be better than shallow ones and under which conditions? Bengio and LeCun, who revived the interest in deep networks, conjectured that “most functions that can be represented compactly by deep architectures cannot be represented by a compact shallow architecture” [9]. However, reservations about overall lower complexity of deep networks over shallow ones have appeared. An empirical study demonstrated that shallow networks can learn some functions previously learned by deep ones using the same numbers of parameters as the original deep networks [4]. Mhaskar et al. [50] suggested that due to their hierarchical structure, deep networks could outperform shallow networks in visual recognition of pictures with objects of different scales. Characterization of functions, which can be computed by deep networks of smaller model complexities than shallow ones, can be derived by comparing lower bounds on numbers of units in shallow networks with upper bounds on numbers of units in deep ones. It has long been known that under mild conditions on types of computational units, shallow networks have the universal representation property, i.e., they can exactly compute any real-valued function on a finite domain [22]. However, the arguments proving this property assume that the number of units in the last hidden layer is potentially as large as the size of the domain. Obviously, not all functions require networks with such high numbers of units. For shallow networks, various upper bounds on numbers of hidden units needed for a given approximation accuracy in dependence on their types, input dimensions, and types of functions to be computed are known (see, e.g., [23] and the references therein). Derivation of lower bounds is much more difficult than derivation of upper ones. Poggio et al. [53] proposed as a potential tool for comparison of deep and shallow networks an application of the topological approach for obtaining lower bounds on complexity of shallow networks exhibiting the “curse of dimensionality” (i.e., an exponential dependence on the number of parameters [8]) from [14]. However, applicability of topological methods is limited only to classes of networks where best or near best approximation of functions can be obtained by a continuous selection of network parameters. We proved in [28–30] that in many common classes of networks such continuous selection is not possible due their nonlinear and non-convex nature. Other lower bounds hold merely for types of computational units that are not commonly used such as perceptrons with specially designed activation functions [47] or the lower bounds merely prove existence of worst-case errors in Sobolev spaces asymptotically [46].

Limitations of Shallow Networks

131

In this chapter, we survey recent results on complexity and sparsity of shallow networks. Minimization of “l0 -pseudonorm”, which formalizes the concept of network sparsity measured by the number of hidden units in a shallow network, is a difficult non convex optimization problem. Thus we focus on investigation of minima of l1 -norms of output-weight vectors. We present several arguments showing that l1 -norm is a good approximation of “l0 -pseudonorm” (it approximates its convexification, can be used as a stabilizer in weight-decay regularization [18], and is related to variational norm tailored to a dictionary of computational units). In practical applications, feedforward networks compute functions on finite domains (formed, e.g., by pixels of pictures, discretized cubes, or scattered vectors of data), which are often quite large. Functions on finite domains form linear spaces which are isomorphic to Euclidean spaces of dimensions equal to sizes of domains. Geometry of high-dimensional spaces has many counter-intuitive features, which have consequences for correlations between functions on large domains. We show that combination of concentration of measure property of high-dimensional spaces with characterization of dictionaries of computational units in terms of their capacity and coherence described by their covering numbers leads to lower bounds on variational norms and l1 -norms of output-weight vectors of shallow networks. Applying these estimates to dictionaries with power-type covering numbers, we conclude that computation of almost any uniformly randomly chosen function on a large domain requires either large number of units or is unstable as some output weights are large. Finally, we illustrate the probabilistic results by a concrete construction of a class of functions induced by matrices, which have large variational norms with respect to the dictionary of signum perceptrons [36]. The chapter is organized as follows. Section 2 contains basic concepts and notations on feedforward networks and dictionaries of computational units. In Sect. 3, various measures of network sparsity and their relationships are studied. In Sect. 4, properties of high-dimensional spaces are applied to obtain estimates of correlations between functions to be computed and computational units. In Sect. 5, lower bounds on variational and l1 -norms formulated in terms of covering numbers of dictionaries and sizes of the domains are derived. In Sect. 6, some estimates of sizes of dictionaries of computational units popular in neurocomputing are presented. In Sect. 7 probabilistic results are complemented by constructive ones. Section 8 contains some examples and Sect. 9 is a brief discussion.

2 Preliminaries For X ⊂ Rd , we denote by F (X ) := { f | f : X → R} the set of all real-valued functions on X . In practical applications, domains X ⊂ Rd are finite, but their sizes card X and/or input dimensions d can be quite large.

132

V. K˚urková

Fixing a linear ordering {x1 , . . . , xm } of elements of X we define an isomorphism ι : F (X ) → Rm as ι( f ) := ( f (x1 ), . . . , f (xm )) and thus we identify F (X ) with the finite dimensional Euclidean space Rm . On F (X ) we denote the induced inner product by   f, g := f (u)g(u), u∈X

√ the Euclidean norm  f 2 :=  f, f , and by S1 (X ) the unit sphere in F (X ) S1 (X ) = { f ∈ F (X ) |  f  ≤ 1}. By B(X ) := { f | f : X → {−1, 1}} we denote the subset of F (X ) formed by functions with values in {−1, 1}. For any norm or “pseudonorm” . on Rd or F (X ), we denote by Br (.) = {w ∈ Rn | w ≤ r } the ball of radius r in .. The set of input-output functions of a feedforward network with a single linear output has the form span G :=

 n 

   wi gi  wi ∈ R, gi ∈ G, n ∈ N ,

i=1

where w = (w1 , . . . , wn ) is the vector of output weights and G is a parameterized family of functions called a dictionary. The dictionary depends on the network architecture and types of computational units. The simplest architecture is a shallow (onehidden-layer) network, where G is a parameterized family of functions computable by a given type of computational units. In the case of a deep network with several hidden layers, G is formed by combinations and compositions of functions representing units from lower layers. Formally, a dictionary can be described as G(X ) = G φ (X, Y ) := {φ(·, y) : X → R | y ∈ Y } , where φ : X × Y → R is a function of two variables: an input vector x ∈ X ⊆ Rd and a parameter vector y ∈ Y ⊆ Rs . Popular computational units are perceptrons which compute functions of the form ψ(v · x + b), where v ∈ Rd is a weight vector, b ∈ R a bias, and ψ : R → R is an activation function (such as Heaviside, sigmoidal, rectified linear). By  n     spann G := wi gi  wi ∈ R, gi ∈ G i=1

Limitations of Shallow Networks

133

we denote the set of functions computable by networks with at most n units in the last hidden layer. Sets of the form spann G are invariant under multiplication by scalars, i.e., cspann G = spann G for all c ∈ R. As for all c > 0 c f − spann G = c  f − spann G, with proper choices of scalars, examples of functions with arbitrarily large or small errors in approximation by sets spann G in any norm . can be obtained. So approximation and representation of functions by networks with a linear output have to be studied for cases when functions to be approximated and function from G have the same norms, e.g., when all functions are normalized or in the case of binary classification, they have values in {−1, 1} rather than in {−0, 1}.

3 Approximate Measures of Sparsity It has long been known that many feedforward networks have the universal representation property, i.e., they can exactly compute any function on a finite domain. Ito [22] proved the following sufficient condition on a dictionary of computational units that guarantees that shallow networks with units from the dictionary have the universal representation property. Theorem 1 Let m be a positive integer, X = {x1 , . . . , xm } ⊂ Rd , and G φ (X, Y ) = {φ(·, y) : X → R | y ∈ Y } be such that there exist y1 , . . . , ym ∈ Y for which the m × m square matrix  defined as i, j = φ(xi , y j ) is regular, then F (X ) = spanm G φ (X, Y ). Regularity of the matrix  implies that for any f : {x1 , . . . , xm } → R, the family of m linear equation f (xi ) =

m 

w j φ(xi , y j ), i = 1, . . . , m

(1)

j=1

with m unknown has a solution. Any solution (w1 , . . . , wm ) can be used as an outputweight vector of a representation of f as an input-output function of a network with units from G φ of the form f (x) =

m 

wi φ(x, yi ).

(2)

i=1

Ito [22] verified that shallow networks with sigmoidal perceptrons satisfy the condition of Theorem 1 and thus have the universal representation property. It is easy to check that Theorem 1 also implies that this property is possessed by shallow networks

134

V. K˚urková

with any positive definite kernel (e.g., Gaussian, Laplace). Positive definiteness of a kernel guarantees that the matrix induced by the kernel with xi = yi , i = 1, . . . , m is regular. The parameters y1 , . . . , ym , for which the matrix  is regular, as well as the solution w1 , . . . , wm of the family of m linear equations (1) need not to be unique. Thus there might exist many representations of a function f as an input-output function of a shallow network with units from G φ . However, potentially all w1 , . . . , wm might be nonzero. Thus for large domains X , networks whose existence is guaranteed by universality results such as Theorem 1 might be too large for efficient implementations. Many dictionaries popular in neurocomputing are linearly independent on infinite domains (see, e.g., [1, 26, 27, 39, 57]). Representations of functions as input-output functions of shallow networks with units from such dictionaries are unique up to permutations of hidden units and, in some cases, also sign-flips. In contrast, such dictionaries restricted to finite domains typically are linearly dependent. The condition of being equal on the whole Rd or its sufficiently large compact subset is much stronger than the condition requiring equality merely on its finite discrete subset. In some literature, dictionaries which are not linearly independent are called overcomplete. Such dictionaries allow multiple representations of functions. For a function f ∈ F (X ) and a dictionary G, we denote by W f (G) := {w = (w1 , . . . , wn ) ∈ Rn | f =

n 

wi gi , gi ∈ G, n ∈ N}

(3)

i=1

the set of output-weight vectors of shallow networks with units from G representing f . When G induces a class of shallow networks having the universal representation capability, then sets W f (G) are nonempty for all f ∈ F (X ). It follows from the definition that sets W f (G) are convex. Proposition 1 Let X ⊂ Rd , G ⊂ F (X ), and f ∈ F (X ), then W f (G) is convex. It is desirable to find among all representations of f as an input-output function of a shallow network with units from G the most sparse ones, i.e., in the set W f (G) to find vectors with the smallest number of nonzero entries. Formally, for a vector w ∈ Rn , the number of its non-zero entries is denoted w0 . It is called “l0 -pseudonorm” in quotation marks as it is neither a norm nor a pseudonorm. It satisfies the triangle inequality, but it does not satisfy the homogeneity condition, which requires |λ| w = λw for all λ ∈ R. The values of .0 are only integers and its “balls” are not convex. “l0 -pseudonorm” satisfies the equation w0 =

n  i=1

wi0

Limitations of Shallow Networks

135

and it is a limit lim w p = w0

p→∞

of l p -functionals. W f (G) is convex and any continuous function on a convex set achieves its minimum. But .0 is not continuous. Minimization of “l0 -pseudonorm” is a difficult non convex problem which has been studied in signal processing (see, e.g, [15, 16]). It was proven that in some cases, it is NP-hard [60]. Due to its non homogeneity, “l0 -pseudonorm” is invariant under multiplication by scalars. In contrast, any norm can be made arbitrarily large or small by multiplying a function by a suitable scalar. Thus investigation of relationships of “l0 -pseudonorm” to various norms has sense only for functions from restricted ambient sets. The following proposition from [52] shows √ that when the ambient set is the unit ball in l2 -norm, then the ball of radius r in the l1 -norm (hyperoctahedron) is a good approximation of the convexification of the “ball” of radius r in “l0 -pseudonorm”. Proposition 2 For every positive integer m and every r > 0, balls in .0 , .1 , and .2 in Rd satisfy conv (Br (.0 ) ∩ B1 (.2 )) ⊂ B√r (.1 ) ∩ B1 (.2 ) ⊂ 2 conv (Br (.0 ) ∩ B1 (.2 )) .

In neurocomputing, instead of “l0 -pseudonorm”, l1 and l2 -norms have been used as stabilizers in weight-decay regularization techniques [18]. Acting as a stabilizer, l2 -norm penalizes even a small number of large output weights but it can tolerate many small ones, while l1 -norm stabilizers penalize many small output weights as well as few large ones. This can be illustrated by a simple example of a weight vector w ∈ Rm , with wi = mc for all i = 1, . . . , m. Then w1 = c, while w2 = √cm . So their difference increases with growing dimension m. Networks with large l1 -norms of output-weight vectors have either large numbers of units or some of their output weights are large. None of these properties is desirable: implementation of networks with large numbers of units might not be feasible, while large output weights might lead to an instability of computation. In addition to approximating the convexification of “l0 -pseudonorm” and penalizing many small output weights, l1 -norm have several other properties which make it a good approximate measure of sparsity. We denote W f (G)∗1 := {w∗ ∈ W f (G) | w∗ 1 = minw∈W f (G) w1 } and W f (G)∗2 := {w∗ ∈ W f (G) | w∗ 2 = minw∈W f (G) w2 } the subsets of W f (G) formed by output-weight vectors of minimal l1 and l2 -norms, resp.

136

V. K˚urková

Proposition 3 Let X ⊂ Rd , f ∈ F (X ), and G = {g1 , . . . , gk } ⊂ F (X ). If W f (G) is non empty, then W f (G)∗1 is non empty and convex. Proof l1 -norm is continuous and every continuous function on a convex set achieves its minimum, so W f (G)∗ is non empty. Its convexity follows from the definition.  Note that l2 -norm does not satisfy an analogy to Proposition 3, namely the set W f (G)∗2 of vectors with minimal l2 -norms contains only one point. Indeed, the strict convexity of l2 -norm implies that aw1 + (1 − a)w2 2 < aw1 2 + (1 − a)w2 2 for all a ∈ (0, 1). Thus Proposition 3 shows another advantage of l1 -norm over l2 norm. Moreover, the minimal value of the l1 -norm of an output-weight vector of a network computing a function f can be expressed in terms of a norm generated by a dictionary G called G-variation. It is defined for a bounded subset G of a normed linear space (X, .) as      f G := inf c ∈ R+  f /c ∈ clX conv (G ∪ −G) , where − G := {− g | g ∈ G}, clX denotes the closure with respect to the topology induced by the norm  · X , and conv is the convex hull. If the set over which the infimum is taken is empty, then  f G := ∞. So G-variation is the Minkowski functional of its unit ball clX conv (G ∪ −G). Variation with respect to Heaviside perceptrons (called variation with respect to half-spaces) was introduced in [6] and extended to general dictionaries in [34]. It was shown in [37] that infimum in the definition of G-variation can be replaced with minimum. The next proposition shows that  f G bounds the minimum of values of l1 norms of output-weight vectors of networks with units from G computing f . Its proof follows directly from the definition. Proposition 4 Let X ⊂ Rd , G be a bounded subset of F (X ), and f ∈ F (X ) such that  f G is finite. Then  f G ≤ w1 for all w ∈ W f (G). When G is finite, then  f G = w∗ 1 for all w∗ ∈ W f (G)∗1 . Besides of being a lower bound on approximate measure of sparsity expressed in terms of l1 -norm, G-variation is also a critical factor in upper bounds on rates of approximation by networks with increasing “l0 -pseudonorms” of output-weight vectors. The following theorem is a special case holding for the Hilbert space F (X ) of the Maurey–Jones–Barron Theorem [7] as reformulated in terms of a variational norm in [35, 37].

Limitations of Shallow Networks

137

Theorem 2 Let X ⊂ Rd be finite, G be a subset of F (X ), sG = maxg∈G g2 , and f ∈ F (X ). Then for every n,  f − spann G2 ≤

sG  f G . √ n

By Theorem 2 there exist functions computable by shallow networks with at most n hidden units from the dictionary G (networks with output-weight vectors with  f G . “l0 -pseudonorms” at most n) approximating f within sG √ n

4 Correlation and Concentration of Measure As mentioned above, l2 -errors in approximation by families of the form spann G can only be compared when functions to be approximated and elements of dictionaries G have the same l2 -norms (e.g., when all functions are normalized). The Euclidean distance of normalized functions on the unit sphere S1 (X ) in F (X ) is related to the angular pseudometrics ρ on S1 (X ) defined as ρ( f, g) = arccos | f, g|. Note that ρ is not metrics, it is merely a pseudometrics, because the distance between f and − f is zero. It is related to the l2 -metrics by the formula  f − g2 = 2 sin(α/2) where ρ( f, g) = α. It can be described in terms of correlation defined as the inner product  f, g. The more correlated functions are, the better they can approximate each other. As F (X ) is isometric to the Euclidean space Rcard X , with increasing size of the domain X , effects of high-dimensional geometry become apparent. In particular on high-dimensional spheres, inner products with any fixed function tend to concentrate around their median. Let C(g, ε) = { f ∈ S m−1 |  f, g ≥ ε}

(4)

denotes the spherical cap centered at a fixed vector g, which contains all vectors f which have the angular distance from g at most α = arccos ε or equivalently the inner product  f, g is at least ε (see Fig. 1). Using classical calculus (integration in spherical polar coordinates), one can compute the relative area of the unit sphere S m−1 in the m-dimensional Euclidean space Rm , which is occupied by the spherical cap μ(C(g, ε)) ≤ e−

mε 2 2

(5)

138

V. K˚urková

Fig. 1 Spherical cap

g α

(see, e.g., [5]). For a fixed angle α, with increasing dimension m the normalized surface area μ of such cap decreases exponentially fast to zero. When ε is small, the complement of C(g, ε) ∪ C(−g, ε) contains vectors which are nearly orthogonal to g. The upper bound (5) implies that most of the area of a high-dimensional sphere is concentrated around its “equator”. The exponential decrease of sizes of “polar caps” (5) is the very essence of two properties of high-dimensional spaces called the curse of dimensionality and the blessing of dimensionality. While there are only m exactly orthogonal unit vectors in Rm , for a fixed ε > 0, the number of ε-quasiorthogonal vectors (with absolute values of inner products at most ε) grows with m exponentially. So the number of highly uncorrelated functions grows with the size of their domain exponentially. On the other hand, (5) implies the phenomenon of concentration of measure. The upper bound (5) on the size of a spherical cap can be rephrased as follows: inner products of a fixed vector in the sphere S m−1 with uniformly randomly chosen vectors concentrate around zero. A generalization obtained by replacing the inner product with a sufficiently smooth (Lipschitz) function on the sphere leads to the Lèvy Lemma [44]. It states that almost all values of a Lipschitz function on a high-dimensional sphere are close to their median. Recall that on a metric space (S, ρ) a function h : S → R is called Lipschitz with a constant c if for all x, y ∈ S, |h(x) − h(y)| ≤ cρ(x, y). For a probability measure P on S m−1 and a function F : S m−1 → R, the median of F is defined as med(F) := sup{t ∈ R | P[F(x) ≤ t] = 1/2} and it satisfies P[F(x) < med(F)] = 1/2 and P[F(x) > med(F)} = 1/2 [49, p. 337]. Theorem 3 (Lévy Lemma) Let m be a positive integer, P be the uniform probability measure (normalized surface measure) on S m−1 , F : S m−1 → R be a function, and ε ∈ [0, 1]. Then (i) for F continuous with modulus of continuity ω P[| F(x) − med(F) | > ω(ε)] ≤ 2e−

mε 2 2

;

Limitations of Shallow Networks

139

(ii) for F 1-Lipschitz. P[| F(x) − med(F) | > ε] ≤ 2e−

mε 2 2

.

 Note that the median of a Lipschitz function is related to its mean value E[F] = S m−1 F(x) dP(x) as follows [49, p. 338]. Proposition 5 Let m be a positive integer, P be the uniform probability measure (normalized surface measure) on S m−1 and F : S m−1 → R be 1-Lipschitz. Then 12 |med(F) − E[F]| ≤ √ . m Thus on high-dimensional spheres, Lipschitz functions are almost constant. This property of high-dimensional spheres is a special case of properties called the concentration of measure phenomenon. Similar property was also discovered in probability theory, where it has been studied in terms of bounds on large deviations of sums of random variables by Hoeffding [21], Chernoff [11], and Azuma [3]. Concentration of measure is also the essence of the proof of the Johnson–Lindenstrauss Flattening Lemma [49, p. 358]. It guarantees a possibility of dimension reduction of d-dimensional data by a random projection to a lower dimension bounded from below by 8ε log d such that the projection is a near-isometry (preserves distances within a multiplicative factor 1 ± ε).

5 Probabilistic Lower Bounds on Approximate Measures of Sparsity Concentration of measure phenomenon has implications for correlations of functions on large domains with functions from dictionaries of computational units. They play a crucial role in estimating variational norms and the minima of l1 -norms of output-weight vectors of networks. Here, we state a special case of a geometric characterization of G-variation proven in [37]. Its proof is based on Hahn-Banach Theorem on separation of a point from convex set by a hyperplane. By G ⊥ is denoted the orthogonal complement of G in the Hilbert space F (X ). Theorem 4 Let X be a finite subset of Rd and G be a bounded subset of F (X ). 2 Then for every f ∈ F (X ) \ G ⊥ ,  f G ≥ sup  f |g, . f | g∈G

Theorem 4 gives a geometric insight into the concept of variational norm. It implies that functions which are “nearly orthogonal” to all elements of a dictionary have large variations and thus cannot be computed by networks having “small” l1 -norms of output-weigh vectors (see Fig. 2).

140

V. K˚urková

Fig. 2 Function nearly orthogonal to G

f

G

The next Corollary gives a lower bound on probability that a uniformly randomly chosen normalized function on X is in the “polar cap” C(g, ε) = { f ∈ S1 (X ) |  f, g ≥ ε} of angle arccos ε around a given g in the space F (X ) (we use the same notation C(g, ε) as for “polar cap” in S m−1 ). Corollary 1 Let X ⊂ Rd be finite with card X = m, g ∈ S1 (X ), and ε ∈ [0, 1]. Then for f uniformly randomly chosen in S1 (X ) P[|  f, g| > ε] ≤ 2e

−mε 2 2

.

Proof Let Fg : S1 (X ) → R be defined as Fg ( f ) =  f, g. By the Cauchy–Schwartz Inequality |Fg ( f 1 ) − Fg ( f 2 )| ≤ g2  f 1 − f 2 2 . Thus Fg is 1-Lipschitz. By symmetry, its median is zero. So the statement follows from the Lévy Lemma (Therom 3).  Corollary 1 shows that for a fixed normalized function g on a large domain X , most of the area of S1 (X ) (the complement of the union of the two “polar caps” formed by functions close to g and −g, formally { f ∈ S1 (X ) |  f, g| > ε} ∪ { f ∈ S1 (X ) |  f, g| > ε}), contains functions which are nearly orthogonal to g. −mε 2 It implies that if a dictionary G is not large enough to outweigh the factor 2e 2 , then most functions in S1 (X ) have inner products with all elements of G at most ε. It means that for a large domain X , most functions in S1 (X ) are nearly orthogonal to all elements of G and thus by Theorem 4, they have G-variation larger than 1ε (see Fig. 3). Combining Corollary 1 and Theorem 4, we obtain for a finite dictionary the following probabilistic lower bound on G-variation. Theorem 5 Let d be a positive integer, X ⊂ Rd with card X = m, P be a uniform probability measure on S1 (X ), b > 0, and G ⊂ S1 (X ) be finite with card G = k. Then for f uniformly randomly chosen in S1 (X )

Limitations of Shallow Networks

141

Fig. 3 Spherical caps around elements of G

P[ f G ≥ b] ≥ 1 − 2k e− b2 . 2m

Theorem 5 estimates probability that a uniformly randomly chosen function on S1 (X ) has G-variation at least b. Hence by Proposition 4, any representation of such function as an input-output function of a shallow network with units from G has l1 -norm of output-weight vector larger than b, too. Similar estimate holds for dictionaries of binary-valued functions. Its proof in [38] uses a discrete version of concentration of measure in the form of Chernoff-Hoeffding Bound on sums of independent variables. Theorem 6 Let d be a positive integer, X ⊂ Rd with card X = m, P be a uniform probability measure on cB(X ), b > 0, and G ⊂ B(X ) be finite with card G = k. Then for f uniformly randomly chosen in B(X ) P ( f G ≥ b) ≥ 1 − k e− 2b2 . m

Estimates from Theorem 5 can be extended also to networks with units from infinite dictionaries. Their “sizes” can be described in terms of covering and packing numbers. They were introduced in [32] as a way to measure sizes of subsets of metric spaces using as measuring units small balls. For ε > 0, an ε-net in G is a set {g1 , . . . , gn } ⊆ G such that the family of the closed balls Bε (gi ) of radii ε centered at gi covers G. The ε-covering number denoted Nε (G) of a subset G of a metric space S is the cardinality of a minimal ε-net in G, i.e., n

Bε ( f i )) . Nε (G) := min n ∈ N+ | (∃ f 1 , . . . , f m ∈ G) (G ⊆ i=1

When the set over which the minimum is taken is empty, then Nε (G) := +∞. Note that all covering numbers of a compact set are finite. Packing number Mε (G) is defined as the maximal number of disjoint balls that fit in a set, i.e.,

142

V. K˚urková n

Mε (G) := min n ∈ N+ | (∃ f 1 , . . . , f m ∈ G) ( Bε ( f i ) ⊆ G) . i=1

Packing numbers are closely related to covering numbers as M2ε (G) ≤ Nε (G) ≤ Mε (G). In the following theorem, we assume that the covering numbers of subsets G of S1 (X ) are considered with respect to the angular pseudometrics ρ( f, g) = arccos | f, g|. Theorem 7 Let d be a positive integer, X ⊂ Rd with card X = m, P be a uniform probability measure on S1 (X ), b > 0, and G ⊂ S1 (X ) has finite covering numbers. Then for f uniformly randomly chosen in S1 (X ) P[ f G ≥ b] ≥ 1 − 2Narccos(2/b) (G) e− b2 . 2m

Proof For g ∈ S1 (X ) and ε > 0, let C(g, ε) = { f ∈ S1 (X ) |  f, g ≥ ε}. Let α = arccos(2/b) and {g1 , . . . , gn } be a minimal nα-net in G in the angular pseudometrics ρ. Then by the triangle inequality i=1 (C(gi , 2/b) ∪ C(−gi , 2/b)) ⊇ 1 C(g, 1/b). By Theorem 4,  f  ≥ . Thus { f ∈ S1 (X ) |  f G ≥ G(X ) g∈G g∈G | f,g| sup n b} ⊇ S1 (X ) \ g∈G C(g, 1/b) ⊇ S1 (X ) \ i (C(gi , 2/b) ∪ C(−gi , 2/b)). By Cor2m n ollary 1, P[ f ∈ C(g, 1/b)] ≤ 2e− b2 . Thus P[ f ∈ S1 (X ) \ i=1 (C(gi , 2/b) ∪ 2m C(−gi , 2/b))] ≥ 1 − 2Nα (G)e− b2 and so the statement holds.  Combining Theorem 7 with Proposition 4 we obtain the following lower bound on the l1 -norm of the output-weight vector of any network with units from G computing a uniformly randomly chosen real-valued function on X . Corollary 2 Let d be a positive integer, X ⊂ Rd with card X = m, P be a uniform probability measure on S1 (X ), b > 0, G ⊂ S1 (X ) has finite covering numbers, and f be a function uniformly randomly chosen from S1 (X ). Then for f uniformly randomly chosen in S1 (X ) 

2m P (∀w ∈ W f (G)) (w1 ≥ b) ≥ 1 − 2Narccos(2/b) (G) e− b2 .

For example for b = m 1/4 , Theorem 7 and Corollary 2 give the lower bound 1 − 2Narccos(2m −1/4 ) (G) e−2



m

on probability that a uniformly randomly chosen normalized function f on X has G-variation and l1 -norm of output-weight vector of any network with units from G

Limitations of Shallow Networks

143

computing f are at least m 1/4 . If a dictionary G is “relatively small” (in √ the sense that its covering number Narccos(2m −1/4 ) (G) do not outweigh the factor e−2 m , then almost any uniformly randomly chosen normalized function on X has G-variation larger than m 1/4 . Covering numbers are called  s power-type if there exists c > 0 and a positive integer s such that Nε (G) ≤ εc . So our results apply to dictionaries with powertype covering numbers. In particular, for X = {0, 1}d the d-dimensional Boolean cube and a power-type dictionary, our estimates imply for almost any uniformly randomly chosen function in S1 ({0, 1}d ) the lower bound 2d/4 on the l1 -norms of all output-weight vectors of networks computing such function.

6 Sizes of Dictionaries of Computational Units Our analysis of approximate measures of sparsity of networks with units from a dictionary G shows that when covering numbers of G grow only polynomially with the size of the domain X , then for almost any uniformly randomly chosen function on a sufficiently large X , l1 -norms of output-weight vectors of all networks with units from G computing f must be large. Such networks have large numbers of units or some of their output weights must be large. Both are not desirable. Some estimates of covering numbers of dictionaries are known for shallow networks, where dictionaries are formed by functions computable by basic computational units. Much more complicated dictionaries formed by compositions of functions which are used in deep networks are less understood. For finite dictionaries G, all covering numbers are bounded from above by card G. For some values of ε, covering numbers of finite dictionaries can even be smaller than their sizes. This can happen when a dictionary is highly coherent. Finite dictionaries are either formed by functions with finite ranges or functions with finite sets of parameters. Examples of dictionaries formed by binary-valued functions are dictionaries of Heaviside or signum perceptrons. Estimates of their sizes follow from d the upper bound 2 md! on the number of linearly separable dichotomies of m points in Rd proven by Schläfli already in the 19th century (see [54]). The dictionary of signum perceptrons Pd (X ) := {sgn(v · x + b) : X → {−1, 1} | v ∈ Rd , b ∈ R} occupies a relatively small subset of the set B(X ) of all functions on X with values in {−1, 1}. The following upper bound is a direct consequence of an upper bound on the number of linearly separable dichotomies of m points in Rd from [13] combined with a well-known estimate of partial sum of binomials (see [40]). Theorem 8 For every d and every X ⊂ Rd with card X = m, card Pd (X ) ≤ 2

md . d!

144

V. K˚urková

Thus the size of the dictionary Pd (X ) of signum perceptrons grows with the size of the domain X ⊂ Rd with cardX = m only polynomially with the polynomial degree equal to d, while the size 2m of the set B(X ) of all functions from X to {−1, 1} grows with m exponentially. Estimate of the size of the dictionary of signum perceptrons combined with Theorem 6 gives the following bounds. Theorem 9 Let d be a positive integer, X ⊂ Rd with card X = m, P be a uniform probability measure on B(X ), f uniformly randomly chosen in B(X ), and b > 0. Then   m md P  f  Pd (X ) ≥ b ≥ 1 − 4 e− 2b2 . d!

Thus for large domains X , almost any uniformly randomly chosen function from X to {−1, 1} has large variation with respect to signum perceptrons and so it cannot be l1 -sparsely represented by a shallow network with signum perceptrons. In particular, d for card X = 2d and b = 2 4 , Theorem 9 implies the following corollary. Corollary 3 Let d be a positive integer, X ⊂ Rd with card X = m, P be a uniform probability measure on B(X ), and f uniformly randomly chosen in B(X ). Then 2   2d −(2 d2 −1) d . P  f  Pd (X ) ≥ 2 4 ≥ 1 − 4 e d!

Covering numbers of the whole set S1 (X ) of all normalized functions on a finite set X are growing exponentially with card X . This follows from estimates of the quasiorthogonal dimension dimε m of Rm . It is defined for ε ∈ [0, 1] as the maximal number of unit vectors such that each pair of distinct ones has inner product at most ε, i.e., dimε m = max{card U ⊂ S m−1 | (∀u, v ∈ U, u = v)(|u · v| ≤ ε)}. Quasiorthogonal dimension can be expressed as the packing number of S m−1 . It was proven in [25] that for a fixed ε > 0, the quasiorthogonal dimension dimε m grows exponentially with the dimension m as e

mε 2 2

 ≤ dimε m

(for arguments based on graph theory see [24]). Let λ(t) = 0 for t ≤ 0 and λ(t) = t for t ≥ 0 and let L(X ) denote the dictionary L(X ) := {λ(e · . + b) : X → R | e ∈ S d−1 , b ∈ R}.

Limitations of Shallow Networks

145

The dictionary L(X ) has infinite range, but has the same size equal to the number of characteristic functions of half-spaces of X ⊂ Rd . Some dictionaries with infinite ranges have finite sets of parameters and thus they are finite. For X finite, let K (X ) := {K y | y ∈ X }, where K : X × X → R is a symmetric positive definite kernel and K y (x)=K (x, y). Such dictionaries are used in SVM and their sizes are equal to card Y [2, 51]. Covering numbers of dictionaries G φ (X, Y ), for which the function Tφ : Y → F (X ) defined as Tφ (y)(x) = φ(x, y) is Lipschitz can be derived from covering numbers of the set of parameters Y . Covering numbers in the angular pseudometrics are related to covering numbers in l2 . Indeed, for f, g ∈ S1 (X ), with ρ( f, g) = α, we have  f − g2 = 2 sin(α/2). Various estimates of covering numbers in l2 -norm are known. For example, any subset G of the set of functions on a finite domain X with range {0, 1} which has a finite VC-dimension has power-type covering numbers in l2 [19]. It was shown in [48] that for any Lipschitz continuous sigmoidal function, L2 -covering numbers of the dictionary of sigmoidal perceptrons on any bounded domain ⊂ Rd grow as  1 β , where β > 0. ε

7 Constructions of Functions with Large Variations By Theorem 4, functions which are nearly orthogonal to all elements of a dictionary G have large G-variations. To construct an example of a class of functions with large variation with respect to signum signum perceptrons, we consider functions on square domains of the form X = {x1 , . . . , xn } × {y1 , . . . , yn } ⊂ Rd . Such functions can be represented by square matrices. For a function f on X = {x1 , . . . , xn } × {y1 , . . . , yn } we denote by M( f ) the n × n matrix defined as M( f )i, j = f (xi , y j ). An n × n matrix M induces a function f M on X such that f M (xi , y j ) = Mi, j . The inner product of two functions f and g on a square domain X = {x1 , . . . , xn } × {y1 , . . . , yn } is equal to the sum of entries of the matrices M( f ) and M(g), i.e.,

146

V. K˚urková

 f, g =

n 

M( f )i, j M(g)i, j .

i, j

Thus it is invariant under permutations of rows and columns performed jointly on both matrices M( f ) and M(g). So to estimate inner products of functions represented by matrices we can reorder rows and columns whenever it is convenient. An advantage of square domains is that on such domains matrices M(g) representing signum perceptrons g ∈ Pd (X ) can be reordered in such a way that each row and each column of the reordered matrix starts with a segment of −1’s followed by a segment of +1’s as stated in the next lemma from [38]. Lemma 1 Let d = d1 + d2 , {xi | i = 1, . . . , n} ⊂ Rd1 , {y j | j = 1, . . . , n} ⊂ Rd2 , and X = {x1 , . . . , xn } × {y1 , . . . , yn } ⊂ Rd . Then for every g ∈ Pd (X ) there exists a reordering of rows and columns of the n × n matrix M(g) such that in the reordered matrix each row and each column starts with a (possibly empty) initial segment of −1’s followed by a (possibly empty) segment of +1’s. Proof Choose an expression of g ∈ Pd (X ) as g(z) = sign(a · z + b), where z = (x, y) ∈ Rd1 × Rd2 , a ∈ Rd = Rd1 × Rd2 , and b ∈ R. Let al and ar denote the left and the right part, resp, of a, i.e., ali = ai for i = 1, . . . , d1 and ari = ad1 +i for i = 1, . . . , d2 . Then sign(a · z + b) = sign(al · x + ar · y + b). Let ρ and κ be permutations of the set {1, . . . , n} such that al · xρ(1) ≤ al · xρ(2) ≤ · · · ≤ al · xρ(n) and ar · yκ(1) ≤ ar · yκ(2) ≤ · · · ≤ ar · yκ(n) . Denote by M(g)∗ the matrix obtained from M(g) by permuting its rows and columns by ρ and κ, resp. It follows from the definition of the permutations ρ and κ that each row and each column of M(g)∗ starts with a (possibly empty) initial segment of −1’s followed by a (possibly empty) segment of +1’s.  The reordering assembling −1’s and +1’s in the matrix representing a signum perceptron (guaranteed by Lemma 1) reduces estimation of their inner products with functions f : X → {−1, 1} to estimation of differences of −1’s and +1’s in submatrices of M( f ). A class of matrices whose submatrices have relatively small differences of −1’s and +1’s is the class of Hadamard matrices. A Hadamard matrix of order n is an n × n square matrix M with entries in {−1, 1} such that any two distinct rows (or equivalently columns) of M are orthogonal. It follows directly from the definition that this property is invariant under permutations of rows and columns and sign flips of all elements in a row or a column. Note that Hadamard matrices were introduced as extremal ones among all n × √n matrices with entries in {−1, 1} as they have the largest determinants equal to n. The well-known Lindsay Lemma bounds from above differences of +1’s and −1’s in submatrices of Hadamard matrices (see, e.g., [17, p. 88]). Lemma 2 (Lindsay) Let n be a positive integer and M be an n × n Hadamard matrix. Then for any subset I of the set of indices of rows and any subset J of the set of indices of columns of M,

Limitations of Shallow Networks

147

       √  Mi, j  ≤ n card I card J .   i∈I j∈J  Constructing a partition of a matrix induced by a signum perceptron into submatrices, which have all entries either equal to +1 or all entries equal to −1, and applying the Lindsay Lemma to a corresponding partition of a Hadamard matrix, one obtains the following lower bound on variation with respect to signum perceptrons for functions induced by Hadamard matrices (for details of the proof see [38]). Theorem 10 Let d = d1 + d2 , {xi | i = 1, . . . , n} ⊂ Rd1 , {y j | j = 1, . . . , n} ⊂ Rd2 , X = {xi | i = 1, . . . , m} × {y j | j = 1, . . . , m} ⊂ Rd , and f M : X → {−1, 1} be defined as f M (xi , y j ) = Mi, j , where M is an n × n Hadamard matrix. Then  f M  Pd (X ) ≥

√ n . log2 n

Theorem 10 combined with Proposition 4 implies the following corollary. Corollary 4 Let d = d1 + d2 , {xi | i = 1, . . . , n} ⊂ Rd1 , {y j | j = 1, . . . , n} ⊂ Rd2 , X = {xi | i = 1, . . . , n} × {y j | j = 1, . . . , n} ⊂ Rd , and f M : X → {−1, 1} be defined as f M (xi , y j ) = Mi, j , where M is an n × n Hadamard matrix. Then f M cannot be computed by a shallow signum perceptron network having both the number of units and absolute values of all output weights depending on log2 n polynomially. Corollary 4 shows that functions induced by Hadamard matrices cannot be computed by shallow signum or Heaviside perceptrons with numbers of units and sizes of output weights considerably smaller than sizes of their domains. Numbers of units and sizes of output weights in these networks cannot be bounded by polynomials of log2 of the sizes of their domains. Theorem 10 can be applied to domains containing sufficiently large squares, for example domains representing pictures formed by two-dimensional squares with 2k × 2k pixels or digitized d-dimensional cubes. Corollary 5 Let k be a positive integer and f M : {0, 1}k × {0, 1}k → {−1, 1} be defined as f M (xi , y j ) = Mi, j , where M is a 2k × 2k Hadamard matrix. Then  f M  Pd ({0,1}2k ) ≥

2k/2 . k

Functions generated by 2k × 2k Hadamard matrices cannot be computed by shallow signum perceptron networks with the sum of the absolute values of output weights bounded by a polynomial of k. This implies that the numbers of units and absolute

148

V. K˚urková

values of all output weights in these networks cannot be bounded by any polynomial of k. Similarly, functions defined on 2k-dimensional discretized cubes of sizes s 2k = s k × s k cannot be computed by networks with numbers of signum perceptrons and output weights smaller than s k/2 . k log2 s

(6)

8 Examples An example of a class of functions with variation with respect to Gaussian kernel units with centers in the Boolean cube {0, 1}d increasing with d exponentially is the class of d-dimensional parities. Let G K ,a = G K ,a ({0, 1}d ) := {e−a·−y | y ∈ {0, 1d } 2

denotes the dictionary of Gaussian kernel units with centers in {0, 1}d and pd : {0, 1}d → {−1, 1}, where pd (v) := −1v·u , for all u = (1, . . . , 1) ∈ {0, 1}d is the parity function. The following lower bound from [38] shows that G K ,a -variation of pd grows with d exponentially. Theorem 11 For every positive integer d and every a > 0,  pd G K ,a ({0,1}d ) > 2d/2 .

Proof By Theorem 4,  pd G K ,a ≥

 pd  supg∈G K ,a ({0,1}d ) | p d , g|

.

Let g0 be the Gaussian centered at (0, . . . , 0), then  p d , g0  = By the binomial formula,

d

k=0 (−1)

  −ak k d e . k

  d −ak e (−1) = (1 − e−a )d .  p , g0  = k k=0 d

d 

k

Using a suitable transformation of the coordinate system, we obtain the same value of the inner product with p d for the Gaussian gx centered at any x ∈ {0, 1}d such that pd (x) = 1. When the Gaussian gx is centered at x with pd (x) = −1, we get the same

Limitations of Shallow Networks

149

absolute value of the inner product by replacing pd with − pd and by a transformation 2d/2 d/2 of the coordinate system. As  pd  = 2d/2 , we get  pd G K ,a ≥ (1−e .  −a )d > 2 Applying Corollary 4 to a variety of types of Hadamard matrices one obtains many examples of functions which cannot be computed by shallow perceptron networks with numbers of units and sizes of output weights bounded by p(log2 card X ), where p is a polynomial and X is the domain of the function. Recall that if a Hadamard matrix of order n > 2 exists, then n is divisible by 4 (see, e.g., [45, p. 44]). It is conjectured that there exists a Hadamard matrix of every order divisible by 4. Various constructions of Hadamard matrices are known, such as Sylvester’s recursive construction of 2k × 2k matrices, Paley’s construction based on quadratic residues, as well as constructions based on Latin Squares, and on Steiner triples. Two Hadamard matrices are called equivalent when one can be obtained from the second one by permutations of rows and columns and sign flips of all entries in a row or a column. Listings of known constructions of Hadamard matrices and enumeration of non-equivalent Hadamard matrices of some orders can be found in [56]. The oldest construction of a class of 2k × 2k matrices with orthogonal rows and columns was discovered by Sylvester [58]. A 2k × 2k matrix is called SylvesterHadamard and denoted S(k) if it is constructed recursively starting from the matrix  S(2) =

1 1 1 −1



and iterating the Kronecker product    S(l) S(l)    S(l + 1) = S(2) ⊗ S(l) =  S(l) −S(l)  for l = 1, . . . , k − 1. Corollary 5 implies that functions generated by 2k × 2k Sylvester-Hadamard matrices cannot be represented by shallow signum perceptron k/2 networks with numbers of units and sizes of output weights smaller than 2 k . The following theorem from [38] shows that model complexities of signum or Heaviside perceptron networks computing functions generated by Sylvester-Hadamard matrices can be considerably decreased when two hidden layers are used instead of merely one hidden layer. Theorem 12 Let S(k) be a 2k × 2k Sylvester-Hadamard matrix, h k : {0, 1}k × {0, 1}k → {−1, 1} be defined as h k (u, v) = S(k)u,v . Then h k can be represented by a network with one linear output and two hidden layers with k Heaviside perceptrons in each hidden layer. An interesting class of functions with large variations with respect to perceptrons can be obtained by applying Theorem 10 to a class of circulant matrices with rows

150

V. K˚urková

formed by shifted segments of pseudo-noise sequences. These sequences are deterministic but exhibit some properties of random sequences. They have been used in acoustics [55]. An infinite sequence a0 , a1 , . . . , ai , . . . of elements of {0, 1} is called kth order linear recurring sequence if for some h 0 , . . . , h k ∈ {0, 1} ai =

k 

ai− j h k− j

mod 2

j=1

for all i ≥ k. It is called k-th order pseudo-noise (PN) sequence (or pseudo-random sequence) if it is kth order linear recurring sequence with minimal period 2k − 1. PN-sequences are generated by primitive polynomials. A polynomial h(x) =

m 

hjx j

j=0

is called primitive polynomial of degree m when the smallest integer n for which h(x) divides x n + 1 is n = 2m − 1. PN sequences have many useful applications because some of their properties mimic those of random sequences. A run is a string of consecutive 1’s or a string of consecutive 0’s. In any segment of length 2k − 1 of a kth order PN-sequence, one-half of the runs have length 1, one quarter have length 2, one-eighth have length 3, and so on. In particular, there is one run of length k of 1’s, one run of length k − 1 of 0’s. Thus every segment of length 2k − 1 contains 2k/2 ones and 2k/2 − 1 zeros [45, p. 410]. An important property of PN-sequences is their low autocorrelation. The autocorrelation of a sequence a0 , a1 , . . . , ai , . . . of elements of {0, 1} with period 2k − 1 is defined as 2k −1 1  −1a j +a j+t . (7) κ(t) = k 2 − 1 j=0 For every PN-sequence and for every t = 1, . . . , 2k − 2, κ(t) = −

2k

1 −1

[45, p. 411]. Let τ : {0, 1} → {−1, 1} be defined as τ (x) = −1x

(8)

Limitations of Shallow Networks

151

(i.e., τ (0) = 1 and τ (1) = −1). We say that a 2k × 2k matrix L k (α) is induced by a k-th order PN-sequence α = (a0 , a1 , . . . , ai , . . .) when for all i = 1, . . . , 2k , L i,1 = 1, for all j = 1, . . . , 2k , L 1, j = 1, and for all i = 2, . . . , 2k and j = 2, . . . , 2k L k (α)i, j = τ (Ai−1, j−1 ) where A is the (2k − 1) × (2k − 1) circulant matrix with rows formed by shifted segments of length 2k − 1 of the sequence α. The next proposition following from the Eqs. (7) and (8) shows that for any PN-sequence α the matrix L k (α) has orthogonal rows. Proposition 6 Let k be a positive integer, α = (a0 , a1 , . . . , ai , . . .) be a kth order PN-sequence, and L k (α) be the 2k × 2k matrix induced by α. Then all pairs of rows of L k (α) are orthogonal. Applying Theorem 10 to the 2k × 2k matrices L k (α) induced by a kth order PNk/2 sequence α we obtain a lower bound of the form 2 k on variation with respect to signum perceptrons of the function induced by the matrix L k (α). So in any shallow perceptron network computing this function, the number of units or sizes of some output weights depend on k exponentially.

9 Discussion Although current hardware allows to implement networks with large numbers of parameters, reducing network complexity is highly desirable as it can considerably improve efficiency of computation. Various studies show that also brain has sparse connectivity (each neuron is connected to only a limited number of other neurons). To obtains some theoretical understanding to limitations of shallow architectures, we investigated lower bounds on complexity of shallow networks. As minimization of “l0 -pseudonorm” (which measures the number of hidden units in a shallow network) is a difficult non convex problem, we focused on approximate measures of network sparsity. We presented several arguments for using l1 -norm of output weight vectors as an approximate measure of network sparsity: Balls in l1 -norm are good approximations of convexifications of intersections of “balls” in “l0 ” with unit balls in the ambient Euclidean metric, in contrast to l2 -norm, acting as a stabilizer l1 penalizes even large number of output weights, l1 has been used in weight-decay regularization [18], in statistical learning in the Lasso method [59], and is related to variational norm tailored to dictionary of computational units. Applying geometrical properties of high-dimensional Euclidean spaces (the concentration of measure) we derived probabilistic lower bounds on minima of variational and l1 -norms of output-weight vectors in terms of covering numbers of dictionaries. As for many types of dictionaries used in shallow networks, covering numbers are power-type, the bounds imply that almost any uniformly randomly chosen normalized function on a large domain is highly uncorrelated with all elements

152

V. K˚urková

of such dictionaries. Covering numbers of dictionaries formed by compositions of computational units characterizing deep networks are much less understood, but it is likely that they have much larger covering numbers than simple dictionaries used in shallow networks. Although probabilistic results prove that there are many functions with large variations, it is not easy to find concrete constructions of such functions. There is an interesting analogy with the central paradox of coding theory. This paradox is expressed in the title of the article “Any code of which we cannot think is good” [12]. It was proven there that any code which is truly random (in the sense that there is no concise way to generate the code) is good (it meets the Gilbert–Varshamov bound on distance versus redundancy). However despite sophisticated constructions for codes derived over the years, no one has succeeded in finding a constructive procedure that yields such good codes. Similarly, computation of “any function of which we cannot think” (truly random) by shallow perceptron networks might be untractable. The results presented in this chapter indicate that computation of functions exhibiting some randomness properties by shallow perceptron networks is difficult in the sense that it requires networks of large complexities. Some of such functions can be constructed using deterministic algorithms and have many useful applications. For example, properties of pseudo-noise sequences were exploited for constructions of codes, interplanetary satellite picture transmission, precision measurements, acoustics, radar camouflage, and light diffusers. These sequences permit designs of surfaces that scatter incoming signals very broadly making reflected energy “invisible” or “inaudible” [55]. It should be emphasized that Theorem 7 and Corollary 2 assume uniform probability distribution of functions to be computed. The assumption of uniform distribution of computational tasks (sometimes implicit) is quite common. For example, in the No Free Lunch Theorem [61], it is assumed that all functions are equally likely. However in real tasks, relevance of functions for a give application area are far from being uniform. Recently, we derived some estimates of complexity of networks computing randomly chosen functions from nonuniform probability distributions [33]. Acknowledgements This work was partially supported by the Czech Grant Foundation grant GA18-23827S and institutional support of the Institute of Computer Science RVO 67985807.

References 1. Albertini, F., Sontag, E.: For neural networks, function determines form. Neural Netw. 6(7), 975–990 (1993) 2. Anguita, D., Ghio, A., Oneto, L., Ridella, S.: Selecting the hypothesis space for improving the generalization ability of support vector machines. In: IEEE International Joint Conference on Neural Networks (2011) 3. Azuma, K.: Weighted sums of certain dependent random variables. Tohoku Math. J. 19, 357– 367 (1967)

Limitations of Shallow Networks

153

4. Ba, L.J., Caruana, R.: Do deep networks really need to be deep? In: Ghahramani, Z. et al. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 1–9 (2014) 5. Ball, K.: An elementary introduction to modern convex geometry. In: Levy, S. (ed.) Flavors of Geometry, pp. 1–58. Cambridge University Press, Cambridge (1997) 6. Barron, A.R.: Neural net approximation. In: Narendra, K.S. (ed.) Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems, pp. 69–72. Yale University Press (1992) 7. Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39, 930–945 (1993) 8. Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957) 9. Bengio, Y., LeCun, Y.: Scaling learning algorithms towards AI. In: Bottou, L., Chapelle, O., DeCoste, D., Weston, J. (eds.) Large-Scale Kernel Machines. MIT Press, Cambridge (2007) 10. Bengio, Y.: Learning deep architectures for AI. Found. Trends Mach. Learn. 2, 1–127 (2009) 11. Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23, 493–507 (1952) 12. Coffrey, J.T., Goodman, R.Y.: Any code of which we cannot think is good. IEEE Trans. Inf. Theory 25(6), 1453–1461 (1990) 13. Cover, T.: Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 14, 326–334 (1965) 14. DeVore, R.A., Howard, R., Micchelli, C.: Optimal nonlinear approximation. Manuscr. Math. 63, 469–478 (1989) 15. Donoho, D.: For most large underdetermined systems of linear equations the minimal 1 -norm solution is also the sparsest solution. Commun. Pure Appl. Math. 59, 797–829 (2006) 16. Donoho, D.L., Tsaig, Y.: Fast solution of 1-norm minimization problems when the solution may be sparse. IEEE Trans. Inf. Theory 54, 4789–4812 (2008) 17. Erdös, P., Spencer, H.: Probabilistic Methods in Combinatorics. Academic, Cambridge (1974) 18. Fine, T.L.: Feedforward Neural Network Methodology. Springer, Berlin (1999) 19. Haussler, D.: Sphere packing numbers for subsets of the Boolean n-cube with bounded VapnikChervonenkis dimension. J. Comb. Theory A 69(2), 217–232 (1995) 20. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006) 21. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963) 22. Ito, Y.: Finite mapping by neural networks and truth functions. Math. Sci. 17, 69–77 (1992) 23. Kainen, P.C., K˚urková, V., Sanguineti, M.: Dependence of computational models on input dimension: tractability of approximation and optimization tasks. IEEE Trans. Inf. Theory 58, (2012) 24. Kainen, P.C., K˚urková, V.: Quasiorthogonal dimension. In: Kosheleva, O., Shary, S., Xiang, G., Zapatrin, R. (eds.) Beyond Traditional Probabilistic Data Processing Techniques: Interval, Fuzzy, etc. Methods and Their Applications. Springer, Berlin (2020, to appear) 25. Kainen, P.C., K˚urková, V.: Quasiorthogonal dimension of Euclidean spaces. Appl. Math. Lett. 6(3), 7–10 (1993) 26. Kainen, P., K˚urková, V.: Functionally equivalent feedforward neural network. Neural Comput. 6(3), 543–558 (1994) 27. Kainen, P., K˚urková, V.: Singularities of finite scaling functions. Appl. Math. Lett. 9(2), 33–37 (1996) 28. Kainen, P.C., K˚urková, V., Vogt, A.: Approximation by neural networks is not continuous. Neurocomputing 29, 47–56 (1999) 29. Kainen, P.C., K˚urková, V., Vogt, A.: Geometry and topology of continuous best and near best approximations. J. Approx. Theory 105, 252–262 (2000) 30. Kainen, P.C., K˚urková, V., Vogt, A.: Continuity of approximation by neural networks in L p spaces. Ann. Oper. Res. 101, 143–147 (2001) 31. Kecman, V.: Learning and Soft Computing. MIT Press, Cambridge (2001) 32. Kolmogorov, A.: Asymptotic characteristics of some completely bounded metric spaces. Dokl. Akad. Nauk. SSSR 108, 585–589 (1956)

154

V. K˚urková

33. K˚urková, V., Sanguineti, M.: Classification by sparse neural networks. IEEE Trans. Neural Netw. Learn. Syst. 30(9), 2746–2754 (2019) 34. K˚urková, V.: Dimension-independent rates of approximation by neural networks. In: Warwick, K., Kárný, M. (eds.) Computer-Intensive Methods in Control and Signal Processing. The Curse of Dimensionality, pp. 261–270. Birkhäuser, Boston (1997) 35. K˚urková, V.: High-dimensional approximation and optimization by neural networks. In: Suykens, J. et al. (eds.) Advances in Learning Theory: Methods, Models, and Applications (NATO Science Series III: Computer & Systems Sciences, vol. 190), pp. 69–88. IOS Press, Amsterdam (2003) 36. K˚urková, V.: Sparsity and complexity of networks computing highly-varying functions. In: International Conference on Artificial Neural Networks, pp. 534–543 (2018) 37. K˚urková, V.: Complexity estimates based on integral transforms induced by computational units. Neural Netw. 33, 160–167 (2012) 38. K˚urková, V.: Constructive lower bounds on model complexity of shallow perceptron networks. Neural Comput. Appl. 29, 305–315 (2018) 39. K˚urková, V., Kainen, P.C.: Comparing fixed and variable-width Gaussian networks. Neural Netw. 57(10), 23–28 (2014) 40. K˚urková, V., Sanguineti, M.: Model complexities of shallow networks representing highly varying functions. Neurocomputing 171, 598–604 (2016) 41. LeCun, Y., et al.: Handwritten digit recognition with a back-propagation network. In: Proceedings of Advances in Neural Information Processing Systems, pp. 396–404 (1990) 42. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998) 43. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015) 44. Lévy, P.: Problèmes concrets d’analyse fonctionelle. Gauthier Villards, Paris (1951) 45. MacWilliams, F., Sloane, N.A.: The Theory of Error-Correcting Codes. North Holland Publishing Co., Amsterdam (1977) 46. Maiorov, V.E., Meir, R.: On the near optimality of the stochastic approximation of smooth functions by neural networks. Adv. Comput. Math. 13, 79–103 (2000) 47. Maiorov, V.E., Pinkus, A.: Lower bounds for approximation by MLP neural networks. Neurocomputing 25, 81–91 (1999) 48. Makovoz, Y.: Random approximants and neural networks. J. Approx. Theory 85, 98–109 (1996) 49. Matoušek, J.: Lectures on Discrete Geometry. Springer, New York (2002) 50. Mhaskar, H.N., Liao, Q., Poggio, T.: Learning functions: when is deep better than shallow. Center for Brains, Minds & Machines, pp. 1–12 (2016) 51. Oneto, L., Ridella, S., Anguita, D.: Tikhonov, Ivanov and Morozov regularization for support vector machine learning. Mach. Learn. 103(1), 103–136 (2015) 52. Plan, Y., Vershynin, R.: One-bit compressed sensing by linear programming. Commun. Pure Appl. Math. 66, 1275–1297 (2013) 53. Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., Liao, Q.: Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. Int. J. Autom. Comput. 14 (5), 503–519 (2017). https://doi.org/10.1007/s11633-017-1054-2 54. Schläfli, L.: Gesamelte Mathematische Abhandlungen, vol. 1. Birkhäuser, Basel (1950) 55. Schröder, M.: Number Theory in Science and Communication. Springer, New York (2009) 56. Sloane, N.A.: A library of Hadamard matrices. http://www.research.att.com/~njas/hadamard/ 57. Sussman, H.J.: Uniqueness of the weights for minimal feedforward nets with a given inputoutput map. Neural Netw. 5(4), 589–593 (1992) 58. Sylvester, J.J.: Thoughts on inverse orthogonal matrices, simultaneous sign successions, and tessellated pavements in two or more colours, with applications to Newton’s rule, ornamental tile-work, and the theory of numbers. Philos. Mag. 34, 461–475 (1867) 59. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B 58, 267–288 (1996) 60. Tillmann, A.: On the computational intractability of exact and approximate dictionary learning. IEEE Signal Process. Lett. 22, 45–49 (2015) 61. Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1(67), (1997)

Fairness in Machine Learning Luca Oneto and Silvia Chiappa

Abstract Machine learning based systems are reaching society at large and in many aspects of everyday life. This phenomenon has been accompanied by concerns about the ethical issues that may arise from the adoption of these technologies. ML fairness is a recently established area of machine learning that studies how to ensure that biases in the data and model inaccuracies do not lead to models that treat individuals unfavorably on the basis of characteristics such as e.g. race, gender, disabilities, and sexual or political orientation. In this manuscript, we discuss some of the limitations present in the current reasoning about fairness and in methods that deal with it, and describe some work done by the authors to address them. More specifically, we show how causal Bayesian networks can play an important role to reason about and deal with fairness, especially in complex unfairness scenarios. We describe how optimal transport theory can be leveraged to develop methods that impose constraints on the full shapes of distributions corresponding to different sensitive attributes, overcoming the limitation of most approaches that approximate fairness desiderata by imposing constraints on the lower order moments or other functions of those distributions. We present a unified framework that encompasses methods that can deal with different settings and fairness criteria, and that enjoys strong theoretical guarantees. We introduce an approach to learn fair representations that can generalize to unseen tasks. Finally, we describe a technique that accounts for legal restrictions about the use of sensitive attributes.

L. Oneto (B) University of Pisa, Pisa, Italy e-mail: [email protected] S. Chiappa DeepMind, London, UK e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 L. Oneto et al. (eds.), Recent Trends in Learning From Data, Studies in Computational Intelligence 896, https://doi.org/10.1007/978-3-030-43883-8_7

155

156

L. Oneto and S. Chiappa

1 Introduction Machine Learning (ML) is increasingly used in a wide range of decision-making scenarios that have serious implications for individuals and society, including financial lending [20, 120], hiring [17, 77], online advertising [72, 150], pretrial and immigration detention [10, 158], child maltreatment screening [30, 171], health care [40, 105], social services [4, 48], and education [140, 143, 144]. Whilst algorithmic decision making can overcome undesirable aspects of human decision making, biases in the training data and model inaccuracies can lead to decisions that treat individuals unfavorably (unfairly) on the basis of characteristics such as e.g. race, gender, disabilities, and sexual or political orientation (sensitive attributes). ML fairness is an emerging area of machine learning that studies how to ensure that the outputs of a model do not depend on sensitive attributes in a way that is considered unfair. For example, in a model that predicts student performance based on previous school records, this could mean ensuring that the decisions do not depend on sex. Or in a model that decides whether a person should be offered a loan based on previous credit card scores, this could mean ensuring that the decisions do not depend on race. Over the last few years, researchers have introduced a rich set of definitions formalizing different fairness desiderata that can be used for evaluating and designing ML systems. Many such definitions express properties of the model outputs with respect to the sensitive attributes. However, in order to properly deal with fairness issues present in the training data, relationships among other relevant variables in the data need to be accounted for. Lack of consideration for the specific patterns of unfairness underlying the training data has lead e.g. to the use of inappropriate fairness criteria in the design and evaluation of the COMPAS pretrial risk assessment tool. This problem was highlighted in [26] using Causal Bayesian Networks (CBNs) as a simple and intuitive visual tool to describe different possible data unfairness scenarios. CBNs can also be used as a powerful quantitative tool to measure unfairness in a dataset and to help researchers develop techniques for addressing it. As such, CBNs represent an invaluable framework to formalize, measure, and deal with fairness. From a procedural viewpoint, methods for imposing fairness can roughly be divided into three families. Methods in the first family consists in pre-processing the data or in extracting representations that do not contain undesired biases, which can then be used as input to a standard ML model. Methods in the second family consist in enforcing a model to produce fair outputs through imposing fairness constraints into the learning mechanism. Methods in the third family consist in postprocessing the outputs of a model in order to make them fair. Most methods (as well as most fairness definitions) approximate fairness desiderata through requirements on the lower order moments or other functions of distributions corresponding to different sensitive attributes. Whilst facilitating model design, not imposing constraints on the full shapes of relevant distributions can be restrictive and problematic. Also, most often the goal of these methods is to create a fair model from scratch on a specific task. However, in a large number of real world applications [9, 33, 141, 170]

Fairness in Machine Learning

157

general,using the same model or part of it over different tasks might be desirable. To ensure that fairness properties generalize to multiple tasks, it is necessary to consider the learning problem in a multitask/lifelong learning framework. Finally, there still exits only a few attempts to group methods that can deal with different settings and fairness criteria into a unified framework accompanied by theoretical guarantees about their fairness properties. In this manuscript, we introduce the problem of fairness in machine learning and describe some techniques for addressing it. We focus the presentation on the issues outlined above, and describe some approaches introduced by the authors in previous work to address them. More specifically, in Sect. 2 we stress the crucial role that CBNs should play to reason about and deal with fairness, especially in complex unfairness scenarios. In Sect. 3.1 we introduce a simple post-processing method that leverage optimal transport theory to impose constraints on the full shapes of distributions corresponding to different sensitive attributes. In Sect. 3.2, we describe a unified fairness framework that enjoys strong theoretical guarantees. In Sect. 3.3 we introduce a method to learn fair representations that can generalize to unseen tasks. In Sect. 3.4, we discuss legal restrictions with the use of sensitive attributes, and introduce an in-processing method that does not require them when using the model. Finally, we draw some conclusions in Sect. 4. In order to not over-complicate the notation, we use the most suited notation in each section.

2 CBNs: An Essential Tool for Fairness Over the last few years, researchers have introduced a rich set of definitions formalizing different fairness desiderata that can be used for evaluating and designing ML systems [2, 7, 14, 15, 21–23, 25, 28, 29, 31, 34, 44–46, 52–56, 59, 66, 69, 71–75, 82, 84, 87, 89, 90, 94, 95, 97–99, 103, 104, 106, 114–116, 121, 123, 127, 138, 139, 141, 148, 149, 153, 156, 159, 165, 174, 176–180, 182–184, 186, 187, 191– 193]. We refer the reader to existing surveys (e.g. [57, 130, 172]) to get an overview of existing definitions. In this manuscript, we are interested in highlighting the risk in the use of definitions which only express properties of the model outputs with respect to the sensitive attributes. As the main sources of unfairness in ML models are biases in the training data, and since biases can arise in different forms depending on how variables relate, accounting for relationships in the data is essential for properly evaluating models for fairness and for designing fair models. This issue was pointed out in [26], by showing that Equal False Positive/Negative Rates and Predictive Parity were inappropriate fairness criteria for the design and evaluation of the COMPAS pretrial risk assessment tool. In Sect. 2.1, we re-explain the issue, and also show that making a parallel between Predictive Parity and the Outcome Test used to detect discrimination in human decision making suffers from the same problem of not accounting for relationships in the data. We do that by using Causal Bayesian Networks (CBNs) [39, 145, 146, 151, 166], which currently represents the simplest and most intuitive tool for describing relationships among random vari-

158

L. Oneto and S. Chiappa

ables, and therefore different possible unfairness scenarios underlying a dataset. In Sect. 2.2, we show that CBNs also provide us with a powerful quantitative tool to measure unfairness in a dataset and to develop techniques for addressing it.

2.1 CBNs: A Visual Tool for (Un)fairness For simplicity of exposition, we restrict ourselves to the problem of learning a binary N , where each datapoint classification model from a dataset D = {(a n , x n , y n )}n=1 n n n (a , x , y )—commonly corresponding to an individual—contains a binary outcome y n that we wish to predict, a binary attribute a n which is considered sensitive, and a vector of features x n to be used, together with a n , to form a prediction yˆ n of y n . We formulate classification as the task of estimating the probability distribution P(Y |A, X ), where1 A, X and Y are the random variables corresponding to a n , x n , and y n respectively, and assume that the model outputs an estimate s n of the probability that individual n belongs to class 1, P(Y = 1|A = a n , X = x n ). A prediction yˆ n of y n is then obtained as yˆ n = 1s n >τ , where 1s n >τ = 1 if s n > τ for a threshold τ ∈ [0, 1], and zero otherwise. Arguably, the three most popular fairness criteria used to design and evaluate classification models are Demographic Parity, Equal False Positive/Negative Rates, and Calibration/Predictive Parity [69]. |=

Demographic Parity. Demographic Parity requires Yˆ to be statistically independent of A (denoted with Yˆ A), i.e. P(Yˆ = 1|A = 0) = P(Yˆ = 1|A = 1) .

(1)

|=

This criterion has recently be extended into Strong Demographic Parity [84]. Strong Demographic Parity considers the full shape of the distribution of the random variable S representing the model output by requiring S A. This ensures that the class prediction does not depend on the sensitive attribute regardless of the value of the threshold τ used. Equal False Positive/Negative Rates (EFPRs/EFNRs). EFPRs and EFNRs require P(Yˆ = 1|Y = 0, A = 0) = P(Yˆ = 1|Y = 0, A = 1), P(Yˆ = 0|Y = 1, A = 0) = P(Yˆ = 0|Y = 1, A = 1).

(2) A|Y , often

|=

These two conditions can also be summarized as the requirement Yˆ called Equalized Odds.

1 Throughout the manuscript, we denote random variables with capital letters, and their values with

small letters.

Fairness in Machine Learning

159

|=

|=

Predictive Parity/Calibration. Calibration requires Y A|Yˆ . In the case of continuous model output s n considered here, this condition is often instead called Predictive Parity at threshold θ , P(Y = 1|S > θ, A = 0) = P(Y = 1|S > θ, A = 1), and Calibration defined as requiring Y A|S. Demographic and Predictive Parity are considered the ML decision making equivalents of, respectively, Benchmarking and the Outcome Test used for testing discrimination in human decisions. There is however an issue in making this parallel—we explain it below on a police search for contraband example. Discrimination in the US Law is based on two main concepts, namely disparate treatment and disparate impact. Disparate impact refers to an apparently neutral policy that adversely affects a protected group more than another group. Evidence that a human decision making has an unjustified disparate impact is often provided using the Outcome Test, which consists in comparing the success rate of decisions. For example, in a police search for contraband scenario, the Outcome Test would check whether minorities (A = 1) who are searched (Y¯ = 1) are found to possess contraband (Y = 1) at the same rate as whites ( A = 0) who are searched, i.e. whether P(Y = 1|Y¯ = 1, A = 0) = P(Y = 1|Y¯ = 1, A = 1). Let C denotes the set of characteristics inducing probability of possessing contraband r n = P(Y = 1|C = cn ). A finding that searches for minorities are systematically less productive than searches for whites can be considered as evidence that the police applies different thresholds τ when searching, i.e. y¯ n = 1r n >τ1 if a n = 1 whilst y¯ n = 1r n >τ0 if a n = 0 with τ1 < τ0 . This scenario can be represented using the causal Bayesian network in Fig. 1a. A Bayesian network is a directed acyclic graph where nodes and edges represent random variables and statistical dependencies. Each node X i in the graph is associated with the conditional distribution p(X i |pa(X i )), where pa(X i ) is the set of parents of X i . The joint distribution of all nodes, p(X 1 , . . . , X I I), is given by the p(X i |pa(X i )). product of all conditional distributions, i.e. p(X 1 , . . . , X I ) = i=1 When equipped with causal semantic, namely when representing the data-generation mechanism, Bayesian networks can be used to visually express causal relationships. More specifically, causal Bayesian networks enable us to give a graphical definition of causes: If there exists a causal path from A to Y , then A is a potential cause of Y . A path from A to Y is a sequence of linked nodes starting at A and ending at Y . A path is called casual if the links point from preceding towards following nodes in the sequence. The CBN of Fig. 1a has joint distribution p(A, C, Y ) = p(Y |C) p(C|A) p(A). A is a potential cause of Y , because the path A → C → Y is causal. The influence of A on Y through C is considered legitimate as indicated by label ’fair’. This in turn means that dependence of R on A is considered fair. We can interpret the Outcome Test in the CBN framework as an indirect way to understand whether the police is introducing an unfair causal path A → Y¯ as depicted in Fig. 1b, namely whether it bases the decision on whether to search a person on Race A, in addition to Characteristics C (i.e. if two individuals with the same Characteristics are treated differently if of different Race).

160

L. Oneto and S. Chiappa ( )

= Cont. ( | )

fair

fair

fair

un fai r

un fai r

¯ = Search

= Search

( | )

(a)

(b)

(c)

Fig. 1 CBNs describing a police search for contraband example

Predictive Parity is often seen as the equivalent of the Outcome Test for ML classifiers. However, the problem in making this parallel is that, in the Outcome Test, Y corresponds to actual possession of contraband whilst, in the dataset used to train a classifier, Y could instead correspond to police search and the police could be discriminating by using different thresholds for searching minorities and whites. This scenario is depicted in Fig. 1c. In this scenario in which the training data contains an unfair path A → Y , Predictive Parity is not a meaningful fairness goal. More generally, in the case in which at least one causal path from A to Y is unfair, both EFPRs/EFNRs and Predictive Parity are inappropriate criteria, as they do not require the unfair influence of A on Y to be absent from the prediction Yˆ (e.g. a perfect model (Yˆ = Y ) would automatically satisfy EFPRs/EFNRs and Predictive Parity, but would contain the unfair influence). This observation is particularly relevant to the recent debate surrounding the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) pretrial risk assessment tool developed by Equivant (formerly Northpointe) and deployed in Broward County in Florida. The debate was triggered by an exposé from investigative journalists at ProPublica [10]. ProPublica claimed that COMPAS did not satisfy EFPRs and EFNRs, as FPRs = 44.9% and FNRs = 28.0% for black defendants, whilst FPRs = 23.5% and FNRs = 47.7% for white defendants. This evidence led ProPublica to conclude that COMPAS had a disparate impact on black defendants, leading to public outcry over potential biases in risk assessment tools and machine learning writ large. In response, Equivant published a technical report [41] refuting the claims of bias made by ProPublica and concluded that COMPAS is sufficiently calibrated, in the sense that it satisfies Predictive Parity at key thresholds. As previous research has shown [81, 113, 160], modern policing tactics center around targeting a small number of neighborhoods—often disproportionately populated by non-white and low income residents—with recurring patrols and stops. This uneven distribution of police attention, as well as other factors such as funding for pretrial services [101, 168], can be rephrased in the language of CBNs as indicating the presence of a direct path A → Y (through unobserved neighborhood) in the CBN representing the data-generation mechanism, as well as an influence of A on Y through the set

Fairness in Machine Learning

161

of variables containing number of prior arrests that are used to form a prediction Yˆ of Y . The fairness of these paths is questionable, indicating that EFPRs/EFNRs and Predictive Parity are inappropriate criteria. More generally, these observations indicate that the fairness debate surrounding COMPAS gave insufficient consideration to the patterns of unfairness underlying the training data. The characterization of unfairness in a dataset as the presence of unfair causal paths has enabled us to demonstrate the danger of making parallels between tests for discrimination in ML and human decision making, and that a correct use of fairness definitions concerned with statistical properties of Yˆ with respect to A requires an understanding of the patterns of unfairness underlying the training data. Deciding whether a path is fair or unfair requires careful ethical and sociological considerations and/or might not be possible from a dataset alone. Furthermore, a path could also be only partially fair, a case that we omitted here for simplicity. Despite these limitations, this viewpoint enables us to use CBNs as simple and intuitive visual tool to reason about fairness.

2.2 CBNs: A Quantitative Tool for (Un)fairness In addition to provide us with a simple and intuitive visual tool to reason at a high level about different unfairness scenarios that may underlie a dataset, CBNs also enable us to quantify unfairness in complex scenarios, and to develop techniques for addressing it. We explain this in this section—more details can be found in [26].

2.2.1

Background on Causal Bayesian Networks

Causal Effect. Consider the CBN on the left, which contains one causal and one non-causal path from A to Y , given by A → Y and A ← C → Y respectively. Whilst the conditional distribution p(Y |A) measures information from A to Y travelA Y ing through both paths2 , the causal effect of A on Y , denoted with p→A (Y |A), measures information traveling through the causal path A → Y only. Thus, the causal effect of A on Y can be seen as the conditional distribution of Y given A restricted to causal paths. C

2 This

is the case as the non-causal path A ← C → Y is open.

162

L. Oneto and S. Chiappa

Path-Specific Effect. Let us define with Y→A=a the random variable with distribution p(Y→A=a ) = p→A=a (Y |A = a). Y→A=a is called potential outcome and we refer to it with the shorthand Y→a . Potential outcomes can be extended to allow to separate the causal effect along different causal paths. Consider a college admission scenario described by the CBN on the left, where A indicates gender, Q qualifications, D choice of department, and Y admission. D Y The causal path A → Y represents direct influence of gender A on admission Y , capturing the fact that two individuals with the same qualifications and applying to the same department can be treated differently depending on their gender. The indirect causal path A → Q → Y represents influence of A on Y through Q, capturing the fact that female applicants might have different qualifications than male applicants. The indirect causal path A → D → Y represents influence of A on Y through D, capturing the fact that female applicants more often apply to certain departments. Let us indicate with A = a and A = a¯ female and male applicants respectively. The path-specific potential outcome Y→a¯ (Q →a , D→a ) is defined as the variable with distribution equal to the conditional distribution of Y given A, restricted to causal paths, for which A has been set to the value a¯ along A → Y , and to the value a along A → Q → Y and A → D → Y . Y→a¯ (Q →a , D→a ) can be used to give an average estimate of the effect of A on Y only along A → Y . This is obtained e.g. by the Path-Specific Effect (PSE) of A = a¯ with respect to A = a, defined as A

Q

PSEa a¯ = E p(Y→a¯ (Q →a ,D→a )) [Y→a¯ (Q →a , D→a )] − E p(Y→a ) [Y→a ],

(3)

where E p(X ) [X ] denotes expectation. Path-Specific Counterfactual. The PSE measures influence of A on Y at the population level. By conditioning Y→a¯ (Q →a , D→a ) on information from a specific individual, e.g. from a female individual {a n = a, q n , d n , y n } who was not admitted, we can answer the counterfactual question of whether that individual would have been admitted had she been male along the path A → Y . To understand how p(Y→a¯ (Q →a , D→a )|a n = a, q n , d n , y n ) can be estimated using the underlying CBN, consider the following linear model associated to a CBN with the same structure A ∼ Bern(π ), Q = θ q + θaq A + q , D = θ d + θad A + d , y Y = θ y + θay A + θqy Q + θd D +  y ,

(4)

where q , d and  y are unobserved independent zero-mean Gaussian variables. The relationships between A, Q, D, Y and Y→a¯ (Q →a , D→a ) in this model can be inferred from the twin Bayesian network [145] below: In addition to A, Q, D and Y , the network contains the variables Q →a , D→a and Y→a¯ (Q →a , D→a ) corresponding to

Fairness in Machine Learning

163

the counterfactual world in which A = a¯ along A → Y , with Q →a = θ q + θaq a + q , D→a = θ d + θad a + d , y Y→a¯ (Q →a , D→a ) = θ y + θay a¯ + θqy Q →a + θd D→a +  y ,

d D→a

D y

Y→a¯ (Q →a , D→a )

Y q Q

The two groups of variables A, Q, D, Y and Q →a , D→a , Y→a¯ (Q →a , D→a ) are connected by d , q and  y , indicating that the factual and counterfactual worlds share the same unobserved randomness. From the twin network we can deduce that Y→a¯ (Q →a , D→a ) { A, Q, D, Y }| = {q , d ,  y }, and therefore that p(Y→a¯ (Q →a , D→a )|A=a, Q=q n , D = d n , Y = y n ) can be expressed as |=

A

(5)

Q →a

 

n n p(Y→a¯ (Q →a , D→a )|, a, d , y n ) p(|a, q n , d n , y n ).  q ,

(6)

As p(qn |a, q n , d n , y n ) = δqn =q n −θ q −θaq a , p(dn |a, q n , d n , y n ) = δdn =d n −θ d −θad a , and p( yn |a, q n , d n , y n ) = δ yn =y n −θ y −θay a−θqy q n −θdy d n , we obtain p(Y→a¯ (Q →a , D→a )|a, q n , d n , y n ) = 1Y→a¯ (Q →a ,D→a )=y n −θay a+θay a¯ .

(7)

Indeed, by expressing Y→a¯ (Q →a , D→a ) as a function of qn , dn and  yn , we obtain y

Y→a¯ (Q →a , D→a ) = θ y + θay a¯ + θqy Q →a + θd D→a +  yn y

= θ y + θay a¯ + θqy (θ q + θaq a + qn ) + θd (θ d + θad a + dn ) +  yn = y n − θay a + θay a. ¯

(8)

Therefore, the outcome in the counterfactual world is simply obtained by correcting y y ¯ the outcome in the factual world through replacing θa a with θa a. In the more complex case in which we are interested in p(Y→a¯ (Q →a , D→a¯ )|A = a, Q = q n , D = d n , Y = y n ), a similar reasoning would give y

Y→a¯ (Q →a , D→a¯ ) = θ y + θay a¯ + θqy (θ q + θaq a + qn ) + θd (θ d + θad a¯ + dn ) +  yn y

y

= y n − θay a + θay a¯ − θd θad a + θd θad a. ¯

(9)

¯ + dn , the desired outUsing the notation Y = f θ y (A, Q, D) +  y and dan¯ = f θ d (a) n n n n ¯ q , da¯ ) +  y . Therefore, the outcome in the come can be written as ya¯ = f θ y (a, counterfactual world is obtained by correcting d n into dan¯ , and then use dan¯ and the y y correction of θa a into θa a¯ to generate a corrected outcome.

164

L. Oneto and S. Chiappa

In a more complex scenario in which e.g. D = f θ d (A, d ) for a non-invertible non-linear function f θ d , we can sample dn,m from p(d |a, q n , d n , y n ), and perform a Monte-Carlo approximation of Eq. (6), obtaining yan¯

M 1  n = f θ y (a, ¯ q n , dan,m ¯ ,  y ), M m=1

(10)

where dan,m = f θ d (a, ¯ dn,m ). ¯ 2.2.2

Quantify Unfairness in a Dataset

The path-specific effect and counterfactual introduced above can be used to quantify unfairness in a dataset in complex scenarios, respectively at the population and individual levels. Path-Specific Unfairness. In the college admission scenario, let us assume that the departmental differences are a result of sysun temic cultural pressure—i.e. female applicants apply to specific fa ir departments at lower rates because of overt or covert societal discouragement—and that for this reason, in addition to the direct D Y path A → Y , the path A → D (and therefore A → D → Y ) is deemed unfair. One way to measure unfairness along A → Y and A → D → Y overall population would be to compute the path-specific effect A

Q

unfair

E p(Y→a¯ (Q →a ,D→a¯ )) [Y→a¯ (Q →a , D→a¯ )] − E p(Y→a ) [Y→a ].

(11)

Notice that computing this quantity requires knowledge of the CBN: If the CBN structure is miss-specified or its conditional distributions are poorly estimated, the resulting estimate could be imprecise. Path-Specific Counterfactual Unfairness. Rather than measuring unfairness along A → Y and A → D → Y overall population, we might want to know whether a specific female applicant {a n = a, q n , d n , y n } who was not admitted would have been admitted had she been male (A = a) ¯ along the direct path A → Y and the indirect path A → D → Y . This question can be answered by estimating the path-specific counterfactual p(Y→a¯ (Q →a , D→a¯ )|A = a, Q = q n , D = d n , Y = y n ). Notice that the outcome in the actual world, y n , corresponds to p(Y→a (Q →a , D→a )|A = a, Q = q n , D = d n , Y = y n ) = 1Y→a (Q →a ,D→a )=y n . 2.2.3

Imposing Fairness in a Model

In addition to quantify unfairness in a dataset, path-specific effects and counterfactuals can also be used to impose fairness in a ML model.

Fairness in Machine Learning

165

In the college admission scenario, a model would output an estimate of the proba¯ = 1|A = a n , Q = q n , D = d n ), bility that individual n belongs to class 1, s n = p(Y n where p¯ indicates the estimate of p. We denote with s→ a¯ (Q →a , D→a¯ ) the model estin n mated probability that a female applicant {a = a, q , d n } would have been admitted in a counterfactual world in which she were male along A → Y and A → D → Y , i.e. n ¯ →a¯ (Q →a , D→a¯ ) = 1|A = a n , Q = q n , D = d n ), s→ a¯ (Q →a , D→a¯ ) = p(Y

(12)

and with S→a¯ (Q →a , D→a¯ ) the corresponding random variable. Notice that, unlike above, we do not condition on y n . Path-Specific Fairness. We can use the path-specific effect to formalize the requirement that the influence of A along the unfair causal paths A → Y and A → D → Y should be absent from the model, by requiring E p(S→a¯ (Q →a ,D→a¯ )) [S→a¯ (Q →a , D→a¯ )] = E p(S→a ) [S→a ],

(13)

i.e. that the path-specific effect in the model should be zero. This criterion was called Path-Specific Fairness (PSF) in [27]. The work in [134] introduces a method to achieve PSF based on enforcing the PSE to be small during model training. When deploying the model, this method requires integrating out all variables that are descendant of A long unfair causal paths (in this case D) to correct for the unfairness in their realizations, with consequent loss in accuracy. Path-Specific Counterfactual Fairness. To overcome this issue, the work in [25] proposes to instead correct the model output s n into its counterfactual n ¯ →a¯ (D→a ) = 1|A = a n , Q = q n , D = d n ). s→ a¯ (Q →a , D→a¯ ) = p(Y

(14)

The resulting model satisfies the individual-level fairness criterion called PathSpecific Counterfactual Fairness, that can be expressed as requiring p f (Y = 1|A = a, ¯ Q = q n , D = d n ) = p f (Y = 1|A = a n , Q = q n , D = d n ), where p f indicates the distribution from the resulting model. The Monte-Carlo approximation corresponding to Eq. 10 is given by p(Ya¯ (Q →a , D→a¯ )|A = a, Q = q n , D = d n ) ≈

M 1  p(Y ¯ |A = a, ¯ Q = q n , D = d n,m ), M m=1

where d n,m = f θ d (A = a, ¯ dn,m ) is the corrected version of d n . From the discussion above we can deduce that the general procedure for computing the desired counterfactual outcome is to condition Y on the non-descendants of A, on the descendants of A that are only fairly influenced by A, and on corrected versions of the descendants of A that are unfairly influenced by A.

166

L. Oneto and S. Chiappa

3 Methods for Imposing Fairness in a Model From a procedural viewpoint, methods for imposing fairness can roughly be grouped into pre-processing, in-processing, and post-processing methods. Pre-Processing methods consist in transforming the training data to remove undesired biases [16, 21, 24, 28, 47, 51–53, 61, 64–67, 86, 90–92, 112, 114, 116, 118, 121, 122, 126, 127, 164, 174, 188, 190, 191, 193]. The resulting transformations can then be used to train a ML model in a standard way. In-Processing methods enforce a model to produce fair outputs through imposing fairness constraints into the learning mechanism. Some methods transform the constrained optimization problem via the method of Lagrange multipliers [3, 15, 34, 37, 38, 59, 97, 135, 183, 185] or add penalties to the objective [5, 14, 22, 44, 46, 56, 58, 62, 73–75, 79, 82, 87–89, 93–95, 98, 104, 115, 119, 123, 133–135, 138, 139, 154, 165, 176, 179, 180, 182], others use adversary techniques to maximize the system ability to predict the target while minimizing the ability to predict the sensitive attribute [189]. Post-Processing methods consist in transforming the model outputs in order to make them fair [2, 7, 31, 42, 51, 54, 68, 69, 99, 100, 106, 137, 147, 153, 156, 178]. Notice that this grouping is imprecise and non exhaustive. For example, the method in [25] imposes constraints into the learning mechanism during training in order to ensure that requirements for correct post-processing are met. In addition, whilst not tested for this purpose, the methods to achieve (Path-Specific) Counterfactual Fairness in [25, 106] generate corrected model outputs for fairness through correcting the variables used to generate those outputs. As such, they could be used to generate fair datasets, and therefore be categorized as pre-processing methods. From a setting viewpoint, methods for imposing fairness have for long almost entirely focused on binary classification with categorical sensitive attributes (see [44] for a review of this setting). More recently, regression [15, 22, 55, 156] has also started to be considered, as well as continuous sensitive attributes through discretization [103, 104, 149, 182]. Most methods are still fragmented and compartmentalized in terms of task and sensitive attribute types, as well as of fairness definitions considered, and most lack consistency results. These limits are partially addressed in [27, 44, 103, 141] by introducing unified frameworks that encompass methods that can deal with different settings and fairness criteria and are accompanied by theoretical guarantees about their fairness properties. The work in [27] views different fairness desiderata as requiring matching of distributions corresponding to different sensitive attributes, and leverages optimal transport theory achieve that for the binary classification and regression settings, providing one of the few approaches that do not approximate fairness desiderata through requirements on the lower order moments or other functions of distributions corresponding to different sensitive attributes. We present a simple post-processing method derived withing this framework in Sect. 3.1. In [44], simple notions of fairness are incorporated within the Empirical Risk Minimization framework. This framework is extended to cover the whole supervised learning setting with risk and fairness bounds—implying consistency properties both

Fairness in Machine Learning

167

in terms of fairness measure and risk of the selected model—in [141]. We present this work in Sect. 3.2. Most often the goal of current methods for imposing fairness is to create a fair model for a fixed scenario. However, in a large number of real world applications using the same model or part of it over different tasks might be desirable. For example, it is common to perform a fine tuning over pre-trained models [43], keeping the internal representation fixed. Unfortunately, fine tuning a model which is fair on a task on novel previously unseen tasks could lead to an unexpected unfairness behaviour (i.e. discriminatory transfer [107] or negative legacy [94]), due to missing generalization guarantees concerning its fairness properties. To avoid this issue, it is necessary to consider the learning problem in a multitask/lifelong learning framework. Recent methods leverage task similarity to learn fair representations that provably generalizes well to unseen tasks. Such methods can be seen as a sophisticated form of pre-processing methods. We discuss one such a method in Sect. 3.3. Legal requirements often forbid the explicit use of sensitives attributes in the model. Therefore research into developing methods that meet these requirements is important. In Sect. 3.4, we discuss issues with not explicitly using sensitives attributes and introduce a method that does not require the use of them during model deployment. Before delving into the explanation of specific methods, we conclude this section with a list of publicly available datasets commonly used in the ML fairness literature in Table 1.

3.1 Constraints on Distributions with Optimal Transport Most methods to obtain fair models impose approximations of fairness desiderata through constraints on lower order moments or other functions of distributions corresponding to different sensitive attributes (this is also what most popular fairness definitions require). Whilst facilitating model design, not imposing constraints on the full shapes of relevant distributions can be problematic. By matching distributions corresponding to different sensitive attributes either in the space of model outputs or in the space of model inputs (or latent representations of the inputs) using optimal transport theory, the work in [27, 84] an approach to fair classification and regression that is applicable to many fairness criteria. In this section, we describe a simple post-processing method to achieve Strong Demographic Parity that was derived in this work. Strong Demographic Parity (SDP). Let us extend the notation of Sect. 2.1 to include regression and multiple, possibly non-binary, sensitive attributes. That is, N , y n can be continuous or let us assume that, in the dataset D = {(a n , x n , y n )}n=1 n n k categorical, and a ∈ A = N (where element ai might correspond e.g. to gender). Regression and classification can be uniquely framed as the task of estimating the probability distribution p(Y |A, X ), and by assuming that the model outputs the expectation

168

L. Oneto and S. Chiappa

Table 1 Publicly available datasets commonly used in the ML fairness literature Datasets

Reference

Number of samples

Number of features

Sensitive features

Task

xAPI students performance

[8]

480

16

Gender, nationality, native-country

MC

NLSY

[19]

≈10 K

Students performance

[35]

649

33

Age, Gender

Wine quality

[36]

4898

13

Color

MC, R

Drug consumption

[50]

1885

32

Age, ethnicity, gender, country

MC

School effectiveness

[60]

15362

9

Ethnicity, gender

R

Arrhythmia

[63]

452

279

Age, gender

MC

MovieLens

[70]

100 K

≈20

Age, gender

R

Heritage health

[76]

≈60 K

≈20

Age, gender

MC, R

German credit

[78]

1K

20

Age, gender/maritalstatus

MC

Student academics perf.

[80]

300

22

Gender

MC

Heart disease

[83]

303

75

Age, gender

MC, R

Census/adult income

[102]

48842

14

Age, ethnicity, gender, native-country

BC

COMPAS

[108]

11758

36

Age, ethnicity, gender

BC, MC

Contraceptive method choice

[109]

1473

9

Age, religion

MC

CelebA faces

[111]

≈200 K

40

Gender skin-paleness, youth

BC

Chicago faces

[117]

597

5

Ethnicity, gender

MC

Diversity in faces

[129]

1M

47

Age, gender

MC, R

Bank marketing [132]

45211

17–20

Age

BC

Stop, question and frisk

[136]

84868

≈100

Age, ethnicity, gender

BC, MC

Communities and crime

[157]

1994

128

Ethnicity

R

Diabetes US

[169]

101768

55

Age, ethnicity

BC, MC

Law school admission

[175]

21792

5

Ethnicity, gender

R

Credit card

[181]

30 K

24

Age, gender

BC

default

Birth-date, BC, MC, R ethnicity, gender R

Fairness in Machine Learning

169

s n = E p(Y ¯ |A=a n ,X =x n ) [Y ] ,

(15)

where p¯ indicates the estimate of p (below we omit the distinction between p and p¯ to simplify the notation). A prediction yˆ n of y n is thus obtained as yˆ n = s n for the regression case, and as yˆ n = 1s n >τ for the classification case. We denote with Sa the output variable restricted to the group of individuals with sensitive attributes a, i.e. with distribution p(Sa ) = p(S|A = a) (we also denote this distribution with p Sa ). We can extend Demographic Parity (Sect. 2.1) to this setting by re-phrasing it as the requirement that the expectation of Yˆ should not depend on A, i.e. ˆ ¯ ∈ A. E p(Yˆ |A=a) [Yˆ ] = E p(Yˆ |A=a) ¯ [Y ], ∀a, a

(16)

|=

In the case of classification, enforcing Demographic Parity at a given threshold τ does not necessarily imply that the criterion is satisfied for other thresholds. Furthermore, to alleviate difficulties in optimizing on the class prediction Yˆ , relaxations are often considered, such as imposing the constraint E[S|A = a] = E[S|A = a], ¯ ∀a, a¯ ∈ A [59, 184]. In the case of regression, whilst Demographic Parity is a commonly used criterion [55], it represents a limited way to enforce similarity between the conditional distributions p(S|A). To deal with these limitations, the work in [27, 84] introduces the Strong Demographic Parity (SDP) criterion, which requires S A, i.e. p(Sa ) = p(Sa¯ ), ∀a, a¯ ∈ A,

(17)

and an approach to enforce SDP using optimal transport theory [152, 173]. Optimal Transport. In Monge’s formulation [131], the optimal transport problem consists in transporting a distribution to another one incurring in minimal cost. More specifically, given two distributions p X and pY on X and Y, the set T of transportation mapsfrom X to Y (where each transportation map T : X → Y satisfies  p (y)dy = T −1 (B) p X (x)d x for all measurable subsets B ⊆ Y), and a cost funcB Y tion C : X × Y → [0, ∞], the optimal transport problem consists in finding the transportation map T ∗ that minimizes the total transportation cost, i.e. such that

T ∗ = arg min WC ( p X , pY ) = arg min T ∈T

T ∈T

 C(x, T (x)) p X (x)d x.

(18)

Under appropriate conditions on C, WC ( p X , pY ) can be turned into a distance between p X and pY . Specifically, if X = Y and C = D p for some distance metric D : X × Y → R and p ≥ 1, then WC ( p X , pY )1/ p is a valid distance between p X p and pY . When X = Y = Rd and C(x, y) = x − y p , where · p indicate the L p norm, WC ( p X , pY ) corresponds to the pth power of the Wasserstein- p distance and we adopt the shorthand W p ( p X , pY ) to denote it.

170

L. Oneto and S. Chiappa

Post-Processing by Transporting Distributions to their Barycenter. We would like to perform a post-processing of the model outputs to reach SDP by transporting the distribution p Sa of each group output variable Sa to a common distribution p S¯ . In order to retain accuracy, we would like to use a transportation map Ta∗ such that Ta∗ (Sa ) remains close to Sa . For regression, Ta∗ minimizing E pSa [(Sa − Ta (Sa ))2 ] would satisfy this property in a least-squares sense, giving Ta∗ = arg min Ta ∈T ( pSa , pS¯ ) W2 ( p Sa , p S¯ ). Considering all groups, each weighted by its probability pa = p(A = a), we obtain that the distribution p S¯ inducing the minimal deviation from S is given by p S¯ = arg min p∗



pa W2 ( p Sa , p ∗ ).

(19)

a∈A

This distribution coincides with the Wasserstein-2 barycenter with weights pa . For classification, using instead the L 1 norm gives the Wasserstein-1 barycenter. This has the desirable property of inducing the minimal number of class prediction changes in expectation. Indeed, a class prediction yˆ = 1sa >τ changes due to  transportation T (sa ) if and only if τ ∈ m sTa , MsTa where m sTa = min[sa , T (sa )] and MsTa = max[sa , T (sa )]. This observation leads to the following result. Proposition 1 Let Sa and S¯ be two output variables with values in [0, 1] and with distributions p Sa and p S¯ , and let T : [0, 1] → [0, 1] be a transportation map sat isfying B p S¯ (y)dy = T −1 (B) p Sa (x)d x for any measurable subset B ⊂ [0, 1]. The following two quantities are equal:  1. W1 ( p Sa , p S¯ ) = min x∈[0,1] |x − T (x)| p Sa (x)d x, T

2. Expected class prediction changes to transporting p Sa into p S¯ through the   due ∗ ∗  map T ∗ , Eτ ∼U ([0,1]),x∼ pSa P τ ∈ m Tx , MxT , where U indicate the uniform distribution. The proof is reported in [84]. In summary, we have demonstrated that the optimal post-processing procedure to ensure fairness whilst incurring in minimal model deviation is to transport all group distributions p Sa to their weighted barycenter distribution p S¯ . Partial Transportation for Fairness-Accuracy Trade-Off. Whilst the approach described above allows to achieve SDP by retaining as much accuracy as possible, in some cases we might want to trade-off a certain amount of fairness for higher accuracy. In the remainder of this section, we explain how to obtain an optimal trade-off for the case of the Wasserstein-2 metric space. Not achieving SDP implies that each p Sa is transported to a distribution p Sa∗ that does not match the barycenter p S¯ . A valid measure of deviation from SDP is dpair =  ∗ ∗ ¯ ∈ A. For any distribution a=a¯ W2 ( p Sa∗ , p Sa¯ ) since dpair = 0 ⇔ p Sa∗ = p Sa¯ , ∀a, a p, by the triangle and Young’s inequalities,

Fairness in Machine Learning

dpair ≤



171

W2 ( p Sa∗ , p) +



W2 ( p, p Sa∗¯ )

2

a=a¯

  ≤ 2 W2 ( p Sa∗ , p) + W2 ( p, p Sa∗¯ ) = 4(|A| − 1) W2 ( p Sa∗ , p) . (20) a=a¯

a∈A

By the definition of the barycenter, this upper bound reaches its minimum when p = p S¯ . We call this tightest upper bound pseudo-dpair and use it to derive optimal trade-off solutions. For any r ∈ R+ , we say that pseudo-dpair satisfies the r -fairness constraint when it is smaller than r . To reach optimal trade-offs, we are interested in transporting p Sa to p Sa∗ under  the r -fairness constraint while minimizing the deviation from S, min p∗Sa a∈A pa W2 ( p Sa , p Sa∗ ). Assuming disjoint groups, we can optimize each group transportation in turn independently. The r -fairness constraint    ∗ on a single  group a becomes W2 ( p Sa , p S¯ ) ≤ r − d , where r = r/(4|A| − 4) and  W2 ( p Sa∗¯ , p S¯ ). Satisfying this constraint corresponds to transporting d = a∈A\{a} ¯ p Sa to the ball with center p S¯ and radius r  − d  in the Wasserstein-2 metric space. To achieve the optimal trade-off, we need to transport p Sa to a destination p Sa∗ with minimal W2 ( p Sa , p Sa∗ ). Thus we want p Sa∗ =

arg min

pa W2 ( p Sa , p ∗ ) =

p∗ s.t. W2 ( p∗ , p S¯ )≤r  −d 

arg min

W2 ( p Sa , p ∗ ), (21)

p∗ s.t. W2 ( p∗ , p S¯ )≤r  −d 

 since pa is constant with respect to p ∗ . As W2 ( p Sa , p ∗ ) ≥ W2 ( p Sa , p S¯ ) − 2 ∗ ∗ W2 ( p , p S¯ ) by triangle inequality, W2 ( p Sa , p ) reaches its minimum if and only if p ∗ lies on a shortest path between p Sa and p S¯ . Therefore it is optimal to transport p Sa along any shortest path between itself and p S¯ in the Wasserstein-2 metric space. Algorithm 1 Wasserstein-2 Geodesic N , number of bins B, model outputs {s n }, trade-off paramInput: Dataset D = {(a n , x n , y n )}n=1 eter t. ¯ Obtain group datasets {Da } and barycenter dataset D. Define the i-th quantile of Da , as

  1 i −1 qDa (i) := sup s : 1s n ≤s ≤ , Na B n n s.t. a =a

−1 (s) := sup{i ∈ [B] : qDa (i) ≤ s}. and its inverse as qD a −1 −1 −1 Define qDa,t (s) := (1 − t)qD (s) + t qD ¯ (s) giving a   −1 −1 qDa,t (i) = sup s ∈ [0, 1] : (1 − t)qD (s) + t q (s) ≤ i . ¯ a D

  −1 n (s ) . Return: qDa,t qD a

172

3.1.1

L. Oneto and S. Chiappa

Wasserstein-2 Geodesic Method

In the univariate case, we can derive a simple post-processing method for implementing the optimal trade-off between accuracy and SDP described above based on geodesics. Let the Wasserstein-2 space P2 (R) be defined as the space of all distributions p on the metric space R with finite3 absolute p-th moments, i.e. E p(s1 ) [|s1 − s0 | p ] < ∞ for ∀s0 ∈ R, equipped with the Wasserstein-2 metric. As R is a geodesic space, i.e. there exists a geodesic between every pair of points in that space, then so is P2 (R) [110]. Whilst geodesics are only locally shortest paths, shortest paths are always geodesics if they exist. In the case of P2 (R), the geodesic between p Sa and p S¯ is unique and can be parametrized by = (1 − t)PS−1 + t PS¯−1 , PS−1 a ,t a

t ∈ [0, 1],

(22)

where PSa and PS¯ are the cumulative distribution functions of Sa and S¯ [152]. This geodesic, by its uniqueness, is therefore the shortest path. The parameter t controls the level to which p Sa is moved toward the barycenter p S¯ , with t = 1 corresponding to total matching. An implementation of this method is described in Algorithm 1, where Da denotes the subset of D corresponding to the group of Na individuals with sensitive attributes a.

3.2 General Fair Empirical Risk Minimization In Sect. 3.1, we have discussed a post-processing method that was derived within a unified optimal transport approach to fairness. In this section, we present another unified framework for fairness, introduced in [44, 141], based on the empirical risk minimization strategy. This framework incorporates several notions of fairness, can deal with continuous outputs and sensitive attributes through discretization, and is accompanied by risk and fairness bounds, which imply consistency properties both in terms of fairness measure and risk of the selected model. N be a training dataset formed by N samples drawn indeLet D = {(s n , x n , y n )}n=1 pendently from an unknown probability distribution μ over S × X × Y, where y n is outcome that we wish to predict, s n the sensitive attribute, and x n a vector of features to be used to form a prediction yˆ n of y n . Notice that we indicate the sensitive attribute with s n rather than a n as in the previous sections. We will us this new notation in all remaining section. To deal with the case in which y n and s n are continuous, we define the discretization sets Y K = {t1 , . . ., t K +1 } ⊂ R and S Q = {σ1 , . . ., σ Q+1 } ⊂ R, where t1 < t2 < · · · < t K +1 , σ1 < σ2 < · · · < σ Q+1 , and K and Q are positive integers. 3 This

condition is satisfied as we use empirical approximations of distributions.

Fairness in Machine Learning

173

The sets Y K and S Q define values of the outcome and sensitive attribute that are regarded as indistinguishable—their definition is driven by the particular application under consideration. We indicate with Dk,q the subset of Nk,q individuals with y n ∈ [tk , tk+1 ) and s n ∈ [σq , σq+1 ) for 1 ≤ k ≤ K and 1 ≤ q ≤ Q. Unlike the previous sections, we indicate the model output with f (z n ), where f is a deterministic function (we refer to it as model) chosen from a set F such that f : Z → R, where Z = S × X or Z = X, i.e. Z may contain or not the sensitive attribute. The risk of the model, L( f ), is defined as L( f ) = E [( f (Z ), Y )], where  : R × Y → R is a loss function. When necessary, we indicate with a subscript the  particular loss function used and the associated risk, e.g. L p ( f ) = E  p ( f (Z ), Y ) . We aim at minimizing the risk subject to the -Loss General Fairness constraint introduced below, which generalizes previously known notions of fairness, encompasses both classification and regression and categorical and numerical sensitive attributes. Definition 1 A model f is -General Fair (-GF) if it satisfies Q K  1    k, p P ( f ) − P k,q ( f ) ≤ , 2 K Q k=1 p,q=1

(23)

where  ∈ [0, 1] and   P k,q ( f ) = P f (Z ) ∈ [tk , tk+1 ) | Y ∈ [tk , tk+1 ), S ∈ [σq , σq+1 ) .

(24)

This definition considers a model as fair if its predictions are approximately (with  corresponding to the amount of acceptable approximation) equally distributed independently of the value of the sensitive attribute. It can be further generalized as follows. Definition 2 A model f is -Loss General Fair (-LGF) if it satisfies Q K  1    k, p  k,q ( f ) − L ( f ) L  ≤ , k k K Q 2 k=1 p,q=1

(25)

where  ∈ [0, 1] and   k,q L k ( f ) = E k ( f (Z ), Y ) | Y ∈ [tk , tk+1 ), S ∈ [σq , σq+1 ) ,

(26)

where k is a loss function. This definition considers a model as fair if its errors, relative to the loss function, are approximately equally distributed independently of the value of the sensitive attribute.

174

L. Oneto and S. Chiappa

Remark 1 For k ( f (Z ), Y ) = 1 f (Z )∈[t / k ,tk+1 ) , Definition 2 becomes Definition 1. Moreover, it is possible to link Definition 2 to other fairness definitions in the literature. Let us consider the setting Y = {−1, +1}, S = {0, 1}, Y K = {−1.5, 0, +1.5}, S Q = {−0.5, 0.5, 1.5},  = 0. In this setting, if k is the 0–1-loss, i.e. k ( f (Z ), Y ) = 1 f (Z )Y ≤0 , then Definition 2 reduces to EFPRs/EFNRs (see Sect. 2.1), whilst if k is the linear loss, i.e. k ( f (Z ), Y ) = (1 − f (Z )Y )/2, then we recover other notions of fairness introduced in [46]. In the setting Y ⊆ R, S = {0, 1}, Y K = {−∞, ∞}, S Q = {−0.5, 0.5, 1.5},  = 0, Definition 2 reduces to the notion of Mean Distance introduced in [22] and also exploited in [103]. Finally, in the same setting, if S ⊆ R in [103] it is proposed to use the correlation coefficient which is equivalent to setting S Q = S in Definition 2. Minimizing the risk subject to the -LGF constraint corresponds to the following minimization problem ⎫ Q  K  ⎬    k, p  k,q min L( f ) : L k ( f ) − L k ( f ) ≤  . ⎭ f ∈F ⎩ ⎧ ⎨

(27)

k=1 p,q=1

Since μ is usually unknown and therefore the risks cannot be computed, we approximate this problem by minimizing the empirical counterpart ⎫ Q  K  ⎬    ˆ k, p  k,q ˆ f) : min L(  L k ( f ) − Lˆ k ( f ) ≤ ˆ , ⎭ f ∈F ⎩ ⎧ ⎨

(28)

k=1 p,q=1

 ˆ f ) = Eˆ [( f (Z ), Y )] = 1 (z n ,y n )∈D ( f (z n ), y n ), and Lˆ k,q where ˆ ∈ [0, 1], L( k (f) N  = N1k,q (z n ,y n )∈Dk,q k ( f (z n ), y n ). We refer to Problem (28) as General Fair Empirical Risk Minimization (G-FERM) since it generalizes the Fair Empirical Risk Minimization approach introduced in [44]. Statistical Analysis In this section we show that, if the parameter ˆ is chosen appropriately, a solution fˆ of Problem (28) is in a certain sense a consistent estimator for a solution f ∗ of Problems (27). For this purpose we require that, for any data distribution, it holds with probability at least 1 − δ with respect to the draw of a dataset that   ˆ f ) ≤ B(δ, N , F ), sup  L( f ) − L(

(29)

f ∈F

where B(δ, N , F ) goes to zero as N grows to infinity, i.e. the class F is learnable with respect to the loss [161]. Moreover B(δ, N , F ) is usually an exponential bound, which means that B(δ, N , F ) grows logarithmically with respect to the inverse of δ. Remark 2 If F is a compact subset of linear separators in a reproducing kernel Hilbert space, and the loss is Lipschitz in its first argument, then B(δ, N , F ) can be

Fairness in Machine Learning

175

obtained via Rademacher bounds [12]. In this √ case B(δ, N , F ) goes to zero at least √ as 1/N as N grows and decreases with δ as ln (1/δ). We are ready to state the first result of this section. Theorem 1 Let F be a learnable set of functions with respect to the loss function  : R × Y → R, and let f ∗ and fˆ be a solution of Problems (27) and (28) respectively, with ˆ =  +

Q K  



B(δ, Nk, p , F ).

(30)

k=1 q,q  =1 p∈{q,q  }

With probability at least 1 − δ it holds simultaneously that Q  Q K  K      k, p  k,q L k ( f ) − L k ( f ) ≤  +2



 B

k=1 q,q  =1 p∈{q,q  }

k=1 p,q=1

 δ , N , F , k, p (4K Q 2 + 2) (31)

and L( fˆ) − L( f ∗ ) ≤ 2B



 δ , N , F . (4K Q 2 + 2)

(32)

The proof is reported in [141]. A consequence of the first statement in Theorem 1 is that, as N tends to infinity, L( fˆ) tends to a value which is not larger than L( f ∗ ), i.e. G-FERM is consistent with respect to the risk of the selected model. The second statement in Theorem 1 instead implies that, as N tends to infinity, fˆ tends to satisfy the fairness criterion. In other words, G-FERM is consistent with respect to the fairness of the selected model. √ Remark 3 Since K , Q ≤ N , the bound in Theorem 1 behaves as ln (1/δ) /N in the same setting of Remark 2 which is optimal [161]. Thanks to Theorem 1, we can state that f ∗ is close to fˆ both in term of its risk and its fairness. Nevertheless, the final goal is to find a f h∗ which solves the following problem ⎫ Q K  ⎬    k, p  P ( f ) − P k,q ( f ) ≤  . min L( f ) : ⎭ f ∈F ⎩ ⎧ ⎨

(33)

k=1 p,q=1

The quantities in Problem (33) cannot be computed since the underline data generating distribution is unknown. Moreover, the objective function and the fairness constraint are non convex. Theorem 1 allows us to solve the first issue since we can safely search for a solution fˆh of the empirical counterpart of Problem (33), which is given by

176

L. Oneto and S. Chiappa

⎫ Q  K  ⎬     ˆ f) : min L(  Pˆ k, p ( f ) − Pˆ k,q ( f ) ≤ ˆ , ⎭ f ∈F ⎩ ⎧ ⎨

(34)

k=1 p,q=1

where Pˆ k,q ( f ) =

1 Nk,q



1 f (z n )∈[tk ,tk+1 ) .

(35)

(z n ,y n )∈Dk,q

Unfortunately, Problem (34) is still a difficult non-convex non-smooth problem. Therefore, we replace the possible non-convex loss function in the risk with its convex upper bound c (e.g. the square loss c ( f (Z ), Y ) = ( f (Z ) − Y )2 for regression, or the hinge loss c ( f (Z ), Y ) = max(0, 1 − f (Z )Y ) for binary classification [161]), and the losses k in the constraint with a relaxation (e.g. the linear loss l ( f (Z ), Y ) = f (Z ) − Y ) which allows to make the constraint convex. This way we look for a solution fˆc of the convex G-FERM problem ⎫ Q  K  ⎬     k, p k,q min Lˆ c ( f ) :  Lˆ l ( f ) − Lˆ l ( f ) ≤ ˆ . ⎭ f ∈F ⎩ ⎧ ⎨

(36)

k=1 p,q=1

This approximation of the fairness constraint corresponds to matching the first order moments [44]. Methods that attempt to match all moments, such as the one discussed in Sect. 3.1 or [154], or the second order moments [177] are preferable, but result in non-convex problemes. The questions that arise here are whether fˆc is close to fˆh , how much, and under which assumptions. The following proposition sheds some lights on these questions. Proposition 2 If c is a convex upper bound of the loss exploited to compute the risk then Lˆ h ( f ) ≤ Lˆ c ( f ). Moreover, if for f : X → R and for l Q  K       ˆ k, p   k, p  k,q ˆ  P ( f ) − Pˆ k,q ( f ) −  Lˆ l ( f ) − Lˆ l ( f ) ≤ ,

(37)

k=1 p,q=1

ˆ small, then also the fairness is well approximated. with The first statement of Proposition 2 tells us that the quality of the risk approximation depends on the quality of the convex approximation. The second statement of ˆ is small then the linear loss based fairness Proposition 2, instead, tells us that if is close to -LGF. This condition is quite natural, empirically verifiable, and it has been exploited in previous work [44, 124]. Moreover, in [141] the authors present ˆ is small. experiments showing that

Fairness in Machine Learning

177

The bound in Proposition 2 may be tighten by using different non-linear approximations of -LGF. However, the linear approximation proposed here gives a convex problem, and as showed in [141], works well in practice. In summary, Theorem 1 and Proposition 2 give the conditions under which a solution fˆc of Problem (28), which is convex, is close, both in terms of risk and fairness measure, to a solution f h∗ of Problem (33). 3.2.1

General Fair Empirical Risk Minimization with Kernel Methods

In this section, we introduce a specific method for the case in which the underlying space of models is a reproducing kernel Hilbert space (RKHS) [162, 163]. Let H be the Hilbert space of square summable sequences, κ : Z × Z → R a positive definite kernel, and φ : Z → H an induced feature mapping such that κ(Z , Z¯ ) = φ(Z ), φ( Z¯ ), for all Z , Z¯ ∈ Z, Functions f in the RKHS can be parametrized as f (Z ) = w, φ(Z ), Z ∈ Z, (38) for some vector of parameters w ∈ H. Whilst a bias term can be added to f , we do not include it here for simplicity of exposition. We propose to solve Problem (36) for the case in which F is a ball in the RKHS, using a convex loss function c ( f (Z ), Y ) to measure the empirical error and a linear loss function l as fairness constraint. We introduce the mean of the feature vectors associated with the training points restricted by the discretization of the sensitive attribute and real outputs, namely u k,q =

1 Nk,q



φ(z n ).

(39)

(z n ,y n )∈Dk,q

Using Eq. (38), the constraint in Problem (36) becomes Q K     w, u k, p − u k,q  ≤ ˆ ,

(40)

k=1 p,q=1

which can be written as A T w 1 ≤ ˆ , where A is the matrix having as columns the vectors u k, p − u k,q . With this notation, the fairness constraint can be interpreted as the composition of ˆ ball of the 1 norm with a linear transformation A. In practice, we solve the following Tikhonov regularization problem min w∈H

 (z n ,y n )∈D

c (w, φ(z n ), y n ) + λ w 2 ,

s.t. A w 1 ≤ ˆ ,

(41)

178

L. Oneto and S. Chiappa

where λ is a positive parameter. Note that, if ˆ = 0, the constraint reduces to the linear can be kernelized by observing that, thanks to the constraint A w = 0. Problem (41) Representer Theorem [162], w = (z n ,y n )∈D φ(z n ). The dual of Problem (41) may be derived using Fenchel duality, see e.g. [18, Theorem 3.3.5]. Finally we notice that, in the case when φ is the identity mapping (i.e. κ is the linear kernel on Rd ) and ˆ = 0, the fairness constraint of Problem (41) can be implicitly enforced by making a change of representation [44].

3.2.2

Fair Empirical Risk Minimization Through Pre-processing

In this section, we show how the in-processing G-FERM approach described above can be translated into a pre-processing approach. For simplicity of exposition, we focus on the case of binary outcome and sensitive attribute, i.e. Y = {−1, +1} and S = {0, 1}, and assume Z = X . We denote with D+,s the subset of N+,s individuals belonging to class 1 and with sensitive attribute s n = s. As above, the purpose of ˆ f) = a learning procedure is to find a model that minimizes the empirical risk L( ˆE[( f (X ), Y )]. Le us introduce slightly less general notion of fairness with respect to Definition 2 of Sect. 3.2. Definition 3 A model f is -Fair (-F) if it satisfies the condition |L +,0 ( f ) − L +,1 ( f )| ≤ , where  ∈ [0, 1] and L +,s ( f ) is the risk of the positive labeled samples with sensitive attribute s. We aim at minimizing the risk subject to the fairness constraint given by Definition 3 with h ( f (X ), Y ) = 1 f (X )Y ≤0 (corresponding to EFPRs for  = 0). Specifically, we consider the problem     +,1 ≤ . ( f ) − L ( f ) min L h ( f ) : f ∈ F ,  L +,0 h h

(42)

By replacing the deterministic quantity with their empirical counterparts, the hard loss in the risk with a convex loss function c , and the hard loss in the constraint with the linear loss l , we obtaining the convex problem     min Lˆ c ( f ) : f ∈ F ,  Lˆ l+,0 ( f ) − Lˆ l+,1 ( f ) ≤ ˆ .

(43)

In the case in which the underlying space of models is a RKHS, f can parametrized as f (X ) = w, φ(X ), X ∈ X. (44) Let u s be the barycenter in the feature space of the positively labelled points with sensitive attribute s, i.e.

Fairness in Machine Learning

179

us =

1  φ(x n ), N+,s n∈N

(45)

+,s

where N+,s = {n : y n = 1, s n = s}. Using Eq. (44), Problem (43) with Tikhonov regularization and for the case in which F is a ball in the RKHS takes the form min w∈H

N 

c (w, φ(x n ), y n ) + λ w 2 ,

  s.t. w, u ≤ ,

(46)

n=1

where u = u 0 − u 1 , and λ is a positive parameter which controls model complexity. Using the Representer Theorem and the fact that u is a linear combination of the feature (corresponding to the subset of positive labeled points), we obtain vectors N N αn φ(x n ), and therefore f (X ) = n=1 αn κ(x n , X ). w = n=1 Let K be the Gram matrix. The vector of coefficients α can then be found by solving min

α∈R N

  N N N   i  Ki j α j , y + λ αi α j K i j , i=1

j=1

   N 1  s.t.  αi N i=1

i, j=1



+,0 j∈N +,0

  1  Ki j − K i j  ≤ . N+,1 j∈N

(47)

+,1

When φ is the identity mapping (i.e. κ is the linear kernel on Rd ) and  = 0, we can solve the constraint w,  u = 0 foru wi , where the index i is such that |u i | = u ∞ , obtaining wi = − dj=1, j=i w j u ij . Consequently, the linear model rewrites   u as dj=1 w j x j = dj=1, j=i w j (x j − xi u ij ). In other words, the fairness constraint is implicitly enforced by making the change of representation x → x˜ ∈ Rd−1 , with x˜ j = x j − xi

uj , ui

j ∈ {1, . . . , i − 1, i + 1, . . . , d},

(48)

that has one feature fewer than the original one. This approach can be extended to the non-linear case by defining a fair kernel matrix instead of fair data mapping [44, 141].

3.3 Learning Fair Representations from Multiple Tasks Pre-Processing methods aim at transforming the training data to remove undesired biases, i.e., most often, to make it statistically independent of sensitive attributes. Most existing methods consider a fixed scenario with training dataset N , and achieve a transformation of x n ∈ Rd through a mapping D = {(s n , x n , y n )}n=1 d r g : R → R which is either problem independent or dependent. In the latter case,

180

L. Oneto and S. Chiappa

the model is consider as a composition f (g(X )), where g synthesizes the information needed to solve a particular task by learning a function f . We refer to such a problem dependent mapping as representation, and to a representation g such that g(X ) does not depend on sensitive attributes as fair representation. There exist several approaches to learning fair representations. The work in [16, 47, 112, 118, 126, 127, 174] propose different neural networks architectures together with modified learning strategies to learn a representation g(X ) that preserves information about X , is useful for predicting Y , and is approximately independent of the sensitive attribute. In [85] the authors show how to formulate the problem of counterfactual inference as a domain adaptation problem—specifically a covariate shift problem [155]—and derive two families of representation algorithms for counterfactual inference. In [188], the authors learn a representation that is a probability distribution over clusters, where learning the cluster of a datapoint contains noinformation about the sensitive attribute. In a large number of real world applications using the same model or part of it over different tasks might be desirable. For example, it is common to perform a fine tuning over pre-trained models [43], keeping the internal representation fixed. Unfortunately, fine tuning a model which is fair on a task on novel previously unseen tasks could lead to an unexpected unfairness behaviour (i.e. discriminatory transfer [107] or negative legacy [94]), due to missing generalization guarantees concerning its fairness properties. To avoid this issue, it is necessary to consider the learning problem in a multitask/lifelong learning framework. Recent methods leverage task similarity to learn fair representations that provably generalizes well to unseen tasks. In this section, we present the method introduced in [140] to learn a shared fair representation from multiple tasks, where each task could be a binary classification or regression problem. N the training sequence for task t, sampled Let us indicate with τt = (stn , xtn , ytn )n=1 independently from a probability distribution μt on S × X × Y. The goal is to learn a model f t : Z × S → Y for several tasks t ∈ {1, . . . , T }. For simplicity, we assume linear f t and Z = X , i.e. f t (X ) = wt , X  where wt ∈ Rd is a vector of parameters, and binary sensitive attributes S = {0, 1}—the method naturally extends to the nonlinear and multiple sensitive attribute cases. Following a general multi-task learning (MTL) formulation, we aim at minimizing the multitask empirical error plus a regularization term which leverages similarities between the tasks. A natural choice for the regularizer is given by the trace norm, namely the sum of the singular values of the matrix W = [w1 · · · wT ] ∈ Rd×T . This results in the following matrix factorization problem min A,B

T N  2 λ  1  n A 2F + B 2F , yt − bt , A xtn  + T N t=1 n=1 2

(49)

where A = [a1 . . . ar ] ∈ Rd×r , B = [b1 . . . bT ] ∈ Rr ×T with W = AB, and where · F is the Frobenius norm (see e.g. [167] and references therein). Here r ∈ N is the number of factors, i.e. the upper bound on the rank of W . If r ≥ min(d, T ), Prob-

Fairness in Machine Learning

181

lem (49) is equivalent to trace norm regularization [11] (see e.g. [32] and references therein4 ). Problem (49) can be solved by gradient descent or alternate minimization as we discuss next. Once the problem is solved, the estimated parameters of the function wt for the tasks’ linear models are simply computed as wt = Abt . Notice that the problem is stated with the square loss function for simplicity, but the observations extend to the general case of proper convex loss functions. The approach can be interpreted as learning a two-layer network with linear activation functions. Indeed, the matrix A applied to X induces the linear representation A X = (a1 X, . . . , ar X ) . We would like each component of the representation vector to be independent of the sensitive attribute on each task. This means that, for every measurable subset C ⊂ Rr and for every t ∈ {1, . . . , T }, we would like P(A X t ∈ C | S = 0) = P(A X t ∈ C | S = 1).

(50)

To turn the non-convex constraint into a convex one, we require only the means to be the same, and compute those from empirical data. For each training sequence τ ∈ (X × Y)T , we define the empirical conditional means c(τ ) =

  1 1 xn − xn, |N0 (τ )| n∈N (τ ) |N1 (τ )| n∈N (τ ) 0

(51)

1

where Ns (τ ) = {n : s n = s}, and relax the constraint of Eq. (50) to A c(τt ) = 0.

(52)

This give the following optimization problem min A,B

T N  2 λ  1  n A 2F + B 2F yt − bt , A xtn  + T N t=1 n=1 2

(53)

A c(τt ) = 0, t ∈ {1, . . . , T }. We tackle Problem (53) with alternate minimization. Let yt = [yt1 , . . . , ytN ] be the vector formed by the outputs of task t, and let X t = [(xt1 ) , . . . , (xtn ) ] be the data matrix for task t. When we regard A as fixed and solve w.r.t. B, Problem (53) can be reformulated as ⎡ ⎤ ⎡  y1 X1 A  ⎢ .. ⎥ ⎢ .. min ⎣ . ⎦ − ⎣ . B   yT 0

4 If

⎡ ⎤2 ⎤ ⎡ ⎤2  b1  0 ··· 0 b1     .. .. .. ⎥ ⎢ .. ⎥ + λ ⎢ .. ⎥ ,    ⎣ ⎦ ⎣ ⎦ ⎦ .  . . .  .    bT bT  · · · 0 XT A

(54)

r < min(d, T ), Problem (49) is equivalent to trace norm regularization plus a rank constraint.

182

L. Oneto and S. Chiappa

which can be easily solved. In particular, notice that the problem decouples across tasks, and each task specific problem amounts to running ridge regression on the data transformed by the representation matrix A . When instead B is fixed and we solve w.r.t. A, Problem (53) can be reformulated as ⎡ ⎤2 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤2  a1   y1 b1,1 X 1 · · · b1,r X 1 a1      ⎢ . ⎥ ⎢ .. ⎥ ⎢  ⎥ ⎢ ⎥ . . .. min ⎣ . ⎦ − ⎣ ⎦ ⎣ .. ⎦ + λ ⎣ .. ⎦ A      ar   yT bt,1 X T · · · bt,r X T ar  ⎡ T⎤ a1  ⎢ .. ⎥  s.t. ⎣ . ⎦ ◦ c1 , . . . , cT = 0, arT

(55)

where we used the shorthand notation ct = c(τt ), and where ◦ is the Kronecker product for partitioned tensors (or Tracy-Singh product). Consequently, by alternating minimization we can solve problem. Notice also that we may relax T the original A c(τt ) 2 ≤ , where  is some tolerance paramthe equality constraint as T1 t=1 eter. In fact, this may be required when the vectors c(τt ) span the entire input space. In this case, we may also add a soft constraint in the regularizer. Notice that, if independence is satisfied at the representation level, i.e. if Eq. (50) holds true, then every model built from such a representation will satisfies independence at the output level. Likewise, if the representation satisfies the convex relaxation (Eq. 52), then it also holds that wt , c(τt ) = bt , A c(τt ) = 0, i.e. the task weight vectors satisfy the first order moment approximation. More importantly, as we show below, if the tasks are randomly observed, then independence (or its relaxation) will also be satisfied on future tasks with high probability. In this sense the method can be interpreted as learning a fair transferable representation. Learning Bound In this section we study the learning ability of the method proposed above. We consider the setting of learning-to-learn [13], in which the training tasks (and their corresponding datasets) used to find a fair data representation are regarded as random variables from a meta-distribution. The learned representation matrix A is then transferred to a novel task, by applying ridge regression on the task dataset, in which X is transformed as A X . In [125] a learning bound is presented, linking the average risk of the method over tasks from the meta-distribution (the so-called transfer risk) to the multi-task empirical error on the training tasks. This result quantifies the good performance of the representation learning method when the number of tasks grow and the data distribution on the raw input data is intrinsically high dimensional (hence learning is difficult without representation learning). We extend this analysis to the setting of algorithmic fairness, in which the performance of the algorithm is evaluated both relative to risk and the fairness constraint. We show that both quantities can be bounded by their empirical counterparts evaluated on the training tasks.

Fairness in Machine Learning

183

N Let Eμ (w) = E(X,Y )∼μ [(Y − w, X )2 ] and Eτ (w) = N1 n=1 (y n − w, x n )2 . N For every matrix A ∈ Rd×r and for every data sample τ = (x n , y n )n=1 , let b A (τ ) be the minimizer of ridge regression with modified data representation, i.e. b A (τ ) = N arg minb∈Rr N1 n=1 (y n − b, A x n )2 + λ b 2 . Theorem 2 Let A be the representation learned by solving Problem (49) and renormalized so that A F = 1. Let tasks μ1 , . . . , μT be independently sampled from a meta-distribution ρ, and let z t be sampled from μt for t ∈ {1, . . . , T }. Assume that the input marginal distribution of random tasks from ρ is supported on the unit sphere and that the outputs are in the interval [−1, 1], almost surely. Let r = min(d, T ). Then, for any δ ∈ (0, 1] it holds with probability at least 1 − δ in the drawing of the datasets τ1 , . . . , τT , that T  1   Rτ (w A (τt )) Eμ∼ρ Eτ ∼μ Rμ w A (τ ) − T t=1 t % % % % 8N T ˆ ˆ ln 2 ln 4δ 4 C ∞ 24 14 ln(N T ) C ∞ δ ≤ + + + , (56) λ N λN T λ T T

and % T 8r 2 ˆ ∞ ln  ln  1 δ +6 Ac(τt ) 2 ≤ 96 Eμ∼ρ Eτ ∼μ Ac(τ ) 2 − T t=1 T T

8r 2 δ

.

(57)

The proof is reported in [140]. Notice that the first bound in Theorem 2 improves Theorem 2 in [125]. The improvement is due to the introduction of the empirical total covariance in the second term √ in the right-hand side of the inequality. The result in [125] instead contains the term 1/T , which can be considerably larger when the raw input is distributed on a high dimensional manifold. The bounds in Theorem 2 can be extended to hold with variable sample size per task. In order to simplify the presentation, we assumed that all datasets are composed of the same number of datapoints N . The general setting can be addressed by letting the sample size be a random variable and introducing the slightly different definition of the transfer risk in which we also take the expectation w.r.t. the sample size. The hyperparameter λ is regarded as fixed in the analysis. In practice it is chosen by cross-validation. The bound on fairness measure contains two terms √ in the right-hand side, in the spirit of Bernstein’s inequality. The slow term O(1/ T ) contains the spectral norm of the covariance of difference of means across the sensitive groups. Notice that  ∞ ≤ 1, but it can be much smaller when the means are close to each other, i.e. when the original representation is already approximately fair.

184

L. Oneto and S. Chiappa

3.4 If the Explicit Use of Sensitive Attributes is Forbidden Due to legal restrictions, developing methods that work without explicit use of sensitive attributes is a central problem in ML fairness. The criterion Fairness through Unawareness was introduced to capture the legal requirement of not explicitly using sensitive attributes to form decisions. This criterion states that a model output Yˆ is fair as long as it does not make explicit use of the sensitive attribute S. However, from a modeling perspective, not explicitly using S can result in a less accurate model, without necessarily improving the fairness of the solution [46, 148, 183]. This could be the case if some variables used to form Yˆ depend on S. CBNs give us the instruments to understand that there might be even a more subtle issue with this fairness criterion, as explained in the following example introduced in [106] and discussed in [26]. Consider the CBN on the left representing the data-generation mechanism underlying a music degree scenario, where S corresponds to gender, M to music aptitude (unobserved, i.e. M ∈ / D), X to the score obtained from an ability test taken at the X Y beginning of the degree, and Y to the score obtained from an ability test taken at the end of the degree. Individuals with higher music aptitude M are more likely to obtain higher initial and final scores (M → X , M → Y ). Due to discrimination occurring at the initial testing, women are assigned a lower initial score than men for the same aptitude level (S → X ). The only path from S to Y , S → X ← M → Y , is closed as X is a collider on this path. Therefore the unfair influence of S on X does not reach Y (Y S). Nevertheless, as Y  S|X , a prediction Yˆ based on the initial score X only would contain the unfair influence of S on X . For example, assume the following linear model S

|=

|=

γ

α

β

M

Y = γ M, X = αS + β M,

with E p(S) [S 2 ] = 1, E p(M) [M 2 ] = 1.

(58)

|=

A linear predictor of the form Yˆ = θ X X minimizing E p(S) p(M) [(Y − Yˆ )2 ] would have parameters θ X = γβ/(α 2 + β 2 ), giving Yˆ = γβ(αS + β M)/(α 2 + β 2 ), i.e. Yˆ  S. Therefore, this predictor would be using the sensitive attribute to form a decision, although implicitly rather than explicitly. Instead, a predictor explicitly using the sensitive attribute, Yˆ = θ X X + θ S S, would have parameters 

θX θS



 =

α2 + β 2 α α 1

−1 

γβ 0



 =

γ /β −αγ /β

 ,

(59)

|=

i.e. Yˆ = γ M. Therefore, this predictor would be fair. In general (e.g. in a non-linear setting) it is not guaranteed that using S would ensure Yˆ S. Nevertheless, this example shows how explicit use of S in a model can ensure fairness rather than

Fairness in Machine Learning

185

leading to unfairness. In summary, not being able to explicit use sensitive attributes to build a model might be problematic for several reasons. Less strict legal requirements prohibit the explicit use of S when deploying the model, but permit it during its training (see [46] and references therein). This case can be dealt with using different approaches. A simple approach (used e.g. in [84]) would be to use explicitly the sensitive attribute only in the constraint term used to enforce fairness. This, however, might be suboptimal for similar reasons to not using the sensitive attribute at all. In this section, we describe a different approach introduced in [139]. In this approach, a function g : X → S that forms a prediction, Sˆ = g(X ), of the sensiˆ instead of S, is used to learn group tive attribute S from X is learned. Then S, specific models via a multi-task learning approach (MTL). As shown in [139], if the prediction Sˆ is accurate, this approach allows to exploit MTL to learn group specific models. Instead, if Sˆ is inaccurate, this approach acts as a randomization procedure which improves the fairness measure of the overall model. We focus on binary outcome and on categorical sensitive attribute, i.e. Y = {−1, +1} and S = {1, . . . , k}. Using the operator  ∈ {−, +}, we denote with D,s , the subset of N,s individuals belonging to class  and with sensitive attribute s. We consider the case in which the underlying space of models is a RKHS, leading to the functional form f (S, X ) = w · φ(S, X ), (S, X ) ∈ S × X, (60) where “·” is the inner product between two vectors in a Hilbert space.5 We can then learn the parameter vector w by w 2 -regularized empirical risk minimization. The average accuracy with respect to each group of a model L( f ), together with ˆ f ), are defined respectively as its empirical counterparts L( L( f ) =

1 L s ( f ), k

L s ( f ) = E [( f (S = s, X ), Y )] , s ∈ S,

(61)

s∈S

and  ˆ f) = 1 Lˆ s ( f ), L( k s∈S

1 Lˆ s ( f ) = Ns



( f (s n , x n ), y n ), s ∈ S, (62)

(x n ,s n ,y n )∈Ds

where  : R × Y → R is the error loss function. In the following, we first briefly discuss the MTL approach, and then explain how it can be enhanced with fairness constraints.

all intents and purposes, one may also assume throughout that H = Rd , the standard ddimensional vector space, for some positive integer d.

5 For

186

L. Oneto and S. Chiappa

Multi-Task Learning. We use a multi-task learning approach based on regularization around a common mean [49]. We choose φ(S, X ) = (0s−1 , ϕ(X ), 0k−s , ϕ(X )), so that f (S, X ) = w0 · ϕ(X ) + v S · ϕ(X ) for w0 , v S ∈ H. MTL jointly learns a shared model w0 and task specific models ws = w0 + vs ∀s ∈ S by encouraging them to be close to each other. This is achieved with the following Tikhonov regularization problem ⎡ ⎤ k k   1 1 ˆ 0 ) + (1 − θ ) min θ L(w ws − w0 2 ⎦ , Lˆ s (ws ) + ρ ⎣λ w0 2 + (1 − λ) w0 , · · · , w S k k s=1

s=1

(63) where the parameter λ ∈ [0, 1] encourages closeness between the shared and specific models, and the parameter θ ∈ [0, 1] captures the relative importance of the loss of the shared model and the group-specific models. The MTL problem is convex provided that the loss function used to measure the empirical errors Lˆ and Lˆ s in (63) are convex. Adding Fairness Constraints. As fairness criterion, we consider EFPRs/EFNRs (Equalized Odds). Using the notation  ∈ {−, +}, this criterion can be expressed as P{ f (S, X ) > 0|S = 1, Y = 1} = · · · = P{ f (S, X ) > 0|S = k, Y = 1}. (64) In many recent papers [1, 3, 6, 14–16, 24, 44, 46, 52, 69, 90–92, 96, 97, 128, 149, 153, 177, 183, 184, 186, 188] it has been shown how to enforce EFPRs/EFNRs during model training. Here we introduce the approach proposed in [44] since it is convex, theoretically grounded, and showed to perform favorably against state-ofthe-art alternatives. To this end, we first observe that P{ f (S, X ) > 0 | S = s, Y = 1} = 1 − E[h ( f (S = s, X ), Y = 1)] = 1 − L s ( f ),

(65) where h ( f (S, X ), Y ) = 1 f (S,X )Y ≤0 . Then, by substituting Eq. (65) in Eq. (64), replacing the deterministic quantities with their empirical counterpart, and by approximating the hard loss function h with the linear one l = (1 − f (S, X )Y )/2 we obtain the convex constraint 1 N,1

 (s n ,x n ,y n )∈D,1

f (s n , x n ) = · · · =

1 N,k



f (s n , x n ).

(66)

(s n ,x n ,y n )∈D,k

Enforcing this constraint can achieved by adding to the MTL the (k − 1) constraints w1 · u 1 = w2 · u 2 ∧ · · · ∧ w1 · u 1 = wk · u k ,

(67)

Fairness in Machine Learning

187

 where u s = N1,s (s n ,x n )∈D,s ϕ(s n , x n ). Thanks to the Representer Theorem, as shown in [44], it is straightforward to derive a kernelized version of the MTL convex problem which can be solved with any solver.

4 Conclusions In this manuscript, we have discussed an emerging area of machine learning that studies the development of techniques for ensuring that models do not treat individuals unfairly due to biases in the data and model inaccuracies. Rather than an exhaustive descriptions of existing fairness criteria and approaches to impose fairness in a model, we focused on highlighting a few important points and research areas that we believe should get more attention in the literature, and described some of the work that we have done in these areas. In particular, we have demonstrated that CBNs provide us with a precious tool to reason about and deal with fairness. A common criticism to CBNs is that the true underlying data-generation mechanism, and therefore the graph structure, is seldom known. However, our discussion has demonstrated that this framework can nevertheless be helpful for reasoning at a high level about fairness and to avoid pitfalls. As unfairness in a dataset often displays complex patterns, we need a way to characterize and account for this complexity: CBNs represents the best currently available tool for achieving that. Furthermore, imposing criteria such as path-specific fairness, whilst difficult in practice, is needed to address the arguably most common unfairness scenarios present in the real world. We have described an optimal transport approach to fairness that enables us to account for the full shapes of distributions corresponding to different sensitive attributes, overcoming the limitations of most current methods that approximate fairness desiderata by considering lower order moments, typically the first moments, of those distributions. In modern contexts, models are often not learned from scratch for solving new tasks, since datasets are too complex or small in cardinality. To ensure that fairness properties generalize to multiple tasks, it is necessary to consider the learning problem in a multitask/lifelong learning framework. We described a method to learn fair representations that can generalize to unseen task. Finally, we have discussed legal restrictions with the use of sensitive attributes, and introduced an in-processing approach that does not require the use of sensitive attributes during the deployment of the model. A limitation of this manuscript is that it does not discuss the important aspect that decisions have consequences on the future of individuals, and therefore fairness should be considered also in a temporal, rather than just static, setting. Acknowledgements This work was partially supported by Amazon AWS Machine Learning Research Award.

188

L. Oneto and S. Chiappa

References 1. Adebayo, J., Kagal, L.: Iterative orthogonal feature projection for diagnosing bias in black-box models. In: Fairness, Accountability, and Transparency in Machine Learning (2016) 2. Adler, P., Falk, C., Friedler, S.A., Nix, T., Rybeck, G., Scheidegger, C., Smith, B., Venkatasubramanian, S.: Auditing black-box models for indirect influence. Knowl. Inf. Syst. 54(1), 95–122 (2018) 3. Agarwal, A., Beygelzimer, A., Dudik, M., Langford, J., Wallach, H.: A reductions approach to fair classification. In: Proceedings of the 35th International Conference on Machine Learning, pp. 60–69 (2018) 4. AI Now Institute: Litigating algorithms: challenging government use of algorithmic decision systems (2016). https://ainowinstitute.org/litigatingalgorithms.pdf 5. Alabi, D., Immorlica, N., Kalai, A.T.: Unleashing linear optimizers for group-fair learning and optimization. In: 31st Annual Conference on Learning Theory, pp. 2043–2066 (2018) 6. Alabi, D., Immorlica, N., Kalai, A.T.: When optimizing nonlinear objectives is no harder than linear objectives (2018). CoRR arXiv:1804.04503 7. Ali, J., Zafar, M.B., Singla, A., Gummadi, K.P.: Loss-aversively fair classification. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 211–218 (2019) 8. Amrieh, E.A., Hamtini, T., Aljarah, I.: Students’ academic performance data set (2015). https://www.kaggle.com/aljarah/xAPI-Edu-Data 9. Anguita, D., Ghio, A., Oneto, L., Ridella, S.: Selecting the hypothesis space for improving the generalization ability of support vector machines. In: IEEE International Joint Conference on Neural Networks (2011) 10. Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine Bias: There’s software used across the country to predict future criminals. And it’s biased against blacks (2016). https://www. propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing 11. Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Mach. Learn. 73(3), 243–272 (2008) 12. Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2002) 13. Baxter, J.: A model of inductive bias learning. J. Artif. Intell. Res. 12, 149–198 (2000) 14. Bechavod, Y., Ligett, K.: Penalizing unfairness in binary classification (2018). CoRR arXiv:1707.00044 15. Berk, R., Heidari, H., Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., Neel, S., Roth, A.: A convex framework for fair regression. In: Fairness, Accountability, and Transparency in Machine Learning (2017) 16. Beutel, A., Chen, J., Zhao, Z., Chi, E.H.: Data decisions and theoretical implications when adversarially learning fair representations (2017). CoRR arXiv:1707.00075 17. Bogen, M., Rieke, A.: Help wanted: an examination of hiring algorithms, equity, and bias. Technical report, Upturn (2018) 18. Borwein, J., Lewis, A.S.: Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer (2010) 19. Bureau of Labor Statistics: National longitudinal surveys of youth data set (2019). https:// www.bls.gov/nls/ 20. Byanjankar, A., Heikkilä, M., Mezei, J.: Predicting credit risk in peer-to-peer lending: a neural network approach. In: IEEE Symposium Series on Computational Intelligence (2015) 21. Calders, T., Kamiran, F., Pechenizkiy, M.: Building classifiers with independency constraints. In: IEEE International Conference on Data Mining Workshops. ICDMW 2009, pp. 13–18 (2009) 22. Calders, T., Karim, A., Kamiran, F., Ali, W., Zhang, X.: Controlling attribute effect in linear regression. In: IEEE International Conference on Data Mining (2013) 23. Calders, T., Verwer, S.: Three naive bayes approaches for discrimination-free classification. Data Min. Knowl. Discov. 21(2), 277–292 (2010)

Fairness in Machine Learning

189

24. Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K.N., Varshney, K.R.: Optimized preprocessing for discrimination prevention. In: Proceedings of the 31st Conference on Neural Information Processing Systems, pp. 3995–4004 (2017) 25. Chiappa, S.: Path-specific counterfactual fairness. In: Thirty-Third AAAI Conference on Artificial Intelligence, pp. 7801–7808 (2019) 26. Chiappa, S., Isaac, W.S.: A causal Bayesian networks viewpoint on fairness. In: Kosta, E., Pierson, J., Slamanig, D., Fischer-Hübner, S., Krenn, S. (eds.) Privacy and Identity Management. Fairness, Accountability, and Transparency in the Age of Big Data. Privacy and Identity 2018. IFIP Advances in Information and Communication Technology, vol. 547. Springer, Cham (2019) 27. Chiappa, S., Jiang, R., Stepleton, T., Pacchiano, A., Jiang, H., Aslanides, J.: A general approach to fairness with optimal transport. In: Thirty-Fourth AAAI Conference on Artificial Intelligence (2020) 28. Chierichetti, F., Kumar, R., Lattanzi, S., Vassilvitskii, S.: Fair clustering through fairlets. In: Proceedings of the 31st Conference on Neural Information Processing Systems, pp. 5036– 5044 (2017) 29. Chouldechova, A.: Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5(2), 153–163 (2017) 30. Chouldechova, A., Putnam-Hornstein, E., Benavides-Prado, D., Fialko, O., Vaithianathan, R.: A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency, pp. 134–148 (2018) 31. Chzhen, E., Hebiri, H., Denis, C., Oneto, L., Pontil, M.: Leveraging labeled and unlabeled data for consistent fair binary classification. In: Proceedings of the 33rd Conference on Neural Information Processing Systems, pp. 12739–12750 (2019) 32. Ciliberto, C., Stamos, D., Pontil, M.: Reexamining low rank matrix factorization for trace norm regularization (2017). CoRR arXiv:1706.08934 33. Coraddu, A., Oneto, L., Baldi, F., Anguita, D.: Vessels fuel consumption forecast and trim optimisation: a data analytics perspective. Ocean. Eng. 130, 351–370 (2017) 34. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., Huq, A.: Algorithmic decision making and the cost of fairness. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797–806 (2017) 35. Cortez, P.: Student performance data set (2014). https://archive.ics.uci.edu/ml/datasets/ Student+Performance 36. Cortez, P.: Wine quality data set (2009). https://archive.ics.uci.edu/ml/datasets/Wine+Quality 37. Cotter, A., Gupta, M., Jiang, H., Srebro, N., Sridharan, K., Wang, S., Woodworth, B., You, S.: Training well-generalizing classifiers for fairness metrics and other data-dependent constraints (2018). CoRR arXiv:1807.00028 38. Cotter, A., Jiang, H., Sridharan, K.: Two-player games for efficient non-convex constrained optimization. In: Algorithmic Learning Theory (2019) 39. Dawid, P.: Fundamentals of statistical causality. Technical report (2007) 40. De Fauw, J., Ledsam, J.R., Romera-Paredes, B., Nikolov, S., Tomasev, N., Blackwell, S., Askham, H., Glorot, X., O’Donoghue, B., Visentin, D., Van Den Driessche, G., Lakshminarayanan, B., Meyer, C., Mackinder, F., Bouton, S., Ayoub, K., Chopra, R., King, D., Karthikesalingam, A., Hughes, C.O., Raine, R., Hughes, J., Sim, D.A., Egan, C., Tufail, A., Montgomery, H., Hassabis, D., Rees, G., Back, T., Khaw, P.T., Suleyman, M., Cornebise, J., Keane, P.A., Ronneberger, O.: Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24(9), 1342–1350 (2018) 41. Dieterich, W., Mendoza, C., Brennan, T.: COMPAS risk scales: demonstrating accuracy equity and predictive parity (2016) 42. Doherty, N.A., Kartasheva, A.V., Phillips, R.D.: Information effect of entry into credit ratings market: the case of insurers’ ratings. J. Financ. Econ. 106(2), 308–330 (2012) 43. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: a deep convolutional activation feature for generic visual recognition. In: Proceedings of the 31st International Conference on Machine Learning, pp. 647–655 (2014)

190

L. Oneto and S. Chiappa

44. Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J.S., Pontil, M.: Empirical risk minimization under fairness constraints. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, pp. 2791–2801 (2018) 45. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.: Fairness through awareness. In: Innovations in Theoretical Computer Science Conference (2012) 46. Dwork, C., Immorlica, N., Kalai, A.T., Leiserson, M.D.M.: Decoupled classifiers for groupfair and efficient machine learning. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency, pp. 119–133 (2018) 47. Edwards, H., Storkey, A.: Censoring representations with an adversary. In: 4th International Conference on Learning Representations (2015) 48. Eubanks, V.: Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. St. Martin’s Press (2018) 49. Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 109–117 (2004) 50. Fehrman, E., Egan, V., Mirkes, E.M.: Drug consumption data set (2016). https://archive.ics. uci.edu/ml/datasets/Drug+consumption+%28quantified%29 51. Feldman, M.: Computational fairness: preventing machine-learned discrimination (2015). https://scholarship.tricolib.brynmawr.edu/handle/10066/17628 52. Feldman, M., Friedler, S.A., Moeller, J., Scheidegger, C., Venkatasubramanian, S.: Certifying and removing disparate impact. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 259–268 (2015) 53. Fish, B., Kun, J., Lelkes, A.: Fair boosting: a case study. In: Fairness, Accountability, and Transparency in Machine Learning (2015) 54. Fish, B., Kun, J., Lelkes, A.D.: A confidence-based approach for balancing fairness and accuracy. In: SIAM International Conference on Data Mining, pp. 144–152 (2016) 55. Fitzsimons, J., Ali, A.A., Osborne, M., Roberts, S.: Equality constrained decision trees: for the algorithmic enforcement of group fairness (2018). CoRR arXiv:1810.05041 56. Fukuchi, K., Kamishima, T., Sakuma, J.: Prediction with model-based neutrality. IEICE Trans. Inf. Syst. 98(8), 1503–1516 (2015) 57. Gajane, P., Pechenizkiy, M.: On formalizing fairness in prediction with machine learning (2017). CoRR arXiv:1710.03184 58. Gillen, S., Jung, C., Kearns, M., Roth, A.: Online learning with an unknown fairness metric. In: Proceedings of the 32nd Neural Information Processing Systems, pp. 2600–2609 (2018) 59. Goh, G., Cotter, A., Gupta, M., Friedlander, M.P.: Satisfying real-world goals with dataset constraints. In: Proceedings of the 30th Conference on Neural Information Processing Systems, pp. 2415–2423 (2016) 60. Goldstein, H.: School effectiveness data set (1987). http://www.bristol.ac.uk/cmm/learning/ support/datasets/ 61. Gordaliza, P., Del Barrio, E., Fabrice, G., Jean-Michel, L.: Obtaining fairness using optimal transport theory. In: Proceedings of the 36th International Conference on International Conference on Machine Learning, pp. 2357–2365 (2019) 62. Grgi´c-Hlaˇca, N., Zafar, M.B., Gummadi, K.P., Weller, A.: On fairness, diversity and randomness in algorithmic decision making (2017). CoRR arXiv:1706.10208 63. Guvenir, H.A., Acar, B., Muderrisoglu, H.: Arrhythmia data set (1998). https://archive.ics. uci.edu/ml/datasets/Arrhythmia 64. Hajian, S., Domingo-Ferrer, J.: A methodology for direct and indirect discrimination prevention in data mining. IEEE Trans. Knowl. Data Eng. 25(7), 1445–1459 (2012) 65. Hajian, S., Domingo-Ferrer, J., Farràs, O.: Generalization-based privacy preservation and discrimination prevention in data publishing and mining. Data Min. Knowl. Discov. 28(5–6), 1158–1188 (2014) 66. Hajian, S., Domingo-Ferrer, J., Martinez-Balleste, A.: Rule protection for indirect discrimination prevention in data mining. In: International Conference on Modeling Decisions for Artificial Intelligence (2011)

Fairness in Machine Learning

191

67. Hajian, S., Domingo-Ferrer, J., Monreale, A., Pedreschi, D., Giannotti, F.: Discrimination-and privacy-aware patterns. Data Min. Knowl. Discov. 29(6), 1733–1782 (2015) 68. Hajian, S., Monreale, A., Pedreschi, D., Domingo-Ferrer, J., Giannotti, F.: Injecting discrimination and privacy awareness into pattern discovery. In: IEEE International Conference on Data Mining Workshops (2012) 69. Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. In: Proceedings of the 30th Conference on Neural Information Processing Systems, pp. 3315–3323 (2016) 70. Harper, F.M., Konstan, J.A.: Movielens data set (2016). https://grouplens.org/datasets/ movielens/ 71. Hashimoto, T.B., Srivastava, M., Namkoong, H., Liang, P.: Fairness without demographics in repeated loss minimization. In: Proceedings of the 35th International Conference on on Machine Learning, pp. 1929–1938 (2018) 72. He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., Candela, J.Q.: Practical lessons from predicting clicks on ads at facebook. In: International Workshop on Data Mining for Online Advertising (2014) 73. Hébert-Johnson, U., Kim, M.P., Reingold, O., Rothblum, G.N.: Calibration for the (computationally-identifiable) masses (2017). CoRR arXiv:1711.08513 74. Heidari, H., Ferrari, C., Gummadi, K., Krause, A.: Fairness behind a veil of ignorance: a welfare analysis for automated decision making. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, pp. 1273–1283 (2018) 75. Heidari, H., Loi, M., Gummadi, K.P., Krause, A.: A moral framework for understanding of fair ml through economic models of equality of opportunity (2018). CoRR arXiv:1809.03400 76. Heritage Provider Network: Heritage health data set (2011). https://www.kaggle.com/c/hhp/ data 77. Hoffman, M., Kahn, L.B., Li, D.: Discretion in hiring. Q. J. Econ. 133(2), 765–800 (2018) 78. Hofmann, H.: Statlog (German Credit) data set (1994). https://archive.ics.uci.edu/ml/datasets/ statlog+(german+credit+data) 79. Hu, L., Chen, Y.: Fair classification and social welfare (2019). CoRR arXiv:1905.00147 80. Hussain, S., Dahan, N.A., Ba-Alwib, F.M., Ribata, N.: Student academics performance data set (2018). https://archive.ics.uci.edu/ml/datasets/Student+Academics+Performance 81. Isaac, W.S.: Hope, hype, and fear: the promise and potential pitfalls of artificial intelligence in criminal justice. Ohio State J. Crim. Law 15(2), 543–558 (2017) 82. Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., Roth, A.: Fairness in reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning, pp. 1617–1626 (2017) 83. Janosi, A., Steinbrunn, W., Pfisterer, M., Detrano, R.: Heart disease data set (1988). https:// archive.ics.uci.edu/ml/datasets/Heart+Disease 84. Jiang, R., Pacchiano, A., Stepleton, T., Jiang, H., Chiappa, S.: Wasserstein fair classification. In: Thirty-Fifth Uncertainty in Artificial Intelligence Conference (2019) 85. Johansson, F., Shalit, U., Sontag, D.: Learning representations for counterfactual inference. In: Proceedings of The 33rd International Conference on Machine Learning, pp. 3020–3029 (2016) 86. Johndrow, J.E., Lum, K.: An algorithm for removing sensitive information: application to race-independent recidivism prediction. Ann. Appl. Stat. 13(1), 189–220 (2019) 87. Johnson, K.D., Foster, D.P., Stine, R.A.: Impartial predictive modeling: ensuring fairness in arbitrary models (2016). CoRR arXiv:1608.00528 88. Joseph, M., Kearns, M., Morgenstern, J., Neel, S., Roth, A.: Rawlsian fairness for machine learning. In: Fairness, Accountability, and Transparency in Machine Learning (2016) 89. Joseph, M., Kearns, M., Morgenstern, J.H., Roth, A.: Fairness in learning: classic and contextual bandits. In: Proceedings of the 30th Conference on Neural Information Processing Systems, pp. 325–333 (2016) 90. Kamiran, F., Calders, T.: Classifying without discriminating. In: International Conference on Computer, Control and Communication (2009)

192

L. Oneto and S. Chiappa

91. Kamiran, F., Calders, T.: Classification with no discrimination by preferential sampling. In: The Annual Machine Learning Conference of Belgium and The Netherlands (2010) 92. Kamiran, F., Calders, T.: Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2012) 93. Kamiran, F., Karim, A., Zhang, X.: Decision theory for discrimination-aware classification. In: IEEE International Conference on Data Mining (2012) 94. Kamishima, T., Akaho, S., Asoh, H., Sakuma, J.: Fairness-aware classifier with prejudice remover regularizer. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2012) 95. Kamishima, T., Akaho, S., Asoh, H., Sakuma, J.: The independence of fairness-aware classifiers. In: IEEE International Conference on Data Mining Workshops (2013) 96. Kamishima, T., Akaho, S., Sakuma, J.: Fairness-aware learning through regularization approach. In: International Conference on Data Mining Workshops (2011) 97. Kearns, M., Neel, S., Roth, A., Wu, Z.S.: Preventing fairness gerrymandering: auditing and learning for subgroup fairness. In: Proceedings of the 35th International Conference on Machine Learning, pp. 2564–2572 (2018) 98. Kilbertus, N., Carulla, M.R., Parascandolo, G., Hardt, M., Janzing, D., Schölkopf, B.: Avoiding discrimination through causal reasoning. In: Proceedings of the 31th Conference on Neural Information Processing Systems, pp. 656–666 (2017) 99. Kim, M., Reingold, O., Rothblum, G.: Fairness through computationally-bounded awareness. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, pp. 4842– 4852 (2018) 100. Kim, M.P., Ghorbani, A., Zou, J.: Multiaccuracy: Black-box post-processing for fairness in classification. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 247–254 (2019) 101. Koepke, J.L., Robinson, D.G.: Danger ahead: risk assessment and the future of bail reform. Wash. Law Rev. 93, 1725–1807 (2017) 102. Kohavi, R., Becker, B.: Census income data set (1996). https://archive.ics.uci.edu/ml/datasets/ census+income 103. Komiyama, J., Shimao, H.: Two-stage algorithm for fairness-aware machine learning (2017). CoRR arXiv:1710.04924 104. Komiyama, J., Takeda, A., Honda, J., Shimao, H.: Nonconvex optimization for regression with fairness constraints. In: Proceedings of the 35th International Conference on Machine Learning, pp. 2737–2746 (2018) 105. Kourou, K., Exarchos, T.P., Exarchos, K.P., Karamouzis, M.V., Fotiadis, D.I.: Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 13, 8–17 (2015) 106. Kusner, M.J., Loftus, J., Russell, C., Silva, R.: Counterfactual fairness. In: Proceedings of the 31st Conference on Neural Information Processing Systems, pp. 4069–4079 (2017) 107. Lan, C., Huan, J.: Discriminatory transfer (2017). CoRR arXiv:1707.00780 108. Larson, J., Mattu, S., Kirchner, L., Angwin, J.: Propublica COMPAS risk assessment data set (2016). https://github.com/propublica/compas-analysis 109. Lim, T.S.: Contraceptive method choice data set (1997). https://archive.ics.uci.edu/ml/ datasets/Contraceptive+Method+Choice 110. Lisini, S.: Characterization of absolutely continuous curves in Wasserstein spaces. Calc. Var. Part. Differ. Equ. 28(1), 85–120 (2007) 111. Liu, Z., Luo, P., Wang, X., Tang, X.: CelebA data set (2015). http://mmlab.ie.cuhk.edu.hk/ projects/CelebA.html 112. Louizos, C., Swersky, K., Li, Y., Welling, M., Zemel, R.: The variational fair autoencoder. In: 4th International Conference on Learning Representations (2016) 113. Lum, K., Isaac, W.S.: To predict and serve? Significance 13(5), 14–19 (2016) 114. Lum, K., Johndrow, J.: A statistical framework for fair predictive algorithms (2016). CoRR arXiv:1610.08077

Fairness in Machine Learning

193

115. Luo, L., Liu, W., Koprinska, I., Chen, F.: Discrimination-aware association rule mining for unbiased data analytics. In: International Conference on Big Data Analytics and Knowledge Discovery, pp. 108–120. Springer (2015) 116. Luong, B.T., Ruggieri, S., Turini, F.: k-nn as an implementation of situation testing for discrimination discovery and prevention. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011) 117. Ma, D.S., Correll, J., Wittenbrink, B.: Chicago face data set (2015). https://chicagofaces.org/ default/ 118. Madras, D., Creager, E., Pitassi, T., Zemel, R.: Learning adversarially fair and transferable representations (2018). CoRR arXiv:1802.06309 119. Madras, D., Pitassi, T., Zemel, R.: Predict responsibly: improving fairness and accuracy by learning to defer. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, pp. 6147–6157 (2018) 120. Malekipirbazari, M., Aksakalli, V.: Risk assessment in social lending via random forests. Expert. Syst. Appl. 42(10), 4621–4631 (2015) 121. Mancuhan, K., Clifton, C.: Discriminatory decision policy aware classification. In: IEEE International Conference on Data Mining Workshops (2012) 122. Mancuhan, K., Clifton, C.: Combating discrimination using Bayesian networks. Artif. Intell. Law 22(2), 211–238 (2014) 123. Mary, J., Calauzenes, C., El Karoui, N.: Fairness-aware learning for continuous attributes and treatments. In: Proceedings of the 36th International Conference on Machine Learning, pp. 4382–4391 (2019) 124. Maurer, A.: A note on the PAC Bayesian theorem (2004). CoRR arXiv:0411099 [cs.LG] 125. Maurer, A.: Transfer bounds for linear feature learning. Mach. Learn. 75(3), 327–350 (2009) 126. McNamara, D., Ong, C.S., Williamson, B.: Costs and benefits of fair representation learning. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics and Society, pp. 263–270 (2019) 127. McNamara, D., Ong, C.S., Williamson, R.C.: Provably fair representations (2017). CoRR arXiv:1710.04394 128. Menon, A.K., Williamson, R.C.: The cost of fairness in binary classification. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency, pp. 107–118 (2018) 129. Merler, M., Ratha, N., Feris, R.S., Smith, J.R.: Diversity in faces data set (2019). https:// research.ibm.com/artificial-intelligence/trusted-ai/diversity-in-faces/#highlights 130. Mitchell, S., Potash, E., Barocas, S.: Prediction-based decisions and fairness: a catalogue of choices, assumptions, and definitions (2018). CoRR arXiv:1811.07867 131. Monge, G.: Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris (1781) 132. Moro, S., Cortez, P., Rita, P.: Bank marketing data set (2014). https://archive.ics.uci.edu/ml/ datasets/bank+marketing 133. Nabi, R., Malinsky, D., Shpitser, I.: Learning optimal fair policies. In: Proceedings of the 36th International Conference on Machine Learning, pp. 4674–4682 (2019) 134. Nabi, R., Shpitser, I.: Fair inference on outcomes. In: Thirty-Second AAAI Conference on Artificial Intelligence, pp. 1931–1940 (2018) 135. Narasimhan, H.: Learning with complex loss functions and constraints. In: Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pp. 1646–1654 (2018) 136. New York Police Department: Stop, Question and frisk data set (2012). https://www1.nyc. gov/site/nypd/stats/reports-analysis/stopfrisk.page 137. Noriega-Campero, A., Bakker, M.A., Garcia-Bulle, B., Pentland, A.: Active fairness in algorithmic decision making. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 77–83 (2019) 138. Olfat, M., Aswani, A.: Spectral algorithms for computing fair support vector machines (2017). CoRR arXiv:1710.05895

194

L. Oneto and S. Chiappa

139. Oneto, L., Donini, M., Elders, A., Pontil, M.: Taking advantage of multitask learning for fair classification. In: AAAI/ACM Conference on AI, Ethics, and Society (2019) 140. Oneto, L., Donini, M., Maurer, A., Pontil, M.: Learning fair and transferable representations (2019). CoRR arXiv:1906.10673 141. Oneto, L., Donini, M., Pontil, M.: General fair empirical risk minimization (2019). CoRR arXiv:1901.10080 142. Oneto, L., Ridella, S., Anguita, D.: Tikhonov, Ivanov and Morozov regularization for support vector machine learning. Mach. Learn. 103(1), 103–136 (2015) 143. Oneto, L., Siri, A., Luria, G., Anguita, D.: Dropout prediction at university of Genoa: a privacy preserving data driven approach. In: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (2017) 144. Papamitsiou, Z., Economides, A.A.: Learning analytics and educational data mining in practice: a systematic literature review of empirical evidence. J. Educ. Technol. Soc. 17(4), 49–64 (2014) 145. Pearl, J.: Causality: Models. Springer, Reasoning and Inference (2000) 146. Pearl, J., Glymour, M., Jewell, N.P.: Causal Inference in Statistics: A Primer. Wiley (2016) 147. Pedreschi, D., Ruggieri, S., Turini, F.: Measuring discrimination in socially-sensitive decision records. In: SIAM International Conference on Data Mining (2009) 148. Pedreshi, D., Ruggieri, S., Turini, F.: Discrimination-aware data mining. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008) 149. Pérez-Suay, A., Laparra, V., Mateo-García, G., Muñoz-Marí, J., Gómez-Chova, L., CampsValls, G.: Fair kernel learning. In: Machine Learning and Knowledge Discovery in Databases (2017) 150. Perlich, C., Dalessandro, B., Raeder, T., Stitelman, O., Provost, F.: Machine learning for targeted display advertising: transfer learning in action. Mach. Learn. 95(1), 103–127 (2014) 151. Peters, J., Janzing, D., Schölkopf, B.: Elements of causal inference: foundations and learning algorithms. MIT Press (2017) 152. Peyré, G., Cuturi, M.: Computational optimal transport. Found. Trends Mach. Learn. 11(5–6), 355–607 (2019) 153. Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., Weinberger, K.Q.: On fairness and calibration. In: Proceedings of the 31st Conference on Neural Information Processing Systems, pp. 5684– 5693 (2017) 154. Quadrianto, N., Sharmanska, V.: Recycling privileged learning and distribution matching for fairness. In: Proceedings of the 31st Conference on Neural Information Processing Systems, pp. 677–688 (2017) 155. Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset shift in machine learning. The MIT Press (2009) 156. Raff, E., Sylvester, J., Mills, S.: Fair forests: regularized tree induction to minimize model bias. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society (2018) 157. Redmond, M.: Communities and crime data set (2009). http://archive.ics.uci.edu/ml/datasets/ communities+and+crime 158. Rosenberg, M., Levinson, R.: Trump’s catch-and-detain policy snares many who call the U.S. home (2018). https://www.reuters.com/investigates/special-report/usa-immigration-court 159. Russell, C., Kusner, M.J., Loftus, J., Silva, R.: When worlds collide: integrating different counterfactual assumptions in fairness. In: Proceedings of the 31st Conference on Neural Information Processing Systems, pp. 6414–6423 (2017) 160. Selbst, A.D.: Disparate impact in big data policing. Georg. Law Rev. 52, 109–195 (2017) 161. Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press (2014) 162. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press (2004) 163. Smola, A.J., Schölkopf, B.: Learning with Kernels. MIT Press (2001) 164. Song, J., Kalluri, P., Grover, A., Zhao, S., Ermon, S.: Learning controllable fair representations (2018). CoRR arXiv:1812.04218

Fairness in Machine Learning

195

165. Speicher, T., Heidari, H., Grgic-Hlaca, N., Gummadi, K.P., Singla, A., Weller, A., Zafar, M.B.: A unified approach to quantifying algorithmic unfairness: measuring individual & group unfairness via inequality indices. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018) 166. Spirtes, P., Glymour, C.N., Scheines, R., Heckerman, D., Meek, C., Cooper, G., Richardson, T.: Causation, Prediction, and Search. MIT Press (2000) 167. Srebro, N.: Learning with matrix factorizations (2004) 168. Stevenson, M.T.: Assessing risk assessment in action. Minn. Law Rev. 103 (2017) 169. Strack, B., DeShazo, J.P., Gennings, C., Olmo, J.L., Ventura, S., Cios, K.J., Clore, J.N.: Diabetes 130-US hospitals for years 1999–2008 data set (2014). https://archive.ics.uci.edu/ ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008 170. Vahdat, M., Oneto, L., Anguita, D., Funk, M., Rauterberg, M.: A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator. In: European Conference on Technology Enhanced Learning (2015) 171. Vaithianathan, R., Maloney, T., Putnam-Hornstein, E., Jiang, N.: Children in the public benefit system at risk of maltreatment: identification via predictive modeling. Am. J. Prev. Med. 45(3), 354–359 (2013) 172. Verma, S., Rubin, J.: Fairness definitions explained. In: IEEE/ACM International Workshop on Software Fairness (2018) 173. Villani, C.: Optimal Transport Old and New. Springer (2009) 174. Wang, Y., Koike-Akino, T., Erdogmus, D.: Invariant representations from adversarially censored autoencoders (2018). CoRR arXiv:1805.08097 175. Wightman, L.F.: Law school admissions (1998). https://www.lsac.org/data-research 176. Williamson, R.C., Menon, A.K.: Fairness risk measures. In: Proceedings of the 36th International Conference on Machine Learning, pp. 6786–6797 (2019) 177. Woodworth, B., Gunasekar, S., Ohannessian, M.I., Srebro, N.: Learning non-discriminatory predictors. In: Computational Learning Theory (2017) 178. Wu, Y., Wu, X.: Using loglinear model for discrimination discovery and prevention. In: IEEE International Conference on Data Science and Advanced Analytics (2016) 179. Yang, K., Stoyanovich, J.: Measuring fairness in ranked outputs. In: International Conference on Scientific and Statistical Database Management (2017) 180. Yao, S., Huang, B.: Beyond parity: Fairness objectives for collaborative filtering. In: Proceedings of the 31st Conference on Neural Information Processing Systems, pp. 2921–2930 (2017) 181. Yeh, I.C., Lien, C.H.: Default of credit card clients data set (2016). https://archive.ics.uci.edu/ ml/datasets/default+of+credit+card+clients 182. Yona, G., Rothblum, G.: Probably approximately metric-fair learning. In: Proceedings of the 35th International Conference on Machine Learning, pp. 5680–5688 (2018) 183. Zafar, M.B., Valera, I., Gomez Rodriguez, M., Gummadi, K.P.: Fairness beyond disparate treatment & disparate impact: learning classification without disparate mistreatment. In: International Conference on World Wide Web (2017) 184. Zafar, M.B., Valera, I., Gomez Rodriguez, M., Gummadi, K.P.: Fairness constraints: mechanisms for fair classification. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 962–970 (2017) 185. Zafar, M.B., Valera, I., Gomez-Rodriguez, M., Gummadi, K.P.: Fairness constraints: a flexible approach for fair classification. J. Mach. Learn. Res. 20(75), 1–42 (2019) 186. Zafar, M.B., Valera, I., Rodriguez, M., Gummadi, K., Weller, A.: From parity to preferencebased notions of fairness in classification. In: Proceedings of the 31st Conference on Neural Information Processing Systems, pp. 229–239 (2017) 187. Zehlike, M., Hacker, P., Wiedemann, E.: Matching code and law: achieving algorithmic fairness with optimal transport (2017). arXiv:1712.07924 188. Zemel, R., Wu, Y., Swersky, K., Pitassi, T., Dwork, C.: Learning fair representations. In: Proceedings of the 30th International Conference on Machine Learning, pp. 325–333 (2013)

196

L. Oneto and S. Chiappa

189. Zhang, B.H., Lemoine, B., Mitchell, M.: Mitigating unwanted biases with adversarial learning. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340 (2018) 190. Zhang, L., Wu, Y., Wu, X.: Achieving non-discrimination in data release. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017) 191. Zhang, L., Wu, Y., Wu, X.: A causal framework for discovering and removing direct and indirect discrimination. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp. 3929–3935 (2017) 192. Zhang, L., Wu, Y., Wu, X.: Achieving non-discrimination in prediction. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pp. 3097–3103 (2018) 193. Zliobaite, I., Kamiran, F., Calders, T.: Handling conditional discrimination. In: IEEE International Conference on Data Mining (2011)

Online Continual Learning on Sequences German I. Parisi and Vincenzo Lomonaco

Abstract Online continual learning (OCL) refers to the ability of a system to learn over time from a continuous stream of data without having to revisit previously encountered training samples. Learning continually in a single data pass is crucial for agents and robots operating in changing environments and required to acquire, fine-tune, and transfer increasingly complex representations from non-i.i.d. input distributions. Machine learning models that address OCL must alleviate catastrophic forgetting in which hidden representations are disrupted or completely overwritten when learning from streams of novel input. In this chapter, we summarize and discuss recent deep learning models that address OCL on sequential input through the use (and combination) of synaptic regularization, structural plasticity, and experience replay. Different implementations of replay have been proposed that alleviate catastrophic forgetting in connectionists architectures via the re-occurrence of (latent representations of) input sequences and that functionally resemble mechanisms of hippocampal replay in the mammalian brain. Empirical evidence shows that architectures endowed with experience replay typically outperform architectures without in (online) incremental learning tasks.

1 Introduction Real-world data is naturally non-stationary and temporally correlated. Artificial learning systems, agents, and robots should thus be able to learn in a continual fashion from rich streams of non-i.d.d. input. The ability of a system to continually acquire and fine-tune knowledge is known as continual learning (CL; see [6, 62] G. I. Parisi (B) ContinualAI Research, University of Hamburg, Hamburg, Germany e-mail: [email protected] V. Lomonaco ContinualAI Research, University of Bologna, Bologna, Italy e-mail: [email protected] URL: http://clair.continualai.org © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 L. Oneto et al. (eds.), Recent Trends in Learning From Data, Studies in Computational Intelligence 896, https://doi.org/10.1007/978-3-030-43883-8_8

197

198

G. I. Parisi and V. Lomonaco

for recent reviews). This paradigm is also referred to as lifelong learning (LL) in the literature. Although CL and LL are arguably different (e.g. LL assumes a finite learning phase or a lifetime that ends at a given time, whereas CL is not necessarily subject to this constrain), these terms are mostly used interchangeably. Machine learning models that learn in a continual fashion from non-stationary sequential input are of great interest to both the scientific community and the industry. Firstly, CL should be a key property of large-scale systems to keep learning over time after their deployment with the goal to efficiently fine-tune and transfer knowledge and skills from continuous (and potentially infinite) streams of novel input. Secondly, the development and evaluation of CL models will help provide valuable insights into how biological mechanisms of learning and memorization work in the brain [1, 8]. Empirical evidence shows that connectionists architectures are prone to catastrophic forgetting, i.e., when learning a new class or task, the overall performance on previously learned classes and tasks may abruptly decrease due to the novel input interfering with or completely overwriting existing representations [15, 54]. Because catastrophic forgetting is a phenomenon that affects also deep learning models, the interest in CL models has grown almost exponentially in the machine learning community. To prevent, or at least alleviate, catastrophic forgetting in neural networks, researchers have studied how to address the plasticity-stability dilemma [21], i.e., how plastic should the models be to accommodate novel knowledge while preventing previously acquired knowledge to be forgotten and, thus, providing a certain degree of stability during the learning process. The vast majority of CL models have taken inspiration from biological mechanisms of continual learning and can be divided into three main categories [62]: synaptic regularization, structural plasticity, and memory replay. Approaches using synaptic regularization impose additional constraints on the update of neural weights to protect consolidated knowledge (e.g. [31, 86]). Synaptic regularization is inspired by theoretical neuroscience models in which consolidated knowledge is protected from forgetting via synapses with a cascade of states yielding different levels of plasticity [16]. However, it remains unclear how to better implement it in artificial networks and, in particular, how to efficiently select the weights to be protected. Approaches with structural plasticity apply architectural changes to a model, typically expanding it with additional neural resources to accommodate representations from novel input (e.g. [64, 65, 74]). For large-scale datasets, these models may have scalability issues and require additional modulatory mechanisms that control their growth over time. Approaches with memory replay implement the re-occurrence of raw stored, training samples or latent representations of the input to prevent catastrophic forgetting (e.g. [29, 65, 69]). Intuitively, storing previously processed input samples to be subsequently replayed to a model (a process known as rehearsal) can be a very expensive and prohibitive practice. Instead, it has been shown to be more efficient to replay latent representations of the input, a process known as latent replay. This latter type of replay is closer to biological mechanisms of hippocampal replay in the mammalian brain [8, 53]. We show in Fig. 1 the most popular and recent CL approaches divided into the above-described categories and their combinations. In the diagram, we differentiate

Online Continual Learning on Sequences

199

Fig. 1 Venn diagram of some of the most popular CL strategies: CWR [43], PNN [74], EWC [31], SI [86], LWF [40], ICARL [70], GEM [46], FearNet [29], GDM [65], ExStream [22], Pure Rehearsal, GR [79], MeRGAN [83] and AR1 [49]. Rehearsal and Generative Replay upper categories can be seen as a subset of replay strategies

methods with rehearsal (replay of explicitly stored training samples) from methods with generative replay (replay of latent representations or the training samples). Crucially, although an increasing number of methods have been proposed, there is no consensus on which training schemes and performance metrics are better to evaluate CL models. Different sets of metrics have been proposed to evaluate CL performance on supervised and unsupervised learning tasks (e.g. [9, 23, 30]). In the absence of standardized metrics and evaluation schemes, it is unclear what it means to endow a method with CL capabilities. In particular, a number of CL models still require large computational and memory resources that hinder their ability to learn in real time, or with a reasonable latency, from data streams. Online CL (OCL) adds a number of desiderata to CL systems and their evaluation to ensure their ability to learn continually from potentially infinite sequential data in an online fashion, i.e., the ability to efficiently learn from one pass of the data. In the next sections, we first define CL and OCL and then summarize and discuss recent deep learning models that address OCL on data streams through the use and combination of synaptic regularization, structural plasticity, and experience replay.

2 Online Continual Learning OCL builds on top of CL with a set of additional desiderata. In particular, online learning needs to take place as input data arrive and, since it should take place in real time, computational and memory resources must be considered. Online learning shares a number of key properties with streaming learning [22], and the boundaries between these two paradigms get fuzzy in the machine learning literature. In streaming learning, models are incrementally trained with one pattern at the time. However, it is computationally more efficient to let OCL methods process the input also as mini-batches if the data is available [69]. In this section, we first define CL and then discuss a set of desiderata for OCL on data sequences.

200

G. I. Parisi and V. Lomonaco

2.1 Formalizing CL Algorithms We define CL, and thus OCL, within the framework presented in [39]: we assume CL aims to tackle a probably approximately correct (PAC) learnable problem in the approximation of a target hypothesis h ∗ as well as learning from a sequence of noni.i.d. training sets. This framework can also be seen as a generalization of the one proposed in [46], where learning happens continuously through a continuum of data and a task supervised signal t may be provided along with each training example. In CL, data can be conveniently seen as drawn from a sequence of distributions Di . Let us define D is a potentially infinite sequence of unknown distributions D = {D1 , . . . , D N } over X × Y , where X and Y are input and output random variables respectively. At time i, a training set T ri containing one or more observations is provided from Di to the algorithm. A task is a learning experience characterized by a unique task label t and its target function gt∗ˆ (x) ≡ h ∗ (x, t = tˆ), i.e., the objective of its learning. It is important to note that tasks are an abstract representation of a learning experience carrying a task label. This label helps split the full learning experience into smaller learning pieces. However, there is necessarily no bijective correspondence between data distributions and tasks. Given h ∗ as the general target function, i.e., our ideal prediction model, and a task label t, a CL algorithm AC L is an algorithm with the following signature: ∀Di ∈ D,

AiC L : < h i−1 , T ri , Mi−1 , ti >→< h i , Mi >

(1)

where: • h i is the current hypothesis at timestep i (the parametric model learned continually); • Mi is an external memory where we can store previous training examples or partial computation not directly related to the parametrization of the model; • ti is a task label that can be used to disentangle tasks and customize the hypothesis parameters. For simplicity, we can assume N as the number of tasks, one for each T ri ; • T ri is the training set of examples. Each T ri is composed of a number of examples eij with j ∈ [1, . . . , m]. Each example eij = < x ij , y ij >, where y i is the feedback signal and can be the optimal hypothesis h ∗ (x, t) (i.e., exact label y ij in supervised learning), or any real tensor (from which we can estimate h ∗ (x, t), such as a reward r ij in reinforcement learning). It is worth pointing out that each Di can be considered as a stationary distribution. However, this framework allows to accommodate CL approaches where examples can also be assumed to be drawn in a non-i.i.d. fashion from each Di over X × Y [17, 23]. A CL scenario is a specific CL setting in which the sequence of N task labels respects a certain task structure over time. Based on the proposed framework, we can define three different common scenarios:

Online Continual Learning on Sequences

201

• Single-Incremental-Task (SIT): t1 = t2 = · · · = t N . • Multi-Task (MT): ∀i, j ∈ [1, . . . , n]2 , i = j, =⇒ ti = t j . • Multi-Incremental-Task (MIT): ∃ i, j, k : ti = t j and t j = tk . An example of a single-incremental-task (SIT) scenario is a classification task between cats and dogs, where the class distribution changes over time. First, there may only be input images of white dogs and white cats, whereas later only black dogs and black cats. Therefore, while learning to distinguish black cats from black dogs the algorithm should not forget to differentiate white cats from white dogs. The task stays the same but such a concept drift might lead to forgetting. Instead, a multi-task (MT) scenario in a classification setting would first consist of learning cats versus dogs, and later cars versus bikes, without forgetting. The task label changes when the classes change and the algorithm can use this information to maximize its performance over time. The multi-incremental-task (MIT) is the scenario where the same task can occur multiple times in the sequence of tasks, but such a task is not the only existing one.

2.2 OCL on Sequences OCL requires the quick integration of information during the learning process. While typically supervised, unsupervised, and reinforcement learning paradigms divide the training phase from the test phase (with possible validation steps), in OCL the learning is seamless and such distinction can be applied to the data (i.e. to use a test-set that evaluates the model’s performance) but cannot be used to constrain the underlying learning mechanisms. In ecological learning environments, the data stream is assumed to be non-stationary and temporally correlated. In this context, we can introduce four desiderata that better reflect the ability to continually learn from non-i.d.d. streams in an online fashion: 1. Sequential Data: We assume a potentially infinite data stream to be highdimensional, non-stationary and temporally correlated. In OCL, incoming patterns must be processed one by one or as mini-batches for the quick integration of information while preventing catastrophic forgetting. Training strategies can take advantage of temporally correlated data streams to accelerate learning. 2. Task-Agnosticism: The model should work in the absence of supplementary supervised signals such as task boundaries or labels (i.e. ti = ∅, ∀Di ). In particular, the scenario in which an oracle provides the CL algorithm the task label both during training and testing to help reduce forgetting, disentangle representations and customize the agent behaviors constitutes a step-back with respect to the concept of CL algorithms which should learn continuously from a never-ending stream of data with none or very sparse supervised signals. 3. Bounded Resources: The number of training sets T ri can potentially be unlimited, and thus, computation and memory should not be proportional to the number of hypothesis updates h i over time in any way. While available computational and

202

G. I. Parisi and V. Lomonaco

storage resources can significantly vary, a finite upper bound should rather exist and be considered, especially with n → ∞. Within the framework proposed in [39], the resource bound may be formulated as follows: for every step in time, the number of current examples contained in the memory is lower i−1 than the total   number of previously seen examples: ∀i ∈ [1, . . . , n], |Mi |  T ri . i=1 4. Experience Replay: The periodic replay of previously encountered data samples (e.g. stored in a replay buffer) can alleviate catastrophic forgetting. However, storing data samples has the general drawback of large memory requirements and re-training complexity. In the brain, hippocampal replay provides the means for the gradual integration of knowledge and is thought to occur through the reactivation of latent representations [53]. In latent replay, samples can be drawn from a probabilistic or generative model and replayed to the system for memory consolidation [73].

2.3 Datasets and Benchmarks Most of the proposed continual learning models have been designed for and evaluated with visual information. Datasets such as ImageNet and Pascal VOC provide a very good playground for classification and detection approaches. However, they were designed with static evaluation protocols in mind, i.e., the entire dataset is split into two parts: a training set is used for (one-shot) learning and a separate test set is used for performance evaluation. Splitting the training set into a number of batches is essential to train and test continual learning approaches. Most of the existing datasets are not well suited to this purpose because they lack a fundamental ingredient: the presence of multiple (unconstrained) views of the same objects taken in different video sessions (e.g. varying background, lighting, pose, occlusions). The presence of temporally coherent sessions (i.e., videos where the objects move in front of the camera) is a key feature since temporal smoothness can be used to simplify object detection, improve classification accuracy, and address unsupervised scenarios [47, 48]. This important propriety would indeed allow a natural interplay between sequence learning and CL approaches. In the context of object recognition, we can consider three continual learning scenarios: • New Instances (NI): new training patterns of the same classes become available in subsequent batches with new poses and conditions (illumination, background, occlusion, etc.). A CL model is expected to incrementally consolidate its knowledge about the known classes without compromising what it has learned before. • New Classes (NC): new training patterns belonging to different classes become available in subsequent batches. In this case, the model should be able to deal with the new classes without decreasing accuracy on the previous ones.

Online Continual Learning on Sequences

203

Fig. 2 Example images of the 50 objects in CORe50. Each column denotes one of the 10 categories [43]

• New Instances and Classes (NIC): new training patterns belonging both to known and novel classes become available in subsequent training batches. A CL model is expected to consolidate its knowledge about the known classes and to learn the novel ones. Most CL benchmarks are benchmarks adapted from others fields, for instance: • Classification: MNIST [37], Fashion-MNIST [84], CIFAR10/100 [32], Street View House Numbers (SVHN) [59], CUB200 [82], LSUN [85], ImageNet [33], Omniglot [36] or Pascal VOC [12] (object detection and segmentation). • Reinforcement Learning: Arcade Learning Environment (ALE) [4] for Atari games, SURREAL [13] for robot manipulation and RoboTurk for robotic skill learning through imitation [50], CRLMaze extension of VizDoom [41] and DeepMind Lab [51]. These datasets are then split, artificially modified via image rotations or the permutation of pixels, and concatenated together to create sequences of tasks. As an example, permuted MNIST [31] and rotated MNIST [46] are CL datasets artificially created from MNIST (Fig. 2). In Table 1, we compare datasets that we believe are better suited for CL tasks (mostly in the context of continuous object recognition such as CORe50 [43], OpenLORIS [78] or iCub-Transformation [68]). Datasets where temporal coherent sequences are not available (or cannot be generated from static frames) were excluded. In the first group of datasets (NORB, COIL-100, iLAB20M, Washington RGB-D, BigBIRD, ALOI ), objects are positioned on turntables and acquisition is systematically controlled in term of pose/lighting. Neither complex backgrounds nor occlusions are present in these datasets. For NORB and COIL-100 we defined in [47] a number of exploration sequences that turn the native static benchmarks into con-

204

G. I. Parisi and V. Lomonaco

Table 1 Comparison of datasets (with temporal coherent sessions) for continual object recognition. *Temporal coherent training/test sessions for NORB and COIL-100 have been defined in [47] and [48] Dataset Cat. Obj. Sess. Frames Format Acquisition Outdoor per sess. setting sessions NORB [38] COIL-100 [58] iLab-20M [5] RGB-D [77] BigBIRD [80] ALOI [18] BigBrother [14] iCubWorld28 [67] iCubWorld-Transf [68] OpenLORIS [68] CORe50 [43]

5 – 15 51 – – – 7 15 19 10

25 100 704 300 100 1000 7 28 150 69 50

20 20 – – – – 54 4 6 7 11

20* 54* – – – – ∼20 ∼150 ∼150 500 ∼300

Grayscale RGB RGB RGB-D RGB-D RGB RGB RGB RGB RGB-D RGB-D

Turntable Turntable Turntable Turntable Turntable Turntable Wall cam. Hand hold Hand hold Moving cam. Hand hold

No No No No No No No No No No Yes (3)

Fig. 3 Example of 1 s recording (at 20 fps) of object #26 in session #4 (outdoor) [43]

tinual learning tasks; [47] also reports supervised and semi-supervised accuracy for the NI scenario. Exploration sequences can be generated for the other datasets in this group as well by randomly walking through adjacent static frames in the multivariate parameter space; however, the obtained sequences would remain quite unnatural. BigBrother dataset (see [14, 42]) is an interesting incremental learning setup in the face recognition domain but copyright restrictions do not allow the public distribution of the dataset. Finally, iCubWorld [67, 68] and OpenLORIS have been acquired in a robotic vision context and are similar to CORe50. In fact, objects are handhold at nearly constant distance from the camera and are randomly moved. Among all these datasets, CORe50 and OpenLORIS comprise a higher number of longer sessions (including outdoor ones), more complex backgrounds and also provide depth information that can be used as extra-feature for classification and/or to simplify object detection. We think that cross-evaluating online continual learning approaches on CORe50,1 iCubWorld-Transf and OpenLORIS could be very interesting (Fig. 3).

1 www.vlomonaco.github.io/core50.

Online Continual Learning on Sequences

205

3 Hybrid Approaches for Gradient-Based OCL In this section, we will discuss a number of recently proposed algorithms for OCL based on gradient-descent. We will start with simple algorithms inspired by structural plasticity in the brain, moving towards more complex and hybrid approaches also using synaptic regularization as well as experience replay. We show that simply combining different and complementary approaches for OCL results in higher memory retention and, ultimately, in a better overall performance.

3.1 Copy Weight with Reinit (CWR) CWR [43] was proposed as a baseline technique for CL from sequential batches. While this approach can work for NC (new classes) as well as for NIC (new instances and classes) update content type, here we focus on NC under SIT scenario. The most obvious approach to implement an SIT strategy seems to be: 1. Freeze shared weights Θ¯ after the first batch. 2. For each batch Bi , extend the output layers with new neurons/weights for the new classes, randomly initialize the new weights but retain the optimal values for the old class weights. The old weights could then be frozen (denoted as FW in [44]) or continued to be tuned (denoted as CW in [44]). To learn class-specific weights without interference among batches, CWR maintains two sets of weights for the output classification layer: cw are the consolidated weights used for inference and tw the temporary weights used for training: cw are initialized to 0 before the first batch, while tw are randomly re-initialized (e.g., Gaussian initialization with std = 0.01, mean = 0) before each training batch. In other words, cw can be seen as a sort of hippocampus where consolidated concepts are maintained, while tw as a short-term working memory in the cortex used to learn new concepts without interfering with stable ones. At the end of each batch training, the weights in tw corresponding to the classes in the current batch are scaled and copied in cw: this is trivial in NC case because of the class segregation in different batches but is also possible for more complex cases [43]. To avoid forgetting in the lower levels, after the first batch B1 , all the lower level weights Θ¯ are frozen. Weight scaling with batch-specific weights wi is necessary in case of unbalanced batches with respect to the number of classes or number of examples per class. In CWR experiments reported in [43], we used models without class-shared fully connected layers (e.g., we removed FC6 and FC7 in CaffeNet) to better disentangle class-specific weights. Since Θ¯ weights are frozen after the first batch, fully connected layer weights tend to specialize in the first batch classes only. CWR implementation is very simple and, the extra computation is negligible and for each batch Bi , its overhead consists of the storage of temporary weights tw, totaling s · pn values, where s is the number of classes and pn the number of penultimate layer neurons.

206

G. I. Parisi and V. Lomonaco

Copy Weight with Reinit+ (CWR+) In [49], the authors proposed two simple modifications of CWR: the resulting approach is denoted as CWR+. The first modification, mean-shift, is an automatic compensation of batch weights wi . Tuning such parameters is annoying and a wrong parametrization can lead the model to underperform. Empirical evidence shows that if the weights tw learned during batch Bi are normalized by subtracting their global average, then rescaling by wi is no longer necessary (i.e., all wi = 1). Other reasonable forms or normalization, such as setting standard deviation to 1, led to worse results in our experiments. The second modification, denoted as zero init, consists in setting initial weights tw to 0 instead of typical Gaussian or Xavier random initialization. It is well known that neural network weights cannot be initialized to 0, because this would cause intermediate neuron activations to be 0, thus nullifying back-propagation effects. While this is certainly true for intermediate level weights, it is not the case for the output level [49]. What is important here is not using the value 0, but the same value for all the weights (0 is used for simplicity). It has been shown that this has a significant impact on the training dynamic and the forgetting [49]. If output level weights are initialized with Gaussian or Xavier random initialization they typically take small values around zero, but even with small values in the first training iterations the softmax normalization could produce strong predictions for wrong classes. This would trigger unnecessary errors backpropagation changing weights more than necessary. While this initial adjustment is irrelevant for normal batch training, we empirically found that is detrimental for continual learning.

3.2 Architect and Regularize (AR1) A drawback of simple structural plasticity approaches as CWR and CWR+ is that weights Θ¯ are tuned during the first batch and then frozen. AR1, firstly proposed in [49], consists of the combination of an Architectural and Regularization approach. In particular, CWR+ is extended by allowing Θ¯ to be tuned across batches subject to a regularization constraint (as per LWF [40], EWC [31] or SI [86]). The authors performed several combination experiments on CORe50 to select a regularization approach; each approach required a new hyperparameter tuning w.r.t. the case when it was used in isolation. At the end, the choice for AR1 was in favor of SI [86] because of the following reasons: • LWF performs nicely in isolation, but in our experiments it does not bring relevant contributions to CWR+. Being the LWF regularization driven by an output stability criterion, most of the regularization effects go to the output level that CWR+ manages apart. • Both EWC and SI provide positive contributions to CWR+ and their difference is minor. While SI can sometimes be unstable when operating in isolation, it has

Online Continual Learning on Sequences

207

been shown that it is much more stable and easy to tune when combined with CWR+ [49]. • SI overhead is small since the computation of trajectories can be easily implemented from data already computed by SGD. Considering the low computational overhead and the fact that typically SGD is typically early stopped after 2 epochs, AR1 is suitable for online continual learning. AR1* and Latent Replay A simple approach such as CWR+, where the fully connected layer is implemented as a double memory, is quite effective to control forgetting in the SIT—NC scenario. However, after the first training batch, CWR+ freezes all the layers except the last one, thus losing the benefit of an incremental adaptation of the underlying representation. AR1 [49] was then proposed to extend CWR+ by enabling end-to-end continual training throughout the entire network. To this purpose, the Synaptic Intelligence [86] regularization approach (similar to EWC [31]) is adopted to constrain the change of critical weights. In [45], the authors: 1. adapt CWR+ to the NIC scenario, thus making it able to reload past weights for already known classes and to adapt them with weighted contributions from different batches. As AR1 incorporates CWR+ in its main algorithm, this modification will result in two continual learning strategies, denoted as CWR* and AR1*. 2. show that in a fine-grained scenario with small and non i.i.d. batches, Batch Normalization layers thwart the continual learning process and replacing them with Batch Renormalization [25] can effectively tackle this problem. 3. propose a selective weight freeze for the CNN models adopting Depth-Wise Separable Convolutions. 4. reduce the computational and storage complexity of AR1 (and in general of EWC like approaches), by introducing an alternative way to implement weights update starting from the Fischer matrix. While 1. is specific to CWR+, 2–4. can be applied to other CL approaches as well. AR1* proved to be effective even without any kind of experience replay. However, even a small percentage of replay patterns can be extremely beneficial for taming forgetting [69]. In deep neural networks, the layers close to the input (often referred to as representation layers) usually perform low-level feature extraction and, after the proper pre-training on a large dataset (e.g., ImageNet), their weights are quite stable and reusable across applications [2, 7, 60, 81]. Higher layers, instead, tend to extract class-specific discriminant features and their fine-tuning is often important to maximize accuracy. For this reason, latent replay was introduced [69]: instead of maintaining copies of input patterns in the external memory in the form of raw data, they stored the activations volumes at a given layer (denoted as Latent Replay layer; see Fig. 4). To keep stable representations and valid stored activations, it was proposed to slow down the learning at all the layers below the latent replay one and to leave the layers above free to learn at full pace. In the limit case where low layers are completely

208

G. I. Parisi and V. Lomonaco

Fig. 4 Architectural diagram of Latent Replay [69]

frozen (i.e., slow down to 0), latent replay is functionally equivalent to rehearsal from the input (hereafter denoted as native rehearsal), but achieves a computational and storage saving thanks to the smaller fraction of patterns that need to flow forward and backward across the entire network and the typical information compression that networks perform at higher layers. In the general case where the representation layers are not completely frozen the activations stored in the external memory suffer from an aging effect (i.e., over time they tend to increasingly deviate from the activations that the same pattern would produce if feed-forwarded from the input layer). However, if the training of these layers is sufficiently slow the aging effect is not disruptive since the external memory has enough time to be rejuvenated with fresh patterns. When latent replay is implemented with mini-batch SGD training: (i) in the forward step, a concatenation is performed at the replay layer (on the mini-batch dimension) to join patterns coming from the input layer with activations coming from the external storage; (ii) the backward step is stopped just before the replay layer for the replay patterns. In Fig. 5, we show the accuracy of AR1*free with latent replay (R Msi ze = 1500) for different choices of the rehearsal layer (reported between parenthesis) is shown. As expected, when the replay layer is pushed down the corresponding accuracy increases, showing that a continual tuning of the representation layers is important. However, after conv5_4/dw there is a sort of saturation and the model accuracy no longer improves. The residual gap (∼4%) with respect to native rehearsal is not due to the weights freezing of the lower part of the network but to the aging effect introduced above. This can be proved by implementing an “intermediate”

Online Continual Learning on Sequences

209

Fig. 5 AR1*free with latent replay (R Msi ze = 1500) for different choices of the latent replay layer [69]. Setting the replay layer at the pool6 layer makes AR1*free equivalent to CWR*. Setting the replay layer at the “images” layer corresponds to native rehearsal. The saturation effect which characterizes the last training batches is due to the data distribution in NICv2—391 (see [45]): in particular, the lack of new instances for some classes (that already introduced all their data) slows down the accuracy trend and intensifies the effect of activations aging Fig. 6 Accuracy results on the CORe50 NICv2—391 benchmark of CWR*, AR1*, DSLDA, iCaRL, AR1*free (conv5_4), AR1*free (pool6) [69]. Results are averaged across 10 runs in which the batches order is randomly shuffled. Colored areas indicate the standard deviation of each curve. As an exception, iCaRL was trained only on a single run given its extensive run time (∼14 days)

approach that always feeds the replay pattern from the input and stops the backward at conv5_4: such an intermediate approach achieved an accuracy very close to the native rehearsal at the end of the training. The accuracy drop due to the aging effect can be further reduced with better tuning of BNR hyper-parameters and/or with the introduction of a scheduling policy, making the global moment mobile windows wider as the continual learning progresses (i.e., more plasticity in the early stages and more stability at later stages). In Fig. 6, we show the iCaRL accuracy over time

210

G. I. Parisi and V. Lomonaco

and compare it with AR1*free (conv5_4/dw), AR1* (pool6) as well as the top three performing rehearsal-free strategies on CORe50 NICv2-391 (CWR*, AR1* and DSLDA) are reported. While iCaRL exhibits better performance than LWF and EWC (as reported in [45]), it is far from DSLDA, CWR* and AR1*. Overall, AR1* combined with latent replay shows to be substantially more effective for OCL on sequential data streams with a good resource consumption trade-off (see Sect. 5.1.1 of [69]).

4 Growing Networks with Experience Replay In this section, we introduce two approaches using growing neural networks for unsupervised learning and human action classification from videos. The architectures discussed here can learn also in the absence of an explicit teaching signal such a class label. The requirement for dense human annotations used by standard regression techniques is undesirable as such labels are typically not present in real-world learning scenarios. The architectures comprise growing self-organizing networks for learning from sequential input. Specifically for self-organizing networks, catastrophic interference is modulated by the conditions of map plasticity, the available resources to represent information, and the similarity between new and old knowledge [66, 72]. Growing networks can dynamically add new neurons and synapses to accommodate novel knowledge and protect consolidated embeddings from catastrophic interference. Importantly, we will describe how self-organizing learning dynamics can be leveraged to implement efficient experience replay mechanisms without the need for additional memory resources such as ad-hoc replay buffers.

4.1 Growing Self-Organizing Networks In Parisi et al. [66], we proposed a self-organizing architecture consisting of a series of hierarchically arranged growing networks for the continual learning of human actions from videos. Each layer in the hierarchy comprises a growing recurrent network Gamma-GWR and a pooling mechanism for learning action features with increasingly large spatiotemporal receptive fields (Fig. 7). The proposed deep architecture is composed of two distinct processing streams for pose and motion features, and their subsequent integration in the STS layer. The Gamma-GWR extends the Grow When Required (GWR) network [52] with temporal context. Each neuron consists of a weight vector w j and a number K of context descriptors c j,k (with w j , c j,k ∈ Rn ). Given x(t) ∈ Rn as input, the index of the best-matching unit (BMU), b, is computed as:

Online Continual Learning on Sequences

t

211

VP

STS

Posture

Action Action labels labels

P

Action 1 Action 2 .. .. .. ..

P

G1

G2

STS

G Motion

DP

M

M

G1

G2

Receptive field Feedforward connection Associative connection Recurrent GWR MAX Pooling

Fig. 7 Diagram of our deep neural architecture with Gamma-GWR networks for continual action recognition [66]

b = arg min(d j ), j∈A

d j = α0 x(t) − w j  +

K 

αk Ck (t) − c j,k ,

(2)

(3)

k=1

Ck (t) = β · wt−1 + (1 − β) · ct−1 b b,k−1 ,

(4)

where  ·  denotes the Euclidean distance, αi and β are constant values that modulate is the weight vector of the BMU at t − 1, the influence of the temporal context, wt−1 b and Ck ∈ Rn is the global context of the network with Ck (t0 ) = 0. For an input x(t), the activity of the network, a(t), is defined in relation to the distance between the input and its BMU (Eq. 2): (5) a(t) = exp(−db ), thus yielding the highest activation value of 1 when the network can perfectly match the input sequence (db = 0). The training of the neurons is carried out by adapting the BMU b and its neighboring neurons n: Δwi = i · h i · (x(t) − wi ),

(6)

Δci,k = i · h i · (Ck (t) − ci,k ),

(7)

where i ∈ {b, n}, i is a constant learning rate (n < b ), and h i is a habituation counter that decreases over time as the neurons are fired (i.e., selected as the BMU or its neighbor).

212

G. I. Parisi and V. Lomonaco

Empirical studies have shown that Gamma-GWR networks with additive neurogenesis show a better performance than a static GWR network with the same number of neurons [61]. While the mechanisms of structural plasticity in the Gamma-GWR do not resemble biologically plausible mechanisms (e.g., [57]), the GWR learning algorithm represents an efficient computational model that incrementally adapts to non-stationary input. Crucially, the GWR model creates new neurons whenever they are required and only after the training of existing ones. The neural update rate decreases as the neurons become more habituated, which has the effect of preventing that noisy input interferes with consolidated neural representations. Similar GWRbased approaches have been proposed for the incremental learning of body motion patterns [10, 55, 63] and human-object interaction [56]. To achieve invariance to scale and translation, we implemented MAX-pooling layers after each Gamma-GWR network with the receptive field of neurons increasing along the hierarchy. This approach has shown competitive results with batch learning methods on the Weizmann [19] and the KTH [76] action benchmark datasets. On the KTH dataset, this method outperforms the overall accuracy reported by [27] with three different deep learning models: convolutional neural network (CNN, 92.9%), multiple spatiotemporal scales neural network (MSTNN, 95.3%), and 3DCNN (96.2%). A direct comparison with these methods is however hindered by the fact that they differ in the type of input and number of frames per sequence used during the training and the test phase. On the Weizmann dataset, the Gamma-GWR model outperforms other hierarchical models that do not rely on handcrafted features, such as 3D CNN (90.2%, [26]) and 3D CNN in combination with long short-term memory (94.39%, [3]). Overall, the Gamma-GWR model shows competitive results with batch learning methods on action benchmark datasets. However, the neural growth and update are driven by the minimization of the bottom-up reconstruction error and, thus, without taking into account top-down, task-relevant signals that can regulate the plasticitystability balance. The mechanism to prevent the acquisition of new information from forgetting existing knowledge is embedded in the dynamics of the self-organizing learning algorithm that allocates new neurons or updates existing ones based on the discrepancy between the input distribution and the prototype neural weights. To support this claim, we conducted an additional experiment in which we explore how our model accounts for avoiding catastrophic interference when learning new action classes. For both datasets, we first trained the model with a single action class and then scaled up progressively to all the others in order to observe how the performance of the model changes for an increasing number of action classes. The results are shown in Fig. 8, where the classification accuracy was averaged across all the combinations for a given number of action classes. Although the performance decreases as the number of action classes is increased, this decline is not catastrophic. However, it is complex to establish whether this accuracy decrease is caused by catastrophic interference or the labeling strategy. We observed that the overall quantization error of the networks tends to decrease over the training epochs, suggesting that the prototype neurons are effectively allocated and fine-tuned to better represent the input distribution.

Online Continual Learning on Sequences

213

Fig. 8 Incremental learning: classification accuracy averaged across all the combinations for a given number of action classes [66]

4.2 Replay via Neural Re-activations This vanilla Gamma-GWR model has a set of limitations in OCL scenarios. In particular, it is non-trivial to empirically find the adequate hyper-parameters to achieve an optimal plasticity-stability balance. To completely alleviate catastrophic forgetting, the model may grow until it becomes computationally expensive to process data in real time, reducing its ability to work with large-scale datasets. Instead, if the model excessively limits its growth, its overall performance over time may decrease due to progressive forgetting as it can be seen in Fig. 8. A replay mechanism could prevent the model from progressive forgetting. Additionally, although the networks would keep learning in an unsupervised fashion, a teaching signal can be used to modulate their growth during the learning phase. In [65], we proposed growing dual-memory learning (GDM) comprising a deep convolutional feature extractor and two hierarchically arranged recurrent selforganizing networks (Fig. 9). The GDM model comprises an episodic memory and a semantic memory, both implemented as extended Gamma-GWR networks [66]. The growing episodic memory (G-EM) learns from sensory experience in an unsupervised fashion, i.e., levels of structural plasticity are regulated by the ability of the network to predict the spatiotemporal patterns given as input. Instead, the growing semantic memory (G-SM) receives neural activation trajectories from G-EM and uses task-relevant signals (input annotations) to modulate levels of neurogenesis and neural update, thereby developing more compact representations of statistical regularities. G-EM and G-SM mitigate catastrophic forgetting through self-organizing learning dynamics with structural plasticity, increasing information storage capacity in response to novel input. For the consolidation of knowledge over time, internally generated neural activity patterns in G-EM are periodically replayed to both memories, thereby mitigating catastrophic forgetting during incremental learning tasks. To yield periodic experience replay, G-EM is equipped with synapses that learn statistically significant neural activity in the temporal domain. This results in the generation of sequence-selective neural activation trajectories that can be replayed to both networks after each learning episode without requiring additional memory resources to store the input. To preserve the temporal structure of the input during experience

214

G. I. Parisi and V. Lomonaco B) Classification

Semantic Memory

Episodic

Semantic

Instance level

Category level

G-

SM

A)

C) Temporal synapses

GE

M

Episodic Memory

BMU at time t BMU at time t-1

Convolutional Feature Extractor

Input-driven feedforward Memory replay

t

Unsupervised learning Regulated learning

Fig. 9 a Illustration of the GDM architecture. Extracted features from image sequences are fed into a growing episodic memory (G-EM). Neural activation trajectories from G-EM are fed to the growing semantic memory (G-SM). While the learning process of G-EM remains unsupervised, G-SM uses class labels as task-relevant signals to regulate levels of plasticity. After each learning episode, neural re-activation trajectories are replayed to both memories (green arrows); b The architecture classifies image sequences at instance level (episodic experience) and at category level (semantic knowledge). For classification, neurons in G-EM and G-SM associatively learn histograms of class labels from the input (red dashed lines); c To enable memory replay in the absence of sensory input, G-EM is equipped with temporal synapses that are strengthened between consecutively activated neurons [65]

replay, the model generates temporally-ordered trajectories of neural activity. The trajectories are created by using the asymmetric temporal links of G-EM to recursively reactivate sequence-selective neural activity trajectories (RNATs) embedded in the network. RNATs can be computed for each neuron in G-EM for a given temporal window and replayed to G-EM and G-SM after each learning episode triggered by external input stimulation. For each neuron j in G-EM, a RNAT S j of length λ = K EM + K SM + 1 can be generated: EM EM S j = wEM s(0) , ws(1) , . . . , ws(λ) ,

(8)

s(i) = arg max P(n,s(i−1)) , i ∈ [1, λ],

(9)

n∈A\ j

where P(i, j) is the matrix of temporal synapses and s(0) = j.

Online Continual Learning on Sequences

215

The set of generated RNATs from all G-EM neurons is replayed to G-EM and G-SM after each learning episode. Sequence-selective prototype sequences can be generated and periodically replayed without the need to explicitly store the temporal relations and labels of previously seen training samples. This is in agreement with neurophysiological studies evidencing that hippocampal replay consists of the reactivation of previously stored patterns of neural activity occurring predominantly after an experience [28, 35]. A series of experiments evaluating the performance of the GDM model on the CORe50 dataset show the benefit of using memory replay. In the training strategy with replay, after each learning episode (i.e., a training epoch over the mini-batch), the model generates a set of RNATs S j from the G-EM neurons. Thus, the number of RNATs of length λ = 5 is equal to the number of neurons created by G-EM. RNATs

Fig. 10 Comparison of the effects of forgetting during incremental learning with and without memory replay at an instance level (a, b) and category level (c, d). Each category contains 5 instances. The plots show the average accuracies on the categories encountered so far (a, c) and the accuracies on the first encountered category (b, d) as further new categories are learned. The shaded areas show the standard deviation [65]

216

G. I. Parisi and V. Lomonaco

are replayed to G-EM and G-SM in correspondence of novel sensory experience to reinforce previously encountered categories. Figure 10 shows a comparison of the overall accuracy on all the object categories encountered so far to the accuracy on the first encountered category over the number of encountered categories. At an instance level (Fig. 10a, b), incremental learning with memory replay improves the overall accuracy to 82.14% (from 75.93%) and accuracy on the first 5 instances to 80.41% (from 69.25%). At a category level (Fig. 10c, d), the overall accuracy increases to 91.18% (from 85.53%) and the accuracy on the first encountered category to 89.21% (from 79.53%). Overall, the results show that replaying RNATs generated from GEM significantly mitigates the effects of catastrophic forgetting over time.

5 Conclusions OCL is a crucial but challenging aspect of learning systems. Empirical evidence shows that the use of episodic memory can significantly alleviate catastrophic forgetting and that, when this memory replays latent input representations rather than explicitly stored training samples encountered during the learning phase, training complexity can be reduced. Overall, generative memory replay strategies may not only empirically useful but also fundamental, especially when processing non-i.i.d. sequential data streams. Future research should aim at modeling and integrating additional aspects of learning found in biological systems which would allow artificial agents to learn efficiently in more complex environments. Examples of these aspects include curriculum and transfer learning. Humans and animals exhibit better learning performance when training examples are organized in a meaningful way, e.g., by making the learning tasks gradually more difficult [34]. Following this observation, it was shown in [11] that having a curriculum of progressively harder tasks leads to faster training performance in neural networks. While a number of methods have been proposed that explore the use of training curricula (e.g. [20, 71]), these methods have not been investigated in combination with continual learning. In transfer learning, previously acquired knowledge in one domain is applied to solve a problem in a novel domain [24]. For this reason, transfer learning represents a significantly valuable feature for inferring general laws from (a limited amount of) samples. Recent methods proposed, for instance, the use of growing neural networks to transfer learned low-level features and high-level policies from a simulated to a real environment (e.g. [75]). However, scalable OCL systems with transferable knowledge remain an open and exciting challenge. Acknowledgements The authors would like to thank the ContinualAI organization and the other ContinualAI Research fellows for their support.

Online Continual Learning on Sequences

217

References 1. Aimone, J.B., Wiles, J., Gage, F.H.: Computational influence of adult neurogenesis on memory encoding. Neuron 61, 187–202 (2009) 2. Anguita, D., Ghio, A., Oneto, L., Ridella, S.: Selecting the hypothesis space for improving the generalization ability of support vector machines. In: IEEE International Joint Conference on Neural Networks (2011) 3. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah, A.A., Lepri, B. (eds.) Human Behavior Understanding, pp. 29–39. Springer, Berlin (2011) 4. Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013) 5. Borji, A., Izadi, S., Itti, L.: iLab-20M: a large-scale controlled object dataset to investigate deep learning. In: International Conference of Computer Vision and Pattern Recognition (CVPR), pp. 2221–2230 (2016). https://doi.org/10.1109/CVPR.2016.244 6. Chen, Z., Liu, B.: Lifelong machine learning. Synth. Lect. Artif. Intell. Mach. Learn. 12(3), 1–207 (2018) 7. Coraddu, A., Oneto, L., Baldi, F., Anguita, D.: Vessels fuel consumption forecast and trim optimisation: a data analytics perspective. Ocean Eng. 130, 351–370 (2017) 8. Deng, W., Aimone, J.B., Gage, F.H.: New neurons and new memories: how does adult hippocampal neurogenesis affect learning and memory? Nat. Rev. Neurosci. 11(5), 339–350 (2010) 9. Díaz-Rodríguez, N., Lomonaco, V., Filliat, D., Maltoni, D.: Don’t forget, there is more than forgetting: new metrics for Continual Learning. In: Workshop on Continual Learning, NeurIPS 2018 (Neural Information Processing Systems), Montreal, Canada (2018). https://hal.archivesouvertes.fr/hal-01951488 10. Elfaramawy, N., Barros, P., Parisi, G.I., Wermter, S.: Emotion recognition from body expressions with a neural network architecture. In: Proceedings of the International Conference on Human Agent Interaction (HAI’17), Bielefeld, Germany, pp. 143–149 (2017) 11. Elman, J.L.: Learning and development in neural networks: the importance of starting small. Cognition 48(1), 71–99 (1993) 12. Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111(1), 98–136 (2015) 13. Fan, L., Zhu, Y., Zhu, J., Liu, Z., Zeng, O., Gupta, A., Creus-Costa, J., Savarese, S., Fei-Fei, L.: Surreal: open-source reinforcement learning framework and robot manipulation benchmark. In: Conference on Robot Learning (2018) 14. Franco, A., Maio, D., Maltoni, D.: The big brother database: evaluating face recognition in smart home environments. In: Advances in Biometrics: 3rd International Conference (ICB), pp. 142–150 (2009) 15. French, R.M.: Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 3(4), 128– 135 (1999) 16. Fusi, S., Drew, P.J., Abbott, L.F.: Cascade models of synaptically stored memories. Neuron 45(4), 599–611 (2005) 17. Gepperth, A., Hammer, B.: Incremental learning algorithms and applications. In: European Symposium on Artificial Neural Networks (ESANN), Bruges, Belgium (2016). https://hal. archives-ouvertes.fr/hal-01418129 18. Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.: The Amsterdam Library of object images. Int. J. Comput. Vis. 61(1), 103–112 (2005). https://doi.org/10.1023/B:VISI.0000042993. 50813.60 19. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV’05, Beijing, China, pp. 1395–1402 (2005) 20. Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwinska, A., Colmenarejo, S.G., Grefenstette, E., Ramalho, T., Agapiou, J.E.A.: Hybrid computing using a neural network with dynamic external memory. Nature 538, 471–476 (2016)

218

G. I. Parisi and V. Lomonaco

21. Grossberg, S.: How does a brain build a cognitive code? Psychol. Rev. 87, 1–51 (1980) 22. Hayes, T.L., Cahill, N.D., Kanan, C.: Memory efficient experience replay for streaming learning. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 9769–9776 (2018) 23. Hayes, T.L., Kemker, R., Cahill, N.D., Kanan, C.: New metrics and experimental paradigms for continual learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2112–21123 (2018). https://doi.org/10.1109/CVPRW.2018. 00273 24. Holyoak, K., Thagard, P.: The analogical mind. Am. Psychol. 52, 35–44 (1997) 25. Ioffe, S.: Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In: Advances in Neural Information Processing Systems (NIPS), pp. 1945–1953 (2017) 26. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013) 27. Jung, M., Hwang, J., Tani, J.: Self-organization of spatio-temporal hierarchy via learning of dynamic visual image patterns on action sequences. PloS One 10(7), 1–16 (2015). https://doi. org/10.1371/journal.pone.0131214 28. Karlsson, M., Frank, L.: Awake replay of remote experiences in the hippocampus. Nat. Neurosci. 19(10), 913–918 (2009) 29. Kemker, R., Kanan, C.: Fearnet: brain-inspired model for incremental learning. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=SJ1XmfRb 30. Kemker, R., McClure, M., Abitino, A., Hayes, T.L., Kanan, C.: Measuring catastrophic forgetting in neural networks. In: AAAI (2017) 31. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. (2017) 32. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Technical Report, Citeseer (2009) 33. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 34. Krueger, K.A., Dayan, P.: Flexible shaping: how learning in small steps helps. Cognition 110, 380–394 (2009) 35. Kudrimoti, H.S., Barnes, C.A., McNaughton, B.L.: Reactivation of hippocampal cell assemblies: effects of behavioral state, experience, and EEG dynamics. J. Neurosci. 19(10), 4090– 4101 (1999). https://doi.org/10.1523/JNEUROSCI.19-10-04090.1999 36. Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learning through probabilistic program induction. Science 350(6266), 1332–1338 (2015) 37. LeCun, Y., Cortes, C.: MNIST handwritten digit database. Public (2010). http://yann.lecun. com/exdb/mnist/ 38. LeCun, Y., Huang, F.J., Bottou, L.: Learning methods for generic object recognition with invariance to pose and lighting. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 97–104 (2004). https://doi.org/10.1109/CVPR.2004.1315150, http://ieeexplore.ieee.org/ lpdocs/epic03/wrapper.htm?arnumber=1315150%5Cn, http://www.cs.nyu.edu/~ylclab/data/ norb-v1.0-small/ 39. Lesort, T., Lomonaco, V., Stoian, A., Maltoni, D., Filliat, D., Díaz-Rodríguez, N.: Continual learning for robotics: definition, framework, learning strategies, opportunities and challenges. Inf. Fusion (2019). https://doi.org/10.1016/j.inffus.2019.12.004, https://hal.archives-ouvertes. fr/hal-02381343 40. Li, Z., Hoiem, D.: Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. (2017) 41. Lomonaco, V., Desai, K., Culurciello, E., Maltoni, D.: Continual reinforcement learning in 3d non-stationary environments (2019). arXiv:1905.10112

Online Continual Learning on Sequences

219

42. Lomonaco, V., Maltoni, D.: Comparing incremental learning strategies for convolutional neural networks. In: Artificial Neural Networks in Pattern Recognition: 7th IAPR TC3 Workshop (ANNPR 2016), pp. 175–184 (2016). https://doi.org/10.1007/978-3-319-46182-3_15 43. Lomonaco, V., Maltoni, D.: CORe50: a new dataset and benchmark for continuous object recognition. In: Levine, S., Vanhoucke, V., Goldberg, K. (eds.) Proceedings of the 1st Annual Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 78, pp. 17–26. PMLR (2017). http://proceedings.mlr.press/v78/lomonaco17a.html 44. Lomonaco, V., Maltoni, D.: CORe50: a new dataset and benchmark for continuous object recognition (2017). arXiv:1705.03550, https://arxiv.org/pdf/1705.03550v1.pdf 45. Lomonaco, V., Maltoni, D., Pellegrini, L.: fine-grained continual learning. 1–14 (2019). arXiv:1907.03799 46. Lopez-Paz, D., Ranzato, M.A.: Gradient episodic memory for continual learning. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 6467–6476. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7225-gradient-episodic-memory-forcontinual-learning.pdf 47. Maltoni, D., Lomonaco, V.: Semi-supervised tuning from temporal coherence. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2509–2514 (2016). https://doi. org/10.1109/ICPR.2016.7900013, http://ieeexplore.ieee.org/document/7900013/ 48. Maltoni, D., Lomonaco, V.: Semi-supervised tuning from temporal coherence (2016). arXiv:1511.03163 49. Maltoni, D., Lomonaco, V.: Continuous learning in single-incremental-task scenarios. Neural Netw. 116, 56–73 (2019). https://doi.org/10.1016/j.neunet.2019.03.010, http://arxiv.org/abs/ 1806.08568, https://linkinghub.elsevier.com/retrieve/pii/S0893608019300838 50. Mandlekar, A., Zhu, Y., Garg, A., Booher, J., Spero, M., Tung, A., Gao, J., Emmons, J., Gupta, A., Orbay, E., Savarese, S., Fei-Fei, L.: Roboturk: a crowdsourcing platform for robotic skill learning through imitation. In: Conference on Robot Learning (2018) 51. Mankowitz, D.J., Žídek, A., Barreto, A., Horgan, D., Hessel, M., Quan, J., Oh, J., van Hasselt, H., Silver, D., Schaul, T.: Unicorn: continual learning with a universal, off-policy agent (2018). arXiv:1802.08294 52. Marsland, S., Shapiro, J., Nehmzow, U.: A self-organising network that grows when required. Neural Netw. 15(8–9), 1041–1058 (2002) 53. McClelland, J.L., McNaughton, B.L., O’Reilly, R.C.: Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 102(3), 419 (1995) 54. Mermillod, M., Bugaiska, A., Bonin, P.: The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Front. Psychol. 4, 504 (2013). https://doi.org/10.3389/fpsyg.2013.00504, http://www.pubmedcentral.nih.gov/ articlerender.fcgi?artid=3732997&tool=pmcentrez&endertype=abstract 55. Mici, L., Parisi, G.I., Wermter, S.: An incremental self-organizing architecture for sensorimotor learning and prediction (2017). arXiv:1712.08521 56. Mici, L., Parisi, G.I., Wermter, S.: A self-organizing neural network architecture for learning human-object interactions. Neurocomputing 307, 14–24 (2018) 57. Ming, G.L., Song, H.: Adult neurogenesis in the mammalian brain: significant answers and significant questions. Neuron 70, 687–702 (2011) 58. Nene, S.A., Nayar, S.K., Murase, H.: Columbia Object Image Library (COIL-100). Technical Report (1996). http://www1.cs.columbia.edu/CAVE/publications/pdfs/Nene_TR96_2.pdf 59. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011 (2011). http://ufldl.stanford.edu/housenumbers/nips2011_ housenumbers.pdf 60. Oneto, L., Ridella, S., Anguita, D.: Tikhonov, Ivanov and Morozov regularization for support vector machine learning. Mach. Learn. 103(1), 103–136 (2015)

220

G. I. Parisi and V. Lomonaco

61. Parisi, G., Ji, X., Wermter, S.: On the role of neurogenesis in overcoming catastrophic forgetting. In: NIPS’18, Workshop on Continual Learning, Montreal, Canada (2018) 62. Parisi, G.I., Kemker, R., Part, J.L., Kanan, C., Wermter, S.: Continual lifelong learning with neural networks: a review. Neural Netw. 113, 54–71 (2019). https://doi.org/10.1016/j.neunet. 2019.01.012, http://www.sciencedirect.com/science/article/pii/S0893608019300231 63. Parisi, G.I., Magg, S., Wermter, S.: Human motion assessment in real time using recurrent self-organization. In: Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication, New York, NY, pp. 71–79 (2016) 64. Parisi, G.I., Tani, J., Weber, C., Wermter, S.: Lifelong learning of humans actions with deep neural network self-organization. Neural Netw. 96, 137–149 (2017) 65. Parisi, G.I., Tani, J., Weber, C., Wermter, S.: Lifelong learning of spatiotemporal representations with dual-memory recurrent self-organization. Front. Neurorobotics 12, 78 (2018). https://doi.org/10.3389/fnbot.2018.00078, https://www.frontiersin.org/article/10.3389/fnbot. 2018.00078 66. Parisi, S., Ramstedt, S., Peters, J.: Goal-driven dimensionality reduction for reinforcement learning. In: Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS) (2017). http://www.ausy.tu-darmstadt.de/uploads/Site/EditPublication/ parisi2017iros.pdf 67. Pasquale, G., Ciliberto, C., Odone, F., Rosasco, L., Natale, L.: Teaching iCub to recognize objects using deep convolutional neural networks. In: Proceedings of Workshop on Machine Learning for Interactive Systems, pp. 21–25 (2015) 68. Pasquale, G., Ciliberto, C., Rosasco, L., Natale, L.: Object identification from few examples by improving the invariance of a deep convolutional neural network. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4904–4911 (2016). https:// doi.org/10.1109/IROS.2016.7759720 69. Pellegrini, L., Graffieti, G., Lomonaco, V., Maltoni, D.: Latent replay for real-time continual learning (2019). arXiv:1912.01100 70. Rebuffi, S., Kolesnikov, A., Sperl, G., Lampert, C.H.: icarl: incremental classifier and representation learning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5533–5542 (2017). https://doi.org/10.1109/CVPR.2017.587 71. Reed, S., de Freitas, N.: Neural programmer interpreters (2015). arXiv:1511.06279 72. Richardson, F.M., Thomas, M.S.: Critical periods and catastrophic interference effects in the development of self-organizing feature maps. Dev. Sci. 11(3), 371–389 (2008) 73. Robins, A.: Catastrophic forgetting, rehearsal and pseudorehearsal. Connect. Sci. 7(2), 123– 146 (1995). https://doi.org/10.1080/09540099550039318 74. Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., Hadsell, R.: Progressive neural networks (2016). ArXiv e-prints 75. Rusu, A.A., Vecerik, M., Rothörl, T., Heess, N., Pascanu, R., Hadsell, R.: Sim-to-real robot learning from pixels with progressive nets. In: CoRL’17, Mountain View, CA (2017) 76. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: ICPR’04, Cambridge, UK, pp. 32–36 (2004) 77. Schwarz, M., Schulz, H., Behnke, S.: RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features. In: IEEE International Conference on Robotics and Automation (ICRA’15), May, 1329–1335 (2015). https://doi.org/10. 1109/ICRA.2015.7139363, http://www.ais.uni-bonn.de/papers/ICRA_2015_Schwarz_RGBD-Objects_Transfer-Learning.pdf 78. She, Q., Feng, F., Hao, X., Yang, Q., Lan, C., Lomonaco, V., Shi, X., Wang, Z., Guo, Y., Zhang, Y., Qiao, F., Chan, R.H.M.: Openloris-object: a dataset and benchmark towards lifelong object recognition (2019). CoRR arXiv:abs/1911.06487 79. Shin, H., Lee, J.K., Kim, J., Kim, J.: Continual learning with deep generative replay. In: Advances in Neural Information Processing Systems, pp. 2990–2999 (2017) 80. Singh, A., Sha, J., Narayan, K.S., Achim, T., Abbeel, P.: BigBIRD: a large-scale 3d database of object instances. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 509–516 (2014). https://doi.org/10.1109/ICRA.2014.6906903

Online Continual Learning on Sequences

221

81. Vahdat, M., Oneto, L., Anguita, D., Funk, M., Rauterberg, M.: A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator. In: European Conference on Technology Enhanced Learning (2015) 82. Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-UCSD birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology (2010) 83. Wu, C., Herranz, L., Liu, X., Wang, Y., van de Weijer, J., Raducanu, B.: Memory replay GANs: learning to generate new categories without forgetting. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31, pp. 5962–5972. Curran Associates, Inc. (2018) 84. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms (2017). arXiv:1708.07747 85. Yu, F., Zhang, Y., Song, S., Seff, A., Xiao, J.: LSUN: construction of a large-scale image dataset using deep learning with humans in the loop (2015). CoRR arXiv:abs/1506.03365, http://dblp. uni-trier.de/db/journals/corr/corr1506.html#YuZSSX15 86. Zenke, F., Poole, B., Ganguli, S.: Continual learning through synaptic intelligence. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp. 3987–3995. PMLR, International Convention Centre, Sydney, Australia (2017). http://proceedings.mlr.press/v70/zenke17a.html