131 14 8MB
English Pages 289 [281] Year 2023
Gerald Friedland
InformationDriven Machine Learning Data Science as an Engineering Discipline
Information-Driven Machine Learning
Gerald Friedland
Information-Driven Machine Learning Data Science as an Engineering Discipline
Gerald Friedland University of California, Berkeley Berkeley, CA, USA
ISBN 978-3-031-39476-8 ISBN 978-3-031-39477-5 https://doi.org/10.1007/978-3-031-39477-5
(eBook)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Preface
In every branch of engineering, we see that an engineer, by tradition, can conceptually build complex constructs like airplanes, bridges, or ships, starting from scratch. This holds for software engineers as well, who are taught from the very beginning to create small programming projects and think algorithmically from basic logic circuits. However, as the professional journey progresses, the joy of building from scratch becomes rarer, as work is divided among specialists to enhance the quality of the end product. Moreover, the engineering process becomes infused with marketing, business, legal, and political aspects. When it comes to machine learning, we observed a drastic shift from this path. Existing large models created by a handful of experts and companies are often repurposed. Instead of building models from scratch, the focus in machine learning tends to revolve around fine-tuning these existing models to improve their accuracy on benchmark tests. This practice can lead to a dominance of those with enough resources to handle the computational demands of oversized models. At the same time, it significantly limits the opportunity for introspection and incremental improvements due to the black-box nature of machine learning. This book seeks to address this issue. Herein, I offer a different perspective on machine learning and data science, viewing them as engineering disciplines derived from physical principles. I advocate for an approach that lets practitioners analyze the complexity of a problem, estimate the required resources, and construct models piece-by-piece. We discuss model size, data sufficiency, resilience, and bias in terms of physically derived units. Like all engineers, practitioners of machine learning should be familiar with potential pitfalls and theoretical foundations, know the criteria for a good model, and understand the limitations of their approach. Like any nascent field, machine learning has its share of open problems. This book provides a range of metrics, which are often averages, expectations, or limits. While these are common in engineering and empirical sciences, it does not mean that we should rest on our laurels. We can refine our models by getting closer to the limits and fine-tuning our expectations using specific assumptions and problem statements. v
vi
Preface
This textbook stems from a graduate seminar at the University of California, Berkeley, named “Experimental Design for Machine Learning,” begun in 2018. The seminar aimed to examine the interplay of machine learning and information theory. We found that information theory could offer a useful tool to unify different perspectives on machine learning, from statistics to early physical modeling approaches. The class, which started with seven students, has grown in popularity, especially among Master of Engineering students, due to its practical advice on building models. Just like the class, this book aims to connect machine learning with traditional sciences like physics, chemistry, and biology, through an information-centered viewpoint on data science. It also establishes a rigorous framework for model engineering. The focus is on fundamental and transferable concepts, properties, and algorithms. The goal is to ensure that the teachings are long-lasting, even in the dynamic field of computing. Readers of this book are expected to have previous exposure to data science and machine learning. The book is structured to facilitate teaching a three-hour class per week. Typically, students progress at a pace of one chapter per week, which includes solving exercises, further reading, and attending a discussion section. The course culminates in a project that requires students to apply their learnings to evaluate the design and model engineering considerations of a published work. This book abides by Occam’s razor principle by aiming to provide the simplest yet most accurate explanation. Mathematical formalisms are kept to the essential and natural language is used whenever possible for ease of reading. While some may see this approach as philosophical, it is my belief that mathematics is a form of formalized philosophy. We use formalisms where necessary for precision but opt for natural language when it serves the purpose just as well and is easier to comprehend. It’s essential to remember that the objective of this book is not to criticize current practices but to explore and suggest alternative approaches. It’s an invitation to consider machine learning from a new perspective, one that holds the potential for more independence and versatility in model building. The idea is to empower practitioners to take more ownership of their projects, to understand their models deeply, and to align them closely with specific problem contexts. Throughout the book, you’ll find many references to real-world examples, case studies, and exercises that will ground the theory in practical applications. The intention is to make the concepts more relatable, applicable, and hence, more retainable. It’s about offering you an opportunity to get your hands dirty with the nuts and bolts of machine learning and, in the process, experience the genuine joy of building something from scratch, understanding each component’s function and contribution. Moreover, this book appreciates that different people have different learning styles. Some prefer a direct, mathematical, and formalized approach, while others favor a more conceptual and philosophical perspective. Therefore, I’ve attempted to strike a balance between these styles, hoping to make the content accessible and engaging to as many readers as possible.
Preface
vii
Ultimately, this book is an exploration of the boundaries of machine learning, an endeavor to further our understanding, and possibly extend those boundaries. It’s about aspiring to push the envelope of machine learning as an engineering discipline, to contribute to the richness and diversity of this exciting field. In conclusion, this book was born out of a deep belief in the potential of machine learning and artificial intelligence and the possibilities they hold for the future. It is a humble attempt to contribute to the journey of understanding, exploring, and refining these powerful tools. Whether you are a student, an industry professional, or simply someone fascinated by machine learning, I hope that this book will provide you with fresh insights, stimulate thoughtful discussions, and inspire innovative ideas. Here’s to the exciting journey that lies ahead! Writing a book is a collaborative endeavor, much like a village raising a child, and I’m simply a minor player benefiting from the wisdom of giants. Two pivotal figures who kindled my desire to pen this book are David MacKay, whose captivating textbook bridges the gap between information theory and machine learning, and Richard Feynman, whose “Lectures on Computation” demonstrated the art of conveying complex computer science concepts in an accessible and succinct manner. A myriad of individuals have contributed to this book either through thoughtful reviews or as insightful conversation partners, both online and offline. My gratitude extends to each one of them, and while I strive to acknowledge all, I apologize if I inadvertently miss anyone. I’d like to extend my deepest appreciation to Andrew Charman, Martin Ciupa, Adam Janin, Mario Krell, Nick Weaver, and other esteemed colleagues from the International Computer Science Institute in Berkeley. I’m indebted to Robert Mertens, Eleanor Cawthon, Alfredo Metere, Raúl Rojas, John Garofalo, Bo Li, Joanna Bryson, Bhiksha Raj, and the many enthusiastic students of my CS294-082 class. They have all played pivotal roles in this book’s fruition. And lastly, but by no means least, my profound gratitude goes to my wife, Adriana. She has always had my back, in every sense of the word. Thank you. Berkeley, CA, USA June, 2023
Gerald Friedland
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Step 1: Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Step 2: Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Step 3: Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Step 4: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.5 Additional Step: Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Information Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 2 3 3 4 4 5 5 5 7 8
2
The Automated Scientific Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The Role of the Human . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Curiosity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 The Data Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Automated Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 The Finite State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 How Machine Learning Generalizes . . . . . . . . . . . . . . . . . . . . . 2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 9 9 10 11 14 15 16 21 22
3
The (Black Box) Machine Learning Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Types of Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Black-Box Machine Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Training/Validation Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Independent But Identically Distributed . . . . . . . . . . . . . . . . . 3.3 Types of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Nearest Neighbors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23 23 23 24 27 27 28 29 30 ix
x
Contents
3.3.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.6 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.7 Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Multi-class Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Information-Based Machine Learning Process . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31 32 35 38 42 44 45 46 46 48 48 49 50 52
4
Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Probability, Uncertainty, Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Chance and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Probability Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Uncertainty and Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Minimum Description Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Information in Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Information in a Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53 53 54 57 57 59 61 61 62 64 67 70 71
5
Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Intellectual Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Minsky’s Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Cover’s Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 MacKay’s Viewpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Memory-Equivalent Capacity of a Model . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73 74 75 76 77 78 82 83
6
The Mechanics of Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Logic Definition of Generalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Translating a Table into a Finite State Machine . . . . . . . . . . . . . . . . . . . 6.3 Generalization as Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Adversarial Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85 85 88 89 91 92 93 94
3.4
3.5 3.6 3.7
Contents
xi
7
Meta-Math: Exploring the Limits of Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Garbage In, Garbage Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Transcendental Numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 No Rule Without Exception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Example: Why Do Prime Numbers Exist? . . . . . . . . . . . . . . . 7.2.2 Compression by Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Correlation vs. Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 No Free Lunch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 All Models Are Wrong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95 95 96 98 98 100 102 103 105 108 110 110 111
8
Capacity of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Memory-Equivalent Capacity of Neural Networks . . . . . . . . . . . . . . . 8.2 Upper-Bounding the MEC Requirement of a Neural Network Given Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Topological Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 MEC for Regression Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
113 113
9
Neural Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Deep Learning and Convolutional Neural Networks . . . . . . . . . . . . . . 9.1.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.2 Residual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Autoencoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Self-attention Mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.4 Example Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.5 Applications and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 The Role of Neural Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
123 123 123 126 127 129 130 130 131 132 133 134 134 135 136
10
Capacities of Some Other Machine Learning Methods . . . . . . . . . . . . . . . . 10.1 k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Converting a Table into a Decision Tree . . . . . . . . . . . . . . . . . 10.3.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.3 Generalization of Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.4 Ensembling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
137 137 138 139 139 140 141 142
116 119 120 120 122
xii
Contents
10.4 10.5
Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unsupervised Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.2 Hopfield Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
143 144 144 144 145 145
Data Collection and Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Data Collection and Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Task Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Well-Posedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Chaos and How to Avoid It . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.3 Forcing Well-Posedness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Tabularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 Table Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 Time-Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.3 Natural Language and Other Varying-Dependency Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.4 Perceptual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.5 Multimodal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Data Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Hard Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.2 Soft Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Numerization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 Imbalanced Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.1 Extension Beyond Simple Accuracy . . . . . . . . . . . . . . . . . . . . . 11.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
147 147 149 149 149 151 153 154 154 154
12
Measuring Data Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Dispelling a Myth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Capacity Progression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Equilibrium Machine Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Data Sufficiency Using the Equilibrium Machine Learner . . . . . . . 12.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
171 171 172 174 175 178 178
13
Machine Learning Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 What Makes a Predictor Production-Ready? . . . . . . . . . . . . . . . . . . . . . . 13.2 Quality Assurance for Predictors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.1 Traditional Unit Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.2 Synthetic Data Crash Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.3 Data Drift Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.4 Adversarial Examples Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.5 Regression Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
179 179 180 180 181 182 182 183
10.6 10.7 11
156 160 162 163 163 164 165 167 169 169 169
Contents
13.3
xiii
Measuring Model Bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.1 Where Does the Bias Come from? . . . . . . . . . . . . . . . . . . . . . . . Security and Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
184 186 187 188 188
Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1 Explainable to Whom? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Occam’s Razor Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 Attribute Ranking: Finding What Matters . . . . . . . . . . . . . . . . . . . . . . . . . 14.4 Heatmapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5 Instance-Based Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6 Rule Extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6.1 Visualizing Neurons and Layers . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6.2 Local Interpretable Model-Agnostic Explanations (LIME). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.7 Future Directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.7.1 Causal Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.7.2 Interactive Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.7.3 Explainability Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 14.8 Fewer Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
189 189 190 191 192 194 195 196
15
Repeatability and Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1 Traditional Software Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Why Reproducibility Matters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3 Reproducibility Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4 Achieving Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5 Beyond Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
201 202 202 203 204 205 206 207
16
The Curse of Training and the Blessing of High Dimensionality. . . . . . 16.1 Training Is Difficult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.1 Common Workarounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Training in Logarithmic Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3 Building Neural Networks Incrementally. . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 The Blessing of High Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
209 209 210 213 214 214 218 219
17
Machine Learning and Society . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.1 Societal Reaction: The Hype Train, Worship, or Fear . . . . . . . . . . . . 17.2 Some Basic Suggestions from a Technical Perspective . . . . . . . . . . . 17.2.1 Understand Technological Diffusion and Allow Society Time to Adapt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
221 221 222
13.4 13.5 13.6 14
196 197 197 197 198 198 198 199
223
xiv
Contents
17.2.2 17.2.3 17.2.4
17.3 17.4
Measure Memory-Equivalent Capacity (MEC) . . . . . . . . . . Focus on Smaller, Task-Specific Models . . . . . . . . . . . . . . . . . Organic Growth of Large-Scale Models from Small-Scale Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.5 Measure and Control Generalization to Solve Copyright Issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.6 Leave Decisions to Qualified Humans . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
223 223 224 224 224 225 225
A
Recap: The Logarithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
B
More on Complexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 B.1 B.2 B.3 B.4 B.5
B.6
O-Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kolmogorov Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VC Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shannon Entropy as the Only Way to Measure Information . . . . . . Physical Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5.1 Example 1: Physical View of the Halting Problem . . . . . . B.5.2 Example 2: Why Do We Expect Diamonds to Exist? . . . P vs NP Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
231 233 236 237 239 240 242 243
C
Concepts Cheat Sheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
D
A Review Form That Promotes Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . 251
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
List of Figures
Fig. 1.1
Fig. 1.2
Fig. 2.1
Fig. 2.2
This cover of Harvard Business Review from October 2012 allegedly started to create the profession of a data scientist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The scientific method (upper cycle) augmented with modern computational tools: Simulation (left cycle) and machine learning (right cycle). Conceptual credit: Jeff Hittinger (LLNL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The traditional scientific process. A human observes and records experimental factors and experimental outcomes pertaining to the possible understanding of a phenomenon, which is formulated as a question. Later, with perceivedly enough experiments performed, a scientist tries to generalize the recorded table of experimental inputs and outcomes into a rule, usually formulated as a mathematical formula combining the factors of the experimental inputs to result in the experimental output. Photo: NIST, Public Domain. Scientist graphics: By DataBase Center for Life Science (DBCLS)—https://doi.org/10.7875/togopic.2017.16, CC BY 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The automated scientific process. The human still observes and records experimental factors and experimental outcomes pertaining to the possible understanding of a phenomenon, which is formulated as a question. However, a machine is used to find the formula. More specifically, a finite state machine is used for this task, as explained in Sect. 2.2.1. Photo: NIST, Public Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
6
12
14
xv
xvi
Fig. 2.3
Fig. 2.4
Fig. 2.5
Fig. 2.6
Fig. 3.1
List of Figures
Example of a finite state machine that determines whether a binary number has at least two “0”s. The starting state .s0 is at the open arrow. The only final state in F is .S1 , depicted with double circle. .S2 is a regular state. That is, the machine halts once a second “0” is found or once the input sequence terminates (which would be in .S2 if there are less than two “0”s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The data table converted into a finite state machine by assigning a state transitions to each row of inputs .(x1 , . . . , xk ) in the table to either an accepting state or a regular state, depending on .f x-. Only class 1 results in accepting states. This finite state machine represents the table with maximum accuracy but no generalization . . . . . . . . . . . . . . . The data table converted into a finite state machine by assigning one state transition for all rows of input. Every input results in an accepting state. This finite state machine represents the table with maximum generalization but minimal accuracy, as the accuracy equals the frequency of outcome 1. The accuracy can be maximized by defining the most frequent outcome as the accepting state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The data table converted into a finite state machine by finding a common property of the input rows for each class and defining the state transitions using the property. If the common property perfectly describes the set of inputs for each class, the accuracy is maximal. If we are able to use the minimum amount of state transition, we maximize generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Breakout is a classic computer game wherein the player controls a bar that can move either left or right. The objective of the game is to eliminate all bricks using the ball, which is bounced off the bar. A player loses one of their three lives if the bar fails to make contact with the ball. In the context of machine learning, this game scenario can be interpreted as a classification task. The environmental state of the game (the positions of the bar, ball, and bricks) needs to be classified into one of three potential actions: move the bar to the left, move it to the right, or keep it stationary. The model is trained during gameplay, applying a technique known as reinforcement learning. Picture from: https://freesvg.org/breakout, Public Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
17
18
18
26
List of Figures
Fig. 3.2
Fig. 3.3 Fig. 3.4
Fig. 3.5
Fig. 4.1
Fig. 5.1
Fig. 5.2
The black-box, supervised machine learning process is data and model agnostic. This means, neither the data, nor the model is analyzed for fitness to be learned or to generalize with the specific machine learner until the system is completely in place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of a decision tree with two input features and a binary output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Three different activation functions to smooth out the step function (one the left): a sigmoid, a Rectified Linear Unit (ReLU), and the Tangens Hyperbolicus (tanh) . . . . . . . . . . . . . . . . . . . . . Precision, recall, sensitivity, specificity—the left half of the image with the solid dots represents individuals who have the condition, while the right half of the image with the hollow dots represents individuals who do not have the condition. The circle represents all individuals who tested positive. By FeanDoe-modified version from Walber’s Precision and Recall https://commons. wikimedia.org/wiki/File:Precisionrecall.svg, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid= 94134880 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The line, square, and cube can all be partitioned into miniature (self-similar) versions of itself. While we can chose the number of miniature versions I , to reconstruct the original, using the miniature versions, we need D miniature versions. We call I the magnification .N = I factor and D the dimensionality (see Definition 4.16) . . . . . . . . . . . . . Shannon’s communication model applied to labeling in machine learning. A dataset consisting of n sample points and the ground truth labeling of n bits are sent to the neural network. The learning method converts it into a parameterization (i.e., network weights). In the decoding step, the network then uses the weights together with the dataset to try to reproduce the original labeling . . . . . . . . . . . . . . . . . . . . . All 6 boolean functions of 2 variables that have target columns with equi-probable outcome (highest entropy). As explained in Sect. 5.2, .f6 (XOR) and .f9 (NXOR) have the highest mutual information content (4 bits) and can therefore not be modeled by a single neuron that has a capacity of about 3.8 bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xvii
27 33
39
47
65
77
79
xviii
Fig. 5.3
Fig. 6.1
Fig. 6.2
Fig. 6.3
List of Figures
The shape of the .C(n, d) function (see Theorem 5.1). The y-axis represents the probability for memorization P . The x-axis represents the number of points n over the dimensionality d. As explained in Sect. 5.1.2, .P = 1 as long as the number of points is smaller than or equal to the number of dimensions (that is, . dn ≤ 1). When the dimensionality d increases (various colored lines), .P = 1 is pushed beyond . dn = 1 to . dn = 2 (where .P = 0.5, see exercises). In fact, as explained in (MacKay, 2003), as .d → ∞, one can guarantee to memorize almost up to 2 points per parameter. Note that the curves are asymptotic with the x-axis for . dn → ∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalization of 9 points using a distance function and a generalization distance .ε. Each color represents a class. This is a special version of a so-called Voronoi diagram. Similar to the logical generalization definition, this Voronoi diagram allows overlapping generalization regions. These are indicated in blended colors. Note that nearest neighbor algorithms can also contain undefined neighborhoods. Image: Balu Ertl, CC-BY-SA-4.0 . . . . . . . . . . . . . . . . . . Left: Visualization of a training table of 180 rows, 2 columns, and a balanced target function of 3 classes (each its own color). Middle: 1-nearest neighbor classification of the dataset. Right: Training points memorized (aka nearest neighbor parameters) reduced to 31 by eliminating points that do not change the final decision (empty circles) except for increasing the generalization distance with regard to a representative point (squares) and points that can be classified as outliers (crosses). Outliers are defined here as those points where all three nearest neighbors of that point belong to one other class. The left bottom corner shows the numbers of the class-outliers, representatives, and absorbed points for all three classes. Images: Agor153, CC BY-SA 3.0, via Wikimedia Commons . . . . . . A human visual system perceives a length difference between the vertical line and the horizontal line, even though objectively there is none. Optical illusions such as the one shown are contradictions to the assumptions underlying the models in our brain. That is, they can be interpreted as adversarial examples to biological models. From: Wikipedia, Public Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
87
87
93
List of Figures
Fig. 7.1
Fig. 7.2
Fig. 8.1
Fig. 8.2
Fig. 8.3
Visualization of no universal and lossless compression. Compressing all files of 2 bits length to 1 bit length requires 4 file conversions but only allows 2 without loss. The compression on the left is an example of lossless compression as all compressed files can be reverted to their originals. However, this scheme is not universal as it is not able to compress all files: .f ilec3 and .f ilec4 maintain their lengths. The scheme on the right universally compresses all files, but uncompressing (traveling backward on the arrows) is not possible without loss of information. One can chose a universal or a lossless compression scheme, not both . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joking example of correlation vs. causation: Dinosaur illiteracy and extinction are correlated, but that does not mean the variables had a causal relationship. Photo by: Smallbones, Creative Commons CC0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The threshold or bias b counts as a weight .wi because it can be easily converted into such. Left: Original neuron with bias. Right: The same neuron with a bias converted into a weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The single neuron (a) has 3 bits of memory-equivalent capacity and can therefore guarantee to memorize any binary classification table with up to 3 rows (more with removing redundancies). The shortcut network (newer term: residual network) (b) has .3 + 4 = 7 bits of capacity because the second-layer neuron is not limited to the output of the first neuron. The 3-layer network in (c) has .6 + min(3, 2) = 8 bits of capacity. The deep network (d) has .6 + min(6, 2) + min(3, 2) = 10 bits of capacity. The multi-class classifier network in (e) has .4∗3+min(4∗3, 3)+min(4∗3, 3) = 12+3+3 = 18 bits of capacity. That is, it can memorize all 8-class classification tables of maximally 6 rows (.3 bits of output per row). The oddly shaped network under (f) has .3 + 1 + 4 = 8 bits of MEC based on the rules defined in this chapter. However, the neuron in the middle really does not add anything, as it is shortcut. Therefore, a tighter bound would be .3 + 4 = 7 bits (see also example b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Visualization of two neural networks with 24 bits of memory-equivalent capacity (output layer hidden behind result) approximating a 2-dimensional inner circle of points. The 1-hidden-layer network approximates the circle with a set of straight lines. The 3-hidden-layer network shows a growing complexity of features (discussed in Sect. 9.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xix
101
106
114
117
120
xx
List of Figures
Fig. 9.1
Conceptual idea of deep learning: While the first layer contains linear separators and forms hyperplanes, further layers are able to combine features from previous layers, effectively creating manifolds of increasing complexity. Image from Schulz and Behnke (2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.2 Conceptual depiction of a typical convolutional neural network architecture. Different lossy compression operations are layered in sequence (and partly specialized during the training process) to prepare the input data for the (fully connected) decision layer, which consists of regular threshold neurons. Aphex34, CC BY-SA 4.0, via Wikimedia Commons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.3 A residual block with a skip connection as part of a ResNet to avoid the vanishing gradient problem . . . . . . . . . . . . . . . . . . . Fig. 9.4 A simple generative adversarial network architecture. Note the use of the word of “generative” for a noise filter, as explained in the text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.5 An example of an autoencoder architecture. The input data (.x1 , x2 , . . . , xn ) are encoded into a lower-dimensional space (.z1 , z2 , . . . , zk ) through the encoder hidden layers (.h1 , h2 , . . . , hm ). The decoder hidden layers (.h'1 , h'2 , . . . , h'm ) then reconstruct the input data, resulting in the output (.xˆ1 , xˆ2 , . . . , xˆn ). The optimization process minimizes the reconstruction error between the input data and the output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.6 A high-level illustration of the transformer architecture. The input sequence (.x1 , x2 , . . . , xn ) is encoded by encoder layers (Enc) into a continuous space representation (.z1 , z2 , . . . , zk ). The decoder layers (Dec) then generate the output sequence (.y1 , y2 , . . . , yn ) . . . . . . . . . . . . . . . . Fig. 10.1 In this example, the data table has 3 columns and 2 rows. The decision tree has a depth of 4, with each level corresponding to a column in the table, and each node at a level corresponding to a value in that column. The leaves of the tree correspond to the output values .f (xi ) for each row in the table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 10.2 Generalized form of a perfect binary decision tree of depth 2. Each inner node represents a decision and each leaf node represents an outcome that would lead to an output . . . . . Fig. 11.1 A plot of the rank versus frequency for the first 10 million words in 30 Wikipedias in a log–log scale. An approximate Zipf distribution is shown in every single one of them (Source: SergioJimenez – CC BY-SA 4.0, Wikimedia Commons). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
124
125 127
128
129
130
140
141
157
List of Figures
Fig. 11.2 A plot of the outcome of word frequencies of “random typing” by B. Mandelbrot. The logarithmic y-axis shows the word frequency .pl (compare Theorem 11.1), and the logarithmic x-axis shows the rank. Consistent with the approximate result in Theorem 11.1, the resulting plot through an equal-percentile rank results in a line l with slope .s = −1 (Source: “The Zipf Mystery” by VSauce (Micheal Stevens) on YouTube, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 11.3 Mental model for constant-rate communication with a Zipf distribution. The brain could use a data structure isomorphic to a minimal effort decision tree. For the decision tree to be extendable with minimal effort, it grows linearly. By putting the most frequently occurring language element at the earliest branch, the second-most frequent at the second branch, and so on, the expected access (path) length is held constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 12.1 Three examples of typical capacity progression curves: red is baseline, green is highly generalizable, and black is in between . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 12.2 Capacity progression on the original and two modifications of the Titanic dataset, calculated by Algorithm 11: red crosses show the capacity progression for random survival, black crosses the progression for the original dataset, and the green dots show the progression for a synthetic dataset where the gender determines the survival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 13.1 Adversarial examples for a stop sign detector (Evtimov et al. 2017). Examples like these should be part of quality assurance for predictors in an MLOps process. From https://www.flickr.com/photos/51035768826@N01/ 22237769, License: CC BY 2.0, Credit: Kurt Nordstrom . . . . . . . . . . Fig. 14.1 Heatmapping the correlation between genes and their expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 14.2 LASSO regularization path: This plot shows the coefficients of the LASSO model for each feature as the regularization parameter (.log2 (α)) quantizes by more bits. As the value of .α increases, more coefficients shrink to zero, resulting in a sparser and more interpretable model. A feature (experimental variable) that “lasts longer” also has a higher significance for the outcome of the experiment. See also the discussion in Sect. 14.8 . . . . . . . . . . . . . .
xxi
158
160
176
177
183 193
195
xxii
List of Figures
Fig. 15.1 Knowledge creation obeys an implication hierarchy: Language is required to understand philosophy, math formalizes philosophy, and physics depends on math, chemistry on physics, and biology on chemistry. A model created in one field that implies a contradiction to a field that is higher in the hierarchy is typically considered wrong—independent of whether this model has been created manually or automatically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 16.1 While training is difficult in general, with the right assumptions and knowledge about the problems, even exponential decay of the error is possible. Such exponential decay implies a logarithmic number of training steps in the complexity of the trained data . . . . . . . . . . . . . . . . Fig. 16.2 The blessing of high dimensionality: As discussed in Sect. 5.2, the 2-variable Boolean functions .f6 (XOR) and .f9 (equality) have an information content of .4 bits and can therefore not be modeled by a single 2-input neuron that has a capacity of about .3.8 bits. However, by projecting the input into a higher dimensional space (here, one-hot encoding the input), the single neuron’s capacity can be increased by another input weight and is able to model both functions, thereby showing that a single neuron can model all 16 Boolean functions, despite what was prominently discussed in Minsky and Papert (1969). . . . . . . . . . . . . . . . Fig. B.1 An image of a feature inside the Mandelbrot set (often called Julia set). Generating the fractal can be done using low Kolmogorov complexity but high computational load. Once it is generated, lossless compression cannot (even closely) reverse the string describing the image back to the length of the program that generated it. Binette228, CC BY-SA 3.0, via Wikimedia Commons. . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. B.2 Expected number of clauses in k-CNF formulas as a function of k. The exponential growth is slowed down insignificantly by increasing k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
206
213
217
234
246
Chapter 1
Introduction
In October 2012, the Harvard Business Review (Davenport and Patil 2012) published an article Data Scientist: The Sexiest Job of the 21st Century that is often cited as the beginning of data scientist being a term for describing a profession on the job market. The term “data scientist” had been used in various capacities prior to this, but this article mainly contributed to the role gaining mainstream recognition as a distinct profession. It has since become a highly sought-after job role in many industries due to the increasing importance and volume of data in decision-making processes. “Originally presented as a catalyst for change in management processes, it swiftly became apparent that data could enhance objectivity in decision-making. Consequently, the term ‘data science’ gained widespread popularity, leading to advocacy efforts from practitioners and researchers across various disciplines. The most significant academic initiatives came from the fields of statistics and computer science. This prompted many educational institutions to offer classes and develop curricula centered around data science. As of the writing of this book, numerous universities offer data science degrees. However, the consensus on defining the boundaries and contents of this emerging field is still in the process. Often, when data science is taught, it is typically explained through the lens of specific tools used for certain tasks. This has led to many books of the type “Using data tool XYZ to boost sales in your business.” On the other hand, there is a wealth of well-written statistics books that present the algebraic fundamentals necessary for various forms of uncertainty modeling. This book aims to present data science from a unique angle, focusing on the intricate relationship between data and scientific principles. The fundamental objective is to elucidate the innate characteristics of information and uncertainty, along with their implications on computational modeling and problem-solving decision processes. From a fundamental computational engineering standpoint, the methods introduced in this book are grounded in the laws that govern information. What makes data science and machine learning challenging is the level of abstraction beyond conventional mathematics. Indeed, many of the concepts dis© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_1
1
2
1 Introduction
Fig. 1.1 This cover of Harvard Business Review from October 2012 allegedly started to create the profession of a data scientist
cussed in this book could be regarded as “meta-math” – essentially mathematics about mathematics. As an example, traditional math often entails examining a function or a set of functions to derive applicable properties, which can then be leveraged for further insights. In contrast, this book poses questions such as: “How many functions can this machine learning algorithm model?”, “What types of functions are these?”, “How can we expand the range of functions it can model?” To answer these questions, we need to understand concepts that abstract away from concrete mathematical mechanisms and provide us with insights and boundaries when considering all possible mathematical mechanisms of a certain length or property. A well-known, yet often misunderstood example of this concept is Shannon Entropy, which measures the minimum complexity of a modeling space given an alphabet and a probability distribution over that alphabet. We will delve deeper into this and similar concepts in Chap. 4. With that said, we are ready to work on an understanding of the terms science and data science as used throughout the remainder of this book.
1.1 Science Science is a disciplined approach aimed at understanding phenomena. Its primary goal is to predict the occurrence and properties of future phenomena by generalizing from those studied in the past. This is achieved by establishing a reproducible and logically sound relationship between a set of observable events and a model capable of describing these events.
1.1 Science
3
The term “science” is often misused to lend apparent credibility or importance to other concepts or disciplines. However, the definition of science is precise: a discipline that strictly adheres to the scientific method can be called a science. The scientific method is the cornerstone of any scientific endeavor. It works on the premise that a theory is confirmed by verifying its predictions against empirical evidence. Alternatively, experimental observations are generalized into a theory that predicts future observations under similar parameters. The method was formalized a millennium ago, during the so-called Islamic Golden Age: An era during which much of the historically Islamic world was characterized by flourishing science, economic development, and cultural progress. Ibn al-Haytham Alhazen (a.D. 965–1039) is considered by some science historians to be the father of modern scientific methodology. In the western world, we often credit Galileo Galilei (a.D. 1564–1642) for introducing the idea of studying natural phenomena following a simple work flow: 1. 2. 3. 4.
Observation Hypothesis Experiment Conclusion
1.1.1 Step 1: Observation An event happens that can be observed, either directly (sensory experience) or indirectly (e.g., using a measurement instrument). Such unexplained event is often called phenomenon. In the modern world, records of observations are called data Information is data that reduce uncertainty, more on that in Chap. 4 Example I observe that stones always fall considerably faster than feathers.
1.1.2 Step 2: Hypothesis A hypothesis (from ancient Greek ὕπο- (hypo-): underlying, and .θή.σις (thesis): reason) is a guess about the underlying cause of a phenomenon. Example Aristotle (384–322 b.C.) immediately concluded that such phenomenon happens because heavier bodies fall faster than lighter bodies.
4
1 Introduction
1.1.3 Step 3: Experiment An experiment (from Latin experior: to try) is a trial aimed to verify the hypothesis. For an experiment to be called “valid,” it must reproduce the phenomenon of interest accurately, reproducibly, and in a controlled environment. These terms will be further discussed in Chaps. 4 and 14. Example Contrary to Aristotle, who jumped straight to conclusions based on the factual evidence, Galileo felt that there was a logical problem with that explanation. According to Aristotle’s logic, a feather connected to a stone should have hit the ground at about the average of the times of the standalone stone and feather. To prove Aristotle wrong, he had the idea to use spheres of different masses and an inclined plane that allowed for easier measurement of time of fall, such that the same object in the same condition will always take the same time to pass from motion to rest. Galileo then measured the time for spheres of different masses to fall from the same height and noticed that the time of each sphere to pass from motion to rest never changed. He also noticed that for twice the height, the time to fall was not twice as much, but less.
1.1.4 Step 4: Conclusion Conclusions are the generalizations that can be inferred from the results of an experiment interpreted with respect to the hypothesis. It is important to remark that an experiment alone is useless without a hypothesis, and a hypothesis alone is not sufficient to draw conclusions. Also, single experiments usually do not allow for generalization. Example Galileo could conclude that Aristotle was wrong, as objects with different masses would ultimately hit the ground at the same time. Moreover, he discovered a quadratic relationship between height of the fall and time to fall, today known as “acceleration of gravity,” and then explained more generally by Newton a century after Galileo. The key lesson is that human perception does not always reflect the truth. What one perceives or believes often represents a manifestation of an underlying natural rule, which is usually simpler than the apparent evidence. It is a scientist’s duty to uncover such underlying rules. This challenging task must be carried out even amidst societal pressures, which often stem from a limited focus on a subject and thus, by default, would agree with simpler, perhaps Aristotelian, views.
1.3 Information Measurements
5
1.1.5 Additional Step: Simplification hough not a strict requirement, one of the principal guidelines in the scientific method is the pursuit of simplicity, a notion often referred to as Occam’s razor. While Occam’s razor is not seen as an inviolable principle of logic or a scientific outcome, the preference for simplicity within the scientific method is rooted in the concept of falsifiability. For every accepted explanation of a phenomenon, an infinite number of increasingly complex alternatives exists. William of Ockham (c. 1287– 1347) was an English Franciscan friar, theologian, and philosopher. Interestingly, the spelling discrepancy between his name and the modern reference to his principle of simplicity is a product of the complexity of his name itself.
1.2 Data Science he core process of the scientific method has endured over centuries, but the tools available to scientists have evolved significantly. Notably, the emergence of automatic computation has empowered theorists to partially validate even complex assertions against simulated experiments, thereby enhancing efficiency by mitigating the need for constantly conducting costly, time-consuming, or hazardous experiments. In recent years, computational data analysis methods have matured to a point where they are considered an extension to the observational phase of the scientific cycle: automated and semi-automated data analysis tools can potentially forecast outcomes based on a set of observations. In simpler terms, the methods often referred to as data analytics, machine learning, or Artificial Intelligence (AI) aim to formulate hypotheses that can be validated with future experimental observations. Figure 1.2 visually represents this concept. ata is the recording of an observation obtained from an experimental setting. The problem is that, in most cases, data are not always gathered using the strict process outlined above. For example, in practice, a hypothesis might be developed after the observations were made. Furthermore, there might be too many observations to analyze manually. We therefore define data science as using the scientific method to develop computational tools to assist humans in the understanding of given observations. Definition 1.1 Data Science: The science of creating computational tools to augment the interpretation of observations.
1.3 Information Measurements While it is certainly useful to apply computational tools to, say, anecdotally correlate a vast array of sales figures with a variety of advertising timings and formats, the
6
1 Introduction
observation
Experiments
Theory prediction
observation
observation prediction
Simulation
prediction
Machine Learning
Fig. 1.2 The scientific method (upper cycle) augmented with modern computational tools: Simulation (left cycle) and machine learning (right cycle). Conceptual credit: Jeff Hittinger (LLNL)
field of statistics and its applications have already produced a plethora of practical resources. However, the focus of this book is to impart an understanding of how computer science, information theory, and physical methodologies can be harnessed to model processes in data science in general, rather than relying solely on anecdotal evidence. Physics offers mental tools that are typically outside the purview of computer scientists, such as the concept of equilibrium (see Definition 4.10). Computer science, on the other hand, acknowledges the value of projecting computational problems into higher dimensional spaces, an approach unique to the field. In fact, the projection into binary numbers was fundamental to the inception of the field. Information theory (see Chap. 4) offers the mathematical framework required to integrate these two perspectives. While philosophers and scientists have spent centuries understanding the characteristics of robust models, it is often observed that these valuable insights are overlooked or even resisted in computer engineering and machine learning practices. As a result, contemporary models frequently require a vast number of observations (propagating the notion that “there is no data like more data”) and contain billions of parameters, essentially delegating the act of understanding to a machine. Over the centuries, the fundamental sciences have developed an impressive array of effective models to predict and describe nature’s processes. Hence, it seems somewhat naive to take observations from a well-designed, expensive experiment, treat them as raw numbers, feed them into a “black box” data analysis algorithm that was conceived just a few years ago, and trust that the output will be reliable. To maintain the rigor of machine learning, it is crucial to understand the limitations of a model, identify circumstances under which it fails, and know how to enhance its performance. Furthermore, without a human understanding of the decisions involved in making a prediction, a machine-driven theory would not fulfil the primary purpose of the scientific endeavor.
1.4 Exercises
7
This book is crafted with the intention to address the aforementioned problem. The approach taken is to define and present information measurements. When an engineer constructs an airplane, a house, a bridge, or a ship, everything starts with measurements. However, this is not typically the case when a machine learning engineer develops a model. Utilizing the information measurements presented in this book will not only help verify and discuss the experimental design of machine learning experiments but ultimately facilitate the full automation of the model-building process, eliminating a great deal of guesswork, including what is commonly referred to as the tuning of hyperparameters. The information model this book adheres to is as straightforward as it is practical: A table filled with experiments comprising experimental factors and observations (training data) is recoded into a finite state machine (a machine learning algorithm operating on a classical computer). The objective of this recoding is generalization – being able to predict the (potential) observation when the experimental factors differ from those included in the table. As one can imagine, science, particularly physics, has significant insights to offer about the table of experimental data and the quality of predictions and models. Computer science is comprehensive about the finite state machine, and information theory offers substantial understanding of the recoding process. From here, we can derive the properties of specific algorithms such as Neural Networks. The structure of this book is linear, designed to be read from beginning to end. This is important because some of the concepts discussed often share similarities with the same concept (though named differently) in a different field. This interdisciplinary viewpoint sometimes necessitates abstraction from the semantics in the specific field, and I, therefore, take the liberty to rename the concept to align it with the context of this book. Once discussed, the new name is consistently used throughout the book. For instance, in Chap. 5, I define Memory Equivalent Capacity (MEC) that will then be used throughout the rest of the book. The exception of the leaner progression is Chap. 3 that sweeps over knowledge that is well known to most machine learning practitioners and is here mostly for reference. Furthermore, the appendices provide further information. For example, a simplified summary of the concepts introduced in this book can be found in the form of a cheat sheet in Appendix C. More in-depth discussion on the relationship of various complexity measures can be found in Appendix B. I hope you enjoy reading as much as I enjoyed writing!
1.4 Exercises 1. Repeat Aristotle’s experiment using a single sheet of letter or A4 paper (usually around 30 grams) that is folded into different shapes. Describe how one would measure the speed of the fall as a function of the shape of the paper.
8
1 Introduction
1.5 Further Reading Here is further reading on the history mentioned in this chapter: . Karl R. Popper, “Conjectures and Refutations: The Growth of Scientific Knowledge”, Routledge, 2003 . Albert Einstein: “On the Method of Theoretical Physics”, in Essays in Science (Dover, 2009 [1934]), pp. 12–21. . Roger Ariew: “Ockham’s Razor: A Historical and Philosophical Analysis of Ockham’s Principle of Parsimony”. University of Illinois Champaign-Urbana (1976). . Roshdi Rashed, Ibn al-Haytham’s Geometrical Methods and the Philosophy of Mathematics: “A History of Arabic Sciences and Mathematics”, Volume 5, Routledge (2017), p. 635.
Chapter 2
The Automated Scientific Process
In the preceding chapter, we took a glance at the elements of the scientific method. This chapter delves deeper into how this ancient process has undergone a significant transformation in the recent past, particularly due to the advent of computers. Furthermore, we aim to provide a mathematical formulation for this change. This effort paves the way for understanding and applying the measurement and quantification concepts that are integral to this book and crucial for the systematic experimental design of machine learning experiments. In essence, this chapter lays the foundation for engineering the automated scientific method.
2.1 The Role of the Human Before we formalize the automated scientific process, let us start with a formalization of the original scientific method. The components of the process, as outlined in the previous chapter, are: observation, hypothesis, experiment, and conclusion.
2.1.1 Curiosity In the context of this book, which revolves around automation, we will consider that observation and hypothesis are always available inputs. The question of why an observed phenomenon triggers a person to question their surroundings and form a hypothesis is, in the end, an attempt to explain the essence of curiosity. From a statistical standpoint, curiosity seems to offer an evolutionary edge, but beyond that, curiosity is so context-specific that we might as well regard the individual motivations behind it as random. The upside of this is that it earmarks a distinct
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_2
9
10
2 The Automated Scientific Process
and irreplaceable role for humans in the automated scientific process: that of posing intriguing questions. On a side note: Introspecting just the English language, we find that there are many ways of asking questions, and we even have our own category of words for them to indicate what kind of property we would like to have introspected: Question words. The main question words are: • • • • • • • •
What (for a thing, when there are many things) Which (for a thing, when there are not many things) Who (for a person) Where (for a place) When (for a time) How (for a method) Whose (to ask about possession) Why (for a reason)
So, at this point, we have identified three crucial conceptual steps that need to be performed before we can even think to delegate scientific discovery to a machine: 1. Identify a physically observable phenomenon. 2. Form a hypothesis. 3. Ask the question that allows to verify or falsify the hypothesis. We will find that it is not always easy to automate the answering of all possible questions with machine learning but, in general, we can. Of the questions listed above, the most important one and also the hardest to answer is the “why” question. Responding to a why question satisfactorily usually requires to establish causality that is inherently difficult, for both humans and machines. This will be explained in further detail in Sect. 7.3. Furthermore, societal influence has trained us to not ask too many why questions, and instead just give in to ignorance in favor of a higher power (imagine soldiers asking “why?” every time they receive a command). So we should expect some questions to feel more or less comfortable and some of them even to receive societal backlash. However, none of these obstacles should prevent us from making “why?” one of the highest priority questions. More on that in Chap. 14. Once we have the right question about a phenomenon, the next step is to perform experiments to collect data that the machine can learn from.
2.1.2 Data Collection The method of asking questions in the scientific process has always been to perform trial-and-error experiments while observing variables that could influence the outcome. This does not change with automation. For example, if we want to respond to “How can we automatically detect tuberculosis using coughs in the telephone?,” we need to record people coughing on the telephone. Then we
2.1 The Role of the Human
11
let, other, known-to-work tests detect if that person coughing on the phone had tuberculosis at the time of the recording. We will find that some subjects were positive, i.e., had tuberculosis, and some subjects were negative, i.e., coughed for another reason. We collect as many samples as feasible and annotate them with the outcome of the known-to-work test. We do not discard the negatives, as they are as important as the positives, which will become clear in a moment. More on annotation in Chap. 3. Also, if data collection and/or annotation are expensive, it may be good to devise a method to know when we may have done enough, which is presented in Chap. 12. Sometimes, data collection is a matter of directly recording a phenomenon, like in the tuberculosis example. Other times, it can involve finding variables explicitly, which can be tricky. For example, if someone wants to forecast the phenomenon of rising real-estate prices in an area, most likely photos of the houses may not be the best source or the only source of data to consult. Instead, we want to take a look at factors such as: time built, size of the lot, the number of bedrooms, distance to infrastructure, etc... . . , which of these factors have an influence and therefore should be included vs. which of these factors are noise, is its own set of questions. For example, is “world population” a factor? At least, annotating the variables with the actual selling price is probably a simpler task than detecting tuberculosis. In general, adding too few factors can lead to unpredictability and chaotic behavior (see Chap. 11), adding too many factors can make learning much harder (see Chap. 16), adding the wrong factors leads to bias (see Chap. 13), and knowing exactly how many and which factors to include require us to have solved the problem.
2.1.3 The Data Table Let us limit the confusion by at least agreeing on a formalism that describes how we will record data and present it to the machine. Ultimately, everything that is stored in a computer is stored in a one-dimensional array of singular on–off switches, which we call bits (more on that in Chap. 4). So even a time-based, multi-soundtrack, highresolution video is ultimately serialized into that 1-dimensional memory structure. Theoretical computer science recognizes this memory structure as the “band” in a Turing machine (Sipser 1996), and many proofs show that making the band more complicated (including using multiple bands) does not add to or subtract from the expressiveness or computational power of the Turing machine and therefore to that of a computer in general (Sipser 1996). It means that 1-dimensional arrays are not only already universal, but we are able to afford a slightly more convenient, this is two-dimensional, structure: The table. The table is intuitive as it has been used in databases and in spreadsheets as universal data structure since the early days of computing. Math recognizes it as matrix, and there are whole fields of math dedicated to them, for example, linear algebra. For the remainder of this book, we will assume that any data are given in tabularized form as follows.
12
2 The Automated Scientific Process
Fig. 2.1 The traditional scientific process. A human observes and records experimental factors and experimental outcomes pertaining to the possible understanding of a phenomenon, which is formulated as a question. Later, with perceivedly enough experiments performed, a scientist tries to generalize the recorded table of experimental inputs and outcomes into a rule, usually formulated as a mathematical formula combining the factors of the experimental inputs to result in the experimental output. Photo: NIST, Public Domain. Scientist graphics: By DataBase Center for Life Science (DBCLS)—https://doi.org/10.7875/togopic.2017.16, CC BY 4.0
Definition 2.1 (Data Table) A data table is a table in 1st Normal Form (1NF) where each column is dedicated to an experimental factor .xi and each row is dedicated to a sample of the data, also called instance. A data table therefore consists of n rows of D dimensions. A sample is therefore described as a vector .x - = (x1 , x2 , . . . , xD ). For supervised tasks, the sample consists of input data and an additional column is added for the experimental outcome. We will call this column .f (x ). The 1st Normal Form (1NF) (Codd 1970) is defined as follows: • The table is two-dimensional with rows and columns (already guaranteed by Definition 2.1). • Each column exactly corresponds to one attribute (experimental factor). • Each row represents a unique instance and must be different from any other row (that is, no duplicate rows). • All entries in any column must be of the same type. A string column contains strings, and a numeric column contains numbers. Another important conceptual assumption that we are making in this book is that data table cells are atomic. Definition 2.2 (Table Cell Atomicity) A data table’s cell is atomic when the cell’s content is considered indivisible. If a cell’s value is divisible into parts (such as a string into characters or a number into digits), it is assumed to be unique in itself and only considered as a whole. This
2.1 The Role of the Human
13
prevents a data table’s row from expanding its dimensionality unpredictably. That is, a column can contain strings of various lengths, but they are all considered only as a new category each. String similarity is not taken into account, only two cells with identical strings count as having the same value. When an .f (x ) column is given, we assume that no two identical input rows .x- (if they exist) result in a different outcome. We can formalize this important constraint as: Definition 2.3 (Functional Assumption of Supervised Machine Learning) Training data are well-formed if and only if .x-i = x-j =⇒ f (xi ) = f (xj ). If the functional assumption is violated, two experiments with the same experimental input result in different outcomes. Mathematically speaking, .f (x ) would not be a function and therefore be ill-defined. Chapter 11 explains how to deal with such scenarios in the real world. Figure 2.1 describes the traditional human-based scientific process using this formalism. A set of experiments is conducted varying experimental factors .xi and then observing outcomes .f (x ). Each experiment is recorded as a row into the table. At some point in the process, a human brain then works on figuring out the understanding of this table. If successful, the outcome is a theory, for example, in the form of a formula, that predicts future experimental outcomes given known experimental variables to the point that it is unnecessary to perform the experiment ever again. The formula is usually accompanied by a human explanation as to which experimental variables vary the outcome most. The most important goal is to describe the experimental setup well enough that the experiment becomes reproducible. That is, we know the experimental factors, and given its values, the theoretic prediction is always observed to be true when verified with an actual experiment. For practical reasons, we can change “always” to a certain number of times by giving the outcome an accuracy. For example, given the experimental factors age, body mass index, and vaccination status, we may be able to predict whether a person needs to be hospitalized for a disease or not with a certain accuracy. Accuracy is defined here as Definition 2.4 (Accuracy) .Accuracy =
# correct predictions # total predictions
× 100
and measured in %.1 A prediction itself, however, does not provide understanding yet. In order to understand why a prediction is made, one has to understand how the experimental factors combine into the result. A scientific formula is exactly that. For example, in physics, force .F = ma, where m is mass measured in kilograms and a is acceleration measured in speed increase per second, whereby speed is measured in meters per second. This gives us a clear explanation of how force is derived out of the combination of the experimental factors mass and change of position change.
1 % is not a physical unit. It is a mathematical norm. However, that does not prevent the use of it as a quasi-measurement unit.
14
2 The Automated Scientific Process
We can call this the conclusion. We have explained the phenomenon of force, and it can settle in people’s brains as intuition. Moreover, this understanding, the model of force, can now be used as a building block to build more complex models. For example, direction can be added to understand how forces that work against each other combine into a single force.
2.2 Automated Model Building The new scientific process is depicted in Fig. 2.2. Two major things change. First, we are entering a different field of human endeavor and so the vocabulary of how we describe things changes. A row of the table, including the outcome, is now called “instance,” and the outcomes .f (x ) are now called “labels” or “annotation.” The set of experimental factors are now just elements of the input vectors .x-i , which are said to have a dimensionality k (even though we will find later that k is the upper bound of the dimensionality). Second, we replace the modeling brain with a computer. Instead of a scientist crunching numbers, plotting curves, and trying to understand what makes a good formula representation of the training data manually, we devise an automatic algorithm to do it. This algorithm could be as simple as linear regression, where all one tries to do is fit a straight line through the data, or as complicated as a recursive deep neural net, where one tries to fit a whole computing architecture to the data. Historically, replacement of human labor, aka automation, has mostly succeeded where machines could do things much better than humans. For example, we use cars (“auto mobiles”) because they usually get us from point A to point B faster
Fig. 2.2 The automated scientific process. The human still observes and records experimental factors and experimental outcomes pertaining to the possible understanding of a phenomenon, which is formulated as a question. However, a machine is used to find the formula. More specifically, a finite state machine is used for this task, as explained in Sect. 2.2.1. Photo: NIST, Public Domain
2.2 Automated Model Building
15
than any vehicle powered by human or animal force. Similarly, machine learning allows us to deal with problems, where traditional models, such as physical or statistical modeling, fail. These are usually problems with extremely large amounts of data and/or very high dimensionality. Examples of these include problems in genomics, image and video classification, and financial transactions data analysis. Furthermore, artificially projecting into high dimensionality can solve problems that are otherwise unsolveable (more on that in Chap. 16). Examples include generative AI bots and translation algorithms.
2.2.1 The Finite State Machine Similar to the way we established a convention to view everything as a table in the previous section, we now need to form another standard to handle the multitude of ways to algorithmically represent the table. Fortunately, the field of computer science offers guidance in this regard. Any machine learning algorithm, from linear regression to recursive neural networks, can be represented by a finite state machine (FSM). In general, a finite state machine has less computational power than some other models of computation such as the Turing machine (Sipser 1996). That is, there are computational tasks that a Turing machine can do but an FSM cannot. In particular, an FSM’s memory is limited by the number of states it has. A finite state machine has the same computational power as a Turing machine that is restricted such that its head may only perform “read” operations and always has to move from left to right. As we can see, these restrictions fit the task of machine learning though: The table is both finite and always read from the experimental input .x- to the experimental result .f (x ). That is, from left to right. It is important to note that machine learning is not Turing complete (Sipser 1996). That is, not every computation can be formulated as a machine learning problem. For example, .π can be computed to an arbitrary amount of digits using a Turing machine. However, since .π is transcendental, no subset of .π represented in a table can be generalized using an algebraic rule to calculate an arbitrary number of digits of it. The best a machine learner can do is find a rule to learn all digits of .π given in the training data. More on that in Chap. 7. The good news is, because of that, we do not have to solve the unsolvable halting problem (Sipser 1996) when analyzing machine learning’s complexity: Representing a table using a finite state machine always halts (at least in theory). To understand this better, let us start with a formal definition of the finite state machine. Definition 2.5 A finite state machine is a quintuple .(E, S, s0 , δ, F ), where: • • • • •
E is the input alphabet (a finite non-empty set of symbols). S is a finite non-empty set of states. .s0 is an initial state, an element of S. .δ is the state transition function: .δ : S × E → S. F is the set of accepting states, a (possibly empty) subset of S. .
16
2 The Automated Scientific Process
Fig. 2.3 Example of a finite state machine that determines whether a binary number has at least two “0”s. The starting state .s0 is at the open arrow. The only final state in F is .S1 , depicted with double circle. .S2 is a regular state. That is, the machine halts once a second “0” is found or once the input sequence terminates (which would be in .S2 if there are less than two “0”s)
An intuitive description of the finite state machine is a directed graph modeling δ. In other words, it is a set of connected arrows. The tail of the arrow contains a possible input, and the arrow is followed to the head if that input is given to the finite state machine. If a different input is given, a different arrow is followed. If an input is given that is not specified in any of the arrow tails, nothing happens. When arriving at a head of an arrow, the state changes: We are now exactly at the head of that arrow in the finite state machine and nowhere else. In order to know at which point in the graph to start, the finite state machine dedicates a special state, the initial state .s0 . Another set of special state exists as elements of F . It is the set of accepting (or final) states. If the FSM reaches one of the accepting states, the finite state machine answer is “yes” or “true.” Figure 2.3 shows a finite state machine that only accepts when it is given a binary number with at least two “0”s. The set of states S is .S = {S1 , S2 } and .F = {S1 }. .δ is visualized as a graph over the input alphabet .E, which is .E = {0, 1}.
.
2.2.2 How Machine Learning Generalizes Generalization is philosophically defined as follows: Definition 2.6 (Generalization) The concept of handling different objects by a common property. The set of objects of the common property are called a class, and the objects are called instances. Returning to the challenge of representing all machine learning models. As demonstrated in the previous section, a finite state machine, in fact, goes beyond what is needed to just represent the table. This can be understood by considering that the finite state machine in Fig. 2.3 can handle inputs of arbitrary length. Given that there could be an infinite number of these, no training table could possibly represent them all. However, this does not mean that the training table cannot represent a representative sample. So machine learning models are a subset of all possible finite
2.2 Automated Model Building
17
Fig. 2.4 The data table converted into a finite state machine by assigning a state transitions to each row of inputs .(x1 , . . . , xk ) in the table to either an accepting state or a regular state, depending on .f x -. Only class 1 results in accepting states. This finite state machine represents the table with maximum accuracy but no generalization
state machines. What subset? Let us redirect our attention back to representing the table using a finite state machine. For the sake of this thought experiment, and without loss of generality, we will assume that the experimental results column .f (x ) is binary. That is, we are dealing with a binary classification problem. The input vectors .x-i can be any data. So, how could a finite state machine that models this table look like? Figure 2.4 shows a first attempt at modeling the table. The alphabet .E is whatever the input cells of the table contain. Each of the input rows .x- that have the result .f (x ) = 1 gets an arrow from the starting state to the final state. The input rows .xthat result in .f (x ) = 0 are assigned to a separate FSM that accepts class 0. This can be extended for multi-class problems in the same way with more FSMs. In anyways, let us analyze the properties of this algorithm. The FSM is able to get .100% accuracy on the dataset. Each experimental input that is the table will result in the correct experimental output. However, any vector that is not present in the original dataset will not be able to be processed. That is, there is no generalization to even the tiniest bit of change in input. This FSM effectively implements a dictionary or lookup table. If a word is present in a dictionary, the answer can be found. If not, the answer remains undefined. Statistics calls models with that property overfitting. In this book, we will call such a model a memorizing model. The model memorizes the training data. Figure 2.5 shows a second attempt at modeling the table. Again, the alphabet .E is whatever the input cells of the table contain. The FSM contains exactly one state transition into one final state. That is, every and any input is given the same answer: “1.” The accuracy of this FSM is the best guess as it is the same as the probability for class 1 (which we can now assume is the majority class – or make it so). Definition 2.7 (Best Guess Accuracy) The accuracy obtainable by always predicting the class with the highest outcome frequency. So the accuracy calls for improvement but the generalization is excellent: Any input gives an answer. That is, this model generalizes to everything. We call such a model overgeneralizing. Overgeneralizing models exist in many places in society,
18
2 The Automated Scientific Process
Fig. 2.5 The data table converted into a finite state machine by assigning one state transition for all rows of input. Every input results in an accepting state. This finite state machine represents the table with maximum generalization but minimal accuracy, as the accuracy equals the frequency of outcome 1. The accuracy can be maximized by defining the most frequent outcome as the accepting state
Fig. 2.6 The data table converted into a finite state machine by finding a common property of the input rows for each class and defining the state transitions using the property. If the common property perfectly describes the set of inputs for each class, the accuracy is maximal. If we are able to use the minimum amount of state transition, we maximize generalization
racism, for example, or any type of stereotypization is usually overgeneralization. Overgeneralization is a synonym for underfitting. These examples obviously constitute two extremes: 100% accuracy and no generalization vs. 100% generalization and best guess accuracy. Ideally, one likes to maximize both quantities. This implies a trade-off of achieving the highest accuracy with the minimum amount of state transitions. Figure 2.6 shows an example of a finite state machine that would implement that trade-off. We get back to the philosophical definition of generalization: different objects (here the inputs) are handled by the same state transition. To define this state transition, we need to find a common property among the inputs. The common property should lead to accurate results and at the same time should capture the most amount of rows. This generalization/accuracy trade-off can be handled as a numerical optimization problem or as a problem of finding the simplest way the experimental variables work in combination with each other to create the experimental results. The former is usually associated with the field of machine learning, and the latter is usually associated with the field of human causal reasoning. Still, with both viewpoints, we have a viable explanation of why we expect a model to actually “predict,” that is, provide correct outputs for observations not recorded in the experiment. When a state transition is able to map the input to the correct result for more than one row of the training table, there is a high chance that similar rows, seen or unseen, are also mapped to the correct results. The chance is higher, the more inputs are
2.2 Automated Model Building
19
correctly mapped by the same state transition. This concept is related to the concept of well-posedness, which we need to introduce first before finishing the discussion on generalization in this chapter. Generalization will be explained in more detail in Chap. 6.
Well-Posedness The 20th-century French mathematician Jacques Hadamard, among others, believed that mathematical models of physical phenomena should have these three exact properties: 1. A solution exists. 2. The solution is unique. 3. The solution’s behavior changes continuously with the initial conditions. Based on the contents of the chapter, for a machine learning solution to a given data table, we can now say the following: 1. A solution always exists, as with enough state transitions, we can always memorize. That is, in the worst case, we can always overfit. 2. A solution is not expected to be unique as there could be different models that can generalize equally well. However, the more general the solution, the “more unique” it is. For example, there is only one finite state machine that contains one arrow to class 0. 3. It is easy to see that for generalization to work, we cannot have a situation where a single bit of change in the input randomly results in a drastic change in prediction outcome: it would be impossible to assign more than row to a single state transition arrow in our finite state machine. In practice, this means, in most cases, small changes in the input only imply small or no changes in the output. Otherwise, our model will not be able to generalize. So in order to be able to find common rules between rows, we want the prediction behavior to approximately change “continuously” with a “continuous” change of the input. We define condition 3 as well-posedness. Definition 2.8 (Well-posedness Assumption for Machine Learning) A machine xi − x-j |A < ek =⇒ |f (xi ) − learning problem is well-posed if and only if .|f (xj )| < δ, where .ek is a set of constants, .δ is a constant, and .| · |A is a semi-metric determined by training the machine learning algorithm. For classification, the definition can be simplified to: Definition 2.9 (Well-posedness Assumption for Classification) A machine learning problem is well-posed if and only if .|xi − x-j |A < ek =⇒ f (xi ) = f (xj ), where .ek is a set of constants and .| · |A is a semi-metric determined by training the machine learning algorithm.
20
2 The Automated Scientific Process
To illustrate Definition 2.9: Assume a model is trained on dog images and cat images. We want noise added to cat image that changes the cat image less than .e to still be classified as cat image and not as a dog. However, some cats may be inherently closer to dogs than others, so, in practice, we may have to learn a different .e for a subset of cats, leading us to an indexed .ek . We will see in Chap. 11 that if a problem does not exhibit well-posedness, that is, small changes in the input potentially imply large changes in the output, it falls under the category of a chaotic problem that has to be dealt with chaos theory. In contrast to Definition 2.3, the significant problem with the well-posedness definitions is that we cannot know .| · |M before building a model, and therefore, we cannot know whether a machine learning problem is well-posed or not. In fact, while well-posedness is a very important concept for the derivations in this book, it needs to be considered an evocative model (see Sec. 7.5). Strictly speaking, a classification 2 problems that partitions .Rn into its classes can never be well-posed. For any .e > 0 there exist .e-environments around the boundary points of each class that also contain points of another class. For example, consider the classification of the real numbers into the two classes .x > 0 and .x ≤ 0. Given an .e > 0 any x with 2 .0 < x < e will satisfy .|x − 0| < e but also .f (x) /= f (0). However, in actual Machine Learning problems sample points have a finite precision. They result from observations in the form of measurements and will have to be to be represented in computer storage. Two points of different classes must have a minimum distance if their classes do not overlap–their precision being a lower bound. By choosing the .e smaller than this distance the problem is shown to be well-posed. In practice, any classification problem with disjoint classes is therefore well-posed. In fact, we will see this again in Chap. 16 from a different angle: Projecting the input into a space of higher dimensionality can make a problem well-posed (see Chap. 16 and Sect. 11.3.3). Applying well-posedness, we know that the less state transitions we need to represent our table, the higher .e must be. For a memorizing model, .e = 0, .δ = 0, and .| · |M can be defined in various ways (for example, as cell-by-cell subtraction). The overly general, single arrow finite state machine makes .e = ∞, .δ = 0, and .|·|M can be literally anything. Again, we cannot measure .e as we do not know .| · |M , but we can measure the number of state transitions that represent each row and thereby quantify the memorization/generalization trade-off (sometimes also referred to as bias/variance trade-off) as outlined so far using the following formula: predictions Definition 2.10 (Finite State Machine Generalization) .GF SM = # #correct state transitions
The higher .GF SM , the better the generalization, whereby only .GF SM > 1 guarantees that we are not memorizing the entire dataset. The process of finding the optimal G is similar to what is commonly known as Occam’s Razor Among competing models that explain the data with the same accuracy, the one with the
2 Thanks
to my former colleague Lars Knipping for pointing this out.
2.3 Exercises
21
smallest description length (here: the number of state transitions) is to be favored. One can easily see how this defines an iteration rule for an optimization algorithm. Definition 2.10, however, is more general as one might obtain higher values for .GF SM by choosing an FSM with slightly lower accuracy but also much fewer number of state transitions. Unsupervised machine learning is fundamentally the same, except that the annotation column .f (x ) is missing from the training data. The algorithmic methods therefore change. Still, the goal is to create the same prediction for unseen data, and this means to create a finite state machine that generalizes. In later chapters, we will see how maximizing both accuracy (see Definition 2.4) and generalization at the same time leads, indeed, to the most resilient prediction model. The reader may notice already that Definition 2.10 can be seen as a compression ratio, and we are obviously reducing the description length of the table of data when using an FSM that generalizes. This will be explored in more detail later in this book as measuring the compression ratio in bits allows to generalize the above generalization formula to be valid for any machine learner and also to be concretely applicable to known types of machine learners, such as neural networks and decision trees. In the next chapter, we will discuss the canonical machine learning process that typically relies just on a simple measurement of accuracy. In Chap. 4, we will then find that it is possible to quantify the generalizability of a labeled dataset before modeling, which in further chapters leads us to many ways of navigating this accuracy/generalization trade-off.
2.3 Exercises 1. Tables (a) A boolean variable is a variable over the alphabet E = {0, 1}. How many states can n boolean variables assume? (b) How many functions can be built from input vectors x- consisting of n boolean variables to a binary label f (x ) ∈ {0, 1}? (c) How many functions can be built from input vectors x- consisting of n boolean variables to a m-class label f (x ) ∈ {0, .., m − 1}? (d) Assume an arbitrary table with k columns and n rows. How many different binary labelings can be created for that table? (e) Assume a table with m columns and n rows, each cell has an information content if q bits. How many different tables are there? 2. Partitioning (a) How many subsets can be created from a set of n elements? (b) Explain how creating subsets is the same as binary labeling. (c) Assume you have n k-dimensional points in a coordinate system. How many binary labelings can you generate for these points?
22
2 The Automated Scientific Process
(d) How many m-class labelings can you generate for the points in the previous question? 3. Finite State Machines Draw first the table and then the smallest (arrow and circle) finite state machine you can think of for the following functions: (a) Two boolean variables AND (f (x ) = 1 ⇐⇒ x1 = 1 and x2 = 1). (b) Three boolean variable equality (f (x ) = 1 ⇐⇒ x1 = x2 ). (c) Two boolean variable AND or two boolean variable equality depending on an input parameter.
2.4 Further Reading I recommend reading about Occam’s Razor, which was actually recited from Aristotle (who we have discussed in the previous chapter) and to take a closer look at finite state machines. The YouTube video digs further into generalization: • Charlesworth, M. J. (1956). “Aristotle’s Razor”. Philosophical Studies. 6: 105– 112. https://doi.org/10.5840/philstudies1956606 • Sakarovitch, Jacques (2009). Elements of automata theory. Translated from the French by Reuben Thomas. Cambridge University Press. • Understanding Generalization in Machine Learning: https://www.youtube.com/ watch?v=UZ5vhqDKyrY
Chapter 3
The (Black Box) Machine Learning Process
In this chapter, we will outline the machine learning process as it has been employed for many years, a process we will refer to as the “black box machine learning process.” Please note, the contents of this chapter are frequently found in other conventional statistical machine learning literature, so here I am presenting it more as a reference point. Depending on your familiarity with these concepts, you might prefer to initially skip this chapter and circle back after gaining a deeper comprehension of the methodologies explored in the subsequent sections of the book.
3.1 Types of Tasks The machine learning community in general divides tasks into supervised and unsupervised. Among the unsupervised tasks are associative memory and clustering. Among the supervised tasks are detection, classification, and regression. Let us define each of them one by one.
3.1.1 Unsupervised Learning To understand where these tasks come from, we start with the definition of quantization. Definition 3.1 (Quantization) Quantization is the process of constraining an input from a continuous or otherwise large set of values to a discrete or small set of values. Quantization usually describes a simple equi-distant binning. For example, binning the unit interval of real numbers .[0; 1] into a discrete set S using a © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_3
23
24
3 The (Black Box) Machine Learning Process
rule that only the digits up to the first digit after the decimal point are kept would be called quantization. Here, all the bins are exactly pre-defined to be 10 0 1 2 .S = { 10 , 10 , 10 , . . . , 10 }. Clustering is a more special form of quantization. Definition 3.2 (Clustering) Quantization of input vectors to allow for equality comparisons between similar but not identical objects without prior labeling. Clustering usually implies that a quantization rule is not readily known as it is performed on high-dimensional data. The actual distribution of the bins (the clusters) needs to be estimated or learned using some rule. For example, we can try to cluster all cat images into one cluster and all dog images into another cluster. Once the clusters are formed, systems can serve as associative memory that allows for noisy queries. Definition 3.3 (Associative Memory) A memory system that is addressed by content. That is, we can now input any cat, and the output would be the prototypical cat, indicating that this cluster was associated with the input. Technically, clustering system often requires a hyperparameter like how many clusters (typically denominated as k) should be formed. Associative memory systems try to automatically find a set of stable states toward which the system tends to evolve, given the data. So an associative memory system would not need a hyperparameter .k = 2 for the number of clusters when cat and dog images are given. Associative memory, just like clustering and quantization can also be used for denoising.
3.1.2 Supervised Learning When the data table (see Definition 2.1) is labeled, that is, contains an .f (x ) column, the data table implies a function from the inputs to the labels. This is further discussed in 4.4. The most frequently used supervised task in machine learning is classification. Definition 3.4 (Classification) Quantization of a set of input vectors into a set of pre-defined classes (see also Definition 2.6). An example would be, again, the infamous “cat” or “dog” classifier of images. However, with a supervised system, it is easier to measure accuracy, generalization, and other metrics discussed in this book. Supervised classification is only defined for a number of classes .k > 1. A classifier is called binary if the number of classes .k = 2. Theoretical reasoning is often easier over binary classifiers. If .k > 2, the classifier is called multi-class classifier. A special case of binary classification is detection.
3.1 Types of Tasks
25
Definition 3.5 (Detection) Binary classification where one class is specific and the other class contains diverse inputs to serve as baseline. Detection is also often called identification. An example would be a cat classifier against all other animals. That is, a classifier that is used to detect if a cat is present. On the other hand, when the number of classes becomes very high, that is, .k −→ ∞, classification becomes regression. Supervised regression predicts a number out of a range of possible numbers. For example, real estate prices based on the attributes of properties. Definition 3.6 (Regression) Classification with a large (up to theoretically infinite) number of classes. To understand that this list is complete, we remember that the goal of machine learning, independent of the choice of process, is to predict based on two main assumptions: 1. Two identical inputs lead to an identical output (well-definedness, Definition 2.3). 2. An input similar to a learned input leads to a similar output (well-posedness, Definition 2.8). These two assumptions exclude various, harder to define tasks, such as generating art (OpenAI 2022) or building a chatbot that impresses people with sentences recombined from memorized documents. Also, since we assume the methods being used to predict are methods based on mathematics, automated mathematics can only work with numbers. Data that are observed in a different form have to be pre-processed and numerized as described in Chap. 11. As explained in Chap. 7, math cannot create observations. It can therefore only reduce information (see also Sect. 7.1.1). This makes any modeling task a task based on information reduction. That is a reduction from a larger set of numbers to a smaller set of numbers (see also Sect. 7.2). All machine learning tasks are therefore special versions of quantization tasks. This includes reinforcement learning. Definition 3.7 (Reinforcement Learning) Learning to improve behavior by classifying the current state of the environment to select an action from a pre-defined set of choices. For example, reinforcement learning can learn to play the game of Breakout (see Fig. 3.1), by classifying the current position and past trajectory of the ball and the number and shape of blocks still on the screen into the actions .{lef t, right, stay} (Mnih et al. 2015). Similar to how classification involves quantizing input vectors into pre-defined classes, reinforcement learning involves learning to classify the current state of the environment and selecting an appropriate action based on that classification. However, unlike classification, where the goal is to assign a class label to each input
26
3 The (Black Box) Machine Learning Process
Fig. 3.1 Breakout is a classic computer game wherein the player controls a bar that can move either left or right. The objective of the game is to eliminate all bricks using the ball, which is bounced off the bar. A player loses one of their three lives if the bar fails to make contact with the ball. In the context of machine learning, this game scenario can be interpreted as a classification task. The environmental state of the game (the positions of the bar, ball, and bricks) needs to be classified into one of three potential actions: move the bar to the left, move it to the right, or keep it stationary. The model is trained during gameplay, applying a technique known as reinforcement learning. Picture from: https://freesvg.org/breakout, Public Domain
vector, the goal of reinforcement learning is to take actions that maximize a reward signal over time. The agent learns to map each state of the environment to a set of available actions and chooses the best action based on the current state and the expected long-term rewards of each action. Therefore, reinforcement learning is a type of classification problem, where the agent is classifying the current state of the environment and selecting an appropriate action from a pre-defined set of choices based on that classification. One special case of regression or classification (depending on the setup) is forecasting. Definition 3.8 (Forecasting) Predicting the future, given a time series of events from the past. Forecasting is a field of its own and will not be discussed beyond Sect. 11.4.2. In fact, Sect. 11.4.2 provides the reason for why forecasting is not well-posed and therefore requires a different mathematical framework than presented in this book. Unsupervised methods are tremendously interesting but are quantitatively described by other fields, such as dynamical systems. Suggestions for further reading are at the end of the chapter. The majority of the content in this book will primarily focus on supervised classification and regression as these are currently the most widely used tasks in both industry and research. This is not to suggest that unsupervised classification is less significant. Quite the opposite, in fact. Unsupervised classification is an intriguing field of study with its own set of unique challenges and opportunities. However, for the scope of this book, we will concentrate on the more prevalently utilized supervised techniques.
3.2 Black-Box Machine Learning Process
27
3.2 Black-Box Machine Learning Process Figure 3.2 shows the canonical machine learning process. Given a dataset, organized in a table (see Definition 2.1), the first step is to split it into training and validation set(s).
3.2.1 Training/Validation Split One of the most important pillars of the black-box machine learning process is the splitting of the given data into training and validation sets. Furthermore, a test or benchmark set is usually held by a third party. The splitting off a validation set is the same idea as an exam: The exam tests if a student is able to solve the problems given in training because the student has an understanding of the underlying mechanisms. If a student is tested on the exact exercises from training, he or she could just memorize the solutions to pass the test. Memorization (as we know from Chap. 2) is overfitting. Therefore, the student as well as the model is tested on sample inputs that he, she, or it has not seen before. In practice, when a machine learning model is
Raw Data
Preprocess
Split Data
Training Data
Validation Data
tuning Model
Test Data
error measure
Predictions
Fig. 3.2 The black-box, supervised machine learning process is data and model agnostic. This means, neither the data, nor the model is analyzed for fitness to be learned or to generalize with the specific machine learner until the system is completely in place
28
3 The (Black Box) Machine Learning Process
trained, the training data are used to change the parameters of the model, while the validation set is used at the end (or sometimes in regular intervals during training) to just check on the progress of the training. Training continues until a certain validation accuracy is reached or until some other criterion, for example, until a certain number of training iterations is reached or when the validation accuracy reaches the training accuracy. Since the validation set contains input/label pairs not used to tune the parameters of the model, it is assumed that a high validation accuracy is equivalent with the machine learner having learned the function implied by the training data with higher generalization than a dictionary. The validation set should have the same complexity as the training set, in the same way as an exam should not be harder or easier than the content learned in class. Since we do not know the complexity of the dataset or how the complexity of the dataset changes once we take a subset out for validation, the safest split is maybe a 50:50 split (training:validation). In practice, split ratios of up to 90:10 are used with a hope for the best. A better way of improving the complexity match between training set and validation set is to do cross validation. Cross-validation methods describe different strategies for using different portions of the data to validate and train a model on different iterations. Practically, several validations are sampled. A third way can be to estimate the complexity of the training set and the validation set using methods such as the one presented in Sect. 8.2 and find the split that has the best complexity-estimate match between training and validation. The latter method is especially interesting when the classes of data are imbalanced.
3.2.2 Independent But Identically Distributed Behind the idea of splitting is the statistical concept of “independently identical distributions” or (“i.i.d.”): A collection of experimental outcomes is independent and identically distributed if each outcome is part of the same probability distribution as the others but all outcomes are mutually independent. Identically distributed means that the same formula can be used to model any set of outcomes (e.g., a Gaussian with the same mean and variance). Independent means that the samples are all independent events. In other words, they are not connected to each other in any way but through a collective rule and knowledge of the value of one sample gives no information about the value of the other and vice versa. However, more samples make it easier to model the collective rule.
Example Let us assume we roll 10 dice at the same time. The sum on the dice will be an approximate Gaussian distribution, with the mean sum of those dice rolls, 35, being,
3.3 Types of Models
29
the highest frequent number and at the end, 10 and 60 being the lowest frequent sums. A decision boundary could assign all rolls smaller than 30 to class 0 and the ones larger or equal to 30 to class 1. The prediction task here could be to output the probability that, given the result of 5 dices, the class will be 0 or 1. The distribution of any large enough subset of 10-dice rolls is identical, but a single 10-dice role conveys no information about the outcome of any of the other rolls. The i.i.d. assumption makes it straightforward to justify why a dataset can be split into a training set and a validation set: Each sample, this is row of the table, is independent, so using a validation set to measure accuracy therefore helps to make sure not to memorize training set rows during the training. The i.i.d. assumption and training–validation split is an important mathematical requisite, but there are practical caveats. First, one needs to make sure that the training and validation sets are actually independent. That is, no identical instance appears in both training and validation sets. Second, both sets need to be large enough to actually be identically distributed. This condition is not empirically testable as it presents a catch-22: To know that both validation set and training set underlie the same distribution would require knowing the distribution, which in turn means having a model. However, building a model is the goal. Finally, many machine learners are not necessarily statistical in nature. For example, Decision Trees, Neural Networks, or Support Vector Machines are not required to use probabilities, nor do they actually have to model the data using randomness: They are classifiers modeling hard decision boundaries using deterministic methods, such as Linear Algebra. So why is a statistical measure imposed on them? In the black-box explanation of Machine Learning, modeling a decision boundary is an additional task on top of modeling the distribution of the input. Nothing is said about the classification or regression function that we are modeling. An alternative approach to modeling is to not split the data but train and evaluate on the training data while minimizing the model size for generalization. While this approach certainly may increase the complexity of the training, measuring generalization separately is a much safer practice as it also increases the resilience of the model. For a detailed discussion on this, see Chap. 6. A hybrid approach, where a splitting is performed and generalization is still explicitly optimized, is, of course, also possible. Many times in this book we will not split into training and validation sets but purely evaluate on the training set. This is introduced in Chap. 5 as a way of making it easier to analyze the properties of a machine learning process, rather than building a prediction model.
3.3 Types of Models In the following, we will take a look at the most-used machine learning techniques.
30
3 The (Black Box) Machine Learning Process
3.3.1 Nearest Neighbors The easiest way to convert the data table into a model is to not model it. As explained in Chap. 6, keeping the table as a lookup table gives us 100% memorization accuracy but no generalization. However, what if we modified our approach from a lookup - == x-j then f (table with a direct equality comparison of the type .if input xj ) - − x-j | < to a lookup table with a similarity comparison of the form .if |input ε then f (xj ) (where j is the matching row number of the training data for the .input vector from the validation or test data in the table)? This approach is called nearest neighbor approach and, of course, the only parameter to be trained here using the black-box process is .ε. There is, however, - − x-j . Unless another choice to make: This is, how to calculate the distance .input there is only one element in the vector, that is, there is only a single experimental factor, vectors can be subtracted using a plethora of techniques. The most common being the Minkowski distance of order 1 or 2. The Minkowski distance is defined as follows: Definition 3.9 The Minkowski distance of order p (where p is an integer) between two vectors .x- = (x1 , x2 , . . . , xn ) and .y- = (y1 , y2 , . . . , yn ) ∈ Rn is defined as )1 (En p p , .D (x , y-) = i=1 |xi − yi | whereby the case .p = 1 is called E taxicab (or sometimes Manhattan) distance and x , y-) = ni=1 |xi − yi |. The intuition for this distance in the formula becomes .D1 (2 dimensions is counting the number of blocks that a taxicab has to go both in the vertical and in the horizontal to go from point .x- to point .y-. The Minkowski distance at .p = 2 is also called Euclidean distance and gives the shortest traveling path (the length of the line) between points .x- and .y-. Setting .p = 2 reveals .D2 (x , y-) = / √ 1 (x1 − y1 )2 + (x2 − y2 )2 + · · · + (xi − yi )2 + · · · + (xn − yn )2 since .x 2 = x. The distances are often also referred to as L1 or L2 distances, respectively. Minkowski metrics of higher p tend to be less useful in general but can be useful in certain use cases. The choice is often based on intuition and/or empirical results based on the black-box machine learning process. An alternative to training .ε, which has the advantage of the prediction being able to reject inputs that are too far away, is to not train at all. Instead, we compare the similarity of the input vector to all other training data vectors and choose the output that the closest one is associated with. As a confidence measure, we could compare the output of the top k vectors and see if they are identical or majority vote on them. This approach is called k-nearest neighbors and is in frequent, practical use. It is also great as a quick baseline before building other, more sophisticated models. An obvious downside of nearest neighbors is that every time a prediction is made, the input has to be compared to all training samples. This is because the training data are not reduced to a model. All the training vectors (plus parameters needed for distance measuring) are memorized. Algorithm 1 shows the idea.
3.3 Types of Models
31
Algorithm 1 A simple implementation of k-nearest neighbors that uses a Euclidean distance metric and a simple majority vote to predict the class of each data point. Different distance metrics and voting schemes may be used to tailor the algorithm to a special use case Require: X: array of length n contains d-dimensional feature vectors, y: a column vector of length n containing the corresponding target values, k: the number of neighbors to consider procedure KNEARESTNEIGHBORS((X, y, k)) for all xi ∈ X do distances ← calculate distance between xi and all xj ∈ X neighbors ← the k nearest neighbors to xi based on distances yi ← the majority class among the neighbors in y end for return y > Return the predicted labels for all data points end procedure
3.3.2 Linear Regression The opposite of memorizing all training samples for use in prediction is trying to reduce all training samples to one single parameter: That is, the parameter of a straight line. This is called linear regression, and the line is represented with the following model: f (x) = a ∗ x + b
.
(3.1)
Note that for easier explanation, the above example is a 1-dimensional function. That is, it only works for data tables with .D = 1. There are many ways to extend the definition to multiple dimensions, however. The major questions are: What are the parameters a and b and how accurate can this simple generalization be? To train the parameters a and b to make the line fit most points, we minimize a cost function. For example, the mean squared error (MSE, see Eq. 3.22). This could be done by guessing values for a and b, calculating the results .f (xi ) for all .x-i in the training data, calculating the MSE, and repeating until we are satisfied because the MSE is small enough. However, there is a more efficient way than randomly guessing a and b: Gradient Descent.
Training The most frequent method to train linear regression is called gradient descent. It is a method of updating a and b to reduce the error function (MSE). The idea is to only guess the starting values for a and b, and then the values are updated iteratively to minimize the error based on the gradient of the error function. The gradient at a point of a function is defined as the “direction and rate of fastest increase.” Subsequently,
32
3 The (Black Box) Machine Learning Process
a gradient with a negative sign is the “direction and rate of fastest decrease.” To find the gradient of the error function, we take the partial derivatives of the MSE function with respect to a and b. This gives the direction and rate of fastest increase of a function at a given point. It points to the next zero crossing of the first derivative, that is, to a minimum of the error function. These are: 2E (Yi − f (xi )) n n
a =a−α∗
.
i=1
2E (Yi − f (xi )) n n
b =b−α∗
.
i=1
The parameter .α is called learning rate and is a hyperparameter that must be specified. A smaller learning rate makes for a more careful approach to the minimum but takes more time, a larger learning rate converges sooner, but there is a chance that one could overshoot the minimum. If you do not find this specification satisfying and would rather like to measure learning rate, check out Chap. 16. Algorithm 2 demonstrates linear regression using gradient descent. Algorithm 2 A simple implementation of linear regression using gradient descent Require: X: array of length n contains d-dimensional feature vectors, y: a column vector of length n containing the corresponding target values, α: the learning rate, e: the convergence threshold procedure LINEARREGRESSIONGRADIENTDESCENT((X, y, α, e)) Xbias ← add a column of ones to the beginning of X β ← initialize the coefficients to random values J ← compute the mean squared error loss using β while J has not converged do T (X β ← β − α n1 (Xbias > Update the coefficients bias β − y)) Jnew ← compute the mean squared error loss using the updated β if |Jnew − J | < e then break > Exit the loop if the loss has converged end if J ← Jnew > Update the loss end while return β > Return the coefficients end procedure
3.3.3 Decision Trees A tree, in general, is a mathematical structure consisting of a set of connected nodes. Each node in the tree can be connected to many child nodes but must be connected to exactly one parent, except for the root node, which has no parent. These constraints
3.3 Types of Models
33
Fig. 3.3 Example of a decision tree with two input features and a binary output
imply that there are no cycles (that is, no node can be its own ancestor). As a side effect, each child can be treated like the root node of its own sub-tree. A tree where each node has up to b children is called a b-nary tree. A practical case being .b = 2, called a binary tree. A tree is called perfect when every internal node has exactly b child nodes and all the leaf nodes are at the same level. A tree where each node represents the state of a system, and each node is connected by a threshold function, is called a decision tree. Figure 3.3 shows an example of a decision tree with two input features (.x1 and .x2 ) and a binary output (.y ∈ 0, 1). The tree is constructed by recursively partitioning the input space based on simple threshold tests of the input features. At each leaf node of the tree, a binary output is assigned based on the majority class of the training examples that fall within the corresponding region of the input space. Decision trees are universal in that any algorithm can be represented as a decision tree. In fact, any algorithm can be converted into a binary decision tree. When an algorithm contains loops, however, unrolling the loops into its decision tree structure can cause the decision tree to grow exponentially. For the purpose of building a model from data contained in a table (see Definition 2.1), we do not have to go to such lengths.
Training There are several ways of building decision trees. However, the most prominent way of doing so is the so-called C4.5 algorithm (Quinlan 1993). It is also highly consistent with the information-theoretical concepts presented in this book. C4.5 builds decision trees from training data of the form .S = s1 , s2 , . . . , sn of already classified samples. That is, each sample .si is a row .(xi , f (xi )) in the data table (see Definition 2.1) of dimension D. Entropy H (compare Eq. 4.2) over a dataset S for the algorithm is defined as H (S) =
E
.
x∈X
−p(x) log2 p(x)
34
3 The (Black Box) Machine Learning Process
where S is the subset of the dataset for which entropy is being calculated, X is the set of classes in S, and .p(x) is the proportion of the number of elements in class x to the number of elements in set S. Note that when .H (S) = 0, the set S is perfectly classified (i.e., all elements in S are of the same class). This property and the other notions of information used in this section are explained further in Chap. 4. The algorithm begins with the original set S as the root node. At each iteration of the algorithm, it iterates through every unused attribute of the set S and calculates the entropy .H(S) of that attribute. It selects the attribute that has the smallest entropy. The set S is then split by the selected attribute to produce subsets of the data. For example, a node can be split into child nodes based upon the subsets of the population whose ages are less than 50, between 50 and 100, and greater than 100. The algorithm then recurses on the partitioned sublists. Algorithm 3 shows an implementation in pseudocode. It uses a function bestSplit that can be calculated by selecting the attribute that provides the highest information gain. This is done by calculating the entropy of the current set of labels, denoted as S: H (S) = −
.
E |Sc | c∈C
|S|
log2
|Sc | , |S|
(3.2)
where C is the set of all possible classes and .|Sc | is the number of samples in S that belong to class c. For each value .vi of the attribute being considered, the set of samples is split into two subsets .Si and .S¬i , where .Si contains the samples that have the attribute value .vi and .S¬i contains the samples that do not have that value. Calculate the entropy of each subset: H (Si ) = −
.
E |Si,c | c∈C
H (S¬i ) = −
.
|Si |
E |S¬i,c | c∈C
|S¬i |
log2
|Si,c | |Si |
(3.3)
log2
|S¬i,c | , |S¬i |
(3.4)
where .|Si,c | is the number of samples in .Si that belong to class c and .|S¬i,c | is the number of samples in .S¬i that belong to class c. We then calculate the information gain of the split: I G(S, i) = H (S) −
.
E |Si | vi
|S|
H (Si ),
(3.5)
where .|Si | is the number of samples in S that have the attribute value .vi . We select the attribute with the highest information gain as the best split. Note that we need to determine the possible values of the attribute being considered. If
3.3 Types of Models
35
the attribute is continuous, we need to use alternatives to the Entropy formula, see, for example, Sect. 4.3. Like all other algorithms, it does not guarantee an optimal solution (see also Sect. 7.4). It can converge upon local optima since it uses a greedy strategy by selecting the locally best attribute to split the dataset on each iteration. Several variants of the algorithm exist, such as J48 (Hall et al. 2009), ID3, C5.0, and CART. The latter works for regression as well by utilizing a variant of the collision entropy rather than Shannon entropy (see also Sect. B.4).
3.3.4 Random Forests Random forests (Breiman 2001) is a so-called ensemble learning method for classification, regression, and unsupervised learning. The idea is to construct a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees’ habit of overfitting to their training set as their capacity is self-regularized (see also discussion in Sect. 10.3.4). The method is quite simple and effective and therefore enjoys widespread use in industry for various machine learning tasks. A significant disadvantage of the method is that the resulting models are hard to explain and modify after they have been trained.
Training One of the most popular ensemble methods is called random subspace method (Herlocker et al. 2004). Let the number of training points be n and the number of features in the training data table (see Definition 2.1) be D. Based on intuition, the user chooses as set L of models (here: decision trees) to be built. Further, based on intuition, for each individual model l, choose .nl (.nl < n) to be the number of input points for a model .l ∈ L. It is common practice to have only one value of .nl for all the individual models. Ensembling then works as follows: . Training: For each individual model l, create a training set by choosing .dl features from D with replacement and train the model. . Prediction: To apply the ensemble model to an unseen point, combine the outputs of the L individual models by majority voting or a similar method. For regression, the average can be calculated. The pseudocode is outlined in Algorithm 4. A specialized version of random forest is called gradient boosted trees (Friedman 2001). In these algorithms, the number of trees is determined automatically as the forest is grown using gradient descent (see also Sect. 3.3.2). Concretely, a new
36
3 The (Black Box) Machine Learning Process
Algorithm 3 Building a decision tree recursively by selecting the attribute with the highest information gain or gain ratio according to the C4.5 algorithm. If binarySplit is true, it uses a binary split on the selected attribute; otherwise, it creates a branch for each possible value of the attribute. The recursion stops when all samples have the same class, when there are no more attributes to split on, or when the current attribute does not provide sufficient gain or contains too few samples. The minimum number of samples required to create a split and the minimum information gain required for a split can be specified as parameters Require: X: array of length n contains d-dimensional feature vectors, y: a column vector of length n containing the corresponding target values, attributes: a list of attribute names, binarySplit: boolean value indicating whether to use binary splits, minSamples: minimum number of samples required to create a split, minGain: minimum information gain required for a split procedure BUILDDECISIONTREE((X, y, attributes, binarySplit, minSamples, minGain)) if all elements in y are the same then return a leaf node with label equal to the common value in y end if if there are no more attributes to split on then return a leaf node with label equal to the most common value in y end if bestSplit ← the attribute with the highest information gain or gain ratio as described in the text. if bestSplit does not provide sufficient gain or contains too few samples then return a leaf node with label equal to the most common value in y end if tree ← a decision node with bestSplit as the splitting attribute if binarySplit is true then XbestSplit ← the column of X corresponding to bestSplit XbestSplit ← the median value of XbestSplit lef tX, lef tY, rightX, rightY ← split X and y on the median value of XbestSplit tree.lef t ← BuildDecisionTree(lef tX, lef tY , attributes, binarySplit, minSamples, minGain) tree.right ← BuildDecisionTree(rightX, rightY , attributes, binarySplit, minSamples, minGain) else values ← the unique values of the bestSplit column of X for all value ∈ values do subsetX, subsetY ← the subset of X and y where the bestSplit attribute equals value if the subset contains at least minSamples samples then tree.value ← a sub-tree returned by BuildDecisionTree(subsetX, subsetY , attributes, binarySplit, minSamples, minGain) end if end for end if return tree end procedure
3.3 Types of Models
37
Algorithm 4 The algorithm builds a random forest by creating numT rees decision trees. Each tree is built using a random subset of the samples with replacement, and a random subset of the features to consider at each split. For each tree, the algorithm uses the C4.5 algorithm (see Sect. 3.3.3) to select the best attribute to split on based on information gain. The tree is built recursively by splitting the data into subsets based on the selected attribute and its values. The process continues until a stopping criterion is met, such as when all samples belong to the same class or the maximum depth of the tree is reached. The algorithm returns a list of decision trees as the random forest Require: data: array of length n contains d-dimensional feature vectors, labels: a column vector of length n containing the corresponding target values, numT rees: number of trees to generate, treeDepth: maximum depth of each tree, numF eatures: number of features to consider for each split function BUILDRANDOMFOREST((data, labels, numT rees, treeDepth, numF eatures)) f orest ← [] for i = 1 to numT rees do subsetX, subsetY ← randomly select numSamples samples from data and corresponding labels labels with replacement tree ← BuildDecisionTree(subsetX, subsetY , treeDepth, numF eatures) add tree to f orest end for return f orest end function function BUILDDECISIONTREE((X, y, treeDepth, numF eatures)) if all elements in y are the same or maximum depth is reached then return a leaf node with label equal to the common value in y end if bestSplit ← the attribute with the highest information gain or gain ratio using C4.5 on a random subset of numF eatures attributes if bestSplit does not provide sufficient gain or contains too few samples then return a leaf node with label equal to the most common value in y end if tree ← a decision node with bestSplit as the splitting attribute values ← the unique values of the bestSplit column of X for all value ∈ values do subsetX, subsetY ← the subset of X and y where the bestSplit attribute equals value if the subset contains at least minSamples samples then tree.value ← a sub-tree returned by BuildDecisionTree(subsetX, subsetY , treeDepth − 1, numF eatures) end if end for return tree end function
38
3 The (Black Box) Machine Learning Process
model l is added to L in an attempt to correct the errors of its predecessors L by correcting its decisions using gradient descent on an error function. The most successful derivate of these methods, called XGBoost (Chen and Guestrin 2016), is not using gradient descent but Newton’s method (Nocedal and Wright 2006). The details of the algorithm are left for further reading.
3.3.5 Neural Networks Originally driven by the fascination of how intelligence works in animals, neural networks have become a mainstream machine learning method, especially in academia and for image analysis tasks. Artificial neural networks (ANNs) are computational models inspired by the biological neural networks present in the animal brain (Haykin 1994; Rojas 1993). Originally designed as electrical circuits (Widrow and Hoff 1960), they are designed to recognize patterns by generalizing. One popular technique for training ANNs is Stochastic Gradient Descent (SGD) (Bottou 2010). This section only provides a short summary of ANNs, their structure, and how to train them using SGD. An ANN is composed of interconnected artificial neurons, organized in layers. A typical ANN consists of an input layer, one or more hidden layers, and an output layer. Each neuron receives input from other neurons, processes it, and then passes the output to the next layer (Goodfellow et al. 2016a). The connections between neurons have associated weights, which determine the strength of the signal passed from one neuron to another. The goal of training is to adjust these weights to minimize the error between the predicted and actual outputs.
Neuron Model A single artificial neuron, or node, computes a weighted sum of its inputs, adds a bias term, and then applies an activation function to produce an output. Mathematically, the output of a neuron can be represented as f (x ) :=
d E
.
wi xi ≥ b,
(3.6)
i=1
where .f (x ) ∈ {0, 1} is the output, .wi are the weights, .xi are the input values, d is the number of inputs, and b is the bias term. Equation 3.6 essentially implements an energy threshold. Mathematically, a threshold presents a step function. Step functions are often hard to deal with since they are non-continuous and therefore no derivative can be calculated that is important to train with gradient descent (see above). The solution to this problem is to add a softening function to the step
3.3 Types of Models
39
Fig. 3.4 Three different activation functions to smooth out the step function (one the left): a sigmoid, a Rectified Linear Unit (ReLU), and the Tangens Hyperbolicus (tanh)
function. This function is usually called activation function and modifies the neuron activation as follows: ( n ) E .f (x) = a wi xi + b , (3.7) i=1
where .f (x ) is the output, a is the activation function, .wi are the weights, .xi are the input values, n is the number of inputs, and b is the bias term. Popular activation functions include the sigmoid, the hyperbolic tangent (tanh), and rectified linear unit (ReLU) (Goodfellow et al. 2016a). The sigmoid function is a direct softening of the step function, and ReLU is a closer emulation of the original hardware design of neurons. Figure 3.4 shows a plot of all the three mentioned functions. The choice of the activation function is of purely practical concern as it is a workaround for training. A more straightforward understanding of neurons is gained by thinking of them as energy thresholds.
Perceptron Learning Training is done by tuning the weight .wi and the bias b such that the difference between Eqs. 3.6 or 3.7 and the input .x-i from the training table is minimized. Any of the error metrics (see Sect. 3.4) can be used here. A single neuron can be trained by Perceptron Learning (Rosenblatt 1958). Perceptron Learning is a fundamental algorithm in the field of machine learning that dates back to the late 1950s. It is a simple but powerful approach for learning binary classifiers, capable of finding a linear decision boundary that separates two classes of points in a high-dimensional input space. The algorithm operates by iteratively updating a weight vector that defines the decision boundary based on the misclassified points. The algorithm is guaranteed to converge if the capacity of the neuron (see Chap. 8) is large enough to learn the problem. This algorithm remains a cornerstone of modern machine learning and continues to be studied and refined to this day. For example, there is a multi-class version of the algorithm (Bishop 1995). The pseudocode is shown in Algorithm 5.
40
3 The (Black Box) Machine Learning Process
Algorithm 5 Perceptron learning is an algorithm to train a single neuron that is guaranteed to converge to the optimum solution if the capacity of the neuron is large enough to represent the function. Measuring the capacity as explained in Chap. 8 before engaging the algorithm is therefore advisable. w - is the weight vector of length d representing the decision boundary between the two classes. The algorithm iteratively updates this vector to minimize the number of misclassifications. yi' : The predicted label of the i-th data point in X based on the current weight vector w. - yi : The true label of the i-th data point in X from the vector Y Require: X: a matrix of shape (n, d) representing the input data, where n is the number of data points and d is the dimensionality of each data point. Require: Y : a vector of length n representing the target labels for the input data. Each element of Y corresponds to the label of the corresponding data point in X. The tuple (X, Y ) represents the datatable defined in Definition 2.1. Require: T : the maximum number of iterations to perform. This can be replaced by any stopping criterion, e.g. an accuracy threshold. function PERCEPTRON LEARNING(X, Y, T ) w - ← random initialization for t = 1 to T do for i = 1 to n do y-i' ← sign(w - · x-i ) if y-i' /= y-i then w - ←w - + y-i x-i end if end for end for return w end function
Backpropagation Neurons and neural networks can also be trained by gradient descent or Stochastic Gradient Descent (SGD). SGD is a variant of gradient descent, which iteratively updates the weights to minimize the loss function (Bottou 2010). The main difference between gradient descent and SGD is that while gradient descent computes the gradient using the entire dataset, SGD estimates the gradient using a randomly selected subset of the data, called a minibatch. The weight update rule for SGD can be expressed as wt+1 = wt − η∇L(wt ),
.
(3.8)
where .wt+1 is the updated weight, .wt is the current weight, .η is the learning rate, and ∇L(wt ) is the gradient of the loss function with respect to the current weight. The learning rate controls the step size of the weight update, and choosing an appropriate learning rate is crucial for the convergence of the algorithm. A simple way to adjust the learning rate is the so-called momentum update rule. The momentum update rule is
.
3.3 Types of Models
41
vt + 1 = αvt − η∇L(wt ),
(3.9)
wt + 1 = wt + vt+1 ,
(3.10)
.
.
where .vt+1 is the momentum term, .α is the momentum coefficient, and the other variables have the same meaning as before. The momentum coefficient is typically set between 0.5 and 0.9, and it determines the influence of the past gradients on the current update (Sutskever et al. 2013). A more measured way of determining the learning rate is to measure the fractal dimension of the observed error function and scale it accordingly (Hayder 2022). The gradient is calculated at the output of the network, but the weights must be updated for all neurons, independent of their location in the network. This is achieved by a technique called backpropagation (Rumelhart et al. 1986). Backpropagation calculates the gradients in a backward pass through the network, starting from the output layer and moving toward the input layer. The gradients are then used to update the weights using the SGD algorithm. It works by applying the chain rule of calculus to compute the gradient of the loss function with respect to each weight in the network. To understand the backpropagation algorithm, consider an ANN with L layers, and let .W (l) denote the weight matrix of the l-th layer. Let .zj(l) be the weighted input (l)
(l)
to the j -th neuron in the l-th layer, and let .aj = f (zj ) be the activation of the same neuron, where f is the activation function. (l) The error term .δj for the j -th neuron in the l-th layer can be defined as the partial derivative of the loss function with respect to the weighted input .zj(l) : ∂L
(l)
δj =
.
∂zj(l) (L)
For the output layer L, the error term .δj (L)
δj
.
where .
∂L (L) ∂aj
=
∂L (L) ∂aj
(3.11)
.
can be calculated directly as
· f ' (zj ), (L)
(3.12)
is the derivative of the loss function with respect to the activation of
the j -th neuron in the output layer, and .f ' (zj(L) ) is the derivative of the activation function with respect to the weighted input. (l) For hidden layers, the error term .δj can be computed recursively using the chain rule: ) (E (l) (l+1) (l) (l) f ' (zj ), .δ j= k = 1nl+1 Wkj δk (3.13)
42
3 The (Black Box) Machine Learning Process (l)
where .nl+1 is the number of neurons in the .(l + 1)-th layer, and .Wkj is the weight connecting the j -th neuron in the l-th layer to the k-th neuron in the .(l + 1)-th layer. After computing the error terms for all neurons in the network, the gradient of (l) the loss function with respect to each weight .Wij can be calculated as ∂L .
(l)
∂Wij
(l−1) (l) δj .
= ai
(3.14)
By applying the backpropagation algorithm, the gradients of the loss function can be efficiently computed, allowing for weight updates using optimization techniques such as SGD. It is important to note that backpropagation relies heavily on the chain rule, which enables the efficient computation of derivatives in composite functions. The chain rule (Stewart 2016) expresses the derivative of a composition of functions as the product of the derivatives of the individual functions, significantly simplifying the calculation of gradients in deep neural networks. However, this requires the use of an activation function and a loss function (error metric) that has a gradient.
3.3.6 Support Vector Machines Support Vector Machines (SVMs) are a class of supervised machine learning algorithms used for classification and regression tasks (Cortes and Vapnik 1995). They were first introduced by (Vapnik 1995) and have since become popular due to their strong theoretical foundation. The foundation of SVMs can be traced back to the statistical learning theory developed by Vladimir Vapnik and Alexey Chervonenkis in the 1960s (Vapnik 2013). Their work on the Vapnik–Chervonenkis (VC) dimension and the structural risk minimization principle provided the basis for developing SVMs. The first practical implementation of SVMs was proposed by (Cortes and Vapnik 1995), which focused on linear classification. The introduction of the kernel trick by (Boser et al. 1992) allowed SVMs to be extended to non-linear classification problems.
Linear Support Vector Machines SVMs aim to find the best decision boundary or hyperplane that separates the data points of different classes. In the case of linear SVMs, the goal is to maximize the margin, defined as the distance between the decision boundary and the closest data points from each class, called support vectors. Given a set of training data .(xi, yi )i = 1n , where .xi ∈ Rd are the feature vectors and .yi ∈ −1, 1 are the class labels, the linear SVM can be formulated as the following closed-form optimization problem:
3.3 Types of Models
43
E 1 ||w||2 + C ξi subject to yi (wT xi + b) ≥ 1 − ξi , 2 n
min
.
w,b,ξ
i=1
ξi ≥ 0,
i = 1, . . . , n,
(3.15)
where .w is the weight vector, b is the bias term, .ξi are the slack variables that allow for some misclassification, and C is the regularization parameter that controls the trade-off between maximizing the margin and minimizing the classification error. The optimization problem can be solved using a well-defined mathematical method, the method of Lagrange multipliers, which leads to the dual problem:
.
max α
n E i=1
1 EE αi αj yi yj xT αi − i xj subject to 2 n
n
E
0 ≤ αi ≤ C,
i=1 j =1
i = 1n αi yi = 0,
i = 1, . . . , n,
(3.16)
where .α are the Lagrange multipliers. The dual problem is a quadratic, and its solution yields the optimal weight vector: w∗ =
n E
.
αi∗ yi xi .
(3.17)
i=1
The decision function for a new data point .x is given by f (x) = sign(wT x + b∗ ),
.
(3.18)
where .b∗ is the optimal bias term.
Kernel Support Vector Machines To handle non-linearly separable data, SVMs can be extended using the kernel trick (Boser et al. 1992). The idea is to map the data points to a higher dimensional space, where they become linearly separable. A kernel function .K(xi , xj ) is used to compute the inner product between the transformed data points. Commonly used kernel functions include the polynomial kernel and the radial basis function (RBF) kernel. The dual problem for kernel SVMs becomes
.
max α
n E i=1
1 EE αi αj yi yj K(xi , xj ) subject to 0 ≤ αi ≤ C, 2 n
αi −
n
i=1 j =1
44
3 The (Black Box) Machine Learning Process n E
αi yi = 0,
i = 1, . . . , n.
(3.19)
i=1
The decision function for kernel SVMs is given by f (x) = sign
( n E
.
) αi∗ yi K(xi , x) + b∗
.
(3.20)
i=1
Practically, the optimization, which is essentially the training, can be implemented using an algorithm called sequential-minimal optimization (SMO), introduced by (Platt 1998). Instead of solving the entire problem at once, SMO breaks the problem down into smaller subproblems, which involve only two Lagrange multipliers at a time. This approach allows for analytically solving the subproblems and updating the multipliers iteratively until convergence, leading to a faster and more memory-efficient algorithm compared to traditional quadratic programming methods. Having said, the approach is very well-defined theoretically, except for the fact that choosing a kernel can be arbitrary as well as the slack factors.
3.3.7 Genetic Programming While models such as artificial neural networks are trying to simulate the intelligence of the brain connected to its sensor system, genetic programming simulates the evolutionary adaptation of genes to fit the environment.
Training The stages of a genetic programming algorithms are literally modeled after the biological process of evolution. The process of natural selection in biology starts with the selection of the fittest individuals from a population. They produce offspring that inherit the characteristics of the parents and will be added to the next generation. If the parents have better fitness, their offspring will be better than parents and have a better chance at surviving. In biology, this process keeps on iterating forever as the environment changes. With genetic programming, a generation with the fittest individuals adapting to the target function will be found. Genetic programming therefore works in five steps: 1. 2. 3. 4. 5.
Initial population. Select population based on fitness to function. Crossover. Mutation. Iterate from step 2 until a “fit enough” selection is found.
3.4 Error Metrics
45
The algorithm begins with a set of individuals that is called a population. Each individual is a solution to the problem you want to solve. An individual is characterized by a set of parameters (variables) known as genes. Genes are joined into a string to form a chromosome (solution). For example, if we are approximating a function, an individual could be comprised of a set of a formula composed of a set of terms. The terms could be of the set .{, +.−, ∗, /, log, exp, sin, cos, tan} and all numbers. These would be called genes. The initial population is randomly generated, and so an individual could have the chromosome .I1 = 2 ∗ sin(x) + 10, and another one could be .I2 = 5 + log2 10. The number of individuals to start with is a hyperparameter that needs to be determined by the programmer. The fitness function determines how fit an individual is. The probability that an individual will be selected for reproduction is based on its fitness score and also needs to be guessed by the programmer. For example, for approximating a function, one could select any of the regression error metrics (see Sect. 3.4). The lower the error, the higher the fitness score. Only individuals with a high enough score survive and can pass on their genes to the next generation. In the next step, pairs of individuals (parents) are selected based on their fitness scores. Individuals with a higher fitness score also have a higher chance to be selected for reproduction. Biological reproduction is simulated as genetic crossover. It is the most significant step in a genetic algorithm. For each pair of parents to be mated, a crossover point is chosen at random from within the genes. For example, if we performed crossover of the genes between .I1 and .I2 from above, at term 2, the result would be .CV (I1 , I2 ) = 2 ∗ log2 10 = I3 and its symmetry .CV (I2 , I1 ) = 5 + sin(x) + 10 = I4 . The new offspring are added to the population. It is assumed that, in general, offspring will have a higher fitness score, and so, over time, the parents will be selected out. However, with only crossover and evaluating the fitness score, the outcome of the genetic algorithm would be completely determined by the initial population. To battle this phenomenon and to be able to escape local minima based on the initialization, mutation is introduced. That is, as new offspring is formed, the genes are subjected to a random mutation with low probability. For example, some of terms of the individual can be exchanged after crossover like .I4 = 5 + cos(x) + 10. The algorithm terminates when the population does not produce offspring that are significantly different from the previous generation in fitness score or when the fitness score is close enough to the desired target. Genetic algorithm has shown to be very efficient at symbolic regression (Schmidt and Lipson 2009). In fact, function approximation using symbolic regression is the main use case of genetic programming as of writing this book.
3.4 Error Metrics Error metrics are also referred to as cost functions. Throughout the book, the clearer term error metric is preferred. The most general accuracy metrics is accuracy as
46
3 The (Black Box) Machine Learning Process
defined in Eq. 2.4. This accuracy metrics works universally when the measurement results in the number 100%. At that number, the output of the model is identical to the output observed whether we are working with regression, classification, imbalanced, or balanced data. However, when accuracy is not perfect, questions arise that are usually not easily answered by a simple number. First, .0% accuracy is rarely expected as the minimum accuracy achievable without any significant work is best guess accuracy (see also Discussion in Sect. 2.2.1). This issue is extremely prevalent in classification problems with highly imbalanced classes. For example, when 99% of the outcomes belong to class 1, it is trivial to create a model with that accuracy. However, it can be extremely hard to create a model with 100% accuracy. The issue can be addressed as discussed in Sect. 11.7. Second, various measures have been developed to analyze and tune model error from various perspectives. These metrics are domain and task dependent. This section presents a selection of them based on the task.
3.4.1 Binary Classification For binary classification, accuracy as defined in Eq. 2.4 (and furthermore in Eq. 11.3) can be broken into components: Accuracy =
.
TP + TN TP + TN = , P+N TP + TN + FP + FN
(3.21)
where T P is the number of correctly predicted positive instances, T N is the number of correctly predicted negative instances, P is the number of instanced belonging to a class labeled as positive, and N is the number of instances of a class (or set of classes) labeled as negative. F P is the number of misclassified positives and F N is the number of misclassified negatives. This allows to see if there is a bias toward the classification of a certain class. For detection tasks, the metric is often further specialized as discussed in the following section.
3.4.2 Detection Detection tasks are important and rarely reach 100% accuracy. So their error behavior is frequently highly optimized based on the application. For example, if one is able to detect a disease based on viral RNA strains with .90% accuracy, one can call this a useful test. However, practically, it may be important for a tested individual that a test has a low miss rate. That is, when the disease is present, it is detected with a higher percentage than .95%. This can be traded off by increasing false alarms. That is, sometimes the disease is detected when it is not present. The
3.4 Error Metrics
47
Fig. 3.5 Precision, recall, sensitivity, specificity—the left half of the image with the solid dots represents individuals who have the condition, while the right half of the image with the hollow dots represents individuals who do not have the condition. The circle represents all individuals who tested positive. By FeanDoe-modified version from Walber’s Precision and Recall https://commons. wikimedia.org/wiki/File: Precisionrecall.svg, CC BY-SA 4.0, https://commons. wikimedia.org/w/index.php? curid=94134880
false alarm rate is also called precision or specificity is defined as the number of false positives detection per detection. The miss rate is also called sensitivity and is defined as the number of false negative detections per detection. The recall is the total number of positively detected items per detection. Figure 3.5 shows the idea graphically. The equal error rate is the point where misses and false alarms are equally high and is the best point to measure accuracy (see Definition 2.4) as it represents the point of probabilistic equilibrium. The compromise between false positives and false negatives is often called detection error trade-off (DET). Plotting the true positive rate (recall) over the false positive rate is called ROC curve (receiver operator characteristic). Other curves include total operator characteristic (TOC) and area under the curve (AUC). These curves can sometimes be explicitly plotted as a function of some hyperparameter tuned in the model. There are also metrics that weigh the two measures in different ways, for example, the F-measure that is the harmonic mean between precision and recall.
48
3 The (Black Box) Machine Learning Process
3.4.3 Multi-class Classification In a multi-class model, one can plot the k ROC or DET curves for k classes in a one vs. all methodology. For example, given three classes named x, y, and z. One would have one curve for x classified against y and z, another curve for y classified against .xandz, and a third one for z classified against y and x. However, it is usually better to use a confusion matrix. A confusion matrix, also known as an error matrix (in unsupervised learning, it is usually called a matching matrix) is generated as follows: Each row of the matrix represents the instances in an actual class, while each column represents the instances in a predicted class. The formal definition is as follows. Definition 3.10 (Confusion Matrix) The confusion matrix for a k-class classifier is a .k × k matrix M with elements .aij ∈ M such that .aij contains the number of elements of class i that have been predicted as class j . A perfectly diagonal confusion matrix therefore indicates 100% accuracy. Any element off the diagonal indicates confusion (that is, an incorrect prediction).
3.4.4 Regression Many metrics are used for regression. However, the most standard metric for regression is the mean squared error (MSE). If a vector of n predictions is generated from a sample of n data points on all variables, and Y is the vector of observed values of the variable being predicted, with .Yˆ being the predicted values, then the within-sample MSE of the predictor is computed as Definition 3.11 (Mean Squared Error) 1E xi ))2 , .MSE = (Yi − f (n n
(3.22)
i=1
where .Yi is the prediction of the model and .f (xi ) is the outcome in the training/validation data. While this metric is by far the most popular metric for regression, it is hard to measure the success of a model with it. The reason being that the number is not bound between best guess and 100%. Clearly, lower MSE is better but what does it mean when an MSE is, let us say 50? Alternatively, sometimes one can re-define the notion of accuracy (Eq. 2.4) based on the well-posedness definition (Definition 2.8) and define a point to be correctly regressed toward by a model if it is within an .e distance of an original point.
3.5 The Information-Based Machine Learning Process
49
Definition 3.12 (Regression Accuracy) Aregression =
.
# of predictions with |Yi − Yˆi | < ε , # of predictions
(3.23)
where .ε is a problem-specific constant. For example, if the regression was to predict housing prices in a certain area based on the properties of houses, one could define .ε as .$1000 as one would probably be very happy if a large number of objects is predicted correctly within that error range. Furthermore, depending on the task, many other metrics defined for classification are now applicable.
3.5 The Information-Based Machine Learning Process The black-box machine learning process as outlined here is completely independent of the actual model used. The model parameters could be trained in a neural network, a decision tree, or linear regression. In fact, one could assume this being the most general version of the model training process. This makes the process highly successful as it is seemingly easy to learn. One can reuse code from other experiments and use trial and error by using the only quality metric that this process defines: accuracy on the validation set. For any other design decision, however, this process relies on human intuition. That is, the intuition of the person running the experiments, as that intuition is used to make decisions about the training/validation split, model type, model size, hyperparameter settings, duration of training, and the definition of success of the entire experiment. This makes the final implementation full of implicit assumptions and “magic constants.” Consequently, the more years of experience the experimenter has and/or the more benchmarking successes the experimenter’s track record shows, the higher the trust in the expertise of the experimenter. That is, due to the significant reliance on intuition, the machine learning field is currently a field that is “driven by superstars.” The danger with intuition-driven empiricism is that sometimes trust can go beyond what can make logical sense. Many fields started like that. In the mid-1850s, for example, the chemist and photographer Cyprien Théodore Tiffereau presented a series of papers to the Academy of Sciences in Paris outlining how, while in Mexico, he had succeeded in turning silver into gold using common reagents. These discoveries were well-argued and therefore had compelled many mid-19th-century chemists to seriously reconsider the possible composite nature of metals.1 Well-respected chemists who supported the compound nature of metals openly speculated that the alchemical dream of metallic transmutation might in fact soon be realized. Of
1 https://www.sciencehistory.org/distillations/the-secrets-of-alchemy.
50
3 The (Black Box) Machine Learning Process
course, by now we know that the intuition of the early alchemists was great and did push the field and the economy. However, systematic measurements in chemistry combined with first-principle-derived physics ultimately replaced it. In summary, the evidence discussed above points into the direction that reducing the success of something as complex as a machine learning experiment to one number is an overgeneralization. The goal of the following chapters is to dig deeper into the knowledge of the model-building process in order to come up with a more fine granular understanding and additional metrics of success. This process is referred to here as the information-based machine learning process. At the center of this process are the ideas of measuring task complexity, intellectual capacity, and data sufficiency. Measurements allow for a more controlled, engineered modeling process that usually results in smaller, more resilient models that are faster and easier to maintain. Architectures become more understandable, model construction and training more predictable. The process can be augmented with synthesized data for further quality assurance, for example, to measure bias. The information-based machine learning process is a natural extension of the black-box process. That is, the black-box process is contained in the informationbased process. The information-based process consists of the following steps: 1. Define the tasks and validate the task is appropriate for machine learning or even modeling in general (compare Chaps. 7 and 11). 2. Collect data and annotate the data, measuring the bias of the data (Sect. 13.3) and annotator agreement (Chap. 11). 3. Convert and validate the data for a machine learner (Chaps. 11 and 12). 4. Select the most likely most successful machine learner (Sect. 16.1.1). 5. Match the complexity of the model to the task (Chaps. 8 and 10). 6. Perform quality assurance on the model and deploy it (see Chap. 11). 7. The deployed model provides mechanisms for explaining its decisions (see Chap. 14). The process introduces new metrics, such as the measurement of expected generalization (Chap. 6), bias (Sect. 13.3), and overfitting capacity (Chap. 5). These are explained based on the information-theoretic concepts, which are introduced in Chap. 4. The information-based machine learning process as described here is not to be presumed complete. Many more measurements and scientific and mathematical facts can possibly be included in future extensions of this process. Some of them are described in Chap. 16.
3.6 Exercises 1. Describe the differences between unsupervised and supervised machine learning tasks. Provide an example of each type of task and explain how it can be applied in practice.
3.6 Exercises
51
2. Explain what quantization and clustering are and how they relate to unsupervised machine learning. Give an example of a real-world application where clustering could be used. 3. Define classification, regression, and forecasting in the context of supervised machine learning. Describe a scenario where each of these tasks would be applicable, and provide an example of a metric that could be used to evaluate the performance of a model in each case. 4. Explain in your own words what the i.i.d. assumption is and why it is important in machine learning. Then, discuss at least two practical caveats associated with the i.i.d. assumption. 5. Write an algorithm in your preferred programming language for implementing the k-nearest neighbor approach. Use the example provided in the text to test your implementation. 6. Explain in your own words what linear regression is and how it is used in machine learning. Then, discuss the advantages and disadvantages of using linear regression as a model. 7. Define the mean squared error (MSE) and explain why it is used in linear regression. Then, calculate the MSE for a sample dataset and provide an interpretation of the result. 8. Define gradient descent and explain how it is used to train linear regression. Then, provide an example of how the learning rate and convergence threshold affect the training process. 9. Explain in your own words the concept of a decision tree and its use in modeling a function. 10. Describe the C4.5 algorithm for building decision trees from training data. What are some of the limitations of this algorithm? 11. What is a random forest? How does it work, and what are its advantages and disadvantages compared to other machine learning methods? 12. Explain the need for activation functions in artificial neurons. Why are noncontinuous functions, such as step functions, difficult to deal with in training? 13. What is the purpose of backpropagation, and what problem does it solve in training neural networks? Explain using the example of a three-layer network. 14. Explain how support vector machines (SVMs) work for binary classification. What is the goal of SVMs, and how do they find the best decision boundary? What is the optimization problem that SVMs solve, and what is the dual problem? How is the kernel trick used to extend SVMs to non-linear classification problems, and what are the common kernel functions used in SVMs? 15. Discuss the main application of genetic programming and how it is used in function approximation. 16. In which type of classification problems is the issue of imbalanced classes most prevalent? How can this issue be addressed? 17. Explain how the choice of error metric depends on the task and provide an example of a specific error metric used in a real-world application.
52
3 The (Black Box) Machine Learning Process
18. What are the main differences between the black-box machine learning process and the information-based machine learning process? Summarize and argue the pros and cons.
3.7 Further Reading One reference for further reading on unsupervised methods and dynamical systems is the following book that provides an introduction to the concepts of non-linear dynamics and chaos theory, including topics such as phase space, attractors, and bifurcations. It also covers applications of these concepts in fields such as physics, biology, and engineering. Chapter 10 specifically focuses on unsupervised learning and clustering: . Strogatz, S. H. (2018). Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering. Westview Press. A classic reference for forecasting is the following book. It provides a comprehensive overview of forecasting methods, including time series methods, regression analysis, and advanced methods such as neural networks and ensemble forecasting. It also covers topics such as data preprocessing, model selection, and forecast evaluation. The book includes numerous case studies and examples, making it a valuable resource for practitioners and researchers alike. The latest edition of the book was published in 2021: . Makridakis, S., Wheelwright, S. C., & Hyndman, R. J. (2021). Forecasting: Methods and Applications. Wiley. XGBoost is explained in the original research paper by the authors of the algorithm: . Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794).
Chapter 4
Information Theory
The aim of this chapter is to equip the reader with the mathematical instruments necessary for examining the information flow within the scientific method. This chapter does not follow the conventional pattern of a textbook chapter on Shannon’s information theory as applied to communication. Instead, we are taking a unique approach, molding these theories and principles to suit our specific requirements of understanding the information dynamics in the realm of the scientific process. The word information is used in many contexts with different meanings. For the sake of scientific accuracy, this book will use the definition by Claude Shannon (Shannon 1948b), which has two main advantages: It is short, and it is easily translatable into math. Definition 4.1 (Information) Reduction of uncertainty. We can immediately see how this definition connects to our intuition: we go to the information booth at an airport with the hope to reduce our uncertainty about which gate to choose or how to connect to the local transportation. It is very important to keep in mind that information can only be acquired when there is uncertainty. That is, acquiring information requires a question. This connects information intuitively to the scientific process, as explained in Sect. 2.1. Since a reduction is always relative to an absolute, it means information is indeed relative to the uncertainty of the observer. In the following, we will formalize this intuition.
4.1 Probability, Uncertainty, Information It is hard to define uncertainty sharply (without using information, which would create a very unhelpful recursive definition with Definition 4.1). However, we may say that uncertainty follows when there is a lack of observations with regard to a decision about the outcome of an experiment. Despite the lack of a strong definition, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_4
53
54
4 Information Theory
however, uncertainty is measurable and therefore mathematically quantifiable. However, in order to do so, we have to at least know the number of possible outcomes, better yet, the set of all possible outcomes. This is called the sample space, usually denoted with .o. Definition 4.2 (Sample Space) The set of all possible results of an experiment, denoted .o. If .o is given, math allows us to measure and calculate a probability of an outcome and also the total uncertainty of a set of experiments. Note that while defining .o requires a question, .o is not defined for all questions. For example, the question “why are bananas curved?” has an uncountable number of possible outcomes. Therefore, the uncertainty implied by this question is hard to quantify. Furthermore, a set .o with result outcomes .s1 , s2 , . . . , sn (i.e., .o = {s1 , s2 , . . . , sn ) must meet the following conditions in order to be a sample space: 1. The outcomes must be mutually exclusive, i.e., if .sj sj occurs, then no other .si si will take place, .∀i, j = 1, 2, . . . , n i /= j . 2. The outcomes must be collectively exhaustive, i.e., every experiment will result in an outcome .si ∈ o for .i ∈ {1, 2, . . . , n}. For example, if the experiment is tossing a single coin, the sample space is the set .{H, T }, where the outcome H means that the coin was heads and the outcome T means that the coin was tails. The possible events are .E = H and .E = T . For tossing two coins, the sample space is .{H H, H T , T H, T T }, where the outcome is H H if both coins are heads, H T if the first coin is heads and the second is tails, T H if the first coin is tails and the second is heads, and T T if both coins are tails. Note, however, that if we were not tossing a coin but a number of thumb tacks and observed whether they landed with its point upward U or downward D, there is no physical symmetry to suggest that the two outcomes should be equally likely, yet the sample space for one toss is still .o = {U, D}. In order to model that, we need to notion of a probability space. Let us start with introducing probabilities in the following section.
4.1.1 Chance and Probability Let us, again, assume we throw a coin. We know there are two possible outcomes: head and tail. Assuming we have to make one prediction before the coin is tossed, we now know that we have a 1 in 2 chance of guessing the outcome of the coin toss right – at least when we do not know anything about the influence of the environment on the coin toss or special properties of the coin. What if the coin is tossed a couple times and we are to guess every time? Well, every time we have a 1 in 2 chance of guessing right. Sometimes, we will predict the outcome correctly, and sometimes we will predict inaccurately. The problem is: Since it is impossible to predict if an individual prediction will be right, there is no way to actually predict how many
4.1 Probability, Uncertainty, Information
55
times a couple guesses will be right. However, the law of large numbers (which is actually a theorem) states that, as the number of experiments goes to infinity, the average of the observations converges to the expected value (Ross 2014). That is, if the coin is tossed an infinite amount of times, it will land on head .50% of tosses. Consequently, it will land on tail the other half of the tosses. Mathematically, this expectation is defined as a probability P . Definition 4.3 (Probability) The expected chance of an outcome, given an infinite number of observed outcomes. We denote .P (H |T ) = 12 . Probabilities were first systematically studied by Gerolamo Cardano, and their further development and popularization were significantly contributed to by Pierre-Simon Laplace. Thomas Bayes introduced a theorem that explains how to update probabilities based on new evidence, which laid the foundation for Bayesian inference. For more details, see (Hacking 2006) and (Feller 1968). Since we cannot know the number of correct coin toss predictions in advance, it is well possible that in an experiment, one is right many times in a row or wrong more times than expected by the computed probability. This is called anecdotal result. Anecdotal results are results that are not repeatable. Definition 4.4 (Repeatable Result) An experiment is repeatable when it shows the same outcomes under minimally different circumstances. One can see how a dice roll is not a repeatable result. Assuming the probability is computed correctly, the above definition is equivalent to a result being anecdotal when it is obtained at greater or lesser frequency than given by the probability. When we repeat the experiments that lead to the anecdotal result enough times, the outcome frequency must, by definition, converge to the probability. However, when a result is repeatable, we do not need to repeat the experiment many times, as the outcome will always be the predictable. A stronger notion is a reproducible result. Definition 4.5 (Reproducible Result) An experiment is reproducible when it shows the same outcomes under specified circumstances. Specified circumstances mean that the experimental setup is described somewhere in such a way that an independent reproducer can get to the same results. Scientists usually aim for the smallest possible specification of circumstances. That is, reproducibility seeks to converge in repeatability. However, in general, the notion of “reproducing” is often subjective, and standards are set by the community of researchers working on a specific set of problems. For example, in computer science, “reproducing an experiment” is often equated with downloading code and data and running the experiment again on one’s own computer. In the author’s humble opinion, one has to define the circumstances here as minimally different as compiling and mathematical computation is deterministic and long-solved. That is, the author would classify such method as repeating an experiment. Reproducing an experiment would be building a model from scratch given specifications in a technical description.
56
4 Information Theory
This thought leads us to another important point. The probability almost certainly converges to the chance (denoted as a fraction of the number of wanted outcomes divided by the total outcomes) when repeating an experiment infinite times by the law of large numbers (Kolmogorov 1933). Theorem 4.1 (Law of Large Numbers Kolmogorov 1933) The sample average X converges almost surely to the average value .μ when the number of samples n approaches infinity.
.
( P
.
) lim X n = μ ≈ 1
n→∞
(4.1)
That is, probability makes experiments repeatable once they are repeated an infinite amount of times. Practically speaking, it is, of course, impossible to repeat an experiment infinite amount of times. It turns out how many samples have to be collected from an experiment for the probability to approach the mean is determined by the amount of uncertainty contained in the experiment. Uncertainty is defined in Definition 4.8. Chapter 12 describes a method that can be used to estimate if a dataset contains enough samples to warrant predictions yet. When an experiment contains no uncertainty, we call it deterministic. Definition 4.6 (Determinism) An experiment is deterministic if there is no uncertainty as to any of its outcomes. Determinism (which originally means lack of free will) is therefore the opposite of randomness. The following corollary is important. Note that this notion of determinism is different from the notion of determinism in theoretical computer science, as discussed in Sect. B.6. Corollary 1 In a deterministic experiment, frequency equals probability. This is easy to see using an example: Assume 1 red ball and 2 green balls in a drawer. The probability of blindly drawing a red ball from the drawer is 1 .P (red ball) = 3 . By Theorem 4.1 if we repeat the experiment an infinite amount of times, .P (red ball) = 13 = 7μ, where .μ is the frequency of outcomes where one draws the red ball. Now, we remove all uncertainty and draw the balls from the drawer while actually looking at them. Let us assume the first draw will be a red ball, the second a green ball, and the third also a green ball. This can be done in any order. That is, the first could be green, the second red, and the third green again. No matter the order, the red ball can only be drawn . 13 of the time. That is, the frequency of the outcome “red ball” is . 13 . When we collect samples in modeling, we often interchange frequency with probability, including in this book. It is important to note that this is only possible under one of two assumptions: The experiment is deterministic or one has collected enough samples that that experiment’s frequencies already converged to its probabilities.
4.1 Probability, Uncertainty, Information
57
4.1.2 Probability Space As promised earlier, we can now introduce the notion of probability space, to model result sets where each outcome has a different probability. For the purposes of this book, we only deal with discrete probability spaces and so the more general definition of a probability space can be simplified to. Definition 4.7 (Discrete Probability Space) A discrete probability space is a tuple (.o, P ) consisting of two elements: 1. A sample space .o 2. A probability function .P : o −→ {p(ω)} that assigns each element .ω ∈ Omega a probability .p(ω) For practical purposes, we will often refer to .p(ω) as .pi because one usually iterates through all probabilities, for example, when calculating uncertainty. We will also assume that .p(ω) > 0 for all .ω ∈ o (or .pi > 0 for all .1 ≤ i ≤ |o|). Another word for probability function we will use in this book is random variable. To practically measure probability, we need as many experiments as we can perform until we see convergence to a probability or repeatability.1 This number of experiments performed to reach a conclusion is usually called sample size and denoted with n. One sufficient indicator that n is too small is if some .p(ω) are still 0. For example, the probability space for our thumb tacks experiment discussed in the previous section could be empirically determined by dropping one thump tack 10,000 times and observing the outcomes. The probability function P approximated could be2 .p(U ) = 0.7, p(D) = 0.3. Probabilities refer to the expected chances of an individual experimental outcome. Uncertainty concerns looking at an entire system of such experiments.
4.1.3 Uncertainty and Entropy Let us assume we flip our coin from the previous section 100 times. The size of the sample space .|o| = 2100 since there are 2 possible outcomes per coin flip. Each flip is independent of the previous one, and thus all outcomes have equal probability.
1 A warning though: The average of the results obtained from a large number of trials may fail to converge to a probability in some cases. For instance, the average of n results taken from the Cauchy distribution or some Pareto distributions will not converge as n becomes larger due to their heavy tails (Roberts and Varadhan 1997). 2 The author searched the web extensively for empirical evidence to back up the numbers but could not find any. So these numbers are purely fictitious.
58
4 Information Theory
1 Consequently, the probability of guessing all 100 coin tosses right is .P = 2100 = 1 1267650600228229401496703205376 . Uncertainty S is mathematically defined as the inverse function of probability P . Instead of given .o and a number of events to compute a probability, we are given probabilities and are to find out how many guesses it would minimally take to have a chance at guessing right. 1 So back to our coin toss example: Let us assume we are only given .P = 2100 and are told that we are to guess “heads” or “tails.” How many guesses would it take? Intuitively, the inverse of exponentiation is the logarithm. That is, we should com1 = pute the logarithm of P . Thus .S = log2 P = log2 1267650600228229401496703205376 −100. The result means: We are 100 binary guesses away from being able to guess the coin-flip-sequence right with probability P . The result is obviously not surprising because we set it up this way with our initial assumption. Logarithm laws also allow us to calculate the minimum number of guesses as .S = k log2 P = 100 log2 12 , where .k = 100 is the number of coin tosses and .P = 12 is the probability P to guess an individual coin toss right. The formula we just used is called the Hartley Entropy, which was introduced by Ralph Hartley in 1928. He was, however, not the first one to invent the concept. The physicist Ludwig Boltzmann is justifiably credited as the first one to introduce this formula for entropy in 1870. He, however, used the natural logarithm, which makes it hard to count the number of guesses. Possibly one of the reasons why entropy was not understood for a long time (cite: von Neumann). The entropy formula as given can be generalized in various ways. J. Willard Gibbs generalized it to work for non-uniform distributions in 1902. The concept was then reinvented by Claude Shannon in 1948. More generalizations exist, for example, von Neumann Entropy and also Renyi Entropy. For the topics covered in this book, the following definition suffices:
Definition 4.8 (Uncertainty) S=k
n E
.
pi logb pi ,
(4.2)
i=1
where b is the number of choices per guess, .pi is the probability of an outcome ω ∈ o indexed i, n is the number of elements in the sample space (.o), and k is the number of events in the system.
.
When all probabilities are equal, this is .Pi = Pj for any .i, j , and Eq. 4.2 reduces algebraically to the calculation we used in the example above. We define it as follows. Definition 4.9 (Equilibrium Uncertainty) S = k ∗ logb P ,
.
(4.3)
4.1 Probability, Uncertainty, Information
59
where b is the number of choices per guess, P is the probability of any outcome, and k is the number of events in the system. In fact, all probabilities being equal are a very distinct case that warrants its own definition: Definition 4.10 (Probabilistic Equilibrium) The state of equal probability for all factors. Two bases b are traditionally used. Base .b = 2 describes the number of binary guesses that are minimally needed to completely guess the state of a system with a certain probability. Some communities have proposed shannons as the name for the unit. Base .b = e = 2.718281828459045 . . . is used in classical physics by using the natural logarithm. It is difficult to intuitively think of having an irrational number of choices per guess, but one can still assign a unit to the result. Calculating the uncertainty base .b = e results in a number that is assigned the unit nats. Throughout this book, we will assume base .b = 2 and measure in bits to be consistent with computer science. Note that the result for S will have a negative sign, just like in our example above. Example A histogram graphing the outcome of .10,000 experiments shows that outcome 1 occurs .70% of all the experiments performed, outcome 2 .27%, and outcome 3 was observed only .3% over all experiments. (a) How many bits of memory would have to be minimally reserved to store a list of the outcomes of the experiments? (b) What is the individual uncertainty (also called surprise) of outcome 3? To solve (a), we assume n is large enough and apply Eq. 4.2. .k = 1000, .p1 = 27 3 7 , 10 p2 = 100 , p3 = 100 . Arithmetic yields: .S = −1021.98 bits. Practically, we cannot have factional bits, so we need to reserve at least .1022 bits of memory for the list of results. (This is assuming we can compress the list optimally well.) (b) The surprise of event 3 is .S3 = log2 P3 = −5.0588 bits or .5.0588 shannons.
4.1.4 Information As explained earlier (see Definition 4.1), information is reduction of uncertainty. Therefore, information can be formalized as: Definition 4.11 (Information) H = −S,
.
where S is uncertainty as defined in Definition 4.8.
(4.4)
60
4 Information Theory
We denote information with the letter H , as introduced by Claude Shannon (in honor of Hartley). Information is typically measured in binary digits (bits). Interestingly enough, information is more general than uncertainty. As we saw so far, uncertainty can only be calculated based on a probability. However, a number of any base can be converted into a number of base 2. The number of digits of a binary number converted from a number of any other base (for example, decimal) is also counted in bits. To get to the number of digits needed to encode a number of any base in decimal one only needs to take the ceiling of the logarithm base 2. That is: Definition 4.12 (Number of Binary Digits) H = 1 + Llog2 xL [bits],
.
(4.5)
where x is a natural number encoded in decimal. One can see how this formula can be generalized to other bases. More importantly, one can see that information in bits can be measured for both, uncertainty and information. In the case of uncertainty, H gives us the minimum number of binary digits required to get to certainty. In the case of information, H gives us the minimum number of bits required to encode information without creating uncertainty. In general, negative bits indicate uncertainty, and positive bits indicate information. Example You are challenged by a friend who wants you to guess a number secretly written on a piece of paper. The number is between 0 and 15. What is the minimum number of guesses that your friend needs to allow you: (a) for you to win (that is, guess the number) and (b) for both of you playing perfectly to have a 50% chance at winning (that is, the odds of you guessing the number are 50%)? The answer is, of course, a direct application of Eq. 4.3. .|o| = 16, so the 1 . .k = 1 as there is only one probability of guessing a number right is .P = 16 event (the secret number). The base b is a little bit tricky as it depends on what kind of guess is allowed. A straightforward type of guess would be to try to guess the binary digits of the secret number. Therefore, the question to ask would be: “Is the number greater or equal x” where x starts with 8 and then jumps to either 4 or 12, etc. In that case, the answer to your question is “yes” or “no” and so the response is binary, which makes .b = 2. So the minimum number of binary guesses to guarantee 1 to win is therefore .S = log2 16 = −4 bits. This is by your friend providing 4 bits of information, your uncertainty can be reduced to 0 bits, which answers question (a). If your friend only allowed the answers to 3 guesses, there is .−1 bit of uncertainty left. Since .P = 2−1 = 12 , you both would have a 50% chance of winning the game when the last guess is only allowed to be a number, which answers question (b). Information can also be defined based on a probability, as a general case of Eq. 4.3.
4.2 Minimum Description Length
61
Definition 4.13 (Self-information) H = − log2 P [bits],
(4.6)
.
where P is the probability of a certain outcome.
4.1.5 Example An experiment is repeated many times with many different outcomes. The outcome o, which we prefer, seems to happen .p = 14 of the time. That is, its self-information is .Ho = − log2 14 = 2 bits. Section 13.3 discusses an important application of selfinformation.
4.2 Minimum Description Length Let us look at our 100 coin tosses from Sect. 4.1.3 one more time. We calculated 100 binary guesses need to be made to be able to predict all coin tosses with probability P . The question we are asking now is: What if we knew all result in advance? Now, of course, we would have a .100% chance of guessing all coin flips right. However, we would still need to provide 100 binary decisions. That is, to describe the outcome of the 100-coin flip experiment, we need .100 bits of information, whether we started with .−100 bits of uncertainty, with .−50 bits of uncertainty, or with .0 bits of uncertainty. This is why the Shannon Entropy H (see Definition 4.11) is also called the minimum description length (MDL). There is no shorter way to get to a correct description of the experiments than using .100 bits of information. So let us agree on the following definition. Definition 4.14 (Decision) Making a choice between b possible outcomes, where b > 1 is an integer.
.
Note that a decision can be informed, that is, based on some prior information about the outcome, or uninformed. An uninformed decision is usually called a guess. Intuitively, a binary decision of the past is recorded as a bit. A bit in the future is a yes/no decision waiting to be made. The minimum description length of non-uniform distributions works the same way, see the example in Sect. 4.1.3. Non-uniform distributions, however, require a smaller minimal description length. This holds in general: Hunif orm = −k
n E
.
i=1
pi logb pi (and pi = pj ∀i, j ) ≥ Hvariable = −k
n E
pi logb pi
i=1
(4.7)
62
4 Information Theory
That is, a probability function where all probabilities are equal requires the longest description compared to a probability function that has varying probabilities. The derivation and limits of Eq. 4.7 are discussed in greater depth in Sect. B.4. Furthermore, providing decisions of any kind, for example, 100 decisions of “head” or “tail” (accurate or not), also constitutes physical work (I encourage the disputing reader to just take a piece of paper and start writing). This makes physical work proportional to minimum description length and, as a consequence, to information and uncertainty. More on this in Appendix B. Due to this connection between physical work and minimum description length, we can use minimum description length arguments to reason about some aspects of physical work, more specifically, computational complexity.
4.2.1 Example You want to sort n positive integers of which the highest value can maximally be n. What is the smallest number of steps that a sorting algorithm could consist of? There are two paths to the solution. First, based on information: n integers of highest value n represent maximally .n ∗ llog2 n| bits of information (that is, when all integers have value n). Even if the algorithm itself did not have to do any computation, all numbers need to be read in for the algorithm to be exact and not approximate: This means, the smallest amount of steps would be proportional to .n ∗ log2 n. Second, based on uncertainty: The uncertainty represented by the task is .S = n ∗ log2 n1 bits. This means, we need at least .−S binary decisions to have a chance at being correct. Since logarithms of different bases are multiplicative of each other, that is, .logb n = c ∗ ln n, it does not matter if the algorithm uses binary decisions or b-nary decisions: The minimum amount of steps is proportional to .n ∗ log n. This result is consistent with the well-known proof on the lower bounds of comparison-based sorting algorithms (Cormen et al. 2009), which is typically taught in the introductory courses of computer science programs. There is a corollary here that is of significance. Corollary 2 (Conservation of Computational Complexity) The number of decisions required for an optimal algorithm .Ecomp is, in general, always larger or equal to the minimum description length of its input or output, whichever is larger. That is, .Ecomp ≥ max(H (Pinput ), H (Poutput )), where .Pinput is the probability function of the probability space of the input, and .Poutput is the probability function of the probability space of the output to the algorithm. That is, even the optimal algorithm to solve a problem can only anecdotally execute in less decisions than is required by the minimum description length H of the probability space that defines the input or output. Anecdotally here means for some inputs out of all possible inputs (also refer to Sects. 7.2, 7.4, and B.2).
4.2 Minimum Description Length
63
We saw an example for an output-bound algorithm in the 100-coin-toss example and an example of an input-bound algorithm in the sorting example. Even if an algorithm had all solutions readily stored in a table and could access those in one step, it would still have to read in the input and/or generate the output. The input or the output can only anecdotally be shorter than the minimum description length H and that always comes at the price of some other inputs or outputs being anecdotally longer. That is, over all possible inputs or outputs, the average will converge to the expectation H . In other words: The complexity of the probability space cannot be reduced and thus is eternally conserved. As an intuitive consequence, we now know that we will never be able to find finite-time algorithms to calculate all digits of a transcendental number, such as .π or e. We can still approximate a solution, e.g., do not consider all input bits or do not generate all bits needed for the output. Furthermore, we can create less general solutions. That is, reduce the size of the sample space, e.g., by not allowing certain inputs. An obvious strategy to reduce the size of the sample space is to discard elements of low probability. This is generally referred to as pruning. Note that the .Ecomp notion of complexity is more accurate than the common loop-steps notion of complexity (see also discussion in Appendix B). For example, a commonly used approximation is to say that finding the maximum in an array of numbers is “linear runtime” in the length of the array. Using Corollary 2, it would take .kllog2 n| decisions (where n is the largest number in the array and K is the length of the array). The linear-time approximation is less accurate because it does not consider the size of the numbers. Finding the maximum in an array where each number has millions of digits will be slower than finding the maximum in an array where the number of digits is in the tens – even when the array has equal size. Furthermore, applying Corollary 2 results in an approximation that results in a unit: bits, which allows complexity comparisons between algorithmic solutions that use different loop structures. As the reader may guess, it allows us to compare different machine learning algorithms, independent of their implementation. We will further investigate the conservation of complexity in Sect. 7.2.2 as part of discussing the non-existence of universal and lossless compression. The corollary is not surprising when we remind ourselves that computational complexity is proportional to physical work (see Sect. B.5), which in turn underlies the principle of conservation of energy (Ito 2020). It is important to note though that computational complexity is not necessarily proportional to runtime as the energy could be spent to execute computation in parallel. A conservation principle for computational complexity has also been observed by the Human–Computer Interface community (Norman 2013). The field investigating algorithmic complexity using information theory is called algorithmic information theory (Li and Vitányi 2008).
64
4 Information Theory
4.3 Information in Curves Back to our goal of automating the scientific process (see Chap. 2) and to the tasks of machine learning described in Sect. 3.1. For a classification problem, the maximum total information H learnable out of the observations presented in the training data can be calculated by using Formula 4.2 and setting i to the number of classes, k to the number of instances, and .pi the relative frequencies of the class occurrences (see example in Sect. 4.1.3). For a regression problem, however, we are given labels that approximate continuous values rather than discrete observations. For that, we need a different way of measuring information content that is not based on discrete event probabilities. We reach a similar problem when we attempt to work with time-based data, such as speech waves, moving objects in videos, or stock prices. Fortunately, this problem was solved by Benoit Mandelbrot in 1975. He reinvented entropy coming from a very different viewpoint: geometry. He called his complexity measure the fractal dimension. His work started because he wanted to solve the “coastline paradox” also known as the “coast of England problem.” The paradox refers to the counterintuitive observation that the length of a coastline can vary greatly depending on the measurement scale used. In other words, the more finely you measure, the longer the coastline appears to be. This is because coastlines are not smooth curves but instead have many twists and turns that are revealed as the measurement scale decreases, similar to regression data. At a large scale, say, 100 km, the coastline of Britain might appear relatively smooth and straight, with few noticeable indentations or protrusions. But if we decrease the scale to 1 km, we will begin to notice small bays and headlands that were not visible at the larger scale. Thus, at this smaller scale, the measured length of the coastline increases. As the scale decreases further, more and more intricate details of the coastline become visible, and the measured length continues to increase. To solve this problem, Mandelbrot reworked the definition of dimensionality by adding the concept of self-similarity. To explain the concept of fractal dimension, it is necessary to understand what one means by dimension in the first place. The intuitive definition is that a line has 1 dimension, a plane 2 dimensions, and a cube 3 dimensions. The reason a line is 1-dimensional is that there is only 1 coordinate to describe a position on a line. Similarly, a plane is 2-dimensional because there are 2 coordinates to describe a position in the plane. And so on. In general, the number of elements of the vector .xdescribing a position precisely is the dimensionality. Mandelbrot added to this intuition with the notion of self-similarity. He argued that any geometric object, including simple ones such as the line and the plane, is composed of self-similar objects. One may break a line segment into N self-similar intervals, each with the same length, and each of which can be scaled by a factor of N to yield the original segment. When we break up a square into N self-similar subplanes, however, the scaling factor would be .I = N/2 as there are .I 2 sub-planes. That is, the square may be broken into .N 2 self-similar copies of itself, each of which must be magnified by a factor of N to yield the original figure. In analogy, we can
4.3 Information in Curves
65
Fig. 4.1 The line, square, and cube can all be partitioned into miniature (self-similar) versions of itself. While we can chose the number of miniature versions I , to reconstruct the original, using the miniature versions, we need .N = I D miniature versions. We call I the magnification factor and D the dimensionality (see Definition 4.16)
decompose a cube into .N 3 self-similar pieces, each of which has magnification factor N . Figure 4.1 illustrates this idea. For curves, it is not clear what the smallest self-similarity is. This means, we will have to tune the scaling factor to find self-similar units that the curve can be broken into and then count. As explained above, this problem was originally coined by Benoit Mandelbrot as measuring the length of the coast of England. Due to its rather chaotic shape, the result of any measurement will drastically differ depending on if the scaling factor is km, meter, inches, or pebbles: the coastline’s measured length changes with the length of the measuring stick used. The smaller the stick, the longer the coast. Mandelbrot, however, found that one can quantify the rate of increase as a function of the decrease of the measuring stick. This rate is the fractal dimension. In other words, the fractal dimension of a coastline quantifies how the number of scaled measuring sticks required to measure the coastline changes with
66
4 Information Theory
the scale applied to the stick. The difference between the coastline and the example of Fig. 4.1 is that D is not an integer number, and it is a fraction. Hence, the name fractal dimension. The same idea can be applied to regression or time-based data given as a curve (just imagine the curve being a coastline). The formal definition is as follows: Definition 4.15 (Fractal Dimension) D=−
.
log N = logI N, log I
(4.8)
where D is the fractal dimension, N the number of self-similar pieces, and I the scaling factor. This leads us to the following definition of dimension. Definition 4.16 (Dimension) The exponent of the number of self-similar pieces with scaling factor I into which a set may be broken. An algorithm that is able to estimate the fractal dimension of a curve is called Box Counting. Box counting is a method of measuring the fractal dimension (more specifically, the Hausdorff dimension) by breaking a dataset, object, image, etc., into smaller and smaller pieces that are typically “box”-shaped. The pieces are then analyzed at each scale. This is repeated until the decrease is not meaningful anymore because the dimensionality converged. A practical implementation of a planar version works as illustrated in Algorithm 6. Given a curve as two-dimensional binary image, the algorithm overlays a rectangular grid, counting how many squares (boxes) of the grid are covering part of the plot, that is, contain both black and white pixels. Boxes with only white or only black pixels are discarded. The process is repeated with a finer and finer grid of boxes until the granularity is fine enough that no box contains both black and white pixels (or we reached pixel accuracy). In the end, the fractal dimension is the slope of a regression line between the logarithms of all the box counting results .Bi on the y-axis and the logarithms of the different magnifications .εi on the x-axis (see also Definition 4.15). To convert a fractal dimension to bits, one uses logarithm laws to know that one has to divide the result of running the algorithm .Df ractal by .logI 2. Multiplying by the number of points counted in the last iteration then results in the total information represented by the curve in bits. That is, Definition 4.17 (Curve Entropy) Hcurve = N
.
Df ractal , logI 2
4.4 Information in a Table
67
Algorithm 6 Estimating the fractal dimension of a curve. Given a binarized image of a curve, the function returns the estimated fractal dimension .Df ractal , the largest magnification I , and the number of boxes counted at that magnification Require: image: array of boolean variables containing a 2-D plot of the curve (where 1 is curve, 0 is ‘white’ pixel). 1: function BOXCOUNT(image) 2: p ← minimal dimension of image 3: n ← greatest power of 2 ≥ p 4: boxsizes ← [] 5: fill boxsizes with 2n to 21 6: countlist ← [] 7: for i in boxsizes do 8: Bi ← number of non-empty and non-full boxes in image 9: if Bi == 0 then break 10: end if 11: add Bi into countlist 12: end for 13: regression fit line log2 Bi ∈ countlist as y-axis and log2 size ∈ boxsizes as x-axis 14: D ← slope of fitted linear regression line 1 15: I ← size(countlist) 16: return −D, I , countlist[size(countlist) − 1] 17: end function
where .Hcurve is the (estimated) information content contained in the curve in bits. Df ractal is the (estimated) fractal dimension, I is the (estimated) scaling factor, and N is the number of points sampled from the curve at scaling I to obtain .Df ractal .
.
Geometrically, this transformation is translating the shape approximated by the boxes of Algorithm 6 into a count of the number of self-similarities required when the self-similarity chosen to represent the same structure is a binary tree branch. Such a geometric “recoding” can be hard to imagine at first, but it is exactly what happens in a computer. Computers only know “on” and “off,” and so when a curve is displayed on a screen, that curve is internally recoded into a large set of binary states in memory. The decisions of whether a memory cell should be “on” or “off” can be represented as a binary decision tree branch. Note that .N < ∞ is guaranteed only for physical measurements. Mathematical constructs can, of course, have infinitely small scaling and therefore require .N = ∞, which results in .Hcurve = ∞.
4.4 Information in a Table Back to the table used in the scientific process defined in Definition 2.1. If we assume that we have a target column .f (x ) filled with the experimental results, then our first intuition, based on the previous sections, should be that we can now calculate the bits of information H learnable by a given table of data. If it is a classification problem, we get the probability function of the classes and apply
68
4 Information Theory
Formula 4.2. If it is a regression problem, we use Algorithm 6 on the target column. This is indeed a valid way to estimate the information gained out of our experiments. The problem with this approach, however, is that it seems there is something missing because we do not take into account the input columns .xi . This intuition is valid: We need to remember that we are neither modeling the input nor the output. We are modeling the function from the input to the output based on the experimental observation of the interaction of the two recorded in our table. Let us look at this for a classification problem. The information-theoretic concept that models the information contained in a discrete function is called the mutual information. Mutual information is defined as follows (Shannon 1948b). Definition 4.18 (Mutual Information) .
I(X; Y ) = I(Y ; X) = H(X) + H(Y ) − H(X, Y ),
(4.9)
where .H (X) is the information contained in the probability function X, .H (Y ) is the information contained in the probability function Y , and .H(X, Y ) is the joint entropy. The joint entropy is the information H gained from joint probability of two events happening together. The joint entropy, used by mutual information (Definition 4.18), is defined as follows: Definition 4.19 (Joint Entropy) H(X, Y ) = −
EE
.
P (x, y) log2 [P (x, y)],
(4.10)
x∈X y∈Y
where x and y are particular elements of the probability functions X and Y , respectively, and .P (x, y) denotes the probability of events from X and Y occurring together. As an important technicality: .P (x, y) log2 [P (x, y)] is defined to be 0 if .P (x, y) = 0. Intuitively, the two probability functions X and Y both contain information by itself: .H (X) and .H (Y ). If we looked at the two outcomes together, we would have a new minimum description length .H (X) + H (Y ). For example, if X describes 100 coin tosses and Y describes 50 coin tosses, then .H (X) = 100 bits and .H (Y ) = 50 bits. Recording both coin tosses together results in .H (X) + H (Y ) = 150 bits of information – but only if the coin tosses are independent of each other and there is no influence between the two experiments. The term .−H (X, Y ) corrects the mutual information by the influence of the two experiments on each other. For example, if the coin tosses modeled by Y were always identical to the coin tosses modeled by X, then all the coin tosses in Y would yield no information. In Formula 4.10, the joint probability .p(x, y) would then always be .p(x, y) = 1, which in turn sums up to .H (X, Y ) = 100 bits and the total mutual information would be .I (X; Y ) = 100 bits.+50 bits.−100 bits. = 50 bits – only the information that is mutual between X and Y.
4.4 Information in a Table
69
In general, two important relationships between mutual information and entropy should be kept in mind. First, the upper limit for the mutual information is I (X; Y )max = min(H (X), H (Y )).
.
(4.11)
Intuitively, using the communication analogy: One cannot receive more information than was sent or sending more information than can be received does not yield any gains. Second, the lower limit for the mutual information is I (X; Y )min = 0.
.
(4.12)
Note that the mutual information can be defined over more than two variables, which is not discussed in this book. .Imin can be negative for three or more variables. That is, uncertainty can be added to a relationship of two variables by a third, confounding variable. Back to the table generated by the scientific process (see Definition 2.1). If our table contained an input vector .x- of dimension 1, that is, the table only contained one input column and one target column .f (x ), the easiest way to determine the amount of information needed to model the function would be to calculate the mutual information .I (X; Y ), where X describes the probability function of the input column and Y describes the probability of the target column. Low mutual information would mean a higher need for memorization, and high mutual information would mean a smaller need for memorization. The problem, however, is that most experiments of practical value contain more than one column in the input table. The issue then becomes on how to combine the information contributed by the various input variables .Xi . Combining them trivially as probabilities of joint events .p(x1 , x2 , . . . , xn ) will easily make all the input rows unique. That is, the joint probability of an input vector is .p = n1 , where n is the number of rows in the table. Doing that leads to the same result as just calculating the information gained by the target column. This can be seen as follows: Let .H (X) be the information contained in all rows by combining the probabilities of the experimental variables .xi as a probability of the joint event .p(x1 , x2 , . . . , xd ). Let .H (Y ) be the information contained in the target column. By Definition 4.18, .I (X; Y ) = H (X) + H (Y ) − H (X, Y ). However, since we have the events all together in a table, the joint probability of the inputs occurring together with the output is trivially .p(x1 , x2 , . . . , xd , y) = 1. So the joint minimum description length is the same as the maximum information of either the input or the output. That is, .H (X, Y ) = max(H (X), H (Y )). Since .H (Y ) < H (X), unless there are more classes in the target function than unique input rows (in which case we have a regression task and that was excluded by our initial assumption above), .I (X; Y ) = H (X) + H (Y ) − H (X) = H (Y ). u n So, indeed, without knowing how to combine the vector elements .xi ∈ x- of the input rows of our table, we cannot easily get to the mutual information. Therefore, unless the table only contains one input column, estimating the information gained in a set of experiments is approximated by calculating .H (Y ) of the results column.
70
4 Information Theory
For a classification problem, this is done by calculating the information using Eq. 4.2, and for regression (but not time-based) problems, this is done by using Algorithm 6. There is a generalization of mutual information to a set of variables, called total correlation (Watanabe 1960). The total correlation is the amount of information shared among the variables in a set. Total correlation is applicable to the scientific process table (Definition 2.1) and will lead to a more accurate approximation. However, calculating it can be expensive for tables with a large d. We can do the best job of estimating the information content of a function contained in a table when we know how the elements of the input vectors are combined. However, this is usually the task of a model. So in order to better estimate the information gained by an experiment, we need to look at how a specific machine learning algorithm models the input to get to the output. This is done in the following chapters.
4.5 Exercises 1. Bit Arithmetic: (a) How many bits do you need to encode the integer 126? (b) How many bits do you need to encode 32.56? (technically and theoretically) (c) Assume you have two positive integers of size n bits each: How many bits does the result of (1) addition, (2) subtraction, (3) multiplication, of the two numbers maximally generate? (d) Assume you have an 8 × 8 matrix of 8 bit numbers. You now: (1) select the maximum number, (2) calculate the average number of that matrix. How many bits do you need to store the result in each case? How many binary matrices of size 8 × 8 are there in total? 2. Information Content (a) Assume you have a black and white (binary) image with a resolution of 64 × 64 pixels. What is the (1) minimal and (2) maximal information content in bits of that image? How do the images look like? (b) Suppose 5 pairs of socks are in a drawer. How many socks do you have to minimally pick to guarantee that at least one pair is chosen? (c) Each of 15 red balls and 15 green balls is marked with an integer between 1 and 100 inclusive; no integer appears on more than 1 ball. The value of a pair of balls is the sum of the numbers on the balls. Show there are at least two pairs, consisting of 1 red and 1 green ball, with the same value. Show that this is not necessarily true if there are 13 balls of each color. (d) Estimate the memory required to implement a “20 Questions” game in a selfcontained device (see http://20q.net/). The original game was implemented using a neural network trained on user input, but, for this exercise, you can assume a decision tree.
4.6 Further Reading
71
4.6 Further Reading For the information-theoretic concepts mentioned in this chapter, I recommend at least skimming through the cited original sources. For a historic overview of how the concept of information/entropy/energy evolved see: • Ayres, R: “A Brief History of Ideas: Energy, Entropy and Evolution”. In: Energy, Complexity and Wealth Maximization. The Frontiers Collection. Springer, 2016, https://doi.org/10.1007/978-3-319-30545-5_3 I also recommend digging into fractal geometry as this topic is under-served in this book and it is massively inspiring: • Kenneth Falconer: “Fractal Geometry: Mathematical Foundations and Applications 3rd Edition”, Wiley and Sons, 2014, ISBN 111994239X (in 3rd edition as of writing of this book).
Chapter 5
Capacity
In the previous chapter, we have investigated how to estimate the information content of the training data in bits. Since the purpose of a model is to generalize the experimental results into a rule that can predict future experiments, most people would intuitively agree that a model should not be more complex than the training data. We introduced this intuition in Chap. 2, especially Sect. 2.2.1. More formally, the information content of the training data should give an upper limit for the complexity of the model. But how does one measure the complexity of a model? To measure the complexity of the model, we need the notion of model capacity, which is the topic of this chapter. The word capacity describes how much something can contain. For example, the capacity of an elevator describes how many people can ride in it. In general, the physical capacity of a container describes how much content can be stored in it. The capacity of a container is often considered equivalent to its volume (the actual capacity, however, must always be a bit smaller than the volume of the container as we have to account for technicalities like the walls of the container). In computer science, memory capacity describes how much information can be stored in some information storage, for example, RAM or a hard disk. As discussed in the previous chapter, volumes of data are measured in bits and, for practical reasons, in bytes, kilobytes, megabytes, gigabytes, terabytes, petabytes, etc., whereby 1 byte equals 8 bits. As explained in Sect. 4.4, the complexity of training data is measured by the mutual information between the input (experimental factors) and the output (experimental results). This is because the training data describe a function from the input to the output. Therefore, the information capacity of a model describes the upper-limit complexity of a function that can be stored in a model. Formally, it is defined as follows. Definition 5.1 (Information Capacity) .C = sup I (X; Y ), where C is the capacity measured in bits and .I (X; Y ) is the mutual information as defined in Definition 4.18. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_5
73
74
5 Capacity
The supremum (abbreviated .sup) of a subset S of a partially ordered set P is the least element in P that is greater than or equal to each element of S, if such an element exists. Consequently, the supremum is also referred to as the least upper bound. Informally, the capacity is therefore the least upper bound of the mutual information of each and all possible functions that could be implemented from X to Y . This notion can be counter-intuitive at first, and, in fact, it has been a topic of discussion and challenges for decades. For this reason, this book contains Appendix B. It provides a historic perspective on complexity measurements of functions that does not contribute to the main thread of this book but is intended to promote a big-picture understanding of complexity.
5.1 Intellectual Capacity Just as we did in Chap. 2, let us understand how capacity is modeled in humans first before we apply the metrics to machine-generated models. The prominent example of a first study conducted to see what humans are able to learn was conducted by Binet and Simon in 1904 (Binet and Simon 1904). They looked at what children of different social and ethnic status can learn and ultimately came up with our currently well-known measure of Intelligence Quotient or (IQ). Their study was so impactful that, as of writing this book, the standard IQ test is still called the Stanford-Binet test (Terman 1986). Their definition of intelligence is short and therefore in the author’s opinion too general, but it serves as a very valuable starting point. Binet and Simon defined intelligence as “the ability to adapt.” The problem with this definition is that there are many things that seem to be able to adapt without being intelligent. For example, when one puts a foot into sand, the sand adapts around it to form a footprint. Nobody would think sand is intelligent. Instead, the ability of the sand to adapt around the foot is explained by Isaac Newton’s “actio est reactio” (action is followed by reaction) principle (Newton 1726a). Since this principle is universal, the entire universe would be intelligent as per Binet and Simon’s definition. The definition worked for them, as they limited the work space of this definition to children learning in school. There were also other, socio-political issues with their work that we will not discuss here as the scope of this book is artificial intelligence. For our purposes, we should be able to distinguish smart phones from regular phones or a primate’s brain from a spider’s ganglion. This requires to be more specific about the adoption part. What makes a smart phone more intelligent than a regular phone is that the phone is there for one exact task, while a smart phone allows to install applications that change its purpose. The spider’s ganglion has evolved for the spider to find a suitable place for spinning a web, catching food, killing and eating the food, and finding a mate and reproducing. A primate’s brain, while essentially also performing the same tasks of nesting, metabolizing, and reproducing, allows the primate to use tools and perform tasks in a variety of
5.1 Intellectual Capacity
75
different ways than dictated by evolution. This book will therefore use the following, more specific, definition of intelligence. Definition 5.2 (Intelligence) The ability to adapt to new tasks. Of course, adapting to new tasks is what students do in school or what makes a personal assistant device be called “artificial intelligence.” However, the above definition does not help us with any quantification of the ability. For this, again, we need capacity. Definition 5.3 (Intellectual Capacity) The number of tasks an intelligent system is able to adapt to. So intellectual capacity is a volume of tasks. As one can see, this is hard to measure for systems as complex as biological intelligence. A typical IQ test will approximate the number of tasks a person can adapt to by showing the subject a number of different problems to solve in a well-defined amount of time. The IQ is then a relative measure that compares the number of tasks correctly solved by an individual against the number of tasks typically solved by a population. An IQ of 100 means the person has solved the same number of tasks as an average person would. A higher IQ number means more tasks, a lower number means less tasks, scaled along the standard deviation of the population’s IQ. For artificially created systems, we can be much more rigorous. Since Chap. 2, we know that we adapt a finite state machine to the function implied by our training data. Therefore, we can define machine learning model capacity as follows: Definition 5.4 (Model Capacity) The number of unique target functions a machine learning system is able to adapt to. Even though this definition seems straightforward, measuring the actual capacity of a machine learner is not. This will be discussed as follows.
5.1.1 Minsky’s Criticism The first account of a deeper discussion on the intellectual capacity of machine learning systems was popularized in 1969. In that year, Marvin Minsky and Seymour Papert (Minsky and Papert 1969) argued that there were a number of fundamental problems with researching neural networks. They argued that there were certain tasks, such as the calculation of topological functions of connectedness or the calculation of parity, which Perceptrons could not solve.1 Of particular significance was the inability of a single neuron to learn to evaluate the logical function of exclusive-or (XOR). The results of Minsky and Papert’s analysis lead them to the conclusion that, despite the fact that neurons were “interesting” to study,
1 Perceptron
being the original name for a single neuron.
76
5 Capacity
ultimately neurons and their possible extensions were a, what they called, “sterile” direction of research. Interestingly enough, their problem had already been solved by Thomas Cover in 1964. His PhD thesis discussed the statistical and geometrical properties of “linear threshold devices,” a summary was published as (Cover 1965). Thomas Cover was among the first people to actually follow the concluding comment on Rosenblatt’s original Perceptron paper (Rosenblatt 1958): “By the study of systems such as the perceptron, it is hoped that those fundamental laws of organization which are common to all information handling systems, machines and men included, may eventually be understood.” That is, Thomas Cover worked on understanding the Perceptron’s information properties.
5.1.2 Cover’s Solution Thomas Cover found that a single neuron with linear activation, which is a system n x w ≥ 0 that thresholds the experimental inputs .xi using a weighted sum .Ei=1 i i (with .wi being the weights), has an information capacity of 2 bits per weight .wi . That is, the supremum of the mutual information of all functions it can model 2 bits .I (X; Y ) = weight . Cover’s article (Cover 1965) also investigated non-linear separation functions, but this is beyond the scope of this chapter. The main insight was that the information capacity of a single neuron could be determined by a formula that was already derived in 1852 by Ludwig Schläfli (Schläfli 1852). Cover called it the Function Counting Theorem. It counts the maximum number of linearly separable sets C of n points in arbitrary position in a d-dimensional space. Theorem 5.1 (Function Counting Theorem) C(n, d) = C(n − 1, d) + C(n − 1, d − 1)
.
(5.1)
with boundary conditions: C(n, 1) = C(1, d) = 2 and C(n, 0) = C(0, d) = 0.
. .
The iterative equivalent of the formula is ) d−1 ( E n−1 , .C(n, d) = 2 l
(5.2)
l=0
where n is the number of points and d is the number of dimensions. A notable special case of C is .C(n, n) = 2n . In order to get the intellectual capacity of a neuron, we set d to the number of weights and n to the number of rows of a training table in a binary classification
5.1 Intellectual Capacity
77
problem. Since .C(n, n) = 2n , which equals the .2n possible binary labelings of an n-row table, we know that a single neuron has the intellectual capacity to train any binary classifier where the number of rows in the training table is smaller than or equal to the number of experimental factor columns. We are now already in a n position to respond to Minsky’s criticism: While there are .22 = 16 unique tables x )), representing all 2-variable boolean functions, that is, of the form .(x1 , x2 , f (.x1 , x2 , f (x ) ∈ {0, 1}, a single neuron can only represent .C(4, 3) = 14 functions of 4 points with 3 parameters (weights). By exclusion (that is, providing the solution for the 14), one can easily find that the two that are missing are XOR and NXOR (boolean equality). However, there is an even better way to identify the untrainable functions: Using information measurements. Information measurements allow to find the untrainable functions a priori. That is, before even trying to model them in any way.
5.1.3 MacKay’s Viewpoint In his book (MacKay 2003), MacKay presents a visual proof of the function counting theorem (created originally by Yaser Abu-Mostafa) together with an interesting viewpoint. The viewpoint is to understand any machine learning algorithm as a storage of parameters. That is, as an encoder of the training data within Shannon’s Communication Model (Shannon 1948b). Figure 5.1 visualizes the idea. Any classification algorithm serves as both an encoder and a decoder in a Shannon communication model. During training, we create the encoder, and during prediction use of the model, we use it as decoder. This viewpoint has the advantage that communication-model-based information theory can be applied to machine learning. MacKay assumes the input of the encoder are n points an arbitrary binary labeling. The output of the encoder are the parameters (weights) of the classifier. The decoder receives the (perfectly learned) weights over a lossless channel. The question he then asks is: “Given the received set of weights and the knowledge of the data, can the decoder reconstruct the original labels for all the points?” In
data Encoder labels
Learning Method
Channel weights
Identity
Decoder weights
Neural Network
labels'
Fig. 5.1 Shannon’s communication model applied to labeling in machine learning. A dataset consisting of n sample points and the ground truth labeling of n bits are sent to the neural network. The learning method converts it into a parameterization (i.e., network weights). In the decoding step, the network then uses the weights together with the dataset to try to reproduce the original labeling
78
5 Capacity
other words, the classifier is interpreted as memory that stores a labeling of n points relative to the data and the question is how much information can be stored by training a classifier. That is, we ask about the memory capacity of a classifier. Note memory itself can already be seen as a mechanism that is trainable. n bits of memory can adopt to .2n different states. Defining each state as a new task (see Definition 5.3) makes memory quite intelligent: its intellectual capacity grows exponentially with the storage capacity. However, as we intuitively know, storage and recall of knowledge is just one part of being intelligent. The other part is to generalize from the knowledge and handle new, unknown situations. Plain memory cannot generalize because it has no separation function. Technically speaking, the separation function of plain memory could be defined as the identity function .f (x ) = x-. However, seeing a machine learning as memory has advantages for understanding its behavior. We will therefore formalize this viewpoint as memory-equivalent capacity (MEC), which is presented in the following section.
5.2 Memory-Equivalent Capacity of a Model As seen in Sect. 5.1.2, .C(3, 4) = 14 binary functions of 4 points can be represented with 3 linear separating functions. However, how would we know which 2 are not representable without trying? Such empirical analysis can be tedious to impossible once n and d are larger. In order to understand how to limit the search space of which functions can be represented and which cannot, we need to remember the discussion from Chap. 4, especially its conclusion, Sect. 4.4. We saw that the information content of a table can be upper-bounded by measuring the information content of the target function. As a consequence, the mutual information of any function is potentially the highest, when the outcomes are equi-distributed. () Of the 16 boolean functions of 2 variables, only . 42 = 6 functions have equidistributed outcomes. Figure 5.2 shows these functions. The functions are indexed by interpreting the target column as a binary number. It is easy to see that .f10 and .f5 are directly correlated with .x2 , thereby rendering .x1 redundant. Only the two states of .x2 have to be known to determine the outcomes. The mutual information is therefore .1 bit. .f12 and .f3 are similarly exclusively dependent on .x1 . Only .f6 (XOR) and .f9 (boolean equality) require the full knowledge of both variables to determine the outcomes. This means XOR and boolean equality are the two boolean functions with the highest information content out of the 16 possible 2-variable functions. These two functions require memorization with a memory capacity of 4 bits. To get to the memorization capacity of a single neuron, all one has to do is to define the function as a self-similarity, the binary decision as the magnification factor, and apply Definition 4.15. The fractal dimension of a single neuron with 3 parameters (two weights and one bias) is .D = log2 C(4, 3) = log2 14 =
5.2 Memory-Equivalent Capacity of a Model
79
Fig. 5.2 All 6 boolean functions of 2 variables that have target columns with equi-probable outcome (highest entropy). As explained in Sect. 5.2, .f6 (XOR) and .f9 (NXOR) have the highest mutual information content (4 bits) and can therefore not be modeled by a single neuron that has a capacity of about 3.8 bits
3.807 . . . bits, which is less than the 4 bits of complexity of XOR and boolean equality. Therefore, they are not representable by a single neuron with 3 parameters. The generalization of this example leads us to the following definition that allows us to measure the intellectual capacity of any machine learner in bits. Definition 5.5 (Memory-Equivalent Capacity (MEC)) A model’s intellectual capacity is memory equivalent to n bits when the model is able to represent all n .2 binary labeling functions of n points. In analogy to memory capacity, MEC is understood to be the maximum. That is, if a model’s MEC is n bits, it is naturally also .n − 1 bits but whenever specified, MEC is given as the maximum n such that all .2n binary labeling functions can be represented. Intuitively speaking, a model has an MEC of n bits when it can memorize all binary functions that could be represented by a training table of n rows. While Definition 5.5 assumes that the functions are binary classification problems, it is easy to see that the definition can be extended to n-ary classification problems by switching the logarithm base (just like storing an n-ary number in a binary memory). The definition can be extended to regression problems by understanding the following: The most complex regression problem is a problem, where there is a unique value in each row of the target column of the training data. That is, the most complex regression problem is equivalent to an .n − class classification problem, where n is the number of rows training in the training data table (Sect. 4.3 explains how to estimate the information content of a curve in bits). Therefore, we will continue to refer to memory-equivalent capacity (MEC) in bits to quantify the intellectual capacity of a machine learning algorithm, regardless of the type of classification or regression task. Note that the definition of MEC is a specialization of VC dimension. More on that in Sect. B.3. A direct corollary from Definition 5.5 is the following: Corollary 3 (All Models with the Same MEC Are Equally Capable) A model with n bits of MEC is able to represent all .2n binary labeling functions of n points.
80
5 Capacity
Function representation is the only capability of a model, and there are no more functions to represent compared to all functions. That is, architectural considerations, training success, separation functions, and other factors are either reducing or increasing the MEC. This can make the determination of the actual MEC difficult. However, the memory-equivalent capacity can be measured and sometimes even determined analytically for any machine learner. It therefore represents a unified capacity view on all models. In being that, it also directly connects to a notion discussed in Appendix A: The Shannon Number. In his 1950s article (Shannon 1950), Claude Shannon devised a conservative lower bound of the game-tree complexity of chess. The game-tree complexity of a game is the number of legal game positions reachable from the initial position of the game. That is, the size of the decision tree that is spanned by taking all legal moves into account. Shannon estimated the complexity of chess to be .10120 possible game states (nodes of the decision tree), based on an average of about .103 possibilities for a pair of moves consisting of a move for White followed by a move for Black, and a typical game lasting about 40 such pairs of moves. This number can be converted into bits by converting the game-tree branches into binary tree decisions branches. That is, at each move, we ask: “move figure on field .x1 , y2 to field .x2 , y2 , yes/no?”. This is done using Definition 4.15 by setting the number of states as the self-similarity and the magnification to 2: .log2 10120 = 398.6313 . . . bits. That is, the binary game tree has a depth of a little more than 397 levels. So we know that all 398.6313 . . . = 10120 games states can be memorized once the memory-equivalent .2 capacity of any machine learner (for example, a binary decision tree) is 399 bits. We also know that a typical game requires .399 bits to memorize. Example A simple upper bound for the game-tree complexity of Tic-Tac-Toe is: 9 possible initial moves, 8 possible responses, and so on, so that there are at most .9! = 362880 total games. The memory-equivalent capacity (MEC) would therefore be .log2 9! = 18.469 . . . bits. This means a binary decision tree with 19 levels can perfectly play the game. In fact, any machine learning or artificial intelligence tool that is able to implement the sequence of 19 binary decisions can learn to play the game perfectly. If we were to train a neural network based on n observed games, we would need maximally .n ∗ 18 bits of MEC in our neural network. MEC helps us understand an upper bound of the intellectual capacity needed to solve a task. For example, as discussed above, the MEC for memorizing v-variable v boolean functions is .log2 22 = 2v bits (see also discussion in Sect. B.6). The MEC for a given training table for binary classification is smaller than or equal to the number of rows in the table. The MEC for Tic-Tac-Toe is .≈20 bits. The MEC for chess is .≈400 bits. MEC, however, is not equivalent to information capacity since information capacity depends on the dimensionality of the data (that is, the number of experimental factors) and the type of separation function used. This has been analyzed in depth by Cover (Cover 1965). The dependency on the dimensionality can easily be seen by calculating the .C(d, n) function for large d and n: As d approaches
5.2 Memory-Equivalent Capacity of a Model
81
Fig. 5.3 The shape of the .C(n, d) function (see Theorem 5.1). The y-axis represents the probability for memorization P . The x-axis represents the number of points n over the dimensionality d. As explained in Sect. 5.1.2, .P = 1 as long as the number of points is smaller than or equal to the number of dimensions (that is, . dn ≤ 1). When the dimensionality d increases (various colored lines), .P = 1 is pushed beyond . dn = 1 to . dn = 2 (where .P = 0.5, see exercises). In fact, as explained in (MacKay 2003), as .d → ∞, one can guarantee to memorize almost up to 2 points per parameter. Note that the curves are asymptotic with the x-axis for . dn → ∞
infinity, the difference .|C(n, d) − C(2n, d)| < ε grows smaller and smaller. This is visualized in Fig. 5.3. Consistent with Definition 5.1, the supremum of the mutual information as .d → ∞ is 2 bits per parameter. Theorem 5.2 (Information Capacity of a Linear-Separator Detection Model Cover 1965) A linear-separator model’s information capacity is 2 bits per parameter when it performs binary classification. That is, even if we tried to memorize uniformly random data with our model, if the dimensionality is large enough, we would be able to generalize almost .2 : 1. Counter-intuitive behavior in statistical modeling for large dimensionality is often dubbed “curse of dimensionality.” This is discussed further in Chap. 16. Note that calculating the information capacity of a machine learner that is using a different separation function (e.g., .x 2 ) is discussed in (Cover 1965). Empirically, one can easily see that the information capacity for a linear c separation machine learner for a c-class problem is . c−1 bits per parameter, where c is the number of classes in the classification problem. LAW 5.1 (INFORMATION CAPACITY OF A LINEAR-SEPARATOR CLASSIFICATION c bits MODEL) A linear-separator classification model’s information capacity is . c−1 per parameter, where c is the number of classes. One way to verify the law is to use the equilibrium machine learner (see Algorithm 8) and train it on a dataset of randomly distributed points that is labeled with c equi-distributed classes and divide the number of thresholds generated by the
82
5 Capacity
number of instances memorized. Law 5.1 holds for the equilibrium case, where we assume .wi = 1 as well as any case where .wi are set randomly. Example Machine learning is used for a 3-class classification problem based on .10,000 instances of experimental data. The classes are distributed as follows: 5000 instances of class 1, 3000 instances of class 2, and 2000 instances of class 3. What is the absolute maximum size of machine learner that can be used before the concern of overfitting all samples of training data and/or wasting resources becomes inevitable? We do not know the number of attributes or anything else about the input, so we take the upper limit given by the Shannon Entropy H of the class distribution, which can be determined using Eq. 4.2 and setting .b = 2, .k = 10,000, .p1 = 0.5, .p2 = 0.3, and .p3 = 0.2. Arithmetic determines .H = 14,900 bits. Since we know that the mutual information of the function represented by the table .I ≤ H and we are looking for an upper bound, we assume .I = H . The information capacity C of a classification model contained of linear separators is determined by Definition 5.1 3 to be .C = 3−1 = 1.5 bits per parameter. We should therefore be starting to be concerned about the machine learner being oversized, if the chosen model has more H than . 1.5 = 9933 parameters. Information capacity is a more accurate measure than MEC when it can be determined, but determining MEC is usually easier. Ideally, we want to steer away from the memorization of every single possible configuration and generalize from training a subset of configurations. Most importantly, because it may be infeasible to train with every possible configuration. For example, it is impossible to train a machine learner with every possible instance of a cat image. Not only are there many different variants of cats, but there are also many different positions the cat can be in and many different viewpoints and angles the cat can be photographed from. A secondary aspect of generalization is efficiency. When rotations and reflections of positions are considered the same, there are only 26,830 possible Tic-Tac-Toe game configurations (Beasley 2002), i.e., .MECT ic−T ac−T oe ≈ 15 bits. In general, we want the MEC of the model to be small, but, at the same time, the model should be highly accurate, e.g., win the game. Consider the opposite: Using a model with higher MEC than the complexity of the problem is adding unnecessary complexity to the solution of the problem. Chapter 6 expands on generalization and how to formally connect it to MEC.
5.3 Exercises 1. Information Content (deepening): (a) How much information in bits is represented by a clock with hour and minute hand?
5.4 Further Reading
83
(b) You have a compass. Assume you implement a binary classifier based on the direction of the compass needle (e.g., “north-facing enough?”). (1) What is the maximum amount of information in bits that this compass classifier can represent? (2) How could you modify your compass classifier to represent less information? (c) Instead of a binary classifier, you implement a general classifier for the above compass. What is the maximum amount of information in bits you can train your compass to represent as a function of the number of thresholds? 2. Single Linear Threshold Capacity: (a) Show that the maximum output of a threshold neuron is maximally one bit, independent of the activation function. Hint: Do the information content exercises first. 3. Boolean Functions: (a) Draw the decision tree for the NAND function of two boolean inputs. (b) Draw an artificial neuron that implements NAND for two boolean inputs. (c) Draw a three-neuron artificial neural network that implements equality for two boolean inputs. (d) Draw a two-neuron artificial neural network that implements equality for two boolean inputs. How do you have to change it for non-boolean inputs? 4. Function Counting Theorem: (a) Show by induction that C(2n, n) = 0.5C(n, n). (b) Assume we did not know the Function Counting Theorem. Describe a blackbox process to empirically determine the C(n, d) function.
5.4 Further Reading For this chapter, I highly recommend the original sources for further reading. These are: • Claude Shannon: “Programming a Computer for Playing Chess”, Philosophical Magazine 41 (314), 1950. • Thomas M. Cover: “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition.” IEEE Transactions on Electronic Computers 3 (1965): 326–334. • Marvin Minsky and Seymour Papert: “Perceptrons: An Introduction to Computational Geometry”, MIT Press (original: 1969); Expanded, Subsequent edition (December 28, 1987) • David JC. MacKay: “Information theory, inference and learning algorithms”, Cambridge University Press, 2003. Chapter 40.
Chapter 6
The Mechanics of Generalization
This chapter outlines the inner workings of generalization and formally derives much of the ideas presented in the previous chapters. This is done by approaching the topic from three different angles and then showing they converge onto the same principle.
6.1 Logic Definition of Generalization Chapter 2 introduced two requirements for machine learning. The first requirement was well-definedness (see Definition 2.3). That is, we can only model a function. The second requirement was well-posedness (see Definitions 2.8 and 2.9). That is, requiring that a small change in an input does not lead to a change of a decision or, in the case of regression, to only small changes in the output. This wellposedness requirement can be directly translated into a notion of generalization that fits Definition 2.6, i.e., handling different objects by a common property: Given a training point .x-t and a decision .f (xt ) associated with it, we treat all points within a small enough radius .ε around .x-t to output the same common decision. This can be formalized for classification as: ∀x with |x − x-i | < ε =⇒ f (xi ),
(6.1)
∀x with |x − x-i | < ε =⇒ |f (xi ) − f (xi )| ≤ δ,
(6.2)
.
and for regression as: .
where f is the target function, .x-i is an instance from the training set, and .x- is a test or validation sample.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_6
85
86
6 The Mechanics of Generalization
The main question is: What are the distance metrics .| · | and, related, what is .ε (and .δ)? In practice, a distance metric is often chosen at random by the machine learning practitioner and most often is chosen to be the Euclidean distance as it is very intuitive: In 2 dimensions, a constant Euclidean distance from a point defines a circle around that point (see also Algorithm 7). The definitions of .ε and .δ are usually a matter of training. It is obvious that this approach is directly connected to the nearest neighbor algorithm (see Sect. 3.3.1). As explained in Sect. 2.2.2, the number of training points defines the number of separations that can be made. On average, the more separations are made, and the smaller is each generalization distance .e. Intuitively, the more often one slices a cake, the smaller the average piece gets. In order to maximize generalization, one may therefore want to find the smallest number of training points one can get away with while still having a high accuracy. That is, we can optimize nearest neighbor generalization .GN N by measuring using the following equation: Definition 6.1 (Nearest Neighbor Generalization) .GN N =
# correct predictions # training points
Again, we can observe that if .ε = 0, the number of correct predictions can maximally equal the number of training points. Consequently, there may be no generalization for .G ≤ 1. For binary classification in higher dimensions, .GN N ≤ 2 could still indicate overfitting. For multi-class classification, the information limit defined in Theorem 5.1 holds (see also: exercises). From now on, we will therefore refer to .ε as generalization distance. For regression, a prediction is defined as correct when it is within the allowable variance .δ. Recalling the discussion in Sect. 3.3.1, the nearest neighbor modeling approach is the modeling approach that does not actually use a model. That is, nearest neighbors use a single defined distance rule directly on the training data. However, for realworld problems, the need arises to make the distance function and .ε adaptive enough to be able to only surround a particular set of instances. That is, the generalization distance and the metrics from Eqs. 6.1 and 6.2 may become indexed and of the form: .|x − x-i |k < εk . Figure 6.1 shows an example of how such a structure may look like in 2D. The practical implementation of this concept leads to the use of other machine learning algorithms than nearest neighbor. Recalling from Fig. 5.1, we know that this means the training labels and the data are converted into the parameters (e.g., weights) of some machine learning algorithm. The denominator can therefore be changed as follows: Definition 6.2 (Poor Person’s Generalization) .Gsimple =
# correct predictions # parameters in model
This formula is called poor person’s generalization because it is quick and easy to calculate, but it can be highly inaccurate and does not have units (as is discussed in Sect. 6.3). Figure 6.2 shows an example of such training point/parameter reduction. The simple generalization, just measured on the training set, is .Gsimple = 162 31 = 5.22 for Fig. 6.2 (right).
6.1 Logic Definition of Generalization
87
Fig. 6.1 Generalization of 9 points using a distance function and a generalization distance .ε. Each color represents a class. This is a special version of a so-called Voronoi diagram. Similar to the logical generalization definition, this Voronoi diagram allows overlapping generalization regions. These are indicated in blended colors. Note that nearest neighbor algorithms can also contain undefined neighborhoods. Image: Balu Ertl, CC-BY-SA-4.0
Fig. 6.2 Left: Visualization of a training table of 180 rows, 2 columns, and a balanced target function of 3 classes (each its own color). Middle: 1-nearest neighbor classification of the dataset. Right: Training points memorized (aka nearest neighbor parameters) reduced to 31 by eliminating points that do not change the final decision (empty circles) except for increasing the generalization distance with regard to a representative point (squares) and points that can be classified as outliers (crosses). Outliers are defined here as those points where all three nearest neighbors of that point belong to one other class. The left bottom corner shows the numbers of the class-outliers, representatives, and absorbed points for all three classes. Images: Agor153, CC BY-SA 3.0, via Wikimedia Commons
To use .Gsimple for regression, all one needs is a metric that determines if a point of the regression function has been predicted correctly, that is, if .|f (xi )−f (xi )| ≤ δ. This metric is usually application specific. For example, if US housing prices were to be predicted, an error bar on the US dollar amount could be used to determine
88
6 The Mechanics of Generalization
correctness. Note that from this equation one can see that linear regression is, in general, the most general model as it usually only uses one parameter. So if one is able to get high enough accuracy over a large amount of training samples with linear regression, the problem can be considered solved, which explains the tremendous popularity of this simple method.
6.2 Translating a Table into a Finite State Machine To view a different angle on generalization, let us return to the training table defined in Definition 2.1 and our goal to create a finite state machine that reproduces the table with the least amount of state transitions (see Sects. 2.2.1 and 2.2.2, finite state machine generalization was defined in Eq. 2.10). For now, let us, again, focus on classification, and let us first go back to the overgeneralizing finite state machine that uses one state transition for all predictions. We will call this state transition .s0 . The probability of this state transition being chosen for any input is .P (s0 ) = 1, since there is only one state transition. The accuracy A for this overgeneralizing state machine is .A = P (classi ) ∗ 100%, where .classi is the class chosen as stopping state. The value of .P (classi ) depends on the distribution of the i classes in the target function. As done in Sect. 6.1, let us apply the well-posedness assumption (see Definition 2.9). It is easy to see that this over general state machine implies .ε = ∞, independent of any possible (distance) function implemented by the state machine as all one needs to calculate is: .|x − x-i | < ∞ =⇒ classi . Second, let us take a look at the finite state machine that has one state transition per instance. That is, for n rows of the training table, we have n state transitions to either stopping or non-stopping states (to perform the classification). The probability of a state .si being chosen is maximally .P (si ) = n1 . It could be smaller if the inputs presented for prediction are not uniformly distributed. This state machine has enough state transitions to calculate .|x − x-i | < εi =⇒ f x-i , where .i = 1 . . . n. That is, there can be a unique distance function per training instance. Even if the .εi were chosen to be all 0, that is, .εi = 0, the finite state machine would implement a lookup table, as explained in Sect. 2.2.2. The accuracy A for such a state machine on the training set is .A = 100%. If we increase the generalization distances beyond 0, we indeed may be able to get some generalization. However, if we make the generalization distances too large, the domain of the state transitions may overlap. That is, one input gets assigned two or more state transitions. In the model of a finite state machine, this condition is undefined and leads to the same conclusion that was drawn in Sect. 2.2.2: The state machine can handle less input and is therefore less general. It is easy to see that the same condition can arise if we increased the number of state transitions beyond n, even for small generalization distances. In the best case, the state transitions are irrelevant, and in the worst case, they make the model less general. The easiest way to ensure generalization is therefore to decrease
6.3 Generalization as Compression
89
the number of state transitions used to model the table. This topic is revisited in Sect. 6.4. We often do not know the generalization distance .ε or the distance metric implemented in real-world machine learning models. So if we want to guarantee that the model does not overfit, we want to make sure that .ε >> 0 and the number of state transitions is minimized. That is, we want to maximize the generalization distance while keeping the model accurate. However, since we usually do not know what the individual state transitions .si do, we can only argue about the average (expected) generalization distance that each state can implement. This statistical principle is called Expectation Maximization (Dempster et al. 1977). That is, we want to maximize the likelihood that a (correct) state transition .si is chosen by a random sample of the test or validation set. We can measure the generalization just as defined in 2.10. The higher the value of G, the higher average generalization distance of .ε. In the next section, we will formalize this notion in a general way.
6.3 Generalization as Compression The generalization equations of the previous sections (see Eqs. 2.10, 6.1, and 6.2) are all valid ways of measuring generalization when the generalization distance .ε is not readily determinable or not otherwise usable to optimize the generalization/accuracy trade-off. However, the presented definitions have the disadvantage that they are specialized to the machine learner we are using. Combining what is discussed in Chap. 4 and in this chapter, we can define a more general measure of generalization that does not require us to model each machine learning algorithm explicitly as a finite state machine or using nearest neighbors. Definition 5.5 introduced the notion of memory-equivalent capacity (MEC). As explained in Chap. 5, MEC can be used as a unified measure to understand how many decisions of a training table can be memorized by a machine learner. MEC is, naturally, a function of the number of parameters. However, as explained in Chap. 8, different topologies of machine learners with the same number of parameters can have different MECs. Also, MEC takes into account regularization, imperfect training, and other capacity reductions that cannot be captured by counting the number of parameters alone. Section 4.4 discussed how each prediction results in the gain of a certain amount of information. This allows us to make Definition 6.2 more precise: Definition 6.3 (Generalization) .G =
# correctly predicted bits Memory−equivalent Capacity
Chapter 8 and following discuss how to analytically and empirically measure the MEC of a given model. For a general classification problem, G is calculated using the self-information of each correctly predicted instance: Definition 6.4 (Generalization (Classification)) .G =
−
Ec
log2 pi , MEC
i=1 ki
90
6 The Mechanics of Generalization
where c is the number of classes, .ki is the number of correctly classified instances in the class i, .pi is the probability of the class being i, and MEC is the memoryequivalent capacity of the machine learner used. If .G ≤ 1, we know the model is using more or the same number of parameters needed to memorize the training data. That is, there is a very high chance that the c model is overfitting. If .1 ≤ G ≤ c−1 (where c is the number of classes or table rows in a regression problem), there is still a very high chance that the model is overfitting (especially when the input dimensionality is high) because we know that the model is most likely below information capacity (see Chap. 5). Optimizing for high generalization means to maximize G. For balanced binary classes, each correctly predicted instance is worth 1 bit, and correctly predicted instances Definition 6.4 can therefore be simplified to: .G = #Memory−equivalent Capacity . Example 1 A binary decision tree with 1024 leaves is used to classify 20000 instances of a balanced, binary classification problem with 100% accuracy. Is the tree overfitting? 20000 .G = log2 1024 = 20. On average, each leave of the tree therefore accounts for 20 correct predictions. The tree is not overfitting and should perform well on a test set drawn from the same experimental conditions. Example 2 The generalization G implemented by Fig. 6.2 (right) is calculated as follows. 162 points are correctly classified. These come from 3 equi-distributed classes. That is, each point has an information content of .log2 3 bits. 31 of these points were 162∗log 3 memorized as representatives. The generalization is therefore .G = 31∗log 23 = 2
5.22 bits bit . Note that the result did not change from the calculation of .Gsimple in Sect. 6.1 because the classes are uniformly distributed. However, we now know that the same classification could be implemented by a neural network or a decision tree or any other classifier as long as the memory-equivalent capacity .MEC = 31 ∗ log2 3 ≈ 50 bits (see Corollary 3). For regression, the enumerator in Definition 6.4 needs to change from discrete events to a series of independent numbers. As discussed in Sect. 4.3, this can be achieved by performing box counting (see Algorithm 6). To measure generalization, box counting needs to be run on the output of the trained regression model (e.g., based on predictions for the training data). The equation then becomes: Definition 6.5 (Generalization (Regression)) .G = n ∗
Df ractal logI 2∗MEC ,
where I is the scaling returned by Algorithm 6, .Df ractal is the fractal Dimension returned by Algorithm 6, MEC is the memory-equivalent capacity of the model used, and n is the number of points that were correctly placed within an error distance .δ (see Definition 6.2). “Correctly placed” can be determined by any distance metric that fits the application. One can see the immediate connection to compression: G is a compression ratio measured in . bits bit . Since the model reduces the complexity of the decisions outlined
6.4 Resilience
91
in the training table (e.g., as rows per state transition), it acts as a compressor for the function represented in the training data. Such compression can be lossless or lossy. c In the first case, .1 ≤ G ≤ c−1 . In the case of lossy compression, G is usually large. .G ≤ 1 indicates no compression, that is, the model has not reduced the complexity of the function represented by the training table. It therefore acts as a memorizer with small or no generalization distance. A condition frequently called overfitting. Note that compression has limitations that are described in Sect. 7.2.
6.4 Resilience The word resilience is generally defined as follows (Norris et al. 2002): “the ability to adaptively cope with adversity, to maintain equilibrium under duress, and to recover from traumatic events.” In the narrower context of machine learning and software engineering, we can simplify this definition to: Definition 6.6 (Resilience) The ability of a system to withstand changes in its environment and still function. For machine learning models, we can even be more specific as we have a wellposedness (see Definitions 2.8 and 2.9) requirement. That is, “similar” inputs should give “similar” outputs. How much change is allowed within that definition of similar, we can call resilience. In general, those changes are called data drift, which is the evolution of data over time that invalidates the model. The below definition makes resilience to drift immediately measurable: Definition 6.7 (Model Resilience) The amount of noise that can be added to a validation/test sample without changing the output of the model. As discussed in Sect. 6.1, the generalization distance .ε is proportional to G from Definition 6.4. This allows us to conclude that resilience and generalization are the same: The higher the generalization distance, the more a test sample could drift without the prediction changing. Furthermore, it is easy to see that the higher the value of G, the higher the average amount of noise until a decision changes. Since G is a compression ratio, G bits of noise can be added on average to each sample before a decision changes. Noise is usually measured in decibel (dB), which is an analog measure of information (Lin and Costello 1994). The conversion works as follows: Definition 6.8 (Average Model Resilience) .R = 20 ∗ log10 G [dB] We expect no noise resilience at .G ≤ 1 as the generalization distance .ε could be 0, thus .R = 0 for .G = 1. For .G < 1, R becomes negative. This can be interpreted as follows: Since the decibel is a unit of information (just like the bit), it means that at .R < 0, each sample already carries a certain amount of prediction uncertainty (without even adding any noise). Based on Eq. 6.1, the generalization distance .ε can never be negative. However, generalization areas could overlap in space, therefore
92
6 The Mechanics of Generalization
adding ambiguity (uncertainty) to the outcome of a decision. The probability of overlapping generalization areas is higher, the more parameters are used (that is, the lower the value of G is). At any .G > 1, we can expect an average noise resilience because .ε must be greater than 0 (at least for one instance). In general, each added bits .1 bit of generalization equals about .6 dB of added average noise resilience. Note that this measure of resilience is consistent with a frequent interpretation of Occam’s razor or the Law of Parsimony (Sober 1984): “Among competing hypothesis, the one with fewer assumptions should be preferred.” That is, models with less assumptions (parameters, lower MEC) for the same correct predictions have a higher resilience and are therefore more likely to pass the test of time. Example In training, 2000 instances of a balanced, binary classification problem have been predicted correctly. Quality assurance measurements have shown that one can add an average of 24 dB of noise to each validation instance before the prediction changes. What is the maximum MEC of the machine learner in bits? R Definition 6.7 can be used to convert dB back into bits: .G = 20∗log 2 = 10
= 3.986 . . . bits. Since it is a balanced binary classification problem, .2000 bits of information have been predicted right. Definition 6.3 can be simplified ied bits . Thus the to and algebraically reformulated to: .MEC = # correctly classif G = maximum model complexity that can withstand this noise is .MEC = 2000 4 500 bits. Note that the definition of resilience does not change for regression problems. 24 20∗log10 2
6.5 Adversarial Examples In the machine learning community, adversarial examples are defined as “small perturbation to an example that is negligible to humans but changes the decision of a computer system.” They were first reported in image object recognition (Szegedy et al. 2013a) but later found in natural language systems as well (Jin et al. 2019). The original article described adding a small amount of noise to the image of a panda bear. The noise was so small that the photo was indistinguishable from the original for humans but the trained image classification model misclassified it. Adversarial examples can be derived using the understanding from Sect. 6.1. Definition 6.9 (Adversarial Example) We call a sample .x- adversarial iff .|x − x ), x-i | < ε =⇒ f (x-i ) /= f (where .x-i is a training sample within the trained generalization distance .ε calculated used the assumed distance metric .|·|, .f (x-i ) is the prediction trained for .x-i , and .f (x) is the prediction observed for .x-. Concretely, .x- would be the noisy panda image, and .x-i would be the original panda image that was trained into the classifier. The assumed distance metric
6.6 Exercises
93
Fig. 6.3 A human visual system perceives a length difference between the vertical line and the horizontal line, even though objectively there is none. Optical illusions such as the one shown are contradictions to the assumptions underlying the models in our brain. That is, they can be interpreted as adversarial examples to biological models. From: Wikipedia, Public Domain
|x − x-i | is “perceived similarity,” and .ε is assumed 0 as the perturbations to the image were unnoticeable to the human eye. In plain English, adversarial examples are contradictions to the generalization assumption made by a model. They are not specific to natural language processing, computer vision, table data, speech recognition, or any other field. For example, optical illusions can be interpreted as adversarial examples to the perception models in human brains. The vertical line in Fig. 6.3 is perceived as longer than the horizontal line even though objectively they are the same length. Applying Definition 6.9, the difference of the length l of the horizontal line and the vertical line .|lhorizontal − lvertrical | = 0 (where the lengths are scalars and the difference operator is arithmetic subtraction), but the final observation based on human perception is .f (lhorizontal ) /= f (lvertical ). There can be many reasons for why the generalization assumption does not hold. For example, in Fig. 6.2 (left), all points marked as crosses can be considered adversarial examples. Using the observations from Sect. 6.2, we already know that an average of state transitions can lead to ambiguities. Ambiguities can also arise when multiple models work together but contradict each other for certain inputs. In general, this makes adversarial examples an important tool for understanding the limits of models. This is discussed in Chap. 13.
.
6.6 Exercises 1. Logic Definition of Generalization: (a) Show empirically that the information limit of 2 prediction bits per parameter also holds for nearest neighbors.
94
6 The Mechanics of Generalization
(b) Extend your experiments to multi-class classification. 2. Finite State Machine Generalization: (a) Implement a program that automatically creates a set of if-then clauses from the training table of a binary dataset of your choice. Implement different strategies to minimize the number of if-then clauses. Document your strategies, the number of resulting conditional clauses, and the accuracy achieved. (b) Use the algorithms developed in (a) on different datasets. Again, observe how your choices make a difference. (c) Finally, use the programs developed in (a) on a completely random dataset, generated artificially. Vary your strategies but also the number of input columns as well as the number of instances. How many if-then clauses do you need? 3. Compression: (a) Create a long random string using a Python program, and use a lossless compression algorithm of your choice to compress the string. Note the compression ratio. (b) What is the expected compression ratio in (a)? Explain why?
6.7 Further Reading Shockingly little has been written about generalization. I do recommend an Internet search on that. Computer science literate uses the word abstraction more than generalization, expect for the cartographic community. • Burrough, Peter A.; McDonnell, Rachael; McDonnell, Rachael A.; Lloyd, Christopher D.: “8.11 Nearest neighbours: Thiessen (Dirichlet/Voronoi) polygons”. Principles of Geographical Information Systems. Oxford University Press (2015). pp. 160–. ISBN 978-0-19–874284-5. • Aurenhammer, Franz: “Voronoi Diagrams—A Survey of a Fundamental Geometric Data Structure”. ACM Computing Surveys (1991). 23 (3): 345–405. Expectation Maximization warrants a look as it is the statistical version of generalization: • Dempster, A.P.; Laird, N.M.; Rubin, D.B “Maximum Likelihood from Incomplete Data via the EM Algorithm”. Journal of the Royal Statistical Society, Series B. 39 (1): 1–38 (1977). A more mathematical view of Occam’s razor has been discussed by MacKay: • David JC. MacKay: “Information theory, inference and learning algorithms”, Cambridge University Press, 2003. Chapter 28.
Chapter 7
Meta-Math: Exploring the Limits of Modeling
In this chapter, we will explore mathematical and statistical modeling in general and explore their limits. That is, what can we expect from modeling and what not. What are the limits of the approach we call “modeling”? In order to get there, we need to remember that we are looking at the scientific process as a whole. That is, this chapter will explore the limits of science and math itself. In order to do so rigorously, we will need to use math that describes the limits of math. This will be called meta-math in this chapter to pinpoint the fact clearly, even though math describing math is definitely not novel Chaitin (2006).
7.1 Algebra The first time most people are confronted with math describing math is algebra. Algebra could be named meta-arithmetic. Arithmetic teaches us .1 + 1 = 2 and that .1 + 3 ∗ 8 = 25. Algebra generalizes the pattern of arithmetic formulates rules such as the commutativity, associativity, and distributivity of several arithmetical operations. This allows the use of variables to represent numbers in computation and reasoning. In many school systems, algebra is drilled into students with such an intense focus that many students think that algebra equals math. Of course, algebra is the subfield of math that uses the patterns emerging from the properties of the arithmetic operators to simplify calculations and equations. In this way, algebra simply serves the principle of Occam’s razor by simplifying a given explanation, if possible. Furthermore, algebra was generalized to what could be called metaalgebra, but it is called “abstract algebra” or “modern algebra” and is the study of algebraic structures. Algebraic structures include groups, rings, fields, modules, vector spaces, lattices, and algebras over a field. The term abstract algebra was coined in the early 20th century to distinguish this area of study from the original algebra. Finding patterns and generalizing from the generalizations help to not only © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_7
95
96
7 Meta-Math: Exploring the Limits of Modeling
identify practically useful patterns but also to see the limits of arithmetic and also algebra itself.
7.1.1 Garbage In, Garbage Out Garbage in, garbage out is a commonly known concept that describes that flawed or nonsense (garbage) input data produce nonsense output. This simple observation has a sound mathematical backing: The data processing inequality. The data processing inequality is used many times in this book even though it simply states that information cannot be created by data processing. Intuitively speaking, when we take a cell phone photo of the moon, we cannot create an algorithm that zooms in and finds the flag that was put there during the first moon landing. For that, we would still need (a quite impressive) telescope. Similarly, one cannot generally recreate information or even that has been lost. This concept is very familiar in math, but it is usually not explained in this way. For example, if we multiply a number with zero, the result is always zero, i.e., .n ∗ 0 = 0. While the inverse operation of multiplication is division, this particular multiplication is not reversible (more on multiplication in Sect. 7.2.1). So if we divided . 00 , the result should be n. However, this is not allowed because the result would be ambiguous. It could be any n. We simply lost the information of which number was multiplied to get to zero. It turns out, lost information cannot ever be restored by math. This is called the data processing inequality and has first been shown in (Cover and Thomas 2006). Theorem 7.1 (Data Processing Inequality) .I (X; Y ) ≥ I (Y ; Z) We can understand .I (X; Y ) as the information communicated from an object and serving as an input to a model/algebraic formula and .I (Y ; Z) is the information that is output by the model/result of the formula. However, the inequality holds more generally. The original communication formulation of the data processing inequality is that no clever transformation of the received code (channel output) Y can give more information about the sent code (channel input) X than Y itself. This can be seen as follows. Let us start with a definition. Definition 7.1 (Markov Chain) A Markov chain is a sequence of experimental x )1 , .f (x )2 , .f (x )3 ,. . . where the probability of moving from a previous outcomes .f (outcome to the next outcome depends only on the present outcome and not on the previous outcomes. Let us assume three experimental outcomes (random variables) from the Markov chain .X → Y → Z, implying that the conditional distribution of Z depends only on Y and is conditionally independent of X. In such a Markov chain, the joint probability of an event can be calculated as .p(x, y, z) = p(x)p(y|x)p(z|y). We zoom in on the first conditional probability in this term, which is defined as:
7.1 Algebra
97
P (y | x) = PP(y∩x) (x) . Just by set theory we know that the set of possible outcomes for Y can maximally be as large as the set of possible outcomes for .(Y ∩ X). That is, .A ⊂ B =⇒ P (A) ≤ P (B). Thus, .P (y ∩ x) ≤ P (x). As a consequence, .p(y|x) ≤ P (x). This implies that .p(x)p(y|x) ≥ p(x). The same argument holds for the second conditional term .p(z|y). In other words, the outcome probabilities at the end of the chain can only get higher as we move down the chain. That is, .− log2 p(x) ≤ − log2 p(x)p(y|x) ≤ − log2 p(x)p(y|x)(z|y) (see Definition 4.1). That is, information never increases in a Markov Chain. More formal proofs exist for Theorem 7.1, which make use of the elegant mathematics of information, including (Chen and Huang 2019). Regardless, feeding an algebraic model data with a high degree of uncertainty (noise) will therefore not increase its information content. It is important to understand that no specialized architecture, no formula, or no whatsoever smart trick can create new information out of nothing. In other words, the data processing inequality has significant consequences for human understanding and the meaning of creativity.
.
Creativity and the Data Processing Inequality Creativity is commonly understood as the ability to generate original, innovative, or novel ideas, concepts, or solutions that have value or usefulness. It involves the process of connecting disparate pieces of information, drawing on existing knowledge, and employing imagination to come up with new and meaningful perspectives or approaches. Creativity is considered a key aspect of human intelligence and can be found in various domains such as arts, sciences, technology, and everyday problemsolving. In essence, creativity represents the capacity to think beyond conventional patterns and frameworks, allowing for the discovery of unique connections and the development of groundbreaking innovations. However, from an information-theoretic perspective, creativity can be considered a misnomer. Given the data processing inequality, creativity in both computer models and biological brains is not about creating information, but rather processing incoming data. In other words, the primary focus of any “thinking” process lies in filtering noise. As discussed in Chap. 8, neurons implement energy thresholds, which act as signal filters. Consequently, thinking should be understood as the application of assumptions to maximize the extraction of available information bits from input data concerning a specific question. This means that creativity is about identifying the most informative aspects of the data at hand with regard to a given question (see also Definition 4.1). In the case of the human brain, this includes filtering information from an entire lifetime of experience. Therefore, informationtheoretically, creativity must be understood as the process of noise filtering (see also the discussion on the practical implementation of generator models in Sect. 9.2). Note that while this understanding of creativity may differ from some conventional definitions, it does not at all diminish the difficulty of the task at hand.
98
7 Meta-Math: Exploring the Limits of Modeling
7.1.2 Randomness One very important and often overlooked limit of algebra is that there is no finite formula to generate an infinite series of random numbers. In order to see this, let us start with a definition. Definition 7.2 (Random Number Sequence) A sequence of numbers is called random when there is no rule that predicts the next number in the sequence from any set of previous numbers. Note that, in practice, the existence of a rule is sometimes just hidden. Also, many theoretical discussions just assume ignorance. For example, in Sect. 4.4, it is discussed that the upper limit of the information content of a training table can be estimated by the information content of the target column because we do not know anything about the distribution of the inputs. For the sake of argument, let us assume, somebody claims to have found a way to generate random numbers. We can then contradict the claim using the following argument. Definition 7.2 immediately implies that each number is independent of each other. If there was a dependency between any two numbers in the sequence, this dependency could be formulated as a rule (contradicting the definition). Since each number is independent, each number has the same probability of appearing, which is .P = n1 , where n is the pool of numbers we are selecting from. This means, we are in probabilistic equilibrium (see Definition 4.10) and can therefore apply Eq. 4.3 to see how many bits would be needed to generate k numbers of such a sequence. We also know from Eq. 4.7 that the information content of probabilistic equilibrium is maximal. If we now apply the data processing inequality presented in Sect. 7.1.1, which states that formulas can only maintain or reduce the information content of the input, we can immediately see that the input to any mathematical formula that generates random output needs to be at least as random as the output. That is, even using infinite recursion, generating an infinite series of random numbers would therefore require an infinite amount of information as input to the algebraic formula. This is hardly useful and cannot be called “generating.” Corollary 4 There is no algebraic formalism that can generate an infinite sequence of random numbers.
7.1.3 Transcendental Numbers Another limit of algebra is that we find that not all numbers can be described by algebra. That is, there are numbers that are the so-called non-algebraic or transcendental. Transcendental numbers are not the root of a non-zero polynomial of finite degree with rational coefficients. In other words, there is no algebraic formula that can fully describe a transcendental number without either using infinite terms and/or irrational numbers. Examples of such numbers are Euler’s constant e and
7.1 Algebra
99
π . As a result, these numbers cannot be learned by a machine learning model: The model would have to have infinite size. That is, such a model would have the same or higher complexity than the transcendental number. Yet, a finite state machine with finite memory is able to calculate .π with infinite precision. That is, one can write an algorithm that is lower complexity than .π to calculate all digits. While doing so still takes infinite time, it shows an important difference between an algebraic model and an algorithm: Algorithms are more expressive than algebraic formulas. Algorithm 7 shows a minimum program that calculates and outputs .π .1
.
Algorithm 7 A theoretical approach to incrementally estimating .π to infinite digits with a finite computer algorithm. No finite mathematical model exists function π (x, y) ← random values radius ← random values counter ← 0 for i ∈ [0, ∞[ do (x/ i , yi ) ← random values if (x − xi )2 + (y − yi )2 ) == radius then if (xi , yi ) is new then counter ← counter+1 counter print 2∗radius end if end if end for end function
The reason for this is an algorithm’s capability to “guess and check,” while models can only “check”: It would be easy to build a classifier that checks if a given point is exactly radius distance away. However, since there is no algebraic formulation for random numbers (see previous section), models cannot generate them. Recursive models can generate pseudo-random numbers. That is, number series that have a very long period. However, these would not succeed in generating all digits of .π because once the period is reached, the same numbers are generated. In that case, Algorithm 7 would stop counting as there are no new points generated. Computers can use random numbers because they have a connection to the physical universe and the physical universe has noise. At a minimum, the universe generates .2.7 Kelvin of cosmic radiation that can be observed using a radio telescope. In practice, we do not have to go to such lengths because a pseudo-random generator can generate new numbers from a new initial seed number. So the initialization could come from any physical event (such as a user’s mouse movements). The generation of random numbers is also necessary because any algorithm that tries to come up 1 To make this program practical, one would have to define not to exactly look for points of exact radius distance from the initial center but define a minimum .ε.
100
7 Meta-Math: Exploring the Limits of Modeling
with guesses may forget some and therefore not get to a correct end result. Guess and check problems over finite sets can, of course, be solved by enumerating all possible guesses and do not require randomness. In general, this important difference between the complexity of a formula to represent data, which is captured by entropy (see Definition 4.11), and the complexity of an algorithm to represent data is captured as Kolmogorov complexity. Kolmogorov complexity is further discussed in Sect. B.2. Note that the number of binary decisions needed to generate .π is still infinite, and so Algorithm 7 does not contradict Corollary 2. Neither does the algorithm contradict the data processing inequality (see Sect. 7.1.1) as random numbers have the highest information content of all data and the algorithm reduces information to get to .π .
7.2 No Rule Without Exception The pigeonhole principle states that if n items are put into m containers, with .n > m, then at least 1 container must contain more than one item. For example, if one has three pigeons, then there must be at least three holes to put them in or at least two pigeons will be sharing a whole. This seemingly obvious statement is not only often overseen but can also be used to demonstrate possibly unexpected results. The wellknown example is the following: Given that the population of London is greater than the maximum number of hairs that can be present on a human’s head, the pigeonhole principle requires that there must be at least two people in London who have the same number of hairs on their heads. The more formal notion of the pigeonhole principle is as follows. Definition 7.3 (Pigeonhole Principle) There cannot be an isomorphism between sets of different cardinalities. An isomorphism is a function that maps each point of the input to a unique point of the output. That is, given the output, the input can be inferred without ambiguity. The most important consequence of the pigeonhole principle for the limits of modeling is that compression can never be both, lossless and universal. Intuitively, a compression algorithm is a method that takes a string and makes it shorter. If all possible input strings can be shortened, the compression is universal. If all compressed strings can be unambiguously reversed to its uncompressed version, the compression is lossless. The pigeonhole principle dictates that compression can only be either lossless or universal. Let us investigate this more formally. Definition 7.4 (Description Length) We denote as .dl(s) the function counting the number of symbols needed to describe a sequence of symbols (“string”) s. Definition 7.5 (Compression Scheme) A compression scheme is a function f such that .dl(f (s)) < dl(s) for some strings s in a set of strings S.
7.2 No Rule Without Exception
101
Definition 7.6 (Lossless Compression) A lossless compression scheme is a compression scheme that is reversible. That is, for a compression scheme f , .∃f −1 such that .f −1 (f (s)) = s for all strings .s ∈ S. Definition 7.7 (Universal Compression) A compression scheme f is universal when .dl(f (s)) < dl(s) for all .s ∈ S. Lemma 7.2 (No Universal, Lossless Compression) A compression scheme cannot be lossless and universal at the same time. Assume S is the set of all binary strings length n. S therefore has cardinality |S| = 2n . Assume a universal compression .f : S → C, where .dl(f (s)) < dl(s) ∀s ∈ S. The maximum length of a compressed string .c ∈ C is therefore n−1 . Since the .|C| < |S|, by pigeonhole .n − 1 and the cardinality .|C| = 2 principle, there can be no isomorphism between C and S. This means, .Ef −1 such that .f −1 (f (s)) = s, ∀s ∈ S and so the compression cannot be lossless. Assume a lossless compression. That is, .∃f −1 such that .f −1 (f (s)) = s, ∀s ∈ S. f therefore needs to be an isomorphism. By pigeonhole principle, f being an isomorphism implies that .|C| = 2n = |S|. As seen above, for a universal compression, .|C| = 2n−1 . This means, not all strings could be compressed and the f is not universal. That is, a compression scheme cannot be lossless and universal at the same time. Figure 7.1 shows the visualization of the special case of compressing all files of .2 bits length to .1 bit. As discussed in Chap. 6, a general model is a less complex representation of the function encoded by the training data. Many inputs are generalized to output the same decision by a separation function. That is, a model implements a lossy compression. The input cannot be inferred from an outcome, which can be desirable if privacy of the input data is a concern. However, some inputs may still have been overfit. A memorizing model is usually implementing a lossless compression, such a compression cannot be universal. A prominent example is discussed in Chap. 5: 14 of the 16 boolean functions of two variables can be losslessly compressed into .3.80 bits of neural capacity. This means, however, 2 of them cannot. We either incur a loss (in accuracy) or we need more neurons. .
Fig. 7.1 Visualization of no universal and lossless compression. Compressing all files of 2 bits length to 1 bit length requires 4 file conversions but only allows 2 without loss. The compression on the left is an example of lossless compression as all compressed files can be reverted to their originals. However, this scheme is not universal as it is not able to compress all files: .f ilec3 and .f ilec4 maintain their lengths. The scheme on the right universally compresses all files, but uncompressing (traveling backward on the arrows) is not possible without loss of information. One can chose a universal or a lossless compression scheme, not both
102
7 Meta-Math: Exploring the Limits of Modeling
Lemma 7.2 has many implications beyond machine learning modeling. Given a set of facts, humans have the desire to simplify things with a rule. This allows us to grasp not only an easier understanding of the set of facts, but it also allows us to communicate the facts with less bandwidths. Grammars simplify language education, laws and policies simplify communicating behavioral limits, and, as discussed in Chap. 1, scientific models help understand and communicate the outcomes of experiments. Simplification here means that a set of descriptive rules is used that are of less complexity than the space they are describing. In short, the rules compress the space. No universal, lossless compression implies that these rules cannot describe the entire space without exceptions (that is, special cases memorizing direct examples out of this space) or not be able to describe the entire space. In a language grammar, an example of such exceptions would be irregular verbs. In short: There cannot be a rule without an exception. Let us loop back to Fig. 7.1 (left): The rule for the lossless compression would be “numbers are shorter omitting leading zeros.” The exceptions being when they do not have leading zeros. The universal compression in Fig. 7.1 (right) then enforces the rule disregarding the exceptions, accepting information loss as a consequence. On a side note: A growing community of mathematicians suggest using variants of Lemma 7.2 as a new proof idea: incompressibility proofs (Filmus 2020). We should expect to hear more about this topic in the coming decades. Let us consider a more in-depth example.
7.2.1 Example: Why Do Prime Numbers Exist? The following is an example of the application of the meta-math principle of the non-existence of universal and lossless compression to something surprisingly fundamental: An explanation of why prime numbers exist. The Peano Axioms guarantee that any natural number can be reached by addition of natural numbers. That is, with .n − 1 additions of 1, any number .n ∈ N can be represented. To represent any natural number in such a way, one needs a string of length .l = n digits + (n − 1) operators. It follows that any number can be represented by one addition of two numbers in .n − 1 possible ways. Multiplication reduces this representation length. For example, instead of .1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 or .3 + 3 + 3 + 3, one can write .3 ∗ 4. For large numbers, the reduction in description length can be significant: For example, .10 + 10+. . .+10 (.100,000 times) will result in the use of .100,000∗2 digits +100,000− 1 operators = 299,999 symbols. Multiplication shortens the number of symbols needed: .10 ∗ 100,000 only uses 9 symbols. This is a remarkable compression ratio of about .25,000 : 1. As a result, multiplication can be understood as run-length encoding of addition. Now, by pigeonhole principle, no reduction of description length can be universal and lossless. Multiplication as a compression is lossless as .12 = 3 ∗ 4 = 12. However, it is not universal: While addition can reach all numbers, multiplication cannot (other than using multiplication with 1 that does not reduce
7.2 No Rule Without Exception
103
description length). The numbers that cannot benefit from the reduced description length of multiplication are commonly called prime numbers. Moreover, we can observe clear properties of compression in multiplication. For example, a necessary condition for lossless compression to work is redundancy. That is, multiplication can only compress additions with the same symbol (number) repeating. For example, .1 + 2 + 3 cannot be expressed by multiplication, whereas .2 + 2 + 2 or .1 + 1 + 1 + 1 + 1 + 1 can be easily translated into a multiplication. Lastly, as mentioned earlier, the result n of a multiplication can always be reversed back into at least one multiplication expression .n = x ∗ y with .x > 1 and .y > 1. Understanding the square as a special case of a rectangle, non-prime numbers can therefore be called rectangular numbers – which is a more descriptive label than composite.2 Note that 1 is, of course, neither prime nor rectangular. The first even rectangular number is 4 and the first odd rectangular number is 9. In other words, even rectangular numbers need to be divisible by a number .≥2, and odd rectangular numbers need to be divisible by a number .≥3.
7.2.2 Compression by Association Introducing new symbols can seemingly break Lemma 7.2. For example, we can compress the infinite amount of digits, resulting from Algorithm 7 by associating it with a Greek character: .π . In short, we can seemingly compress a number with infinite description length to description length 1 by associating it with a new symbol. However, in the end, we are only able to do algebra with the symbol .π . To get to an actual numerical result, we have to either approximately calculate .π or have sufficient digits memorized somewhere. Furthermore, we can universally and losslessly compress all natural binary numbers of length d to decimal numbers of length .r logd2 10 |. This is possible because we extend the alphabet .E from two possible digits .|E| = 2 to ten possible digits .|E| = 10. However, this compression does not come for free either: Each digit now carries .log2 10 = 3.32 . . . bits of information. That is, if data were transferred at a bits constant bandwidth (e.g., measured in . second ), this compression would not do any good. In general, it is harder to read each digit as each digit comes from a larger set of choices. In practice, such alphabet extensions require definition work that is usually beyond technical possibility. For example, if we were to try to universally and losslessly compress all files on a hard disk by alphabet extension, we would have to invent a new type of hard disk that uses (at least) ternary logic instead of binary logic.
2 This notion has also been adopted by California’s Common Core math curriculum for elementary schools.
104
7 Meta-Math: Exploring the Limits of Modeling
We can also observe the same issue with alphabet extension when comparing the Latin alphabet with the Chinese alphabet. The Latin alphabet is composed of about 26 characters that are loosely speaking atomic. That is, each character represents a basic sound produced by our vocal tract. It therefore allows to reconstruct the pronunciation of a word out of smallest pronunciation units.3 Simply put: There are rules that map a symbol to a pronunciation, with a few exceptions. In the Chinese alphabet, each character typically represents a word. This is a semantic unit. While characters are composed out of strokes that have meanings, these strokes are aligned with semantics as well. As a consequence, it is much harder to learn to read and write the Chinese alphabet compared to the Latin alphabet. For each of the thousands of Chinese characters, the reader needs to memorize the meaning and the pronunciation. The pronunciations of new characters and their meaning need to be explicitly communicated. There are very few rules, and most pronunciations and meanings have to be memorized (as an exception). In the Latin alphabet, new words can be introduced, and the reader can infer their pronunciation from the spelling and their meaning from the context allowing the users of the language to adapt to new concepts quicker. In general, while increasing the number of symbols decreases the description length of the message, it increases the complexity of the mechanism that interprets the communication. This is why studies find (e.g., Gollan et al. 2009) that when comparing the reading of different languages with readers at the same proficiency level, the amount of information transferred per time stays relatively the same. However, becoming a Chinese reader takes a lot more effort than becoming a reader of Spanish, for example. The easiest alphabet to process is therefore the smallest alphabet. It turns out, the main insight that led to the invention of the automated computer was to use the binary system, thus reducing the alphabet used to represent numbers to .E = {0, 1}. The binary system had long been discovered by Chinese scholars (who undoubtedly must have had similar thoughts to the ones outlined in the above paragraph) but was introduced to western civilization by Leibniz (1703). Shannon (1948a) wrote his master thesis showing that electronic circuits are able to implement all functions of the Boolean algebra (Boole 1847). The thesis did not influence the decision, but the computer was invented based on exactly these principles (Zuse 1993). Even before Leibniz, in the late 13th century, Ramon Llull attempted to account for all wisdom in every branch of human knowledge of the time. For that purpose, he developed a general method or “Ars generalis” based on binary combinations of a number of simple basic principles or categories, for which he has been considered a predecessor of computing science and artificial intelligence. His work is suggested in further reading.
3 Never mind English or French but Spanish, German, Portuguese, Italian, and some other languages have at least kept most of this idea. Also, Korea introduced the Hangul alphabet that is even closer to pronunciation than the modern uses of the Latin alphabet.
7.3 Correlation vs. Causality
105
Example Consider the following English text, which is the beginning of Moby Dick (Melville 1851): Call me Ishmael. Some years ago – never mind how long precisely – having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
It has a total of 225 characters (excluding punctuation but including spaces). The mandarin translation has 68 characters: 叫我以实玛利命名。 几年前,不用管时间有多长,当时我钱包里没什么 钱,海岸上也没有什么特别值得我感兴趣的东西,所以我想航行一下,看看 这个世界上的海洋部分。 The information content of English characters (Shannon 1951; Miller 1951) to between maximally .1.3 bits and .2 bits per character, depending on the assumptions made. Contemporary mainland Chinese characters were estimated by Chen (Chen et al. 2003) to be about .9.8 bits per character. That is, the English text has about .225 ∗ 2 = 450 bits of information, whereas the Chinese translation has about .68 ∗ 9.8 = 666.4 bits of information. This simple experiment shows that, while the string lengths are different by more than factor 3, the actual information content is closer to being equal – erring on the side of the Chinese text having overall slightly higher entropy than the English text. In summary, introducing a new symbol usually means introducing an exception to the current set of rules and needs to be considered very carefully as it complicates comprehensibility. Furthermore, we are again reminded that a data processing algorithm’s runtime is not cardinally dependent on the length of the input but on the bitcount of the input (see also Corollary 2).
7.3 Correlation vs. Causality As outlined in Chap. 2, the scientific process (either automatically or manual) fully relies on observations to validate or invalidate a given model. However, it has been well known for centuries that, based on observations, all we can infer are correlations. That is, usually phrased as “correlation does not imply causation” and refers to the inability to legitimately deduce a cause-and-effect relationship between outcomes and/or variables solely on the basis of an observed association or correlation between them. A joke example of this is shown in Fig. 7.2. However, there is no example (joke or not) where causation does not imply correlation. Intuitively, two series of observations are correlated when one series is able to somewhat predict the other (see also Sect. B.4). That is, if a variable causes an outcome, it will be correlated with the outcome.4 In general, the problem of
4 This
simple deduction can change when multiple variables are in play.
106
7 Meta-Math: Exploring the Limits of Modeling
Fig. 7.2 Joking example of correlation vs. causation: Dinosaur illiteracy and extinction are correlated, but that does not mean the variables had a causal relationship. Photo by: Smallbones, Creative Commons CC0
causation vs. correlation has been discussed for many millennia (Aristotle 350 BCE). However, it merits a quick discussion in the context of this book. Let us formalize. In mathematical logic, the technical use of the word implies is equivalent to is a sufficient condition for. That is, .p =⇒ q for two boolean variables is equivalent to whenever p is observed to be true, q must be observed to be true; otherwise, the implication is false. In general, the truth for implication is as follows. p 0 0 1 1
q 0 1 0 1
p 1 1 0 1
q
As a side note: The case where q is true but p is false is called “ex falso quod libet.” That is, from wrong follows anything. So a causal chain (e.g., a proof) that is broken can unfortunately appear to be complete, leading to the wrong conclusion. As can be seen, the implies operator . =⇒ is so strong that .¬q =⇒ ¬p (this is called the contraposition). For example, we are able to say if two variables and/or events are not correlated, they cannot be causal. The problem arises when variables and/or outcomes are correlated as then, intuitively, the question arises if there is a causal relationship. So let us talk about correlation. We know that two observations and/or variables are more correlated the lower their mutual uncertainty is. That is, the more one observation or variable is predictive of the other, the higher the correlation. So we are able to observe correlation as “whenever we gain knowledge about A, there is a high chance we know something about B.” We quantify this knowledge about B given knowledge about A as mutual information .I (A; B), where A and B are the set of observations about each of them, usually normed into a random variable (see also Definition 4.18).
7.3 Correlation vs. Causality
107
The linkage between mutual information and causality is a matter of understanding the assumptions underlying the different viewpoints. First, logic does not deal with uncertainty. That is, all results are absolute, and any given probability would be either 1 or 0. Second, logic is a universal tool as it can be applied to anything (it even serves as the base for arithmetic Shannon 1948a). That is, the results of logic do not change based the context, circumstances, or a priori distributions of what it is applied for. From a probabilistic standpoint, we therefore have to assume that logic operates without any bias, in equilibrium (see Definition 4.10). If we now assumed our correlations in equilibrium, then A and B are uniform random distributions. This means, either .I (A; B) = 0, this is not correlated or completely correlated .I (A; B) = H (A) = H (B). If there is complete correlation, there is no explanation for the correlation because A is random and so is B. So we might as well define the correlation to be causal as it behaves identical to the logic implication .A =⇒ B (which also provides no explanation for its existence).5 The issue in practice is the well-known problem that all measurements and therefore all observations are relative to a baseline (Saaty 2008). This baseline introduces a bias because we have a choice of baseline. If there was a universal baseline, measurements could be absolute. This means that the scientific process cannot generate observations that are unbiased (that is, in equilibrium), and therefore causality must be defined and can never be observed. As a result, much of scientific evidence is based upon a correlation of variables that are observed to occur together. Scientists are careful to point out that correlation does not necessarily mean causation. This practice should continue with the automatic scientific process. However, sometimes people commit the opposite fallacy of dismissing correlation entirely. Doing so systematically would dismiss a large swath of important scientific evidence. Unfortunately, the problem of correlation vs. causation can therefore only be solved by common sense and the acceptance of the fact that scientific results may change as knowledge evolves. There are cases, however, where a correlation is very easily legitimized as causation: The directly observable result of the application of a rule. A rule is usually formulated using (or can be reformulated into) mathematical logic. This means the rule is defined using implication . =⇒ and the results are therefore causal. Example 1 In court, the question may be if a driver caused an accident. There is no question that two cars crashed as witnesses observed the incident and there is observable damage correlated with two cars crashing into each other. In law cause is established using a rule. Here, the rule is that a driver has to stop his or her car at a stop sign and the person who does not do this will be defined as having caused the accident. If there were no traffic rules, there would be no way of assigning a person to be the
mutual information is symmetric, here .B =⇒ A as well. To get to an asymmetric argument, of the same conceptual value, the Kullback–Leibler divergence can be used instead of the mutual information.
5 Since
108
7 Meta-Math: Exploring the Limits of Modeling
cause of the accident. There would simply be observations of the physical world and correlations of outcomes. Example 2 In medicine, we often search for a cause of death. Death may seem like a clear and obvious observation. However, it is not. Rules define when a doctor is to declare somebody as dead. These rules have exceptions as can be observed occasionally using sensational reports in the main stream media (Mieno et al. 2016). Example 3 Physics defines some universal rules, for example, the second law of thermodynamics or Newton’s laws of motion. The correlations described there are so obvious and intuitive that we often interpret them as causal. However, physicists point out that thermodynamics is defined in equilibrium and Newton’s laws have assumptions such as “point-shaped mass in a vacuum.” None of those assumptions can be actually made true in reality. As a consequence, the causality inferred from these rules is an approximation. Since there is no approximated causality as logic is absolute, the observations following physical rules are correlations based on predictions of causal models. Recalling that all models use algebraic rules (that is, rules derived using logic), we can therefore summarize: Observations are always correlations. The outputs of a mathematical model are always causal. The question remains if the causality implied by the model reflects reality.
7.4 No Free Lunch “There ain’t no such thing as a free lunch,” that is, there are no easy shortcuts to success. Similar to “garbage in, garbage out,” presented earlier in this chapter, this saying emerged from folk wisdom and then got adapted into formal mathematics. The No-Free-Lunch Theorem (Wolpert and Macready 1996) may be the single most important mathematical results about tooling. It essentially says: All tools have the same power, on average. The many discussions about which type of text editor is the best, which programming language is the most powerful, or which type of machine learner beats them all are all made irrelevant as anecdotal results by this theorem. Here is the formal language. Theorem 7.3 (No Free Lunch) Given a finite set V and a finite set S of real numbers, assume that .f : V → S is chosen at random according to uniform distribution on the set .S V of all possible functions from V to S. When trying to optimize f over the set V , no algorithm performs better than guessing and checking. Here, “guessing and checking” means that at each step of the algorithm, the element .v ∈ V is chosen at random with uniform probability distribution from the elements of V that have not been chosen previously. This means that when all
7.4 No Free Lunch
109
functions f are equally likely because we have no way of narrowing them down otherwise, the probability of observing an arbitrary sequence of values m (that one deems an optimal model for f ), in the course of whatever optimization, does not depend upon the algorithm. Finding m faster than defined by the probability of finding m is anecdotal (see also Corollary 2). The proof of Theorem 7.3 can be found in (Wolpert and Macready 1996). The No-Free-Lunch theorem assumes that the target function is chosen from a uniform distribution of all possible functions. Such an assumption is discussed in Sect. 7.1.2. If we knew a specific target function would be chosen more likely than others, we would probably have more insight into the problem and would indeed chose a specialized technique for modeling rather than general purpose machine learning modeling. Such insight, however, usually reduces the problem from general to special. The contribution of Theorem 7.3 is that it tells us choosing an appropriate algorithm requires making assumptions about the kinds of target functions the algorithm is being used for. With no assumptions, no “meta-algorithm,” not even the scientific method, performs better than random choice. A fundamental criticism of the applicability of Theorem 7.3 to machine learning is that it seemingly contradicts Occam’s Razor (see Sect. 2.2.2). Models m with lower description length (see Sect. 7.2) should have a higher probability to be discovered than models with longer description length. By Occam’s razor, they are also more likely to be general. Clearly, a search among shorter sequences m is easier than a search among longer sequences. Theorem 7.3 holds for sequences of equal lengths, and therefore optimizing with smaller models is easier. Therefore, in general, the amount of work needed to solve an optimization problem depends on the complexity of the problem, as discussed in Chap. 5, not on the tool that is used to solve the problem – unless the problem is specialized with assumptions or further knowledge (see also Sect. 16.1). In the context of high-dimensional data, the No-Free-Lunch theorem has been challenged as well. Researchers have argued that when dealing with high-dimensional data, the assumptions made by the theorem may not hold true (Grunwald 2004; Bousquet 2004). In high-dimensional spaces, data often contain random low-entropy structures or regularities, which can be exploited to perform better than random guessing. This phenomenon is known as the “Blessing of Dimensionality” (Indyk and Motwani 1998) (see also Chap. 16). It suggests that the performance of some algorithms can actually improve as the dimensionality of the data increases, provided that the algorithms are designed to leverage the phenomenon. Having said that, this does not contradict the theorem as these algorithms are now specialized to high-dimensional data and that comes with its own disadvantages (see discussion in Sect. 16.4).
110
7 Meta-Math: Exploring the Limits of Modeling
7.5 All Models Are Wrong Last but not least, we finish this chapter with another common saying. “All models are wrong, but some are useful” is an aphorism that acknowledges that models always fall short of the complexities of reality but can still be useful, nonetheless. The aphorism originally referred just to statistical models, but it is now sometimes used for scientific models in general; moreover, it is a common saying in the machine learning community. The aphorism is generally attributed to the statistician George Box. In his original article (Box 1976) that deals with general properties of statistic modeling, Box makes tow important observations: . “Parsimony: Since all models are wrong the scientist cannot obtain a ‘correct’ one by excessive elaboration. On the contrary, following William of Occam, he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.” . “Worrying Selectively: Since all models are wrong, the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.” In a further article (Box 1979), he explained this about the resilience of such wrong models: “Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law .P V = nRT relating pressure P , volume V and temperature T of an ‘ideal’ gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behavior of gas molecules. For such a model there is no need to ask the question ‘Is the model true?’. If ‘truth’ is to be the ‘whole truth’ the answer must be “No”. The only question of interest is ‘Is the model illuminating and useful?”’. This marks another important difference between the human scientific process and the automated scientific process. In the automated scientific process, a model is optimized for accuracy and generalization with very precise measure. While we know from Chap. 6, that parsimony is helpful for generalization, the automated process would never result in a model like the ideal gas law. Since ideal gases do not exist, no ground truth exists to optimize a machine learning model to. Further elaboration on this topic is presented in Chap. 14.
7.6 Exercises 1. Algebra (a) Implement a practical version of Algorithm 7.
7.7 Further Reading
111
(b) Since random numbers do not actually exist: What is a good way to make sure a randomized computer experiment can be repeated? 2. Pigeonhole Principle (a) Suppose 5 pairs of socks are in a drawer. How many socks do you have to minimally pick to guarantee that at least one pair is chosen? (b) Each of 15 red balls and 15 green balls is marked with an integer between 1 and 100 inclusive; no integer appears on more than one ball. The value of a pair of balls is the sum of the numbers on the balls. Show there are at least two pairs, consisting of one red and one green ball, with the same value. Show that this is not necessarily true if there are 13 balls of each color. (c) Prove Lemma 7.2 by contradiction (English language explanation is enough). 3. No Free Lunch (a) Algorithm 7 uses the Euclidean distance to find all points that are exactly radius away from the center. Explain how this particular metric is not actually necessary.
7.7 Further Reading Correlation vs. Causation has been discussed extensively in philosophy and physics. A comprehensive summary is presented here: . Beebee, Helen; Hitchcock, Christopher; Menzies, Peter: “The Oxford Handbook of Causation”. Oxford University Press, 2009. ISBN 978-0-19-162946-4. As outlined in Sect. 7.2.2, it is worth looking at the work of Ramon Llull: . Alexander Fidora, Carles Sierra: “Ramon Llull: From the Ars Magna to Artificial Intelligence”, Artificial Intelligence Research Institute, IIIA, 2011.
Chapter 8
Capacity of Neural Networks
Chapter 5 presented several notions of capacity for a linear separation unit, which includes the single neuron, with its dot-product threshold. This chapter shows how to generalize the memory-equivalent capacity of a single neuron to a network of neurons. This question has sparked many approaches in the machine learning community, including fields such as “neural architecture search” or the wider field of “hyper parameter optimization (HPO).” However, as explained in Chap. 5, two models of the same capacity are able to represent the exact same functions. The understanding of how there can be an analytic solution to neural network capacity is facilitated by remembering the history of the neuron as an (bio-) electrical circuit element, functioning as an energy threshold. Early implementations of neurons (Widrow and Hoff 1960) where hardware-only and neural networks were therefore electrical circuits designed by electrical engineers. In electrical engineering, understanding networks of electrical components analytically, usually by seeing them as connected in series or in parallel, is standard theory and practice. It should therefore be no surprise that neural networks can be understood in the same way. Fortunately, information theory allows us to mathematically abstract from having to go back to electrical engineering.
8.1 Memory-Equivalent Capacity of Neural Networks Section 5.2 indicated the memory-equivalent capacity for a single neuron as .1 bit per parameter and the information capacity as .2 bit per parameter (the latter assumes the number of weights approaching infinity) based on previous work. Due to the infinity assumption, it is harder to generalize the information capacity from a single neuron to neural networks. We will therefore discuss only the generalization of the memory-equivalent capacity (MEC). MEC still indicates the
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_8
113
114
8 Capacity of Neural Networks
Fig. 8.1 The threshold or bias b counts as a weight .wi because it can be easily converted into such. Left: Original neuron with bias. Right: The same neuron with a bias converted into a weight
point of guaranteed memorization capability and will only maximally differ a factor 2 from the information capacity. In every neuron, both the weights and the threshold count as parameters as the threshold can easily be made a weight (see Fig. 8.1). Then, recalling the Function Counting Theorem (Theorem 5.1) and its properties, we remember that MEC was deduced by the following property of the .C(n, d) function: C(n, d) = 2n ∀d : d ≥ n.
.
(8.1)
The number of possible separations (binary labelings) for n points is .2n . Since n .C(n, n = d) = 2 , it follows that .log2 C(n, d = d) = n. That is, all possible binary labelings of the .d = n points can be realized and so all binary classification training data tables (see Definition 2.1 of n rows can be memorized with n parameters in the neuron. Since the outcome of each row contributes maximally one bit of information, each parameter therefore contributes maximally .1 bit of capacity. In short, d parameters can guarantee to memorize d rows of any binary classification table. Note that, without modification, a single neuron cannot solve multi-class classification problems. This allows us to conclude: Corollary 5 The memory-equivalent capacity of a single neuron equals the number of parameters in bits. Moving from a single neuron to many neurons, the first question is how can neurons be combined. In practice, neurons are combined in layers and layers are stacked. However, neurons can literally be combined in any way one can see fit. Two things, however, are always true: A neuron’s input can only be a: 1) data input (e.g., in a first layer) or 2) connected to another neuron. This allows us to talk about connections in parallel (when several weights are connected to the same source) and connections in series (when the output of a neuron is connected to the input of another neuron). When connections are connected to the same source, that is, in parallel, each parameter associated with the connection has a maximal MEC of one bit. Since each parameter can be set independently of any other, the total capacity is therefore the sum of the individual capacities. In other words:
8.1 Memory-Equivalent Capacity of Neural Networks
115
Corollary 6 The memory-equivalent capacity of neurons in parallel is additive. This is quite intuitive: Two independent storages in parallel connected to the same data source are additive, like two hard disks, two RAM banks, or two water buckets. A simple approximation to determine the MEC of a neural network would therefore just be to count the number of parameters and assume that is the MEC in bits. However, this practice overcounts quite significantly, and there is an easy way to get to a tighter upper bound for the MEC. In order to determine the memory-equivalent capacity of neuron in series, we have to remember that information can never be created by data processing (see also Sect. 7.1.1). That is, a neuron that is connected to the output of another neuron can maximally store the information received by the sending neuron. Corollary 7 For neurons in series (e.g., in subsequent layers), the capacity of any layer cannot be larger than (is limited by) its input. Intuitively, when calculating MEC, we think of a layer of neurons as storage. Now, if you receive .10 bits of information from somebody and you store it in storage that has a capacity of .1000 bits, you still only have .10 bits of information. This is relevant because we try to estimate the total capacity in terms of mutual information, not just storage capacity. It is also worth noting that the capacity is still additive because each neuron still implements its own separation function. There is one more piece we need to determine the capacity, which is the amount of information that can be output by a neuron. The neuron implements the following separation function: f ( x) =
d
.
xi wi ≥ b.
(8.2)
i=1
So the output of a neuron is maximally .1 bit when both outcomes of .f ( x ) are equiprobable. That is, when .P (f ( x ) = T rue) = P (f ( x ) = F alse) = 12 and thus .− log2 12 = 1 bit. The output may be less than one bit if the probabilities are not equal (see also Sect. 4.1.3). Corollary 8 The maximum output of a single neuron is 1 bit of information. Intuitively, one would think that the information output of a neuron is dependent on the activation function. For example, a separation function such as the popular ReLU activation f ( x ) = max 0,
d
.
xi wi
(8.3)
i=1
would yield more information since when the weighted sum of inputs is greater than 0, the function will output the sum as a number, not just the boolean value indicating
116
8 Capacity of Neural Networks
firing (or not firing) like the step function from Eq. 8.2. This intuition is correct, and calculating the capacity for regression using an activation function such as a ReLu could be tricky. However, for classification, this argument is less relevant because there is a decision in the end. Since our goal is to calculate the MEC, we require there is an output neuron making a sharp binary decision. This binary decision has maximally .1 bit of information. The mutual information capacity of a system is therefore bottlenecked by that ultimate decision. As a consequence, the information output of a neuron can be approximated with that bottleneck, regardless of the shape of the activation function. In summary, derived four engineering rules to determine the memory-equivalent capacity of a neural network: 1. The output of a single neuron yields maximally one bit of information. 2. The capacity of a single neuron is the number of its parameters (weights and threshold) in bits. 3. The total capacity .Ctot of M neurons in parallel is .Ctot = M i=1 Ci , where .Ci is the capacity of each neuron. 4. For neurons in series (e.g., in subsequent layers), the capacity of any layer cannot be larger than (is limited by) its input. Example Figure 8.2 shows several examples on how to apply the above rules to estimate MEC for concrete networks. It is assumed in these examples that all connections have weights and all neurons have biases. We also assume the networks are not limited by the input. For example, a network’s total capacity could be limited by a training table with very little information content (similarly to a dependent layer being limited by the output of its dependence, see Corollary 7). In layered networks, the output of a single neuron is often multiplexed over many connections. Of course, the output is still the same bit on each connection, so it should always only be counted as .1 bit. Of noteworthy interest is the residual network (b). Its capacity is higher because the second-layer neuron is not limited to the output of the first neuron as in a typical layered network. It also appeals to our intuition: The second neuron makes the decision taking into account both the original input and the “opinion” of the first neuron. Informed decisions are able to solve more complex tasks correctly.
8.2 Upper-Bounding the MEC Requirement of a Neural Network Given Training Data As discussed in Sect. 4.4, one can upper-bound the information contained in a table by calculating the Shannon Entropy of the output column. However, assuming we know that we want to use a set of linear separating functions (e.g., using a neural network or a Support Vector Machine) to solve a classification problem, we can get a closer estimate using the algorithm presented in this section.
8.2 Upper-Bounding the MEC Requirement of a Neural Network Given. . . a)
117
b) x 1
x1
x2 x2
c)
d)
x1
x1
x2
x2
e) f) x
1
x1
x2
x2
x3
Fig. 8.2 The single neuron (a) has 3 bits of memory-equivalent capacity and can therefore guarantee to memorize any binary classification table with up to 3 rows (more with removing redundancies). The shortcut network (newer term: residual network) (b) has .3 + 4 = 7 bits of capacity because the second-layer neuron is not limited to the output of the first neuron. The 3-layer network in (c) has .6 + min(3, 2) = 8 bits of capacity. The deep network (d) has .6 + min(6, 2) + min(3, 2) = 10 bits of capacity. The multi-class classifier network in (e) has .4 ∗ 3 + min(4 ∗ 3, 3) + min(4 ∗ 3, 3) = 12 + 3 + 3 = 18 bits of capacity. That is, it can memorize all 8-class classification tables of maximally 6 rows (.3 bits of output per row). The oddly shaped network under (f) has .3 + 1 + 4 = 8 bits of MEC based on the rules defined in this chapter. However, the neuron in the middle really does not add anything, as it is shortcut. Therefore, a tighter bound would be .3 + 4 = 7 bits (see also example b)
The ultimate goal of a neural network is to implement a set of separation lines of the form represented by Eq. 8.2. Looking at Eq. 8.2, it is easy to see that the dot product has .d + 1 variables that need to be tuned (with d being the dimensionality of the input vector .x). This causes Backpropagation to be NP-complete (Blum and Rivest 1992). To rid ourselves of those difficulties, we use the following trick that shortcuts our computational load dramatically: We assume all d dimensions are in equilibrium and can be modeled with equal weights in the dot product. In other words, we chose to ignore the training of the weights .wi by fixing them to 1 and train only the biases. To train the biases, we create a two-column table containing the 1-weighted sums of the sample vectors .x and the corresponding labels. We now sort the rows of the two-column table by the first column (the sums). Finally, we iterate through the sorted table row by row, from top to bottom and count the need for a threshold every time a class change occurs. This is the equivalent of adding a neuron with input weights 1 and the given threshold as bias to a hidden layer of a 3-layer neural network. Algorithm 8 shows a pseudocode implementation.
118
8 Capacity of Neural Networks
Algorithm 8 Calculate the upper-bound memory-equivalent capacity expected for a three-layer neural network Require: data: array of length n contains d-dimensional vectors x, labels: a column of 0 or 1 with length n function memorize((data, labels)) thresholds ← 0 for all rows do table[row] ← ( x[i][d], label[row]) sortedtable ← sort (table, key = column 0) class ← 0 end for for all rows do if not sortedtable[row][1] == class then class ← sortedtable[i][1] thresholds ← thresholds + 1 end if end for minthreshs ← log2 (thresholds + 1) mec ← (minthreshs ∗ (d + 1)) + (minthreshs + 1) end function: mec
For the actual complexity measurement, we can safely ignore column sums with the same value (hash collisions) but different labels: If equal sums of input vectors do not belong to the same class, Algorithm 8 counts them as needing another threshold. The assumption is that if an actual machine learner was built, training of the weights would resolve this collision. In the end, one can estimate the expected memory-equivalent capacity by assuming the machine learner is ideal, and therefore, training the weights (to something other than .wi = 1) is maximally effective. That is, perfect training can cut down the number of threshold comparisons exponentially to at least .log2 (t), where t is the number of thresholds (numerically, we have to add 1 so that 0 thresholds correspond to an MEC of 0 bits). The proof for this is given in Sect. 8.3, which assumes random inputs and balanced binary classes. So for imbalanced classes or a non-random mapping, training could result in even a steeper improvement. Also, the resulting memoryequivalent capacity number should be adjusted if there are more than two classes. The latter is left to the reader (see exercises below). As a last step, we take the number of thresholds needed and estimate the capacity requirement for a network by assuming that each threshold is implemented by a neuron in the hidden layer firing 0 or a 1. The number of inputs for these neurons is given by the dimensionality of the data. We then need to connect a neuron in the output layer that has the number of hidden layer neurons as input plus a bias. One can see that the bottleneck of Algorithm 8 is the sorting that will take .O(n log2 n) steps (see Sect. 4.2). As a concern, it is important to note that 3-layers is all one needs. By definition (see: Definition 5.5), all machine learners with the same memory-equivalent
8.3 Topological Concerns
119
capacity can implement the same functions. So there is no reason to explore other topologies for this measurement. This is explained in the following section.
8.3 Topological Concerns In 1900, mathematician David Hilbert composed a list of 23 problems with the goal of advancing science into the coming centuries. They were all unsolved at the time, and some of them proved to be very influential until today. Hilbert’s thirteenth problem is entitled: “Impossibility of the Solution of the General Equation of the 7th Degree by Means of Functions of only two Arguments” (Hilbert 1902). It entailed proving whether a solution exists for all 7th-degree equations using algebraic (variant: continuous) functions of only two arguments. It was solved in 1961 in a more constrained, yet more general form by Vladimir Arnold and Andrey Kolmogorov (Kolmogorov and Arnold 2019), and is the reason we have artificial neural networks today. Arnold and Kolmogorov established that if f is a multivariate continuous function, then f can be written as a finite composition of continuous functions of a single variable and the binary operation of addition. More specifically, in the following form: Theorem 8.1 (Kolmogorov–Arnold Representation Theorem Kolmogorov and Arnold 2019) f (x) = f (x1 , . . . , xn ) =
2n
.
q=0
⎛ q ⎝
n
⎞ φq,p (xp )⎠
(8.4)
p=1
It is easy to see that the inner sum is a set of neurons in a hidden layer and the outer set is a set of neurons in an output layer. This was first established in (Arnold 2016). This result therefore establishes 3-layer neural networks as universal representations for even continuous functions. In other words, a 3-layer neural network can learn any function. Many specialized versions of Theorem 8.1 exist, including specialized to specific types of neural networks. They are categorized as Universal Approximation Theorems. However, practice has shown that anecdotally, different topologies, e.g., deep layering, can be useful for technical reasons such as reducing noise from the input or to include error correcting. As an example, Fig. 8.3 shows a 1-hidden layer vs. many hidden layer networks and how both work in different ways to approximate a circular structure. Both networks have the same MEC. Such technical considerations may regularize the network and can increase the need for MEC (see also discussion around Corollary 3). Algorithm 8 does not account for that. However, Sect. 9.1 will dig deeper into many layer networks.
120
8 Capacity of Neural Networks
Fig. 8.3 Visualization of two neural networks with 24 bits of memory-equivalent capacity (output layer hidden behind result) approximating a 2-dimensional inner circle of points. The 1-hiddenlayer network approximates the circle with a set of straight lines. The 3-hidden-layer network shows a growing complexity of features (discussed in Sect. 9.1)
8.4 MEC for Regression Networks Neural networks can be used for regression. For regression modeling, the output layer represents a linear combination of hidden layer values. That is, instead of a thresholding function .f ( x ) = di=1 xi ∗ wi ≥ b, the output layer calculates the function .f ( x ) = di=1 xi ∗ wi . Note that d here is not the dimensionality of the table but the dimensionality of the output layer and .xi are not the inputs of the table but the outputs of the layer the output layer depends on. So, to first order, MEC is calculated in the same way as for a classification network. The complexity of the regression function increases with the MEC of the network. However, since the output neuron does not threshold, the exact analytical calculation of regression network MEC is dependent on the activation function used.
8.5 Exercises 1. Maximum MEC of Neural Networks What is the maximum memory-equivalent capacity of the following neural networks. Assume binary classification, all weights are non-zero and all units
8.5 Exercises
121
have biases. There is enough information in the input that the first layer is not limited by it:
(c) What is the maximum amount of rows that each network in (a) and (b) can memorize? (d) Answer (c) but for 4 classes instead of binary classification. 2. Draw two different neural network architectures that can guarantee to memorize the training data of a 12-instance binary classification problem of 4-dimensional inputs (assuming perfect training). 3. Train a neural network of your choice in any framework of your choice (TensorFlow, PyTorch, SciKit Learn, Weka, self-build, etc.. . . ) to distinguish odd from even numbers: (a) How many neurons are needed theoretically? (b) How many neurons did you end up using? (c) Discuss the limitations of your implementation. 4. Upper Bounds: (a) Do Exercise 40.8 in MacKay’s book (MacKay 2003). It is cited here as follows: Estimate in bits the total sensory experience that you have had in your life – visual information, auditory information, etc. Estimate how much information you have memorized. Estimate the information content of the works of Shakespeare. Compare these with the capacity of your brain assuming you have 1011 neurons each making 1000 synaptic connections and that the (information) capacity result for one neuron (two bits per connection) applies. Is your brain full yet?
122
8 Capacity of Neural Networks
Note that MacKay is right to suggest using information capacity for this estimate as image and acoustic data are relatively high dimensional and he also suggests 1000 connections per neuron. (b) Expand Algorithm 8 to work with more than one binary classification. (c) Expand Algorithm 8 to work with regression.
8.6 Further Reading To see why neural networks can be seen as electric circuitry, it is easiest to take a look at original sources: • Bernard Widrow: “Adaptive “ADALINE” Neuron Using Chemical Memistors”, 1960. • JJ. Hopfield: “Artificial Neural Networks”, IEEE Circuits and Devices Magazine, 1988 Sep;4(5):3–10.
Chapter 9
Neural Network Architectures
Much of the recent success of artificial neural networks needs to be attributed to larger systems of neurons, herein called architectures. This chapter exemplifies some of these systems on a conceptual level.
9.1 Deep Learning and Convolutional Neural Networks Despite what is shown in Sect. 8.3, neural networks with many more than three layers of neurons have become widespread use, surmised under the term deep learning. The reasons for that are further explained in Sect. 16.1 but are mostly practical. Figure 9.1 shows the typical conceptual idea behind deep layering: Each layer gradually reduces the amount of information of the input before it is combined in an output later. Deep learning refers to training what is called a deep neural network (DNN). A DNN is an artificial neural network with multiple layers between the input and output layers. Calculating the memory-equivalent capacity (MEC) is, of course, not different from what was explained in Chap. 8.
9.1.1 Convolutional Neural Networks The most common type of DNN is a convolutional neural network (CNN), despite the fact that they deviate from the original idea of increasingly complex feature combination. CNNs are a specialized type of artificial neural networks that use convolution functions in place of threshold functions in at least one of their layers. They are specifically designed to process pixel or audio data (that is, noisy data) and are used in image/audio recognition and processing. These types of networks were inspired by the anatomy of the animal brain (Fukushima 1980). © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_9
123
124
9 Neural Network Architectures
Fig. 9.1 Conceptual idea of deep learning: While the first layer contains linear separators and forms hyperplanes, further layers are able to combine features from previous layers, effectively creating manifolds of increasing complexity. Image from Schulz and Behnke (2012)
Neurological research found that individual cortical neurons respond to stimuli only in a restricted region of the visual image known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual image. The connectivity patterns between neurons in convolutional networks resemble the organization of the animal visual cortex. The output of convolutional layers is therefore often referred to as feature maps. CNNs therefore automize the preprocessing compared to other image classification methods. The network learns to optimize the filters (or kernels) through automated learning, whereas historically, these filters were hand-engineered. This independence from prior knowledge and human intervention in feature extraction is a major advantage and leads to the use of CNNs in other fields, including, as mentioned above, audio processing. Figure 9.2 shows a typical architecture for a convolutional neural network. Calculating memory-equivalent capacity (MEC) (see Definition 5.5) of such a network can be hard. However, before we start such an endeavor, let us remind ourselves why we would calculate MEC. MEC is an estimate of when a machine learner can overfit the function implied by the training data. The convolutional part of a network, however, can never overfit because its purpose is not to fit a function. Its purpose is to convert the input part of the data into features. Since information can never be created (compare also Sect. 7.1.1), this means reducing information.
9.1 Deep Learning and Convolutional Neural Networks
125
Fig. 9.2 Conceptual depiction of a typical convolutional neural network architecture. Different lossy compression operations are layered in sequence (and partly specialized during the training process) to prepare the input data for the (fully connected) decision layer, which consists of regular threshold neurons. Aphex34, CC BY-SA 4.0, via Wikimedia Commons
In other words, the convolutional layers implement a lossy compression (compare also Sect. 7.2). In other words, a convolutional neural network can still overfit, but the cause for overfitting would be the decision layer (often called: fully connected layer). So to design a CNN to not overfit, we need to know two things: the MEC of the decision layer and the amount of bits of information arriving at that decision layer. For the first one, we can use what is presented in the previous sections. To get to the information content output by the convolution layers, we need to estimate the compression ratio performed by the convolution and relate it to the input. In general, estimating the compression ratio of a convolutional layer is equivalent to understanding how many input values reduce to how many output values. In a CNN, the input is, mathematically speaking, a tensor with a shape: (the number of inputs) .× (input height) .× (input width) .× (input channels). After passing through a convolutional layer, the image becomes abstracted to a feature map, also called an activation map, with shape: (the number of outputs) .× (output height) .× (output width) .× (output channels). The compression ratio .Gconv is therefore simply calculated as: Definition 9.1 (Convolutional Layer Generalization) Gconv =
.
(# inputs) × (input height) × (input width) × (input channels) . (# outputs) × (output height) × (output width) × (output channels) (9.1)
Note that here, like in the remainder of the book, we equal generalization with compression (more on that in Sect. 6.3). For example, a maxpooling layer (Scherer et al. 2010) implements a fixed filtering operation that calculates and then only propagates the maximum value of a given region. For example, for a region area of .8 × 8 pixels, .Gconv would be .64 : 1. Like all compression ratios, the compression ratios of individual layers multiply. That is, the total compression ratio of two convolutional layers in series is multiplicative.
126
9 Neural Network Architectures
Corollary 9 (Multiplicativity of Generalization) Gtotal = G1 ∗ G2 .
.
(9.2)
Proof Assume .G1 = n1 and .G2 = m1 , .n, m ≥ 1. That is, .G1 reduces .m ∗ n bits to .m bits and .G2 furtherly reduces the .m bits output to .1 bit. That is, the total u n compression is .Gtotal = n∗m 1 = G1 ∗ G2 . A specific way of downsampling image data is called stride. A convolution filter is typically moved across the data left to right, top to bottom, with a onecolumn change on the horizontal movements, then a one-row change on the vertical movements. The amount of movement between applications of the convolution filter to the input data is referred to as the stride, and it is almost always symmetrical in all dimensions of the filter. Definition 9.1 assumes a stride of 1 in each dimension. If a stride s is .s > 1, however, it has to be taken into account for calculating the generalization of a layer. For example, the stride s could be .s = 2 in all dimensions D of the convolution matrix (that is, .sD = 2 for all D). This has the effect of applying the convolution filter in such a way that a feature map output is down-sampled, resulting in the outputs being . ||1s = 14 (where .sD is the stride D in dimension D) of the number of inputs.||Therefore, in general, the product of sD needs to be multiplied with the the strides in each dimension .Gstride = compression ratio of the convolution .Gconv . As can be seen by their very definition as compression layers, deep layers cannot overfit and always generalize. However, that does not mean that the decision layer could not overfit. We are at risk for that for two main reasons. First, the MEC of the fully connected layer could be much higher than it should be for the problem. This can be seen by analyzing the training data table itself as outlined in Sect. 4.4 or by using the method from Sect. 8.2 on the output of the convolution and comparing it with the MEC of the decision layer (compare Sect. 8.1). Second, the convolutional layers could overgeneralize. For example, it could reduce all images to effectively 1 pixel. The decision layer could then overfit over general patterns. That is, input resolution and aspect ratio of the input matter for convolutional networks. If an original input needs to be scaled up to fit a network, one can almost certainly assume that the convolutional layers will overgeneralize. Similarly, when the area that contains the pattern that we want to detect would be reduced to less than a pixel or audio sample.
9.1.2 Residual Networks Residual Networks (ResNets) were introduced by He et al. in (2015). They are a type of deep convolutional neural network (CNN) that addresses the so-called vanishing gradient problem by introducing skip connections or shortcuts (compare also Fig. 8.2) to jump over some layers.
9.2 Generative Adversarial Networks
x
127
F(x)
y
Fig. 9.3 A residual block with a skip connection as part of a ResNet to avoid the vanishing gradient problem
The vanishing gradient problem is a common issue encountered during the training of deep neural networks using gradient-based optimization algorithms such as stochastic gradient descent. It refers to the situation where the gradients of the loss function with respect to the model parameters become very small as they propagate back through the layers during the backpropagation process. When the gradients are small, the updates to the parameters during training also become small, leading to slow convergence or the model getting stuck in its current state without any significant improvement. The main cause of the vanishing gradient problem is the repeated application of the chain rule during backpropagation, which results in the multiplication of gradients across layers. When the gradients are consistently less than 1 in magnitude, their product across layers can become very small, causing the gradients to vanish. This issue is more pronounced in deep networks with many layers, as the gradients have to pass through a larger number of multiplicative steps. The higher level cause is the Data Processing Inequality (see Sect. 7.1.1). The main idea of ResNets is to allow the network to learn residual functions with reference to the layer inputs, rather than the complete inputs. Figure 9.3 shows a simple residual block with a skip connection. The output of the block of neurons is .y - = F (x ) + x-, where .F (x ) is the function learned by the neurons and .x- is the input. That is, .F (x ) is a residual function and not learning the entire mapping. ResNets have been considered a major breakthrough in the field of deep learning, enabling the training of considerably deeper neural networks while mitigating the vanishing gradient problem. This benefit comes at the cost of a higher memoryequivalent capacity (MEC) though (see explanation in Chap. 8) since the skip connections increase the capacity of the deeper layer. The increased MEC may result in extended training times, memory requirements particularly for highly intricate architectures. Moreover, for the same problem, a ResNet architecture is more likely to overfit than a regularly layered architecture.
9.2 Generative Adversarial Networks Generative Adversarial Networks (GANs) were first proposed by Goodfellow et al. in (2014). They consist of two neural networks, a generator G and a discriminator D, that are trained simultaneously in a game-theoretic framework. The conceptual idea, however, does not rely on the machine learning models to be neural networks. The generator model G creates fake samples from random noise, while the discriminator model D tries to distinguish between the fake samples and real data
128
9 Neural Network Architectures
Real Data
Noise
Generator G
Fake Data
Discriminator D
Output
Fig. 9.4 A simple generative adversarial network architecture. Note the use of the word of “generative” for a noise filter, as explained in the text
samples. The training process is akin to a two-player minimax game, where the generator aims to maximize the probability of the discriminator being fooled, and the discriminator aims to minimize that probability. Figure 9.4 shows a simple GAN architecture. The generator and discriminator are trained in an alternating fashion, with the objective function given by .
min max V (D, G) = Ex ∼ pdata(x)[log2 D(x)] + Ez∼pz (z) [log2 (1 − D(G(z)))], G
D
(9.3) where .D(x) denotes the discriminator network, which takes instance x (usually denoted .x- in this book) and outputs the probability that x is a sample from the real data. In other words, .D(x) represents the discriminator’s estimate of the probability that x is a genuine sample rather than a generated one. .G(z) represents the generator network, which takes input z sampled from noise. .pdata (x) is the distribution of the real data samples. The expectation .Ex∼pdata (x) indicates that the average is taken over the real samples. In analogy, .pz (z) is the noise distribution, from which noise vectors z are sampled as input for the generator. The expectation .Ez∼pz (z) indicates that the average is taken over the generated samples. The objective of the GAN training is to find the optimal generator G and discriminator D by minimizing the value function with respect to G and maximizing it with respect to D. The first term of the value function, .Ex∼pdata (x) [log2 D(x)], encourages the discriminator to correctly classify real samples as real. The second term, .Ez∼pz (z) [log2 (1 − D(G(z)))], encourages the discriminator to correctly classify generated samples as fake and simultaneously pushes the generator to produce samples that are more difficult for the discriminator to distinguish from the real samples. This adversarial process leads to the generator producing increasingly realistic outputs. GANs have proven useful in numerous domains, including image synthesis (Karras et al. 2018), data augmentation (Zhang et al. 2018), style transfer (Gatys et al. 2016), and semi-supervised learning (Salimans et al. 2016). They can generate high-quality and diverse samples, which can be beneficial for tasks with limited data availability. However, their adversarial dynamics can also make them challenging to train, which can lead to mode collapse, oscillation, or non-convergence (Arjovsky and Bottou 2017). Quantitative evaluation of GANs is nontrivial, as traditional metrics such as accuracy are not directly applicable. Researchers often resort to proxy metrics such as the Inception Score (Salimans et al. 2016) or the Frechet Inception
9.3 Autoencoders
129
Distance (Heusel et al. 2017), but these can have their own limitations. This leads to a lack of control: GANs can generate diverse samples, but controlling the generation process to produce specific desired attributes can be challenging. Techniques such as conditional GANs (Mirza and Osindero 2014) and style-based generators (Karras et al. 2019) have been proposed to address this issue, but they may introduce additional complexity. Last but not least, GANs’ ability to produce realistic synthetic data, such as images, videos, or text, raises ethical concerns, including the creation of deepfakes or the potential misuse of generated content for malicious purposes (see also discussion in Chap. 17). Note that the use of the term “generator” for a noise filter that is consistent with the discussion in Sect. 7.1.1. In other words, “creativity” does not generate information. Information can only be observed or filtered out of noise.
9.3 Autoencoders Autoencoders are a type of unsupervised learning neural network architecture that aims to learn a compact and efficient representation of the input data (Hinton and Salakhutdinov 2006; Vincent et al. 2010). Autoencoders consist of two main components: an encoder function .fθ and a decoder function .gφ , which are typically parameterized by neural networks. The encoder function maps the input data to a lower-dimensional latent space, while the decoder function maps the latent representation back to the original input space. The objective learning function of an autoencoder is to minimize the reconstruction error, typically measured by mean squared error (MSE) (see Eq. 3.22) or cross-entropy (see Eq. 11.2). Figure 9.5 shows a conceptual example of an autoencoder architecture. The input data are encoded into a lower-dimensional latent space and then decoded to
x1
xˆ1
x2
h1
z1
h1
xˆ2
x3
h2
z2
h2
xˆ3
xn
hm
zk
hm
xˆn
Fig. 9.5 An example of an autoencoder architecture. The input data (.x1 , x2 , . . . , xn ) are encoded into a lower-dimensional space (.z1 , z2 , . . . , zk ) through the encoder hidden layers (.h1 , h2 , . . . , hm ). The decoder hidden layers (.h'1 , h'2 , . . . , h'm ) then reconstruct the input data, resulting in the output (.xˆ1 , xˆ2 , . . . , xˆn ). The optimization process minimizes the reconstruction error between the input data and the output
130 Fig. 9.6 A high-level illustration of the transformer architecture. The input sequence (.x1 , x2 , . . . , xn ) is encoded by encoder layers (Enc) into a continuous space representation (.z1 , z2 , . . . , zk ). The decoder layers (Dec) then generate the output sequence (.y1 , y2 , . . . , yn )
9 Neural Network Architectures y1
x1 x2
Enc
z1
Dec
y2
x3
Enc
z2
Dec
y3
xn
Enc
zk
Dec
yn
reconstruct the original input data. The optimization process seeks to minimize the reconstruction error, which can be measured using various loss functions such as mean squared error (MSE) or cross-entropy loss (Bengio 2009). Autoencoders have been used for various tasks, including dimensionality reduction, feature learning, denoising, and anomaly detection (Goodfellow et al. 2016a; Chalapathy et al. 2019). Autoencoders are essentially simulations of Shannon channels (see also Chap. 5). In a statistical sense, they can be understood as non-linear Principal Component Analysis (PCA) (see also Jolliffe 2002; Pearson 1901) (Fig. 9.6).
9.4 Transformers Transformers, introduced by Vaswani et al. (2017), are a type of neural network architecture that have achieved state-of-the-art results on various natural language processing (NLP) tasks. Transformers are designed to handle sequential data more effectively than the architectures that predated them, like recurrent neural networks (RNNs) (Pollack 1990; Behnke 2001) and long short-term memory (LSTM) networks (Hochreiter and Schmidhuber 1997), by leveraging the so-called selfattention mechanisms (Bahdanau et al. 2014).
9.4.1 Architecture The transformer architecture consists of two main components: the encoder and the decoder. Both the encoder and the decoder are composed of multiple layers, each containing a self-attention mechanism, followed by position-wise feed-forward networks. The encoder maps the input sequence to a continuous representation, which is then fed into the decoder to generate the output sequence. In addition, transformers incorporate positional encoding to capture the relative position of elements in the input sequence.
9.4 Transformers
131
9.4.2 Self-attention Mechanism The self-attention mechanism is a key component of the transformer architecture. It allows the model to weigh the importance of different parts of the input sequence when generating the output. The self-attention mechanism computes an attention score for each pair of input elements, which is then used to weight their contributions to the output. The attention scores are computed using the scaled dot-product attention mechanism, defined as (
QK T .Attention(Q, K, V ) = softmax √ dk
) V,
(9.4)
where Q, K, and V are the query, key, and value matrices, respectively, and .dk is the dimension of the key vectors. The softmax function normalizes the attention scores, ensuring they sum to 1. Let us consider a simplified example to understand the self-attention mechanism. Given the input sequence “I like apples,” the self-attention layer computes an attention score for each pair of words. The attention score indicates how much each word should contribute to the context-aware representation of every other word in the sequence. In our example, the word “apples” might have a high attention score when computing the context-aware representation of the word “like,” as the two words have a strong semantic relationship. Conversely, the attention score between the words “I” and “apples” might be lower, as their relationship is weaker. Assuming we have the following attention scores for the given input sequence:
I Like Apples
I 0.1 0.2 0.1
Like 0.3 0.5 0.6
Apples 0.6 0.3 0.3
These attention scores are used to compute a weighted sum of the input vectors (after adding positional encodings) for each word, resulting in context-aware representations: • For the word “I”: .0.1 ∗ I nputV ector(I ) + 0.3 ∗ I nputV ector(like) + 0.6 ∗ I nputV ector(apples). • For the word “like”: .0.2 ∗ I nputV ector(I ) + 0.5 ∗ I nputV ector(like) + 0.3 ∗ I nputV ector(apples). • For the word “apples”: .0.1 ∗ I nputV ector(I ) + 0.6 ∗ I nputV ector(like) + 0.3 ∗ I nputV ector(apples). These context-aware representations are then used as input for subsequent layers in the Transformer architecture. Many transformer architectures employ what was
132
9 Neural Network Architectures
coined “multi-head attention.” That is, it consists of multiple parallel attention layers, each with its own set of learned parameters. The outputs of these attention layers are concatenated and linearly transformed to produce the final output.
9.4.3 Positional Encoding Since transformers do not have any inherent notion of position, positional encoding is used to capture the relative position of elements in the input sequence. The positional encoding is added to the input embeddings before they are fed into the encoder. A common approach for positional encoding is to use sinusoidal functions, as defined by Vaswani et al. (2017): ( P E(pos,2i) = sin
)
pos
.
(9.5)
2i
10,000 d (
P E(pos,2i+1) = cos
)
pos
.
2i
,
(9.6)
10,000 d
where .P E(pos,i) denotes the positional encoding for position pos and dimension i, and d is the dimension of the input embeddings. Let us consider a simple example to demonstrate how positional encodings are generated and added to the input sequence. Given an input sequence “I like apples,” assume that the model uses a 3-dimensional word embedding and a 3-dimensional positional encoding. The input sequence is first converted into word embedding vectors (see Sect. 11.3.3):
I Like Apples
.x1
.x2
0.25 0.63 .−0.41
.−0.12
.x3
0.47
0.89 0.07
.−0.35
0.29
For the positional encoding, the equations above are used. For simplicity, in our example, we use the following: P E(p, i) =
{ sin(p)
if i is even
.
cos(p)
if i is odd
Using the above function, the positional encodings for our 3-word input sequence can be calculated as:
9.4 Transformers
133 Position 1 2 3
.P E1
.P E2
.P E3
0.84 0.91 0.14
0.54 .−0.42 .−0.99
.−0.91 .−0.08 .−0.35
The positional encodings are then added element-wise to the corresponding word embedding vectors: + P E1 1.09 1.54 .−0.27 .x1
I Like Apples
+ P E2 0.42 0.47 .−0.92 .x2
.x3
+ P E3
.−0.44 .−0.43 .−0.06
These vectors, which now contain both semantic and positional information, are passed as input to the encoder.
9.4.4 Example Transformation To illustrate the entire transformation process in a Transformer, let us consider a simple machine translation task from English to French. Given the already discussed input sentence “I like apples,” we want the Transformer to output the French translation “J’aime les pommes.” First, the input sentence “I like apples” is tokenized into individual words, resulting in a sequence of tokens: [“I”, “like”, “apples”]. Each token is then mapped to a unique integer identifier based on the model’s vocabulary, such as [1, 2, 3]. These integer identifiers are subsequently converted into continuous vectors using a learned word embedding matrix. For example, the identifiers [1, 2, 3] might be transformed into the following continuous vectors: [[0.25, -0.12, 0.47], [0.63, 0.89, -0.35], [-0.41, 0.07, 0.29]]. This transformation allows the model to capture semantic relationships between the words in the input sentence learned by the statistical co-occurrence of tokens in a large corpus (see also Sect. 11.3.3). The positional encoding (see Sect. 9.4.3) is computed for each position in the sequence and then added element-wise to the corresponding word embedding vectors. For instance, the positional encodings for the positions 1, 2, and 3 might be [[0.01, 0.02, -0.03], [-0.02, 0.03, 0.01], [0.03, -0.01, 0.02]]. After adding these positional encodings element-wise to the word embedding vectors, the resulting vectors are [[0.26, -0.10, 0.44], [0.61, 0.92, -0.34], [-0.38, 0.06, 0.31]]. This resulting sequence of vectors is then passed to the encoder, which is composed of multiple layers of self-attention and position-
134
9 Neural Network Architectures
wise feed-forward networks. The encoder processes the input sequence and generates a context-aware representation for each token, such as [[0.55, 0.12, -0.23], [0.78, 0.35, 0.18], [0.01, -0.29, 0.65]]. These context-aware representations capture the relationships between the input tokens and their surrounding context. The decoder starts generating the translated output one token at a time. At each step, the decoder attends to the previously generated tokens and the context-aware representations produced by the encoder. The decoder uses self-attention to prevent the model from attending to future tokens in the output sequence during training. For example, when generating the second token “les” in the French translation, the decoder should only attend to the previously generated token “J’aime” and not the yet-to-be-generated token “pommes.” The self-attention mechanism achieves this by setting the attention weights of future tokens to zero, ensuring that the decoder does not incorporate information from tokens that have not been generated yet. This masking allows the model to learn a proper conditional probability distribution over the target sequence. For our example, the decoder will first generate the token “J’aime,” followed by “les” and finally “pommes.” The output tokens are generated by taking the highest probability token from the softmax distribution over the vocabulary at each step (see Sect. 9.4.2).
9.4.5 Applications and Limitations Transformers have been highly successful in a wide range of natural language processing tasks, such as machine translation (Vaswani et al. 2017), sentiment analysis (Devlin et al. 2018), and question answering (Radford et al. 2018). Some of the most popular transformer-based models include BERT (Devlin et al. 2018), GPT (Radford et al. 2018), and T5 (Raffel et al. 2019). Like any method, transformers also have limitations. They can be computationally expensive due to the self-attention mechanism’s quadratic complexity with respect to the input sequence length. Furthermore, transformers require large amounts of data and computational resources to achieve state-of-the-art performance, making them less accessible for researchers and practitioners with limited resources (see also Chap. 17). Furthermore, the required memory-equivalent capacity of transformers that create the attention mechanism and embedding out of billions of documents pretty much ensures that humans will never fully understand what was actually learned by a transformer.
9.5 The Role of Neural Architectures In recent decades, the machine learning community has devoted significant attention to developing specialized neural network architectures. The success of these
9.6 Exercises
135
architectures is partly due to their resemblance to Shannon’s communication channel (Shannon 1948b) (also see Fig. 5.1). From a theoretical perspective, the choice of architecture mainly affects the effectiveness of the parameters in terms of memory-equivalent capacity, as indicated by Corollary 3 and Theorem 8.1 (for a detailed discussion, refer to Sect. 8.3). In practice, the architecture of a neural network plays a crucial role in facilitating human understanding and interpretability of the modeling process (not the result), much like the architecture of a house influences its livability and aesthetics. A well-designed house architecture enhances the living experience for its inhabitants. Similarly, empirical evidence indicates that carefully designed model architectures simplify the training process (Krizhevsky et al. 2012). More on this in Chap. 16. Furthermore, it enables humans to better comprehend some of the underlying relationships within the data (Goodfellow et al. 2016a), which is an important advantage for academic training. However, it is important to acknowledge that the majority of the credit for the effectiveness of these models is due to the large amounts of data they are exposed to during training. The availability of massive datasets has significantly contributed to the success of deep learning and various neural network architectures (Halevy et al. 2009).
9.6 Exercises 1. Implement a deep convolutional neural network from scratch using a popular deep learning framework (e.g., TensorFlow or PyTorch). Train and evaluate the network on a standard image classification dataset, such as CIFAR-10 or MNIST. First, experiment blindly with various hyperparameters and architectures and observe the model’s performance. Second, apply the measurements proposed in this book to reduce the hyperparameter search space and observe the model’s performance. 2. Convert your deep convolutional network into a Residual Network (ResNet) architecture. Discuss the challenges and trade-offs involved in adapting to a ResNet architecture. 3. Implement a Generative Adversarial Network (GAN) to generate new images similar to a given dataset. Experiment with different machine learning model types and loss functions and observe quality and diversity of the generated images. Discuss the challenges in evaluating the performance of GANs and propose some quantitative and qualitative evaluation methods based on the methods provided in this book. 4. Conceptually develop a multimodal transformer architecture that takes both images and text as input and generates captions for the images. Analyze the challenges involved in integrating multiple modalities into the transformer architecture.
136
9 Neural Network Architectures
5. Investigate the impact of positional encodings in the transformer model. Experiment with different types of positional encodings (e.g., sinusoidal, learned, or fixed) and evaluate their effect on model performance. Discuss the advantages and disadvantages of different positional encoding schemes. 6. Analyze the attention patterns in a pretrained transformer model by visualizing the attention weights. Investigate the role of different attention heads and layers in capturing various linguistic properties, such as syntax, semantics, and longrange dependencies. Discuss the interpretability of transformer models based on the attention patterns.
9.7 Further Reading • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. This textbook provides a detailed introduction to deep learning, covering topics such as deep feed-forward networks, convolutional networks, sequence modeling, and practical methodology. • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. This influential review article provides an overview of the field, including a discussion of convolutional networks and their applications in computer vision and speech recognition. • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30, 5998–6008. This seminal paper introduces the transformer architecture, which has become the basis for many state-of-the-art natural language processing models. The authors provide a detailed description of the self-attention mechanism and its applications in sequence-to-sequence tasks. • Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed.). O’Reilly Media. This hands-on book covers a wide range of machine learning topics, including deep learning and neural networks, using popular libraries such as Scikit-Learn, Keras, and TensorFlow. It provides practical examples and real-world use cases for various architectures and techniques.
Chapter 10
Capacities of Some Other Machine Learning Methods
As we have seen in previous chapters, memory-equivalent capacity (see Definition 5.5) is an important metric to understand the efficiency of parameters in a machine learning model, and it brings us one step closer to reproducibility (see Corollary 3 and Chap. 15). This chapter therefore discusses other machine learning approaches and how to calculate their memory-equivalent capacity (MEC).
10.1 k-Nearest Neighbors The simplest supervised algorithm introduced in Chap. 3 is k-Nearest Neighbors (kNN) (see Sect. 3.3.1), which is a non-parametric, “lazy” learning algorithm used for classification and regression tasks that does not even actually train a model. It is therefore useful to start analyzing the memory-equivalent capacity of this algorithm as a baseline. The only variables we are able to consider for the memory-equivalent capacity (MEC) of k-NN are the number of training samples (n), the number of nearest neighbors considered (k), the intrinsic dimensionality of the data (.dint ), and the distance function. While the decision function can be varied, we will consider here only the most typical decision function: majority vote. Given a set of training data points in a table (see Definition 2.1), .(xi , f (x ), the k-NN algorithm assigns a label to a new data point .x- by considering the majority vote of its k-nearest neighbors in the training dataset. In the case of regression, the output is the average of the target values of the k-nearest neighbors. The MEC of a k-NN classifier can be approximated as MECkN N ≈
.
n . k · dint
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_10
(10.1)
137
138
10 Capacities of Some Other Machine Learning Methods
Here, n is the number of training samples, k is the number of nearest neighbors, and .dint is the intrinsic (fractal) dimensionality of the data (see also Sect. 4.3). The intrinsic dimensionality is a measure of how the data are effectively distributed in the feature space, which can be lower than the original dimensionality if the data lie on a lower-dimensional manifold or the distance function weighs dimensionalities in some way. .dint can therefore be hard to estimate. So, like in the previous chapter, in the absence of knowledge, it is safer to use the maximum memory-equivalent capacity, which is given when .dint = 1. The memory-equivalent capacity of k-NN decreases as the number of nearest neighbors (k) increases or the intrinsic dimensionality (.dint ) of the data increases. A larger k therefore reduces the risk of overfitting. On the other hand, a higher intrinsic dimensionality indicates that the data are more complex and require more capacity to generalize effectively.
10.2 Support Vector Machines Support Vector Machines (SVMs) are discussed in Sect. 3.3.6. In this section, we discuss the memory-equivalent capacity (MEC) of SVMs and derive the relevant formulas, which as explained in Chapter 6 is similar to the MEC of k-Nearest Neighbors. Please refer to Sect. 3.3.6, to recall the formulation of the SVM optimization problem. From that we can see that the parameters that can influence the MEC are the number of support vectors (.nSV ), the intrinsic fractal dimensionality of the input space (.dint ), and the regularization parameter (C). Just as discussed in the previous section, the distance function (here: the kernel) also influences the MEC. Now, the memory-equivalent capacity .MECSV M of a linear SVM can be approximated as: MECSV M ≈
.
nSV . C · dint
(10.2)
This approximation is based on the observation that each support vector contributes a certain amount of information to the classifier, just as a reduced k-Nearest Neighbor algorithm would do (see also discussion in Sect. 6.1 and applying Eq. 10.1). The capacity is inversely proportional to both the regularization parameter C and the intrinsic dimensionality .di nt. As the regularization parameter increases, the capacity decreases, indicating that the classifier becomes less prone to overfitting. When a kernel function is employed, SVMs can learn non-linear decision boundaries by implicitly mapping the input data into a higher dimensional feature space. To understand this, we can relate the kernel to the generalized version of the function counting theorem by Thomas Cover (1965). In Cover’s article, the growth function .G(n) is the number of dichotomies (distinct binary classifications) that can be realized by a given learning algorithm on a set of n training points.
10.3 Decision Trees
139
The generalized function counting theorem provides an upper bound on this growth function. As we saw in Sect. 3.3.6, in a kernelized SVM, the decision boundary is defined in a high-dimensional feature space implicitly mapped by the kernel function .K(xi , xj ). The kernelized SVM aims to maximize the margin between the two classes while minimizing the classification error. The decision function can be expressed as f (x) =
n E
.
αi yi K(xi , x) + b.
(10.3)
i=1
Here, .αi are the Lagrange multipliers, .yi are the class labels, and b is the bias term. To apply Cover’s generalized function counting theorem to kernelized SVMs, we need to consider the properties of the kernel function and its relationship with the growth function. The kernel function implicitly maps the input data into a higher dimensional feature space, which might lead to an increase in the growth function due to the increased complexity of the classifier. Cover’s formula for the information capacity (see Definition 5.1) of an r-th order variety is given as follows: ( ) d +r .I (d, r) = 2 · . r
(10.4)
Here, .I (d, r) denotes the information capacity per stored parameter (in SVMs, these are the support vectors) for a kernel of rational r-th order variety, and d is the number of dimensions. We can approximate the memory-equivalent capacity to be half of that. As one can see, just like a linear SVM, the memory-equivalent capacity of a kernelized SVM is still largely determined by the number of support vectors as these are the stored parameters.
10.3 Decision Trees Section 3.3.3 already introduced decision trees, and Sect. 3.3.4 ensembles of them as a structure. To analyze the MEC of decisions trees, we start with a thought experiment (Fig. 10.1).
10.3.1 Converting a Table into a Decision Tree We can memorize a data table (see Definition 2.1), by simply defining a (non-binary) decision tree (see Sect. 3.3.3) that checks for the exact value of column 1, row 1 at
140
10 Capacities of Some Other Machine Learning Methods
Column 1
Column 2
Column 3
f (x1 )
p?
Column 2
Column 3
x?
f (x2 )
a?
f (x3 )
y?
Column 3
f (x4 )
f ( x5 )
q?
Column 3
x?
f (x6 )
f (x7 )
y?
f (x8 )
Fig. 10.1 In this example, the data table has 3 columns and 2 rows. The decision tree has a depth of 4, with each level corresponding to a column in the table, and each node at a level corresponding to a value in that column. The leaves of the tree correspond to the output values .f (xi ) for each row in the table
level 1, then checks for the exact value of column 2, row 1, at level 2 etc. . . until all columns in row 1 are checked. The output is then .f (x1 ). This can be repeated until all rows are represented in the tree. The result is a decision tree of depth .D + 1 (including output nodes) base n (the number of rows). That is, a tree with a total of .(D + 1)n nodes. The tree can be made a binary tree, by replacing the exact value comparison conditional to a binary conditional. That is, each node asks a yes/no question. The new depth of the binary decision tree is then .log2 ((D + 1)n ). Obviously, this is about the most inefficient way to build a decision tree from a table. However, the point of this thought experiment is merely to show that every table can be represented by a decision tree. Furthermore, we can see how dimensionality D in the data table (usually defined as the number of columns D) and depth of tree are related.
10.3.2 Decision Trees Let us start with the most rigorous structure: The perfect binary tree. Figure 10.2 shows the general form of a perfect depth-2 binary decision tree. Each node (include the root node) represents a decision, and the leaf (terminal) nodes represent the outcomes. A decision tree can be defined over arbitrary decisions. However, for a decision tree to model a function, it will chose intervals in the domain of the function (e.g., a curve) and approximate the outputs. Theorem 10.1 The memory-equivalent capacity of a perfect binary tree is .2D bits, where D is the depth of the tree. Proof This can be seen as follows: A perfect binary tree has .2D nodes, where D is the depth of the tree (counting the root node as depth 0) as, at each depth
10.3 Decision Trees
141
Fig. 10.2 Generalized form of a perfect binary decision tree of depth 2. Each inner node represents a decision and each leaf node represents an outcome that would lead to an output
level, the number of nodes doubles. The number of leaf nodes is therefore .2D . By definition, a perfect binary tree of level D is therefore able to have .2D different outputs. Each output can be set arbitrarily to a value .∈ {0, 1}. Therefore, the tree D is able to represent all possible .22 binary labelings of D binary dimensions. That D is, the MEC of a perfect binary tree of depth D is .log2 22 = 2D bits (compare Definition 5.5). u n The memory-equivalent capacity of an imperfect binary decision tree, in analogy to Theorem 10.1, derives from the number of possible outputs. In fact, it generalizes to any decision tree. Corollary 10 The memory-equivalent capacity of any decision tree is .l bits, where l is the number of leaf nodes. Proof By definition, l leaf nodes are able to issue l unique outputs. That is, we can represent the labeling of up to l rows in any data table, binary or not. u n Note that the tree can have any odd shape, especially when there are a lot of columns in the input for a small number of rows. The shape of a decision tree therefore mostly depends on the structure of the input data and not so much on the target column.
10.3.3 Generalization of Decision Trees Assuming a data table of n rows and D columns (see Definition 2.1), the maximum depth required for a decision tree is D and the maximum number of leaf nodes is n. Assuming a perfect binary tree to represent the data, we know that .D = log2 n. If a binary decision tree is required to be perfect to represent the data, there can be no generalization:
142
10 Capacities of Some Other Machine Learning Methods
Corollary 11 A perfect binary decision tree does not generalize. Proof A perfect binary decision tree has .2D leaf nodes, which by Theorem 10.1 is its memory-equivalent capacity in bits. This tree can label .2D rows of a table correctly. In other words, the maximum generalization (see Definition 6.3) is .G = 2D labels = 1, which is no generalization. u n 2D bits That is, each leaf node represents exactly one label. For example, the 2-variable XOR function, discussed in Sect. 5.1.1, requires a tree of depth .D = 2 and .n = 4 = 2D labels. Since .2 = log2 4, the tree is perfect. The contraposition to Corollary 11 is: Only imperfect trees generalize. Tree imperfection is a necessary but not a sufficient condition, as can be seen by constructing a tree using the method outlined in Sect. 10.3.1 for a data table filled with completely unique values, when there is a high input dimensionality and a low number of rows. However, consistent with Sect. 5.2, we know now that generalization is synonymous with the pruning of branches of a perfect decision tree. That is, minimizing the number of leaf nodes per row of the data table or, equivalently, maximizing the expectation that each path from root to leave is taken (see Sect. 6.2).
10.3.4 Ensembling Ensemble methods in machine learning, such as bagging, boosting, and random forests, have been shown to significantly improve the performance of decision trees by combining multiple decision trees to reduce overfitting and improve generalization (see also Sect. 3.3.4). The capacity of ensembles can be analyzed using the same principles as for individual decision trees. In fact, the memoryequivalent capacity of an ensemble of decision trees can be approximated as the sum of the memory-equivalent capacities of the individual trees. Corollary 12 The memory-equivalent capacity of an ensemble of T decision trees is maximally .T l bits, where l is the number of leaf nodes per tree. Proof By Corollary 10, each decision tree can represent up to l rows in the data table. Therefore, an ensemble of T decision trees can represent up to T l rows. Since the memory-equivalent capacity is the number of representable binary labels, the memory-equivalent capacity of the ensemble is .log2 (2T l ) = T l bits. u n The capacity of an ensemble of decision trees is therefore proportional to the number of trees and the number of leaf nodes per tree. The number of leaf nodes per tree is a measure of the complexity of the tree and its ability to overfit the data. As such, ensembling can be seen as a form of regularization that reduces the capacity of individual decision trees by combining them. The memory-equivalent capacity of the ensemble of trees is therefore practically much smaller and can be approximated by estimating the generalization of the ensembling function. For example, if the
10.4 Genetic Programming
143
ensembling implements the average of the decision of three trees, then the capacity is one third of the maximal capacity determined by Corollary 12.
10.4 Genetic Programming Genetic Programming (GP) has been introduced in Sect. 3.3.7 as a machine learning technique that simulates natural selection to fit a target function. Since the adoption functions are fixed in GP, the parameters of consideration to determine the memoryequivalent capacity (MEC) are the size of the population (N), the information content of the “DNA” strings (.si ) in bits, and the number of generations (G). Intuitively, the memory-equivalent capacity of GP increases with the size of the population, the sizes of the strings, and, since a single, good-enough-fitting individual string can stop the process, the number of generations. That is, in the context of GP, the memory-equivalent capacity captures the algorithm’s search space, which depends on the number of candidate strings that can be explored during the evolutionary process. A larger memory-equivalent capacity indicates a more extensive search space, which can potentially lead to more powerful and complex solutions. Of course, it also increases the risk of overfitting, particularly when the problem at hand does not require complex solutions or when the available computational resources are limited. Therefore, the memory-equivalent capacity (MEC) of GP can be upper-bounded by MECGP ≤
N E
.
si · G.
(10.5)
i=1
Here, N is the size of the population, .si is the information content of string i in bits, and G is the number of generations. A tighter upper bound can be found by analyzing the maximum rate of adaption in bits for each iteration of a biological genetic algorithm. That is, analyzing the intellectual capacity (see Definitions 5.2 and 5.3) of genetic evolution. This question is answered in detail in MacKay (2003), Chapter 19 based on a theory of sexual reproduction. We will summarize the result as follows: If species reproduce sexually, the rate of information acquisition is upper-bounded by MECGP ≤
.
√ S · G bits,
(10.6)
E where S is the size of the genome in bits (that is, .S = N i=1 si , see above) and G is the number of generations. It is assumed that the initial strings are uniform random.
144
10 Capacities of Some Other Machine Learning Methods
10.5 Unsupervised Methods The memory-equivalent capacity (MEC) can also be calculated for unsupervised machine learning algorithms. However, in unsupervised systems, there is less concern about overfitting since, technically, no labels are fit. However, knowing the MEC still helps to compare machine learning across algorithms and, more recently, tasks that have required supervision started to be solved without supervision (He et al. 2020; Devlin et al. 2019; Dosovitskiy et al. 2021; Ramsauer and Schafle 2020). In this section, we will discuss the memory-equivalent capacity of two unsupervised learning methods, k-means clustering as a baseline, and Hopfield networks.
10.5.1 k-Means Clustering K-means is an unsupervised learning algorithm that aims to partition a dataset into k clusters, where each data point belongs to the cluster with the nearest mean. The memory-equivalent capacity of k-means is determined by the number of centroids required to store the clustering information. The centroids are the means of the clusters and can be computed as μj =
.
1 E xi , nj
(10.7)
i∈Cj
where .μj is the centroid of cluster .Cj , and .nj is the number of data points in the cluster. The memory-equivalent capacity of k-means is proportional to the number k of clusters, and k is upper-bounded by N, the number of training points. The clusters then need to be assigned to labels, and, of course, there are .ck possibilities of assignments where c is the number of classes. That is, the MEC of k-means is MECk−means = log2 ck bits.
.
(10.8)
10.5.2 Hopfield Networks Hopfield networks are a type of recurrent neural network with binary neuron states that can be used for associative memory tasks. That is, the memory-equivalent capacity of a Hopfield network is its actual memory capacity. In a Hopfield network, the memory capacity is determined by the number of stored patterns that it can reliably retrieve. According to MacKay (MacKay 2003), Chapter 42, the memory capacity of a Hopfield network can be estimated as
10.7 Further Reading
145
MECH opf ield = αN,
.
(10.9)
where N is the number of neurons in the network, and .α is the storage capacity per neuron. MacKay suggests that for a Hopfield network with Hebbian learning and synchronous updates, the optimal value of .α is approximately 0.14 bits. The memory-equivalent capacity of Hopfield networks, therefore, grows linearly with the number of neurons in the network.
10.6 Exercises 1. Provide an example of how a different distance function can change the MEC of k-Nearest Neighbors. 2. Assume you have a data table with n rows and D columns. What is the maximum depth required for a perfect binary decision tree to represent the data? What is the maximum number of leaf nodes? How does this relate to the memory-equivalent capacity of the tree? 3. What is the memory-equivalent capacity of an imperfect binary decision tree? Provide a proof for your answer. 4. Explain why a perfect binary decision tree does not generalize. How does this relate to the memory-equivalent capacity of the tree? 5. Compare and contrast the MEC in supervised and unsupervised machine learning algorithms and discuss their implications for practical applications. 6. Investigate the relationship between the MEC and other measures of algorithmic complexity, such as Kolmogorov complexity (see Appendix B) and computational complexity for unsupervised learning.
10.7 Further Reading This chapter only gives a cursory overview of the topic. Much of the suggested further reading drills deeper. • Amit, D.J. (1985). A Theory of the Storage and Retrieval of Classical Patterns in Neural Networks. Nature, 376(6535), 215–216. • MacQueen, J.B. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297. • Ackley, D.H., & Hinton, G.E. (1985). A Learning Algorithm for Boltzmann Machines. Cognitive Science, 9(1), 147–169. • Haykin, S. (2009). Neural Networks and Learning Machines (3rd ed.). Pearson. • Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. CRC Press.
146
10 Capacities of Some Other Machine Learning Methods
• Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. • Dietterich, T. G. (2000). Ensemble methods in machine learning. Multiple Classifier Systems, 1857, 1–15. • Eiben, A.E., & Smith, J.E. (2015). Introduction to Evolutionary Computing. Springer. • Hinton, G.E., & Salakhutdinov, R.R. (2006). Unsupervised Learning. In B. Schölkopf, J.C. Platt, & T. Hoffman (Eds.), Advances in Neural Information Processing Systems 19 (pp. 689–696). MIT Press. • Schölkopf, B., & Smola, A.J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
Chapter 11
Data Collection and Preparation
This chapter and the subsequent Chap. 12 introduce concepts related to collection and preparation of training data for the automated scientific process. Many of these considerations (except for data collection and annotation) and transformations are not necessary in the traditional, manual scientific process because the human brain easily abstracts away many of the issues discussed in this chapter.
11.1 Data Collection and Annotation As discussed in Sect. 2.1, all modeling begins with a question. This question creates uncertainty, which we are aiming to reduce. It also creates the frame of reference for what is relevant and what is irrelevant information (see also Sect. B.5). Data collection should always start with a specification. This not only allows for a more precise collection of the data in compliance with the task, it also allows to implement unit testing as described in Sect. 13.2.1. A data specification, it does not have to be set in stone, as discussed in Sect. 13.2.1, it can be refined as more data with certain properties are defined – as long as that data are relevant in response to the original question. On top of the specification, when collecting samples, equilibrium is key (see Definition 4.10). Ideally, every experimental factor is represented in full range and with equal frequency. For example, if age plays a role, the age range of the subjects should be equally spaced from the minimum possible age to the maximum possible age, and there should be as many subjects in each age range as possible. If gender plays a role, each gender should be represented with equal frequency. Similarly, what one intuitively calls “edge and corner cases” should be represented in the data. Whenever data or outcomes are not in equilibrium, we have a bias, and we risk to include this bias in our final model.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_11
147
148
11 Data Collection and Preparation
Definition 11.1 (Bias) Reduction of uncertainty toward an outcome by factor(s) not intended to be part of the experiment. Bias is discussed in more detail in Sect. 13.3. It is also important to note at this point that no data and no outcome are ever completely unbiased. However, reducing bias is still important as too much bias invalidates the experiment itself: When unintended and/or unknown factors play too large of a role, there is no need to perform the experiment in the first place. At the same time, data can be too diverse for inference: A unique number per each row, like a database key or identification strings, is not useful for prediction. That is, nothing can be inferred from a value whose only purpose is to identify a specific column (more on that in Sect. 11.5). For supervised learning, annotation is very important. That is, each sample .xneeds to be annotated with an experimental outcome .f (x ). We already know that Definition 2.3 forbids us from annotating the same input with different outcomes. Furthermore, it is mandatory for the quality of a model that the annotation is accurate. Machine learning cannot correct annotation errors. In the best case, annotation errors are treated as exceptions and the particular instances are memorized. In the worst case, the machine learning algorithm will just not be able to infer the right rule. In any case, the accuracy of the machine learner cannot exceed the accuracy of the annotation. Creating annotation automatically requires a model and therefore presents a catch-22. Highly subjective tasks (for example, sentiment analysis) or tasks where it is difficult to be precise (for example, word boundaries in speech analysis) may require the use of multiple annotators per instance. Annotations should then only be counted as correct, when the majority of all annotators agree. The concept of the diversity of human annotations is called annotator agreement. Definition 11.2 (Inter-annotator Agreement) Inter-annotator agreement (also called by various similar names, such as inter-rater agreement, inter-observer reliability) is the degree of agreement among independent observers who assess the same phenomenon. It goes without saying that in such a scenario, annotators should be independent and not communicate with each other. Furthermore, it may make sense to diversify the properties of the human annotators. That is, choose annotators of different gender, age range, race, religion, social status, etc., depending on the task at hand. Inter-annotator agreement can be analyzed just like model accuracy in a 1 vs. majority scheme and one can also calculate confusion matrices. Ideally, one ends up with an accurate annotation where each class should have the same number of instances associated to it (see also mitigation if this is difficult in Sect. 11.7). For a regression problem, it is best if outcome values are equally represented within the range of the regression task.
11.3 Well-Posedness
149
11.2 Task Definition Before engaging in the automated scientific process, the first question should be: Can the problem potentially be solved manually? For example, data of low dimensionality (two or three columns) and samples in the count of hundreds can usually be better handled using traditional mathematical and statistical methods. Computational support for tasks like these is also available in the forms of spreadsheet applications, statistical programming languages, or algebra packages. Not only is the data being modeled into a well-explainable, few-variable formula the most preferable result of the scientific process (see also discussion in Sect. 7.5), with hundreds of years of scientific exploration, there is also a high chance that a solution already exists. Once the decision is made that machine learning is required, one needs to define the task as either being supervised or unsupervised. Unsupervised machine learning, that is, clustering, seemingly seems to save the tedious annotation work. However, forming perfect clusters requires an exact rule on how to cluster, that is, a model. So unless a coarse approximation is enough, supervised modeling is required. Among supervised tasks, the two ways to represent the function implied in the data table are regression or classification. Binary classification is usually the easiest to solve and can be later extended to many classes.
11.3 Well-Posedness Whatever the task is, supervised or unsupervised, regression, or classification, no machine learning will succeed if the task is not or cannot be made well-posed (see Definitions 2.8 and 2.9). That is, in general, small changes of the inputs in the data should only result in small changes of the outcomes. If, in general, even small changes in the input result in large changes in the output, the problem has to be characterized as chaotic. In fact, the metaphorical example for such a problem is called the butterfly effect (Lorenz 1972). The butterfly effect is derived from the metaphorical example of the details of a tornado (the exact time of formation, the exact path taken) being influenced by minor perturbations such as a distant butterfly flapping its wings several weeks earlier. Chaotic behavior is often observable during data collection and should be handled then. The good news is that there is a whole theory around it: chaos theory. Fortunately, we can learn from chaos theory on how to possibly avoid chaos.
11.3.1 Chaos and How to Avoid It The field of physics has the longest experience with modeling nature. It is therefore no surprise that the field has developed notions of detecting and measuring chaos. In
150
11 Data Collection and Preparation
physics, one way to detect and measure chaotic behavior are the so-called Lyapunov exponents (Lyapunov 1892). They are defined as follows: Definition 11.3 (Lyapunov Exponents) Given two starting trajectories (in phase space) that are infinitesimally close, with initial separation .δZ0 , the two trajectories end up diverging at a rate given by .|δZ(t)| ≈ eλt |δZ0 |, where t is the time and .λ is the Lyapunov exponent. The rate of separation depends on the orientation of the initial separation vector, so a whole spectrum of Lyapunov exponents can exist. The number of Lyapunov exponents is equal to the number of dimensions of the phase space, though it is common to just refer to the largest one. For example, the maximal Lyapunov exponent (MLE) is most often used because it determines the overall predictability of the system. A positive MLE is usually taken as an indication that the system is chaotic (Lyapunov 1892). To many people, the term .eλt is a more familiar in a very similar form: .e−λt indicates exponential decay (Kleinrock 1987). More precisely, .N(t) = N0 e−λt , where .N(t) is the quantity at time t, and .N0 is the initial quantity, that is, the quantity at time .t = 0 is the equation for exponential decay. That is, an exponential reduction of a measured quantity over time. An intuitive characteristic of exponential decay for many people is the time required for the decaying quantity to fall to one half of its initial value. This time is called the half-life and often denoted by the symbol .t1/2 . The half-life can be written as: .t1/2 = ln(2) λ . A positive Lyapunov exponent therefore is intuitively the same as a negative exponential decay or exponential grows. We can also compute .t1/2 , and it would be defined as the time required for the exponentially growing quantity to grow to double of its initial value. For example, in a time-based experiment, noise could add to a measurement at each time step, until it reaches the point where a Lyapunov exponent becomes positive. With this intuition, let us now understand how to modify Definition 11.3 for general modeling. We remind ourselves that .|δZ| is a distance between two trajectories. It could, however, be any numerical quantity in time and nothing prevents us from measuring the number of bits required to store it, by applying the logarithm base 2. That is, .log2 |δZ(t)| ≈ log2 eλt |δZ0 | ⇐⇒ log2 |δZ(t)| ≈ log2 eλt + log2 |δZ0 | ⇐⇒ log2 |δZ(t)| ≈ λt log2 e + log2 |δZ0 |, which can be algebraically reformulated to .log2 |δZ(t)| − log2 |δZ0 | ≈ λt log2 e. The term .log2 e is a constant and can be ignored (for example, by burying it in the measurement of time). As per the explanation of Definition 11.3, .λ > 0 indicates chaos. That is, conceptually chaos emerges when: .
log2 |δZ(t)| − log2 |δZ0 | > 0. t
(11.1)
bits . In other words, when the base-2 The unit of the left side of the inequality is . timestep logarithm of the distance between the two trajectories changes more than 0 bits per
11.3 Well-Posedness
151
(time) step, chaos eventually emerges. One can also see that if .λ < 0, the system eventually converges since the logarithm is negative for .|δZ(t)| < 1. The system has converged, once .λ = 0. This allows us to conclude the following for the information theory of modeling. Corollary 13 (Chaotic Experiment) An experiment is unpredictable (chaotic) and therefore ill-posed for modeling, when the amount of change observed in the experiment exceeds the amount of memory used for recording the experiment. Proof .log2 |δZ(t)| and .log2 |δZ(0)| are the number of bits of memory used to store the difference between two trajectories (in one dimension) at time step t and at time bits step 0. The unit . timestep indicates changing of bits per time step. As per Eq. 11.1, more than .0 bits, that is, any reduction in bits between the memory used in the original step and the current step, creates chaos. u n At first sight, Corollary 13 only seems to apply to time-based systems, but we will see that this is, practically speaking, not true: Memory overflow creates chaos.
11.3.2 Example In computer systems, integers are usually allocated a fixed amount of bits. For example, .16 bits. Let us now assume, and we add two .16 bit integers .a, b to a new 16 − 1 + 216 − 1 = .16 bit integer c. The maximum result .c = a + b could be .c = 2 17 2 − 2 = 131,070, which is a .17 bit number. Therefore, c cannot store the result as it is only .16 bits wide. The maximum number c can store is .216 − 1 = 65,535. The maximum error resulting from storing the result of .a + b in c is therefore .emax = 131,070−65,535 = 65,535, which itself is a .16 bit number. In other words, the value of .emax is as high as the result shown by c. Substituting into Eq. 11.1, one log c−log a can see that . 2 1 2 = log2 131,070 − log2 (216 − 1) > 0 (.t = 1 since there is only one step). To prevent such overflows, especially from accumulating over time, computer systems usually throw an exception that prohibits storing the result in c. Taking another look at Eq. 11.1, one can ask: How is it that the amount of bits required to store an experimental outcome (over time or not) can actually increase? One way this can happen is when one records too few experimental variables (dimensions) that influence the experiment, for example, because not all factors are known or there are just too many. If we remember that a binary digit is the same as a dimension (see Corollary 14), we can see that if we record a data table (see Definition 2.1) of dimensionality D and we do not capture all relevant experimental factors .xi , we also use too little memory to record the experiment. So we can expect chaotic outcomes as well.
152
11 Data Collection and Preparation
Example A balanced, binary classification table could contain .D = 3 input attributes for a set of instances. This results in .23 = 8 possible outcomes. A fourth experimental factor 4 .x4 that we assume has been forgotten would result in a total of .2 = 16 outcomes. That is, twice as many outcomes as modeled using the original table. Even if all outcomes are modeled correctly based on the original .D = 3 table, the unknown outcomes could reduce the accuracy from 100% to best guess (here 50%). However, since the fourth factor is unknown, the model will appear to be randomly incorrect (based on the value of the fourth column). That is, predict chaotically. The other source of added information during an experiment is noise, which can be defined as follows: Definition 11.4 (Noise) Observations irrelevant to the experimental outcome. Whereas “signal” is defined as the exact opposite of noise: Observations relevant to the experimental outcome. In general, noise can be modeled. For example, as error bar. Modeling with an error bar assumes that each point .x-i is accurate .+/− some value induced by noise. That is, each point .x-i is given as .x-i + / − e-, where .e- is a vector of error magnitudes in each dimension. Again, we can measure the number of bits used to model each coordinate in the points .x-i as well as the error. We can see that the absolute upper limit for error-bar modeling would be when the number of bits in the error equals the number of bits used to store the actual value of the points. In that case, the error is not distinguishable from the points anymore and modeling (either the points or the error) is pointless. Of course, adding n bits of error to n bits of signal can maximally result in .n + 1 bits of an observation. That is, .λ will be positive, and we have to consider the observations chaotic as the observation could be twice the value of the original signal. On second thought, noise itself is information induced into the observations from factors unknown or not considered. So it boils down to Corollary 13 again: not enough memory or variables considered/recorded. The connection between missing information and exponential error has been explored in depth by John Larry Kelly Jr. (1956) who even explained its applicability for monetary gain. Unfortunately, it is unknown if Lyapunov’s explanation of chaos being caused by exponential growth is sufficient to define all chaos. However, it is often accepted that the universe would be deterministic (see Definition 4.6), if we could model it by taking all independent variables into account. We also know that this is impossible as such a model is expected to be as complex as the universe itself (see also Sect. 7.2). The connection between physics and information is touched upon in more detail in Sect. B.5). In summary, we can prevent chaos and increase our chances of modeling accurately significantly by making sure all experimental factors are considered and reducing noise in the measurements. Once convinced the problem at hand is wellposed, one can proceed with the next steps: tabularization, data validation, and numerization.
11.3 Well-Posedness
153
11.3.3 Forcing Well-Posedness Sometimes data are clearly learnable by humans, and it is clear that all factors are considered, so the system is not chaotic, yet well-posedness (see Definitions 2.8 and 2.9) is not trivially achieved. That is, while similar rows in the table should result in similar outputs, speaking about “similarity” seems really far-fetched. One example of such data is natural language. For example, the word “king” and the word “queen” are semantically similar because they describe the same role, except for one bit of semantic difference: The gender of the person executing the role. Syntactically, however, the words have different lengths, and only one letter in common. Other semantically related words, such as “prince” or “duke,” also cannot be trivially connected with a syntactic similarity metric. The trick to address this issue has first been successfully applied in an approach called word2vec (Mikolov et al. 2013). Word2vec is an unsupervised learning technique that maps words to high-dimensional vectors (so called “embeddings”) in such a way that, based on statistical co-occurrence, semantically similar words have similar syntactic vector representations. A word2vec model is trained on a large corpus of text by optimizing the embeddings so that words that appear in similar contexts have similar vector representations. The goal is for the embedding space to become continuous. By learning continuous word embeddings, word2vec is able to capture semantic similarities between words that are not obvious from their syntactic representations, essentially forcing well-posedness through a high-dimensional manifold. The success of word2vec has inspired the development of other advanced natural language processing techniques, such as GloVe (Pennington et al. 2014) and BERT (Devlin et al. 2018), which further improve the representation and understanding of natural language data. This ultimately leads to the creation of the first Large Language Models such as (Brown et al. 2023; Lewis et al. 2019). Similarly, Gene2Vec (Asgari and Mofrad 2015) is an approach to represent gene sequences as continuous vectors, inspired by the Word2Vec method used in natural language processing. It aims to capture the underlying features and relationships between genes in a lower-dimensional space. By encoding genes as vectors, researchers can analyze and compare gene sequences more effectively, enabling various applications such as functional prediction, gene clustering, and gene set enrichment analysis. The underlying mechanism of why this works and the limits of this approach are discussed in Sect. 16.4. Obviously, forcing a dataset into high dimensionality does not make the dataset contain more information. That is, Sect. 11.3 still applies.
154
11 Data Collection and Preparation
11.4 Tabularization Since Chap. 1, it is assumed that data is in a table of the specific format defined in Definition 2.1, and the assertion is all data can be brought into that format. This section elaborates on that with some practical considerations.
11.4.1 Table Data In practice, data tables (see Definition 2.1) often arrive as a comma-separated-value (CSV) file. CSV files are standardized in RFC4180 (Shafranovich 2005). CSV files can contain a header or not. Based on RFC 4180, the ambiguities resulting from that choice have to be resolved by the user. This is why a system has to have the user specify if a header is there or not. RFC4180 CSV files do not allow for the specification of a type for the column. However, for data validation (see Sect. 11.5), it can be useful to know if a column contains integers, floating point values, categories, or random strings. A file format that extends RFC4180 in that way is called ARFF (Hall et al. 2009). At the very minimum, one needs to know which column contains the target function .f (x ). This is often the rightmost column. However, this cannot be assumed always. In fact, some research fields, such as computational genetics, turn data tables by .90◦ . That is, each column contains an instance, and each row contains the attributes for an instance. After the format of a table and its geometry have been established, there are still some high-level considerations for data validation and for building a model. Some of these are discussed in the following sections.
11.4.2 Time-Series Data Time-series data present a special case in machine learning applications as it does not obey the i.i.d. criterion, since each sample is dependent on at least a previous sample, rather than independent (i.i.d.). This section will describe how to untangle those dependencies with the goal of automated modeling for classification or regression. For example, to classify utterances into phonemes or to find that a weather pattern is a high category storm. Forecasting, that is, predicting the future based on past observations, is not part of this section as it could easily warrant another book. Forecasting is extrapolation, that is, predicting values by assuming that existing trends will continue. However, that is not a well-posed problem in machine learning. By Definition 2.8, machine learning assumes similar inputs lead to similar outputs (which is also called interpolation). Many articles have confirmed empirically that windowing time-series data (as explained in this section) and then training a model are not very good at forecasting,
11.4 Tabularization
155
e.g., Lim and Zohren (2021). Alternative solutions exist, including using genetic algorithms to find a calculus-based formula (Schmidt and Lipson 2009). In physics, time is defined as “what a clock reads” and is therefore considered a human-made, virtual dimension. That is, time does not exist and, if we define it, it is dependent (relative) on the spatial dimensions (Einstein 1916). While there is nothing wrong with collecting time-based data, it is important to realize that because of the virtuality of time, there are no general techniques on how to scale or handle time. The three most important properties of time-based data are its linear dependence, its smoothness, and its periodicity. Two observations in time are dependent on each other such that the later sample is at least dependent on the previous sample. Since time can be arbitrarily defined, observations are usually collected at consistent intervals over a set period of time. The duration and interval length are dependent on the application. To model speech, samples are usually collected in millisecond intervals, while stock market predictions are usually modeled based on one to several minute intervals. Physics has applied calculus to model time-based problems for hundreds of years because physical properties such as momentum or acceleration are never observed to jump discretely in time. Smoothing therefore can help reduce noise. Furthermore, interpolating between coordinates of a point that moves in time is a very reasonable operation. So how do we fit time-series data into the data table defined in Definition 2.1 that assumes that two samples are independent from each other? To answer this question, we look at the third property: periodicity. Let us start with an observation about how clocks and calendars work. They follow the spin of the Earth, the position of the Moon relatively to the Earth, and also the circulation of the Earth around the Sun. That is, humans define time to follow cycles. A new cycle starts when time has seemingly reset to a previously reached state. For example, the Earth is at the same rotational angle toward the Sun again, every 24 hrs. Reaching the end of a cycle and beginning a new one are usually treated as a reset of the dependencies within the cycle. Whatever mathematical operation one had to do within the dependencies of a cycle would be redundant in the new cycle. Knowing these cycles makes time-series data highly predictable to the point where machine learning is not required. After all, periodicity has been used, even by very early astronomers to predict the constellations of the stars relative to the Earth. There are many methods to estimate possible cycles (periods) in time-based data. For example, orthogonal function transforms such as the Fourier transform (Stein and Shakarchi 2003) or the more practical discrete cosine transform (Ahmed et al. 1974). If time-series data have no obvious periodicity, predictions become much harder. All we can rely on then is on yet another idiom (see also Chap. 7): “time heals all wounds.” This idiom applies to machine learning practice as time even heals “unwanted” dependencies. That is, the more time passes by, the less dependent an observation seems to be on an initial observation a while ago. For example, it is not clear if minute-by-minute stock price prediction for the next hour still depends on the stock prices from a year ago. Even with the theoretical dependency, it is not clear how much the prediction would change by including that knowledge. Similarly,
156
11 Data Collection and Preparation
while there may be a dependency in a speech stream on words spoken several minutes ago, the classification of current vocal tract output does not need to take this information into account to be effective. In summary, the standard solution to handling the dependencies in time-series data is to only focus on local dependencies. That is, one defines a reasonable time interval, everything before that is considered independent. This method is called “windowing.” In windowing, a fixed time interval (window size or frame size) is chosen upfront and the samples recorded in each window become a new input row for the data table. For example, for speech data, a frame size of .10 ms is a typical time interval. At .16 kHz sampling rate, all 160 samples within .10 ms are considered dependent, thus creating a table of n rows and 160 columns .x1 , . . . , x160 (plus the target column .f x-). When windowing as explained results in artifacts, a window function (Harris 1978) can be used instead to smooth out the edges of the windows. A window function is usually zero-valued outside of the chosen interval, symmetric around the middle of the interval, near a maximum in the middle, and tapering away from the middle. A typical example is the hamming window (Hamming 1989). While windowing forces time-based data into a data table as defined in Definition 2.1, the (residual) dependency of the samples can also be useful. For example, for speech recognition, often a phonetic classifier (working under the i.i.d. assumption) is combined with a Hidden Markov Model (using a language model that describes the dependencies between phonemes) to increase the accuracy of the phonetic classifier (Rabiner 1989).
11.4.3 Natural Language and Other Varying-Dependency Data Varying-dependency data are data where the dependency is not necessarily linear (that is, a previous sample could be dependent on a subsequent sample) and the dependency cycle is not necessarily constant. The most prominent example is natural language.
Stop-Symbol Cycles In natural language and in other varying-dependency data, there is usually a mechanism to mark the end of a cycle. For example, in natural language, words are separated by spaces, sentences are separated by punctuation, paragraphs by new lines, etc. Interestingly enough, this property alone makes the resulting cycle lengths highly predictable. This has been observed by the linguist George Kingsley Zipf (Zipf 1935) and then explained by Benoit Mandelbrot (compare Sect. 4.3). It turns out, if a stop symbol is defined out to mark the end of a word, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. That is, the frequency
11.4 Tabularization
157
Fig. 11.1 A plot of the rank versus frequency for the first 10 million words in 30 Wikipedias in a log–log scale. An approximate Zipf distribution is shown in every single one of them (Source: SergioJimenez – CC BY-SA 4.0, Wikimedia Commons)
of the “n-th” most frequent word is .fn = n1 ∗ f1 , where .f1 is the frequency of the most frequent word. So if we plot frequency over rank, on log–log scale, we see a log f requency change line with slope .−1. That is, . b log rank ≈ −1. Figure 11.1 illustrates this. b The literature often categorizes this distribution, which is named after George Zipf, as an unexplained phenomenon. However, Mandelbrot (Mandelbrot 1953) showed a simple mathematical solution to this observation that is not only visible in all human languages but also all other problems where, information-theoretically, a dedicated stop symbol is used to indicate the end of a cycle. Theorem 11.1 (Stop-Symbol Theorem (Mandelbrot 1953; Perline 2018)) Given an alphabet .E of cardinality .|E| = b with equal probabilities .p(c1 ∈ E) = p(c2 ∈ E) for any two characters .c1 , c2 and one symbol .s ∈ E is chosen to be a stop symbol. The frequency l of the words .w ∈ E l between the stop symbols s is log f requency change distributed according to a Zipf distribution. That is, . b log rank ≈ −1. b
Proof By definition, the probability for any character a appearing is .p(c ∈ E) = b1 . That is, the probability of word length 1 is .p1 = b1 . This probability includes s and so we are counting the space as part of the word. The probability for word length 2 is the probability of any character, but the stop symbol and then the stop symbol b−1 1 .p2 = b ∗ b . A three-letter word evolves with the probability of any character, but
158
11 Data Collection and Preparation
Fig. 11.2 A plot of the outcome of word frequencies of “random typing” by B. Mandelbrot. The logarithmic y-axis shows the word frequency .pl (compare Theorem 11.1), and the logarithmic xaxis shows the rank. Consistent with the approximate result in Theorem 11.1, the resulting plot through an equal-percentile rank results in a line l with slope .s = −1 (Source: “The Zipf Mystery” by VSauce (Micheal Stevens) on YouTube, 2015)
the stop symbol times the probability of another symbol, but the stop symbol and b−1 1 then the stop symbol: .p3 = b−1 b ∗ b ∗ b . Consequently, an l-letter word evolves (b−1) ∗ 1 . If out of .l − 1 non-stop symbols and then the stop symbol: .pl = ( b−1 b ) b l we now rank order the words, it is clear that there are .b words of length l, the most frequent (based on .pl ) being words of length .l = 1 followed by .l = 2, etc., so the log-scaled rank order is therefore .logb rank = logb bl = l. The frequency of words with length l is .pl . One can immediately see that 1 b−1 l−1 .f1 = p1 = . For a log–log plot, b and therefore .fl = pl = p1 ∗ b we take the logarithm of the declining frequency portion b−1 l−1 .logb b
.
b−1 l−1 . b
This results in
⇐⇒ (l − 1) ∗ independent of the word length, so we frequency portion in log space becomes .(l −1)∗c. That is, on a log-scaled frequency axis, words of rank l appear with spacing .(l −1)∗c. Since .c < 0, the term is negative −l (i.e., declining). That is, . (l−1)∗c ≈ l∗c u n l l ≈ l ≈ −1. The term .logb ( b−1 b ) is a negative constant assign .c := logb ( b−1 b ). That is, the declining
logb b−1 b .
Figure 11.2 shows a plot for .b = 27. A cleaner but harder to reproduce version of the proof can be found in Perline (2018). The article also shows that the result does not change when the input is not uniformly distributed. This is interesting for automatic modeling for two reasons. First, an i.i.d. data table (see Definition 2.1) can be created by complying with the stop symbols. For example, entire words or entire sentences can be defined as independent instances and become rows in the table. However, words and sentences are usually varying length and therefore need to be normalized to constant length to fit the definition of the data table with D cells per row. This can be done using
11.4 Tabularization
159
one of many methods in the literature (Salton et al. 1975; Singhal et al. 1996). Of course, the resulting table cells must be atomic (recall Definition 2.2). Second, the Zipf distribution implies a low memory-equivalent capacity requirement (see Definition 5.5) for the model. The Zipf distribution is the discrete version of the more general Pareto distribution (Pareto 1896), which is used to derive the Pareto principle or “80–20 rule.” It states that 80% of outcomes are due to 20% of causes. The log–log plot of the Pareto distribution (just like current interpretations of the Zipf distribution) can have a different slope .|s| ≥ 1 and is defined as logb f requency change .s = − . The Pareto Principle is exactly true for a Pareto logb rank distribution of slope .|s| = log4 5 ≈ 1.16. As the reader might have realized by now, this formula is extremely similar to the fractal dimension from Definition 4.15. To exactly understand the absolute of the slope as fractal dimension, we need to conceptually define what the self-similar pieces are and what the scaling factor is. The self-similar pieces are the segments (words, sentences), and the scaling factor is the “language proficiency” defined as the lowest rank segment that the model still needs to know. This closes the loop to Mandelbrot again. That is, .|s| = Df ractal . For the English language, one can observe that approximately 18% of the words make out more than 80% of the language. That is, with about 1000 words, one can get to 80% of the language (Hirsh and Nation 1992; Hwang 1989). Knowing about 3000 words, let us one understand about 95% of the language (Hazenberg and Hulstijn 1996). .15, 851 words about .97.5%. The largest dictionary (Indonesian) has about .54,000 words (Sutarsyah et al. 1994). This is used by foreign language learning curricula to speed up the training process: The words and phrases used in common situations are taught first. Similarly, machine learning models do not need the full vocabulary of the English language to be successful. In fact, early experiments, such as Eliza (Weizenbaum 1966), showed that machines can appear surprisingly intelligent even knowing just about a dozen sentences. As a side note: A common criticism of Mandelbrot’s explanation of the Zipf distribution is that his argument does not explain why the Zipf distribution can also be empirically observed when counting the number of citations of academic articles, the population counts of cities, or the use of the names of the planets. Also, the most frequent word in English “the” is not the shortest word. So for that, one needs an additional explanation. Zipf himself already speculated that the physical principle of least effort may be at play. This can be seen by incrementally building a leasteffort decision tree as an isomorphic model for how information may be arranged in a brain. Language is evolutionary, and humans are able to learn new words at any point. This means, the decision tree has to be able to grow without bounds. Rebuilding an optimal Huffman tree (Huffman 1952) that is updated every time a new word is introduced would definitely not be least effort. However, growing a binary decision tree linearly (as shown in Fig. 11.3) must be considered least effort.1 Doing so, creates path lengths that correspond to the Zipf ranking. That is, a word
1 Refer .b
> 2.
to the discussion in Sect. 7.2.2 on why a binary tree is less effort than an b-ary tree with
160 Fig. 11.3 Mental model for constant-rate communication with a Zipf distribution. The brain could use a data structure isomorphic to a minimal effort decision tree. For the decision tree to be extendable with minimal effort, it grows linearly. By putting the most frequently occurring language element at the earliest branch, the second-most frequent at the second branch, and so on, the expected access (path) length is held constant
11 Data Collection and Preparation
1
1
r1
2
2
r2
3
3
.
r3 n
rn
.. n
.
..
rn+1
with rank n has path lengths n. Let us assume this tree to be isomorphic to neural paths in our brain and a constant travel time per node. This way, the expected latency of reaching a child node stays constant as visiting a path with length r only occurs with frequency . n1 . As discussed by Piantadosi (2014), this will also dissipate the information load on the listener and explains why speaking rates are, in general, not dependent on the frequency rank of the vocabulary uttered. The standard assumption would be that less-frequently-used words take the brain longer to find. Even this system may not have been opted-in voluntarily by biology as thought experiments like the so-called Chinese Restaurant Process (Pitman 1996) show. This goes beyond this book, however. In general, whenever variable-length cycles are indicated by a stop symbol or a human is asked to choose from a set of options, we can assume a Zipf-like distribution.
Non-linear Dependencies The good news on this property is that it seems the best way to deal with non-linear dependencies is to have the machine learner figure out what the dependencies are, once the data have been appropriately windowed. This can be done using approaches such as Attention Networks (see Sect. 9.4.2).
11.4.4 Perceptual Data “An image says more than a 1000 words” is another common idiom. While that can be true, for our purposes, the idiom makes clear that image data, just like audio, and video data have a higher information density than text data. In fact, text can be seen
11.4 Tabularization
161
as a lossy compression of speech: Text abstracts away the speaker-specific sounds and, in some character systems, also the pronunciation specifics (see also discussion in Sect. 7.2.2). So the largest challenge with perceptual data is noise reduction. In fact, it has been reported that for a typical image object recognition task, there is only about an average of .1 bit of information per pixel (which is defined by .24 bits) that is relevant for the pattern to be recognized (Friedland et al. 2020). For many decades, research had focused on how to compress sound and images by understanding which information of an image or a sound would actually be perceived. For example, Fraunhofer’s MPEG-3 standard was based on a psychoacoustic model (Zwicker and Fastl 1999) that provided very clear rules on what frequencies would dominate and which ones could be deleted to save bandwidth. Similarly, color science (Wyszecki and Stiles 1982), psychovisual models, and psychophysics were used to define image formats such as JPEG (Wallace 1991) and video formats such as MPEG (Le Gall 1991). In recent times, the tool of choice for compressing acoustic and image data seems to be convolutional neural networks (CNNs), see also Sect. 9.1.1. CNNs learn a specific lossy compression (noise filter) given a certain classification task. Tabularization of image data then becomes simple. Each image is placed in a row of the table, often reduced to small resolutions, like 1024 pixels. Using conventional compression techniques upfront, one can save training time as one can reduce the number of convolutional layers (Friedland et al. 2020). The tabularization process of sound and video data is furtherly subject to considerations discussed in Sects. 11.4.2 and 11.4.5. To get a better quantitative idea on how much noise needs to be eliminated, one can try to measure the signal-to-noise ratio described as follows.
Estimating the Signal-to-Noise Ratio Measuring the signal-to-noise ratio (SNR) is hard because to do so, one needs to determine what part of the inputs are relevant for the experimental outcomes and which parts are not. The only way to do that precisely is to use an accurate and general model. Of course, if we had a model, we would not care about the signalto-noise ratio anymore because the model provides us with the correct predictions. In other words, this is a catch 22. Going back to the discussion in Sect. 11.3, we learned that we can model noise with error bars. So our goal of measuring the signal-to-noise ratio would be to measure the size of the error bars and relate it to the size of the signal. That is: .SN R = signal noise , where both the signal and the noise are measured in bits (sometimes also in dB, see also Definition 6.6). The information content of the signal is determined by the minimum number of bits that need to be known in order to represent the function implied by the training table (see also Sect. 4.4). Every other bit of information is either redundant or noise. We also know from Chap. 5 that we can always memorize the training data, including noise and signal to
162
11 Data Collection and Preparation
100% accuracy by evaluating on the training data if the trained model is at memoryequivalent capacity (MEC). This implies the following process: 1. Memorize the training data with a model at MEC (by evaluating on the training data). We call this .MEC0 . 2. Topologically reduce the maximum MEC of the model by at least .1 bit and retrain. 3. Repeat step 2 until the accuracy cannot reach 100% anymore. The MEC for step 1 can be estimated using methods like the ones described in Sects. 4.4 and 8.2. We call the lowest MEC that still achieved 100% memorization efficiency .MECmin . We can assume the model with .MECmin as an estimate of the minimum number of bits that need to be known in order to represent the function implied by the training table. The remaining bits are noise. The signal-to-noise ratio min is therefore .SN R = MECMEC . Note that this process can be time intensive 0 −MECmin as it involves many iterations of training a model from scratch. Furthermore, it is recommended to use simple models that do not use artificial regularization that makes it harder to calculate the MEC. For a further discussion on finding models of the data that exactly represent the signal, see also Chap. 12.
11.4.5 Multimodal Data Multimodal data are data that where we know it has been captured using several independent sensory sources. For example, multimedia data (Friedland and Jain 2013) are multimodal as it involves, for example, a combination of audio and moving images (video) or a combination of images and text. Definition 11.5 (Multimodal Data) Data are called multimodal when it has been obtained using more than one source of multidimensional inputs. Note that if each independent sensor signal fits into one column of the data table, then the data are multidimensional but not considered multimodal. Otherwise, any data table with more than one input column would be considered multimodal. In statistics, a multimodal distribution is a probability distribution with more than one mode. This definition is too narrow for the scope of machine learning for two reasons. First, given only data, we do not know how many modes the distribution that models the function implied by the data actually has. Second, we will have to make decisions about handling the data before we model the data. What makes multimodal data special is that it can be, for example, a mix of time-based data and i.i.d. data. Time synchronization may also be an issue. At the minimum, data from different sources will have different information density, for example, due to varying signal-to-noise ratios of the different sensors. So one has to make a decision as to whether it is better to handle all data in one table or to model the data from separate sources individually and then find a way to combine
11.5 Data Validation
163
the decisions. The handling of multimodal data goes beyond this book. The further reading section contains recommendations.
11.5 Data Validation The purpose of data validation is to reliably prepare data for supervised machine learning. In order to reduce bias, the goal is always to try to leave as much as possible to the machine learning algorithms themselves. However, some data glitches are truly easy to detect and represent an unnecessary burden on the training complexity.
11.5.1 Hard Conditions Hard conditions are clear cut, objective data glitches that have to be handled.
Rows of Different Dimensionality Data table rows that contain more cells than defined in the header of the table must be ignored as they make the machine learning task undefined.
Missing Input Values Sometimes data tables contain missing or invalid values, such as NA, NaN, +-Inf. All table cells within an input column that have a missing or invalid value should be recoded into a single new value that does not appear in the column otherwise. This allows the model to learn from the fact that values were missing. Furthermore, it preserves the i.i.d. property between rows. The practice of calculating the average, the median, or any other statistical value between rows could lead to a violation of the i.i.d. assumption and therefore potentially introduce bias. In time-based data, however, using the average between rows could be sometimes advantageous as there is an assumption of smoothness. Calculating the average is also difficult when the column is otherwise binary or contains strings.
Missing Target Values Missing or invalid values in the target column are unacceptable. We assume the annotation to be ground truth. Therefore, there is no other way than considering a row with missing or invalid annotation to be non-existent.
164
11 Data Collection and Preparation
Less Than Two Classes If a problem has less than two classes, machine learning is not necessary.
Database Key/Maximum Entropy Column One cannot infer anything from data table columns that have one unique value per row. These columns, whether numeric or string (categorical), appear as uniform random in any mathematical model and thus can be safely ignored.
Constant Values Data table columns that contain a constant value do not have any information (see Chap. 4). They can be safely ignored.
Redundancy A very simple but potentially important test is to check that each row is unique. Repeating rows usually only contribute to an implicit class imbalance, see also discussion in Sect. 11.7. Since they do not add any information, they can be safely ignored.
Contradictions A contradiction in the data occurs when two identical inputs are labeled with two different outcomes. Contradictions make the machine learning task ill-defined as the output is no longer a function of the input, see also Definition 2.3. Contradictions are a result of wrongful annotation and therefore cannot be handled automatically but require resolution by a human.
11.5.2 Soft Conditions Soft conditions in data validation should create warnings and offer the possibility for the user to provide a decision. Sometimes, what appears to be a data glitch, however, may be an exception or outlier that creates an insight. So one has to be careful with taking action on those data conditions. These conditions include, but are not limited, to the following.
11.6 Numerization
165
High-Entropy Target Column If there are very few rows per target class, the problem is probably better treated as a regression problem. In fact, many automatic machine learning systems will use this criterion to suggest regression (rather than classification) as a task.
High/Low-Entropy Input Columns If an input column is near one unique value per row, it is often not meaningful for inference. Similarly, if a column is near constant (that is, almost all values in a column are identical).
Out-of-Range Number Columns Sometimes it is easy to detect out-of-range values. For example, when a binary column contains one other value. It might suggest that this is a data collection mistake. Note though that out-of-range values can often not be detected without a model, which creates a catch 22.
11.6 Numerization Most machine learning algorithms can only deal with numeric inputs. That is, strings, categories, dates, and other non-numeric values in the data table (see Definition 2.1) need to be converted into numeric values. Several algorithms in this book will also assume only numeric values (for example, Algorithms 12 and 11). In other words, all values of the table need to be converted into numeric values. We call this process numerization. Numerization needs to be done in a way that we preserve the original structure of the problem. This can seem counter-intuitive at first as it seems that tables containing dates, categories, and strings seemingly cannot be converted into numbers without changing the structure of the problem. However, we need to remember that a computer only understands binary numbers, and every string, date, or category is converted into a bit sequence internally. This leads us to the question what a structure-preserving numerization is. Since our goal is to build a mathematical model, the answer will be found there. In mathematics, an isomorphism is a structure-preserving mapping between two structures of the same type that can be reversed by an inverse mapping. In fact, the word isomorphism is derived from the Ancient Greek: isos “equal” and morphe “form” or “shape.” The interest in isomorphisms lies in the fact that two isomorphic objects have the same properties (excluding further information such as additional structure or names of objects). Isomorphisms are 1:1 mappings and are
166
11 Data Collection and Preparation
Algorithm 9 An example of an algorithm for state-less numerization of table data Require: table: array of length n contains d-dimensional vectors x- and f (x ) column. function numerize((table)) newT able ← [] for all rows do for all columns do try: newT able[row][column] ← (integer) table[row][column] except (not an integer): newT able[row][column] ← (f loat) table[row][column] except (not a float): newT able[row][column] ←CRC32(table[row][column]) end except end for end for end function: N ewT able
also discussed in Sect. 7.2 as an essential method of structure preservation as their main attribute is reversibility. To create a column-wise isomorphism for the data from whatever content into numbers, all we have to is preserve the number of unique values per column and its positions in each row. This can be done as follows: Read each cell of the data table and try to convert it into a floating point number. If unsuccessful, treat as string. If successful, try to convert it into an integer. If that succeeds, keep the integer. If it does not but the floating point conversion succeeded, keep the floating point number. If the cell is treated like a string, build a dictionary and replace the string with either a new dictionary index or if the string was already part of the dictionary with the lookup index of the string. Treat dates, categories, and any other non-numeric values as strings. Building up various lookup tables can be difficult and is only necessary to show that reversibility is preserved. For practical reasons, it may be much faster to replace strings with a hash value of the string (which is identical for two identical strings), for example, by using the CRC32 algorithm (Castagnoli et al. 1992). While this method is faster, it is not an isomorphism. However, it does preserve the information content of each column of atomic cells (that is, the entropy of each column). Algorithm 9 shows this simplified method that also has the advantage of not requiring to save any state. This means that the same numerization can be applied to the validation and any test set without having to store the state of the preprocessing. The numerized table now only contains floating point numbers and/or integers and is ready to be used as input for the further algorithms that are able to treat the input as a mathematical matrix. Some machine learning algorithms only work with floating point numbers, and the integer typecasting step can therefore be skipped. Sometimes it can be useful to normalize the numbers in each column to be between 0 and 1.
11.7 Imbalanced Data
167
11.7 Imbalanced Data For a classification problem, the black-box machine learning process assumes balanced experimental outcomes in the training data (see Chap. 3). This ensures that each independent event has the same probability and thus, predicting it right, yields the same information for each instance. In some cases, collecting enough data from a certain class can be difficult, however, leading to the outcome classes to be imbalanced. This results in a predicted sample not yielding the same amount of information and thereby “disadvantaging” the prediction of the minority class(es). This is a concern because, in practice, the machine learning algorithm is presented instances from the majority class most of the time while at the same time, predicting the majority class correct is trivial. A finite state machine with a single arrow to the majority class would yield best guess accuracy. There are different proposals in the literature that deal with class imbalance (He and Garcia 2009; Japkowicz and Stephen 2002; Krawczyk 2016). Since information cannot be created by data processing (see Sect. 7.1.1), many approaches use assumptions to try to augment the training data with synthetic samples. Such a heuristic, however, can lead to bias (see Sect. 13.3) or, in the worst case, to data that are completely contradictory to the data that would have been collected. Instead, this section presents a way of specializing the accuracy metric to be able to handle the imbalanced classes case using information theory. The black-box machine learning process calculates accuracy as defined in Definition 2.4: Abalance =
.
k × 100 [%], n
(11.2)
where k is the number of correctly predicted instances and n is the number of total instances. Just like predicting the outcome of an unfair coin toss (see examples in Chap. 4), predicting an instance with higher probability is less surprising and therefore yields less information. Equation 2.4 can therefore be specialized for this in the following way: −
A=
.
Ec
log2 pi × 100 [%], n∗H
i=1 ki
(11.3)
where c is the number of classes, .ki is the number of correctly classified instances in the class i, .pi is the probability of the class being i, n is the number of total instances, and H is Shannon entropy (see Definition 4.11):
H =
c E
.
1
pi log2 pi [bits],
(11.4)
168
11 Data Collection and Preparation
where .pi is the probability of the outcome being class i and c is the number of classes. The denominator of Eq. 11.3 has been used under the term cross-entropy loss (Goodfellow et al. 2016b) in machine learning for a while. Let us now show that (1) Eq. 11.3 is a true specialization of Eq. 11.2 as A generalizes to .Abalance when the classes are balanced and (2) A is the only sensible specialization of .Ab alanced. Theorem 11.2 .Abalance is a generalization of A. Proof In the case of balanced classes, H becomes .H = − log2 P (see Eq. 4.3), and we can set .− log2 pi := − log2 P = H (since all .pi are equal). As a result, Eq. 11.3 becomes E E E − ci=1 ki log2 pi − ci=1 ki H −H ( ci=1 ki ) .A = × 100 = × 100 = × 100 = n∗H −n ∗ H −H ∗ n Ec i=1 ki . × 100. n E It is easy to see that . ci=1 ki = k (k from Eq. 2.4). Thus: Ec A=
.
i=1 ki
n
× 100 =
k × 100 = Abalance . n u n bits . bits ,
instances . instances
instead of as in Eq. 2.4 The units of the fraction in Eq. 11.3 (A) are (.Abalance ), but in both equations the unit cancels out and therefore accuracy can be measured in % in both cases. Theorem 11.3 A is the only sensible specialization of .Abalance . Proof Let us assume we have the freedom to define any reward function I that will increase the accuracy more when instances from the minority class(es) are classified correctly vs. instances from the majority class(es). A would be defined by including I on a per-instance basis. I would be defined in terms of an event i (mapping from input to class label) with probability .pi . In order to comply with commons sense and the i.i.d assumption, I would have to have the following properties: 1. .I (p) ≥ 0: The reward function should never be negative since accuracy cannot be negative. 2. .I (1) = 0: If there is only one class in the entire problem, the reward to learning that class is 0. 3. .I (p) is monotonically decreasing in p: an increase in the probability of the outcome of a specific class decreases the accuracy reward from an observed event, and vice versa. That is, we want to build a reward function that reduces the importance of the majority classes.
11.9 Further Reading
169
4. .I (p1 ∗ p2 ) = I (p1 ) + I (p2 ): The reward from independent events is the sum of the reward from each event. Since we assume i.i.d., each classification event is independent and thus, according to Bayes’ rules, the joint probability of two events is multiplicative. These four axioms are equivalent to Shannon’s original formulation of the information axioms, and it can be shown that the logarithm is the only reward function that satisfies all 4 constraints (see also Chakrabati 2005) and also discussion in Sect. B.4. .Abalance is therefore the only specialization of A that satisfies common sense and does not contradict the i.i.d. assumption. u n
11.7.1 Extension Beyond Simple Accuracy Note that A just like .Abalance can be extended to be used as other error metrics (compare also Sect. 3.4). The same breakdown of accuracy in true positives, false negatives, etc., as defined in Eq. 3.21 can be done using A by multiplying each instance by its information content in bits, as shown in Eq. 11.3.
11.8 Exercises 1. Sometimes, an already collected set of experimental observations is knowingly biased. For example, historical records have societal biases against minorities. How can we prevent the model from being trained with that bias? 2. Devise a metric for inter-annotator agreement. Assume a maximally disqualified set of annotators: What happens to your metric as you add many more annotators? 3. Prove that a machine learner can never have better accuracy than given by the annotation (hint: use methods from Chap. 7). 4. Explain why linguists use a Zipf distribution as a criterion that enough material has been collected about a language that it can be analyzed. 5. Calculate the Fractal Dimension of English. 6. Assume a set of 8 symbols with a Zipf distribution. Apply the Huffman-tree algorithm to generate an optimal prefix code. 7. Define sensitivity and specificity using the equation in Sect. 11.7.1 for A.
11.9 Further Reading Further reading for this chapter expands on the surprisingly deep concepts that can evolve from the simple question of how to prepare data.
170
11 Data Collection and Preparation
• N. Bostrom: “Anthropic bias: Observation selection effects in science and philosophy”. Routledge, 2013. • Erik M Bollt: “Information Theory in Dynamical Systems” in “Dynamical Systems”, Chapter 9, 2018. https://webspace.clarkson.edu/~ebollt/Classes/ MA563Sp18DynamicalSystems/Ch9Bollt.pdf • Albert Einstein: “Relativity: The special and general theory”. Prabhat Prakashan, 1948.
Chapter 12
Measuring Data Sufficiency
The problem of determining data sufficiency is a repeating issue. An often quoted response to the problem is: “There is no data, like more data” (Pieraccini 2012). Indeed, increasing the sample size is definitely a guaranteed way to increase the probability of coming up with a good model. However, data acquisition can be costly, the task can be time sensitive, or there is simply not more data available. For example, during the COVID19 pandemic, there was a discussion between one of the vaccine providers and the US government of whether the sample size of tested subjects was large enough to warrant approval.1 Here the task was time sensitive, data acquisition was costly and, furthermore, the stakes were high. This chapter discusses a method to estimate data sufficiency based on the information metrics earlier in this book. The method clearly shows when there is enough data and also clearly shows when there are not enough samples. It can be ambiguous when there may or may not be enough data. Most importantly, it works without building a model.
12.1 Dispelling a Myth As we transition into the discussion on data sufficiency, it is crucial to note that such a notion indeed exists. From the conversation in Sect. 4.1, our understanding is that if the data is fraught with uncertainty, potentially an infinite amount of samples may be necessary. Consequently, a popular adage in data science has emerged: “There is no data like more data.” Let us consider a hypothetical scenario to further elaborate on this point. We are presented with a sequence and asked to predict the next number: 2, 4, 6, 8, 10,
1 “FDA advisers debate standards on a coronavirus vaccine for young children,” Washington Post, June 10, 2021.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_12
171
172
12 Measuring Data Sufficiency
x. Naturally, many of us would instinctively suggest .x = 12, which is indeed the expected response. Such exercises are typical in IQ tests, where the ability to quickly decipher a pattern or rule from a sequence is evaluated (Binet and Simon 1904). This ability, as we have extensively discussed in Chap. 6, is central to the concept of generalization. Once we recognize the pattern, we can extend our predictions beyond the provided data. For instance, when asked about the 50th number in the sequence, we could confidently respond 102. However, if we were presented with a sequence such as 6, 5, 1, 3, x, predicting the next number would prove significantly more challenging. This sequence was purposely crafted to be perplexing, as it represents the last four digits of the author’s phone number. Phone numbers are typically random, devoid of any discernible pattern, rendering predictions as accurate as mere guesswork. Nonetheless, an intriguing point to consider here is that providing additional data points would not necessarily aid in finding the rule in either sequence. Adding more numbers to the first sequence would not make the pattern any more evident. On the contrary, it might complicate the process by requiring additional cognitive load. In the case of the phone number sequence, supplying more data would not help either, given the inherent randomness. Therefore, we could propose a corollary: “There is no data like enough data to discover the rule—and outliers validate the rule.” Although additional data may not help to uncover the rule, it may bolster confidence in the model derived from the sequence. For instance, in the first sequence, our pattern could potentially be disrupted if an actual measurement renders .x = 11. This could imply our inferred rule was incorrect, .x = 11 is an outlier, or our measurements are flawed. Thus, there is merit to the claim that more data is beneficial. However, this argument implies an endless quest for data, where each new measurement could potentially introduce error. Given that resources such as time and money are finite, the decision to collect more data can often pivot from a scientific debate to a political one. Noise, another common justification for requiring vast amounts of data (as referenced in Sect. 4.1), can be modeled as well, perhaps as an uncertainty surrounding the rule. The implications of this are left for further exploration in an exercise. One potential resolution to the question of data sufficiency, which might be intuitively apparent to the reader at this stage, is to find a quantitative solution. Specifically, we aim to devise a method to measure how feasible it is to learn the mapping between a collection of observations and their corresponding outcomes. Furthermore, we want to measure this learnability (that is, generalizability—see Chap. 6) in the context of a specific machine learning algorithm. In this chapter, we will explore the concept of general learnability.
12.2 Capacity Progression Chapter 7 introduced energy as the most general complexity measurement and Chap. 5 explained that an artificial neuron’s representation function is the dot
12.2 Capacity Progression
173
product to calculate the energy of an input vector and thresholds it using a parameter b called bias: f (x) :=
d
.
xi wi ≥ b.
(12.1)
i=1
Algorithm 10 Calculate the memory-equivalent capacity needed for a binary classifier assuming weight equilibrium in a dot product Require: data: array of length n contains d-dimensional vectors x, labels: a column of 0 or 1 with length n function memorize((data, labels)) thresholds ← 0 for all rows do table[row] ← ( x[i][d], label[row]) end for sortedtable ← sort (table, key = column 0) class ← 0 for all rows do if not sortedtable[row][1] == class then class ← sortedtable[i][1] thresholds ← thresholds + 1 end if end for mec ← log2 (thresholds + 1) end function: mec
One approach to determining if our dataset is sufficient is to create and train a machine learning model using a training/validation split and test its ability to generalize on a held-out test set, as discussed in Chap. 3. However, this is not a foolproof method. If our model fails to generalize well, we cannot definitively say whether the problem lies in the adequacy of our data or the adequacy of our model’s structure or training. The concept of capacity progression, which we will discuss here, can help mitigate this uncertainty. Capacity progression offers a method for quantifying the learnability of a dataset with respect to a specific target. To grasp this concept, we need to revisit the idea of viewing generalization as compression, as outlined in Chap. 6. To illustrate, imagine a scenario where a friend has a 10 GB file and you only have a 5 GB hard drive. Can the file be stored on your drive? As is, no, but with the help of compression, it might. The success of this approach hinges on the nature of the file and the efficiency of the compression algorithm. The ideal strategy is to apply capacity progression: take increasingly large subsets of the data (for instance, 1%, 2%, 5%, 10%, etc.), compress them using each of the available tools, and observe if any tool reaches or surpasses a 2:1 compression ratio as the sample size increases. The tool that converges the fastest and performs the best is your optimal choice. If no tool can
174
12 Measuring Data Sufficiency
achieve the desired compression ratio within your time and disk space constraints, you will need a larger hard drive. In machine learning, our goal is not to reproduce a file, but rather to model the function represented by the data in our dataset. This means we are trying to assess whether our machine learner can serve as a relevance compression for the function implied by the data table (see Definition 2.1, Chaps. 4 and 6 elaborate on understanding machine learning algorithms as compression tools). However, it is important to note that the method described above would require us to construct and train multiple models of different sizes, a time-consuming process. To address this, our goal is to develop a way to estimate the memory-equivalent capacity (see Definition 5.5) without actually constructing and training a model. In other words, we need to find a way to estimate the informational content of the function suggested by our labeled data.
12.3 Equilibrium Machine Learner Chapter 16 discusses that training is difficult. Finding the optimal weights in a neural network is NP-complete (Blum and Rivest 1992). So we need to avoid trying to find parameters and so the idea is to use a generalized version of Algorithm 8. Algorithm 10 shows the pseudocode implementation. Just like Algorithm 8, Algorithm 10 shortcuts the computational load dramatically by assuming all d dimensions are in equilibrium and can be modeled with equal weights in the dot product. In other words, we chose to ignore training any weights .wi by fixing them to 1 and train only the thresholds. To find the thresholds, we create a two-column table containing the 1-weighted sums of the sample vectors .x and the corresponding labels. We now sort the rows of the two-column table by the first column (the sums). Finally, we iterate through the sorted table row by row from top to bottom and count the need for a threshold every time a class change occurs. Another intuition for this algorithm is that, once the thresholds are calculated, if-then clauses of the form for j = 0 to length(thresholds)-1 do if . di=1 xi < threshold[j + 1] then return label[j] end for could serve as a hash table for the training data. The number of rows of the training table divided by the number of if-then clauses gives us a rough estimation of the generalization possible using equilibrium energy thresholding on this data. For the actual complexity measurement, we can safely ignore column sums with the same value (hash collisions) but different labels: If equal sums of input vectors do not belong to the same class, Algorithm 10 counts them as needing another threshold. The assumption is that if an actual machine learner was built, training of the weights would resolve this collision. In the end, we assume the machine learner is ideal and therefore training the weights (to something other than .wi = 1) will
12.4 Data Sufficiency Using the Equilibrium Machine Learner
175
reduce the number of thresholds required to at least .log2 (t), where t is the number of thresholds (numerically, we have to add 1 so that 0 thresholds correspond to an MEC of 0 bits). The proof for this is given in Sect. 8.3, which assumes random inputs and balanced binary classes. So for imbalanced classes or a non-random mapping, training could result in even a steeper improvement. Also, the resulting memoryequivalent capacity number should be adjusted if there are more than two classes (see exercises in Chap. 8). Just like Algorithm 8, the bottleneck of Algorithm 10 is the sorting which will take .O(n log2 n) steps. Equilibrium dot product (energy) seems like a crude approximation, but it has two advantages. First, we are able to calculate it quickly. Second, the equilibrium state is the most general (see also exercises at the end of this chapter). For those two reasons, equilibrium is assumed as a computational shortcut in many fields of sciences, for example, in particle simulations, where the dot product is also used to model energy (see Further Reading). The reader is free to employ any other method to collapse the input vector to tailor the algorithm to a more specific use case (see the discussion in Chap. 7 and also exercises at the end of this chapter).
12.4 Data Sufficiency Using the Equilibrium Machine Learner Algorithm 11 shows how to implement capacity progression using Algorithm 10. It estimates the memory-equivalent capacity needed to memorize 5%, 10%, 20%, . . . , 100% of the function in the training table and outputs the values into a table. The data samples here increase exponentially because we expect a logarithmic increase in Memory-Equivalent Capacity (MEC) for a completely random mapping (as shown in Chap. 6). What one expects is that the Memory-Equivalent Capacity of the Equilibrium Machine Learner increases linearly (given exponential input growth) if all the model is able to do is memorize. This is the case when the function represented by the training data appears random to the machine learner. The machine learner then has to increase its capacity with every new sample coming in, just like memory capacity itself has to increase as one adds uncompressible data. In other words, if the capacity does not stabilize at the higher percentages, the machine learner is not able to extract a rule. That is, either there is not enough training data to generalize or the representation function of the machine learner (here, equilibrium energy) is not right for the task. If the machine learner is able to learn a rule to effectively apply relevance compression, we expect the MemoryEquivalent Capacity to converge to whatever Memory-Equivalent Capacity is needed to memorize the rule. A convergence as early as possible but definitely at the 80% mark would be ideal to understand that the machine learner can operate based on a rule formulated with the representation function. Figure 12.1 exemplifies expected capacity progression curves. The plot is memory capacity over percentage of training data memorized. Needing 100% of the
176
12 Measuring Data Sufficiency
Algorithm 11 Calculating the capacity progression for the equilibrium machine learner Require: data: array of length n contains d-dimensional vectors x, labels: a column of 0 or 1 with length n Require: getSample(p) returns p percent of the data with corresponding labels. Require: memorize(data), see Algorithm 10. procedure CapP rog((data, labels)) sizes = {5, 10, 20, 40, 80, 100} for all sizes do subset = getSample(size) mec = memorize(subset) print “MEC for”+size+“% of the data:”+mec+“bits” end for end procedure 100
80
40
Memorizing 20 Somewhat generalizing
10
Rule found
0 0
10
20
40
80
100
% Training Data
Fig. 12.1 Three examples of typical capacity progression curves: red is baseline, green is highly generalizable, and black is in between
Memory-Equivalent Capacity for memorizing 100% of the function implied by the training data by definition means the machine learner can only overfit. In practice, we expect the curve to be anywhere in between memorization and best-possible generalization. The intuition for somewhat generalizable is that there is a set of rules memorized but as more data is added, exceptions to the rules still need to be added as well to classify with 100% accuracy. If the dimensionality of the input data is high, it may be recommendable to plot the capacity progression curve together with a random baseline of the same dimensionality. This is to normalize the difference in information capacity that comes with higher dimensionality (see Chap. 4).
12.4 Data Sufficiency Using the Equilibrium Machine Learner
177
10 100
Titanic (random survival) 8
Titanic (original)
MEC
4
2
Titanic (gender=>survival) 1
0 0
5
10
20
40
80
100
% Training Data
Fig. 12.2 Capacity progression on the original and two modifications of the Titanic dataset, calculated by Algorithm 11: red crosses show the capacity progression for random survival, black crosses the progression for the original dataset, and the green dots show the progression for a synthetic dataset where the gender determines the survival
When capacity progression is plotted for real data, the graphs look like depicted in Fig. 12.2. These plots were made using the Titanic dataset.2 The dataset consists of a part of the passenger manifest of the RMS Titanic (which sank on April 15, 1912 after striking an iceberg) with a Boolean target column “survived.” The black plot shows the capacity progression on the original dataset. It is clear that a 100% accurate prediction of survival in such a disaster is impossible, and therefore we expect to never have enough data. However, historical accounts suggest that women with wealth and multiple children had higher survival rates due to prioritization in rescue boat assignments. The green curve illustrates the capacity progression with a synthetic modification of the dataset, where the “survived” column is made to coincide with the “gender” column, implying that all female passengers survived. Such a rule is easier to learn, and hence the capacity progression remains constant. The red curve shows the progression on yet another synthetic derivative of the dataset: Here, the “survived” column has been completely randomized. As a result, the survival function is essentially impossible to generalize.
2 The
dataset is documented and downloadable here: https://www.kaggle.com/c/titanic. To obtain the described result with the algorithm explained in this chapter, validate and numerize the data as explained in Chap. 11.
178
12 Measuring Data Sufficiency
The method of capacity progression and estimating memory-equivalent capacity, however powerful, does have its limitations. For instance, it might not be appropriate to assume equilibrium in all cases. If a certain distribution is hypothesized, a different weight initialization can be used to alter the representation function. Additionally, you can iterate through different representation functions, analogous to the process used in the hard disk example previously discussed. While no method can guarantee 100% accuracy, this approach to measure data sufficiency significantly outperforms the process of constructing and training a “black box” model in terms of speed and experimental rigor. Instead of blindly relying on a model’s output, we gain a systematic way to quantify the adequacy of our data, enabling us to make more informed decisions in our machine learning endeavors.
12.5 Exercises 1. Given is the following noisy sequence: 2.1, 3.8, 6.2, 8.1, 10.0, 11.8, x. What is your answer for the value of x? How would you define the rule? 2. Explain using the knowledge from Chap. 6 why the equilibrium state is the most general state for any system. 3. Give an example of an application for a different configuration of initial weights for Algorithm 10 and specify what that configuration would be. 4. Devise a method to measure the information limit of the Ideal Machine learner. Compare the limit to the C(n, d) function discussed in Chap. 5. 5. Propose an idea to extend the Ideal Machine Learner to regression problems.
12.6 Further Reading • Wikipedia: “Types of Equilibrium,” https://en.wikipedia.org/wiki/List_of_types_ of_equilibrium • Samy Merabia: “Equilibrium Molecular Dynamics,” chapter in “Nanostructured Semiconductors” 2nd edition, Jenny Stanford Publishing, 2017. • Gabor Kassay Vicentiu Radulescu: “Equilibrium Problems and Applications,” Academic Press, October 2018.
Chapter 13
Machine Learning Operations
Machine Learning operations (MLOps) is a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production and then maintaining and monitoring them. The name is derived from DevOps (which stands for development operations) and is the process of testing and releasing software to production.
13.1 What Makes a Predictor Production-Ready? What makes a good model is the same as what makes good software in general. Good software does not crash (exits without the user wanting it to exit), does not become unresponsive, and is able to warn and react well when the user provides inputs that seem to make no sense for the operations requested. In other words, the software (or the predictor) needs to be able to gracefully handle any random input. Just like any software, a model must be packaged together with all dependencies. Prediction must use the same preprocessing and tabularization (see Chap. 11) as was done for training. This usually means the preprocessing needs to be packaged with the prediction model. It may also sometimes require re-translate predictions to the original format. For example, if the model outputs numeric classes, but the original data had textual categories, the user should be presented with the textual categories as the output of the model. We will call models that are shipped together with data processing and error handling predictors, defined as follows. Definition 13.1 (Predictor) A predictor is a model packaged as a software module that contains all pre-processing and post-processing steps, all error handling, and resolves all library dependencies required to run self-contained on a certain, prespecified computer configuration.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_13
179
180
13 Machine Learning Operations
Predictors are often shipped as a programmatic library (so-called API— application programming interface) which can be on a server (aka “in the cloud”) or they can be installed as local packages using a package management tool. Predictors can also be command-line tools or integrated with a graphical user interface. DevOps usually also includes tracking of versions. As software gets updated, it changes behavior. Sometimes the change is wanted (and called a feature) and sometimes the change is unwanted (and called a bug). To keep the documentation and other communication to the user consistent, each time a new version of the software is released, it contains a version counter that is increased. Furthermore, software needs to be maintained because interfaces to dependent software change. So predictors have the same maintenance requirements. All of these requirements add to the basic requirement of a predictor having high enough accuracy. On top of the maintenance required to keep the software base for the predictor up to date, models have specific requirements as they react to real-world data. Data has the tendency to drift. That is, the what is selected as a validation and test set and provides a valid base for training and testing today may not be what works tomorrow.
13.2 Quality Assurance for Predictors While models are developed, benchmarks are usually used to test and tune for accuracy and generalization. Benchmark data usually consists of past or publicly available data from a similar type. Needless to say, the more diverse the set of benchmark data is, the more accurate the testing of the prediction algorithm (see also the discussion in Sect. 11.1).
13.2.1 Traditional Unit Testing If it is possible to exactly specify the kind of data that needs to be predicted correctly by a predictor, then a large part of the MLOps task can be fulfilled by a typical and frequently applied DevOps method: unit testing. Unit testing is a part of the development operations process in which the smallest testable parts of an application, called units, are individually and independently scrutinized for proper operation. A machine learning model implements a function like most other parts of a software program and therefore can be just handled as a unit like any other function. The problem is that unit testing usually tests a function against the original specification of the function with all edge cases and exceptions. For example, a specification for a function to a programmer could be as follows: “Given an array of integers and an integer target, write a function that returns indices of the two numbers such that they add up to target. You may assume that each input would
13.2 Quality Assurance for Predictors
181
have exactly one solution, and you may not use the same element twice. You can return the answer in any order.” The unit test would then test the function developed by a programmer using examples of data generated by the specification. The problem in machine learning is that it can be extremely hard to describe, for example, an image dataset precisely enough. However, it is definitely not a bad idea to attempt to precisely describe the data that a predictor needs to handle. For example, “A classifier is to separate photographs containing either cats or dogs. All images provided to the predictor contain exactly one cat or one dog. The images are all in a 4:3 ratio with 1 megapixel resolution and 24-bit color depths. All photos are taken in full day light outside or inside with artificial lighting. The cat or dog can be of any breed, any age, taken from any angle, in front of any background, in any position. The cat or dog can be located at any position in the picture. At the time of the photo the animals may be moving or be stationary. Cats or dogs are seen in full, not zoomed in to any specific body parts. Any cat or dog must at least occupy . 13 of the area of the image. Neither a cat or a dog will be partially obscured. Animals will not be mirrored in any way by reflection. All images are unedited photographs with no parts being painted or rendered. The photos contain no indirections, like animals on screen or through a window. All images have been compressed with JPEG compression with a quality level no lower than 20. The training and validation set has been labelled by 3 annotators (unanimous decision) and consists of 1000 images per class. The predictor must be able to solve the classification task with .95% accuracy on any image fitting this specification.” The most practical approach to creating a data specification is an iterative approach that incrementally refines the specification during data collection and annotation while comparing against the original task. Chapter 11 focuses more on data collection.
13.2.2 Synthetic Data Crash Tests Benchmark set tests are often run with the assumption that they “simulate realworld data.” However, the real world also contains exceptions that are seldom part of benchmarks. An analogy that can be used here is the following: Benchmark data makes sure the car drives in traffic, in bad weather, and on a bumpy road. However, one also needs to see how the car behaves when things fail, e.g., when it crashes into a wall or when a tire blows. This kind of testing is usually best achieved with synthetic data that is specifically designed to test things. The general idea of a crash test for predictors is that a predictor should never crash. That is, whatever input is given to the predictor, it should always either give a sensible prediction or provide an informational error. For example, if a predictor has been trained on D columns of input and the user only provides .D − 1 columns, an error message should appear. Similarly, when the model was trained on numeric data and the user asks for a prediction based on strings, there should be a graceful exit with an informational error message. A typical mistake is also that a user will
182
13 Machine Learning Operations
ask a predictor to predict a row that is annotated with a label. This can be detected and seen as a validation. To systematize such testing during MLOps, random data tables (see Definition 2.1) should be generated to test the predictor. It is up to the creativity of the test designer and the specification of the target use of the predictor which error conditions to include. So the following list of suggestions is provided in the hope to be useful. Typical crash tests can be prediction requests: • Of the wrong dimensionality • With strings cells with special characters or in different UTF-encoding (Crockford 2008) • With numbers out of the numeric range of the specification of the predictor • With missing values • With non-atomic cells or mixed values (see also Definition 2.1) • That are much noisier than the training data • That are all 0s or all maximal values per cell
13.2.3 Data Drift Test Drift is the evolution of data that, over time, invalidates the model and thus its predictions. This is called data drift or also concept drift and is true for any observation, as predicted by the second law of thermodynamics (Landa 1958). Even the SI units have to be updated for data drift (Mohr et al. 2018). If one is worried about the longevity of a model, one should therefore dedicate tests to measure the model’s behavior under data drift. The most important indicator for the longevity of a prediction model is its generalization ratio. As we know, from Sect. 6.4, the higher the generalization, the higher the resilience of the model. Beyond that, synthetic data can be generated to test data drift. This can be implemented by having the benchmark datasets “drift.” Drift can be simulated by increasing the overall entropy of the data by adding noise and also by adding bias (compare Sect. 13.3). An example of such a test is discussed in Chap. 12, where the Titanic dataset is changed to all female passengers. For a data drift test, the question is then whether the accuracy still holds as if the original dataset was used. The best guarantee that a predictor is handling drifted data is re-testing the predictor regularly with recent-time prediction requests that have been annotated. If the accuracy of this updated test set drifts, retraining, or a new model may be required.
13.2.4 Adversarial Examples Test Sometimes, just adding random noise to the data is not sufficiently representative of the type of erroneous input that could be expected. As explained in Sect. 6.5,
13.2 Quality Assurance for Predictors
183
Fig. 13.1 Adversarial examples for a stop sign detector (Evtimov et al. 2017). Examples like these should be part of quality assurance for predictors in an MLOps process. From https://www.flickr. com/photos/51035768826@N01/22237769, License: CC BY 2.0, Credit: Kurt Nordstrom
adversarial examples are (sometimes counter-intuitive) contradictions of the generalization assumption for the model. Finding these can be crucial for the production success of a predictor. Of course, there is no general rule of how to discover adversarial examples for a given model as it depends on the use case at hand. A prominent story of a set of adversarial examples that was discovered during the development of autonomous cars, however, can hopefully serve as an example of their importance for quality assurance. As part of building the autonomous driving functionality for an electrical car, a stop sign detector was built and tested using benchmark sets of stop sign in front of various backgrounds, camera angles, lighting conditions, zoom factors, partial obscuring, and driving speeds. However, it turned out that the generalization assumption (red octagon with white letters saying “STOP”) was easily contradicted by stop sign with graffiti or stickers on them (Evtimov et al. 2017). Figure 13.1 shows several examples. To make the model robust against these adversarial examples, instances of stop signs with alterations should be included into the training and validation. Furthermore, different altered stop signs need to be part of the quality assurance testing for MLOps before the predictor is included in an actual car.
13.2.5 Regression Tests Regression tests are testing if a new version of a predictor is still performing the same way or better on a set of standard benchmarks, which can include synthetic
184
13 Machine Learning Operations
data. While measuring and comparing the average overall accuracy (see Eqs. 11.3 and 3.23) of all benchmark datasets used for a previous version of the predictor is a primary indicator for progress or regress in prediction quality of the current version of the predictor, other methods should be added. For example, when there is an accuracy difference between the benchmark results of a previous version and a current version, the samples that made the difference should be analyzed for a bias. Even consistently better results can be at the cost of some bias that has been introduced: for example, when one dataset of the benchmark set scores a lot higher accuracy than before but several others are slightly worse. Strictly speaking, a predictor really only improves when the correctly predicted samples are a true superset of the samples correctly predicted in the previous version of the predictor.
13.3 Measuring Model Bias Bias is a frequently discussed topic when it comes to societal issues, but it is also very important in science and therefore for machine learning. To narrow down the definition of bias for machine learning, let us start with a textbook societal example. Five people are waiting in a room for a job interview. They have the same qualifications. In this particular moment, an unbiased interview would assign equal probability for each candidate to be hired. That is, each of them has a . 15 expected chance to be hired. Any other distribution of the probabilities would indicate bias. For example, let us assume the interviewer is more likely to hire a societal majority or has to obey a policy that alters anybody’s chances in any way: any reduction or increase of uncertainty for a job candidate that is not part of the experiment (that is, in this example, not based solely on qualification) is defined as bias toward or against the candidate, respectively. From an information point of view, we already defined bias as reduction of uncertainty toward an outcome by factor(s) not intended to be part of the experiment (see Definition 11.1). The simplest example for bias in machine learning is class imbalance in the training data. The probabilistic equilibrium (see Definition 4.10) of the classes is disturbed by an externality, e.g., the fact that not enough data of a certain class could be observed to be included in the training data. Since bias is such a well-discussed topic, many different definitions of bias will be presented particularly, from sociology (Taylor & Fiske 2020) and law (Surden 2014). However, these disciplines usually have the additional job of providing a cause for the bias. Since causality is not part of creating a model purely based on observations (see also the discussion in Sect. 7.3), machine learning model bias is measurable as exactly that deviation from probabilistic equilibrium. Algorithm 12 explains the idea. The underlying assumption for Algorithm 12 is that random inputs should result in random outputs. Back to the earlier scenario, if people’s qualifications are completely random and only the qualifications are taken into account, completely random people should be hired. However, if the model learned to favor a certain
13.3 Measuring Model Bias
185
Algorithm 12 Measuring the bias of a given trained model. The algorithm creates an evaluation table by generating random numbers in the min and max range of the column’s original values. It then creates predictions for the evaluation table and plots a histogram for the distribution of predictions. A bias-free model would have a completely flat output histogram. Any bias toward or against a certain class is indicated as a higher or lower than average bar in the histogram Require: Model: Fully trained model Require: T (x1 , . . . , xd , f x-) Training table with d columns and n rows (numeric only, see also Sect. 11.6). 1: function MEASUREBIAS(M, T (x1 , . . . , xd , f x-)) 2: EvalT able ← [] 3: for i ∈ [1, d] do 4: EvalT ablei ← create-randomcolumn(min(T (xi )), max(T (xi )), n) 5: end forreturn H istogram(Model(EvalT able)) 6: end function
class of people, this should be visible in the model as bias, especially when the number of rows in the evaluation table is chosen as quite high. That is, a certain class is more or less often selected than average. The bias can, for example, be visualized as a histogram, where flat means no bias. Alternatively, we can follow Definition 11.1 word by word and measure the total model bias in bits: Definition 13.2 (Model Bias) bias = − log2 P +
c E
.
oi log2 oi [bits],
(13.1)
i=1
where c is the number of classes, P is the probability of occurrence of a class i in training (assuming each class has equal frequency), and .oi is the probability of occurrence of a class i after testing with random input (e.g., with Algorithm 12). The bias is measured in bits, as a (undue) reduction of uncertainty, and the smaller the value, the less bias, .0 bits of bias being ideal. If a bias is to be measured comparing with a certain class, the second term can be replaced with the self-information of the class, when measured as output of the model (e.g., with Algorithm 12). Definition 13.3 (Class Bias) bias = − log2 P + log2 O [bits],
.
(13.2)
where P is the probability of occurrence of a class in training (assuming each class has equal frequency), and O is the probability of occurrence of a certain class after testing with random input (e.g., with Algorithm 12).
186
13 Machine Learning Operations
There will then be no or .0 bits of bias if the class occurs as frequently as any other class. However, if the class occurs more frequently than any other class, it adds (undue) information and therefore the bias is positive. If the class bias is negative, the uncertainty for this class has been unduly increased in favor of other classes by the model. Entropy is the only way to measure bias without introducing bias by the measurement itself due to the minimum axiomatic definition of entropy. See also the detailed discussion in Sect. B.4. In both, Definitions 13.2 and 13.3, the first term could be exchanged with the non-equilibrium entropy, if practical considerations require that.
13.3.1 Where Does the Bias Come from? Since a perfect probabilistic equilibrium is unobservable (see also Sect. 7.1.2), bias is everywhere. However, being aware of bias and the cause of it is helpful in weighing the validity of a decision. In machine learning, bias can be introduced in many ways: through the bad data collection, through intrinsic restrictions in the data, through not training to the end, by introducing regularization, and even by creating features that otherwise help generate high accuracy. Algorithm 12 and Definitions 13.2 and 13.3 quantify the bias, but they do not tell us where it comes from. Unfortunately, the definition of bias itself (see Definition 11.1) tells us that we have an influence that was outside of the factors we considered as part of the experiment. As a human, we have the ability to switch our frame of reference (see also Sect. B.5) and talk about the context of an experiment. However, when the experiment has been abstracted into a model, the context is often forgotten. So it is important to measure bias and document it and, ideally, understand where it comes from. As part of deploying a model to the outside, it is therefore advisable to do quality assurance tests against bias. Some of the possibilities are described as follows.
Add-One-In Test Let us assume we suspect that a certain model classifies more people of a certain skin color as eligible for a potential employment. The question is then: Is there a bias or is the eligibility a consequence of a correlation of the qualifications and that group of people? One way to get to a quantitative answer to that question is to take another look at mutual information (see Definition 4.18), which was intuitively defined as “how much can we learn for the outcome of an experiment by the factor X.” So we can add a new column to our training table and measure the mutual information .Ibias (X, Y ), where X is the probability function of the new experimental factor “skin color” and Y is the probability function of the target .f (x )) (see also
13.4 Security and Privacy
187
Chap. 4). If mutual information is high, correlation is high. We can measure the mutual information between all columns and the “skin color” attribute and provide a detailed analysis of what attribute in the training data has the highest mutual information with the unwanted attribute. If a specific attribute cannot be singled out, a combination of attributes is responsible for the bias. Neither of these techniques gives a perfect answer to the above question. However, depending on the data at hand, not finding bias in the data could indicate bias in the model. Finding bias in the data often requires re-collecting or re-annotating.
Leave-One-Out Test Just like one can add attribute columns, one can leave attributes out that are suspicious of introducing bias. Furthermore, one can change attributes to lower or higher entropy or also to be anti-correlated. An example of such a test is discussed in Chap. 12, where the Titanic dataset is changed to all female passengers or the target function is changed to be completely random. When one changes the passenger gender column to a single value “female,” one effectively deletes that column. Since there was a historically documented, societal bias toward female survivorship “women and children first,” predicting the survivorship of each passenger becomes much harder when all passengers are simulated to be females. Thus the predicted memory-equivalent capacity indicates the bias experimentally.
Cheating Experiments Cheating experiments are extremely useful for measuring bias when several machine learning algorithms are combined. Since we know the actual outcomes of the training and validation data, it is sometimes useful to replace a machine learner with the actual 100% correct prediction and then measure the bias again. This way, one can measure which machine learner in the system introduces the most bias. Note that the above test can be generalized from attribute columns to any rule applied on the data. For example, in images, we could change colors or blank out certain pixel regions to see if there is bias and in an audio file we could delete certain frequencies.
13.4 Security and Privacy Traditional DevOps also has a security and privacy component. So, depending on the target application and audience, a predictor module may have to include access control for specific predictions. Also, sometimes it is important to know that the
188
13 Machine Learning Operations
training data cannot be reverse engineered from the model. Reverse engineering is harder when the generalization ratio is high but there is no guarantee. This is left as an exercise.
13.5 Exercises 1. MLOps Process (a) Draw a block diagram outlining an MLOps process as outlined in this chapter. (b) Assume you have a model with very few parameters. How would the process change? 2. Bias Measurements (a) Outline a bias measurement for a regression task. (b) Explain the connection between Occam’s Razor and bias. 3. Privacy Considerations (a) Argue why a high generalization G (see Definition 6.1) allows to infer a certain amount of privacy against reconstruction of the original training data. (b) Discuss why there are no guarantees though.
13.6 Further Reading Reading on DevOps, which is the source of MLOps: • S. Carter, J. King, M. Younkers, J. Lothian: “Model-Driven DevOps: Increasing agility and security in your physical network through DevOps”, 1st Edition, Addison-Wesley, September 2022.
Chapter 14
Explainability
The field of explainable artificial intelligence (XAI), or interpretable AI, or sometimes explainable Machine Learning is a research field into creating models using the automatic scientific process that allows humans to understand the decisions and predictions made by the model (Gilpin et al. 2018). It contrasts with the “black box” concept in machine learning (see Chap. 3 where even its designers cannot explain why a model arrived at a specific decision.
14.1 Explainable to Whom? To answer the question of what makes models explainable on a high level, we have to ask ourselves: “What is an explanation?” Typically, an explanation is a hypothesis that fits the observations (see also Chap. 1). An explanation then has further quality standards, such as does it fit future observations (repeatability, see also Definition 4.4) and also does it generalize to similar observations (reproducibility, see Definition 4.5 and generalizability, see Chap. 6). That is, in theory a good machine learning model that is highly reproducible and generalizable is already a highquality explanation. The reason why humans may have difficulty understanding the explanation is that the explanation is not in a human language that the user of the model is familiar with. A similar problem may arise if an explanation was using Chinese characters to a reader who is only trained to read the Latin alphabet (see also the related discussion in Sect. 7.2.2). An additional problem is that a model by itself even when highly general and accurate usually does not provide explanations. For example, let us try to explain Newton’s formula for potential energy (Newton 1726b) based on gravity: .Epot = m ∗ g ∗ h, where .Epot is the potential energy, m is mass, g is gravity, and h is height. The formula itself provides the numeric relationships between the different physical entities. However, to understand why the results of the formula arise, as humans, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_14
189
190
14 Explainability
we need to understand the underlying assumptions and the explanation around the physical entities, for example, what is mass, gravity, and height—and why does potential energy matter? As with every explanation, we therefore need to first ask about the receiver, that is, who are we explaining to. An explanation to a 5th grader would certainly be different compared to an explanation to an expert in the field of data science or a business person. Explainability can therefore not be solved in general. At the minimum, it is relative to the target audience. Having said that, the remainder of the chapter will go through some general techniques that can be helpful creating explainable models using the automated scientific process.
14.2 Occam’s Razor Revisited When it comes to explanation, the most often referenced concept is Occam’s razor. In fact, we already introduced it in Sect. 1.1.5 and in Sect. 2.2.2 and mentioned it several more times. The original formulation (Thorburn 1915) of Occam’s razor is: “Do not multiply to often" and literally only referred to simplifying fractions. That 3 is, . 39 52 should be simplified to . 4 . This obviously helps to see that two arithmetic results are the same. The philosophy, science, and machine learning communities have long discussed two possible higher level interpretations of Occam’s razor: 1. Among models of the same accuracy, chose the simpler one. 2. Simpler models will more likely be accurate. For the constraints outlined in this book (see Chap. 2), the first interpretation is a corollary. For a high-accuracy model, high generalization is achieved using low memory-equivalent capacity (MEC) (see Chap. 6). Low MEC implies a short description. That is, under the assumption of the same accuracy on a training set, the lower MEC model is preferred as it will achieve higher accuracy on a validation set. The second interpretation of Occam’s razor is more difficult to sustain as it ignores the issue of over-generalization. Just as discussed in Sect. 6.2, best guess accuracy is easy to achieve with minimal models that highly overgeneralize. For example, if the second interpretation was true, it would always be better to use linear regression (see Sect. 3.3.2). Having said that, linear regression is highly explainable. In other words, the focus on explainability implies a tendency toward over-generalization. This trend is also observable in the evocative models discussion in Sect. 7.5. All models are wrong, but some of them are so memorable that they are useful even when they only describe ideal situations, i.e., situations that can never be observed (such as the ideal gas equation). We also know from Sect. 7.3 that it is easy for humans to confuse model and observation.
14.3 Attribute Ranking: Finding What Matters
191
In summary, if accuracy is a concern, then among models of equal accuracy, the least complex model will be the easiest to explain—even though there is no guarantee that the generated model is simple enough to be explainable at all. If accuracy is a secondary concern, then explainability can be approached bottom up, by increasingly building more complicated models which fit the data increasingly better (but likely while decreasing generalization).
14.3 Attribute Ranking: Finding What Matters A powerful feature that can help explainability with low-dimensional data classification problems is attribute ranking. Attribute ranking is defined as follows: Definition 14.1 (Attribute Ranking) Given a data table as defined in Definition 2.1, an attribute ranking is a list .(x1 , x2 , . . . , xr ) with .r ≤ D which is a sorted subset of the experimental factors .xi of the data table such that experimental factor .xi is more indicative of .f (x ) than experimental factor .xj when .i < j . Synonyms for experimental factors are columns, attributes, and features. As a consequence, the literature describes the same methods as attribute selection, feature ranking, etc. The definition of “more indicative” is also varying as many methods exist to correlate columns or random variables. Algorithm 13 An example of a greedy attribute ranking algorithm. Examples of the MI function are Definition 4.18 or Algorithm 10 Require: table: array of length n contains D-dimensional vectors x- and f (x ) column. 1: function ATTRIBUTERANK(table) 2: ranking ← [] 3: ranking ← ranking + [argmax(MI (f (x ), c))] c∈table
4: 5:
while (length(ranking)) < D do ranking ← ranking + [ argmax
(MI (f (x ), ranking + [c]))]
c∈table\ranking
6: end whilereturn ranking 7: end function
Deriving from Chap. 4, we know that the complexity of the model representation of the function represented by the data table is measured by the mutual information - The higher (Sect. 4.4) between the experimental factors .xi and the outcomes .f (x]). the mutual information, the fewer data table rows need to be memorized. It is therefore natural to apply the mutual information as a criterion for the most indicative experimental factors as well. The issue with any algorithm for attribute ranking is that there are .D! rankings (where D is the number of attributes). Even at .D = 20 this would be .20! = 2432902008176640000 rankings. Typical methods for attribute ranking therefore converge into a local minimum by doing a greedy search.
192
14 Explainability
A greedy search only requires .D ∗ (D − 1) comparisons. Algorithm 13 implements such greedy search. The algorithm builds up the ranking list incrementally by adding attributes that increase the total mutual information. The first attribute is selected into the ranking by calculating the mutual information between each single attribute column and the outcome column. The attribute with the maximum mutual information is the first element of the ranking list. After that, the algorithm adds new attributes to the ranking by calculating the joint mutual information of all the columns that are part of the ranking plus one column that is not yet in the ranking. Each time, an attribute is added to the ranking that is not part of the ranking but maximizes the joint mutual information between the new attributes in the new ranking and the outcome column. The greedy part of the algorithm is adding attributes to the ranking that locally maximize the mutual information. For example, it is well possible that not adding the single attribute with the highest mutual information first but only the next two attributes in the ranking leads to higher mutual information than the result of adding all three. The algorithm therefore only returns one of many locally optimal solutions. The algorithm can be optimized in various ways. For example, we know that the maximum mutual information is bound by the maximum entropy of either the outcomes column or the entropy of the attribute (see Eq. 4.11). This allows for a stopping criterion as once we ranked enough attributes to explain the complexity of the outcome. Also, rather than calculating mutual information directly, Algorithm 10 can be used for an approximation of the mutual information. Doing so on the Titanic dataset (see also Chap. 12) leads to the following ranking of the attributes: Gender, SiblingSpouse, ParentChildren, CabinClass, PortofEmbarkation, Age, Fare, CabinNumber, TicketNumber, and PassengerId. This ranking explains the indicators for the outcome function (survived or not) quite well. See also the discussion in Sect. 13.3. Attribute Rankings can be confirmed by building models of the most importantly ranked features and comparing them to models that include more features. Furthermore, leaving the most important attributes out of the greedy search and repeating the search on different subsets of attributes can indicate the stability of a greedy search result. Instead of measuring mutual information, one can also build a (simple) model between each of the attributes and the output, for example, a decision tree (Burges 2010). However, imperfect model training can introduce an additional bias on top of the bias that any assumption introduces that is not considering all .D! rankings of attributes (see also Sect. 13.3).
14.4 Heatmapping With low-dimensional data (low D), inference is usually fast. This allows us to provide yet another way to provide perceived explainability: combine the predictor with a spreadsheet. This allows users to manually provide and manipulate prediction requests. The prediction could be directly displayed in the .D + 1-th column of
14.4 Heatmapping
193
the spreadsheet. While this does not actually provide an explanation, the user feels enabled by the possibility of trial and error. In fact, “playing around” with a predictor this way can provide quite interesting insights. In general, the easiest way to provide any kind of explainability is to show the user the instances that were classified incorrectly by the model. For a detection task, categorizing the incorrectly predicted instances into false positives and false negatives can improve this poor-person’s explainability method. For images and text data, a further improvement on this method has become common in the machine learning community: heatmapping. Heatmapping is a variant of attribute ranking for high-dimensional data and given a model. It is currently mostly applied on neural networks as they inherently implement attribute importance ranking: The dot-product threshold calculated at E each neuron .f (x ) := D x ∗ wi ≥ b (see Sect. 3.3.5) weighs every .xi using a i i=1 weight .wi . Higher values for a weight .wi imply a higher contribution to the positive class for the attribute .xi . Lower weights imply a higher contribution to the negative class for a certain attribute. Figure 14.1 shows an example. One popular method for heatmapping in neural networks is gradient-based heatmapping. In gradient-based heatmapping, the gradient of the output with respect to the input attributes is calculated. The gradient represents the sensitivity of the output to changes in each input attribute, and it can be used to highlight the most important attributes for a given prediction (Simonyan et al. 2013). A positive gradient indicates that increasing the attribute value would increase the output, while a negative gradient indicates that decreasing the attribute value would increase the output. Thus, the gradient provides information about the direction of change for each attribute. Fig. 14.1 Heatmapping the correlation between genes and their expressions
194
14 Explainability
There are several gradient-based heatmapping techniques, such as: • Gradient: Compute the gradient of the output with respect to the input attributes (Simonyan et al. 2013). • Integrated Gradients: Compute the gradient of the output with respect to the input attributes and average it over a straight path from a baseline input to the actual input (Sundararajan et al. 2017). • Guided Backpropagation: Modify the backpropagation algorithm to only propagate positive gradients, which emphasizes the positive contributions of the input attributes (Springenberg et al. 2014). • Grad-CAM: Compute the gradient of the output with respect to the feature maps of a convolutional layer, and then use the gradients to weight the feature maps, creating a heatmap that highlights important regions (Selvaraju et al. 2017). Gradient-based heatmapping techniques can provide insight into the model’s decision-making process, making them useful for interpretability and debugging purposes. However, these techniques also have limitations, such as potential sensitivity to noise in the input data, and they may not fully capture the complex interactions between input attributes.
14.5 Instance-Based Explanations Another approach to providing explanations for model predictions is by using instance-based explanations, which identify similar instances in the training data and provide information about their outcomes. One popular instance-based explanation technique is Local Interpretable Model-agnostic Explanations (LIME) (Ribeiro et al. 2016a). LIME works by perturbing the input instance, obtaining predictions for the perturbed instances, and then fitting a simple, interpretable model (such as linear regression or a decision tree) to the perturbed instances and their predictions. The resulting interpretable model can be used to explain the prediction for the original input instance. Another instance-based explanation technique is Counterfactual Explanations (Wachter et al. 2017). Counterfactual explanations provide information about the minimal changes required to the input instance to alter the model’s prediction. This is consistent with the definition of resilience (see Definition 6.6) and adversarial examples (see Definition 6.5). This can help users understand the factors that were most influential in the model’s decision-making process. Instance-based explanations can provide valuable insights into the model’s behavior, particularly for individual predictions. However, they may not always generalize well to other instances, and their usefulness may be limited for understanding the overall behavior of the model.
14.6 Rule Extraction
195
14.6 Rule Extraction Rule extraction is another approach to generating explainable models, where the goal is to extract a set of human-readable rules from a trained model that approximate its decision-making process (Andrews 1995). There are several rule extraction techniques, such as: 1. Decision trees: Decision trees provide an intuitive and easily understandable structure that represents the decision-making process, with each node of the tree representing a condition on a feature and branches representing the possible values of the feature (Quinlan 1986). 2. Rule-based models: These models use a set of logical rules to make predictions. The rules can be easily understood by humans and can be expressed in the form of if-then statements. Examples of rule-based models include RIPPER (Cohen 1995) and genetic programming using symbolic expressions (see also Sect. 3.3.7). 3. LASSO regression: LASSO (Least Absolute Shrinkage and Selection Operator) is a linear regression technique that includes a regularization term to force some coefficients to be exactly zero, leading to a sparse and more interpretable model (Tibshirani 1996). Figure 14.2 shows how such a path plot can look like for different features showing. 4. Sparse Bayesian learning: Sparse Bayesian learning is a Bayesian technique that enforces sparsity in the learned model by using appropriate priors, making the
Fig. 14.2 LASSO regularization path: This plot shows the coefficients of the LASSO model for each feature as the regularization parameter (.log2 (α)) quantizes by more bits. As the value of .α increases, more coefficients shrink to zero, resulting in a sparser and more interpretable model. A feature (experimental variable) that “lasts longer” also has a higher significance for the outcome of the experiment. See also the discussion in Sect. 14.8
196
14 Explainability
model more interpretable by focusing on a small subset of relevant features (Tipping 2001). 5. Explainable Boosting Machines (EBMs): EBMs are a combination of generalized additive models and gradient boosting that provide interpretable explanations by learning additive structure and using monotonic constraints (Lou et al. 2012; Caruana et al. 2015).
14.6.1 Visualizing Neurons and Layers A technique that has gained popularity in recent years is the visualization of the neurons and layers in deep learning models. Specifically, in the case of convolutional neural networks (CNNs) (LeCun et al. 1998), visualizing the activations in the layers can help understand the features and patterns that the model is learning. This approach can give researchers’ and practitioners’ insights into the intermediate stages of the model’s decision-making process, thus providing a form of explainability. In a CNN, each layer of neurons can be thought of as feature detectors. Early layers in the network often detect simple patterns, such as edges and corners, while deeper layers detect more complex features, such as object parts or entire objects. By visualizing the activations in each layer, one can gain a better understanding of the hierarchical nature of the features learned by the model (Zeiler and Fergus 2014). This can help in identifying issues, such as overfitting or biases, as well as understanding the model’s generalization capabilities. However, visualizing neurons and layers is not without its limitations. It is more applicable to image and video data, as interpreting intermediate activations for text or tabular data can be more challenging. Moreover, for very deep networks, visualizing activations at all layers may not be feasible or insightful, as the sheer amount of information can be overwhelming.
14.6.2 Local Interpretable Model-Agnostic Explanations (LIME) Another approach to explainability is to use local interpretable model-agnostic explanations (LIME) (Ribeiro et al. 2016b). LIME is a technique that aims to provide explanations for individual predictions made by any machine learning model. The key idea behind LIME is to approximate the complex model with a simpler, more interpretable model around a specific instance. This is done by generating a dataset of perturbed instances around the instance of interest and training a simple model, such as a linear regression or decision tree, on this new dataset.
14.7 Future Directions
197
The simpler model can then be used to provide an explanation for the original model’s prediction. The explanation is given in terms of the features that contributed the most to the prediction, making it easier for humans to understand the model’s decision-making process. As LIME is model-agnostic, it can be applied to any machine learning model, including deep learning and ensemble methods. However, LIME also has its limitations. As it provides local explanations, it may not fully capture the global behavior of the model. Additionally, the quality of the explanation depends on the quality of the approximation, which can be affected by the choice of the simpler model and the perturbation strategy.
14.7 Future Directions In addition to the techniques presented here, there are many other approaches to explainability being developed. As artificial intelligence becomes more ingrained in our daily lives, the need for transparent, interpretable, and accountable models will only increase. The ongoing research in explainable AI seeks to address these challenges and ensure that the decisions made by these models can be understood and trusted by humans. As the field of explainable AI continues to evolve, new techniques and methods are being developed to provide better explanations for complex models. Some future directions in this area include the following:
14.7.1 Causal Inference Causal inference (Pearl 2009) is another area that is gaining traction in the field of explainable AI. By incorporating causal relationships between features and model outputs, causal inference methods can help provide more meaningful explanations and insights. This can lead to a better understanding of the underlying mechanisms driving model predictions and help identify potential biases or confounding factors in the data. For more information on causality, see Sect. 7.3.
14.7.2 Interactive Explanations Interactive explanations aim to provide more engaging and user-friendly explanations by allowing users to explore and manipulate model predictions (Amershi et al. 2019). This can be achieved through interactive visualization tools or conversational AI systems that can answer users’ questions about the model’s decision-making process. By making explanations more accessible and engaging, interactive explanations can help bridge the gap between complex AI models and human understanding.
198
14 Explainability
14.7.3 Explainability Evaluation Metrics As more explainability techniques are developed, there is a growing need for standardized evaluation metrics to assess their effectiveness (Arrieta et al. 2020). These metrics should consider aspects such as the fidelity of the explanation (how well it approximates the original model), the interpretability of the explanation (how easily it can be understood by humans), and the usefulness of the explanation for decisionmaking. Developing robust evaluation metrics will help drive advancements in explainable AI and ensure that the explanations provided by these techniques are both accurate and meaningful.
14.8 Fewer Parameters In the end, as discussed in Sect. 14.2, the key factor contributing to the explainability of machine learning models is their number of parameters. Models with lower memory-equivalent capacity often have fewer parameters, which makes them inherently more explainable. The rationale behind this is rooted in the principle that simpler models are usually easier to understand and interpret (Tibshirani 1996; Lou et al. 2012). As the number of parameters in a model increases, it becomes more challenging for human users to comprehend the relationships between input features and the model’s output. This complexity can make it difficult to ascertain how the model is making its predictions and to provide a clear explanation of its decisionmaking process. Conversely, models with fewer parameters tend to have a more straightforward structure, which allows users to better understand the underlying relationships between inputs and outputs (Caruana et al. 2015). Combined with symbolic methods, such as genetic programming (see Sect. 3.3.7), explainability becomes much easier when the model is sparse.
14.9 Exercises 1. Compare and contrast LIME (Ribeiro et al. 2016b), SHAP (Lundberg and Lee 2017), and counterfactual explanations (Wachter et al. 2017) as methods for improving interpretability of complex machine learning models. Discuss the advantages and disadvantages of each method and provide examples of realworld applications where each technique might be particularly useful. 2. Rule Extraction: Implement a decision tree (Quinlan 1986) and LASSO (Tibshirani 1996) based rule extraction technique for a given dataset. Compare the interpretability and performance of the extracted rules with the original model. Discuss the trade-offs involved in using these rule extraction techniques and their potential impact on model performance and generalizability.
14.10 Further Reading
199
3. Design an explainable model for a specific task or application (e.g., healthcare, finance, or image recognition). Use the principles of intelligible models (Lou et al. 2012) and sparsity (Tipping 2001) to guide your design. Evaluate the model’s performance and interpretability and discuss the trade-offs you encountered during the design process. 4. Analyze the relationship between the number of parameters in a machine learning model and its explainability. Experiment with different model architectures and parameter settings to demonstrate the impact of parameter reduction on model interpretability. Discuss the implications of your findings for the design of explainable models. 5. Select a real-world machine learning application (e.g., predicting patient outcomes, fraud detection, or natural language processing) and evaluate the interpretability of a chosen model. Apply at least two interpretability techniques (e.g., LIME, SHAP, or counterfactual explanations) to improve the model’s explainability. Analyze the trade-offs between model performance, generalizability, and explainability, and discuss the implications of your findings for the selected application.
14.10 Further Reading Here is a list of suggested further reading materials for those interested in learning more about interpretability, explainability, and related techniques in machine learning: • Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi, D. (2018). A survey of methods for explaining black box models. ACM Computing Surveys (CSUR), 51(5), 1–42. • Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81– 106. • Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. • Lou, Y., Caruana, R., & Gehrke, J. (2012). Intelligible models for classification and regression. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 150–158. • Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1(Jun), 211–244.
Chapter 15
Repeatability and Reproducibility
Reproducibility is a cornerstone of the scientific process, as it allows researchers to independently verify and validate the findings of previous studies. Repeatability and reproducibility have already been discussed as a fundamental property of experiments in Chap. 4. As per Definition 4.4 an experiment is repeatable when it shows the same outcomes under minimally different circumstances and an experiment is reproducible Definition 4.5 when it shows the same outcomes under specified circumstances. However, the fields of computer science and data science have been facing a reproducibility crisis, with many published results being difficult or impossible to reproduce (Baker 2016b). This crisis can be attributed to several factors, including the fact that the above definitions can, at first glance, seem subjective as it is hard to objectively define what “specified” or even “minimally different” circumstances mean. As a result of that, many articles have been published based on the use of proprietary software, or giving readers no access to raw data, promoting too complex and poorly documented model configurations, and the absence of clear reporting standards. To address the reproducibility crisis in the context of the automatic scientific processes, several initiatives have been proposed. These include: • Encouraging the use of open-source software, which promotes transparency and allows researchers to inspect and modify the underlying code. • Promoting the sharing of raw data, model configurations, and code through public repositories. • Establishing clear reporting standards and guidelines to ensure that published research includes sufficient information for replication. • Fostering a culture of open science, which emphasizes collaboration, openness, and transparency in the research process. These initiatives increase the repeatability of results and the transparency of science. However, achieving actual scientific reproducibility requires further initiative. These are discussed in this chapter. Furthermore, in Appendix D, this © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_15
201
202
15 Repeatability and Reproducibility
book proposed a review form that can be adapted to conferences and journals to ensure reproducibility of experiments that are presented in a paper submission.
15.1 Traditional Software Engineering The history of computing has seen little discussion about repeatability and reproducibility because most of computer science focuses on the synthesis of code, that is, programming. When programming, it is usually assumed that the same machine code will result in the same outcome whenever it is run. It is also assumed that language compilers and interpreters are free of errors and the same high-level language code is always translated into the same machine code. In other words, repeatability is assumed and the state of technology is such that this assumption is very feasible, the most notable exception being the handling of floating point numbers (Goldberg 1991). When talking about software, there is virtually no difference between repeatability and reproducibility. The difference would be coming up with the minimum length program vs any program to obtain a result. Unfortunately, deciding whether a program is minimum length is undecidable (see also the discussion in Sect. B.2). Therefore, any program that produces the desired output, given the specified input, is therefore automatically a reproducible result. Continuing the tradition of not having to spend many mental cycles on reproducibility in software engineering, many practitioners define a machine learning experiment reproducible when they have uploaded the data, the code that produces the model, and the desired results to an online repository for anybody to download. This, however, is enough for repeatability but not for reproducibility. The following sections elaborate on this.
15.2 Why Reproducibility Matters Before investigating what constitutes reproducibility for a machine learning experiment, it is worth spending some time to understand why reproducibility is important. In the scientific process, knowledge is only created by reproducible results. This is, while a model that creates a prediction that coincides with an observation is the result of the process, for practical reasons, the circumstances of when and how the predictions are accurate usually need to be documented. The term “reproducible research” therefore refers to the concept that results of the scientific process should be documented in such a way that their deduction is fully transparent to anybody wanting to reproduce the experiment. This requires a description of the methods used to obtain the observations (data) as well as a complete description of the model. A result is reproducible if it can be confirmed by other researchers (for example, the reviewers of a paper). Once a result is reproducible, it constitutes knowledge.
15.3 Reproducibility Standards
203
That is, under the specified circumstances, human kind now has a model to predict a described phenomenon. The quality (and often the impact) of the knowledge gained is governed by the ease of the reproducibility and the relaxation of the circumstances. That is, the more general a result is, the more widely applicable it is, and the more potential impact the knowledge has. That is, results that can be reproduced by many people with very few preconditions are generally preferable. This immediately leads us to why reproducibility is important: Rather than trusting a piece of paper, by reproducing an experimental result, one can observe the outcomes personally and therefore confirm them as truth without the need to believe. Similarly, if a result cannot be reproduced, it is falsified and therefore needs to be reworked before it can be accepted as knowledge. In other words, reproducibility is the mechanism that allows science to be confirmed or falsified and eliminates the need for believe or trust, separating science from religion, political opinion, or hear-say.
15.3 Reproducibility Standards The current practice in machine learning is outlined in Chap. 3. A model is created and then tested by measuring accuracy against a validation set. Often, for a public task, many models are trained by different people and there is competition for the model with the highest validation accuracy (or some other metric as defined by the organizer of the benchmark). It is often accepted that the model which achieves the highest accuracy is the model which constitutes state of the art. Many benchmarks do require that the models be able to be downloaded by third party and so the benchmark result can be repeated. However, does benchmarking and selecting a model in this way constitute a reproducible experiment? As mentioned earlier, repeatability can be guaranteed by making the code and data available. What repeatability shows is that given the training data, given the code, given all hyper parameters (including initialization) and no introduced randomness, the prediction experiment will be able to get the same accuracy on the same validation set. However, what is the knowledge gained from that? How general is the conclusion from this experiment? To maximize the knowledge gained from the experiment, the experiment should be made reproducible. That is, a description (specification) of the model should be given in paper form and any third party should be able to reproduce the model to get the same accuracy on the validation set. Furthermore, if the model is to be general for “this type of task,” the generated model should also be able to perform with equal accuracy on a new validation set, chosen by a third party. The properties of this new validation need to be specified in the same paper. Given a paper and a set of external practitioners reproducing the results with their own models build to specification and their own validation data collected to specification, one would be able to conclude that a model for “this task” can be built in the way described in the paper. Alternatively, the knowledge presented in the paper can be falsified with experiments to show that the specification of the validation set is incomplete
204
15 Repeatability and Reproducibility
or overly general or that the model as described does not always yield the accuracy as reported. The burden of experimentation for a reproducible result is much higher than the burden of experimentation to win a benchmark for the simple reason that winning a benchmark could be an anecdotal result and does not guarantee reproducibility. This coincides with showing that a certain medicine is effective in general vs only anecdotally. In short, gaining general, reproducible knowledge empirically is a much higher burden than obtaining anecdotal results. Unfortunately, in recent years, the culture of publishing in machine learning has lowered the burden of reproducibility in favor of good results. This is often referred to as reproducibility crisis (Baker 2016a). Several professional organizations, including the Association of Computing Machinerists (ACM), have therefore started working on reproducibility standards (Board 2018). The ACM differentiates three classes of experimental validation mechanisms: • Repeatability (same team, same experimental setup): The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same location on multiple trials. For computational experiments, this means that a researcher can reliably repeat her own computation. • Reproducibility (different team, same experimental setup): The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author’s own artifacts. • Replicability (different team, different experimental setup): The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently. Note that the ACM’s definition of replicability is the one that is most compatible with the definition of reproducibility in this book (see also Definition 4.5). The differentiation of repeatability within and across teams of researchers, however, is an important point. The assumption is that the original experimental team may memorize state and apply it as implicit assumptions to the experiment. The only way to make sure implicit assumptions are made explicit in the specification is to have an independent team reproduce the result.
15.4 Achieving Reproducibility Achieving reproducibility (in the sense of ACM’s replicability or Definition 4.5), where another team is able to obtain the same results with independently developed
15.5 Beyond Reproducibility
205
code artifacts, seems a big challenge in the view of extremely large models that cannot be easily described reproducibly in a paper (see also Chap. 9). However, one has to remember that the burden of proof is on the researcher. This is, a company coming up with a language model with billions of parameters has to show that their result is not anecdotal for it to be scientifically valid. A paper describing the model should not be accepted as a gain of scientific knowledge unless it has been made reproducible. Appendix D represents my suggestion for a peer-review process for scientific publications that fosters reproducibility. The good news is that if one wants to be reproducible, there are several mechanisms in this book that can make things easier. Most importantly, let us remember Corollary 3. It states that all models with the same memory-equivalent capacity (MEC) are equally capable of implementing the same functions. To reproduce a result, it therefore should not matter what type of machine learner was chosen or what architecture was created in detail. Since models of the same capacity are able to implement the same functions, the intellectual capacity is the main parameter that needs to be reported to guarantee reproducibility. Unfortunately, as we know from the relevant chapters in the book, measuring the exact MEC can sometimes be tricky. However, again, the burden of proof is with us and we have the freedom to create the machine learner we want and therefore can choose to create one where MEC is easier to measure. A challenge is that training only results in a local minimum error (see also the discussion in Chap. 16). Since MEC assumes perfect training, two machine learners with the same MEC may not converge to the exact same accuracy because two training runs never result in the exact same result. However, the uncertainty of the training outcome can be addressed by doing many training runs and reporting the average accuracy numbers of the many runs (see also Sect. 4.1.3). In summary, if one wants to solve an anecdotal problem for a specific customer, it may be fine to deliver a model that solves the exact problem for the customer. A customer may be satisfied with a rationale that the approach was democratized by publishing the problem specification along with the dataset and running a benchmark. The redundancy in the model-building approach then shifts the blame of any fault in the approach to the data collection. However, if one wants to solve the problem for all and any customer, then a reproducible approach is required because it creates systematic knowledge on how to actually solve the task in general. Furthermore, the following section discusses that reproducibility is actually not even the highest quality standard.
15.5 Beyond Reproducibility Reproducibility as defined in this chapter may seem like a very high bar as it goes beyond customer satisfaction. Of course, this is a science book with the primary goal of creating knowledge by automating the scientific process. Having said that, a model that is general has wider applicability than a model that is specialized. However, did we actually create any knowledge when we build a reproducible
206
15 Repeatability and Reproducibility
Language
Philosophy
Math
Physics
Chemistry
Biology
Fig. 15.1 Knowledge creation obeys an implication hierarchy: Language is required to understand philosophy, math formalizes philosophy, and physics depends on math, chemistry on physics, and biology on chemistry. A model created in one field that implies a contradiction to a field that is higher in the hierarchy is typically considered wrong—independent of whether this model has been created manually or automatically
model for a task? The answer must be yes. Not only because of the content of the previous sections but also due to the fact that everybody can now solve the task by reproducing the results. So the knowledge created is of the form: “We know how to solve problem X.” As one can see that is a contribution but it leaves further questions, including “Why is this the way to solve X?” or “how do we extend the solution of X to Y?” The traditional scientific approach implies an understanding, that is, knowing when a model can be applied and when not and knowing the assumptions underlying it and what the pitfalls and disadvantages of applying the model are. Furthermore, models need to be consistent within other models, theories, and explanations. A contradiction of two models is called a paradox and is an indication of the lack of understanding of a phenomenon. Furthermore, the knowledge-creating fields obey an implication hierarchy as shown in Fig. 15.1. The arrow is to be understood as follows: A result in a discipline further down the chain (e.g. physics) that contradicts a result further up the chain (e.g., math) is considered incorrect. That is, truth in physics is a subset of truth in math. At the beginning of the chain, language can describe anything, philosophy restricts language to that useful for reasoning, math formalizes philosophy, physics, as a science, restricts itself to observable and reproducible truth, chemistry relies on physics, and biology relies on chemistry. Applied sciences branch of fundamental sciences, for example, engineering branches of physics, medicine branches of biology, or law branches of philosophy. Similarly, a result in medicine is incorrect if it implies a violation of biology, chemistry, or physics. So to automate the scientific process, models created by machine learning should not just be reproducible but also fit into the implication hierarchy. In fact, obeying the implication hierarchy can be more important than empirical accuracy. For example, in Sect. 7.5 the concept of an evocative model is discussed with the example of the ideal gas equation which is empirically incorrect but greatly contributes to an understanding and is very consistent with the hierarchy of fields.
15.6 Exercises 1. Define and explain the difference between repeatability and reproducibility in the context of machine learning experiments. Provide an example of a situation where an experiment might be repeatable but not reproducible and vice versa.
15.7 Further Reading
207
2. The ACM defines three classes of experimental validation mechanisms: repeatability, reproducibility, and replicability. Discuss the differences between these three concepts and explain why each is important in machine learning research. Give an example of a machine learning experiment that would require each of these types of validation. 3. One of the main challenges in achieving reproducibility in machine learning experiments is the difficulty of describing large, complex models in a paper. Discuss some potential solutions to this challenge, and explain why measuring the memory-equivalent capacity (MEC) of a model is an important step in ensuring reproducibility. 4. The concept of an evocative model was introduced in Sect. 7.5 as a model that may not be empirically accurate but contributes to an understanding of a phenomenon and is consistent with the hierarchy of fields. Give an example of an evocative model in machine learning, and explain how it contributes to our understanding of the field. 5. The reproducibility crisis in science has led to a push for increased transparency and open science practices. Discuss some specific practices that can be adopted to increase transparency and reproducibility in machine learning research, and explain how these practices can benefit the field.
15.7 Further Reading The following resources provide further information on repeatability, reproducibility, and quality in machine learning experiments: • Mesnard, O., Kouskoulas, Y., Kondratyev, A., & Mironchenko, A. (2019). A survey on reproducibility in machine learning. arXiv preprint arXiv:1910.09170. This survey provides an overview of the reproducibility crisis in machine learning and discusses various approaches to achieving reproducibility. • Drummond, C. (2014). Reproducible machine learning. ICML Workshop on Reliable Machine Learning in the Wild. This paper discusses the importance of reproducibility in machine learning and proposes a framework for achieving reproducibility. • Stodden, V., Guo, P., & Ma, Z. (2013). Toward reproducible computational research: An empirical analysis of data and code policy adoption by journals. PloS one, 8(6), e67111. This paper analyzes the adoption of data and code sharing policies by academic journals and discusses the importance of such policies in achieving reproducibility. • Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226–1227. This article discusses the importance of reproducibility in computational science and proposes strategies for achieving reproducibility in research.
208
15 Repeatability and Reproducibility
• Ince, D. C., Hatton, L., & Graham-Cumming, J. (2012). The case for open computer programs. Nature, 482(7386), 485–488. This article argues for the importance of open-source software in achieving reproducibility in computational science and discusses the benefits of sharing computer programs in research.
Chapter 16
The Curse of Training and the Blessing of High Dimensionality
This chapter presents the main technical challenges to a rigorous engineering discipline that automates the scientific process. The number one challenge to what is presented in this book and many other theories is the uncertainty introduced by training. The fact that training does not guarantee to converge to a global minimum makes us look at engineering rigor through a blurry lens. Connected to it is the second challenge, which is also an opportunity: the behavior of models in high dimensionality. Reproducibility, which is described in Chap. 15, would be trivial to implement if training was guaranteed to converge to a global optimum every time. Explainability (see Chap. 14) would be much easier if a straight reduction of information from the input to output, that is, a straight reduction into lower dimensional spaces, could always be the solution. However, modeling with higher dimensional spaces, that is, the introduction of virtual dimensions, allows to create so-called embeddings that solve problems that are otherwise ill-posed (see Definition 2.8 and the discussion in Sects. 11.3.3 and 9.4).
16.1 Training Is Difficult The training process makes the field look at machine learning through a blurry lens. In general, training cannot be expected to easily converge to a global minimum, no matter the machine learning model used or the training procedure. This is due to the fact that training can be best understood as a packing problem. Intuitively, as discussed in Chap. 6, the main goal is to pack as many rows as possible from the training data table (see Definition 2.1) to each parameter of the machine learner to maximize accuracy and generalization at the same time. This has been investigated more formally in articles such as (Haussler 1995; Arora et al. 2018) both for neural networks and for machine learning in general.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_16
209
210
16 The Curse of Training and the Blessing of High Dimensionality
“Training a 3-Node Neural Network is NP-Complete” is a seminal paper by A. Blum and R. L. Rivest published in 1989. The paper investigates the computational complexity of training a simple neural network with just three nodes. The authors prove that training such a small neural network is an NP-complete problem and therefore belongs to the hardest problems in computer science (see also Appendix B). The authors consider a 3-node network with a single input node, a single hidden node, and a single output node. The input node has a threshold function, while the hidden and output nodes have sigmoid activation functions. The goal of training the network is to find the weights and thresholds that minimize the error between the network’s output and the target output for a given set of input– output pairs. Blum and Rivest show that this optimization problem is NP-complete by reducing it from the well-known NP-complete problem 3-SAT (satisfiability with three literals per clause). They demonstrate that an instance of the 3-SAT problem can be transformed into an equivalent problem of training a 3-node neural network. The paper’s results imply that training even small neural networks is computationally challenging. Similar proofs exist for the decision tree problem (Hyafil and Rivest 1976) and symbolic regression (Virgolin and Pissis 2022). In general, the process of tuning kernel or any other parameters of any machine learner is a complex optimization problem. As a consequence, the problem is approximated using methods such as grid search (Bergstra and Bengio 2012), gradient descent (Ruder 2016), or Bayesian optimization (Snoek et al. 2012). The computational expense of even these approximation techniques can be high, especially in high-dimensional spaces. What makes things even worse is that knowing if the process actually converged into a global minimum error is impossible (again, see the discussion in Appendix B). Even to determine how far the process is away from that global minimum error, one would require to know the entire function that one is trying to fit. In general, this is a catch-22 as there would be no need to fit a function that is explicitly known. As a result, assumptions are usually used to determine stopping of the training algorithm. Of course, with all that uncertainty involved, as discussed in Chap. 3, it can be extremely tempting to focus on the entire process solely around accuracy.
16.1.1 Common Workarounds To speed up the process, even the approximate training problem is often tackled with workarounds. These are discussed as follows.
Hardware Support Hardware support can enable more complex models to be trained efficiently, which can in turn improve performance on the packing problem. This is particularly true for large neural network models, which can benefit greatly from hardware
16.1 Training Is Difficult
211
acceleration. One common hardware accelerator for deep learning is the graphics processing unit (GPU). GPUs have originally been developed for video games and other graphics-intensive applications, but they are well suited for training neural networks due to their ability to perform many parallel computations simultaneously. This, practically, allows for significant speedups in training times compared to using traditional CPUs alone (Wu et al. 2019). Another type of hardware accelerator is the tensor processing unit (TPU), which is a custom-built chip designed specifically for neural network training and inference. TPUs are optimized for matrix multiplication, which is a key operation in training neural networks, and can offer even greater speedups compared to GPUs (Jouppi et al. 2017). While hardware support can be an important tool for accelerating neural network training, it is important to note that it is not a silver bullet. The choice of hardware must be balanced against other factors, such as the cost and power consumption of the hardware, as well as the size and complexity of the model being trained. In some cases, it may be more efficient to train on simpler models using CPUs rather than more complex models using GPUs or TPUs (Krizhevsky et al. 2017). Furthermore, hardware support alone does not solve the packing problem. Middleware or compilers that map the topological description of the neural network onto instructions for the architecture of the hardware have to solve another packing problem: that of packing calculation vectors (SIMD instructions) into parallelizable units. In practice, packing problems can be made easier by choosing constant size packaging, especially when the overall volume of packets is greater than the amount of content to pack (Leung and Wei 2014; Carrabs et al. 2019). In conclusion, hardware support can be an important tool for accelerating neural network training, but it is not a solution to the packing problem itself. Reducing the complexity of the model and/or changing the architecture can bring significant training efficiency improvements as well. A careful balance must be struck between the complexity of the model being trained, the hardware being used, and other factors such as the cost of hardware acquisition/renting and power consumption.
Early Stopping Early stopping is a technique originally invented to prevent overfitting by stopping the training process before the model starts to memorize the training data (Prechelt 1998). It monitors a validation metric and stops training when the metric stops improving or starts to degrade. The idea is to find a point where the model generalizes well to new data while minimizing the risk of overfitting since the exact point for stopping is indeterminable anyway. In essence, it is a regularization technique which has the positive side effect of reducing the training time. A negative side effect is that it effectively reduces the memory-equivalent capacity of the model in an opaque way, without actually reducing the parameter footprint. As discussed in Chaps. 15 and 14, this impacts reproducibility and explainability.
212
16 The Curse of Training and the Blessing of High Dimensionality
Transfer Learning Transfer learning is a technique that leverages pre-trained models to improve the performance of related tasks (Pan and Yang 2009). The main idea is to use the knowledge learned from one task to initialize the model for another task, reducing the amount of required training data and time. Transfer learning typically involves using pre-trained convolutional neural networks (CNNs) for image classification or pre-trained transformers for natural language processing tasks. The pre-trained models are usually fine-tuned on the target task with a smaller learning rate to adapt the model to the new task without losing the previously learned features. Transfer learning has become a widely used technique in machine learning for leveraging pre-trained models to improve the performance of related tasks. However, transfer learning can also introduce challenges for reproducibility and explainability due to the difficulty in accurately calculating the model capacity. When using transfer learning, the pre-trained model already has a certain level of its capacity filled with data, which is difficult to quantify accurately. Furthermore, when the pre-trained model is fine-tuned on a new task, it is possible that the model will experience catastrophic forgetting, where previously learned knowledge is lost due to an overflow of capacity or other reasons (French 1999). This can make it difficult to reproduce the results of the model and can make it challenging to explain its decision-making processes. Therefore, it is important to carefully consider the use of transfer learning and to take steps to mitigate its potential limitations.
Model Selection and AutoML Transfer learning (see Sect. 16.1.1) can be combined with model selection to create an almost automatic machine learning experience (AutoML). The goal in an AutoML pipeline is to automate the process of selecting the best machine learning model for a given task from a so-called model zoo. One approach to model selection in AutoML is to use meta-learning, where a meta-learner is trained to select the best model based on the performance of a set of candidate models on a validation set. Meta-learning has been shown to be effective for a variety of tasks and domains, including image classification (Zoph et al. 2018; Liu et al. 2018) and natural language processing (Pham et al. 2018; So et al. 2019). Another approach to model selection in AutoML is to use Bayesian optimization, which involves modeling the relationship between the model’s hyperparameters and its performance and using this model to select the next set of hyperparameters to evaluate. Bayesian optimization has been shown to be effective for optimizing the hyperparameters of deep neural networks (Snoek et al. 2012; Bergstra et al. 2011, 2013). In summary, model selection is often automated by delegating it to even more parameter estimation and training techniques—which in most cases decreases the reproducibility and interpretability of the outcome.
16.2 Training in Logarithmic Time
213
16.2 Training in Logarithmic Time While Sect. 16.1 showed that training is extremely hard, this section illustrates that some problems can be trained extremely fast and that the right assumptions and insights into the problem can make a significant difference in specific cases. Let us consider the following thought experiment. Suppose we have a binary number that we want to train a machine learning model to memorize. Let the binary number be represented by the vector of digits .x- = (x1 , x2 , . . . , xn ), where .xi ∈ {0, 1} is the i-th bit of the number. We want to train a model to predict the value of each bit given the previous bits. Let .yi be the predicted value of the i-th bit. We can define the loss function L as the L1 distance (see Sect. 3.3.1) between the predicted number and the actual number: .L = |y − x|. The maximum error is then given as .L = |y − x| = 2n − 1, for example, when .x- = (0, 0, 0, . . . , 0) and .y - = (1, 1, 1, . . . , 1). Suppose we start by training the model to predict the most significant bit, .xn . Let .yn be the predicted value of .xn . Since .xn ∈ {0, 1}, the error L after predicting .xn correctly is now .L = |y − x| ≤ 2n−1 − 1. Next, we train the model to predict the second-most significant bit, .xn−1 , given the most significant bit n−2 − 1, we can continue this process for .xn . Once the error is now .L = |y − x| ≤ 2 each subsequent bit, training the model to predict the i-th bit given the previous bits, and the error in iteration i will be bounded by .Li ≤ 2n−i − 1. Therefore, if we train the model in this order, from the most significant bit to the least significant bit, the error between the actual number and the trained number will decrease exponentially as training progresses (see Fig. 16.1).
Fig. 16.1 While training is difficult in general, with the right assumptions and knowledge about the problems, even exponential decay of the error is possible. Such exponential decay implies a logarithmic number of training steps in the complexity of the trained data
214
16 The Curse of Training and the Blessing of High Dimensionality
The conclusion of this thought experiment, although not a universal proof applicable to all machine learning issues, demonstrates the worth of employing the appropriate assumptions and leveraging any available prior knowledge during model training. Notably, under the right circumstances, even exponential error decay and hence logarithmic training time in the context of problem size are achievable. It is important to note that in this mental experiment, the model type, architecture, or training algorithm was not considered. Thus, it becomes clear that this is equally true even when training a black box.
16.3 Building Neural Networks Incrementally One of the striking differences between machine learning practice and the practice of most other engineering disciplines is that most other engineering disciplines have smallest pieces (e.g., bricks) that are combined using various rules and according to best practice. While the smallest bricks of deep learning could be neurons, neural networks are often assumed as “given.” The assumption is also that it is easier to perform transfer learning on a model that already exists rather than building a model from scratch. Algorithm 14 provides a simple alternative for training neural networks by building them incrementally, neuron by neuron. While this algorithm does not guarantee any better outcome either, it allows for significantly more flexibility compared to being forced to train a given network. Each neuron can be constrained by an assumption individually which leads to a self-regularization with regard to the number of neurons. It is straightforward to extend it to multiple layers. Another interesting property of using Perceptron Learning (see Algorithm 5) is that Perceptron Learning is guaranteed to converge to the optimal solution (albeit in potentially exponential time). This way of building networks also corresponds to how a human would solve a packing problem, putting items into one box at a time.
16.4 The Blessing of High Dimensionality High dimensionality often leads to what is known as the curse of dimensionality, which refers to the difficulties and challenges that arise when dealing with data in high-dimensional spaces (Bellman 1961). Some of the issues include the exponential increase in the amount of data required to maintain a given level of accuracy, the increased computational complexity, and the degraded performance of many machine learning algorithms. For example, nearest neighbor methods become less effective as dimensionality increases because the distance between neighbors becomes more similar, making it difficult to distinguish between relevant and irrelevant points (Beyer et al. 1999). This obviously has an impact on the notion of generalization as discussed in Chap. 6. This challenge, however, can be used as
16.4 The Blessing of High Dimensionality
215
Algorithm 14 An algorithm to automatically grow a neural network. Training networks incrementally leads to a divide and conquer approach for the packing problem and automatically regularizes the size of the network Load training data and labels Start with one neuron Train the weights connecting input layer to the first neuron until a constraint is reached (e.g. accuracy threshold or capacity). for n = 2 to N do Add a neuron to the hidden layer repeat Train the weights for the new neuron using perceptron learning on the misclassified data by previous neurons until until a constraint is reached (e.g. accuracy threshold or capacity). Combine the new neuron with the existing neurons in the hidden layer if no more data to train then Break end if end for Add output layer Randomly initialize all weights Train using back propagation until a constraint is reached.
a feature. In fact it is the feature that distinguishes computational methods from methods in other scientific fields. The idea of projecting into higher dimensionality originates in Cover’s theorem, which states that any input becomes linearly separable when projected into highenough dimensional space (Cover 1965), see also the discussion in Chap. 5. The theorem can be expressed as follows: Theorem 16.1 (Cover’s Theorem) .P (linear separability) = 1 −
Ed−1 i=0
(N−1 i )
2N
,
where P is the probability of linear separability, N is the number of points, and d is the dimensionality of the space. As the dimensionality increases, the probability of linear separability approaches 1. This means that an otherwise not-so-well-posed problem can be made wellposed (see Definition 2.8), which, again, is a powerful tool that separates the field of machine learning from other disciplines. An important example of a transformation from a low-dimensional space into a high-dimensional space in machine learning can be found in the word2vec algorithm, which projects words into a high-dimensional space (so-called embedding) to capture semantic relationships between them (Mikolov et al. 2013). See also the detailed discussion in Sect. 11.3.3. The word2vec problem is of a nature where the difference in inputs (e.g., using syntactical comparisons) is not readily measurable. That is, well-posedness is hard to qualify. However, if the problem is actually chaotic and tiny differences in the input can lead to large differences in the output (see Chap. 11), the machine learner will just overfit.
216
16 The Curse of Training and the Blessing of High Dimensionality
In general, projections into higher dimensionality are not a new trick. To see this, let us use the definition of Fractal dimension (see Definition 4.15) from Chap. 4 to infer the following. Corollary 14 (Equivalence of Digit and Dimension) The geometric interpretation of a digit of a number is a dimension. Proof The decision tree branch (drawn as an inverted “V” for binary numbers) is the geometric self-similarity to represent a digit: When we write a number, every time we write down a digit of the number, we make a decision to write 1 out of b symbols on the piece of paper (in the case of binary, the decision is between “0” and “1”). That is, we follow the path of a decision tree with .bD branches, where b is the base of the number system and D is the total number of digits we write. The self-similarity of any number is therefore the digit and the magnification factor I is the base b, that is, .I = b. Any number N can therefore be represented by .D = logI N digits.1 Therefore, setting .I = b = 2, it is easy to see that the formula in Definition 4.15 is identical to Eq. 4.5 omitting the ceiling operator and the addition of 1. That is, Formula 4.15 is a more accurate version of Eq. 4.5. The result for .I = 2 can therefore also be assigned the unit bits. The difference is the fractal dimension allows fractional bits. The notion of a fractional bit or a fractional digit can be counter intuitive at first but becomes clearer when remembering that b is a decision base (see Sect. 4.2). If D is a ceiled integer (as in Eq. 4.5), we get the maximum number of branches. The fractal dimension is simply more accurate as it takes into account that some decisions do not have to cover the full range b. u n As discussed in Sect. 7.2.2, using the binary system, that is, increasing the dimensionality of the numbers to work with relative to the decimal system, allowed the creation of the automatic computer in the first place. This proves the point of projections into higher dimensionality not being a new trick. The method is, however, quite new to the scientific process (see Chap. 1) in that only automatic computational methods really exploit these projections. Of course, once a problem has become linearly separable, it can be easily solved even by a single neuron. In other words, training can become a lot easier. As an example, Fig. 16.2 is a demonstration of how XOR and Boolean equality can be linearly separated with a single neuron by projecting the input into a higher dimensional space. This works around the original discussion by Minsky (see also Chap. 5) that originally caused the first AI winter. Note that the projection function from binary to one-hot encoding is an isomorphism. That is, the projection is not feature extraction. This can be seen as follows. Given a binary sequence .bn−1 bn−2 . . . b0 , the one-hot encoding OH is defined as a sequence of .2n elements such that
1 Note
that this also extends to unary numbers as since the decision base is still .b = 2.
16.4 The Blessing of High Dimensionality
217
Fig. 16.2 The blessing of high dimensionality: As discussed in Sect. 5.2, the 2-variable Boolean functions .f6 (XOR) and .f9 (equality) have an information content of .4 bits and can therefore not be modeled by a single 2-input neuron that has a capacity of about .3.8 bits. However, by projecting the input into a higher dimensional space (here, one-hot encoding the input), the single neuron’s capacity can be increased by another input weight and is able to model both functions, thereby showing that a single neuron can model all 16 Boolean functions, despite what was prominently discussed in Minsky and Papert (1969)
{ OH (i) =
1
if i = bn−1 bn−2 . . . b0 in binary
0
otherwise,
.
where .i ∈ {0, 1, . . . , 2n − 1}. The reverse process, or decoding from one-hot encoding to binary, is then defined as B(OH ) = i
.
where
OH (i) = 1.
From the data processing inequality (see Sect. 7.1.1), projection into higher dimensionality may make a string longer or a model appear more complex, but it does not create any information. A one-hot encoded number does not contain more bits of information than a binary number even though it is longer. So it cannot be information gain or hidden dynamics (latent variables) that creates the blessing. A physical intuition for why higher dimensional data is easier to model is the following: Instead of packing a set of complicated (input) shapes into a complicated (output)space, we shred the input into a large amount of small pieces. Now, the shredded pieces are much easier to pack into the output space. In other words, the packing problem (see Sect. 16.1) becomes significantly easier to solve (or even just to approximately solve) if the items become atomically sized. This is, of course, well known in the theory of computer science community, see for example Epstein and van Stee (2002). In the machine learning community, the observation that training becomes easier in higher dimensions has been attributed to the so-called lottery ticket hypothesis (Frankle and Carbin 2019). This hypothesis suggests that within large neural
218
16 The Curse of Training and the Blessing of High Dimensionality
networks, there exist smaller subnetworks (referred to as “winning tickets”) that can achieve comparable performance when trained from their initial random weights. These smaller subnetworks can be found through a process of iterative pruning and retraining, potentially resulting in more computationally efficient models. Of course, for an individual, winning the lottery becomes easier with the number of tickets purchased. However, there are trade-offs to consider when using high-dimensional representations. While the increased dimensionality can make problems more tractable, it typically leads to a loss of interpretability of the model (see Chap. 14), decreased incremental improvability and engineering transparency (see Chap. 17), and increased difficulty in debugging (see Chap. 13). Last but not least, the amount of unseen input is increased exponentially when projecting inputs into higher dimensional space. Let us use the 2-variable Boolean functions as an example. The 2-variable Boolean function space consists of four configurations: (0, 0), (0, 1), (1, 0), and (1, 1). These combinations can be represented in the original, or base, space. Now, consider encoding these combinations into a 3-dimensional one-hot-encoded space. In this scenario, we can represent the original four configurations, but we also introduce new, unseen configurations: (0, 1, 1), (1, 0, 1), (1, 1, 0), and (1, 1, 1). These unseen configurations are not directly corresponding to any configurations in the original 2-variable Boolean space and thus are irrelevant in the context of the original Boolean function. However, some of these unseen configurations could (accidentally) become inputs when the model is used. For instance, in a Generative Adversarial Network (GAN) setting, as referenced in Sect. 9.2, the model might be fed noise, which could fall into these “unseen” configurations. The term “hallucination” refers to a phenomenon where the model produces outputs for unseen inputs that do not strictly conform to the patterns and regularities it learned from the training data, hence going beyond its generalization capabilities (Brown et al. 2023). I hypothesize that hallucinations could be the model’s response to encountering unseen or unlearned configurations in its input space, as it is forced to venture beyond its ability to generalize from its training data.
16.5 Exercises 1. Research and discuss various advanced optimization techniques (e.g., Momentum, RMSProp, Adam) used to improve the convergence of training deep learning models. In a second step try the measurement methods discussed in this book. For example, one can scale the learning rate in a neural network by the fractal dimension of the observed error curve during training. Compare and contrast all techniques in terms of their strengths, weaknesses, and applicability to different types of problems. 2. Explain how capacity is impacted by methods like early stopping and other assumptions on the training convergence.
16.6 Further Reading
219
3. Explore the concepts of dynamic and lifelong learning in neural networks, focusing on their relevance for incrementally building neural networks. Discuss the challenges and potential solutions for maintaining performance, mitigating catastrophic forgetting, and efficiently incorporating new knowledge in these learning systems. 4. Explain how the memory-equivalent capacity of a neuron changes as the dimensionality of the input grows. Contrast this with Theorem 16.1 and explain how it is consistent.
16.6 Further Reading • Torquato, S. (2009). Random Heterogeneous Materials: Microstructure and Macroscopic Properties. Springer Science & Business Media. This book discusses the packing problem in the context of materials science, examining the microstructure and macroscopic properties of random heterogeneous materials. • Fahlman, S. E., & Lebiere, C. (1990). The Cascade-Correlation Learning Architecture. Advances in Neural Information Processing Systems. This paper introduces the cascade-correlation learning architecture, which is an approach for incrementally building neural networks by gradually adding hidden nodes during the training process. • Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “Nearest Neighbor” Meaningful? Proceedings of the 7th International Conference on Database Theory. This paper delves into the issue of high dimensionality, specifically addressing the implications of the “curse of dimensionality” in the context of nearest neighbor search algorithms.
Chapter 17
Machine Learning and Society
The impact of machine learning on society has been profound, with significant advancements in various fields such as medicine, finance, and transportation. However, the rapid adoption of these technologies has also led to mixed societal reactions, with some people embracing them as the key to a better future, while others express concerns about their potential negative consequences (Brynjolfsson and McAfee 2017). This chapter discusses some of the societal issues that the engineering field of machine learning will have to face and some technological suggestions.
17.1 Societal Reaction: The Hype Train, Worship, or Fear The public perception of machine learning is often characterized by polarized attitudes, with one camp heralding it as the panacea for numerous challenges and the other expressing trepidation about its potentially detrimental effects, such as job displacement and privacy invasion (Bostrom 2014). As machine learning becomes more prevalent, there is a growing concern that the reliance on these algorithms may lead to the erosion of critical thinking and problem-solving skills. This could have long-term effects on society as individuals become more dependent on machines to make decisions for them, rather than fostering their own analytical abilities (Carr 2010). This has already been historically exemplified with the introduction of pocket calculators: Part of the skill set of every merchant was arithmetic. With the pocket calculator at hand, the human skill has been replaced by a dependency on technology. Today’s Large Language Models, for example, are sometimes able to serve as assistants to more senior people even in creative jobs such as marketing and advertising, replacing the need for junior people.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5_17
221
222
17 Machine Learning and Society
As AI systems become increasingly sophisticated, concerns about the displacement of human workers due to the loss of need for human skill have intensified (Frey and Osborne 2017). While automation can boost efficiency and productivity, the potential sudden loss of jobs must be carefully considered, and appropriate measures should be taken to ensure a fair transition into new roles for those affected. Among those roles could be education about the usage, engineering, and safety of machine learning and AI systems. As machine learning permeates various sectors, ensuring the usability, quality, safety, and security of these systems is imperative. On the engineering side, an immediate thought is data engineering, governance, and privacy. For example, during the training process, machine learning models can inadvertently store sensitive information, thereby jeopardizing user privacy (Song et al. 2017). Ensuring that models do not retain confidential data is crucial in maintaining user trust and upholding ethical standards in AI development. The substantial energy consumption associated with training and deploying large-scale machine learning models warrants a critical examination of their environmental impact. The increasing demand for computational resources exacerbates greenhouse gas emissions and climate change, necessitating the development of energy-efficient algorithms (Strubell et al. 2019). Engineering work, as suggested in this book, can optimize model efficiency and therefore energy use. Thinking about quality, machine learning models often rely on large datasets that may contain inherent biases, resulting in AI systems that inadvertently perpetuate and amplify these biases when making decisions (Bolukbasi et al. 2016). Developing fair, transparent, and accountable AI systems is essential in minimizing the potential harm caused by biased decision-making, but in general as discussed in Sect. 13.3, a balanced decision-making is better decision-making. On the security side, one immediate thought is adversarial examples. Adversarial examples are intentionally crafted inputs designed to cause machine learning models to produce erroneous predictions or classifications (Szegedy et al. 2013b). They are explained in Sect. 6.5 and proposed as testing mechanism in Chap. 13. However, these examples also pose a considerable security risk, as they can deceive AI systems, potentially leading to disastrous consequences in critical applications such as autonomous vehicles or cybersecurity. Last but not least, although machine learning holds great promise in revolutionizing various aspects of our lives, the value of manual science, which offers a more transparent and incremental approach to knowledge discovery, should not be overlooked (Leonelli 2016). By embracing both machine learning and manual science, we can foster a more balanced and accountable ecosystem of knowledge production and innovation.
17.2 Some Basic Suggestions from a Technical Perspective As with any technology, machine learning can be used for both positive and negative purposes. Knives, for instance, have been used by terrorists to hijack airplanes and
17.2 Some Basic Suggestions from a Technical Perspective
223
harm innocent people, yet their most common use is cutting food in our everyday lives. An outright ban on knives is not feasible. So the goal is, like with any other technology, to maximize the benefits and minimize the risks of AI and machine learning. This section offers some suggestions from a technical perspective.
17.2.1 Understand Technological Diffusion and Allow Society Time to Adapt Societies inevitably adapt to new technologies, but the pace of adaptation must be taken into account (Rosenberg 1994). For example, candle makers have mostly been rendered obsolete since the advent of electric lighting. Memetic adaptation occurs at a certain rate, and exceeding that rate can lead to political and economic struggles. The term Technological diffusion (Rogers 2003) describes the process by which innovations spread within and across societies over time. It involves the adoption of new technologies by individuals, organizations, and industries, which can be influenced by factors such as cost, accessibility, and perceived benefits. The rate of diffusion can vary significantly depending on the nature of the technology, its compatibility with existing systems, and the level of awareness and support from various stakeholders. Understanding technological diffusion is crucial for policymakers, businesses, and researchers as it informs decision-making, investment strategies, and the overall impact of innovations on society.
17.2.2 Measure Memory-Equivalent Capacity (MEC) As discussed in Sect. 5.2, a way to understand the capabilities of AI systems is to measure their memory-equivalent capacity (MEC). According to Corollary 3, MEC is a universal limit. By measuring and quantifying the limits of models, we can make more informed decisions about how to deploy these technologies in a safe and responsible manner. For example, society accepts pocket knives as generally harmless, but brandishing a machete in a bank would be problematic. Long story short: Effective size matters.
17.2.3 Focus on Smaller, Task-Specific Models As explored in Chap. 14, smaller models consume less energy (Schwartz et al. 2019) and are more understandable (see Chap. 14), less prone to errors (see Chap. 6), and easier to control (see Chap. 16). Society tends to adapt more easily to AI designed for specific tasks; for instance, very few people complained about spell-checking in word processing when it was introduced. Training smaller machine learning
224
17 Machine Learning and Society
models for specific tasks can alleviate some concerns associated with large-scale AI systems. By focusing on the engineering of smaller, more efficient, and task-specific models, we can harness the power of AI while mitigating some of the associated risks.
17.2.4 Organic Growth of Large-Scale Models from Small-Scale Models Large-scale models can be developed organically from many smaller scale models. By integrating smaller models designed for specific tasks, we can create a more robust and adaptable large-scale AI system while minimizing the risks and challenges associated with monolithic models. The truth is that most commercial AI systems do not actually consist of a single model. They consist of an interaction between several machine learning models, combined with traditional software and sometimes hardware.
17.2.5 Measure and Control Generalization to Solve Copyright Issues Striking a balance between generalization and memorization in AI systems is crucial. The less generalization, the more an AI system is simply memorizing human work, akin to a search engine (see Chap. 6). Controlling the degree of generalization can help maintain the system’s acceptability, as for example, search engines are still able to provide references and attribution to the authors of the original work. A Large Language Model generalizes from the sources making the identification of the originals more difficult. However, the Data Processing Inequality states that information cannot be created, only recombined (see Sect. 7.1.1). This immediately raises copyright issues that must be addressed. As AI systems generate (perceivedly) new content or ideas, questions of ownership and intellectual property rights will become increasingly relevant. Policymakers and stakeholders must work together to clarify these legal and ethical concerns; in short, how much generalization is required to constitute (new) original work?
17.2.6 Leave Decisions to Qualified Humans Another pertinent ethical concern revolves around the role of AI in decision-making processes. It is vital to understand that AI, in its current form, serves as a tool that can aid in decision-making, similar to other instruments we use.
17.4 Further Reading
225
Societies have already established strict regulations regarding decision-making authorities. Within organizations, typically only a select few, such as the CEO, have the power to make impactful decisions. In government structures, decision-making is even more regimented, with elected individuals bearing this responsibility. Bearing this in mind, when we consider integrating AI into our societal structures, two fundamental principles should guide us: 1. An AI cannot be the primary decision-maker, as it lacks the capability to be held accountable. Thus, AI should serve only to aid human decision-makers. 2. All decisions should be feasible without AI assistance. In other words, the use of AI should remain optional, ensuring a more gradual and controlled integration into society. The first principle aims at the preservation of human accountability in any decision-making process. Simultaneously, the second principle promotes a gradual societal adaptation to AI, reminiscent of how children learn arithmetic manually before using calculators. These principles aim to preserve human judgment and accountability while leveraging the benefits that AI can bring to our decisionmaking processes.
17.3 Exercises 1. Discuss the potential consequences of widespread adoption of machine learning technologies on the job market. Propose potential strategies to mitigate negative impacts. 2. Explore the concept of adversarial examples (see Sect. 6.5) as a tool for society (in contrast to their use in Chap. 13). 3. Analyze the role of transparency in the ethical deployment of AI systems. How can we design more accountable and interpretable machine learning models? 4. Research the environmental impact of machine learning, particularly in terms of energy consumption. Propose strategies for reducing the carbon footprint of AI technologies. Hint: You may use concepts described in this book.
17.4 Further Reading • Bostrom, N. (2014). Superintelligence: Paths, dangers, strategies. Oxford University Press. • Brynjolfsson, E., & McAfee, A. (2017). The business of artificial intelligence. Harvard Business Review. • Carr, N. (2010). The shallows: What the internet is doing to our brains. W. W. Norton & Company.
226
17 Machine Learning and Society
• Frey, C. B., & Osborne, M. A. (2017). The future of employment: How susceptible are jobs to computerisation? Technological Forecasting and Social Change. • Leonelli, S. (2016). Data-centric biology: A philosophical study. University of Chicago Press. • Naam, R. (2013). The infinite resource: The power of ideas on a finite planet. UPNE. • Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
Appendix A
Recap: The Logarithm
In the simplest interpretation, the logarithm counts the number of occurrences of the same factor in repeated multiplication; e.g., since .1000 = 10 × 10 × 10 = 103 , the “logarithm base 10” of 1000 is 3, or .log1 0(1000) = 3. The logarithm of x to base b is denoted as .logb (x) or, without parentheses, .logb x. When no confusion is possible or when the base does not matter, the base can be omitted, e.g., .log x. Furthermore, consistent with Definition 4.15, the logarithm can also be denoted as a fractional 1 exponent: .logb x = x b . Formally, the logarithm is the inverse function to exponentiation. That means the logarithm of a given number x is the exponent to which another fixed number, the base b, must be raised, to produce that number x. Formally, .ba = x ⇐⇒ logb x = a for .a, b, x ∈ R. However, the logarithm does not always have a solution, see below. The following definitions hold: .
log0 x = undef ined.
(A.1)
1
This is easy to see as .log0 x = x 0 and division by zero is undefined. u n Note that while .x 0 = 1 for any .x > 0, .00 =undefined. This is, again, easily n demonstratable in the fractional notation using exponential laws: .1 = xx n = x (n−n) = x 0 for all x. However, multiplying 0 by itself is .0n = 0 × 0 × · · · × 0 = 0, even for .n = 0. This is a contradiction and therefore leads to .00 = undefined.
.
log1 x = undef ined.
(A.2)
This is because of its definition .ba = x ⇐⇒ logb x = a. Since .1a = 1 for any a, there is no solution for x unless .x = 1. u n 1 Note that the fractional notation .x 1 cannot be used to express logarithms base 1 1 since .x 1 = x 1 = x, which prevents us from making the mistake of trying to execute a logarithm base 1. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5
227
228
A Recap: The Logarithm
.
logb 0 = undef ined.
No number a exists such that .ba = 0 since even .b0 = 1.
.
logb 1 = 0.
b0 = 1 for any b.
(A.3) u n
(A.4) u n
.
.
logb b = 1.
(A.5)
b1 = b for any b. u n In addition, there are various algebraic manipulations, often coined as “logarithm laws” that cover the cases of negative and fractional logarithms. I leave the derivation of those to the reader. .x, y, b ∈ R+ , k ∈ R, b > 1 (see above).
.
Product Rule
.
log x + log y = log xy.
(A.6)
x log x − log y = log . y
(A.7)
Quotient Rule
.
Especially, .− log x = log x1 , which is used prominently in Chap. 4. Power Rule
.
log x n = n log x.
(A.8)
A typical combination of the quotient and the power rule is used frequently in this book: .log n1k = k log n1 = −k log n (see also Chap. 4). Inverse Property of Logarithm
.
logb (bk ) = k.
(A.9)
A Recap: The Logarithm
229
Inverse Property of Exponent blogb k = k.
(A.10)
logc x , c ∈ R, c > 1. logc b
(A.11)
.
Change of Base
.
logb x =
This algebraic trick allows to convert the logarithm of any base c to a logarithm of base b. Fixing c and b as a constant, it also shows that all logarithms are proportional to one another.
Appendix B
More on Complexity
“Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name. In the second place, and more importantly, no one knows what entropy really is, so in a debate you will always have the advantage’.” (MacKay 2006) This appendix augments Chaps. 4 and 7 with a more detailed discussion on different complexity measurements, their limits and advantages, and a comparison to memory-equivalent capacity (MEC).
B.1 O-Notation In computer science, more often than any other notation, the O-notation is used to classify algorithms according to how their run time or space requirements grow as the input size grows. That is, O-notation serves as the primary measure of algorithm complexity. O-notation is also used in other fields to provide similar estimates. The O-notation is a member of a family of notations invented by Paul Bachmann, Edmund Landau, and others, collectively called Bachmann–Landau notation or asymptotic notation (Bachmann 1894; Landau 1909). In general, it is a mathematical notation that describes the limiting behavior of a function when the argument tends toward a particular value or infinity. The letter O was chosen by Bachmann to stand for Ordnung, German for “order,” meaning the order of approximation (similar to order of magnitude). The O-notation characterizes functions according to their growth rates: different functions with the same growth rate may be represented using the same O-notation. In computer science, usually, the input string to an algorithm or program grows and the notation indicates the order the run time or space requirements grow with the algorithm. For example, if an algorithm’s run time grows linearly with the input length n, this is indicated as .O(n) where as a quadratic growth of run time with © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5
231
232
B More on Complexity
increasing input length would be indicated as .O(n2 ). Section 4.2 already discussed that the lower bound for sorting is in the order .O(n ∗ log2 n). The success of the O-notation in computer science comes from its seeming simplicity. Probably the easiest way to obtain an idea for the order of growth of a program is to count its nested loops over the input. One loop is equaled with 2 3 .O(n), two loops nested are equaled with .O(n ), three loops nested .O(n ), etc. If a loop over the input does not process all the inputs, the order is smaller, for example, .O(log2 n). Loops that are not over the input do not count. To dig into this notation a bit deeper and see how it connects to Corollary 2 and what its limits are, let us look into the formal definition. Let f be the function to be estimated and let us call g the comparison function and real-valued. Both functions can be defined on some unbounded subset of the positive real numbers, ( )and let .g(n) be strictly positive for all large enough values of n. .f (n) = O g(n) as .n → ∞ if the absolute value of n is at most a positive constant multiple M of .g(n) for all sufficiently large values of n. That is, ( ) Definition B.1 (O-Notation) .f (n) = O g(n) if there exist a positive real number M and a real number .x0 such that .|f (n)| ≤ Mg(n) for all .n ≥ n0 . While simple to apply by counting inner loops or recursion depth, the downside of the use of this notation in computer science is twofold: First, it really only gives an approximation of the order of complexity growth and the constant M could be large. Second, in computer science n is usually equated with the number of characters in the input. We know from Sect. 7.2.2 that, especially for large strings, the use of the Chinese alphabet versus a binary alphabet could therefore make a significant difference. Most importantly though, in large strings, characters are usually not uniformly distributed. In practice, we will therefore see that the exact same algorithms take much longer when it is run on a noisy version of the same image. Even though, O-notation run time, color depth, and the number of pixels did not change, by adding noise, the distribution of the pixel values becomes more uniform. This can also be observed when trying to compress the image: A noisy image is less compressible and therefore will result in a larger file size. For longer strings, one can get a closer estimate of run time and space usage by taking into account the complexity of the input string, rather than just its length. That is, E|E| Definition B.2 (String Complexity) .C(s) = −n ∗ i=1 pi ∗ log2 pi , where n is the number of characters in the input (usually assumed to increase to .∞), |E| is the number of possible choices for an individual character (alphabet size), and .pi is the frequency of occurrence of a character with index i in .E (see also Eq. 4.2). Practically, the estimation of the O-notation run time based on nested-loop counting or recursion depth estimation will not change. The length of the input is still counted as n but just becomes the number of events in the Shannon Entropy (see Definition 4.8). For example, .O(n) just becomes .O(C(s)), .O(n2 ) becomes 2 n C(s) ). Technically, inside a computer, low entropy, .O(C(s) ), or .O(2 ) becomes .O(2 .
B
More on Complexity
233
this is predictability, is rewarded with higher execution speed by better utilization of cashes or other hardware optimization techniques. Theoretically, we just gave units to the run time estimate. That is, we are counting the order of binary decisions necessary to achieve our goal based on the number of bits of input. We know from Corollary 2 that for an exact result (that is, not an approximation) we expect to have to read the entire input and .O(C(s)) is therefore the minimum number of binary decisions that have to be made. Since n is assumed E|E| to be growing, the entropy term .− i=1 pi ∗ log2 pi is only a constant if the characters in the string are uniformly distributed (see the discussion in Sect. 4.1.3). For example, if s was a binary string containing a single “1” in the beginning followed by a potentially unlimited number of “0”s, the result for the entropy term would be .1 bit at .n = 2, .0.47 bits at .n = 10, .0.08 bits at .n = 100, or .0.0015 bits at .n = 10,000. Therefore, Definition B.2 is a natural specialization of the more general but less accurate O-notation used in computer science: In ignorance of the distribution of characters, we assume equilibrium (see Definition 4.10). That is, uniform distribution of characters and, as such, the complexity term can just be part of the assumed constant M (see Definition B.1). If we know the distribution of characters in the alphabet, we are able to come up with a more accurate estimate of the growth order. The definition of string complexity can also be used for space (memory usage) estimation, as this follows from the original definition of Shannon entropy for communication (Shannon 1948b).
B.2 Kolmogorov Complexity The Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produces the object as output. It is named after Andrey Kolmogorov, who first published on the subject in 1963 (Li and Vitányi 1997), and is a generalization of classical information theory. Kolmogorov complexity uses similar arguments to the ones used in Sect. 4.2. If a description .d(s) of a string s is of minimal length (i.e., using the fewest bits), it is called a minimal description of s or .dmin (s), and the length of .dmin (s) (i.e., the number of bits in the minimal description) is the Kolmogorov complexity of s, written .K(s). Symbolically, Definition B.3 (Kolmogorov Complexity) .K(s) = |dmin (s)| The Kolmogorov complexity therefore tries to extend from the probabilistic notion of general minimum description length presented in Sect. 4.2 to a deterministic notion that pins down the optimal program for one concrete string (rather than all possible inputs). The issue with Kolmogorov complexity is that it is uncomputable. This can be seen as follows.
234
B More on Complexity
Fig. B.1 An image of a feature inside the Mandelbrot set (often called Julia set). Generating the fractal can be done using low Kolmogorov complexity but high computational load. Once it is generated, lossless compression cannot (even closely) reverse the string describing the image back to the length of the program that generated it. Binette228, CC BY-SA 3.0, via Wikimedia Commons
Theorem B.1 (Kolmogorov Complexity Is Uncomputable) There is no program which takes any string s as input and produces the integer .K(s) as output. Proof Let us assume a function .K(x) that will return the Kolmogorov complexity of x. One can write a program that will test all the possible strings one by one until one finds a string with .K(x) > 100 Mb. This can be done in a nested loop: The first loop indicates the number of bits of the string from 1 to .∞ and the second loop creates all the strings of k bits starting from 0 and summing up by 1 until .2k − 1. That is, .0, 1, 00, 01, 10, 11, 000, 001, . . .. Such a program can be written in much less than .100 Mb. Thus, contradicting the result of .K(x), we have written a program that prints the first string with complexity .100 Mb, but the program is less than .100 Mb. So the program cannot generate the u n string (yet it did). Therefore .K(x) cannot exist. Let us try to analyze why Kolmogorov complexity is uncomputable using a famous example. The Kolmogorov Complexity for generating a Mandelbrot set (see Fig. B.1) is very low (about 15 lines of code, see Mandelbrot 1967), yet the computational load is very high (typically about 200–1000 iterations per pixel). Likewise, compressing the fractal using lossless compression generates a string that is longer than the code that was used to generate it. Figure B.1 takes up about .7.35 Mb as a PNG-file at a resolution of .1.1 Mpixels and a color depth of about .8 bits per pixel (palette indexed),
B
More on Complexity
235
which is easily explainable since the Mandelbrot set has a Fractal dimension (see Definition 4.15) of 2 (Mandelbrot 1982). That is, its complexity is maximal for a two-dimensional space. As a result, we expect any compression to have to use about all bits per pixel (here, .6.68 bits per pixel). However, decompressing of the compressed image will be much faster than the generation of the fractal using a program at Kolmogorov complexity for the image. Figure B.1 also cannot be reversed easily to its generative code as there are potentially an unlimited number of implementations that can generate the same image. Kolmogorov complexity is independent of computational complexity, so a program at Kolmogorov complexity may be very small but still take a very long to calculate a certain object or string. Disregarding computational complexity makes determining the Kolmogorov complexity a question of the efficiency of the syntax and semantics of the language used to describe the original string. The challenge with that is it can be quite difficult to compare the description length of different programming languages. Not only could one shorten each of the programming language’s instruction words (for example, reduce for to f ), but different instructions also have different semantics. For example, it is well known that one can reduce all programming to one single instruction (Patterson and Ditzel 1980). In fact, there are several such instruction candidates. But which one works best? Based on Theorem 7.3, none of them. Some of them will lead to shorter program for certain problems and longer one for others for other problems (see also Sect. 7.2). In the end, all of them will perform equally well on average. This average can be determined by the entropy of the string. In summary, while one can define complexity in general (see: Sect. 4.2), one can always find anecdotes that can be smaller (or larger) than the expectation. Complexity therefore always needs to be discussed over the entire space of possible strings. It cannot be discussed over a single string. Compression is related with Kolmogorov complexity as one can always compress a string s and prepend the decompression algorithm to it (effectively creating what is called a self-extracting archive). A universal method could maximally yield Shannon entropy Definition 4.8 bits per character in the string plus a constant number of bits for the program length of the decompression algorithm. For this thought experiment, however, the decompression algorithm does not have to be universal (as one would usually assume for a self-extracting archive). That is, it can be specific to the string because Kolmogorov complexity only concerns one string. Furthermore, it is allowed to take almost infinite time. In conclusion, Kolmogorov complexity shows us that we will never be able to predict the exact minimal model size that can reproduce a function encoded by a training table and never mind generalizing the function. We can define an upper bound using memory-equivalent capacity (see Definition 5.5) and apply the techniques presented in this book to approximately estimate, similar to upperbounding Kolmogorov complexity with self-extracting archives.
236
B More on Complexity
B.3 VC Dimension A largely recognized capacity measure for machine learning theory comes from Vapnik and Chervonenkis (Vapnik 2000). It is called the Vapnik–Chervonenkis (VC) dimension (Vapnik and Chervonenkis 1971). It is defined as the largest natural number of samples in a dataset that can be shattered by a hypothesis space. This means that for a hypothesis space having VC dimension .DV C , there exists a dataset with .DV C samples such that for any binary labeling (.2DV C possibilities) there exists a perfect classifier f in the hypothesis space. That is, f maps the samples perfectly to the labels by memorization. Formally, Definition B.4 (VC Dimension (Vapnik 2000)) The VC dimension .DV C of a hypothesis space f is the maximum integer .D = DV C such that some dataset of cardinality D can be shattered by f . Shattered by f means that any arbitrary labeling can be represented by a hypothesis in f . If there is no maximum, it holds .DV C = ∞. For example, it holds .DV C = ∞ for 1-nearest neighbor as there is no maximum number of points that can be labeled (compare also the discussion in Chap. 6). The definition of VC dimension comes with two major footnotes. First, like memory-equivalent capacity (MEC) (see Definition 5.5), it considers only the perfect machine learner but ignores other aspects like imperfections in the optimization algorithm or even intentional loss due to regularization functions (Arpit et al. 2017). Second, and most importantly, it is sufficient to provide only one example of a dataset to match the VC dimension. That is, VC Dimension, like Kolmogorov complexity, is an anecdotal measure rather than a general measure over all possible datasets of a certain size. As discussed in Sect. 5.2, memory-equivalent capacity and VC Dimension have a direct connection. Recalling from Definition 5.5: A model’s intellectual capacity is memory equivalent to n bits when the machine learner is able to represent all .2n binary labeling functions of n points. If the training table represents a function of uniformly distributed input mapping to balanced, binary labels and model is able to memorize the table of n rows (but not .n+1), .DV C = n = MEC. This is because .2n shatterings (separations) are needed to memorize such a table, hence .DV C = n, and by definition, the .MEC = n bits. If the mutual information in the table is higher than 0, that is, the function can be somewhat generalized, and the given model can represent all .2n binary labeling functions of n rows (but not .n + 1), the model’s VC Dimension is .DV C = n. The model’s .MEC < n because the definition of MEC requires all possible functions of n bits to be representable. A function with lower mutual information will need a higher capacity to represent. MEC has the advantage of being able to work with non-binary classification and regression, as long as information content can be measured in bits. MEC can never be infinite as there is only the finite number of instances in a training table that can be memorized.
B
More on Complexity
237
B.4 Shannon Entropy as the Only Way to Measure Information Shannon Entropy is introduced in Chap. 4 and used consistently throughout this book. It usually plays the role of an additional cost/reward function (see, for example, Sect. B.1) that balances out disequilibrium. Often the question arises whether such a balancing could not be done by another cost function. Also, intuitively, information can be defined in many ways and it may seem pretentious to assume that Shannon Entropy is the only solution. After all, statistics has other measures of defining information, for example, Fisher information (Fisher 1922).1 Claude Shannon himself faced that question and responded to it by formulating 4 axioms that characterize the function that should be used to quantify information. The axioms are as follows: 1. An event with probability 100% is perfectly unsurprising and therefore yields 0 information. 2. The less probable an event is, the more surprising it is and the more information it yields (monotonicity). 3. If two independent events are observed separately, the total amount of information is the sum of the self-information of the individual events. An argument against these axioms is that events are never truly independent. However, in the standard machine learning process (see Chap. 3), we assume independence of training samples unless modeling time-based data. Shannon (Shannon 1948b) then showed that there is only one unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly, given a real number .b > 1 and an event x with probability P , the only way to define information content is as in Chap. 4: .H (x) := − logb (P ). This uniqueness of the entropy function may contribute to the fact that it is connected to physical work/energy, which is explained in Sect. B.5). EnThe Shannon entropy of a random variable X is then defined as .H (X) = i=1 pi logb pi , consistent with what is discussed in Chap. 4. Which, by definition, is equal to the expected information content of the measurement of X since the information content of each event .logb pi is multiplied with the expected frequency of occurrence (indicated by .pi ). This generalization is not unique, however. This was pointed out most prominently by Alfréd Rényi, who looked for the most general way to quantify information while preserving the additivity for independent events (Axiom 3). The simplicity of the end result is rather surprising: Definition B.5 The Rényi entropy of order .α, where .α ≥ 0 and .α /= 1, is defined as
1 The Fisher information represents the curvature of the relative entropy (KL-Divergence) of a conditional distribution with respect to its parameters. KL-Divergence is derivable from Shannon Entropy.
238
B More on Complexity
Hα (X) =
.
1 log 1−α
(E n
) piα .
(B.1)
i=1
Here, X is a discrete probability function (see Definition 4.7) with possible . outcomes in the set .E = {x1 , x2 , . . . , xn } and corresponding probabilities .pi = Pr(X = xi ) for .i = 1, . . . , n. The logarithm is conventionally taken to be base 2. At .α = 0, Rényi entropy .H0 = log |E|, where .E is the alphabet of outcomes. At .α = 1, Rényi entropy equals the Shannon entropy (by definition) .H (X) = E|E| . At .α = 2, Rényi entropy equals the so-called collision entropy i=1 pi logb pi E n 2 .H2 (X) = − log i=1 pi = − log P (X = Y ). In base 2, .H2 specifies the number of bits of information that is expected to overlap between two independent random variables with the same probability distribution, that is, the number of times the two distributions coincidentally have the same value. More probable values are much more likely to collide. Intuitively, we can think of two fair coins: Each coin is independent from the other coin and (by definition of fair) they are also identically distributed. That is, .pi = 12 for each event .H = heads and .T = tails. As a result, the collision entropy 12 12 .H2 = − log2 ( 2 + 2 ) = 1 bit, which is correct: Out of the 4 equi-distributed states .{H H, H T , T H, T T } or .log2 4 = 2 bits of information, the 2 states H H and T T show the coins with equal values .log2 2 = 1 bit in this case. For distributions that are not equally distributed, this can be less intuitive. Such considerations of coincidental overlap are important for cryptography as the first line of attack is to analyze the distribution of the encrypted message. The question is then how many bits can be extracted just by guessing with the same distribution. Intuitively, we can set alpha to be the number of coins flipped. At .alpha = 1 we are asking for the amount of repeating observations within the same random variable, where at .alpha > 1 we are asking for repeating observations between several independent but identical distributions. Collision entropy is therefore also interesting for modeling as we can measure the chance of a spurious correlation. When a correlation between two variables is said to be “more than spurious,” this usually means that there is some significant information shared between the two variables. If the mutual information between these two variables (which measures this shared information) is larger than the collision entropy of each variable (which measures their inherent randomness), then it might indicate that the correlation is not just due to random “collisions” of outcomes, but actually due to some underlying relationship between the variables. To make the intuitive understanding of Renyi Entropy a bit harder, alpha does not actually have to be a natural number, which puts us into the realm of fractal geometry (again). This is explained in Barros and Rousseau (2021).
B
More on Complexity
239
B.5 Physical Work Intuitively, the more complex a problem is, the more work is required to come up with a solution. Therefore, informally, physical work and complexity are quite directly connected. We can formalize this connection since work is a function of energy and energy is directly connected to information. Consider a simple battery. Voltage U (potential energy) of the battery is maximal when all n electrons are on the cathode side of the battery, that is, when it is easy to predict where any of the electrons reside. In other words, the probablity of electron i to reside at the cathode is .pi (at cathode) = 1. Consequently, the Shannon entropy Definition 4.11 of the probability function of the electrons X is .H (X) = 0. When the battery is used, potential energy is transformed into work by electrons traveling from the cathode to the anode. As a consequence .0 ≤ pi (at cathode) ≤ 1, while the battery is performing work until the battery reaches equilibrium. In equilibrium, 1 .pi (at cathode) = pi (at anode) = 2 . At that point, .U = 0 and the information content has steadily increased to its maximum .H (X) = n bits (where n is the number of electrons). The process can be reversed by charging the battery. That is, adding external energy will increase U back to .Umax and .H (X) back to .0 bits. Since energy can be transferred from one form to another, this example is general and not restricted to electric energy. In fact, this direct connection between energy and information is well known (Feynman et al. 2018): It takes work (and therefore energy) to erase bits or, in other words, to make things more certain. Uncertainty is free (see also the second law of thermodynamics). Many physicists are still debating the deeper meaning of the relationship between information and energy. However, it is clear that everything we can ever observe from the universe is information be it through vision, touch, smell, or indirectly through measurement (of radiation for example). Therefore, the laws of information govern our perception if we want it or not. In turn this means that some of the work of physicists must have included work on information, and of course, that is not news as fields like thermodynamics, statistical mechanics, and quantum physics have created their own notions of entropy. Furthermore, since energy and information are so intimately related, the principles that govern energy (which have been investigated for hundreds of years) also govern information (which has been investigated consciously for a several decades). For example, Chap. 4 already introduced Corollary 2, which is directly connected to energy conservation: It is clear that, to flip a switch, a minimum amount of energy is required. Usually, one direction is free (e.g., gravity can pull down but not push up or setting a transistor to off does not require electron flow), but the other direction requires energy. In this book we always assume that temperature is constant and that the energy to flip an individual bit to certainty is constant as all bits are independent of one another. That is, if we have the set of all possible input strings of n bits length that need to be reduced to 1 bit of certain output (which is, for example, the typical scenario in a binary classification problem), it will require energy proportional to n bits to do so. That is, the amount of work that needs to be done is proportional to the complexity
240
B More on Complexity
reduction required by the problem. That is, energy is computational complexity which is proportional to the number of bits that need to be set to certainty.2 This is quantified by the mutual information (see Definition 4.18) between the input and the output of the function implemented by the program: The more bits of the input correlate with the output, the less work has to be performed. In theory, reproducing (copying) a bit does not require energy (Feynman et al. 2018). This seems logical since no uncertainty is reduced and no decisions have to be made. In practice, copying does take work.3 This can be understood by highlighting one more property that information and energy have in common: Information, just like energy, depends on the frame of reference. That is, what is information from one frame of reference can be noise from another frame of reference. The frame of reference can be easily switched, for example, by asking a different question about the same observation (see Sect. 2.1): In the same image, the pixels relevant for “is there a dog?” are irrelevant for “is there a car?”, and in fact they may make the decision harder. Going back to copying, we have .n bits of memory in a source bank of memory that we want to copy to a destination bank of memory. That destination bank can be in any state. That is, it is totally uncertain in which state the destination is. So by making the destination memory bank equal to the source memory bank, we reduced the uncertainty of the state of the destination by .n bits. This requires work. From the point of view of the source, however, no uncertainty was reduced and so no energy needed to be expended. The good news is that models (of energy or information) typically do not change frame of reference as they answer exactly one question. The ability to change the frame of reference seems, so far, exclusive to humans. Interestingly enough, memory-equivalent capacity can be interpreted as the maximum number of bits that can be set to certainty by a model. That is, given the minimum amount of energy needed to set one bit to certainty, it is a measure of energy as well. The following sections present two examples that show how physical energy and information structure are embedded within each other.
B.5.1 Example 1: Physical View of the Halting Problem In computer science, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running or continue to run forever. Alan Turing proved in 1936 (Turing 1936) that a general algorithm to solve the halting problem for all possible programinput pairs cannot exist. The proof is one of the contradictions, similar to the proof for Theorem B.1, and, while taught in many undergraduate classes, does not offer
2 This 3 Take
is why we identified the expected computational complexity with .Ecomp in Sect. 4.2. a pen and a piece of paper and start copying the first page of this chapter.
B
More on Complexity
241
explanation. However, knowing Corollary 2 and the energy conservation principle, we can derive a more constructive explanation for the undecidability of the halting problem. First any input needs to be encoded in n bits of information and therefore, by Corollary 1 2, requires at n binary decision to process. That is, as discussed in the above section, Sect. B.5, it requires energy to reduce .−n bits of uncertainty to some amount of bits of certainty. This means, just by the energetic equation, the halting problem could be solved but only if we spend the same amount of energy solving the halting problem as we would executing the program. However, if a program does not halt, this would be an infinite amount of energy. That is, if the halting problem could be solved, a problem that requires infinite energy to run (e.g., “is an input the same as .π or not?”), in general, could be solved with a finite amount of energy. This obviously violates energy conservation and would probably enable to build a perpetual motion machine, which we know is impossible (Atkins 2010) or create a source of infinite energy which would violate the laws of thermodynamics.4 Of course, anecdotally, we can always find a special program where the halting problem is clearly solvable without spending an infinite amount of energy. For example, 1: while true do 2: Perform some operation 3: end while clearly never halts. Notice, however, that this program does not reduce any uncertainty and therefore is just a highly inefficient way of doing nothing. That is, physically speaking, this program does not perform any work. Similarly, any program can be simulated. This is done practically in emulators and virtual machines. This is possible because simulation does not destroy or create any energy. If the simulation reaches the exact same memory state again without the program halting, the program will not halt as it is stuck in an endless, unescapable loop. Therefore the halting problem can be solved by simulation when we restrict ourselves to finite memory. This has already been observed by Minsky (1967). Practically speaking, this is only feasible if the computer that is simulated is extremely small, as a computer with n bits of memory has .2n possible states of memory that would have to be checked for repetition.
4 A perpetual motion is the motion of bodies that continues forever in an unperturbed system. A perpetual motion machine is a hypothetical machine that can work infinitely without an external energy source. This kind of machine is impossible, as it would violate either the first or the second law of thermodynamics or both.
242
B More on Complexity
B.5.2 Example 2: Why Do We Expect Diamonds to Exist? A crystal is a solid material whose constituents (such as atoms, molecules, or ions) are arranged in a highly ordered microscopic structure, forming a crystal lattice that extends in all directions. Diamonds are a solid form of the element carbon with its atoms arranged in a crystal structure. They are created in nature when carbon-containing fluids dissolved under high pressure and temperature. Synthetic diamonds can be grown from high-purity carbon under high pressures and temperatures. From an information standpoint, the key elements here are highly predictable structure formed under high temperature and pressure. Interestingly enough, there are two very fundamental equations of thermodynamics that brings these key elements together: the Helmholtz and the Gibbs free energy. Helmholtz’s equation deals with temperature and Gibbs’ equation with pressure. Without loss of generality (since the two equations are derivable from each other), we will only use the Helmholtz equation, which is defined as follows: Definition B.6 (Helmholtz Free Energy) .A ≡ U − T S, where . . . .
A is the Helmholtz free energy (unit: Joules). U is the internal energy of the system (unit: Joules). T is the absolute temperature (Kelvins) of the surroundings. S is the entropy of the system (unit: Joules per Kelvin).
The Helmholtz free energy is a thermodynamic potential that measures “the useful work obtainable from a closed thermodynamic system at a constant temperature.” For the point made in this paragraph, we do not have to go into the details of what the physical concepts mean. At the time of discovery of this equation, around 1882 (von Helmholtz 1882), Shannon or Gibbs entropy was not known. However, without even digging in any further, we can apply this formula to understand why crystals are formed. We reformulate .A ≡ U − T S algebraically to . A−U = −S. Now it is easy to see T that, independent of what A or U maybe (and .U ≤ A), once T goes up, the result n −→ 0 for .m −→ for the entire term on the left decreases toward 0 (because . m ∞). That is, S will be very small. As introduced in Chap. 4, low entropy means either very little uncertainty or a low number of bits to describe a system. In other words, a highly ordered structure i.e., by definition, a crystal. A similar derivation can be made with Gibbs free energy and using pressure, rather than temperature. Of course, there are many more details to consider that are not described by a 4-variable equation. However, this example shows that information complexity plays a direct role in physical processes. Many physicists therefore advocate measuring S in bits, rather than in Joules per Kelvin.
B
More on Complexity
243
B.6 P vs NP Complexity The P versus NP problem, in informal terms, asks whether every problem whose solution can be “quickly verified” can also be “quickly solved.” In formal terms, it is to determine whether every language accepted by some non-deterministic algorithm in polynomial time is also accepted by some deterministic algorithm in polynomial time. This question was introduced by Stephen Cook in a seminal paper entitled “The complexity of theorem proving procedures” (Cook 1971) where he defines and discusses the challenge using the satisfiability problem (SAT). In physics and statistics, the word “deterministic” stands synonymous for “free of uncertainty,” and however, the word was redefined in theoretical computer science to mean “a machine that has exactly one possible move from a given configuration to another” vs non-deterministic “a machine that has more than one possible move from a given configuration to another.” Algorithm 15 Default algorithm as proposed by Cook (1971) to check if a Boolean formula is satisfiable function SAT (F (x1 , . . . , xd )) for i ∈ [0, 2d ] do x- - binarized(i) if F (x ) == 1 then return T rue end if end for return F alse end function
The general class of questions for which some algorithm can provide an answer in polynomial time (time here meaning “steps”) is “P” or “class P,” mostly referring to the term used in the O-notation (see Sect. B.1). That is, polynomial time is defined as maximally .O(nk ), where k is a constant and n the length of the string. For some questions, including SAT (see below), there is no known way to find an answer in a polynomial amount of steps, but if one is provided with information showing what the answer is, it is possible to verify the answer quickly. The class of questions for which an answer can be verified in polynomial time is “NP,” which stands for “nondeterministic polynomial time.” An answer to the “P versus NP” question would determine whether problems that can be verified in polynomial time can also be solved in polynomial time. To understand this issue better, let us look at SAT. A Boolean formula .F (x1 , . . . , xd ) is satisfiable if there exists at least one configuration vector .x - of the variables such that .F (x ) = true, that is, whether there exists an assignment to the variables such that the formula evaluates to true. The SAT problem is to find an algorithm that determines if a given formula F is satisfiable, preferably one that runs in polynomial time. The default algorithm proposed by Cook (Cook 1971) was
244
B More on Complexity
simple: Guess a configuration and check if the formula becomes true, until either all configurations are checked or one configuration satisfies: see Algorithm 15. The challenge with this algorithm is that one has to go through all .2d configurations of d variables if F is not satisfiable. That is, the number of iterations in the while loop in O-notation is .O(2d ), where d is the number of variables, which is definitely larger than polynomial, that is, .d k (k constant). However, checking that a specific configuration satisfies the formula can be executed in .O(n), where n is the length of the string describing the formula, i.e., in polynomial time. So the total run time of Algorithm 15 in O-notation is .O(n∗2d ), where n is the number of characters in the formula and d is the number of variables in the formula.5 Cook argues that the algorithm could be polynomial if a “nondeterministic” (see above) machine could be built that evaluates all .2d configurations at once, thus making the run time .O(n), that is, polynomial. The question that has been asked since 1971 is as follows: Can there be such an algorithm—or at least one that solves SAT in .O(nk ), k constant? From a more modern perspective, we know Boolean formulas are models of data tables (see Chap. 2) that were empirically derived (Quine 1940). These tables d are called truth tables. There are .22 unique truth tables of d Boolean variables. Therefore, any model, be it a neuron (see Chap. 5), a binary tree, or a propositional d formula, requires a memory-equivalent capacity (see Definition 5.5) of .log2 22 = 2d bits to be able to represent any truth table in general. We remember that Onotation characterizes functions according to their growth rates (Sect. B.1) and we know from Sect. 7.2.2 that it does not matter how many or few symbols we include in the grammar that defines of a propositional formula. That is, in general, there is no possible alphabet or encoding that can make an exponential growth rate of a string polynomial, unless one restricts the total length of the string. Of course, based on the operations we choose in the propositional formula, anecdotally, some specific truth tables may be described very briefly, while others require a longer minimum description length (see also Sects. 7.2, 7.4, and B.2). Therefore, assuming we can evaluate the configuration of any propositional formula in growth order .O(n) steps is making the “formula length-axis” logarithmic as the actual growth is exponential, even in the number of symbols used. Additionally, we know that Algorithm 15 must, in general, go through all .2d bits since all bits have to be at least touched once to evaluate the formula exactly (see Corollary 2). So we have to understand d = d bits by the problem definition. Since a configuration string .O(n) := log2 2 cannot be longer than the formula itself, we would have to assume the length of each configuration evaluated in Algorithm 15 as .log2 2d = d bits, rather than .2d bits. Under that view, it appears as if the while loop goes through .2n configurations and evaluating each of them in F is .O(n). In reality, the total number of steps d to execute in Algorithm 15 for an unsatisfiable formula is at minimum .O(22 ). In other words, the original problem statement is ill-defined as it assumes that evaluating a formula was at maximum a polynomial number of computation steps 5 Note that processing a formula of length n is proportional to the complexity and the length n of the formula, as discussed in Sect. B.1, but this is ignored here, as in the original paper.
B
More on Complexity
245
(“quickly verified”) while going through all configurations takes, as far as we know, exponential time (not “quickly solved”). However, even if we found a way to go through all .2d configurations in polynomial time, evaluating a formula would, in the worst case, still require an exponential amount of steps. That is, the run time would be .O(nk ∗ 2d ), where d is the number of variables in the formula and .nk (k constant) is the number of iterations of the while loop using some smart trick, including building a non-deterministic Turing machine. The community working on P/NP complexity has evolved to discussing the satisfiability problem for specific representations of Boolean formulas, named k-SAT. This allows us to take a deeper look into the problem. The specific representation is called Conjunctive Normal Form (CNF). A formula F is said to be in CNF if it consists of only conjunctions (and) of several clauses, whereby a clause is a disjunction (or) of literals and a literal is a variable or its negation, e.g., “x” or “.¬ x.” k-SAT is defined as follows: Given a Boolean formula F in CNF, in which each clause has exactly k literals, decide whether or not F is satisfiable. For example, .(x1 ∨ ¬x2 ∨ ¬x3 ) ∧ (¬x4 ∨ x5 ∨ x6 ) is in CNF with .k = 3 or .(x1 ∨ x3 ) ∧ (x2 ∨ x3 ) is in CNF with .k = 2. It is easy to see that the case .k = 1 is quite simple to solve: A formula consisting only of conjunctions of single literals is satisfiable by setting each negated variable to 0 and all other variables to 1. However, if there exists two clauses of the form .xi ∧ . . . ∧ ¬xi , it is not satisfiable. Solving 1-SAT therefore only requires a linear number of steps in the number of symbols required to describe the formula. However, since the MEC required to describe all formulas is still .2d bits (see above), in the worst case, there will still be formulas that will need .2d bits to describe a 1CNF formula (see again Sects. 7.2 and 7.2.2), and therefore the number of binary decisions is still exponential in the number of variables. In 1-SAT, one clause can maximally carry one bit of information (that is, if each variable appears in each term once). So the minimum amount of clauses c needed for the worst-case formula is 2d .c = log2 d because we need .log2 d bits to encode the variable number. Similarly, there is a quite clever polynomial algorithm to solve 2-SAT (Garey and Johnson 1976). However, the worst-case length of a formula is also still .2d bits. We can encode each clause with .2 bits of information (indicating negation or not for each variable) and .2 ∗ log2 d bits of information for the numbers of the two variables. So 2d .6 For .k ≥ 3, no algorithm is known the expected number of clauses is .c = 2+2∗log 2d that can generalize the outer loop of Algorithm 15 from .2d steps to a polynomial amount of steps. However, even if we found one, the number of clauses would 2d still be exponential as the expected number of clauses would be .c = k+k∗log . 2d Figure B.2 graphs the function of the expected number of clauses for different values of k. 6 This
result is consistent with Papadimitriou (1994) showing that 2-CNF formulas can be represented in logarithmic space by a non-deterministic Turing machine. The result assumes .O(n) for the description length. That is, .n = O(k ∗ (2 + 2 ∗ log2 d)) bits, the non-determinism can take care of k and so what is left is .O(log2 d), which is logarithmic.
246
B More on Complexity
120
Number of clauses
100
80
60
k=2
40
k=3
k=10
20
1
2
3
4
5
6
7
8
9
10
Number of Variables
Fig. B.2 Expected number of clauses in k-CNF formulas as a function of k. The exponential growth is slowed down insignificantly by increasing k
It seems tempting and also fun to try to come up with algebraic and algorithmic tricks to shorten the number of binary decisions required to acquire certainty about a formula. But we have to remind ourselves that all current computers can do is propositional logic (see also the discussion at the end of Sect. 7.2.2) plus looping. Trying to evaluate a propositional logic formula in less steps than is dictated by its sample space .o, that is, trying to beat Corollary 2, is therefore only possible anecdotally. In general, as discussed in Sect. B.5, setting one bit to certainty requires .O(1) energy. We can use as much energy as we can afford to speed up time, but with the current architecture of computers, each switch requires time and energy to flip. This is most prominently evidenced when we change the way the switches are flipped: An analog computing approach can show polynomial analog time complexity on k-SAT with .k ≥ 3 problem instances, but at an energy cost dependent on exponentially growing auxiliary variables (Yin et al. 2017). Consistent with the discussion in Sect. B.5.1, the conclusion for automated modeling is that, in general, models can never be predicted by other models using less work than is required to fully evaluate the original model. In the author’s humble opinion, the formulation of the SAT problem should be interpreted as an evocative model (see Section 7.5). Like all models, it is wrong but this one is useful in that it was definitely inspiring: As a result of it, thousands, if not millions, of computer scientists have worked to make algorithms more efficient. While I suggest to exercise caution applying the P/NP model for complexity estimates, in lieu of actual bit measurements, it is not clear that the ill-definedness of SAT makes the whole theory of P/NP invalid. The Traveling Salesman Problem (TSP) (Reinelt 1994), for example, takes as input a set of distances between points. A distance is a positive integer that can be stored in .[log2 d] bits (where d is the distances between n points. The input for largest distance) and there are only . n∗(n−1) 2
B
More on Complexity
247
TSP is therefore definitely not of exponential growth and, yet, nobody has invented a polynomial algorithm to find the shortest round trip. Last but not least, the idea that there could be an alternative way to build a computer that can bring an exponential number of bits to certainty at each time step, given enough energy expenditure, is definitely not out of the question as we can see the universe itself as a large computing machine (Zuse 1970).
Appendix C
Concepts Cheat Sheet
This cheat sheet offers a simplified overview of the main concepts covered in this book. Please use it as a handy reference, but remember that it does not replace a thorough understanding of the principles detailed within the book.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5
249
C Concepts Cheat Sheet 250
Machine Learning Experimental Design Theory Cheat Sheet Basics Intelligence: The ability to adapt (Binet and Simon 1904). Machine Learner: A machine learner is a mechanism that is able to adapt to a target function in two ways: (1) through training and (2) through generalization. A mechanism that is trainable without generalization is called memory. A mechanism that generalizes without training is a quantizer. Artificial Intelligence: A system that performs humanlike tasks, often by relying on machine learning. Classification: Quantization of an input into a set of predefined classes (see also: Generalization). Regression: Classification with a large (theoretically infinite) number of classes described in functional form. Detection: Binary classification. Clustering: Quantization.
VC Dimension for
Memory View of Machine Learning Tabularization: Data is organized in N rows and D columns. Training data contains a D + 1th column with the prediction variable to be learned ( labels). All inputs to the machine learner are points with D coordinates. All data in digital computers can be tabularized this way. Target Function: A mapping from all input points to the labels, assumed to be a mathematical function. Representation Function: The parameterized function a machine learner uses either standalone or in composition to adapt to the target function, e.g., the activation function of an artificial neuron. Bit (Binary digit): Unit of measurement for memory capacity (Shannon 1948b). One bit corresponds to one boolean parameter of the identity function whereby each parameter can be in one of the two states with equal chance. Intellectual Capacity: The number of unique target functions a machine learner is able to represent (as a function of the number of model parameters) (MacKay 2003). Memory-equivalent Capacity: With the identity function as representation function, N bits of memory are able to adapt to 2N target functions. A machine learner’s intellectual capacity is memory-equivalent to N bits when the machine learner is able to represent all 2N binary labeling functions of N uniformly random inputs. Memory Equivalent Capacity: uniformly random data points.
Generalization Generalization (∀): The concept of handling different objects by a common property. The set of objects of the common property are called a class, the elements are called instances. Neuron: all instances of input signals below an energy threshold are ignored. Generalization in Machine Learning: All inputs close enough to each other result in the same output. This is for an assumed or trained definition of close enough (generalization distance). Memory: Only identical is close enough =⇒ no generalization. Adversarial Example: An input that contradicts the generalization assumption of a machine learner (Huang et al. 2011). Overfitting: A machine learner at memory-equivalent capacity or higher with regard to the number of inputs in the training data could as well just use the identity function as representation function and still adapt to the training data perfectly. The machine learner is said to overfit. Accuracy: A necessary but not a sufficient condition for generalization success. At the same accuracy, overfitting potentially maximizes the number of adversarial examples, and capacity reduction minimizes it. Measuring Generalization: Generalization of a class-balanced binary classifier is predicted instances G = #correctly M emory Equivalent Capacity , only G > 1 implies successful generalization. Training Processes Training for Accuracy: The process of adjusting the parameters of the representation function(s) of the machine learner to approximate a target function with maximum accuracy. Training for Generalization: The process of successively reducing the capacity of a machine learner while training for accuracy. The model with the highest accuracy and the smallest capacity is the one that uses the representation function(s) most effectively. Therefore, it has the lowest chance of failing (this is requiring to increase capacity) when applied to unseen data from the same experimental setup (Friedland et al. 2018). Regularization: Reducing capacity during training by restricting the freedom of the parameters, thereby potentially improving generalization. Techniques include drop out, early stopping, data augmentation, or imperfect training.
1.
Capacity Estimation Capacity Requirement:Build a static-parameter machine learner to memorize the training data. Assume exponential improvement through training. This is, the memory-equivalent capacity can be minimally logarithmic of the size of the static-parameter machine learner (Friedland et al. 2018). Memory-equivalent Capacity for Neural Networks (Friedland et al. 2018), http://tfmeter.icsi.berkeley.edu:
2. 3. 4.
The output of a single neuron is maximally one bit. The memory capacity of a single neuron is the number of parameters in bits. The memory capacity of neurons is additive. The memory capacity of neurons in a subsequent layer is limited by the output of the layer it depends on.
Generalization Estimation Generalization Progression: Estimate the capacity needed to memorize 10%, 20%, . . . , 100% of training table. If the capacity does not stabilize at higher percentages, either there is not enough training data to generalize or the representation function(s) of the machine learner is/are not right for the task. Find Best Machine Learner for Data: Measure/estimate Generalization Progression for different types of machine learners. Pick the one with convergence to the smallest capacity. Testing Generalization Performance: Measuring accuracy against an independent data set after training is the only way to guarantee generalization performance. Testing against hold-out or cross-validation data in training practically makes the data part of the training set. Occam’s Razor (Modern): When you have two competing theories that make exactly the same predictions, the simpler one is the better (Thorburn 1915). Occam’s Razor for Machine Learning: Among equally accurate models, choose the one that requires the lowest memory-equivalent capacity (see also: Training for Generalization).
Created by G. Friedland, v1.01 Aug 28th, 2020 [email protected] Released under Creative Commons 4.0 license: CC-BY-NC-SA.
Appendix D
A Review Form That Promotes Reproducibility
Summary Please provide a brief summary of the paper and its main contributions. Originality (pick one of the following): . . . .
The paper presents a novel and original contribution to the field. The paper presents a significant extension or improvement over prior work. The paper presents a minor improvement over prior work. The paper does not present a significant contribution or improvement over prior work.
Technical Quality (pick one of the following): . The paper is technically sound and well-written. . The paper has minor technical flaws or writing issues that can be easily addressed. . The paper has significant technical flaws or writing issues that need to be addressed. . The paper is not technically sound or well-written. Repeatability, Reproducibility, and Replicability (pick one of the following) . The paper provides clear and detailed instructions for repeating the experiments by a reader from scratch. . The paper provides sufficient information for repeating the experiments by a reader, but additional effort may be required. . The paper provides sufficient information for repeating the experiments by a reader, given access to the authors’ code repository. . The paper cannot be replicated due to missing information or resources.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5
251
252
D A Review Form That Promotes Reproducibility
Impact . . . .
The paper has the potential to significantly impact the field. The paper has the potential to moderately impact the field. The paper has limited potential to impact the field. The paper is unlikely to impact the field.
Recommendation . . . .
Accept Accept with minor revisions Accept with major revisions Reject
Detailed Comments Please provide detailed comments on the paper, including strengths and weaknesses and suggestions for improvement. Confidential Comments If you have any confidential comments for the program committee or editors, please provide them here. Overall Evaluation (pick one of the following) . . . .
Excellent Good Fair Poor
Overall Recommendation Justification Please provide a brief justification for your overall recommendation, including any additional comments on the paper that you have not yet mentioned.
Bibliography
Ahmed, N., Natarajan, T. & Rao, K. R. (1974), Discrete Cosine Transform, IEEE Transactions on Computers. Amershi, S., Cakmak, M., Knox, W. B., Kulesza, T., Lau, T. & Nichols, J. (2019), Guidelines for human-ai interaction, in ‘Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems’, pp. 1–13. Andrews, D. W. K. (1995), ‘A survey of log-linear analysis’, International Statistical Review 63, 5– 24. Aristotle (350 BCE), Aristotle’s Metaphysics, Oxford University Press. Arjovsky, M. & Bottou, L. (2017), ‘Towards principled methods for training generative adversarial networks’, arXiv preprint arXiv:1701.04862. Arnold, V. I. (2016), Neural Networks, World Scientific. Arora, S., Cohen, N., Golowich, N. & Hu, W. (2018), Stronger generalization bounds for deep nets via a compression approach, in ‘Proceedings of the 35th International Conference on Machine Learning’, Vol. 80, PMLR, pp. 254–263. Arpit, D., Jastrzebski, S., Larochelle, H. & Courville, A. C. (2017), ‘A closer look at memorization in deep networks’. Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., García, S., GilLopez, S., Molina, D., Benjamins, R. et al. (2020), ‘Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible ai’, Information Fusion 58, 82–115. Asgari, E. & Mofrad, M. R. (2015), ‘Continuous distributed representation of biological sequences for deep proteomics and genomics’, PLoS ONE 10(11), e0141287. Atkins, P. W. (2010), The Laws of Thermodynamics: A Very Short Introduction, Oxford University Press. Bachmann, P. (1894), Analytische Zahlentheorie, Teubner, Leipzig. Available at: http://www. archive.org/details/analytischezahle00bachrich. Bahdanau, D., Cho, K. & Bengio, Y. (2014), ‘Neural machine translation by jointly learning to align and translate’, arXiv preprint arXiv:1409.0473. Baker, M. (2016a), ‘The repeatability crisis in science and its possible remediation’, Nature 533(7604), 452–454. Baker, M. (2016b), ‘Reproducibility crisis’, Nature 533(26), 353–66. Barros, V. & Rousseau, J. (2021), ‘Shortest distance between multiple orbits and generalized fractal dimensions’, Annales Henri Poincarè 22(6), 1853—-1885. Beasley, J. (2002), ‘Tic-tac-toe’, Mathematics Magazine 75(5), 335–338.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5
253
254
Bibliography
Behnke, S. (2001), Learning iterative image reconstruction in the Neural Abstraction Pyramid, PhD thesis, Freie Universität Berlin. Bellman, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton University Press. Bengio, Y. (2009), ‘Learning deep architectures for ai’, Foundations and trends in Machine Learning 2(1), 1–127. Bergstra, J., Bardenet, R., Bengio, Y. & Kégl, B. (2011), ‘Algorithms for hyper-parameter optimization’, NIPS. Bergstra, J. & Bengio, Y. (2012), Random search for hyper-parameter optimization, in ‘Journal of Machine Learning Research’, Vol. 13, pp. 281–305. Bergstra, J., Yamins, D. & Cox, D. (2013), Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures, in ‘ICML’. Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. (1999), ‘When is “nearest neighbor” meaningful?’, International conference on database theory pp. 217–235. Binet, A. & Simon, T. (1904), ‘Méthodes nouvelles pour le diagnostic du niveau intellectuel des anormaux’, L’année Psychologique 11(1), 191–244. Bishop, C. M. (1995), Neural networks for pattern recognition, in ‘Neural information processing systems, 1995. Proceedings of the 1995 IEEE/INNS International Conference on’, IEEE, pp. 261–267. Blum, A. & Rivest, R. (1992), ‘Training a 3-node neural network is np-complete’, Neural Networks 5, 117–127. Board, A. P. (2018), ‘Reproducibility in computing: A community-led approach’, Communications of the ACM 61(5), 33–35. Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V. & Kalai, A. T. (2016), Man is to computer programmer as woman is to homemaker? debiasing word embeddings, in ‘Advances in neural information processing systems’, pp. 4349–4357. Boole, G. (1847), ‘The algebra of logic’, Cambridge and Dublin Mathematical Journal 3(Suppl.), 424–433. Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992), A training algorithm for optimal margin classifiers, in ‘Proceedings of the Fifth Annual Workshop on Computational Learning Theory’, ACM, pp. 144–152. Bostrom, N. (2014), Superintelligence: Paths, dangers, strategies, Oxford University Press. Bottou, L. (2010), ‘Large-scale machine learning with stochastic gradient descent’, Proceedings of COMPSTAT’2010 pp. 177–186. Bousquet, O. (2004), ‘Machine learning and the dimensionality curse’, Lecture Notes in Computer Science 3176, 7–11. Box, G. E. (1976), ‘Science and statistics’, Journal of the American statistical Association 71(356), 791–799. Box, G. E. (1979), ‘Robustness in the strategy of scientific model building’, Robustness in Statistics pp. 201–236. Breiman, L. (2001), ‘Random forests’, Machine Learning 45(1), 5–32. Brown, T., Mann, B., Ryder, L., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, S., Sainbayar, S., Askell, A., Howard, A. & Ruder, S. (2023), ‘Language models are few-shot learners’, arXiv preprint arXiv:2205.14165. Brynjolfsson, E. & McAfee, A. (2017), ‘The business of artificial intelligence’, Harvard Business Review. Burges, C. J. (2010), ‘From RankNet to LambdaRank to LambdaMart: An overview’, Learning 11(23–581), 81. Carr, N. (2010), The shallows: What the internet is doing to our brains, WW Norton & Company. Carrabs, F., Salani, M. & Bierlaire, M. (2019), ‘Waste collection routing with fixed-size containers’, Transportation Research Part C: Emerging Technologies 104, 67–85. Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M. & Elhadad, N. (2015), ‘Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission’, Proceedings of the 21st ACM SIGKDD international conference on Knowledge discovery and data mining pp. 1721–1730.
Bibliography
255
Castagnoli, G., Bräuer, S. & Herrmann, M. (1992), ‘A cyclic redundancy check (CRC) polynomial selection method’, IEEE Transactions on Computers 41(7), 883–892. Chaitin, Gregory (2006), Meta math!: the quest for omega. Vintage. Chakrabati, S. (2005), ‘Axiomatic characterization of the entropy of a random variable’, International Journal of Mathematical Modelling and Scientific Computing 7(1), 1–18. Chalapathy, R., Menon, A. K. & Chawla, S. (2019), ‘A deep learning approach to anomaly detection in network traffic’, IEEE Transactions on Network and Service Management 16(3), 1217–1230. Chen, K.-J., Bai, M.-H. & Huang, C.-R. (2003), Information content estimation of Chinese text, in ‘The Companion Volume to the Proceedings of 41st Annual Meeting of the Association for Computational Linguistics’, pp. 193–196. Chen, N. & Huang, Y. (2019), ‘A new perspective on data processing inequality’, IEEE Transactions on Information Theory 65(1), 450–460. Chen, T. & Guestrin, C. (2016), ‘Xgboost: A scalable tree boosting system’, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining pp. 785–794. Codd, E. F. (1970), ‘A relational model of data for large shared data banks’, Communications of the ACM 13(6), 377–387. Cohen, W. W. (1995), ‘Fast effective rule induction’, Machine learning proceedings 1995 pp. 115– 123. Cook, S. A. (1971), ‘The complexity of theorem-proving procedures’, Proceedings of the Third Annual ACM Symposium on Theory of Computing pp. 151–158. Cormen, T. H., Leiserson, C. E., Rivest, R. L. & Stein, C. (2009), Introduction to Algorithms, MIT Press. Cortes, C. & Vapnik, V. (1995), ‘Support-vector networks’, Machine learning 20(3), 273–297. Cover, T. M. (1965), ‘Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition’, IEEE Transactions on Electronic Computers EC14(3), 326–334. Cover, T. M. & Thomas, J. A. (2006), Elements of Information Theory, 2nd edn. John Wiley & Sons. Crockford, D. (2008), ‘Utf-8 everywhere’, https://utf8everywhere.org/. Davenport, T. H. & Patil, D. (2012), ‘Data Scientist: The Sexiest Job of the 21st Century’, https:// hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century. Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977), ‘The expectation-maximization algorithm’, Journal of the Royal Statistical Society, Series B (Methodological) 39(1), 1–38. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. (2018), ‘Bert: Pre-training of deep bidirectional transformers for language understanding’, arXiv preprint arXiv:1810.04805. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. (2019), Bert: Pre-training of deep bidirectional transformers for language understanding, in ‘Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)’, pp. 4171–4186. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. et al. (2021), An image is worth 16x16 words: Transformers for image recognition at scale, in ‘Proceedings of the International Conference on Learning Representations (ICLR)’. Einstein, A. (1916), Relativity: The Special and the General Theory, Henry Holt and Company. Epstein, L. & van Stee, R. (2002), ‘Efficient algorithms for the unary bin packing problem’, Theory of Computing Systems 35(2), 139–157. Evtimov, I., Eykholt, K., Fernandes, E., Kohno, T., Li, B., Prakash, A., Rahmati, A. & Song, D. (2017), ‘Robust physical-world attacks on machine learning models’, arXiv preprint arXiv:1707.08945 2(3), 4. Feller, W. (1968), An Introduction to Probability Theory and Its Applications, Vol. 1, John Wiley & Sons. Feynman, R. P., Hey, T. & Allen, R. (2018), Feynman Lectures on Computation, CRC Press.
256
Bibliography
Filmus, Y. (2020), Introduction to Incompressibility Proofs, Springer Nature. Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical statistics’, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 222(594–604), 309–368. Frankle, J. & Carbin, M. (2019), The lottery ticket hypothesis: Finding sparse, trainable neural networks, in ‘International Conference on Learning Representations’. French, R. M. (1999), ‘Catastrophic forgetting in connectionist networks’, Trends in cognitive sciences 3(4), 128–135. Frey, C. B. & Osborne, M. A. (2017), ‘The future of employment: How susceptible are jobs to computerisation?’, Technological Forecasting and Social Change. Friedland, G. & Jain, R. (2013), Multimedia computing, Cambridge University Press. Friedland, G., Metere, A., & Krell, M. (2018), ‘A practical approach to sizing neural networks’. arXiv preprint arXiv:1810.02328, October 2018 https://arxiv.org/abs/1810.02328. Friedland, G., Jia, R., Wang, J., Li, B. & Mundhenk, N. (2020), On the impact of perceptual compression on deep learning, in ‘2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)’, pp. 219–224. Friedman, J. H. (2001), ‘Greedy function approximation: A gradient boosting machine’, Annals of Statistics 29(5), 1189–1232. Fukushima, K. (1980), ‘Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position’, Biological cybernetics 36(4), 193–202. Garey, M. R. & Johnson, D. S. (1976), ‘The complexity of near-optimal graph coloring’, Journal of the ACM 23(1), 43–49. Gatys, L. A., Ecker, A. S. & Bethge, M. (2016), ‘Image style transfer using convolutional neural networks’, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 2414–2423. Gilpin, L. H., Bau, D., Yuan, B., Bajwa, A., Specter, M. & Kagal, L. (2018), ‘Explaining explanations: An overview of interpretability of machine learning’, IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) pp. 80–89. Goldberg, D. (1991), ‘What every computer scientist should know about floating-point arithmetic’, ACM Computing Surveys 23(1), 5–48. Gollan, T. H., Montoya, R. I., Cera, C. & Sandoval, T. C. (2009), ‘Reading rate differences between the native languages of bilinguals: New evidence from Spanish-English bilinguals’, Psychonomic bulletin & review 16(6), 1133–1137. Goodfellow, I., Bengio, Y. & Courville, A. (2016a), Deep learning, MIT press. Goodfellow, I. J., Bengio, Y. & Courville, A. (2016b), ‘Cross-entropy loss’, pp. 464–469. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014), ‘Generative adversarial nets’, arXiv preprint arXiv:1406.2661. Grunwald, P. D. (2004), ‘A tutorial introduction to the minimum description length principle’, arXiv preprint math/0406077. Hacking, I. (2006), The Emergence of Probability: A Philosophical Study of Early Ideas about Probability, Induction and Statistical Inference, Cambridge University Press. Halevy, A., Norvig, P. & Pereira, F. (2009), ‘The unreasonable effectiveness of data’, IEEE Intelligent Systems 24(2), 8–12. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I. H. (2009), ‘Weka 3: Data mining software in java’, http://www.cs.waikato.ac.nz/ml/weka/. Hamming, R. W. (1989), Digital Filters, Prentice-Hall. Harris, F. J. (1978), Window Functions and Their Applications, Proceedings of the IEEE. Haussler, D. (1995), ‘Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik-Chervonenkis dimension’, Journal of Combinatorial Theory, Series A 69(2), 217–232. Hayder, N. (2022), Learning rate estimation for stochastic gradient descent, Master’s thesis, University of California, Berkeley. Haykin, S. (1994), Neural networks: a comprehensive foundation, Prentice Hall. Hazenberg, R. H. & Hulstijn, J. H. (1996), ‘Vocabulary size and reading comprehension in a second language: a study of Dutch university students’, Language learning 46(3), 519–552.
Bibliography
257
He, H. & Garcia, E. A. (2009), ‘Learning from imbalanced data’, IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284. He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. (2020), ‘Momentum contrast for unsupervised visual representation learning’, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 9729–9738. He, K., Zhang, X., Ren, S. & Sun, J. (2015), ‘Deep residual learning for image recognition’, arXiv preprint arXiv:1512.03385. Herlocker, J. L., Konstan, J. A., Borchers, A. & Riedl, J. (2004), ‘An experimental comparison of several collaborative filtering algorithms’, ACM Transactions on Information Systems 22(5), 179–206. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. (2017), ‘Gans trained by a two time-scale update rule converge to a local Nash equilibrium’, arXiv preprint arXiv:1706.08500. Hilbert, D. (1902), Mathematische Probleme, Göttinger Nachrichten. Hinton, G. E. & Salakhutdinov, R. R. (2006), ‘Reducing the dimensionality of data with neural networks’, Science 313(5786), 504–507. Hirsh, D. & Nation, I. S. P. (1992), ‘How much vocabulary is needed for reading comprehension in English?’, The Modern Language Journal 76(3), 200–207. Hochreiter, S. & Schmidhuber, J. (1997), ‘Long short-term memory’, Neural computation 9(8), 1735–1780. Huang, L., Joseph, A. D., Nelson, B., Rubinstein, B. I., & Tygar, J. (2011), Adversarial machine learning. in ‘Proceedings of the 4th ACM workshop on Security and Artificial Intelligence’, ACM, pp. 43–58. Huffman, D. A. (1952), ‘A method for the construction of minimum-redundancy codes’, Proceedings of the IRE 40(9), 1098–1101. Hwang, K. (1989), ‘The magical number 1000’, English Teaching Forum 27(1), 22–25. Hyafil, L. & Rivest, R. L. (1976), Constructing optimal binary decision trees is np complete, in ‘Proceedings of the 5th International Colloquium on Automata, Languages, and Programming’, Springer, pp. 15–28. Indyk, P. & Motwani, R. (1998), Approximate nearest neighbors: Towards removing the curse of dimensionality, in ‘Proceedings of the thirtieth annual ACM symposium on Theory of computing’, ACM, pp. 604–613. Ito, S.-i. (2020), ‘Fundamental limits of energy and information’, IEEE Transactions on Information Theory 66(9), 5696–5714. Japkowicz, N. & Stephen, S. (2002), ‘The class imbalance problem: A systematic study’, Intelligent Data Analysis 6(5), 429–449. Jin, D., Jin, Z., Zhou, J. T. & Szolovits, P. (2019), Generating natural language adversarial examples through probability weighted word saliency, in ‘Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies’, Association for Computational Linguistics, pp. 587–596. Jolliffe, I. T. (2002), Principal component analysis, Springer. Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A. et al. (2017), ‘In-datacenter performance analysis of a tensor processing unit’, Proceedings of the 44th Annual International Symposium on Computer Architecture pp. 1–12. Karras, T., Aila, T., Laine, S. & Lehtinen, J. (2018), ‘Progressive growing of GANs for improved quality, stability, and variation’, arXiv preprint arXiv:1710.10196. Karras, T., Laine, S. & Aila, T. (2019), ‘A style-based generator architecture for generative adversarial networks’, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 4401–4410. Kelly, J. L. (1956), A New Interpretation of Information Rate, Bell System Technical Journal. Kleinrock, L. (1987), ‘Exponential decay’, Queueing systems 2(1), 1–32. Kolmogorov, A. N. (1933), ‘On the law of large numbers’, Mathematical Notes 1(2), 103–111.
258
Bibliography
Kolmogorov, A. N. & Arnold, V. I. (2019), Real Analysis: Measure Theory, Integration, and Hilbert Spaces, Springer. Krawczyk, B. (2016), ‘Addressing the class imbalance problem in medical datasets’, International Journal of Medical Informatics 96, 266–280. Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012), ImageNet classification with deep convolutional neural networks, in ‘Advances in Neural Information Processing Systems’, pp. 1097–1105. Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2017), ‘ImageNet classification with deep convolutional neural networks’, Communications of the ACM 60(6), 84–90. Landa, H. M. (1958), ‘The second law of thermodynamics’, Encyclopedia of Physics 3, 175–192. Landau, E. (1909), Handbuch der Lehre von der Verteilung der Primzahlen, Chelsea Publishing Company, New York. Available at: https://www.hathitrust.org/Record/000451502. Le Gall, D. (1991), ‘Mpeg: A video compression standard for multimedia applications’, Communications of the ACM 34(4), 46–58. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998), ‘Gradient-based learning applied to document recognition’, Proceedings of the IEEE 86(11), 2278–2324. Leibniz, G. W. (1703), ‘Explication de l’arithmétique binaire’, Sämtliche Schriften und Briefe 4(2), 224–228. Leonelli, S. (2016), Data-centric biology: A philosophical study, University of Chicago Press. Leung, C.-W. & Wei, C. (2014), ‘A multi-objective genetic algorithm for container loading problem’, Applied Soft Computing 19, 66–76. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V. & Zettlemoyer, L. (2019), ‘Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension’, arXiv preprint arXiv:1910.13461. Li, M. & Vitányi, P. M. (2008), An Introduction to Kolmogorov Complexity and Its Applications, Springer Science & Business Media. Li, M. & Vitányi, P. M. B. (1997), An Introduction to Kolmogorov Complexity and Its Applications, Springer. Lim, B. & Zohren, S. (2021), ‘Time-series forecasting with deep learning: a survey’, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 379(2246), 20200209. Lin, S. & Costello, D. J. (1994), ‘Information theory and coding’, Encyclopedia of Electrical and Electronics Engineering 7, 397–404. Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.-J. & Le, Q. V. (2018), ‘Progressive neural architecture search’, ECCV. Lorenz, Edward (1972), Does the flap of a butterfly’s wings in Brazil set off a tornado in Texas?. Transcript of a lecture given to the 139th meeting of the American Association for the Advancement of Science, Washington, DC, USA Lou, Y., Caruana, R. & Gehrke, J. (2012), ‘Intelligible models for classification and regression’, Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining pp. 150–158. Lundberg, S. & Lee, S.-I. (2017), ‘A unified approach to interpreting model predictions’, Advances in Neural Information Processing Systems 30, 4765–4774. Lyapunov, A. M. (1892), ‘The general problem of the stability of motion’, Annals of mathematics pp. 215–247. MacKay, D. J. (2006), ‘Claude E. Shannon: A retrospective’, IEEE Communications Magazine 44(3), 140–146. MacKay, D. J. C. (2003), Information Theory, Inference, and Learning Algorithms, Cambridge University Press, New York, NY, USA. Mandelbrot, B. (1953), ‘An information theory of the statistical structure of language’, Communication Theory 84(2), 486–502. Mandelbrot, B. B. (1967), ‘How long is the coast of Britain? statistical self-similarity and fractional dimension’, Science 156(3775), 636–638. Mandelbrot, B. B. (1982), ‘The fractal geometry of nature’, Freeman.
Bibliography
259
Melville, H. (1851), Moby-Dick; or, The Whale, Harper & Brothers. Mieno, M. N., Tanaka, N., Arai, T., Kawahara, T., Kuchiba, A., Ishikawa, S. & Sawabe, M. (2016), ‘Accuracy of death certificates and assessment of factors for misclassification of underlying cause of death’, Journal of epidemiology 26(4), 191–198. Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013), ‘Efficient estimation of word representations in vector space’, arXiv preprint arXiv:1301.3781. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. (2013), Distributed representations of words and phrases and their compositionality, in ‘Advances in neural information processing systems’, pp. 3111–3119. Miller, G. A. (1951), ‘The information capacity of the English language’, IRE Transactions on Information Theory 7(3), 194–206. Minsky, M. (1967), Unsolvability of the halting problem, Computation: Finite and Infinite Machines, Prentice-Hall. Minsky, M. & Papert, S. (1969), Perceptrons: An introduction to computational geometry, MIT Press. Mirza, M. & Osindero, S. (2014), ‘Conditional generative adversarial nets’, arXiv preprint arXiv:1411.1784. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G. et al. (2015), ‘Human-level control through deep reinforcement learning’, Nature 518(7540), 529–533. Mohr, P. J., Newell, D. B. & Taylor, B. N. (2018), ‘Input data for the special codata-2017 adjustment’, Metrologia 55(1), 1–10. Newton, I. (1726a), Philosophiæ Naturalis Principia Mathematica, 3 edn, William and John Innys. Newton, I. (1726b), Philosophiae Naturalis Principia Mathematica, J. Tonson. Nocedal, J. & Wright, S. J. (2006), Numerical Optimization, Springer Series in Operations Research and Financial Engineering, 2nd edn, Springer, New York. Norman, D. A. (2013), Design principles for everyday cognition, in ‘Persuasive technology’, Springer, pp. 61–76. Norris, F. H., Stevens, S. P., Pfefferbaum, B., Wyche, K. F. & Pfefferbaum, R. L. (2002), ‘Resilience: A consensus statement’, Psychiatry: Interpersonal and Biological Processes 65(4), 290–297. OpenAI (2022), ‘Dall-e 2: An image generator that uses text descriptions’, arXiv preprint arXiv:2201.07285. Pan, S. J. & Yang, Q. (2009), ‘A survey on transfer learning’, IEEE Transactions on knowledge and data engineering 22(10), 1345–1359. Papadimitriou, C. H. (1994), Computational Complexity, Addison-Wesley, Reading, MA. Pareto, V. (1896), ‘Cours d’économie politique’, Lausanne. Patterson, D. A. & Ditzel, D. R. (1980), ‘A case for the reduced instruction set computer’, ACM SIGARCH Computer Architecture News 8(2), 25–33. Pearl, J. (2009), Causality: Models, reasoning and inference, 2nd edn, Cambridge University Press. Pearson, K. (1901), On lines and planes of closest fit to systems of points in space, Philosophical Magazine, 2(6), 559–572. Pennington, J., Socher, R. & Manning, C. D. (2014), Glove: Global vectors for word representation, in ‘Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)’, pp. 1532–1543. Perline, R. (2018), ‘Strong universality of Zipf’s law from the basic principles of information theory’, Physica A: Statistical Mechanics and its Applications 492, 152–161. Pham, H., Guan, M. Y., Zoph, B., Le, Q. V. & Dean, J. (2018), Efficient neural architecture search via parameter sharing, in ‘ICML’. Piantadosi, S. T. (2014), ‘Zipf’s word frequency law in natural language: A critical review and future directions’, Psychonomic Bulletin& Review 21(5), 1112–1130. Pieraccini, R. (2012), There Is No Data like More Data, in ‘The Voice in the Machine: Building Computers That Understand Speech’, The MIT Press. Pitman, J. (1996), Chinese Restaurant Process, University of California, Berkeley.
260
Bibliography
Platt, J. (1998), ‘Sequential minimal optimization: A fast algorithm for training support vector machines’. Pollack, J. B. (1990), Recursive distributed representations, in ‘Proceedings of the 1990 conference on Advances in neural information processing systems’, Morgan Kaufmann Publishers Inc., pp. 527–534. Prechelt, L. (1998), Automatic early stopping using cross validation: quantifying the criteria, in ‘Neural Networks: Tricks of the Trade’, Springer, pp. 55–69. Quine, W. V. O. (1940), Mathematical Logic, Harvard University Press, Cambridge, MA. Quinlan, J. R. (1986), ‘Induction of decision trees’, Machine learning 1(1), 81–106. Quinlan, J. R. (1993), C4.5: Programs for machine learning, Technical report, Morgan Kaufmann. Rabiner, L. R. (1989), ‘A tutorial on hidden Markov models and selected applications in speech recognition’, Proceedings of the IEEE 77(2), 257–286. Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. (2018), ‘Improving language understanding by generative pre-training’. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. & Liu, P. J. (2019), ‘Exploring the limits of transfer learning with a unified text-to-text transformer’, arXiv preprint arXiv:1910.10683. Ramsauer, S. & Schafle, B. (2020), All you need is Hopfield networks, in ‘Advances in Neural Information Processing Systems’, pp. 7855–7864. Reinelt, G. (1994), The Traveling Salesman: Computational Solutions for TSP Applications, Vol. 840 of Lecture Notes in Computer Science, Springer-Verlag, Berlin. Ribeiro, M. T., Singh, S. & Guestrin, C. (2016a), ‘Model-agnostic interpretability of machine learning’, arXiv preprint arXiv:1606.05386. Ribeiro, M. T., Singh, S. & Guestrin, C. (2016b), “why should i trust you?”: Explaining the predictions of any classifier, in ‘Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining’, pp. 1135–1144. Roberts, G. O. & Varadhan, S. R. S. (1997), ‘Convergence of probability measures’, SpringerVerlag. Rogers, E. M. (2003), Diffusion of Innovations, 5 edn, Free Press. Rojas, R. (1993), Neural Networks: A Systematic Introduction, Springer. Rosenberg, N. (1994), Exploring the Black Box: Technology, Economics, and History, Cambridge University Press. Rosenblatt, F. (1958), ‘The perceptron: A probabilistic model for information storage and organization in the brain’, Psychological Review 65, 386–408. Ross, S. M. (2014), A First Course in Probability, Pearson. Ruder, S. (2016), An overview of gradient descent optimization algorithms, arXiv preprint arXiv:1609.04747. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986), ‘Learning representations by backpropagating errors’, Nature 323(6088), 533–536. Saaty, T. L. (2008), ‘Relative measurement and its generalization in decision making why pairwise comparisons are central in mathematics for the measurement of intangible factors the analytic hierarchy/network process’, European journal of operational research 169(3), 687–691. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. & Chen, X. (2016), ‘Improved techniques for training gans’, arXiv preprint arXiv:1606.03498. Salton, G., Wong, A. & Yang, C. S. (1975), ‘A vector space model for automatic indexing’, Communications of the ACM 18(11), 613–620. Scherer, D., Müller, A. & Behnke, S. (2010), Evaluation of pooling operations in convolutional architectures for object recognition, in ‘International Conference on Artificial Neural Networks’, Springer, pp. 92–101. Schläfli, L. (1852), ‘Theorie der vielfachen kontinuität’, Journal für die reine und angewandte Mathematik 44, 242–286. Schmidt, M. & Lipson, H. (2009), ‘Distilling free-form natural laws from experimental data’, Science 324(5923), 81–85.
Bibliography
261
Schulz, H. & Behnke, S. (2012), Deep learning in neural networks: An overview, in ‘KI 2012: Advances in Artificial Intelligence’, Springer, pp. 57–72. Schwartz, R., Dodge, J., Smith, N. A. & Etzioni, O. (2019), ‘Green ai’, arXiv preprint arXiv:1907.10597. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. & Batra, D. (2017), ‘Grad-cam: Visual explanations from deep networks via gradient-based localization’, Proceedings of the IEEE international conference on computer vision pp. 618–626. Shafranovich, Y. (2005), ‘Common format and mime type for comma-separated values (csv) files’, https://tools.ietf.org/html/rfc4180. Shannon, C. E. (1948a), ‘A symbolic analysis of relay and switching circuits’, Bell System Technical Journal 27(3), 379–423. Shannon, C. E. (1948b), ‘The Bell System Technical Journal’, A mathematical theory of communication 27, 379–423. Shannon, C. E. (1950), ‘Programming a computer for playing chess’, Philosophical Magazine 41(314), 256–275. Shannon, C. E. (1951), ‘Prediction and entropy of printed english’, Bell Labs Technical Journal 30(1), 50–64. Simonyan, K., Vedaldi, A. & Zisserman, A. (2013), ‘Deep inside convolutional networks: Visualising image classification models and saliency maps’, arXiv:1312.6034. Singhal, A., Buckley, C. & Mitra, M. (1996), ‘Length normalization in degraded text collections’, SIGIR Forum 21(4), 71–84. Sipser, M. (1996), ‘Introduction to the theory of computation’, ACM Sigact News 27(1). Snoek, J., Larochelle, H. & Adams, R. P. (2012), Practical Bayesian optimization of machine learning algorithms, in ‘Advances in Neural Information Processing Systems’, pp. 2951–2959. So, D. R., Liang, C., Huang, Q. V. & Lee, A. (2019), Evolved transformer for accurate language generation, in ‘ICLR’. Sober, E. (1984), ‘The law of parsimony and some of its applications’, The American Philosophical Quarterly 21(2), 121–129. Song, C., Ristenpart, T. & Shmatikov, V. (2017), ‘Machine learning models that remember too much’, arXiv preprint arXiv:1709.07886. Springenberg, J. T., Dosovitskiy, A., Brox, T. & Riedmiller, M. (2014), ‘Striving for simplicity: The all convolutional net’, arXiv preprint arXiv:1412.6806. Stein, E. M. & Shakarchi, R. (2003), Fourier Analysis: An Introduction, Princeton University Press. Stewart, J. (2016), Calculus: Early Transcendentals, 8 edn, Cengage Learning. Strubell, E., Ganesh, A. & McCallum, A. (2019), Energy and policy considerations for deep learning in NLP, in ‘Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics’. Sundararajan, M., Taly, A. & Yan, Q. (2017), ‘Axiomatic attribution for deep networks’, Proceedings of the 34th International Conference on Machine Learning 70. Surden, H. (2014), ‘Machine learning and law’, Wash. L. Rev. 89, 87. Sutarsyah, S., Nation, I. S. P. & Kennedy, G. (1994), ‘Vocabulary size and reading comprehension in Indonesian’, Reading in a Foreign Language 11(2), 347–363. Sutskever, I., Martens, J., Dahl, G. & Hinton, G. (2013), ‘On the importance of initialization and momentum in deep learning’, Proceedings of the 30th International Conference on Machine Learning (ICML-13) 28, 1139–1147. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. & Fergus, R. (2013a), ‘Intriguing properties of neural networks’, arXiv preprint arXiv:1312.6199. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. & Fergus, R. (2013b), ‘Intriguing properties of neural networks’, arXiv preprint arXiv:1312.6199. Taylor, S. E. & Fiske, S. T. T. (2020), ‘Social cognition: From brains to culture’, Social cognition pp. 1–672. Terman, L. M. (1986), The Stanford-Binet intelligence scale, 4 edn, Riverside Publishing Company. Thorburn, W. M. (1915), ‘Occam’s razor’.
262
Bibliography
Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288. Tipping, M. E. (2001), ‘Sparse Bayesian learning and the relevance vector machine’, Journal of machine learning research 1(Jun), 211–244. Turing, A. M. (1936), ‘On computable numbers, with an application to the entscheidungsproblem’, Proceedings of the London Mathematical Society 42(1), 230–265. Vapnik, V. (1995), The Nature of Statistical Learning Theory, Springer-Verlag. Vapnik, V. (2000), The Nature of Statistical Learning Theory, Springer Science & Business Media. Vapnik, V. (2013), The nature of statistical learning theory, Springer Science & Business Media. Vapnik, V. N. & Chervonenkis, A. Y. (1971), ‘On the uniform convergence of relative frequencies of events to their probabilities’, Theory of Probability & Its Applications 16(2), 264–280. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. & Polosukhin, I. (2017), Attention is all you need, in ‘Proceedings of the 31st International Conference on Neural Information Processing Systems’, Curran Associates Inc., pp. 6000– 6010. Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. (2010), ‘Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion’, Journal of Machine Learning Research 11(12), 3371–3408. Virgolin, M. & Pissis, S. P. (2022), ‘Symbolic regression is np-hard’, arXiv preprint arXiv:2207.01018. von Helmholtz, H. (1882), ‘Die thermodynamik chemischer vorgänge’, Annalen der Physik und Chemie 22(10), 665–697. Wachter, S., Mittelstadt, B. & Russell, C. (2017), ‘Counterfactual explanations without opening the black box: Automated decisions and the GDPR, Harvard Journal of Law & Technology 31(2), 841–887. Wallace, G. K. (1991), ‘A standard for the compression of digital images’, Communications of the ACM 34(4), 30–44. Watanabe, S. (1960), ‘Information theoretical analysis of multivariate correlation’, IBM Journal of Research and Development 4(1), 66–82. Weizenbaum, J. (1966), ELIZA - A Computer Program for the Study of Natural Language Communication between Man and Machine, Communications of the ACM. Widrow, B. & Hoff, M. E. (1960), ‘Adaptive switching circuits’, IRE WESCON Convention Record pp. 96–104. Wolpert, D. H. & Macready, W. G. (1996), ‘The lack of a priori distinctions between learning algorithms’, Neural computation 8(7), 1341–1390. Wu, J., Leng, C., Wang, Y. & Hu, Q. (2019), ‘An introduction to GPUs for deep learning’, Foundations and Trends® in Machine Learning 13(4), 351–418. Wyszecki, G. & Stiles, W. S. (1982), Color Science: Concepts and Methods, Quantitative Data and Formulae, John Wiley& Sons. Yin, X., Sedighi, B., Varga, M., Ercsey-Ravasz, M., Toroczkai, Z. & Hu, X. S. (2017), ‘Efficient analog circuits for Boolean satisfiability’, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26(1), 155–167. Zeiler, M. D. & Fergus, R. (2014), ‘Visualizing and understanding convolutional networks’, pp. 818–833. Zhang, H., Cisse, M., Dauphin, Y. N. & Lopez-Paz, D. (2018), ‘mixup: Beyond empirical risk minimization’, arXiv preprint arXiv:1710.09412. Zipf, G. K. (1935), ‘The meaning-frequency relationship in written english’, Journal of the American Statistical Association 30(191), 369–380. Zoph, B., Vasudevan, V., Shlens, J. & Le, Q. V. (2018), Learning transferable architectures for scalable image recognition, in ‘CVPR’. Zuse, K. (1970), ‘Calculating space’, Datamation 16(5), 19–22. Zuse, K. (1993), ‘The computer – my life’, Springer. Zwicker, E. & Fastl, H. (1999), Psychoacoustics: Facts and Models, Springer Science& Business Media.
Index
A Accepting state, 15–18 Accuracy, v, 13, 17, 18, 20, 21, 24, 28–30, 40, 45–49, 53, 66, 86, 88–90, 94, 101, 110, 128, 148, 152, 156, 162, 167–169, 176, 178, 180–182, 184, 186, 190, 191, 203–206, 209, 210, 214, 215 Add-one-in test, 186–187 Adversarial examples, 92–93, 183, 194, 222, 225 All Models with the same MEC are Equally Capable, 79–80 Alphabet, 2, 15–17, 21, 103, 104, 157, 189, 232, 233, 238, 244 Annotation, 11, 14, 21, 147–149, 163, 164, 169, 181 Area under the curve (AUC), 47 Aristotle, 3, 4, 7, 22, 106 Arnold, V.I., 119 Associative memory, 24, 144 Atomicity, 12 Attention, 17, 131, 132, 134, 136, 160 Attribute ranking, 191–193 Autoencoder, 129–130 Automated model building, 14–21 Automation, 7, 9, 10, 14, 222 AutoML, 212 Average model resilience, 91 B Backpropagation, 40–42, 51, 117, 127, 194 Best-guess accuracy, 17, 18, 46, 167, 190 Bias, v, 11, 20, 38, 39, 43, 46, 50, 78, 107, 114, 116–118, 121, 139, 147, 148, 163, 167,
169, 173, 182, 184–188, 192, 196, 197, 222 Bias measurement, 188 Bit, 11, 59, 73, 90, 97, 113, 125, 140, 150, 175, 181, 195, 213, 232 Black box, v, 6, 23–52, 167, 178, 189, 199, 214 Blum, A., 117, 174, 210 Breakout, 26
C Capacity, 7, 35, 73, 89, 97, 113, 123, 137, 159, 173, 187, 190, 205, 211, 223 Capacity in parallel, 115, 116 Capacity in series, 115, 116 Capacity of neural network, 113–122 Capacity of neuron, 39, 40, 76, 78, 113–116, 219 Causality, 10, 105–108, 184, 197 Chaos, 52, 149–152 Chaos theory, 20, 52, 149 Chaotic experiment, 151 Cheating experiments, 187 Cheat sheet, 7, 249 Chervonenkis, A., 42, 236 Class bias, 185, 186 Classification, 15, 17, 19, 23–26, 29, 35, 42, 43, 46, 48, 49, 51, 64, 67, 68, 70, 76, 77, 79–82, 85–90, 92, 94, 114, 116, 117, 120–122, 124, 135, 137–139, 149, 152, 154, 156, 161, 165, 167, 169, 181, 212, 222, 236, 239 Classification model, 81, 82, 92 Clustering, 23, 24, 51, 52, 144, 149, 153
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 G. Friedland, Information-Driven Machine Learning, https://doi.org/10.1007/978-3-031-39477-5
263
264
Index
Complexity, v, 2, 5, 7, 15, 28, 29, 50, 62–64, 73, 74, 79, 80, 82, 90–92, 99, 100, 102, 104, 109, 118, 120, 124, 129, 134, 139, 142, 145, 163, 172, 174, 191, 192, 198, 210, 211, 213, 214, 231–247 Compression scheme, 100, 101 Conclusion, 3, 4, 9, 14, 57, 75, 78, 88, 106, 203, 211, 214, 235, 246 Confusion matrix, 48, 148 Conservation of computational complexity, 62 Consistency with science, 59 Constant values, 164 Contradictions, 93, 111, 164, 183, 206, 227, 240 Contraposition, 106, 142 Convolutional layer generalization, 125 Convolutional neural networks (CNNs), 123–127, 135, 161, 196, 212 Cook, S.A., 243, 244 Correlation, 70, 105–108, 186, 187, 193, 238 Cost function, 31, 45, 237 Cover’s theorem, 215 Cover, T.M., 76, 80, 81, 138, 139, 215 Creativity, 97, 129, 182 Curve entropy, 66 Cycle, 5, 6, 33, 155–157, 160, 202 Cyprien Théodore Tiffereau, 49
Description length, 21, 61–63, 68, 69, 100, 102–104, 109, 233, 235, 244, 245 Detection, 23–25, 46–47, 81, 130, 199 Detection error trade-off (DET), 47, 48 Detection tasks, 46, 193 Determinism, 56 Diamonds, 242 Dimension, 12, 30, 31, 33, 41, 42, 64–67, 69, 76, 78, 79, 81, 86, 90, 117, 126, 131, 132, 139, 141, 150–152, 155, 159, 174, 209, 216–218, 235, 236 Discrete probability space, 57
D Data, 1, 10, 24, 55, 73, 86, 96, 114, 123, 137, 147, 171, 179, 191, 201, 211, 222 Database keys, 148, 164 Data collection, 10–11, 147–169, 181, 186, 205 Data drift, 91, 182 Data processing inequality, 96–98, 127, 217, 224 Data science, v, vi, 1, 2, 5, 6, 171, 190, 201 Data table, 11–14, 17–19, 24, 30, 31, 33, 35, 79, 114, 126, 139–142, 149, 151, 154–156, 158, 162–166, 174, 182, 191, 209, 244 Data validation, 152, 154, 163–165 Decisions, 1, 6, 29, 35, 38–40, 42–44, 49, 50, 53, 61–63, 67, 78, 80, 85, 87, 89–92, 100, 101, 104, 116, 125, 126, 137–141, 149, 162–164, 172, 178, 181, 186, 189, 192, 197, 216, 221–225, 233, 240, 241, 245, 246 Decision tree, 21, 29, 32–37, 49, 67, 80, 90, 139–143, 159, 160, 194–196, 198, 210, 216 Deep learning, 123–127, 135, 196, 211, 214
F False alarm rate, 47 False negative, 47 False positive, 47, 193 Falsifiability, 5 Fear, 221–222 Fewer parameters, 198 Feynman, R., vii, 239, 240 Finite state machine, 7, 14–22, 75, 88–89, 99, 167 Finite State Machine Generalization, 20, 94 F-measure, 47 Forecasting, 26, 51, 52, 154 Fractal dimension, 41, 64–67, 78, 90, 138, 159, 169, 216, 218, 235 Functional Assumption of Supervised Machine Learning, 13 Function counting theorem, 76, 83, 138, 139
E Early stopping, 211, 218 Ensembling, 35, 52, 139, 142–143, 197 Equal error rate, 47 Equilibrium machine learning, 81, 174–178 Equilibrium uncertainty, 59–60 Equivalence of Digit and Dimension, 216 Error metrics, 39, 42, 45–49, 51, 169 Evocative models, 110, 190, 206, 207, 246 Experiment, 3, 9, 49, 53, 73, 94, 102, 135, 139, 148, 184, 195, 201, 213, 235 Explainability, 189–199, 209, 211, 212
G Galileo, 3, 4 Generalization, 4, 7, 16–21, 24, 28–31, 50, 58, 70, 79, 82, 85–95, 110, 113, 125, 126, 141–142, 168, 172–174, 176, 180, 182,
Index 183, 188, 190, 191, 196, 209, 214, 218, 224, 233, 237 Generalization (Classification), 89 Generalization (Regression), 90 Generative adversarial network (GAN), 127–129, 135, 218 Genetic programming, 44–45, 51, 143, 195, 198 Goodfellow, I., 38, 39, 127, 130, 135, 136, 168 Graphics processing unit (GPU), 211
H Hadamard, J., 19 Halting problem, 15, 240–241 Hard conditions, 163–164 Heatmapping, 192–194 Helmholtz free energy, 242 High entropy target, 165 Hilbert, D., 119 Hopfield networks, 144–145 Human process, 10, 13, 49, 110 Hype, 221–222 Hypothesis, 3–5, 9, 10, 92, 189, 217, 236
I Imbalanced classes, 46, 51, 118, 167, 175 Imbalanced data, 167–169 Independence, vi, 124, 237 Independent but identically distributed, 28–29 Information, 1, 25, 53, 73, 86, 96, 113, 123, 138, 147, 171, 184, 191, 201, 209, 222 Information capacity, 73, 76, 80–82, 90, 113, 114, 116, 121, 122, 139, 176 Information capacity of a linear-separator classification model, 81 Information theory, vi, vii, 6, 7, 53–70, 77, 83, 94, 113, 167, 233 Initial state, 15, 16 Instance, 7, 12, 14, 16, 29, 46, 48, 57, 64, 82, 85, 86, 88–90, 92, 94, 121, 128, 133, 148, 152, 154, 158, 167–169, 172, 173, 178, 183, 193, 194, 196, 210, 218, 222, 223, 236, 246 Intellectual capacity, 50, 74–80, 143, 205, 236 Intelligence, vii, 5, 38, 74, 75, 80, 97, 104, 111, 197 Inter-annotator agreement, 148, 169
J Joint entropy, 68 Julia set, 234
265 K k-means, 144 k-nearest neighbors, 30, 31, 137–138 Kolmogorov, A., 56, 100, 119, 145, 233–236 Kolmogorov complexity, 100, 233–235 Kolmogorov complexity is uncomputable, 234
L Label, 14, 25, 28, 31, 34, 36, 37, 40, 42, 64, 77, 86, 103, 117, 118, 137, 139, 142, 144, 168, 173, 174, 176, 182, 215, 236 Leave-one-out test, 187 Less than two classes, 164 Linear regression, 14, 31–32, 49, 88, 190, 195, 196 Logarithm, 58–60, 62, 66, 79, 150, 151, 158, 169, 175, 213, 214, 227–229, 238 Lossless compression, 63, 101–103, 234 Low entropy columns, 165 Lyapunov exponents, 150
M Machine Learning operations (MLOps), 179, 180, 182, 183, 188 MacKay, D., vii, 77–78, 81, 143–145, 231 Mandelbrot, B., 64, 65, 156–159, 234, 235 Mandelbrot set, 234, 235 Markov chain, 96, 97 Maximum output, 83, 115 Mean squared error (MSE), 31, 32, 48, 51, 130 Mediocrity, 110 Memory-equivalent capacity (MEC), 7, 78–82, 89, 90, 113–115, 117, 118, 120, 123–127, 134, 137–145, 159, 162, 173–176, 178, 187, 190, 198, 205, 207, 211, 219, 223, 231, 235, 236, 240, 244 Memory-equivalent capacity of an ensemble, 142 Minkowski distance, 30 Minsky, M., 75, 77, 216, 217, 241 Missing values, 182 Miss rate, 46, 47 Model bias, 184–187 Model capacity, 71, 75, 212 Model resilience, 91 Model selection, 52, 212 Monotonicity, 237 Multimedia, 162 Multimodal data, 162–163 Multiplicativity of generalization, 126
266 Mutual information, 68–70, 73, 78, 79, 82, 106, 107, 115, 116, 186, 187, 191, 192, 236, 238, 240
N Nearest neighbor generalization, 86 Nearest neighbors, 30–31, 86, 87, 89, 93, 138, 214 Neural architectures, 113, 134 Neural networks, 7, 15, 21, 29, 38–42, 44, 75, 77, 80, 90, 113–120, 123–135, 144, 161, 174, 193, 196, 210–212, 214, 215, 218, 219 No data, like more data, 171 No Free Lunch, 108–109, 111 Noise, 11, 20, 91, 92, 97, 99, 119, 127–150, 152, 155, 161, 162, 172, 182, 194, 218, 232, 240 No universal, lossless compression, 102 NP-complete, 117, 174, 210 Number of binary digits, 60 Numerization, 152, 165–166
O Observation, 3, 5–7, 9, 18, 25, 53, 55, 64, 68, 93, 96, 105–108, 110, 138, 152, 154, 155, 157, 172, 182, 184, 189, 190, 202, 217, 238, 240 Occam’s Razor, vi, 5, 20, 92, 95, 109, 190–191 Ockham, 5 O-notation, 231–233, 243, 244 Out-of-range columns, 165 Overelaboration, 110
P Packing problem, 209–211, 214, 215, 217, 219 Papert, S., 75, 217 Parsimony, 92, 110 Perceptron, 39, 40, 75, 76, 117, 214, 215 Perceptron learning, 39–40, 117, 214, 215 Perceptual data, 160–162 Phenomenon, 3–5, 9–12, 14, 45, 109, 148, 157, 203, 206, 207, 218 Physical work, 62, 63, 239 Pigeonhole principle, 100–102 Poor person’s generalization, 86 Prediction, 3, 6, 7, 13, 19, 21, 29–31, 35, 48, 49, 54–56, 77, 86, 88–92, 108, 148, 153, 155, 161, 167, 172, 177, 179–182, 184, 185, 187, 189, 192–198, 202, 203, 222
Index Predictor, 48, 179–184, 187, 192, 193 Prime numbers, 102–103 Privacy, 101, 187–188, 221, 222 Probabilistic equilibrium, 47, 59, 98, 184, 186 Probability, 2, 17, 28, 54, 88, 96, 115, 128, 157, 184, 215, 237 P vs. NP, 243–247
Q Quality assurance, 50, 92, 180–184 Quantization, 23–25 Questions, 2, 9–12, 21, 31, 46, 53, 54, 60, 61, 77, 78, 86, 97, 106–108, 110, 113, 114, 134, 140, 143, 147, 149, 155, 165, 169, 172, 182, 186, 187, 189, 197, 206, 224, 235, 237, 238, 240, 243, 244, 247
R Random forest, 35–38 Random number sequence, 98 Rectified Linear Unit (ReLU), 39, 115, 116 Redundancy, 103, 117, 164, 205 Regression, 23, 25, 26, 29, 35, 42, 45, 46, 48–49, 64, 66, 68–70, 79, 85–87, 90, 92, 116, 120, 137, 148, 149, 154, 165, 194, 210, 236 Regression accuracy, 49 Regression networks, 120 Regression tests, 183–184 Reinforcement learning, 25, 26 Rényi, A., 237, 238 Rényi entropy, 58, 237, 238 Repeatability, 55, 57, 189, 201–206, 251 Repeatable result, 55 Replicability, 204, 207, 251 Reproducibility, 55, 137, 189, 201–206, 209, 211, 212, 251–252 Reproducible result, 55, 202, 204 Residual networks, 111, 116, 126, 135 Resilience, v, 29, 91–92, 110, 182, 194 Review form, 202, 251–252 Rivest, R., 117, 174, 210 Role of the human, 9–14 Rosenblatt, F., 39, 76
S Sample space, 54, 57, 58, 63, 246 SAT problem, 243, 246
Index Science, v–vii, 1–3, 5–7, 95, 97, 119, 161, 175, 184, 201, 203, 205–207, 210, 219 Scientific method, 3, 5, 6, 9, 53, 109 Security, 187–188, 222 Self-information, 61, 89, 185, 237 Shannon, C., 53, 58, 60, 68, 77, 80, 104, 105, 107, 135, 233, 237 Shannon entropy, 2, 35, 61, 82, 116, 167, 232, 233, 235, 237–239 Signal-to-noise (SNR), 161–162 Simplification, 5, 102 SNR ratio, 161, 162 Societal reaction, 221–222 Soft conditions, 164–165 Software engineering, 91, 202 Solution, 11, 19, 27, 35, 38, 40, 43, 45, 62, 63, 76–77, 82, 97, 113, 119, 126, 143, 149, 155–157, 161, 164, 172, 181, 192, 206, 207, 209, 211, 214, 219, 227, 237, 239, 243 State-transition, 15, 17–20, 88, 89, 91, 93 1st Normal Form (1NF), 12 String complexity, 232, 233 Support vector machine (SVM), 29, 42–44, 116, 138–139 Surprise, 59, 113, 149 Synthetic data, 129, 177, 181–182
T Table cell atomicity, 12 Technological diffusion, 223 Tensor processing unit (TPU), 211 Time-series data, 154–156 Topological concerns, 119–120 Total correlation, 70 Total operator characteristic (TOC), 47 Training/validation split, 27–28, 173 Transfer learning, 212, 214
267 Transformer, 130–134, 212 True positive, 47, 169
U Uncertainty, 1, 3, 53–62, 69, 91, 92, 97, 106, 107, 147, 148, 171–173, 184–186, 205, 209, 210, 231, 239–243 Unit testing, 147, 180 Universal compression, 101, 102 V Vapnik, V., 42, 236 Vaswani, A., 130, 132, 134 VC dimension, 42, 79, 236 Visualization, 87, 101, 120, 196, 197 Voronoi diagram, 87 W Well-formedness, 13 Well-posedness, 19, 20, 25, 26, 48, 85, 88, 149–154, 215 Well-posedness assumption for classification, 19 Well-posedness assumption for machine learning, 19 Word2vec, 153, 215 Worship, 221–222 Y Yaser, Abu-Mostafa, 71
Z Zipf Distribution, 157, 159, 160, 169 Zipf, George Kingsley, 156 Zipf Mystery, 158 Zuse, K., 104, 247