141 9
English Pages 478 [437] Year 2023
Studies in Systems, Decision and Control 484
Martine Ceberio Vladik Kreinovich Editors
Uncertainty, Constraints, and Decision Making
Studies in Systems, Decision and Control Volume 484
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Systems, Decision and Control” (SSDC) covers both new developments and advances, as well as the state of the art, in the various areas of broadly perceived systems, decision making and control–quickly, up to date and with a high quality. The intent is to cover the theory, applications, and perspectives on the state of the art and future developments relevant to systems, decision making, control, complex processes and related areas, as embedded in the fields of engineering, computer science, physics, economics, social and life sciences, as well as the paradigms and methodologies behind them. The series contains monographs, textbooks, lecture notes and edited volumes in systems, decision making and control spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
Martine Ceberio · Vladik Kreinovich Editors
Uncertainty, Constraints, and Decision Making
Editors Martine Ceberio Department of Computer Science University of Texas at El Paso El Paso, TX, USA
Vladik Kreinovich Department of Computer Science University of Texas at El Paso El Paso, TX, USA
ISSN 2198-4182 ISSN 2198-4190 (electronic) Studies in Systems, Decision and Control ISBN 978-3-031-36393-1 ISBN 978-3-031-36394-8 (eBook) https://doi.org/10.1007/978-3-031-36394-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
In the first approximation, decision making is nothing else but an optimization problem: we want to select the best alternative. This description, however, is not fully accurate: it implicitly assumes that we know the exact consequences of each decision, and that, once we have selected a decision, no constraints prevent us from implementing it. In reality, we usually know the consequences with some uncertainty, and there are also numerous constraints that need to be taken into account. The presence of uncertainty and constraints makes decision making challenging. To resolve these challenges, we need to go beyond simple optimization, and we also need to get a good understanding of how the corresponding systems and objects operate, and a good understanding of why we observe what we observe—this will help us better predict what will be the consequences of different decisions. All these problems—in relation to different application areas—are the main focus of this book. Most papers from this book are extended and selected versions of papers presented at the 15th International Workshop on Constraint Programming and Decision Making CoProD’2022 (Halifax, Nova Scotia, Canada, May 30, 2022), 28th UTEP/NMSU Workshop on Mathematics, Computer Science, and Computational Science (El Paso, Texas, November 5, 2022), and several other conferences. We are greatly thankful to all the authors and referees, and to all the participants of the CoProd’2022 and other workshops. Our special thanks to Prof. Janusz Kacprzyk, the editor of this book series, for his support and help. Thanks to all of you! El Paso, TX, USA January 2023
Martine Ceberio Vladik Kreinovich
v
Contents
Applications to Biology and Medicine Hunting Habits of Predatory Birds: Theoretical Explanation of an Empirical Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adilene Alaniz, Jiovani Hernandez, Andres D. Muñoz, and Vladik Kreinovich Why Rectified Power (RePU) Activation Functions are Efficient in Deep Learning: A Theoretical Explanation . . . . . . . . . . . . . . . . . . . . . . . . Laxman Bokati, Vladik Kreinovich, Joseph Baca, and Natasha Rovelli Aquatic Ecotoxicology: Theoretical Explanation of Empirical Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demetrius R. Hernandez, George M. Molina Holguin, Francisco Parra, Vivian Sanchez, and Vladik Kreinovich
3
7
15
How Hot is Too Hot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sofia Holguin and Vladik Kreinovich
21
How Order and Disorder Affect People’s Behavior: An Explanation . . . . Sofia Holguin and Vladik Kreinovich
29
Shape of an Egg: Towards a Natural Simple Universal Formula . . . . . . . . Sofia Holguin and Vladik Kreinovich
33
A General Commonsense Explanation of Several Medical Results . . . . . . Olga Kosheleva and Vladik Kreinovich
39
Why Immunodepressive Drugs Often Make People Happier . . . . . . . . . . . Joshua Ramos, Ruth Trejo, Dario Vazquez, and Vladik Kreinovich
45
Systems Approach Explains Why Low Heart Rate Variability is Correlated with Depression (and Suicidal Thoughts) . . . . . . . . . . . . . . . . . . Francisco Zapata, Eric Smith, and Vladik Kreinovich
49
vii
viii
Contents
Applications to Economics and Politics How to Make Inflation Optimal and Fair . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sean Aguilar and Vladik Kreinovich
57
Why Seneca Effect? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sean R. Aguilar and Vladik Kreinovich
63
Why Rarity Score Is a Good Evaluation of a Non-Fungible Token . . . . . . Laxman Bokati, Olga Kosheleva, and Vladik Kreinovich
69
Resource Allocation for Multi-tasking Optimization: Explanation of an Empirical Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alan Gamez, Antonio Aguirre, Christian Cordova, Alberto Miranda, and Vladik Kreinovich Everyone is Above Average: Is It Possible? Is It Good? . . . . . . . . . . . . . . . . Olga Kosheleva and Vladik Kreinovich
75
79
How Probable is a Revolution? A Natural ReLU-Like Formula That Fits the Historical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olga Kosheleva and Vladik Kreinovich
85
Why Should Exactly 1/4 Be Returned to the Original Owner: An Economic Explanation of an Ancient Recommendation . . . . . . . . . . . . Olga Kosheleva and Vladik Kreinovich
91
Why Would Anyone Invest in a High-Risk Low-Profit Enterprise? . . . . . Olga Kosheleva and Vladik Kreinovich Which Interval-Valued Alternatives Are Possibly Optimal if We Use Hurwicz Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marina Tuyako Mizukoshi, Weldon Lodwick, Martine Ceberio, and Vladik Kreinovich
95
99
How to Solve the Apportionment Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Christopher Reyes and Vladik Kreinovich In the Absence of Information, the only Reasonable Negotiation Scheme Is Offering a Certain Percentage of the Original Request: A Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Miroslav Svítek, Olga Kosheleva, and Vladik Kreinovich Applications to Education How to Make Quantum Ideas Less Counter-Intuitive: A Simple Analysis of Measurement Uncertainty Can Help . . . . . . . . . . . . . . . . . . . . . . 117 Olga Kosheleva and Vladik Kreinovich
Contents
ix
Physical Meaning Often Leads to Natural Derivations in Elementary Mathematics: On the Examples of Solving Quadratic and Cubic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Christian Servin, Olga Kosheleva, and Vladik Kreinovich Towards Better Ways to Compute the Overall Grade for a Class . . . . . . . 127 Christian Servin, Olga Kosheleva, and Vladik Kreinovich Why Some Theoretically Possible Representations of Natural Numbers Were Historically Used and Some Were Not: An Algorithm-Based Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Christian Servin, Olga Kosheleva, and Vladik Kreinovich Applications to Engineering Dielectric Barrier Discharge (DBD) Thrusters–Aerospace Engines of the Future: Invariance-Based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Alexis Lupo and Vladik Kreinovich Need for Optimal Distributed Measurement of Cumulative Quantities Explains the Ubiquity of Absolute and Relative Error Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Hector A. Reyes, Aaron D. Brown, Jeffrey Escamilla, Ethan D. Kish, and Vladik Kreinovich Over-Measurement Paradox: Suspension of Thermonuclear Research Center and Need to Update Standards . . . . . . . . . . . . . . . . . . . . . . 157 Hector Reyes, Saeid Tizpaz-Niari, and Vladik Kreinovich How to Get the Most Accurate Measurement-Based Estimates . . . . . . . . . 165 Salvador Robles, Martine Ceberio, and Vladik Kreinovich How to Estimate the Present Serviceability Rating of a Road Segment: Explanation of an Empirical Formula . . . . . . . . . . . . . . . . . . . . . . 177 Edgar Daniel Rodriguez Velasquez and Vladik Kreinovich Applications to Linguistics Word Representation: Theoretical Explanation of an Empirical Fact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Leonel Escapita, Diana Licon, Madison Anderson, Diego Pedraza, and Vladik Kreinovich Why Menzerath’s Law? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Julio Urenda and Vladik Kreinovich Applications to Machine Learning One More Physics-Based Explanation for Rectified Linear Neurons . . . . 195 Jonatan Contreras, Martine Ceberio, and Vladik Kreinovich
x
Contents
Why Deep Neural Networks: Yet Another Explanation . . . . . . . . . . . . . . . . 199 Ricardo Lozano, Ivan Montoya Sanchez, and Vladik Kreinovich Applications to Mathematics Really Good Theorems Are Those That End Their Life as Definitions: Why . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Olga Kosheleva and Vladik Kreinovich Applications to Physics How to Describe Hypothetic Truly Rare Events (With Probability 0) . . . 211 Luc Longpré and Vladik Kreinovich Spiral Arms Around a Star: Geometric Explanation . . . . . . . . . . . . . . . . . . 217 Juan L. Puebla and Vladik Kreinovich Why Physical Power Laws Usually Have Rational Exponents . . . . . . . . . . 223 Edgar Daniel Rodriguez Velasquez, Olga Kosheleva, and Vladik Kreinovich Freedom of Will, Non-uniqueness of Cauchy Problem, Fractal Processes, Renormalization, Phase Transitions, and Stealth Aircraft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Miroslav Svítek, Olga Kosheleva, and Vladik Kreinovich How Can the Opposite to a True Theory Be Also True? A Similar Talmudic Discussion Helps Make This Famous Bohr’s Statement Logically Consistent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Miroslav Svítek and Vladik Kreinovich How to Detect (and Analyze) Independent Subsystems of a Black-Box (or Grey-Box) System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Saeid Tizpaz-Niari, Olga Kosheleva, and Vladik Kreinovich Applications to Psychology and Decision Making Why Decision Paralysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Sean Aguilar and Vladik Kreinovich Why Time Seems to Pass Slowly for Unpleasant Experiences and Quickly for Pleasant Experiences: An Explanation Based on Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Laxman Bokati and Vladik Kreinovich How to Deal with Conflict of Interest Situations When Selecting the Best Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Olga Kosheleva and Vladik Kreinovich
Contents
xi
Why Aspirational Goals: Geometric Explanation . . . . . . . . . . . . . . . . . . . . . 269 Olga Kosheleva and Vladik Kreinovich Why Hate: Analysis Based on Decision Theory . . . . . . . . . . . . . . . . . . . . . . . 275 Olga Kosheleva and Vladik Kreinovich Why Self-Esteem Helps to Solve Problems: An Algorithmic Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Oscar Ortiz, Henry Salgado, Olga Kosheleva, and Vladik Kreinovich Why Five Stages of Solar Activity, Why Five Stages of Grief, Why Seven Plus Minus Two: A General Geometric Explanation . . . . . . . . . . . . 287 Miroslav Svítek, Olga Kosheleva, and Vladik Kreinovich Applications to Software Engineering Anomaly Detection in Crowdsourcing: Why Midpoints in Interval-Valued Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Alejandra De La Peña, Damian L. Gallegos Espinoza, and Vladik Kreinovich Unexpected Economic Consequence of Cloud Computing: A Boost to Algorithmic Creativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Francisco Zapata, Eric Smith, and Vladik Kreinovich Unreachable Statements Are Inevitable in Software Testing: Theoretical Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Francisco Zapata, Eric Smith, and Vladik Kreinovich General Computational Techniques Why Constraint Interval Arithmetic Techniques Work Well: A Theorem Explains Empirical Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Barnabas Bede, Marina Tuyako Mizukoshim, Martine Ceberio, Vladik Kreinovich, and Weldon Lodwick How to Describe Relative Approximation Error? A New Justification for Gustafson’s Logarithmic Expression . . . . . . . . . . . . . . . . . 323 Martine Ceberio, Olga Kosheleva, and Vladik Kreinovich Search Under Uncertainty Should be Randomized: A Lesson from the 2021 Nobel Prize in Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Martine Ceberio and Vladik Kreinovich Why Convex Combination is an Effective Crossover Operation in Continuous Optimization: A Theoretical Explanation . . . . . . . . . . . . . . 335 Kelly Cohen, Olga Kosheleva, and Vladik Kreinovich
xii
Contents
Why Optimization Is Faster Than Solving Systems of Equations: A Qualitative Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Siyu Deng, K. C. Bimal, and Vladik Kreinovich Estimating Skewness and Higher Central Moments of an Interval-Valued Fuzzy Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Juan Carlos Figueroa Garcia, Martine Ceberio, Olga Kosheleva, and Vladik Kreinovich How to Detect the Fundamental Frequency: Approach Motivated by Soft Computing and Computational Complexity . . . . . . . . . . . . . . . . . . . 353 Eric Freudenthal, Olga Kosheleva, and Vladik Kreinovich What if There Are Too Many Outliers? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Olga Kosheleva and Vladik Kreinovich What Is a Natural Probability Distribution on the Class of All Continuous Functions: Maximum Entropy Approach Leads to Wiener Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 Vladik Kreinovich and Saeid Tizpaz-Niari An Argument in Favor of Piecewise-Constant Membership Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 Marina Tuyako Mizukoshi, Weldon Lodwick, Martine Ceberio, Olga Kosheleva, and Vladik Kreinovich Data Processing Under Fuzzy Uncertainty: Towards More Accurate Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 Marina Tuyako Mizukoshi, Weldon Lodwick, Martine Ceberio, Olga Kosheleva, and Vladik Kreinovich Epistemic Versus Aleatory: Case of Interval Uncertainty . . . . . . . . . . . . . . 401 Marina Tuyako Mizukoshi, Weldon Lodwick, Martine Ceberio, Olga Kosheleva, and Vladik Kreinovich Standard Interval Computation Algorithm Is Not Inclusion-Monotonic: Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 Marina Tuyako Mizukoshi, Weldon Lodwick, Martine Ceberio, and Vladik Kreinovich Monotonic Bit-Invariant Permutation-Invariant Metrics on the Set of All Infinite Binary Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 Irina Perfilieva and Vladik Kreinovich Computing the Range of a Function-of-Few-Linear-Combinations Under Linear Constraints: A Feasible Algorithm . . . . . . . . . . . . . . . . . . . . . 451 Salvador Robles, Martine Ceberio, and Vladik Kreinovich How to Select a Representative Sample for a Family of Functions? . . . . . 459 Leobardo Valera, Martine Ceberio, and Vladik Kreinovich
Applications to Biology and Medicine
Hunting Habits of Predatory Birds: Theoretical Explanation of an Empirical Formula Adilene Alaniz, Jiovani Hernandez, Andres D. Muñoz, and Vladik Kreinovich
Abstract Predatory birds play an important role in an ecosystem. It is therefore important to study their hunting behavior, in particular, the distribution of their waiting time. A recent empirical study showed that the waiting time is distributed according to the power law. In this paper, we use natural invariance ideas to come up with a theoretical explanation for this empirical dependence.
1 Formulation of the Problem It is important to study hunting habits of predatory birds. Predatory birds are an important part of an ecosystem. Like all predators, they help maintain the healthy balance in nature. This balance is very delicate, unintended human interference can disrupt it. To avoid such disruption, it is important to study the hunting behavior of predatory birds. A recent discovery. The hunting behavior of most predatory birds is cyclic. Most predatory birds like owls spend some time waiting for the prey, and then either attack or jump to a new location. For the same bird, waiting time w changes randomly from one cycle to another. def Researchers recently found how the probability f (t) = Prob(w ≥ t) that the waiting time is ≥ t depends on t: A. Alaniz · J. Hernandez · A. D. Muñoz · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] A. Alaniz e-mail: [email protected] J. Hernandez e-mail: [email protected] A. D. Muñoz e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_1
3
4
A. Alaniz et al.
f (t) ≈ A · t −a ; see [2]. Problem. How can we explain this empirical observation? What we do in this paper. In this paper, we provide a theoretical explanation for this empirical formula.
2 Our Explanation We need a family of functions. • Some birds from the same species tend to wait longer. • Other birds tend to wait less. So: • We cannot have a single formula that would cover all the birds of the same species. • We need a family of functions f (t). What is the simplest family. The simplest family if when: • we fix some function F(t), and • consider all possible functions of the type C · F(t). Resulting question. What family should we choose? Invariance: idea. To find the appropriate family, let us take into account that the numerical value of waiting time depends on the selection of the measuring unit. In precise terms: if we replace the original measuring unit with the one which is λ times smaller, then all numerical values multiply by λ: t → λ · t. It looks like there is no preferable measuring unit. So, it makes sense to assume that the family {C · F(t)}C should remain the same if we change the measuring unit. Invariance: precise formulation. In other words, after re-scaling, we should get the exact same family of functions, i.e., the families {C · F(λ · t)}C and {C · F(t)}C should coincide, i.e., consist of exactly the same functions. This invariance requirement leads to the desired theoretical explanation for the empirical formula. The fact that the family remains the same implies, in particular, that for every λ > 0, the function F(λ · t) should belong to the same family. Thus, for every λ > 0, there exists a constant C—depending on λ—for which F(λ · t) = C(λ) · F(t). It is known (see, e.g., [1], see also Sect. 3) that every measurable solution to this functional equation F(λ · t) = C(λ) · F(t) has the form F(t) = A · t a . This is exactly the empirical probability distribution—it is only one which does not depend on the selection of the measuring unit for time. So, we indeed have the desired theoretical explanation.
Hunting Habits of Predatory Birds: Theoretical Explanation …
5
3 How to Prove the Result About the Functional Equation The above result about a functional equation is easy to prove when the function F(t) is differentiable. Indeed, suppose that F(λ · t) = C(λ) · F(t). If we differentiate both sides with respect to λ, we get t · F (λ · t) = C (λ) · F(t). In particular, for λ = 1, we get t · F (t) = a · F(t), def
where we denoted a = C (1). So, we get t·
dF = a · F. dt
We can separate the variables if we multiply both sides by dt and divide both sides by t and by F; then we get: dt dF =a· . F t Integrating both sides of this equality, we get ln(F) = a · ln(t) + C. By applying exp(x) to both sides, we get F(t) = exp(a · ln(t) + C) = A · t a , def
where we denoted A = eC . Acknowledgements This work was supported in part by the National Science Foundation grants: • 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and • HRD-1834620 and HRD-2034030 (CAHSI Includes). It was also supported by the AT&T Fellowship in Information Technology, and by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478. The authors are thankful to the participants of the 2022 UTEP/NMSU Workshop on Mathematics, Computer Science, and Computational Science (El Paso, Texas, November 5, 2022) for valuable discussions.
6
A. Alaniz et al.
References 1. J. Aczél, J. Dhombres, Functional Equations in Several Variables. (Cambridge University Press, 2008) 2. O. Vilk, Y. Orchan, M. Charter, N. Ganot, S. Toledo, R. Nathan, M. Assaf, Ergodicity breaking in area-restricted search of avion predators. Phys. Rev. X 12, Paper 031005 (2022)
Why Rectified Power (RePU) Activation Functions are Efficient in Deep Learning: A Theoretical Explanation Laxman Bokati, Vladik Kreinovich, Joseph Baca, and Natasha Rovelli
Abstract At present, the most efficient machine learning techniques is deep learning, with neurons using Rectified Linear (ReLU) activation function s(z) = max(0, z), in many cases, the use of Rectified Power (RePU) activation functions (s(z)) p —for some p—leads to better results. In this paper, we explain these results by proving that RePU functions (or their “leaky” versions) are optimal with respect that all reasonable optimality criteria.
1 Formulation of the Problem Need for machine learning. In many practical situation, we need to transform the input x into the desired output y. For example, we want to predict tomorrow’s weather y based on the information x about the weather today and in several past days, or we may want to find the species y of an animal based on its photo x. In many cases, we do not know an algorithm transforming x into y, but we have several (K ) past cases k = 1, 2, . . . , K , in which we know both the inputs x (k) and the corresponding output y (k) . Based on this information, we need to design an algorithm y = f (x) that, given an input x, produces a reasonable estimate for the desired output y. L. Bokati Computational Science Program, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) · J. Baca · N. Rovelli Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] J. Baca e-mail: [email protected] N. Rovelli e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_2
7
8
L. Bokati et al.
In mathematics, reconstructing a function based on its few values is known as interpolation/extrapolation. In computer science, such a reconstruction is known as machine learning. Deep learning: a brief reminder. At present, the most efficient machine learning technique is deep learning; see, e.g., [6]. In this technique, the final computation is performed by a network of neurons. Each neuron takes several inputs i 1 , . . . , i n and transforms them into an output o = s(w1 · i 1 + . . . + wn · i n + w0 ), where: • wi are real numbers—that need to be adjusted—and • s(z) is a continuous non-linear function known as the activation function. Since the usual training of a neural network requires computing derivatives, the activation function should be differentiable—at least everywhere except maybe a few points. In a neural network, first, the input x to the problem is fed into several neurons, then outputs of these neurons became inputs for other neurons, etc.—and the output of the final neurons is returned as the desired answer y. Which activation functions should we use. Most successful neural networks use: • either so-called Rectified Linear (ReLU) activation function s(z) = max(0, z) • or its “Leaky” version s(z) = z for z > 0 and s(z) = a · z for z < 0, for some value a = 1 for which |a| ≤ 1. Interestingly, several papers have shown that in situations in which the actual dependence y = f (x) is sufficiently smooth, we can get better results if we use a different activation function s(z) = z p for z > 0 and s(z) = 0 for z ≤ 0, for some parameter p. This activation function is known as Rectified Power (RePU) activation function; see, e.g., [1, 3–5, 7–9, 11, 12]. In some cases, a “Leaky” version of RePU— when s(z) = z p for z > 0 and s(z) = a · |z| p for z < 0—works even better. Resulting challenge and what we do in this paper. Several of the above-mentioned papers prove that RePU leads to an almost optimal approximation to smooth functions. However, it remains unclear whether RePU is only activation function with this property—and whether some other activation function would lead to even better approximations. In this paper, we analyze this problem from the theoretical viewpoint. Specifically, we prove that, for all reasonable optimality criteria, RePU and Leaky RePU are the only optimal activation functions. To prove this result, we first explain the importance of scale-invariance—in Sect. 2, and then, in Sect. 3, prove an auxiliary result—that RePU and Leaky RePU are the only scale-invariant activation functions. In Sect. 4, we use this auxiliary result to prove the main result about optimality of RePU and Leaky RePU.
Why Rectified Power (RePU) Activation Functions are Efficient …
9
2 Importance of Scale-Invariance Artificial neural networks versus biological ones. Artificial neural networks largely simulate networks of biological neurons in the brain. In the brain, both the input and the output signals are physical signals, i.e., values of the corresponding physical quantities. For example, z can be a frequency of pulses, or the amplitude, etc. Computer-based neurons simulate how these signals are processed in the brain. To perform such simulations, we describe both input and output signals by their numerical values. Importance of scale-invariance. It is important to emphasize that the numerical value of a physical quantity depends on the choice of a measuring unit. For example, we can measure frequency as the number of pulses per second or as number of pulses per minute, and we get different numerical values describing the same frequency: 1 pulse per second is equivalent to 60 pulses per minute. In general, if we change a measuring unit to the one which is λ > 0 times smaller, then all the numerical results will be multiplied by λ. This is similar to the fact that if we, e.g., replace meters with centimeters—a 100 times smaller measuring unit—then all numerical values of lengths get multiplied by 100, so 1.7 m becomes 1.7 · 100 = 170 cm. There is no preferred measuring unit, so it makes sense to require that the transformation described by the activation function o = s(z) should not change if we simply change the measuring unit for the input signal. To be more precise, if we have the relation o = s(z) in the original units, then we should have a similar relation O = s(Z ) in the new units, and for each change of units for measuring z, there should be an appropriate selection of the corresponding unit for measuring o. This property is known as scale-invariance. Let us describe this property in precise terms. Definition 1 We say that a function o = s(z) is scale-invariant if for each real number λ > 0 there exists a value μ > 0 (depending on λ) for which, once we have o = s(z), then we should also have O = s(Z ), where O = μ · o and Z = λ · z.
3 First Auxiliary Result: RePU and Leaky RePU are, In Effect, the Only Scale-Invariant Activation Functions Discussion. If we simply change the unit for the output to the one which is c > 0 time smaller, then, from the purely mathematical viewpoint, we get a new activation function c · s(z). But, of course, from the physical viewpoint, everything remains the same. In this sense, the activation functions s(z) are c · s(z) are equivalent. Similarly, we get physically the same—but mathematically different—function if we change the direction of z, from z to −z. Let us describe this equivalence in precise terms.
10
L. Bokati et al.
Definition 2 We say that two activation functions s(z) and S(z) are equivalent if there exist a constants c = 0 for which: • either we have S(z) = c · s(z) for all z, • or we have S(z) = c · s(−z) for all z. Now, we can formulate our first result. Definition 3 By a Leaky RePU, we mean a function s(z) = z p for z ≥ 0 and s(z) = a · |z| p for some p ≥ 0 and for some a = 1 for which |a| ≤ 1. Proposition 1 Every continuous non-linear scale-invariant function is equivalent to a Leaky RePU. Proof Scale-invariance means that o = s(z) implies that O = s(Z ), i.e., that μ(λ) · o = s(λ · z). Substituting o = s(z) into this equality, we get μ(λ) · s(z) = s(λ · z).
(1)
It is known—see, e.g., [2]—that for z > 0, every continuous solution to this functional equation has the form s(z) = A+ · z p+ for some p+ > 0. In this case, μ(λ) = λ p+ . Similarly, for z < 0, we get s(z) = A− · z p− for some p− > 0. In this case, μ(λ) = λ p− . Since the function μ(λ) must be the same for positive and negative z, we thus have p+ = p− . Let us denote the common value of this exponent by p = p+ = p− . If A− > A+ , then, after the transformation z → −z, we will have A− ≤ A+ . Thus, for c = A+ and a = A− /A+ , we get the desired equivalence. The proposition is proven. Comment. The general proof from [2] is somewhat too complicated to be reproduced here, but since we are interested in differentiable activations functions, we can get a simpler proof of this result. Indeed, since the function s(z) is differentiable, the function s(λ · z) μ(λ) = s(z) is differentiable as well—as the ratio of two differentiable functions. Thus, all the functions in the equality (1) are differentiable. So, we can differentiate both sides of this equality by λ. As a result, we get μ (λ) · s(z) = z · s (λ · z). def
In particular, for λ = 1, we get p · s(z) = z · s (z), where we denoted p = μ (1). Thus, we have ds p·s = z· . dz
Why Rectified Power (RePU) Activation Functions are Efficient …
11
In this equality, we can separate the variables if we divide both parts by s and z and multiply both parts by dz, then we get p·
ds dz = . z s
Integrating both parts, we get p · ln(z) + C = ln(s), where C is an integration constant. By applying the function exp(x) to both sides, and taking into account that exp(ln(s)) = s and that exp( p · ln(z) + C) = exp( p · ln(z)) · exp(C) = (exp(ln(z)) p · exp(C) = exp(C) · z p ,
we get the desired expression s(z) = const · z p .
4 Main Result: RePU and Leaky RePu are Optimal Activation Functions What do we want to select. Two equivalent activation functions represent, in effect, the same neurons, and the same neural networks. Thus, what we to select is not a single activation function, but an equivalence class—i.e., the set of all the activation functions which are equivalent to the given one. Discussion: what do we mean by optimal? We can have many different optimality criteria. Usually, we select some objective function, and we say that an alternative is optimal if it has the largest possible value of this objective function. However, this is not the only possible formulation of optimality. For example, if it turns out that there are several activation functions whose use leads to the same success rate in determining an animal’s species from a photo, we can select, among these activation functions, the one for which the training time is the shortest. In this case, the optimality criterion is more complex than simply comparing the two value of the objective function: we also need to take into account training times. If this change does not lead to a unique choice, this means that our criterion is not final, we can use the current non-uniqueness to optimize something else. In general, in all such complex optimality criteria, what we have is an ability to compare two alternative a and b and say that some are better (we will denote it by a > b) and some are of the same quality (we will denote it by a ∼ b). Of course, these comparisons must be consistent: if a is better than b and b is better than c, then we should have a better than c. Thus, we arrive at the following definition. Definition 4 ([10]) Let A be a set; its elements will be called alternatives. By an optimality criterion, we mean the pair (>, ∼) of binary relations on this set that satisfy the following properties for all a, b, c ∈ A:
12
• • • • • • •
L. Bokati et al.
if a > b and b > c, then a > c; if a > b and b ∼ c, then a > c; if a ∼ b and b > c, then a > c; if a ∼ b and b ∼ c, then a ∼ c; if a ∼ b, then b ∼ a; a ∼ a; if a > b, then we cannot have a ∼ b.
Definition 5 We say that an alternative a is optimal with respect to the criterion (>, ∼) if for every b ∈ A, we have either a < b or a ∼ b. Definition 6 We say that the optimality criterion is final if there exists one and only one optimal alternative. Finally, since there is no preferred measuring unit for signals, it makes sense to require that the comparison between equivalence classes of activation functions should not change if we simply change the measuring unit for the input signal. Definition 7 For each function s(z) and for each λ > 0, by a λ-rescaling Tλ (s), we mean a function s(λ · z). One can easily check that if two functions are equivalent, then their λ-rescalings are also equivalent. Definition 8 We say that the optimality criterion on the set S of all equivalence classes of continuous non-linear functions is scale-invariant if for every two classes e1 and e2 and for every λ > 0, the following two conditions are satisfied: • if e1 > e2 , then Tλ (e1 ) > Tλ (e2 ); and • if e1 ∼ e2 , then Tλ (e1 ) ∼ Tλ (e2 ). Proposition 2 For every final scale-invariant optimality criterion on the set S of all equivalence classes of continuous non-linear functions, every function from the optimal class is equivalent to a Leaky RePU. Proof Let eopt be the optimal class. This means that for every other class e, we have either eopt > e or eopt ∼ e. This is true for every class e, in particular, this is true for the class T1/λ (e). Thus, for every e, we have either eopt > T1/λ (e) or eopt ∼ T1/λ (e). So, by scale-invariance, we have: • either Tλ (eopt ) > Tλ (T1/λ (e)) = e • or Tλ (eopt ) ∼ Tλ (T1/λ (e)) = e. Thus, by definition of an optimal alternative, the class Tλ (eopt ) is optimal. However, since the optimality criterion is final, there is only one optimal class, so Tλ (eopt ) = eopt . Thus, the class eopt is scale-invariant. Hence, by Proposition 1, every function from the optimal class is equivalent to Leaky RePU. The proposition is proven.
Why Rectified Power (RePU) Activation Functions are Efficient …
13
Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. A. Abdeljawad, P. Grosh, Integral Representations of Shallow Neural Network with Rectified Power Unit Activation Function (2021), arXiv:2112.11157v1 2. J. Aczél, J. Dhombres, Functional Equations in Several Variables. (Cambridge University Press, 2008) 3. M. Ali, A. Nouy, Approximation of smoothness classes by deep rectifier networks. SIAM J. Numer. Anal. 59(6), 3032–3051 (2021) 4. C.K. Chui, X. Li, H.N. Mhaskar, Neural networks for localized approximation. Math. Comput. 63(208), 607–623 (1994) 5. C.K. Chui, H.N. Mhaskar, Deep nets for local manifold learning, in Frontiers in Applied Mathematics and Statistics, vol. 4, Paper 00012 (2018) 6. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, Massachusetts, 2016) 7. R. Gribonval, G. Kutyniok, M. Nielsen, F. Voigtlaender, Approximation spaces of deep neural networks. Constr. Approx. 55, 259–367 (2022) 8. B. Li, S. Tang, H. Yu, Better approximations of high dimensional smooth functions by deep neural networks with rectified power units. Commun. Comput. Phys. 27(2), 379–411 (2020) 9. H.N. Mhaskar, Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math. 1(1), 61–80 (1993) 10. H.T. Nguyen, V. Kreinovich, Applications of Continuous Mathematics to Computer Science (Kluwer, Dordrecht, 1997) 11. J.A.A. Opschoor, Ch. Schwab, J. Zech, Exponential ReLU DNN Expression of Holomorphic Maps in High Dimension, Seminar on Applied Mathematics. Technical Report 35 (Swiss Federal Institute of Technology ETH Zürich, 2019) 12. E. Weinan, Y. Bing, The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems. Commun. Math. Stat. 6(1), 1–12 (2018)
Aquatic Ecotoxicology: Theoretical Explanation of Empirical Formulas Demetrius R. Hernandez, George M. Molina Holguin, Francisco Parra, Vivian Sanchez, and Vladik Kreinovich
Abstract To analyze the effect of pollution on marine life, it is important to know how exactly the concentration of toxic substances decreases with time. There are several semi-empirical formulas that describe this decrease. In this paper, we provide a theoretical explanation for these empirical formulas.
1 Formulation of the Problem Background. Pollution is ubiquitous, and oceans, lakes, and rivers are not immune from it. Good news is that many toxic substances are not stable, their concentration C in sea creatures decreases with time. It is important to be able to predict how this decrease will go. What is known. Several semi-empirical equations dC = f (C) dt
D. R. Hernandez · G. M. Molina Holguin · F. Parra · V. Sanchez · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] D. R. Hernandez e-mail: [email protected] G. M. Molina Holguin e-mail: [email protected] F. Parra e-mail: [email protected] V. Sanchez e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_3
15
16
D. R. Hernandez et al.
describe this decrease (see, e.g., [2]): • In most cases, f (C) is a polynomial (mostly quadratic). • In other cases, the formula is fractional-linear: f (C) =
a+b·C . 1+d ·C
Problem. How can we explain these formulas? What we do in this paper. In this paper, we use invariance ideas to provide a theoretical explanation for these empirical formulas.
2 Invariance Ideas General idea. The numerical value of each physical quantity depends: • on the selection of the measuring unit, and • on the selection of the starting point. Scaling and scale-invariance. If we replace the original measuring unit with the one which is λ times smaller, all numerical values multiply by λ: x → λ · x. This transformation is known as scaling. In many cases, there is no preferable measuring unit. In this case, it makes sense to assume that the formulas should remain the same if we change the measuring unit. Shift and shift-invariance. If we replace the original starting point with the one which is x0 units earlier, we add x0 to all numerical values: x → x + x0 . This transformation is known as shift. For quantities such as temperature or time, there is no preferred starting point. In this case, it makes sense to assume that the formulas should remain the same if we change the starting point. More general invariance. If we change both the starting point and the measuring unit, we get a general linear transformation: x → λ · x + x0 .
Aquatic Ecotoxicology: Theoretical Explanation of Empirical Formulas
17
A typical example of such a transformation is converting temperature from Celsius to Fahrenheit: t F = 1.8 · tC + 32.
3 Let Us Apply Invariance Ideas to Our Problem What we want. We want to find out how the reaction rate def
r =
dC dt
depends on the concentration C. We need a family of functions. In different places, this dependence is different. Thus: • we cannot look for a single function r = f (C), • we should look for a family of functions. Simplest types of families. The simplest type of family is: • when we fix several functions e1 (t), …, en (t), and • we consider all possible linear combinations of these functions: C1 · e1 (t) + . . . + Cn · en (t). Examples. For example: • if we take e1 (t) = 1, e2 (t) = t, e3 (t) = t 2 , etc., • then we get polynomials. If we select sines and cosines, we get what is called trigonometric polynomials, etc. Remaining problem. Which family should we select?
4 Let Us Apply Invariance Ideas to Our Case Reminder. For time, there is no preferred measuring unit and no preferred starting point. Resulting requirement. So, it makes sense to select a family which is invariant with respect to scalings and shifts. This leads to the desired explanation of polynomial formulas. It is known that in all such invariant families, all the functions from these families are polynomials; see Sect. 6. This explains why in most cases, the empirical dependence r = f (C) is polynomial.
18
D. R. Hernandez et al.
5 How to Explain Fractional-Linear Dependence? Some natural transformations are non-linear. In the previous text, we only considered linear transformations. However, in some cases, we can also have natural non-linear transformations between different scales. For example: • There are two natural scales for describing earthquakes: energy scale and logarithmic (Richter) scale. • Our reaction to sound is best described not by its power, but the power’s logarithms—decibels, etc. How can we describe the class of all natural transformations? What are natural nonlinear transformations? As we have mentioned, linear transformations are natural. Also: • If we have a natural transformations between scales A and B, then the inverse transformation should also be natural. • If we apply two natural transformations one after another, then the resulting composition should also be natural. Thus, the class of all natural transformations should be closed under composition and inverse. Such classes are known as transformation groups. Also, we want to describe all these transformation in a computer. In each computer, we can only store finitely many parameters. Thus, we should restrict ourselves to finite-parametric transformation groups that contain all linear transformations. A known result about such groups explains fractional-linear dependence. It is known that every element of finite-parametric transformation groups that contain all linear transformations is a fractional-linear function; see, e.g., [1, 3, 5]. This explains why the dependence r = f (C) is sometimes described by a fractionallinear function: r and C can be viewed as two different scales for describing the same phenomena.
6 How to Prove the Result about the Invariant Families This result is easy to prove when the functions ei (t) are differentiable. Let us first use shift-invariance. Shift-invariance implies that for each function ei (t), its shift ei (t + t0 ) belongs to the same family. So, for some coefficients Ci j (t0 ), we get: ei (t + t0 ) = Ci1 (t0 ) · e1 (t) + . . . + Cin (t0 ) · en (t). If we differentiate both sides with respect to t0 , we get
Aquatic Ecotoxicology: Theoretical Explanation of Empirical Formulas
19
ei (t + t0 ) = Ci1 (t0 ) · e1 (t) + . . . + Cin (t0 ) · en (t).
In particular, for t0 = 0, we get ei (t) = ci1 · e1 (t) + . . . + cin · en (t), def
where we denoted ci j = Ci j (0). So, we get a system of linear differential equations with constant coefficients. It is known (see, e.g., [4, 6]) that its solutions are linear combinations of terms t m · exp(k · t), where k is an eigenvalue of the matrix ci j , and m is a non-negative integer which is smaller than k’s multiplicity. Let us now use scale-invariance. Similarly, scale-invariance implies that for each function ei (t), its scaling ei (λ · t) belongs to the same family. So, for some coefficients Ci j (λ), we get: ei (λ · t) = Ci1 (λ) · e1 (t) + . . . + Cin (λ) · en (t). If we differentiate both sides with respect to λ, we get (λ) · e1 (t) + . . . + Cin (λ) · en (t). t · ei (λ · t) = Ci1
In particular, for λ = 1, we get t·
dei (t) = ci1 · e1 (t) + . . . + cin · en (t), dt
def
where we denoted ci j = Ci j (1). In this case, dt = dT, t def
where we denoted T = ln(t). Thus, for the new variable T , we get the same system as before ei (T ) = ci1 · e1 (T ) + . . . + Cin · en (T ). So, its solution is a linear combination of functions T m · exp(k · T ) = (ln(t))m · exp(k · ln(t)) = (ln(t))m · t k .
20
D. R. Hernandez et al.
What is a family is both shift- and scale-invaraint? The only functions that can be represented in both forms t m · exp(k · t) and (ln(t))m · t k are polynomials. Thus, every element of a scale- and shift-invariant family is indeed a polynomial. Acknowledgements This work was supported in part by the National Science Foundation grants: • 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and • HRD-1834620 and HRD-2034030 (CAHSI Includes). It was also supported by the AT&T Fellowship in Information Technology, and by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478. The authors are thankful to the participants of the 2022 UTEP/NMSU Workshop on Mathematics, Computer Science, and Computational Science (El Paso, Texas, November 5, 2022) for valuable discussions.
References 1. V.M. Guillemin, S. Sternberg, An algebraic model of transitive differential geometry. Bull. Am. Math. Soc. 70(1), 16–47 (1964) 2. M.C. Newman, Quantitative Methods in Aquatic Ecotoxicology (Lewis Publishers, Boca Raton, Ann Arbor, London, and Tolyo, 1995) 3. H.T. Nguyen, V. Kreinovich, Applications of Continuous Mathematics to Computer Science (Kluwer, Dordrecht, 1997) 4. J.C. Robinson, An Introduction to Ordinary Differential Equations (Cambridge University Press, Cambridge, UK, 2004) 5. I. M. Singer, S. Sternberg, Infinite groups of Lie and Cartan, Part 1. J. d’Analyse Mathematique XV, 1–113 (1965) 6. M. Tenenbaum, H. Pollard, Ordinary Differential Equations (Dover Publications, New York, 2022)
How Hot is Too Hot Sofia Holguin and Vladik Kreinovich
Abstract A recent study has shown that the temperature threshold—after which even young healthy individuals start feeling the effect of heat on their productivity— is 30.5◦ ± 1◦ . In this paper, we use decision theory ideas to provide a theoretical explanation for this empirical finding.
1 Introduction Formulation of the problem. Humans can tolerate heat, but when the weather becomes too hot, our productivity decreases. Not only our productivity decreases: continuous exposure to high temperature stresses the organism and can lead to illness and even to death. This is specially true for older people or for people who are not feeling well, but heat affects young healthy people as well. An important question is: what is the threshold temperature T0 after which even young healthy people will be affected? Comment. Of course, our perception of heat depends not only on the temperature, it also depends on humidity. Because of this, to describe human perception of heat, researchers use a characteristic known as “wet-bulb temperature”: the temperature measured by a regular thermometer while the bulb is covered by a water-soaked cloth.
S. Holguin · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] S. Holguin e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_4
21
22
S. Holguin and V. Kreinovich
Traditional estimate for this threshold. Until recently, it was believed that the corresponding threshold is T0 = 35◦ , i.e., that: • temperatures below 35 ◦ C do not affect the productivity of young healthy individuals, while • higher temperatures read to a loss of productivity. This threshold was based on the original research [10]. Recent research result. A recent study [11] used more accurate measurements of the effect of heat on humans. These more accurate measurements resulted in a conclusion that wet-bulb temperatures below 35◦ also affect people. Specifically, it was shown that the actual threshold above which even young healthy individuals decrease their productivity is T0 = 30.5◦ ± 1◦ . Resulting problem. How can we explain this empirical threshold? What we do in this paper. In this paper, we provide a possible explanation for this empirical finding.
2 Our Explanation Preliminary analysis of the problem. According to the average people’s opinion, the most comfortable temperature is about 24◦ ; see, e.g., [3]. This is the typical temperature of the countries and regions known as tropical paradises, where the temperature stays close to this most comfortable level all year long. Clearly, at this level, there should be no bad effect on humans. On the other extreme, when the outside temperature reaches the average normal human body temperature of 36.6◦ , this would clearly make us uncomfortable, since in this case, excess hear generated by our bodies cannot dissipate in the surrounding air—as it can in lower temperatures. So: • the temperature of 24◦ is clearly below the threshold, while • the temperature of 36.6◦ is clearly above the threshold. Thus, the threshold T0 is somewhere between these two temperatures, i.e., somewhere on the interval [24, 36.6]. Let us use Laplace Indeterminacy Principle. We do not know which temperature in this interval corresponds to the desired threshold. Such situations of uncertainty are common in real life. In such situations, if we have several alternative hypotheses and we have no reason to believe that some of them are more probable than others, a natural idea is to assume that all these hypotheses are equally probable, i.e., that each of n hypotheses has the same probability of 1 . n
How Hot is Too Hot
23
This natural idea was first formulated by Laplace, one of the pioneers of probability theory, and is thus known as the Laplace Indeterminacy Principle; see, e.g., [4]. In our case, possible hypotheses correspond to possible values from the interval [T , T ] = [24, 36.6]. We have no reason to believe that some values from this interval are more probable than others. Thus, it is reasonable to assume that all these values are equally probable, i.e., that we have a uniform distribution on this interval. Based on this probability distribution, what is the most reasonable estimate? We need to select a single value from this interval. Ideally, this value should be close to all the values from this interval. In practice, we only measure temperature with some accuracy ε—and we can feel temperature only up to some accuracy. This means that we cannot distinguish two temperatures—neither by measurement not by its effect on a human body—if the difference between them is smaller than ε. For example, all the values between T and T + ε are indistinguishable from each other. So, in effect, what we possibly have are the following values T1 = T , T2 = T + ε, T3 = T + 2ε, . . . , Tk = T . We want to make sure that the selected value T0 is close to all these values, i.e., that we have T0 ≈ T , T0 ≈ T + ε, T0 ≈ T + 2ε, . . . , T0 ≈ T . In other words, we want to make sure that the vector (T0 , T0 , T0 , . . . , T0 ) formed by the left-hand sides is close to the vector
T , T + ε, T + 2ε, . . . , T
formed by the right-hand sides. The distance between the two vectors a = (a1 , . . . , ak ) and b = (b1 , . . . , bk ) is naturally represented by the Euclidean formula d(a, b) =
(a1 − b1 )2 + (a2 − b2 )2 + . . . + (ak − bk )2 .
In particular, in our case, the desired distance d between the two vector takes the form √ d = D, where we denoted 2 2 2 2 def D = T0 − T + T0 − (T + ε) + T0 − (T + 2ε) + . . . + T0 − T .
24
S. Holguin and V. Kreinovich
One can see that this expression D is related to the general expression for the integral sum
b
f (x) d x ≈ f (x1 ) · x + f (x2 ) · x + f (x3 ) · x + . . . + f (xk ) · x,
a
where x1 = a, x2 = a + x, x3 = a + 2x, . . . , xk = b. This expression is very accurate for small x. Specifically, the expression D is very similar to the expression for the integral sum for the integral T (T0 − T )2 dT T
corresponding to the values T1 = T , T2 = T + ε, T3 = T + 2ε, . . . , Tk = T that has the following form:
T
(T0 − T )2 dT ≈
T
T0 − T
2
2 2 2 · ε + T0 − (T + ε) · ε + T0 − (T + 2ε) · ε + . . . + T0 − T · ε.
All the terms in the right-hand side have a common factor ε. By separating this common factor, we get T (T0 − T )2 dT ≈ T
ε·
T0 − T
2
2 2 2 . + T0 − (T + ε) + T0 − (T + 2ε) + . . . + T0 − T
The sum in the right-hand side of this formula is exactly our expression D, so
T
(T0 − T )2 dT ≈ ε · D,
T
and thus 1 D≈ · ε
T T
(T0 − T )2 dT.
How Hot is Too Hot
25
So, minimizing the distance d = √
√
D means minimizing the expression
1 · ε
D≈
T
(T0 − T )2 dT .
T
One can check that this expression attains its smallest value if and only if the integral
T
(T0 − T )2 dT
T
attains its smallest value. Differentiating this integral with respect to the unknown T0 and equating the derivative to 0, we conclude that
T
2 · (T0 − T ) dT = 0.
T
Dividing both sides by 2 and taking into account that the integral of the difference is equal to the difference of integrals, we get
T
T
T0 dT −
T
T dT = 0.
T
The first integral in this expression is the integral of a constant, so it is equal to T0 · (T − T ). The second integral is T dT =
1 · T 2, 2
so the second integral is equal to
T T
T dT =
T 1 1 2 2 · T 2 = · T − T . 2 2 T
Thus, the resulting equation takes the form (T − T ) · T0 − hence
1 2 2 · T − T = 0, 2
1 2 2 · T − T T +T T0 = 2 . = 2 T −T
26
S. Holguin and V. Kreinovich
This conclusion is in perfect accordance with the recommendations of the general decision theory (see, e.g., [1, 2, 5–9]) according to which a rational decision maker should gauge the quality of each alternative by the mean value of the corresponding utility. In our case, we have a uniform distribution on the interval [T , T ], and it is known that the mean value of the corresponding random variable is equal to the midpoint T +T 2 of this interval. In our case, T = 24 and T = 36.6, so we get T0 =
60.6 24 + 36.6 = = 30.3. 2 2
This number is in perfect accordance with the empirical value 30.5 ± 1. Thus, we have indeed explained this empirical value. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. P.C. Fishburn, Utility Theory for Decision Making (John Wiley & Sons Inc., New York, 1969) 2. P.C. Fishburn, Nonlinear Preference and Utility Theory (The John Hopkins Press, Baltimore, Maryland, 1988) 3. A.P. Gagge, J.A.J. Stolwijk, J.D. Hardy, Comfort and thermal sensations and associated physiological responses at various ambient temperatures. Environ. Res. 1(1), 1–20 (1967) 4. E.T. Jaynes, G.L. Bretthorst, Probability Theory: The Logic of Science (Cambridge University Press, Cambridge, UK, 2003) 5. V. Kreinovich, Decision making under interval uncertainty (and beyond), in Human-Centric Decision-Making Models for Social Sciences. ed. by P. Guo, W. Pedrycz (Springer, 2014), pp. 163–193 6. R.D. Luce, R. Raiffa, Games and Decisions: Introduction and Critical Survey (Dover, New York, 1989) 7. H.T. Nguyen, O. Kosheleva, V. Kreinovich, “Decision making beyond Arrow’s ‘impossibility theorem’, with the analysis of effects of collusion and mutual attraction". Int. J. Intell. Syst. 24(1), 27–47 (2009) 8. H.T. Nguyen, V. Kreinovich, B. Wu, G. Xiang, Computing Statistics under Interval and Fuzzy Uncertainty (Springer, Berlin, Heidelberg, 2012) 9. H. Raiffa, Decision Analysis (McGraw-Hill, Columbus, Ohio, 1997)
How Hot is Too Hot
27
10. S.C. Sherwood, M. Huber, An adaptability limit to climate change due to heat stress. Proceedings of the National Academy of Sciences USA 107, 9552–9555 (2010). https://doi.org/10. 1073/pnas.0913352107 11. D.J. Vecellio, S.T. Wolf, R.M. Cottle, W.L. Kenney, Evaluating the 35◦ C wet-bulb temperature adaptability threshold for young, healthy subjects (PSU HEAT Project). J. Appl. Physiol. 132, 340–345 (2022)
How Order and Disorder Affect People’s Behavior: An Explanation Sofia Holguin and Vladik Kreinovich
Abstract Experimental data shows that people placed in orderly rooms donate more to charity and make healthier food choices that people placed in disorderly rooms. On the other hand, people placed in disorderly rooms show more creativity. In this paper, we provide a possible explanation for these empirical phenomena.
1 Experimental Results Experiments described in [5] have shown: • that people placed in an orderly room donate, on average, larger amounts to charity than people placed in disorderly rooms, • that people placed in an orderly room make much healthier food choices than people placed in disorderly rooms, and • that people placed in disorderly rooms show more creativity in solving problems than people placed in orderly rooms. The corresponding differences are not small: • from an orderly room, almost twice as many people donated to charity than from a disorderly room: 82% of participants places in an orderly room donated their money whereas only 47% of the people in the disorderly room donated; • also, those people from an organized room who decided to donate donated, on average, somewhat more than double the amount of money than the people in the disorganized rooms. In short, the desire to donate to charity doubles—whether you measure it by the number of participants who decided to donate or by the amount each of them donated. S. Holguin · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] S. Holguin e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_5
29
30
S. Holguin and V. Kreinovich
The first conclusion is somewhat similar to the results of previous experiments [2, 3, 6] that showed that cleanliness of the environment (e.g., cleaning-related scents) enhances morally good behavior, including reciprocity. That such a drastic change in people’s behavior can be caused by simply being placed in a room for a few minutes or even by a clean scent is amazing. How can we explain such changes? How can we explain why the desire to donate to charity is increased by a factor of two—and not by any other factor?
2 Why Order and Cleanness Affect Human Behavior: Qualitative Explanation General idea behind our explanation of the first two empirical phenomena. Our environments—and we ourselves—are never perfect, there are many things that we would like to improve. We would like to be in a nice clean ordered place where everything works and where it is easy to find things. We would like to be healthy. We would like to be kind to others, etc. Achieving all these goals requires some time and some effort, and our time is limited. Some of our objectives are easier to reach, some are more difficult to reach. Naturally, we select to spend our efforts on the goals in which these efforts will bring the largest benefit. How this general idea explains the observed human behavior. It is reasonably easy to put things in order, it is reasonably easy to clean a room—at least in comparison with such more ambitious goals such as living a healthy lifestyle or becoming more socially active. So, not surprisingly, if a person is placed in a room that really needs ordering, or in a room that does not have a clean smell—and so, can probably be cleaned—natural person’s improving-environment efforts are directed towards such easier tasks as putting the room in order or cleaning it, instead of more complex tasks such as living a healthier lifestyle of becoming more socially active. Comment. In addition to the drastic changes caused by order versus disorder, the paper [5] also reported a relatively another smaller-size effect: that people in the disorderly room showed greater creativity when solving problems than people in the orderly room. This fact has an easy explanation: people in the disorderly rooms, when looking around, can see many unrelated things laying around, such as a Physics textbook. This may inspire them to relate their problems to physics (or to whatever other topic prompted by these unrelated objects) and, use ideas typically used to solve problems in physics (or in another area). This unexpected use of seemingly unrelated idea will be clearly perceived as a sign of creativity.
How Order and Disorder Affect People’s Behavior: An Explanation
31
3 How Order and Cleanness Affect Human Behavior: Quantitative Explanation Problem: reminder. In the previous section, we explained why order and cleanness affect human behavior. What remains to be explained is the size of this effect – i.e., the fact that both individual amounts of donations to charity and the number of participants who donate to charity double when the experiment switches from a disorderly to an orderly room. Towards an explanation. To explain the quantitative observations, let us reformulate them in the following equivalent form: the average donation to charity and the number of participants who donate to charity are both decreased by half when the experiment switches from an orderedly to a disorderly room. Let D be the average amount donated by participants in an orderly room, and d is the overall amount donated by participants in a disorderly room. All we know—from experiment and from the explanation provided in the previous section—is that the amount d is smaller than the amount D, i.e., in mathematical terms, that d belongs to the interval [0, D). What reasonable estimate for d can we produce based on this information? Laplace Indeterminacy Principle. The situation would be simpler if we knew the probabilities of different values d ∈ [0, D). Then, as a reasonable estimate, we could take the expected value of d; see, e.g., [4]. We do not know these probabilities, but maybe we can estimate them and thus use these estimated values to compute the corresponding expected value of d—this providing a reasonable estimate for d? The problem of estimating probabilities in situations of uncertainty is well known, it goes back to a 19 century mathematician Pierre-Simon Laplace who first formulated, in precise terms, the following natural idea: that if we have no reason to believe that two alternatives have different probabilities, then, based on the available information, we should provide equal estimates to these probabilities. For example, if several people are suspected of the same crime, and there are no additional factors that would enable us to consider someone’s guilt as more probable than others, then a natural idea is to assign equal probability to each of these suspects. This natural idea is known as Laplace Indeterminacy Principle; it is a particular case of a general approach—known as Maximum Entropy approach—according to which we should not add unnecessary certainty when there is no support for it; see, e.g., [1]. For example, in the above criminal example, we have no certainty about who is guilty, but assigning a larger probability value to one of the suspects would implicitly imply that we have reasons—with some certainty—to believe that this person is guilty. In our case, the Laplace Indeterminacy Principle means that since we do not have any reason to assume that some values from the interval [0, D) are more probable than others, a reasonable idea is to assign equal probability to all the values from this interval, i.e., to assume that the value d uniformly distributed on this interval. We can get the exact same conclusion if we apply precise formulas of the Maximum Entropy approach [1].
32
S. Holguin and V. Kreinovich
It turns out that this leads to the desired explanation. Resulting explanation. We have shown that it is reasonable to assume that the value d is uniformly distributed on the interval [0, D). Thus, as a reasonable estimate for d, we can take the expected value of this random variable. In the uniform distribution on an interval [0, D), the uniform distribution is described by the probability density f (x) =
1 . D
Thus, the desired expected value d with respect to this distribution is equal to d= 0
D
D D 1 1 x 2 1 x dx = dx = · · D D 0 D 2 0 0 2 D D 1 · −0 = . = D 2 2
x · f (x) d x =
D
x·
Thus, we conclude that the reasonable estimate for the amount d is exactly half of the amount D—exactly as what was observed in the experiments. So, we indeed have an explanation for this numerical result as well. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. E.T. Jaynes, G.L. Bretthorst, Probability Theory: The Logic of Science (Cambridge University Press, Cambridge, UK, 2003) 2. K. Liljenquist, C.-B. Zhong, A.D. Galinsky, The smell of virtue: clean scents promote reciprocity and charity. Psychol. Sci. 21, 381–383 (2010) 3. N. Mazar, C.-B. Zhong, Do green products make us better people? Psychol. Sci. 21, 494–498 (2010) 4. D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures (Chapman and Hall/CRC, Boca Raton, Florida, 2011) 5. K.D. Vohs, J.P. Redden, R. Rahinel, Physical order produces healthy choices, generosity, and conventionality, whereas disorder produces creativity. Psychol Sci 24(9), 1860–1867 (2013) 6. C.-B. Zhong, B. Strejcek, N. Sivanathan, A clean self can render harsh moral judgment. J. Exp. Soc. Psychol. 46, 859–862 (2010)
Shape of an Egg: Towards a Natural Simple Universal Formula Sofia Holguin and Vladik Kreinovich
Abstract Eggs of different bird species have different shapes. There exists formulas for describing the usual egg shapes—e.g., the shapes of chicken eggs. However, some egg shapes are more complex. A recent paper proposed a general formula for describing all possible egg shapes; however this formula is purely empirical, it does not have any theoretical foundations. In this paper, we use the theoretical analysis of the problem to provide an alternative—theoretically justified—general formula. Interestingly, the new general formula is easier to compute than the previously proposed one.
1 What Is the Shape of an Egg: Formulation of the Problem It is important to describe egg shapes. Biologists are interested in describing different species and different individuals within these species. An important part of this description is the geometric shape of each biological object. Of course, in addition to the shape, we need to describe many other characteristics related to the object’s dynamics. From this viewpoint, the simplest to describe are bird eggs. First, their shapes are usually symmetric—and do not vary that much as the shapes of living creatures in general. Second, eggs are immobile. In contrast to other creatures, their shape does not change—until they hatch. So, to start solving a more general problem of describing shapes of living creatures, a natural first step is to describe shapes of different eggs. How egg shapes are described now. At present, there are several formulas that describe egg shapes. S. Holguin · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] S. Holguin e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_6
33
34
S. Holguin and V. Kreinovich
All known eggs have an axis of rotation. Usually, in describing the egg shape, this axis is used as an x-axis. Because of this symmetry, to describe the shape of an egg, it is sufficient to describe its projection on the x y-plane. Once we have this projection, all other points can be obtained by rotating this projection around the x-axis. This projection can, in general, be described by a formula F(x, y) = 0 for an appropriate function F(x, y). Historically the first formula for the egg shape was produced by Fritz Hügelschäffer, see [1, 3–7, 9]. According to this formula, the egg shape has the form y2 =
a0 + a1 · x + a2 · x 2 . b0 + b1 · x
(1)
This formula can be further simplified. First, we can use the same trick that is used to find the to the quadratic equation. Namely, by an appropriate selection of the starting point of the x-axis, i.e., by using a new variable x = x + x0 , with x0 = a1 /(2a2 ) we can simplify the numerator a0 + a1 · x + a2 · x 2 to the simpler form. Indeed, since
a1 a2 · (x ) = a2 · x + 2a2 2
2
a1 = a2 · x + + a2 2
a1 2a2
2 = a2 · x 2 + a1 · x +
a12 , 4a2
we have a0 + a1 · x + a2 · x 2 = a0 + a2 · (x )2 , where a0 = a0 − def
a12 . 4a2
Second, we can divide both the numerator and the denominator by a2 > 0, resulting in a simplified formula y2 =
a0 + x 2 def a def b0 def b1 , where a0 = 0 , b0 = , and b1 = . + b1 · x a2 a2 a2
b0
(2)
For practical purposes, it turned out to be convenient to use a different formula for the exact same dependence: y2 =
L 2 − 4x 2 B2 · 2 , 4 L + 8w · x + 4w2
(3)
for some parameters B, L, and w. It turns out that some eggs are not well-described by these formulas. To describe such unusual egg shapes, [3] proposed a different formula y2 =
(L 2 − 4x 2 ) · L B2 · . 4 2(L − 2w) · x 2 + (L 2 + 8L · w − 4w2 ) · x + 2L · w2 + L 2 · w + L 3 (4)
Shape of an Egg: Towards a Natural Simple Universal Formula
35
It is therefore desirable to come up with a single formula for the egg shapes, that would explain both the usual shapes (2)–(3) and the non-standard shapes (4). Formulation of the problem. The following empirical general formula was proposed in [3]: y=± 1−
⎛ ⎝1 −
B 2
L2 − 4 · x2 × L 2 + 8 · w · x + 4 · w2
√ √ √ 5.5L 2 + 11Lw + 4w2 · ( 3B L − 2D L/4 L 2 + 2wL + 4w2 ) × √ √ √ 3B L · ( 5.5L 2 + 11Lw + 4w2 − 2 L 2 + 2wL + 4w2 )
⎞⎞ L · (L 2 + 8wx + 4 · w2 ) ⎠⎠ . 2(L − 2w) · x 2 + (L 2 + 8Lw − 4w2 ) · x + 2Lw2 + L 2 w + L 3
However, this formula is purely empirical. It is desirable to come up with a general formula that has some theoretical foundations. This is what we do in this paper.
2 Towards a General Formula Main idea. In general, any shape can be described by a formula P(x, y) = 0 for an appropriate function P(x, y). We have no a priori information about this shape. So, to get a good first approximation to the actual shape, let us use the usual trick that physicists use in such situations: expand the unknown function P(x, y) in Taylor series and keep only the first few terms in this expansion; see, e.g., [2, 8]. The first non-trivial terms provide a reasonable first approximation; if we take more terms into account, we can get a more accurate description. Details. In general, if we expand the expression P(x, y) in Taylor series in terms of y, we get the following equation for the egg’s shape: P0 (x) + y · P1 (x) + y 2 · P2 (x) + . . . = 0,
(5)
for some functions Pi (x). As we have mentioned, the shape of an egg is symmetric with respect to rotations around an appropriate line. So, if we take this line as the x-axis, we conclude that with each point (x, y), the shape also contains a point (x, −y) which is obtained by rotating by 180◦ around this axis. Thus, once the Eq. (5) is satisfied for some x and y, the same equation must be satisfied if we plug in −y instead of y. Substituting −y instead of y into the formula (5), we get:
36
S. Holguin and V. Kreinovich
P0 (x) − y · P1 (x) + y 2 · P2 (x) + . . . = 0.
(6)
Adding (5) and (6) and dividing the result by 2, we get a simpler equation: P0 (x) + y 2 · P2 (x) + . . . = 0.
(7)
We want to find the smallest powers of y that lead to some description of the shape. According to the formula (7), the smallest power of y that can keep and still get some shape if the quadratic term—proportional to y 2 . If we only keep terms up to quadratic tn terms of y, the Eq. (7) takes the following form: P0 (x) + y 2 · P2 (x) = 0, i.e., equivalently, y2 =
−P0 (x) . P2 (x)
(8)
(9)
Since we kept only terms which are no more than quadratic in y, it makes sense to also keep terms which are no more than quadratic in x as well. In other words, instead of the general functions −P0 (x) and P2 (x), let us keep only up-to-quadratic terms in the corresponding Taylor series: • instead of a general formula P0 (x), we only consider a quadratic expression −P0 (x) = a0 + a1 · x + a2 · x 2 , • and, instead of a general formula P2 (x), we only consider a quadratic expression P2 (x) = b0 + b1 · x + b2 · x 2 . In this case, the Eq. (9) takes the form y2 =
a0 + a1 · x + a2 · x 2 . b0 + b1 · x + b2 · x 2
(10)
Similarly to the transition from the formula (1) to the formula (2), we can simplify this formula by introducing a new variable x = x + x0 for which the numerator of the formula (10) no longer contains a linear term. In terms of this new variable, the formula (1) takes the simplified form given in the next section.
Shape of an Egg: Towards a Natural Simple Universal Formula
37
3 General Formula Resulting formula. y2 =
a0 + a2 · (x )2 . b0 + b1 · x + b2 · (x )2
(11)
Previous formulas are particular cases of this general formula. Let us show that the previous formulas (2)–(3) and (4) are particular cases of this formula. Indeed: • the standard formula (2)–(3) corresponds to the case when b = 0, i.e., when there is no quadratic term in the denominator; • the formula (4) fits exactly into this formula. Our general formula has exactly as many parameters as needed. The complex general formula proposed in [3] contain 4 parameters. At first glance, one may get an impression that our formula (11) has 5 parameters: a0 , a2 , b0 , b1 , and b2 . However, we can always divide both numerator and denominator by a0 without changing the value of the expression. In this case, we get an equivalent expression with exactly four parameters: 1 + a2 · (x )2 , (12) y 2 = b0 + b1 · x + b2 · (x )2 def
def
where a2 = a2 /a0 and bi = bi /a0 . Thus, our formula (11) is not too general—it has exactly as many parameters as the previously proposed general formula. In what sense is our general formula better. First, our formula is theoretically justified. Second, as one can see, it is easier to compute than the previously proposed general formula. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. R. Ferréol, Hügelschäffer egg, Encyclopédie des formes mathématiques remarquables. 2D Curves (2022), http://www.mathcurve.com/courbes2d.gb/oeuf/oeuf.shtml. Accessed 28 Feb 2022 2. R. Feynman, R. Leighton, M. Sands, The Feynman Lectures on Physics (Addison Wesley, Boston, Massachusetts, 2005)
38
S. Holguin and V. Kreinovich
3. V.G. Narushin, M.N. Romanov, D.K. Griffin, Egg and math: introducing a universal formula for egg shape. Ann. N. Y. Acad. Sci. 1, 1–9 (2021) 4. M. Obradovi´c, B. Malesevi´c, M. Petrovi´c, G. Djukanovi´c, Generating curves of higher order using the generalization of Hügelschäffer’s egg curve construction, Scientific Bulletin of the “Politehnica” University of Timi¸soara. Trans. Hydrotechn. 58, 110–114 (2013) 5. M. Petrovi´c, M. Obradovi´c, The complement of the Hugelschaffer’s construction of the egg curve, in Proceedings of the 25th National and 2nd International Scientific Conference moNGeometrija, Belgrad, Serbia, June 24–27, 2010, ed. by M. Nestorovi´c. (Serbian Society for Geometry and Graphics, Belgrad, 2010), pp. 520–531 6. M. Petrovi´c, M. Obradovi´c, R. Mijailovi´c, Suitability analysis of Hugelschaffer’s egg curve application in architectural and structure’s geometry, Buletinul Institutului Politehnic Din Ia¸si, 2011, vol. 57 (61), No. 3. Section Construc¸tii de Ma¸sini, pp. 115–122 7. H. Schmidbauer, Eine exakte Eierkurvenkonstruktion mit technischen Anwendungen. Elem. Math. 3, 67–68 (1948) 8. K.S. Thorne, R.D. Blandford, Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics (Princeton University Press, Princeton, New Jersey, 2017) 9. O. Ursinus, Kurvenkonstruktionen für den Flugzeugentwurf. Flugsport 36(9), 15–18 (1944)
A General Commonsense Explanation of Several Medical Results Olga Kosheleva and Vladik Kreinovich
Abstract In this paper, we show that many recent experimental medical results about the effect of different factors on our health can be explained by commonsense ideas.
1 Introduction Unexpected medical results appear all the time. New medical results are published all the time. Many of these results describe the effect of different factors on our health and, as a result, on our longevity. Many of these results unexpectedly relate seemingly unrelated phenomena—e.g., a recent discovery that flu vaccination slows down the Alzheimer disease. Let us list several such recent results. Examples of recent somewhat unexpected medical results. Let us list—in chronological order—several recent medical results that made the general news: • • • • • • • •
bilingualism helps healthy ageing [2]; flu vaccination helps prevent Alzheimer [1]; leisure activity helps prevent dementia [5]; cold temperatures enhance diabetes [4]; exercise decreases mortality [6]; pollution promotes autoimmune diseases [8]; immuno-depressive drugs increase longevity [3]; loneliness enhances mental decline [7].
O. Kosheleva · V. Kreinovich (B) University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_7
39
40
O. Kosheleva and V. Kreinovich
What we do in this paper. In this paper, we show that all the above-mentioned medical results can be explained—on the qualitative level—by common sense. Structure of the paper. We start, in Sect. 2, with a general commonsense description of a biological organism. In Sect. 3, we list general conclusions that can be made based on this description. Finally, in Sect. 4, we show that these conclusions explain all the above results.
2 A Biological Organism and Its Constant Struggle with Negative Factors: A General Commonsense Description Negative factors. A rock can stay stable for millennia. In contrast, a biological organism is constantly affected by many negative factors: • challenging environment outside the body, • bacteria, viruses, and dangerous chemicals that penetrate the body from the outside, and • the fact that cells forming the body tend to deteriorate with time, they need to be rejuvenated or replaced. So, to stay alive and active, the organism is constantly fighting against all these negative factors. Let us analyze. Let us analyze how this commonsense description affects our health and our mortality.
3 General Conclusions Possible conclusions. The negative effect comes from the interaction between the negative factors (i.e., the environment) and the organism. So, to analyze how this effect changes, we need to analyze: • how the changes in environments affect the organism, and • how the activity of the organism itself affects the organism’s reaction to these negative factors. Effect of environment. • When environment worsens, it becomes more difficult for the organism to fight all negative factors, so their effect increases. • Vice versa, when the environment improves—and/or when we artificially make it better, e.g., by neutralizing some of the potentially harmful bacteria and viruses— all negative effects decrease.
A General Commonsense Explanation of Several Medical Results
41
Effect of the organism’s activity. To be able to repair a problem, we first need to detect that there is a problem. The reason why we called the above factors negative is that they negatively affect our ability to perform different actions. So, a natural way to detect a problem is by performing—or at least trying to perform—a corresponding action and noticing that the results are not perfect. The more frequently we perform an action, the more chances are that we will notice a problem and thus, be able to repair it before the negative effect becomes too strong. For example, if we have a minor sprain of an ankle, we may not notice it when we are lying down, but we will immediately notice it when we try to step on the corresponding foot. So, to make sure that all the problems are detected early, before they become serious, it is beneficial to frequently perform different types of activities: • The more activities we perform, the higher the chances that we will notice possible faults and repair them—and the smaller the chances that we will miss a serious problem, and thus, the more chances we will remain healthier. • Vice versa, the fewer activities we perform, the larger the chance that some fault will remain undetected, and thus, that the health deteriorates—all the way to a possible death. Specifics of auto-immune diseases. The organism does not have an exact understanding of the negative effects, so it cannot use the exactly right level of defense: • Sometimes, the organism underestimates the danger—and some negative effects still happen. • In other cases, the organism overestimates the danger—and since all these fights take away resources that are otherwise needed for normal functioning, and thus, brings harm to the organism. Corresponding diseases are known as auto-immune. When the effect of negative factors is small, this over-reaction is also proportionally small and thus, practically harmless. However, when the effect of negative factors becomes large, the over-reaction is also proportionally large and can, thus, cause a serious harm. Comment. It should be mentioned that while eventually, a biological organism adapts to changes in the environment, this adaptation is rather slow, it takes at least several generations. As a result, the organism’s reactions are based not on the current states of the environment, but on the state in which this environment was several generations ago. For humans, this means that the organism’s reaction goes back to the times when we did not know antibiotics and other powerful ways to treat some diseases. In those days, a minor infection could be fatal, so, when encountering such an infection, the organism often develops a very strong reaction—a reaction that now, when we have powerful medicine, sounds like an over-reaction, and often causes more harm than help. This explains why, as the medicine progresses, we see more and more auto-immune diseases.
42
O. Kosheleva and V. Kreinovich
4 These Conclusions Provide a Commonsense Explanation for the Recent Medical Results Effect of environment. • In line with the general conclusion that worsening of environment worsens the diseases, cold temperatures enhance diabetes [4]. • In line with the general conclusion that improvement of the environment makes the organism healthier, flu vaccination helps prevent Alzheimer’s disease [1]. Effect of the organism’s activity. • In line with the general conclusion that activity helps the organism and thus, makes the organism healthier: – physical exercises decrease mortality [6]; – both physical and mental activities help prevent dementia [5]; and – bilingualism—which necessitates increased mental activities—helps healthy ageing [2]. • In line with the general conclusion that lack of activity harms the organism and thus, makes the organism less healthy, loneliness—that means decreased communication and thus, decreased mental activity—enhances mental decline [7]. Auto-immune effects. • In line with the general conclusion that the worsening of the environment increases the risk and intensity of autoimmune diseases, pollution promotes and enhances these diseases [8]. • In line with the comment that in the current environments, organisms often overreact to challenges leading to autoimmune diseases, immunodepressive drugs—that decrease the organism’s reaction to challenges—decrease this overreaction. This makes the organism healthier and thus, helps to increase longevity [3]. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
A General Commonsense Explanation of Several Medical Results
43
References 1. A.S. Bukhbinder, Y. Ling, O. Hasan, X. Jiang, Y. Kim, K.N. Phelps, R.E. Schmandt, A. Amran, R. Coburn, S. Ramesh, Q. Xiaoc, P.E. Schulz, Risk of Alzheimer’s disease following influenza vaccination: a claims-based cohort study using propensity score matching. J. Alzheimers Dis. (2022). https://doi.org/10.3233/JAD-220361 2. F. Gallo, J. Kubiak, A. Myachykov, Add bilingualism to the mix: l2 proficiency modulates the effect of cognitive reserve proxies on executive performance in healthy aging. Front. Psychol. 13, Paper 780261 (2022) 3. P. Juricic, Y.-X. Lu, T. Leech, L.F. Drews, J. Paulitz, J. Lu, T. Nespital, S. Azami, J. C. Regan, E. Funk, J. Fröhlich, S. Grönke, L. Partridge, Long-lasting geroprotection from brief rapamycin treatment in early adulthood by persistently increased intestinal autophagy. Nat. Ageing Paper 278 (2022), https://doi.org/10.1038/s43587-022-00278-w 4. L.N.Y. Qiu, S.V. Cai, D. Chan, R.S. Hess, Seasonality and geography of diabetes mellitus in United States of America dogs. PLoS ONE, 17(8), Paper e0272297 (2022) 5. S. Su, L. Shi, Y. Zheng, Y. Sun, X. Huang, A. Zhang, J. Que, X. Sun, J. Shi, Y. Bao, J. Deng, L. Lu, Leisure activities and the risk of dementia: a systematic review and meta-analysis. Neurology (2022). https://doi.org/10.1212/WNL.0000000000200929 6. E.L. Watts, C.E. Matthews, J.R. Freeman, J.S. Gorzelitz, H.G. Hong, L.M. Liao, K.M. McClain, P.F. Saint-Maurice, E.J. Shiroma, S.C. Moore, Association of leisure time physical activity types and risks of all-cause, cardiovascular, and cancer mortality among older adults. JAMA Netw. Open 5(8), Paper e2228510 (2022) 7. X. Yu, A.C. Westrick, L.C. Kobayashi, Cumulative loneliness and subsequent memory function and rate of decline among adults aged ≥50 in the United States, 1996 to 2016. Alzheimer’s & Dementia, Paper 12734 (2022) 8. N. Zhao, A. Smargiassi, S. Jean, P. Gamache, E.-A. Laouan-Sidi, H. Chen, M.S. Goldberg, S. Bernatsky, Long-term exposure to find particulate matter and ozone and the onset of systemic autoimmune rheumatic diseases: an open cohort study in Quebec, Canada. Arthritis Res. Ther. 24, Paper 151 (2022)
Why Immunodepressive Drugs Often Make People Happier Joshua Ramos, Ruth Trejo, Dario Vazquez, and Vladik Kreinovich
Abstract Many immunodepressive drugs have an unusual side effect on the patient’s mood: they often make the patient happier. This side effect has been observed for many different immunodepressive drugs, with different chemical composition. Thus, it is natural to conclude that there must be some general reason for this empirical phenomenon, a reason not related to the chemical composition of any specific drug—but rather with their general functionality. In this paper, we provide such an explanation.
1 Formulation of the Problem Description of the phenomenon. The purpose of most medical drugs is to cure physical diseases. In addition to fulfilling their main purpose, many drugs also have side effects. Most side effects unfavorably affect physical conditions. For example, some drugs cause an increase in blood pressure, some cause nausea. Interestingly, some drugs also have an effect on mood. For example, statistics has shown that many immunodepressive drugs affect people’s mood. In most cases when mood is affected, drugs make people feel happier; see, e.g., [1–6]. Comment. To be more precise, a similar effect is known for most drugs: people start feeling better—and thus, get in a better mood—just by taking the medicine. However, J. Ramos · R. Trejo · D. Vazquez · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] J. Ramos e-mail: [email protected] R. Trejo e-mail: [email protected] D. Vazquez e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_8
45
46
J. Ramos et al.
for immunodepressive drugs, this mood-changing effect is much larger than for drugs of any other type. This phenomenon is universal. Interestingly, this effect is observed for different immunodepressive drugs. This means that this happiness effect is not caused by a specific chemical composition of a drug. This effect seems to be universal. So, there must be a general explanation for this effect that does not use chemical specifics. What we do in this paper. This explanation is what we try to come up with in this paper.
2 Why do we Need Immmunodepressive Drugs? To provide the desired explanation, let us recall the main purpose of immunodepressive drugs.The main purpose of these drugs is to decrease the immune reaction. This is very important, e.g., in organ transplant. Indeed, otherwise, the immune system will attack and destroy the implanted organ.
3 But Why is there an Immune Reaction in the First Place? But why is there an immune reaction in the first place? In general: • if a genetically alien material appears inside the body, • this is an indication that someone from outside the organism is trying to take control—or at least to parasite. This could be a virus or a bacteria, this could be a tapeworm. Naturally, all such alien bio-matter appearances are perceived as danger. Because of this potential danger, a natural reaction is to attack and destroy the alien biomaterial.
4 Immune Reaction is Usually Multi-level There are many ways for a body to fight, on all levels; for example: • On the level of cells, individual cells attack invader cells. • On the level of organism as a whole, the body temperature increases, making all the possible fighting cells more active. Usually, all these levels are involved at the same time.
Why Immunodepressive Drugs Often Make People Happier
47
5 How Immunodepressive Drugs Work Immunidepressive drugs do not just affect one of the immune mechanisms. If they did that, there would be many other ways for the body to attack and destroy the life-saving implant. Most immunodepressive drugs suppress all immune mechanisms, on all levels. This is why a person taking such drugs is vulnerable to flu or cold: • While the drugs suppress reaction to an implant, they also suppress reaction to flu viruses. • Thus, even a usually harmless—and easily defeated—virus can be dangerous to patients who take these drugs.
6 Resulting Explanation The general idea of immunodepressive drugs is to decrease the body’s reaction to possibly dangerous situations. This reaction has to be decreased on all levels. An important level is always a mental level. Thus, the person becomes less worried about possibly dangerous situations. This naturally makes a person more optimistic—thus, happier. This explains the puzzling empirical phenomenon. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). The authors are thankful to all the participants of the 27th Joint NMSU/UTEP Workshop on Mathematics, Computer Science, and Computational Sciences (Las Cruces, New Mexico, April 2, 2022) for valuable suggestions.
References 1. E.S. Brown, P.A. Chandler, Mood and cognitive changes during systemic corticosteroid therapy. Primary Care Companion J. Clin. Psychiatry 3(1), 17–21 (2001) 2. E.S. Brown, D.A. Khan, V.A. Nejtek, The psychiatric side effects of corticosteroids. Ann. Allergy Asthma Immunol. 83(6), 495–504 (1999) 3. D.A. Khan, E.S. Brown, T. Suppes, T.J. Carmody, Mood changes during prednisone bursts for asthma. Am. J. Respir. Crit. Care Med. 159, A919 (1999) 4. D.M. Mitchell, J.V. Collins, Do corticostroids really alter mood? Postgrad. Med. J. 60, 467–470 (1984)
48
J. Ramos et al.
5. D. Naber, P. Sand, B. Heigl, Psychopathological and neurophysiologocal effects of of 8-days corticosteroid treatment: a prospective study. Psychoneuroendocrinology 21, 25–31 (1996) 6. K. Wada, N. Yamada, H. Suzuki, Y. Lee, S. Kuroda, Recurrent cases of cortiosteroid-induced mood disorder: clinical characteristics and treatment. J. Clin. Psychiatry 61(4), 261–267 (2000)
Systems Approach Explains Why Low Heart Rate Variability is Correlated with Depression (and Suicidal Thoughts) Francisco Zapata, Eric Smith, and Vladik Kreinovich
Abstract Depression is a serious medical problem. If diagnosed early, it can usually be cured, but if left undetected, it can lead to suicidal thoughts and behavior. The early stages of depression are difficult to diagnose. Recently, researchers found a promising approach to such diagnosis—it turns out that depression is correlated with low heart rate variability. In this paper, we show that the general systems approach can explain this empirical relation.
1 Formulation of the Problem Medical problem. Depression is a serious medical condition. If it is diagnosed early, in most cases, it can be cured. However, early diagnosis is difficult. However, early diagnosis of depression is difficult. As a result, depression become severe and often leads to suicidal thoughts and behavior. As a result, at present, suicide is the second leading cause of death among adolescents. The situation is not improving, it is getting worse: the number of suicide attempts have been rising exponentially; see, e.g., [1]. Towards a possible solution. Recent research seems to have found a way towards the early diagnosis of depression: namely, it turns out the depression is correlated with the so-called heart rate variability (HRV); see, e.g., [3, 4, 6]. To explain this result, let us recall what is HRV. In the first approximation, heart beat looks like a perfectly periodic signal. Our heart rate increases when we perform physical activity or get stressed, it goes down when we calm down or go to sleep, but during short-time periods, when there are no changes in behavior and in emotions, the heart rate seem to be pretty stable. Interestingly (and somewhat unexpectedly), it turns out that even during such stable 1-min intervals, the times between the two consequent heart beats somewhat change. F. Zapata · E. Smith · V. Kreinovich (B) University of Texas at El Paso, El Paso, TX 79968, USA e-mail: [email protected] E. Smith e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_9
49
50
F. Zapata et al.
These changes are not easy to detect—even trained medical doctors cannot detect them by listening to the heartbeats. The only way to detect such changes is by using specialized devices. The existence of this heart rate variability has been known for some time. It has been actively used in cardiology, since it turned out that unusually large heart beat variations may be a signal of the forthcoming arrythmia—a serious condition when the heartbeat becomes irregular, so serious that it can cause death. From the cardiologist viewpoint, large heart rate variations indicate the possibility of a serious disease, while smaller variations—even variations which are much smaller than average—are one of the indications that the heart is functioning normally. The papers [3, 4, 6] show that while, on the one hand, unusually small heart rate variability means a healthy heart, on the other hand, it indicates a high probability of another type of health problem: beginning depression. Remaining challenge. An important open question is why there is a correlation between low heart beat variability and depression. What we do in this paper. In this paper, we show that this correlation can be explained by a simple systems approach.
2 Systems Approach: A Brief Reminder What a system wants—or what we want from a system. In general, a system has some objective: to reach a certain state, maybe to stay in a current comfortable state, etc. For example: • an airplane wants to reach its destination, • the electric grid wants to supply all needed electricity to the customers, • an economy wants to reach desired GDP growth while keeping unemployment, inflation, and other undesirable phenomena under control. Ideal case of perfect knowledge. In the ideal case, when we have full information about the world, we can find the corresponding optimal trajectory, i.e., we can determine how all the quantities x1 , …, xn describing the state of the system should change with time t. In practice, a system needs to be stable. Real-life situations are more complicated: there are usually many difficult-to-predict disturbances that deviate the system from its planned trajectory. We therefore need to make sure that the system takes care of these unexpected deviations and returns to its desired state. In system theory, this is known as stability. For example: • When an airplane encounters an unexpected thunderstorm straight ahead, it deviates from the original trajectory to avoid this thunderstorm, but then it needs to get back to the original trajectory.
Systems Approach Explains Why Low Heart Rate Variability …
51
• Similarly, if a storm downs electric lines, disrupting the electricity flow, the electric grid needs to get back to the desired state as soon as possible. How can we describe the system’s reaction to unexpected deviations. If the actual values X i (t) of the corresponding quantities starts deviating from the desired values def xi (t), i.e., if the difference di (t) = X i (t) − xi (t) becomes different from 0, then we need to change the values of the corresponding quantities, so that at the next moment of time, the system will be closer to the desired value. For example: • an airplane need to adjust its direction and velocity, • the electric grid needs to change the amount of electricity going to different hubs and to different customers, etc. In other words, we need to apply some changes c = (c1 , . . . , cn )—depending on the observed deviations d(t) = (d1 (t), . . . , dn (t)). Then, at the next moment of time t = t + t, the deviations will take the form d(t + t) = d(t) − c(d(t)).
(1)
Possibility of linearization. Deviations d(t) are usually reasonably small: e.g., an airplane may move a few kilometers away from the trajectory of a 10,000 km intercontinental flight. Situations when deviations are small are ubiquitous in physics; see, e.g., [2, 5]. In such situations, terms which are quadratic or of higher order with respect to these deviations can be safely ignored: e.g., the square of 1% is 0.01% which is indeed much smaller than 1%. Thus, the usual approach to such situations is to expand the dependence c(d) in Taylor series and to ignore quadratic and higher order terms, i.e., to keep only linear terms in this expansion: ci (d) = ci0 + ci1 · d1 + . . . + cin · dn . When all deviations are 0s, there is no need for any corrections, so ci0 = 0. Thus, the formula (1) takes the following form: di (t + t) = di (t) − ci1 · d1 (t) − . . . − cin · dn (t).
(2)
Ideal control and real control. In the ideal case, we should take cii = 1 for all i and ci j = 0 for all i = j. In this case, at the next moment of time, the deviation will get to 0. However, such an exact control is rarely possible: • First, we usually have only an approximate understanding of the system, so even if we aim from the exact changes, the actual deviation can be somewhat different. • Second, we may not be able to implement such a change right away: – a plane cannot immediately get back since would mean uncomfortable accelerations for passengers,
52
F. Zapata et al.
– an electric grid may take time to repair an electric line if it was damaged in a faraway area, etc. As a result, the actual control is different from the desired one. What are the consequences of this difference? Let us first consider the 1-D case. To illustrate how the difference between the ideal and the actual control affects the system’s behavior, let us first consider the simplest 1-D case, i.e., the case when the state of the system is determined by only one quantity x1 . In this case, the formula (2) takes the form d1 (t + t) = d1 (t) − c11 · d1 (t) = (1 − c11 ) · d1 (t).
(3)
So, if no other disturbances appear, at the following moments of time, we have: d1 (tk · t) = (1 − c11 )k · d1 (t).
(4)
If c11 is close to 1, the original deviation decreases fast. But what if the coefficient c11 describing our reaction is different from 1? • If c11 is much larger than 1—e.g., larger than 2—then c11 − 1 > 1 and thus, the absolute value of the deviation increases with time instead of decreasing—and increases exponentially fast: |d1 (tk · t)| = |1 − c11 |k · |d1 (t)|.
(4)
In this case, the system is in immediate danger. • On the other hand, if c11 is much smaller than 1, i.e., close to 0, we practically do not get any decrease, so the original deviation largely stays. As we get more and more outside-caused deviations, the system will move further and further away from the desired trajectory—and the system can also be in danger. General multi-dimensional case is similar. In general, we have a similar behavior if we use a basis formed by eigenvectors. In this case, in each direction, we have, in effect, the 1-D behavior.
3 How This Can Explain the Empirical Correlation Let us show how the above facts about general systems explain the empirical correlation between heart rate variability and the potential for several depression and suicide thoughts. Indeed, a human body is constantly affected by different outside (and inside) disturbances. As a reaction, the body changes the values of quantities; for example:
Systems Approach Explains Why Low Heart Rate Variability …
53
• it changes the body temperature, • it changes the muscle tension, and • it changes the heart beat rate. Some changes are inevitably slower: e.g., it is not possible to change the temperature fast. However, the heart beat rate can be changed right away. The average heart rate variability describes the healthy reaction to these disturbances, corresponding—in the notations of the previous section—to c11 ≈ 1. If the hear rate variability is much larger than the average, this means that the body’s reaction is too strong, c11 1. We have already shown that in systems’ terms, this overreaction leads to a disaster—this is why high values of heart rate variability may indicate a potential life-threatening heart condition. On the other hand, if the heart rate variability is much smaller than the average, this means that the body’s reaction is too weak. In other words, it means that the body practically does not react to outside phenomena. In clinical terms, this indifference to everything that occurs outside is exactly what is called depression. This explains why empirically low heart rate variability and depression are correlated. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. Center for Disease Control, Web-Based Injury Statistics Query and Reporting System (WISQARS) (National Center for Injury Prevention and Control, Atlanta, Georgia, 2019) 2. R. Feynman, R. Leighton, M. Sands, The Feynman Lectures on Physics (Addison Wesley, Boston, 2005) 3. T. Forkmann, J. Meessen, T. Teismann, S. Sütterlin, S. Gauggel, V. Mainz, Resting vagal tone is negatively associated with suicide ideation. J. Affect. Disord. 194, 30–32 (2016) 4. D.C. Sheridan, S. Baker, R. Dehart, A. Lin, M. Hansen, L.G. Tereshchenko, N. Le, C.D. Newgard, B. Nagel, Heart rate variability and its ability to detect worsening suicidality in adolescents: a pilot trial of wearable technology. Psychiatry Invest. 18(10), 928–935 (2021) 5. K.S. Thorne, R.D. Blandford, Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics (Princeton University Press, Princeton, New Jersey, 2017) 6. S.T. Wilson, M. Chesin, E. Fertuck, J. Keilp, B. Brodsky, J.J. Mann, C.C. Sönmez, C. BenjaminPhillips, B. Stanley, Heart rate variability and suicidal behavior. Psychiatry Res. 240, 241–247 (2016)
Applications to Economics and Politics
How to Make Inflation Optimal and Fair Sean Aguilar and Vladik Kreinovich
Abstract A reasonably small inflation helps economy as a whole—by encouraging spending, but it also hurts people by decreasing the value of their savings. It is therefore reasonably to come up with an optimal (and fair) level of inflation, that would stimulate economy without hurting people too much. In this paper, we describe how this can be potentially done.
1 Formulation of the Problem: Inflation is Often Useful Sometimes, countries need more capital. In some cases, the country’s economy is stagnant, and to boost its economy, the country needs more capital—e.g., to invest in traffic or electronic infrastructure, to replace obsolete manufacturing abilities with more modern ones, etc. How a country can get more capital: two ways. A natural way to get this capital is to borrow money, either from its own citizens or from abroad. Borrowing money means imposing a debt on the next generations—i.e., in particular, on the country’s young citizens—but since these new generations will benefit from the resulting economic growth, this additional burden sounds fair (provided, of course, that the borrowed money is indeed invested in economic growth and not just lead to increased consumption). In the past, when money was supported by valuable assets like gold or silver, this was the only way to get more capital. However, nowadays, when everyone uses paper or electronic money without a guaranteed value, there is another way of getting S. Aguilar College of Business Administration, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_10
57
58
S. Aguilar and V. Kreinovich
more capital: to print more money. Since the value of the country’s monetary unit is, crudely speaking, determined by the overall value of the economy divided by the number of monetary units in circulation, printing more money means decreasing the value of each monetary unit. Equivalently, it means that the prices—i.e., the value of each good expressed in the decreased-in-value units—increase, i.e., that we have inflation. Inflation means that the country, in effect, decreases its current debts (at least those promised in the country’s monetary units) and thus, has additional money to invest in the economy. It is known that an appropriate level of inflation boosts the country’s economy. Natural questions. What is the optimal level of inflation? Moreover, how can we make it fair for everyone? These are the two questions that we study in this paper.
2 What is the Optimal Level of Inflation Main idea behind inflation: reminder. If we print more money, the prices go up, and people can consume less this year, but hopefully, the economy will be boosted, and so next year, people will be able to consume more. Situation without inflation: a simplified description. Let us consider a simplified model in which we only trace two years of the economy: the current year and the following year. Let us denote this year’s average income by a0 and expected next year’s income by a1 . Let s denote the average amount that people save. In the simplified two-year analysis, whatever we save this year will be consumed next year. This means that this year, we consume the value a0 − s, while next year, we consume the value a1 + s. It is known that the utility is proportional to the square root of the consumed √ value; see, e.g., [1]. This means that this year, our √ utility is proportional to a0 − s, and the next year, the utility is proportional to a1 + s. When we make decisions, we add, to the current utility, next year’s utility multiplied by a certain “discount” factor q < 1; see, e.g., [1]. Thus, to decide on how much we save, we optimize the expression √ √ a0 − s + q · a1 + s. (1) If the economy is not doing well, we expect a1 < a0 , so instead of spending the money, we save it for the future to make sure that we will get something to live on when our incomes decrease. This sounds reasonable, but, as a result, if this year’s consumption a0 − s decreases, there is less demand for goods, there is less incentive to produce, and the economy shrinks even more. Inflation may help. How can we avoid this shrinkage? It turns out that inflation helps. With inflation, if we save the amount s, next year, this same amount of money
How to Make Inflation Optimal and Fair
59
will enable us to buy fewer things that this year, to the equivalent of d · s, for some coefficient d < 1. Thus, the effective consumption next year will be not a1 + s but a1 + d · s. As a result, the overall utility that we want to optimize is described by the following modification of the formula (1): def
u(s) =
√ a0 − s + q · a1 + d · s.
(2)
So what is the optimal level of inflation? To maximally boost the economy, we want to maximize this year’s consumption, i.e., minimize this year’s savings s. In principle, the portion s of this year’s income a0 can be any value from 0 to a0 . Thus, the smallest possible savings is s = 0. So, we want to select the inflation parameter d in such a way that out of all possible values s ∈ [0, a0 ], the largest value of u(s) is attained when s = 0. According to the calculus, if a function f (x) attains its largest value on an interval [x, x] at its lower endpoint x, this means that the derivative f (x) of this function at this point is nonpositive f (x) ≤ 0: otherwise, if it were positive, the function would reach higher values in the close vicinity of the left endpoint x, and thus, the left endpoint would not be the maximum. In our case, this means that u (0) ≤ 0. The derivative of the expression (2) takes the form 1 q ·d . u (s) = − √ + √ 2 a0 − s 2 a1 + d · s Thus, the condition u (0) ≤ 0 takes the form 1 q ·d − √ + √ ≤ 0, 2 a0 2 a1 i.e., equivalently, that
d≤
a1 · q. a0
(3)
Many different values of the inflation parameter d satisfy this inequality, all the values from the most significant value that satisfies this inequality to the values d ≈ 0, which corresponds to so-called hyper-inflation, when money loses value almost completely. As we have mentioned, inflation decreases the consumption abilities of people. So, out of all possible values of d, we should select the least painful value—i.e., the one which is the largest. Out of all the values of d that satisfy the inequality (3), the largest possible value is dopt =
a1 · q. a0
(4)
In particular, in the case of stagnation a1 = a0 , the optimal level of inflation is dopt = q.
60
S. Aguilar and V. Kreinovich
Comment. Of course, the recommendation (4) is only approximate: it comes from a simplified two-year description of the economy. The same ideas can be used to analyze more long-term descriptions. Although we may not get simple explicit formulas, in this case, we would probably need to perform the numerical optimization.
3 How to Make Inflation Fair But is inflation fair? Increased prices make life harder for now, but in a few years— if the inflation level is selected properly—the resulting economic boom will help the country’s people. With the boom, unemployed folks will become employed, and many employed people will have their salaries increased. However, a significant group of people will not benefit from this boom at all: namely, retired people who are no longer working. The rise of prices decreases their buying abilities. When most retired people got a pension in the old days, this could be compensated by adjusting the pension to the annual inflation. However, nowadays, in the US and many other countries, a large portion of retired people’s income comes from their investments—in stocks, bonds, and special retirement funds. This investment is not compensated, as a result of which inflation simply means that their level of living decreases. Unfortunately, the resulting unfairness is not well understood. It is a common understanding that borrowing money means placing a burden on the next generations, often cited by journalists and politicians. This burden is somewhat lifted by the fact that the next generations will benefit from the country’s prosperity resulting from this borrowing. It is not as commonly understood that inflation also means placing an extra burden—this time, placing an extra burden on retired people, a burden which is not lifted at all. As a result, when the country’s central bank selects an inflation rate, their main concern in helping the economy—growth of the country’s Gross Domestic Product (GDP), decrease in unemployment, etc.—and the adverse effect on retired people, who form a significant proportion of the population, is not taken into account. How can we make the inflation effects fair? A seemingly natural idea that does not work. Since retired people have invested their money in all possible investment venues—stocks, bonds, bank accounts—the seemingly natural idea is to adjust all these investments based on the inflation rate. However, such a blanket increase defeats the whole purpose of inflation—the purpose is to decrease the actual value of the country’s debt by decreasing the values of all the bonds etc. If we automatically adjust all investments, we will not have extra money to invest. A solution that can work. An alternative solution is to consider that, in most cases, retirement income comes from specially designated retirement investments. So, a fair approach is to compensate such specifically designated investments automatically.
How to Make Inflation Optimal and Fair
61
Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the ScientificEducational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
Reference 1. D. Kahneman, Thinking, Fast and Slow (Farrar, Straus, and Giroux, New York, 2011)
Why Seneca Effect? Sean R. Aguilar and Vladik Kreinovich
Abstract Already ancients noticed that decrease is usually faster than growth— whether we talk about companies or empires. A modern researcher Ugo Bardi confirmed that this phenomenon is still valid today. He called it Seneca effect, after the ancient philosopher Seneca—one of those who observed this phenomenon. In this paper, we provide a natural explanation for the Seneca effect.
1 Formulation of the Problem Empirical fact. Ancient philosopher Seneca observed, in his Letters to Lucilius [11], that in many situations, “Fortune is of sluggish growth, but ruin is rapid” (Letter 91, Part 6). In [1], a modern researcher Ugo Bardi showed that this observations is still valid—and that it is applicable to many social phenomena be it states, companies, etc. Bardi called this Seneca effect. Problem. A natural question: how can we explain this widely spread phenomenon? What we do in this paper. In this paper, we provide a possible explanation of the Seneca effect.
S. R. Aguilar College of Business Administration, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_11
63
64
S. R. Aguilar and V. Kreinovich
2 Our Explanation Case study. We will illustrate our explanation on an example where the Seneca effect has a clear numerical description: namely, on the growth and collapse of a company. Historically, many companies first rather slowly grow to a significant size and then— much faster—decline. In such cases, the value of a company can be measured by the overall value of its stock. What affects the company’s valuation: a brief reminder. The overall value of a company’s stock, like every other monetary value, is determined by supply and demand: • When more people want to buy this stock than to sell, the price of the stock goes up—and, thus, the overall company’s valuation increases. • Vice versa, when more people want to sell this stock than to buy it, the price of the stock decreases—and thus, the overall company’s valuation decreases. Whether people want to buy or to sell a stock depends on this stock’s trend: • if the price of the stock increases, more people want to buy it, while • if the price of the stock decreases, more people want to sell it. The resulting positive feedback. As a result, there is a positive feedback that enhances both increase and decrease of the company’s valuation: • When a company’s stock increases in value, more people want to buy it, and this increases its value even more. • Similarly, when a company’s stock decreases in value, more people want to sell it, and this decreases its value even more. At first glance, the corresponding symmetry does not leave space for the Seneca affect. In all the above observations, increase and decrease of the company’s valuation are described in similar terms. Because of this “symmetry”, it looks like the speed of growth should be similar to the speed of decline—while in reality, the observed Seneca effect indicates that in many cases, these speeds are different. How can explain this difference? In order to come up with such an explanation, let us go deeper into the behavior of investors. For this purpose, let us recall how— according to decision theory—investors (and people in general) make their decisions. How people, in general, make decisions: a brief reminder. According to decision theory, preferences of a rational person can be described by a special function called utility that assigns a numerical value to each possible alternative. When people are faced with several alternatives, then select the one with the largest possible value of utility; see, e.g., [2, 3, 5–9]. It is important to mention that utility is not exactly proportional to money. A good approximation is that utility is proportional to a square root of a monetary gain or loss m; see, e.g., [4]:
Why Seneca Effect?
65
√ • for gains m > 0, the utility is approximately equal to u ≈ c+ · √m; while • for losses m < 0, the utility is approximately equal to u ≈ c− · |m|. The fact that this dependence is non-linear makes sense: • the same increase in utility means the same gain in good feelings, • however, clearly, an increase from 0 to $100 is much more important for a person that a similar-size increase from $10,000 to $10,100. It should also be mentioned that people are much more sensitive to losses than to gains, i.e., that c− > c+ [4]. Let us use this description of human decision making to analyze our situation. Analysis of the situation. Each investor’s action—whether it is buying or selling— requires some effort and comes with some costs. Because of this, investors only perform this action when the expected change in utility is larger than the decrease in utility caused by these auxiliary costs. Let us denote the decrease in utility corresponding to the auxiliary cost of buying or selling a single stock by u 0 . From this viewpoint, let us analyze when the investor will make a decision to buy or sell a given stock. Let us denote the current value of the investor’s portfolio by x, and let us denote the expected change in the stock’s price by x0 . If the price of this stock is increasing—and expected to continue to increase— then, if the investor spends some amount to buy this stock, the value of his/her portfolio is expected to increase by x0 , to the new value x + x√0 . Thus, the investor’s utility √ is expected to increase from the original amount c+ · x to the new amount c+ · x + x0 . So, the expected increase in utility is equal to the difference c+ ·
√
x + x0 − c+ ·
√
x.
The investor will buy this stock when this difference exceeds the threshold u 0 : c+ · i.e., equivalently, when
√ √
x + x0 − c+ · x + x0 −
where we denoted
def
c0 =
√
√
x ≥ u0.
x ≥ c0 ,
(1)
(2)
u0 . c+
Similarly, if the price of this stock is decreasing—and expected to continue to decrease—then, if the investor does not sell this stock, the value of his/her portfolio is expected to decrease by x0 , to the new value x − x0 . Thus, √ the investor’s utility is√expected to decrease from the original amount c+ · x to the new amount c+ · x − x0 . So, the expected decrease in utility is equal to the difference c+ ·
√
x − c+ ·
√
x − x0 .
66
S. R. Aguilar and V. Kreinovich
The investor will sell this stock when this difference exceeds the threshold u 0 : c+ · i.e., equivalently, when
√ √
x − c+ · x−
√
√
x − x0 ≥ u 0 .
x − x0 ≥ c0 ,
(3)
(4)
where the value c0 is the same as before. Towards the resulting explanation of the Seneca to our analysis, √ √effect. According the investor buys the stock when the difference x + x0 − x from the left-hand side of the formula (2) exceeds the threshold c0 . So, the smallest increase x0 = x+ at which the investor starts buying the stock corresponds to the case when this difference is equal to the threshold value c0 : √ By adding
√
√
x + x+ −
x = c0 .
x to both sides, we get √
x + x+ =
√
x + c0 .
Squaring both sides, we get x + x+ = x + 2c0 · hence x+ = 2c0 ·
√
√
x + c02 ,
x + c02 .
(5)
√ √ Similarly, the investor sells the stock when the difference x − x − x0 from the left-hand side of the formula (4) exceeds the threshold c0 . So, the smallest decrease x0 = x− at which the investor starts selling the stock corresponds to the case when this difference is equal to the threshold value c0 : √ By adding
√
x−
√
x − x− = c0 .
x − x− − c0 to both sides, we get √
x − c0 =
√
x − x− .
Squaring both sides, we get x − 2c0 ·
√
x + c02 = x − x− ,
hence x− = 2c0 ·
√
x − c02 .
(6)
Why Seneca Effect?
67
We can see that x− < x+ , so selling starts at a smaller change in stock price than buying. Thus, if we compare two situations: • when the stock price increases with time, from some initial value x0 , to the new value x0 + a(t), and • when the stock price similarly decreases with time, from the same initial value x0 , to the new value x0 − a(t), then selling in the second situation starts earlier than buying in the first situation. We have mentioned that both selling and buying affect the stock price: • selling further decreases the stock price, while • buying further increases the stock price. Thus, the fact that selling starts earlier than buying means that the corresponding selling-caused decrease starts earlier than the buying-caused increase. Therefore, the resulting selling-caused decrease will be larger than a similar buying-caused increase. In other words, the decrease in stock price happens faster than the corresponding increase—and this is exactly the Seneca effect that we wanted to explain. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. U. Bardi, The Seneca Effect: Why Growth is Slow But Collapse is Rapid (Springer, Cham, 2017) 2. P.C. Fishburn, Utility Theory for Decision Making (Wiley, New York, 1969) 3. P.C. Fishburn, Nonlinear Preference and Utility Theory (The John Hopkins Press, Baltimore, 1988) 4. D. Kahneman, Thinking, Fast and Slow (Farrar, Straus, and Giroux, New York, 2011) 5. V. Kreinovich, Decision making under interval uncertainty (and beyond), in Human-Centric Decision-Making Models for Social Sciences. ed. by P. Guo, W. Pedrycz (Springer, Berlin, 2014), pp.163–193 6. R.D. Luce, R. Raiffa, Games and Decisions: Introduction and Critical Survey (Dover, New York, 1989) 7. H.T. Nguyen, O. Kosheleva, V. Kreinovich, Decision making beyond Arrow’s ‘impossibility theorem’, with the analysis of effects of collusion and mutual attraction. Int. J. Intell. Syst. 24(1), 27–47 (2009) 8. H.T. Nguyen, V. Kreinovich, B. Wu, G. Xiang, Computing Statistics under Interval and Fuzzy Uncertainty (Springer, Berlin, 2012) 9. H. Raiffa, Decision Analysis (McGraw-Hill, Columbus, 1997) 10. R.T. Rockafeller, Convex Analysis (Princeton University Press, Princeton, 1997) 11. L.A. Seneca, Letters on Ethics: to Lucilius (University of Chicago Press, Chcago, 2017)
Why Rarity Score Is a Good Evaluation of a Non-Fungible Token Laxman Bokati, Olga Kosheleva, and Vladik Kreinovich
Abstract One of the new forms of investment is investing in so-called non-fungible tokens—unique software objects associated with different real-life objects like songs, painting, photos, videos, characters in computer games, etc. Since these tokens are a form of financial investment, investors would like to estimate the fair price of such tokens. For tokens corresponding to objects that have their own price—such as a song or a painting—a reasonable estimate is proportional to the price of the corresponding object. However, for tokens corresponding to computer game characters, we cannot estimate their price this way. Based on the market price of such tokens, an empirical expression—named rarity score—has been developed. This expression takes into account the rarity of different features of the corresponding character. In this paper, we provide a theoretical explanation for the use of rarity score to estimate the prices of non-fungible tokens.
1 Formulation of the Problem What is a non-fungible token. One of the new ways to invest money is to invest in non-fungible tokens (NFT). For many real-life objects—e.g.: • for a song, • for a painting, • for a photo, L. Bokati · V. Kreinovich (B) Computational Science Program, University of Texas at El Paso 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] L. Bokati e-mail: [email protected] O. Kosheleva Department of Teacher Education, University of Texas at El Paso 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_12
69
70
L. Bokati et al.
• for a video, • for a character in a computer game. a special unique software object is designed called a non-fungible token. For each real-life object, there is at most one non-fungible token. Owning a token does not mean owning the corresponding object, but people still buy these tokens, often for high prices: for example, if a person cannot afford to buy the actual painting, he/she can still be a proud owner of this painting’s non-fungible token (which is cheaper). Non-fungible tokens are financial instruments. Non-fungible tokens have been an investment instrument—since, like many other good investments, they, on average, increase in price. At present (2022), the overall market price of all such tokens is in billions of dollars. How to evaluate the price of a non-fungible token? Since tokens serve as investment instruments, buyers and sellers are interested in estimating the fair price for each such token. Such an estimate is easier for token of objects that themselves have a price—e.g., songs, paintings, etc. For such tokens, it is reasonable to assume that the price of a token is proportional to the price of the actual object. This makes sense: • If a multi-million dollar painting by a famous artist turned to be a later forgery, the actual price of this painting plummets and, naturally, the price of its token shall fall down too. • On the other hand, if an obscure painting turns out to be painted by a famous artist, the price of this painting skyrockets and, correspondingly, the price of its token should drastically increase. But how can we estimate the price of a token corresponding to something that does not have a naturally defined price—e.g., a character in a computer game? Enter rarity score. People actually sell and buy such tokens all the time. Usually, as with all other objects that can be sold and bought, after some oscillation, the market more or less settles on some price. How can we estimate this price? Such an estimation will be of great value to those who want to buy and/or sell such tokens. Several empirical formulas have been proposed to estimate the price of such tokens; see, e.g., [1]. The most accurate estimates are based on so-called rarity score. To understand this notion, it is necessary to take into account that each game character has several different features that are useful in different circumstances. For example: • a character can fly or can jump high, which is helpful in avoiding obstacles or pursuing some other dynamic goals like running away from danger; • a character can have X-ray vision that helps this character clearly see the situation, etc.
Why Rarity Score Is a Good Evaluation of a Non-Fungible Token
71
For each of these features, we can define the rarity score as the result of dividing the overall number of characters in the given game by the number of characters with the given feature. For example, if out of 100 game characters, 5 can fly, then the rarity score of flying is 100 = 20. 5 The rarity score of a character is then estimated as the sum of rarity scores of all its features. For example, suppose that a character: • can fly—the rarity score of this feature is 20.0, • has a normal vision—as well as 79 other characters, so that the rarity score is 100 = 1.25, 80 and • has a magic wand—as well as 19 other characters, so that the rarity score of this feature is 100 = 5.0. 20 Then, the overall rarity score of this character is 20.0 + 1.25 + 5.0 = 26.25.
But why? Rarity score provides a reasonable estimate of the marker value of the token corresponding to the character, but why is not clear. What we do in this paper. In this paper, we provide a possible explanation of the efficiency of rarity score estimation.
2 Our Explanation A natural analogy. For people who do not play computer games, it may be difficult to think of computer game characters. So, to explain our reasoning, let us consider a somewhat similar situation of coins or postage stamps. We all receive letters with postage stamps, we all get coins as change and use coins to buy things (although both stamps and coins are becoming rarer and rarer events.) Most of the stamps we see on our letters are mass-produced, easily available ones—but sometimes, we see an unusual stamp. Similarly, most coins we get and use are easily available ones, but sometimes, we accidentally encounter an unusual coin—e.g., a rare so-called “zinc” US penny produced in 1943. Rare stamps and rare coins are highly valued by collectors.
72
L. Bokati et al.
What is a reasonable way to estimate the price of a rare stamp or a rare coin? How to estimate the price of a rare stamp or a rare coin: a natural idea. A natural way to look for a rare penny—unless we want to simply buy it—is to inspect every penny that we have, until we find the desired one. The rarer the coin, the more time we will need to spend to find it—i.e., the more work we will have to perform. It is therefore reasonable to set up a reasonable price for a rare coin by a per-hour pay, i.e., proportional to the average time that a person needs to spend to find this rare coin. Let us transform this idea into a precise estimate. Towards a precise estimate. Let us estimate the average time needed to find a rare coin. Let us denote the proportion of rare coins among all the coins by p. Then, if we inspect coins one by one, a rare coin can appear: • as the first one in our search; in this case we spend 1 unit of effort—by having to look at just one coin; • as the second one in our search; in this case we spend 2 units of effort—by having to look at two coins; etc. Thus, the average number a of units of effort to find a rare coin can be computed as follows: a = p1 · 1 + p2 · 2 + . . . ,
(1)
where p1 denotes the probability the first coin is rare, p2 is the probability that the first coin is not rare but the second coin is rare, etc. In general, for every positive integer k, pk is the probability that the first k − 1 coins were not rare but the k-th coin is rare. The probability that a randomly selected coin is rare is equal to p, the probability that this coin is not rare is equal to 1 − p. Different coins are independent, so the probability pk is equal to the product of k − 1 terms equal to 1 − p and one term equal to p: pk = (1 − p)k−1 · p.
(2)
Substituting the expression (2) into the formula (1), we conclude that a = p · 1 + (1 − p) · p · 2 + (1 − p)2 · p · 3 + . . . + (1 − p)k−1 · p · k + . . . (3) All the terms in this formula have a common factor p, so we can conclude that a = p · A,
(4)
A = 1 + (1 − p) · 2 + (1 − p)2 · 3 + . . . + (1 − p)k−1 · k + . . .
(5)
where we denoted def
Why Rarity Score Is a Good Evaluation of a Non-Fungible Token
73
To estimate the value of A, let us multiply both sides of the formula (5) by 1 − p, then we get: (1 − p) · A = (1 − p) · 1 + (1 − p)2 · 2 + . . . + (1 − p)k−1 · (k − 1) + . . . (6) If we subtract (6) from (5), then we get 1 and also, for each k, we subtract the term (1 − p)k−1 · (k − 1) from the term (1 − p)k−1 · k, resulting in (1 − p)k−1 · k − (1 − p)k−1 · (k − 1) = (1 − p)k−1 · (k − (k − 1)) = (1 − p)k−1 . (7) Thus, the difference A − (1 − p) · A = (1 − (1 − p)) · A = p · A
(8)
p · A = 1 + (1 − p) + (1 − p)2 + . . . + (1 − p)k−1 + . . .
(9)
takes the form
To compute the sum in the right-hand side of the formula (9), we can use the same trick: multiply both side of this formula by 1 − p and subtract the resulting equality from (9). Multiplying both sides of the equality (9) by 1 − p, we conclude that (1 − p) · p · A = (1 − p) + (1 − p)2 + . . . + (1 − p)k−1 + . . .
(10)
Subtracting (10) from (9), we conclude that p · A − (1 − p) · p · A = (1 − (1 − p)) · p · A = p 2 · A = 1. Thus, A=
1 , p2
(11)
(12)
and so, due to formula (4), a= p·A= p·
1 1 = . 2 p p
(13)
How is this estimate related to the rarity score of a feature. The probability p is equal to the ratio between the number of rare coins divided by the overall number of coins. Thus, the inverse expression (13) is the ratio of the number of all the coins to the number of rare coins—exactly what is called rarity score of a feature. From rarity score of a feature to rarity score of the character. To explain how to go from rarity scores of different features to the rarity score of the character with
74
L. Bokati et al.
these features, let us use a different analogy. Suppose that at an international airport, there are three souvenir stores: • one sells US souvenirs and only accepts US dollars, • the second one sells Canadian souvenirs and only accepts Canadian dollars, and • the third one sells Mexican souvenirs and only accepts Mexican pesos. If we have a wallet with money of all these three countries, then the overall price of our wallet is the sum of the prices of all these three components. Similarly, since different features can be used in different situations—as different money in the above example can be used in different stores—it is reasonable to estimate the overall price of a character as the sum of prices corresponding to these different features, i.e., as the sum of the corresponding rarity scores. This is exactly how the character’s rarity score is estimated. Thus, we have indeed found an explanation for this estimate. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
Reference 1. Ranking Rarity: Understanding Rarity Calculation Methods, https://raritytools.medium.com/ ranking-rarity-understanding-rarity-calculation-methods-86ceaeb9b98c, downloaded on June 22, 2022
Resource Allocation for Multi-tasking Optimization: Explanation of an Empirical Formula Alan Gamez, Antonio Aguirre, Christian Cordova, Alberto Miranda, and Vladik Kreinovich
Abstract For multi-tasking optimization problems, it has been empirically shown that the most effective resource allocation is attained when we assume that the gain of each task logarithmically depends on the computation time allocated to this task. In this paper, we provide a theoretical explanation for this empirical fact.
1 Formulation of the Problem Formulation of the practical problem. In many practical situations, we have several possible objective functions. For example, when we design a plant: • we can look for the cheapest design, • we can look for the most durable design, • we can look for the most environmentally friendly design, etc. To make a decision, it is desirable to find designs which are optimal with respect to each of these criteria. • This way, if we find, e.g., that the most environmentally friendly design is close to the cheapest one, we can add a little more money and make the design environmentally friendly. A. Gamez · A. Aguirre · C. Cordova · A. Miranda · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] A. Gamez e-mail: [email protected] A. Aguirre e-mail: [email protected] C. Cordova e-mail: [email protected] A. Miranda e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_13
75
76
A. Gamez et al.
• On the other hand, if these two designs are too far away, we can apply for an environment-related grant to make the design environment friendly. Optimizing different functions on the same domain involves several common domain-specific computational modules. Thus, it makes sense to perform all these optimization tasks on the same computer. In this case, the important question is how to distribute resources between different tasks: e.g., when one task is close to completion: • this task does not require many resources, • while other tasks may require a lot of resources. Let us formulate this problem in precise terms. At each time period, we need to distribute, e.g., the available computation time T between tasks, i.e., find value tk for which t1 + t2 + . . . = T. We should select the values tk for which the overall gain is the largest
gk (tk + tk ) → max,
k
where: • tk is the time already spent on task t, and • gk (t) indicates how much the k-th task will gain if it is allowed computation time t. Related challenge and how to overcome it. We do not know exactly how each gain gk will change with time. So, a natural idea is: • to select a family of functions g(t) depending on a few parameters; • for each task, to find parameters that lead to the best fit; • then use these parameters to predict the value gk (tk + tk ). Empirical fact. An empirical fact is that among 2-parametric families, the best results are achieved for g(t) = a · ln(t) + A; see, e.g., [2]. Remaining problem. How can we explain this empirical fact? What we do in this paper. In this paper, we provide a theoretical explanation for this empirical fact.
2 Our Explanation Invariance: the main idea behind our explanation. The numerical value of each physical quantity depends:
Resource Allocation for Multi-tasking Optimization …
77
• on the selection of the measuring unit, and • on the selection of the starting point. Specifically: • If we replace the original measuring unit with the one which is λ times smaller, all numerical values multiply by λ: x → λ · x. • If we select a new starting point which is x0 units smaller, then we get x → x + x0 . In many cases, there is no preferable measuring unit. In this case, it makes sense to assume that the formulas should remain the same if we change the measuring unit. Let us apply this idea to our case. In our case: • there is a clear starting point for time: the moment when computations started; • however, there is no preferred measuring unit. Thus, it is reasonable to require that if we change t to λ · t, we will get the same resource allocation. Clarification and the resulting requirement. Invariance does not necessarily mean that g(λ · t) = g(t), since functions gk (t) and gk (t) + const lead to the same resource allocation. Thus, we require that for every λ > 0, we have g(λ · t) = g(t) + c for some constant c depending on λ. This requirement explains the above empirical fact. It is known (see, e.g., [1]) that every measurable solution to this functional equation has the form g(t) = a · ln(t) + A. This explains the empirical fact: dependencies g(t) = a · ln(t) + A are the only ones that do not depend on the choice of a measuring unit for time.
3 How to Prove the Result About the Functional Equation This result is easy to prove when the function f (X ) is differentiable. Indeed, Suppose that g(λ · X ) = g(X ) + c(λ). If we differentiate both sides with respect to λ, we get X · g (λ · X ) = c (λ). In particular, for λ = 1, we get X · g (X ) = a, where a = c (1). def
So: X·
dg = a. dX
78
A. Gamez et al.
We can separate the variables if we multiply both sides by d X and divide both sides by X , then we get dX . dg = a · X Integrating both sides of this equality, we get the desired formula g(X ) = a · ln(X ) + C. Acknowledgements This work was supported in part by the National Science Foundation grants: • 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and • HRD-1834620 and HRD-2034030 (CAHSI Includes). It was also supported by the AT&T Fellowship in Information Technology, and by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478. The authors are thankful to the participants of the 2022 UTEP/NMSU Workshop on Mathematics, Computer Science, and Computational Science (El Paso, Texas, November 5, 2022) for valuable discussions.
References 1. J. Aczél, J. Dhombres, Functional Equations in Several Variables (Cambridge University Press, Cambridge, 2008) 2. T. Wei, J. Zhong, Towards generalized resource allocation on evolutionary multitasking for multi-objective optimization, in IEEE Computational Intelligence Magazine (2021), pp. 20–36
Everyone is Above Average: Is It Possible? Is It Good? Olga Kosheleva and Vladik Kreinovich
Abstract Starting with the 1980s, a popular US satirical radio show described a fictitious town Lake Wobegon where “all children are above average”—parodying the way parents like to talk about their children. This everyone-above-average situation was part of the fiction since, if we interpret the average in the precise mathematical sense, as average over all the town’s children, then such a situation is clearly impossible. However, usually, when parents make this claim, they do not mean town-wise average, they mean average over all the kids with whom their child directly interacts. Somewhat surprisingly, it turns out that if we interpret average this way, then the everyone-above-average situation becomes quite possible. But is it good? At first glance, this situation seems to imply fairness and equality, but, as we show, in reality, it may lead to much more inequality than in other cases.
1 Everyone is Above Average: Formulation of the Problem Where this expression comes from. A Prairie Home Companion, a satirical radio show very popular in the US from the 1980s to 2000s, regularly described events from a fictitious town of Lake Wobegon, where, among other positive things, “all children are above average”. Why this situation is usually considered to be impossible. This phrase—which is a part of the usual parent’s bragging about their children—was intended as a satire, because, of course, if all the values are above a certain level, then the average is also larger than this level—and so, this level cannot be the average level. O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_14
79
80
O. Kosheleva and V. Kreinovich
But is it really impossible? It depends on what we mean by average. But is such a situation indeed impossible? It would indeed be impossible if by “average”, the speaker would mean that, e.g., the IQ level of each child is higher than the town (or state) average. However, when people brag about their kids, they usually do not mean comparison with the town or state average—especially since such an average is rarely computed and practically never known. When we say that someone is above average, we usually mean that in terms of the desired characteristic(s), this person is above the average of his/her nearest neighbors—i.e., people with whom he/she constantly interacts and thus, has the opportunity to compare the corresponding characteristic—be it IQ, school grades, sports achievements, etc. Resulting problem. If we interpret “above average” in this localized way, a natural question is: in this interpretation, is it possible that everyone is above average? And if it is possible, is it good? To some extent, this situation is a version of equality—no one is left behind, but equality does not necessarily mean that the situation is good: e.g., it is not good when everyone is equally poor. In this paper, we provide simple answers to both questions.
2 Is It Possible? Let us formulate the problem in precise terms. To answer the above question, let us describe the problem in precise terms. First, let us consider a continuous approximation. This is a must in physics—see, e.g., [1, 3]—where matter consists of atoms, but since there are too many of them to have an efficient atom-simulating computational model, it is reasonable to consider a continuous approximation. For example, instead of the masses of individual atoms, we characterise these masses by a function ρ(x, y) which is proportional to the average mass of all the particles in a small vicinity of the spatial point (x, y). Similarly, there are too many people on Earth to have an efficient each-personsimulating computational model, so it is reasonable to consider a continuous model. In other words, we assume that the distribution of the desired quantity u among humans is described by a function u(x, y)—the average value of this function in a small vicinity of the spatial point (x, y). By the average a(x, y) to which a person located at a point (x, y) is comparing his/her characteristics, it is reasonable to understand the average value of this function in some vicinity of this point. This notion becomes precise if we fix the radius ε > 0 of this vicinity. In this case, the average takes the form a(x, y) =
1 4 · π · ε3 3
·
def
(X,Y ): r =d((x,y),(X,Y ))≤ε
u(X, Y ) d X dY.
(1)
Everyone is Above Average: Is It Possible? Is It Good?
81
In these terms, the problem takes the following form: is it possible to have a function u(x, y) for which u(x, y) > a(x, y) for all points (x, y)? Analysis of the problem. The vicinity in which people contact a lot is usually small—the overall number of people each of us seriously knows is in the hundreds, which is much, much smaller than the billions of the world’s population. Since this vicinity is small, in this vicinity, we can expand the function u(X, Y ) in Taylor series in terms of the small differences X − x and Y − y, and keep only the few main terms in this expansion. In other words, we take u(X, Y ) = a0 + ax · (X − x) + a y · (Y − y) + ax x · (X − x)2 + ax y · (X − x) · (Y − y) + a yy · (Y − y)2 ,
(2)
where a0 = u(x, y), ax = ax x =
1 ∂ 2u · , ax y 2 ∂x2
∂u ∂u , ay = , ∂x ∂y ∂ 2u 1 ∂ 2u , a yy = · 2 . = ∂x ∂y 2 ∂y
(3)
Let us compute the average of this expression over the vicinity. The average of a linear combination is equal to the linear combination of the averages; thus, to find the average of the expression (2), it is sufficient to compute the averages of the terms 1, X − x, Y − y, (X − x)2 , (X − x) · (Y − y), and (Y − y)2 . The average value of the constant 1 is, of course, this same constant. The average value of the term X − x is 0, since the circle is invariant with respect to reflections in the line Y = y and thus, for every point (X − x, Y − y), we have a point (−(X − x), Y − y) in the same vicinity, and the contributions of these two points to the integral describing the average cancel each other. Similar, the averages of the terms Y − y and (X − x) · (Y − y) are zeros. Let us find the average A of the term he average of the term (X − x)2 . Since the circle is invariant with respect to swapping x and y, the average of (X − x)2 is equal to the average of (Y − y)2 , thus the average of the sum (X − x)2 + (Y − y)2 is equal to 2 A. In polar coordinates this sum is equal to r 2 , thus the integral is equal to r 2 d X dY
def
(X,Y ): r =d((x,y),(X,Y ))≤ε
is equal to 0
ε
ε
r · 2 · π · r dr = 2 · π · 2
0
Thus, the average of this sum is equal to
r 3 dr = 2 · π ·
1 ε4 = · π · ε4 . 4 2
82
O. Kosheleva and V. Kreinovich
1 · π · ε4 3 2 2A = = · ε. 4 8 · π · ε3 3 Hence, the average A of each of the two terms x 2 and y 2 is equal to A=
3 · ε. 16
Thus, for the average a(x, y), we get the following expression: a(x, y) = u(x, y) + Here, the sum ax x + a yy =
3 · ε · (ax x + a yy ). 16 ∂ 2u ∂ 2u + 2 2 ∂x ∂y
is known as the Laplacian; it is usually denoted by u, so a(x, y) = u(x, y) +
3 · ε · u(x, y). 16
(4)
So when is this possible. If the Laplacian is everywhere negative, then at each point (x, y), the value u(x, y) of the desired quantity is larger than the corresponding average a(x, y). This happens, e.g., when the function u(x, y) is strictly concave; see, e.g., [2]— and there are many such functions. Also, on a local level, this property holds in a vicinity of a maximum—provided that the corresponding matrix of second derivatives (known as Jacobian) is not degenerate. Comment. This fits well with the fact the fictitious Lake Wobegon is located in the United States—one of the richest countries on Earth—i.e., indeed, close to the maximum.
3 It is Possible, But is It Good? Not always. If everyone is above average, it may sound good, but in reality, it is not always good. As an example, let us consider a strictly convex 1-D function, such as u(x) = −x 2 . For a concave function, the second derivative is negative, thus, the first derivative can only decrease. Thus, as we go further in the direction in which the function
Everyone is Above Average: Is It Possible? Is It Good?
83
decreases, its rate of decrease cannot get smaller, it only continuous to increase, and the further we go from the maximum, the smaller the value of the function u(x)—and this decrease is at least linear in terms of the distance to the function’s maximum location. In other words, in this case, we would have a drastic decrease of quality—which would not see if we did not have this seemingly positive everyone-above-average situation. Discussion. The last argument is the second time in this paper when our intuition seems to mislead us: • First, our intuition seems to indicate that a situation in which everyone is above average is not possible. However, contrary to this intuition, it turns out that such situations are possible, and rather typical—e.g., any concave function has this property. • Second, our intuition seems to indicate that if everyone is above average, this kind of implies fairness and equality. In reality, in this case, income—or whatever characteristic we are looking for—becomes even more unequally distributed than in situations where this everyone-above-average property is not satisfied. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. R. Feynman, R. Leighton, M. Sands, The Feynman Lectures on Physics (Addison Wesley, Boston, 2005) 2. R.T. Rockafeller, Convex Analysis (Princeton University Press, Princeton, 1997) 3. K.S. Thorne, R.D. Blandford, Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics (Princeton University Press, Princeton, 2017)
How Probable is a Revolution? A Natural ReLU-Like Formula That Fits the Historical Data Olga Kosheleva and Vladik Kreinovich
Abstract In his recent book “Principles for Dealing with the Changing World Order”, Ray Dalio considered many historical crisis situations, and came up with several data points showing how the probability of a revolution or a civil war depends on the number of economic red flags. In this paper, we provide a simple empirical formula that is consistent with these data points.
1 Formulation of the Problem Economic crises often lead to internal conflicts. On many historical occasions, economic problems such as high inequality, high debt level, high deficit, high inflation, and slow (or absent) growth led to internal conflicts: revolutions and civil wars. However, the internal conflict is not pre-determined by the state of the economy: sometimes an economic crisis causes a revolution, and sometimes a similar crisis is resolved peacefully. What we would like to predict. There are many factors besides economy that affect human behavior. So, if we only consider the state of the economy, we cannot exactly predict if this state will cause a revolution or not. At best, we can predict the probability of a revolution. What information we can use to make this prediction. Clearly, the more severe the economic crisis, the more probable is the revolution. A natural measure of severity is the proportion of economic “red flags”; for example: • if we only have high inequality, but inflation remains in check and the economy continues growing, then the probability of a revolution is low; O. Kosheleva · V. Kreinovich (B) University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_15
85
86
O. Kosheleva and V. Kreinovich
• on the other hand, in situations when high inequality is accompanied by high inflation and slow growth, revolutions becomes much more probable. Empirical data. In Chap. 5 of his book [1], Ray Dalio describes the results of his statistical analysis of the problem. Specifically, he considered all known economic crises: • he grouped them into groups depending on the proportion of red flags, and • for each group, he counted the proportion of cases in which the economic crisis caused internal strife. His results are as follows: • when the proportion of red flags is smaller than 40%, then the probability of a revolution is 12%; • when the proportion of red flags is between 40% and 60%, the probability of a revolution is 11%; • when the proportion of red flags is between 60% and 80%, the probability of a revolution is 17%; and • when the proportion of red flags is larger than 80%, the probability of a revolution is 30%. What we do in this paper. In this paper, we provide a simple natural formula that is consistent with this empirical data.
2 Analysis of the Problem and the Resulting Formula What happens when there are few red flags. It is natural to start our analysis with the first of the above groups, i.e., with the cases in which the proportion of red flags was reasonably low. In such cases, we observe an interesting phenomenon—that even we increase the proportion of red flags, the probability of a revolution does not increase. Actually, it looks like the probability even decreases—from 12 to 11%— but this small decrease is not statistically significant and is, thus, probably caused by the fact that the sample size is not large. It looks like in cases there are very few red flags, the probability of a revolution remains at the same level (small but still the same), all the way from 0% to 60% of red flags. The fact that revolutions happen even in good economic situations, when there are few—or even none—economic red flags, is in good accordance with our above comment that while economy is important, it is not the only factor affecting human behavior. Natural conclusion: need to analyze the excess probability. Since it looks like there is some probability of a revolution even when economy is in good shape, what we can predict based on the proportion of red flags is the increase of this probability, i.e., the difference between:
How Probable is a Revolution? A Natural ReLU-Like Formula …
87
• the probability corresponding to a given proportion of red flags and • the economy-independence probability of a revolution. The largest value of the proportion-of-red-flags at which the probability of a revolution still does not increase is 60%. At this level, the probability of a revolution is 11%. Thus, to study how economy affects the probability of a revolution, we need to compute the excess probability, i.e., to subtract 11% from all the empirical probabilities. As a result, we get the following data: • when the proportion of red flags is smaller than 60%, then the excess probability of a revolution is 11 − 11 = 0%; • when the proportion of red flags is between 60% and 80%, the excess probability of a revolution is 17 − 11 = 6%; and • when the proportion of red flags is larger than 80%, the excess probability of a revolution is 30 − 11 = 19%. Preparing for numerical interpolation. We know three values, and we want to extend these values to a formula that would describe the probability of a revolution y as a function of proportion of red flags x. In mathematics, such an extension is known as interpolation. Usual interpolation formulas for the dependence y = f (x) assume that we know, in several cases, the value of x and the corresponding value of y. In our case: • we know the exact values of y, • however, for x, we only know intervals. So, to apply the usual interpolation techniques, we need to select a single point within each of these intervals. Which point should we select? In statistical analysis, for each random variable X , the natural idea is to select a value x for which the mean square difference E[(X − x)2 ] is the smallest possible. It is known (see, e.g., [5]) that this value is equal to the mean x = E[X ] of the corresponding random variable. In our case, we only know that the value x belongs to some interval [x, x], we do not know what is the probability od different values from this interval. Since we have no reason to believe that some values x from this interval are more probable than others, it makes sense to assume that all these probabilities are equal, i.e., that we have a uniform distribution on this interval. This argument—widely used in statistical analysis—is known as Laplace Indeterminacy Principle; it is an important particular case of the general Maximum Entropy approach; see, e.g., [4]. For the uniform distribution on an interval, its mean value is this interval’s midpoint. Thus, for interpolation purposes, we will replace each interval by its midpoint:
88
O. Kosheleva and V. Kreinovich
• the interval [60, 80] will be replaced by its midpoint 60 + 80 = 70, and 2 • the interval [80, 100] will be replaced by its midpoint 80 + 100 = 90. 2 After this replacement, we get the following conclusion: • when the proportion of red flags is x1 = 60%, then the excess probability of a revolution is y1 = 0%; • when the proportion of red flags is x2 = 70%, the excess probability of a revolution is y2 = 6%; and • when the proportion of red flags is x3 = 90%, the excess probability of a revolution is y3 = 19%. Actual interpolation. Interpolation usually starts with trying the simplest possible dependence—i.e., linear one y = a + bx, when the difference in y is proportion to a difference in x: yi − y j = a · (xi − x j ) for all i and j, i.e., when the ratio yi − y j xi − x j is the same for all pairs i = j. In our case, we have 6 6−0 y2 − y1 = = 0.6 = x2 − x1 70 − 60 10 and
y3 − y1 19 19 − 0 = ≈ 0.63. = x3 − x1 90 − 60 30
These two ratios are indeed close. If we take into account that, as we have mentioned earlier, a 1% difference between y-values is below statistical significance level, and that 19 could as well be 18—for which y3 − y1 19 18 − 0 = = 0.6, = x3 − x1 90 − 60 30 we get a perfect fit. Thus, we conclude that for x > 60%, the dependence of y on x is linear: y = 0.6 · (x − 60). So, we arrive at the following formula. Resulting formula. When we know the proportion x of red flags, we expect the excess probability y of the revolution to be equal to:
How Probable is a Revolution? A Natural ReLU-Like Formula …
89
• y = 0 when x ≤ 60, and • y = 0.6 · (x − 60) for x ≥ 60. These two cases can be describe by a single formula y = max(0, 0.6 · (x − 60)).
(1)
To get the full probability p of the revolution, we need to add the non-economic 11% level to y, and get y = 11 + max(0, 0.6 · (x − 60)). (2)
Comment. The formula (1) is very similar to the Rectified Linear Unit (ReLU) function—an activation function used in deep learning, which is, at present, the most effective machine learning tool; see, e.g., [3]. Why this formula? So far, we have provided purely mathematical/computational justification for this formula. How can we explain it from the viewpoint of economy and human behavior? A natural explanation of the first part—when the probability of a resolution does not increase with x—is that in this situations, just like in physics, there is static friction: systems do not drastically change all the time, one needs to reach a certain threshold level of external forces to make the system drastically change. Linear dependence after that is also easy to explain: as in physics (see, e.g., [2, 6]), most real-life dependencies are smooth and even analytical, and in the first approximation, any analytical dependence can be described by a linear function. Caution. Of course, these are preliminary results, we should not trust too much formulas based on a few points. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). The authors are thankful to Omar Badreddin for his valuable advice.
References 1. R. Dalio, Principles of Dealing with the Changing World Order: Why Nations Succeed and Fail (Avid Reader Press, New York, 2021) 2. R. Feynman, R. Leighton, M. Sands, The Feynman Lectures on Physics (Addison Wesley, Boston, 2005) 3. I. Goodfellow, Y. Bengio, A. Courville, Deep Leaning (MIT Press, Cambridge, 2016)
90
O. Kosheleva and V. Kreinovich
4. E.T. Jaynes, G.L. Bretthorst, Probability Theory: The Logic of Science (Cambridge University Press, Cambridge, 2003) 5. D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures (Chapman and Hall/CRC, Boca Raton, 2011) 6. K.S. Thorne, R.D. Blandford, Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics (Princeton University Press, Princeton, 2017)
Why Should Exactly 1/4 Be Returned to the Original Owner: An Economic Explanation of an Ancient Recommendation Olga Kosheleva and Vladik Kreinovich
Abstract What if someone bought a property in good faith, not realizing that this property was unjustly confiscated from the previous owner? In such situations, if the new owner decided to sell this property, Talmud recommended that a fair way is to return 1/4 of the selling price to the original owner. However, it does not provide any explanation of why exactly 1/4—and not any other portion—is to be returned. In this paper, we provide an economic explanation for this recommendation, an explanation that fits well with other ancient recommendations about debts.
1 Formulation of the Problem An ethical problem. Life is unfair. Many times in history, unjust governments forcibly and unjustly confiscated property from its rightful owners. Once this government is no longer in power, previous owners naturally want to either get this property back, or at least to get some compensation. If this property was simply used by the unjust government itself, then the solution is straightforward: we should return this property to the original owner. But in many cases, the situation is not so clear. Someone may have bought this property with their own hard-owned money without realizing that it was confiscated unjustly; maybe after that, someone else have bought this property, etc. If this is an object of luxury, like a painting worth millions of dollars, then the current tendency is to return it to the original owner without any compensation to the current owner. This may be not fair
O. Kosheleva Department of Teacher Education, University of Texas at El Paso 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_16
91
92
O. Kosheleva and V. Kreinovich
to the new owner, but, honestly, if this new owner could afford to buy a multi-million dollar painting, losing this painting will not make him/her starving or homeless. But what if this property was a small apartment or a small farm? The new owner did not do anything wrong, why penalize him/her? This may be an apartment on which he/she has saved money all his/her life—and now we confiscate this money and make the new owner homeless? This does not seem fair. So what can we do? Ancient solution: general idea. The above ethical problem is not new—just like unjust confiscations of property are not new. Unfortunately, this happened in ancient times as well—and the resulting ethical problem appeared already in the ancient times. This problem was the subject of many discussions. In particular, Talmud has the following solution to this ethical problem [5]: • The new owner—who is not guilty of wrongdoing—is not penalized. He or she can keep this property, and if he/she sells this property, the new owner is entitled to get back the full price of this property, i.e., the price that he/she paid when he/she bought it. • If the new owner is willing to sell this property, and the original owner is willing to buy it back, then the original owner has a priority—in such a situation, the new owner is obliged to sell to the original owner and not to anyone else. • However, if this property is on sale, and the original owner is unwilling or unable to buy it, then the situation changes: the new buyer is now aware that he/she is buying a tainted property, so the new buyer is obliged to return a portion of this property to the original owner—either as a piece of land (or, alternatively, as a monetary payment to the original owner). Ancient solution: details. The general idea sounds reasonable, but then it goes into details about what portion should be returned to the original owner. No general rule for determining this option is provided, only an example. In this example, when 12 months passed between the original confiscation and the forthcoming sale, the new buyer must return one quarter of the property to the original owner. Why 1/4? Talmud rarely provides exact algorithms, it usually provides examples— and the intent was that these examples should be sufficient to produce the desired algorithm. This explanation by examples was typical in the ancient world—this is how mathematical ideas were explained; usually, by examples (see, e.g., [1]). The fact that in this case, there was only one example seems to indicate that the authors of this advice believed that this example will be sufficient to come up with a general idea. So what is this general idea? What we do in this paper. In this paper, we provide a possible explanation of 1/4.
Why Should Exactly 1/4 Be Returned to the Original …
93
2 Towards a Possible Explanation How did the buyer accumulate the desired amount of money. Let us denote the amount that the buyer is paying by m. If the buyer had this amount a year ago, he/she could have bought a similar property already then. The fact that the buyer did not do it a year ago means that a year ago, the buyer did not have this amount of money. We can estimate the amount of money p that the buyer had a year ago if we know the interest rate r . Indeed, if a year ago the buyer had amount p, then now he has p · (1 + r ). • Clearly, this amount p · (1 + r ) must be larger than or equal to m—otherwise, the buyer would not be able to pay now. • It is also not possible that the amount p · (1 + r ) be larger than m—this would mean that the buyer accumulated the desired sum m before 12 months, in which cases would have bought a similar property at this past moment, and would not need to wait until the current moment. So, the only remaining possibility is that p · (1 + r ) = m, or, equivalently, that p=
m . 1+r
(1)
Let us now take fairness into account. A year ago, the original owner had the property worth m monetary units, and the buyer had the amount of money described by the formula (1). This past year was not very good for the original owner—he lost his property— and probably, not very good for many other people, since unjust confiscations are usually a mass phenomenon. The buyer did not suffer, he/she has honestly earned his amount p. In general, in a normal situation, it would be fair for the buyer to also enjoy his/her interest, but in view of the general massively bad situation, it does not seem fair to let the buyer enjoy the interest while many others suffered. We cannot do anything about the interest rate of others, but since the buyer kind of benefits from the misfortune of others, it sounds fair to not allow the buyer to benefit from the last year’s interest when others suffer. From this viewpoint, the buyer is only entitled to the part of the property which is worth his/her last year’s amount p—and it is fair to give the remaining amount to the original owner, as a partial compensation. Let us describe this conclusion in numerical terms. According to historians, in the ancient Middle East, a person could get 1/3 of the original amount per year, i.e., we have r = 1/3; see, e.g., [3, 4]. Thus, according to the formula (1), the buyer is only entitled to the portion p=
m 3 m = = ·m 1 + 1/3 4/3 4
94
O. Kosheleva and V. Kreinovich
of the original property. The remaining amount m−p=m−
3 1 ·m = ·m 4 4
should be returned to the original owner. This is exactly the amount recommended by the Talmud—so the above fairness arguments explain the mysterious appearance of 1/4 in this advice. Auxiliary explanation. The same ancient interest rate r = 1/3 explains another historical fact—that, according to Hammurabi Code [2], a person who is unable to return his/her debt to another person becomes the other person’s slave (i.e., an unpaid worker) for 3 years. Indeed, the interest rate of 1/3 means, in particular, that if we invest money in a person—e.g., by hiring a worker—we will get, through this person’s labor, on average 1/3 of our original amount. So, each year, the person pays off 1/3 of the original debt—and thus, in 3 years will pay off his/her full debt. However, in three years, the person just returns the original debt, not the accumulated interest. To get back more than the original amount, one needs to make the debtor work for a longer time—this is, e.g., why in such a situation, the Bible recommends six years of unpaid work. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. C.B. Boyer, U.C. Merzbach, History of Mathematics (Wiley, Hoboken, 2011) 2. T.C. Hammurabi, of Hammurabi (Virginia, Compass Circle, Virginia Beach, 2019) 3. M. Hudson, How interest rates were set, 2500 BC - 1000 AD: máš, tokos, and foenus as metaphors for interest accuruals. J. Econ. Soc. Hist. Orient 43(2), 132–161 (2000) 4. K. Muroi, The oldest example of compound interest in Sumer: Seventh power of four-thirds (2015), https://arxiv.org/ftp/arxiv/papers/1510/1510.00330.pdf 5. Babylonian Talmud, Gittin 55b, https://www.sefaria.org/Gittin.55b?lang=bi
Why Would Anyone Invest in a High-Risk Low-Profit Enterprise? Olga Kosheleva and Vladik Kreinovich
Abstract Strangely enough, investors invest in high-risk low-profit enterprises as well. At first glance, this seems to contradict common sense and financial basics. However, we show that such investments make perfect sense as long as the related risks are independent from the risks of other investments. Moreover, we show that an optimal investment portfolio should allocate some investment to these enterprises [1–6].
1 Formulation of the Problem Puzzle. Once in a while, in our city, we encounter a struggling enterprise—e.g., a restaurant or a store—whose profit level is clearly low, and risk level is high. Interestingly, not only this enterprise exists, but it often even manages to get companies and people investing some money in it. Why would anyone want to invest in a high-risk low-profit enterprise? This seems to contradict all financial basics—and common sense. What we do in this paper. In this paper, we provide an explanation for this puzzle. Specifically, we show that: • while, of course, it does not make any sense for an investor to invest all his/her money into this enterprise, • it makes perfect sense to invest some of the money into it – as long as its risks are independent of the risks of other enterprises. O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_17
95
96
O. Kosheleva and V. Kreinovich
2 Analysis of the Problem and the Resulting Explanation Optimal portfolio selection: a brief reminder. In general, there are different financial instruments in which we can invest, ranging: • from low-risk low-profit financial instruments like bonds • to high-risk high-profit instruments like risky stocks. A smart investor distributes his/her money between different instruments, so as to maximize the expected profit under the condition that the risk remains tolerable. Risk means that the actual profit may differ from its expected value. In statistics, the difference between the actual value of a random variable and its expected value is characterized by its variance V . Thus, variance is a natural measure of investment risk. From this viewpoint, each investor has a tolerance level V0 , so that for the selected portfolio, the variance V should not exceed this level: V ≤ V0 .
(1)
Under this constraint (1), we need to select a portfolio that provides the largest possible expected gain. Each financial instrument i can be characterized by its expected gain μi and its variance Vi . The investor distributes his/her money between these instruments, so that each instrument gets a portion wi ≥ 0 of the overall invested amount, with n
wi = 1.
(2)
i=1
The expected gain μ of a portfolio is, therefore, the sum of gains of each investments, i.e., n wi · μi . (3) μ= i=1
In the simplified case when all risks are independent, the overall risk V can be obtained as n wi2 · Vi . (4) V = i=1
A similar—but slightly more complex—formula describes the overall variance in situations when there is correlation between different risks. In these terms, the problem is to maximize the expected gain (3) under the condition that the variance (4) satisfies the inequality (1). This problem was first formulated in the 1950s by Harry Markowitz, who came up with an explicit solution to this optimization problem—for which he got a Nobel Prize in Economics in 1990; see, e.g., [1–4, 6].
Why Would Anyone Invest in a High-Risk Low-Profit Enterprise?
97
Let us use this portfolio optimization problem to solve our puzzle. In portfolio terms, what we want to show is that in the optimal portfolio, the portion w p allocated to the puzzling low-profit high-risk enterprise is positive. To show this, we will show that: • if the portfolio does not include this weird investment, • then we can get a better portfolio by investing some money into this investment. This would mean that a portfolio that excludes the given enterprise cannot be optimal, i.e., that the optimal portfolio must include the given enterprise. Indeed, let us assume that we have an optimal portfolio that does not include the given puzzling enterprise p. Let μ p > 0 be the given enterprise’s expected gain and let V p be this enterprise’s variance. An optimal portfolio usually includes an almost-sure investment a for which the gain is μa is relatively small – in particular, smaller than μ p —but the risk Va is also very small (and reasonably independent from all other risks). Let wa be the portion of the overall investment that goes into this instrument. Let us show that for a sufficiently small ε > 0, if we re-allocate a portion ε from investment a to our investment p, we will get a better portfolio – i.e.: • the expected gain will increase while • the variance will decrease. Indeed, after the reallocation, in the expected gain (3), the original term wa · μa in the sum (3) will be replaced by the sum of the (wa − ε) · μa + ε · μ p .
(5)
The difference between the new sum and the old term—and, thus, between the new and the old values of the expected gain—is equal to (wa − ε) · μa + ε · μ p − wa · μa = ε · (μ p − μa ).
(6)
Since μ p > μa , this difference is always positive. So, for all ε > 0, reallocating the investment indeed increases the expected gain. Let us now analyze what happens to the portfolio’s variance under such a reallocation. In this case, the original term wa2 · Va in the sum (4) is replaced by the sum (7) (wa − ε)2 · Va + ε2 · V p . The difference between the new sum and the old term—and, thus, between the new and the old values of the variance—is equal to (wa − ε)2 · Va + ε2 · V p − wa2 · Va = −2ε · Va + ε2 · Va + ε2 · V p .
(8)
For small ε, the main term in this expression is the linear term. Thus, the difference is negative and reallocating the investment indeed decreases the portfolio’s variance.
98
O. Kosheleva and V. Kreinovich
3 Conclusion We have shown that in the optimal portfolio, some portion of the investment should be allocated to the given enterprise. In other words, contrary to our intuition, investing some money in a low-profit high-risk enterprise makes perfect sense—it is even required by the desired to have an optimal portfolio. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. M.J. Best, Portfolio Optimization (Chapman and Hall/CRC, Roca Baton, Florida, 2010) 2. F.J. Fabozzi, P.N. Kolm, D. Pachamanova, S.M. Focardi, Robust Portfolio Optimization and Management (Wiley, Hoboken, New Jersey, 2007) 3. R. Korn, E. Korn, Option Pricing and Portfolio Optimization: Modern Methods of Financial Mathematics (American Mathematical Society, Providence, Rhode Island, 2001) 4. H.M. Markowitz, Portfolio selection. J. Financ. 7(1), 77–91 (1952) 5. R.O. Michaud, R.O. Michaud, Efficient Asset Management: A Practical Guide to Stock Portfolio Optimization and Asset Allocation (Oxford University Press, 2008) 6. J.-L. Prigent, Portfolio Optimization and Performance Analysis (Chapman and Hall/CRC, Boca Raton, Florida, 2007)
Which Interval-Valued Alternatives Are Possibly Optimal if We Use Hurwicz Criterion Marina Tuyako Mizukoshi, Weldon Lodwick, Martine Ceberio, and Vladik Kreinovich
Abstract In many practical situations, for each alternative i, we do not know the corresponding gain xi , we only know the interval [x i , x i ] of possible gains. In such situations, a reasonable way to select an alternative is to choose some value α from the interval [0, 1] and select the alternative i for which the Hurwicz combination α · x i + (1 − α) · x i is the largest possible. In situations when we do not know the user’s α, a reasonable idea is to select all alternatives that are optimal for some α. In this paper, we describe a feasible algorithm for such a selection.
1 Formulation of the Problem Need to make decisions under interval uncertainty. In the ideal case, when we know the exact expected gain of different investments, a natural idea is to select the investment for which the expected gain is the largest. In practice, however, we rarely know the exact values of the gains. At best, for each alternative investment i, we know the interval [x i , x i ] of possible values of the gain. We therefore need to make a decision based on this incomplete information.
M. Tuyako Mizukoshi Federal University of Goias, Goiania, Brazil e-mail: [email protected] W. Lodwick Department of Mathematical and Statistical Sciences, University of Colorado Denver, 1201 Larimer Street, Denver, CO 80204, USA e-mail: [email protected] M. Ceberio · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500W. University, El Paso, TX 79968, USA e-mail: [email protected] M. Ceberio e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_18
99
100
M. Tuyako Mizukoshi et al.
Hurwicz approach. To make a decision, we need to announce the price that we are willing to pay for each interval-valued alternative. For this, we need to select a function v([x, x]) that assigns a numerical value to each interval. This function must satisfy two natural condition. The first is that this price must be somewhere between the lower bound x i and the upper bound x i : it makes no sense to pay more than we are expecting to gain. The second condition is related to the fact that some investments consist of two independent parts. We can view these two parts, with interval gains [x 1 , x 1 ] and [x 2 , x 2 ], separately. This way we pay the price v([x 1 , x 1 ]) for the first part, and we pay the price v([x 2 , x 2 ]) for the second part, to the total of v([x 1 , x 1 ]) + v([x 2 , x 2 ]). For example, if in the first part of the investment, we gain between 1 and 2 dollars, and in the second part of the investment, we also gain between 1 and 2 dollars, then overall, in both parts, our gain is between 2 and 4 dollars. Alternative, we can view the two parts as a single investment, with the interval of possible gains {x1 + x2 : x1 ∈ [x 1 , x 1 ] and x2 ∈ [x 2 , x 2 ]} = [x 1 + x 2 , x 1 + x 2 ]. In this case, the price will be equal to v([x 1 + x 2 , x 1 + x 2 ]). These are the two ways to describe the same overall investment. So it makes sense to require that the resulting overall price of this investment should not depend on how we describe it – as a single investment or as an investment consisting of two parts. So, we should have v([x 1 + x 2 , x 1 + x 2 ]) = v([x 1 , x 1 ]) + v([x 2 , x 2 ]). It turns that every function v([x, x]) that satisfies these two conditions has the form (1) v([x, x]) = α · x + (1 − α) · x, for some value α ∈ [0, 1]; see, e.g., [1–3]. So, to make a decision under interval uncertainty, we should select some α ∈ [0, 1] and then select an alternative for which the corresponding α-combination is the largest. This recommendation was first proposed by an economist Leo Hurwicz— who later received a Nobel Prize for it—and it is thus called Hurwicz criterion. Remaining problem. If we know the customer’s α, we can easily find the optimal alternative. However, in some cases, we need to make a recommendation without knowing α. In this case, the only thing we can do is come up with a list of possibly optimal alternatives, i.e., alternatives which are optimal for some α ∈ [0, 1]. In this paper, we describe an algorithm for generating such a list. To describe this algorithm, we first, in Sect. 2, explain the current idea of such a selection, and provide an example explaining that this idea is not sufficient—sometimes it returns
Which Interval-Valued Alternatives Are Possibly Optimal if We Use Hurwicz Criterion
101
alternatives which are not possibly optimal. Then, in Sect. 3, we describe an algorithm that returns all possibly optimal alternatives – and only them.
2 Current Idea and Its Limitations Idea. If for two alternatives [x i , x i ] and [x j , x j ], we have x i < x j and x i < x j , then, clearly, for every α ∈ [0, 1], we will have α · x i + (1 − α) · x i < α · x j + (1 − α) · x j . Thus, the alternative i will never be optimal. So, a reasonable idea is to dismiss all alternatives i which are dominated by some other alternative j, i.e., for which x i < x j and x i < x j . After this dismissal, the remaining list will contain all possibly optimal alternatives. But are all alternatives in the remaining list possibly optimal? Sometimes they are, but, as we will show, sometimes they are not. Let us describe an example when one of the remaining alternatives is not possibly optimal. An example of a remaining alternative which is not possibly optimal. Let us assume that we have three alternatives, with interval gains [x 1 , x 1 ] = [0, 10], [x 2 , x 2 ] = [1, 8], and [x 3 , x 3 ] = [2, 7]. Let us prove, by contradiction, that the alternative [x 2 , x 2 ] = [1, 8] cannot be optimal. Indeed, suppose that this alternative is optimal for some α. This means, in particular, that for this α, this alternative is better than (or is of the same quality as) the alternative [0, 10]. This means that α · 8 + (1 − α) · 1 ≥ α · 10 + (1 − α) · 0, i.e., 8α + 1 − α ≥ 10α, so 3α ≤ 1 and α ≤ 1/3. Similarly, the fact that this alternative is better than [2, 7] means that α · 8 + (1 − α) · 1 ≥ α · 7 + (1 − α) · 2, i.e., 8α + 1 − α ≥ 7α + 2 − 2α, so 2α ≥ 1 and α ≥ 1/2. A number cannot be at the same time larger than or equal to 1/2 and smaller then or equal to 1/3, so our assumption that the alternative [x 2 , x 2 ] = [1, 8] can be optimal leads to a contradiction. Thus, this alternative cannot be optimal.
102
M. Tuyako Mizukoshi et al.
3 Algorithm for Selecting Possibly Optimal Alternatives General analysis of the problem. Let us first weed out all dominated alternatives. We need to decide, for each of the remaining alternatives i, whether this alternative is possibly optimal, i.e., whether it is optimal for some α ∈ [0, 1]. This optimality happens if for all j = i, we have α · x i + (1 − α) · x i ≥ α · x j + (1 − α) · x j , i.e., equivalently, that x i + α · wi ≥ x j + α · w j , def
where, for each interval [x k , x k ], by wk , we denote its width wk = x k − x k . This inequality can be equivalently reformulated as x i − x j ≥ α · (w j − wi ).
(2)
Here, we have three options: w j − wi = 0, w j − wi > 0, and w j − wi < 0. Let us consider them one by one. First case. Let us first consider the case when w j − wi = 0, i.e., when wi = w j . In this case, the right-hand side of the inequality (2) is equal to 0, so this inequality takes the form x i − x j ≥ 0.
(2a)
Let us prove, by contradiction, that when wi = w j , the inequality (2a) is always satisfied after we filter out dominated alternatives. Indeed, suppose that the inequality (2a) is not satisfied, i.e., that x i − x j < 0. In this case, we would have x i < x j and thus, x i = x i + wi < x j + wi = x j + w j = x j . So, the ith alternative is dominated by the jth alternative – but we assumed that we have already dismissed all dominated alternatives. So, the assumption that x i − x j < 0 leads to a contradiction. This contradiction shows that when wi = w j , the inequality (2a) is always satisfies for the after-filtering-out intervals. In other words, in situations when wi = w j , the inequality (2) is satisfied for all α. Second case. Let us now consider the case when w j − wi > 0, i.e., w j > wi . In this case, dividing both sides of the inequality (2) by the positive number w j − wi , we get an equivalent inequality xi − x j . α≤ w j − wi
Which Interval-Valued Alternatives Are Possibly Optimal if We Use Hurwicz Criterion
103
This inequality must be satisfied for all j for which w j > wi , so we must have α ≤ min
j: w j >wi
xi − x j w j − wi
We must also have α ≤ 1, so we must have α ≤ min 1, min
j: w j >wi
.
xi − x j
w j − wi
.
(3)
Third case. When w j − wi < 0, i.e., when w j < wi , then, dividing both sides of the inequality (2) by the negative number w j − wi , we get the opposite inequality α≥
xi − x j w j − wi
.
This inequality must be satisfied for all j for which w j < wi , so we must have α ≥ max
j: w j L(0) = A0 . One can prove that there exists a threshold value u(A): • for which L( p) > A for p > u(A) and • for which L( p) < A for p < u(A). This threshold value is known as the utility of the alternative A. The numerical value u(A) depends on our choice of the two extreme alternatives Ai . It turns out that if we select a different pairs of extreme alternatives, then the new values of the utility u (A) can be obtained from the previous values u(A) by a linear transformation u (A) = a + b · u(A), for some constants a and b > 0. Comment. When the outcomes are purely financial, in the first approximation, we can view the money amounts as utility values. However, it is important to take into account that many negotiations involve non-financial issues as well, and that even for financial issues, utility is not always proportional to money. Cooperative group decision making: case of full information. Cooperative decision making is when we have a status quo situation S, and several (n) agents are looking for alternatives that can make the outcomes better for all of them. In many real-life situations, everyone known everyone’s utility u i (A) for each alternative A. Since utility values are defined only modulo a linear transformation, it makes sense to consider decision strategies for which: • the resulting solution would not change if we apply linear re-scaling u i (A) → ai + bi · u i (A) to one of the utilities, and • the resulting solution would not change if we simply rename the agents. It turns out that the only such not-changing (= invariant) scheme is when the agents maximize the product of their utility gains, i.e., the value (u 1 (A) − u 1 (S)) · · · · · (u n (A) − u n (S)); see, e.g., [5–7]. This scheme was first proposed by the Nobelist John Nash and is thus known as Nash’s bargaining solution. Sometimes, we do not have the full knowledge. In some situations—e.g., when two countries have a territorial dispute—we do have full information of what each side wants. However, in other situations—e.g., in negotiations between companies— agents are reluctant to disclose their utilities: while in this case, they are collaborating, in the future, they may be competing, and any information about each other leads to a competitive advantage.
In the Absence of Information, the only Reasonable Negotiation …
111
In many cases, the only two things we know for each agent i are: • the agent’s utility u i (S) corresponding to status quo—which is usually known because of the reporting requirements, and • the agent’s original offer u i(0) —which usually means the best outcome that the agent can reach: def u i(0) = u i = max u i (A). A
In other words, for each agent, we know the largest value gi(0) = u i(0) − u i (S) that this agent can gain. Based on this information, how can we make a joint decision? def
General idea. In practice, it is never possible to each agent to get the largest possible gain, there is usually a need for a trade-off between agents. Since agents cannot all get their maximum gain, a natural idea is to somewhat lower their requests, from the original values gi(0) to smaller values gi(1) = f i gi(0) for some function f i (g) for which f i (g) < g. Then: • if it is possible to satisfy all reduced requests, then this is the desired joint solution; • if it is not possible to satisfy all reduced then the request amounts should requests, (2) (1) be reduced again, to values gi = f i gi , etc. The procedure should be fair, meaning that the same reducing function f (g) should be applied for all the agents: f i (g) = f (g) for all i. Question. The main question is: what reducing function f (g) should we use?
2 Which Reducing Function Should We Use Natural requirement: reminder. As we have mentioned in our description of Nash’s bargaining solution, since utilities are only known modulo linear transformations, a natural requirement is that the resulting decision not change if we use re-scale an agent’s utilities, from the original values u i (A) to new values ai + bi · u i (A). Let us apply the same invariance criterion to our problem. Which reducing functions are invariant under re-scaling? Under the linear transformation u i (A) → ai + bi · u i (A), we get u i (S) → ai + bi · u i (S) and u i(0) → ai + bi · u i(0) . Thus, the difference gi = u i(0) − u i (S) get transformed into the difference gi → ai + bi · u i(0) − (ai + bi · u i (A)) = bi · u i(0) − u i (S) = bi · gi . So, using a new scale means multiplying all gain values gi by bi . (Vice versa, to transform the new-scale gain value into the original scale, we need to divide this new-scale value by bi .)
112
M. Svítek et al.
Let us see when the reducing function is invariant under such re-scaling. • In the original scale, we transform the gain gi(k) into the reduced utility gain (k+1) (k) = f gi . In the new scale, the reduced gain takes the form bi · f gi(k) . gi • Let us see what happens if we apply the reducing function to the values described in the new scale. The original gain gi(k) in the new scale has the form bi · gi(k) . When we apply the reducing function to this value, we get the new-scale value f bi · gi(k) . Invariance means that both new-scale gains should be equal, i.e., that we should have f bi · gi(k) = bi · f bi · gi(k) . This equality should hold for all possible values of bi and gi(k) . In particular, for any number g > 0, for bi = g and gi(k) = 1, we get f (g) = c · f (g) for some constant def
c = f (1). Since we must have f (g) < g, we should have c < 1. Thus, we arrive at the following conclusion. Resulting recommendation. In the above absence-of-information case, the only fair scale-invariant scheme is for all agents to select a certain percentage c of their original request—e.g., 70% or 60%—and to decrease this percentage c until we find a solution for which each agent can get this percentage of his/her original request. Comments. Once this value c is found, the gain gi = u i − u i (S) will be equal to gi = c · (u i − u i (S)) and thus, the utility of the ith agent will be equal to u i = u i (S) + c · (u i − u i (S)) = c · u i + (1 − c) · u i . In this problem, the smallest utility u i the agent can end up with is the status-quo value u i (S). In these terms, the resulting gain has the form u i = c · u i + (1 − c) · u i . Interestingly, this expression coincides with another expression for decision theory: Hurwicz formula that describe the equivalent utility of a situation in which all we know is that the actual utility will be in the interval [u i , u i ]; see, e.g., [3–5]. The formulas are similar, but there is an important difference between these two situations: • in the Hurwicz situation, the coefficient c depends on each user, it describe how optimistic the user is; • in contrast, in our situation, the coefficient c is determined by the group as a whole, it describe how much the group can achieve.
In the Absence of Information, the only Reasonable Negotiation …
113
The above solution is also similar to the usual solution to the bankruptcy problem, when everyone gets a certain percentage c · m i of the amount it is owed—e.g., 20 cents the dollar, and the coefficient c is the largest for which such solution is possible. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. P.C. Fishburn, Utility Theory for Decision Making (Wiley, New York, 1969) 2. P.C. Fishburn, Nonlinear Preference and Utility Theory (The John Hopkins Press, Baltimore, Maryland, 1988) 3. L. Hurwicz, Optimality Criteria for Decision Making Under Ignorance, Cowles Commission Discussion Paper, Statistics, No. 370 (1951) 4. V. Kreinovich, Decision making under interval uncertainty (and beyond), in Human-Centric Decision-Making Models for Social Sciences. ed. by P. Guo, W. Pedrycz (Springer, 2014), pp.163–193 5. R.D. Luce, R. Raiffa, Games and Decisions: Introduction and Critical Survey (Dover, New York, 1989) 6. J. Nash, The bargaining problem. Econometrica 18(2), 155–162 (1950) 7. H.P. Nguyen, L. Bokati, V. Kreinovich, New (simplified) derivation of Nash’s bargaining solution. J. Adv. Comput. Intell. Intell. Inf. (JACIII) 24(5), 489–592 (2020) 8. H.T. Nguyen, O. Kosheleva, V. Kreinovich, Decision making beyond Arrow’s ‘impossibility theorem’, with the analysis of effects of collusion and mutual attraction. Int. J. Intell. Syst. 24(1), 27–47 (2009) 9. H.T. Nguyen, V. Kreinovich, B. Wu, G. Xiang, Computing Statistics under Interval and Fuzzy Uncertainty (Springer, Berlin, Heidelberg, 2012) 10. H. Raiffa, Decision Analysis (McGraw-Hill, Columbus, Ohio, 1997)
Applications to Education
How to Make Quantum Ideas Less Counter-Intuitive: A Simple Analysis of Measurement Uncertainty Can Help Olga Kosheleva and Vladik Kreinovich
Abstract Our intuition about physics is based on macro-scale phenomena, phenomena which are well described by non-quantum physics. As a result, many quantum ideas sound counter-intuitive—and this slows down students’ learning of quantum physics. In this paper, we show that a simple analysis of measurement uncertainty can make many of the quantum ideas much less counter-intuitive and thus, much easier to accept and understand.
1 Formulation of the Problem Many quantum ideas are counter-intuitive. Many ideas of quantum physics are inconsistent with our usual physics intuition; see, e.g., [1, 3]. This counter-intuitive character of quantum ideas is an additional obstacle for students learning about these effects. Problem: how can we make these ideas less counter-intuitive? To enhance the teaching of quantum ideas, it is desirable to come up with ways to make quantum ideas less counter-intuitive. What we do in this paper. In this paper, inspired by the ideas from [2], we propose some ways to make quantum ideas less counter-intuitive. Specifically, in Sect. 2, we overview the main counter-intuitive ideas of quantum physics. In Sect. 3, we explain how a simple analysis of measurement uncertainty can help.
O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University El Paso, Texas 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, Texas 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_21
117
118
O. Kosheleva and V. Kreinovich
2 Main Counter-Intuitive Ideas of Quantum Physics Quantum—discrete—character of measurement results. In non-quantum physics, most physical quantities can take any real values—or at least any real values within a certain range (e.g., possible values of speed are bounded by the speed of light). The change of values with time is smooth. It may be abrupt—as in phase transitions—but still continuous. In contrast, in quantum physics, for most quantities, only a discrete set of values is possible. Correspondingly, changes are by discontinuous “jumps”: at one moment of time, we have one energy value, at the next moment of time, we may measure a different energy value, with no intermediate values. Non-determinism. We are accustomed to the fact that non-quantum physics is deterministic: if we know the exact current state, we can uniquely predict all future observations and all future measurement results. This does not mean, of course, that we can always predict everything: e.g., it is difficult to predict tomorrow’s weather, because it depends on many today’s values that we do not know. However, the more sensors we place to measure today’s temperature, wind speed, and humidity at different locations (and at different heights), the more accurate our predictions—in line with the belief that if we knew where every atom was now, we could, theoretically, predict the weather exactly. In contrast, in quantum physics, even if we know the exact current state, we cannot uniquely predict the results of future measurements. (At best, we can predict the probabilities of future measurement results.) This non-determinism is mostly observed on the micro-level. Quantum effects are sometimes observed on the macro-level—e.g., with lasers—but mostly, they are observed on the micro-level. There is no non-determinism in consequent measurement of the same quantity. In quantum physics, we cannot predict the result of a measurement: in the same state, we may get different results. However, in quantum physics, if after measuring the value of a quantity, we measure the same quantity again, we get the exact same result. Entanglement. According to relativity theory, all speeds are bounded by the speed of light. As a result, an event happening a large distance from here cannot immediately affect the result of the local experiment. However, in quantum physics, we can prepare a pair of particle in an “entangled” state. Then, we separate these two particles at a large distance from each other, and perform measurement on each of these particles. In this arrangement, we cannot predict the result of each measurement, but, surprisingly, if we know the result of one of these measurements, we can make some prediction about the result of the second one as well.
How to Make Quantum Ideas Less Counter-Intuitive: A Simple Analysis …
119
This fact – known as Einstein-Podolsky-Rosen paradox—was experimentally observed. It has been shown that, strictly speaking, it does not violate relativity theory: namely, this phenomenon cannot be used to communicate signals faster than the speed of light. However, this phenomenon still sounds very counter-intuitive.
3 A Simple Analysis of Measurement Uncertainty and How It Can Help A simple analysis of measurement uncertainty. Our information about values of physical quantities comes from measurements. Measurements are never absolutely accurate – and even if they were accurate, we would not be able to produce all infinitely many digits needed to represent the exact value, we have to stop at some point. For example, if we use decimal fractions, we cannot even exactly represent the value 1/3 = 0.33 . . ., we have to stop at some point and use an approximate value 1/3 ≈ 0.33 . . . 3. In general, there is the smallest possible value h that we can represent, and all other values that we can represent are integer multiples of this value: 0, h, 2h, …, k · h, …, all the way to the largest value L = N · h that we can measure and represent. This immediately makes discreteness and “jumps” less counter-intuitive. The first trivial consequence of the above analysis is that the discrete character of measurement results—and the abrupt transitions between measurement results—become natural. How are these measurement results related to the actual (unknown) value of the corresponding quantity? In the ideal situation, when the measurement instruments are very accurate, and the only reason for discreteness is the need to have a finitely long representation, the result of the measurement comes from rounding the actual values—just like 0.33 . . . 3 is the result of rounding 1/3, and 0.66 . . . 67 is the result of rounding 2/3. So: • the measurement result 0 means that the actual value is somewhere between −0.5h and 0.5h; • the measurement result h means that the actual value is somewhere between 0.5h and 1.5h; and • in general, the measurement result k · h means that the actual value is somewhere between (k − 1/2) · h and (k + 1/2) · h. Non-determinism becomes less counter-intuitive. Let us consider the following simple situation. We have an inertial body traveling at the same speed in the same direction. We measure the distance d that it travels in the first second, and, based on this measurement, we want to predict the distance D that this body will travel in 2 s.
120
O. Kosheleva and V. Kreinovich
From the purely mathematical viewpoint, if we ignore measurement uncertainty, the answer is trivial: D = 2d. What will happen if we take measurement uncertainty into account? Suppose that the measurement of the 1 s distance led to the value k · h. This means that the actual (unknown) distance is located somewhere in the interval [(k − 1/2) · h, (k + 1/2) · h]. In this case, one can easily see that the value D = 2d is located in the interval [(2k − 1) · h, (2k + 1) · h]. If we round different values from this interval, we get three possible values: (2k − 1) · h, 2k · h, and (2k + 1) · h. Thus, even when we know that the original value was k · h, we cannot uniquely predict the result of the next measurement: it can be one of the three different value. In other words, we have non-determinism—exactly as in quantum physics. This non-determinism is mostly observed on the micro-level. On the macro-level, when the values k are much larger than 1 (which is usually denoted by 1 k), the = 2k · h where non-determinism is relatively small: the relative error in predicting D the actual value is D = (2k − 1) · h is equal to − D h 1 D = = 1. D 2k · h 2k On the other hand, for small values, e.g., for k = 1, the resulting relative error is 1/(2k) = 50%—a very large uncertainty. There is no non-determinism in consequent measurement of the same quantity. We cannot predict what will be the result of measuring D, but if we measure it again right away, we will, of course, get the same result as last time—just like in quantum physics. Entanglement. Let us assume that we have two particles 1 and 2 at the same location. Measurement shows that both their electric charges are 0s, and that the overall charge of both particles is 0. Then, we separate these particles and place them in an electric field where the force acting on a particle with charge q is equal to 4q. Let us see what will happen in this situation if we take measurement uncertainty into account. The fact that the measured values of both charge q1 and q2 is 0 means that the actual values of each of these charges is located in the interval [−0.5h, 0.5h]. The fact that the measurement of the overall charge q1 + q2 result in 0 value means that this overall charge is also located in the same interval [−0.5h, 0.5h]. For each particle i, the only thing we can conclude about the actual value of the force Fi = 4qi is that this value located in the interval [−2h, 2h]. So, in principle, possible measured values of the force are −2h, −h, 0, h, and 2h. The fact that the overall charge was measured as 0 does not restrict possible values of the measured force; indeed, for the first particle:
How to Make Quantum Ideas Less Counter-Intuitive: A Simple Analysis …
121
• it could be that q1 = −0.5h and q2 = 0; in this case, the observed value of the force F1 is −2h; • it could be that q1 = −0.3h and q2 = 0; in this case, the observed value of the force F1 is −h; • it could be that q1 = 0 and q2 = 0; in this case, the observed value of the force F1 is 0; • it could be that q1 = 0.3h and q2 = 0; in this case, the observed value of the force F1 is h; • it could be that q1 = 0.5h and q2 = 0; in this case, the observed value of the force F1 is 2h. Similarly, all five values are possible for the measured value of the force F2 . However, this does not mean that all possible combinations of measured force values are possible. Indeed, while it is possible that the measured value of F1 is −2h, and it is possible that the measured value of F2 is −2h, it is not possible that both these measured values are equal to −2h. Indeed: • such situation would mean that F1 ∈ [−2.5h, −1.5h] and F2 ∈ [−2.5h, −1.5h], thus F1 + F2 ∈ [−5h, −3h], and F1 + F2 ≤ −3h; • however, we know that F1 + F2 = 4 · (q1 + q2 ), and since q1 + q2 ∈ [−0.5h, 0.5h], we get F1 + F2 ∈ [−2h, 2h], and −2h ≤ F1 + F2 . The two resulting inequalities cannot be both true, so this case is indeed impossible. Thus, if we measure the force F1 acting on the first particle and get the measurement result −2h, this would mean that we can immediately predict which values of the force measured at a faraway particle are possible and which are not possible— exactly as in the quantum entanglement situation. Warning. While, as we have shown, the analysis of measurement uncertainty helps make quantum ideas less counter-intuitive, it is important to remember that the corresponding mathematics of quantum physics is very different from the mathematics of measurement uncertainty. On the qualitative level, yes, there is some similarity, and it may help the students, but on the quantitative level quantum physics is completely different. Comment. In our analysis, we used simple examples of arithmetic operations on uncertain numbers. It may be beneficial to provide a general description of such situations: if we have the measurement results i · h and j · h, then for each arithmetic operation —be it addition, subtraction, multiplication, or division—we can define (i · h) ( j · h) as the set of all roundings of all the values a b, where a ∈ [(i − 1/2) · h, (i + 1/2) · h] and b ∈ [( j − 1/2) · h, ( j + 1/2) · h].
122
O. Kosheleva and V. Kreinovich
Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). The authors are grateful to Louis Lopez for valuable discussions.
References 1. R. Feynman, R. Leighton, M. Sands, The Feynman Lectures on Physics (Addison Wesley, Boston, Massachusetts, 2005) 2. L. Lopez, The Meaning of the Universe. http://louislopez.com/ 3. K.S. Thorne, R.D. Blandford, Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics (Princeton University Press, Princeton, New Jersey, 2017)
Physical Meaning Often Leads to Natural Derivations in Elementary Mathematics: On the Examples of Solving Quadratic and Cubic Equations Christian Servin, Olga Kosheleva, and Vladik Kreinovich
Abstract Usual derivation of many formulas of elementary mathematics—such as the formulas for solving quadratic equation—often leave un unfortunate impression that mathematics is a collection of unrelated unnatural tricks. In this paper, on the example of formulas for solving quadratic and cubic equations, we show that these derivations can be made much more natural if we take physical meaning into account.
1 Formulation of the Problem In elementary mathematics, many derivations feel unnatural. Derivations of many formulas or elementary mathematics are usually presented as useful tricks. These tricks result from insights of ancient geniuses. A formula for solving quadratic equations is a good example. These derivations do not naturally follow from the formulation of the problem. Resulting problem. The perceived un-naturalness of these derivations leads to two problems. • Short-term problem: the perceived un-naturalness makes these derivations—and thus, the resulting formulas—difficult to remember.
C. Servin Information Technology Systems Department, El Paso Community College (EPCC), 919 Hunter Dr., El Paso, TX 79915-1908, USA e-mail: [email protected] O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University El Paso, Texas 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, Texas 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_22
123
124
C. Servin et al.
• Long-term problem: this un-naturalness adds to the unfortunate students’ impression that mathematics is an un-natural collection of unrelated tricks. From this viewpoint, it is desirable to come up with natural—or at least more natural—derivation of such formulas. What we do in this paper. In this paper, we show that taking into account physical meaning can help. To be more precise: • We are not proposing radically new derivations. • We just show that physical meaning makes some of the available derivations natural.
2 Solving Quadratic Equations Simplest case. It is straightforward to solve simple quadratic equations, of the type a · x 2 + c = 0. Indeed: • First, similarly to linear equations, we subtract c from both sides and divide both sides by a. • This way, we get x 2 = −c/a. √ • Then we extract the square root, and get x = ± −c/a. Need for a general case. But what if we have a generic equation a · x 2 + b · x + c = 0? This happens, e.g., when: • we know the area s = x · y and the perimeter p = 2x + 2y of a rectangular region, and • we want to find its sizes x and y. In this case, we can plug in y = p/2 − x into s = x · y and get a quadratic equation x·
p 2
− x = s.
By opening the parentheses and moving all the terms to the right-hand side, we get x2 −
p · x + s = 0. 2
Physical meaning. In real-life applications, x is often a numerical value of some physical quantity. Numerical values depend:
Physical Meaning Often Leads to Natural Derivations in Elementary …
125
• on the choice of the measuring unit and • on the choice of a starting point. So, a natural idea is to select a new scale in which the equation would get a simpler form. What happens if we change a measuring unit? If we replace a measuring unit by a new one which is λ times larger, we get new value x for which x = λ · x . Substituting this expression instead of x into the equation, we get a new equation a · (λ · x)2 + b · λ · x + c = 0. However, this does not help to solve this equation. What happens if we change a starting point? Selecting a new starting point changes the original numerical value x to the new value x for which x = x + x0 . Substituting this expression for x into the given quadratic equation, we get a · (x + x0 )2 + b · (x + x0 ) + c = a · (x )2 + (2a · x0 + b) · x + (a · x02 + b · x0 + c) = 0.
For an appropriate x0 —namely, for x0 = −b/(2a)—we can make the coefficient at x equal to 0. Thus, we get a simple quadratic equation—which we know how to solve.
3 Cubic Equations General description. A general cubic equation has the form a · x 3 + b · x 2 + c · x + d = 0. Same idea helps, but does not yet lead to a solution. We can use the same idea as for the quadratic equation—namely, change from x to x = x + a0 , and get a simplified equation a · (x )3 + c · x + d = 0. In contrast to the quadratic case, however, the resulting equation is still difficult to solve. We still need to reduce it to an easy-to-solve case with no linear terms. Additional physics-related idea. To simplify the equation, let us take into account another natural physics-related idea: that often, a quantity is the sum of several ones. For example, the mass of a complex system is the sum of the masses of its components.
126
C. Servin et al.
This idea works. Let us consider the simplest case x = u + v. Substituting this expression into the above equations, we get a · (u + v)3 + c · (u + v) + d = a · (u 3 + v 3 ) + 3a · u · v · (u + v) + c · (u + v) + d =
a · (u 3 + v 3 ) + (3 · a · u · v + c ) · (u + v) + d = 0. When 3a · u · v = −c , we get the desired reduction, so u 3 + v 3 = −d /a. So, we need to select u and v for which u 3 + v 3 = −d /a and u · v = −c . The second equality leads to u 3 · v 3 = (−d /a)3 . We know the sum and product of u 3 and v 3 , so, as we have mentioned earlier, we can find both by solving the corresponding quadratic equation. Thus, we can find u, v, and finally x = u + v. Acknowledgements This work was supported in part by the National Science Foundation grants: • 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and • HRD-1834620 and HRD-2034030 (CAHSI Includes). It was also supported: • by the AT&T Fellowship in Information Technology, and • by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478.
Towards Better Ways to Compute the Overall Grade for a Class Christian Servin, Olga Kosheleva, and Vladik Kreinovich
Abstract Traditional way to compute the overall grade for the class is to use the weighted sum of the grades for all the assignments and exams, including the final exam. In terms of encouraging students to study hard throughout the semester, this grading scheme is better than an alternative scheme, in which all that matters is the grade on the final exam: in contrast to this alternative scheme, in the weightedsum approach, students are penalized if they did not do well in the beginning of the semester. In practice, however, instructors sometimes deviate from the weighted-sum scheme: indeed, if the weighted sum is below the passing threshold, but a student shows good knowledge on the comprehensive final exam, it makes no sense to fail the student and make him/her waste time re-learning what this student already learned. So, in this case, instructors usually raise the weighted sum grade to the passing level and pass the student. This sounds reasonable, but this idea has a limitation similar to the limitation of the alternative scheme: namely, it does not encourage those students who were initially low-performing to do as well as possible on the final exam. Indeed, within this idea, a small increase in the student’s grade on the final exam will not change the overall grade for the class. In this paper, we provide a natural idea of how we can overcome this limitation.
C. Servin Information Technology Systems Department, El Paso Community College (EPCC), 919 Hunter Dr., El Paso, TX 79915-1908, USA e-mail: [email protected] O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University El Paso, Texas 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, Texas 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_23
127
128
C. Servin et al.
1 How the Overall Grade Is Computed: The Usual Description How the overall grade is computed now. Usually in the US, an overall grade is computed as a weighted linear combination of grades for all the class assignments, tests, quizzes, a comprehensive final exam, etc. The number of points one can get for each assignment is limited. As a result, the overall grade is limited by some number M—usually, 100—corresponding to the situation when the students get a perfect grade for each assignment. There is usually a passing threshold P, so that: • if the student’s overall grade G for the class is P or larger, the student passes the class; • however, if the student’s overall grade G for the class is smaller than P, the student fails the class; in this case, the student usually has a chance to re-take this class next semester. Usually, P = 70, which means that C is the passing grade; however, for some classes, the passing grade is P = 60 (equivalent to D). Alternative grading systems and their limitations. In Russia (where two of us are from) and in many other countries, the overall grade for the class is just the grade E for the final exam. The student still needs to maintain some minimum number of points on all the previous assignments, since without this minimum he/she will not be allowed to take the final exam. However, even if the student did poorly on all the previous assignment, when this student’s grade on the final exam is perfect, this perfect grade will be the overall student’s grade for this class. The main limitation of this scheme is that, while ideally, a students should study hard the whole semester, in this scheme, many students study not so hard during the semester and then cram the material in the last few weeks—not the best arrangement, often leading to imperfect knowledge. In contrast, the US scheme penalizes students who do not study hard in the beginning, and thus, motivates them to study hard during the whole semester. What we do in this paper. In this paper, we show that the actual grading by US instructors is somewhat different from the usually assumed weighted-sum scheme. Specifically, in Sect. 2, we explicitly describe how the actual grading is done. In Sect. 3, we show that, in some cases, this actual approach faces limitations similar to the limitations of the alternative grading scheme. Finally, in Sect. 4, we explain a natural way to modify the actual grading scheme that will allow us to overcome these limitations. Comment. The results of this paper were first announced in [1].
Towards Better Ways to Compute the Overall Grade for a Class
129
2 How the Overall Grade Is Actually Computed The problem with the usual weighted-sum approach. The above weightedcombination scheme is what most instructors place in their syllabi and claim to follow. However, in reality, what they do is more complicated, and it is more complicated for the following very simple reason. If the final exam results show that the student has mastered the material but the weighted average leads to failing, do we really want to fail this student? If this student has mastered the material already, what is the purpose of forcing him/her to retake this class? So how is the overall grade actually computed? Yes, we want to penalize the student for not working hard in the beginning of the semester, but we do not want the student to fail. As a result, if the grade on the final exam is larger than or equal to the passing grade, then, even if the weighted sum leads to failing, we still pass the student—i.e., in effect, increase the student’s overall grade for this class to P. Let us describe the actual computation in precise terms. Let us describe this scheme in precise terms. The weighted-sum overall grade G is a weighted combination of all the student’s grades for this class, including this student’s grade E on the final exam: G = G − + w E · E, where G − denotes a weighted combination of grades of all previous assignments, and w E is the weight of the final exam. In these terms, the actual overall grade A has the following form: • if G < P and E ≥ P, then A = P; • in all other cases, A = G.
3 Limitations of the Actually Used Scheme Limitation: a general description. Suppose that we have G − + w E · E < P and E ≥ P, then, according to this scheme, we take A = P. Since G − + w E · E < P, if we increase E by a sufficiently small amount ε > 0, then for the resulting value E = E + ε, we will still have G − + w E · E < P. Also, since E ≥ E and E ≥ P, we will still have E ≥ P. Thus, according to the above-described scheme, the resulting overall student’s grade for this class will be the same. So, the student may get the grade E for the final exam, the same student may get a larger grade E > E—in both cases, the overall grade for the class remains the same A = P. Thus, the student does not have any incentive to study better: he will get the same grade P: • if he/she barely knows the material by the time of final exam, and • if he/she knows much more than the required minimum. This lack of motivation is exactly the same limitation that explains the advantages of the US weighted-sum scheme in comparison the laternative scheme, where the grade on the final exam is all that matters.
130
C. Servin et al.
Limitation: an example. Let us consider a typical situation when the final exam has weight 0.35. In this case, if the student did not submit any assignments before attending the final exam, this student will get G − = 0 points for all these assignments. So, according to the weighted-sum formula, the overall student’s grade for this class should be G = G − + 0.35 · E = 0.35 · E. So, even if the student gets the perfect grade E = 100 on the final exam, we will have G = 0.35 · 100 = 35 < 70. Thus, if we literally follow the weighted-sum scheme, we should fail this student. In practice, very few instructors would fail such a student. We would definitely penalize him/her, we will not give him/her 100 points overall for the class, but we will not fail this students. In this case, according to the above description of the actual grading, we will assign A = 70. However, if another student also gets 0s on all previous assignments, but gets 80—above the passing grade of 70—on the final exam—we also, according to the above formulas, assign the grade A = 70. So, whether the student gets E = 80 or E = 100 on the final exam, this will not change the student’s overall grade—so there is no incentive for the student to study hard for the final exam.
4 How Can We Overcome This Limitation The actually used grading scheme: reminder. To analyze how we can overcome the above limitation, let us summarize the actually used grading scheme. In this scheme, if both the overall grade G and the grade E on the final exam are smaller than the passing grade P, the student fails the class anyway. The only case when a student passes the class is when: • either this student’s weighted-sum overall grade G is larger than or equal to the passing threshold P • or this student’s grade on the final exam is greater than or equal to P. In this case: • if G ≥ P, we assign the grade A = G to this student for this class; • otherwise, if G < P and E ≥ P, we assign the grade A = P. These two subcases can be summarized into a single formula—applicable when either G ≥ P or E ≥ P: A = max(G, P). (1) Analysis of the problem. As we have mentioned, the limitation of the actually used formula (1) is that the value (1) remains the same—equal to P—when we increase G < P to a larger value G < P, so a student has no incentive to study better for the final exam. The smallest possible value G that a student can get when E ≥ P is when G − = 0, then G = w E · P. To give the student an incentive to study better, a reasonable idea
Towards Better Ways to Compute the Overall Grade for a Class
131
is to give, to this student, a few extra points proportional to every single point beyond that, i.e., proportional to the difference G − w E · P. In other words, we consider, for the cases when G ≤ P, the dependence of the type A = max(G, P) + α0 · (G − w E · P),
(2)
for some small value α0 > 0. This way, any increase in G increases the final grade— so the students get the desired incentive. The only absolute limitation on α0 is that we should not exceed the maximum grade M. Also, this value A should not be equal to M—otherwise, there is no incentive for a student to get perfect grades on all the assignments and on the final exam, if this student can earn the perfect grade with G < P. So, in this case, we must have A < M for all G ≤ P. Since we consider the case when G ≤ P, the largest possible value of G is P, and the corresponding largest possible value of the expression (2) is P + α0 · (P − w E · P). Thus, the requirement that this value does not exceed M means that P + α0 · (P − w E · P) < M, i.e., that α0
P, we cannot use the same formula (2)—otherwise for G = M, we will have the overall grade larger than M. Indeed, for G = M, we already have max(G, P) = M, so we cannot add anything to this grade. So, for G > P, in the formula (2), instead of a constant α0 , we should have an expression α(G) that goes from α(P) = α0 for G = P to α(M) = 0 for G = M. Which function α(G) should we choose? The simplest functions are linear functions. A linear function is uniquely determined by its values at two points, so we have M−G · α0 , (4) α(G) = M−P and thus, A = G + α0 ·
M−G · (G − w E · P). M−P
(5)
We need to make sure that the larger G, the larger the resulting grade A. The expression (5) has the form A = G + α0 ·
M · G − M · wE · P − G 2 + G · wE · P . M−P
(6)
Thus, the derivative of this expression (5) with respect to G must be positive, i.e., that we should have
132
C. Servin et al.
1 + α0 ·
M − 2G + w E · P ≥0 M−P
(7)
for all G. The expression for this derivative (7) is decreasing with G. So, to guarantee that this derivative is positive for all G, it is sufficient to make sure that it is positive for the largest possible value G = M. This leads to the following inequality 1 + α0 ·
−M + w E · P ≥ 0, M−P
i.e., equivalently, to α0 ≤
M−P . M − wE · P
(8)
(9)
In particular, for M = 100, P = 70, and w E = 0.35, this condition takes the form α0 ≤ 0.4. Now, we have two inequalities bounding α0 : inequalities (3) and (9). However, since P < M, we have M−P M−P < , M − wE · P P − wE · P
(10)
so if (9) is satisfied, the inequality (3) is satisfied too. Thus, it is sufficient to satisfy the inequality (9). Let us summarize the resulting scheme. Resulting proposal. Let P be the passing threshold, M be the maximum possible numerical grade, and w E be the weight of the final exam. To specify the proposed arrangement, we need to select a positive real number α0 for which α0 ≤
M−P . M − wE · P
(9)
Suppose now that for some student, the grade for the final exam is E and that the weighted-sum combination of this grade and all the grades for the previous tests and assignments is G. Then: • If G < P and E < P, the student fails the class. • If G ≤ P and E ≥ P, we assign, to this student, the grade A = P + α0 · (G − w E · P).
(11)
• If G > P, then we assign, to this student, the grade A = G + α0 ·
M−G · (G − w E · P). M−P
(5)
Towards Better Ways to Compute the Overall Grade for a Class
133
Numerical example. Let us assume that M = 100, P = 70, and w E = 0.35. Let us take α0 = 0.05. The worst-case passing student is when G − = 0 and E = P = 70. In this case, w E · E = 0.35 · 70 = 24.5, and the formula (11) leads to A = 70—the smallest possible passing grade, exactly as planned. Suppose now we have a solid C student, i.e., a student for whom G = E = 70. For this student, the formula (11) leads to A = 70 + 0.05 · (70 − 0.35 · 70) = 70 + 0.05 · (70 − 24.5) = 70 + 0.05 · 45.5 = 70 + 2.275 ≈ 72.
For a solid B student, for whom G = E = 80, the formula (12) leads to A = 80 + 0.05 ·
2 100 − 80 · (80 − 0.35 · 70) = 80 + 0.05 · · (80 − 24.5) ≈ 82. 100 − 70 3
For a solid A student, for whom G = E = 90, the formula (12) leads to A = 90 + 0.05 ·
100 − 90 1 · (90 − 0.35 · 70) = 80 + 0.05 · · (90 − 24.5) ≈ 91. 100 − 70 3
Finally, for a perfect student for whom G = E = 100, the formula (12) leads to the expected perfect value A = 100. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
Reference 1. O. Kosheleva, V. Kreinovich, C. Servin, Motivations do not decrease procrastination, so what can we do?, in Proceedings of the 11th International Scientific-Practical Conference “Mathematical Education in Schools and Universities” MATHEDU’2022, March 28-April, 2022 (Kazan, Russia, 2022), pp. 185–188
Why Some Theoretically Possible Representations of Natural Numbers Were Historically Used and Some Were Not: An Algorithm-Based Explanation Christian Servin, Olga Kosheleva, and Vladik Kreinovich
Abstract Historically, people have used many ways to represent natural numbers: from the original “unary” arithmetic, where each number is represented as a sequence of, e.g., cuts (4 is IIII) to modern decimal and binary systems. However, with all this variety, some seemingly reasonable ways of representing natural numbers were never used. For example, while it may seem reasonable to represent numbers as products—e.g., as products of prime numbers—such a representation was never used in history. So why some theoretically possible representations of natural numbers were historically used and some were not? In this paper, we propose an algorithmbased explanation for this different: namely, historically used representations have decidable theories—i.e., for each such representation, there is an algorithm that, given a formula, decides whether this formula is true or false, while for un-used representations, no such algorithm is possible.
1 Formulation of the Problem Historical representations of natural numbers: a brief overview (see, e.g., [2] for more details). There are, in principle, infinitely many natural numbers 0, 1, 2, …To represent all such numbers, a natural idea is:
C. Servin Information Technology Systems Department, El Paso Community College (EPCC), 919 Hunter Dr., El Paso, TX 79915-1908, USA e-mail: [email protected] O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University El Paso, Texas 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, Texas 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_24
135
136
C. Servin et al.
• to select one or more basic numbers, and • to select operations that would allow us to build new natural numbers based on the existing ones. Which operations should we select? On the set of all natural numbers, the most natural operations are arithmetic ones. They are, in the increasing order of computational complexity: addition (of which the simplest case in adding 1), subtraction, multiplication, and division. Historically first representation is the one corresponding to “unary” numbers, where each natural number (e.g., 5) is represented by exactly that many cuts on a stick. Roman numerals preserve some trace of this representation, since there 1 is represented as I, 2 as II, 3 as III, and sometimes even 4 as IIII. In this representation, the only basic number is 0 (or rather, since the ancient folks did not know zero, 1), and the only operation is adding 1. Such a unary representation works well for small numbers, but for large numbers— e.g., hundreds or thousands—such a representation requires an un-realistic amount of space. A natural way to shorten the representation is to use other basic numbers in addition to 0, and use more complex operations—e.g., full addition instead of adding 1. This is exactly how, e.g., the Biblical number system worked: some letters represented numbers from 1 to 9, some letters represented numbers 10, 20, …, 90, some letters represented 100, 200, etc., and a generic number was described by a combination of such letters—interpreted as the sum of the corresponding numbers. A natural next step is to add the next-simplest arithmetic operation: subtraction. This is what the Romans did, and indeed, IV meaning 5 − 1 is a much shorter representation of the number 4 than the addition-based IIII meaning 1 + 1 + 1 + 1. Historically the next step was the current positional system, which means adding, to addition, some multiplication. The simplest such system is binary arithmetic, where we only allow multiplication of 2s—i.e., in effect, computing the powers of 2—so that any natural number is represented as the sum of the corresponding powers of 2. Decimal numbers are somewhat more complicated, since in decimal numbers, a sum presenting a natural number includes not only powers of 10, but also products of these powers with numbers 1 through 9. Why not go further? In view of our listing of arithmetic operations in the increasing order of their complexity, a natural idea is to go further and use general multiplication—just like general addition instead of its specific case transitioned unary numbers into a more compact Biblical representation. If we add multiplication, this would means, e.g., that we can represent numbers as products, e.g., 851 could be represented as a product 37 · 23. Such representations can be useful in many situations: e.g., if we want to divide some amount between several people. This need for making division easy is usually cited as the main reason why, e.g., Babylonians used a 60-base system: since 60 can be easily divided by 3, 4, 5, 6, 10, 12, 15, 20, and 30 equal pieces. So why was not such a product-based representation ever proposed? An even less complex idea is to take into account that different cultures used different bases. Because of this, a natural way to describe totals is by adding numbers
Why Some Theoretically Possible Representations of Natural Numbers …
137
in different bases—i.e., for example, to allow sums of decimal and 60-ary numbers. This was also never used, why? What we do in this paper. In this paper, we propose a possible answer to this question.
2 Our Explanation Main idea behind our explanation. We want to be able to reason about numbers, we want to be able to decide which properties are true and which are not. From this viewpoint, it is reasonable to select a representation of numbers for which such a decision is algorithmically possible, i.e., for which there is an algorithm that, given any formula related to such an representation, decides whether this formula is true or not. Let us analyze the historical and potential number representations from this viewpoint. Case of unary numbers. Let us start with the simplest case of unary numbers. In this case, the basic object is 0 and the only operation is adding 1, which we can denote def by s(n) = n + 1. It is therefore reasonable to consider, as elementary formulas, equalities between terms, where a term is something that is obtained from variables and a constant 0 by using the function s(n). Then, we can combine elementary formulas by using logical connectives “and”, “or”, “not”, “implies”, and quantifiers ∀n and ∃n, where the variables n can take any values from the set N = {0, 1, 2, . . .} of all natural numbers. The resulting “closed” formulas—i.e., formulas in which each variable is covered by some quantifier—are known as first-order formulas in signature N , 0, s. It is known that there is an algorithm that, given any such formula, checks whether this formula is true or not. Case of Biblical numbers. In this case, we also consider addition, i.e., we consider signature N , 0, n 1 , . . . , n k , +, where n 1 , . . . , n k are basic numbers. From the logical viewpoint, we can describe s(n) and all the numbers n i in terms of addition: m = s(n) ⇔ (∃a (m = n + a & a = 0) & ¬(∃a ∃b (m = n + a + b & a = 0 & b = 0))),
and n i is simply s(. . . (s(0) . . .))), where s(n) is applied n i times. Thus, the class of all such formulas is equivalent to the class of all the formulas in the signature N , 0, +. For this theory—known as Presburger arithmetic—it is also known that there is an algorithm that, that, given any such formula, checks whether this formula is true or not. Case of Roman-type numerals. The only think that Roman numerals add is subtraction. However, from the logical viewpoint, subtraction can be described in terms
138
C. Servin et al.
of addition: a − b = c ⇔ a = b + c, so this case is also reducible to Presburger arithmetic for which the deciding algorithm is possible. Case of binary and decimal numbers. In this case, the signature is N , 0, n 1 ·, . . . , n k ·, +, bn , where b is the base of the corresponding number system (in our examples, b = 2 or b = 10), and n i · means an operation of multiplying by n i (e.g., by 1, …, 9 in the decimal case). Each operation n → n i · n is simply equivalent to the addition repeated n i times: n i · n = n + . . . + n. Thus, from the logical viewpoint, the signature can be reduced simply to N , 0, +, bn . For such a signature, the existence of the deciding algorithm was, in effect, proven in [4]. Comment. To be more precise, the paper [4] explicitly states the existence of the deciding algorithm for b = 2. For b > 2, the existence of the deciding algorithm follows from the general result from [4] about the signature N , 0, +, f (n) for exponentially growing functions f (n). What if we add multiplication? If we also allow general multiplication, i.e., if we consider the signature N , 0, +, ·, then—as is well known—the corresponding first order theory is undecidable, in the sense that no general algorithm, is possible that, given a formula, would decide whether this formula is true or not. This was, in effect, proven by Kurt Gödel in his famous results. What if we allow numbers in different bases? If we allow even two bases b1 and b2 , this would mean that we consider signature N , 0, +, b1n , b2n . It is known that— unless both b1 and b2 are both powers of the same number—the corresponding first order theory is undecidable; see, e.g., [1, 3, 5] Conclusion. So, our idea—that people select representations for which a deciding algorithm is possible—indeed explains why some representations were historically used while some weren’t: • for representations that were used, there exist algorithm that decided whether each formula us true or not; • in contrast, for representations that were not historically used, such an algorithm is not possible. Clarification. We definitely do not claim that, e.g., Romans knew modern sophisticated deciding algorithms, what we claim is that they had intuition about this— intuition that was later confirmed by theorems. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology.
Why Some Theoretically Possible Representations of Natural Numbers …
139
It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). The authors are grateful to Mikhail Starchak for valuable discussions.
References 1. A. Bés, Undecidable extensions of Büchi arithmetic and Cobham-Semënov theorem. J. Symb. Logic 62(4), 1280–1296 (1997) 2. C.B. Boyer, U.C. Merzbach, History of Mathematics (Wiley, Hoboken, New Jersey, 2011) 3. P. Hieronymi, C. Schulz, A strong version of Cobham’s theorem (2021). arXiv:2110.11858 4. A.L. Semenov, Logical theories of one-place functions on the set of natural numbers. Izvestiya: Math. 22(3), 587–618 (1984) 5. R. Villemaire, The theory of N , +, Vk , Vl is undecidable. Theor. Comput. Sci. 106(2), 337–349 (1992)
Applications to Engineering
Dielectric Barrier Discharge (DBD) Thrusters–Aerospace Engines of the Future: Invariance-Based Analysis Alexis Lupo and Vladik Kreinovich
Abstract One of the most prospective aerospace engines is a Dielectric Barrier Discharge (DBD) thruster–an effective electric engine without moving parts. The efficiency of this engine depends on the proper selection of the corresponding electric field. To make this selection, we need to know, in particular, how its thrust depends on the atmospheric pressure. At present, for this dependence, we only know an approximate semi-empirical formula. In this paper, we use natural invariance requirements to come up with a theoretical explanation for this empirical dependence, and to propose a more general family of models that can lead to more accurate description of the DBD thruster’s behavior.
1 Formulation of the Poblem 1.1 How to Fly on Mars: Dielectric Barrier Discharge (DBD) Thrusters Need to fly on other planets. A large amount of information about Earth comes from air-based observations. It is therefore desirable to have similar studies of planets with atmosphere. Related challenge. One of the problems is that on Earth, flying devices use fuelbased engines. On other planets, we do not have ready sources of fuel, and bringing fuel from Earth is too expensive. A. Lupo Department of Physics, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_25
143
144
A. Lupo and V. Kreinovich
The main source of energy for planetary missions is electricity. Electricity can be generated by solar batteries and/or by radioactive energy sources. It is therefore desirable to use electricity to power the flying devices. How to design electricity-powered flying machines. A natural idea is to use electrostatic forces between two electrodes. When the voltage is high, an electric arc appears–as in a lightning. The arc means that atmospheric atoms are ionized, into negatively charges electrons and positively changed ions; • ions move towards the negative electrode, while • electrons move towards the positive electrode. Since the ions move towards the negative electrode, the density near the negative electrode becomes smaller than nearby. Thus, the atmospheric gases are sucked into this area. These gases also become ionized, so they also move towards the negative electrode. The mass of ions is much larger than the mass of electrons, so the ion flow produces momentum and thus, thrust. This is the main idea behind what is called Dielectric Barrier Discharge (DBD) Thrusters; see, e.g., [1, 2, 5].
1.2 DBD Thrusters Are Useful on Earth Too The DBD thrusters are useful on Earth too. • They do not have moving parts, so they are durable and reliable. • They do not burn fuel, so they do not pollute the environment. • They have a higher efficiency. Let us explain why they have higher efficiency. In fuel-using flying devices, energy is wasted on two stages: • when fuel is burning–a large part of energy goes into useless heat, and • when turbines are used–a part of energy is spent on friction. In contrast, in an electric device, there are only one stage, so fewer energy is wasted.
1.3 What Electric Field E Should We Select For a given design the given value E of the electric field, the efficiency of a thruster changes with atmospheric pressure p: • When the pressure is too low, we do not have enough ions to generate thrust.
Dielectric Barrier Discharge (DBD) Thrusters–Aerospace Engines …
145
• On the other hand, when the pressure is too high, the air resistance becomes too strong: the atmosphere is very dense, moving through it becomes practically impossible. For each value E of the electric field, there is an optimal pressure at which the thruster is the most efficient. So, for each value of the atmospheric pressure p, we should select this optimal E. In particular, since the atmospheric pressure decreases with height, so we should thus have E changing with height. To find the optimal value E of the electric field, we need to know how the thrust F depends on the pressure p.
1.4 Remaining Problems As we have mentioned, to find the optimal value E of the electric field, we need to know how the thrust F depends on the pressure p. At present, we only have an approximate semi-empirical formula F( p) = c · p · exp(a · p) based on a simplified model; see, e.g., [6]. It is therefore desirable to provide a theoretical explanation for this formula. Another issue is that this formula provides a rather crude approximation to the data. It is desirable to come up with more accurate formulas. These are the two problem with which we deal in this paper.
2 Analysis of the Problem We should select a family of functions F( p). For different designs, for different values of the electric field E, we have, in general, different dependencies F( p). So, we cannot have a single function F( p), we should select a family of functions. A natural way to represent a family of functions. A natural way to describe a family of functions is: • to select “basic” functions e1 ( p), …, en ( p), and • to consider all possible linear combinations of these basic functions, i.e., functions of the type C1 · e1 ( p) + . . . + Cn · en ( p). Example. For example, when e1 ( p) = 1, e2 ( p) = p, e3 ( p) = p 2 , we get a family of quadratic polynomials. Resulting question. Which family should we select? In other words, which basic functions should we select?
146
A. Lupo and V. Kreinovich
Shift-invariance. An important feature of pressure is that its effects, in some sense, do not depend on the selection of the starting point. For example, when there are no cars on the road: • we say that the pressure on the pavement is 0, • while in reality, there is always a strong atmospheric pressure. We can select different starting points, e.g., we can take 0 as vacuum or 0 as atmospheric pressure. If we replace the original starting point with the one which is p0 units smaller, all numerical values are shifted: p → p + p0 . The selection of a starting point is rather arbitrary. So, it makes sense to select an approximating family that does not change under such shifts. What we can conclude based on shift-invariance. Shift-invariance implies, in particular, that for each function ei ( p), its shift ei ( p + p0 ) belongs to the same family. This means that for some coefficients Ci j ( p0 ), we get: ei ( p + p0 ) = Ci1 ( p0 ) · e1 ( p) + . . . + Cin ( p0 ) · en ( p). If we differentiate both sides with respect to p0 , we get ( p0 ) · e1 ( p) + . . . + Cin ( p0 ) · en ( p). ei ( p + p0 ) = Ci1
In particular, for p0 = 0, we get ei ( p) = ci1 · e1 ( p) + . . . + cin · en ( p), def
where we denoted ci j = Ci j (0). So, we get a system of linear differential equations with constant coefficients. It is known (see, e.g., [3, 4]) that all solutions to such systems are linear combinations of functions t m · exp(k · t), where: • k is an eigenvalue of the matrix ci j , and • m is a non-negative integer which is smaller than k’s multiplicity. What are the simplest cases. The simplest case if when n = 1. In this case, we get functions F( p) = C1 · exp(k · p). However, this does not satisfy the condition F(0) = 0–meaning that when there is no atmosphere, there will be no thrust. So, to satisfy this condition, we need to take n ≥ 2. The simplest of such cases is n = 2. In this case, we have three possible options: • we may have a single eigenvalue of multiplicity 2; • we may have two complex-valued eigenvalues; and • we may have two different real-valued eigenvalues. Let us consider these three options one by one.
Dielectric Barrier Discharge (DBD) Thrusters–Aerospace Engines …
147
First option explains the semi-empirical formula. if we have an eigenvalue of multiplicity 2, we get F( p) = C1 · exp(k · p) + C2 · p · exp(k · p). The condition F(0) = 0 implies that C1 = 0, so we have exactly the semi-empirical formula from [6]. Second option is not applicable. When eigenvalues are complex-valued k = a ± b · i, then F( p) = exp(a · p) · (C1 · cos(b · p) + C2 · sin(b · p)). This expression is not applicable to our case; indeed: • this expression is negative for some p, while • the thrust force is always non-negative. Third option explains how to get more accurate formulas. When eigenvalues k1 < k2 are real and different, we get F( p) = C1 · exp(k1 · p) + C2 · exp(k2 · p). The condition F(0) = 0 implies C2 = −C1 , so F( p) = C1 · (exp(k1 · p) − exp(k2 · p)). In the limit k2 → k1 , we get the expression F( p) = c · p · exp(k · p). In general, we get a 3-parametric family that may lead to a better description of experimental data. Comment. If this will be not accurate enough, we can use shift-invariant families with n = 3, 4, . . . Acknowledgements This work was supported in part by the National Science Foundation grants: • 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and • HRD-1834620 and HRD-2034030 (CAHSI Includes). It was also supported by the AT&T Fellowship in Information Technology, and by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478. The authors are thankful to the participants of the 2022 UTEP/NMSU Workshop on Mathematics, Computer Science, and Computational Science (El Paso, Texas, November 5, 2022) for valuable discussions.
148
A. Lupo and V. Kreinovich
References 1. T.C. Corke, M.L. Post, D.M. Orlov, SDBD plasma enhanced aerodynamics: concepts, optimization and applications. Prog. Aerosp. Sci. 43(7–8), 193–217 (2007) 2. A.D. Greig, C.H. Birzer, M. Arjomandi, Atmospheric plasma thruster: theory and concept. AIAA J. 51(2), 362–371 (2013) 3. J.C. Robinson, An Introduction to Ordinary Differential Equations (Cambridge University Press, Cambridge, UK, 2004) 4. M. Tenenbaum, H. Pollard, Ordinary Differential Equations (Dover Publ, New York, 2022) 5. F. Thomas, T. Corke, M. Iqbal, A. Kozlov, D. Schatzman, Optimization of dielectric barrier discharge plasma actuators for active aerodynamic flow control. AIAA J. 47, 2169–2178 (2009) 6. Z. Wu, J. Xu, P. Chen, K. Xie, N. Wang, Maximum thrust of single dielectric barrier discharge thruster at low pressure. AIAA J. 56(6), 2235–2241 (2018)
Need for Optimal Distributed Measurement of Cumulative Quantities Explains the Ubiquity of Absolute and Relative Error Components Hector A. Reyes, Aaron D. Brown, Jeffrey Escamilla, Ethan D. Kish, and Vladik Kreinovich Abstract In many practical situations, we need to measure the value of a cumulative quantity, i.e., a quantity that is obtained by adding measurement results corresponding to different spatial locations. How can we select the measuring instruments so that the resulting cumulative quantity can be determined with known accuracy–and, to avoid unnecessary expenses, not more accurately than needed? It turns out that the only case where such an optimal arrangement is possible is when the required accuracy means selecting the upper bounds on absolute and relative error components. This results provides a possible explanation for the ubiquity of such two-component accuracy requirements.
1 Formulation of the Problem Need for distributed measurements. In many practical situations, we are interested in estimating the value x of a cumulative quantity: e.g., we want to estimate the overall amount of oil in a given area, the overall amount of CO2 emissions, etc. How to perform distributed measurements. Measuring instruments usually measure quantities in a given location, i.e., they measure local values x1 , . . . , xn that H. A. Reyes · A. D. Brown · J. Escamilla · E. D. Kish · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] H. A. Reyes e-mail: [email protected] A. D. Brown e-mail: [email protected] J. Escamilla e-mail: [email protected] E. D. Kish e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_26
149
150
H. A. Reyes et al.
together form the desired value x = x1 + . . . + xn . So, a natural way to produce an estimate x for the cumulative value x is: • to place measuring instruments at several locations within a given area, • to measure the values xi of the desired quantity in these locations, and xn of these measurement: • to add up the results x1 + . . . + xn . x = x1 + . . . + Need for optimal planning. Usually, we want to reach a certain estimation accuracy. To achieve this accuracy, we need to plan how accurate the deployed measurement instruments should be. Use of accurate measuring instruments is often very expensive, while budgets are usually limited. It is therefore desirable to come up with the deployment plan that would achieve the desired overall accuracy within the minimal cost. This implies, in particular, that the resulting estimate should not be more accurate than needed–this would mean that we could use less accurate (and thus, cheaper) measuring instruments. What we do in this paper. In this paper, we provide a condition under which such optimal planning is possible–and the corresponding optimal planning algorithm. The resulting condition will explain why usually, measuring instruments are characterized by their absolute and relative accuracy.
2 Let Us Formulate the Problem in Precise Terms How we can describe measurement accuracy. Measurements are never absolutely accurate, the measurement result xi is, in general, different from the actual (unknown) value xi of the corresponding quantity; see, e.g., [5]. In other words, the difference def xi − xi is, in general, different from 0. This difference is known as the meaxi = surement error. For each measuring instrument, we should know how large the measurement error can be. In precise terms, we need to know an upper bound on the absolute value |xi | of the measurement error. This upper bound should be provided by the manufacturer of the measuring instrument. Indeed, if no such upper is known, this means that whatever the reading of the measuring instrument, the actual value can be as far away from this reading as possible, so we get no information whatsoever about the actual value–in this case, this is not a measuring instrument, it is a wild guess. Ideally, in addition to knowing that the measurement error xi is somewhere in the interval [−, ], it is desirable to know how probable are different values from this interval, i.e., what is the probability distribution on the measurement error.
Need for Optimal Distributed Measurement of Cumulative Quantities …
151
Sometimes, we know this probability distribution, but in many practical situations, we don’t know it, and the upper bound is all we know. So, in this section, we will consider this value as the measure of the instrument’s accuracy; see, e.g., [1–4]. This upper bound may depend on the measured value. For example, if we are measuring current in the range from 1 mA to 1 A, then it is relatively easy to maintain accuracy of 0.1 mA when the actual current is 1 mA–this means measuring with one correct decimal digit. We can get values 0.813…, 0.825…, but since the measurement accuracy is 0.1, this means that these measurement results may correspond to the same actual value. In other words, whatever the measuring instrument shows, only one digit is meaningful and significant–all the other digits may be caused by measurement errors. On the other hand, to maintain the same accuracy of 0.1 mA when we measure currents close to 1 A would mean that we need to distinguish between values 0.94651 A = 946.51 mA and 0.94637 A = 946.37 mA, since the difference between these two values is larger than 0.1 mA. This would mean that we require that in the measurement result, we should have not one, but four significant digits–and this would require much more accurate measurements. Because of this, we will explicitly take into account that the accuracy depends on the measured value: = (x). Usually, small changes in x lead to only small changes in the accuracy, so we can safely assume that the dependence (x) is smooth. What we want. We want to estimate the desired cumulative value x with some accuracy δ. In other words, we want to make sure that the difference between our estimate x and the actual value x does not exceed δ: | x − x| ≤ δ. The cumulative value is estimated based on n measurement results. As we have mentioned, the accuracy that we can achieve in each measurement, in general, depends on the measured value: the larger the value of the measured quantity, the more difficult it is to maintain the corresponding accuracy. It is therefore reasonable to conclude that, whatever measuring instruments we use to measure each value xi , it will be more difficult to estimate the larger cumulative value x with the same accuracy. Thus, it makes sense to require that the desired accuracy δ should also depend on the value that we want to estimate δ = δ(x): the larger the value x, the larger the uncertainty δ(x) that we can achieve. So, our problem takes the following form: • we want to be able to estimate the cumulative value x with given accuracy δ(x)– i.e., we are given a function δ(x) and we want to estimate the cumulative value with this accuracy; • we want to find the measuring instruments that would guarantee this estimation accuracy–and that would be optimal for this task, i.e., that would not provide better accuracy than needed. Let us describe what we want in precise terms. To formulate this problem in precise terms, let us analyze what estimation accuracy we can achieve if we use, for each of n measurements, the measuring instrument characterized by the accuracy (x). Based on each measurement result xi , we can conclude that the actual value xi of the corresponding quantity is located somewhere in the interval
152
H. A. Reyes et al.
[ xi − (xi ), xi + (xi )] : xi + (xi ). the smallest possible value is xi − (xi ), the largest possible value is When we add the measurement results, we get the estimate x = x1 + . . . + xn for the desired quantity x. What are the possible values of this quantity? The sum x = x1 + . . . + xn attains its smallest value if all values xi are the smallest, i.e., when x = ( x1 − (x1 )) + . . . + ( xn − (xn )) = ( x1 + . . . + xn ) − ((x1 ) + . . . + (xn )),
i.e., when x = x − ((x1 ) + . . . + (xn )). Similarly, the sum x = x1 + . . . + xn attains its largest value if all values xi are the largest, i.e., when x = ( x1 + (x1 )) + . . . + ( xn + (xn )) = ( x1 + . . . + xn ) + ((x1 ) + . . . + (xn )),
i.e., when x = x + ((x1 ) + . . . + (xn )). Thus, all we can conclude about the value x is that this value belongs to the interval x + ((x1 ) + . . . + (xn ))]. [ x − ((x1 ) + . . . + (xn )), This means that we get an estimate of x with the accuracy (x1 ) + . . . + (xn ). Our objective is to make sure that this is exactly the desired accuracy δ(x). In other words, we want to make sure that whenever x = x1 + . . . + xn , we should have δ(x) = (x1 ) + . . . + (xn ). Substituting x = x1 + . . . + xn into this formula, we get δ(x1 + . . . + xn ) = (x1 ) + . . . + (xn ).
(1)
We do not know a priori what will be the values xi , so if we want to maintain the desired accuracy δ(x)–and make sure that we do not get more accuracy–we should make sure that the equality (1) be satisfied for all possible values x1 , . . . , xn . In these terms, the problem takes the following form: • For which functions δ(x) is it possible to have a function (x) for which the equality (1) is satisfied? and • For the functions δ(x) for which such function (x) is possible, how can we find this function (x)–that describes the corresponding measuring instrument? This is the problem that we solve in this paper.
Need for Optimal Distributed Measurement of Cumulative Quantities …
153
3 When Is Optimal Distributive Measurement of Cumulative Quantities Possible? Let us first analyze when the optimal distributive measurement of a cumulative quantity is possible, i.e., for which functions δ(x), there exists a function (x) for which the equality (1) is always satisfied. We have assumed that the function (x) is smooth, i.e., differentiable. Thus, the sum δ(x) of such functions is differentiable too. Since both functions (x) and δ(x) are differentiable, we can differentiate both sides of the equality (1) with respect to one of the variables–e.g., with respect to the variable x1 . The terms (x1 ), . . . , (xn ) do not depend on x1 at all, so their derivative with respect to x1 is 0, and the resulting formula takes the form (2) δ (x1 + . . . + xn ) = (x1 ), where, as usual, δ and denote the derivatives of the corresponding functions. The equality (2) holds for all possible values x2 , . . . , xn . For every real number x0 , we can take, e.g., x2 = x0 − x1 and x3 = . . . + xn = 0, then we will have x1 = . . . + xn = x0 , and the equality (2) takes the form δ (x0 ) = (x1 ). The right-hand side does not depend on x0 , which means that the derivative δ (x0 ) is a constant not depending on x0 either. The only functions whose derivative is a constant are linear functions, so we conclude that the dependence δ(x) is linear: δ(x) = a + b · x for some constants a and b. Interestingly, this fits well with the usual description of measurement error [5], as consisting of two components: • the absolute error component a that does not depend on x at all, and • the relative error component–according to which, the bound on the measurement error is a certain percentage of the actual value x, i.e., has the form b · x for some constant b (e.g., for 10% accuracy, b = 0.1). Thus, our result explains this usual description.
154
H. A. Reyes et al.
4 What Measuring Instrument Should We Select to Get the Optimal Distributive Measurement of Cumulative Quantity? Now that we know for what desired accuracy δ(x), we can have the optimal distributive measurement of a cumulative quantity, the natural next question is: given one of such functions δ(x), what measuring instrument–i.e., what function (x)–should we select for this optimal measurement? To answer this question, we can take x1 = . . . = xn . In this case, (x1 ) = . . . = (xn ), so the equality (2) takes the form δ(n · x1 ) = n · (x1 ).
(3)
We know that δ(x) = a + b · x, so the formula (3) takes the form a + b · n · x1 = n · (x1 ). If we divide both sides of this equality by x1 , and rename x1 into x, we get the desired expression for (x): a (x) = + b · x. n In other words: • the bound on the relative error component of each measuring instrument should be the same as the desired relative accuracy of the cumulative quantity, and • the bound on the absolute error component should be n times smaller than the desired bound on the absolute accuracy of the cumulative quantity. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). The authors are thankful to the participants of the 2022 UTEP/NMSU Workshop on Mathematics, Computer Science, and Computational Science (El Paso, Texas, November 5, 2022) for valuable discussions.
Need for Optimal Distributed Measurement of Cumulative Quantities …
155
References 1. L. Jaulin, M. Kiefer, O. Didrit, E. Walter, Applied Interval Analysis, with Examples in Parameter and State Estimation, Robust Control, and Robotics (Springer, London, 2001) 2. B.J. Kubica, Interval Methods for Solving Nonlinear Constraint Satisfaction, Optimization, and Similar Problems: from Inequalities Systems to Game Solutions (Springer, Cham, Switzerland, 2019) 3. G. Mayer, Interval Analysis and Automatic Result Verification (de Gruyter, Berlin, 2017) 4. R.E. Moore, R.B. Kearfott, M.J. Cloud, Introduction to Interval Analysis (SIAM, Philadelphia, 2009) 5. S.G. Rabinovich, Measurement Errors and Uncertainty: Theory and Practice (Springer Verlag, New York, 2005)
Over-Measurement Paradox: Suspension of Thermonuclear Research Center and Need to Update Standards Hector Reyes, Saeid Tizpaz-Niari, and Vladik Kreinovich
Abstract In general, the more measurements we perform, the more information we gain about the system and thus, the more adequate decisions we will be able to make. However, in situations when we perform measurements to check for safety, the situation is sometimes opposite: the more additional measurements we perform beyond what is required, the worse the decisions will be: namely, the higher the chance that a perfectly safe system will be erroneously classified as unsafe and therefore, unnecessary additional features will be added to the system design. This is not just a theoretical possibility: exactly this phenomenon is one of the reasons why the construction of a world-wide thermonuclear research center has been suspended. In this paper, we show that the reason for this paradox is in the way the safety standards are formulated now–what was a right formulation when sensors were much more expensive is no longer adequate now when sensors and measurements are much cheaper. We also propose how to modify the safety standards so as to avoid this paradox and make sure that additional measurements always lead to better solutions.
1 What Is Over-Measurement Paradox General case: the more measurements, the better. Most of our knowledge about the world comes from measurements; see, e.g., [8]. Each measurement provides us with an additional information about the world–and once we have a sufficient number of measurements of the same system, we may be able to find the equations that describe H. Reyes · S. Tizpaz-Niari · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso 500 W. University, El Paso, El Paso, TX 79968, USA e-mail: [email protected] H. Reyes e-mail: [email protected] S. Tizpaz-Niari e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_27
157
158
H. Reyes et al.
the dynamics of this system and thus, to get even more additional information that was hidden in the original measurements. The more measurements we perform, the more information we gain about the system, the more accurate our estimates, and thus, the better will be our decisions. From this viewpoint: • the more measurements we perform, • the better. We only expect one limitation on the number of measurements–the financial one. Indeed, at some point, after we have performed a large number of measurements, we get a very accurate picture of the measured system. Decisions based on this picture are close to optimal, and a very small expected increase in optimality may not be worth spending money on additional measurements. Over-measurement paradox: case study. Most of our energy comes from the Sun. In the Sun, as in most stars, energy is generated by the thermonuclear synthesis, when protons–i.e., nuclei of Hydrogen (H)–combine together to form nuclei of Helium (He). This is a very efficient way of generating energy, a way that does not lead to pollution or other side effects. The majority of physicists believe that this is a way to get energy for our civilization: instead of relying on direct or indirect energy from the thermonuclear reaction inside the Sun, why not use the same reactions ourselves–this will be a very effective and clean idea. The idea is theoretically feasible, but technically, this is a very difficult task. Researchers and engineers all over the world have been working on it since the 1950s. To speed up the process, researchers from 35 major world countries decided to join efforts, and allocated $65 billion dollars to build an international research center where specialists from all the world will work on this topic. This project is named ITER–this is both: • an abbreviation of International Thermonuclear Experimental Reactor and • the Latin word meaning “the way”; see, e.g., [5]. The problem is that as of now, this project is suspended, and one of the main reasons for this suspension is over-measurement; see, e.g., [6]. In a nutshell, the requirement was that, to guarantee safety, the level of danger–e.g., the level of radiation–was supposed to be below the safety threshold at a certain number of locations and scenarios. • The current design does satisfy this criterion. • However, the designers decided to be thorough and simulated more measurement situations. Unfortunately, some of the expected measurement results exceed the threshold. As a result, the whole project is in suspension. Making sure that all future measurements satisfy the criterion would require a drastic redesign and a drastic further increase in the cost of the whole project–so drastic that it is doubtful that this additional funding will appear, especially in the current economic situation.
Over-Measurement Paradox: Suspension of Thermonuclear …
159
Why is it a paradox? If the designers did not perform these additional measurements, the design would have been approved and the project would have started. So in this case, additional measurements made the situation much worse–not only for the researchers, but for the humankind as a whole. This is a clear situation where additional measurements do not help at all. But is it really a paradox? Maybe it is good that the project stopped–maybe additional measurements revealed that the original design was unsafe? What we do in this paper. In this paper, we analyze the situation from the general measurement viewpoint and come up with several conclusions. • first, we show that this situation is, in principle, ubiquitous: a similar problem will surface in many other projects, including those that have already been approved and designed and seem to function OK; • second, although it may look that the problem is caused by insufficient safety of the original design, we show that this is not the case: practically any design, no matter how safe, will fail the currently used criteria if we perform sufficiently many measurements; • finally, we propose a natural suggestion on now to change the corresponding standards so as to avoid such unfortunate situations.
2 Analysis of the Problem Let us formulate the situation in precise terms. We are interested in studying states of different systems. A usual way to describe each state is by describing the values of the corresponding quantities at different locations and at different moments of time. Usually, specifications include constraints on the values of some of these quantities. These may be constrains on the radioactivity level, constraints on concentration of potentially harmless chemicals, on the temperature, etc. In all these cases, a typical constraint is that the value of some quantity q should not exceed some threshold q0 : q ≤ q0 . How can we check this constraint: seemingly natural idea. In the ideal world, we should be able to measure the value q(x, t) at all possible spatial locations x and for all possible moments of time t, and check that all these values do not exceed q0 . Of course, in real life, we can only perform finitely many measurements. So, a seemingly natural idea is to perform several measurements, and to check that all measurement results q1 , . . . , qn do not exceed q0 . However, it is known that this seemingly natural idea can lead to dangerous consequences; see, e.g., [8]. Let us explain why. Why the above seemingly natural idea is dangerous. The actual value of the quantity q depends on many factors which are beyond our control. For example, the actual radioactivity level at a given location is affected by the natural radioactivity
160
H. Reyes et al.
level at this location–the level that can change based, e.g., on weather conditions, when wind brings matter from neighboring areas where this natural level is somewhat higher. There are many small independent factors affecting the actual value of the quantity q. In addition, the measurement result is somewhat different from the actual value of the measured quantity; see, e.g., [8]. We may be able to get rid of major sources of such measurement errors, but there are always a lot of small independent factors that lead to small changes of the measurement results. Because of both types of random factors, the measured value differs from its locally-average level, and this difference is the result of a joint effort of a large number of small independent factors. It is known (see, e.g., [9]) that such a joint effect is usually well described by a normal (Gaussian) distribution. To be more precise: • What is known is that in the limit, when the number N of small independent random factors increases (and the size of each factor appropriately decreases), the probability distribution of the joint effect of all these factors tends to the normal distribution–which thus appears as the limit of the actual distributions when N increases. • By definition of the limit, this means exactly that when the number N of factors is large–and in many practical situations it is large–the actual distribution is very close to normal. So, with high accuracy, we can safely assume that this distribution is normal. This assumption explains why the above seemingly natural idea is dangerous. Indeed, what we have is several measurement results q1 , . . . , qn , i.e., in effect, several samples from the normal distribution. Usually, measurement errors corresponding to different measurements are practically independent–and the same can be said about the random factors affecting the value of the quantity q at different spatial locations and at different moments of time. From this viewpoint, what we observe are n independent samples from a normal distribution. If we only require that qi ≤ q0 , we thus require that max(q1 , . . . , qn ) ≤ q0 . Usually, our resources are limited, so we try to make the minimal effort to satisfy the requirements. In other words, when we institute more and more efficient filters–thus slowly decreasing the value qi –and finally, reach the condition max(q1 , . . . , qn ) ≤ q0 , we stop and declare this design to be safe. • We start with the design for which max(q1 , . . . , qn ) > q0 . • So the first time when we satisfy the desired constraint max(q1 , . . . , qn ) ≤ q0 is when we get max(q1 , . . . , qn ) = q0 . This again may sound reasonable, but it is known that the probability that the next random variable will exceed the maximum max(q1 , . . . , qn ) is proportional to 1/(n + 1). So:
Over-Measurement Paradox: Suspension of Thermonuclear …
161
• even if we perform 40 measurements–and this is, e.g., what measurement theory requires for a thorough analysis of a measuring instrument (see, e.g., [8]), • we get a 1/40 ≈ 2.5% probability that next time, we will go beyond the safety threshold. This is clearly not an acceptable level of safety–especially when we talk about serious, potentially deadly dangers like radioactivity or dangerous chemicals. So what can be done to avoid this danger. To simplify our analysis, let us suppose that the mean value of q is 0. This can always be achieved if we simply subtract the actual mean value from all the measurements results, i.e., for example, consider not the actual radioactivity level, but the excess radioactivity over the average value of the natural radioactivity background. In this case, measurement results q1 , . . . , qn form a sample from a normal distribution with 0 mean and some standard deviation σ . • Of course, no matter how small is σ , the normally distributed random variable always has a non-zero probability of being as large as possible–since the probability density function of a normal distribution is always positive, and never reaches 0. • So, we cannot absolutely guarantee that all future values of q will be smaller than or equal to q0 . • We can only guarantee that the probability of this happening is smaller than some given probability p0 , i.e., that Prob(q > q0 ) ≤ p0 . So, to drastically decrease the probability of a possible disaster–from the unsafe 2.5% to the much smaller safety level p0 2.5%: • instead of the original threshold q0 , • we select a smaller threshold q0 < q0 that guarantees that the conditional probability of exceeding q0 is small: q 0 ) ≤ p0 . Prob(q > q0 | max(q1 , . . . , qn ) ≤ In this case: q0 , • if we have n measurement q1 , . . . , qn all below • then we guarantee, with almost-1 probability 1 − p0 , that the next value will be below the actual danger threshold q0 . This value q0 depends on q0 and on the number of measurements n: • the larger n, • the larger the value q0 . When n increases, this value tends to q0 . So what is included in the safety standard. When safety standards are designed, one of the objectives is to make them easy to follow:
162
H. Reyes et al.
• We do not want practitioners–who need to follow these standards–to perform complex computations of conditional probabilities. • We need to give them clear simple recommendations. From this viewpoint, the easiest to check if whether the measurement result satisfies a given inequality. So, a reasonable way to set up the corresponding standard is to set up: • the new threshold q0 and • the minimal necessary number of measurements n. The standard then says that: • if we perform n measurements, and the results q1 , . . . , qn of all these n measurements do not exceed this threshold q0 , then the situation is safe; • otherwise, the situation is not safe, and additional measures need to be undertaken to make this situation safer. Resulting common misunderstanding. The fact that safety standards provide such q0 –makes most a simplified description–and rarely mention actual threshold q0 > people assume that the critical value q0 provided by a standard is the actual danger level, so any situation in which a measured value exceeds q0 is unacceptable. This is exactly what happened in the above case study. And this is wrong conclusion: • if we perform a sufficiently large number of measurements, • we will eventually get beyond any threshold. Indeed, according to the extreme value theory (see, e.g., [1–4, 7]), for normal distribution with mean 0 and standard deviation σ , the average value An of the maximum max(q1 , . . . , qn ) grows with n as An ∼ γ ·
2 ln(n) · σ,
where γ ≈ 0.5772 is the Euler’s constant def
γ = lim
n→∞
n 1 k=1
k
− ln(n) .
So, this mean value indeed grows with n. Why this problem surfaces only now? Gaussian distributions was invented by Gauss in early 19 century, measurements have been performed since antiquity, so why is this problem surfacing only now? Why did not it surface much earlier? The main reason, in our opinion, is that, until recently: • sensors were reasonably expensive–especially accurate ones–and the cost of measurements was non-negligible;
Over-Measurement Paradox: Suspension of Thermonuclear …
163
• in this case, while in principle, it was possible to perform more measurement than required for safety testing, this would have led to useless costs. Lately, however: • sensors have become very cheap: kids buy them to make robots, the cheapest cell phones have very accurate sensors of positions, acceleration, etc.; • as a result, it is reasonably inexpensive to perform many more measurements than required; • and, as we have mentioned, as a result, in situations that would previously–based on only the required number of measurements–would be classified as safe, now we get values exceeding the threshold q0 provided by the standard–and thus, we end up classifying perfectly safe situations as unsafe.
3 So What Do We Propose What is the problem now: summarizing our findings. The reason why we have the over-measurement paradox is that current safety standards usually list only two numbers: • the recommended threshold q0 and • the recommended number of measurements n. The idea that the results of all the measurements must not exceed q0 for the situation to be considered safe. The problem is that the recommended threshold q0 is actually not the safety threshold q0 , it is smaller than the safety threshold–smaller so that for the prescribed number of measurements n, we would guarantee that: • for all future values, • the probability to exceed the real safety threshold q0 should be smaller than the desired small value p0 . When, in an actually safe situation, in which the probability to exceed q0 does not exceed p0 , we perform more measurements than recommended, then it is eventually inevitable that some of them will be larger than the recommended threshold q0 –even though they will still, with almost-1 probability, be not larger than the actual danger threshold q0 . This leads to the following natural solution to the over-measurement problem. Proposed solution: we need to change the standards. In addition to providing the two numbers q0 and n, we should provide the formula describing the dependence of the testing safety threshold t (n ) for different numbers n ≥ n of actual measurements, so that for all n , we should have Prob(q > q0 | max(q1 , . . . , qn ) ≤ t (n )) ≤ p0 .
164
H. Reyes et al.
At least we should provide the value t (n ) for several different values n , thus taking care of the cases when, due to thoroughness, practitioners will provide more measurements. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. J. Beirlant, Y. Goegevuer, J. Teugels, J. Segers, Statistics of Extremes: Theory and Applications (Wiley, Chichester, 2004) 2. L. de Haan, A. Ferreira, Extreme Value Theory: An Introduction (Springer Verlag, Berlin, Hiedelberg, New York, 2006) 3. P. Embrechts, C. Klüppelberg, T. Mikosch, Modelling Extremal Events for Insurance and Finance (Springer Verlag, Berlin, Heidelberg, New York, 2012) 4. E.J. Gumbel, Statistics of Extremes (Dover Publ, New York, 2004) 5. ITER Project, https://www.iter.org/ 6. D. Kramer, Further delays at ITER are certain, but their duration isn’t clear. Phys. Today. 20–22 (2022) 7. J. Lorkowski, O. Kosheleva, V. Kreinovich, S. Soloviev, How design quality improves with increasing computational abilities: general formulas and case study of aircraft fuel efficiency. J. Adv. Comput. Intell. Intell. Inform. (JACIII). 19(5), 581–584 (2015) 8. S.G. Rabinovich, Measurement Errors and Uncertainty: Theory and Practice (Springer Verlag, New York, 2005) 9. D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures (Chapman and Hall/CRC, Boca Raton, Florida, 2011)
How to Get the Most Accurate Measurement-Based Estimates Salvador Robles, Martine Ceberio, and Vladik Kreinovich
Abstract In many practical situations, we want to estimate a quantity y that is difficult–or even impossible–to measure directly. In such cases, often, there are easier-to-measure quantities x1 , . . . , xn that are related to y by a known dependence y = f (x1 , . . . , xn ). So, to estimate y, we can measure these quantities xi and use the measurement results to estimate y. The two natural questions are: (1) within limited resources, what is the best accuracy with which we can estimate y, and (2) to reach a given accuracy, what amount of resources do we need? In this paper, we provide answers to these two questions.
1 Introduction Need for data processing and indirect measurements. One of the main objectives of science is to describe the current state of the world. The state of the world– and, in particular, the state of different objects and systems–is usually described by specifying the values of the corresponding quantities. For example, from the viewpoint of celestial mechanics, to describe the state of a planet, we need to know its location and its velocity. Some quantities we can measure directly. For example, we can directly measure the size of an office or the weight of a person. Other quantities are difficult to measure directly. For example, at present, there is no way to directly measure the size or the weight of a planet. To estimate the value of each such quantity y, a natural idea is to S. Robles · M. Ceberio · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] S. Robles e-mail: [email protected] M. Ceberio e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_28
165
166
S. Robles et al.
find some easier-to-measure quantities x1 , . . . , xn that are related to y by a known dependence y = f (x1 , . . . , xn ). Then, to estimate y: • we measure these auxiliary quantities x1 , . . . , xn , and then xn of measuring these quantities to produce the estimate • we use the results x1 , . . . , xn ) . for y: y = f ( x1 , . . . , Computing this estimate is an important case of data processing, and the whole measurement-based procedure for estimating y is known as indirect measurement (in contrast to direct measurements, when we simply measure the value y); see, e.g., [1]. Another objective of science is to predict the future state of the world–and this future state is also characterized by the future values of the corresponding quantities. For example, we may want to predict tomorrow’s weather, i.e., temperature, wind speed, etc. To predict the future value y of each of these quantities, we can use the current values x1 , . . . , xn of these quantities and of other related quantities, and we need to know how the future value y depends on these values xi , i.e., we need to have an algorithm y = f (x1 , . . . , xn ) describing this prediction. In this case too, to estimate y: • we measure the quantities x1 , . . . , xn , and then xn of measuring these quantities to produce the estimate • we use the results x1 , . . . , xn ) . for y: y = f ( x1 , . . . , This is another example of data processing and indirect measurement. One of the main objectives of engineering is to describe how to make the world better. We may be looking for the best control strategy for a plant or for a car that will make it the most efficient and/or the least polluting, we may be looking for a gadget that leads to the best results. In all these cases, we need to find the optimal values of the quantities that characterize this control or this gadget. To find each such value y, we need to solve the corresponding optimization problem and find an algorithm y = f (x1 , . . . , xn ) that determines this value based on the known values of the quantities xi that describe the corresponding environment. For example, the best control strategy for a petrochemical plant may depend on the amount of different chemical compounds in the incoming oil. In this case too, to find the desired value y, we measure the values of these quantities, and we use the results of these measurements to estimate y, i.e., we also apply data processing. We want the most accurate data processing results, but our resources are limited. In all the above problems, we would like to get the estimate y which is as accurate as possible: we want the most accurate predictions of tomorrow’s weather, we want to have the most accurate control of a petrochemical plant, etc. To get the more accurate estimates of y, we need to measure the auxiliary quantities x1 , . . . , xn more accurately. However, high-accuracy measurements are more costly, and our resources are limited. So, we face the following problems: • Within a limited budget, how accurately can we estimate the desired quantity y? • When we want to estimate y with a given accuracy, how much money do we need?
How to Get the Most Accurate Measurement-Based Estimates
167
2 Let Us Formulate Both Problems in Precise Terms What do we need to do to formulate the problems in precise terms. In both problems, we want to find the relation between the cost of the corresponding measurements and the accuracy with which these measurements enable us to estimate y. So, to describe these problems in precise terms, we need to be able to describe accuracy of the measurements: • we need to know how these accuracies affect the accuracy of estimating y, and • we need to know how the cost of a measurement depends on its accuracy. Let us analyze these questions one by one. What we know about measurement accuracy. Measurements are never 100% accurate, so the result xi of measuring the quantity xi is, in general, different from def xi − xi is known as the measurement error. the actual value xi . The difference xi = In most cases, many different factors affect the measurement accuracy. As a result, the measurement error is the joint effect of many different independent factors. According to the Central Limit Theorem (see, e.g., [2]), when the number of these factors is large, the probability distribution of these errors is close to Gaussian (normal). In general, a normal distribution is uniquely determined by two parameters: mean and standard deviation. Measuring instruments are usually calibrated: before the instrument is released, its readings are compared with some more accurate (standard) measuring instrument. The value xi,st recorded by this more accurate measuring instrument is very close to xi,st ≈ xi . So, the difference xi − xi,st between the measurement the actual value xi : xi − xi : errors is very close to the measurement error xi = xi,st ≈ xi . xi − If the mean value of the measurement error is not 0, then, after a reasonable number xi,st will be close to this of comparisons, the average value of the differences xi − mean. Once we realize that there is this mean difference, we can “recalibrate” the measuring instrument–namely, we can subtract this mean value (known as bias) from all the measurement results. For example, if a person knows that his/her watch is 5 minutes behind, this person can simply add 5 minutes to all the watch’s reading–or, better yet, re-set the watch so that the watch will show the correct time. Because of this calibration process that eliminates possible biases, we can safely assume that the mean value of the measurement error is 0. Thus, the only characteristic that describe the measurement error is the standard deviation σi . Usually, for different measurements, measurement errors are caused by different factors. Thus, the random variables xi describing different measurement errors are independent. How measurement errors affect the accuracy with which we estimate y. We know that the desired quantity y is equal to y = f (x1 , . . . , xn ). So, if we knew the exact
168
S. Robles et al.
values of the quantities xi , we would be able to apply the algorithm f and get the exact value of y. As we have mentioned, in practice, we only know the measurement results xi which are, in general, somewhat different from the actual values xi . Based on the measurement results, we compute the estimate y = f ( x1 , . . . , xn ) for y. We def y − y depends on the accuracies want to find out how the estimation error y = with which we measure xi . Since we have argued that a reasonable description of each measurement accuracy is provided by the standard deviation σi , the question is: how the estimation accuracy depends on these standard deviations σi . To answer this question, let us first describe the estimation error y = y − y in y and y into terms of the measurement errors xi . Substituting the expressions for the formula that describes y, we conclude that xn ) − f (x1 , . . . , xn ) . y = f ( x1 , . . . , xi − xi , we have xi = xi − xi . Thus, By definition of the measurement error xi = the above expression for y takes the following form: xn ) − f ( xn − xn ) . x1 − x1 , . . . , y = f ( x1 , . . . , Measurement errors xi are usually small, so that terms which are quadratic (or of higher order) in terms of xi can be safely ignored. Thus, we can expand the above expression in Taylor series in terms of xi and keep only linear terms in this expansion. As a result, we get the following formula: y =
n ∂f xn ) xi , x1 , . . . ( ∂ xi i=1
i.e., y =
n
ci · xi ,
(1)
i=1
where we denoted def
ci =
∂f xn ) . x1 , . . . ( ∂ xi
According to the formula (1), the approximation error y is a linear combination of independent measurement errors xi . As we have mentioned, each measurement error is normally distributed with mean 0 and standard deviation σi . It is known that a linear combination of several independent random variables is also normally distributed. Thus, we conclude that y is also normally distributed. As we have mentioned, the normal distribution is uniquely determined by its mean and its standard deviation σ .
How to Get the Most Accurate Measurement-Based Estimates
169
The mean of a linear combination of random variables is equal to the linear combination of the means. Since the means of all the variables xi are 0s, this implies that the mean of y is also 0. The quantity y is the sum of n independent random variables ci · xi . When we multiply a random variable by a constant c, its variance is multiplied by c2 . So, for each term ci · xi , the variance is equal to ci2 · σi2 . It is known that the variance σ 2 of the sum of several independent random variables is equal to the sum of their variances. Thus, we conclude that σ2 =
n
ci2 · σi2 .
(2)
i=1
How the cost of measuring xi depends on the measurement accuracy. A natural way to increase the accuracy is to repeat the measurement sereval times and take the average of the measurement results. How does this affect the accuracy? Let us assume that we have a measuring instrument that measures the quantity xi with mean 0 and standard deviation m i . To increase the accuracy, we use this xi,1 , . . . , xi,ni , and then we instrument n i times, resulting in n i measurement results return the average as an estimate for the desired value xi : xi =
xi,ni xi,1 + . . . + . ni
What is the accuracy of this estimate? Subtracting the actual value xi from both sides of this equality, we get the following xi − xi : expression for the measurement error xi = xi =
xi,ni xi,1 + . . . + − xi . ni
To simplify this expression, we can add the two terms in the right-hand side and then re-order terms in the numerator:
xiav
xi,ni − xi xi,1 − xi + . . . + xi,ni − n i · xi xi,1 + . . . + = = = ni ni xi,1 + . . . + xi,ni , ni def
(3)
where xi, j = xi, j − xi is the measurement error of the j-th measurement. The numerator of the formula (3) is the sum of n i independent random variables, so its variance is equal to the sum of their variances. The variance of each measurement error xi, j is m i2 , so the variance of the numerator is equal to n i · m i2 . When we
170
S. Robles et al.
multiply a random variable by a constant c, its variance is multiplied by c2 . In our case, 1 c= , ni thus, the variance σi2 of the measurement error xi is equal to σi2
=
1 ni
2 · n i · m i2 =
m i2 . ni
(4)
What about the cost? Let di be the cost of each individual measurement. When we repeat the measurement n i times, the overall cost of measuring the i-th quantity xi is n i times larger, i.e., (5) Di = n i · di Let us analyze how the cost depends on accuracy. Once we are given the desired accuracy σi , we can find, from the formula (4), the value n i corresponding to this accuracy: m2 n i = 2i . σi Substituting this value n i into the formula (5), we get Di =
m i2 · di , σi2
i.e., Di = where we denoted
def
Ci , σi2
Ci = m i2 · di .
(6)
(7)
Comment. In some cases, to get a more accurate measurement, we explicitly repeat a measurement several times. An example is the way some devices for measuring blood pressure work: they measure blood pressure three times and take an average. Some measuring devices do it implicitly: e.g., super-precise clocks usually consist of several independent clocks, and the result is obtained by processing the reading of all these clocks. In general, it is reasonable to assume that the dependence of measurement cost on measurement accuracy has the form (6)–(7). This allows us to answer a similar question about estimating y.
How to Get the Most Accurate Measurement-Based Estimates
171
How the cost of estimating y depends on the estimation accuracy. If we measure each quantity xi with accuracy σi , then the accuracy σ 2 of the resulting estimate for y is determined by the formula (2), and the overall measurement cost D = D1 + . . . + Dn can be obtained by adding up the costs of all n measurements: D=
n Ci . σ2 i=1 i
(8)
So, we arrive at the following precise formulation of the above two problems. What we want. In both problems, we want to find out the accuracies σi with which we need to measure the i-th quantity. Formulating the first problem in precise terms. A limited budget means that our expenses D are bounded by some given value D0 . This means that we want to find the values σi for which the estimation error (2) is the smallest possible under the constraint D ≤ D0 , i.e., n Ci ≤ D0 . (9) σ2 i=1 i Formulating the second problem in precise terms. In the second problem, we are given the accuracy σ 2 with which we want to estimate y. So, we want to find the values σi for which the cost (8) is the smallest possible under the constraint (2).
3 Within Limited Resources, What Is the Best Accuracy With Which We Can Estimate y? Analysis of the problem. In this problem, to find the desired values σi , we minimize the expression (2) under the non-strict inequality constraint (9). The minimum of a function under non-strict inequality constraint is attained: • either when the inequality is strict–in this case it is a local minimum of the objective function (2), • or when the inequality becomes equality. Let us show that the first case is not possible. Indeed, in the local minimum, all derivatives of the objective function (2) should be equal to 0. If we differentiate the expression (2) with respect to each unknown σi and equate the derivative to 0, we get σi = 0. In this case, the left-hand side of the inequality (9) is infinite, so this inequality is not satisfied. Thus, the desired minimum occurs when the inequality (9) becomes an equality, i.e., when we have
172
S. Robles et al. n Ci = D0 . σ2 i=1 i
(10)
To minimize the objective function (2) under the equality constraint (10), we can use the Lagrange multiplier method, according to which this constraint optimization problem can be reduced, for an appropriate value λ, to the unconstrained optimization problem of minimizing the expression n
ci2
·
n Ci +λ· − D0 , σ2 i=1 i
σi2
i=1
(11)
where the value λ can be determined by the condition that the minimizing values σi satisfy the equality (10). Differentiating the expression (11) with respect to σi and equating the derivative to 0, we conclude that Ci 2ci2 · σi − 2λ · 3 = 0, σi i.e., equivalently, 2ci2 · σi = 2λ ·
Ci . σi3
To find an explicit expression for σi , we multiply both sides by σi3 and divide both sides by 2ci2 . As a result, we get σi4 = λ · thus σi2
√ = λ·
Ci , ci2 √ Ci . |ci |
(12)
To find λ, we substitute this expression into the equality (10) and get D0 = Thus,
n n C j · |c j | 1
=√ · C j · |c j | . √
λ · Cj λ j=1 j=1
n √ 1
λ= · C j · |c j | . D0 j=1
Substituting this expression for answer.
√
λ into the formula (12), we get the following
How to Get the Most Accurate Measurement-Based Estimates
173
Optimal selection of measuring instruments. To get the best accuracy under limited resources, we need to select measuring instruments for which σi2 =
n √C 1
i · C j · |c j | · . D0 j=1 |ci |
(13)
Substituting these expressions into the formula (2), we get the best accuracy σ 2 that we can achieve under this resource limitation: n n
1
· C j · |c j | · Ci · |ci | , σ = D0 j=1 i=1 2
i.e.,
n 2
1 σ = · Ci · |ci | . D0 i=1 2
(14)
4 To Reach a Given Accuracy, What Amount of Resources Do We Need? Analysis of the problem. In this problem, to find the desired values σi , we minimize the expression (8) under the constraint (2). To minimize the objective function (8) under the equality constraint (2), we can use the Lagrange multiplier method, according to which this constraint optimization problem can be reduced, for an appropriate value λ, to the unconstrained optimization problem of minimizing the expression n n Ci 2 2 2 +λ· ci · σi − σ , (15) σ2 i=1 i i=1 where the value λ can be determined by the condition that the minimizing values σi satisfy the equality (2). Differentiating the expression (15) with respect to σi and equating the derivative to 0, we conclude that Ci −2 3 + 2λ · ci2 · σi = 0, σi i.e., equivalently, 2
Ci = 2λ · ci2 · σi . σi3
To find an explicit expression for σi , we multiply both sides by σi3 and divide both sides by 2ci2 . As a result, we get
174
S. Robles et al.
σi4 = thus σi2
1 Ci · , λ ci2
√ Ci 1 . =√ · |c λ i|
(16)
To find λ, we substitute this expression into the equality (2) and get σ = 2
n j=1
c2j
n Cj 1
1 =√ · ·√ · C j · |c j | . λ |c j | λ j=1
Thus,
1 σ2 √ = n
.
λ C j · |c j |
(17)
j=1
Substituting this expression into the formula (16), we get the following answer. Optimal selection of measuring instruments. To get the smallest cost guaranteeing the given accuracy σ , we need to select measuring instruments for which σi2
√ Ci σ2 = n
· |ci | .
C j · |c j |
(18)
j=1
Substituting these expressions into the formula (8), we get the smallest cost that we can achieve to provide the desired accuracy: D= i.e.,
n n n n
Ci · |ci |
1
· C · |c | = · C · |c | · C · |c | , √ j j i i j j σ 2 i=1 σ 2 · Ci j=1 i=1 j=1
n 2
1 D= 2· Ci · |ci | . σ i=1
(19)
Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
How to Get the Most Accurate Measurement-Based Estimates
175
References 1. S.G. Rabinovich, Measurement Errors and Uncertainty: Theory and Practice (Springer Verlag, New York, 2005) 2. D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures (Chapman and Hall/CRC, Boca Raton, Florida, 2011)
How to Estimate the Present Serviceability Rating of a Road Segment: Explanation of an Empirical Formula Edgar Daniel Rodriguez Velasquez and Vladik Kreinovich
Abstract An accurate estimation of the road quality requires a lot of expertise, and there is not enough experts to provide such estimates for all the road segments. It is therefore desirable to estimate this quality based on the easy-to-estimate and easyto-measure characteristics. Recently, an empirical formula was proposed for such an estimate. In this paper, we provide a theoretical explanation for this empirical formula.
1 Formulation of the Problem Empirical formula. In pavement engineering, it is important to gauge the remaining quality of the road pavement based on its current status. The pavement quality is estimated by experts on a scale from 0 to 5. The corresponding expert estimate is known as the Present Serviceability Index (PSI). However, estimating the road quality requires a lot of expertise, a lot of experience, and there is not enough experts to gauge the quality of all the road segments. It is therefore desirable to estimate this value based on easy-to-observe and easy-to-measure characteristics. Such estimates of PSI are known as Present Serviceability Rating (PSR). Several characteristics can be used to gauge the pavement quality. One of the most widely used characteristics if the Pavement Condition Index (PCI), which is a combination of values characterizing different observed and measured pavement E. D. Rodriguez Velasquez Department of Civil Engineering, Universidad de Piura in Peru (UDEP), Av. Ramón Mugica 131, Piura, Peru e-mail: [email protected]; [email protected] Department of Civil Engineering, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_29
177
178
E. D. Rodriguez Velasquez and V. Kreinovich
faults. In principle, it is possible to provide a rough estimate of PSR based on PCI. However, to get a more accurate estimate for PSR, it is desirable to explicitly take into account values of the most important characteristics of the pavement imperfection. In [1], an empirical formula was presented for estimating the Present Serviceability Rating (PSR) of a road segment based on the PCI and on the following three specific characteristics of the pavement: • the mean rut depth of the pavement RD, • the area of medium and high severity patching Patch, and • the sum of the lengths of medium and high severity cracks LTCrk. This formula has the following form: PSR = a0 + aPCI · ln(PCI) + aPatch · Patch + aPatch,2 · Patch2 + aPatch,3 · Patch3 + aRD · RD2 + aLTCrack · LTCrack · RD Problem. How can we explain this empirical formula? In other words, how can we answer the following questions–and similar other questions: • Why there are up to cubic terms in terms of Patch and only linear terms in terms of the length of crack–which seems to indicate that terms of higher order in terms of this length did not lead to any statistically significant improvement of the resulting estimate? • Why there are quadratic terms in terms of rut depth but only linear terms in terms of the crack length? What we do in this paper. In this paper, we provide a theoretical explanation for this formula. This explanation provides answer to both above questions–and to several similar questions.
2 Our Explanation How to explain dependence on PCI. In the above empirical formula, the Present Serviceability Rating (PSR) logarithmically depends on the Pavement Condition Index (PCI). In general, the logarithmic dependence is one of the basic dependencies corresponding to scale-invariance; see, e.g., [2]. So, such scale-invariance is the most probable explanation of the logarithmic dependence of PSR on PCI. Remaining questions: dependence on other characteristics. The remaining questions are: how can we explain the dependence on other characteristics–the mean rut depth RD, the patching area Patch, and the cracking length LTCrk. What is important is the area. The larger the area affected by a fault, the more this fault’s effect on the traffic. Thus, in the first approximation, the effect is proportional to the area.
How to Estimate the Present Serviceability Rating …
179
This means that if a fault is described by its length–as are cracks–or by its depth–as the rut depth–the effect should be, in the first approximation, proportional to the corresponding area–i.e., to the expressions which are quadratic in terms of these characteristics: RD2 , LTCrack2 , and LTCrack·RD. From this viewpoint, in the first approximation, the important terms are terms which are linear in Patch, RD2 , LTCrack2 , and LTCrack·RD. To get a more accurate description, we may need to add terms which are quadratic, cubic, etc., in terms of these characteristics. How faults appear: a reminder. Which of the possible terms are the most important? To answer this question, let us recall how faults appear. • First, we have a pavement deformation–which is described by the rut depth. • After some time, cracks appear. How this affects the relative size of terms depending on crack length and rut depth. Since the cracks appear much later than the rut, they have less time to develop and thus, the effect of the crack size is, in general, much smaller than the effect of the rut depth: RDLTCrack. This implies that RD2 RD·LTCrackLTCrack2 . The largest term out of these three is RD2 , so we need to take this term into account. However, if we only take into account this term, we will completely ignore the effect of the cracks, and cracks are important. So, to take cracks into account, we need to consider at least one more term RD·LTCrack. So, a reasonable idea is to consider terms which are linear in RD2 and LTCrack·RD. What about patches? When the rut and depth drastically decrease the pavement quality, the pavement is patched. Current patching technology does not allow precise patching that would cover only the affected areas. As a result, the patch is not just covering the cracks and the rut, it covers a much larger area. As a result, the patches cover a much larger region. Thus, the patch area Patch is much larger than the terms RD2 , LTCrack2 , and LTCrack·RD describing the actual faults. This explains the appearance of higher-order patch-related terms. Since the relative area covered by patches is larger than the ratio of RD2 to the pavement area, the terms quadratic and even cubic in terms of Patch are much larger than similar terms quadratic and cubic in terms of RD2 . Thus, even though terms quadratic and cubic in terms of RD2 can be safely ignored, it makes sense that much larger terms–quadratic and cubic in terms of Patch–cannot be ignored and have to be present in the estimation formula. All this explains the above empirical formula.
180
E. D. Rodriguez Velasquez and V. Kreinovich
Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. J. Bryce, R. Boadi, J. Groeger, Relating pavement condition index and present serviceability rating for asphalt-surfaced pavements. Transp. Res. Rec. 2673(3), 308–312 (2019) 2. E.D. Rodriguez Velasquez, V. Kreinovich, O. Kosheleva, Invariance-based approach: general methods and pavement engineering case study. Int. J. Gen. Syst. 50(6), 672–702 (2021)
Applications to Linguistics
Word Representation: Theoretical Explanation of an Empirical Fact Leonel Escapita, Diana Licon, Madison Anderson, Diego Pedraza, and Vladik Kreinovich
Abstract There is a reasonably accurate empirical formula that predicts, for two words i and j, the number X i j of times when the word i will appear in the vicinity of the word j. The parameters of this formula are determined by using the weighted least square approach. Empirically, the predictions are the most accurate if we use the weights proportional to a power of X i j . In this paper, we provide a theoretical explanation for this empirical fact.
1 Formulation of the Problem Background: need to describe the relation between words from natural language by a numerical characteristic. To better understand and process natural-language texts, computers need to have an accurate precise description of the meaning of different words and different texts. In particular, we need to represent, in a computer, to what extent different words from natural language are related to each other. This relation can be described, e.g., by the number X i j of times when word i appears in the context of word j.
L. Escapita · D. Licon · M. Anderson · D. Pedraza · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] L. Escapita e-mail: [email protected] D. Licon e-mail: [email protected] M. Anderson e-mail: [email protected] D. Pedraza e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_30
183
184
L. Escapita et al.
How this characteristic can be estimated: a straightforward approach. In principle, we can get the values X i by analyzing several natural-language texts. The more texts we analyze, the more accurately we can represent this dependence. However, there are so many natural-language texts that it is not feasible to process them all, so we have to limit ourselves to values obtained by processing some of the available texts. Limitations of the straightforward approach. In these texts, for words which are closely related–e.g., “cow” and “milk”–we will probably have a sufficient number of situations in which one of these two words appeared in the context of another one. In such situations, the values that we measure based on these texts provide a reasonable statistically accurate description of this relation–and we can use this relation to predict how many pairs we will have if we add additional texts to our analysis. On the other hand, if the two words are not so closely related, then in the current texts, we may have only a few examples of such pairs. In general, in statistics, when the sample is small, the corresponding estimates are not very accurate and do not lead to good predictions. How can we get more accurate estimates: a general idea. Good news is that, in general, predictions are not only based on statistics–otherwise, we would never be able to predict rare events like solar eclipses–they are also based on the known dependencies between the corresponding quantities. For example, if we want to analyze with what force two charged bodies attract or repel each other, we do not need to perform experiments with all possible pairs of such bodies: we know Coulomb’s law, according to which we can predict this force if we know the charges of both bodies (and the distance between them). Similarly, Newton’s laws allow us to predict the gravitational force between two bodies if we know their masses (and the distance between the bodies). It is therefore desirable to look for a similar dependence–that would describe the quantity X i j describing the relation between the two words in terms of some numerical characteristics of the two words i and j. How this general idea is used. At present, these characteristics are determined by training a neural network; see, e.g., [2] and references therein. There is also a reasonably good approximate analytical formula for describing this dependence (see, e.g., [3]): b j + wi · w j , ln(X i j ) ≈ bi + b j are numbers, wi and w j are vectors, and a · b is dot (scalar) product. where bi and b j , wi and w j can be found by using the Least Squares method, The values bi , i.e., by solving the minimization problem def
J =
i, j
f (X i j ) · (bi + b j + wi · w j − ln(X i j ))2 → min
Word Representation: Theoretical Explanation …
185
for an appropriate weight function f (X ). Empirical fact. The efficiency of this method depends on the appropriate choice of the weight function f (X ). Empirical data shows that the most efficient weight function is the power law f (X ) = X a . Remaining problem. How can we explain this empirical fact? What we do in this paper. In this paper, we provide a theoretical explanation for this fact.
2 Our Explanation Analysis of the problem. The values X i j depend on the size of the corpus. For example, if we consider twice smaller corpus, each value X i j will decrease approximately by half. In general, if we consider a λ times larger corpus, we will get new values which are close to λ · X i j . A natural requirement. The word representation should depend only on the words themselves, not on corpus size. So, the resulting representation should not change if we replace X i j with λ · X i j . How to describe this requirement in precise terms: discussion. Of course, if we replace X i j with λ · X i j , the weights will change. However, this does not necessarily mean that the resulting representations will change. Namely, for any c > 0, optimizing any function J is equivalent to optimizing the function c · J . For example, if we are looking for the richest person on Earth, the same person will be selected as the richest whether we count his richness in dollars on in pesos. So, if we replace f (X ) with c · f (X ), we will get a new objective function c · J instead of the original objective function J , but we will get the same representations wi . Resulting requirement. In view of the above discission, we can have f (λ · X ) = c · f (X ), and the resulting representation will be the same. So, the invariance with respect to corpus size can be described as follows: for every real number λ > 0, there exists a real number c > 0 depending on λ for which f (λ · X ) = c(λ) · f (X ). What are the consequences of this requirement. It is known that all measurable solutions to the functional equation f (λ · X ) = c(λ) · f (X ) are power laws f (X ) = A · X a ; see, e.g., [1]. So, the invariant weight function should have the form
186
L. Escapita et al.
f (X ) = A · X a . As we have mentioned, multiplying all the values of the objective function by a constant does not change the resulting values. Because of this, the weight function A · X a and the wight function X a lead to the same values wi . Thus, it is sufficient to consider the weight function f (X ) = X a . We get the desired explanation. This explains why the power law weights work the best: power law weights are the only ones for which the resulting representation does not depend on the corpus size.
3 How to Prove the Result About the Functional Equation The above result about the functional equation is easy to prove when the function f (X ) is differentiable. Indeed, suppose that f (λ · X ) = c(λ) · f (X ). If we differentiate both sides with respect to λ, we get X · f (λ · X ) = c (λ) · f (X ). def
In particular, for λ = 1, we get X · f (X ) = a · f (X ), where a = c (1), so X·
df = a · f. dX
We can separate the variables if we multiply both sides by
dX , then we get X· f
df dX =a· . f X Integrating both sides of this equality, we get ln( f ) = a · ln(X ) + C. By applying exp(x) to both sides, we get def
f (X ) = exp(a · ln(X ) + C) = A · X a , where A = eC , which is exactly the desired formula.
Word Representation: Theoretical Explanation …
187
Acknowledgements This work was supported in part by the National Science Foundation grants: • 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and • HRD-1834620 and HRD-2034030 (CAHSI Includes). It was also supported by the AT&T Fellowship in Information Technology, and by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478. The authors are thankful to the participants of the 2022 UTEP/NMSU Workshop on Mathematics, Computer Science, and Computational Science (El Paso, Texas, November 5, 2022) for valuable discussions.
References 1. J. Aczél, J. Dhombres, Functional Equations in Several Variables (Cambridge University Press, 2008) 2. D. Jurafsky, J.H. Martin, Speech and Language Processing (Prentice Hall, Upper Saddle River, New Jersey, 2023) 3. J. Pennington, R. Socher, C. D. Manning, Glove: global vectors for word representation, in Proceedingds of the 2014 Conference on Empirical Methods in Natural Language Processing EMNLP, Doha, Qatar (2014), pp. 1532–1543
Why Menzerath’s Law? Julio Urenda and Vladik Kreinovich
Abstract In linguistics, there is a dependence between the length of the sentence and the average length of the word: the longer the sentence, the shorter the words. The corresponding empirical formula is known as the Menzerath’s Law. A similar dependence can be observed in many other application areas, e.g., in the analysis of genomes. The fact that the same dependence is observed in many different application domains seems to indicate there should be a general domain-independent explanation for this law. In this paper, we show that indeed, this law can be derived from natural invariance requirements.
1 Formulation of the Problem Menzerath’s law: a brief description. It is known that in linguistics, in general, the longer the sentence, the shorter its words. There is a formula–known as the Menzerath’s Law–that describes the dependence between the average length x of the word and the length y of the corresponding sentence: y = a · x −b · exp(−s · x); see, e.g., [2, 3, 5–7, 9]. To be more precise, the original formulation of this law described the dependence between the number Y of words in a sentence and the average length of the word: Y = a · x −B · exp(−s · x). However, taking into account that the average length x of the word is equal to y/Y , we thus conclude that y = x · Y = a · x −(B−1) · exp(−s · x), i.e., that the relation between y and x has exactly the same form.
J. Urenda Deparments of Mathematical Sciences and Computer Science, University of Texas at El Paso 500 W. University, El Paso, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso 500 W. University, El Paso, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_31
189
190
J. Urenda and V. Kreinovich
Menzerath’s law is ubiquitous. A similar dependence was found in many other application areas. For example, the same formula describes the dependence between the length of DNA and the lengths of genes forming this DNA; see, e.g., [4, 8]. Challenge. Since this law appears in many different application areas, there must be a generic explanation. The main objective of this paper is to provide such an explanation.
2 Analysis of the Problem There are different ways to describe length. We can describe the length of a word or a phrase by number of letters in it. We can describe this length by the number of bits or bytes needed to store this word in the computer. Alternatively, we can describe the word in an international phonetic alphabet, this usually makes its description longer. Some of these representations make the length larger, some smaller. A good example is the possibility to describe the length in bits or in bytes. In this case, since 1 byte is 8 bits, the length of the word in bits is exactly 8 times longer than its length in bytes. In general, we can use different units for measuring length, and, on average, when you use a different unit, length y in the original unit becomes length x = c · x in the new units, where c is the ratio between the two units. For the length of the word x, there is an additional possibility. For example, in many languages, the same meaning can be described in two ways: by a prefix or a postfix, or by a preposition. For example, when Julio owns a book, we can say it is a book of Julio, or we can say that it Julio’s book. Similarly, we can say that a function is not linear or that a function is non-linear. There are many cases like this. In all these examples, in the first case, we have two words, while in the second case, we have one longer word, a word to which a “tail” of fixed length was added. So, if we perform the transformation from the first representation to the second one, then the average length of the meaningful words will increase by a constant x0 : the average length of such an addition (and the average length of words in general will decrease). In this case, the average length of the word changes from x to x = x + x0 . The relation between x and y should not depend on how we describe length. As we have mentioned, there are several different ways to describe both the length of the phrase y and the average length of the words x. There seems to be no reasons to conclude that some ways are preferable. It is therefore reasonable to require that the dependence y = f (x) should have the same form, no matter what representation we use: if we change the way we describe length x, the dependence between x and y should remain the same. Of course, this does not mean that if we change from x to x , we should have the same formula y = f (x ). For example, the dependence d = v · t describing how the path depends on velocity v and time t does not change if we change the unit of velocity, e.g., from km/h to miles per hour. However, for the formula to remain valid, we need to also change the unit of distance, from km to miles.
Why Menzerath’s Law?
191
In general, invariance of the relation y = f (x) means that for each re-scaling x → x of the input x, there should exist an appropriate re-scaling y → y of the output y such that if we have y = f (x) in the original units, then we should have the exact same relation y = f (x ) in the new units. Let us describe what this invariance requirement implies for the above two types of re-scaling. Invariance with respect to scaling x → c · x. For this re-scaling, invariance means that for every c > 0, there exists a value C(c) (depending on c) for which y = f (x) implies that y = f (x ), where y = C(c) · y and x = c · x. Substituting the expressions for x and y into the formula y = f (x ), we conclude that C(c) · y = f (c · x). Since y = f (x), we conclude that C(c) · f (x) = f (c · x). It is known (see, e.g., [1]) that every measurable function (in particular, every definable function) that satisfies this functional equation has the form y = A · x p for some real numbers A and p. Invariance with respect to shift x → x + x0 . For this re-scaling, invariance means that for every x0 , there exists a value C(x0 ) (depending on x0 ) for which y = f (x) implies that y = f (x ), where y = C(x0 ) · y and x = x + x0 . Substituting the expressions for x and y into the formula y = f (x ), we conclude that C(x0 ) · y = f (x + x0 ). Since y = f (x), we conclude that C(x0 ) · f (x) = f (x + x0 ). It is known (see, e.g., [1]) that every measurable function (in particular, every definable function) that satisfies this functional equation has the form y = D · exp(q · x) for some real numbers D and q. How can we combine these two results? We wanted to find the dependence y = f (x), but instead we found two different dependencies y1 (x) = A · x p and y2 (x) = D · exp(q · x). We therefore need to combine these two dependencies, i.e., to come up with a combined dependency y(x) = F(y1 (x), y2 (x)). Which combination function F(y1 , y2 ) should we choose? Since the quantity y is determined modulo scaling, it is reasonable to select a scale-invariant combination function, i.e., a function for which, for all possible pairs of values c1 > 0 and c2 > 0, there exists a value C(c1 , c2 ) for which y = F(y1 , y2 ) implies that y = F(y1 , y2 ), where y = C(c1 , c2 ) · y, y1 = c1 · y1 , and y2 = c2 · y2 . Substituting the expressions for y , y1 , and y2 into the formula y = F(y1 , y2 ), we conclude that C(c1 , c2 ) · y = F(c1 · y1 , c2 · y2 ). Since y = F(y1 , y2 ), we conclude that C(c1 , c2 ) · F(y1 , y2 ) = F(c1 · y1 , c2 · y2 ). It is known (see, e.g., [1]) that every measurable function (in particular, every definable function) that satisfies this p p functional equation has the form y = C · y1 1 · y2 2 for some real numbers C, p1 , and p2 . Substituting y1 (x) = A · x p and y2 (x) = D · exp(q · x) into this expression, we get
192
J. Urenda and V. Kreinovich
y(x) = C · (A · x p ) p1 · (D · exp(q · x)) p2 = C · A p1 · x p· p1 · D p2 · exp( p2 · q · x), i.e., in effect, the desired expression y = a · x −b · exp(−s · x), where a = C · A p1 · D p2 , b = p · p1 , and s = p2 · q. Thus, we have indeed explained that the Menzerath’s law can indeed be derived from natural invariance requirements. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. J. Aczél, J. Dhombres, Functional Equations in Several Variables (Cambridge University Press, 2008) 2. G. Altmann, Prolegomena to Menzerath’s law. Glottometrika 2, 1–10 (1980) 3. G. Altmann, M.H. Schwibbe, Das Menzerathsche Gesetz in informationsverarbeitenden Systemen (Olms, Hildesheim/Zürich/New York, 1989) 4. S. Eroglu, Language-like behavior of protein length distribution in proteomes. Complex. 20(2), 12–21 (2014) 5. R. Ferrer-I-Cancho, N. Forns, The self-organization of genomes. Complex. 15(5), 34–36 (2009) 6. L. Hˇrebíˇcek, Text Levels: Constituents and the Menzerath-Altmann Law (Wissenschaftlicher Verlag, Trier, Language Constructs, 1995) 7. R. Köhler, Zur Interpretation des Menzerathschen Gesetzes. Glottometrika 6, 177–183 (1984) 8. W. Li, Menzerath’s law at the gene-exon level in the human genome. Complex. 17(4), 49–53 (2012) 9. J. Miliˇcka, Menzerath’s law: the whole is greater than the sum of its parts. J. Quant. Linguist. 21(2), 85–99 (2014)
Applications to Machine Learning
One More Physics-Based Explanation for Rectified Linear Neurons Jonatan Contreras, Martine Ceberio, and Vladik Kreinovich
Abstract The main idea behind artificial neural networks is to simulate how data is processed in the data processing devoice that has been optimized by million-years natural selection–our brain. Such networks are indeed very successful, but interestingly, the most recent successes came when researchers replaces the original biologymotivated sigmoid activation function with a completely different one–known as rectified linear function. In this paper, we explain that this somewhat unexpected function actually naturally appears in physics-based data processing.
1 Formulation of the Problem Neural networks: main idea. One of the objectives of Artificial Intelligence is to enrich computers with intelligence, i.e., with the ability to intelligently process information and make decisions. A natural way to do it is to analyze how we humans make intelligent decisions, and to borrow useful features of this human-based data processing and decision making. We humans process information in the brain, by using special cells called neurons. Different values of the inputs are presented as electric signals–in the first approximation, the frequency of pulses is proportional to the value of the corresponding quantity. In the first approximation, the output signal y of a neuron is related to its inputs x1 , . . . , xn by a relation y = s(w1 · x1 + . . . + wn · xn − w0 ), J. Contreras · M. Ceberio · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, El Paso, Texas 79968, USA e-mail: [email protected] J. Contreras e-mail: [email protected] M. Ceberio e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_32
195
196
J. Contreras et al.
where wi are constants and s(z) is a non-linear function called the activation function. Outputs of most neurons serve as inputs to other neurons. Devices that simulate such biological networks of neurons are known as artificial neural networks, or simply neural networks, for short. Choice of activation functions. In the first approximation, the activation function of biological neurons has the form s(z) = 1/(1 + exp(−z)); this function is known as a sigmoid. Because of this fact, at first, artificial neural networks used this activation function; see, e.g., [1]. However, later, it turns out that it much more efficient to use different activation function–so-called Rectified Linear (ReLU) function s(z) = max(0, z); see, e.g., [3]. Comment. It is known that the use of rectified linear activations in neural networks is equivalent to using similar functions s(z) = max(a, z) and s(z) = min(a, z) for some constant a–functions that can be obtained from the rectified linear function by shifts of the input and the output and, if needed, by changing the signs of input and/or output. Question. How can we explain the success of rectified linear activation functions? What we do in this paper. In this paper, we show from the physics viewpoint, the rectified linear function is a very natural transformation.
2 Analysis of the Problem Many physical quantities are bounded. In most cases, when we write down formulas relating physical quantities, we implicitly assume that these quantities can take any possible real values. A good example of such a formula is a formula d = v · t that related the distance d traveled by an inertial object, the object’s velocity v, and the duration t of this object’s travel. In reality, all these quantities are bounded: • the distance cannot be larger than the current size of the Universe, • the velocity cannot be larger than the speed of light, and • the duration cannot be larger than the lifetime of the Universe; see, e.g., [2, 4]. Of course, in most practical problems we can safely ignore these fundamental bounds, but in practical applications, there are usually more practical bounds that need to be taken into account. For example, if we plan a schedule for a truck driver: • the velocity v is bounded by the speed limits on the corresponding roads, and • the duration t is limited by the fact that a long drive, the driver will be too tired to continue–and the driver’s reaction time will decrease to a dangerous level, because of which regulations prohibit long workdays.
One More Physics-Based Explanation …
197
How can we take these bounds into account? Suppose that by analyzing a simplified model, a model that does not take the bound a on a quantity z into account, we came up with an optimal-within-this-model value z 0 of this quantity. How can we now take the bound into account? One possibility would be to start from scratch, and to re-solve the problem while taking the bound into account. However, this is often too time-consuming: if we could do this, there would be no need to first solve a simplified model. Since we cannot re-solve this problem in a more realistic setting, a more practical approach is to transform the possible un-bounded result z 0 of solving the simplified problem into a bounded value z. Let us describe the corresponding transformation by z = f (z 0 ). Which transformation f (z) should we choose? To answer this question, let us consider natural requirements on the function f (z). The result of this transformation must satisfy the desired constraint. The whole purpose of the transformation z = f (z 0 ) is produce the value z that satisfies the desired constraint z ≤ a. Thus, we must have f (z 0 ) ≤ a for all possible values z 0 . The transformation must preserve optimality. If the solution z 0 of the simplified model–the model that does not take bounds into account–already satisfies the desired constraint z 0 ≤ a, this means that this value is optimal even if we take the constraint into account. So, in this case, we should return this same value, i.e., take z = z 0 . Monotonicity. In many practical situations, the order between different values of quantity has a physical meaning; e.g.: • larger velocity leads to faster travel, • smaller energy consumption means saving money and saving environment, etc. It is therefore reasonable to require that the desired transformation f (z 0 ) preserve this order, i.e., that if z 0 ≤ z 0 , then we should have f (z 0 ) ≤ f (z 0 ). Let us see what we can conclude based on these three requirements.
3 Main Result Proposition. Let a be a real number, and let f (z 0 ) be a function from real numbers to real numbers that satisfies the following three requirements: • for every z 0 , we have f (z 0 ) ≤ a; • for all z 0 ≤ a, we have f (z 0 ) = z 0 ; and • if z 0 ≤ z 0 , then we have f (z 0 ) ≤ f (z 0 ). Then, f (z 0 ) = min(a, z 0 ).
198
J. Contreras et al.
Comment. Vice versa, it is easy to show that the function f (z 0 ) = min(a, z 0 ) satisfies all three requirements. Proof. To prove this proposition, let us consider two possible cases: z 0 ≤ a and z 0 > a. In the first case, when z 0 ≤ a, then from the second requirement, it follows that f (z 0 ) = z 0 . In this case, we have min(a, z 0 ) = z 0 , so indeed f (z 0 ) = min(a, z 0 ). In the second case, when z 0 > a, then from the first requirement, it follows that f (z 0 ) ≤ a. On the other hand, since a < z 0 , the third requirement implies that f (a) ≤ f (z 0 ). The second requirement implies that f (a) = a, thus we get a ≤ f (z 0 ). From f (z 0 ) ≤ a and a ≤ f (z 0 ), we conclude that f (z 0 ) = a. In this case, we have min(a, z 0 ) = a, so we indeed have f (z 0 ) = min(a, z 0 ). The proposition is proven. Conclusion. The above proposition shows that the function f (z 0 ) = min(a, z 0 ) naturally follows from a simple physical analysis of the problem. We have mentioned that the use of this function is equivalent to using the rectified linear activation function. Thus, the rectified linear activation function also naturally follows from physics. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. C.M. Bishop, Pattern Recognition and Machine Learning (Springer, New York, 2006) 2. R. Feynman, R. Leighton, M. Sands, The Feynman Lectures on Physics (Addison Wesley, Boston, Massachusetts, 2005) 3. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, Massachusetts, 2016) 4. K.S. Thorne, R.D. Blandford, Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics (Princeton University Press, Princeton, New Jersey, 2017)
Why Deep Neural Networks: Yet Another Explanation Ricardo Lozano, Ivan Montoya Sanchez, and Vladik Kreinovich
Abstract One of the main motivations for using artificial neural networks was to speed up computations. From this viewpoint, the ideal configuration is when we have a single nonlinear layer: this configuration is computationally the fastest, and it already has the desired universal approximation property. However, the last decades have shown that for many problems, deep neural networks, with several nonlinear layers, are much more effective. How can we explain this puzzling fact? In this paper, we provide a possible explanation for this phenomena: that the universal approximation property is only true in the idealized setting, when we assume that all computations are exact. In reality, computations are never absolutely exact. It turns out that if take this non-exactness into account, then one-nonlinear-layer networks no longer have the universal approximation property, several nonlinear layers are needed–and several layers is exactly what deep networks are about.
1 Formulation of the Problem Neural networks: a brief reminder. A natural way to perform data processing is to simulate how data is processed in a brain. In a brain, signals are processed by special cells called neurons. In the first approximation, a neuron transforms the inputs x1 , . . . , xn into a value
R. Lozano · I. Montoya Sanchez · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA e-mail: [email protected] R. Lozano e-mail: [email protected] I. Montoya Sanchez e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_33
199
200
R. Lozano et al.
yk (x1 , . . . , xn ) = s
n
wki · xi − wk0 .
i=1
Here, s(z) is a nonlinear continuous function known as an activation function. Output signals become inputs for other neurons, etc. This network of neurons is called a neural network. This is exactly what is implemented in artificial neural networks; see, e.g., [1, 2]. Why neural networks? When a computer program recognizes a face, it performs millions of elementary computational operations. The results still come our fast, since a computer performs billions of operations per second. A human can also recognize a face in a fraction of a second. The big difference is that it takes milliseconds for a biological neuron to perform any operation. The reason why such slow devices return the results fast is that the brain is heavily parallel. When we look, millions of neurons simultaneously process different parts of the image. This time-saving parallelism was one of the main original motivations for neural computing. (Artificial) neural networks: a brief history. As we have just mentioned, one of the main objectives of a neural network was to speed up computations. In such a heavily parallel system, the processing time is proportional to the number of computational stages. So, it is desirable to use the smallest possible number of stages. It turns out that already one nonlinear stage is sufficient, when: • first, signals go into nonlinear neurons, and • then a linear combination of neurons’ results is returned: y(x1 , . . . , xn ) =
K
Wk · yk (x1 , . . . , xn ) − W0 .
k=1
To be more precise, it has been proven that such one-nonlinear-layer networks have the following universal approximation property: • for every continuous function f (x1 , . . . , xn ) on a closed bounded domain (|xi | ≤ B for all i) and for every desired accuracy ε > 0, • there exist weights wki and Wk for which the resulting value y(x1 , . . . , xn ) is ε-close to f (x1 , . . . , xn ) for all xi . The proof follows, e.g., from Fourier series and Fourier transform. Indeed, Newton’s prism experiment showed that every light can be decomposed into pure colors. In mathematical terms, a color is pure when the signal depends on time as a sinusoid. Thus, every continuous function can be represented as a linear combination of sinusoids. This representation is what is known as Fourier transform of a function.
Why Deep Neural Networks: Yet Another Explanation
201
This is the desired result for s(z) = sin(z). So: • one nonlinear stage is the fastest, and • with a single such stage, we can, in principle, represent any dependence. Thus, for several decades, people mostly used one-nonlinear-layer neural networks. Puzzling development. Surprisingly, last decades have shown that often, multi-stage (“deep”) neural networks are much more efficient; see, e.g., [2]. Many machine learning problems: • could not be solved with traditional (“shallow”) networks, but • were successfully solved by deep ones. A natural question is: why are deep networks effective?
2 Towards a Possible Explanation Idea. In the above analysis, we implicitly assumed that all computations are exact. In practice, computational devices are never absolutely accurate–e.g., due to rounding. For example, in a usual binary computer we cannot even represent 1/3 exactly. As a result, the computed value 3 · (1.0/3) is slightly different from 1. So, we can only implement an activation function approximately, with some accuracy δ > 0. Thus, all we know that the actual neurons apply some function s(z) which is δ-close to s(z). How this affects the universal approximation property: we need more than one nonlinear layer. It is known that every continuous function on a closed bounded domain can be approximated, with any given accuracy, by a polynomial. This is, by the way, how all special functions like exp(x), sin(x), etc. are computed in a computer: they are computed by computing the corresponding approximating polynomial. So, it is possible that neurons use a polynomial function s(z). Let d be the order (= degree) of this polynomial, i.e., the largest overall degree of its terms. Then, in a traditional neural network, all the outputs yk (x1 , . . . , xn ) are polynomials of order d. Thus, their linear combination y(x1 , . . . , xn ) is also a polynomial of order d. It is known that polynomials of fixed order are not universal approximators. So, we need more than one non-linear stage (layer). What if we use two nonlinear layers? What if we have two non-linear stages, so that the outputs yk of the first later serve as inputs to the second one? In this case, the resulting function is obtained by applying a polynomial of degree d to polynomials of degree d. This results in a polynomial of degree d 2 . Thus, it is also not a universal approximator. What is use a fixed number of layers? For any fixed number L of layers, we will get polynomials of degree bounded by d L .
202
R. Lozano et al.
Conclusion. Thus, in this realistic setting: • to get the universal approximation property, • we cannot limit the number of layers. And this is exactly what deep networks are about. Acknowledgements This work was supported in part by the National Science Foundation grants: • 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and • HRD-1834620 and HRD-2034030 (CAHSI Includes). It was also supported: • by the AT&T Fellowship in Information Technology, and • by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478. The authors are also thankful to all the participants of the 27th Joint NMSU/UTEP Workshop on Mathematics, Computer Science, and Computational Sciences, Las Cruces, New Mexico, April 2, 2022, for valuable suggestions.
References 1. C.M. Bishop, Pattern Recognition and Machine Learning (Springer, New York, 2006) 2. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, Massachusetts, 2016)
Applications to Mathematics
Really Good Theorems Are Those That End Their Life as Definitions: Why Olga Kosheleva and Vladik Kreinovich
Abstract It is known that often, after it is proven that a new statement is equivalent to the original definition, this new statement becomes the accepted new definition of the same notion. In this paper, we provide a natural explanation for this empirical phenomenon.
1 Formulation of the Problem Empirical fact. A recent book [1] cites a statement that is widely believed by mathematicians: that really good theorems are those that end their life as definitions. This is indeed an empirical fact with which many mathematicians are familiar – often: • after an interesting and useful theorem is proven that provides an equivalent condition to the original definition, • this equivalent statement eventually becomes a new definition. A natural why-question. A natural question is: why is this phenomenon really ubiquitous? What we do in this paper. In this paper, we provide a natural explanation for this natural phenomenon.
O. Kosheleva · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_34
205
206
O. Kosheleva and V. Kreinovich
2 Analysis of the Problem and the Resulting Explanation Which of the equivalent statements should we select? For many mathematical notions – be it continuity, compactness, etc. – there are several different properties equivalent to this notion. We need to select one of these properties as the definition, then all other properties become theorems. Which of the equivalent formulations should we select? What do we want from a definition: a natural idea. The main objective of mathematics is to prove theorems. From this viewpoint, the natural role of a definition is to help prove theorems about the corresponding notion. Thus, we should select a definition which, on average, makes it easier to derive different properties of the corresponding notion. Let us describe this idea in precise terms. Let A1 , . . . , An denote all the known results about the corresponding notion. For each of possible definitions D, we can describe the complexity of deriving the property Ai – as measured, e.g., by the length of the proof – by c(D → Ai ). This is the complexity of an individual derivation. To combine these individual complexity into a meaning average, we need to take into account that different statements Ai may have different importance: • Some of these statement Ai are important, they appear in many different areas. • On the other hand, some results Ai are obscure, important for maybe a single result about some unusual and rarely appearing object. To take this difference into account, we need to assign, to each statement Ai , a number wi describing, e.g., the relative frequency with which this statement is used in mathematical literature. Then, average complexity c(D) of selecting the statement D as the definition can be obtained if we add up all individual complexities multiplied by these weights wi : n c(D) = wi · c(D → Ai ). (1) i=1
What does it mean that a theorem is good? As we have mentioned, often, there are many equivalent statements describing the same notion, i.e., there are several statements D , D , …, which are all equivalent to D: D ↔ D ↔ D . . . What do we mean when we say that a theorem proving the equivalence D ↔ D is a good theorem? This usually means that once we know that D is equivalent to D, it makes it easier to derive several different results Ai .
Really Good Theorems Are Those That End Their Life as Definitions: Why
207
What does it mean that a theorem is very good? Similarly, when we say that the equivalence D ↔ D is a very good theorem, this means that, once we know that D is equivalent to D, it makes it easier to derive many different results Ai . So when does a very good theorem become a new definition? So when does the new equivalent formulation D become a new definition? This happens when the average length of the derivations from the new definition D become smaller than the average length of deriving statements from the original definition D, i.e., when n
wi · c(D → Ai )
c(D → D) ·
wj.
(3)
j ∈G /
Clearly, when the set G becomes sufficiently large, the left-hand side of the formula (3) indeed becomes larger than the right-hand side, and thus, it becomes reasonable to select D as the new definition. This explains the phenomenon described in [1].
208
O. Kosheleva and V. Kreinovich
Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
Reference 1. A.V. Borovik, Shadows of the Truth: Metamathematics of Elementary Mathematics (AMS, Providence, RI, 2022)
Applications to Physics
How to Describe Hypothetic Truly Rare Events (With Probability 0) Luc Longpré and Vladik Kreinovich
Abstract In probability theory, rare events are usually described as events with low probability p, i.e., events for which in N observations, the event happens n(N ) ∼ p · N times. Physicists and philosophers suggested that there may be events which are even rarer, in which n(N ) grows slower than N . However, this idea has not been developed, since it was not clear how to describe it in precise terms. In this paper, we propose a possible precise description of this idea, and we use this description to answer a natural question: when two different functions n(N ) lead to the same class of possible “truly rare” sequences.
1 Formulation of the Problem How rare events are described now. The current approach to describing rare events comes from probability theory. In the probability theory, each class of events is characterized by its probability p. From the observational viewpoint, probability means a frequency with which such an event is observed in a long series of observations. Specifically, this means that if we make a large number N of observations, then the number of observations n(N ) in which the event occurred is asymptotically proportional to n(N ) ∼ p · N ; see, e.g., [1]. In these terms, rare events correspond to cases when the probability p is small: the smaller the probability p, the fewer the occurrences of the events, so the rarer is this event. Challenge. Probability theory deals with events for which the number of occurrences n(N ) is proportional to the number of observations N . But what if we have an event L. Longpré · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] L. Longpré e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_35
211
212
L. Longpré and V. Kreinovich
which is even rarer, i.e., an event for which the number of occurrences n(N ) grows even slower—e.g., as C · N α for some C > 0 and α < 1? Such a hypothetical possibility was proposed, on the intuitive level, by several physicists and philosophers. This idea may sound interesting, but the big challenge is that it is not clear how to develop it into something quantitative—since, in contrast to probability theory, there is no readily available formalism to describe such rare events. What we do in this paper. In this paper, we provide a possible way to formally describe such rare events—and thus, to be able to analyze them.
2 Analysis of the Problem and the Resulting Proposal Analysis of the situation. In general, when the expected number of events observed during the first N observations is described by a formula n(N ), the expected number of events occurring during the N -th observations can be estimated as the difference n(N ) − n(N − 1) between: • the overall number n(N ) of events observed during the first N observations and • the overall number n(N − 1) of events observed during the first N − 1 observations. This expected number of events occurring during a single observation is nothing else but the probability p N that the N -th observation leads to the desired effect. In the traditional probability approach, when n(N ) = p · N , this probability is equal to p: p N = n(N ) − n(N − 1) = p · N − p · (N − 1) = p. In the desired random case, the corresponding difference decreases with N . For example, for n(N ) ∼ c · N α , we have N −1 α . p N = C · N − c · (N − 1) = c · N · 1 − N α
Here,
and therefore
α
N −1 N
α
1 α α = 1− ∼1− , N N
pn ∼ C · N α · def
α
def
α = c · N a, N
where we denoted c = C · α and a = α − 1. This leads to the following proposal.
How to Describe Hypothetic Truly Rare Events (With Probability 0)
213
Proposal. We describe a rare event with n(N )/N → 0 as a random sequence in which the N -th observation leads to an event with probability p N = n(N ) − n(N − 1), and the occurrences for different N are independent.
3 A Natural Question: Which Probabilities p N and q N Lead to the Same Class of Rare Observations? Now we can start analyzing the new notion of rareness. The fact that now we have a formal definition enables us to start analyzing quantitative questions. Let us start with the question described in the title of this section. Why this question is important. In the usual probabilistic setting, we need only one parameter to describe the degree of rareness: the probability p. The value of this parameter can be determined based on the observations, as the limit of n(N )/N when N increases. The more observations we make, the more accurately we can determine this probability. In contrast, when we have a sequence of events with different probabilities p N , for each N , we only have one case in which the event either occurred or did not occur. Clearly, this information is not sufficient to uniquely determine the corresponding probability p N . In other words, this means that different sequences p N can lead to the exact same sequence of observations. In mathematical terms, based on the observations, we cannot uniquely determine the sequence p N , we can only determine a class of sequences consistent with this observation. This leads to the following natural questions: what is this class? Or, in other terms: when do two sequences p N and q N lead to the same sequence of observations? Let us describe this question in precise terms. The usual way to apply probabilistic knowledge to practice is as follows: we prove that some formula is true with probability 1, and we conclude that this formula must be true for the actual random sequence. For example, it is proven than if we have a sequence of independent events with probability p each, then the ratio n (N )/N , where n (N ) denotes the actual number of occurrences of this event in the first N observations, tends to p with probability 1. We therefore conclude that for the actual sequence of observations, we will have n (N )/N → p. Similarly, we conclude that the actual sequence of observations should satisfy the Central Limit Theorem, according to which the distribution of the deviations n (N )/n − p of this ratio from the actual probability is asymptotically normal, etc. In other words, the usual application assumes that when we have a random sequence, it should satisfy all the laws that occur with probability 1—or, equivalently, that this sequence should not belong to any definable set of probability measure 0. This idea underlies the formal definition of a random sequence in Algorithmic Information Theory—it is defined as a sequence that does not belong to any definable set of measure 0; see, e.g., [2].
214
L. Longpré and V. Kreinovich
Here, “definable” can be formalized in different ways: computable in some sense, describable by a formula from some class, etc. No matter how we describe it, there are only countably many such sets. So, the measure of their union is 0, and if we delete this union from the set of all possible binary sequences, we therefore still get a set of measure 1. Let us apply this formalization to our question. When we select the probabilities p N , we thus determine a probability measure on the set of all binary sequences, and we can use the above definition to formally define when a binary sequence is random with respect to this measure. So, the above question takes the following form: for which pairs of sequences p N and q N there exists a binary sequence which is random with respect to both measures? Let us call such probabilities p N and qn indistinguishable. The following proposition explains when two sequences are indistinguishable. Proposition. Sequences p N → 0 and q N → 0 are indistinguishable if and only if ∞ √ √ ( p N − q N )2 < ∞. N =1
Proof. According to [3] (see also Problem 4.5.14 in [2]), the two probabilities p N and q N are indistinguishable if and only if ∞ √
pN −
2
√ 2
< ∞. qN + 1 − pN − 1 − qN
(1)
N =1
In our case, when p N → 0 and q N → 0, we have
1 − pN −
√ 1 − p N ∼ 1 − 0.5 p N , thus
√ √ √ √ 1 − q N ∼ −0.5 · ( p N − q N ) = −0.5 · pN − qN · pN + qN .
Therefore, 2
√ √ 2 √ √ 2 1 − p N − 1 − q N ∼ 0.25 · pN − qN · pN + qN . √ √ 2 Here, p N → 0 and q N → 0, thus p N + q N → 0. Hence, the second term in the left-hand side of the formula (1) is asymptotically negligible in comparison with the first term, and the convergence of the sum is thus equivalent to the convergence of the first terms. The proposition is proven. Comment. By using this proposition, one can check that the probabilities p N = c · N a and q N = d · N b
How to Describe Hypothetic Truly Rare Events (With Probability 0)
215
are indistinguishable only when they coincide. Thus, all the probabilities p N = c · N a corresponding to different pairs (c, a) are distinguishable. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures (Chapman and Hall/CRC, Boca Raton, Florida, 2011) 2. M. Li, P. Vitanyi, An Introduction of Kolmogorov Complexity and Its Applications (Springer, New York, 2008) 3. V.G. Vovk, On a randomness criterion. Soviet Math. Doklady 35, 656–660 (1987)
Spiral Arms Around a Star: Geometric Explanation Juan L. Puebla and Vladik Kreinovich
Abstract Recently, astronomers discovered spiral arms around a star. While their shape is similar to the shape of the spiral arms in the galaxies, however, because of the different scale, galaxy-related physical explanations of galactic spirals cannot be directly applied to explaining star-size spiral arms. In this paper, we show that, in contrast to more specific physical explanation, more general symmetry-based geometric explanations of galactic spiral can explain spiral arms around a star.
1 Formulation of the Problem Spiral shapes are ubiquitous in astronomy. Spiral arms are typical in galaxies. For example, our own Galaxy consists largely of such arms.
J. L. Puebla · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] J. L. Puebla e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_36
217
218
J. L. Puebla and V. Kreinovich
A recent discovery and a related challenge. Recently, very similar spiral arms were discovered around a star; see, e.g., [4]; see also [3]. How can we explain the appearance of spiral arms around a star? Why is this a challenge? There are many physical theories that explain the appearance of logarithmic spiral arms in galaxies. However, because of the different scale, galaxy-related explanations cannot be directly applied to explaining spiral arms around a star. What we do in this paper. In this paper, we show that while specific physical explanations of galaxy spiral arms cannot be applied to explain spiral arms around a star, a more general symmetry-based geometric explanation can be applied to the newly discovered phenomenon.
2 Analysis of the Problem and the Resulting Explanation How are spiral arms in galaxies explained? For galaxies, several dozen different physical theories have been proposed that explain the same spiral shape. All these theories, based on completely different physics, successfully explain the same shape. This led researchers to conclude that this shape must have a simple geometric explanation. Such an explanation has indeed been proposed; see, e.g., [1, 2].
Spiral Arms Around a Star: Geometric Explanation
219
Let us recall this explanation. Initial symmetries and inevitability of symmetry violations. The distribution of matter close to the Big Bang was practically uniform and homogeneous. Thus, this distribution was invariant with respect to shifts, rotations, and scalings. However, such a distribution is unstable. If at some point, density increases, then matter will be attracted to this point, and the disturbance will increase. Thus, the original symmetry will be violated. Which symmetry violations are more probable? According to statistical physics, it is more probable to go from a symmetric state to a state where some symmetries are preserved. For example, typically, when heating, the matter rarely go directly: from a highly symmetric crystal state to the completely asymmetric gas state. It first goes through intermediate liquid state, where some symmetries are preserved. The more symmetries are preserved, the more probable the transition. From this viewpoint, it is most probable that matters forms a state with the largest group of symmetries. Resulting shapes consists of orbits of symmetry groups. If there is a disturbance at some point a, and the situation is invariant with respect to some transformation g, this means that there is a disturbance at the point g(a) as well. So, with each point a, the resulting shape contains all the points G(a) = {g(a) : g ∈ G}. The set G(a) is known as an orbit of the group G. For example, if G is the group of all rotations around a point, then G(a) is the sphere containing a. Original symmetry group. In the beginning, we have the following 1-dimensional families of basic transformations: • three families of shifts x → x + a—in all three dimensions, • three families of rotations x → Rx—around all three axes, and • one family of scalings x → λ · x. We can combine them, so we get a 7-dimensional symmetry group. What is the next shape? What is the shape with the largest symmetry? If we have all three shifts, then from each point, we can get to every other point. In this case, the whole 3-D space is the shape. Thus, for perturbation shapes, we can have at most two families of shifts. If we apply two families of shifts to a point, we get a plane. The plane also has: • one family of rotations inside the plane, and • one family of scalings. Thus, the plane has a 4-dimensional symmetry group. If we have one family of shifts, we get a straight line. It also has rotations around it and scaling. So, a straight line has a 3-dimensional symmetry group.
220
J. L. Puebla and V. Kreinovich
If we have no shifts, but all rotations, then we get a sphere. A sphere has a 3dimensional symmetry group. So far, the most symmetric shape is the plane. A detailed analysis of all possible symmetry groups confirms this. So, the most probable first perturbation shape is a plane. This is in accordance with astrophysics, where such a proto-galaxy shape is called a pancake. What next after a pancake? The planar shape is still not stable. So, the symmetry group decreases further. From the 2-D shape of a plane, we go to a symmetric 1-D shape. The generic form of a 1-D group is exactly logarithmic spiral. In polar coordinates, it has the form r = a · exp(k · θ ). For every two its points (r, θ ) and (r , θ ), there is spiral’s symmetry transforming (r, θ ) and (r , θ ): • first, we rotate by δ = θ − θ , • then, we scale r → exp(k · δ) · r . This is what we observe in galaxies. What is after spirals? What will happen next? The spiral is also unstable. So, we go from the continuous 1-D symmetry group to a discrete one. In the resulting shape, we have points whose distances from a central point form a geometric progression rn = r0 · q n . Interestingly, this is exactly the Titius-Bode formula describing planets’ distances from the Sun. The only exception to this formula is that after Earth and Mars, this formula gets a distance to the asteroid belt between Mars and Jupiter. Astrophysicists believe that there was a proto-planet there torn apart by Jupiter’s gravity, and asteroids are its remainder. Resulting explanation of spiral arms around a star. A general geometric analysis shows that at some point, we should reach a spiral shape. This explains the observed spiral arms around a star. This also explains why such arms are rare and were never observed before. Indeed, in our Solar system—and in other places—we have moved to the next stage, of planets following Titius-Bode law. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). The authors are thankful to all the participants of the 27th Joint NMSU/UTEP Workshop on Mathematics, Computer Science, and Computational Sciences (Las Cruces, New Mexico, April 2, 2022) for valuable suggestions.
Spiral Arms Around a Star: Geometric Explanation
221
References 1. A. Finkelstein, O. Kosheleva, V. Kreinovich, Astrogeometry: towards mathematical foundations. Int J Theor Phys 36(4), 1009–1020 (1997) 2. A. Finkelstein, O. Kosheleva, V. Kreinovich, Astrogeometry: geometry explains shapes of celestial bodies. Geombinatorics 6(4), 125–139 (1997) 3. X. Lu, G.X. Li, Q. Zhang, Y. Lin, A massive Keplerian protostellar disk with flyby-induced spirals in the central molecular zone. Nat. Astron. (2022). https://doi.org/10.1038/s41550-02201681-4 4. L.M. Perez, J.M. Carpenter, S.M. Andrews, L. Ricci, A. Isella, H. Linz, A.I. Sargent, D.J. Wilner, T. Henning, A.T. Deller, C.J. Chandler, C.P. Dullemond, J. Lazio, K.M. Menten, S.A. Corder, S. Storm, L. Testi, M. Tazzari, W. Kwon, N. Calvet, J.S. Greaves, R.J. Harris, L.G. Mundy, Spiral density waves in a young protoplanetary disk. Science 353(6307), 1519–1521 (2016)
Why Physical Power Laws Usually Have Rational Exponents Edgar Daniel Rodriguez Velasquez, Olga Kosheleva, and Vladik Kreinovich
Abstract Many physical dependencies are described by power laws y = A · x a , for some exponent a. This makes perfect sense: in many cases, there are no preferred measuring units for the corresponding quantities, so the form of the dependence should not change if we simply replace the original unit with a different one. It is known that such invariance implies a power law. Interestingly, not all exponents are possible in physical dependencies: in most cases, we have power laws with rational exponents. In this paper, we explain the ubiquity of rational exponents by taking into account that in many case, there is also no preferred starting point for the corresponding quantities, so the form of the dependence should also not change if we use a different starting point.
1 Formulation of the Problem Power laws are ubiquitous. In many application areas, we encounter power laws, when the dependence of a quantity y on another quantity x takes the form y = A · x a for some constants A and a; see, e.g., [2, 3, 5]. E. D. Rodriguez Velasquez Department of Civil Engineering, Universidad de Piura in Peru (UDEP), Av. Ramón Mugica 131, Piura, Peru e-mail: [email protected]; [email protected] Department of Civil Engineering, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_37
223
224
E. D. Rodriguez Velasquez et al.
This ubiquity has a natural explanation. This explanation comes from the fact that the numerical values of physical quantities depend on the choice of a measuring unit. If we replace the original unit with the one which is λ times smaller, then all the numerical values are re-scaled, namely, multiplied by λ: x → X = λ · x. For example, 1.7 m becomes 100 · 1.7 = 170 cm. In many cases, there is no physically preferable measuring unit. In such situations, it makes sense to require that the formula y = f (x) describing the dependence between x and y should be invariant (= does not change) if we simply change the measuring unit for x. To be more precise, for each re-scaling x → X = λ · x of the variable x, there exists an appropriate re-scaling y → Y = μ(λ) · Y of the variable Y for which y = f (x) implies that Y = f (X ). Substituting the expressions for X and Y into this formula, we conclude that f (λ · x) = μ(λ) · y, i.e., since y = f (x), that f (λ · x) = μ(λ) · f (x). It is easy to check that • all power laws satisfy this functional equation, for an appropriate function μ(λ). Vice versa, • it is known that the only differentiable functions f (x) that satisfy this functional equation are power laws; see, e.g., [1]. Remaining problem and what we do in this paper. Not all power laws appear in physical phenomena: namely, in almost all the cases, we encounter only power laws with rational exponents a. How can we explain this fact? In this paper, we show that a natural expansion of the above invariance-based explanation for the ubiquity of power laws explains why physical power laws usually have rational exponents.
2 Our Explanation Main idea. The main idea behind our explanation is to take into account that for many physical quantities, their numerical values depend not only on the choice of a measuring unit, but also on the choice of a starting point. For example, if we change the starting point for measuring time to a one which is x0 moments earlier, then all numerical values of x are replaced by new “shifted” numerical values X = x + x0 . In such situations, it is reasonable to require that the formulas do not change if we simply change the starting point for x. How can we apply this idea to our situation. For a power law y = A · x a , we cannot apply the above idea of “shift-invariance” directly: if we change x to x + x0 , then the original power law takes a different form y = A · (x + x0 )a .
Why Physical Power Laws Usually Have Rational Exponents
225
This situation is somewhat similar to what we had when we derived the power law from scale-invariance: if we simply replace x with λ · x in the formula y = A · x a , we get a different formula y = A · (λ · x)a = (A · λa ) · x a . So, in this sense, the power law formula y = A · x a is not scale-invariant. What is scale-invariant is the 1-parametric family of functions {C · (A · x a )}C corresponding to different values C—and this scale-invariance is exactly what leads to the power law. With respect to shifts, however, the 1-parametric family {C · (A · x a )}C is not invariant, so a natural idea is to consider a multi-parametric family, i.e., to fix several functions e1 (x), . . . , en (x), and consider the family of all possible linear combinations of these functions: {C1 · e1 (x) + . . . + Cn · en (x)}C1 ,...,Cn
(1)
corresponding to all possible values of the parameters Ci . For this family, we can try to require both scale- and shift-invariance. In this case, we get the following result: Proposition. Let e1 (x), . . . , en (x) be differentiable functions for which the family (1) is scale- and shift-invariant and for which this family contains a power law f (x) = A · x a . Then, a is an integer. Proof. It is known—see, e.g., [4]—that if for some differentiable functions ei (x), the family (1) is scale- and shift-invariant, then all the functions from this family are polynomials, i.e., functions of the type a0 + a1 · x + . . . + a p · x p . In particular, this means that the power law f (x) = A · x a —which is also a member of this family— is a polynomial. The only case when the power law is a polynomial is when the exponent a is a non-negative integer. The proposition is proven. How this explains the prevalence of rational exponents. What we proved so far was an explanation of why we often have integer exponents y = A · x n for some integer n. But how can we get rational exponents? First, we notice that if the dependence of y on x has the form y = A · x n , with an integer exponent n, then the dependence on x on y has the form x = B · x 1/n for some constant B, with a rational exponent which is no longer an integer. Another thing to notice is that the relation between two quantities x and y is rarely direct. For example, it may be that y depends on some auxiliary quantity z which, in turn, depends on x. In general, y depends on some auxiliary quantity z 1 , this quantity depends on another auxiliary quantity z 2 , etc., and finally, the last auxiliary quantity z k depends on x. If all these dependencies are described by power laws, then we have a
y = A0 · z 1a0 , z 1 = A1 · z 2a1 , . . . , z k−1 = Ak−1 · z k k−1 , z k = Ak · x ak , with coefficients ai which are either integers or inverse integers. Then, we have z k−1 = Ak−1 · z k k−1 = Ak−1 · (Ak · x ak )ak−1 = const · x ak−1 ·ak , a
226
E. D. Rodriguez Velasquez et al.
similarly k−2 = Ak−2 · (const · x ak−1 ·ak )ak−2 = const · x ak−2 ·ak−1 ·ak , z k−2 = Ak−2 · z k−1
a
etc., and finally y = const · x a , where a = a0 · a1 · . . . · ak . Since all the values ai are rational numbers, their product is also rational—and every rational exponent a = m/n can be thus obtained, if we take y = const · z 1m and z 1 = const · x 1/n . This explains why rational exponents are ubiquitous. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. J. Aczél, J. Dhombres, Functional Equations in Several Variables (Cambridge University Press, Cambridge, 2008) 2. B.M. Das, Advanced Soil Mechanics (CRC Press, Boca Raton, 2019) 3. R. Feynman, R. Leighton, M. Sands, The Feynman Lectures on Physics (Addison Wesley, Boston, Massachusetts, 2005) 4. H.T. Nguyen, V. Kreinovich, Applications of Continuous Mathematics to Computer Science (Kluwer, Dordrecht, 1997) 5. K.S. Thorne, R.D. Blandford, Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics (Princeton University Press, Princeton, New Jersey, 2017)
Freedom of Will, Non-uniqueness of Cauchy Problem, Fractal Processes, Renormalization, Phase Transitions, and Stealth Aircraft Miroslav Svítek, Olga Kosheleva, and Vladik Kreinovich
Abstract We all know that we can make different decisions, decisions that change— at least locally—the state of the world. This is what is known as freedom of will. On the other hand, according to physics, the future state of the world is uniquely pre-determined by its current state, so there is no room for freedom of will. How can we resolve this contradiction? In this paper, we analyze this problem, and we show that many physical phenomena can help resolve this contradiction: fractal character of equations, renormalization, phase transitions, etc. Usually, these phenomena are viewed as somewhat exotic, but our point is if we want physics which is consistent with freedom of will, then these phenomena need to be ubiquitous.
1 Freedom of Will and Physics: A Problem A problem: reminder. We all know that in some situations, we can make different decisions, and these decisions will change our state and the state of others—i.e., change the state of the world. The possibility to make different decision is known as freedom of will. The problem is that this experience is inconsistent with physics. This inconsistency is very clear in Newtonian mechanics, where the state of the world at any future moment of time is uniquely determined by the current state. Strictly speaking, in Newtonian physics, all our actions are pre-determined—just like all other changes in the state of the world are pre-determined, so freedom of will is an illusion. Interestingly, Einstein himself seriously believed that freedom of will M. Svítek Faculty of Transportation Sciences, Czech Technical University in Prague, Konviktska 20, CZ-110 00 Prague 1, Czech Republic e-mail: [email protected] O. Kosheleva · V. Kreinovich (B) University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_38
227
228
M. Svítek et al.
is an illusion: he could not swim but he liked to go yachting alone, and when his friends expressed concern about this, assured them that everything is pre-determined and nothing can be changed. For most of us, however, freedom of will is real. The situation is a little bit less pre-determined in quantum physics, but in quantum physics still, the current state of the world uniquely pre-determines the probabilities of all future states. To be more precise, Schrodinger equations pre-determine the values of the wave function, and the wave function uniquely determined all future probabilities; see, e.g., [1, 3]. So, if we face several similar decisions at future moments of time: • it is not possible to predict how exactly the state will change every time, • but the proportion of times in which we make a certain decision is pre-determined by the past state of the world. This also contradicts to our intuition. In a nutshell, freedom of will means that we can change the state of the world just by making a mental decision. Many people do claim that they can move things or otherwise change the state of the world by simply using their thoughts, but so far, none of such claims have been confirmed: human thought cannot change the state of even a single particle. So how can we resolve the above contradiction between modern physics and common sense? What we do in this paper. In this paper, we propose a possible solution to this problem.
2 Possible Solutions How physical theories are described. Traditionally, physical theories have been described by differential equations, equations that describe how the state’s rate of change depends on the current state. Lately, however, most theories are described by describing a functional—known as action—for which the actual trajectory of how the system’s state changes with time is the one that minimizes action [1, 3]. To be more precise, what is often described is not the action itself but the so-called Lagrangian whose integral over space and time forms the action. In line with this, let us consider how freedom of will can be explained in both these approaches. Case of differential equations. Let us first consider the case when the physical laws are described in terms of a differential equation x˙ = f (x), where x is the state of the world at the moment of time t—as described by states of the particles, states of the fields, etc. In general, differential equations enables us, given the state x(t0 ) at some moment of time, to predict the states x(t) at all future moments of time. The problem of predicting the future state x(t) based on the current state x(t0 ) is known as the Cauchy problem.
Freedom of Will, Non-uniqueness of Cauchy Problem …
229
Sometimes, the solution to Cauchy problem is not unique. In mathematics, it is known that for some differential equations, the Cauchy problem has several different solutions. Such situations happen with physically meaningful differential equations as well: see, e.g., [2], where this non-uniqueness is used to explain the “time arrow”— irreversibility of many macro-phenomena. Non-uniqueness should be ubiquitous. Non-uniqueness helps to resolve the contradiction between physics and freedom of will: we can follow several different trajectories without violating the physical laws. However, by itself, the current non-uniqueness does not fully resolve this contradiction: namely, we can make decisions at any moment of time. while in general in physics, non-uniqueness is rare, it is limited to few exceptional cases and/or exceptional moments of time. To fully resolve the contradiction, we need to make sure that non-uniqueness is ubiquitous. How can we do it? Need for non-smooth (fractal) equations. If the right-hand side f (x) of the corresponding physical equation is analytical, then usually, there is a unique solution (at least locally), this was proven already in the 19 century Cauchy himself, the mathematician who first started a systematic study of what we now call the Cauchy problem. This means that, if we want consistency with common sense, we have to consider right-hand sides which are not analytical and probably not even smooth— e.g., fractal. Let us show that non-smooth right-hand sides indeed lead to non-uniqueness. Indeed, let us consider the simplest possible case when the state of the world is described by a single variable x, and the equation takes the form x˙ = x α , with 0 initial condition x(t0 ) = 0. In this case, one possibility is to have x(t) = 0 for all t. On the other hand, there are many other possibilities: e.g., we can have a solution which is equal to 0 until some moment t1 ≥ t0 , and then switch to x(t) = C · (t − t1 )1/(1−α) for some constant C. So, introducing fractal-ness into the equations can help resolve the freedom-ofwill-vs-physics problem. Discussion. The need for non-smoothness is well known in some areas of physics. For example, if we limit ourselves to infinitely smooth solutions of aerodynamics equations, then we arrive at the conclusion that Lord Kelvin made in the late 19 century—that human-carrying heavier-than-air flying machines are not possible—a conclusion that was experimentally disproved by the appearance of airplanes [1]. Renormalization: another reason for non-uniqueness and another opening for freedom of will. In some cases, a solution to a differential equation is locally unique, but after some time, it leads to a physically meaningless infinite value of some physical quantity. For example, a general solution to an equation x˙ = x 2 is x(t) = 1/(C − t) for some constant C, so for t = C, we get an infinite value. This phenomenon is not purely mathematical, it is a well-known physical phenomenon. The simplest example of such a phenomenon is an attempt to compute the overall energy of an electron’s electric field [1]. According to relativity theory,
230
M. Svítek et al.
since the electron is an elementary particle and not a combination of several independent sub-particles, it must be a point-wise particle: otherwise, due to the fact that all communication speeds are limited by the speed of light, states of the different spatial parts of the electron at the same moment of time cannot affect each other and would thus serve as such independent sub-particles. For a point-wise particle, the electric field is proportion to r −2 , where r is the distance to the electron, and the energy density of the field is proportional to the square of the field, i.e., to r −4 . Thus, the overall energy of the electron’s electric field is equal to the integral of this energy density r −4 over the whole space—and one can check, by using radial coordinates, that this integral is infinite. So, the overall energy of the electron—which is the sum of its rest-mass energy m 0 · c2 and the overall energy of its electric field—is supposed to be infinite, while we know that it is finite and very small. Of course, our analysis ignored quantum effects, but if we take quantum effects into account, the result remains infinite. How is this problem resolved now? The usual way—known as renormalization— is to consider, for each ε > 0, a model in which an electron has a finite radius ε > 0. For this model, the overall energy of the electric field is finite. Within this model, the rest mass m 0 (ε) is then selected in such a way that the overall energy of the electron—the sum of the rest-mass energy m(ε) · c2 and the overall energy of the electric field—becomes equal to the observed value. For each quantity of interest, as a prediction, we then take the limit of predictions in different models when ε tends to 0. For each ε > 0, we have uniqueness, but there is no guarantee that the corresponding predictions will tend to some limit. In situations when the sequence of predictions corresponding to different ε does not converge, we do not have a definite prediction—which also opens room for freedom of will. What about the optimization approach. In the optimization approach, nonuniqueness appears when we have two or more trajectories or states with the exact same smallest possible value of the objective function. This phenomenon is known in physics: e.g., during the phase transition such as melting, at some point, both the solid and the liquid states have the same value of the objective function. There is also a related phenomenon of unstable equilibrium, when, e.g., the smallest push can move body on top of a rotation-invariant mountain downhill—but it is difficult to predict in which direction it will move. So maybe we can test the ability of people to use their thoughts to change the state of the world—by testing this ability on such unstable equilibrium situations. An additional feature of such phase transitions is non-smoothness. Non-smoothness (and even discontinuity) is typical for optimization problems, where it is known as a “bang-bang control”. For example, the stealthiest shape of an aircraft is when its surface is not smooth, but is formed by several planar parts. For a smooth surface, there are always parts the shape that reflect the radar’s signal back to the radar, but for such a piece-wise planar shape, there are only a few reflected directions, and the probability that one of them goes back to the radar is close to 0.
Freedom of Will, Non-uniqueness of Cauchy Problem …
231
Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. R. Feynman, R. Leighton, M. Sands, The Feynman Lectures on Physics (Addison Wesley, Boston, Massachusetts, 2005) 2. O. Kosheleva, V. Kreinovich, Brans-Dicke scalar-tensor theory of gravitation may explain time asymmetry of physical processes. Math. Struct. Model. 27, 28–37 (2013) 3. K.S. Thorne, R.D. Blandford, Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics (Princeton University Press, Princeton, New Jersey, 2017)
How Can the Opposite to a True Theory Be Also True? A Similar Talmudic Discussion Helps Make This Famous Bohr’s Statement Logically Consistent Miroslav Svítek and Vladik Kreinovich
Abstract In his famous saying, the Nobelist physicist Niels Bohr claimed that the sign of a deep theory is that while this theory is true, its opposite is also true. While this statement makes heuristic sense, it does not seem to make sense from a logical viewpoint, since, in logic, the opposite to true is false. In this paper, we show how a similar Talmudic discussion can help come up with an interpretation in which Bohr’s statement becomes logically consistent.
1 Formulation of the Problem Bohr’s statement: reminder. In [1], the famous physicist Niels Bohr cited “the old saying of the two kinds of truth. To the one kind belong statements so simple and clear that the opposite assertion obviously could not be defended. The other kind, the so-called ‘deep truths’, are statements in which the opposite also contains deep truth.” On the intuitive level, this makes sense. This statement summarizes many Bohr’s ideas about quantum physics. For example: • in some cases, it makes sense to consider elementary particles like electrons, protons, or photons, as point-wise particles; • on the other hand, in some other cases, it makes sense to consider elementary particles not as point-wise particles but as waves.
M. Svítek Faculty of Transportation Sciences, Czech Technical University in Prague, Konviktska 20, CZ-110 00 Prague 1, Czech Republic e-mail: [email protected] V. Kreinovich (B) University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_39
233
234
M. Svítek and V. Kreinovich
However, from the purely logical viewpoint, Bohr’s statement does not make sense. The problem is that from the viewpoint of logic, Bohr’s statement does not seem to make sense, since in logic, the opposite to a true statement is false. What we plan to do about it. The above Bohr’s statement is very famous. There was a lot of discussion of this statements. Some of the discussants mentioned that similar statements were made in the past, for example, a similar statement was made in the Talmud [3], where the corresponding statement has been accompanied by a more detailed discussion. In this paper, we show that the ideas from this more-detailed discussion can help to make Bohr’s statement logically consistent.
2 Analysis of the Problem and the Resulting Solution What the Talmudic discussion says. The corresponding Talmudic statement is related to a contradiction between the two statements made by the respected Rabbis: “Make for yourself a heart of many rooms, and enter into it the words of Beit Shammai and the words of Beit Hillel, the words of those who declare a matter impure, and those who declare it pure.” How we can interpret this discussion. An important point is that this discussion does not say that we should simply add both contradictory statements to our reasoning— this would cause a contradiction. What this discussion seems to say is that we need to keep these abstract statements separate—“in separate rooms”: • We should not consider these two statements together when we argue about abstract things, like whether some matter is pure or not—otherwise we will get a contradiction. So, when we try to decide how to act, we should not combine arguments of both sides. • However, it seems to make sense to accept both: – the practical advice coming from the arguments of the first school and – the practical advice coming from the arguments of the second school. Let us describe this interpretation in precise terms. We have a general set of statements. Some statements from this set are about practical consequences; we will call them practical. Intuitively, if two statements A and B are about practical consequences, then their logical combinations A & B, A ∨ B, and ¬A are also about practical consequences. Thus, the set of practical statements must be closed under such combinations. We have a statement a which is abstract—i.e., which is not equivalent to any practical statement. What Talmudic discussion seems to recommend is to consider both practical consequences of the statement a and practical consequences of its negation ¬a. Can this lead to a logically consistent set of recommendations? Let us describe this setting in precise terms.
How Can the Opposite to a True Theory Be Also …
235
Definition. • By a signature , we mean a finite list of: – variable types t1 , t2 , …, – constants c1 , c2 , …, of different types, – predicate symbols P1 (x1 , . . . , xn 1 ), P2 (x1 , x2 , . . . , xn 2 ), …, and function symbols f 1 (x1 , . . . , xm 1 ), f 2 (x1 , . . . , xm 2 ), …, where xi are variables of given types; • By a formula in the signature , we mean any formula obtained by using logical connectives &, ∨, ¬, →, ↔ and quantifies ∃x and ∀x over variables of each type. • We say that a formula is closed if each variable is covered by some quantifier. Closed formulas will also be called statements. • By a theory T , we mean a finite list of statements. • We say that a statement f is true if it follows from statements from T . • We say that a statement f is independent if neither this statement not its negation are true. • We say that a theory T is consistent if whenever it implies a statement s, it does not imply the opposite statement ¬s. • Let σ be a subset of —that includes only some types, predicates and function symbols from . Statements using only variable types, constants, predicates, and function from the signature σ will be called practical. • We say that a statement a is abstract if this statement is not equivalent to any practical statement. Proposition. Let T be a consistent theory, and let a be an abstract independent statement. Then, if you add to the set of all true practical statements: • all the practical statements that follow from a, and • all the practical statements that follow from ¬a, the resulting set of statements will still be consistent. Discussion. So, if we add practical consequences of an abstract statement and of its negation, we will indeed get no contradiction. In this sense, Bohr’s statement is indeed logically consistent. Proof. The above proposition—and its proof—extend a similar results proved in our earlier paper [2]. Let us prove this proposition by contradiction. Let us assume that the enlarged set is inconsistent, i.e., that it contain both a practical statement p and its negation ¬ p. These two statements p and ¬ p cannot both follow from a—otherwise, that would mean that a implies contradiction and thus, a is not true—while we assumed that a is independent. Similarly, the two statements p and ¬ p cannot both follow from ¬a. Thus, one of these two statements p and ¬ p follows from a, and another one from ¬a. If ¬ p def follows from a, then we can name p = ¬ p, and have p = ¬ p . So, without losing generality, we can conclude that a implies p and ¬a implies ¬ p.
236
M. Svítek and V. Kreinovich
From the fact that ¬a implies ¬ p, it follows that p implies a. Thus, a implies p and p implies a, i.e., the statements a and p are equivalent. However, we assumed that the statement a is abstract, so it cannot be equivalent to any practical statement. This contradiction shows that the enlarged set cannot be inconsistent—i.e., that the enlarged set is consistent. The proposition is proven. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. N. Bohr, Discussion with Einstein on epistemological problems in atomic physics, in Albert Einstein: Philosopher-Scientist. ed. by P.A. Schlipp (Open Court, Peru, Illinois, 1949), pp. 201– 241; reprinted in: J. Kalckar, Niels Bohr Collected Works, Vol. 7, Foundations of Quantum Physics II (1933–1958) (Elsevier, Amsterdam, The Netherlands, 1996), pp. 339–381 2. H.T. Nguyen, V. Kreinovich, Classical-logic analogue of a fuzzy ‘paradox’, in Proceedings of the 1996 IEEE International Conference on Fuzzy Systems, vol. 1 (New Orleans, 1996), pp. 428–431 3. Tosefta Sotah 7:12. https://www.sefaria.org/Tosefta_Sotah.7.12
How to Detect (and Analyze) Independent Subsystems of a Black-Box (or Grey-Box) System Saeid Tizpaz-Niari, Olga Kosheleva, and Vladik Kreinovich
Abstract Often, we deal with black-box or grey-box systems where we can observe the overall system’s behavior, but we do not have access to the system’s internal structure. In many such situations, the system actually consists of two (or more) independent components: (a) how can we detect this based on observed system’s behavior? (b) what can we learn about the independent subsystems based on the observation of the system as a whole? In this paper, we provide (partial) answers to these questions.
1 Need to Detect (and Determine) Independent Subsystems: Formulation of the Problem 1.1 Black-Box and Grey-Box Systems Often, we only have a so-called black-box access to the system: namely, we can check how the system reacts to different inputs, but we do not know what exactly is happening inside this system. For example, we have a proprietary software, we can feed different inputs to this software and observe the results, but we do not know how exactly this result is generated. In other cases, we have what is known as grey-box access: we have some information about the system, but not enough to find out what is happening inside the system. Such situations are also typical in biomedicine, in physics, in engineering—when we try to reverse engineer a proprietary system, etc. S. Tizpaz-Niari · O. Kosheleva · V. Kreinovich (B) University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] S. Tizpaz-Niari e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_40
237
238
S. Tizpaz-Niari et al.
1.2 Independent Subsystems: How Can We Detect Them? In many practical situations, a black-box system consists of two or more independent subsystems. For physical systems, the overall energy of the system—which we can observe— is equal to the sum of energies of its components (which we cannot observe); see, e.g., [2, 3]. Can we detect, based on the observed energies, that the system consists of two subsystems? And if yes, can we determine the energies of subsystems? Similarly, if a software consists of two independent components, then each observable state of the system as a whole consists of the states of the two components. Since the components are independent, the probability of each such state is equal to the product of probabilities of the corresponding states of the two components. Can we detect, based on the observed probabilities of different states, that the system consists of two subsystems? And if yes, can we determine the probabilities corresponding to subsystems?
1.3 What We Do in This Paper Sometimes, the accuracy with which we measure energy or probability is very low. In such cases, we can hardly make any conclusions about the system’s structure. In this paper, we consider a frequent case when we know these values with high accuracy, so that in the first approximation, we can safely ignore the corresponding uncertainty and assume that we know the values of energies or probabilities. In such cases, we show that the detection of subsystems—and the determination of their energies or probabilities—is possible (and computationally feasible) in almost all situations.
2 Formulation of the Problem in Precise Terms and the Main Result: Case of Addition In this section, we consider the case of addition. Definition 1 • Let A and B be finite sets of real numbers, each of which has more than one element. We will call A the set of possible value of the first subsystem and B the set of possible values of the second subsystem. • For each pair (A, B), by the observed set, we mean the set def
A + B = {a + b : a ∈ A, b ∈ B} of possible sums a + b when a ∈ A and b ∈ B.
How to Detect (and Analyze) Independent Subsystems …
239
Comment. In mathematics, the set A + B is known as the Minkowski sum of the sets A and B. Discussion. The problem is, given the observed set A + B, to reconstruct A and B. Of course, we cannot reconstruct A and B exactly since, if we add a constant c to all the elements of A and subtract c from all elements of B, the observed set A + B will remain the same. Indeed, in this case, each pair of elements a ∈ A and b ∈ B gets transformed into a + c and b − c, so we have (a + c) + (b − c) = a + b and thus, indeed, each sum a + b ∈ A + B remains the same. So, by the ability to reconstruct, we mean the ability to reconstruct modulo such an addition-subtraction. In some cases, it is not possible to reconstruct the components: for example, for A = {0, 1} and B = {0, 2}, the Minkowski sum A + B = {0, 1, 2, 3} can also be represented as {0, 1} + {0, 1, 2}. This alternative representation has a different number of elements and thus, cannot be obtained from the original one by addition and subtraction of a constant. What we will show, however, is that in almost all cases, reconstruction is possible. Definition 2 We say that the pair (A, B) is sum-generic if the following numbers are all different: • all the non-zero differences a − a between elements of the set A, • all the non-zero differences b − b between elements of the set B, and • all the sums (a − a ) + (b − b ) of these non-zero differences. Proposition 1 For every two natural numbers n > 1 and m > 1, the set of all nonsum-generic pairs (A, B) with n elements in A and m elements in B has Lebesque measure 0. Discussion. In other words, almost all pairs (A, B) are sum-generic. Proof Indeed, the set of non-sum-generic pairs is a finite union of the sets described by equalities like ai − a j = ak − a , ai − a j = bk − b , etc. Each such set is a hyperplane in the (n + m)-dimensional space of all the pairs (A, B) = ({a1 , . . . , an }, {b1 , . . . , bm }) and thus, has measure 0. The union of finitely many sets of measure 0 is also of measure 0. So, the proposition is proven. Proposition 2 If (A, B) and (A , B ) are sum-generic pairs for which A + B = A + B , then: • either there exists a constant c for which A = {a + c : a ∈ A} and B = {b − c : b ∈ B}, • or there exists a constant c for which A = {b − c : b ∈ B} and B = {a + c : a ∈ A}.
240
S. Tizpaz-Niari et al.
Discussion. Thus, in the sum-generic case, we can reconstruct the components A and B from the observed set S = A + B—as uniquely as possible. Proof 1◦ . Let us sort the elements of each of four sets A, B, A , and B in decreasing order: a1 > · · · > an , b1 > · · · > bm , a1 > · · · > an , b1 > · · · > bm .
2◦ . The largest element of the set A + B is the sum a1 + b1 of the largest element a1 of the set A and the largest element b1 of the set B. The next largest element of the set A + B is obtained when we add the largest element of one of the sets and the second largest element of another set, so it is either a1 + b2 or a2 + b1 . These two numbers are different, since if they were equal, we would have a1 − a2 = b1 − b2 , which contradicts to the assumption that the pairs (A, B) is sum-generic. Without losing generality, let us assume that a1 + b2 > a2 + b1 . Similarly, in the Minkowski sum A + B , the largest element is a1 + b1 , and the next element is either a1 + b2 or a2 + b1 . If the value a2 + b1 is the second largest, let us rename A into B and B into A . Thus, without losing generality, we can assume that the element a1 + b2 is the second largest. Since the sets A + B and A + B coincide, this means that they have the same largest element and the same second largest element, i.e., that a1 + b1 = a1 + b1 and a1 + b2 = a1 + b2 . def
3◦ . Let us prove that for c = a1 − a1 , we have ai = ai + c and bj = b j − c for all i and j. 3.1◦ . For i = 1, the condition a1 = a1 + c follows from the definition of c. 3.2◦ . Let us prove the desired equality for j = 1. Indeed, from the fact that a1 + b1 = a1 + b1 , we conclude that b1 = b1 − (a1 − a1 ), i.e., that b1 = b1 − c. 3.3◦ . Let us now prove the desired equality for j = 2. Indeed, the difference between the largest and the second largest elements of the set A + B is equal to (a1 + b1 ) − (a1 + b2 ) = b1 − b2 . Based on the set A + B , we conclude that the same difference is equal to b1 − b2 . From b1 − b2 = b2 − b2 , we conclude that b2 = b2 + (b1 − b1 ) = b2 − c. So, the desired equality holds for j = 2 as well. 3.4◦ . Let us prove the desired equality for all i.
How to Detect (and Analyze) Independent Subsystems …
241
Since the pair (A, B) is sum-generic, all the differences (ai + b j ) − (ak + b ) corresponding to i = k or (i = k and ( j, ) = (1, 2)) are different from the difference b1 − b2 . So, the only pairs of elements from the set A + B whose difference is equal to b1 − b2 are differences of the type (ai + b1 ) − (ai + b2 ). There are exactly n such pairs corresponding to different values ai ∈ A, and the sorting of the largest elements of these pairs leads to the order a1 + b1 > a2 + b1 > · · · > an + b1 . Similarly, in the set A + B , we have exactly n pairs whose difference is equal to b1 − b2 , and the sorting of the largest elements of these pairs leads to the order a1 + b1 > a2 + b1 > · · · > an + b1 . Since A + B = A + B and b1 − b2 = b1 − b2 , these two orders must coincide, so we must have n = n and ai − b1 = ai − b1 for all i. Thus, indeed, ai = ai − (b1 − b1 ) = ai + c. 3.5◦ . Finally, let us prove the desired equality for all j. Indeed, since the pair (A, B) is sum-generic, all the differences (ai + b j ) − (ak + b ) corresponding to j = or ( j = and (i, k) = (1, 2)) are different from the difference a1 − a2 . So, the only pairs of elements from the set A + B whose difference is equal to a1 − a2 are differences of the type (a1 + b j ) − (a2 + b j ). There are exactly m such pairs corresponding to different values b j ∈ B, and the sorting of the largest elements of these pairs leads to the order a1 + b1 > a1 + b2 > · · · > a1 + bm . Similarly, in the set A + B , we have exactly m pairs whose difference is equal to a1 − a2 , and the sorting of the largest elements of these pairs leads to the order a1 + b1 > a1 + b2 > · · · > a1 + bm . Since A + B = A + B and a1 − a2 = a1 − a2 , these two orders must coincide, so we must have m = m and a1 − bj = a1 − b j for all j. Thus, indeed, bj = b j − (a1 − a1 ) = b j − c. The proposition is proven. Proposition 3 There exists a quadratic-time algorithm that, given a finite set S of real numbers: • checks whether this set can be represented as S = A + B for some sum-generic pair (A, B), and
242
S. Tizpaz-Niari et al.
• if S can be thus represented, computes the elements of the corresponding sets A and B. Proof We are given the observed set S. If this set can be represented as a Minkowski sum, we want to find the values a1 > · · · > an and b1 > · · · > bm for which the sums ai + b j are exactly the elements of the given set S. In our algorithm, we will follow the steps of the previous proof. ◦
1 . Let us first sort all s elements of the given observed set S in decreasing order: e1 > e2 > · · · > es . Sorting requires time O(s · ln(s)); see, e.g., [1]. 2◦ . Let us take a1 = e1 and b1 = 0, then we have a1 + b1 = e1 . Let us also take b2 = −(e1 − e2 ), then e2 = a1 + b2 and b1 − b2 = e1 − e2 . According to the previous proof, this will work if S is the desired Minkowski sum. 3◦ . Let us now try all pairs (ei , e j ) and find all the pairs for which ei − e j = e1 − e2 . Testing all the pairs requires time O(s 2 ). If S is the desired Minkowski sum, then, as we have shown in the previous proof, the number of such pairs is n—the number of elements in the set A—and if we sort the largest elements of these pairs in the decreasing order, we will get elements u 1 = a1 + b1 > u 2 = a2 + b1 > · · · > u n = an + b1 . Thus, based on these elements u i , we get ai = u i − b1 , i.e., since we chose b1 = 0, we get ai = u i . 4◦ . Let us now find all the pairs (ei , e j ) for which ei − e j = a1 − a2 . Testing all the pairs requires time O(s 2 ). If S is the desired Minkowski sum, then, as we have shown in the previous proof, the number of such pairs is m—the number of elements in the set B—and if we sort the largest elements of these pairs in the decreasing order, we will get elements v1 = a1 + b1 > v2 = a1 + b2 > · · · > vm = a1 + bm . Thus, based on these elements v j , we get b j = v j − a1 , i.e., since we chose a1 = e1 , we get b j = v j − e1 . 5◦ . Finally, we form the list of all the sums ai + b j , sort it (which takes time O(s · ln(s))), and check that the sorted list coincides element-by-element with the original sorted list e1 > e2 > · · · If it does, this means that the given set can be represented as A + B—and we have the desired sets A and B. If the two lists are different, this means—according to the previous proof—that the given set S cannot be represented as the Minkowski sum. The overall computation time is equal to
How to Detect (and Analyze) Independent Subsystems …
243
O(s · ln(s)) + O(s 2 ) + O(s 2 ) + O(s · ln(s)) = O(s 2 ). The proposition is proven. First numerical example. Let us illustrate the above algorithm on a simple example when A = {1, 2} and B = {1, 1.3}. In this case, the observed set is S = A + B = {2, 2.3, 3, 3.3}. Let us show how the above algorithm will, given this set S, reconstruct the component-related sets A and B. 1◦ . Sorting the elements of the set S in decreasing order leads to e1 = 3.3 > e2 = 3 > e3 = 2.3 > e4 = 2.
2◦ . According to the algorithm, we then take a1 = e1 = 3.3, b1 = 0, and b2 = −(e1 − e2 ) = −(3.3 − 3) = −0.3. 3◦ . Then, we find all pairs (ei , e j ) for which ei − e j = e1 − e2 = 0.3. There are exactly n = 2 such pairs: • the pair (3.3, 3), and • the pair (2.3, 2). So, we conclude that n = 2. We then sort the largest elements of these pairs—i.e., the values 3.3 and 2.3—in decreasing order: u 1 = 3.3 > u 2 = 2.3, and take a1 = 3.3 and a2 = 2.3. 4◦ . Let us now find all the pairs (ei , e j ) for which ei − e j = a1 − a2 = 3.3 − 2.3 = 1. There are exactly m = 2 such pairs: • the pair (3.3, 2.3), and • the pair (3, 2). So, we conclude that m = 2. We then sort the largest elements of these pairs—i.e., the values 3.3 and 3—in decreasing order: v1 = 3.3 > v2 = 3, and take b1 = v1 − e1 = 3.3 − 3.3 = 0 and b2 = 3 − 3.3 = −0.3. 5◦ . Finally, we form all the sums ai + b j : a1 + b1 = 3.3 + 0 = 3.3, a2 + b1 = 2.3 + 0 = 2.3, a1 + b2 = 3.3 + (−0.3) = 3, a2 + b2 = 2.3 + (−0.3) = 2.
244
S. Tizpaz-Niari et al.
We then sort these sums into a decreasing sequence: 3.3 > 3 > 2.3 > 2. This is exactly the given set S. Thus, this set does correspond to independent components, and we have found the sets A and B corresponding to these components. Namely, we found the sets A = {3.3, 2.3} and B = {0, −0.3}. If we deduce 1.3 to elements in A and add 1.3 to elements in B, we get exactly the original sets A and B. Second numerical example. Let us illustrate the above algorithm on a example when the observed set S = {2, 2.7, 3, 3.3} cannot be represented as the Minkowski sum. 1◦ . Sorting the elements of the set S in decreasing order leads to e1 = 3.3 > e2 = 3 > e3 = 2.7 > e4 = 2.
2◦ . According to the algorithm, we then take a1 = e1 = 3.3, b1 = 0, and b2 = −(e1 − e2 ) = −(3.3 − 3) = −0.3. 3◦ . Then, we find all pairs (ei , e j ) for which ei − e j = e1 − e2 = 0.3. There are exactly n = 2 such pairs: • the pair (3.3, 3), and • the pair (3, 2.7). So, we conclude that n = 2. We then sort the largest elements of these pairs—i.e., the values 3.3 and 2.3—in decreasing order: u 1 = 3.3 > u 2 = 3, and take a1 = 3.3 and a2 = 3. 4◦ . Let us now find all the pairs (ei , e j ) for which ei − e j = a1 − a2 = 3.3 − 3 = 0.3. There are exactly m = 2 such pairs: • the pair (3.3, 3), and • the pair (3, 2.7). So, we conclude that m = 2. We then sort the largest elements of these pairs—i.e., the values 3.3 and 3—in decreasing order: v1 = 3.3 > v2 = 3, and take b1 = v1 − e1 = 3.3 − 3.3 = 0 and b2 = 3 − 3.3 = −0.3. 5◦ . Finally, we form all the sums ai + b j :
How to Detect (and Analyze) Independent Subsystems …
245
a1 + b1 = 3.3 + 0 = 3.3, a2 + b1 = 3 + 0 = 3, a1 + b2 = 3.3 + (−0.3) = 3, a2 + b2 = 3 + (−0.3) = 2.7. We then sort these sum into a decreasing sequence: 3.3 > 3 = 3 > 2.7. This is different from the sorting of the given set S. Thus, the original set S does not correspond to independent components.
3 Generalization to Probabilistic Setting with Multiplication In this section, we consider the case of multiplication. Definition 3 • Let P and Q be finite sets of positive real numbers, each of which has more than one element, and for each of which, the sum of all its elements is equal to 1. We will call P the set of possible probabilities of the first subsystem and Q the set of possible probabilities of the second subsystem. def • For each pair (P, Q), by the observed probabilities set, we mean the set P · Q = { p · q : p ∈ P, q ∈ Q} of possible products p · q when p ∈ P and q ∈ Q. Observation: the product case can be reduced to the sum case. The logarithm of the product is equal to the sum of the logarithms. Thus, when the observed probabilities are equal to the products pi · q j , the logarithms of these observed probabilities def
def
are equal to the sums ai + b j , where ai = ln( pi ) and b j = ln(q j ). Thus, by taking the logarithms, we can reduce this case to the case of the sum, and so get the following results. Definition 4 We say that the pair (P, Q) is product-generic if the following numbers are all different: • all the ratios p/ p = 1 between elements of the set P, • all the ratios q/q = 1 between elements of the set Q, and • all the products ( p/ p ) · (q/q ) of these ratios that are not 1. Proposition 4 For every two natural numbers n > 1 and m > 1, the set of all nonproduct-generic pairs (P, Q) with n elements in P and m elements in Q has Lebesque measure 0. Discussion. In other words, almost all pairs (P, Q) are product-generic.
246
S. Tizpaz-Niari et al.
Proposition 5 If (P, Q) and (P , Q ) are product-generic pairs for which P · Q = P · Q , then: • either P = P and Q = Q, • or P = Q and Q = P. Discussion. If we did not have the condition that the sum of P-probabilities be equal to 1, we would have uniqueness modulo multiplication of all probabilities by a constant. However, due to this condition, in the product-generic case, we can uniquely reconstruct the components P and Q from the observed probabilities set S = P · Q. Proposition 6 There exists a quadratic-time algorithm that, given a finite set S of positive real numbers: • checks whether this set can be represented as S = P · Q for some product-generic pair (P, Q), and • if S can be thus represented, computes the elements of the corresponding sets P and Q. Proof In line with the general reduction, we form the set L of logarithms of the elements of S, then apply the algorithm from Proposition 3 to these logarithms, and, if the set L can be represented as a Minkowski-sum A + B, take pi = exp(ai ) and q j = exp(b j ). act can then normalize these probabilities to take into account that pi = We act q j = 1, i.e., we should take qj pi and q act . piact = j = pk q k
Alternatively, we can avoid computing exp and ln functions, and deal directly with the ratios and product instead of sums and differences of logarithms. Let us describe this idea in detail. We are given the observed set S. If this set can be represented as a product, we want to find the values p1 > · · · > pn and q1 > · · · > qm for which the products pi · q j are exactly the elements of the given set S. 1◦ . Let us first sort all s elements of the given observed set S in decreasing order: e1 > e2 > · · · > es . As we have mentioned, sorting requires time O(s · ln(s)). 2◦ . Let us take p1 = e1 and q1 = 1, then we have p1 · q1 = e1 . Let us also take q2 = e2 /e1 , then e2 = p1 · q2 and q1 /q2 = e1 /e2 .
How to Detect (and Analyze) Independent Subsystems …
247
3◦ . Let us now try all pairs (ei , e j ) and find all the pairs for which ei /e j = e1 /e2 . Testing all the pairs requires time O(s 2 ). If S is the desired product, then the number of such pairs is n—the number of elements in the set P—and if we sort the largest elements of these pairs in the decreasing order, we will get elements u 1 = p1 · q 1 > u 2 = p2 · q 1 > · · · > u n = pn · q 1 . Thus, based on these elements u i , we get pi = u i /q1 , i.e., since we chose q1 = 1, we get pi = u i . 4◦ . Let us now find all the pairs (ei , e j ) for which ei /e j = p1 / p2 . Testing all the pairs requires time O(s 2 ). If S is the desired product, then, as we have shown in the previous proof, the number of such pairs is m—the number of elements in the set B—and if we sort the largest elements of these pairs in the decreasing order, we will get elements v1 = p1 · q1 > v2 = p1 · q2 > · · · > vm = p1 · qm . Thus, based on these elements v j , we get q j = v j / p1 , i.e., since we chose p1 = e1 , we get q j = v j /e1 . 5◦ . Finally, we form the list of all the products pi · q j , sort it (which takes time O(s · ln(s))), and check that the sorted list coincides element-by-element with the original sorted list e1 > e2 > · · · If it does, this means that the given set can be represented as P · Q—and we have (after normalization) the desired sets P and Q. If the two lists are different, this means that the given set S cannot be represented as the product. Numerical example. Let us consider the case when P = {0.2, 0.8} and Q = {0.3, 0.7}. In this case, the set of observed probabilities is S = {0.06, 0.24, 0.14, 0.56}. Let us show how the above algorithm will, given this set S, reconstruct the component-related sets A and B. 1◦ . Let us first sort all the elements of the given observed probabilities set in decreasing order: e1 = 0.56 > e2 = 0.24 > e3 = 0.14 > e4 = 0.06.
2◦ . Let us take p1 = e1 = 0.56 and q1 = 1, then we have p1 · q1 = e1 . Let us also take q2 = e2 /e1 = 0.24/0.56 = 3/7, then e2 = p1 · q2 and q1 /q2 = e1 /e2 . 3◦ . Let us now try all pairs (ei , e j ) and find all the pairs for which ei /e j = e1 /e2 = 0.56/0.24 = 7/3. There are exactly n = 2 such pairs: • the pair (0.56, 0.24) and • the pair (0.14, 0.06). Thus, we conclude that n = 2. If we sort the largest elements of these pairs in the decreasing order, we will get elements
248
S. Tizpaz-Niari et al.
u 1 = e1 = 0.56 > u 2 = e3 = 0.14. Thus, based on these elements u i , we get pi = u i /q1 , i.e., since we chose q1 = 1, we get pi = u i , i.e., p1 = 0.56 and p2 = 0.14. 4◦ . Let us now find all the pairs (ei , e j ) for which ei /e j = p1 / p2 = 0.56/0.14 = 4. There are exactly two such pairs: • the pair (0.56, 0.14) and • the pair (0.24, 0.06). If we sort the largest elements of these pairs in the decreasing order, we will get elements v1 = 0.56 > v2 = 0.24. Thus, based on these elements v j , we get q j = v j /e1 , i.e., q1 = 0.56/0.56 = 1 and q2 = 0.24/0.56 = 3/7. 5◦ . Finally, we form the list of all the products pi · q j : p1 · q1 = 0.56 · 1 = 0.56, p2 · q1 = 0.14 · 1 = 0.14, p1 · q2 = 0.56 · (3/7) = 0.24, p2 · q2 = 0.14 · (3/7) = 0.06. If we sort these numbers in a decreasing order, we get 0.56 > 0.24 > 0.14 > 0.06, i.e., exactly the original sorted sequence of elements of the set S. Thus, the set S can be represented as a combination of two independent components. of each component, we need to To find the actual probabilities piact and q act j normalize the values pi and q j : p1act =
0.56 p1 0.56 = = 0.8; = p1 + p2 0.56 + 0.14 0.7
p2act =
0.14 p2 0.14 = = 0.2; = p1 + p2 0.56 + 0.14 0.7
q1act =
1 q1 1 = = 7/10 = 0.7; = q1 + q2 1 + 3/7 10/7
q2act =
3/7 q2 3/7 = = 3/10 = 0.3. = q1 + q2 1 + 3/7 10/7
This is exactly what we needed to reconstruct.
How to Detect (and Analyze) Independent Subsystems …
249
Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. Th.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms (MIT Press, Cambridge, Massachusetts, 2009) 2. R. Feynman, R. Leighton, M. Sands, The Feynman Lectures on Physics (Addison Wesley, Boston, Massachusetts, 2005) 3. K.S. Thorne, R.D. Blandford, Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics (Princeton University Press, Princeton, New Jersey, 2017)
Applications to Psychology and Decision Making
Why Decision Paralysis Sean Aguilar and Vladik Kreinovich
Abstract If a person has a small number of good alternatives, this person can usually make a good decision, i.e., select one of the given alternatives. However, when we have a large number of good alternatives, people take much longer to make a decision—sometimes so long that, as a result, no decision is made. How can we explain this seemingly non-optimal behavior? In this paper, we show that this “decision paralysis” can be naturally explained by using the usual decision making ideas.
1 Formulation of the Problem Decision paralysis: a paradoxical human behavior. If we have a few good alternatives, it is usually easy to select one of them. If we add more good alternatives, the decision situation becomes even more favorable, since some of these new alternatives may be better than the ones we had before. One would therefore expect that, in general, the more alternative, the better the person’s decision will be. However, in practice, often, the opposite happens: • when a person faces a few good alternatives, this person usually makes a reasonable decision, while • when the number of possible alternatives becomes large, a decision maker sometimes spends a lot of time deciding which alternative to select and, as a result, does not select any of these good alternatives at all. This phenomenon is known as decision paralysis; see, e.g., [6, 12] and references therein. S. Aguilar College of Business Administration, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_41
253
254
S. Aguilar and V. Kreinovich
Example. When a store carries one or two types of cereal, many customers buy cereal. However, when a supermarket carries dozens of different brands of cereal, a much larger proportion of customers end up not buying any cereal at all. How can we explain this non-optimal behavior? From the person’s viewpoint, not selecting any of the good alternatives is a worse outcome that selecting one of them. So why do people exhibit such an non-optimal behavior? Usual explanations for this phenomenon are based on psychology—people are not very confident in their ability to make decisions, etc. However, while this explains how exactly people end up making non-optimal decisions, it does not explain why we humans—the product of billions of years of improving evolution—make such obviously non-optimal decisions. What we do in this paper. In this paper, we try to explain the decision paralysis phenomenon from the viewpoint of decision making. We show that it is exactly the desire to make an optimal decision that leads to the decision paralysis phenomenon.
2 Our Explanation Rational decision making. According to decision theory analysis, decisions by a rational person are equivalent to maximizing the value of a certain quantity known as utility; see, e.g., [4, 5, 7–11]. From this viewpoint, if we have n alternatives A1 , …, An , then a rational decision is to select the alternative for this the utility u(Ai ) is the largest. How can we make this rational decision: a general description. A natural idea is to compute all n utility values and to select the alternative i for which the utility u(Ai ) is the largest. How can we make a decision: towards a more detailed description. For each real number, an exact representation requires infinitely many digits. In real life, we can only generate finitely many digits, i.e., we can only compute this number with some accuracy. Of course, the more accuracy we want, the longer it takes to compute the value with this accuracy. Thus, a natural idea is to compute only with the accuracy which is sufficient for our purpose. For example, if we select the world champion in sprint, then we sometimes need the accuracy of 0.01 seconds to decide on the winner, but if we want to decide whether a person has a fever or not, there is no need to measure this person’s temperature with very high accuracy. So, we compute all the utilities and find the ones whose utility is the largest. We can only compute utility with some accuracy. The more accuracy we want, the longer it takes to compute with this accuracy. The more alternatives we have, the more time we need to make a decision: explanation of decision paralysis. If we have only two alternatives, with utilities
Why Decision Paralysis
255
on some interval, e.g., [0, 1], then on average, these two values are not very close, so a low accuracy is sufficient to decide which is larger. On the other hand, if the number n of alternatives is large, the average distance between the utilities becomes much smaller: on average, this distance is equal to 1/(n + 1); see, e.g., [1–3]. Thus, the more alternatives we need to consider, the larger the accuracy with which we need to compute all the utilities to select the best one. Hence, the more time we need to compute all the utilities. This explains the observed decision paralysis phenomenon. Acknowledgements This work was supported in part by the National Science Foundation grants: • 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and • HRD-1834620 and HRD-2034030 (CAHSI Includes). It was also supported: • by the AT&T Fellowship in Information Technology, • by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and • by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. M. Ahsanullah, V.B. Nevzorov, M. Shakil, An Introduction to Order Statistics (Atlantis Press, Paris, 2013) 2. B.C. Arnold, N. Balakrishnan, H.N. Nagaraja, A First Course in Order Statistics, Society of Industrial and Applied Mathematics (SIAM) (Pennsylvania, Philadelphia, 2008) 3. H.A. David, H.N. Nagaraja, Order Statistics (Wiley, New York, 2003) 4. P.C. Fishburn, Utility Theory for Decision Making (Wiley, New York, 1969) 5. P.C. Fishburn, Nonlinear Preference and Utility Theory (The John Hopkins Press, Baltimore, Maryland, 1988) 6. F. Huber, S. Köcher, J. Vogel, F. Meyer, Dazing diversity: investigating the determinants and consequences of decision paralysis. Psychol. Mark. 29(6), 467–478 (2012) 7. V. Kreinovich, Decision making under interval uncertainty (and beyond), in Human-Centric Decision-Making Models for Social Sciences. ed. by P. Guo, W. Pedrycz (Springer, 2014), pp. 163–193 8. R.D. Luce, R. Raiffa, Games and Decisions: Introduction and Critical Survey (Dover, New York, 1989) 9. H.T. Nguyen, O. Kosheleva, V. Kreinovich, Decision making beyond Arrow’s ‘impossibility theorem’, with the analysis of effects of collusion and mutual attraction. Int. J. Intell. Syst. 24(1), 27–47 (2009) 10. H.T. Nguyen, V. Kreinovich, B. Wu, G. Xiang, Computing Statistics under Interval and Fuzzy Uncertainty (Springer, Berlin, Heidelberg, 2012) 11. H. Raiffa, Decision Analysis (McGraw-Hill, Columbus, Ohio, 1997) 12. B. Talbert, Overthinking and other minds: the analysis paralysis. Soc. Epistemol. 31(6), 545– 556 (2017)
Why Time Seems to Pass Slowly for Unpleasant Experiences and Quickly for Pleasant Experiences: An Explanation Based on Decision Theory Laxman Bokati and Vladik Kreinovich
Abstract It is known that our perception of time depends on our level of happiness: time seems to pass slower when we have unpleasant experiences and faster if our experiences are pleasant. Several explanations have been proposed for this effect. However, these explanations are based on specific features of human memory and/or human perception, features that, in turn, need explaining. In this paper, we show that this effect can be explained on a much more basic level of decision theory, without utilizing any specific features of human memory or perception.
1 Formulation of the Problem Perceived time is different from actual time: empirical fact. Many of us have felt that time seems to pass quickly for pleasant experiences and slowly for unpleasant ones. We seem to overestimate time elapsed for unpleasant experiences and underestimate time elapsed for pleasant ones. For example: • If we are watching a movie we really like, or if we are out on a date with someone we love, hours go by and it fells like only few minutes have passed. • On the other hand, if we are in a situation where we have to conceal something from people around us—a situation that makes us uncomfortable—we cannot wait for the focus of people or the topic of conversation to be shifted to someone or something else, in such a situation even few seconds feels like long minutes; see, e.g., [10].
L. Bokati Computational Science Program, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_42
257
258
L. Bokati and V. Kreinovich
In all these cases, while the actual time for those experiences doesn’t change, our perception of this time changes depending on the experience. How is this phenomenon explained now. There are several explanations for this effect; see e.g. [1–3, 13], but these explanations are based on specific features of human memory and/or human perception, features that, in turn, need explaining. What we do in this paper. In this paper, we show that this effect can be explained on a much more basic level of decision theory, without utilizing any specific features of human memory or perception.
2 Decision Theory: A Brief Reminder and the Resulting Explanation What is decision theory. Decision theory describes the preferences of a rational agent—i.e., a person who, e.g., if he/she prefers option A to option B and B to C, will always prefer option A if provided with two options: A and C; see, e.g., [5, 7, 9, 11, 12]. Comment. In real life, people are rarely fully rational, as our ability to process information in short time and make optimal decision is limited; see, e.g., [6, 8]. However, decision theory still provides a reasonably accurate description of human behavior. Utility: the main notion of decision theory. Decision theory shows that for each rational agent, we can assign a number—known as utility—to each possible alternative, so that in each situation of choice, the agent selects an alternative whose utility is the largest [5, 7, 9, 11, 12]. Utility is influenced by the past evens as well. People’s moods and preferences are affected not only by the current events, but also by the events that happened in the past. This makes sense: past experiences help make reasonable decisions. In decision theory terms, preferences are described by the utility values. In these terms, a person’s current utility value depends not only on the situation at the present moment of time t0 , but also on the utilities u 1 , u 2 , …, at the previous moments of time t1 > t2 > . . . How can we describe this influence: the notion of discounting. Let us denote by u 0 the utility that we would have gotten at the current moment of time t0 if the past did not influence our current decision making. To describe this influence, we need to find out how the current utility value u depends on this utility u 0 and on the utilities u 1 , u 2 , etc., at the previous moments of time, i.e., what is the dependence u = f (u 0 , u 1 , u 2 , . . .)
(1)
The effect of past events on our behavior is relatively small, the largest factor is our decisions is still the current situation. Since the effect of the values u 1 , u 2 , …,
Why Time Seems to Pass Slowly …
259
etc. is small, we can do what physicists do in similar situations (see, e.g., [4, 14]): expand the dependence (1) in Taylor series in terms of u i and keep only linear terms in this expansion. Thus, we get the following formula: u = f 0 (u 0 ) + f 1 (u 0 ) · u 1 + f 2 (u 0 ) · u 2 + . . .
(2)
When there is no influence of past events, the resulting utility is simply equal to u 0 —by definition of u 0 . Thus, we have f 0 (u 0 ) = u 0 ,
(3)
u = u 0 + f 1 (u 0 ) · u 1 + f 2 (u 0 ) · u 2 + . . .
(4)
so The value f i (u 0 ) depends on the moment of time ti —naturally, the more recent ones affect more, the more distance one affect less. Let us describe this dependence by writing f i (u 0 ) as F(u 0 , ti ). Here, the function F(u 0 , t) is non-positive—since positive past events make us feel better, while negative past events make us feel worse. As time t decreases, the effect decreases, i.e., the function F(u 0 , t) is increasing, going from the zero limit value F(u 0 , t) when t → −∞ to a positive value F(u 0 , t) for moments t which are close to the current moment t0 . Thus, the formula (4) takes the following form: u = u 0 + F(u 0 , t1 ) · u 1 + F(u 0 , t2 ) · u 2 + . . .
(5)
Comment. As we have mentioned, the past values u 1 , u 2 , etc., affect the utility less than current value u 0 . In economics, a similar decrease of value with time is a particular example of a discount. Because of this analogy—and since economics is one of the main application areas of decision theory—this decrease is known as discounting. Time is subjective. In contract to computers that have inside a reasonably precise clock, we humans operate on subjective time: sometimes our processes go faster, sometimes they go slower—e.g., when we are sleepy or asleep. The whole empirical phenomenon that we try to explain is exactly about the difference between this subjective time and the clock-measured physical time. To a large extent, whether we slow down our perception of time or speed up, is within our brain’s control. So how does the brain select whether to slow down or to speed up subjective time? Let us apply decision theory to our selection of subjective time. In line with the general ideas of decision theory, our brain decides whether to slow down or to speed up depending on which option leads to larger utility. How does subjective time affect our utility? Subjective time is all we observe, all we feel. Thus, in the formula (4) describing current utility, we should use subjective
260
L. Bokati and V. Kreinovich
times s1 , s2 , etc., instead of the actual (physical) moments of time ti . In other words, in view of the fact that the subjective time may be different from the physical time, the formula (5) describing current utility value takes the following form: u = u 0 + F(u 0 , s1 ) · u 1 + F(u 0 , s2 ) · u 2 + . . .
(6)
Interestingly, already this formula leads to the desired explanation. Let us explain how. Resulting explanation: case of pleasant experiences. If we are in the middle of a positive time period, i.e., if the current experience and all recent past experience are positive, this means that u i > 0 for all i. In this case, the larger each coefficient F(u 0 , si ), the larger the resulting sum (6). We have mentioned that the function F(u 0 , t) is an increasing function of time. Thus, to increase the value F(u 0 , si ), we need to select the subjective value si of the i-th past moment as large as possible—i.e., bring it as close to the current moment of time as possible, i.e., take si > ti . In this case, the subjective time duration t0 − si decreases in comparison with the corresponding period t0 − ti of physical time, i.e., subjective time goes faster— exactly as we observe. Resulting explanation: case of unpleasant experiences. On the other hand, if we are in the middle of a negative time period, i.e., if the current experience and all recent past experience are negative, this means that u i < 0 for all i. In this case, the smaller each coefficient F(u 0 , si ), the larger the resulting sum (6). We have mentioned that the function F(u 0 , t) is an increasing function of time. Thus, to decrease the value F(u 0 , si ), we need to select the subjective value si of the i-th past moment as small as possible—i.e., bring it as far away from the current moment of time as possible, i.e., take si < ti . In this case, the subjective time duration t0 − si increases in comparison with the corresponding period t0 − ti of physical time, i.e., subjective time goes slower— exactly as we observe. Conclusion. So, in both cases, decision theory indeed explains the desired phenomenon. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
Why Time Seems to Pass Slowly …
261
References 1. A. Angrilli, P. Cherubini, A. Pavese, S. Manfredini, The influence of affective factors on time perception. Percept. Psychophys. 59(6), 972–982 (1997) 2. Y. Bar-Haim, A. Kerem, D. Lamy, D. Zakay, When time slows down: the influence of threat on time perception in anxiety. Cogn. Emot. 24(2), 255–263 (2010) 3. D.M. Eagleman, P.U. Tse, D. Buonomano, P. Janssen, A.C. Nobre, A.O. Holcombe, Time and the brain: how subjective time relates to neural time. J. Neurosci. 25(45), 10369–10371 (2005) 4. R. Feynman, R. Leighton, M. Sands, The Feynman Lectures on Physics (Addison Wesley, Boston, Massachusetts, 2005) 5. P.C. Fishburn, Utility Theory for Decision Making (Wiley, New York, 1969) 6. D. Kahneman, Thinking, Fast and Slow (Farrar, Straus, and Giroux, New York, 2011) 7. V. Kreinovich, Decision making under interval uncertainty (and beyond), in Human-Centric Decision-Making Models for Social Sciences. ed. by P. Guo, W. Pedrycz (Springer, 2014), pp.163–193 8. J. Lorkowski, V. Kreinovich, Bounded Rationality in Decision Making Under Uncertainty: Towards Optimal Granularity (Springer, Cham, Switzerland, 2018) 9. R.D. Luce, R. Raiffa, Games and Decisions: Introduction and Critical Survey (Dover, New York, 1989) 10. I. Matsuda, A. Matsumoto, H. Nittono, Time passes slowly when you are concealing something. Biol. Psychol. vol. 155, Paper 107932 (2020) 11. H.T. Nguyen, O. Kosheleva, V. Kreinovich, Decision making beyond Arrow’s ‘impossibility theorem’, with the analysis of effects of collusion and mutual attraction. Int. J. Intell. Syst. 24(1), 27–47 (2009) 12. H. Raiffa, Decision Analysis (McGraw-Hill, Columbus, Ohio, 1997) 13. C. Stetson, M.P. Fiesta, D.M. Eagleman, Does time really slow down during a frightening event? PLOS One 2, 1–3 (2007) 14. K.S. Thorne, R.D. Blandford, Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics (Princeton University Press, Princeton, New Jersey, 2017)
How to Deal with Conflict of Interest Situations When Selecting the Best Submission Olga Kosheleva and Vladik Kreinovich
Abstract In many practical situations when we need to select the best submission— the best paper, the best candidate, etc.—there are so few experts that we cannot simply dismiss all the experts who have conflict of interest: we do not want them to judge their own submissions, but we would like to take into account their opinions of all other submissions. How can we take these opinions into account? In this paper, we show that a seemingly reasonable idea can actually lead to bias, and we explain how to take these opinions into account without biasing the final decision.
1 Formulation of the Problem Need for expert opinions. In many practical situations, we rely on human expertise. This happens when we review papers, this happens when we decide on an award, this happens when we decide which of the faculty candidates to hire, etc. Usually, each expert i provides a numerical estimate ei j of the quality of each submission j: the larger this estimate, the higher the quality. Then, for each submission j, we take the sum def ei j (1) sj = i
of all the scores given by different experts. We then make a decision based on these scores s j : if we want to select a single award-winner or a single faculty candidate, we select the submission with the largest score. Conflict of interest situations and how are they usually handled. Sometimes, some experts have a conflict of interest—e.g., such an expert is a co-author of one O. Kosheleva Department of Teacher Education, University of Texas at El Paso, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_43
263
264
O. Kosheleva and V. Kreinovich
of the papers considered for the award, or a close relative of one of the nominees; there may be many other reasons, see, e.g., [1, 2]. The opinion of such experts is potentially biased. Because of this potential bias, they are usually excused from the judgment process. Sometimes, we still need the opinion of experts who have conflict of interest. However, in some situations, this simple solution may not be perfect. For example, in a small awards committee of a broad-range conference or journal, the person with conflict of interest may be one of the few who has expertise is the corresponding subarea. We may not trust this person’s opinion about a submission to which he is closely related, but we would like to use this person’s expertise when comparing others submissions from the same subarea. How can we do it?
2 A Seemingly Natural Solution and Why it is Not Fair What we need. Suppose that we have E experts, and one of the experts i 0 has a conflict of interest with one of the submissions j0 —e.g., he/she is a co-author of the nominated paper. Since we are interested in i 0 ’s opinions about all other submissions, we ask i 0 to provide the scores ei j for all the submissions except for the one to which he/she is closely related. This way, for all submissions j = j0 , we have opinions ei j provided by all E experts, so we can compute the sum (1). To be able to compare different submissions, we need to also provide a reasonable score for the submission j0 . For this submission, we only have E − 1 estimates ei j0 —namely, we only have estimates corresponding to experts i = i 0 . To compute the desired score, we need to provide some estimate for the missing value ei0 j0 . How can we estimate this missing value? A natural idea. In general, comparing sums s j is equivalent to comparing averages aj =
1 · ei j . E i
Indeed, each average is simply equal to the corresponding sum divided by E, and if we divide all the values by the same number E, their order does not change. For all submissions j except for the submission j0 , we have E estimates, but for j0 we only have E − 1 estimates. So, a natural idea is to take the average of all these E − 1 estimates: 1 · ei j . E − 1 i=i 0 0
Multiplying this average by E, we get an equivalent score
How to Deal with Conflict of Interest Situations …
265
E · ei j , E − 1 i=i 0 0
which is equal to
ei j0 +
i=i 0
1 ei j . · E − 1 i=i 0
(2)
0
This formula has the same form as the formula (1), with ei0 j0 =
1 · ei j . E − 1 i=i 0
(3)
0
In other words, when comparing submissions, as a missing score ei0 j0 , we take the average of the scores ei j0 assigned to this submission j0 by all other experts. This natural idea does not provide an unbiased estimate. Let us show that this seemingly natural idea does not work. Indeed, suppose that the expert i 0 assigns very small scores—e.g., the smallest possible score of 0—to all the submissions j = j0 . In this case, even if all other experts provide the exact same score e to all the submissions, then: • for j = j0 , the average score is e and thus, the sum score if e · E, while • for all other submissions j = j0 , the sum score is e · (E − 1), which is smaller than e · E. On the other hand, if in the same situation, we excluded the conflict-of-interest expert, all the submissions would have gotten the same score e · (E − 1). Thus, by including the expert i 0 in the decision process, and without explicitly asking his/her opinion about the submission j0 , we nevertheless bias the group decision in the direction of favoring the submission to which he/she is closely related—and this bias is exactly what we want to avoid. Maybe we can modify the above scheme? To avoid the above situation, we can take, as ei0 j0 , the average score of i 0 over all submissions j = j0 . In this case, assigning 0s to all other submissions will not lead to a bias, but a bias is still possible. To show this, let us consider the case when among the submissions, only two submissions are very good—the submission j0 and some other submission j1 = j0 . Suppose that if we only take into account the opinion of all experts without conflict of interest, then these two submissions get equal scores. Suppose now that i 0 : • assigns good scores to all the submissions except for the submission j1 , and • to the submission j1 , he/she assigns the 0 score. Then, we get ei0 j1 = 0, while as ei0 j0 , we take the average of all the scores ei0 j , which is positive. So here, too, taking i 0 ’s opinion into account biases the decision is favor of the submission to which i 0 is closely related—exactly the bias that we wanted to avoid.
266
O. Kosheleva and V. Kreinovich
So shall we just exclude the conflict-of-interest experts? So maybe the situation is hopeless, and the only solution is to completely ignore the opinions of all the conflict-of-interest experts? Good news is that there is a scheme enabling us to take these experts’ opinions into account without introducing the undesired bias. Let us describe this scheme.
3 How to Take into Account Opinion of Conflict-of-interest Experts Without Introducing the Bias: Analysis of the Problem We want to avoid the situations in which the opinions of the conflict-of-interest expert i 0 would bias our decision in favor of the submission j0 to which he/she is closely related. In other words, in situations in which we decide that j0 is the best alternative, we should not take i 0 ’s opinions into account. So, a natural idea is to first decide whether j0 is indeed the best submission. This has to be decided without taking into account i 0 ’s opinions. So: • If, based on the scores of all other experts, j0 is selected as the best option, we just declare it the best option—and this is the end of the selection process. • On the other hand, if j0 is not selected as the best option, we dismiss j0 and only consider all other options. In this new process, i 0 no longer has a conflict of interest, so we can take his/her opinion into account. Thus, we arrive at the following process. The resulting process works no matter how we make the collective decision, whether we take the sum of the scores or whether we make any other comparison.
4 Resulting Process What is given. Suppose that we have a process P that allows us, given values ei j assigned to different submissions j by different experts i, to select one of the alternatives j. This process works when no one has any conflict of interest, and thus, when every expert i provides a score for every submission j. In real life, some experts i may have conflict of interest with some submissions. In this situation, every expert i provides his/her score ei j only about the submissions for which this expert does not have any conflict of interest. First stage. At first, we ignore all the experts who have conflict of interest, and make a preliminary decision by applying the process P only to experts who do not have any remaining conflicts of interest. If the first-stage selection results in selecting one of the submissions that have a conflict of interest, we declare this selection to be the final winner.
How to Deal with Conflict of Interest Situations …
267
Possible second stage. If the submission selected by the first-stage selection does not have any conflict of interest with any expert, this means that selections that have conflict of interest are not as good. Thus, the conflict-of-interest submissions can be dismissed from our search for the best submission. So: • We dismiss all conflict-of-interest submissions. • Then, to make a final selection, we apply the process P again to all remaining submissions. This time we take into account the opinion of all the experts (including those that originally had a conflict of interest). Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. D. Koepsell, Scientific Integrity and Research Ethics: An Approach from the Ethos of Science (Springer, 2017) 2. D.B. Resnik, Institutional conflicts of interest in academic research. Sci. Eng. Ethics 25(6), 1661–1669 (2019)
Why Aspirational Goals: Geometric Explanation Olga Kosheleva and Vladik Kreinovich
Abstract Business gurus recommend that an organization should have, in addition to clearly described realistic goals, also additional aspirational goals—goals for which we may not have resources and which most probably will not be reached at all. At first glance, adding such a vague goal cannot lead to a drastic change in how the company operates, but surprisingly, for many companies, the mere presence of such aspirational goals boosts the company’s performance. In this paper, we show that a simple geometric model of this situation can explain the unexpected success of aspirational goals.
1 What Are Aspirational Goals Business gurus recommend that, in addition to realistic achievable goals, organizations and people should also have aspirational goals: goals which may never be achieved, goals for which we do not have resources to achieve—and maybe never have such resources; see, e.g., [1]. For example, a small-size US university may have realistic goals of becoming the best of small-size universities in its state or even in the US as a whole, but its aspirational goal may be to overcome MIT and other top schools and become the world’s top university.
O. Kosheleva · V. Kreinovich (B) University of Texas at El Paso, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_44
269
270
O. Kosheleva and V. Kreinovich
2 Can Aspirational Goals Help? A clear formulation of realistic goals helps the company—or an individual—to progress towards these goals. One of the main reasons why realistic goals help is that they provide a clear path to achieving these goals. From this viewpoint, one would expect that vague pie-in-the-sky aspirational goals cannot be of much help. (Yes, sometimes, a visionary reaches his or her goals, but such situations are rare.)
3 Surprisingly, Aspirational Goals Do Help Contrary to the above-mentioned natural pessimistic expectations, in practice, aspirational goals help: their presence often helps a company to achieve its realistic goals—provided, of course, that these goals are properly aligned with the realistic goals.
4 But Why? A natural question is: how can we explain the success of aspirational goals? In this paper, we provide a simple geometric explanation of this success.
5 How to Describe Goals in Precise Terms: A Simplified Geometric Model To describe the current state of any system—in particular, the state of an organization or of an individual—we need to know the current values of the numerical characteristics x1 , . . . , xn characterizing this system. Thus, the current state can be naturally described as a point x = (x1 , . . . , xn ) with coordinates xi in an n-dimensional space. A realistic goal is, in effect, a description of the desired reasonably-short-term future state of the system. So, from this viewpoint, a goal can also be represented by an n-dimensional point y = (y1 , . . . , yn ), where y1 , . . . , yn are the values of the corresponding characteristics in the desired future state. We want to go from the original state x to the state y corresponding to the realistic goal. In general, the fastest way to go from x to y is to follow a straight line from x to y.
Why Aspirational Goals: Geometric Explanation
271
6 In This Simplified Model, Aspirational Goals—Even When Perfectly Aligned—Cannot Help In this simplified description, if we supplement a realistic goal y with an additional aspirational goal z = (z 1 , . . . , z n )—even the one which is perfectly aligned with the realistic goal y—the path from x to y remains the same.
In other words, in this simplified model, the addition of an aspirational model cannot help.
7 A More Realistic Geometric Model Let us now consider a more realistic geometric model, namely, a model that takes uncertainty into account. Specifically: • we usually have a very good information about the current state x of the system, • however, we usually only know the future state with uncertainty. For example, a university knows exactly how many students are enrolled now (e.g., 23,456) but even a realistic plan for increased future enrollment cannot be that accurate—it may say something like “around 30 thousand students”, meaning, e.g., anywhere from 28 to 32 thousand. From this viewpoint, since we do not know the exact future state of the system, the resulting trajectory is oriented not exactly towards the actual future goal y, but towards a point y which is ε-close close to y, where ε is the accuracy with which we can determine future goals:
8 What if We Now Add a Second, Longer-Term Goal? Let us see what will happen if in this more realistic geometric model, we add a perfectly aligned additional longer-term goal—for which the corresponding state z is also known with the same accuracy ε:
272
O. Kosheleva and V. Kreinovich
Let us show that in this case, the deviation of the actual trajectory from the desired one is smaller than without the additional goal. Indeed: • Without the additional model, the angle α between the actual and desired trajectories is approximately equal to α≈
ε , d(x, y)
where d(x, y) is the distance between the current state x and the state y corresponding to the realistic goal. Thus, the deviation d of the resulting state y from the desired state y is approximately equal to d ≈ α · d(x, y) ≈ ε. • In the presence of a perfectly aligned additional mode, the angle αadd between the actual and desired trajectories is approximately equal to αadd ≈
ε , d(x, z)
where d(x, z) is the distance between the current state x and the state z corresponding to the additional goal. Thus, the deviation dadd of the resulting state yadd from the desired state y is approximately equal to dadd ≈ αadd · d(x, z) ≈ ε ·
d(x, y) . d(x, z)
(1)
The additional model is a more longer-term goal than the realistic goal. Thus, the state z corresponding to the additional goal is located further away from the current state x than the state y corresponding to the realistic goal y, i.e., d(x, y) < d(x, z). So, we have d(x, y) 0 and u (0) j > 0, then, according to the formula (1), adding negative emotions can only decrease their emotion. From this viewpoint, we should not be experiencing negative emotions towards each other—but negative emotions, including hate, have been ubiquitous throughout history. So why hate? What is benefit of having negative feelings towards others?
Why Hate: Analysis Based on Decision Theory
277
2 Analysis of the Problem, the Resulting Explanation, and Recommendations When hate makes sense: analysis of the problem. Person i wants to increase his/her happiness, in particular, by selecting appropriate values ai j . • According to the formula (1), when the objective utility u (0) j of another person is positive, to increase his/her utility, person i should have positive feelings towards person j. In this situation, indeed, negative feelings ai j < 0 make no sense. • The only case when it make sense to have negative feelings towards person j is when person j’s objective utility u (0) j is negative. In this case, indeed, having a negative attitude ai j < 0 will increase the utility u i of the ith person. So, hate—to be more precise, negative feelings—makes sense only towards people who are unhappy. Do we really have negative feelings towards unhappy people? One may ask: why would we have negative feelings towards an unhappy person—e.g., a person who is poor and/or sick? When, if a person is blameless, it is indeed difficult to imagine hating him/her. But nobody’s perfect, we all make mistakes. For the case of a poor and/or sick person these mistakes—drugs, alcohol, crime—may have contributed to his/her current conditions. The resulting negative feelings sound justified, but these are exactly the feelings described by the above analysis: negative feelings towards someone who is already unhappy. When is hate possible? When it is (practically) necessary? If a person i is surrounded both by happy and unhappy people, then this person can increase his/her utility • either by expressing positive emotions towards happy people • or by expressing negative emotions towards unhappy people, • or by doing both. In such a situation, while hate is possible, it is not necessary: we can increase our utility by being positive towards happy people and neutral towards unhappy ones. The only case when negative feeling become necessary is when a person is surrounded only by unhappy people. In this case, the only way to increase your own happiness is to have negative feelings towards others—at least some others. Examples of such situations. Let us give some examples of such situations when negative feelings are practically necessary. • A classical example is war. In a war, soldiers on both sides are objectively unhappy—but the fact that they have strong negative feelings towards soldiers from the other side make their lives worth living. • Another example is criminals sitting in jails—especially in overcrowded jails in the poor countries: they are often divided into gangs that hate each other.
278
O. Kosheleva and V. Kreinovich
• Yet another example is citizens of a poor multi-ethnic state. Objective, practically all of them are poor and thus objectively unhappy. The only way to increase their happiness is to start having negative feelings towards each other. In the last example, sometimes, such feelings are encouraged by the oppressors—the divide and conquer policy followed by all empires starting with the ancient Rome (and probably even earlier), to prevent poor people from fighting together against their oppressors. But even without the oppressors, such attitude is, unfortunately, frequent. Recommendations. How can we eliminate hate and other negative feelings—or at least make them less frequent? Based on the above conclusions, the best strategy is to make people objectively happier—this will make hatred counterproductive and thus, hopefully, eliminated. While this noble goal is being pursued, a natural intermediate solution is: • to avoid situations when objectively unhappy people only communicate with each other: have mixed neighborhoods, have mixed schools—this will make hate and other negative feelings unnecessary, and also • to teach people positive feelings—this will hopefully decrease the spread of stillpossible negative feelings. Acknowledgements This work was supported in part by the National Science Foundation grants: • 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and • HRD-1834620 and HRD-2034030 (CAHSI Includes). It was also supported: • by the AT&T Fellowship in Information Technology, • by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and • by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. G.S. Becker, A Treatise on the Family (Harvard University Press, Cambridge, Massachusetts, 1991) 2. T. Bergstrom, Love and spaghetti, the opportunity cost of virtue. J. Econ. Perspect. 3, 165–173 (1989) 3. T. Bergstron, Systems of benevolent utility interdependence, University of Michigan, Technical Report (1991) 4. B.D. Bernheim, O. Stark, Altruism within the family reconsidered: do nice guys finish last? Am. Econ. Rev. 78(5), 1034–1045 (1988) 5. P.C. Fishburn, Utility Theory for Decision Making (Wiley, New York, 1969) 6. P.C. Fishburn, Nonlinear Preference and Utility Theory (The John Hopkins Press, Baltimore, Maryland, 1988) 7. D.D. Friedman, Price Theory (South-Western Publ, Cincinnati, Ohio, 1986)
Why Hate: Analysis Based on Decision Theory
279
8. H. Hori, S. Kanaya, Utility functionals with nonpaternalistic intergerenational altruism. J. Econ. Theory 49, 241–265 (1989) 9. V. Kreinovich, Decision making under interval uncertainty (and beyond), in Human-Centric Decision-Making Models for Social Sciences. ed. by P. Guo, W. Pedrycz (Springer, 2014), pp. 163–193 10. R.D. Luce, R. Raiffa, Games and Decisions: Introduction and Critical Survey (Dover, New York, 1989) 11. H.T. Nguyen, O. Kosheleva, V. Kreinovich, Decision making beyond Arrow’s ‘impossibility theorem’, with the analysis of effects of collusion and mutual attraction. Int. J. Intell. Syst. 24(1), 27–47 (2009) 12. H.T. Nguyen, V. Kreinovich, B. Wu, G. Xiang, Computing Statistics under Interval and Fuzzy Uncertainty (Springer, Berlin, Heidelberg, 2012) 13. H. Raiffa, Decision Analysis (McGraw-Hill, Columbus, Ohio, 1997) 14. A. Rapoport, Some game theoretic aspects of parasitism and symbiosis. Bull. Math. Biophys. 18 (1956) 15. A. Rapoport, Strategy and Conscience (New York, 1964) 16. F.J. Tipler, The Physics of Immortality (Doubleday, New York, 1994)
Why Self-Esteem Helps to Solve Problems: An Algorithmic Explanation Oscar Ortiz, Henry Salgado, Olga Kosheleva, and Vladik Kreinovich
Abstract It is known that self-esteem helps solve problems. From the algorithmic viewpoint, this seems like a mystery: a boost in self-esteem does not provide us with new algorithms, does not provide us with ability to compute faster—but somehow, with the same algorithmic tools and the same ability to perform the corresponding computations, students become better problem solvers. In this paper, we provide an algorithmic explanation for this surprising empirical phenomenon.
1 Formulation of the Problem Self-esteem helps to solve problem: a well-known phenomenon. It is known that good self-esteem helps to solve problems; see, e.g., [1, 5, 8, 9, 25–28, 30, 32]. There is a correlation between students’ academic performance and their general self-esteem, there is an even higher correlation between the student’s academic performance and their discipline-related self-esteem. It is also known that helping students to boost their self-esteem makes them, on average, better problem solvers. O. Ortiz · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Ortiz e-mail: [email protected] H. Salgado College of Engineering, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_46
281
282
O. Ortiz et al.
Why is this a problem. The above phenomenon is such a commonly known fact that people do not even realize that from the algorithmic viewpoint, this does not seem to make sense. Indeed: • From the algorithmic viewpoint, what we need to solve a problem is an appropriate algorithm and a sufficient amount of computation time. • However, self-esteem does not mean we know new ways of solving the problem, it does not mean that we have gained additional time. So why does it help? What we do in this paper. In this paper, we provide a possible algorithmic explanation for this phenomenon.
2 Our Explanation Clarification of the phenomenon: it is about problems with unique solution. The correlation between self-esteem and academic performance is usually at the level of straightforward problems—like arithmetic—where there is exactly one correct solution. This does not mean that this phenomenon is limited to simple arithmetic; for example: • the derivative of a function is also uniquely determined by this function, and • most differential equations with given initial conditions have a unique solution. The reason for this uniqueness limitation is clear: • in such situations, we have an objective characteristic of the academic performance—e.g., the number of correct answers • while in more vague situations—e.g., writing an essay about a book—criteria are much more subjective and thus, difficult to exactly quantify. How uniqueness helps: reminder. Interestingly, uniqueness does help to find solutions to well-defined problems: namely, there is a general computational result that uniqueness implies computability; see, e.g., [19, 23] and references therein. Let us formulate this result in precise terms. For real-life well-defined problems, solutions usually can be described by a finite list of numbers. Let us give two examples. • If we are looking for a shape of a building or an airplane, this shape is usually described by splines, i.e., by several glued-together pieces each of which is characterized by a polynomial equation of a given degree. Thus, to describe the shape, it is sufficient to describe the coefficients of all these polynomial equations. • If we are looking for the best way to invest money, then a solution means describing, for each possible way of investing (different stocks, different bonds, etc.) the exact amount of money that we should invest this way.
Why Self-Esteem Helps to Solve Problems: An Algorithmic Explanation
283
In all these cases, finding the desired solution means finding the values of the corresponding numerical quantities x1 , . . . , xn . For each real-life quantity, there are usually limitations on possible ranges; examples: • The amount of money invested in a given stock must be greater than or equal to 0, and cannot exceed the overall amount of money that we need to invest. • Similarly, speed in general is limited by the speed of light, and the speed of a robot is limited by the power of its motors. There may be other general limitations on these values xi . For example, the building code may require that each building is able to withstand the most powerful winds that happen in this geographic area. To properly describe such general limitations, it is important to take into account that in most real-life problems, the values xi can only be implemented with some accuracy ε > 0. In other words, what we will have in reality when we, e.g., design xi − xi | ≤ ε. the building, is some values xi which are ε-close to xi , i.e., for which | We want to be able to make sure that for all such ε-close values xi , the resulting design will satisfy the desired general limitation—e.g., withstand the most powerful winds that happen in this geographic area. For each ε, we can list all possible approximations of this type. For example, if we consider values between 0 and 10 with accuracy 0.1, we can consider values 0, 0.1, 0.2, …, 9.9, and 10.0. In general, we can similarly take values 0, ε, 2ε, …. By combining such values and selecting only the combinations that satisfy the desired constraints, we will have a finite list of possible tuples (x1 , . . . , xn ) such that every possible tuple is ε-close to one of the tuples from this list. A set for which we can algorithmically find such a list for each ε is known as constructively compact. We are interested in finding the values xi which are possible—in the sense that they satisfy the general constraints—and which satisfy the specific constraints corresponding to this particular problem. These constraints are usually described by a system of equations and inequalities f 1 (x1 , . . . , xn ) = 0, . . . , f m (x1 , . . . , xm ) = 0, g1 (x1 , . . . , xn ) ≥ 0, . . . , g p (x1 , . . . , xn ) ≥ 0,
(1)
with computable functions f i (x1 , . . . , xn ) and g j (x1 , . . . , xn ). For example: • we may be given the overall budget—a constraint of equality type, and • within this budget, we want to make sure that the building has enough space to adequately host at least a pre-determined number of inhabitants—a constraint of an inequality type. The first version of the uniqueness-implies-computability result is as follows; see Appendix for a brief overview of the corresponding definitions of computability:
284
O. Ortiz et al.
• In general, no algorithm is possible that would always, given computable functions f i and g j on a constructive compact set C for which the system (1) has a solution, return one of these solutions; see, e.g., [2–4, 6, 7, 18, 20, 29, 31]. • There does exist a general algorithm that, given computable functions f i and g j on a constructive compact set C for which the system (1) has a unique solution, always returns this solution; see, e.g., [10–15, 18, 19, 21–24]. Comment about optimization. In many practical situations, the corresponding system (1) has many solutions. In this case, we are looking for the best solution, i.e., for the solution for which some objective function h(x1 , . . . , xn ) attains its largest (or its smallest) value. For example, we may want to minimize cost, or minimize the pollution, or maximize the number of passengers on an airplane. For such problems, uniqueness also helps: • In general, no algorithm is possible that would always, given a computable function h on a constructive compact set C, return the tuple (x1 , . . . , xn ) ∈ C at which this function attains its largest possible value; see, e.g., [2–4, 6, 7, 18, 20, 29, 31]. • There does exist a general algorithm that, given a computable function h on a constructive compact set C for which the maximum of h is attained at only one tuple, always returns this tuple; see, e.g., [10–15, 18, 19, 21–24]. Resulting explanation. For many problems given to K-12 students, there is exactly one solution. • In these terms, self-esteem means that a student is confident that he/she can come up with a solution. By virtue of the above algorithmic result, this means that the student can, in principle, simply apply the general uniqueness-implies-computability algorithm and find the solution—even when this student is not yet fully able to apply techniques studied in class. • On the other hand, without a good self-esteem, the student is not confident that he/she will come up with a solution. In such situation of non-uniqueness, no general algorithm is possible, so a student who is still struggling with the class material is, in general, not able to solve the corresponding problem. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT & T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). The authors are thankful to all the participants of the 27th Joint NMSU/UTEP Workshop on Mathematics, Computer Science, and Computational Sciences (Las Cruces, New Mexico, April 2, 2022) for valuable suggestions.
Why Self-Esteem Helps to Solve Problems: An Algorithmic Explanation
285
Appendix: What Is Computable: An Overview of the Main Definitions • We say that a real number x is computable if there exists an algorithm that, given a natural number k, returns a rational number r which is 2−k -close to x: |x − r | ≤ 2−k . • We say that a tuple of real numbers x = (x1 , . . . , xn ) is computable if all the numbers xi in this tuple are computable. • We say that a function f (x) from tuples of real numbers to real numbers is computable if there exists an algorithm that, given algorithms for computing xi and a natural number k, returns a rational number which is 2−k -close to f (x). This computing- f (x) algorithm can call algorithms for computing approximations to the inputs xi . • We say that a compact set S ⊆ IRn is constructive if there exists an algorithm that, given a natural number k, returns a finite list of computable tuples x (1) , x (2) , …from the set S which serve as a 2−k -net for S, i.e., for which every tuple x ∈ S is 2−k -close to one of the tuples x (i) from this list. See [2–4, 6, 7, 18, 20, 29, 31] for details.
References 1. R.F. Baumeister, J.D. Campbell, J.I. Krueger, K.D. Vohs, Does high self-esteem cause better performance, interpersonal success, happiness, or healthier lifestyles? Psychol. Sci. Public Interes. 4(1), 1–44 (2003) 2. M.J. Beeson, Foundations of Constructive Mathematics (Springer, N.Y., 1985) 3. E. Bishop, Foundations of Constructive Analysis (McGraw-Hill, 1967) 4. E. Bishop, D.S. Bridges, Constructive Analysis (Springer, N.Y., 1985) 5. M.Z. Booth, J.M. Gerard, Self-esteem and academic achievement: a comparative study of adolescent students in England and the United States. Comp.: J. Comp. Int. Educ. 41(5), 629– 648 (2011) 6. D.S. Bridges, Constructive Functional Analysis (Pitman, London, 1979) 7. D.S. Bridges, S.L. Vita, Techniques of Constructive Analysis (Springer, New York, 2006) 8. A. Di Paula, J.D. Campbell, Self-esteem and persistence in the face of failure. J. Pers. Soc. Psychol. 83(3), 711–724 (2002) 9. B. Harris, Self-Esteem: 150 Ready-to-use Activities to Enhance the Self-Esteem of Children and Teenagers to Increase Student Success and Improve Behavior (CGS Communications, San Antonio, Texas, 2016) 10. U. Kohlenbach, Theorie der majorisierbaren und stetigen Funktionale und ihre Anwendung bei der Extraktion von Schranken aus inkonstruktiven Beweisen: Effektive Eindeutigkeitsmodule bei besten Approximationen aus ineffektiven Eindeutigkeitsbeweisen, Ph.D. Dissertation, Frankfurt am Main (1990) 11. U. Kohlenbach, Effective moduli from ineffective uniqueness proofs: an unwinding of de La Vallée Poussin’s proof for Chebycheff approximation. Ann. Pure Appl. Log. 64(1), 27–94 (1993)
286
O. Ortiz et al.
12. U. Kohlenbach, Applied Proof Theory: Proof Interpretations and their Use in Mathematics (Springer, Berlin-Heidelberg, 2008) 13. V. Kreinovich, Uniqueness implies algorithmic computability, in Proceedings of the 4th Student Mathematical Conference (Leningrad University, Leningrad, 1975), pp. 19–21 (in Russian) 14. V. Kreinovich, Reviewer’s remarks in a review of D.S. Bridges, in Constructive Functional Analysis (Pitman, London, 1979); Zentralblatt für Mathematik, 401, 22–24 (1979) 15. V. Kreinovich, Categories of space-time models, Ph.D. Dissertation, Novosibirsk, Soviet Academy of Sciences, Siberian Branch, Institute of Mathematics (1979) (in Russian) 16. V. Kreinovich, Philosophy of Optimism: Notes on the Possibility of Using Algorithm Theory When Describing Historical Processes, Leningrad Center for New Information Technology “Informatika”, Technical Report, Leningrad (1989). ((in Russian)) 17. V. Kreinovich, Physics-motivated ideas for extracting efficient bounds (and algorithms) from classical proofs: beyond local compactness, beyond uniqueness, in Abstracts of the Conference on the Methods of Proof Theory in Mathematics (Max-Planck Institut für Mathematik, Bonn, Germany, 3–10 June 2007), p. 8 18. V. Kreinovich, A. Lakeyev, J. Rohn, P. Kahl, Computational Complexity and Feasibility of Data Processing and Interval Computations (Kluwer, Dordrecht, 1998) 19. V. Kreinovich, K. Villaverde, Extracting computable bounds (and algorithms) from classical existence proofs: girard domains enable us to go beyond local compactness. Int. J. Intell. Technol. Appl. Stat. (IJITAS) 12(2), 99–134 (2019) 20. B.A. Kushner, Lectures on Constructive Mathematical Analysis (American Mathematical Society, Providence, Rhode Island, 1984) 21. D. Lacombe, Les ensembles récursivement ouvert ou fermés, et leurs applications à l’analyse récurslve. Comptes Rendus de l’Académie des Sci. 245(13), 1040–1043 (1957) 22. V.A. Lifschitz, Investigation of constructive functions by the method of fillings. J. Sov. Math. 1, 41–47 (1973) 23. L. Longpré, O. Kosheleva, V. Kreinovich, Baudelaire’s ideas of vagueness and uniqueness in art: algorithm-based explanations, in Decision Making under Uncertainty and Constraints: A Why-Book, ed. by M. Ceberio, V. Kreinovich (Springer, Cham, Switzerland, 2022) (to appear) 24. L. Longpré, V. Kreinovich, W. Gasarch, G.W. Walster, m solutions good, m − 1 solutions better. Appl. Math. Sci. 2(5), 223–239 (2008) 25. C.W. Loo, J.L.F. Choy, Sources of self-efficacy influencing academic performance of engineering students. Am. J. Educ. Res. 1(3), 86–92 (2013) 26. H.W. Marsh, Causal ordering of academic self-concept and academic achievement: a multiwave, longitudinal path analysis. J. Educ. Psychol. 82(4), 646–656 (1990) 27. L. Noronha, M. Monteiro, N. Pinto, A study on the self-esteem and academic performance among the students. Int. J. Health Sci. Pharm. (IJHSP) 2(1), 1–7 (2018) 28. A. Ntem, Every student’s compass: a simple guide to help students deal with low self-esteem, set academic goals, choose the right career and make a difference in the society (2022) 29. M. Pour-El, J. Richards, Computability in Analysis and Physics (Springer, New York, 1989) 30. M. Rosenberg, C. Schooler, C. Schoenbach, F. Rosenberg, Global self-esteem and specific self-esteem: different concepts, different outcomes. Am. Sociol. Rev. 60(1), 141–156 (1995) 31. K. Weihrauch, Computable Analysis: An Introduction (Springer, Berlin, Heidelberg, 2000) 32. S.Y. Yoon, P.K. Imbrie, T. Reed, Development of the leadership self-efficacy scale for engineering students, in Proceedings of the 123rd Annual Conference and Exposition of the American Society for Engineering Education (ASEE) (New Orleans, Louisiana, 26–29 June 2016) (Paper 15784)
Why Five Stages of Solar Activity, Why Five Stages of Grief, Why Seven Plus Minus Two: A General Geometric Explanation Miroslav Svítek, Olga Kosheleva, and Vladik Kreinovich
Abstract A recent paper showed that the solar activity cycle has five clear stages, and that taking theses stages into account helps to make accurate predictions of future solar activity. Similar 5-stage models have been effective in many other application area, e.g., in psychology, where a 5-stage model provides an effective description of grief. In this paper, we provide a general geometric explanations of why 5-stage models are often effective. This result also explains other empirical facts, e.g., the seven plus minus two law in psychology and the fact that only five space-time dimensions have found direct physical meaning.
1 Introduction Empirical fact. A recent study of solar activity [6] has shown that the solar activity cycle can be divided into five clearly different stages, and that explicitly taking these stages into account leads to a much more effective technique for predicting future solar activity. Of course, we can further subdivide each of these five stages into sub-stages, but the gist of dynamics is already well-captured by this 5-stage description. Taking into account that five stages naturally appear in many other dynamical situations—e.g., in the well-known five-stages-of-grief model [5]—a natural conclusion is that there may be a general explanation of why five-stage models are effective. M. Svítek Faculty of Transportation Sciences, Czech Technical University in Prague, Konviktska 20 TCZ-110 00 Prague 1, Prague, Czech Republic e-mail: [email protected] O. Kosheleva · V. Kreinovich (B) University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_47
287
288
M. Svítek et al.
What we do in this paper. In this paper, we provide a general geometric explanation for the effectiveness of 5-stage dynamical models.
2 Our Explanation What does the 5-stage description mean. On each stage, a different type of activity is prevalent. So, at each stage, a different quantity characterizes the system’s behavior. As an example, let is consider the dynamics of a simple pendulum. In its stationary state, the pendulum is positioned at its lowest point, and its velocity is 0. When the pendulum starts moving, each cycle of its motion consists of two stages following one another: • One of them is the stage where the pendulum is close to its lowest point. On this stage, the height of the pendulum’s position is close to the stationary value. What is drastically different from the stationary state is the speed. • Another is the stage where the pendulum is close to one of its highest points. On this stage, the velocity is close to the stationary one. What is drastically different from the stationary state is the height of the pendulum’s position. So, each of the two stages is indeed characterized by its own characteristics quantity: • velocity at the first stage, and • height at the second stage. When we have s stages, this means that, to provide a reasonable description of this system’s dynamics, we need to trace to trace the values of s different quantities, each of which corresponds to one of these stages. In other words, to get a reasonable description of the system’s dynamics, we can view this system as a system in an s-dimensional space. In particular, the fact that the system exhibits 5 stages means that this system’s dynamics can be reasonably well described by modeling this system’s behavior in 5-dimensional space. How many stages? We can use spaces of different dimension and we get different approximate descriptions of the given system. Which of these descriptions is the most informative, the most adequate? To fully describe a complex system, we need to know the values of a large number of variables. Thus, a complex system can be in a large number of different states. When we describe the complex system as a system in an s-dimensional space, for some small s, we thus limit the number of possible states, and therefore, make the description approximate. The fewer states we use, the cruder the approximation. Let us illustrate this natural idea on a simple example of approximating real numbers from the interval [0, 1] by a finite set of points.
Why Five Stages of Solar Activity, Why Five Stages of Grief …
289
• If we only allow one approximating point, then the most accurate description we can achieve is if we select the point 0.5. In this case, the worst-case approximation error is attained when we are trying to approximate the borderline values 0 and 1—this approximation error is equal to |0 − 0.5| = |1 − 0.5| = 0.5. • If we allow 2 points, then we can use points 0.25 and 0.75, in which case the worst-case approximation error is twice smaller—it is equal to 0.25. • In general, if we allow n points, then we should select points 3 5 2n − 1 1 , , , ..., . 2n 2n 2n 2n In this case, the worst-case approximation error is equal to 1 . 2n The more points we use to approximate, the more accurate our approximation. In our case, we approximate original states by states in an s-dimensional space. In physics, the values of all the quantities are bounded. For example, in general: • velocity is limited by the speed of light, • distance is limited by the size of the Universe, etc. For practical applications, there are even stricter bounds. By appropriately re-scaling each quantity, we can make sure that each bound is 1. After this re-scaling, instead of considering the whole space, we only need to consider a unit ball in this space. The number of possible states is proportional to the volume of this unit ball—to be more precise, it is equal to the ratio between the volume of the unit ball and the volume of a small cell in which states are indistinguishable. Thus, to get the most accurate description of the original system, we need to select the dimension s for which the unit ball has the largest possible volume. What is this dimension? At first glance, it may look like the larger the dimension, the larger the volume. Indeed: • The 1-dimensional volume (i.e., length) of the 1-dimensional unit ball—i.e., of the interval [−1, 1]—is 2. • The 2-dimensional volume (i.e., area) of the 2-dimensional unit ball—i.e., of the unit disk—is π > 1.
290
M. Svítek et al.
• The 3-dimensional volume of the 3-dimensional unit ball is 4 · π > π, 3 etc. However, this impression is false. It turns out that the increase continues only up to the s = 5 dimensions, after which the volume starts decreasing—and tends to 0 as the dimension increases; see, e.g., [11]. So, the largest volume of the unit ball is attained when the dimension is equal to 5. In view of the above, this means that 5-dimensional approximations— corresponding to 5-stage descriptions—indeed provide the most adequate firstapproximation description of a dynamical system. This explains why 5-stage descriptions are effective in area ranging from solar activity to grief.
3 Other Applications of This Conclusion Five-dimensional and higher-dimensional models of space-time. Originally, physicists believed assumed that there is a 3-dimensional space and there is 1dimensional time. Special relativity showed that it is convenient to combine them into a single 4-dimensional space-time. This way, for example, it became clear that electric and magnetic fields are actually one single electromagnetic field; see, e.g., [2, 13]. A natural idea is to try to see if adding additional dimensions will enable us to combine other fields as well. This idea immediately lead to a success—it turns out (see, e.g., [3, 4, 7, 12]) that if we formally write down the equation of General Relativity Theory in a 5-dimensional space, then we automatically get both the usual equations of gravitational field and Maxwell’s equations of electrodynamics. Since then, many other attempts have been made to add even more dimensions— but so far, these attempts, while mathematically and physically interesting, did not lead to a natural integration of any additional fields; see, e.g., [13]. From this viewpoint, dimension 5 seems to be a natural way to describe physical fields—in perfect accordance with the above result. Seven plus minus two law. It is known that when we classify objects, we divide them into five to nine categories; see, e.g., [9, 10]. In particular, when we divide a process into stage, we divide it into five to nine stages: • • • •
some people tend of divide everything into fix stages; some people tend to divide everything into six stages; …, and some people tend to divide everything into nine stages.
Why Five Stages of Solar Activity, Why Five Stages of Grief …
291
The need to have at least five stages—and not four—again can be explained by the fact that, as we have shown, 5-stage representations provide the most adequate description of complex systems. But why not ten? The above result explains why people use models that have at least 5 stages, i.e., it explains the lower bound 5 on the number of stages. But how can be explain the upper bound? Why at most 9 stages? Why cannot we divide the process into 10 or more stages? For this question, we only have partial answers which are not as clear as the above explanation of why 5 stages. Actually, we have the following three answers explaining that starting with dimension 10 (corresponding to 10 stages) situation becomes completely different. • The paper [1] shows that while we can feasibly analyze dynamical systems in dimensions up to 9, analysis of 10-dimensional dynamical systems is, in general, not feasible. • Another specific feature of dimension 10 is that this is the smallest dimension in which we can have a consistent quantum field theory—which is not possible in all dimensions up to 9 [13]. This shows that 10-dimensional space are drastically different from lower-dimensional ones. • Finally, when we consider cooperative situations with 10 agents—i.e., when the space of possible actions of all the players it 10-dimensional—there are situations in which no stable solution (also known as von Neumann-Morgenstern solution) is possible [8], while no such situations were found for smaller dimensions (i.e., for smaller number of agents). Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT & T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. G. Datseris, A. Wagemakers, Effortless estimation of basins of attraction. Chaos 32 (2022) (Paper 023104) 2. R. Feynman, R. Leighton, M. Sands, The Feynman Lectures on Physics (Addison Wesley, Boston, Massachusetts, 2005) 3. T. Kaluza, Sitzungsberichte der K. Prussischen Akademie der Wiseenschaften zu Berlin (1921), p. 966 (in German); English translation “On the unification problem in physics” in [7], pp. 1–9 4. O. Klein, Zeitschrift für Physik, vol. 37 (1926), p. 895 (in German); English translation “Quantum theory and five-dimensional relativity” in [7], pp. 10–23 5. E. Kübler-Ross, On Death and Dying: What the Dying Have to Teach Doctors, Nurses, Clergy and Their Own Families (Scribner, New York, 2014)
292
M. Svítek et al.
6. R.J. Leamon, S.W. McIntosh, A.M. Title, Deciphering solar magnetic activity: the solar cycle clock. Front. Astron. Space Sci. 19 (2022) (Paper 886670) 7. H.C. Lee (ed.), An Introduction to Kaluza-Klein Theories (World Scientific, Singapore, 1984) 8. W.F. Lucas, The proof that a game may not have a solution. Trans. Am. Math. Soc. 136, 219–229 (1969) 9. G.A. Miller, The magical number seven plus or minus two: some limits on our capacity for processing information. Psychol. Rev. 63(2), 81–97 (1956) 10. S.K. Reed, Cognition: Theories and Application (SAGE Publications, Thousand Oaks, California, 2022) 11. D.J. Smith, M.K. Vamanamurthy, How small is a unit ball? Math. Mag. 62(2), 101–107 (1989) 12. S.A. Starks, O. Kosheleva, V. Kreinovich, Kaluza-Klein 5D ideas dade ully gefometric. Int. J. Theor. Phys. 45(3), 589–601 (2006) 13. K.S. Thorne, R.D. Blandford, Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics (Princeton University Press, Princeton, New Jersey, 2017)
Applications to Software Engineering
Anomaly Detection in Crowdsourcing: Why Midpoints in Interval-Valued Approach Alejandra De La Peña, Damian L. Gallegos Espinoza, and Vladik Kreinovich
Abstract In many practical situations—e.g., when preparing examples for a machine learning algorithm—we need to label a large number of images or speech recordings. One way to do it is to pay people around the world to perform this labeling; this is known as crowdsourcing. In many cases, crowd-workers generate not only answers, but also their degrees of confidence that the answer is correct. Some crowdworkers cheat: they produce almost random answers without bothering to spend time analyzing the corresponding image. Algorithms have been developed to detect such cheaters. The problem is that many crowd-workers cannot describe their degree of confidence by a single number, they are more comfortable providing an interval [x, x] of possible degrees. To apply anomaly-detecting algorithms to such interval data, we need to select a single number from each such interval. Empirical studies have shown that the most efficient selection is when we select the arithmetic average. In this paper, we explain this empirical result by showing that arithmetic average is the only selection that satisfies natural invariance requirements.
1 Formulation of the Problem What is crowdsourcing: a brief reminder. In many practical situations, we need to perform a large number of reasonably simple tasks, tasks that do not require high qualifications. For example, deep learning requires that a large number of labeled examples are available (see, e.g., [1]). In many cases, we do not have that many A. De La Peña · D. L. Gallegos Espinoza · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] A. De La Peña e-mail: [email protected] D. L. Gallegos Espinoza e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_48
295
296
A. De La Peña et al.
labeled examples, so we need someone to label a large number of photos, or a large number of speech recordings. One way to perform these tasks is crowdsourcing, when people all over the world are paid to solve the corresponding tasks—e.g., to label pictures that will be used for training a machine learning algorithm. Need to detect anomalies. Most crowd-workers work conscientiously. However, since the payment is proportional to the number of answers, there are also many cases when crowd-workers do a sloppy job, not spending enough time on analyzing the corresponding picture and therefore producing answers that are often wrong. Such wrong answers prevent machine learning algorithms from getting high quality results. It is therefore important to be able to detect such anomalous crowd-workers and dismiss their answers. A natural way to do it is to include examples with known labels into the list of tasks. Then, we can gauge the quality of a crowd-worker by the number of wrong answers that he/she has on these examples. If this number is unusually high, then all the answers provided by this crowd-worker should be dismissed. Need to take into account degrees of confidence. Crowd-workers are often not 100% confident in their answers. To help machine learning, it is therefore desirable to collect not only the answers, but also the degrees indicating how confident is the crowd-worker in each answer. This way, the neural network will be able to weigh these answers with different weights: if its answer is different from the confident answer of a crowd-worker, then the algorithm should continue training, but if the difference is only with not very confident crowd-workers, then maybe there is no need to adjust. Because of this, some crowdsourcing algorithms require the crowd-worker to submit not only the answer, but also his/her degree of confidence in this answer—as expressed by a number on some scale [X , X ], e.g., from 0 to 10, or from 0 to 1. Usually, larger numbers correspond to larger degrees of confidence. Usually, linear transformations are used to transform between different scales. For example, the value 7 on a scale from 0 to 10 is transformed into 7/10 on the scale from 0 to 1. Similarly, the value 0 on the scale [−1, 1] is transformed into the value 0.5 on the scale [0, 1]. These degrees of confidence are used to detect anomalies: if the answer is wrong but the crowd-worker is not very confident about it, this may be an honest mistake, but if there are many wrong answers with high degrees of confidence, this indicates an anomaly. Sometimes, these degrees also affect the amount of payment: the higher degree of confidence, the higher the pay—since one way to gain more confidence is to spend more time analyzing the corresponding picture or recording. Interval-valued degrees of confidence. Crowd-workers are usually unable to describe their degree of confidence by a single number: in general, people cannot meaningfully distinguish, e.g., between degrees of confidence 0.70 and 0.71 on a scale from 0 to 1. So, it makes sense to allow the crowd-workers to mark their confidence by selecting an interval [x, x] of possible degrees: e.g., an interval [0.7, 0.8].
Anomaly Detection in Crowdsourcing: Why Midpoints in Interval-Valued Approach
297
How to detect anomalies based on interval-valued degrees: formulation of the problem. A natural idea is to utilize formulas that have been successful in detecting anomalies based on numerical degrees. To apply these formulas, we need to select a single value x from the corresponding interval [x, x]. In other words, we need an algorithm x = f (x, x) that generates a number based on the bounds of the workergenerated interval. Which algorithm f (x, x)should we select? We can take arithmetic average, we can take geometric average x · x, we can have many other choices. An empirical analysis described in [2] has shown that the more accurate anomaly detection happens when we use arithmetic average x+x . 2 How can we explain this empirical result? What we do in this paper. In this paper, we use natural invariances to explain this empirical result.
2 Our Explanation Invariance. We can have different scales, so it is reasonable to require that the desired algorithm x(x, x) should not change if we apply some linear transformation to a different scale. Thus, we arrive at the following definition. Definition 1 We say that a function f (x, x) is scale-invariant if for every linear transformation x → a + b · x with b > 0, and for all possible values x < x, once we have x = f (x, x), then we should also have y = f (y, y), where y = a + b · x, y = a + b · x, and y = a + b · x. Additional requirement related to negation. In situations where there are only two possible choices A and B, if we use the scale from 0 to 1, then one way to interpret the degree of confidence x is as the probability that the correct choice is A. This same situation can be interpreted as the probability 1 − x that the correct choice is B. If, instead of the exact probability x, we have an interval [x , x] of possible values of A-probability, then the corresponding values of B-probability 1 − x form an interval [1 − x, 1 − x]. • We can apply the desired function to the original interval [x, x] and thus get some probability x. • Alternatively, we can apply the same function to the negation-related interval [1 − x, 1 − x] and get some probability y—in which case, for the probability of A, we get the value 1 − y.
298
A. De La Peña et al.
Since these are two ways to describe the same situation, it is reasonable to require that we should get the same probability, i.e., that we should have x = 1 − y. Thus, we arrive at the following definition. Definition 2 We say that a function f (x, x) is negation-invariant if for all possible values 0 ≤ x < x ≤ 1, once we have x = f (x, x), then we should also have y = f (y, y), where y = 1 − x, y = 1 − x, and y = 1 − x. Proposition The only scale-invariant and negation-invariant function f (x, x) is arithmetic average x+x . f (x, x) = 2 Proof It is easy to check that arithmetic average is scale-invariant and negationinvariant. Let us prove that, vice versa, every scale-invariant and negation-invariant function f (x, x) is arithmetic average. Indeed, let us denote f (0, 1) by α. Here, x = 0, x = 1, and x = α. Then, for every two numbers x1 < x2 , we can take a = x1 and b = x2 − x1 . In this case, y = a + b · x = x1 , y = a + b · x = x1 + (x2 − x1 ) = x2 , and y = a + b · x = x1 + α · (x2 − x1 ) = α · x2 + (1 − α) · x1 . Thus, due to scale-invariance, we conclude that f (x1 , x2 ) = α · x2 + (1 − α) · x1 .
(1)
To find α, let us now use negation invariance. According to this property, we should have f (1 − x2 , 1 − x1 ) = 1 − f (x1 , x2 ). Substituting the expression (1) into this formula, we conclude that α · (1 − x1 ) + (1 − α) · (1 − x2 ) = 1 − α · x2 − (1 − α) · x1 . If we open parentheses, we conclude that 1 − α · x1 − (1 − α) · x2 = 1 − α · x2 − (1 − α) · x1 . The two linear functions on both sides of this formula should be equal for all x1 < x2 . Thus, the coefficients at x1 must coincide, so α = 1 − α and thus, α = 1/2— and therefore, the formula (1) becomes arithmetic average. The proposition is proven. Conclusion. We have explained why arithmetic average works well: it is the only function that satisfies natural invariance requirements.
Anomaly Detection in Crowdsourcing: Why Midpoints in Interval-Valued Approach
299
Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). One of the authors (VK) is thankful to all the participants of the 19th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems IPMU’2022 (Milan, Italy, July 11–15, 2022), especially to Chenyi Hu, for valuable discussions.
References 1. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, Massachusetts, 2016) 2. M. Spurling, C. Hu, H. Zhan, V.S. Sheng, Anomaly detection in crowdsourced work with intervalvalued labels, in Proceedings of the 19th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems IPMU’2022 (Milan, Italy, 11–15 July 2022), pp. 504–516
Unexpected Economic Consequence of Cloud Computing: A Boost to Algorithmic Creativity Francisco Zapata, Eric Smith, and Vladik Kreinovich
Abstract While theoreticians have been designing more and more efficient algorithms, in the past, practitioners were not very interested in this activity: if a company already owns computers that provide computations in required time, there is nothing to gain by using faster algorithms. We show the situation has drastically changed with the transition to cloud computing: many companies have not yet realized this, but with the transition to cloud computing, any algorithmic speed up leads to immediate financial gain. This also has serious consequences for the whole computing profession: there is a need for professionals better trained in subtle aspects of algorithmics.
1 Algorithms Can Be Made More Efficient It is well known than many algorithms that we traditionally use are not the most efficient ones. This was a big revelation in the 20th century, when it turned out that not only the traditional algorithms for such exotic things as Fourier transform were not the most efficient ones, but also traditional algorithms for multiplying matrices— and even for multiplying numbers—are far from efficient; see, e.g., [2].
F. Zapata · E. Smith · V. Kreinovich (B) University of Texas at El Paso, El Paso, TX 79968, USA e-mail: [email protected] F. Zapata e-mail: [email protected] E. Smith e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_49
301
302
F. Zapata et al.
2 Theoreticians Win Prizes, but Practitioners Are Not Impressed Researchers still actively work on designing new more and more efficient algorithms. These new algorithms win research prizes and academic acclaim, but in most cases, these developments do not attract much interest from the practitioners. As a result, while there is a lot of effort for speeding general algorithms—efforts supported by national funding agencies—there is very little effort in speeding specialized algorithms intended for geo-, bio-, engineering, and other applications.
3 Why, in the Past, Practitioners Were Not Interested In the recent past, this lack of interest was mostly caused by the fact that we were still operating under Moore’s Law, according to which computer speed doubled every two years or so. So why invest in speeding up the algorithm by 20% if in two years, we will get a 100% increase for free?
4 Practitioners Are Still Not Very Interested Nowadays, Moore’s Law is over, but practitioners are still not very interested. The main reason for this is that—with the exception of a few time-critical situations when computation time is important—what would a company gain by using a faster algorithm? This company most probably already has computer hardware allowing it to perform all the computations it needs within the required time—so there will be no financial gain if these computations are performed faster.
5 But What Happens with the Transition to Cloud Computing? The above arguments work well when the company owns its computers. However, lately, a large portion of computations is done in the cloud—i.e., on computers owned by a cloud service (to which the company pays for these computations); see, e.g., [1]. What we plan to show is that this drastically changes the situation: many companies may not have realized that, but now it has become financially beneficial to support algorithmic creativity.
Unexpected Economic Consequence of Cloud Computing …
303
6 How a Company Pays for Cloud Computing a Reminder With cloud computing, a company only pays for the actual computations. This fact is the main reason why cloud computing is economically beneficial for companies. For example, a chain of stores does not need to buy additional computers to cover spikes in purchase processing needs during the Christmas season—additional computers that will be mostly idle at other times. Instead, it can only pay for additional computations during this season—and do not spend any money at other times.
7 This Leads to a Boost in Algorithmic Creativity Now that the company payment is directly proportional to computation time, any decrease in computation time leads to immediate financial savings. For example, if a company spends 3 million dollars a year on cloud computing services, and its computer specialists manage to make its algorithms 20% faster, the company immediately saves 600 000 dollars—not an insignificant amount. With this in mind, it has become financially beneficial to try to speed up existing algorithms—i.e., to boost algorithmic creativity. Many companies have not yet recognized this—and this paper is one of the ways to convince them—but the financial logic is clear: the more algorithmic creativity, the larger the company’s profit.
8 How This Will Affect Education of Computing Professionals As of now, most companies are not interested in computational efficiency, so they hire people who can code—without requiring that these folks are familiar with all the techniques used in making algorithms faster. The resulting demand leads to the emphasis on basic skills when teaching computing professionals. As more and more companies will realize that algorithmic creativity is profitable, there will be a larger need for professionals who are more skilled in algorithmics— and the resulting demand will definitely change the way computer professional are educated. This phenomenon will also boost the corresponding theory—and modify it since companies will be interested in actual computation time, the corresponding problems will switch from optimizing approximate characteristics like number of elementary computational steps to more sophisticated characteristic that will provide a better approximation to the actual computation time.
304
F. Zapata et al.
Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. D. Comer, The Cloud Computing Book: The Future of Computing Explained (CRC Press, Boca Raton, Florida, 2021) 2. T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms (MIT Press, Cambridge, Massachusetts, 2022)
Unreachable Statements Are Inevitable in Software Testing: Theoretical Explanation Francisco Zapata, Eric Smith, and Vladik Kreinovich
Abstract Often, there is a need to migrate software to a new environment. The existing migration tools are not perfect. So, after applying such a tool, we need to test the resulting software. If a test reveals an error, this error needs to be corrected. Usually, the test also provides some warnings. One of the most typical warnings is that a certain statement is unreachable. The appearance of such warnings is often viewed as an indication that the original software developer was not very experienced. In this paper, we show that this view oversimplifies the situation: unreachable statements are, in general, inevitable. Moreover, a wide use of above-mentioned frequent view can be counterproductive: developers who want to appear more experienced will skip potentially unreachable statements and thus, make the software less reliable.
1 Unreachable Statements Happen Many software systems periodically migrate to new software environments: • new operating systems, • new compilers, • sometimes even new programming language. Usually, most of this migration is performed automatically, by using special migration-enhancing tools: without such tools, migration of a million-lines code would not be possible.
F. Zapata · E. Smith · V. Kreinovich (B) University of Texas at El Paso, El Paso, TX 79968, USA e-mail: [email protected] F. Zapata e-mail: [email protected] E. Smith e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_50
305
306
F. Zapata et al.
However, the result is rarely perfect. For example, some tricks that use specific feature of the original operating system may not work in the new software environment. So, before using the migrated code, it is important to test the result of automatic migration. Compilers and other testers test the migrated code and produce errors and warnings; see, e.g., [1, 3, 4]. • Errors clearly indicate that something is wrong with the program. Once a compiler or a tester find an error, this error needs to be corrected. • In contrast, a warning does not necessarily mean that something is wrong with the program: it may simply mean that a software developed should double-check this part of the code. One of the most frequent warnings is a warning that a certain statement is unreachable. Such warnings are the main object of this study.
2 Unreachable Statements: Why? The ubiquity of unreachable statements prompts the need to explain where they come from. An unreachable statement means that: • the original software developer decided to add a certain statement to take care of a situation when certain unusual conditions are satisfied, but • the developer did not realize that these conditions cannot be satisfied—while the newly applied compiler or tester detected this impossibility.
3 Can We Avoid Unreachable Statements? As we have mentioned, the main reason for the appearance of unreachable statements is that the original software developer(s) did not realize that the corresponding conditions are never satisfied. This implies that a more skilled developer would have been able to detect this fact and avoid these statements. So maybe if we have sufficiently skilled developers, we can avoid unreachable statements altogether? Unfortunately, a simple analysis of this problems shows that, in general, it is not possible to detect all unreachable statements—and thus, that unreachable statements are inevitable. Indeed, the condition that need to be satisfies to get to this statement is often described as a boolean expression, i.e., as an expression formed by elementary truefalse conditions c1 , . . . , cn by using boolean operations “and” (&), “or” (∨), and “not” (¬a). For example, we can have a condition (c1 ∨ c2 ) & (¬c1 ∨ ¬c2 ). If the corresponding boolean expression is always false, then the condition is never satisfied and thus, the corresponding statement is unreachable.
Unreachable Statements Are Inevitable in Software Testing: Theoretical Explanation
307
There are boolean expressions which are never satisfied. For example, the following boolean expression is always false: (c1 ∨ c2 ) & (c1 ∨ ¬c2 ) & (¬c1 ∨ c2 ) & (¬c1 ∨ ¬c2 ). One can easily check that this expression is always false by considering all four possible combinations of truth values of the boolean variables c1 and c2 : • • • •
both c1 and c2 are true, c1 is true and c2 is false, c1 is false and c2 is true, and both c1 and c2 are false.
The problem of checking whether a given boolean expression is always false is known to be NP-hard; see, e.g., [2]. This means, crudely speaking, that unless P = NP (which most computer scientists believe to be false), no feasible algorithm is possible that would always provide this checking. Not only this problem is NP-hard, it is actually historically the first problem for which NP-hardness was proven. (To be more precise, the historically first NP-hard problem was to check whether a given boolean expression E is always true, but this is, in effect, the same problem, since an expression E is always false if and only if its negation ¬E is always true.) This NP-hardness result means that no matter what testing algorithm the software developer uses, for a sufficiently large and sufficiently complex software package, testing will not reveal all unreachable statements. In other words, unreachable statements are inevitable. To be more precise: • a novice software developer may leave more such statements in the code; • an experienced software developer will leave fewer unreachable statements; however, in general, unreachable statements are inevitable.
4 So Can We Use the Number of Detected Unreachable Statements to Gauge the Experience of the Original Software Developer? At first glance, the conclusion from the previous section seems to be that we can use the number of detected unreachable statements to gauge the experience of the original software developer: • if the migrated software has a relatively large number of detected unreachable statements, this seems to indicate that the original software developer was not very skilled;
308
F. Zapata et al.
• on the other hand, if the migrated software has a relatively small number of detected unreachable statements, this seems to indicate that the original software developer was more skilled. But can we really make such definite conclusions? As we have mentioned earlier, the main reason why unreachable statements appear is that a software developer is not sure whether the corresponding condition is possible—and, as we have mentioned in the previous section, no feasible algorithm can always check whether the given condition is possible. So, if we start seriously considering the number of detected unreachable statements as a measure of the quality (experience) of the original software developer, developers who want to boost their reputation would have a simple way of increasing their perceived quality: if we do not know whether a condition is possible or not, just do not add any statement for this condition. This, by the way, will make the program more efficient, since there will be no need to spend computer time checking all these suspicious statements. In this case, the number of detected unreachable statements will be 0, so, from the viewpoint of this criterion, the developer will look very experienced. But what if one of these conditions is actually satisfied sometimes? In this case, for this unexpected condition—for which we did not prepare a proper answer—the program will produce God knows what, i.e., we will have an error. Is this what we want? We have made the program more efficient, faster to run, but it is now less reliable. Is a small increase in efficiency indeed so important that we can sacrifice reliability to achieve it? Not really: • With most modern computer application, small increases in running speed—e.g., time savings obtained by not checking some conditions—do not add any useful practical features: e.g., whether processing a patient’s X-ray takes 60 or 55 s does not make any difference. • On the other hand, reliability is a serious issue. For example, if the software misses a disease clearly visible on an X-ray, we may miss a chance to prevent this disease evolving into a more serious (possibly deadly) condition. From this viewpoint, the presence of unreachable statements is, counter-intuitively, a good sign: it makes us more confident that the program is reliable. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
Unreachable Statements Are Inevitable in Software Testing: Theoretical Explanation
309
References 1. G. Blokdyk, Software Modernization: A Complete Guide (The Art of Service, 2020) 2. T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms (MIT Press, Cambridge, Massachusetts, 2022) 3. F. Zapata, O. Kosheleva, V. Kreinovich, How to estimate time needed for software migration. Appl. Math. Sci. 15(1), 9–14 (2021) 4. F. Zapata, O. Lerma, L. Valera, V. Kreinovich, How to speed up software migration and modernization: successful strategies developed by precisiating expert knowledge, in Proceedings of the Annual Conference of the North American Fuzzy Information Processing Society NAFIPS’2015 and 5th World Conference on Soft Computing (Redmond, Washington, 17–19 Aug 2015)
General Computational Techniques
Why Constraint Interval Arithmetic Techniques Work Well: A Theorem Explains Empirical Success Barnabas Bede, Marina Tuyako Mizukoshim, Martine Ceberio, Vladik Kreinovich, and Weldon Lodwick
Abstract Often, we are interested in a quantity that is difficult or impossible to measure directly, e.g., tomorrow’s temperature. To estimate this quantity, we measure auxiliary easier-to-measure quantities that are related to the desired ones by a known dependence, and use the known relation to estimate the desired quantity. Measurements are never absolutely accurate, there is always a measurement error, i.e., a non-zero difference between the measurement result and the actual (unknown) value of the corresponding quantity. Often, the only information that we have about each measurement error is the bound on its absolute value. In such situations, after each measurement, the only information that we gain about the actual (unknown) value of the corresponding quantity is that this value belongs to the corresponding interval. Thus, the only information that we have about the value of the desired quantity is that it belongs to the range of the values of the corresponding function when its inputs are in these intervals. Computing this range is one of the main problems of interval computations. Lately, it was shown that in many cases, it is more efficient to compute the range if we first re-scale each input to the interval [0, 1]; this is one of
B. Bede DigiPen Institute of Technology, 9931 Willows Rd, Redmond, WA 98052, USA e-mail: [email protected] M. Tuyako Mizukoshim Federal University of Goias, Goiânia, Brazil e-mail: [email protected] M. Ceberio · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] M. Ceberio e-mail: [email protected] W. Lodwick Department of Mathematical and Statistical Sciences, University of Colorado Denver, 1201 Larimer Street, Denver, CO 80204, USA © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_51
313
314
B. Bede et al.
the main ideas behind Constraint Interval Arithmetic techniques. In this paper, we explain the empirical success of this idea and show that, in some reasonable sense, this re-scaling is the best.
1 Formulation of the Problem Need for data processing. In many practical situations, we are interested in a quantity y that is difficult or impossible to measure directly, such as distance to a faraway star or tomorrow’s weather. The usual way to estimate this quantity is: • to find easier-to-measure characteristics x1 , . . . , xn that are related to y by a known relation y = f (x1 , . . . , xn ), • To measure the values of these characteristics, and y = f ( x1 , . . . , xn ). • to use the results xi of these measurements to estimate y as For example, to estimate tomorrow’s temperature y at some location, we can use the measurements of temperature, atmospheric pressure, humidity, wind speed, etc. in this and nearby locations. Such an estimation is what is usually called data processing. Need to take interval uncertainty into account. Measurements are never absodef lutely accurate, there is always a difference Δx = x − x between the measurement result x and the actual (unknown) value x of the corresponding quantity. This difference is known as the measurement error. Since the measurement results xi are, in y = f ( x1 , . . . , xn ) obtained by general, different from the actual values xi , the value processing these measurement results is, in general, different from the desired value y = f (x1 , . . . , xn ). The only information that we have about each measurement error Δxi , in many situations, is the upper bound Δi > 0 on its absolute value: |Δxi | ≤ Δi ; see, e.g., [7]. In such situations, after the measurement, the only information that we gain about the def xi − Δi actual value xi is that this value belongs to the interval [x i , x i ], where x i = def
and x i = xi + Δi . Different values of xi from these intervals lead, in general, to different values of y = f (x1 , . . . , xn ). It is therefore desirable to find the range of possible values of y, i.e., the range [y, y] = {y = f (x1 , . . . , xn ) : x1 ∈ [x 1 , x 1 ], . . . , xn ∈ [x n , x n ].}
(1)
y = min{ f (x1 , . . . , xn ) : x1 ∈ [x 1 , x 1 ], . . . , xn ∈ [x n , x n ]}
(2)
Here,
Why Constraint Interval Arithmetic Techniques Work Well …
315
and y = max{ f (x1 , . . . , xn ) : x1 ∈ [x 1 , x 1 ], . . . , xn ∈ [x n , x n ].}
(3)
Computing this range is one of the main problems of interval computations; see, e.g., [1, 5, 6]. Constraint Interval Arithmetic techniques and their successes. Recently, it turned out (see, e.g., [2–4]) that in many cases, to solve the optimizations problems (2) and (3), it is beneficial to first perform a linear re-scaling of the variables. Specifically, instead of the original variables xi whose range is the interval [x i , x i ], with x i < x i , it is useful to introduce auxiliary variables def
αi =
xi − x i xi − xi
(4)
whose range is [0, 1]. In terms of αi , each original variable xi has the form xi = x i + αi · (x i − x i ),
(5)
and thus, the value y = f (x1 , . . . , xn ) takes the form y = F(α1 , . . . , αn ), where def
F(α1 , . . . , αn ) = f (x 1 + α1 · (x 1 − x 1 ), . . . , x n + αn · (x n − x n )).
(6)
This is one of the main ideas behind Constraint Interval Arithmetic techniques. Interestingly, in many cases, such a simple re-scaling improves the optimization results. But why? That re-scaling of the variables often helps with optimization is not surprising. For example, the basic minimization method – gradient descent, when we iteratively replace the values xi(k) on the current iteration with the new values xi(k+1) = xi(k) − λ ·
∂f , ∂ xi |x1 =x1(k) ,...,xn =xn(k) def
behaves differently when we re-scale the original variables xi into new variables yi = ci · xi . Another example when re-scaling is helpful is preconditioning of systems of linear equations. Of course, the fact that re-scaling is sometimes helpful does not mean that rescaling will always help: it depends on the optimization technique: some sophisticated optimization packages already perform some variable re-scaling themselves, in which case additional prior re-scaling does not make much sense. However, for less sophisticated packages—e.g., the ones that rely on the gradient descent (at least on the early stages of optimization), the prior re-scaling helps. So, a natural question is: how can we explain the empirical success of re-scaling (4)–(5))? And is this the best re-scaling we can apply—or there are better re-scalings?
316
B. Bede et al.
What we do in this paper. In this paper, we provide answers to both questions: • we explain why re-scaling (4)–(5) works, and • we show that this re-scaling is, in some reasonable sense, optimal.
2 Scale Invariance Explains the Empirical Success of Constraint Interval Arithmetic Techniques Similar re-scalings naturally appear in practical situations. While in Constraint Interval Arithmetic techniques, re-scalings are introduced artificially, to help solve the corresponding optimization problems, similar re-scalings are well known (and widely used) in data processing. Indeed, the use of re-scalings is related to the fact that when we process data, we intend to deal with the actual physical quantities, but what we actually deal with are numerical values of these quantities. Numerical values depends on the choice of the measuring unit. For example, if we replace meters with centimeters, all numerical values get multiplied by 100: e.g., 1.7 m becomes 170 cm. In general, if we replace the measuring unit for measuring xi by a new unit which is ai > 0 times smaller, then instead of the original numerical value xi , we get a new value ai · xi describing the same quantity. For many physical quantities such as temperature or time, the numerical value also depends on the choice of the starting point. If we select a new starting point which is bi units earlier than the original one, then instead of each original value xi we get a new numerical value xi + bi describing the same amount of the corresponding quantity. If we replace both the measuring unit and the starting point, then we get a general linear re-scaling xi → ai · xi + bi . A classical example of such a re-scaling is a transition between temperatures t F and tC in Fahrenheit and Celsius scales: t F = 1.8 · tC + 32. Under such linear transformation, an interval [x i , x i ] gets transformed into an interval [ai · x i + bi , ai · x i + bi ]. Need for scale-invariance and permutation-invariance. Re-scalings related to changing the measuring unit and the starting point change numerical values, but they do not change the practical problem. The practical problem remains the same whether we measure length in meters or in centimeters (or in inches). It is therefore reasonable, before feeding the problem to an optimization software, to first provide some additional re-scaling, so that the resulting re-scaled optimization problem not depend on what measuring units and what starting point we used for our measurements. It is also reasonable to require that the resulting re-scaled optimization problem not depend on which variable we call first, which second, etc., i.e., that it should be invariant with respect to all possible permutations.
Why Constraint Interval Arithmetic Techniques Work Well …
317
Let us formulate this requirement in precise terms. After the measurements, the only information that we get are, in effect, the endpoints x i and x i . So, a strategy for a proper additional re-scaling can use all these endpoints. Definition 1 Let us fix an integer n. • By a problem, we mean a tuple f (x1 , . . . , xn ), [x 1 , x 1 ], . . . , [x n , x n ], where f (x1 , . . . , xn ) is a real-valued function of n real variables and [x i , x i ] are intervals, with x i < x i . • We say that problems f (x1 , . . . , xn ), [x 1 , x 1 ], . . . , [x n , x n ] and g(y1 , . . . , yn ), [y 1 , y 1 ], . . . , [y n , y n ]
are obtained from each other by permutation if for some permutation π : {1, . . . , n} → {1, . . . , n}: – we have g(x1 , . . . , xn ) = f (xπ(1) , . . . , xπ(n) ) for all xi and – we have [y i , y i ] = [x π(i) , x π(i) ] for all i. • We say that problems f (x1 , . . . , xn ), [x 1 , x 1 ], . . . , [x n , x n ] and g(y1 , . . . , yn ), [y 1 , y 1 ], . . . , [y n , y n ]
are obtained from each other by re-scaling if for some real numbers a1 > 0, …, an > 0, b1 , …, bn : – we have g(y1 , . . . , yn ) = f (a1 · y1 + b1 , . . . , an · yn + bn ) for all yi , and – we have [y i , y i ] = [ai · x i + bi , ai · x i + bi ] for all i. • By a re-scaling strategy, we mean a tuple of functions s = p1 (x 1 , x 1 , . . . , x n , x n ), q1 (x 1 , x 1 , . . . , x n , x n ), . . . , pn (x 1 , x 1 , . . . , x n , x n ), qn (x 1 , x 1 , . . . , x n , x n ). • By the result of applying a re-scaling strategy p1 , . . . to a problem f (x1 , . . . , xn ), [x 1 , x 1 ], . . . , [x n , x n ], we mean a problem f ( p1 · x1 + q1 , . . . , pn · xn + qn ), [ p1 · x 1 + q1 , p1 · x 1 + q1 ], . . . , [ pn · x n + qn , pn · x n + qn ]. • We say that the re-scaling strategy is permutation-invariant if whenever two problems are obtained from each other by permutation, the results of applying this strategy to both problems should be obtained from each other by the same permutation.
318
B. Bede et al.
• We say that the re-scaling strategy is scale-invariant if whenever two problems are obtained from each other by re-scaling, the results of applying this strategy to both problems will be the same. Comment. In particular, the re-scaling strategy corresponding to Constraint Interval Arithmetic techniques has the form pi (x 1 , x 1 , . . . , x n , x n ) =
xi 1 and qi (x 1 , x 1 , . . . , x n , x n ) = − . xi − xi xi − xi
This re-scaling transforms each interval [x i , x i ] into an interval
x − xi xi − xi , [ pi · x i + qi , pi · x i + qi ] = i xi − xi xi − xi
= [0, 1].
Alternatively, we could use a different re-scaling strategy: e.g., we could try to place all the intervals into [0, 1] by applying the same linear transformation to all i, e.g., the strategy pi (x 1 , x 1 , . . . , x n , x n ) =
1 max x j − min x j j
and qi (x 1 , x 1 , . . . , x n , x n ) = −
j
min x j j
max x j − min x j j
.
j
This re-scaling transforms each interval [x i , x i ] into an interval [ pi · x i + qi , pi · x i + qi ] ⊆ [0, 1], which is, in general, different from [0, 1]. Proposition 1 For each re-scaling strategy, the following two conditions are equivalent to each other: • the re-scaling strategy is permutation-invariant and scale-invariant, and • for some values A > 0 and B, the re-scaling strategy has the form pi (x 1 , x 1 , . . . , x n , x n ) = A · and qi (x 1 , x 1 , . . . , x n , x n ) = −A ·
1 xi − xi
xi + B. xi − xi
(7)
(8)
Why Constraint Interval Arithmetic Techniques Work Well …
319
Discussion. For A = 1 and B = 0, we get the re-scaling strategy xi →
xi − x i xi − xi
used in Constraint Interval Arithmetic techniques. In general, for a permutationinvariant and scale-invariant strategy, the corresponding re-scaling has the form xi → A ·
xi − x i + B. xi − xi
(9)
This re-scaling can be described as follows: • first, we apply the re-scaling used in Constraint Interval Arithmetic techniques, and • then, we apply an additional re-scaling x → A · x + B. This result (almost) explains why the re-scaling strategy corresponding to Constraint Interval Arithmetic techniques is so effective. We said “almost” since it is possible to use different values of A and B. The selection of A = 1 and B = 0 can be explained, e.g., by the fact that this selection leads to the simplest possible expression (9), with the smallest number of arithmetic operations needed to compute the value of this expression. Proof of Proposition 1. One can easily check that the re-scaling strategy (7)–(8) is permutation-invariant and scale-invariant. Vice versa, let us prove that every permutation-invariant and scale-invariant re-scaling strategy has the form (7)–(8). Indeed, let us assume that we have a permutation-invariant and scale-invariant re-scaling strategy. Then, by applying the re-scaling xi →
xi − x i , xi − xi
we can reduce each problem to the form in which all intervals are equal to [0, 1]. For this form, the formulas describing the given re-scaling strategy become pi (0, 1, . . . , 0, 1) and qi (0, 1, . . . , 0, 1). In this case, permutation-invariance means that pi (0, 1, . . . , 0, 1) = pπ(i) (0, 1, . . . , 0, 1) for all permutations π . Thus, we have p1 (0, 1, . . . , 0, 1) = . . . = pn (0, 1, . . . , 0, 1). Let us denote the common value of all pi (0, 1, . . . , 0, 1) by A. Similarly, we can conclude that q1 (0, 1, . . . , 0, 1) = . . . = qn (0, 1, . . . , 0, 1). Let us denote the common value of all qi (0, 1, . . . , 0, 1) by B. Thus, for this [0, 1]-case, the re-scaling strategy performs the same transformation xi → A · xi + B for all i. By applying this transformation to the result of transforming the original problem into the [0, 1]-case, we get exactly the transformation (7)–(8). The proposition is proven.
320
B. Bede et al.
3 Which Re-Scaling Strategy Is Optimal? What do we mean by “optimal”? Usually, optimal means that there is a numerical characteristic that describes the quality of different alternatives, and an alternative is optimal if it has the largest (or the smallest) value of this characteristic. However, this description does not capture all the meanings of optimality. For example, if we are designing a computer networks with the goal of maximizing the throughout, and several plans lead to the same throughput, this means we can use this non-uniqueness to optimize something else: e.g., minimize the cost or minimize the ecological impact. In this case, the criterion for comparing two alternatives is more complex than a simple numerical comparison. Indeed, in this case, an alternative A is better than the alternative B if it either has larger throughput, or it has the same throughput but smaller cost. We can have even more complex criteria. The only common feature of all these criteria is that they should decide, for each pair of alternatives A and B, whether A is better than B (we will denote this A > B), or B is better than A (B > A), or A and B are of the same quality to the user (we will denote this by A ∼ B). This comparison must be consistent, e.g., if A is better than B, and B is better than C, then we expect A to be better than C. Also, as we have mentioned, if according to a criterion, there are several equally good optimal alternatives, then we can use this non-uniqueness to optimize something else—i.e., this optimality criterion is not final. Once the criterion is final, there should therefore be only one optimal strategy. Finally, it is reasonable to require that which re-scaling strategy is better should not change if we first apply some permutation to both strategies—or first apply some re-scaling to all the variables. Thus, we arrive at the following definitions. Definition 2 • Let S be a set, its elements will be called alternatives. By an optimality criterion on the set S, we mean a pair of relations (>, ∼) for which: – – – – –
A> A∼ A> A∼ A>
B B B B B
and B > C imply A > C, and B > C imply A > C, and B ∼ C imply A > C, and B ∼ C imply A ∼ C, and implies that we cannot have A ∼ B.
• An alternative A is called optimal with respect to criterion (>, ∼) if for every B ∈ S, we have either A > B or A ∼ B. • An optimality criterion is called final if it has exactly one optimal alternative.
Why Constraint Interval Arithmetic Techniques Work Well …
321
Definition 3 Let s = p1 (x 1 , x 1 , . . . , x n , x n ), q1 (x 1 , x 1 , . . . , x n , x n ), . . . , pn (x 1 , x 1 , . . . , x n , x n ), qn (x 1 , x 1 , . . . , x n , x n ) be a re-scaling strategy. • For each permutation π : {1, . . . , n} → {1, . . . , n}, by the result π(s) of applying this permutation to s, we mean the re-scaling strategy s = pπ(1) (x π(1) , x π(1) , . . . , x π(n) , x π(n) ), qπ(1) (x π(1) , x π(1) , . . . , x π(n) , x π(n) ), . . . ,
pπ(n) (x π(1) , x π(1) , . . . , x π(n) , x π(n) ), qπ(n) (x π(1) , x π(1) , . . . , x π(n) , x π(n) ). • We say that the optimality criterion is permutation-invariant if for every permutation π : – s > s is equivalent to π(s) > π(s ), and – s ∼ s is equivalent to π(s) ∼ π(s ). • For each tuple of re-scalings t = a1 , b1 , . . . , an , bn , by the result t (s) of applying these re-scalings to s, we mean the following re-scaling strategy: s = p1 (a1 · x 1 + b1 , a1 · x 1 + b1 , . . . , an · x n + bn , an · x n + bn ), q1 (a1 · x 1 + b1 , a1 · x 1 + b1 , . . . , an · x n + bn , an · x n + bn ), . . . , pn (a1 · x 1 + b1 , a1 · x 1 + b1 , . . . , an · x n + bn , an · x n + bn ), qn (a1 · x 1 + b1 , a1 · x 1 + b1 , . . . , an · x n + bn , an · x n + bn ). • We say that the optimality criterion is scale-invariant if for every tuple of re-scalings t: – s > s is equivalent to t (s) > t (s ), and – s ∼ s is equivalent to t (s) ∼ t (s ). Proposition 2 For every final permutation-invariant and scale-invariant optimality criterion, the optimal re-scaling strategy has the form (7)–(8). Discussion. Thus, re-scalings which are similar to the ones used in Constraint Interval Arithmetic techniques are indeed optimal, and not just optimal with respect to one specific optimality criteria—they are optimal with respect to any reasonable optimality criterion.
322
B. Bede et al.
Proof of Proposition 2. 1◦ . Let us first prove that, in general, for any invertible transformation T , if a final optimality criterion is invariant with respect to this transformation, then the corresponding optimal alternative aopt is also invariant with respect to T . Indeed, the fact this alternative is optimal means that for every alternative a, we have aopt > a or aopt ∼ a. In particular, for every alternative a, this property is true for the result T −1 (a) of applying the inverse transformation T −1 to this alternative a. In other words, for every alternative a, we have either aopt > T −1 (a) or aopt ∼ T −1 (a). Since the optimality criterion is T -invariant, the condition aopt > T −1 (a) implies that T (aopt ) > T (T −1 (a)) = a, and similarly, the condition aopt ∼ T −1 (a) implies that T (aopt ) ∼ T (T −1 (a)) = a. Thus, for every alternative a, we have either T (aopt ) > a or T (aopt ) ∼ a. By definition of the optimal alternative, this means that the alternative T (aopt ) is optimal. We assumed that our optimality criterion is final, which means that there is only one optimal alternative. Thus, we must have T (aopt ) = aopt . In other words, the optimal alternative aopt is indeed T -invariant. 2◦ . In our case, the statement from Part 1 means that the optimal re-scaling strategy is permutation-invariant and scale-invariant. According to Proposition 1, this implies that the optimal re-scaling strategy has the form (7)–(8). The proposition is proven. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT & T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. L. Jaulin, M. Kiefer, O. Didrit, E. Walter, Applied Interval Analysis, with Examples in Parameter and State Estimation, Robust Control, and Robotics (Springer, London, 2001) 2. W.A. Lodwick, Constrained Interval Arithmetic, University of Colorado at Denver, Center for Computational Mathematics CCM, Report 138 (1999) 3. W.A. Lodwick, Interval and fuzzy analysis: an unified approach, in Advances in Imaging and Electronic Physics, vol. 148, ed. by P.W. Hawkes (Elsevier Press, 2017), pp. 75–192 4. W.A. Lodwick, D. Dubois, Interval linear systems as a necessary step in fuzzy linear systems. Fuzzy Sets and Systems 281, 227–251 (2015) 5. G. Mayer, Interval Analysis and Automatic Result Verification (de Gruyter, Berlin, 2017) 6. R.E. Moore, R.B. Kearfott, M.J. Cloud, Introduction to Interval Analysis (SIAM, Philadelphia, 2009) 7. S.G. Rabinovich, Measurement Errors and Uncertainty: Theory and Practice (Springer, New York, 2005)
How to Describe Relative Approximation Error? A New Justification for Gustafson’s Logarithmic Expression Martine Ceberio, Olga Kosheleva, and Vladik Kreinovich
Abstract How can we describe relative approximation error? When the value b approximate a value a, the usual description of this error is the ratio |b − a|/|a|. The problem with this approach is that, contrary to our intuition, we get different numbers gauging how well a approximates b and how well b approximates a. To avoid this problem, John Gustafson proposed to use the logarithmic measure | ln(b/a)|. In this paper, we show that this is, in effect, the only regular scale-invariant way to describe the relative approximation error.
1 Formulation of the Problem It is desirable to describe relative approximation error. If we use a value a to approximate a value b, then the natural number of accuracy of this approximation is the absolute value |a − b| of the difference between these two values. This quantity is known as the absolute approximation error. In situations when both a and b represent values of some physical quantity, the absolute error changes when we replace the original measuring unit with the one which is λ > 0 times smaller. After this replacements, the numerical values describing the corresponding quantities get multiplied by λ: a → a = λ · a and b → b = λ · b; for example, if we replace meters by centimeters, 1.7 m becomes 100 · 1.7 = 170 cm. In this case, the numerical value of the absolute approximation error also gets multiplied by λ :
M. Ceberio · O. Kosheleva · V. Kreinovich (B) University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] M. Ceberio e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_52
323
324
M. Ceberio et al.
|a − b | = |λ · a − λ · b| = λ · |a − b|. However, it is sometimes desirable to provide a measure of approximation error that would not depend on the choice of the measuring unit. Such measures are known as relative approximation error. Traditional description of relative approximation error and its limitations. Usually, the relative approximation errors is described by the ratio |a − b| ; b
(1)
see, e.g., [3]. The problem with this measure is that intuitively, the value a approximates the value b with exactly the same accuracy as b approximates a. However, with the measure (1), this is not true. For example, 0.8 approximates 1 with relative accuracy |0.8 − 1| = 0.2, 1 while 1 approximates 0.8 with relative accuracy |1 − 0.8| = 0.25 = 0.2. 0.8 Logarithmic measure of relative accuracy. To avoid the above-described asymmetry, John Gustafson [2] proposed to use the following alternative expression for relative approximation accuracy | ln(a/b)|. (2) One can easily check that this expression is indeed symmetric: | ln(a/b)| = | ln(b/a)|. Natural question. There can be several different symmetric measures, why logarithmic one? What we do in this paper. In this paper, we provide a natural explanation for selecting the logarithmic measure.
2 Why Logarithmic Measure: An Explanation What we want. What we want is, in effect, a metric on the set IR+ of all positive real numbers, i.e., a function d(a, b) ≥ 0 for which: • d(a, b) = 0 if and only if a = b; • d(a, b) = d(b, a) for all a and b, and • d(a, c) ≤ d(a, b) + d(b, c) for all a, b, and c.
How to Describe Relative Approximation Error? …
325
We want this metric to be scale-invariant in the following precise sense: Definition 1. We say that a metric d(a, b) on the set of all positive real numbers is scale-invariant if d(λ · a, λ · b) = d(a, b) (3) for all λ > 0, a > 0, and b > 0. It is also reasonable to require that the desired metric—just like the usual Euclidean metric—is uniquely generated by its local properties, in the sense that the distance between every two points is equal to the length of the shortest path connecting these points: Definition 2. Let M be a metric space with metric d(a, b). • By a path s from a point a ∈ M to a point b ∈ M, we mean continuous mapping s : [0, 1] → M. • We say that a path has length L if for every ε > 0 there exists a δ > 0 such that if we have a sequence t0 = 0 < t1 < . . . < tn−1 < tn = 1 for which ti+1 − ti ≤ δ for all i, then n−1 d(ti , ti+1 ) − L ≤ ε. i=1
• We say that a metric d(a, b) is regular if every two points a, b ∈ M can be connected by a path of length d(a, b). Proposition 1. • Every scale-invariant regular metric on the set of all positive real numbers has the form d(a, b) = k · | ln(b/a)| for some k > 0. • For every k > 0, the metric d(a, b) = k · | ln(b/a)| is regular and scale-invariant. Comment. Thus, we have indeed justified the use of logarithmic metric. Proof. It is easy to see that the metric d(a, b) = k · | ln(b/a)| is regular and scaleinvariant. Let us prove that, vice versa, every scale-invariant regular metric on the set of all positive numbers has this form. Let us first note that on the shortest path, each point occurs only once: otherwise, if we had a point c repeated twice, we could cut out the part of the path that connect the first and second occurrences of this point, and thus get an even shorter path. On the set of all positive real numbers, the only path between two points a < b that does not contain repetitions is a continuous monotonic mapping of the interval [0, 1] into the interval [a, b]. So, this is the shortest path. On each shortest path between the point a and c for which a < c, for each intermediate point b, we have d(a, c) = d(a, b) + d(b, c). Indeed, the sequence ti contains a point t j close to b, thus each sum d(ti , ti+1 ) is close to the sum of two subsums –
326
M. Ceberio et al.
before this point and after this point. When ε → 0, the first subsum – corresponding to the shortest path from a to b – tends to d(a, b), and the second subsum tends to d(b, c). Thus, in the limit, for all a < b < c, we indeed have d(a, c) = d(a, b) + d(b, c).
(4)
By scale-invariance, for λ = 1/a, we have d(a, b) = d(1, b/a). For all x > 1, def let us denote f (x) = d(1, x). In these terms, for a < b, we have d(a, b) = f (b/a) and thus, the equality (4) takes the form f (c/a) = f (b/a) + f (c/b). In particular, for each x ≥ 1 and y ≥ 1, we can take a = 1, b = x, and c = x · y, and conclude that f (x · y) = f (x) + f (y). It is known (see, e.g., [1]) that the only non-negative non-zero solutions to this functional equation are f (x) = k · ln(x) for some k > 0. Thus, indeed, for a < b, we have d(a, b) = f (b/a) = k · ln(b/a). Since d(a, b) = d(b, a), for a > b, we get d(a, b) = d(b, a) = k · ln(a/b), i.e., exactly d(a, b) = k · | ln(b/a)|. The proposition is proven.
3 Related Result In the previous text, we considered situations in which we can select different measuring units. For some quantities, we can select different starting points: e.g., when we measure time, we can start from any moment of time. For such quantities, if we select a new starting point which is a0 moments earlier, then the numerical value corresponding to the same moment of time is shifted from a to a → a = a + a0 . In such cases, it is reasonable to consider shift-invariant metrics: Definition 3. We say that a metric d(a, b) on the set of all real numbers is shiftinvariant if d(a + a0 , b + a0 ) = d(a, b) (5) for all a, b, and a0 . Proposition 2. • Every shift-invariant regular metric on the set of all real numbers has the form d(a, b) = k · |a − b| for some k > 0. • For every k > 0, the metric d(a, b) = k · |a − b| is regular and shift-invariant. Comment. So, in this case, we get the usual description of the absolute approximation error. Proof. It is easy to see that the metric d(a, b) = k · |a − b| is regular and shiftinvariant. Let us prove that, vice versa, every shift-invariant regular metric on the set of all real numbers has this form.
How to Describe Relative Approximation Error? …
327
Similarly to the proof of Proposition 1, we conclude that for all a < b < c, the equality (4) is satisfied. By shift-invariance, for a0 = −b, we have d(a, b) = d(0, b − a). For all x > def 0, let us denote g(x) = d(0, x). In these terms, for a < b, we have d(a, b) = g(b − a) and thus, the equality (4) takes the form g(c − a) = g(b − a) + g(c − b). In particular, for each x ≥ 1 and y ≥ 1, we can take a = 0, b = x, and c = x + y, and conclude that g(x + y) = g(x) + g(y). It is known (see, e.g., [1]) that the only nonnegative non-zero solutions to this functional equation are g(x) = k · x for some k > 0. Thus, indeed, for a < b, we have d(a, b) = g(b/a) = k · (b − a). Since d(a, b) = d(b, a), for a > b, we get d(a, b) = d(b, a) = k · (b − a), i.e., exactly d(a, b) = k · |a − b|. The proposition is proven. Acknowledgements This work was supported by: • the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), • the AT & T Fellowship in Information Technology, • the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and • grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. J. Aczél, J. Dhombres, Functional Equations in Several Variables (Cambridge University Press, 2008) 2. J. Gustafson, Posting on Reliable Computing Mailing List, 17 Feb. 2022 3. S.G. Rabinovich, Measurement Errors and Uncertainties: Theory and Practice (Springer, New York, 2005)
Search Under Uncertainty Should be Randomized: A Lesson from the 2021 Nobel Prize in Medicine Martine Ceberio and Vladik Kreinovich
Abstract In many real-life situations, we know that one of several objects has the desired property, but we do not know which one. To find the desired object, we need to test these objects one by one. In situations when we have no additional information, there is no reason to prefer any testing order and thus, a usual recommendation is to test them in any order. This is usually interpreted as ordering the objects in the increasing value of some seemingly unrelated quantity. A possible drawback of this approach is that it may turn out that the selected quantity is correlated with the desired property, in which case we will need to test all the given objects before we find the desired one. This is not just an abstract possibility: this is exactly what happened for the research efforts that led to the 2021 Nobel Prize in Medicine. To avoid such situations, we propose to use randomized search. Such a search would have cut in half the multi-year time spent on this Nobel-Prize-winning research efforts.
1 Formulation of the Problem 1.1 Search Under Uncertainty: A General Problem In many practical situations, we strongly suspect that one of several objects satisfies the desired property, but we do not know which one—and we have no reason to assume that some of these objects are more probable to have the desired property. To find the desired object, we must therefore try all the objects one by one until we find the object that has the desired property. Once we have found the desired object, we can stop the search. M. Ceberio · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] M. Ceberio e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_53
329
330
M. Ceberio and V. Kreinovich
In some situations, checking the desired property is very time- and/or resourceconsuming. It is therefore desirable to minimize the number of such checks. So, a natural question is: in what order should we test the given objects?
1.2 What is Known In this situation, we know the following (see, e.g., [1]): • The best situation is when the very first object that we check has the desired property. In this case, we only need a single check. • The worst situation is when the very last object that we check has the desired property—or even none of the tested objects has the desired property. In this situation, we need to perform n checks, where n is the number of objects. It is also possible to compute the average number of checks. Indeed, since we have no reason to believe that some objects are more probable to satisfy the desired property, it is reasonable to assume that each object has the exact same probability that this object has the desired property. This ideas—going back to Laplace—is known as the Laplace Indeterminacy Principle; see, e.g., [4]. Under this assumption: • with probability 1/n, the first object has the desired property, so we will need only 1 check; • with probability 1/n, the second object has the desired property, so we will need 2 checks; • …, and • with probability 1/n, the last object has the desired property, so we will need n checks. The average number of checks is therefore equal to 1 n · (n + 1) n+1 1 · (1 + 2 + · · · + n) = · = . n n 2 2
1.3 The Usual (Seemingly Reasonable) Recommendation We have no information about the objects, we do not know which objects are more probable to be desired and which are less probable. From this viewpoint, it seems like we cannot make any recommendation about the order of testing, so any order should be OK.
Search Under Uncertainty Should be Randomized …
331
1.4 What We Do in This Paper In this paper, we argue that this usual recommendation is somewhat misleading, and that in situations of uncertainty, it is better to use randomized order. This will not be just a theoretical conclusion—we show that following the new recommendation would have sped up the research that led to the 2021 Nobel Prize in Medicine.
2 Analysis of the Problem and the New Recommendation 2.1 How People Usually Interpret the “any Order” Recommendation When people read “any order”, they usually follow some seemingly unrelated order— e.g., alphabetic order of the objects’ names or order by using the values of some seemingly unrelated quantity of different objects.
2.2 Sometimes, This Works, but Sometimes, it Doesn’t If the quantity used in ordering is really unrelated to the property that we want to test, then this interpretation works well. However, since we know nothing about the desired property, it could be that the quantity used in the ordering is actually correlated with the desired quantity. And if this correlation is positive, and we sort the objects in the increasing order of the selected quantity, then the desired object will be the last one—i.e., we arrive at the worse-case situation. Of course, in this case, if we sort the objects in the decreasing order of the selected quantity, then we arrive at the best-case situation: we will pick the desired object right away or at least almost right away. So maybe it makes sense to try first the first and the last objects in the selected order, then the second and the last but one, etc.? This would cover both the cases of positive and negative correlation, but what if the dependence of the desired quality on the selected quantity is quadratic, with the maximum for the midpoint value? In this case, we again arrive at the worst-case situation. How can we avoid this?
332
M. Ceberio and V. Kreinovich
2.3 Natural Idea Since ordering by a fixed quantity may lead to the worst-case situation, a natural idea is not to use any deterministic approach, not to use any quantity to sort the objects, but to use random order. In other words: • as a first object to test, we select any of the n given objects, with equal probability 1/n; • if the first selected object does not satisfy the desired property, then, for the second checking, we select any of the remaining n − 1 objects with equal probability 1/(n − 1); • if the second selected objects does not satisfy the desired property either, then, for the third checking, we select any of the remaining n − 2 objects with equal probability 1/(n − 2), etc.
2.4 This Idea is, in Effect, Well Known While the data randomization idea is not always performed (and, as promised, we will show an Nobel-Prize-related example when it was not), the need for randomization in data processing is, in general, well understood. For example, in machine learning, the usual practice is to shuffle the data set before carrying out cross-validation or a partition into training and test data—unless it is already known that the data come in a random order.
2.5 How Many Checks Will We Need if We Follow This Recommendation Under such random order, the average number of checks is equal to (n + 1)/2—this follows from the same argument that we had before. It is possible that we will get into the worst-case situation, but the probability of this happening is the probability that the desired object will be the last in the random order—i.e., 1/n. For large n, this probability is very small.
2.6 But is All This Really Important? Theoretically, the new recommendation is reasonable, but a natural question arises: how important is the difference between this new recommendation and the usual one?
Search Under Uncertainty Should be Randomized …
333
In the next section, we show that this difference is not purely theoretical: namely, we show that following the new recommendation would have sped up the research that led to the 2021 Nobel Prize in Medicine.
3 Case Study 3.1 Research that Led to the Nobel Prize The 2021 Nobel Prize in Physiology or Medicine was awarded to Dr. David Julius and Dr. Ardem Patapoutian who discovered receptors for temperature and touch; see, e.g., [2, 3, 5, 6, 8]. In this research, first, they identified 72 genes as potential sensors of mechanical force. To find out which of these genes is a receptor for touch, they silenced each of these genes one by one and checked whether the cells would still react to poking. Interestingly, when they silenced each of the first 71 genes in their ordering, the cell was still reacting to touch. It was only when they silenced (“switched of”) the last, 72nd gene, that they saw that the reaction to poking disappeared—which showed that this gene was a receptor for touch. It is not easy to switch off a gene, so this research took several years.
3.2 How This Research Could be Sped Up Clearly, in this study, the researchers hit what we described as the worst-case situation, when the quantity selected for ordering was correlated with the desired quality. If instead of selecting a deterministic order, they would have used a random order, they would, on average, have cut the number of tests—and thus, the research duration—in half, with the probability of encountering the worst-case situation as small as 1/72 ≈ 1.5%. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT & T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). The authors are thankful to the anonymous referees for valuable suggestions.
334
M. Ceberio and V. Kreinovich
References 1. T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms (MIR Press, Cambridge, MA, 2009) 2. B. Coste, J. Mathur, M. Schmidt, T.J. Earley, S. Ranade, M.J. Petrus, A.E. Dubin, A. Patapoutian, Piezo1 and Piezo2 are essential components of distinct mechanically activated cation channels. Science 330, 55–60 (2010) 3. Howard Hughes Medical Institute (HHMI): David Julius and Ardem Patapoutian Awarded the 2021 Nobel Prize in Physiology or Medicine. https://www.hhmi.org/news/david-julius-andardem-patapoutian-awarded-2021-nobel-prize-physiology-or-medicine. Accessed 4 Oct. 2021 4. E.T. Jaynes, G.L. Bretthorst, Probability Theory: The Logic of Science (Cambridge University Press, Cambridge, UK, 2003) 5. The Nobel Prize Committee, Press release: The Nobel Prize in Physiology or Medicine 2021. https://www.nobelprize.org/prizes/medicine/2021/press-release/. Accessed 4 Oct. 2021 6. S.S. Ranade, S.H. Woo, A.E. Dubin, R.A. Moshourab, C. Wetzel, M. Petrus, J. Mathur, V. Bégay, B. Coste, J. Mainquist, A.J. Wilson, A.G. Francisco, K. Reddy, Z. Qiu, J.N. Wood, G.R. Lewin, A. Patapoutian, Piezo2 is the major transducer of mechanical forces for touch sensation in mice. Nature 516, 121–125 (2014) 7. D.J. Sheskin, Handbook of Parametric and Non-Parametric Statistical Procedures (Chapman & Hall/CRC, London, UK, 2011) 8. S.H. Woom, V. Lukacs, J.C. de Nooij, D. Zaytseva, C.R. Criddle, A. Francisco, T.M. Jessell, K.A. Wilkinson, A. Patapoutian, Piezo2 is the principal mechonotransduction channel for proprioception. Nat. Neurosci. 18, 1756–1762 (2015)
Why Convex Combination is an Effective Crossover Operation in Continuous Optimization: A Theoretical Explanation Kelly Cohen, Olga Kosheleva, and Vladik Kreinovich
Abstract When evolutionary computation techniques are used to solve continuous optimization problems, usually, convex combination is used as a crossover operation. Empirically, this crossover operation works well, but this success is, from the foundational viewpoint, a challenge: why this crossover operation works well is not clear. In this paper, we provide a theoretical explanation for this empirical success.
1 Formulation of the Problem Biological evolution has successfully managed to some up with creatures that fit well with different complex environments, i.e., creatures for each of which the value of its fitness is close to optimal. So, when we want to optimize a complex objective function, a natural idea is to emulate biological evolution. This idea is known as evolutional computation or genetic programming. Each living creature is described by its DNA, i.e., in effect, by a discrete (4-ary) code. In other words, each creature is characterized by a sequence x1 x2 . . ., where each xi is an element of a small discrete set (4-element set for biological evolution). Environment determines how fit is a creature is: the higher the fitness level, the higher the probability that this creature will survive—and also the higher the probability that it will be able to mate and give birth to offsprings. During the lifetime, the code is slightly changed by mutations—when some of the values xi changes with a small probability. K. Cohen Department of Aerospace Engineering and Engineering Mechanics, College of Engineering and Applied Science, University of Cincinnati, 2850 Campus Way, Cincinnati, OH 45221-0070, USA e-mail: [email protected] O. Kosheleva · V. Kreinovich (B) University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_54
335
336
K. Cohen et al.
The most drastic changes happen when a new creature appears: its code is obtained from the codes of the two parents by a special procedure called crossover. For biological creatures, crossover is organized as follows: the two DNA sequences are placed together, then a few random locations i 1 < i 2 < . . . are marked on both sequences, and the resulting DNA is obtained by taking the segment of the 1st parent until the first location, then the segment of the second parent between the 1st and 2nd locations, then again the segment from the 1st parent, etc. In this scheme, for each location i, the child’s value ci at this location is equal: • either to the value f i of the first parent at this location, • or to the values si of the second parent at this location. To be more precise: • for i < i 1 , we take ci = f i ; • for i 1 ≤ i < i 2 , we take ci = si , • for i 2 ≤ i < i 3 , we take ci = f i , etc. This is how crossover is performed in nature, this is how it is performed in discrete optimization problems, where each possible option is represented as a sequence of discrete units—e.g., 0-or-1 bits. Similar techniques have been proposed—and successfully used—for continuous optimization problems, in which each possible alternative is represented by a sequence x1 x2 . . . of real numbers. In this arrangement—known as geometric genetic programming—mutations correspond to small changes of the corresponding values, and crossover is usually implemented as a convex combination ci = α · f i + (1 − α) · si
(1)
for some real value α—which may different for different locations i; see, e.g., [1–4]. Empirically, the crossover operation (1) works really well, but why it works well is not clear. In this paper, we provide a theoretical explanation for its efficiency.
2 Our Explanation Usually, the numbers xi that characterize an alternative are values of the corresponding physical quantities. For example, if we are designing a car, the values xi can be lengths, heights, weights, etc. of its different components. The numerical value of each physical quantity depends on the choice of a measuring unit and, for some properties like temperature or time, on the choice of the starting point. If we replace the original measuring unit by a new unit which is a times smaller, then all the numerical values are multiplied by a: x → a · x. For example, if we replace meters with 100 times smaller unit—centimeter—then 2 m becomes 100 · 2 = 200 cm. For some quantities like electric current, the sign is also arbitrary, so we can change x to −x. If we replace the original starting point with a new starting point which is b
Why Convex Combination is an Effective Crossover Operation …
337
units earlier, then this value b is added to all the numerical values: x → x + b. If we change both the measuring unit and the starting point, then we get a generic linear transformation x → a · x + b. If we change the measuring unit and the starting point, then the numerical values change but the actual values of the physical quantities do not change. Thus, a reasonable crossover operation should not change if we simply change the measuring unit and the starting point: it does not make sense to use one optimization algorithm if we have meters and another one if we use feet or centimeters. It turns out that this natural requirement explains the appearance of the convex combination crossover operation (1). Let us describe this result in precise terms. Definition 1. By a crossover operation, we mean a real-value function of two realvalued inputs F : IR × IR → IR. Definition 2. We say that a crossover operation is scale-invariant if for all a = 0 def and b, whenever we have c = F( f, s), we will also have c = F( f , s ), where c = def def a · c + b, f = a · f + b, and s = a · s + b. Proposition 1. A crossover operation is scale-invariant if and only if it has the form F(s, f ) = α · f + (1 − α) · s for some α. Discussion. This results explains why crossover (1) is effective: it is the only crossover operation that satisfies the reasonable invariance requirements. Comment. Usually, only the values α from the interval [0, 1] are used. However, Proposition 1 allows values α outside this interval. So maybe such value α < 0 or α > 1 can be useful in some optimization applications? Proof. It is easy to check that the crossover operation (1) is scale-invariance. Let us prove that, vice versa, every scale-invariant crossover operation has the form (1). def
Let us denote α = F(1, 0). By scale-invariance, for each a = 0 and b, we have a · α + b = F(a · 1 + b, a · 0 + b). For each two different numbers f = s, we can take a = f − s and b = s. In this case, a · 1 + b = f − s + s = f , a · 0 + b = s, and a · α + b = ( f − s) · α + s = α · f + (1 − α) · s, and thus, scale-invariance implies that α · f + (1 − α) · s = F( f, s). The desired equality is thus proven for all the cases when f = s. So, to complete the proof, it is sufficient to prove this equality for the case when f = s. In this case, scale-invariance means that if c = F( f, f ), then for every a = 0 and b, we should have a · c + b = F(a · f + b, a · f + b). Let us take a = 2 and b = − f . In this case, a · f + b = 2 · f + (− f ) = f , so we get 2c − f = F( f, f ). We already know that c = F( f, f ), thus 2c − f = c and therefore, c = f . In this case, the righthand side of the desired equality is equal to α · f + (1 − α) · f = f , so the desired equality is also true. The proposition is proven.
338
K. Cohen et al.
3 An Alternative Explanation In the previous section, we proved that the currently used crossover operation is the only one that satisfies some reasonable requirement. In this section, we will show that this operation is also optimal in some reasonable sense. To explain this result, let us first recall what optimal means. In many cases, optimal means attaining the largest or the smallest value of some objective function G(x). In this case, the optimality criterion is straightforward: an alternative x is better than the alternative y if—for the case of minimization—we have G(x) < G(y). For example, we can select the crossover operation for which the average degree of sub-optimality is the smallest possible, i.e., for which the results are, on average, the closest to the desired optimum of the fitness function. However, not all optimal situations are like that. For example, if may turn out that several different crossover operations lead to the same average degree of suboptimality. In this case, we can use this non-uniqueness to optimize some other function H (x): e.g., the worst-case degree of sub-optimality. In this case, the optimality criterion is more complex that simply comparing the values of a single function— here, x is better than y if: • either G(x) < G(y) • or G(x) = G(y) and H (x) < H (y). If this still leaves us with several equally optimal alternatives, this means that our optimality criterion is not final—we can again use the non-uniqueness to optimize something else. For a final optimality criterion, there should be only one optimal alternative. What all the resulting complex optimality criteria have in common is that they allow, for at least some pairs of alternatives a and b, to decide: • whether a is better according to this criterion—we will denote it by a > b, • or whether b is better, • or whether these two alternatives are equally good; we will denote it by a ∼ b. Of course, these decisions must be consistent: e.g., if a is better than b and b is better than c, then a should be better than c. Thus, we arrive at the following definition. Definition 3. Let A be a set. Its elements will be called alternatives. • By an optimality criterion on the set A, we mean a pair of binary relations >, ∼ that satisfy the following conditions for all a, b, and c: • • • • • •
if a > b and b > c, then a > c; if a > b and b ∼ c, then a > c; if a ∼ b and b > c, then a > c; if a ∼ b and b ∼ c, then a ∼ c; if a > b, then we cannot have a ∼ b; we have a ∼ a.
Why Convex Combination is an Effective Crossover Operation …
339
• We say that an alternative aopt is optimal if for every a ∈ A, either aopt > a or aopt ∼ a. • We say that the optimality criterion is final if there exists exactly one optimal alternative. In our case, alternatives are crossover operations. It is reasonable to require that which crossover operation is better should not depend on re-scaling the corresponding quantity. Let us describe this natural requirement in precise terms. def Let us denote the corresponding re-scaling by Ta,b (x) = a · x + b. If we apply a crossover operation F( f, s) in a new scale, this means that in the new scale, we get the value c = F( f , s ) = F(Ta,b ( f ), Ta,b (s)). If we transform this value into −1 (c ) = a −1 · (c − b), then the original scale, i.e., apply the inverse operation Ta,b −1 we get c = Ta,b (F(Ta,b ( f ), Ta,b (s))). Thus, the use of the crossover operation F is the new scale is equivalent to applying a crossover operation Ta,b (F) defined def −1 as (Ta,b (F))( f, s) = Ta,b (F(Ta,b ( f ), Ta,b (s))) in the original scale. In these terms, invariance means that, e.g., F > F , then we should have Ta,b (F) > Ta,b (F ), etc. Definition 4. def
• For every two numbers a = 0 and b, let us denote Ta,b (x) = a · x + b and −1 (x) = a −1 · (x − b). Ta,b
• For each a = 0 and b, and for each crossover operation F(a, b), by Ta,b (F), we will denote a crossover operation defined as −1 (Ta,b (F))( f, s) = Ta,b (F(Ta,b ( f ), Ta,b (s))). def
• We say that an optimality criterion on the set of all crossover operations is scaleinvariant if for all crossover operations F and F and for all a = 0 and b, F > F implies that Ta,b (F) > Ta,b (F ) and F ∼ F implies that Ta,b (F) ∼ Ta,b (F ). Proposition 2. For every scale-invariant final optimality criterion on the set of all crossover operations, the optimal operation has the form F(s, f ) = α · f + (1 − α) · s for some α. Proof. Let Fopt be the optimal crossover operation. This means that for every other crossover operation F, we have Fopt > F or Fopt ∼ F. In particular, this is true for −1 −1 −1 (F), i.e., we have either Fopt > Ta,b (F) or Fopt ∼ Ta,b (F). Due to operations Ta,b the fact that the optimality criterion is scale-invariant, we get either Ta,b (Fopt ) > F or Ta,b (Fopt ) ∼ F. This is true for all crossover operations F. Thus, by definition of an optimal alternative, the operation Ta,b (Fopt ) is optimal. However, we already know that the operation Fopt is optimal, and we assumed that the optimality criterion is final,
340
K. Cohen et al.
which means that there is only one optimal alternative. Thus, we have Ta,b (Fopt ) = Fopt . This equality holds for all a = 0 and b. Thus, the crossover operation Fopt is scaleinvariant. For such operations, we have already proved, in Proposition 1, that they have the desired form. The proposition is proven. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. M. Castelli, L. Manzoni, GSGP-C++ 2.0: a geometric semantic genetic programming framework. SoftwareX, vol. 10, paper 100313 (2019) 2. K. Krawiec, T. Pawlak, Locally geometric semantic crossover: a study on the roles of semantics and homology in recombination operators. Gen. Program. Evol. Mach. 14, 31–63 (2013) 3. A. Moraglio, K. Krawiec, C.G. Johnson, Geometric semantic genetic programming, in Parallel Problem Solving from Nature PPSN’XII (Springer, Heidelberg, Germany, 2012), pp. 21–31 4. T.P. Pawlak, B. Wieloch, K. Krawiec, Review and comparative analysis of geometric semantic crossovers. Gen. Program. Evol. Mach. 16, 351–386 (2015)
Why Optimization Is Faster Than Solving Systems of Equations: A Qualitative Explanation Siyu Deng, K. C. Bimal, and Vladik Kreinovich
Abstract Most practical problems lead either to solving a system of equation or to optimization. From the computational viewpoint, both classes of problems can be reduced to each other: optimization can be reduced to finding points at which all partial derivatives are zeros, and solving systems of equations can be reduced to minimizing sums of squares. It is therefore natural to expect that, on average, both classes of problems have the same computational complexity—i.e., require about the same computation time. However, empirically, optimization problems are much faster to solve. In this paper, we provide a possible explanation for this unexpected empirical phenomenon.
1 Formulation of the Problem Practical problems: a general description. In many practical situations, we need to make a decision: • decide what controls to apply, • decide which proportion of money to invest in stocks and in bonds, • decide the proper dose of a medicine, etc. What are possible options. Let us describe such problems in precise terms. Let x1 , . . . , xn be parameters describing possible decisions. S. Deng · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] S. Deng e-mail: [email protected] K. C. Bimal Computational Science Program, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_55
341
342
S. Deng et al.
We usually have some constraints on the values of these parameters. Many of these constraints are equalities: f i (x1 , . . . , xn ) = 0 for some functions f i (x1 , . . . , xn ), 1 ≤ i ≤ m. Sometimes, there is only one possible solution. In some practical situations, there are so many constraints that these constraints uniquely determine the values xi . In this case, to find xi , we need to solve the corresponding system of equations f 1 (x1 , . . . , xn ) = 0, ..., f m (x1 , . . . , xn ) = 0. Sometimes, there are many possible solutions. In other situations, constraints do not determine the solution uniquely. In this case, we must select the best of possible solutions. Usually, the quality of a possible solution x1 , . . . , xn can be described in numerical terms, as f (x1 , . . . , xn ). The corresponding function f (x1 , . . . , xn ) is known as objective function. Thus, we must select the possible solution that maximizes the value of the objective function. In other words, we need to solve an optimization problem. Unconstrained verses constrained optimization. In some cases, we do not have any constraints. In such cases, we need to find the values xi that maximize the given function f (x1 , . . . , xn ). In other practical situations, we need to optimize the objective function f (x1 , . . . , xn ) under given constraints f 1 (x1 , . . . , xn ) = 0, . . . , f m (x1 , . . . , xn ) = 0. Lagrange multiplier method reduces this problem to unconditional optimization of the auxiliary function def
F(x1 , . . . , xn ) = f (x1 , . . . , xn ) +
m
λi · f i (x1 , . . . , xn ).
i=1
Summarizing. Practical problems lead to two types of mathematical problems. • Some problems lead to solving systems of equations: f 1 (x1 , . . . , xn ) = 0, . . . , f m (x1 , . . . , xn ) = 0. • Some problems lead to finding the values x1 , . . . , xn for which the given function F(x1 , . . . , xn ) attains its largest value.
Why Optimization Is Faster Than Solving Systems …
343
Which of these two types of problems is, in general, easier to solve? It is known that these two problems can be reduced to each other. • Optimization F → max can be reduced to solving a system of equations obtained by equating all partial derivatives to 0: ∂F = 0. ∂ xi • Solving a system of equations f 1 (x1 , . . . , xn ) = 0, …, f m (x1 , . . . , xn ) = 0 is equivalent to minimizing the sum m
( f i (x1 , . . . , xn ))2 .
i=1
Because of this possible mutual reduction, one would expect their computational complexity to be comparable. Surprising empirical fact and the resulting challenge. In spite of this expectation, empirically, in general, optimization problems are faster-to-solve; see, e.g., [1]. Thus, we have the following challenge: We have two types of problems: solving systems of equations and optimization. Because of the possible mutual reduction, one would expect their computational complexity to be comparable. However, empirically, in general, optimization problems are faster-to-solve. How can we explain this unexpected empirical fact?
2 Possible Explanation General observation about relative computation time. To provide an explanation, let us recall cases when some class of problem is computationally easier. In each computation problem, there is one or more inputs and desired outputs. The output of computations is usually uniquely determined by the inputs. In mathematical terms, this means that the output is a function of the inputs. In this sense, every computation is a computation of the value of an appropriate function. In general: • functions of two variables take more time to compute than functions of one variable, • functions of three variables take more time to compute than functions of two variables, etc. In other words, the more inputs we have, the more computation time the problem requires.
344
S. Deng et al.
Let us apply this general observation to our problem. For both optimization problem and the problem of solving a system of equation, the inputs are functions. The difference is in how many functions form the input. • To describe an optimization problem we need to describe only one function F(x1 , . . . , xn ); this function is to be maximized. • On the other hand, to describe a system of m equations with n unknowns, we need to describe m functions f 1 (x1 , . . . , xn ), . . . , f m (x1 , . . . , xn ). So: • The input to an optimization problem is a single function. • The input to a solving-system-of-equations problem consists of several functions. Thus, solving systems of equations requires more inputs than optimization. So, not surprisingly, optimization problems are, in general, faster to solve. This explains the above empirical fact. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). The authors are thankful to all the participants of the 27th Joint NMSU/UTEP Workshop on Mathematics, Computer Science, and Computational Sciences (Las Cruces, New Mexico, April 2, 2022) for valuable suggestions.
Reference 1. R.L. Muhanna, R.L. Mullen, M.V. Rama Rao, Nonlinear interval finite elements for beams, in Proceedings of the Second International Conference on Vulnerability and Risk Analysis and Management (ICVRAM) and the Sixth International Symposium on Uncertainty, Modeling, and Analysis (ISUMA), Liverpool, UK, 13–16 Jul. 2014
Estimating Skewness and Higher Central Moments of an Interval-Valued Fuzzy Set Juan Carlos Figueroa Garcia, Martine Ceberio, Olga Kosheleva, and Vladik Kreinovich
Abstract A known relation between membership functions and probability density functions allows us to naturally extend statistical characteristics like central moments to the fuzzy case. In case of interval-valued fuzzy sets, we have several possible membership functions consistent with our knowledge. For different membership functions, in general, we have different values of the central moments. It is therefore desirable to compute, in the interval-valued fuzzy case, the range of possible values for each such moment. In this paper, we provide efficient algorithms for this computation.
1 Outline From the purely mathematical viewpoint, the main difference between a membership function μ(x) (see, e.g., [1–5, 7]) and a probability density function f (x) is that they are normalized differently: • for μ(x), we require that its maximum is equal to 1, while • for f (x), we require that its integral is equal to 1. Both functions can be re-normalized, so there is a natural probability density function assigned to each membership function, and vice versa. This assignments allow to J. C. F. Garcia Departamento de Ingenieria Industrial, Universidad Distrital, Bogota, Colombia e-mail: [email protected] M. Ceberio · O. Kosheleva · V. Kreinovich (B) University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] M. Ceberio e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_56
345
346
J. C. F. Garcia et al.
naturally extend probabilistic notions to membership functions, and define mean (which turns out to be equivalent to a centroid), skewness, etc. These transformation implicitly assume that for each x, we can extract the exact value μ(x) from the expert. In practice, the experts can, at best, provide an interval μ(x), μ(x) of possible values of μ(x). This case—known as interval-valued fuzzy case—means that we havemany different membership functions μ(x) as long as we have μ(x) ∈ μ(x), μ(x) for all x. For different possible membership functions, we have different values of skewness and of other characteristics. It is therefore desirable to estimate the range of possible values of such a characteristic. In this paper, we provide efficient algorithms for such an estimation. The structure of this paper is as follows. In Sect. 2, we formulate the problem in precise terms. In Sect. 3, we analyze the resulting computational problem for the case of skewness. In Sect. 4, we use this analysis to describe the resulting algorithm. In Sect. 5, we extend this algorithm to all central moments.
2 Formulation of the Problem Known relation between probabilistic and fuzzy uncertainty. As Zadeh himself, the father of fuzzy logic, emphasized several times, the same data can be described by a probability density function f (x) and a membership function μ(x). The main difference between these two descriptions is in the normalization: • for a probability density function, the normalizing requirement is that the overall probability should be equal to 1: f (x) d x = 1,
(1)
• while for a membership function, the normalizing requirement is that the largest value of this function must be equal to 1: max μ(x) = 1. x
(2)
Because of this relation, it is possible to re-normalized each of these functions by multiplying it by an appropriate constant. For example, to each membership function μ(x), we can assign the corresponding probability density function: f (x) =
μ(x) . μ(y) dy
(3)
This allows us to naturally extend probabilistic characteristics to the fuzzy case. The transformation (3) enables us to naturally extend known characteristics of a
Estimating Skewness and Higher Central Moments of an Interval-Valued Fuzzy Set
347
probability distribution—such as its central moments (see, e.g., [6])—to the fuzzy case. The first moment—the mean value—is defined as: M1 = x · f (x) d x. (4) Substituting the expression (3) for f (x) into this formula, we get the corresponding fuzzy expression: x · μ(x) d x . (5) M1 = μ(x) d x Interestingly, this is exactly what is called centroid defuzzification. Similarly, for any natural number n ≥ 2, we can plug in the expression (3) into a formula for the n-th order moment: Mn = (x − M1 )n · f (x) d x, (6) and get the corresponding fuzzy expression: Mn =
(x − M1 )n · μ(x) d x . μ(x) d x
(7)
Need for interval-valued fuzzy. Traditional [0, 1]-based fuzzy approach implicitly assumes that for each x, we can extract the exact value μ(x) from the expert. In practice, the experts cannot very accurately gauge their degrees of confidence. At best, they provide an interval μ(x), μ(x) of possible values of μ(x). This case—known as interval-valued fuzzy case—means that we have many different membership functions μ(x) as long as we have μ(x) ∈ μ(x), μ(x) for all x. For different possible membership functions, we have different values of skewness M3 and of other central moments Mn . It is therefore desirable to estimate the range of possible values of each of these characteristics. In this paper, we provide efficient algorithms for such an estimation.
3 Case of Skewness: Analysis of the Problem Case of skewness: explicit formula. Let us start with the case when n = 3. In this case, the formula (7) takes the form (x − M1 )3 · μ(x) d x . (8) M3 = μ(x) d x
348
J. C. F. Garcia et al.
Here, (x − M1 )2 = x 3 − 3 · x 2 · M1 + 3 · x · M12 − M13 ,
(9)
thus the expression (9) takes the form
x 2 · μ(x) d x x · μ(x) d x 2 + 3 · M1 · − M13 . M3 = μ(x) d x μ(x) d x (10) Substituting the formula (5) into this expression, we get x 3 · μ(x) d x − 3 · M1 · μ(x) d x
M3 =
x 3 · μ(x) d x −3· μ(x) d x
x 2 · μ(x) d x · x · μ(x) d x + 2 μ(x) d x
3 3 x · μ(x) d x x · μ(x) d x 3 · 3 − 3 = μ(x) d x μ(x) d x
3 x 2 · μ(x) d x · x · μ(x) d x x · μ(x) d x + 2 · 2 3 . μ(x) d x μ(x) d x (11) Basic facts from calculus: reminder. In general, according to calculus, a function F(v) attains its maximum with respect to the input v ∈ [v, v] in one of the three cases:
x 3 · μ(x) d x −3· μ(x) d x
• It can be that this maximum is attained inside the interval; in this case, at this point the derivative F (v) of the maximized function is equal to 0. • It can be that this maximum is attained at the lower endpoint v of the given interval. In this case, the derivative F (v) at this point has to be non-positive: otherwise, if this derivative was positive, then for a sufficiently small ε > 0, at a point v + ε we would have larger values of F(v)—which contradicts to our assumption that the largest value of the function F(v) is attained for v = v. • It can also be that this maximum is attained at the upper endpoint v of the given interval. In this case, the derivative F (v) at this point has to be non-negative: otherwise, if this derivative was negative, then for a sufficiently small ε > 0, at a point v − ε we would have larger values of F(v)—which contradicts to our assumption that the largest value of the function F(v) is attained for v = v. Similarly, a function F(v) attains its minimum with respect to the input v ∈ [v, v] in one of the three cases: • It can be that this minimum is attained inside the interval; in this case, at this point the derivative F (v) of the maximized function is equal to 0. • It can be that this minimum is attained at the lower endpoint v of the given interval. In this case, the derivative F (v) at this point has to be non-negative: otherwise,
Estimating Skewness and Higher Central Moments of an Interval-Valued Fuzzy Set
349
if this derivative was negative, then for a sufficiently small ε > 0, at a point v + ε we would have smaller values of F(v)—which contradicts to our assumption that the smallest value of the function F(v) is attained for v = v. • It can also be that this minimum is attained at the upper endpoint v of the given interval. In this case, the derivative F (v) at this point has to be non-positive: otherwise, if this derivative was positive, then for a sufficiently small ε > 0, at a point v − ε we would have smaller values of F(v)—which contradicts to our assumption that the smallest value of the function F(v) is attained for v = v. Let us apply this idea to our case. In our case, unknown are values μ(x). For each expression def
Ik =
x k · μ(x) d x,
(12)
the derivative with respect to μ(x) is equal to ∂ Ik = xk. ∂μ(x)
(13)
Thus, for the expression (11)—which, in terms of the notations Ik has the form M3 =
I3 I2 · I1 I3 −3· + 2 · 13 , 2 I0 I0 I0
(14)
the derivative d(x) with respect to μ(x) takes the following form: d(x) =
x 3 · I0 − I3 x 2 · I1 · I02 + I2 · x · I02 − 2I0 −2· + 2 I0 I04 2
3I1 · x · I03 − I3 · 3I02 . I06
(15)
In other words, this derivative is a cubic polynomial d(x) = a0 + a1 · x + a2 · x 2 + a3 · x 3 ,
(16)
with a positive coefficient a3 = 1/I0 at x 3 . It is known that a cubic function has either 1 or 3 roots, i.e., in general, we have values x1 ≤ x2 ≤ x3 at which d(x) = 0. Since the coefficient at x 3 is positive: • • • •
we have d(x) < 0 for x < x1 , we have d(x) > 0 for x1 < x < x2 , we have d(x) < 0 for x2 < x < x3 , and we have d(x) > 0 for x > x3 .
350
J. C. F. Garcia et al.
When d(x) > 0, then, as we have mentioned:
• maximum cannot be attained inside the interval μ(x), μ(x) , and it cannot be attained at the lower endpoint of this interval, so maximum has to be attained for μ(x) = μ(x); • similarly, minimum cannot be attained inside the interval μ(x), μ(x) , and it cannot be attained at the upper endpoint of this interval, so minimum has to be attained for μ(x) = μ(x). When d(x) < 0:
• maximum cannot be attained inside the interval μ(x), μ(x) , and it cannot be attained at the upper endpoint of this interval, so maximum has to be attained for μ(x) = μ(x); • similarly, minimum cannot be attained inside the interval μ(x), μ(x) , and it cannot be attained at the lower endpoint of this interval, so minimum has to be attained for μ(x) = μ(x). Thus, once we know the values xi , we can uniquely determine the membership functions μ(x) ∈ μ(x), μ(x) at which the skewness attains, for these xi , its largest and smallest values. So, to find the overall largest and smallest values of the skewness, it is sufficient to find the values xi for which the resulting skewness is the largest and for which it is the Hence, we arrive at the following algorithm for smallest. computing the range M 3 , M 3 of possible values of skewness.
4 Resulting Algorithm for Computing the Range M 3 , M 3 of Possible Values of Skewness What is given and what we want. For each x, we have an interval μ(x), μ(x) . We want to find the range M 3 , M 3 of all possible values of the skewness M3 — as defined by the formulas (5) and (8)—over all functions μ(x) for which μ(x) ∈ μ(x), μ(x) for all x. Computing M 3 . For each triple of real numbers x1 ≤ x2 ≤ x3 , we compute the skewness M 3 (x1 , x2 , x3 ) corresponding to the following membership function: • when x < x1 or x2 < x < x3 , we take μ(x) = μ(x); and • we take μ(x) = μ(x) when x1 < x < x2 or x > x3 . Then, we use an optimization algorithm to compute M 3 = max M 3 (x1 , x2 , x3 ). x1 ,x2 ,x3
(17)
Estimating Skewness and Higher Central Moments of an Interval-Valued Fuzzy Set
351
Computing M 3 . For each triple of real numbers x1 ≤ x2 ≤ x3 , we compute the skewness M 3 (x1 , x2 , x3 ) corresponding to the following membership function: • when x < x1 or x2 < x < x3 , we take μ(x) = μ(x); and • we take μ(x) = μ(x) when x1 < x < x2 or x > x3 . Then, we use an optimization algorithm to compute M 3 = min M 3 (x1 , x2 , x3 ). x1 ,x2 ,x3
(18)
5 General Case of Arbitrary Central Moments What is given and what we want. For each x, we have an interval μ(x), μ(x) . We are also given a natural number n. We want to find the range M n , M n of all possible values of the n-th central moment Mn —as the formulas (5) and defined by (7)—over all functions μ(x) for which μ(x) ∈ μ(x), μ(x) for all x. Discussion. In the general case of the n-th order central moment, similar arguments leads to an n-th order polynomial d(x) with a positive coefficient at x n . Thus, similar arguments lead to the following algorithm. Computing M n . For each n-tuple of real numbers x1 ≤ x2 ≤ . . . ≤ xn , we take def def x0 = −∞ and xn+1 = +∞. Thus, the whole real line is divided into n + 1 intervals (x0 , x1 ), (x1 , x2 ), …, (xn , xn+1 ). Then, we compute the n-th central moment M n (x1 , . . . , xn ) corresponding to the following membership function: • for x ∈ (xk , xk+1 ) for which n − k is odd, we take μ(x) = μ(x) and • when n − k is even, we take μ(x) = μ(x). Then, we use an optimization algorithm to compute M n = max M n (x1 , . . . , xn ). x1 ,...,xn
(19)
Computing M n . For each n-tuple of real numbers x1 ≤ x2 ≤ . . . ≤ xn , we take def
def
x0 = −∞ and xn+1 = +∞. Thus, the whole real line is divided into n + 1 intervals (x0 , x1 ), (x1 , x2 ), …, (xn , xn+1 ). Then, we compute the n-th central moment M n (x1 , . . . , xn ) corresponding to the following membership function: • for x ∈ (xk , xk+1 ) for which n − k is odd, we take μ(x) = μ(x) and • when n − k is even, we take μ(x) = μ(x). Then, we use an optimization algorithm to compute M n = min M n (x1 , . . . , xn ). x1 ,...,xn
(20)
352
J. C. F. Garcia et al.
Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. R. Belohlavek, J.W. Dauben, G.J. Klir, Fuzzy Logic and Mathematics: A Historical Perspective (Oxford University Press, New York, 2017) 2. G. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic (Prentice Hall, Upper Saddle River, NJ, 1995) 3. J.M. Mendel, Uncertain Rule-Based Fuzzy Systems: Introduction and New Directions (Springer, Cham, Switzerland, 2017) 4. H.T. Nguyen, C.L. Walker, E.A. Walker, A First Course in Fuzzy Logic (Chapman and Hall/CRC, Boca Raton, FL, 2019) 5. V. Novák, I. Perfilieva, J. Moˇckoˇr, Mathematical Principles of Fuzzy Logic (Kluwer, Boston, Dordrecht, 1999) 6. D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures (Chapman and Hall/CRC, Boca Raton, FL, 2011) 7. L.A. Zadeh, Fuzzy sets. Information and Control 8, 338–353 (1965)
How to Detect the Fundamental Frequency: Approach Motivated by Soft Computing and Computational Complexity Eric Freudenthal, Olga Kosheleva, and Vladik Kreinovich
Abstract Psychologists have shown that most information about the mood and attitude of a speaker is carried by the lowest (fundamental) frequency. Because of this frequency’s importance, even when the corresponding Fourier component is weak, the human brain reconstruct this frequency based on higher harmonics. The problem is that many people lack this ability. To help them better understand moods and attitudes in social interaction, it is therefore desirable to come up with devices and algorithms that would reconstruct the fundamental frequency. In this paper, we show that ideas from soft computing and computational complexity can be used for this purpose.
1 Outline According to psychologists, the fundamental frequency components of human speech carry the bulk of information about the mood and attitude of the speaker. Because of the importance of the fundamental frequency signal, even when the actual Fourier component corresponding to this frequency is absent, the brain automatically reconstructs this frequency. The problem is that many people lack this automatic ability and thus, miss important speech-related social cues. In this paper, we use ideas from soft computing and computational complexity to reconstruct the fundamental frequency component and thus, to help these people better understand the social aspects.
E. Freudenthal · O. Kosheleva · V. Kreinovich (B) University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] E. Freudenthal e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_57
353
354
E. Freudenthal et al.
The structure of this paper is as follow. In Sect. 2, we explain what is fundamental frequency, why it is important, and why reconstructing this frequency is important. In Sect. 3, we explain ideas about this reconstruction motivated by soft computing and computational complexity. In Sect. 4, we explain that these ideas indeed help.
2 What is Fundamental Frequency and Why it is Important Speech uses different frequencies. Some people have lower voice, some higher ones. In physical terms, this difference means that the sounds are—locally—periodic with different frequencies: • lower frequencies for lower voices, • higher frequencies for higher voices. Fundamental frequency and harmonics. Every periodic signal x(t) with frequency f 0 can be represented as a sum of sinusoids corresponding to different frequencies f = f 0 , f = 2 f 0 , …: x(t) = A1 · sin( f 0 · t + ϕ1 ) + A2 · sin(2 f 0 · t + ϕ2 ) + . . . + An · sin(n · f 0 · t + ϕn ) + . . . (1)
for some coefficients Ai ≥ 0 and ϕi ; this is known as the Fourier series; see, e.g., [2–4, 6, 13, 16, 19]. The lower frequency f 0 —corresponding to the given period—is known as the fundamental frequency. Frequencies f = n · f 0 for n > 1 are known as harmonics. The overall energy of the signal is equal to the sum of energies of all the harmonics, and the energy of each harmonic is proportional to the square A2n of its amplitude. Since the overall energy of the signal is finite, in general, this means that the amplitudes of the harmonics decrease with n—otherwise, if the amplitudes did not decrease, the overall energy would be infinite. This is not just a mathematical representation of audio signals: in the first approximation, this is how we perceive the sounds: there are, in effect, biological sensors in our ears which are attuned to different frequencies f ; see, e.g., [5, 18]. Which frequencies are most useful for conveying information. We can convey information by slightly changing the amplitude of each harmonic during each cycle. Higher-frequency harmonics have more cycles and thus, can convey more information. Because of this, most information is carried by such high-frequency harmonics; see, e.g., [17]. Which frequencies are most useful for conveying mood and attitude. In addition to information, we also want to convey our mood, our attitude to the person. This attitude is not changing that fast, so the best way to convey the attitude is to use harmonics which are changing the slowest—i.e., which have the lowest frequency.
How to Detect the Fundamental Frequency …
355
And indeed, most such socially important information is conveyed on the lowest frequencies, especially on the lowest—fundamental—frequency; see, e.g., [11]. In most people, the brain automatically reconstructs the fundamental frequency. Because of the social importance of understanding the speaker’s mood and attitude, our brain is actively looking for this information. So, even when the actual fundamental frequency is suppressed, our brain reconstructs it based on the other harmonics. For example, while many people have very low voices, with fundamental frequencies so low that these frequencies are cut off by the usual phone systems, we clearly hear their basso voices over the phone—because in most people, the brain automatically reconstructs the fundamental frequency. Related problem. The problem is that for many people, this reconstruction does not work well. As a result, these people do not get the mood and other social information conveyed on the fundamental frequency—a big disadvantage in social interactions. It is therefore desirable to come up with some signal transformation that would help these people detect the fundamental frequency.
3 Ideas Motivated by Soft Computing and Computational Complexity Where can we get ideas: a general comment. We would like to use computers to help people reconstruct the fundamental frequency. In other words, we need to find an appropriate connection between people’s brains and computers. In general, if we want to find a connection between two topics A and B, a natural idea is: • either to start with topic A and try to move towards topic B, • or to start with topic B and try to move towards topic A. Let us try both these approaches. Let us start with the brain: ideas motivated by soft computing. Let us first analyze what we can get if we start with the brain side, i.e., with the what-we-want side. From this viewpoint, what we want is to somehow enhance the signal x(t) so that the enhanced signal will provide the listener with a better understanding of the speaker’s mood and attitude. Since we started from the brain side, not from the computer side, naturally the above goal is imprecise, it is not formulated in terms of computer-understandable precise terms, it is formulated by using imprecise (“fuzzy”) words from natural language. Our ultimate objective is to design a computer-based gadget that would pursue this goal. Thus, we need to translate this goal into computer-understandable form.
356
E. Freudenthal et al.
The need for such a translation has been well known since the 1960s, when it was realized that a significant part of expert knowledge is formulated in vague naturallanguage terms. To translate this knowledge into precise terms, Lotfi Zadeh came up with an idea of fuzzy logic, in which we describe each imprecise property like “small” by assigning: • to each possible value of the corresponding quantity, • a degree to which this value satisfies the property of interest—e.g., is small; see, e.g., [1, 10, 12, 14, 15, 20]. Fuzzy techniques are a particular case of what is known as soft computing—intelligent techniques motivated by not-fully-precise ideas like neural networks or evolutionary computations. If our case, the desired property is “being informative”; • when the signal is 0, we gain no information; • the stronger the signal, the more information it conveys. The simplest way to capture this idea is to have a degree d which is proportional to the signal’s strength x(t): d = c · x(t) for some constant c. We want to enhance this information. In terms of fuzzy techniques, what does enhancing mean? In general, it means going: • from tall to very tall, • from strong to very strong, etc. In fuzzy approach, the easiest way to describe “very” is to square the degree. For example: • if the degree to which some value is small is 0.7, • then the degree to which this value is very small is estimated as 0.72 = 0.49. So, whenever the original degree was proportional to x(t)—corresponding to the signal x(t)—the enhanced degree is proportional to (x(t))2 —which corresponds to the signal (x(t))2 . Thus, from the viewpoint of soft computing, it makes sense to consider the square (x(t))2 of the original signal. An important point is what to do with the sign, since the signal x(t) can be both positive and negative while the degree is always non-negative. So, we have two choices. The first choice is to take the formula (x(t))2 literally, and thus to consider only non-negative values for the enhanced signal. The second choice is to preserve the sign of the original signal, i.e.: • to take (x(t))2 when x(t) ≥ 0 and • to take −(x(t))2 when x(t) ≤ 0.
How to Detect the Fundamental Frequency …
357
In the following text, we will consider both these options. What if this does not work. What if we need an additional enhancement? From the viewpoint of fuzzy techniques, if adding one “very” does not help, a natural idea is to use something like “very very”—i.e., to apply the squaring twice or even more times. As a result, we get the idea of using (x(t))n for some n > 2. What if we start with a computer: ideas motivated by computational complexity. Because the desired signal enhancement is very important for many people in all their communications, we want the signal enhancement to be easily computable—not just available when the user has access to a fast computer. In a computer, the fastest—hardware-supported—operations are the basic arithmetic operations: • addition and subtraction are the fastest, • multiplication is next fastest, and • division is the slowest. So, a natural idea is to use as few arithmetic operations as possible—just one if possible—and to select the fastest arithmetic operations. Using only addition or subtraction does not help—these are linear operations, and a linear transformation does not change the frequency. So, we need a non-linear transformation, and the fastest non-linear operation is multiplication. At each moment of time t, all we have is the signal value x(t), so the only thing we can do is multiply this signal value by itself—thus getting (x(t))2 . Similarly to the soft computing-motivated cases, we can deal with the sign in the same two different ways, it does not change the computation time. And similarly to the soft computing case, if a single multiplication does not help, we can multiply again and again—thus getting (x(t))n for some n > 2. Discussion. The fact that two different approaches lead to the exact same formulas makes us confident that these formulas will work. So let us analyze what happens when we try them.
4 These Ideas Indeed Help What if we use squaring. Let us start with squaring the signal. We consider the case when the component corresponding to fundamental frequency—i.e., proportional to A1 —is, in effect, absent, i.e., when for all practical purposes, we have A1 = 0. In this case, the general formula (1) takes the form x(t) = A2 · sin(2 f 0 · t + ϕ2 ) + A3 · sin(3 f 0 · t + ϕ3 ) + . . .
(2)
358
E. Freudenthal et al.
Since, as we have mentioned, in general, the effect of higher harmonics—proportional to An —decreases with n, in the first approximation, it makes sense to only consider the largest terms in the expansion (2)—i.e., the terms corresponding to the smallest possible n. The simplest possible approximation is to consider only the second harmonic, i.e., to take x(t) ≈ A2 · sin(2 f 0 · t + ϕ2 ). However, in this case, all we have is a periodic process with frequency 2 f 0 , all the knowledge about the original fundamental frequency f 0 is lost. Thus, to be able to reconstruct the fundamental frequency f 0 , we need to consider at least one more term, i.e., take x(t) ≈ A2 · sin(2 f 0 · t + ϕ2 ) + A3 · sin(3 f 0 · t + ϕ3 ),
(3)
in which, as we have mentioned, A2 > A3 . Since A2 > A3 , the main term in (x(t))2 is proportional to A22 . However, we cannot restrict ourselves to this term only, we need to take A3 -term into account as well. Let us therefore take into account the next-largest terms proportional to A2 · A3 . Thus, we get (x(t))2 = A22 · sin2 (2 f 0 · t + ϕ2 ) + 2 A2 · A3 · sin(2 f 0 · t + ϕ2 ) · sin(3 f 0 · t + ϕ3 ). (4) From trigonometry, we know that cos(x) = sin(x + π/2), sin2 (a) =
1 1 · (1 − cos(2a)) = · (sin(2a + π/2)), 2 2
and sin(a) · sin(b) =
1 · (cos(a − b) − cos(a + b)) = 2
1 · (sin(a − b + π/2) − sin(a + b + π/2)). 2 Thus, we have (x(t))2 =
A2 A2 − · sin(2 f 0 · t + ϕ2 + π/2)+ 2 2
A2 · A3 · sin( f 0 · t + (ϕ3 − ϕ2 + π/2))− A2 · A3 · sin(5 f 0 · t + (ϕ2 + ϕ3 + π/2)), hence, in this case, (x(t))2 indeed contains a component with fundamental frequency.
How to Detect the Fundamental Frequency …
359
What if the fundamental frequency component in (x(t))2 is not large enough. In this situation, both soft computing and computational complexity approaches recommended using (x(t))n for larger n. Will this work? Let us analyze this case. To perform this analysis, let us take into account that: • while we usually consider time to be a continuous variable, • in reality, any sensor—be it biological sensors in our ears or electronic sensors— only measures values at several moments of time t1 < t2 < . . . < t N during each cycle. In these terms, a signal is represented by an N -dimensional tuple (x(t1 ), . . . , x(t N )). At one of these moments of time, the absolute value of the signal takes the largest value during the cycle. In the N -dimensional space of all such tuples, tuples for which |x(ti )| = |x(t j )| for some i = j form a set of smaller dimension—thus, a set of measure 0. Thus, for almost all tuples, such an equality is not possible—and hence, the largest absolute value of the signal is only attained at one single point on this cycle. Let us denote this point—where maximum is attained—by tm . This means that for all other moments of time t, we have |x(t)| < |x(tm )|, i.e., |x(t)| < 1. |x(tm )| As n increases, the ratio
|x(t)|n = |x(tm )|n
|x(t)| |x(tm )|
n
tends to 0. Thus, as n increases, the signal (x(t))n tends to the function that is equal to 0 everywhere except for one single tm on each cycle. Such a function is known as a delta-function δ(t − tm ). It is known that the Fourier transform of the periodic delta-function contains all the components—including the component corresponding to fundamental frequency— with equal amplitude. Thus, even if the square (x(t))2 does not have a visible fundamental frequency component, eventually, for some n, the signal (x(t))n will have this component of size comparable to all other components—and thus, visible. Limitations of squaring. Squaring the signal works well when the original signal x(t) does not have a visible fundamental frequency components. However, if x(t) has a strong component corresponding to fundamental frequency—e.g., if this component prevails and we have x(t) ≈ A1 · sin( f 0 · t + ϕ1 ),
360
E. Freudenthal et al.
then squaring leads to (x(t))2 =
A21 · (1 − sin(2 f 0 · t + 2ϕ1 + π/2)), 2
i.e., to signal that no longer has any fundamental frequency component—indeed, it is periodic with double frequency 2 f 0 . So, in principle, we should consider both: • the original signal x(t) and • its square (x(t))2 . It is desirable to have a single signal instead. One such possibility is to have sign-modified signal (x(t))2 · sign(x(t)), where: • sign(x) = 1 for x > 0 and • sign(x) = −1 for x < 0. Indeed, in this case: • even if the original signal is a pure sinusoid x(t) ≈ A1 · sin( f 0 · t + ϕ1 ) with frequency f 0 , • its transformation has a non-zero Fourier coefficient A1 —since this coefficient is proportional to sin( f 0 · t) · sin2 ( f 0 · t) · sign(sin( f 0 · t)) dt = | sin( f 0 · t)| · sin2 ( f 0 · t) dt > 0. Preliminary confirmation. Our preliminary simulation experiments confirmed these theoretical findings; see, e.g., [7–9]. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
How to Detect the Fundamental Frequency …
361
The authors are thankful to all the participants of the 27th Joint NMSU/UTEP Workshop on Mathematics, Computer Science, and Computational Sciences, Las Cruces, New Mexico, April 2, 2022 for valuable suggestions.
References 1. R. Belohlavek, J.W. Dauben, G.J. Klir, Fuzzy Logic and Mathematics: A Historical Perspective (Oxford University Press, New York, 2017) 2. A. Brown, Guide to the Fourier Transform and the Fast Fourier Transform (Cambridge Paperbacks, Cambridge, UK, 2019) 3. C.S. Burrus, M. Frigo, G.S. Johnson, Fast Fourier Transforms (Samurai Media Limited, Thames Ditton, UK, 2018) 4. Th.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms (MIT Press, Cambridge, MA, 2022) 5. S. Da Costa, W. van der Zwaag, L.M. Miller, S. Clarke, M. Saenz, Tuning in to sound: frequencyselective attentional filter in human primary auditory cortex. J. Neurosci. 33(5), 1858–1863 (2013) 6. J.L. deLyra, Fourier Transforms: Mathematical Methods for Physics and Engineering, by J.L. de Lyra (2019) 7. L.L.C. Frudensing, Freudensong: Unleashing the Human Voice. https://www.freudensong. com/ 8. E. Freudenthal, R. Alvarez-Lopez, V. Kreinovich, B. Usevitch, D. Roundy, Work-in-progress: vocal intonation regeneration through heterodyne mixing of overtone series, in Abstracts of the 27th Joint NMSU/UTEP Workshop on Mathematics, Computer Science, and Computational Sciences, Las Cruces, New Mexico, 2 Apr. 2022 9. E. Freudenthal, V. Mueller, Summary of Vocal Intonation Boosting (VIB). https://www. freudensong.com/vib-white-paper 10. G. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic (Prentice Hall, Upper Saddle River, NJ, 1995) 11. N. Kuga, R. Abe, K. Takano, Y. Ikegaya, T. Sasaki, Prefrontal-amygdalar oscillations related to social behavior in mice. eLife, vol. 11, paper e78428 (2022) 12. J.M. Mendel, Uncertain Rule-Based Fuzzy Systems: Introduction and New Directions (Springer, Cham, Switzerland, 2017) 13. D.F. Mix, Fourier, Laplace, and z Transforms: Unique Insight into Continuous-Time and Discrete-Time Transforms, Their Definition and Applications (2019). https://www.amazon. com 14. H.T. Nguyen, C.L. Walker, E.A. Walker, A First Course in Fuzzy Logic (Chapman and Hall/CRC, Boca Raton, FL, 2019) 15. V. Novák, I. Perfilieva, J. Moˇckoˇr, Mathematical Principles of Fuzzy Logic (Kluwer, Boston, Dordrecht, 1999) 16. B.G. Osgood, Lectures on the Fourier Transform and Its Applications (American Mathematical Society, Prividence, Rhode Island, 2019) 17. T. Pendlebury, How to Improve the Speech On Your TV to Make It More Understandable (2022). https://www.cnet.com/tech/home-entertainment/having-a-hard-time-hearing-the-tvyour-speaker-settings-may-be-the-culprit/ 18. S.K. Reed, Cognition: Theories and Application (SAGE Publications, Thousand Oaks, CA, 2022) 19. J.V. Stone, The Fourier Transform: A Tutorial Introduction (Sebtel Press, Sheffield, UK, 2021) 20. L.A. Zadeh, Fuzzy sets. Information and Control 8, 338–353 (1965)
What if There Are Too Many Outliers? Olga Kosheleva and Vladik Kreinovich
Abstract Sometimes, a measuring instrument malfunctions, and we get a value which is very different from the actual value of the measured quantity. Such values are known as outliers. Usually, data processing considers situations when there are relatively few outliers. In such situations, we can delete most of them. However, there are situations when most results are outliers. In this case, we cannot produce a single value which is close to the actual value, but we can generate several values one of which is close. Of course, all the values produced by the measuring instrument(s) satisfy this property, but there are often too many of them, we would like to compress this set into a smaller-size one. In this paper, we prove that irrespective of the size of the original sample, we can always compress this sample into a compressed sample of fixed size.
1 Formulation of the Problem 1.1 What Are Outliers and How They Are Treated Now Usually, measuring instruments work reliably, and produce a measurement result which is close to the actual value of the measured quantity. However, sometimes, measurement instruments malfunction, and the value they produce are drastically different from the actual value of the corresponding quantity. Such values are known as “outliers”. O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_58
363
364
O. Kosheleva and V. Kreinovich
When we process measurement results, it is important to delete as many outliers as possible. Indeed, if we take them at face value, we may get a biased impression of the situation and thus, based on this biased impression, we will make a wrong decision. In some cases, outliers are easy to detect. • If I step on a scale and get my weight as 10 kg, clearly something is wrong. • If I measure my body temperature and the result is 30 C, this cannot be right. • Similarly, if a patient has a clear fever, but the thermometer shows 36 C, something is wrong with this thermometer. However, in many other situations, it is not as easy to detect outliers. Usual methods for detecting outliers are based on the assumption that the majority of measurement results are correct. In this case, e.g., if we take the median of all the measurement results, this guarantees that this result will not be an outlier.
1.2 But What if There Are Too Many Outliers? However, in some practical situations, it is the outliers that form the majority: the measuring instrument is malfunctioning most of the time, and the correct measurement results are in the minority. The usual approach to such a situation is to ignore all the values and to try to improve the measuring instrument.
1.3 Formulation of the problem If we made 1000 measurements and 60% of the results are outliers, still there are 400 correct measurement results. Clearly these results contain a lot of information about the studied system. It is therefore desirable to extract some information from these values. How can we do it?
2 How to Extract Information: Idea 2.1 Why This Extraction Is Not Easy When we have a small number of outliers, we can delete them and produce a value which is close to the actual value of the measured quantity. Unfortunately, in situations when the majority of results are outliers, this is not possible.
What if There Are Too Many Outliers?
365
For example, suppose that: • 1/3 of the measurement results are 0s, • 1/3 of the measurement results are 1s, • 1/3 of the measurement results are 2s, and we know that no more than 2/3 of these results are outliers. In this case: • it could be that 0 is the actual value, and 1 and 2 are outliers; • it could be that 1 is the actual value, and 0 and 2 are outliers; and • it could be that 2 is the actual value, and 0 and 1 are outliers. In such a situation, the only conclusion that we can make is that one of these three results 0, 1, and 2 is close to the actual value—and we do not know which one. This list of possible values does not have to include all measurement results. For example, if we measure with accuracy 0.1, we know that no more than 2/3 of the results are outliers, and: • 1/3 of the measurement results are 0s, • 1/2 of the measurement results are 1s, and • 1/6 of the measurement results are 2s, then only 0 and 1 can be close to the actual value. Indeed, if 2 was the actual value, then we would have 5/6 outliers, which contradicts to our knowledge that the proportion of outliers does not exceed 2/3.
2.2 Main Idea As we have mentioned, in situations when there are too many outliers, we cannot select a single result which is close to the actual value of the measured quantity. A natural idea is thus to extract a finite list of results so that one of them is close to the actual value—and ideally, we should make this list as small as possible.
3 Let Us Formulate This Problem in Precise Terms 3.1 Often, a Sensor Measures Several Quantities A simple measuring instrument—such as a thermometer—measures only one quantity. However, many measuring instruments measure several quantities at the same time. For example, chemical sensors often measure concentrations of several substances, wind measurements usually involve measuring not only the wind’s speed but also its direction, sophisticated meteorological instruments measure humidity in addition to temperature, etc.
366
O. Kosheleva and V. Kreinovich
In principle, we can simply consider such a complex measuring instrument as a collection of several instruments measuring different quantities. However, if we do this, we will miss the fact that such an instrument usually malfunctions as a whole: if one of its values is an outlier, this means that this instrument malfunctioned, and we should not trust other values that it produced either. In other words, either all its values are correct, or all the values that this instrument produced are outliers. So, from the viewpoint of outliers, it make sense to consider it as a single measuring instrument producing several values. In all such situations, as a result of the jth measurement of the values of several (d) quantities, we get d values x j1 , . . . , x jd . These values can be naturally represented by a point x j = (x j1 , . . . , x jd ) in the d-dimensional space.
3.2 How Can We Represent Uncertainty def
In the 1-D case, a usual information about the measurement error Δx = x j − x— i.e., the difference between the measurement result x j and the (unknown) actual value x of the desired quantity—is the lower bound Δ− < 0 and the upper bound Δ+ > 0 on its value: Δ− ≤ Δx j ≤ Δ+ ; see, e.g., [12]. In this case, once we know the measurement result, we can conclude that the actual value x is located somewhere in the set x j − [Δ− , Δ+ ], where, as usual, x j − U means the set of possible values x j − u when u ∈ U ; see, e.g., [8–10, 12]. In the general case, when both x j and x are d-dimensional, the measurement error Δx j = x j − x is also d-dimensional. We usually know a set U of possible values of the measurement error. • This set may be a box, i.e., the set of all the tuples (Δx j1 , . . . , Δxd ) for which Δi− ≤ Δx ji ≤ Δi+ for all i. • This may be a subset of this box—e.g., an ellipsoid; see, e.g., [2–6, 11, 15, 16, 18, 20]. In all these cases, the set U is a convex set containing 0. An important aspect is that often, we do not know the set U —i.e., we are not 100% sure about the measurement accuracy. Let us summarize this information.
3.3 What We Have and What We Want We have n points x j = (x j1 , . . . , x jd ) in d-dimensional space. We know that there is a convex set U containing 0 that describes measurement uncertainty. We know the lower bound k on the number of correct measurements. Usually, we know the proportion ε of correct measurements, in this case k = ε · n.
What if There Are Too Many Outliers?
367
In precise terms, this means that there exists a point x such that for at least k of the original n points, we have x j − x ∈ U . Our objective is to generate a finite set S—with as few points as possible—so that for one of the elements s of this set, we have s − x ∈ U . Let us describe this summary in precise terms.
3.4 Definition Definition 1 Let X = {x1 , . . . , xn } be a set of points in a d-dimensional space, and let k < n be an integer. By the set X ’s (n, k)-compression, we mean a finite set S with the following property: For every pair (U, x), for which: • U ∈ IRd is a convex set containing 0, • x ∈ IRd , and • x j − x ∈ U for at least k different indices j, there exists en element s ∈ S for which s − x ∈ U .
4 Main Result, Its Proof and Discussion 4.1 Result Proposition 1 For every d, and for every δ > 0, there exists a constant cd,δ such that for all ε, if we take k = ε · n, then for every set of n points, there exists an (n, k)-compression with no more that cd,δ · ε−(d−0.5+δ) elements.
4.2 Discussion Good news is that the above upper bound on the number of elements in a compression does not depend on n at all. • We can have thousands of measurement results, • we can have billions of measurement results, but no matter how many measurement results we have, we can always compress this information into a finite set whose size remains the same no matter what n we choose.
368
O. Kosheleva and V. Kreinovich
4.3 Proof This result follows from the known result of combinatorial convexity about so-called weak ε-nets; see, e.g., [1, 14]. A set S is called a weak ε-net with respect to the set X = {x1 , . . . , xn } if for every subset Y of X with at least ε · n elements, the convex hull of Y contains at least one element s ∈ S. It is known that for every dimension d and for every number δ > 0, there exists a constant cd,δ such that for every set X , there exists a weak ε-net with ≤ cd,δ · ε−(d−0.5+δ) elements. To complete our proof, we need to show that each weak ε-net is an (n, k)compression. Indeed, we know that for at least k points, we have x j − x ∈ U . Let us denote k of these points by x j1 , . . . , x jk . In these terms, we have x j1 − x ∈ U, . . . , x jk − x ∈ U. By the definition of the weak ε-net, one of the elements s ∈ S belongs to the convex hull of the points x j1 , . . . , x jk , i.e., has the form s = α1 · x j1 + · · · + αk · x jk for some values α ≥ 0 for which α1 + · · · + αk = 1. Thus, we have s − x = α1 · (x j1 − x) + · · · + αk · (x jk − x). In other words, the difference x − s is a convex combination of the differences x j − x. The differences x j1 − x,…, x jk − x all belong to the set U , and the set U is convex, thus, there convex combination s − x also belongs to U . The proposition is proven.
4.4 There Exists an Algorithm That Computes the Desired Compression Compressions are exactly weak ε-nets, and the problem of finding such a net can be described in terms of the first-order language of real numbers—with addition, multiplication, and inequalities, and finitely many quantifies over real numbers. Thus, this problem is covered by an algorithm—this can be either the original TarskiSeidenberg algorithm [17, 19] or one of its later improvements; see, e.g., [7, 13].
What if There Are Too Many Outliers?
369
5 Remaining Open Problems 5.1 Can We Get Even Smaller Compressions? Researchers in combinatorial convexity believe that we can have a weak ε-net of size ≤ cd · ε−1 · (log(1/ε))C for some C—and thus, a smaller-size compression—but this still needs to be proven.
5.2 How to Efficiently Compute a Compression While there exist algorithms for computing a small-size compression, these general algorithms require exponential time—even checking the condition that s ∈ Conv(Y ) for all subsets Y of size k requires checking exponential number of sets Y . Thus, for large n, these algorithms are not feasible. It is therefore desirable to come up with a feasible algorithm for computing the desired compression. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). The authors are thankful to the anonymous referees for valuable suggestions.
References 1. I. Bárány, Combinatorial Convexity (American Mathematical Society, Providence, 2021) 2. G. Belforte, B. Bona, An improved parameter identification algorithm for signal with unknownbut-bounded errors, in Proceeding of the 7th IFAC Symposium on Identification and Parameter Estimation, York, UK (1985) 3. F.L. Chernousko, Estimation of the Phase Space of Dynamic Systems (Nauka Publishing House, Moscow, 1988). (in Russian) 4. F.L. Chernousko, State Estimation for Dynamic Systems (CRC Press, Boca Raton, 1994) 5. A.F. Filippov, Ellipsoidal estimates for a solution of a system of differential equations. Interval Comput. 2(4), 6–17 (1992) 6. E. Fogel, Y.F. Huang, On the value of information in system identification. Bounded noise case. Automatica 18, 229–238 (1982) 7. D. Yu. Grigor’ev, N.N. Vorobjov (Jr.), Solving systems of polynomial inequalities in subexponential time. J. Symb. Comput. 5(1/2), 37–64 (1988) 8. L. Jaulin, M. Kiefer, O. Didrit, E. Walter, Applied Interval Analysis, with Examples in Parameter and State Estimation, Robust Control, and Robotics (Springer, London, 2001) 9. G. Mayer, Interval Analysis and Automatic Result Verification (de Gruyter, Berlin, 2017) 10. R.E. Moore, R.B. Kearfott, M.J. Cloud, Introduction to Interval Analysis (SIAM, Philadelphia, 2009)
370
O. Kosheleva and V. Kreinovich
11. J.P. Norton, Identification and application of bounded parameter models, in Proceeding of the 7th IFAC Symposium on Identification and Parameter Estimation, York, UK (1985) 12. S.G. Rabinovich, Measurement Errors and Uncertainties: Theory and Practice (Springer, New York, 2005) 13. S. Ratschan, Efficient solving of quantified inequality constraints over the real numbers. ACM Trans. Comput. Log. 7(4), 723–748 (2006) 14. N. Rubin, Stronger bounds for weak ε-nets in higher dimensions, in Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing STOC 2021, Rome, Italy, June 21–25 (2021), pp. 989–1002 15. F.C. Schweppe, Recursive state estimation: unknown but bounded errors and system inputs. IEEE Trans. Autom. Control 13, 22–28 (1968) 16. F.C. Schweppe, Uncertain Dynamic Systems (Prentice Hall, Englewood Cliffs, 1973) 17. A. Seidenberg, A new decision method for elementary algebra. Ann. Math. 60, 365–374 (1954) 18. S.T. Soltanov, Asymptotics of the function of the outer estimation ellipsoid for a linear singularly perturbed controlled system, in Interval Analysis, ed. by S.P. Shary, Yu.I. Shokin. Academy of Sciences Computing Center, Krasnoyarsk, Technical Report No. 17 (1990), pp. 35–40. (in Russian) 19. A. Tarski, A Decision Method for Elementary Algebra and Geometry, 2nd edn. Berkeley and Los Angeles (1951) 20. G.S. Utyubaev, On the ellipsoid method for a system of linear differential equations, in Interval Analysis, ed. by S.P. Shary. Academy of Sciences Computing Center, Krasnoyarsk, Technical Report No. 16 (1990), pp. 29–32. (in Russian)
What Is a Natural Probability Distribution on the Class of All Continuous Functions: Maximum Entropy Approach Leads to Wiener Measure Vladik Kreinovich and Saeid Tizpaz-Niari Abstract While many data processing techniques assume that we know the probability distributions, in practice, we often only have a partial information about these probabilities—so that several different distributions are consistent with our knowledge. Thus, to apply these data processing techniques, we need to select one of the possible probability distributions. There is a reasonable approach for such selection— the Maximum Entropy approach. This approach selects a uniform distribution if all we know is that the random variable if located in an interval; it selects a normal distribution if all we know is the mean and the variance. In this paper, we show that the Maximum Entropy approach can also be applied if what we do not know is a continuous function. It turns out that among all probability distributions on the class of such functions, this approach selects the Wiener measure—the probability distribution corresponding to Brownian motion.
1 Formulation of the Problem Need to select a probability distribution under uncertainty. Usually, we do not have full knowledge of the situation, we only have a partial knowledge—so that several different states of the studied system are possible. Usual data processing techniques assume that we know the probabilities of different states—i.e., in mathematical terms, that we know the probability distribution of the set of possible states; see, e.g., [6, 7]. However, in practice, we often do not have full knowledge of these probabilities—i.e., there are several probability distributions consistent with our knowledge. For example, often, for some quantities x, we only know the interval of possible values [x, x]—but we do not have any information V. Kreinovich (B) · S. Tizpaz-Niari Department of Computer Science, University of Texas at El Paso, El Paso, TX 79968, USA e-mail: [email protected] S. Tizpaz-Niari e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_59
371
372
V. Kreinovich and S. Tizpaz-Niari
about the probabilities of different values from this interval; see, e.g., [1, 3, 4, 6]. To apply the usual data processing technique to such situations, we need to select one of the possible probability distributions. How can we make such a selection: Maximum Entropy approach. If all we know is the interval of possible values, then, since there is no reason to assume that some values are more probable than others, it is reasonable to select a distribution in which all these values are equally probable—i.e., the uniform distribution on this interval. This idea goes back to Laplace and is therefore known as Laplace Indeterminacy Principle; see, e.g., [2]. More generally, a natural idea is not to introduce certainty where there is none. For example, if all we know is that the actual value of the quantity x is located on the interval [x, x], then one of the possible probability distributions is when x is equal to x with probability 1. However, if we select this distribution as a representation of the original interval uncertainty, we would thus introduce certainty where there is none. To describe this idea in precise terms, we need to formally describe what we mean by certainty. A natural measure of uncertainty is the number of binary (“yes”-“no”) questions that we need to ask to determine the exact value of the desired quantity—or, alternatively, to determine this value with some given accuracy ε > 0. It is known that this number of questions can be gauged by Shannon’s entropy S=−
ρ(x) · ln(ρ(x)) d x,
where ρ(x) is the probability density; see, e.g., [2, 5]. In these terms, the requirement not to add certainty—i.e., equivalently, not to decrease uncertainty—means that we should not select a distribution if there is another possible distribution with higher uncertainty. In other words, out of all possible distributions, we should select the one for which the entropy attains its maximal value. This idea is known as the Maximum Entropy approach; see, e.g., [2]. Successes of the Maximum Entropy approach. The Maximum Entropy approach has many successful applications. Let us give a few examples: • If all we know about a quantity x is that its value is located somewhere in the interval [x, x], and we have no information about the probabilities of different values within this interval, then the Maximum Entropy approach leads to the uniform distribution on this interval—in full accordance with the Laplace Indeterminacy Principle. • If we know the probability distributions of two quantities x and y, but we do not have any information about the possible correlation between these two quantities— could be positive, could be negative—the Maximum Entropy approach leads to the conclusion that the variables x and y are independent. This is also in good accordance with common sense: if we have no information about the dependence between the two quantities, it makes sense to assume that they are independent. • Finally, if all we know is the mean and standard deviation of a random variable, then out of all possible distributions with these two values, the Maximum Entropy approach selects the Gaussian (normal) distribution. This selection also makes
What Is a Natural Probability Distribution on the Class of All …
373
perfect sense: indeed, in most practical situations, there are many different factors contributing to randomness, and it is known that under reasonable conditions, the joint effect of many factors leads to the normal distribution; the corresponding mathematical result is known as the Central Limit Theorem; see, e.g., [7]. Remaining challenge. Usually, the Maximum Entropy approach is applied to situations when the state of the system is described by finitely many quantities is to select a probability distribution on the x (1) , x (2) , . . . , x (n) , and the question set of all possible tuples x = x (1) , x (2) , . . . , x (n) . In some practical situations, however, what is not fully known is also a function— e.g., the function that describes how the system’s output is related to the system’s input. In most cases, this dependence is continuous—small changes in x cause small changes in y = f (x). In such situations, often, we know a class of possible functions f (x), but we do not have a full information about the probabilities of different functions from this class (and in many cases, we have no information at all about these probabilities). In such cases, several different probability distributions on the class of all continuous functions are consistent with our knowledge. It is desirable to select a single probability distribution from this class. How can we do it? What we do in this paper. In this paper, we show that the Maximum Entropy approach can help to solve this challenge: namely, this approach leads to the selection of the Wiener measure, a probability distribution that describes the Brownian motion. In this distribution: • the joint distribution of any combination of values f (x), f (y),…, is Gaussian, and • for some constant C > 0, we have E[ f (x)] = 0 and E[( f (x) − f (y))2 ] = C · |x − y| for all x and y (where E[·], as usual, means the expected value).
2 Main Result From the practical viewpoint, all the values are discrete. From the purely mathematical viewpoint, there are infinitely many real numbers and thus, there are infinitely many possible values of the input quantity x. However, from the practical viewpoint, if we take into account that at each moment of time, there is a limit ε > 0 on the accuracy with which we can measures x. As a result, we have a discrete set of distinguishable values of the quantity x: e.g., the values x0 = 0, x1 = x0 + ε = ε, x2 = x1 + ε = 2ε, and, in general, xk = k · ε. In this description, to describe a function y = f (x) means to describe its values yk = f (xk ) on these values xk . Comment. The actual values of x correspond to the limit when our measurements become more and more accurate, i.e., when ε tends to 0.
374
V. Kreinovich and S. Tizpaz-Niari
In these terms, what does continuity mean? In the discretized description, the continuity of the dependence y = f (x) means, in effect, that the values yk−1 and yk corresponding to close values of x cannot differ by more than some small constant def δ, i.e., |Δk | ≤ δ, where we denoted Δk = yk − yk−1 . Thus, to describe a probability distribution on the class of all continuous functions means describing a probability distribution on the set of all the tuples y = (y0 , y1 , . . .). Now, we are ready to apply the Maximum Entropy approach. Once know the value y0 —e.g., we know that y0 = 0—and we know the differences Δk , we can easily reconstruct the values yk : yk = y0 + Δ1 + Δ2 + · · · + Δk .
(1)
So, to describe the probability distribution on the set of all possible tuples, it is sufficient to describe the distribution on the set of all possible tuples of differences Δ = (Δ1 , Δ2 , . . .). To find this distribution, let us apply the Maximum Entropy approach. In our description, the only thing that we know about the values Δk is that each of these values is located somewhere in the interval [−δ, δ]. Thus, as we have mentioned earlier, the Maximum Entropy approach implies: • that each difference Δk is uniformly distributed on this interval, and • that the differences Δk and Δ corresponding to k = are independent. This indeed leads to the Wiener measure. For each variable Δk , its mean is 0, and its variance is equal to δ 2 /3. For the sum of independent random variables, the mean is equal to the sum of the means, and the variance is equal to the sum of the variances. Thus, we conclude that for each quantity yk , its mean E[yk ] and variance V [yk ] are equal to: δ2 (2) E[yk ] = 0, V [yk ] = k · . 3 Let us reformulate this expression in terms of the values f (x). The value yk denotes f (x) for x = k · ε, so x k= ε and thus, the formula (2) implies that E[ f (x)] = 0 and V [ f (x)] = C · x, where we denoted def
C =
δ2 . 3ε2
(3)
What Is a Natural Probability Distribution on the Class of All …
375
Similarly, for the difference yk − y = Δk+1 + Δk+2 + · · · + Δ , we conclude that E[(yk − y )2 ] = | − k| ·
δ2 , 3
and thus, that for every two values x and y, we have E[( f (x) − f (y))2 ] = C · |x − y|.
(4)
The formulas (3) and (4) are exactly the formulas for the Wiener measure. To complete our conclusion, we need to show that the probability distribution of each value f (x)—as well as the joint distribution of values f (x), f (y), etc.—is Gaussian. Indeed, according to the formula (1), each value f (x) = yk is the sum of k = x/ε independent random variables Δk . As our measurements become more and more accurate, i.e., as ε tends to 0, the number of such terms tends to infinity, and thus, according to the above-mentioned Central Limit Theorem, the distribution of f (x) tends to Gaussian. Similarly, we can show that the joint distribution of several values f (x), f (y),…, is also Gaussian. So, we can indeed conclude that the selected probability distribution on the set of all continuous functions is the Wiener measure. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. L. Jaulin, M. Kiefer, O. Didrit, E. Walter, Applied Interval Analysis, with Examples in Parameter and State Estimation, Robust Control, and Robotics (Springer, London, 2001) 2. E.T. Jaynes, G.L. Bretthorst, Probability Theory: The Logic of Science (Cambridge University Press, Cambridge, 2003) 3. G. Mayer, Interval Analysis and Automatic Result Verification (de Gruyter, Berlin, 2017) 4. R.E. Moore, R.B. Kearfott, M.J. Cloud, Introduction to Interval Analysis (SIAM, Philadelphia, 2009) 5. H.T. Nguyen, V. Kreinovich, B. Wu, G. Xiang, Computing Statistics under Interval and Fuzzy Uncertainty (Springer, Berlin, 2012) 6. S.G. Rabinovich, Measurement Errors and Uncertainties: Theory and Practice (Springer, New York, 2005) 7. D.J. Sheskin, Handbook of Parametric and Non-Parametric Statistical Procedures (Chapman & Hall/CRC, London, 2011)
An Argument in Favor of Piecewise-Constant Membership Functions Marina Tuyako Mizukoshi, Weldon Lodwick, Martine Ceberio, Olga Kosheleva, and Vladik Kreinovich
Abstract Theoretically, we can have membership functions of arbitrary shape. However, in practice, at any given moment of time, we can only represent finitely many parameters in a computer. As a result, we usually restrict ourselves to finiteparametric families of membership functions. The most widely used families are piecewise linear ones, e.g., triangular and trapezoid membership functions. The problem with these families is that if we know a nonlinear relation y = f (x) between quantities, the corresponding relation between membership functions is only approximate—since for piecewise linear membership functions for x, the resulting membership function for y is not piecewise linear. In this paper, we show that the only way to preserve, in the fuzzy representation, all relations between quantities is to limit ourselves to piecewise constant membership functions, i.e., in effect, to use a finite set of certainty degrees instead of the whole interval [0, 1].
M. T. Mizukoshi Federal University of Goias, Goiania, Brazil e-mail: [email protected] W. Lodwick Department of Mathematical and Statistical Sciences, University of Colorado Denver, 1201 Larimer Street, Denver, CO 80204, USA e-mail: [email protected] M. Ceberio · O. Kosheleva · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] M. Ceberio e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_60
377
378
M. T. Mizukoshi et al.
1 Formulation of the Problem Need for interpolation. In this paper, we consider fuzzy techniques; see, e.g., [2– 7]. A membership function μ(x) corresponding to a natural-language property like “small” describes, for each real number x, the expert’s degree of confidence that the given value x satisfies the corresponding property—e.g., that the value x is small. Of course, these are infinitely many real numbers x, but we cannot ask infinitely many questions to the expert about all these values x. So, in practice, we ask finitely many questions, and then apply some interpolation/extrapolation to get the values μ(x) for all other x. Typical example: linear interpolation. The simplest interpolation is linear interpolation. So, if we know: • that some value x0 (e.g., x0 = 0) is absolutely small—i.e., μ(x0 ) = 1, and • that some values x− < x0 and x+ > x0 are definitely not small—i.e. μ(x− ) = μ(x+ ) = 0, then we can use linear interpolation and get frequently used triangular membership functions. Similarly, if we know that a certain property holds for sure for two values x0− and x0+ , then linear interpolation leads to trapezoid membership functions. General case. In general, since we can only ask finitely many questions and thus, only get finitely many parameters describing a membership function, we must limit ourselves to a finite-parametric class of membership functions. Resulting problem. This idea of a linear interpolation works well when all we have is unrelated fuzzy quantities. In this case, we can consider all quantities to be described by either triangular or trapezoid membership functions. If we have more values to interpolate from, by more general piecewise linear membership functions, i.e., functions for which there exist values x1 < x2 < · · · < xn such that on each interval (−∞, x1 ], [x1 , x2 ], …, [xn−1 , xn ], [xn , ∞) the function is linear; see, e.g., [1]. The situation becomes more complicated if we take into account that some quantities are related to each other by non-linear dependencies y = f (x) (or y = f (x1 , . . . , xn )). Indeed, it is desirable to require that if our knowledge about a quantity x is described by a membership function from the selected family, then the resulting knowledge about the quantity y = f (x) should also be described by a membership function from this family. It is straightforward, given a membership function μ X (x) corresponding to x, to compute a membership function μY (y) corresponding to y = f (x)—there is Zadeh’s extension principle for this: μY (y) = max{μ X (x) : f (x) = y}. The problem is that:
(1)
An Argument in Favor of Piecewise-Constant Membership Functions
379
• when the membership function μ X (x) is triangular, and f (x) is nonlinear, then the resulting membership function for y is not piecewise linear, and, • vice versa, if the membership for y is piecewise linear, then the corresponding membership function for x = f −1 (y) is not piecewise linear. So, the class of all triangular functions does not satisfy the above property. How can we find a finite-parametric class of membership functions for which for every nonlinear function f (x), the result of applying this function to a membership function from this class would also lead to a membership function from this same class? What we plan to do in this paper. In this paper, we show that the only way to reach this goal is to limit ourselves to piecewise constant membership functions, i.e., membership functions for which there exist values x1 < x2 < · · · < xn such that on each interval (−∞, x1 ), (x1 , x2 ), …, (xn−1 , xn ), (xn , ∞) the function is constant: y
x
2 Definitions To formulate our result, let us define, in precise terms: • what we mean by a finite-parametric family, • what we mean by the requirement that this family should be closed under applying Zadeh’s extension principle, and • what we mean by a piecewise constant membership function.
380
M. T. Mizukoshi et al.
2.1 What is a Finite-Parametric Family In general, a finite-parametric class of objects means that we have a continuous mapping that assigns, to each tuples of real numbers (c1 , . . . , cn ) from some open set, a corresponding object. To describe what is a continuous mapping, we need to have a metric on the set of all the objects. So, to describe what we mean by a finite-parametric family of membership functions, we need to describe a reasonable metric on the set of all membership functions, i.e., we need to describe, for each real number ε > 0, what it means that two membership functions are ε-close. Let us analyze what it means for two membership functions to be ε-close. Intuitively, two real numbers x1 and x2 are ε-close if, whenever we measure them with some accuracy ε, we may not be able to distinguish between them. Similarly, the two functions μ1 (x) and μ2 (x) are ε-close if, whenever we measure both real values x and μi (x) with accuracy ε, we will not be able to distinguish these functions. In precise terms, when we only know x1 with some accuracy ε, i.e., when we only x1 − ε, x1 + ε] for the measurement result x1 , then all we know know that x1 ∈ [ is that the value μ1 (x1 ) is somewhere between the minimum and the maximum of the function μ1 (x) on this interval. If we know that the function μ2 (x) is ε-close to μ1 (x), this means that the value μ1 (x) is ε-close to one of the values μ2 (x) for x from x1 + ε], i.e., that we have μ2 (x ) − ε ≤ μ1 (x) ≤ μ2 (x ) + ε the x-interval [ x1 − ε, for some points x and x from the above x-interval. Thus, we arrive at the following definition. Definition 1 We say that two membership functions μ1 (x) and μ2 (x) are ε-close if: • for every real number x1 , there exist ε-close values x2 and x2 for which μ1 (x1 ) is ε-close to some number between μ2 (x2 ) and μ2 (x2 ), and • for every real number x2 there exist ε-close values x1 and x1 for which μ2 (x2 ) is ε-close to some number between μ1 (x1 ) and μ2 (x1 ). We will denote this relation by μ1 ≈ε μ2 . Examples. If both membership functions are continuous—e.g., if both are triangular—then we can simply take x2 = x2 = x1 . Definition 2 By the distance d(μ1 , μ2 ) between two membership functions, we mean that infimum of all the values ε for which these functions are ε-close: def
d(μ1 , μ2 ) = inf{ε : μ1 ≈ε μ2 }. One can prove that thus defined distance satisfies the triangle inequality and is, therefore, a metric. Now, we are ready to define what we mean by a finite-parametric family of membership functions. Definition 3 By a finite-parametric family of membership functions, we mean a continuous mapping from a open subset of IRn for some integer n to the class of all membership functions with the metric defined by Definition 2.
An Argument in Favor of Piecewise-Constant Membership Functions
381
2.2 When is a Family Closed Under Applying Zadeh’s Extension Principle Definition 4 We say that a family of membership functions is closed under transformations if for every function μ X (x) from this family and for every function f (x), the function μ X ( f −1 (y)) also belongs to this family.
2.3 What is a Piecewise Constant Function Definition 5 We say that a function f (x) from real numbers to real numbers is piecewise constant if there exist values x1 < x2 < · · · < xn such that on each interval (−∞, x1 ), (x1 , x2 ), …, (xn−1 , xn ), (xn , ∞), the function f (x) is constant.
3 Main Result Simplifying assumption. For simplicity, let us consider non-strictly increasing membership functions. Such functions correspond to such properties as “large”. Our result can be easily extended to other membership functions that consist of two or more non-strictly increasing and non-strictly decreasing segments. Definition 6 We say that a membership function μ(x) is non-strictly increasing if x ≤ x implies that μ(x) ≤ μ(x ). Definition 7 We say that a family of membership functions is closed under increasing transformations if for every function μ X (x) from this family and for every increasing 1-1 function f (x), the function μ X ( f −1 (y)) also belongs to this family. Discussion. For a membership function μ X (x) and for a 1-1 increasing function f (x), the only value x for which f (x) = y is the value f −1 (y). So, in this case, Zadeh’s extension principle means that for y = f (x), the membership function μY (y) has the form μY (y) = μ X ( f −1 (x)). One can check that if the function μ X (x) was increasing then the function μY (y) = μ X ( f −1 (x)) will be increasing as well. Now, we are ready to formulate our main result. Proposition. Let F be a finite-parametric family of non-strictly increasing membership functions which is closed under increasing transformations. Then, each function from this family is piecewise constant. Comment. For readers’ convenience, the proof is placed in a special last section. Discussion. So, if we want to have a finite-parametric family of membership functions for which each relation y = f (x) between physical quantities leads to a similar
382
M. T. Mizukoshi et al.
relation between the corresponding membership functions, we should use piecewise constant membership functions, i.e., functions from the real line to a finite subset of the interval [0, 1]. This is equivalent to using this finite subset instead of the whole interval [0, 1], i.e., to considering a finite-valued logic instead of the usual full [0, 1]-based fuzzy logic. Once we limit ourself to this finite set of confidence degrees, then Zadeh’s extension principle will keep the values within this set. Indeed, this principle is based on using minimum and maximum, and both operations do not introduce new values, they just select from the existing values.
4 Proof 1◦ . Let μ(x) be a membership function from a finite-parametric family of non-strictly increasing membership functions which is closed under increasing transformations. To prove the proposition, it is sufficient to prove that all its values form a finite set. Then, the fact that this function is piecewise constant would follow from this result and from the fact that the function μ(x) is non-strictly increasing. We will prove this statement by contradiction. Let us assume that the set V of values of the function μ(x) is infinite. 2◦ . Let us pick an infinite sequence S of different values μ(x1 ), μ(x2 ), …, from this set one by one. To do that, we pick one value μ(x1 ) from the set V . Once we have picked the values μ(x1 ), . . . , μ(xk ), since the set V is infinite, it has other values than these selected ones, so we pick one of them as μ(xk+1 ). This way, we get a sequence of different values μ(xk ). These values correspond to x-values x1 , x2 , . . . Since the values μ(xn ) are all different, the values xn are also all different. 3◦ . It is known that if we add an infinity point to the real line and consider the usual convergence to infinity, we get a set which is topologically equivalent to a circle: after all positive numbers, we have the infinity point, and after this infinity point, we have all negative numbers. This is known as a one-point compactification of the real line. A circle is a compact set. So, by the properties of a compact set, from any sequence—including the sequence {xn }n —we can extract a convergent subsequence. Let us denote this subsequence by {cn }n , and its limit by def
x0 = lim cn . n→∞
4◦ . Let us prove that this sequence {cn }n either contains infinitely many elements that follow x0 or it contains infinitely many elements which precede x0 . Here: • for finite x0 , follow and precede simply means, correspondingly, larger and smaller; • for x0 = ∞, follow means the values are negative and precede means that these values are positive.
An Argument in Favor of Piecewise-Constant Membership Functions
383
This statement—that the sequence {cn }n either contains infinitely many elements that follow x0 or it contains infinitely many elements which precede x0 —can be easily proven by contradiction. Indeed, if we would have only finitely elements cn preceding x0 and only finitely many elements cn following x0 , then the whole sequence cn would only have finitely many values—but we know this sequence has infinitely many different values. 5◦ . Let us prove that when we have infinitely many elements preceding x0 , then from the sequence {cn }n , we can extract a strictly increasing subsequence {X n }n . Let us consider only elements of the sequence {cn }n that precede x0 . Let us denote the subsequence of the sequence {cn }n formed by such elements by {sn }n . Then, as X 1 , let us take X 1 = s1 . Because of the convergence sn → x0 , all the elements sn —starting with some element—get close to x0 and thus, become larger than X 1 . Let us take the first element of the sequence sn which is larger than X 1 as X 2 . Similarly, there exists elements which are larger than X 2 . Let is select the first of them by X 3 , etc. Thus, we get a strictly increasing sequence X 1 < X 2 < · · · for which all the values μ(X n ) are different, i.e., for which μ(X 1 ) < μ(X 2 ) < · · · . 6◦ . Let us prove that when we have infinitely many elements following x0 , then from the sequence {cn }n , we can extract a strictly decreasing subsequence {X n }n . Let us consider only elements of the sequence {cn }n that follow x0 . Let us denote the subsequence formed by such elements by {sn }n . As X 1 , let us take X 1 = s1 . Because of the convergence sn → x0 , all the elements sn —starting with some element—get close to x0 and thus, become smaller than X 1 . Let us take the first element of the sequence sn which is smaller than X 1 as X 2 . Similarly, there exists elements which are smaller than X 2 . Let is select the first of them by X 3 , etc. Thus, we get a strictly decreasing sequence X 1 > X 2 > · · · for which all the values μ(X n ) are different, i.e., for which μ(X 1 ) > μ(X 2 ) > · · · . 7◦ . In both cases, we have either a strictly increasing or a strictly decreasing sequence {X n }n for which the corresponding values μ(X n ) are, corresponding, either strictly increasing of strictly decreasing. 7.1◦ . When the sequence {X n }n is strictly increasing, we can take the values def
Yn = inf{x : μ(x) = μ(X n )} and so Y1 < Y2 < · · · For any other strictly increasing sequence Z 1 < Z 2 < · · · tending to the same limit, we can form a piecewise linear transformation function f (x) that maps Z n into Yn , namely: • for x ≤ Z 1 , we take f (x) = x + (Y1 − Z 1 ); • for Z n ≤ x ≤ Z n+1 , we take f (x) = Yn +
Yn+1 − Yn · (x − Z n ); Z n+1 − Z n
384
M. T. Mizukoshi et al.
y Yn+1
Yn
Zn
Z n+1
x
and • if the limit x0 is finite, then for x ≥ x0 , we take f (x) = x. For each such sequence, the transformed membership function μ (x) = μ( f −1 (x)), because of closed-ness, also belongs to the given family. For this function, we have inf{x : μ (x) = μ(X n )} = Z n . Thus, all these functions μ (x) are different. So, the given family contains an infinite-parametric subfamily determined by infinitely many parameters Z n . This contradicts to our assumption that the family is finite-parametric. 7.2◦ . Similarly, when the sequence {X n }n is strictly decreasing, we can take the values def
Yn = sup{x : μ(x) = μ(X n )} for which too Y1 > Y2 > · · · For any other decreasing sequence Z 1 > Z 2 > · · · tending to the same limit, we can form a piecewise linear transformation function f (x) that maps Z n into Yn , namely: • for x ≥ Z 1 , we take f (x) = x + (Y1 − Z 1 ); • for Z n+1 ≤ x ≤ Z n , we take f (x) = Yn+1 +
Yn − Yn+1 · (x − Z n+1 ); Z n − Z n+1
An Argument in Favor of Piecewise-Constant Membership Functions
385
y Yn
Yn+1
Z n+1
Zn
x
and • if the limit x0 is finite, then for x ≤ x0 , we take f (x) = x. For each such sequence, the transformed membership function μ (x) = μ( f −1 (x)), because of closed-ness, also belongs to the given family. For this function μ (x), we have inf{x : μ(x) = μ(X n )} = Z n . Thus, all these functions are different. So, the given family contains an infinite-parametric subfamily determined by infinitely many parameters Z n —which contradicts to our assumption that the family is finiteparametric. 8◦ . In both cases, we get a contradiction. So the set of values of the membership function μ(x) cannot be infinite and must, thus, be finite. The proposition is proven. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. A.D. Báez-Sánchez, A.C. Moretti, M.A. Rojas-Medar, On polygonal fuzzy sets and systems. Fuzzy Sets Syst. 209, 54–65 (2012) 2. R. Belohlavek, J.W. Dauben, G.J. Klir, Fuzzy Logic and Mathematics: A Historical Perspective (Oxford University Press, New York, 2017) 3. G. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic (Prentice Hall, Upper Saddle River, 1995)
386
M. T. Mizukoshi et al.
4. J.M. Mendel, Uncertain Rule-Based Fuzzy Systems: Introduction and New Directions (Springer, Cham, 2017) 5. H.T. Nguyen, C.L. Walker, E.A. Walker, A First Course in Fuzzy Logic (Chapman and Hall/CRC, Boca Raton, 2019) 6. V. Novák, I. Perfilieva, J. Moˇckoˇr, Mathematical Principles of Fuzzy Logic (Kluwer, Boston, 1999) 7. L.A. Zadeh, Fuzzy sets. Inf. Control 8, 338–353 (1965)
Data Processing Under Fuzzy Uncertainty: Towards More Accurate Algorithms Marina Tuyako Mizukoshi, Weldon Lodwick, Martine Ceberio, Olga Kosheleva, and Vladik Kreinovich
Abstract Data that we process comes either from measurements or from experts— or from the results of previous data processing that were also based on measurements and/or expert estimates. In both cases, the data is imprecise. To gauge the accuracy of the results of data processing, we need to take the corresponding data uncertainty into account. In this paper, we describe a new algorithm for taking fuzzy uncertainty into account, an algorithm that, for small number of inputs, leads to the same or even better accuracy than the previously proposed methods.
1 Outline In many practical situations, our information about the values of physical quantities x1 , . . . , xn comes from experts, and experts usually describe their estimates by using imprecise (“fuzzy”) words from natural language. A natural way to describe this information in computer-understandable terms is to use fuzzy techniques. When we M. T. Mizukoshi Federal University of Goias, Goiania, Brazil e-mail: [email protected] W. Lodwick Department of Mathematical and Statistical Sciences, University of Colorado Denver, 1201 Larimer Street, Denver, Co 80204, USA e-mail: [email protected] M. Ceberio · O. Kosheleva · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] M. Ceberio e-mail: [email protected] O. Kosheleva e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_61
387
388
M. T. Mizukoshi et al.
process this data, i.e., when we estimate the quantity y which is related to xi by a known dependence y = f (x1 , . . . , xn ), we need to take the expert’s imprecision into account. Algorithms for computing the corresponding information about the desired quantity y are usually based on interval computations; these algorithms are described in Sect. 2. In general, the problem of interval computations is NP-hard, so all known feasible algorithms provide only an approximate estimate for y’s information. If we want more accurate estimates, we can, in principle, use more accurate (and more time-consuming) interval computation techniques. What we show in this paper is that for applications to data processing under fuzzy uncertainty, there are other approaches to improve the accuracy, approaches that, for small numbers of inputs, are comparable in accuracy—or even more accurate—that the currently used ones.
2 Data Processing Under Fuzzy Uncertainty: Definitions, State of the Art, and Remaining Problems Need for data processing. In many practical situations, we want to know the value of a quantity y which is difficult or even impossible to measure directly. For example, we want to know tomorrow’s temperature or the distance to a faraway star. Since we cannot measure the desired quantity directly, a natural idea is to estimate it indirectly: • find easier-to-measure-or-estimate quantities x1 , . . . , xn that are related to y by a known relation y = f (x1 , . . . , xn ), • estimate xi , and • use the resulting estimates for xi and the known relation between y and xi to estimate y. This is an important case of data processing. Need for data processing under fuzzy uncertainty. In many cases, the only way to estimate the values xi is to ask experts, and expert’s estimates are often given not in precise terms, but rather by using imprecise (“fuzzy”) words from natural language. For example, an expert may say that the value of a quantity is very small, or that this value is close to 0.1. In this case, to find a resulting estimate for y, we need to perform data processing under such fuzzy uncertainty. Fuzzy techniques: main idea. To process such estimates, Lotfi Zadeh proposed a technique that he called fuzzy; see, e.g., [1, 3, 7, 10–12]. In this technique, for each imprecise natural-language term P like “very small” and for each possible value x of the corresponding quantity, we ask an expert to estimate, on a scale from 0 to 1, the degree μ P (x) to which this value can be described by this term. Of course, there are infinitely many possible values x, and we cannot ask infinitely many questions to an expert; so:
Data Processing Under Fuzzy Uncertainty: Towards More Accurate Algorithms
389
• we ask this question about finitely many values, and then • we use some interpolation/extrapolation algorithm to estimate the degrees for all other values x. The resulting function μ P (x) is called a membership function or a fuzzy set. Fuzzy “and”- and “or”-operations. In many cases, the expert’s rules involve logical connectives like “and” and “or”. For example the rule may explain what to do if the car in front is close and it slows down a little bit. Ideally, we could ask the expert to estimate his/her degree of confidence in all such complex statements, but there are too many such possible complex statements, and it is not feasible to ask the expert about all of them. So, we need to be able, given the degrees of confidence a and b in statements A and B, to compute the estimates for the expert’s degree of confidence in statements A & B and A ∨ B. The algorithms for computing such estimates are known as, correspondingly, “and”-operations f & (a, b) (also known as t-norms) and “or”-operations f ∨ (a, b) (also known as t-conorms). The simplest “and”- and “or”-operations are f & (a, b) = min(a, b) and f ∨ (a, b) = max(a, b). Data processing under fuzzy uncertainty. Suppose that have fuzzy information μi (xi ) about each input xi . What can we say about the value y = f (x1 , . . . , xn )? A number y is a possible value of the resulting quantity if for some tuple (x1 , . . . , xn ) for which y = f (x1 , . . . , xn ), x1 is a possible value of the first input, and x2 is a possible value of the second input, …We know the degrees μi (xi ) to which each value xi is possible. Here, “for some” means “or”, so if we use min to describe “and” and max to describe “or”, we get the following formula μ(y) =
sup
(x1 ,...,xn ): f (x1 ,...,xn )=y
min(μ1 (x1 ), . . . , μn (xn )).
(1)
This formula was first described by Zadeh and is thus known as Zadeh’s extension principle. Data processing under fuzzy uncertainty: reduction to interval computations. It is known that the formula (1) becomes simpler if instead of the original membership functions μi (xi ), we consider its α-cuts def
[x i (α), x i (α)] = {xi : μi (xi ) ≥ α} corresponding to different α ∈ (0, 1].
390
M. T. Mizukoshi et al.
μ(x)
α
x(α)
x(α)
x
It is convenient to denote each α-cut by def
xi (α) = [x i (α), x i (α)]. In these notations, each α-cut y(α) = {y : μ(y) ≥ α} corresponding to y is equal to y(α) = f (x1 (α), . . . , xn (α)), where for each tuple of sets X 1 , . . . , X n , the y-range f (X 1 , . . . , X n ) is defined as def
f (X 1 , . . . , X n ) = { f (x1 , . . . , xn ) : x1 ∈ X 1 , . . . , xn ∈ X n }. Usually, membership functions corresponding to terms like “very small” or “close to 0.1” are first non-strictly increasing and then non-strictly decreasing. In this case, each α-cut is an interval, and the problem of computing each α-cut y(α) becomes the problem of computing the set f (X 1 , . . . , X n ) for n intervals X i = xi (α). The problem of computing the y-range f (X 1 , . . . , X n ) for intervals X 1 , . . . , X n is known as the problem of interval computations; see, e.g., [2, 5, 6, 8]. Comment. To reconstruct the membership function exactly, we need to know the α-cuts corresponding to all possible values α. Of course, there are infinitely real numbers α in the interval (0, 1], but at any period of time, we can only perform finitely many computations. So, in practice, we compute the α-cuts for finitely many values 0 < α1 < α2 < · · · < αm ≤ 1. We try to select these values αi to provide a good approximation to the original membership function, so, to avoid large gaps, we make sure that the differences αk+1 − αk between the consequent values of α are small. Interval computations: a brief reminder. In general, the interval computation problem is NP-hard (see, e.g., [4]). This means, crudely speaking, that no feasi-
Data Processing Under Fuzzy Uncertainty: Towards More Accurate Algorithms
391
ble algorithm can always compute the exact y-range f (X 1 , . . . , X n ) for intervals X i = [x i , x i ]. Thus, each feasible algorithm provides only an approximation to the y-range. The usual algorithms for computing the y-range can be explained in several steps. The first step is the fact that when the function f (x1 , x2 ) corresponds to a simple arithmetic operation like addition, subtraction, multiplication, and division, the range of the result can be explicitly computed: [x 1 , x 1 ] + [x 2 , x 2 ] = [x 1 + x 2 , x 1 + x 2 ]; [x 1 , x 1 ] − [x 2 , x 2 ] = [x 1 − x 2 , x 1 − x 2 ]; [x 1 , x 1 ] · [x 2 , x 2 ] = [min(x 1 · x 2 , x 1 · x 2 , x 1 · x 2 , x 1 · x 2 ), max(x 1 · x 2 , x 1 · x 2 , x 1 · x 2 , x 1 · x 2 )];
1 1 1 [x , x 1 ] 1 if 0 ∈ / [x 1 , x 1 ]; and 1 = = [x 1 , x 1 ] · . , [x 1 , x 1 ] x1 x1 [x 2 , x 2 ] [x 2 , x 2 ] These formulas are known as interval arithmetic (sometimes called standard interval arithmetic, to distinguish it from alternative techniques for processing intervals). The second step is related to the fact that in a computer, arithmetic operations are the only ones that are hardware supported. Thus, in a computer, every computation is actually a sequence of arithmetic operations. It is known that if we replace each arithmetic operation with numbers with the corresponding operation of interval arithmetic, then we get an enclosure Y for the desired range y, i.e., a set Y for which y ⊆ Y. This procedure of replacing each arithmetic operation with the corresponding operation of interval arithmetic is known as straightforward interval computations. (This procedure is also known as naive interval computations, since it is sometimes used—usually unsuccessfully—by people who naively think that this is what interval computations are about—not realizing that they are missing an important part of interval techniques.) The third step is the so-called centered form, according to which the desired range y = f (x1 , . . . , xn ) is enclosed by the following interval: y⊆ y+
n
di · [−Δi , Δi ],
i=1
where def
Δi =
xi − xi 2 def
is the half-width (= radius) of the ith interval, y = f ( x1 , . . . , xn ), where def
xi =
xi + xi 2
392
M. T. Mizukoshi et al.
is the midpoint of the ith interval, and di is the result of applying straightforward interval computations to estimate the range of the ith partial derivative of f on the intervals xi : ∂f di ⊇ (x1 , . . . , xn ). ∂ xi (There is also an additional idea—that we can simplify computations if the function f (x1 , . . . , xn ) is monotonic with respect to some of the variables—as was, e.g., the case of arithmetic operations.) Since, as we have mentioned, the problem of computing the exact y-range is NPhard, the centered form leads, in general, to an approximation to the actual y-range. What if we want a more accurate estimate? The approximation error of the centered form approximation can be estimated if we take into account that the centered form is, in effect, based on the linear terms in the Taylor expansion of the function f (x1 , . . . , xn )—and indeed, for linear functions f (x1 , . . . , xn ), the centered form formula leads to the exact range. Thus, the approximation error of this approximation is of the same order of magnitude as the largest terms that we ignore in this approach—i.e., quadratic terms, terms proportional to the products Δi · Δ j . From this viewpoint, to get a more accurate estimate, we can: • divide (bisect) one of the intervals into two equal parts, • estimate the range corresponding to each of these parts, and then • take the union of the corresponding y-ranges. After bisection, the width of one of the intervals decreases in half, so some terms in the general quadratic expression become smaller, and the approximation error decreases. Remaining problems. The current algorithm requires a certain computation time and leads to a certain accuracy. • In some cases, we need the results faster—and it is OK if the accuracy is slightly lower; in this case, we can use an alternative algorithm described in [9]. • In other cases, we are interested in higher accuracy—even if it takes more computation time. This is a problem with which we deal in this paper. What we do in this paper. As we have mentioned, a general way to get more accurate estimates is to apply bisection. In this paper, we propose another way of increasing accuracy, which is specifically tailored to the case of fuzzy data processing.
3 New Algorithm: Motivations, Description, and Comparative Analysis Motivations. For each αk , the corresponding y-α-cut y(αk ) is equal to the y-range f (B(αk )) of the function f (x1 , . . . , xn ) over the corresponding x-range box
Data Processing Under Fuzzy Uncertainty: Towards More Accurate Algorithms
393
def
B(αk ) = [x 1 (αk ), x 1 (αk )] × · · · × [x n (αk ), x n (αk )]. For example, for n = 2, the x-range box has the following form: x2 x 2 (αk )
x 2 (αk )
x 1 (αk )
x 1 (αk ) x1
From the definition of the α-cut, we can conclude that the x-ranges corresponding to different values αk are getting narrower as α increases: B(α1 ) ⊇ B(α2 ) ⊇ · · · ⊇ B(αm ).
B(α1 ) B(α2 ) B(α3 ) …
Because of this, each x-range can be represented as a union of the differences: B(αk ) = B(αm ) ∪ (B(αm−1 ) − B(αm )) ∪ · · · ∪ (B(αk ) − B(αk+1 )), where, as usual, B(α j ) − B(α j+1 ) denotes the difference between the two sets def
B(α j ) − B(α j+1 ) = {a : a ∈ B(α j ) and a ∈ / B(α j+1 )}.
394
M. T. Mizukoshi et al.
B(α1 ) − B(α2 ) B(α2 ) − B(α3 ) B(α3 ) − B(α4 ) …
For example, for n = 1, we get: B(αm ) B(αm−1 ) − B(αm ) B(αm−2 ) − B(αm−1 ) x1 Thus, the desired y-range y(αk ) can be represented as the union of the y-ranges corresponding to the component sets: f (B(αk )) = f (B(αm )) ∪ f (B(αm−1 ) − B(αm )) ∪ · · · ∪ f (B(αk ) − B(αk+1 )). (2) Each difference B(αk ) − B(αk+1 ) is the union of (somewhat overlapping) 2n boxes corresponding to i = 1, . . . , n: Bi− (αk ) = [x 1 (αk ), x 1 (αk )] × · · · [x i−1 (αk ), x i−1 (αk )] × [x i (αk ), x i (αk+1 )]× [x i+1 (αk ), x i+1 (αk )] · · · × · · · × [x n (αk ), x n (αk )]
(3)
and Bi+ (αk ) = [x 1 (αk ), x 1 (αk )] × · · · [x i−1 (αk ), x i−1 (αk )] × [x i (αk+1 ), x i (αk )]× [x i+1 (αk ), x i+1 (αk )] · · · × · · · × [x n (αk ), x n (αk )].
(4)
Data Processing Under Fuzzy Uncertainty: Towards More Accurate Algorithms
395
B2+ (αk )
B1− (αk )
B + (αk ) 1
B2− (αk ) Thus, the y-range corresponding to each difference B(αk ) − B(αk+1 ) is equal to the union of y-ranges corresponding to the component boxes: f (B(αk ) − B(αk+1 )) = f (B1− (αk )) ∪ f (B1+ (αk )) ∪ · · · ∪ f (Bn− (αk )) ∪ f (Bn+ (αk )).
(5)
So, we can compute the y-ranges over these component boxes, and then take the union. In each of the component boxes (3) and (4), the width of one of the sides is much narrower than in the box B(αk ). In this case, as we have mentioned, we get a more accurate result. This leads us to the following algorithm. Algorithm. • We compute the y-range f (Bm ) over the box def
B(αm ) = [x 1 (αm ), x 1 (αm )] × · · · × [x n (αm ), x n (αm )].
(6)
• For each k from 1 to m − 1 and for each i from 1 to n, we use the centered form (or any other interval computation technique) to estimate the ranges f (B1− (αk )) and f (B1+ (αk )) over the boxes (3) and (4). • For each k from 1 to m − 1, we compute the range f (B(αk ) − B(αk+1 )) by using the formula (5). • Finally, for each k from 1 to m − 1, we compute the y-range y(αk ) = f (B(αk )) by using the formula (2). Is this algorithm better than bisection? In the above-described usual algorithm for data processing under interval uncertainty, we apply interval computations m times—as many times as there are different values αk . In the new algorithm, we apply it 1 + (m − 1) · (2n) times. So, clearly, the new algorithm takes longer time: approximately 2n times longer.
396
M. T. Mizukoshi et al.
Similarly, if we use bisection, we get more accurate results, but also at the expense of a longer computation time. So, the question is which of these two algorithms leads to more accurate estimates if we use the same computation time. Let us analyze this question for different values n. Case when n = 1: the new algorithm is more accurate. For n = 1, the new algorithm requires 2n = 2 times longer than the original centered form. So, we need to compare with the case when we apply bisection once—in this case we need two applications of centered form, i.e., we also increase the computation time by a factor of 2. Δ1 /2
Δ1 /2
The approximation error of the original centered form is proportional to Δ21 . If we bisect, the width decreases to Δ1 /2, so the approximation error of the bisection result—which is proportional to (Δ1 /2)2 = Δ21 /4—decreases by a factor of 4. On the other hand, for the new algorithm, the approximation error is proportional to the square of much smaller widths like x i (αk ) − x i (αk+1 ) which are, in general, m times smaller than the original width. So, with the new algorithm, the approximation error decreases by a factor of m 2 which, for usual m = 7 ± 2, is much better than for bisection. Thus, for n = 1, the new method is much more accurate than bisection. Case when n = 2: both algorithms are of the same accuracy. In this case, the new algorithm requires 2n = 4 times longer than the original centered form. So, we need to compare with the case when we apply bisection twice—then we also increase the computation time by a factor of 4.
Δ2 /2
Δ2 /2 Δ1 /2
Δ1 /2
In this case, we bisect both input intervals, so both widths are decreased by a factor of 2, and thus, all products Δi · Δ j involved in our estimate for the approximation error also decrease by a factor of 4. Thus, for bisection, the new approximation error is 4 times smaller than for the original centered form. In the new method, on each of the four estimates, only one side has the original width, the other side is much smaller. Thus, instead of the four terms Δi · Δ j (i, j = 1, 2) forming the original approximation error, only one terms remains the same size—all others are much smaller. So, the new method also decreases the approximation error by a factor of 4. Thus, for n = 2, the two methods have comparable accuracy.
Data Processing Under Fuzzy Uncertainty: Towards More Accurate Algorithms
397
Case when n = 3: both algorithms are of the same accuracy. In this case, the new algorithm requires 2n = 6 times longer than the original centered form. So, we need to compare with the case when we apply bisection and get 6 smaller boxes. This way, for all resulting boxes, we can get the first two coordinates divided by two. However, we do not have enough boxes to makes sure that for all the boxes, all the widths are divided by 2—this would require 8 boxes. Thus, at least for some of the smaller boxes, we will have the third coordinate the same size as before. So: • for 4 terms Δi · Δ j corresponding to i, j ≤ 2, we get 1/4 of the original size, • for 2 · 2 = 4 terms Δi · Δ3 and Δ3 · Δi , i ≤ 2, we get 1/2 of the original size, and • the term Δ23 remains the same. Thus, the overall approximation error—which was originally consisting of 3 × 3 = 9 original-size terms Δi · Δ j —is now reduced to 4 · (1/4) + 4 · (1/2) + 1 = 4 times the original product term. So, for bisection, the approximation error is smaller by a factor of 9/4 = 2.25. In the new method, for each box, we only have two sides of the original width, the third side is much narrower. So, out of 9 products, only 4 products corresponding to the original-width sides remains the same, the rest become much smaller and can, thus, safely be ignored. Thus, for n = 3, the new method also decreases the approximation error by the same factor of 9/4 = 2.25, so the two methods have comparable accuracy. Case of larger n: bisection is more accurate. In general, if we bisect the box over v variables, we get 2v different boxes—and thus, 2v times longer computations. To compare with the new algorithm that requires 2n times more computations, we need to select v for which 2v = 2n, i.e., v = log2 (2n). Here, v n. In this case: • for v2 pairs of bisected variables, the product Δi · Δ j is decreased by a factor of 4; • for 2v · (n − v) pairs of a bisected and non-bisected variable, the product Δi · Δ j is decreased by a factor of 2; and • for (n − v)2 pairs of non-bisected variables, the product Δi · Δ j remains the same. Thus, after this bisection, instead of the sum of n 2 products of the original size, have the sum proportional to 1 2 1 1 · v + · 2 · v · (n − v) + (n − v)2 = · v2 + v · n − v2 + n 2 − 2 · v · n + v2 = 4 2 4
n2 − v · n +
1 2 · v = n 2 − log2 (2n) · n + O(log2 (n)). 4
398
M. T. Mizukoshi et al.
On the other hand, for the new algorithm, for each box, n − 1 sizes are the same, and one is much smaller, so we reduce the sum of n terms to the sum of (n − 1)2 terms, i.e., to (n − 1)2 = n 2 − 2 · n + 1. For large n, we have log2 (2n) > 2, and thus, bisection leads to more accurate results. This advantage starts with the case when n = 4. In this case, the new algorithm requires 2n = 8 uses of the centered form. During this times, we can bisect each of the first 3 inputs—this will also lead to 23 = 8 boxes and, thus, 8 uses of the centered form. Without loss of generality, let us assume that we bisect the first three inputs. Then: • for 9 terms Δi · Δ j corresponding to i, j ≤ 3, we get 1/4 of the original size, • for 2 · 3 = 6 terms Δi · Δ4 and Δ4 · Δi , i ≤ 2, we get 1/2 of the original size, and • the term Δ23 remains the same. Thus, the overall approximation error—which was originally consisting of 4 × 4 = 16 original-size terms Δi · Δ j —is now reduced to 9 · (1/4) + 6 · (1/2) + 1 = 6.25 times the original product term. So, for bisection, the approximation error is smaller by a factor of 16/6.25 = 2.56. On the other hand, in the new method, for each box, we only have 3 sides of the original width, the 4th side is much narrower. So, out of 16 products, only 33 = 9 products corresponding to the original-width sides remains the same, the rest become much small and can, thus, safely be ignored. Thus, for n = 4, the new method also decreases the approximation error by the factor of 16/9 ≈ 1.78, not as good as bisection. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. R. Belohlavek, J.W. Dauben, G.J. Klir, Fuzzy Logic and Mathematics: A Historical Perspective (Oxford University Press, New York, 2017) 2. L. Jaulin, M. Kiefer, O. Didrit, E. Walter, Applied Interval Analysis, with Examples in Parameter and State Estimation, Robust Control, and Robotics (Springer, London, 2001) 3. G. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic (Prentice Hall, Upper Saddle River, 1995)
Data Processing Under Fuzzy Uncertainty: Towards More Accurate Algorithms
399
4. V. Kreinovich, A. Lakeyev, J. Rohn, P. Kahl, Computational Complexity and Feasibility of Data Processing and Interval Computations (Kluwer, Dordrecht, 1998) 5. B.J. Kubica, Interval Methods for Solving Nonlinear Contraint Satisfaction, Optimization, and Similar Problems: from Inequalities Systems to Game Solutions (Springer, Cham, 2019) 6. G. Mayer, Interval Analysis and Automatic Result Verification (de Gruyter, Berlin, 2017) 7. J.M. Mendel, Uncertain Rule-Based Fuzzy Systems: Introduction and New Directions (Springer, Cham, 2017) 8. R.E. Moore, R.B. Kearfott, M.J. Cloud, Introduction to Interval Analysis (SIAM, Philadelphia, 2009) 9. H.T. Nguyen, O. Kosheleva, V. Kreinovich, Data processing under fuzzy uncertainty: towards more efficient algorithms, in Proceedings of the 2022 IEEE World Congress on Computational Intelligence IEEE WCCI’2022, Padua, Italy, July 18–23 (2022) 10. H.T. Nguyen, C.L. Walker, E.A. Walker, A First Course in Fuzzy Logic (Chapman and Hall/CRC, Boca Raton, 2019) 11. V. Novák, I. Perfilieva, J. Moˇckoˇr, Mathematical Principles of Fuzzy Logic (Kluwer, Boston, 1999) 12. L.A. Zadeh, Fuzzy sets. Inf. Control 8, 338–353 (1965)
Epistemic Versus Aleatory: Case of Interval Uncertainty Marina Tuyako Mizukoshi, Weldon Lodwick, Martine Ceberio, Olga Kosheleva, and Vladik Kreinovich
Abstract Interval computations usually deal with the case of epistemic uncertainty, when the only information that we have about a value of a quantity is that this value is contained in a given interval. However, intervals can also represent aleatory uncertainty—when we know that each value from this interval is actually attained for some object at some moment of time. In this paper, we analyze how to take such aleatory uncertainty into account when processing data. We show that in case when different quantities are independent, we can use the same formulas for dealing with aleatory uncertainty as we use for epistemic one. We also provide formulas for processing aleatory intervals in situations when we have no information about the dependence between the inputs quantities.
1 Outline There exist many algorithm for dealing with epistemic uncertainty, when we do not have full information about the actual value of a physical quantity x. An important case of such uncertainty is the case of interval uncertainty when all we know is the M. T. Mizukoshi Federal University of Goias, Goiania, Brazil e-mail: [email protected] W. Lodwick Department of Mathematical and Statistical Sciences, University of Colorado Denver, 1201 Larimer Street, Denver, CO 80204, USA e-mail: [email protected] M. Ceberio · O. Kosheleva · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] M. Ceberio e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_62
401
402
M. T. Mizukoshi et al.
lower and upper bounds x and x on the actual (unknown) value x. This case is called interval uncertainty, because in this case the set of possible values of x forms an interval [x, x]. In the case of epistemic uncertainty, we have a single (unknown) actual value, and the only thing we know about this value is that it belongs to the given interval. However, intervals emerge also in other situations, e.g., when we actually have several different values of the physical quantity, and the interval represents the range of these actual values. This situation is known as aleatory uncertainty. In this paper, we analyze how to take aleatory uncertainty into account during data processing.
2 Epistemic Interval Uncertainty: A Brief Reminder What is epistemic interval uncertainty: a reminder. In many practical situations, we do not know the exact value of a physical quantity x, the only thing we know about this value x is an interval [x, x] that contains this value; see, e.g., [13]. In this case, the interval represents our lack of knowledge and is, thus, a particular case of epistemic uncertainty. How to process data under epistemic interval uncertainty: general case. In general, data processing means that we apply an algorithm f (x1 , . . . , xn ) to some values x1 , . . . , xn , and get the result y = f (x1 , . . . , xn ). This procedure allows us, given the values xi , to find the value of a quantity y whose values are related to xi by the algorithm y = f (x1 , . . . , xn ). For example, if we know the resistance x1 and the current x2 , then we can use the known dependence y = x1 · x2 (Ohm’s law) to compute the voltage y. In case of epistemic interval uncertainty, we only know the intervals [x i , x i ] that contain the actual (unknown) values xi . Thus, the only thing that we can conclude about the value y = f (x1 , . . . , xn ) is that this value is contained in the set { f (x1 , . . . , xn ) : x1 ∈ [x 1 , x 1 ], . . . , xn ∈ [x n , x n ]}.
(1)
This set is usually called the y-range. Definition 1 By a y-range of a function y = f (x1 , . . . , xn ) on the intervals [x 1 , x 1 ], …, [x n , x n ], we mean the set def
f ([x 1 , x 1 ], . . . , [x n , x n ]) = { f (x1 , . . . , xn ) : x1 ∈ [x 1 , x 1 ], . . . , xn ∈ [x n , x n ]}.
(2)
Data processing functions y = f (x1 , . . . , xn ) are usually continuous. For such functions, the following fact is known from calculus. Proposition 1 For each continuous function f (x1 , . . . , xn ) and for all intervals [x 1 , x 1 ], …, [x n , x n ], the y-range is an interval.
Epistemic Versus Aleatory: Case of Interval Uncertainty
403
In the following text, we will denote the endpoints of the y-range by y and y. Then, the y-range is equal to the interval [y, y]. Computing this interval y-range is knows as interval computations [4, 8–10]. In general, the interval computation problem is NP-hard already for quadratic functions y = f (x1 , . . . , xn ); see, e.g., [6]. This means that unless P = NP (which most computer scientists believe not to be true), no feasible algorithm can compute all the y-ranges. However, there are many efficient algorithms that help to solve many practical cases of this general problem; see, e.g., [4, 8–10]. Linearized case. In the important and frequent case when we know the values xi with a good accuracy, there exist feasible algorithms—actually, straightforward formulas—for interval computations. This case is known as the linearized case, and the corresponding techniques is known as linearization; see, e.g., [5, 6, 11]. To explain these formulas, we need to take into account that each interval [x i , x i ] xi + Δi ], where can be presented as [ xi − Δi , def
xi =
xi + xi 2
(3)
xi − xi 2
(4)
is the interval’s midpoint and def
Δi =
is the interval’s half-width. Each value xi from this interval can be represented as xi + Δxi , where the difference def
Δxi = xi − xi
(5)
satisfies the condition |Δxi | ≤ Δi . In these terms, each value y = f (x1 , . . . , xn ) takes the form xn + Δxn ). y = f ( x1 + Δx1 , . . . , Since the values Δxi are small, we can expand this expression in Taylor series and keep only linear terms in this expansion. Thus, we get xn + Δxn ) = y+ y = f ( x1 + Δx1 , . . . ,
n
ci · Δxi ,
i=1
where we denoted
def
x1 , . . . , xn ) y = f ( and def
ci =
∂f . ∂ xi |x1 =x1 ,...,xn =xn
In other words, the data processing function y = f (x1 , . . . , xn ) becomes linear:
(6)
(7)
404
M. T. Mizukoshi et al.
y = y+
n
ci · (xi − xi ),
(8)
i=1
For such linear functions, there exists a known explicit formula for their y-ranges. x1 + Δ1 ], Proposition 2 The y-range of a linear function (8) on intervals [ x 1 − Δ1 , xn + Δn ] is equal to [ y − Δ, y + Δ], where y = f ( x1 , . . . , xn ) and …, [ x n − Δn , Δ=
n
|ci | · Δi .
(9)
i=1
Comment. This proposition is known (see, e.g., [5, 6, 11]), but for completeness, we include its proof. For readers’ convenience, all the proofs are placed in the special (last) Proofs section.
3 Aleatory Interval Uncertainty: Towards the Precise Formulation of the Problem Why interval uncertainty in the aleatory case. Aleatory uncertainty means that in different situations, we have different actual values of the corresponding quantity. What can we say about the set of these actual values? For each physical quantity, there are usually bounds; see, e.g., [3, 14]: for example, all speeds are limited by the speed of light, all distances are bounded by the size of the Universe, etc. So, for each physical quantity, its set of actual values is bounded. For each physical quantity x, its values change with time x(t), and in physics, these changes are usually continuous; see, e.g., [3, 14]. As we have mentioned, the range of a continuous function x(t) on any time interval is an interval. So, we can conclude that the set of all actual values of a quantity is an interval. Comment. In this section, we consider the situation when all the values from this interval actually occur for some objects at some moments of time. Of course, in quantum physics, changes may be discrete, and we may have a discrete set of values. The following text analyzes what happens if we take into account that not all the values from the interval are physically possible. Notations. To distinguish an aleatory interval from the epistemic one, we will denote aleatory intervals with capital letters. For example, while the epistemic interval for a quantity x is denoted by [x, x], its aleatory interval will be denoted by [X , X ]. Resulting problem: informal description. How can we take aleatory interval uncertainty into account in data processing, when:
Epistemic Versus Aleatory: Case of Interval Uncertainty
405
• we have aleatory information about the quantities x1 , . . . , xn , • we have a data processing algorithm y = f (x1 , . . . , xn ), and • we want to find out what aleatory knowledge we have about the quantity y. To be more precise: for each of n quantities x1 , . . . , xn , we know an aleatory interval [X i , X i ], i.e., we know that all the values xi from this interval [X i , X i ] are actually occurring. We want to known which values y are actually occurring. Strictly speaking, to find out which values y occur, we need to know which ntuples (x1 , . . . , xn ) occur, i.e., we need to know the set S of all the actually occurring n-tuples (x1 , . . . , xn ). Once we know this set, we can conclude that the set of actual values of y is equal to def
f (S) = { f (x1 , . . . , xn ) : (x1 , . . . , xn ) ∈ S}. We do not know this set S, we only have three pieces of information about this set: • first, since we only know that the values xi ∈ [x i , x i ] actually occur, the set S is contained in the “box” [X 1 , X 1 ] × · · · × [X n , X n ] of all the n-tuples for which xi ∈ [X i , X i ] for all i; • second, since each values xi from each aleatory interval [X i , X i ] is actually occurring, for each i and for each value xi ∈ [X i , X i ], there must exist an n-tuple in the set S for which the ith component is equal exactly to this value; • third, since we consider the case when all dependencies are continuous, we can conclude that the set S is connected—as the set of values of a continuous function over a time interval. We will call sets S satisfying these three properties conceivable sets. A value y is definitely attained if y is contained in f (S) for all conceivable sets. Thus, we arrive at the following definition. Definition 2 Let [X 1 , X 1 ], …, [X n , X n ] be intervals, and let f (x1 , . . . , xn ) be a continuous real-valued function of n real variables. • We say that a set S ⊆ [X 1 , X 1 ] × · · · × [X n , X n ] is conceivable if this set is connected and for every i and for every value xi ∈ [X i , X i ], there exists an n-tuple (x1 , . . . , xi , . . . , xn ) ∈ S whose ith component is equal to this value xi . • By the aleatory y-set, we mean the set of all the real numbers y for which, for every conceivable set S, there exists an n-tuple (x1 , . . . , xn ) ∈ S for which f (x1 , . . . , xn ) = y. Notation. In the following text, we will denote the aleatory y-set by f a ([X 1 , X 1 ], . . . , [X n , X n ]).
406
M. T. Mizukoshi et al.
Proposition 3 The aleatory y-set is either an interval, or an empty set. Comment. An example when the aleatory y-set is empty will be given in the following section. Notation. Because of this result, in the following text, we will also call the aleatory y-set the aleatory y-interval. We denote this interval by [Y , Y ].
4 How to Compute the Aleatory y-Interval: Linearized Case Let us denote the midpoint of the ith aleatory interval by X i and its half-width by δi > 0. Let us consider the linearized case, when the data processing algorithm takes the form n + ci · (xi − X i ), (10) y=Y i=1
where X i is the midpoint of the aleatory interval [X i , X i ], def
ci = and
∂f , ∂ xi |x1 = X 1 ,...,xn = Xn
def Y = f ( X 1, . . . , X n ).
(11)
In this case, we can efficiently compute the aleatory y-interval by using the following result. Proposition 4 Suppose that we have n intervals X 1 + δ1 ], . . . , [ X n − δn , X n + δn ], [ X 1 − δ1 , and the function y = f (x1 , . . . , xn ) has the form (10). Let us denote def
δ = 2 · max |ci | · δi − i=1,...,n
n
|ci | · δi .
i=1
Then, the aleatory y-interval has the following form: • if δ < 0, then f a ([ X 1 − δ1 , X 1 + δ1 ], . . . , [ X n − δn , X n + δn ]) = ∅; and − δ, Y + δ]. X 1 − δ1 , X 1 + δ1 ], . . . , [ X n − δn , X n + δn ]) = [Y • if δ ≥ 0, then f a ([ Corollary 1 For the case when we have two intervals [X 1 , X 1 ] and [X 2 , X 2 ] and f (x1 , x2 ) = x1 + x2 , we get the aleatory y-interval
Epistemic Versus Aleatory: Case of Interval Uncertainty
407
[Y , Y ] = [X 1 , X 1 ] +a [X 2 , X 2 ] = [min(X 1 + X 2 , X 1 + X 2 ), max(X 1 + X 2 , X 1 + X 2 )].
Corollary 2 For the case when we have two intervals [X 1 , X 1 ] and [X 2 , X 2 ] and f (x1 , x2 ) = x1 − x2 , we get the aleatory y-interval [Y , Y ] = [X 1 , X 1 ] −a [X 2 , X 2 ] = [min(X 1 − X 2 , X 1 − X 2 ), max(X 1 − X 2 , X 1 − X 2 )].
Example. For intervals [0, 2] and [1, 4], we get [0, 2] +a [1, 4] = [min(0 + 4, 2 + 1), max(0 + 4, 2 + 1)] = [min(3, 4), max(3, 4)] = [3, 4]. This is clearly different from the usual interval addition [0, 2] + [1, 4] = [0 + 1, 2 + 4] = [1, 6] corresponding to epistemic or independent aleatory case.
5 Computing the Aleatory y-Interval Is, in General, NP-Hard Discussion. In the previous section, we showed that when the function y = f (x1 , . . . , xn ) is linear, we can effectively compute the resulting aleatory y-interval. It turns out that—similarly to the above-mentioned case of epistemic interval uncertainty, if we consider the next-in-complexity class of functions—quadratic functions—the problem becomes NP-hard. Proposition 5 For quadratic functions f (x1 , . . . , xn ), the problem of computing the aleatory y-interval is NP-hard.
6 Case of Partial or Full Independence Discussion. In the above development, we considered the case when all we know is the set of actual values for each quantity xi , and we do not know whether there is any dependence between these quantities. Because we allow cases of possible dependence, we can have somewhat counter-intuitive conclusions. For example, for the function f (x1 , x2 ) = x1 · x2 and aleatory intervals [X 1 , X 1 ] = [0, a1 ] and [X 2 , X 2 ] = [0, a2 ] for some ai > 0, the aleatory y-interval consists of a single value 0.
408
M. T. Mizukoshi et al.
Indeed, in this case, each conceivable set S contains a 2-tuple (0, x2 ) for which x1 · x2 = 0. Thus, all y-ranges contain 0 and hence, the aleatory y-interval—which is the intersection of these y-ranges—also contains 0. On the other hand, the set S consisting of all the points (0, x2 ) and (x1 , 0) is conceivable. For this set, f (S) = {0}. Thus, the intersection of all y-ranges cannot contain any non-zero values and is, thus, indeed equal to [Y , Y ] = [0, 0]. In some cases, however, we know that some sets of quantities are independent, i.e., that the fact that one of them has a value xi should not affect the set of actually occurring values of other quantities x j from this set. In such cases, all possible combinations (xi , x j , . . .) of actually occurring values are also actually occurring. So, we get the following modification of Definition 2. Definition 3 Let [X 1 , X 1 ], …, [X n , X n ] be intervals, let f (x1 , . . . , xn ) be a continuous real-valued function of n real variables, and let F be a class of subsets F ⊆ {1, . . . , n} that contains all 1-element subsets {1}, . . . , {n}. • We say that a set S ⊆ [X 1 , X 1 ] × · · · × [X n , X n ] is F -conceivable if: – this set is connected, and – for each set F = {i 1 , i 2 , . . .} ∈ F and for each combination (xi1 , xi2 , . . .) of values xik ∈ [X ik , X ik ], there exists an n-tuple (x1 , . . . , xi , . . . , xn ) ∈ S with these values xi1 , xi2 , . . . • By the F -aleatory y-set, we mean the set of all the real numbers y for which, for every conceivable set S, there exists an n-tuple (x1 , . . . , xn ) ∈ S for which f (x1 , . . . , xn ) = y. Notation. In the following text, we will denote the F -aleatory y-set by f aF ([X 1 , X 1 ], . . . , [X n , X n ]). Proposition 6 The F -aleatory y-set is either an interval, or an empty set. Notation. Because of this result, in the following text, we will also call the F -aleatory y-set the F -aleatory y-interval. We denote this interval by [Y , Y ]. Case of full independence. In the case of full independence, when the class F contains the set {1, . . . , n}, all n-tuples are conceivable, and there is only one conceivable set—the set of all combinations of the actual values of xi : S = [X 1 , X 1 ] × · · · × [X n , X n ]. Thus, the F -aleatory y-interval is equal to the y-range of the function f (x1 , . . . , xn ) on these intervals: def
f ([X 1 , X 1 ], . . . , [X n , X n ]) = { f (x1 , . . . , xn ) : x1 ∈ [X 1 , X 1 ], . . . , xn ∈ [X n , X n ]}.
Epistemic Versus Aleatory: Case of Interval Uncertainty
409
This is exactly the same formula as for the epistemic uncertainty. Thus, in this case of full independence, to compute the aleatory y-interval, we can use the same interval computations techniques as for epistemic intervals. Linearized case. In the linearized case, there are explicit formulas for the F -aleatory y-interval: Proposition 7 Suppose that we have n intervals X 1 + δ1 ], . . . , [ X n − δn , X n + δn ], [ X 1 − δ1 , and the function y = f (x1 , . . . , xn ) has the form (10). Let us denote def
δ = 2 · max F∈F
i∈F
|ci | · δi −
n
|c j | · δ j .
j=1
Then, the F -aleatory y-interval has the following form: • if δ < 0, then f a ([ X 1 − δ1 , X 1 + δ1 ], . . . , [ X n − δn , X n + δn ]) = ∅; and − δ, Y + δ]. X 1 − δ1 , X 1 + δ1 ], . . . , [ X n − δn , X n + δn ]) = [Y • if δ ≥ 0, then f a ([
7 What if We Take Discreteness Into Account In the previous text, we assumed that all the changes are continuous and thus, that all the ranges are connected. As we have mentioned, according to quantum physics, there can be discrete transitions. In most cases, however, these transitions are small, so that the distance between the previous state and the new state does not exceed some small number ε > 0. In this case, for the set of actual combinations of values, instead of the original connectedness, we have a similar notion of ε-connectedness: Definition 4 Let ε > 0 be a real number. We say that a set S ⊆ IRn is ε-connected if every two points x, x ∈ S can be connected by a sequence x = x (1) , x (2) , . . . , x (m−1) , x (m) = x for which d(x (i) , x (i+1) ) ≤ ε for all i = 1, . . . , m − 1. It turns out that such sets can be approximated by connected sets. Definition 5 We say that a connected set C is ε-close to the set S if S ⊆ C and every element of C is ε-close to some element of the set S. Comment. In particular, we say that an interval [X , X ] is an ε-aleatory interval if for every value x from this interval, there is an ε-close actual value. Proposition 8 For every ε-connected set S, there exists an ε-close connected set C. One can easily see that for continuous functions f (x1 , . . . , xn ), the image of an ε-connected set is ε -connected, for an appropriate ε .
410
M. T. Mizukoshi et al.
Proposition 9 For each box B = [X 1 , X 1 ] × · · · × [X n , X n ] and for every ε-connected set S ⊆ B, the y-range f (S) is ε -connected, for ε = c f (ε) = sup{d( f (x), f (x )) : d(x, x ) ≤ ε}. def
(12)
Thus, we arrive at the following definition: Definition 6 Let ε > 0 be given, let [X 1 , X 1 ], …, [X n , X n ] be intervals for which X i − X i ≥ 2ε, and let f (x1 , . . . , xn ) be a continuous real-valued function of n real variables. • We say that a set S ⊆ [X 1 , X 1 ] × · · · × [X n , X n ] is ε-conceivable if this set is connected and for every i and for every value xi ∈ [X i , X i ], there exists an ntuple (x1 , . . . , xi , . . . , xn ) ∈ S whose ith component is ε-close to this value. • By the ε-aleatory y-set, we mean the set of all the real numbers y for which, for every ε-conceivable set S, there exists an n-tuple (x1 , . . . , xn ) ∈ S for which f (x1 , . . . , xn ) = y. Notation. In the following text, we will denote the ε-aleatory y-set by f a,ε ([X 1 , X 1 ], . . . , [X n , X n ]). Discussion. From the computational viewpoint, this case can be reduced to the continuous case: Proposition 10 For every function f (x1 , . . . , xn ) and for all intervals [X 1 , X 1 ], …, [X n , X n ], the ε-aleatory y-set has the form f a,ε ([X 1 , X 1 ], . . . , [X n , X n ]) = f a ([X 1 + ε, X 1 − ε], . . . , [X n + ε, X n − ε]). What if we allow discontinuities of arbitrary size? In this case, there is no justification for connectedness or ε-connectedness, so we can have the following new definition, in which ∗ indicates that we no longer require connectedness: Definition 7 Let [X 1 , X 1 ], …, [X n , X n ] be intervals, and let f (x1 , . . . , xn ) be a continuous real-valued function of n real variables, and let ε > 0. • We say that a set S ⊆ [X 1 , X 1 ] × · · · × [X n , X n ] is ∗-conceivable if for every i and for every value xi ∈ [X i , X i ], there exists an n-tuple (x1 , . . . , xi , . . . , xn ) ∈ S whose ith component is ε-close to this value. • By the ∗-aleatory y-set, we mean the set of all the real numbers y for which, for every ∗-conceivable set S, there exists an n-tuple (x1 , . . . , xn ) ∈ S for which
Epistemic Versus Aleatory: Case of Interval Uncertainty
411
f (x1 , . . . , xn ) = y. In some cases, we still have a non-empty ∗-aleatory y-set: e.g., in the above example of multiplication and [X i , X i ] = [0, ai ], we still have [0, 0] as the ∗-aleatory y-set. However, in the linearized case, this definition does not lead to any meaningful result: this set is always empty. Proposition 11 For each linear function f (x1 , . . . , xn ) of n = 2 variables that actually depends on all its variables—i.e., for which all the coefficients are different from 0—the ∗-aleatory y-set is empty.
8 Proofs Proof of Proposition 2 The largest value of the expression (8) is attained when each term ci · Δxi in the sum is the largest. • When ci is positive, this largest value is attained when Δxi is the largest, i.e., when Δxi = Δi . • When ci is negative, this largest value is attained when Δxi is the smallest, i.e., when Δxi = −Δi . In both cases, the largest value of this term is equal to |ci | · Δi , so the largest value of y is equal to y + Δ, where we denoted Δ=
n
|ci | · Δi .
i=1
Similarly, we can show that the smallest value of y is equal to y − Δ. Thus, the range of conceivable values of y is indeed equal to [ y − Δ, y + Δ]. The proposition is proven. Proof of Proposition 3 By Definition 2, a real number y belongs to the aleatory yset if and only if for every conceivable set S, there exists an n-tuple (x1 , . . . , xn ) ∈ S for which f (x1 , . . . , xn ) = y, i.e., for which y belongs to the y-range f (S) = {y = f (x1 , . . . , xn ) : (x1 , . . . , xn ) ∈ S}. This means that the aleatory y-set is the intersection of y-ranges f (S) corresponding to all conceivable sets S. Each conceivable set S is bounded—since it is contained in the bounded box [X 1 , X 1 ] × · · · × [X n , X n ]—and connected. The function f (x1 , . . . , xn ) is continuous, thus the corresponding range f (S) is also bounded and connected, i.e., is an interval. Intersection of intervals is either an interval or an empty set. The proposition is proven.
412
M. T. Mizukoshi et al.
Proof of Proposition 4 In this proof, we use ideas from [7]. 1◦ . One can see that to prove the proposition, we need to prove the following two statements: | ≤ δ, then, for each conceivable set S, we have y = • that when δ ≥ 0 and |y − Y f (x1 , . . . , xn ) for some n-tuple (x1 , . . . , xn ) ∈ S, and | > δ, there exists a conceivable set S for which, for all n-tuples • that when |y − Y (x1 , . . . , xn ) ∈ S, we have y = f (x1 , . . . , xn ). Let us prove these two statements one by one. | ≤ δ, then the aleatory y-interval 2◦ . Let us first prove that if δ ≥ 0 and |y − Y contains the corresponding value y. − δ and 2.1◦ . Let is first prove that the aleatory y-interval contains both values Y Y + δ, i.e., that for every conceivable set S, the range f (S) contains both these values. To prove this, let us denote by i 0 the index at which the product |ci | · δi attains its maximum: |ci0 | · δi0 = max |ci | · δi . i
In these terms, the expression for δ has the following form δ = 2 · |ci0 | · δi0 −
n
|ci | · δi .
i=1
The sum in the right-hand side can be represented as n
|ci | · δi = |ci0 | · δi0 +
|ci | · δi ,
i =i 0
i=1
thus δ = |ci0 | · δi0 −
|ci | · δi .
i =i 0
Let us also denote, by si , the sign of the coefficient ci , i.e.: • si = 1 if ci ≥ 0, and • si = −1 if ci < 0. In this case, for all i, we have ci · si = |ci |. Since the set S is conceivable, there exists an n-tuple (x1 , . . . , xn ) for which X i0 + si0 · δi0 and |xi − X i | ≤ δi for all other i. For this n-tuple, we have xi0 = + ci0 · (xi0 − f (x1 , . . . , xn ) = Y X i0 ) +
ci · (xi − X i ).
(13)
i =i 0
Here,
ci0 · (xi0 − X i0 ) = ci0 · si0 · δi0 = |ci0 | · δi0 ,
(14)
Epistemic Versus Aleatory: Case of Interval Uncertainty
413
and for each i = i 0 , we have X i ) ≥ −|ci · (xi − X i )| = −|ci | · |xi − X i | ≥ −|ci | · δi . ci · (xi −
(15)
Due to (14) and (15), we have + |ci0 | · δi0 − f (x1 , . . . , xn ) ≥ Y
|ci | · δi ,
i =i 0
+ δ. Thus, for each conceivable set S, the y-range f (S) i.e., f (x1 , . . . , xn ) ≥ Y + δ. Let us denote this value contains a value which is larger than or equal to Y by y+ . Since the set S is conceivable, there exists an n-tuple (x1 , . . . , xn ) for which X i0 − si0 · δi0 and |xi − X i | ≤ δi for all other i. For this n-tuple, we have the xi0 = formula (13). Here, X i0 ) = ci0 · (−si0 · δi0 ) = −|ci0 | · δi0 , ci0 · (xi0 −
(16)
and for each i = i 0 , we have X i ) ≤ |ci · (xi − X i )| = |ci | · |xi − X i | ≤ |ci | · δi . ci · (xi −
(17)
Due to (16) and (17), we have − |ci0 | · δi0 + f (x1 , . . . , xn ) ≤ Y
|ci | · δi ,
i =i 0
− δ. Thus, for each conceivable set S, the y-range f (S) i.e., f (x1 , . . . , xn ) ≤ Y − δ. Let us denote this value contains a value which is smaller than or equal to Y by y− . The y-range f (S) contains the values y+ and y− for which − δ ≤ Y + δ ≤ y+ . y− ≤ Y Since the y-range f (S) is an interval, it must also contain both intermediate values − δ and Y + δ. So, this statement is proven. Y ◦ | ≤ δ, the 2.2 . Let us now prove that, once we have a value y for which |y − Y aleatory y-interval contains the value y. | ≤ δ means that Y − δ ≤ y ≤ Y + δ. We have Indeed, the inequality |y − Y − δ and Y + δ. Thus, proved that the aleatory y-interval contains both values Y since the aleatory y-set is an interval, it should also contain the intermediate value y. The first part of the statement 1◦ is proven. | > δ, then there exists a conceivable set S for 3◦ . Let us now prove that if |y − Y which, for all n-tuples (x1 , . . . , xn ) ∈ S, we have f (x1 , . . . , xn ) = y.
414
M. T. Mizukoshi et al.
> δ. In this case, Without loss of generality, we can consider the case when y − Y + δ. The case when y − Y < −δ can be proven the same way. y>Y Let us take, as S, the set consisting of all n-tuples of the type X i−1 − si−1 · δi−1 , xi , X i+1 − si+1 · δi+1 , . . . , X n − sn · δn ), ( X 1 − s1 · δ1 , . . . , (18) X i − δi , X i + δi ]. for all i and for all xi ∈ [ This set consists of n connected components corresponding to different values X 2 − s2 · δ2 , . . . , Xn − i. All these components have a common point ( X 1 − s1 · δ1 , sn · δn ) through which we can connect points from the two different component sets. Thus, the whole set S is connected. X i − δi , X i + δi ], there It is easy to see that for every i, for every point xi ∈ [ exists an n-tuple from the set S with exactly this value of xi – namely, we can take the corresponding point from the ith component of the set S. Thus, the set S is conceivable. Let us show that for all n-tuples (x1 , . . . , xn ) from this set S, we have f (x1 , . . . , xn ) + δ and thus, f (x1 , . . . , xn ) < y and f (x1 , . . . , xn ) = y. Since the value y does ≤Y not belong to the y-range f (S) for one of the conceivable sets, this means that this value y does not belong to the y-aleatory interval [Y , Y ] which is the intersection of the y-ranges corresponding to all conceivable sets. Indeed, for each n-tuple (18), we have + f (x1 , . . . , xn ) = Y
n
+ ci · (xi − c j · (x j − X j) = Y Xi ) +
c j · (x j − X j) =
j =i
j=1
y + ci · (xi − Xi ) +
c j · s j · (−δ j ) = y + ci · (xi − Xi ) −
j =i
|c j | · δ j .
j =i
Here, |xi − X i | ≤ δi , so ci · (xi − X i ) ≤ |ci | · δi and thus, f (x1 , . . . , xn ) ≤ y + |ci | · δi −
⎛ + ⎝2 · |ci | · δi − |c j | · δ j = Y
j =i
n
⎞ |c j | · δ j ⎠ .
j=1
We have |ci | · δi ≤ max |c j | · δ j , j
therefore ⎛ + ⎝2 · max |c j | · δ j − f (x1 , . . . , xn ) ≤ Y j
The statement is proven, and so is the Proposition.
n j=1
⎞ + δ. |c j | · δ j ⎠ = Y
Epistemic Versus Aleatory: Case of Interval Uncertainty
415
Proof of Corollary 1 1◦ . For the sum function f (x1 , x2 ) = x1 + x2 , we have n = 2 and c1 = c2 = 1. In this case, the expression for δ takes the form δ = 2 · max(δ1 , δ2 ) − (δ1 + δ2 ). The expression max(δ1 , δ2 ) is equal either to δ1 or to δ2 —depending on which of these values is larger. Let us consider both case δ1 ≥ δ2 and δ1 < δ2 . 1.1◦ . Let us first consider the case when δ1 ≥ δ2 . In this case, max(δ1 , δ2 ) = δ1 , and we have δ = 2δ1 − (δ1 + δ2 ) = δ1 − δ2 . Here, = X 2 , so Y X1 + − δ = X1 + X 2 − δ1 + δ2 = ( X 1 − δ1 ) + ( X 2 + δ2 ) = X 1 + X 2 Y =Y and similarly, + δ = Y =Y X1 + X 2 + δ1 − δ2 = ( X 1 + δ1 ) + ( X 2 − δ2 ) = X 1 + X 2 . Since always Y ≤ Y , we thus have X 1 + X 2 ≤ X 1 + X 2 . 1.2◦ . Similarly, when δ1 ≤ δ2 , we get Y = X 1 + X 2 ≤ Y = X 1 + X 2 . 2◦ . In both cases, Y = min(X 1 + X 2 , X 1 + X 2 ) and Y = max(X 1 + X 2 , X 1 + X 2 ). The corollary is proven. Proof of Corollary 2 The difference x1 − x2 can be represented as x1 + (−x2 ), where the set of known actual values of −x2 is equal to {−x2 : x2 ∈ [X 2 , X 2 ]} = [−X 2 , −X 2 ]. If we apply the formula from Corollary 1 to this expression, we get exactly the expression from Corollary 2. The statement is proven. Proof of Proposition 5 1◦ . By definition, a problem is NP-hard if every problem from the class NP can be reduced to this problem; see, e.g., [6, 12]. Thus, to prove that our problem is NP-hard, it is sufficient to show that a known NP-hard problem can be reduced to our problem. Then, for any problem from the class NP, by combining the reduction to the known problem with the reduction of the known problem, we will get the desired reduction to our problem—and thus, prove that our problem is NP-hard. In this proof, as a known NP-hard problem, we take the following partition problem: given m positive integers s1 , . . . , sm , divide them into two groups with equal sum. If we move all the terms of the desired equality i∈G
si =
j ∈G /
sj
416
M. T. Mizukoshi et al.
to the left-hand side, we get an equivalent equality n
ηi · si = 0,
(19)
i=1
where ηi = 1 if i ∈ G and ηi = −1 otherwise. Thus, the partition problem is equivalent to checking whether there exist values ηi ∈ {−1, 1} for which the equality (19) is true. Let us show that each instance of this problem can be reduced to the following instance of the problem of computing the aleatory y-interval: n = m + 1, • [X i , X i ] = [−si , si ] for all i ≤ m, and [X m+1 , X m+1 ] = [0, s], where we denoted def
s =
(20)
m 1 2 · s , m i=1 i
(21)
and •
f (x1 , . . . , xm , xm+1 ) = xm+1 − V, where
m 1 2 · V = x − m i=1 i def
m 1 · xi m i=1
(22)
2 .
(23)
Comment. In the following text, we will use the fact that the expression V —which is actually the expression for sample variance—can be equivalently reformulated as ⎛ ⎞2 m m 1 ⎝ 1 ⎠ · xi − · xj , m i=1 m j=1 and is, thus, always non-negative: V ≥ 0. 2◦ . Let us recall—see, e.g., [1, 2]—that for the maximum def
M =
max V,
xi ∈[−si ,si ]
(24)
we have the following property: • if the original instance of the partition problem has a solution, then M = s, and • if the original instance of the partition problem does not have a solution, then
Epistemic Versus Aleatory: Case of Interval Uncertainty
417
M < s. Indeed, since |xi | ≤ si , we have xi2 ≤ si2 and thus, m 1 2 V = · x − m i=1 i
m 1 · xi m i=1
2 ≤
m m 1 2 1 2 · · xi ≤ s = s. m i=1 m i=1 i
(25)
Hence, the maximum M of this expression is always smaller than or equal to s. When the original instance of the partition problem has a solution ηi , then for xi = ηi · si , we have m m xi = ηi · si = 0 (26) i=1
i=1
and xi2 = si2 , thus m 1 2 V = x − · m i=1 i
m 1 xi · m i=1
2
m m 1 2 1 2 = x = s = s. · · m i=1 i m i=1 i
Thus, in this case, the maximum M is indeed equal to s. Vice versa, if the maximum M is equal to s, this means that for some n-tuples (x1 , . . . , xn ), we have equality in both inequalities that form the formula (25). The fact that we have equality in the first inequality means that we have xi2 = si2 for all i, i.e., that xi = ±si , i.e., that we have xi = ηi · si for some ηi ∈ {−1, 1}. The fact that we have equality in the second inequality means that we have equality (26), i.e., that the values ηi form a solution to the original instance of the partition problem. Thus, if the original instance of the partition problem does not have a solution, we cannot have M = s. Since we always have M ≤ s, this means that in this case, we must have M < s. The statement is proven. Comment. As shown in [1, 2], we can feasibly compute the value δ > 0 such that if the original instance of the partition problem does not have a solution, then we have M ≤ s − δ. 3◦ . Let us prove that for the above problem, the aleatory y-interval is equal to [0, s − M]. 3.1◦ . First, let us prove that for every conceivable set S, the y-range f (S) contains the interval [0, s − M]. 3.1.1◦ . Let us first prove that the y-range f (S) contains a value which is larger than or equal to s − M.
418
M. T. Mizukoshi et al.
Indeed, by definition of a conceivable set, the set S contains an n-tuple for which def xm+1 = s. The value y+ = f (x1 , . . . , xm , s) corresponding to this n-tuple is obtained by subtracting, from xm+1 = s, the expression V whose maximum is M. Thus, V ≤ M, and therefore, y+ = s − V ≥ s − M. So, the y-range f (S) contains a value y+ ≥ s − M. 3.1.2◦ . Let us first prove that the y-range f (S) contains a value which is smaller than or equal to 0. Indeed, by definition of a conceivable set, the set S contains an n-tuple for def which xm+1 = 0. The value y− = f (x1 , . . . , xm , 0) corresponding to this n-tuple is obtained by subtracting, from xm+1 = 0, a non-negative expression V . Thus, y− = 0 − V ≤ 0. 3.1.3◦ . The interval f (S) contains a value y− ≤ 0 and a value y+ ≥ s − M, thus this interval must contain all the values between y− and y+ , including all the values from the interval [0, s − M]. The Statement 3.1 is proven. 3.2◦ . Let us now prove that there exists a conceivable set S for which the y-range f (S) does not contain any values larger than s − M. opt opt Indeed, let x1 , . . . , xm be an n-tuple at which the expression V attains its maximum M. Then, we can take the set U consisting of: opt
opt
• all the (m + 1)-tuples (x1 , . . . , xm , xm+1 ) corresponding to all the values xm+1 ∈ [0, s] and • all the (m + 1)-tuples (x1 , . . . , xm , 0) corresponding to all m-tuples (x1 , . . . , xm ). One can easily check that this set is conceivable. • For n-tuples from the first component of this set, we have y = xm+1 − M ≤ s − M. • For n-tuples from the second component of this set, we have y = 0 − V ≤ 0 and thus, y ≤ s − M. Thus, for this set S, all the values from the y-range f (S) are indeed smaller than or equal to s − M. 3.3◦ . Let us prove that there exists a conceivable set S for which the y-range f (S) does not contain any negative values. Indeed, let us take the set S consisting of: • all the (m + 1)-tuples (0, . . . , 0, xm+1 ) corresponding to all the values xm+1 ∈ [0, s] and • all the (m + 1)-tuples (x1 , . . . , xm , s) corresponding to all m-tuples (x1 , . . . , xm ). One can easily check that this set is conceivable. • For (m + 1)-tuples from the first component of this set, we have V = 0, thus y = xm+1 − V = xm+1 ≥ 0.
Epistemic Versus Aleatory: Case of Interval Uncertainty
419
• For (m + 1)-tuples from the second component of this set, we have y = s − V and, since V ≤ M, we have y ≥ s − M ≥ 0. Thus, for this set S, all the values from the y-range f (S) are indeed non-negative. 3.4◦ . Due to Parts 3.1–3.3 of this proof, the aleatory y-interval—which is equal to the intersection of all the y-ranges f (S) corresponding to conceivable sets S: • contains the interval [0, s − M] but • does not contain any values outside this interval. Thus, indeed, the aleatory y-interval is equal to [0, s − M]. 4◦ . Now we can prove the desired reduction. If we could compute the aleatory y-range, we would then be able to compute the value s − M and thus, we would be able to check whether s − M > 0, i.e. whether M < s or M = s. Due to Part 2 of this proof, we would then be able to check whether the original instance of the partition problem has a solution. So, we indeed have a reduction of a known NP-hard problem to our problem. Thus, the problem of computing the aleatory y-interval is NP-hard. The proposition is proven. Proof of Proposition 6 is similar to the proof of Proposition 3. Proof of Proposition 7 is similar to the proof of Proposition 4. Proof of Proposition 8 It is sufficient to take, as the desired set C, the union of all straight line segments that connect all the ε-close pairs of points x, x ∈ S. One can easily see that this set is connected and ε-close to the original set S. Proof of Proposition 10 An ε-conceivable set S must contain, for each i: • n-tuples for which xi is ε-close to X i , i.e., for which xi ≤ X i + ε, and • n-tuples for which xi is ε-close to X i , i.e., for which xi ≥ X i − ε. Since the set S is connected, the set Si of all values xi corresponding to its n-tuples is also connected, i.e., is an interval containing points xi ≤ X i + ε and xi ≥ X i − ε. Thus, the set Si contains all the values from the “ε-reduced” interval [X i + ε, X i − ε], so it must be conceivable for the reduced intervals. Vice versa, if we have a set S which is conceivable for the reduced intervals, then, as one can easily check, this set S is ε-conceivable for the original intervals [X i , X i ]. Thus, the desired equality follows from Definition 2. Proof of Proposition 11 Connected sets are a particular case of general sets, so the ∗-aleatory y-interval is a subset of the aleatory y-interval [Y , Y ]. To prove the proposition, we need to prove that for every y ∈ [Y , Y ], there exists a ∗-conceivable set S for which y ∈ / f (S). As such a set, let us take S = {(x1 , . . . , xn ) : f (x1 , . . . , xn ) = y}.
420
M. T. Mizukoshi et al.
The only thing we need to check is that for each i, all the values xi ∈ [X i , X i ] are represented by n-tuples from this set, Indeed, otherwise, if some value xi(0) was not represented, this would mean that f (x1 , . . . , xi−1 , xi(0) , xi+1 , . . . , xn ) = y for all combinations (x1 , . . . , xi−1 , xi+1 , . . . , xn )—but for a linear function, this would mean that this function does not depend on the variables x1 , . . . , xi−1 , xi+1 , . . . , xn at all—which contradicts to our assumption. The proposition is proven. Comment. For a non-linear function, as the example of multiplication shows, the ∗-aleatory y-interval can be non-empty—exactly because for multiplication, there is a value x1(0) = 0 for which f (x1(0) , x2 ) = 0 · x2 = 0 for all x2 . Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. S. Ferson, L. Ginzburg, V. Kreinovich, M. Aviles, Exact bounds on sample variance of interval data, in Extended Abstracts of the 2002 SIAM Workshop on Validated Computing, Toronto, Canada, May 23–25 (2002), pp. 67–69 2. S. Ferson, L. Ginzburg, V. Kreinovich, L. Longpré, M. Aviles, Computing variance for interval data is NP-hard. ACM SIGACT News 33(2), 108–118 (2002) 3. R. Feynman, R. Leighton, M. Sands, The Feynman Lectures on Physics (Addison Wesley, Boston, 2005) 4. L. Jaulin, M. Kiefer, O. Didrit, E. Walter, Applied Interval Analysis, with Examples in Parameter and State Estimation, Robust Control, and Robotics (Springer, London, 2001) 5. V. Kreinovich, Interval computations and interval-related statistical techniques: estimating uncertainty of the results of data processing and indirect measurements, in Advanced Mathematical and Computational Tools in Metrology and Testing AMTCM’X. ed. by F. Pavese, W. Bremser, A. Chunovkina, N. Fisher, A.B. Forbes (World Scientific, Singapore, 2015), pp.38–49 6. V. Kreinovich, A. Lakeyev, J. Rohn, P. Kahl, Computational Complexity and Feasibility of Data Processing and Interval Computations (Kluwer, Dordrecht, 1998) 7. V. Kreinovich, V.M. Nesterov, N.A. Zheludeva, Interval methods that are guaranteed to underestimate (and the resulting new justification of Kaucher arithmetic). Reliab. Comput. 2(2), 119–124 (1996) 8. B.J. Kubica, Interval Methods for Solving Nonlinear Contraint Satisfaction, Optimization, and Similar Problems: From Inequalities Systems to Game Solutions (Springer, Cham, 2019) 9. G. Mayer, Interval Analysis and Automatic Result Verification (de Gruyter, Berlin, 2017) 10. R.E. Moore, R.B. Kearfott, M.J. Cloud, Introduction to Interval Analysis (SIAM, Philadelphia, 2009) 11. H.T. Nguyen, V. Kreinovich, B. Wu, G. Xiang, Computing Statistics under Interval and Fuzzy Uncertainty (Springer, Berlin, 2012) 12. C. Papadimitriou, Computational Complexity (Addison-Wesley, Reading, 1994)
Epistemic Versus Aleatory: Case of Interval Uncertainty
421
13. S.G. Rabinovich, Measurement Errors and Uncertainty: Theory and Practice (Springer, New York, 2005) 14. K.S. Thorne, R.D. Blandford, Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics (Princeton University Press, Princeton, 2017)
Standard Interval Computation Algorithm Is Not Inclusion-Monotonic: Examples Marina Tuyako Mizukoshi, Weldon Lodwick, Martine Ceberio, and Vladik Kreinovich
Abstract When we usually process data, we, in effect, implicitly assume that we know the exact values of all the inputs. In practice, these values come from measurements, and measurements are never absolutely accurate. In many cases, the only information about the actual (unknown) values of each input is that this value belongs to an appropriate interval. Under this interval uncertainty, we need to compute the range of all possible results of applying the data processing algorithm when the inputs are in these intervals. In general, the problem of exactly computing this range is NPhard, which means that in feasible time, we can, in general, only compute approximations to these ranges. In this paper, we show that, somewhat surprisingly, the usual standard algorithm for this approximate computation is not inclusion-monotonic, i.e., switching to more accurate measurements and narrower subintervals does not necessarily lead to narrower estimates for the resulting range.
1 Formulation of the Problem Need for data processing. In many practical situations, we are interested in a quality y that is difficult—or even impossible—to directly measure, e.g., tomorrow’s temperature or the distance to a faraway star. M. T. Mizukoshi Federal University of Goias, Goiania, Brazil e-mail: [email protected] W. Lodwick Department of Mathematical and Statistical Sciences, University of Colorado Denver, 1201 Larimer Street, Denver, CO 80204, USA e-mail: [email protected] M. Ceberio · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] M. Ceberio e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_63
423
424
M. T. Mizukoshi et al.
Since we cannot measure y directly, the only way to estimate y is: • to find some easier-to-measure quantities x1 , . . . , xn that are related to y by a known dependence y = f (x1 , . . . , xn ), and y = f ( x1 , . . . , xn ). • to use the results xi of measuring xi to produce an estimate Computation of this estimate is an important example of data processing. Need for interval computation. Measurements are never absolutely accurate, there def is usually a non-zero measurement error Δxi = xi − xi , the difference between the measurement result xi and the actual (unknown) value xi of the corresponding quantity. In many practical situations, the only information that we have about each measurement error Δxi is the upper bound Δi on its absolute value: |Δxi | ≤ Δi ; see, e.g., [9]. In such situations, once we know the measurement result xi , the only information that we have about the actual value xi is that it belongs to the interval def
xi − Δi , xi + Δi ]. xi = [x i , x i ] = [ In this case, the only information that we can have about the value of the desired quantity y is that this value belongs to the following y-range: def
y = f (x1 , . . . , xn ) = { f (x1 , . . . , xn ) : x1 ∈ x1 , . . . , xn ∈ xn }. When the function y = f (x1 , . . . , xn ) is continuous—and most data processing functions are continuous—this y-range is an interval. Because of this, computation of the y-range y is known as interval computations; see, e.g., [2, 6–8]. Comment about notations. In this paper, we will follow the usual practice of interval computations, where bold-face letters indicate intervals. For example: • xi is a number while • xi is an interval. Range is inclusion-monotonic. By definition of the range, we can easily see that the y-range is inclusion-monotonic in the sense that if for some intervals x1 , . . . , xn , X1 , . . . , Xn , we have xi ⊆ Xi for all i, then we should have f (x1 , . . . , xn ) ⊆ f (X1 , . . . , Xn ). Need for approximate algorithms. It is known—see, e.g., [5]—that computing the y-range is, in general, NP-hard: it is actually NP-hard already for quadratic functions y = f (x1 , . . . , xn ). This means, in effect, that we cannot compute the exact y-range in feasible time: the only thing we can do is use approximate algorithms, i.e., algorithms that compute the approximate y-range. We will denote these algorithms by f approx (x1 , . . . , xn ); here, f approx (x1 , . . . , xn ) ≈ f (x1 , . . . , xn ).
Standard Interval Computation Algorithm Is Not Inclusion-Monotonic: Examples
425
Natural question: are approximate interval computation algorithms inclusionmonotonic? A natural question is whether these approximate algorithms are inclusion-monotonic, i.e., whether the fact that xi ⊆ Xi for all i implies that f approx (x1 , . . . , xn ) ⊆ f approx (X1 , . . . , Xn ). What we do in this paper. In this paper, we consider the standard algorithm for computing the approximation for the interval y-range. We show that while many components of this algorithm are inclusion-monotonic, the algorithm itself is not: there are simple counter-examples.
2 Standard Interval Computations Algorithm: Reminder Interval arithmetic. Let us first start with describing the standard interval computation algorithm. To describe this algorithm, we will first describe preliminary algorithms of which this standard algorithm is composed. For each preliminary algorithm, we will also explain its motivations. Taking monotonicity into account. Many functions are increasing or decreasing. We say that a function y = f (x1 , . . . , xn ) is (non-strictly) increasing with respect to the variable xi if for all possible values x1 , . . . , xi−1 , xi , X i , xi+1 , . . . , xn the inequality xi ≤ X i implies that f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) ≤ f (x1 , . . . , xi−1 , X i , xi+1 , . . . , xn ). Similarly, we say that a function f (x1 , . . . , xn ) is (non-strictly) decreasing with respect to the variable xi if for all possible values x1 , . . . , xi−1 , xi , X i , xi+1 , . . . , xn the inequality xi ≤ X i implies that f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) ≥ f (x1 , . . . , xi−1 , X i , xi+1 , . . . , xn ). If a function is increasing or decreasing in xi , we say that it is monotonic in xi . For example, when x1 > 0 and x2 > 0, then: • the function y = f (x1 , x2 ) = x1 · x2 is increasing in each of the inputs x1 and x2 , while • the function y = f (x1 , x2 ) = x1 /x2 is increasing in x1 and decreasing in x2 . When the function y = f (x1 , . . . , xn ) is monotonic—(non-strictly) increasing or (non-strictly) decreasing—with respect to each variable, then its y-range can be easily computed by considering the corresponding endpoints. For example, suppose that a function y = f (x1 , . . . , xn ) is increasing with respect to each of its variables, and we have a combination of values (x1 , . . . , xn ) for which x i ≤ xi ≤ x i for all i. Then:
426
M. T. Mizukoshi et al.
• monotonicity with respect to x1 implies that f (x 1 , x 2 , . . . , x n ) ≤ f (x1 , x 2 , x 3 , . . . , x n ); • monotonicity with respect to x2 implies that f (x1 , x 2 , x 3 , . . . , x n ) ≤ f (x1 , x2 , x 3 , . . . , x n ); • …, and • monotonicity with respect to xn implies that f (x1 , . . . , xn−1 , x n ) ≤ f (x1 , . . . , xn−1 , xn ). So, we have: f (x 1 , x 2 , . . . , x n ) ≤ f (x1 , x 2 , x 3 , . . . , x n ) ≤ f (x1 , x2 , x 3 , . . . , x n ) ≤ · · · ≤ f (x1 , . . . , xn−1 , x n ) ≤ f (x1 , . . . , xn−1 , xn ) and thus, by transitivity of inequality: f (x 1 , . . . , x n ) ≤ f (x1 , . . . , xn ). Similarly: • monotonicity with respect to x1 implies that f (x1 , x2 , . . . , xn ) ≤ f (x 1 , x2 , x3 , . . . , xn ); • monotonicity with respect to x2 implies that f (x 1 , x2 , x3 , . . . , xn ) ≤ f (x 1 , x 2 , x3 , . . . , xn ); • …, and • monotonicity with respect to xn implies that f (x 1 , . . . , x n−1 , xn ) ≤ f (x 1 , . . . , x n−1 , x n ). So, we have: f (x1 , x2 , . . . , xn ) ≤ f (x 1 , x2 , x3 , . . . , xn ) ≤ f (x 1 , x 2 , x3 , . . . , xn ) ≤ · · · ≤ f (x 1 , . . . , x n−1 , xn ) ≤ f (x 1 , . . . , x n−1 , x n ) and hence, by transitivity of inequality:
Standard Interval Computation Algorithm Is Not Inclusion-Monotonic: Examples
427
f (x1 , . . . , xn ) ≤ f (x 1 , . . . , x n ). Thus, we always have f (x 1 , . . . , x n ) ≤ f (x1 , . . . , xn ) ≤ f (x 1 , . . . , x n ), so the y-range of possible values of y = f (x1 , . . . , xn ) is contained in the interval [ f (x 1 , . . . , x n ), f (x 1 , . . . , x n )]. On the other hand, both endpoints of this interval are clearly part of the desired y-range, so the y-range is simply equal to this interval: f ([x 1 , x 1 ], . . . , [x n , x n ]) = [ f (x 1 , . . . , x n ), f (x 1 , . . . , x n )]. This general fact has the following immediate implications. Interval arithmetic. For example, the function y = f (x1 , x2 ) = x1 + x2 is increasing in x1 and in x2 , so, according to the above formula, its y-range is equal to f ([x 1 , x 1 ], [x 2 , x 2 ]) = [ f (x 1 , x 2 ), f (x 1 , x 2 )], i.e., we have [x 1 , x 1 ] + [x 2 , x 2 ] = [x 1 + x 2 , x 1 + x 2 ]. Similarly, the function y = f (x1 , x2 ) = x1 − x2 is increasing in x1 and decreasing in x2 , so we have [x 1 , x 1 ] − [x 2 , x 2 ] = [x 1 − x 2 , x 1 − x 2 ]. Similarly, we get: [x 1 , x 1 ] · [x 2 , x 2 ] = [min(x 1 · x 2 , x 1 · x 2 , x 1 · x 2 , x 1 · x 2 ), max(x 1 · x 2 , x 1 · x 2 , x 1 · x 2 , x 1 · x 2 )];
1 1 1 if 0 ∈ / [x 1 , x 1 ]; and = , [x 1 , x 1 ] x1 x1 [x 1 , x 1 ] 1 = [x 1 , x 1 ] · . [x 2 , x 2 ] [x 2 , x 2 ] All these formulas are known as interval arithmetic. Comment. Similar formulas can be described for monotonic elementary functions such as exp(x), ln(x), x m for odd m, and for elementary function y = f (x) which are monotonic on known x-intervals such as x m for even m, sin(x), etc. First preliminary algorithm—straightforward interval computation: idea. In a computer, the only hardware supported operations are arithmetic operations. Thus, no matter what we want to compute, the computer will actually perform a sequence of arithmetic operations. For example, if we ask a computer to compute exp(x), most
428
M. T. Mizukoshi et al.
computers will simply compute the sum of the first few (N ) terms of this function’s Taylor series: x2 xN x1 + + ··· + . exp(x) ≈ 1 + 1! 2! N! So, we arrive at the following natural idea known as straightforward interval computations. Straightforward interval computations: algorithm. In this algorithm, we replace each arithmetic operation with a corresponding operation of interval arithmetic. Important comments. • It is known that, as a result of straightforward interval computations, we get an enclosure Y = f approx (x1 , . . . , xn ) for the desired y-range y, i.e., an interval Y for which y ⊆ Y. • We can also replace the application of elementary functions with the corresponding interval expressions. Straightforward interval computations: first example. For example, if we want to compute the y-range of the function y = f (x1 ) = x1 · (1 − x1 ) on the interval x1 = [0, 1], we take into account that computing this function involves: def
• first subtraction r = 1 − x1 and • then subtraction y = x1 · r . So, according to the general description of straightforward interval computations: • we first compute r = 1 − x1 = 1 − [0, 1] = [1, 1] − [0, 1] = [1 − 1, 1 − 0] = [0, 1], and • then we compute Y = x1 · r = [0, 1] · [0, 1] = [min(0 · 0, 0 · 1, 1 · 0, 1 · 1), max(0 · 0, 0 · 1, 1 · 0, 1 · 1)] = [min(0, 0, 0, 1), max(0, 0, 0, 1)] = [0, 1].
In this example, the actual y-range—as one can easily check—is [0, 0.25], which is much narrower than our estimate. Straightforward interval computations: second example. Similarly, to compute the y-range of the function y = f (x1 ) = x1 · (1 − x1 ) on the interval x1 = [0, 0.5], • we first compute r = 1 − [0, 0.5] = [1, 1] − [0.0.5] = [1 − 0.5, 1 − 0] = [0.5, 1],
Standard Interval Computation Algorithm Is Not Inclusion-Monotonic: Examples
429
and • then we compute Y = x1 · r = [0, 0.5] · [0.5, 1] = [0, 0.5]. The actual y-range is [0, 0.25]. Straightforward interval computations: third example. Finally, to compute the y-range of the function y = f (x1 ) = x1 · (1 − x1 ) on the interval x1 = [0.4, 0.8], • we first compute r = 1 − x1 = 1 − [0.4, 0.8] = [1, 1] − [0.4, 0.8] = [1 − 0.8, 1 − 0.4] = [0.2, 0.6],
and • then we compute Y = x1 · r = [0.4, 0.8] · [0.2, 0.6] = [0.08, 0.48]. The actual y-range is [0.16, 0.25]. Need to go beyond straightforward interval computations. In all these examples, the actual y-range is much narrower than our estimate. So, clearly, we need a better algorithm. Checking monotonicity. So far, we have only used the monotonicity idea for functions corresponding to elementary arithmetic operations. A natural idea is to use monotonicity for other functions as well. How can we check whether the function y = f (x1 , . . . , xn ) is monotonic with respect to some of the variables? • According to calculus, a function y = f (x1 , . . . , xn ) is (non-strictly) increasing with respect to xi if and only if the corresponding partial derivative ∂f ∂ xi is non-negative for all possible values xi ∈ xi . • Similarly, a function y = f (x1 , . . . , xn ) is (non-strictly) decreasing with respect to xi if and only if the corresponding partial derivative ∂f ∂ xi is non-positive for all possible values xi ∈ xi . We cannot directly check the corresponding inequalities for all the infinitely many combinations of xi ∈ xi , but what we can do is use straightforward interval computations to find the enclosure di = [d i , d i ] for the range of this partial derivatives over
430
M. T. Mizukoshi et al.
the whole box x1 × · · · × xn . If d i ≥ 0, this means that the partial derivative is always non-negative and thus, that the function y = f (x1 , . . . , xn ) is (non-strictly) increasing in xi . Since x i ≤ xi , this means, in particular, that for all possible values x1 , . . . , xn from the corresponding intervals, we have f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ) ≤ f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ). Thus, wherever value f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) of the function we have, there is a smaller (or equal) value of the type f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ). So, to compute the smallest possible value y of the function y = f (x1 , . . . , xn ), it is sufficient to only consider the values of the type f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ). In other words, to compute y, it is sufficient to consider the y-range of the function x1 , . . . , xi−1 , xi+1 , . . . , xn → f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ) of n − 1 variables. Similarly, since xi ≤ x i , for all possible values x1 , . . . , xn from the corresponding intervals, we have f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) ≤ f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ). Thus, wherever value f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) of the function we have, there is a larger (or equal) value of the type f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ). So, to compute the largest possible value y of the function y = f (x1 , . . . , xn ), it is sufficient to only consider the values of the type f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ). In other words, to compute y, it is sufficient to consider the y-range of the function x1 , . . . , xi−1 , xi+1 , . . . , xn → f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ) of n − 1 variables. When d i ≤ 0, this means that the partial derivative is always non-positive and thus, that the function f (x1 , . . . , xn ) is (non-strictly) decreasing in xi . Since x i ≤ xi , this means, in particular, that for all possible values x1 , . . . , xn from the corresponding intervals, we have f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) ≤ f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ). Thus, wherever value f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) of the function we have, there is a larger (or equal) value of the type f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ). So, to compute the largest possible value y of the function y = f (x1 , . . . , xn ), it is sufficient to only consider the values of the type f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ). In other words, to compute y, it is sufficient to consider the y-range of the function
Standard Interval Computation Algorithm Is Not Inclusion-Monotonic: Examples
431
x1 , . . . , xi−1 , xi+1 , . . . , xn → f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ) of n − 1 variables. Similarly, since xi ≤ x i , for all possible values x1 , . . . , xn from the corresponding intervals, we have f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ) ≤ f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ). Thus, wherever value f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) of the function we have, there is a smaller (or equal) value of the type f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ). So, to compute the smallest possible value y of the function y = f (x1 , . . . , xn ), it is sufficient to only consider the values of the type f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ). In other words, to compute y, it is sufficient to consider the y-range of the function x1 , . . . , xi−1 , xi+1 , . . . , xn → f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ) of n − 1 variables. Thus, we arrive at the following algorithm. Second preliminary algorithm: taking monotonicity into account. We select one of the variables xi , and we use straightforward interval computations to find the enclosure di = [d i , d i ] for the range of the ith partial derivative ∂f (x1 , . . . , xn ). ∂ xi If d i ≥ 0 of d i ≥ 0, then we reduce the original problem with n variables to two problems of finding the following y-ranges of functions of n − 1 variables: f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ) and f (x1 , . . . , xi−1 , x i , xi+1 , . . . , xn ). • When d i ≥ 0, we produce the interval formed by the lower endpoint of the first y-range and the upper endpoint of the second y-range as the desired estimate for the y-range of the original function. • When d i ≤ 0, we produce the interval formed by the lower endpoint of the second y-range and the upper endpoint of the first y-range as the desired estimate for the y-range of the original function. For each of the resulting functions of n − 1 variables—or, if no monotonicity was discovered, for the original function y = f (x1 , . . . , xn )—we check monotonicity with respect to other variables, etc.
432
M. T. Mizukoshi et al.
At the end, when we have checked monotonicity with respect to all the variables and we are still left with the need to estimate the y-range, we use straightforward interval computations. Second preliminary algorithm: example. Let us find the y-range of the function y = f (x1 ) = x1 · (1 − x1 ) on the interval [0, 0.5]. The standard differentiation algorithm leads to the derivative f (x1 ) = 1 · (1 − x1 ) + x1 · (−1) = 1 − 2x1 . For this derivative, straightforward interval computations lead to the range d1 = 1 − 2 · [0, 0.5] = [1, 1] − [0, 1] = [1 − 1, 1 − 0] = [0, 1]. Here, d 1 ≥ 0, so we conclude that this function is monotonic with respect to d1 and thus, its y-range is equal to [ f (x 1 ), f (x 1 )] = [ f (0), f (0.5)] = [0, 0.25]. So, in this case, we get the exact y-range—which is much better than a wider enclosure that we got when we use straightforward interval computations. Need for a better algorithm. For estimating the y-range of the function y = f (x1 ) = x1 · (1 − x1 ) on the interval [0, 0.5], where this function is monotonic, using monotonicity leads to a better result. However, for the other two intervals [0, 1] and [0.4, 0.8] the given function is not monotonic. Accordingly, our ranges for the derivative are equal to d1 = [d 1 , d 1 ] = 1 − 2 · [0, 1] = [1, 1] − [0, 2] = [−1, 1] and to d1 = [d 1 , d 1 ] = 1 − 2 · [0.4, 0.8] = [1, 1] − [0.8, 1.6] = [−0.6, 0.2]. In both case, we have neither d 1 ≥ 0 nor d 1 ≤ 0. Thus, the second preliminary algorithm still leads to the same straightforward interval computations that led to a very wide enclosure. So, it is still desirable to have better estimates for the y-range. xi − Δi , xi + Δi ] can be Centered form: idea. Each value xi ∈ xi = [x i , x i ] = [ xi + Δxi , where |Δxi | ≤ Δi . It is know that for each combination represented as xi = of such values, we have x1 + Δx1 , . . . , xn + Δxn ) = f (x1 , . . . , xn ) = f (
Standard Interval Computation Algorithm Is Not Inclusion-Monotonic: Examples
f ( x1 , . . . , xn ) +
433
n ∂f ( x1 + ζ1 , . . . , xn + ζn ) · Δxi , ∂ xi i=1
for some ζi ∈ [−Δi , Δi ]. Each value xi + ζi belongs to the interval xi . Thus, the corresponding value of the partial derivative belongs to the range ∂f (x1 , . . . , xn ) ∂ xi and thence, belongs to the enclosure di of this range. The values Δxi belong to the interval [−Δi , Δi ]. Hence, for every combination of values xi ∈ xi , the value f (x1 , . . . , xn ) belongs to the interval xn ) + f ( x1 , . . . ,
n
di · [−Δi , Δi ].
i=1
This interval contains the whole desired y-range of the function y = f (x1 , . . . , xn ) and is, therefore, an enclosure for this y-range. This enclosure is known as the centered form. By using this enclosure, we get the following algorithm. Third preliminary algorithm. • First, we check monotonicity—as in the second preliminary algorithm. • Once we are left with box or boxes on which there is no monotonicity, we use the centered form to compute the enclosure. Third preliminary algorithm: first example. For estimating the y-range of the function y = f (x1 ) = x1 · (1 − x1 ) on the interval [0, 1], for which x1 = 0.5 and Δ1 = 0.5, and we which we already know that d1 = [−1, 1], the centered form leads to f (0.5) + [−1, 1] · [−0.5, 0.5] = 0.25 + [−0.5, 0.5] = [−0.25, 0.75]. This is still much wider than the actual y-range [0, 0.25]. Third preliminary algorithm: second example. For estimating the y-range of the x1 = 0.6 and function y = f (x1 ) = x1 − x12 on the interval [0.4, 0.8], for which Δ1 = 0.2, and we which we already know that d1 = [−0.6, 0.2], the centered form leads to f (0.4) + [−0.6, 0.2] · [−0.2, 0.2] = 0.24 + [−0.12, 0.12] = [0.12, 0.36]. This is still much wider than the actual y-range [0.16, 0.25], but much narrower than the enclosure [0.08, 0.48] obtained by straightforward interval computations.
434
M. T. Mizukoshi et al.
Need for a better algorithm. The estimates for the desired y-ranges are still too wide, so we need better estimates. Bisection: idea. The centered form means, in effect, that we approximate the original function by linear terms of its Taylor expansion. Thus, the inaccuracy of this method is of the same size as the largest ignored terms in this expression—i.e., quadratic terms. These terms are proportional to Δi2 . Thus, to decrease these terms, a natural idea is to decrease Δi . A natural way to do it is to bisect, i.e., to divide one of the intervals into two equal halves, with half-size value of Δi . By using this idea, we arrive at the following standard interval computations algorithm. Standard interval computations algorithm. First, we follow the third preliminary algorithm. If we are not satisfied with the result: • we select one of the variables i, • we divide the corresponding interval xi into two equal-size sub-intervals [x i , xi ] and [ xi , x i ], and • we apply the third preliminary algorithm to estimate the y-ranges xi ], xi+1 , . . . , xn ) f (x1 , . . . , xi−1 , [x i , and xi , x i ], xi+1 , . . . , xn ). f (x1 , . . . , xi−1 , [ We then take the union of these two y-range estimates as the enclosure for the desired y-range. If we are still no happy with the result, we again apply bisection, etc. Standard algorithm: first example. Let us consider the problem of estimating the y-range of the function y = f (x1 ) = x1 · (1 − x1 ) on the interval [0, 1]. For this problem, since we did not get monotonicity, a reasonable idea is to try at least one bisection. In this case, there is only one variable x1 , so bisection simply means considering two intervals [0, 0.5] and [0.5, 1]. For the interval [0, 0.5], the range of the derivative f (x1 ) = 1 − 2x1 is estimated as [d 1 , d 1 ] = 1 − 2 · [0, 0.5] = 1 − [0, 1] = [1 − 1, 1 − 0] = [0, 1]. In this case, d 1 ≥ 0, so the function y = f (x1 ) is increasing, and its y-range is equal to f ([0, 0.5]) = [ f (0), f (0.5)] = [0, 0.25]. For the interval [0.5, 1], the range of the derivative f (x1 ) = 1 − 2x1 is estimated as [d 1 , d 1 ] = 1 − 2 · [0.5, 1] = 1 − [1, 2] = [1 − 2, 1 − 1] = [−1, 0].
Standard Interval Computation Algorithm Is Not Inclusion-Monotonic: Examples
435
In this case, d 1 ≤ 0, so the function y = f (x1 ) is decreasing, and its y-range is equal to f ([0.5, 1]) = [ f (1), f (0.5)] = [0, 0.25]. The y-range f ([0, 1]) of this function on the whole interval [0, 1] can be computed as the union of its y-ranges on the two subintervals: f ([0, 1]) = f ([0, 0.5]) ∪ f ([0.5, 1]) = [0, 0.25] ∪ [0, 0.25] = [0, 0.25]. In this example, we get the exact y-range. Standard algorithm: second example. Let us now consider the problem of estimating the y-range of the function y = f (x1 ) = x1 · (1 − x1 ) on the interval [0.4, 0.8]. For this problem, since we did not get monotonicity, a reasonable idea is to try at least one bisection. In this case, there is only one variable x1 , so bisection simply means considering two intervals [0.4, 0.6] and [0.6, 0.8]. For the interval [0.4, 0.6], the range of the derivative f (x1 ) = 1 − 2x1 is estimated as [d 1 , d 1 ] = 1 − 2 · [0.4, 0.6] = 1 − [0.8, 1.2] = [1 − 1.2, 1 − 0.8] = [−0.2, 0.2]. The range includes both positive and negative values, so we cannot use monotonicity, x1 = we have to use the centered form. In this case, midpoint x1 of the interval is 0.5 and its half-width Δ1 is Δ1 = 0.1, so the centered form leads to the following estimate: f (0.5) + f ([0.4, 0.6]) · [−0.1, 0.1] = 0.25 + [−0.2, 0.2] · [−0.1, 0.1] = 0.25 + [−0.02, 0.02] = [0.23, 0.27]. For the interval [0.6, 0.8], the range of the derivative f (x1 ) = 1 − 2x1 is estimated as [d 1 , d 1 ] = 1 − 2 · [0.6, 0.8] = 1 − [1.2, 1.6] = [1 − 1.6, 1 − 1.2] = [−0.6, −0.2]. In this case, d 1 ≤ 0, so the function y = f (x1 ) is decreasing, and its y-range is equal to f ([0.6, 0.8]) = [ f (0.8), f (0.6)] = [0.16, 0.24]. The range f ([0.4, 0.8]) of this function on the whole interval [0.4, 0.8] can be estimated as the union of its y-ranges on the two subintervals: f approx ([0.4, 0.8]) = f approx ([0.4, 0.6]) ∪ f ([0.6, 0.8]) = [0.23, 0.27] ∪ [0.16, 0.24] = [0.16, 0.27].
436
M. T. Mizukoshi et al.
This is better than without bisection (we got [0.12, 0.36] there), but still wider than the actual y-range [0.16, 0.25]. To get a better estimate, we can again apply bisection: namely, we bisect the interval [0.4, 0.6]. (By the way, after this second bisection, the standard algorithm leads to the exact y-range.)
3 Inclusion Monotonicity: What Is Known and New Counterexamples What is known. One can show—by induction over the number of arithmetic steps— that the straightforward interval computations algorithm is inclusion-monotonic. It is also known that the centered form itself is inclusion-monotonic [3, 4]; see also [1, 10]. First counter-example. As we have mentioned earlier, for estimating the y-range of the function y = f (x1 ) = x1 · (1 − x1 ) on the interval [0, 1], the standard interval computations algorithm with one bisection computes the exact y-range f approx ([0, 1]) = f (0, 1]) = [0, 0.25]. Let us show that when we use the same one-bisection version of the standard interval computations algorithm to estimate the y-range of this function on the subinterval [0, 0.8] ⊂ [0, 1], we do not get a subinterval of [0, 0.25]. Indeed, in this case the range d1 of the derivative on this interval is estimated as 1 − 2 · [0, 0.8] = 1 − [0, 1.6] = [−0.6, 1]. This range contains both positive and negative values, so there is no monotonicity, and thus, we need to bisect. Bisecting means dividing the interval [0, 0.8] into two subintervals [0, 0.4] and [0.4, 0.8]. For the interval [0, 0.4], the range of the derivative f (x1 ) = 1 − 2x1 is estimated as [d 1 , d 1 ] = 1 − 2 · [0, 0.4] = 1 − [0, 0.8] = [1 − 0.8, 1 − 0] = [0.2, 1]. In this case, d 1 ≥ 0, so the function y = f (x1 ) is increasing, and its y-range is equal to f ([0, 0.4]) = [ f (0), f (0.4)] = [0, 0.24]. For the interval [0.4, 0.8], the range of the derivative f (x1 ) = 1 − 2x1 is estimated as
Standard Interval Computation Algorithm Is Not Inclusion-Monotonic: Examples
437
[d 1 , d 1 ] = 1 − 2 · [0.4, 0.8] = 1 − [0.8, 1.6] = [1 − 1.6, 1 − 0.8] = [−0.6, 0.2]. The range includes both positive and negative values, so we cannot use monotonicity, we have to use the centered form. In this case, the midpoint x1 of the interval is x1 = 0.6 and its half-width Δ1 is Δi = 0.2, so the centered form leads to the following estimate for the y-range: f (0.6) + f ([0.4, 0.8]) · [−0.2, 0.2] = 0.24 + [−0.6, 0.2] · [−0.2, 0.2] = 0.24 + [−0.12, 0.12] = [0.12, 0.36]. The y-range f ([0, 0.8]) of this function on the whole interval [0, 0.8] can be estimated as the union of its y-ranges estimates corresponding to the two subintervals: f approx ([0, 0.8]) = f ([0, 0.4]) ∪ f approx ([0.4, 0.8]) = [0, 0.16] ∪ [0.12, 0.36] = [0, 0.36].
This is clearly not a subinterval of the interval estimate [0, 0.25] corresponding to the wider input [0, 1]. So, we have [0, 0.8] ⊆ [0, 1], but f approx ([0, 0.8]) = [0, 0.36] [0, 0.25] = f approx ([0, 1]), so the range is not inclusion isotonic. Second counter-example. The previous example may create a false impression that the presence of bisection was essential for such an example. To avoid this impression, let us provide a counter-example that does not use bisection. In this example, the function whose y-range we want to estimate is y = f (x1 , x2 ) = x1 + 0.5 · x24 · (1 − x12 ). As a larger input x-ranges, we take X1 = [−1, 1] and X2 = [−1, 1]. For the smaller input x-ranges, we take x1 = [−1, 0] and x2 = [−1, 1]. Let us first consider the case of the larger input x-ranges X1 = [−1, 1] and X2 = [−1, 1]. In this case, the partial derivative with respect to x1 has the form ∂f = 1 + 0.5 · x24 · (−2x1 ) = 1 − x24 · x1 . ∂ x1 By using straightforward interval computations, we can estimate the range of this derivative as [d 1 , d 1 ] = 1 − [−1, 1] · [−1, 1] · [−1, 1] · [−1, 1] · [−1, 1] = [1, 1] − [−1, 1] = [1 − 1, 1 − (−1)] = [0, 2],
438
M. T. Mizukoshi et al.
so the function y = f (x1 , x2 ) is (non-strictly) increasing with respect to x1 . Comment. It is important to mention that we will get the same conclusion if, instead of interpreting x24 as the result of three multiplications, we use the known estimate for the range of this term, which in this case is equal to [0, 1]. Thus, to estimate the lower endpoint of the y-range, it is sufficient to consider only the value x1 = x 1 = −1. For this value, the function is simply equal to f (−1, x2 ) = −1 + 0.5 · x24 · (1 − (−1)2 ) = −1, so −1 is the lower endpoint of the desired y-range. To estimate the upper endpoint of the y-range, it is sufficient to consider only the value x1 = x 1 = 1. For this value, the function is simply equal to f (1, x2 ) = 1 + 0.5 · x24 · (1 − 12 ) = 1, so 1 is the upper endpoint of the desired y-range. Thus, the desired y-range is equal to [−1, 1]—and this is the exact value of this y-range. Let us now consider the smaller x-ranges x1 = [−1, 0] and x2 = [−1, 1]. In this case, we still have monotonicity with respect to x1 . Thus, to estimate the lower endpoint of the y-range, it is sufficient to consider only the value x1 = x 1 = −1. For this value, as we have mentioned, the function y = f (x1 , x2 ) is simply equal to −1, so this is the lower endpoint of the desired y-range. Similarly, to estimate the upper endpoint of the y-range, it is sufficient to consider only the value x1 = x 1 = 0. For this value, the original function is equal to f (0, x2 ) = 0 + 0.5 · x24 · (1 − 02 ) = 0.5 · x24 . So, to find the upper endpoint of the y-range, we need to find the range of the function f (x2 ) = 0.5 · x24 on the interval [−1, 1]. The derivative of this function is equal to 2 · x23 , thus the range d2 of this derivative on the interval [−1, 1] is equal to d2 = 2 · [−1, 1] · [−1, 1] · [−1, 1] = [−2, 2]. This range contains both positive and negative values, so, according to the standard algorithm, we need to use the centered form. Here, x2 = 0 and Δ2 = 1, so the centered form leads to the following estimate for the y-range: f (0) + b2 · [−Δ2 , Δ2 ] = 0 + [−2, 2] · [−1, 1] = [−2, 2]. This estimate for the y-range is clearly wider than what we got when we consider the wider x-range of x1 —so inclusion monotonicity is clearly violated here:
Standard Interval Computation Algorithm Is Not Inclusion-Monotonic: Examples
439
x1 = [−1, 0] ⊂ X1 = [−1, 1] and x2 = [−1, 1] ⊆ X2 = [−1, 1], but for the corresponding y-range estimates f approx we do not have inclusion: f approx ([−1, 0], [−1, 1]) = [−2, 2] f approx ([−1, 1], [−1, 1]) = [−1, 1]. Comment. It is important to mention that we will get the exact same conclusion if, instead of treating x23 as the result of two multiplications, we take into account that the function x1 → x13 is increasing and thus, its range of the interval [−1, 1] is equal to [(−1)3 , 13 ] = [−1, 1]. Remaining open question. We showed the existence of counter-examples to inclusion monotonicity for the standard interval computations algorithm. A natural question is: are there other feasible algorithms that provide the same asymptotic approximation accuracy as the standard algorithm, but which are inclusion-monotonic? Our hypothesis is that no such algorithms are possible. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI).
References 1. M. Gavriliu, Towards more efficient interval analysis: corner forms and a remainder interval newton method. Ph.D. Dissertation, California Institute of Technology, Pasadena, California (2005) 2. L. Jaulin, M. Kiefer, O. Didrit, E. Walter, Applied Interval Analysis, with Examples in Parameter and State Estimation, Robust Control, and Robotics (Springer, London, 2001) 3. R. Krawczyk, Centered forms and interval operators. Computing 34, 234–259 (1985) 4. R. Krawczyk, K. Nickel, Die zentrische Form in der Intervallarithmetik, ihre quadratische Konvergenz und ihre Inklusionsisotonie. Computing 28, 117–127 (1982) 5. V. Kreinovich, A. Lakeyev, J. Rohn, P. Kahl, Computational Complexity and Feasibility of Data Processing and Interval Computations (Kluwer, Dordrecht, 1998) 6. B.J. Kubica, Interval Methods for Solving Nonlinear Contraint Satisfaction, Optimization, and Similar Problems: from Inequalities Systems to Game Solutions (Springer, Cham, 2019) 7. G. Mayer, Interval Analysis and Automatic Result Verification (de Gruyter, Berlin, 2017) 8. R.E. Moore, R.B. Kearfott, M.J. Cloud, Introduction to Interval Analysis (SIAM, Philadelphia, 2009)
440
M. T. Mizukoshi et al.
9. S.G. Rabinovich, Measurement Errors and Uncertainty: Theory and Practice (Springer, New York, 2005) 10. V. Stahl, Interval methods for bounding the range of polynomials and solving systems of nonlinear equations. Ph.D. Dissertation, Johannes Kepler University, Linz, Austria (1995)
Monotonic Bit-Invariant Permutation-Invariant Metrics on the Set of All Infinite Binary Sequences Irina Perfilieva and Vladik Kreinovich
Abstract In a computer, all the information about an object is described by a sequence of 0s and 1s. At any given moment of time, we only have partial information, but as we perform more measurements and observations, we get longer and longer sequence that provides a more and more accurate description of the object. In the limit, we get a perfect description by an infinite binary sequence. If the objects are similar, measurement results are similar, so the resulting binary sequences are similar. Thus, to gauge similarity of two objects, a reasonable idea is to define an appropriate metric on the set of all infinite binary sequences. Several such metrics have been proposed, but their limitation is that while the order of the bits is rather irrelevant—if we have several simultaneous measurements, we can place them in the computer in different order—the distance measured by current formulas change if we select a different order. It is therefore natural to look for permutation-invariant metrics, i.e., distances that do not change if we select different orders. In this paper, we provide a full description of all such metrics. We also explain the limitation of these new metrics: that they are, in general, not computable.
1 Formulation of the Problem Need for a metric. To gain knowledge about an object or a system, we perform measurements and observations. Nowadays, the results of both measurements and observations are stored in a computer, and in a computer, everything is represented as a binary sequence, i.e., as sequence of 0s and 1s. Thus, our information about I. Perfilieva Institute for Research and Applications of Fuzzy Modeling, University of Ostrava, 30. dubna 22, 701 03 Ostrava 1, Ostrava, Czech Republic e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_64
441
442
I. Perfilieva and V. Kreinovich
an object can be represented as a potentially infinite binary sequence—potentially infinite, since, to gain more information, we can always perform more measurements. A natural way to describe closeness between the two objects is by comparing the corresponding binary sequences: if two objects are similar, the results should be similar, so the corresponding sequences should be close. To describe this similarity, it + is therefore reasonable to have a metric on the set B = {0, 1} N of all infinite binary sequences x = x1 x2 . . . xn . . . Comment on notations. Here, we denoted: • by N + = {1, 2, . . .}, the set of all positive integers, and • by A B we—as usual—denote the set of all the functions from the set B to the set A. What is a metric: reminder. What are the reasonable properties of this metric? To formulate these properties, let us first recall the usual definition of a metric. Definition 1 Let X be a set. A metric is a mapping d : X × X → IR+ 0 that assigns, to every pair (x, y) of elements of the set X , a non-negative number d(x, y), and that satisfies the following three properties for all elements x, y, and z: • d(x, y) = 0 if and only if x = y; • d(x, y) = d(y, x) (symmetry), and • d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality). First reasonable property: bit invariance. When we represent observation results in a computer, what is important is that for each bit, we have two different values. It is not important which of these values is associated with 1 and which with 0. For example, to describe whether a person is sick, we have the following two options: • We can ask whether the person is sick. In this case, a sick person corresponds to the “yes” answer, i.e., to 1. • Alternatively, we can ask whether the person is healthy. In this case, a sick person corresponds to the “no” answer, i.e., to 0. Since selecting 0 or 1 for each n is arbitrary, it is reasonable to require that the metric does not change if we simply swap some 0s and 1s. Let us describe this property in precise terms. Definition 2 • Let S ⊆ N + be a subset of the set of all positive integers. For each infinite binary sequence x, by its S-swap BS (x), we mean a sequence y = y1 y2 . . . where: – for n ∈ S, we have yn = 1 − xn , and – for n ∈ / S, we have yn = xn . + • We say that a metric d : B × B → IR+ 0 is bit-invariant if for every set S ⊆ N and for every two sequences x, y ∈ B, we have
d(x, y) = d(BS (x), BS (y)).
Monotonic Bit-Invariant Permutation-Invariant Metrics on the Set …
443
Comment on notations. By IR+ 0 , we denote the set of all non-negative real numbers. Second reasonable property: monotonicity. Suppose that we have three sequences x, y, and z such that z differs from x in all places in which y differs from x and maybe in other places as well. In other words, z is “more different” from x than y. In this case, it is reasonable to require that the distance between x and z is larger—or at least the same, but not smaller—than the distance between x and y. Let us describe this reasonable property in precise terms. Definition 3 We say that a metric d : B × B → IR+ 0 is monotonic if for all triples of sequences (x, y, z) for which xn = yn implies xn = z n , we have d(x, y) ≤ d(x, z). Examples of metrics that satisfy both properties. There are many metrics on the set B of all infinite binary sequences that are both bit-invariant and monotonic. The most well-known example (see, e.g., [1, 2]) is the metric d(x, y) =
∞
2−n · |xn − yn |.
(1)
n=1
It is also possible to have a more general family of bit-invariant and monotonic metrics ∞ d(x, y) = q n · |xn − yn | (2) n=1
corresponding to different values q ∈ (0, 1). Limitations of the known metrics. The problem with these metrics is related to the fact that the numbering of the bits in a binary sequence is rather arbitrary: if several sensors measured some quantities at the same time, we can place the results of the 1st sensor first or the result of the 2nd sensor first, etc. From this viewpoint, the closeness d(x, y) of the two objects should not depend on which order we choose, i.e., should not change if we apply the same permutation to the sequences x and y. However, the above two metrics—and most other metrics—change if we permute the bits. For example, in the expression (1), the difference |x1 − y1 | between the first bits enters with the coefficient 1/2, while the difference |x2 − y2 | between the second bits enters with the coefficient 1/4. So, if we swap the first and the second bits, we get, in general, a different distance value. Resulting problem and what we do in this paper. Because of the above limitation of known metrics, it is desirable to describe permutation-invariant metrics. This is what we do in this paper.
444
I. Perfilieva and V. Kreinovich
2 Definitions and the Main Result Definition 4
• By a permutation, we mean a 1-1 mapping π : N + → N + for which π N + = N +. • By the result π(x) of applying a permutation π to an infinite sequence x = x1 x2 . . ., we means a sequence xπ(1) xπ(2) . . . Definition 5 We say that a metric d : B × B → IR+ 0 is permutation-invariant if for every two sequences x and y and for every permutation π , we have d(π(x), π(y)) = d(x, y). Definition 6 • We say that the sequences x and y have k different bits if there are exactly k indices n for which xn = yn . • We say that the sequences x and y have k common bits if there are exactly k indices n for which xn = yn . Proposition 1 For a metric d on the set B of all infinite binary sequences, the following two conditions are equivalent to each other: • the metric d is monotonic, bit-invariant, and permutation-invariant; • there exists real numbers =
=
=
0 = d0 ≤ d1 ≤ · · · ≤ dk ≤ · · · ≤ d∞ ≤ · · · dk= ≤ · · · ≤ d2= ≤ d1= ≤ d0= for which
=
=
=
=
= ≤ dk= + d dk+ ≤ dk + d and dk−
(3)
so that: =
– when x and y have k different bits, then d(x, y) = dk ; – when x and y have k common bits, then d(x, y) = dk= ; and – when x and y have infinitely many different bits and infinitely many common bits, then d(x, y) = d∞ . =
Example. One can check that the desired inequalities are satisfied for dk = 1 − 2−k , d∞ = 1, and dk= = 1 + 2−k . Indeed: • The first of the two inequalities (3) takes the form 1 − 2−(k+) ≤ 1 − 2−k + 1 − 2− ,
Monotonic Bit-Invariant Permutation-Invariant Metrics on the Set … def
445
def
i.e., if we denote a = 2−k and b = 2− , the form 1 − a · b ≤ 1 − a + 1 − b. If we subtract the left-hand side from both sides of this inequality, we get an equivalent inequality 0 ≤ 1 − a − b + a · b, i.e., 0 ≤ (1 − a) · (1 − b), which is, of course, always true since a = 2−k ≤ 1 and b = 2− ≤ 1. • The second of the two inequalities (3) takes the form 1 + 2−(k−) ≤ 1 + 2−k + 1 − 2− , def
def
i.e., if we denote a = 2−(k−) and b = 2− , the form 1 + a ≤ 1 + a · b + 1 − b. If we subtract the left-hand side from both sides of this inequality, we get an equivalent inequality 0 ≤ 1 − a − b + a · b, i.e., 0 ≤ (1 − a) · (1 − b), which is always true. Proof 1◦ . One can easily check that the function d(x, y) described by the second condition is indeed a monotonic bit-invariant and permutation-invariant metric. Specifically, monotonicity and invariances are easy to prove. The only thing that is not fully trivial to prove is that d as a metric satisfies the triangle inequality. Below, we propose the proof. 1.1◦ . If: =
• x differs from y in k places—so that d(x, y) = dk , and = • y differs from z in places—so that d(y, z) = d , =
then x and z can differ in no more than k + places, so d(x, z) = dm for some m ≤ = = = k + . Due to the inequalities between the values dk , we have d(x, z) = dm ≤ dk+ , and due to (3), we have =
=
=
d(x, z) ≤ dk+ ≤ dk + d = d(x, y) + d(y, z), exactly what we wanted to prove. 1.2◦ . If: • x coincides with y in k places—so that d(x, y) = dk= , and = • y differs from z in places—so that d(y, z) = d , then, if k > , x and z coincide in at least k − places, so d(x, z) = dm= for some m ≥ k − . Due to inequalities between the values dk= , we have d(x, z) = dm= ≤ = dk− , and due to (3), we have =
= d(x, z) ≤ dk− ≤ dk= + d = d(x, y) + d(y, z),
exactly what we wanted to prove.
446
I. Perfilieva and V. Kreinovich
1.3◦ . For the cases: • when at least one of the pairs (x, y) or (y, z) has infinitely many coinciding and infinitely many differing indices, and • when both pairs coincide in finitely many places, triangle inequality can be proven similarly. 2◦ . Let us now show that, vice versa, every monotonic bit-invariant and permutationinvariant metric on the set of all infinite binary sequences has this form. Specifically, we will show that the metric has this form for the following values: = def
dk = d(00 . . . 0 . . . , 1 . . . 1 (k times) 0 . . . 0 . . .); dk= = d(00 . . . 0 . . . , 0 . . . 0 (k times) 1 . . . 1 . . .); and def
def
d∞ = d(00 . . . 0 . . . , 0101 . . . 01 . . .). 3◦ . Let us first use bit-invariance. In general, when we apply the same transformation TS to two sequences x and y, resulting in X = TS (x) and Y = TS (y), then for each n: • if xn = yn then X n = Yn , and • if xn = yn then X n = Yn . Let us take, as the set S, the set of all indices n for which xn = 1. Then, the sequence X = TS (x) has the form X = 00 . . . 0 . . ., and for each n, X n and Yn have equal or different values depending on whether xn or yn were equal or not. Due to bitinvariance, we have d(x, y) = d(X, Y ). 4◦ . Let us now use permutation-invariance, according to which d(X, Y ) = d(π(X ), π(Y )) for all permutations π . 4.1◦ . If the sequences x and y differ in exactly k places, then X and Y also differ in exactly k places. Since the values X n are all zeros, this means that the sequence Y is equal to 1 in exactly k places. We can thus perform a permutation π that moves these k places to the first k locations 1, …, k. This permutation does not change the all-zeros sequence X , so we have d(x, y) = d(X, Y ) = d(X, π(Y )) = d(00 . . . 0 . . . , 1 . . . 1 (k times) 0 . . . 0 . . .), =
i.e., d(x, y) = dk . 4.2◦ . If the sequences x and y coincide in exactly k places, then X and Y also coincide in exactly k places. Since the values X n are all zeros, this means that the sequence
Monotonic Bit-Invariant Permutation-Invariant Metrics on the Set …
447
Y is equal to 0 in exactly k places. We can thus perform a permutation π that moves these k places to the first k locations 1, …, k. This permutation does not change the all-zeros sequence X , so we have d(x, y) = d(X, Y ) = d(X, π(Y )) = d(00 . . . 0 . . . , 0 . . . 0 (k times) 1 . . . 1 . . .), i.e., d(x, y) = dk= . 4.3◦ . Finally, if the sequences x and y differ in infinitely many places and coincide in infinitely many places, then X and Y also differ in infinitely many places and coincide in infinitely many places. Since the values X n are all zeros, this means that the sequence Y has infinitely many 0s and infinitely many 1s. We can thus perform a permutation π that moves these all coinciding places into odd locations and all differing places into even locations. This permutation does not change the all-zeros sequence X , so we have d(x, y) = d(X, Y ) = d(X, π(Y )) = d(00 . . . 0 . . . , 0101 . . . 01 . . .), i.e., d(x, y) = d∞ . =
5◦ . Let us now prove that the values dk and dk= satisfy the desired inequalities (3). 5.1◦ . Let us first prove the first of the two inequalities (3). Triangle inequality implies that d(00 . . . 0 . . . , 1 . . . 1 (k + times) 0 . . . 0 . . .) ≤ d(00 . . . 0 . . . , 1 . . . 1 (k times) 0 . . . 0 . . .) + d(1 . . . 1 (k times) 0 . . . 0 . . . , 1 . . . 1 (k + times) 0 . . . 0 . . .). =
(4)
=
The first two distances in this equality are dk+ and dk , so =
=
dk+ ≤ dk + d(1 . . . 1 (k times) 0 . . . 0 . . . , 1 . . . 1 (k + times) 0 . . . 0 . . .). (5) Applying bit-invariance with S = {1, . . . , k}, we have d(1 . . . 1 (k times) 0 . . . 0 . . . , 1 . . . 1 (k + times) 0 . . . 0 . . .) = d(0 . . . 0 (k times) 0 . . . 0 . . . , 0 . . . 0 (k times) 1 . . . 1 ( times) 0 . . . 0 . . .).
(6)
By applying a permutation π that places numbers from k + 1 to k + in the first positions, we conclude that d(1 . . . 1 (k times) 0 . . . 0 . . . , 1 . . . 1 (k + times) 0 . . . 0 . . .) =
448
I. Perfilieva and V. Kreinovich
d(0 . . . 0 . . . , 1 . . . 1 ( times) 0 . . . 0 . . .).
(7)
=
So, this term is equal to d , and the inequality (5) takes the desired form =
=
=
dk+ ≤ dk + d . 5.2◦ . Let us now prove the second of the two inequalities (3). Triangle inequality implies that d(00 . . . 0 . . . , 0 . . . 0 (k − times) 1 . . . 1 . . .) ≤ d(00 . . . 0 . . . , 0 . . . 0 (k times) 1 . . . 1 . . .) + d(0 . . . 0 (k times) 1 . . . 1 . . . , 0 . . . 0 (k − times) 1 . . . 1 . . .).
(8)
= and dk= , so The first two distances in this equality are dk− = ≤ dk= + d(0 . . . 0 (k times) 1 . . . 1 . . . , 0 . . . 0 (k − times) 1 . . . 1 . . .). (9) dk−
Applying bit-invariance with S = {k − + 1, k − + 2, . . .}, we have d(0 . . . 0 (k times) 1 . . . 1 . . . , 0 . . . 0 (k − times) 1 . . . 1 . . .) = d(0 . . . 0 (k − times) 1 . . . 1 ( times) 0 . . . 0 . . . , 0 . . . 0 . . .).
(10)
By applying a permutation π that places numbers from k − + 1 to k in the first positions, we conclude that d(0 . . . 0 (k − times)0 . . . 0 ( times) 0 . . . 0 . . . , 0 . . . 0 (k − times) 0 . . . 0 . . .) = d(1 . . . 1 ( times) 0 . . . 0 . . .), 0 . . . 0 . . .). =
So, this term is equal to d , and the inequality (9) takes the desired form =
= dk− ≤ dk= + d .
The proposition is thus proven.
(11)
Monotonic Bit-Invariant Permutation-Invariant Metrics on the Set …
449
3 Auxiliary Result: Why Do We Need Metrics Which Are Not Permutation-Invariant? So why use existing metrics? Since it is possible to have permutation-invariant metrics, why do we need metrics like (1) and (2) which are not permutation-invariant? It turns out that this need comes from yet another natural requirement: computability. Additional requirement: computability. At any given moment of time, we have finitely many observations and measurements, i.e., we only know finitely many bits of the sequences x and y. Thus, we need to estimate the distance based on this information. A natural requirement is that, as we get more and more bits, we should get a closer and closer approximation to the actual distance d(x, y)—in other words, that for any desired accuracy ε, we should be able to find a natural number n so that, by knowing n first bits of each of the two sequences x and y, we should be able to estimate the distance d(x, y) with accuracy ε. Let us formulate this requirement in precise terms. Definition 7 We say that a metric d : B × B → IR+ 0 is computable if for every positive real number ε > 0 there exists an integer n such that if x1 . . . xn = X 1 . . . X n and y1 . . . yn = Y1 . . . Yn , then |d(x, y) − d(X, Y )| ≤ ε. Discussion. One can check that the usual—non-permutation-invariant metric (1) is computable: we can take n for which 2−n ≤ ε, i.e., n ≥ − log2 (ε). Similarly, each metric (2) is also computable. Let us prove, however, that a non-trivial permutation-invariant metric cannot be computable. Definition 8 We say that a metric is trivial if there exists a constant d0 > 0 such that d(x, x) = 0 and d(x, y) = d0 for all x = y. Proposition 2 The only computable monotonic bit-invariant permutation-invariant metric on the set of all infinite binary sequence is the trivial metric. Proof Let d(x, y) be a computable monotonic bit-invariant permutation-invariant metric on the set of all infinite binary sequences. This metric has the form described in the formulation of Proposition 1. Let x and y be any two different sequences. Let ε > 0 be any positive real number. By definition of continuity, there exists an integer n for which if x1 . . . xn = X 1 . . . X n and y1 . . . yn = Y1 . . . Yn , then |d(x, y) − d(X, Y )| ≤ ε. Let us take: • as X , an infinite sequence that continues the starting fragment x1 . . . xn of the sequence x with all zeros, i.e., X = x1 . . . xn 00 . . . 0 . . . , and
450
I. Perfilieva and V. Kreinovich
• as Y , an infinite sequence that continues the starting fragment y1 . . . yn of the sequence y with an infinite repetition of 01’s, i.e., Y = y1 . . . yn 0101 . . . 01 . . . Then, X and Y have infinitely many common bits and infinitely many different bits, so d(X, Y ) = d∞ . Thus, we have |d(x, y) − d∞ | ≤ ε. This is true for every ε, so we have d(x, y) = d∞ . Thus, the metric d(x, y) is indeed trivial, with d0 = d∞ . The proposition is proven. Possible future work. In principle, instead of considering all possible permutations— as we did—we can consider only computable measure-preserving permutations; see, e.g., [3]. It would be interesting to analyze which metrics are invariant with respect to such permutations. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). The work of Irina Perfilieva was supported by European Regional Development Fund (ERDF) and European Social Fund (ESF) via the project “Centre for the development of Artificial Intelligence Methods for the Automotive Industry of the region” No. CZ.02.1.01/0.0/0.0/17_049/0008414. The authors are thankful to all the participants of the 19th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems IPMU’2022, (Milan, Italy, July 11–15, 2022), especially to Alexander Šostak, for valuable discussions.
References 1. R. B¯ets, A. Šostak, E.M. Mi¸kelsons, Parameterized metrics and their applications In word combinatorics, in Proceedings of the 19th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems IPMU’2022, vol. 1, Milan, Italy, July 11–15 (2022), pp. 270–281 2. C.S. Calude, H. Jürgensen, L. Staiger, Topology on words. Theor. Comput. Sci. 410(24–25), 2323–2335 (2009) 3. M. Li, P. Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications (Springer, Berlin, 2008)
Computing the Range of a Function-of-Few-Linear-Combinations Under Linear Constraints: A Feasible Algorithm Salvador Robles, Martine Ceberio, and Vladik Kreinovich
Abstract In many practical situations, we need to find the range of a given function under interval uncertainty. For nonlinear functions—even for quadratic ones—this problem is, in general, NP-hard; however, feasible algorithms exist for many specific cases. In particular, recently a feasible algorithm was developed for computing the range of the absolute value of a Fourier coefficient under uncertainty. In this paper, we generalize this algorithm to the case when we have a function of a few linear combinations of inputs. The resulting algorithm also handles the case when, in addition to intervals containing each input, we also know that these inputs satisfy several linear constraints.
1 Formulation of the Problem First case study: Fourier transform. In many application areas, an important data processing technique is Fourier transform, that transforms, e.g., the values x0 , x1 , . . . , xn−1 of a certain quantity at several moments of time into values Xk =
n−1 i=0
def
where i =
2π · k · i , xi · exp −i · n
√ −1. These values are known as Fourier coefficients.
S. Robles · M. Ceberio · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] S. Robles e-mail: [email protected] M. Ceberio e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_65
451
452
S. Robles et al.
Since exp(i · x) = cos(x) + i · sin(x), the value X k can be written as X k = Ak + i · Bk , where Ak =
n−1 i=0
n−1 2π · k · i 2π · k · i and Bk = − . xi · cos i · xi · sin i · n n i=0
In addition to the real part Ak and imaginary part Bk of each Fourier coefficient, def
it is also important to know the absolute value (modulus) Mk = |X k | = of each Fourier coefficient.
A2k + Bk2
Need for interval uncertainty. The values xi come from measurements, and measurements are never absolutely exact: the measurement result xi is, in general, different from the actual (unknown) value xi of the corresponding quantity; see, e.g., [8]. In many practical situations, the only information that we have about the measurement def xi − xi is the upper bound Δi on its absolute value: |Δxi | ≤ Δi . In such error Δxi = situations, after the measurement, the only information that we have about the actual xi − Δi and value xi is that this value belongs to the interval [x i , x i ], where x i = xi = xi + Δi ; see, e.g., [4, 6, 7]. Need to estimate the ranges under interval uncertainty. In general, processing data x0 , . . . , xn−1 means applying an appropriate algorithm f (x0 , . . . , xn−1 ) to compute the desired value y = f (x0 , . . . , xn−1 ). In the case of interval uncertainty, for different possible values xi ∈ [x i , x i ] we have, in general, different possible values of y = f (x0 , . . . , xn−1 ). It is therefore desirable to find the range of possible values of y: def
[y, y] = { f (x0 , . . . , xn−1 ) : xi ∈ [x i , x i ] for all i}.
(1)
In particular, it is desirable to compute such ranges for the values Ak , Bk , and Mk corresponding to Fourier transform. Ranges of Fourier coefficients: what is known. The values Ak and Bk are linear functions of the quantities xi , and for a linear function y = c0 +
n−1
ci · xi ,
i=0
the range is easy to compute: this range is equal to [ y − Δ, y + Δ], where y = c0 +
n−1 i=0
ci · xi and Δ =
n−1
|ci | · xi .
i=0
The problem of computing the range of Mk —or, what is equivalent, the range of its square Mk2 —is more complicated, since Mk2 is a quadratic function of the inputs,
Computing the Range of a Function-of-Few-Linear-Combinations …
453
and, in general, for quadratic functions, the problem of computing the range under interval uncertainty is NP-hard; see, e.g., [5]. However, for the specific case of Mk2 , feasible algorithms—i.e., algorithms that compute the range in time limited by a polynomial of n—are known; see [2]. Need to take constraints into account. In addition to knowing the ranges [x i , x i ] for each quantity xi , we also often know that the actual values xi do not change too fast, i.e., e.g., that the consequent values xi and xi+1 cannot change by more than some small value ε > 0: |xi+1 − xi | ≤ ε, i.e., equivalently, −ε ≤ xi+1 − xi ≤ ε. Motivation: second case study. In the previous example, we had a nonlinear function—namely, the sum of two squares—applied to linear combinations of the inputs. There is another important case when a nonlinear function is applied to such a linear combination: data processing in an artificial neural network, where a nonlinear function s(z)—known as activation function—is applied to a linear combin−1 nation z = wi · xi + w0 of the inputs x0 , . . . , xn−1 , resulting in the output signal n−1 i=0 y=s wi · xi + w0 ; see, e.g., [1, 3]. In this case, we may also be interested in i=0
finding the range of the possible values of y when we know intervals of possible values of the inputs xi —and maybe, as in the previous case, some additional constraints on the inputs. General formulation of the problem. In both cases studies, we have a function f (x0 , . . . , xn−1 ) which has the form f (x0 , . . . , xn−1 ) = F(y1 , . . . , yk ),
(2)
where k is much smaller than n (this is usually denoted by k n) and each y j is a linear combination of the inputs yj =
n−1
w j,i · xi + w j,0 .
(3)
i=0
• In the Fourier coefficient case, k = 2, and F(y1 , y2 ) = y12 + y22 . • In the case of a neuron, k = 1, and F(y1 ) is the activation function. We know the intervals [x i , x i ] of possible values of all the inputs xi , and we may also know some linear constraints of the input values, i.e., constraints of the form n−1 i=0
ca,i · xi ≤ ca,0 , cb,0 ≤
n−1 i=0
cb,i · xi , or
n−1 i=0
cd,i · xi = cd,0 ,
(4)
454
S. Robles et al.
for some constants ca,i , cb,i , and cd,i , and we want to find the range [y, y] of the function (3) under all these constraints—interval constraints xi ∈ [x i , x i ] and additional constraints (4), i.e., we want to find the following interval n−1
n−1 [y, y] = F w0,i · xi + w0,0 , . . . , wn−1,i · xi + wn−1,0 : xi ∈ [x i , x i ], i=0 n−1
i=0
ca,i · xi ≤ ca,0 , cb0 ≤
i=0
n−1 i=0
cbi · xi ,
n−1
cd,i · xi = cd,0 .
(5)
i=0
In this paper, we design a feasible algorithm that computes the desired range— under some reasonable conditions on the function F(y1 , . . . , yk ).
2 Analysis of the Problem and the Resulting Algorithm What are reasonable conditions on the function F(y1 , . . . , yk ): discussion. We want to come up with a feasible algorithm for computing the desired range of the function f (x0 , . . . , xn−1 ). In general, in computer science, feasible means that the computation time t should not exceed a polynomial of the size of the input, i.e., equivalently, that t ≤ v · n p for some values v and p—otherwise, if this time grows faster, e.g., exponentially, for reasonable values n, we will require computation times longer than the lifetime of the Universe; see, e.g., [5]. For computations with real numbers, it is also reasonable to require that the value of the function does not grow too fast—i.e., that it is bounded by a polynomial of the values of the inputs. It is also necessary to take into account that, in general, the value of a real-valued function can be only computed with some accuracy ε > 0, and that the inputs xi can also only be determined with some accuracy δ > 0. Thus, it is also reasonable to require that: • the time needed to compute the function with accuracy ε is bounded by some polynomial of ε (and of the values of the inputs), and • that the accuracy δ with which we need to know the inputs to compute the value of the function with desired accuracy ε should also be bounded from below by some polynomial of ε (and of the values of the inputs). To make sure that the function f (x0 , . . . , xn−1 ) has these “regularity” properties, we need to restrict ourselves to functions F(y1 , . . . , yk ) that have similar regularity properties—otherwise, if even computing a single value F(y1 , . . . , yk ) is not feasible, we cannot expect computation of the range of this function to be feasible either. Thus, we arrive at the following definition. def
Definition 1 Let T = (v F , p F , va , pa , qa , tc , pc , qc ) be a tuple of real numbers. We say that a function F(y1 , . . . , yk ) is T -regular if the following conditions are satisfied:
Computing the Range of a Function-of-Few-Linear-Combinations …
455
• for all inputs y j , we have |F(y1 , . . . , yk )| ≤ v F · (max |y j |) pv ; def
• for each ε > 0, if |y j − y j | ≤ δ = va · (max |y j |) pa · εqa , then |F(y1 , . . . , yk ) − F(y1 , . . . , yk )| ≤ ε; • there exists an algorithm that, give inputs y j and ε > 0, computes the value F(y1 , . . . , yk ) with accuracy ε > 0 in time T f (y1 , . . . , yk ) ≤ tc · (max |y j |) pc · εqc . Comment. One can easily check that the Fourier-related function F(y1 , y2 ) = 2 2 y1 + y2 is T -regular for an appropriate tuple T . Definition 2 Let T be a tuple, let F(y1 , . . . , yk ) be a T -regular function, and let W , X and ε be real numbers. By a problem of computing the range of a function-offew-linear-combinations under linear constraints, we mean the following problem. Given: • a function (2)–(3), where |w j,i | ≤ W for all i and j, • n intervals [x i , x i ] for which |x i | ≤ X and |x i | ≤ X for all i, and • m linear constraints (4), compute the ε-approximation to the range [y, y] of this function for all tuples of values xi from the given intervals that satisfy the given constraints. Proposition. For each tuple T , for each T -regular function, and for each selections of values W , X , and ε, there exists a feasible algorithm for computing the range of a function-of-few-linear-combinations under linear constraints. Comment. In other words, we have an algorithm that finishes computations in time bounded by a polynomial of n and m. Proof Since |x i | ≤ X and |x i | ≤ X , we conclude that for all values xi ∈ [x i , x i ], we have |xi | ≤ X . Since |w j,i | ≤ W , from the formula (3), we conclude that |y j | ≤ n · W · X + W, hence max |y j | ≤ n · W · X + W . To compute the value of the function F(y1 , . . . , yk ) with the desired accuracy ε, we need know each yk with accuracy δ proportional to (max |y j |) pa . In view of the above estimate for max |y j |, we need δ ∼ n a . We can divide each interval [−(n · W · X + W ), n · W · X + W ]
456
S. Robles et al.
of possible values of y j into sub-intervals of size 2δ. There will be 2(n · W · X + W ) n ∼ a = n 1−a 2δ n such subintervals. By combining subintervals corresponding to each of k variables y j , we get ∼ (n 1−a )k = n (1−a)·k boxes. Each side of each box has size 2δ. Thus, each value y j from this side differs from its midpoint y j by no more that δ. So, by our choice of δ, for each point y = (y1 , . . . , yk ) y1 , . . . , yk ) at the from the box, the value F(y1 , . . . , yk ) differs from the value F( corresponding midpoint by no more than ε. Hence, to find the desired range of the function f , it is sufficient to find the values yk ) at the midpoints of all the boxes that have values y satisfying the F( y1 , . . . , constraints. Since each value of f is ε-close to one of these midpoint values, the largest possible value of f is ε-close to the largest of these midpoint values, and the smallest possible of f is ε-close to the smallest of these midpoint values. Thus, once we know which boxes are possible and which are not, we will be able to compute both endpoints of the desired range with accuracy ε. There are no more than ∼ n (1−a)·k such midpoint values, and computing each value requires ∼ n pc time. So, once we determine which boxes are possible and which are not, we will need computation time ∼ n (1−a)·k · n pc = n (1−a)·k+ pc . How can we determine whether a box is possible? Each box [y1− , y1+ ] × · · · × [yk− , yk+ ] is determined by 2k linear inequalities y −j ≤ y j and y j ≤ y +j , j = 1, . . . , k. Substituting the expressions (3) into these inequalities, and combining them with m constraints (4), we get 2k + m linear constraints that determine whether a box is possible: if all these constraints can be satisfied, then the box is possible, otherwise, the box is not possible. The problem of checking whether a system of linear constraints can be satisfied is known as linear programming. There exist feasible algorithms for solving this problem, e.g., it can be solved in time ∼(n + (2k + m)) · n 1.5 ; see, e.g., [9, 10]. Checking this for all ∼n (1−a)·k boxes requires time ∼n (1−a)·k · (n + 2k + m) · n 1.5 . The overall time for our algorithm consists of checking time and time for actual computation, i.e., is bounded by time ∼n (1−a)·k · (n + m) · n 1.5 + n (1−a)·k+ pc . This upper bound is polynomial in n and m. Thus our algorithm is indeed feasible. The proposition is proven. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowships in Information Technology. It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and by a grant from the Hungarian National Research, Development and Innovation Office (NRDI). The authors are thankful to Michael Beer and Nigel Ward for valuable discussions.
Computing the Range of a Function-of-Few-Linear-Combinations …
457
References 1. C.M. Bishop, Pattern Recognition and Machine Learning (Springer, New York, 2006) 2. M. De Angelis, M. Behrendt, L. Comerford, Y. Zhang, M. Beer, Forward interval propagation through the Discrete Fourier Transform, in Proceedings of the International Workshop on Reliable Engineering Computing REC’2021, Taormina, Italy, May 17–20 ed. by A. Sofi, G. Muscolino, R.L. Muhanna (2021), pp. 39–52 3. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016) 4. L. Jaulin, M. Kiefer, O. Didrit, E. Walter, Applied Interval Analysis, with Examples in Parameter and State Estimation, Robust Control, and Robotics (Springer, London, 2001) 5. V. Kreinovich, A. Lakeyev, J. Rohn, P. Kahl, Computational Complexity and Feasibility of Data Processing and Interval Computations (Kluwer, Dordrecht, 1998) 6. G. Mayer, Interval Analysis and Automatic Result Verification (de Gruyter, Berlin, 2017) 7. R.E. Moore, R.B. Kearfott, M.J. Cloud, Introduction to Interval Analysis (SIAM, Philadelphia, 2009) 8. S.G. Rabinovich, Measurement Errors and Uncertainties: Theory and Practice (Springer, New York, 2005) 9. G. Sierksma, Y. Zwols, Linear and Integer Optimization: Theory and Practice (CRC Press, Boca Raton, 2015) 10. P.M. Vaidya, Speeding-up linear programming using fast matrix multiplication, in Proceedings of the 30th IEEE Annual Symposium on Foundations of Computer Science FOCS’1989, Research Triangle Park, North Carolina, USA, October 30–November 1 (1989), pp. 332–337
How to Select a Representative Sample for a Family of Functions? Leobardo Valera, Martine Ceberio, and Vladik Kreinovich
Abstract Predictions are rarely absolutely accurate. Often, the future values of quantities of interest depend on some parameters that we only know with some uncertainty. To make sure that all possible solutions satisfy desired constraints, it is necessary to generate a representative finite sample, so that if the constraints are satisfied for all the functions from this sample, then we can be sure that these constraints will be satisfied for the actual future behavior as well. At present, such a sample is selected based by Monte-Carlo simulations, but, as we show, such selection may underestimate the danger of violating the constraints. To avoid such an underestimation, we propose a different algorithms that uses interval computations.
1 Formulation of the Problem Often, we only known a family of functions. One of the important objectives of science and engineering is to predict the behavior of different systems. Examples include predicting the trajectories of celestial bodies (including the trajectories of satellites), predicting weather, predicting how a building will react to a strong earthquake, etc. In many such situations, we know the differential equations that describe how the corresponding quantities change with time, and we can use these equations to make predictions.
L. Valera Department of Mathematical Sciences, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA M. Ceberio · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] M. Ceberio e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Ceberio and V. Kreinovich (eds.), Uncertainty, Constraints, and Decision Making, Studies in Systems, Decision and Control 484, https://doi.org/10.1007/978-3-031-36394-8_66
459
460
L. Valera et al.
Sometimes, we can make (almost) exact predictions: e.g., we can predict the trajectories of celestial bodies hundreds of years into the future. In such case, we predict how the values of the corresponding quantities x change with time t, i.e., we know the exact form of the dependence x(t). In many other situations, however, the future values x(t) of these quantities are not uniquely determined by available information, they also depend on the values of some quantities c1 , . . . , cn which are not known exactly. In such situations, we know an algorithm that, given the value of the parameters ci , returns the corresponding function x(t, c1 , . . . , cn ). We do not know the exact values of the parameters ci , we only know the approximate values ci . Based on these approximate values, we can find the approximate function cn ). (1) x (t) = x(t, c1 , . . . , Since the (unknown) actual values ciact of the corresponding parameters are, in general, different from the approximate values, the actual dependence x act (t) = x t, c1act , . . . , cnact
(2)
is, in general, different from the approximate dependence x (t). How different can these two dependencies be depends on the accuracy of the approximations ci . Usually, we know the accuracy of the corresponding approximation, i.e., we know the upper bounds Δi on the absolute value of the approximation def ci − ci , i.e., the bounds for which |Δci | ≤ Δi . (If we did not know errors Δci = such bounds, then we could not make any conclusion at all.) In this case, the actual xi − Δi , ci + Δi ]. value of the quantity ci can take any value from the interval [ For each combination of values ci , we have, in general, a different dependence on time x(t) = x(t, c1 , . . . , cn ). In such situations, we do not know the exact dependence x(t), we only know that the family of functions that contains the desired function: ci − ci | ≤ Δi for all i}. F = {t → x(t, c1 , . . . , cn ) : |
(3)
Comment. A similar problem appears if we want to predict a field, i.e., the future values of a quantity at different moments of time and at different spatial locations. In this case, the formulas (and algorithms) are similar, the only difference is that in this case, t is not a single number but a tuple of numbers—e.g., the moment of time and the spatial coordinates. What we want. One of the reasons we want to make a prediction is to study how possible future conditions can affect our system. For example, we want to study how seismic waves from possible future earthquakes will affect the building that we are currently designing. For a chemical plant, we want to make sure that the concentration of the pollutants in the atmosphere does not exceed the tolerable micro-level, etc.
How to Select a Representative Sample for a Family of Functions?
461
In all these cases, we want to check whether an appropriate numerical characteristic q(x(t)) of the solution x(t) stays within the desired bounds—or goes beyond the desired bounds, into the danger zone. Usually, there are several such characteristics q. For example, for a building, we want stress in all critical locations not to exceed the desired threshold. For a chemical plant, we want the level of all possible pollutants not to exceed the desired level, etc. It is desirable to have a representative sample. Usually, we have a simulator that can describe the effect of each possible function x(t), i.e., that estimates the corresponding values q(x(t)). The problem is that in situations with uncertainty, there are infinitely many possible values of each quantity ci and thus, infinitely many possible functions x(t). We cannot test them all, we need to select a representative sample. In other words, we want to select a finite list of functions x1 (t), …, x L (t)— each of which is either themselves possible solutions or which are close to possible solutions—so that once we compute the value of the quantity q(x (t))) on all these functions, the range (4) [min q(x (t)), max q(x (t))] will give us a good approximation for the actual range ci − Δi , ci + Δi ]}, {Q(c1 , . . . , cn ) : ci ∈ [ where we denoted
(5)
def
Q(c1 , . . . , cn ) = q(x(t, c1 , . . . , cn )). Comment. We want to make sure that we do not underestimate the danger, that all possible deviations of the actual range (5) from the desired bounds are captured by the sample-based estimate (4) for this range. In other words, we want to make sure that the sample-based estimate (4) contains the actual range (5). How this sample is usually selected: description and limitations. The usual way of selecting the sample is to choose, several times, random values of the parameters ci from the corresponding intervals—e.g., values uniformly distributed on this interval. The problem with this method is that it often underestimates the effect. Indeed, we want to check, e.g., whether the designed building will remain stable for all possible values of the quantities ci describing the earthquake, i.e., whether a certain quantity q depending of the function x(t) (and, thus, on the values of ci ) does not exceed a danger threshold. Usually, the deviations Δci are relatively small, so ci − Δci in in the first approximation, we can expand the dependence of Q on ci = Taylor series and only keep linear terms in this expansion: − c1 − Δc1 , . . . , ci − Δcn ) = Q Q(c1 , . . . , cn ) = Q(
n i=1
Q i · Δci ,
(6)
462
L. Valera et al.
def where we denoted Q = Q( c1 , . . . , cn ) and def
Qi =
∂Q . ∂ci |c1 =c1 ,...,cn =cn
The expression (6) is monotonic in Δci , so its largest possible value when Δci ∈ [−Δi , Δi ] is attained: • when Δci = Δi for indices i for which Q i ≥ 0 and • when Δci = −Δi for indices i for which Q i ≤ 0. and Q is thus equal to The resulting largest difference Δ between Q Δ=
n
|Q i | · Δi .
(7)
i=1
On other hand, if we use independent random values Δci , then, due to the Central Limit Theorem (see, e.g., [4]), for large n, the distribution of the sum (6) is close to Gaussian. For independent random variables, the mean is the sum of the means, and the variance is the sum of the variances. Each mean is 0, so the overall mean a is equal to 0 too. The variance of the uniform distribution on the interval [−Δi , Δi ] is equal to (1/3) · Δi2 . Thus, the overall variance is equal to V =
n 1 2 · Q · Δi2 , 3 i=1 i
with the standard deviation σ equal to σ =
√
n 1 · V = Q 2 · Δi2 . 3 i=1 i
(8)
For the Gaussian (normal) distribution with mean a and standard deviation σ , with high confidence, all the random values are within the 3-sigma interval [a − 3σ, a + 3σ ]. So, with high confidence, all the values q generated by random simulations do not exceed 3σ . Herein lies a problem: in the simplest case when all the values Q i are equal to 1 and Δ1 = · · · = Δn : • the formula (7) leads to Δ = n · Δc (where we denoted by Δc the common value of all the Δi ), while, • according to the formula (8), we have V = (1/3) · n · Δ2c and thus, 3σ = 3 ·
√
V =
√
3·
√
n · Δc .
√ For large n, we have n n, so this method indeed underestimates possible dangers.
How to Select a Representative Sample for a Family of Functions?
463
Resulting problem. How to select a representative sample that would not underestimate possible dangers?
2 Analysis of the Problem Analysis of the problem. In the linearized case (6), as we have mentioned, the largest and the smallest values of Q are attained when each parameter ci is either equal to ci + Δi or to its smallest possible value its largest possible value ci = ci − Δi . ci = The same property holds if we take into account that the linear expression (6) is cn ) all partial derivatives are different an approximation—in a generic point ( c1 , . . . , from 0 and thus, they are different from 0 also in a small vicinity of this point. First idea. Thus, in principle, as the desired sample, we can take functions x(t, c1 , . . . , cn ) corresponding to all possible combinations of these values ci ± Δi . ci = Limitations of the first idea. For each parameter ci , we need to select one of the two possible values, and we need to consider all possible combinations of n such selections—and there are 2n such combinations. The above first idea can be implemented for small n, but in realistic situations when n is large, the number of combinations becomes astronomic and not realistic. Second idea. Since we cannot consider extreme values of all n parameters ci , a natural next idea is to select k < n most important parameters—i.e., parameters ci for which the dependence on ci as expressed, e.g., by the mean square value def
Di =
c1 , . . . , ci−1 , ci + Δi , ci+1 , . . . , cn ) − x (t))2 dt, (x(t,
(9)
is the largest. In other words, we compute the values Di for all i, sort them in non-increasing order (10) D1 ≥ D2 · · · ≥ Dk ≥ · · · ≥ Dn , and select the parameters ci corresponding to the first k terms in this order (10). Without losing generality, we can assume that the parameters ci are already sorted in the inverse order of the corresponding Di values. This way, we will only need to ci ± Δi corresponding to i = 1, . . . , k. use 2k combinations of the values ci = For each of the remaining parameters ck+1 , …, cn , we have to use a fixed value, e.g., the value c j = cj.
464
L. Valera et al.
Limitations of the second idea. By using the formula (6), we can conclude that by using this sample, the largest possible difference between the sample-based values is equal to Q and the nominal value Q Δ
(k)
=
k
|Q i | · Δi .
(11)
i=1
In general, this value is smaller that the largest possible value (7), since we ignore the terms corresponding to i = k + 1, . . . , n. Thus, this idea shares the same limitation as the traditional method—it underestimates the possible difference between the actual (unknown) value q and the nominal value q and thus, underestimates the danger. Towards the final idea. To avoid underestimation, for each combination of the c1 ± Δ1 , …, ck = ck ± Δk , we need to provide bounds on parameter values c1 = the values of the function x(t, c1 , . . . , ck , ck+1 , . . . , cn ) corresponding to all possible values of ck+1 − Δk+1 , ck+1 + Δk+1 ], . . . , cn ∈ [ cn − Δn , cn + Δn ]. ck+1 ∈ [ Techniques for providing such bounds are known as techniques of interval computations; see, e.g., [1–3]. Thus, we arrive at the following algorithm for selecting a representative sample.
3 Resulting Algorithm What is given: • an algorithm that, given n values ci , returns a function x(t, c1 , . . . , cn ). cn , and • approximate values c1 , …, • upper bounds Δ1 , …, Δn on the corresponding approximation errors. Based on the approximate values ci , we compute the approximate function cn ). x (t) = x(t, c1 , . . . , What we want. We want to generate a finite list of functions x1 (t), …, x L (t) with the following properties: • that each of these functions is close to one of the possible solutions x(t, c1 , . . . , cn ) ci − Δi , ci + Δi ], and for some ci ∈ [ • that for each characteristic q, the sample-based range (4) of the values of this characteristic contains the actual range (5)—and is close to the actual range.
How to Select a Representative Sample for a Family of Functions?
465
Preliminary step. For each i, we use the given algorithm to compute the function def
c1 , . . . , ci−1 , ci + Δi , ci+1 , . . . , cn ) x+i (t) = x(t, and then compute the value Di =
(x+i (t) − x (t))2 dt.
We then sort n parameters ci in the non-increasing order of the values Di . Without losing generality, we assume that D1 ≥ D2 ≥ · · · ≥ Dn . Main step. We select some value k so that we will be able to generate 2 · 2k functions. εi ∈ {−, +}, we Then, for each of 2k combinations of signs ε = (ε
1 , . . . , εk ), where apply interval computations to find an estimate X ε (t), X ε (t) for the range of the function ck + εk · Δk , ck+1 , . . . , cn ) x(t, c1 + ε1 · Δ1 , . . . , when ck+1 − Δk+1 , ck+1 + Δk+1 ], . . . , cn ∈ [ cn − Δn , cn + Δn ]. ck+1 ∈ [ The resulting 2 · 2k functions X ε (t) and X ε (t) form the desired list. Comment. Since we use interval computations to take care of all possible values of ck+1 , . . . , cn , we expect that the sample-based range will indeed contain the actual range (5) of each quantity q. For a sufficient large k, the effect of the quantities ck+1 , . . . , cn is small, so: • each selected function X ε (t) or X ε (t) is close to the corresponding actual solutions ck + εk · Δk , ck+1 , . . . , cn ), x(t, c1 + ε1 · Δ1 , . . . , • and thus, the sample-based range should not differ too much from the actual range (5). Acknowledgements This work was supported by: • the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), • the AT&T Fellowship in Information Technology, • the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478, and • grant from the Hungarian National Research, Development and Innovation Office (NRDI).
466
L. Valera et al.
References 1. L. Jaulin, M. Kiefer, O. Didrit, E. Walter, Applied Interval Analysis, with Examples in Parameter and State Estimation, Robust Control, and Robotics (Springer, London, 2001) 2. G. Mayer, Interval Analysis and Automatic Result Verification (de Gruyter, Berlin, 2017) 3. R.E. Moore, R.B. Kearfott, M.J. Cloud, Introduction to Interval Analysis (SIAM, Philadelphia, 2009) 4. D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures (Chapman and Hall/CRC, Boca Raton, 2011)