150 67 2MB
English Pages 161 [152] Year 2021
Studies in Systems, Decision and Control 306
Martine Ceberio Vladik Kreinovich Editors
How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies
Studies in Systems, Decision and Control Volume 306
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Systems, Decision and Control” (SSDC) covers both new developments and advances, as well as the state of the art, in the various areas of broadly perceived systems, decision making and control–quickly, up to date and with a high quality. The intent is to cover the theory, applications, and perspectives on the state of the art and future developments relevant to systems, decision making, control, complex processes and related areas, as embedded in the fields of engineering, computer science, physics, economics, social and life sciences, as well as the paradigms and methodologies behind them. The series contains monographs, textbooks, lecture notes and edited volumes in systems, decision making and control spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/13304
Martine Ceberio Vladik Kreinovich •
Editors
How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies
123
Editors Martine Ceberio Department of Computer Science University of Texas at El Paso El Paso, TX, USA
Vladik Kreinovich Department of Computer Science University of Texas at El Paso El Paso, TX, USA
ISSN 2198-4182 ISSN 2198-4190 (electronic) Studies in Systems, Decision and Control ISBN 978-3-030-65323-1 ISBN 978-3-030-65324-8 (eBook) https://doi.org/10.1007/978-3-030-65324-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
In some situations, we have perfect knowledge of the corresponding processes. Based on this knowledge, we can, based on the current and past measurement results, predict the future state of the system—and, if needed, decide on the best way to control the system so as to get to the desired future state. This is the situation, e.g., in describing the trajectory of a spaceship. In many other application areas, however, we not do have a perfect knowledge of the situation, so we have to rely on empirically determined dependencies, dependencies that often have no convincing theoretical explanations. Because of this lack of explanations, practitioners are somewhat reluctant to use these dependencies—since, without convincing theoretical explanations, these empirical dependencies may (and sometimes do) turn out to be accidental coincidences which only hold for some cases and are not satisfied in other situations. It is therefore desirable to come up with a theoretical explanation of these dependencies. The problem of finding and explaining empirical dependencies is made more complex by the fact that measurements and observations come with uncertainty, as a result of which the measurement results provide only an approximate picture of the corresponding processes. In this book, we collected several papers explaining how we can take this uncertainty into account when providing the desired theoretical explanations. Most of these papers have been presented, in different years, at annual International Workshops on Constraint Programming and Decision Making and at other related workshops that we were organizing. This volume presents extended versions of selected papers from these workshops. These papers provide explanations of empirical dependencies in different application areas: • decision making [1], where the authors explain how the status quo bias is useful, • electrical engineering [3], where the authors explain why class D audio amplifiers work so well, • image processing [15], where the authors explain why representing contour segments by fragments of straight lines, circles, and hyperbolas works well,
v
vi
Preface
• geological sciences [6, 7], where the authors explain the ubiquity of gamma distribution in seismology and the observed absence of large triggered earthquakes, • logic [10, 19, 20], where the authors explain why we use “and”, “or”, and “not” and not other operations, how to describe the degree of implication, and why either “and”- or “or”-operations are approximate, • pedagogy [5, 18], where the authors explain why we all have similar learning potential and explain the empirical effect of repetitions on short-term and long-term learning, • physics [2, 11, 12], where the authors explain quark confinement, the presence of galaxy superclusters, and the minimum entropy production principle of non-equilibrium thermodynamics, • psychology [4, 9, 13, 14], where the author explains why we use certain number systems, why we use 3 basic colors and 4 basis tastes, how to avoid echo chamber phenomenon, and how to achieve family happiness, • quantum computing [8], where the authors provide one more explanation of the effectiveness of quantum computing, and • transportation engineering [16, 17], where the authors explain empirical formulas describing the current strength of the road pavement and how this strength deteriorates with time. We are greatly thankful to all the authors and referrees, and to all the participants of the CoProd and other workshops. Our special thanks to Prof. Janusz Kacprzyk, the editor of this book series, for his support and help. Thanks to all of you! El Paso, USA March 2021
Martine Ceberio Vladik Kreinovich
References 1. Acosta, G., Smith, E., V. Kreinovich: Status Quo Bias Actually Helps Decision Makers to Take Nonlinearity into Account: An Explanation. this volume 2. Acosta, G., Smith, E., Kreinovich, V.: A Natural Explanation for the Minimum Entropy Production Principle. this volume 3. Alvarez, K., Urenda, J. C., Kreinovich, V.: Why Class-D Audio Amplifiers Work Well: A Theoretical Explanation. this volume 4. Bokati, L., Kosheleva, O., Kreinovich, V.: How Can We Explain Different Number Systems? this volume 5. Bokati, L., Urenda, J., Kosheleva, O., Kreinovich, V.: Why Immediate Repetition Is Good for Short-Term Learning Results but Bad For Long-Term Learning: Explanation Based on Decision Theory. this volume 6. Bokati, L., Velasco, A., Kreinovich, V.: Absence of Remotely Triggered Large Earthquakes: A Geometric Explanation. this volume 7. Bokati, L., Velasco, A., V. Kreinovich: Why Gamma Distribution of Seismic Inter-Event Times: A Theoretical Explanation. this volume
Preface
vii
8. Ceberio M., Kreinovich, V.: Quantum Computing as a Particular Case of Computing With Tensors. this volume 9. Kosheleva O., Kreinovich, V.: A ‘Fuzzy’ Like Button Can Decrease Echo Chamber Effect. this volume 10. Kosheleva O., Kreinovich, V.: Intuitive Idea of Implication vs. Formal Definition: How to Define the Corresponding Degree. this volume 11. Kreinovich, V.: Dimension Compactification – a Possible Explanation for Superclusters and for Empirical Evidence Usually Interpreted as Dark Matter. this volume 12. Kreinovich, V.: Fundamental Properties of Pair-Wise Interactions Naturally Lead to Quarks and Quark Confinement: A Theorem Motivated by Neural Universal Approximation Results. this volume 13. Kreinovich, V.: Linear Neural Networks Revisited: From PageRank to Family Happiness. this volume 14. Kreinovich, V.: Why 3 Basic Colors? Why 4 Basic Tastes? this volume 15. Kreinovich V., Quintana, C.: What Segments Are the Best in Representing Contours? this volume 16. Rodriguez Velazquez E. D., Kreinovich, V.: Strength of Lime Stabilized Pavement Materials: Possible Theoretical Explanation of Empirical Dependencies. this volume 17. Rodriguez Velazquez E. D., Kreinovich, V.: Towards a Theoretical Explanation of How Pavement Condition Index Deteriorates over Time. this volume 18. Servin, C., Kosheleva, O., Kreinovich, V.: A Recent Result about Random Metrics Explains Why All of Us Have Similar Learning Potential. this volume 19. Urenda, J., Kosheleva O., Kreinovich, V.: Finitely Generated Sets of Fuzzy Values: If ‘And’ Is Exact, Then ‘Or’ Is Almost Always And Vice Versa. this volume 20. Urenda, J., Kosheleva O., Kreinovich, V.: Fuzzy Logic Explains the Usual Choice of Logical Operations in 2-Valued Logic. this volume
Contents
Status Quo Bias Actually Helps Decision Makers to Take Nonlinearity into Account: An Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Griselda Acosta, Eric Smith, and Vladik Kreinovich
1
A Natural Explanation for the Minimum Entropy Production Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Griselda Acosta, Eric Smith, and Vladik Kreinovich
7
Why Class-D Audio Amplifiers Work Well: A Theoretical Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kevin Alvarez, Julio C. Urenda, and Vladik Kreinovich
15
How Can We Explain Different Number Systems? . . . . . . . . . . . . . . . . Laxman Bokati, Olga Kosheleva, and Vladik Kreinovich
21
Why Immediate Repetition Is Good for Short-Time Learning Results but Bad for Long-Time Learning: Explanation Based on Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laxman Bokati, Julio C. Urenda, Olga Kosheleva, and Vladik Kreinovich
27
Absence of Remotely Triggered Large Earthquakes: A Geometric Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laxman Bokati, Aaron Velasco, and Vladik Kreinovich
37
Why Gamma Distribution of Seismic Inter-Event Times: A Theoretical Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laxman Bokati, Aaron Velasco, and Vladik Kreinovich
43
Quantum Computing as a Particular Case of Computing with Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martine Ceberio and Vladik Kreinovich
51
ix
x
Contents
A “Fuzzy” Like Button Can Decrease Echo Chamber Effect . . . . . . . . . Olga Kosheleva and Vladik Kreinovich
57
Intuitive Idea of Implication Versus Formal Definition: How to Define the Corresponding Degree . . . . . . . . . . . . . . . . . . . . . . . Olga Kosheleva and Vladik Kreinovich
63
Dimension Compactification—A Possible Explanation for Superclusters and for Empirical Evidence Usually Interpreted as Dark Matter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vladik Kreinovich Fundamental Properties of Pair-Wise Interactions Naturally Lead to Quarks and Quark Confinement: A Theorem Motivated by Neural Universal Approximation Results . . . . . . . . . . . . . . . . . . . . . Vladik Kreinovich Linear Neural Networks Revisited: From PageRank to Family Happiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vladik Kreinovich
69
75
83
Why 3 Basic Colors? Why 4 Basic Tastes? . . . . . . . . . . . . . . . . . . . . . . Vladik Kreinovich
93
What Segments Are the Best in Representing Contours? . . . . . . . . . . . . Vladik Kreinovich and Chris Quintana
97
Strength of Lime Stabilized Pavement Materials: Possible Theoretical Explanation of Empirical Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 115 Edgar Daniel Rodriguez Velasquez and Vladik Kreinovich Towards a Theoretical Explanation of How Pavement Condition Index Deteriorates over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Egdar Daniel Rodriguez Velasquez and Vladik Kreinovich A Recent Result About Random Metric Spaces Explains Why All of Us Have Similar Learning Potential . . . . . . . . . . . . . . . . . . 129 Christian Servin, Olga Kosheleva, and Vladik Kreinovich Finitely Generated Sets of Fuzzy Values: If “And” Is Exact, Then “Or” Is Almost Always Approximate, and Vice Versa—A Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Julio C. Urenda, Olga Kosheleva, and Vladik Kreinovich Fuzzy Logic Explains the Usual Choice of Logical Operations in 2-Valued Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Julio C. Urenda, Olga Kosheleva, and Vladik Kreinovich
Status Quo Bias Actually Helps Decision Makers to Take Nonlinearity into Account: An Explanation Griselda Acosta, Eric Smith, and Vladik Kreinovich
Abstract One of the main motivations for designing computer models of complex systems is to come up with recommendations on how to best control these systems. Many complex real-life systems are so complicated that it is not computationally possible to use realistic nonlinear models to find the corresponding optimal control. Instead, researchers make recommendations based on simplified—e.g., linearized— models. The recommendations based on these simplified models are often not realistic but, interestingly, they can be made more realistic if we “tone them down”—i.e., consider predictions and recommendations which are close to the current status quo state. In this paper, we analyze this situation from the viewpoint of general system analysis. This analysis explain the above empirical phenomenon—namely, we show that this “status quo bias” indeed helps decision makers to take nonlinearity into account.
1 Formulation of the Problem Real-life problems. In his presentation [1] at the 2019 World Congress of the International Fuzzy Systems Association (IFSA), Professor Kacprzyk recalled his expeG. Acosta Department of Electrical and Computer Engineering, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] E. Smith Department of Industrial, Manufacturing, and Systems Engineering, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_1
1
2
G. Acosta et al.
rience of optimizing large systems—like an economic region—at the International Institute for Applied Systems Analysis (IIASA) in Laxenburg, Austria. Linearization is needed. Of course, many dependencies in complex real-life systems are non-linear. However, even with modern computers, optimizing a complex system under nonlinear constraints would require an unrealistic amount of computation time. As a result, in the optimization, most processes in a real-life system were approximated by linear models. Comment. To be more precise, when analyzing the effect of one specific strategy on a system, we can afford to take non-linearity into account. However, when we need to solve an optimization problem of selecting the optimal strategy and/or optimal parameters of such a strategy, we have to use linearization. Recommendations based on the linearized model were often not realistic. Not surprisingly, since the optimization process involved simplification of the actual system, recommendations based on the resulting simplified model were often not realistic and could not be directly implemented. This was not just a subjective feeling: when the researchers tested, on the nonlinear model, the effect of a strategy selected based on linearization, the results were often not so good. Status quo bias helped. One of the reasons that people listed for being reluctant to accept the center’s recommendations was that these recommendations differed too much from what they expected. This phenomenon of unwillingness to follow recommendations if they are too far away from the status quo is known as the status quo bias; see, e.g., [2, 3]. Interestingly, when the center’s researchers “toned down” their recommendations by making them closer to the status quo, the resulting recommendations led to much better results (e.g., as tested on the nonlinear models). In other words, the toning down—corresponding to what we understand as the status quo bias—actually improved the decisions in comparison to simply using recommendations based on the simplified linear models. Thus, the status quo bias somehow takes into account non-linearity—and is, thus, not a deviation from an optimal decision making (as the word bias makes you think) but rather a reasonable way to come up with a better decision. But why? The phenomenon described above seems mysterious. Why would getting closer to the status quo lead to a better solution? In this paper, we analyze this phenomenon from the general system approach. Our analysis allows us to explain why the status quo bias indeed helps to take some nonlinearity into account.
Status Quo Bias Actually Helps Decision Makers to Take Nonlinearity …
3
2 Analysis of the Problem and the Resulting Explanation A general description of a system: a brief reminder. In general, the state of a system at each moment of time t can be described by listing the values x1 (t), . . . , xn (t) of all the quantities x1 , . . . , xn that characterize this system. Similarly, the change in the system can be described by differential equations that explain how the value of each of these quantities changes with time: d xi (t) = f i (x1 (t), . . . , xn (t)), i = 1, . . . , n. dt Here the expressions f i (x1 , . . . , xn ) describe how the rate of change in each of the quantities depends on the state x = (x1 , . . . , xn ); in general, the expressions f i (x1 , . . . , xn ) are non-linear. In particular, in the simplest case when we use the value of only quantity x1 to describe the state, we get the equation d x1 (t) = f 1 (x1 (t)). dt
What happens when we linearize. When we linearize the description of the system, we thus replace the nonlinear functions f i (x1 , . . . , xn ) by their linear approximations f i (x1 , . . . , xn ) ≈ ai +
n
ai j · x j .
j=1
In particular, in the case when we use only quantity x1 , we get the following approximate equation d x1 = a1 + a11 · x1 . dt def
In this case, for the auxiliary variable y1 = x1 +
a1 , we get a1
dy1 = a11 · y1 . dt The solution to this simple differential equation is well-known: it is y1 (t) = y1 (0) · exp(a11 · t) and thus, x1 (t) = y1 (y) −
a1 a1 = y1 (0) · exp(a11 · t) − . a11 a11
4
G. Acosta et al.
In situations when the value of x1 (and thus, of y1 ) decreases, we have a11 < 0. In such situations, the value y1 decreases to 0. Such things happen. However, in situations when we want to describe the growth, i.e., when a11 > 0, we get an exponential growth. Exponential growth may be a good approximation for some period of time, but eventually it starts growing too fast to be realistic. For example, in good times, economies grow—but we do not expect, e.g., production of meat to grow thousands times. Similarly, pollution grows, or, on a somewhat less negative side, populations grow, but we do not expect thousands-times increases predicted by the exponential models. This phenomenon of models-growing-too-fast is not limited to the case when the system if described by only one variable. In general, a solution to a system of linear differential equations with constant coefficients is a linear combination of oscillatory terms and exponential terms—so, if we describe growth, the models will make it unrealistically exponential. How to make conclusions more realistic. Linear models are not realistic—the deviations from the current values and what these models predict become too large to be realistic. Thus, a natural way to make the models more realistic is to take this phenomenon into account—i.e., instead of the results of the linearized models, consider states which are closer to the original state x(0) = (x1 (0), . . . , xn (0)).
Conclusion. This idea of considering the states which are closer to the original state than the model suggests is exactly what the status quo bias is about. Thus, indeed, the status quo bias helps take make models more realistic. The unrealistic character of the linearized model’s recommendation is caused by the fact that this model is only approximate—it ignores nonlinear terms. So, by making recommendations more realistic, the status quo bias, in effect, helps us to take nonlinearity into account. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
Status Quo Bias Actually Helps Decision Makers to Take Nonlinearity …
5
References 1. Kacprzyk, J.: Cognitive biases in choice and decision making: a potential role of fuzzy logic. In: Proceedings of the Joint World Congress of International Fuzzy Systems Association and Annual Conference of the North American Fuzzy Information Processing Society IFSA/NAFIPS’2019, Lafayette, Louisiana, June 18–21 (2019) 2. Kahneman, D., Knetsch, J.L., Thaler, R.H.: Anomalies: the endowment effect, loss aversion, and status quo bias. J. Econ. Perspect. 5(1), 193–206 (1991) 3. Samuelson, W., Zeckhauser, R.: Status quo bias in decision making. J. Risk Uncertain. 1, 7–59 (1988)
A Natural Explanation for the Minimum Entropy Production Principle Griselda Acosta, Eric Smith, and Vladik Kreinovich
Abstract It is well known that, according to the second law of thermodynamics, the entropy of a closed system increases (or at least stays the same). In many situations, this increase is the smallest possible. The corresponding minimum entropy production principle was first formulated and explained by a future Nobelist Ilya Prigogine. Since then, many possible explanations of this principle appeared, but all of them are very technical, based on complex analysis of differential equations describing the system’s dynamics. Since this phenomenon is ubiquitous for many systems, it is desirable to look for a general system-based explanation, explanation that would not depend on the specific technical details. Such an explanation is presented in this paper.
1 Formulation of the Problem Minimum entropy production principle. It is well known that, according to the second law of thermodynamics, the entropy of any closed system—including the Universe as a whole—cannot decrease, it can only either increase or stay the same; see, e.g., [3, 23].
G. Acosta Department of Electrical and Computer Engineering, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] E. Smith Department of Industrial, Manufacturing, and Systems Engineering, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_2
7
8
G. Acosta et al.
It is somewhat less well known that in many situation, this entropy increase is the smallest possible; this fact is known as the minimum entropy production principle. This principle was first formulated in 1945 by a future Nobelist Ilya Prigogine [20]; see also [6, 8, 9, 16–18, 21]. In contrast to the second law of thermodynamics—which is always true—the minimum entropy production principle is not always valid (see, e.g., [7]), but it is still valid in many practical situations. In particular, it explains why usually, a symmetric state, when perturbed, does not immediately turn into a state with no symmetries at all; usually, some symmetries are preserved—and the more symmetries are preserved, the more frequent are such transitions. For example, when heated, a highly symmetric solid-body state usually does not immediately turn into a completely symmetry-less gas state, it first transitions into a liquid state in which some symmetries are preserved. Sometimes, a solid state does turn directly into gas: e.g., dry ice used to keep ice cream cold goes directly into a gas state without becoming a liquid. However, usually, symmetries are broken sequentially, not all at once. This seemingly simple idea explains many physical phenomena: e.g., it explains the observable shapes of celestial bodies, relative frequency of different shapes, and how shapes change with time; see, e.g., [4, 5, 15]. Challenge: to provide a simple explanation for the minimum entropy production principle. While the principle itself sounds reasonable, all its available derivations are very technical and non-intuitive. Usually, in physics, no matter how complex the corresponding equations, there is a reasonably simple explanation—at least a qualitative one—of the observed phenomena [3, 23]. However, for the minimum entropy production principle, such an explanation has been lacking. What we do in this paper. In this paper, we provide a general system-based explanation for the ubiquity of the minimum entropy production principle, an explanation which—unlike the existing ones—uses only simple easy-to-understand math. In this explanation, we will first start with the analysis how complex problems are solved, and then we will explain how this analysis helps explain the minimum entropy production principle.
2 How Complex Problems Are Solved: Reminder and Related Analysis NP-complete problems: a brief reminder. As we have mentioned, our explanation for the minimum entropy production principle starts not with physics, but with the known fact that in real life, we need to solve complex problems: • we may need to find a path that leads from point A to point B, • a mechanic needs to find a way to repair a broken car, • a medical doctor needs to cure the patients.
A Natural Explanation for the Minimum Entropy Production Principle
9
In most such problems, it may be difficult to come up with a solution, but once we have a candidate for a solution, we can relatively easily check whether this is indeed a solution. For example, if may be difficult to find a way to repair a car, but if we follow some sequence of actions and the car starts running, we clearly have a solution—otherwise, if the car does not start running, the sequence is not a solution. The class of all such problems, i.e., problems in which we can, in reasonable (“feasible”) time check whether a given candidate for a solution is indeed a solution, is known as the class NP. Within this class, there is a subclass of all the problems that can be solved in reasonable time. This subclass is usually denoted by P; see, e.g., [12, 19] for details. Most computer scientists believe that there are problems that cannot be solved in reasonable time, i.e., that P is different from NP; however, this has never been proven, it is still an open problem. What is known is that in the class NP, there are problems which are as hard as possible—in the sense that all other problems can be reduced to this one. Such problems are known as NP-complete. Historically the first NP-complete problem was the following propositional satisfiability problem for 3-SAT formulas. • We start with Boolean (propositional) variables x1 , . . . , xn , i.e., variables that can take only two values: true (1) and false (0). • A literal is either a variable xi , or its negation ¬xi . • A clause (disjunction) is an expression of the type a ∨ b or a ∨ b ∨ c, where a, b, and c are literals. • Finally, a 3-SAT formula is an expression of the type C1 & C2 & . . . , & Cm , where C j are clauses. An example is a 3-clauses formula (x1 ∨ x2 ) & (¬x1 ∨ x2 ∨ x3 ) & (x1 ∨ ¬x2 ∨ ¬x3 ). The general problem is: • given a 3-SAT formula, • check whether this formula is satisfiable, i.e., whether there exist values of the variables that make it true. How NP-complete problems are solved now. If P=NP, this means, in particular, that no feasible algorithm is possible that would solve all the instance of the general 3-SAT problem. So, in practice, when only feasible algorithms are possible, we have to use heuristic algorithms, i.e., algorithms which do not always lead to a solution. Many such algorithms start by selecting a literal—i.e., equivalently, by selecting one of the Boolean variables xi and selecting its truth value. Then, when we substitute this value into the original formula, we get a new propositional formula with one fewer variable. If the original formula was satisfiable and we selected the literal correctly, then the new formula is also satisfiable—and so, by repeating this procedure again and again, we will confirm that the formula is satisfiable (and also
10
G. Acosta et al.
find the values of the variables xi that make the formula true). Which literal should we select? In general, a satisfying 3-SAT formula has several satisfying vectors. For example, by trying all 8 possible combinations of truth values, we can check that the above sample 3-SAT formula has four different solutions: (101), (110), (111), and (010). By selecting a literal, we restrict the number of solutions, from the original number N to a new—usually smaller—number N ≤ N . A priori we do not know which vector of Boolean values are solutions, all 2n such vectors are equally probable to be a solution. Thus, the more vectors remain, the higher the probability that by this restriction we do not miss a solution. It is therefore reasonable to select a literal for which the estimated number of satisfying vectors is the largest possible; see, e.g., [1, 2, 10, 11] and references therein. For a general 3-SAT formula, the expected number of solutions can be estimated, e.g., as follows: • a formula a ∨ b is satisfied by 3 out of 4 combinations of the values (a, b) (the only combination which does not make this formula true is a = b = false); thus, the probability that this clause will be satisfied by a random Boolean vector is 3/4; • a formula a ∨ b ∨ c is satisfied by 7 out of 8 combinations of the values (a, b, c) (the only combination which does not make this formula true is a = b = c = false); thus, the probability that this clause will be satisfied by a random Boolean vector is 7/8. It is difficult to take into account correlation between the clauses, so, in the first approximation, we can simply assume that the clauses are independent, and thus, the probability that a random vector satisfies the formula is equal to the product of the corresponding probabilities—and the number of satisfying vectors can be estimated if we multiply the overall number 2n of Boolean vectors of length n by this probability. For example, for the above 3-SAT formula, the corresponding probability is (3/4) · (7/8) · (7/8), and the estimates number of satisfying Boolean vectors is (3/4) · (7/8) · (7/8) · 23 ≈ 4.6. In this formula, we have three variables, so we have six possible literals. Which one should we select? • if we select x1 to be true, then the first and the third clauses are always satisfied, and the formula becomes ¬x2 ∨ ¬x3 ; here, the estimated number of solutions is (3/4) · 22 = 3; • if we select a literal ¬x1 , i.e., we select x1 to be false, then the second clause is satisfied, and the formula becomes x2 & (¬x2 ∨ ¬x3 ); here, the estimated number of solutions is (1/2) · (3/4) · 22 = 1.5; • if we select a literal x2 , then the formula becomes x1 ∨ ¬x3 ; here, the estimated number of solutions is (3/4) · 22 = 3; • if we select a literal ¬x2 , then the formula becomes x1 & (¬x1 ∨ x3 ); here, the estimated number of solutions is (1/2) · (3/4) · 22 = 1.5; • if we select a literal x3 , then the formula becomes (x1 ∨ x2 ) & (x1 ∨ ¬x2 ); here, the estimated number of solutions is (3/4) · (3/4) · 22 = 2.25;
A Natural Explanation for the Minimum Entropy Production Principle
11
• finally, if we select a literal ¬x3 , then the formula becomes (x1 ∨ x2 ) & (¬x1 ∨ x2 ); here, the estimated number of solutions is (3/4) · (3/4) · 22 = 2.25. The largest estimate of remaining Boolean vectors is when we select x1 or x2 . So, on the first step, we should select either the literal x1 or the literal x2 . One can check that in both cases, we do not miss a solution (and in each of these cases, we actually get 3 solutions, exactly the number that we estimated). General case. The same idea is known to be efficient for many other complex problems; see, e.g., [10]. For example, a similar algorithm has been successfully used to solve another complex problem: a discrete optimization knapsack problem, where: • given the resources r1 , . . . , rn needed for each of n projects, the overall amount r of available resources, and the expected gain g1 , . . . , gn from each of the projects, • we need to select a set of projects S ⊆ {1, . . . , n} which has the largest expected gain gi among all the sets that we can afford, i.e., among all the sets S for which i∈S ri ≤ r. i∈S
The corresponding algorithms are described, e.g., in [14, 22]. In general, it is important to keep as many solution options open as possible. In decision making, one of the main errors is to focus too quickly and to become blind to alternatives. This is a general problem-solving principle which the above SAT example illustrates very well.
3 How This Analysis Helps Explain the Minimum Entropy Production Principle How is all this related to entropy. From the physical viewpoint, entropy is proportional to the logarithm of the number of micro-states forming a given macro-state; see, e.g., [3, 23]. In the case of the SAT problems, micro-states are satisfying vectors, so the number of micro-states is the number of such vectors. Similarly, in other complex problems, solution options are micro-states, and the number of micro-states is the number of such options. As we solve each problem, the number of states decreases—but decreases as slowly as possible. Thus, the entropy—which is the logarithm of the number of states—also decreases, but decreases as slowly as possible, at the minimal possible rate. So, if we consider the dependence of entropy on time, then, in the backward-time direction (i.e., in the direction in which entropy increases), this increase is the smallest possible.
12
G. Acosta et al.
How is all this related to physics. At first glance, the above text may be more relevant for human and computer problem solving than for physics, since at first glance, nature does not solve problems. However, in some reasonable sense it does; let us explain this. Traditionally, physical theories—starting from Newton’s mechanics—have been formulated in terms of differential equations. In this formulation, there is no problem to solve: once we know the state at a given moment of time, we can compute the rate at which each variable describing the state changes with time. This computation may be tedious, may require a lot of computation time on a high-performance computer, but it does not constitute a challenging NP-complete problem. At present, however, the most typical way to describe a physical theory is in the form of a variational principle, i.e., in the form of an objective function whose optimization corresponds to the actual behavior of the physical systems; see, e.g., [3, 13, 23]. This formulation is especially important if we take quantum effects into account: • while in non-quantum physics, optimization is exact and is just another equivalent form of describing the corresponding differential equations, • in quantum physics, optimization is approximate: a quantum system tries to optimize, but its result is close to (but not exactly equal to) the corresponding optimum. In this formulation, what nature does is solving the complex optimization problem: namely, trying to optimize the value of the corresponding functional. We therefore expect to see the same pattern of entropy changes as in general problem solving: in the direction in which entropy is increasing, this increase is the smallest possible. Increasing entropy is exactly how we determine the direction of physical time. For example: • if we see a movie in which a cup falls down and break, we understand that this is exactly the time direction, while • if we see the same movie played backward, when the pieces of a broken cup mysteriously come together to form a whole cup, we realize that we saw this movie in reverse. From this viewpoint, the above statement means that in the forward-time direction— i.e., in the direction in which entropy increases—the rate of the entropy increase is the smallest possible. We thus have a natural systems-based explanation for the minimum entropy production principle. Acknowledgements This work was supported by the Institute of Geodesy, Leibniz University of Hannover. It was also supported in part by the US National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence). This paper was partly written when V. Kreinovich was a visiting researcher at the Leibniz University of Hannover.
A Natural Explanation for the Minimum Entropy Production Principle
13
References 1. Dubois, O.: Counting the number of solutions for instances of satisfiability. Theor. Comput. Sci. 81, 49–64 (1991) 2. Dubois, O., Carlier, J.: Probabilistic approach to the satisfiability problem. Theor. Comput. Sci. 81, 65–85 (1991) 3. Feynman, R., Leighton, R., Sands, M.: The Feynman Lectures on Physics. Addison Wesley, Boston, Massachusetts (2005) 4. Finkelstein, A., Kosheleva, O., Kreinovich, V.: Astrogeometry: towards mathematical foundations. Int. J. Theor. Phys. 36(4), 1009–1020 (1997) 5. Finkelstein, A., Kosheleva, O., Kreinovich, V.: Astrogeometry: geometry explains shapes of celestial bodies. Geombinatorics VI(4), 125–139 (1997) 6. Glansdorff, P., Prigogine, I.: Thermodynamic Theory of Structure, Stability and Fluctuations. Wiley-Interscience, London (1971) 7. Grandy Jr., W.T.: Entropy and the Time Evolution of Macroscopic Systems. Oxford University Press, Oxford (2008) 8. Jaynes, E.T.: The minimum entropy production principle. Annu. Rev. Phys. Chem. 31, 579–601 (1980) 9. Klein, M.J., Meijer, P.H.E.: Principle of minimum entropy production. Phys. Rev. 96, 250–255 (1954) 10. Kreinovich, V.: S. Maslov’s iterative method: 15 years later (Freedom of choice, neural networks, numerical optimization, uncertainty reasoning, and chemical computing). In: Kreinovich, V., Mints, G. (eds.) Problems of Reducing the Exhaustive Search, pp. 175–189. American Mathematical Society, Providence, Rhode Island (1997) 11. Kreinovich, V., Fuentes, O.: High-concentration chemical computing techniques for solving hard-to-solve problems, and their relation to numerical optimization, neural computing, reasoning under uncertainty, and freedom of choice. In: Katz, E. (ed.) Molecular and Supramolecular Information Processing: From Molecular Switches to Logical Systems, pp. 209–235. WileyVCH, Weinheim, Germany (2012) 12. Kreinovich, V., Lakeyev, A., Rohn, J., Kahl, P.: Computational Complexity and Feasibility of Data Processing and Interval Computations. Kluwer, Dordrecht (1998) 13. Kreinovich, V., Liu, G.: We live in the best of possible worlds: Leibniz’s insight helps to derive equations of modern physics. In: Pisano, R., Fichant, M., Bussotti, P., Oliveira, A.R.E. (eds.) The Dialogue Between Sciences, Philosophy and Engineering. New Historical and Epistemological Insights, Homage to Gottfried W. Leibnitz 1646–1716, pp. 207–226. College Publications, London (2017) 14. Kreinovich, V., Shukeilo, S.: A new probabilistic approach to the knapsack problem. In: Proceedings of the Third USSR All-Union School on Discrete Optimization and Computers, Tashtagol, Russia, December 2–9, 1987, pp. 123–124. Moscow (1987) (in Russian) 15. Li, S., Ogura, Y., Kreinovich, V.: Limit Theorems and Applications of Set Valued and Fuzzy Valued Random Variables. Kluwer Academic Publishers, Dordrecht (2002) 16. Livi, R., Politi, P.: Non-Equilibrium Statistical Physics: A Modern Perspective. Cambridge University Press, Cambridge, UK (2017) 17. Maes, C., Netoˇcný, K.: Minimum entropy production principle from a dynamical fluctuation law. J. Math. Phys. 48, Paper 053306 (2007) 18. Martyushev, I.N., Nazarova, A.S., Seleznev, V.D.: On the problem of the minimum entropy production in the nonequilibrium stationary state. J. Phys. A: Math. Theor. 40(3), 371–380 (2007) 19. Papadimitriou, C.H.: Computational Complexity. Addison-Wesley, San Diego (1994) 20. Prigogine, I.: Modération et transformations irréversibles des systémes ouverts. Bulletin de la Classe des Sciences, Académie Royale de Belgique 31, 600–606 (1945) 21. Prigogine, I.: Etude Thermodynamique des phénoménes irréversibles. Desoer, Liége, France (1947)
14
G. Acosta et al.
22. Shukeilo, S.: A New Probabilistic Approach to the Knapsack Problem. Leningrad Electrical Engineering Institute (LETI), Master’s Thesis (1988) (in Russian) 23. Thorne, K.S., Blandford, R.D.: Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics. Princeton University Press, Princeton, New Jersey (2017)
Why Class-D Audio Amplifiers Work Well: A Theoretical Explanation Kevin Alvarez, Julio C. Urenda, and Vladik Kreinovich
Abstract Most current high-quality electronic audio systems use class-D audio amplifiers (D-amps, for short), in which a signal is represented by a sequence of pulses of fixed height, pulses whose duration at any given moment of time linearly depends on the amplitude of the input signal at this moment of time. In this paper, we explain the efficiency of this signal representation by showing that this representation is the least vulnerable to additive noise (that affect measuring the signal itself) and to measurement errors corresponding to measuring time.
1 Formulation of the Problem Most current electronic audio systems use class-D audio amplifiers, where a signal x(t) is represented by a sequence s(t) of pulses whose height is fixed and whose duration at time t linearly depends on the amplitude x(t) of the input signal at this moment of time; see, e.g., [1] and references therein. Starting with the first commercial applications in 2001, D-amps have been used in many successful devices. However, why they are so efficient is not clear. In this paper, we provide a possible theoretical explanation for this efficiency.
K. Alvarez Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] J. C. Urenda Department of Mathematical Sciences and Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_3
15
16
K. Alvarez et al.
We will do it in two sections. In the first of these sections, we explain why using pulses makes sense. In the following section, we explain why the pulse’s duration should linearly depend on the amplitude of the input signal x(t).
2 Why Pulses Let us start with some preliminaries. There are bounds on possible values of the signal. For each device, there are bounds s and s on the amplitude of a signal that can be represented by this device. In other words, for each signal s(t) represented by this device and for each moment of time t, we have s ≤ s(t) ≤ s. Noise is mostly additive. In every audio system, there is noise n(t). Most noises are additive: they add to the signal. For example, if we want to listen to a musical performance, then someone talking or coughing or shuffling in a chair produce a noise signal that adds to the original signal. Similarly, if we have a weak signal s(t) which is being processed by an electronic system, then additional signals from other electric and electronic devices act as an additive noise n(t), so the actual amplitude is equal to the sum s(t) + n(t). What do we know about the noise. In some cases—e.g., in a predictable industrial environment—we know what devices produce the noise, so we can predict some statistical characteristics of this noise. However, in many other situations—e.g., in TV sets—noises are rather unpredictable. All we may know is the upper bound n on possible values of the noise n(t): |n(t)| ≤ n.
We want to be the least vulnerable to noise. To improve the quality of the resulting signal, it is desirable to select a signal’s representation in which this signal would be the least vulnerable to noise. Additive noise n(t) changes the original value s(t) of the signal to the modified value s(t) + n(t). Ideally, we want to make sure that, based on this modified signal, we will be able to uniquely reconstruct the original signal s(t). For this to happen, not all signal values are possible. Let us show that for this desired feature to happen, we must select a representation s(t) in which not all signal values are possible.
Why Class-D Audio Amplifiers Work Well: A Theoretical Explanation
17
Indeed, if we have two possible values s1 < s2 whose difference does not exceed s2 − s1 ≤ n, then we have a situation in which: 2n, i.e., for which 2 s2 − s1 for which |n 1 | ≤ n, thus getting • to the first value s1 , we add noise n 1 = 2 s1 + s2 ; and the value s1 + n 1 = 2 s2 − s1 • to the second value s2 , we add noise n 2 = − for which also |n 2 | ≤ n, thus 2 s1 + s2 . also getting the same value s2 + n 2 = 2 s1 + s2 Thus, based on the modified value si + n i = , we cannot uniquely determine 2 the original value of the signal: it could be s1 or it could be s2 . Resulting explanation of using pulses. The larger the noise level n against which we are safe, the larger must be the difference between possible values of the signal. Thus, to makes sure that the signal representation is safe against the strongest possible noise, we must have the differences as large as possible. On the interval [s, s], the largest possible difference between the possible values is when one of these values is s and another value is s. Thus, we end up with a representation in which at each moment of time, the signal is equal either to s or to s. This signal can be equivalently represented as a constant level s and a sequence of pulses of the same height (amplitude) s − s.
3 Why the Pulse’s Duration Should Linearly Depend on the Amplitude of the Input Signal Pulse representation: reminder. As we concluded in the previous section, to make the representation as noise-resistant as possible, we need to represent the timedependent input signal x(t) by a sequence of pulses. How can we encode the amplitude of the input signal. Since the height (amplitude) of each pulse is the same, the only way that we can represent information about the input signal’s amplitude x(t) at a given moment of time t is to make the pulse’s width (duration) w—and the time interval between the pulses—dependent on x(t). Thus, we should select w depending on the amplitude w = w(x). Which dependence w = w(x) should we select: analysis of the problem. Which dependence of the width w on the initial amplitude should we select?
18
K. Alvarez et al.
Since the amplitude x of the input is encoded by the pulse’s width w(x), the only way to reconstruct the original signal x is to measure the width w, and then find x for which w(x) = w. We want this reconstruction to be as accurate as possible. Let ε > 0 be the accuracy with which we can measure the pulse’s duration w. A change Δx in the input signal – after which the signal becomes equal to x + Δx—leads to the changed width w(x + Δx). For small Δx, we can expand the expression w(x + Δx) in Taylor series and keep only linear terms in this expansion: w(x + Δx) ≈ w(x) + Δw, def
where we denoted Δw = w (x) · Δx. Differences in width which are smaller than ε will not be detected. So the smallest difference in Δx that will be detected comes ε . from the formula |Δw| = |w (x) · Δx| = ε, i.e., is equal to |Δx| = |w (x)| In general, the guaranteed accuracy with which we can determine the signal is thus equal to the largest of these values, i.e., to the value δ = max x
ε ε = . |w (x)| min |w (x)| x
We want this value to be the smallest, so we want the denominator min |w (x)| to x attain the largest possible value. The limitation is that the overall widths of all the pulses corresponding to times 0 = t1 , t2 = t1 + Δt, . . . , tn = T should fit within time T , i.e., we should have n
w(x(ti )) ≤ T.
i=1
This inequality must be satisfied for all possible input signals. In real life, everything is bounded, so at each moment of time, possible values x(t) of the input signal must be between some bounds x and x: x ≤ x(t) ≤ x. We want to be able to uniquely reconstruct x from w(x). Thus, the function w(x) should be monotonic—either increasing or decreasing. If the function w(x) is increasing, then we have w(x(t)) ≤ w(x) for all t. Thus, to make sure that the above inequality is satisfied for all possible input signals, it is sufficient to require that this inequality is satisfied when x(t) = x for all t, i.e., that T n · w(x) ≤ T , or, equivalently, that w(x) ≤ Δt = . n If the function w(x) is decreasing, then similarly, we conclude that the requirement that the above inequality is satisfied for all possible input signals is equivalent to requiring that w(x) ≤ Δt.
Why Class-D Audio Amplifiers Work Well: A Theoretical Explanation
19
Thus, to find the best dependence w(x), we must solve the following optimization problem: Which dependence w = w(x) should we select: precise formulation of the problem. To find the encoding w(x) that leads to the most accurate reconstruction of the input signal x(t), we must solve the following two optimization problems. • Of all increasing functions w(x) ≥ 0 defined for all x ∈ [x, x] and for which w(x) ≤ Δt, we must select a function for which the minimum min |w (x)| attains x the largest possible value. • Of all decreasing functions w(x) ≥ 0 defined for all x ∈ [x, x] and for which w(x) ≤ Δt, we must select a function for which the minimum min |w (x)| attains x the largest possible value. Solution to the above optimization problem. Let us first show how to solve the problem for the case when the function w(x) is increasing. We will show that in this case, the largest possible value of the minimum min |w (x)| is attained for the x x−x . function w(x) = Δt · x−x Indeed, in this case, the dependence w(x) on x is linear, so we have the same value of w (x) for all x—which is thus equal to the minimum: min |w (x)| = w (x) = x
Δt . x−x
Let us prove that the minimum cannot attain any larger value. Indeed, if we could have Δt def , m = min |w (x)| > x x−x then we will have, for each x, w (x) ≥ m >
Δt . x−x
Integrating both sides of this inequality by x from x to x, and taking into account that x w (x) d x = w(x) − w(x) x
and
x x
m d x = m · (x − x),
20
K. Alvarez et al.
we conclude that w(x) − w(x) ≥ m · (x − x). Since m > m · (x − x) > Δt hence
Δt , we have x−x
w(x) − w(x) > Δt.
On the other hand, from the fact that w(x) ≤ Δ and w(x) ≥ 0, we conclude that w(x) − w(x) ≤ Δt, which contradicts to the previous inequality. This contradiction Δt , and thus, that shows that the minimum cannot attain any value larger than x−x x−x the linear function w(x) = Δt · is indeed optimal. x−x Similarly, in the decreasing case, we can prove that in this case, the linear function x−x is optimal. w(x) = Δt · x−x In both cases, we proved that the width of the pulse should linearly depend on the input signal—i.e., that the encoding used in class-D audio amplifiers is indeed optimal. Acknowledgements This work was supported in part by the US National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
Reference 1. Santo, B.: Bringing big sound to small devices: one module revolutionized how we listen. IEEE Spectr. (10), 17 (2019)
How Can We Explain Different Number Systems? Laxman Bokati, Olga Kosheleva, and Vladik Kreinovich
Abstract At present, we mostly use decimal (base-10) number system, but in the past, many other systems were used: base-20, base-60—which is still reflected in how we divide an hour into minutes and a minute into seconds—and many others. There is a known explanation for the base-60 system: 60 is the smallest number that can be divided by 2, by 3, by 4, by 5, and by 6. Because of this, e.g., half an hour, one-third of an hour, all the way to one-sixth of an hour all correspond to a whole number of minutes. In this paper, we show that a similar idea can explain all historical number systems, if, instead of requiring that the base divides all numbers from 2 to some value, we require that the base divides all but one (or all but two) of such numbers.
1 Formulation of the Problem A problem. Nowadays, everyone use a decimal (base-10) system for representing integers—of course, with the exception of computers which use a binary (base-2) system. However, in the past, many cultures used different number systems; see, e.g., [1, 3–5, 9, 11–13, 15–18]. Some of these systems have been in use until reasonably L. Bokati Computational Science Program, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_4
21
22
L. Bokati et al.
recently (and are still somewhat used in colloquial speech, e.g., when we count in dozens). Other systems are only known directly from historical records or from indirect sources—such as linguistics. An interesting question is: why some number systems (i.e., some bases) were used and some similar bases were not used? Case when an explanation is known. One of the known reasons for selecting a base comes from base-60 system B = 60 used by the ancient Babylonians; see, e.g., [2, 6, 8]. We still have a trace of that system—which was widely used throughout the ancient world—in our division of the hour into 60 min and a minute into 60 s. A natural explanation for the use of this system is that it makes it easy to divide by small numbers: namely, when we divide 60 by 2, 3, 4, 5, and 6, we still get an integer. Thus, if we divide the hours into 60 min as we do, 1/2, 1/3, 1/4, 1/5, and 1/6 of the hour are all represented by a whole number of minutes—which makes it much easier for people to handle. And one can easily show that 60 is the smallest integer which is divisible by 2, 3, 4, 5, and 6. Our idea. Let us use this explanation for the base-60 system as a sample, and see what we can get if make a similar assumption of divisibility, but for fewer numbers, or with all numbers but one or but two. It turns out that many historically used number systems can indeed be explained this way.
2 Which Bases Appear if We Consider Divisibility by All Small Numbers from 1 to Some k Let us consider which bases appear if we consider divisibility by all small natural numbers—i.e., by all natural numbers from 1 to some small number k. We will consider this for all values k from 1 to 7, and we will explain why we do not go further. Case when k = 2. In this case, the smallest number divisible by 2 is the number 2 itself, so we get the binary (base-2) system B = 2 used by computers. Some cultures used powers of 2 as the base—e.g., B = 4 or B = 8 (see, e.g., [1]). This, in effect, is the same as using the original binary system—since, e.g., the fact that we have a special word for a hundred 100 = 102 does not mean that we use a base-100 system. Case when k = 3. The smallest number divisible by 2 and 3 is B = 6. The base-6 number system has indeed been used, by the Morehead-Maro language of Southern New Guinea; see, e.g., [10, 14].
How Can We Explain Different Number Systems?
23
Case when k = 4. The smallest number divisible by 2, 3, and 4 is B = 12. The base-12 number system has been used in many cultures; see, e.g., [2, 11, 16], and the use of dozens in many languages is an indication of this system’s ubiquity. Case when k = 5. The smallest number divisible by 2, 3, 4, and 5 is B = 60, the familiar Babylonian base. Since this number is also divisible by 6, the case k = 6 leads to the exact same base and thus, does not need to be considered separately. Case when k = 7. The smallest number which is divisible by 2, 3, 4, 5, 6, and 7 is B = 420. This number looks too big to serve as the base of a number system, so we will not consider it. The same applied to larger values k > 7. Thus, in this paper, we only consider values k ≤ 6.
3 What if We Can Skip One Number What happens if we consider bases which are divisible not by all, but by all-but-one numbers from 1 to k? Of course, if we skip the number k itself, this is simply equivalent to being divisible by all the small numbers from 1 to k − 1—and we have already analyzed all such cases. So, it makes sense to skip a number which is smaller than k. Let us analyze all the previous cases k = 1, . . . , 6 from this viewpoint. Case when k = 2. In this case, there is nothing to skip, so we still get a binary system. Case when k = 3. In this case, the only number that we can skip is the number 2. The smallest integer divisible by 3 is the number 3 itself, so we get the ternary (base-3) system B = 3; see, e.g., [5]. There is some evidence that people also used powers of 3, such as 9; see, e.g., [7, 15]. Case when k = 4. For k = 4, in principle, we could skip 2 or we could skip 3. Skipping 2 makes no sense, since if the base is divisible by 4, it is of course also divisible by 2 as well. Thus, the only number that we can meaningfully skip is the number 3. In this case, the smallest number which is divisible by the remaining numbers 2 and 4 is the number 4. As we have mentioned, the base-4 system is, in effect, the same as binary system—one digit of the base-4 system contains two binary digits, just like to more familiar base-8 and base-16 system, one digit corresponds to 3 or 4 binary digits. Case when k = 5. In this case, we can skip 2, 3, or 4. • Skipping 2 does not make sense, since then 4 remains, and, as we have mentioned earlier, if the base is divisible by 4, it is divisible by 2 as well.
24
L. Bokati et al.
• Skipping 3 leads to B = 20, the smallest number divisible by 2, 4, and 5. Base-20 numbers have indeed been actively used, e.g., by the Mayan civilization; see, e.g., [2–4, 6, 8]. In Romance languages still 20 is described in a different way than 30, 40, and other similar numbers. • Skipping 4 leads to B = 30, the smallest number divisible by 2, 3, and 5. This seems to be the only case when the corresponding number system was not used by anyone. Case when k = 6. In this case, in principle, we can skip 2, 3, 4, and 5. Skipping 2 or 3 does not make sense, since any number divisible by 6 is also divisible by 2 and 3. So, we get meaningful examples, we only consider skipping 4 or 5. • If we skip 4, we get the same un-used base B = 30 that we have obtained for k = 5. • If we skip 5, then the smallest number divisible by 2, 3, 4, and 6 is the base B = 12 which we already discussed earlier.
4 What if We Can Skip Two Numbers What happens if we consider bases which are divisible by all-but-two numbers from 1 to k? Of course, to describe new bases, we need to only consider skipped numbers which are smaller than k. Cases when k = 2 or k = 3. In these cases, we do not have two intermediate numbers to skip. Case when k = 4. In this case, we skip both intermediate numbers 2 and 3 and consider only divisibility by 4. The smallest number divisible by 4 is the number 4 itself, and we have already considered base-4 numbers. Case when k = 5. In this case, we have three intermediate numbers: 2, 3, and 4. In principle, we can form three pairs of skipped numbers: (2, 3), (2, 4), and (3, 4). Skipping the first pair makes no sense, since then 4 still remains, and if the base is divisible by 4, then it is automatically divisible by 2 as well. Thus, we have only two remaining options: • We can skip 2 and 4. In this case, the smallest number divisible by the two remaining numbers 3 and 5 is B = 15. Historically, there is no direct evidence of base-15 systems, but there is an indirect evidence: e.g., Russia used to have 15-kopeck coins, a very unusual nomination. • We can skip 3 and 4. In this case, the smallest number divisible by the two remaining numbers 2 and 5 is B = 10. This is our usual decimal system. Case when k = 6. In this case, we have four intermediate values 2, 3, 4, and 5. Skipping 2 or 3 makes no sense: if the base is divisible by 6, it is automatically
How Can We Explain Different Number Systems?
25
divisible by 2 and 3. Thus, the only pair of values that we can skip is 4 and 5. In this case, the smallest number divisible by 2, 3, and 6 is the value B = 6, which we have already considered earlier.
5 What if We Can Skip Three or More Numbers What if we skip three numbers? What happens if we consider bases which are divisible but by all-but-three numbers from 1 to k? Of course, to describe new bases, we need to only consider skipped numbers which are smaller than k. Cases when k = 2, k = 3, or k = 4. In these cases, we do not have three intermediate numbers to skip. Case when k = 5. In this case, skipping all three intermediate numbers 2, 3, and 4 leave us with B = 5. The base-5 system has actually been used; see, e.g., [3]. Case when k = 6. In this case, we have four intermediate numbers, so skipping 3 of them means that we keep only one. It does not add to the list of bases if we keep 2 or 3, since then the smallest number divisible by 6 and by one of them is still 6—and we have already considered base-6 systems. Thus, the only options are keeping 4 and keeping 5. If we keep 4, then the smallest number divisible by 4 and 6 is B = 12—our usual counting with dozens, which we have already considered. If we keep 5, then the smallest number divisible by 5 and 6 is B = 30, which we have also already considered. What if we skip more than three intermediate numbers. The only numbers k ≤ 6 that have more than three intermediate numbers are k = 5 and k = 6. For k = 5, skipping more than three intermediate numbers means skipping all fours of them, so the resulting base is B = 5, which we already considered. For k = 6, for which there are five intermediate numbers, skipping more than three means either skipping all of them—in which case we have B = 6—or keeping one of the intermediate numbers. Keeping 2 or 3 still leaves us with B = 6, keeping 4 leads to B = 12, and keeping the number 5 leads to B = 30. All these bases have already been considered. Acknowledgements This work was supported in part by the US National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
26
L. Bokati et al.
References 1. Ascher, M.: Ethnomathematics: A Multicultural View of Mathematical Ideas. Routledge, Milton Park, Abingdon, Oxfordshire, UK (1994) 2. Boyer, C.B., Merzbach, U.C.: A History of Mathematics. Wiley, New York (1991) 3. Heath, T.L.: A Manual of Greek Mathematics. Dover, New York (2003) 4. Ifrah, G.: The Universal History of Numbers: From Prehistory to the Invention of the Computer. Wiley, Hoboken, New Jersey (2000) 5. Knuth, D.: The Art of Computer Programming, Volume 2: Seminumerical Algorithms, 3rd edn. Addison Wesley, Boston, Massachusetts (1998) 6. Kosheleva, O.: Mayan and Babylonian arithmetics can be explained by the need to minimize computations. Appl. Math. Sci. 6(15), 697–705 (2012) 7. Kosheleva, O., Kreinovich, V.: Was there a pre-biblical 9-ary number system? Math. Struct. Model. 50, 87–90 (2019) 8. Kosheleva, O., Villaverde, K.: How Interval and Fuzzy Techniques Can Improve Teaching. Springer, Cham, Switzerland (2018) 9. Kroeber, A.L.: Handbook of the Indians of California, Bulletin 78. Bureau of American Ethnology of the Smithsonian Institution, Washington, DC (1919) 10. Lean, G.: Counting Systems of Papua New Guinea, vols. 1–17, Papua New Guinea University of Technology, Lae, Papua New Guinea (1988–1992) 11. Luria, A.R., Vygotsky, L.S.: Ape, Primitive Man, and Child: Essays in the History of Behavior. CRC Press, Boca Raton, Florida (1992) 12. Mallory, J.P., Adams, D.Q.: Encyclopedia of Indo-European Culture. Fitzroy Dearborn Publsishers, Chicago and London (1997) 13. Nissen, H.J., Damerow, P., Englund, R.K.: Archaic Bookkeeping, Early Writing, and Techniques of Economic Administration in the Ancient Near East. University of Chicago Press, Chicago, Illinois (1993) 14. Owens, K.: The work of Glendon Lean on the counting systems of Papua New Guinea and Oceania. Math. Educ. Res. J. 13(1), 47–71 (2001) 15. Parkvall, M.: Limits of Language: Almost Everything You Didn’t Know about Language and Languages. William James & Company, Portland, Oregon (2008) 16. Ryan, P.: Encyclopaedia of Papua and New Guinea. Melbourne University Press and University of Papua and New Guinea, Melbourne (1972) 17. Schmandt-Besserat, D.: How Writing Came About. University of Texas Press, Austin, Texas (1996) 18. Zaslavsky, C.: Africa Counts: Number and Pattern in African Cultures. Chicago Review Press, Chicago, Illinois (1999)
Why Immediate Repetition Is Good for Short-Time Learning Results but Bad for Long-Time Learning: Explanation Based on Decision Theory Laxman Bokati, Julio C. Urenda, Olga Kosheleva, and Vladik Kreinovich
Abstract It is well known that repetition enhances learning; the question is: when is a good time for this repetition? Several experiments have shown that immediate repetition of the topic leads to better performance on the resulting test than a repetition after some time. Recent experiments showed, however, that while immediate repetition leads to better results on the test, it leads to much worse performance in the long term, i.e., several years after the material have been studied. In this paper, we use decision theory to provide a possible explanation for this unexpected phenomenon.
L. Bokati Computational Science Program, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] J. C. Urenda Department of Mathematical Sciences and Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_5
27
28
L. Bokati et al.
1 Formulation of the Problem: How to Explain Recent Observations Comparing Long-Term Results of Immediate and Delayed Repetition Repetitions are important for learning. A natural idea to make students better understand and better learn the material is to repeat this material—the more times we repeat, the better the learning results. This repetition can be explicit—e.g., when we go over the material once again before the test. This repetition can ne implicit—e.g., when we give the students a scheduled quiz on the topic, so that they repeat the material themselves when preparing for this quiz. When should we repeat? The number of repetitions is limited by the available time. Once the number of repetitions is fixed, it is necessary to decide when should we have a repetition: • shall we have it immediately after the students have studied the material, or • shall we have it after some time after this studying, i.e., after we have studied something else. What was the recommendation until recently. Experiments have shown that repeating the material almost immediately after the corresponding topic was first studied—e.g., by giving a quiz on this topic—enhances the knowledge of this topic that the students have after the class as a whole. This enhancement was much larger than when a similar quiz—reinforcing the students’ knowledge—was given after a certain period of time after studying the topic. New data seems to promote the opposite recommendation. This idea has been successfully used by many instructors. However, a recent series of experiments has made many researchers doubting this widely spread strategy. Specifically, these experiments show that (see, e.g., [1] and references therein): • while immediate repetition indeed enhances the amount of short-term (e.g., semester-wide) learning more than a later repetition, • from the viewpoint of long-term learning—what the student will be able to recall in a few years (when he or she will start using this knowledge to solve real-life problems)—the result is opposite: delayed repetitions lead to much better longterm learning than the currently-fashionable immediate ones. Why? The above empirical result is somewhat unexpected, so how can we explain it? We have partially explained the advantages of interleaving—a time interval between the study and the repetition—from the general geometric approach; see, e.g., [3, 5]. However, this explanation does not cover the difference between short-term and long-term memories. So how can we explain this observed phenomenon? We can simply follow the newer recommendations, kind of arguing that human psychology is difficult, has
Why Immediate Repetition Is Good for Short-Time Learning Results …
29
many weird features, so we should trust whatever the specialists tell us. This may sound reasonable at first glance, but the fact that we have followed this path in the past and came up with what seems now to be wrong recommendation—this fact encourages us to take a pause, and first try to understand the observed phenomenon, and only follow it if it makes sense. This is exactly the purpose of this paper: to provide a reasonable explanation for the observed phenomenon.
2 Main Idea Behind Our Explanation: Using Decision Theory Main idea: using decision theory. Our memory is limited in size. We cannot memorize everything that is happening to us. Thus, our brain needs to decide what to store in a short-term memory, what to store in a long-term memory, and what not to store at all. How can we make this decision? There is a whole area of research called decision theory that describes how we make decisions—or, to be more precise, how a rational person should make decisions. Usually, this theory is applied to conscientious decisions, i.e., decisions that we make after some deliberations. However, it is reasonable to apply it also to decisions that we make on subconscious level—e.g., to decisions on what to remember and what not to remember: indeed, these decisions should also be made rationally. Decision theory: a brief reminder. To show how utility theory can be applied to our situation, let us briefly recall the main ideas and formulas behind decision theory; for details, see, e.g., [2, 4, 6, 7]. To make a reasonable decision, we need to know the person’s preferences. To describe these preferences, decision theory uses the following notion of utility. Let us denote possible alternatives by A1 , . . . , An . To describe our preference between alternatives in precise terms, let us select two extreme situations: • a very good situation A+ which is, according to the user, much better than any of the available alternatives Ai , and • a very bad situation A− which is, according to the user, much worse than any of the available alternatives Ai . Then, for each real number p from the interval [0, 1], we can form a lottery—that we will denote by L( p)—in which: • we get the very good situation A+ with probability p and • we get the very bad situation A− with the remaining probability 1 − p. Clearly, the larger the probability p, the more chances that we will get the very good situation. So, if p < p , then L( p ) is better than L( p). Let us first consider the extreme cases p = 1 and p = 0.
30
L. Bokati et al.
• When p = 1, the lottery L( p) = L(1) coincides with the very good situation A+ and is, thus, better than any of the alternatives Ai ; we will denote this by Ai < L(1). • When p = 0, the lottery L( p) = L(0) coincides with the very bad situation A− and is, thus, worse than any of the alternatives Ai : L(0) < Ai . For all other possible probability values p ∈ (0, 1), for each i, the selection between the alternative Ai and the lottery L( p) is not pre-determined: the decision maker will have to select between Ai and L( p). As a result of this selection, we have: • either Ai < L( p), • or L( p) < Ai , • or the case when to the decision maker, the alternatives Ai and L( p) are equivalent; we will denote this by Ai ∼ L( p). Here: • If Ai < L( p) and p < p , then Ai < L( p ). • Similarly, if L( p) < Ai and p < p, then L( p ) < Ai . Based on these two properties, one can prove that for the probability def
u i = sup{ p : L( p) < Ai }, we have: • we have L( p) < Ai for all p < u i and • we have Ai < L( p) for all p > u i . This “threshold” value u i is called the utility of the alternative Ai . For every ε > 0, no matter how small it is, we have L(u i − ε) < Ai < L(u i + ε). In this sense, we can say that the alternative Ai is equivalent to the lottery L(u i ). We will denote this new notion of equivalence by ≡: Ai ≡ L(u i ). Because of this equivalence, if u i < u j , this means that Ai < A j . So, we should always select an alternative with the largest possible value of utility. This works well if we know exactly what alternative we will get. In practice, when we perform an action, we may end up in different situations—i.e., with different alternatives. For example, we may have alternatives of being wet without an umbrella and being dry with an extra weight of an umbrella to carry, but when we decide whether to take the umbrella or not, we do not know for sure whether it will rain or not, so we cannot get the exact alternative. In such situations, instead of knowing the exact alternative Ai , we usually know the probability pi of encountering each alternative Ai when the corresponding action if performed. If we know several actions like thus, which action should we select?
Why Immediate Repetition Is Good for Short-Time Learning Results …
31
Each alternative Ai is equivalent to a lottery L(u i ) in which we get the very good alternative A+ with probability u i and the very bad alternative A− with the remaining probability 1 − u i . Thus, the analyzed action is equivalent to a two-stage lottery in which: • first, we select one of the alternatives Ai with probability pi , and then • depending on which alternative Ai we selected on the first stage, we select A+ or A− with probabilities, correspondingly, u i and 1 − u i . As a result of this two-stage lottery, we end up either with A+ or with A− . The probability of getting A+ can be computed by using the formula of full probability, pi · u i . So, the analyzed action is equivalent to getting A+ with probability as u = i
u and A− with the remaining probability 1 − u. By definition of utility, this means that the utility of the action is equal to u. The above formula for u is exactly the formula for the expected value of the utility. Thus, we conclude that the utility of an action is equal to the expected value of the utility corresponding to this action. Let us apply this to learning. If we learn the material, we spend some resources on storing it in memory. If we do not learn the material, we may lose some utility next time when this material will be needed. So, whether we store the material in memory depends on for which of the two possible actions—to learn or not to learn—utility is larger (or equivalently, losses are smaller). Let us describe this idea in detail.
3 So When Do We Learn: Analysis of the Problem and the Resulting Explanation Notations. To formalize the above idea, let us introduce some notations. • Let m denote the losses (= negative utility) needed to store a piece of material in the corresponding memory (short-term or long-term). • Let L denote losses that occur when we need this material but do not have it in our memory. • Finally, let p denote our estimate of the probability that this material will be needed in the corresponding time interval (short-term time interval for short-term memory or long-term time interval for long-term memory). If we learn, we have loss m. If we do not learn, then the expected loss is equal to p · L. We learn the material if the second loss of larger, i.e., if p · L > m, i.e., equivalently, if p > m/L. Comment. Sometimes, students underestimate the usefulness of the studied material, i.e., underestimate the value L. In this case, L is low, so the ratio m/L is high, and for most probability estimates p, learning does not make sense. This unfortunate
32
L. Bokati et al.
situation can be easily repaired if we explain, to the students, how important this knowledge can be—and thus, make sure that they estimate the potential loss L correctly. Discussion. For different pieces of the studied material, we have different ratios m/L. These ratios do not depend on the learning technique. As we will show later, the estimated probability p may differ for different learning techniques. So, if one technique consistently leads to higher values p, this means that, in general, that for more pieces of material we will have p > m/L and thus, more pieces of material will be learned. So, to compare two different learning techniques, we need to compare the corresponding probability estimates p. Let us formulate the problem of estimating the corresponding probability p in precise terms. Towards a precise formulation of the probability estimation problem. In the absence of other information, to estimate the probability that this material will be needed in the future, the only information that our brain can use is that there were two moments of time at which we needed this material in the past: • the moment t1 when the material was first studied, and • the moment t2 when the material was repeated. In the immediate repetition case, the moment t2 was close to t1 , so the difference t2 − t1 was small. In the delayed repetition case, the difference t2 − t1 is larger. Based on this information, the brain has to estimate the probability that there will be another moment of time during some future time interval. How can we do that? Let us first consider a deterministic version of this problem. Before we start solving the actual probability-related problem, let us consider the following simplified deterministic version of this problem: • we know the times t1 < t2 when the material was needed; • we need to predict the next time t3 when the material will be needed. We can reformulate this problem in more general terms: • we observed some event at moments t1 and t2 > t1 ; • based on this information, we want to predict the moment t3 at which the same event will be observed again. In other words, we need to have a function t3 = F(t1 , t2 ) > t2 that produces the desired estimate. What are the reasonable properties of this prediction function? The numerical value of the moment of time depends on what unit we use to measure time—e.g., hours, days, or months. It also depends on what starting point we choose for measuring time. We can measures it from Year 0 or—following Muslim or Buddhist calendars—from some other date.
Why Immediate Repetition Is Good for Short-Time Learning Results …
33
If we replace the original measuring unit with the new one which is a times smaller, then all numerical values will multiply by a: t → t = a · t. For example, if we replace seconds with milliseconds, all numerical values will multiply by 1000, so, e.g., 2 sec will become 2000 msec. Similarly, if we replace the original starting point with the new one which is b units earlier, then the value b will be added to all numerical values: t → t = t + b. It is reasonable to require that the resulting prediction t3 not depend on the choice of the unit and on the choice of the starting point. Thus, we arrive at the following definitions. Definition 1 We say that a function F(t1 , t2 ) is scale-invariant if for every t1 , t2 , t3 , and a > 0, if t3 = F(t1 , t2 ), then for ti = a · ti , we get t3 = F(t1 , t2 ). Definition 2 We say that a function F(t1 , t2 ) is shift-invariant if for every t1 , t2 , t3 , and b, if t3 = F(t1 , t2 ), then for ti = ti + b, we get t3 = F(t1 , t2 ). Proposition 1 A function F(t1 , t2 ) > t2 is scale- and shift-invariant if and only if it has the form F(t1 , t2 ) = t2 + α · (t2 − t1 ) for some α > 0. def
Proof Let us denote α = F(−1, 0). Since F(t1 , t2 ) > t2 , we have α > 0. Let t1 < t2 , then, due to scale-invariance with a = t2 − t1 > 0, the equality F(−1, 0) = α implies that F(t1 − t2 , 0) = α · (t2 − t1 ). Now, shift-invariance with b = t2 implies that F(t1 , t2 ) = t2 + α · (t2 − t1 ). The proposition is proven. Discussion. Many physical processes are reversible: if we have a sequence of three events occurring at moments t1 < t2 < t3 , then we can also have a sequence of events at times −t3 < −t2 < −t1 . It is therefore reasonable to require that: • if our prediction works for the first sequence, i.e., if, based on t1 and t2 , we predict t3 , • then our prediction should work for the second sequence as well, i.e. based on −t3 and −t2 , we should predict the moment −t1 . Let us describe this requirement in precise terms. Definition 3 We say that a function F(t1 , t2 ) is reversible if for every t1 , t2 . and t3 , the equality F(t1 , t2 ) = t3 implies that F(−t3 , −t2 ) = −t1 . Proposition 2 The only scale- and shift-invariant reversible function F(t1 , t2 ) is the function F(t1 , t2 ) = t2 + (t2 − t1 ). Comment. In other words, if we encounter two events separated by the time interval t2 − t1 , then the natural prediction is that the next such event will happen after exactly the same time interval. Proof In view of Proposition 1, all we need to do is to show that for a reversible function we have α = 1. Indeed, for t1 = −1 and t2 = 0, we get t3 = α. Then, due Proposition 1, we have F(−t3 , −t2 ) = F(−α, 0) = 0 + α · (0 − (−α)) = α 2 . The requirement that this value should be equal to −t1 = 1 implies that α 2 = 1, i.e., due to the fact that α > 0, that α = 1. The proposition is proven.
34
L. Bokati et al.
From simplified deterministic case to the desired probabilistic case. In practice, we cannot predict the actual time t3 of the next occurrence, we can only predict the probability of different times t3 . Usually, the corresponding uncertainty is caused by a joint effect of many different independent factors. It is known that in such situations, the resulting probability distribution is close to Gaussian—this is the essence of the Central Limit Theorem which explains the ubiquity of Gaussian distributions; see, e.g., [8]. It is therefore reasonable to conclude that the distribution for t3 is Gaussian, with some mean μ and standard deviation σ . There is a minor problem with this conclusion; namely: • Gaussian distribution has non-zero probability density for all possible real values, while • we want to have only values t3 > t2 . This can be taken into account if we recall that in practice, values outside a certain kσ -interval [μ − k · σ, μ + k · σ ] have so little probability that they are considered to be impossible. Depending on how low we want this probability to be, we can take k = 3, or k = 6, or some other value k. So, it is reasonable to assume that the lower endpoint of this interval corresponds to t2 , i.e., that μ − k · σ = t2 . Hence, for given t1 and t2 , once we know μ, we can determine σ . Thus, to find the corresponding distribution, it is sufficient to find the corresponding value μ. As this mean value μ, it is reasonable to take the result of the deterministic prediction, i.e., μ = t2 + (t2 − t1 ). In this case, from the above formula relating μ and σ , we conclude that σ = (t2 − t1 )/k. Finally, an explanation. Now we are ready to explain the observed phenomenon. In the case of immediate repetition, when the difference t2 − t1 is small, most of the probability—close to 1—is located is the small vicinity of t1 , namely in the kσ interval which now takes the form [t2 , t2 + 2(t2 − t1 )]. Thus, in this case, we have: • (almost highest possible) probability p ≈ 1 that the next occurrence will have in the short-term time interval and • close to 0 probability that it will happen in the long-term time interval. Not surprisingly, in this case, we get: • a better short-term learning than for other learning strategies, but • we get much worse long-term learning. In contrast, in the case of delayed repetition, when the difference t2 − t1 is large, the interval [t2 , t + 2(t2 − t1 )] of possible values t3 spreads over long-term times as well. Thus, here: • the probability p to be in the short-time interval is smaller than the value ≈ 1 corresponding to immediate repetition, but • the probability to be in the long-term interval is larger that the value ≈ 0 corresponding to immediate repetition. As a result, for this learning strategy:
Why Immediate Repetition Is Good for Short-Time Learning Results …
35
• we get worse short-term learning but • we get much better long-term learning, exactly as empirically observed. Acknowledgements This work was supported in part by the US National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
References 1. Epstein, D.: Range: Why Generalists Triumph in the Specialized World. Riverhead Books, New York (2019) 2. Fishburn, P.C.: Utility Theory for Decision Making. Wiley, New York (1969) 3. Kosheleva, O., Villaverde, K.: How Interval and Fuzzy Techniques Can Improve Teaching. Springer, Cham, Switzerland (2018) 4. Kreinovich, V.: Decision making under interval uncertainty (and beyond). In: Guo, P., Pedrycz, W. (eds.) Human-Centric Decision-Making Models for Social Sciences, pp. 163–193. Springer (2014) 5. Lerma, O., Kosheleva, O., Kreinovich, V.: Interleaving enhances learning: a possible geometric explanation. Geombinatorics 24(3), 135–139 (2015) 6. Luce, R.D., Raiffa, R.: Games and Decisions: Introduction and Critical Survey. Dover, New York (1989) 7. Raiffa, H.: Decision Analysis. Addison-Wesley, Reading, Massachusetts (1970) 8. Sheskin, D.J.: Handbook of Parametric and Non-Parametric Statistical Procedures. Chapman & Hall/CRC, London, UK (2011)
Absence of Remotely Triggered Large Earthquakes: A Geometric Explanation Laxman Bokati, Aaron Velasco, and Vladik Kreinovich
Abstract It is known that seismic waves from a large earthquake can trigger earthquakes in distant locations. Some of the triggered earthquakes are strong themselves. Interestingly, strong triggered earthquakes only happen within a reasonably small distance (less than 1000 km) from the original earthquake. Even catastrophic earthquakes do not trigger any strong earthquakes beyond this distance. In this paper, we provide a possible geometric explanation for this phenomenon.
1 Formulation of the Problem Triggered earthquakes: original expectations. It is known that seismic waves from a large earthquake can trigger earthquakes at some distance from the original quake; see, e.g., [1–6, 8, 9]. At first glance, it seems reasonable to conclude that the stronger the original earthquake, the stronger will be the triggered earthquakes, so that catastrophic earthquakes will trigger strong earthquakes even far away from the original location. Unexpected empirical fact. Somewhat surprisingly, it turned out that no matter how strong the original earthquake, strong triggered earthquakes are limited to an about 1000 km distance from the original event. At larger distances, the triggered L. Bokati Computational Science Program, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] A. Velasco Department of Geological Sciences, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_6
37
38
L. Bokati et al.
(secondary) earthquakes are all low-magnitude, with magnitude M < 5 on Richter scale; see, e.g., [7]. Why? At present, there is no convincing explanation for this empirical fact. What we do in this paper. In this paper, we provide a possible geometric explanation for the observed phenomenon.
2 Geometric Explanation Main idea. Our explanation is based on a very natural idea: that if we have a phenomenon which is symmetric—i.e., invariant with respect to some reasonable transformation—then the effects of this phenomenon will also be invariant with respect to the same transformation. For example, if we have a plank placed, in a symmetric way, over a fence—so that we have the exact same length to the left and to the right of the fence, and we apply similar forces to the left and right ends of this plank, we expect it to curve the same way to the left and to the right of the fence. What are reasonable transformations here? All related physical processes do not change if we simply shift from one place to another and/or rotate the corresponding configuration by some angle. If we describe each point x by its coordinates xi , then a shift means that each coordinate xi is replaced by a shifted value xi = xi + ai , and rotation means that we replace the original coordinates xi with rotated ones n ri j · x j for an appropriate rotation matrix ri j . xi = j=1
In addition, many physical processes—like electromagnetic or gravitational forces—do not have a fixed spatial scale. If we scale down or scale up, we get the same physical phenomenon (of course, we need to be careful when scaling down or scaling up). This is how, e.g., airplanes were tested before computer simulations were possible: you test a scaled-down model of a plane in a wind tunnel, and it provides a very accurate description of what will happen to the actual airplane. So, to shift and rotation, it is reasonable to add scaling xi → λ · xi , for an appropriate value λ. What is the symmetry of the propagating seismic wave? In a reasonable first approximation, the seismic waves propagates equally in all directions with approximately the same speed. So, in this approximation, at any given moment of time, the locations reached by a wave form a circle with radius r equal to the propagation speed times the time from the original earthquake. When we are close to the earthquake location, we can easily see that the set of all these locations is not a straight line segment, it is a curved part of a circle. However, as we get further and further away from the original earthquake location, this curving
Absence of Remotely Triggered Large Earthquakes: A Geometric Explanation
39
becomes less and less visible—just like we easily notice the curvature of a ball, but it is difficult to notice the curvature of an Earth surface; for most experiments, it is safe to assume that locally, the Earth is flat (and this is what people believed for a long time, until more sophisticated measurements showed that it is not flat). So: • in places close to the original earthquake, the set of locations affected by the incoming seismic wave can be approximated as a circle’s arc—a local part of a circle, while • in places far away from the original earthquake, the set of locations affected by the incoming seismic wave can be well approximated by a straight line segment. It is important to emphasize that the difference between these two situations depends only on the distance to the original earthquake location, it does not depend on the strength of the earthquake—it is the same for very weak and for very strong earthquakes. What is the effect of these two different symmetries? Out of all possible symmetries—shifts, rotations, and scalings—a circle is only invariant with respect to all possible rotations around its center. Thus, we expect the effect of the resulting seismic wave to be also invariant with respect to such rotations. Thus, the area A affected by the incoming wave should also be similarly invariant. This means that with each point a, this area must contain the whole circle. As a result, this area consists of one or several such circles. From the viewpoint of this invariance, it could be that the affected area is limited to the circle itself—in which case the area is small, and its effect is small. It can also be that the area includes many concentric circles—in which case the affected area may be significant, and its effect may be significant. On the other hand, a straight line has different symmetries: it is invariant with respect to shifts along this line and arbitrary scalings. Thus, it is reasonable to conclude that the area effected by such almost-straight-line seismic wave is also invariant with respect to the same symmetries. This implies that this area is limited to the line itself: otherwise, if the area A had at least one point outside the line, then: • by shifting along the original line, we can form a whole line parallel to the original line, and then • by applying different scalings, we would get all the lines parallel to the original line—no matter what distance, and thus, we will get the whole plane, while the affected area has to be bounded. Thus, in such situations, the effect of the seismic wave is limited to the line itself— i.e., in effect, to a narrow area around this line—and will, thus, be reasonably weak. This indeed explains the absence of remotely triggered large earthquakes. Indeed, for locations close to the earthquake, the resulting phenomenon is (approximately) invariant with respect to rotations—and thus, its effect should be similarly invariant. This leaves open the possibility that a large area will be affected and thus, that the resulting effect will be strong—which explains why in a small vicinity, it is possible to have a triggered large earthquake.
40
L. Bokati et al.
On the other hand, in remote locations, location far away from the original earthquake, the resulting phenomenon is invariant with respect to shifts and scalings—and thus, its effect should be similarly invariant. As a result, only a very small area is affected—which explains why, no matter how strong the original earthquake, it never triggers a large earthquake in such remote locations. Comments. • It should be mentioned that our analysis is about the geometric shape of the area affected by the seismic wave, not about the physical properties of the seismic wave itself. From the physical viewpoint, at each sensor location, the seismic wave can definitely be treated as a planar wave already at much shorter distances from the original earthquake than 1000 km. However, if instead of limiting ourselves to a location of a single sensor, we consider the whole area affected by the seismic wave—which may include many seismic sensors—then, at distance below 1000 km, we can no longer ignore the fact that the front of the incoming wave is curved. (At larger distances from the earthquake, even at such macro-level, the curvature can be ignored.) • It should also be mentioned that what we propose is a simple qualitative explanation of the observed phenomenon. To be able to explain it quantitatively—e.g., to understand why 1000 km and not any other distance is an appropriate threshold, and why exactly the Richter scale M = 5 is the right threshold—we probably need to supplement our simplified geometric analysis with a detailed physical analysis of the corresponding phenomena. Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
References 1. Brodsky, E.E., Karakostas, V., Kanamori, H.: A new observation of dynamically triggered regional seismicity: earthquakes in Greece following the August 1999 Izmit, Turkey earthquake. Geophys. Res. Lett. 27, 2741–2744 (2000) 2. Gomberg, J., Bodin, P., Larson, K., Dragert, H.: Earthquake nucleation by transient deformations cause by the M = 7.9 Denali, Alaska, earthquake. Nature 427, 621–624 (2004) 3. Gomberg, J., Reasenberg, P., Bodin, P., Harris, R.: Earthquake triggering by transient seismic waves following the Landers and Hector Mine, California, earthquake. Nature 11, 462–466 (2001) 4. Hill, D.P., et al.: Seismicity in the Western United States triggered by the M 7.4 Landers. California, earthquake of June 28, 1992. Science 260, 1617–1623 (1993) 5. Hough, S.E., Seeber, L., Armbruster, J.G.: Intraplate triggered earthquakes: observations and interpretation. Bull. Seismol. Soc. Am. 93(5), 2212–2221 (2003) 6. Kilb, D., Gomberg, J., Bodin, P.: Triggering of earthquake aftershocks by dynamic stresses. Nature 408, 570–574 (2000) 7. Parsons, T., Velasco, A.A.: Absence of remotely triggered large earthquakes beyond the mainshock region. Nat. Geosci. 4, 312–316 (2011)
Absence of Remotely Triggered Large Earthquakes: A Geometric Explanation
41
8. Velasco, A.A., Hernandez, S., Parsons, T., Pankow, K.: Global ubiquity of dynamic earthquake triggering. Nat. Geosci. 1, 375–379 (2008) 9. West, M., Sanchez, J.J., McNutt, S.R.: Periodically triggered seismicity at Mount Wrangell, Alaska, after the Sumatra earthquake. Science 308, 1144–1146 (2005)
Why Gamma Distribution of Seismic Inter-Event Times: A Theoretical Explanation Laxman Bokati, Aaron Velasco, and Vladik Kreinovich
Abstract It is known that the distribution of seismic inter-event times is well described by the Gamma distribution. Recently, this fact has been used to successfully predict major seismic events. In this paper, we explain that the Gamma distribution of seismic inter-event times can be naturally derived from the first principles.
1 Formulation of the Problem Gamma distribution of seismic inter-event times: empirical fact. Detailed analysis of the seismic inter-event times t—i.e., of times between the two consequent seismic events occurring in the same area—shows that these times are distributed according to the Gamma distribution, with probability density ρ(t) = C · t γ −1 · exp(μ · t),
(1)
for appropriate values γ , μ, and C; see, e.g., [1, 2]. Lately, there has been a renewed interest in this seemingly very technical result, since a recent paper [4] has shown that the value of the parameter μ can be used
L. Bokati Computational Science Program, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] A. Velasco Department of Geological Sciences, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_7
43
44
L. Bokati et al.
to predict a major seismic event based on the preceding foreshocks. Specifically, it turns out that more than 70% of major seismic events in Southern California could be predicted some time in advance—with an average of about two weeks in advance. Why gamma distribution? This interest raises a natural question: why the interevent times follow gamma distribution? In this paper, we provide a possible theoretical explanation for this empirical fact.
2 Our Explanation Maximum entropy: general idea. In our explanation, we will use Laplace’s Indeterminacy Principle, which is also known as the maximum entropy approach; see, e.g., [3]. The simplest case of this approach is when we have n alternatives, and we have no reasons to believe that one of them is more probable. In this case, a reasonable idea is to consider these alternatives to be equally probable, i.e., to assign, to each of these n alternatives, the same probability p1 = . . . = pn . Since the probabilities n pi = 1, we thus get pi = 1/n. should add to 1, i.e., i=1
In this case, we did not introduce any new degree of certainty into the situation that was not there before—as would have happened, e.g., if we selected a higher probability for one of the alternatives. In other words, out of all possible probability n pi = 1, we distributions, i.e., out of all possible tuples ( p1 , . . . , pn ) for which i=1
selected the only with the largest possible uncertainty. In general, it is known that uncertainty can be described by the entropy S, which n pi · ln( pi ), and in the in the case of finitely many alternatives has the form S = − i=1
case of a continuous random variable with probability density ρ(x), for which sum becomes an integral, a similar form S = − ρ(x) · ln(ρ(x)) d x. (2) Maximum entropy: examples. If the only information that we have about a probability distribution is that it is located somewhere on the given interval [a, b], then the only constraint on the corresponding probability density function is that the overall probability over this interval is 1, i.e., that b ρ(x) d x = 1. (3) a
So, to apply the maximum entropy approach, we need to maximize the objective function (2) under the constraint (3). The usual way of solving such constraint
Why Gamma Distribution of Seismic Inter-Event Times …
45
optimization problem is to apply Lagrange multiplier method that reduces the original constraint optimization problem to an appropriate unconstrained optimization problem. In this case, this new problem means maximizing the expression
−
b
ρ(x) · ln(ρ(x))d x + λ ·
ρ(x)d x − 1 ,
(4)
a
where the parameter λ—known as Lagrange multiplier—needs to be determined from the condition that the solution to this optimization problem satisfies the constraint (3). For this problem, the unknowns are the values ρ(x) corresponding to different x. Differentiating the expression (4) with respect to ρ(x) and equating the derivative to 0, we get the following equation: − ln(ρ(x)) − 1 + λ = 0. (Strictly speaking, we need to use variational differentiation, since the unknown is a function.) The above equation implies that ln(ρ(x)) = λ − 1, and thus, that ρ(x) = const. So, we get a uniform distribution—in full accordance with the original idea that, since we do not have any reasons to believe that some points on this interval are more probable than others, we consider all these points to be equally probable. If, in addition to the range [a, b], we also know the mean value
b
x · ρ(x) d x = m
(5)
a
of the corresponding random variable, then we meed to maximize the entropy (2) under two constraints (3) and (5). In this case, the Lagrange multiplier methods leads to the unconstrained optimization problem of maximizing the following expression:
−
ρ(x) · ln(ρ(x)) d x + λ · +λ1 ·
b
ρ(x) d x − 1
a b
(6)
x · ρ(x) d x − m .
a
Differentiating this expression with respect to ρ(x) and equating the derivative to 0, we get the following equation: − ln(ρ(x)) − 1 + λ + λ1 · x = 0, hence ln(ρ(x)) = (λ − 1) + λ1 · x and so, we get a (restricted) Laplace distribudef tion, with the probability density ρ(x) = C · exp(μ · x), where we denoted C =
46
L. Bokati et al. def
exp(λ − 1) and μ = λ1 . Comment. It is worth mentioning that if, in addition to the mean value, we also know the second moment of the corresponding random variable, then similar arguments lead us to a conclusion that the corresponding distribution is Gaussian. This conclusion is in good accordance with the ubiquity of Gaussian distributions. What are the reasonable quantities in our problem. We are interested in the probability distribution of the inter-event time t. Based on the observations, we can find the mean inter-event time, so it makes sense to assume that we know the mean value of this time. Usual (astronomical) time versus internal time: general idea. This mean value is estimated if we use the usual (astronomical) time t, as measured, e.g., by rotation of the Earth around its axis and around the Sun. However, it is known that many processes also have their own “internal” time—based on the corresponding internal cycles. For example, we can measure the biological time of an animal (or a person) by such natural cyclic activities as breathing or heartbeat. Usually, breathing and heart rate are more or less constant, but, e.g., during sleep, they slow down—as most other biological processes slow down. On the other hand, in stressful situations, e.g., when the animal’s life is in danger, all the biological processes speed up— including breathing and heart rate. To adequately describe how different biological characteristics change with time, it makes sense to consider them not only how they change in astronomical time, but also how they change in the corresponding internal time—measured not by number of Earth’s rotations around the sun, but rather in terms of number of heartbeats. An even more drastic slowdown occurs when an animal hibernates. In general, the system’s internal time can be sometimes slower than astronomical time, and sometimes faster. Usual (astronomical) time versus internal time: case of seismic events. In our problem, there is a similar phenomenon: usually, seismic events are reasonably rare. However, the observations indicate that the frequency with which foreshocks appear increases when we get closer to a major seismic event. In such situation, the corresponding seismic processes speed up, so we can say that the internal time speeds up. In general, an internal time is often a more adequate description of the system’s changes than astronomical time. It is therefore reasonable to supplement the mean value of the inter-event time measured in astronomical time by the mean value of the inter-event time measured in the corresponding internal time. How internal time depends on astronomical time: general idea. To describe this idea in precise terms, we need to know how this internal time τ depends on the astronomical time. As we have mentioned, the usual astronomical time is measured by natural cycles, i.e., by processes which are periodic in terms of the time t. So, to
Why Gamma Distribution of Seismic Inter-Event Times …
47
find the expression for internal time, we need to analyze what cycles naturally appear in the studied system—and then define internal time in terms of these cycles. To describe the system’s dynamics means to describe how the corresponding physical quantities x(t) change with time t. In principle, in different physical situations, we can have different functions x(t). In principle, to describe a general function, we need to have infinitely many parameters—e.g., we need to describe the values of this function at different moments of time. In practice, however, we can only have finitely many parameters. So, it is reasonable to consider finite-parametric families of functions. The simplest—and most natural—is to select some basic functions e1 (t), . . . , en (t), and to consider all possible linear combinations of these functions, i.e., all possible functions of the type x(t) = C1 · e1 (t) + · · · + Cn · en (t),
(7)
where C1 , . . . , Cn are the corresponding parameters. This is indeed what is done in many situations: sometimes, we approximate the dynamics by polynomials—linear combinations of powers t k , sometimes we use linear combinations of sinusoids, sometimes linear combinations of exponential functions, etc. How internal time depends on astronomical time: case of seismic events. The quality of this approximation depends on how adequate the corresponding basis functions are for the corresponding physical process. Let us analyze which families are appropriate for our specific problem: analysis of foreshocks preceding a major seismic event. In this analysis, we can use the fact that, in general, to transform a physical quantity into a numerical value, we need to select a starting point and a measuring unit. If we select a different starting point and/or a different measuring unit (e.g., minutes instead of seconds), we will get different numerical values for the same quantity. For the inter-event times, the starting point is fixed: it is 0, the case when the next seismic events follows immediately after the previous one. So, the only remaining change is the change of a measuring unit. If we replace the original time unit with a one which is r times smaller, then all numerical values are multiplied by r , i.e., instead of the original value t, we get a new value tnew = r · t. For example, if we replace minutes by seconds, then the numerical values of all time intervals are multiplied by 60, so that, e.g., 2.5 min becomes 60 · 2.5 = 150 s. Some seismic processes are faster, some are slower. This means that, in effect, they differ by this slower-to-faster or faster-to-slower transformations t → r · t. We would like to have a general description that would fit all these cases. In other words, we would like to make sure that the class (6) remains the same after this “re-scaling”, i.e., that for each i and for each r , the re-scaled function ei (r · t) belongs to the same class (7). In other words, we require that for each i and r , there exists values Ci j (r ) for which (8) ei (r · t) = Ci1 (r ) · e1 (t) + · · · + Cin (r ) · en (t).
48
L. Bokati et al.
Let us solve the resulting systems of equations. Seismic waves may be changing fast but, in general, they are still smooth. It is therefore reasonable to consider only smooth functions ei (t). If we pick n different values t1 , . . . , tn , then, for each r and for each i, we get a system of n linear equations for determining n unknowns Ci1 (r ), …, Cin (r ): ei (r · t1 ) = Ci1 (r ) · e1 (t1 ) + · · · + Cin (r ) · en (t1 ); ... ei (r · tn ) = Ci1 (r ) · e1 (tn ) + · · · + Cin (r ) · en (tn ). Due to Cramer’s rule, each component Ci j (r ) of the solution to this system of linear equations is a ratio of two determinants and is thus, a smooth function of the corresponding coefficients ei (r · t j ) and ei (t j ). Since the function ei (t) is differentiable, we conclude that the functions Ci j (r ) are also differentiable. Since all the functions ei (t) and Ci j (r ) are differentiable, we can differentiate both sides of the formula (8) with respect to r and get: (r ) · e1 (t) + · · · + Cin (r ) · en (t), ei (r · t) · t = Ci1
where for each function f , the expression f , as usual, denotes the derivative. In particular, for r = 1, we get ei (t) · t = ci1 · e1 (t) + · · · + cin · en (t), def
where we denoted ci j = Ci j (1). For the auxiliary variable T = ln(t) for which t = dei (t) dei (t) dt ·t = . So, for the auxiliary functions exp(T ), we have dT = , hence t dt dT def E i (T ) = ei (exp(T )), we get E i (T ) = ci1 · E 1 (T ) + · · · + cin · E n (T ). So, for the functions E i (T ), we get a system of linear differential equations with constant coefficients. It is well known that a general solution to such a system is a linear combination of the expressions T k · exp(a · T ) · sin(ω · T + ϕ) for some natural number k and real numbers a, ω, and ϕ. Thus, each function ei (t) = E i (ln(t)) is a linear combination of the expressions (ln(t))k · exp(a · ln(t)) · sin(ω · ln(t) + ϕ) = (ln(t))k · t a · sin(ω · ln(t) + ϕ). (9) So, the general expression (7) is also a linear combination of such functions. The periodic part of this expression is a sine or cosine function of ln(t), so we can conclude that for seismic processes, the internal time τ is proportional to the
Why Gamma Distribution of Seismic Inter-Event Times …
49
logarithm ln(t) of the astronomic time: τ = c · ln(t) for some constant c. This explains the ubiquity of Gamma distributions. Indeed, suppose that we know the mean values m t and m τ of the astronomical time t and the mean value of the internal time τ = c · ln(t). This means that the corresponding probability density function ρ(t), in addition to the usual constraint ρ(t) dt, also satisfies the constraints t · ρ(t) dt = m t
and
c · ln(t) · ρ(t) dt = m τ . Out of all possible distributions satisfying these three inequalities, we want to select the one that maximizes entropy −
ρ(t) · ln(ρ(t)) dt.
To solve the resulting constraint optimization problem, we can apply the Lagrange multiplier method and reduce it to the unconstrained optimization problem of maximizing the expression:
−
ρ(t) · ln(ρ(t)) dt + λ ·
λt ·
t · ρ(t) dt − m t
ρ(t) dt − 1 +
+ λτ ·
c · ln(t) · ρ(t) dt − m τ ,
for some values λ, λt , and λτ . Differentiating both sides with respect to each unknown ρ(t), we conclude that − ln(ρ(t)) − 1 + λ + λt · t + λτ · c · ln(t) = 0, i.e., that ln(ρ(t)) = (λ − 1) + λ · t + (λτ · c) · ln(t). Exponentiating both sides, we get the desired Gamma distribution form (1). ρ(t) = C · τ γ −1 · exp(μ · t), with C = exp(λ − 1), γ = λτ · c + 1, and μ = λt . Thus, we have indeed explained the ubiquity of the Gamma distribution.
50
L. Bokati et al.
Acknowledgements This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
References 1. Corral, A.: Long-term clustering, scaling, and universality in the temporal occurrence of earthquakes. Phys. Rev. Lett. 92, Paper 108501 (2004) 2. Heinzl, S., Scherbaum, F., Beauval, C.: Estimating background activity based on interevent-time distribution. Bull. Seismol. Soc. Am. 96(1), 313–320 (2006) 3. Jaynes, E.T., Bretthorst, G.L.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge, UK (2003) 4. Trugman, D.T., Ross, Z.E.: Pervasive foreshock activity across Southern California. Geophys. Res. Lett. 46 (2019). https://doi.org/10.1029/2019GL083725
Quantum Computing as a Particular Case of Computing with Tensors Martine Ceberio and Vladik Kreinovich
Abstract One of the main potential applications of uncertainty in computations is quantum computing. In this paper, we show that the success of quantum computing can be explained by the fact that quantum states are, in effect, tensors.
1 Why Tensors One of the main problems of modern computing is that: • we have to process large amounts of data; • and therefore, long time required to process this data. A similar situation occurred in the 19 century physics: • physicists had to process large amounts of data; • and, because of the large amount of data, a long time required to process this data. We will recall that in the 19 century, the problem was solved by using tensors. It is therefore a natural idea to also use tensors to solve the problems with modern computing.
M. Ceberio · V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] M. Ceberio e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_8
51
52
M. Ceberio and V. Kreinovich
2 Tensors in Physics: A Brief Reminder Let us recall how tensors helped the 19 century physics; see, e.g., [6]. Physics starts with measuring and describing the values of different physical quantities. It goes on to equations which enable us to predict the values of these quantities. A measuring instrument usually returns a single numerical value. For some physical quantities (like mass m), the single measured value is sufficient to describe the quantity. For other quantities, we need several values. For example, we need three components E x , E y , and E z to describe the electric field at a given point. To describe the tension inside a solid body, we need even more values: we need 6 values σi j = σ ji corresponding to different values 1 ≤ i, j ≤ 3: σ11 , σ22 , σ33 , σ12 , σ23 , and σ13 . The problem was that in the 19 century, physicists used a separate equation for each component of the field. As a result, equations were cumbersome and difficult to solve. The main idea of the tensor approach is to describe all the components of a physical field as a single mathematical object: • a vector ai , • or, more generally, a tensor ai j , ai jk ..., As a result, we got simplified equations—and faster computations. It is worth mentioning that originally, mostly vectors (rank-1 tensors) were used. However, the 20 century physics has shown that higher-order matrices are also useful. For example: • matrices (rank-2 tensors) are actively used in quantum physics, • higher-order tensors such as the rank-4 curvature tensor Ri jkl are actively used in the General Relativity Theory.
3 From Tensors in Physics to Computing with Tensors As we have mentioned earlier, 19 century physics encountered a problem of too much data. To solve this problem, tensors helped. Modern computing suffers from a similar problem. A natural idea is that tensors can help. Two examples justify our optimism: • modern algorithms for fast multiplication of large matrices; and • quantum computing.
4 Modern Algorithm for Multiplying Large Matrices In many data processing algorithms, we need to multiply large-size matrices:
Quantum Computing as a Particular Case of Computing with Tensors
53
⎛
⎞⎛ ⎞ ⎛ ⎞ a11 . . . a1n b11 . . . b1n c11 . . . c1n ⎝... ... ... ⎠⎝... ... ... ⎠ = ⎝... ... ... ⎠; an1 . . . ann bn1 . . . bnn cn1 . . . cnn
(1)
ci j = ai1 · b1 j + · · · + aik · bk j + · · · + ain · bn j .
(2)
There exist many efficient algorithms for matrix multiplication. The problem is that for large matrix size n, there is no space for both A and B in the fast (cache) memory. As a result, the existing algorithms require lots of timeconsuming data transfers (“cache misses”) between different parts of the memory. An efficient solution to this problem is to represent each matrix as a matrix of blocks; see, e.g., [2, 10]: ⎞ ⎛ A11 . . . A1m (3) A = ⎝... ... ... ⎠, Am1 . . . Amm then Cαβ = Aα1 · B1β + · · · + Aαγ · Bγβ + · · · + Aαm · Bmβ .
(4)
Comment. For general arguments about the need to use non-trivial representations of 2-D (and multi-dimensional) objects in the computer memory, see, e.g., [21, 22]. In the above idea, • we start with a large matrix A of elements ai j ; • we represent it as a matrix consisting of block sub-matrices Aαβ . This idea has a natural tensor interpretation: • each element of the original matrix is now represented as • an (x, y)-th element of a block Aαβ , • i.e., as an element of a rank-4 tensor (Aαβ )x y . So, in this case, an increase in tensor rank improves efficiency. Comment. Examples when an increase in tensor rank is beneficial are well known in physics: e.g., a representation of a rank-1 vector as a rank-2 spinor works in relativistic quantum physics [6].
5 Quantum Computing as Computing with Tensors Classical computation is based on the idea a bit: a system with two states 0 and 1. In quantum physics, due to the superposition principle, we can have states c0 · |0 + c1 · |1
(5)
54
M. Ceberio and V. Kreinovich
with complex values c0 and c1 ; such states are called quantum bits, or qubits, for short. The meaning of the coefficients c0 and c1 is that they describe the probabilities to measure 0 and 1 in the given state: Prob(0) = |c0 |2 and Prob(1) = |c1 |2 . Because of this physical interpretations, the values c1 and c1 must satisfy the constraint |c0 |2 + |c1 |2 = 1. For an n-(qu)bit system, a general state has the form c0...00 · |0 . . . 00 + c0...01 · |0 . . . 01 + · · · + c1...11 · |1 . . . 11.
(6)
From this description, one can see that each quantum state of an n-bit system is, in effect, a tensor ci1 ...in of rank n. In these terms, the main advantage of quantum computing is that it can enable us to store the entire tensor in only n (qu)bits. This advantage explains the known efficiency of quantum computing. For example: √ • we can search in an unsorted list of n elements in time n—which is much faster than the time n which is needed on non-quantum computers [8, 9, 15]; • we can factor a large integer in time which does not exceed a polynomial of the length of this integer—and thus, we can break most existing cryptographic codes like widely used RSA codes which are based on the difficulty of such a factorization on non-quantum computers [15, 18, 19].
6 New Idea: Tensors to Describe Constraints A general constraint between n real-valued quantities is a subset S ⊆ R n . A natural idea is to represent this subset block-by-block—by enumerating sub-blocks that contain elements of S. Each block bi1 ...in can be described by n indices i 1 , . . . , i n . Thus, we can describe a constraint by a boolean-valued tensor ti1 ...in for which: • ti1 ...in =“true” if bi1 ...,in ∩ S = ∅; and • ti1 ...in =“false” if bi1 ...,in ∩ S = ∅. Processing such constraint-related sets can also be naturally described in tensor terms. This representation speeds up computations; see, e.g., [3, 4].
7 Computing with Tensors Can Also Help Physics So far, we have shown that tensors can help computing. It is possible that the relation between tensors and computing can also help physics.
Quantum Computing as a Particular Case of Computing with Tensors
55
As an example, let us consider Kaluza-Klein-type high-dimensional space-time models of modern physics; see, e.g., [7, 11–13, 16, 20]. Einstein’s original idea [5] was to use “tensors” with integer or circular values to describe these models. From the mathematical viewpoint, such “tensors” are unusual. However, in computer terms, integer or circular data types are very natural: e.g., circular data type means fixed point numbers in which the overflow bits are ignored. Actually, from the computer viewpoint, integers and circular data are even more efficient to process than standard real numbers.
8 Remaining Open Problem One area where tensors naturally appear is an efficient Taylor series approach to uncertainty propagation; see, e.g., [1, 14, 17]. Specifically, the dependence of the result y on the inputs x1 , . . . , xn is approximated by the Taylor series: y = c0 +
n i=1
ci · xi +
n n
ci j · xi · x j + · · ·
(7)
i=1 j=1
The resulting tensors ci1 ...ir are symmetric: ci1 ...ir = cπ(i1 )...π(ir )
(8)
for each permutation π . As a result, the standard computer representation leads to a r ! duplication. An important problem is how to decrease this duplication. Acknowledgements This work was supported in part by NSF grant HRD-0734825 and by Grant 1 T36 GM078000-01 from the National Institutes of Health. The authors are thankful to Fred G. Gustavson and Lenore Mullin for her encouragement.
References 1. Berz, M., Hoffstätter, G.: Computation and application of Taylor polynomials with interval remainder bounds. Reliab. Comput. 4(1), 83–97 (1998) 2. Bryant, R.E., O’Hallaron, D.R.: Computer Systems: A Programmer’s Perspective. Prentice Hall, Upper Saddle River, New Jersey (2003) 3. Ceberio, M., Ferson, S., Kreinovich, V., Chopra, S., Xiang, G., Murguia, A., Santillan, J.: How to take into account dependence between the inputs: from interval computations to constraintrelated set computations, with potential applications to nuclear safety, bio- and geosciences. J. Uncertain Syst. 1(1), 11–34 (2007)
56
M. Ceberio and V. Kreinovich
4. Ceberio, M., Kreinovich, V., Pownuk, A., Bede, B.: From interval computations to constraintrelated set computations: towards faster estimation of statistics and ODEs under interval, pBox, and fuzzy uncertainty. In: Melin, P., Castillo, O., Aguilar, L.T., Kacprzyk, J., Pedrycz, W. (eds.), Foundations of Fuzzy Logic and Soft Computing, Proceedings of the World Congress of the International Fuzzy Systems Association IFSA’2007, Cancun, Mexico, June 18–21, 2007, Springer Lecture Notes on Artificial Intelligence, vol. 4529, pp. 33–42 (2007) 5. Einstein, A., Bergmann, P.: On the generalization of Kaluza’s theory of electricity. Ann. Phys. 39, 683–701 (1938) 6. Feynman, R., Leighton, R., Sands, M.: The Feynman Lectures on Physics. Addison Wesley, Boston, Massachusetts (2005) 7. Green, M.B., Schwarz, J.H., Witten, E.: Superstring Theory, vols. 1, 2. Cambridge University Press (1988) 8. Grover, L.K.: A fast quantum mechanical algorithm for database search. In: Proceedings of the 28th Annual ACM Symposium on the Theory of Computing, May 1996, pp. 212-ff 9. Grover, L.K.: From Schrödinger’s equation to quantum search algorithm. Am. J. Phys. 69(7), 769–777 (2001) 10. Gustavson, F.G.: The relevance of new data structure approaches for dense linear algebra in the new multi-core/many core environments. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski J. (eds.) Proceedings of the 7th International Conference on Parallel Processing and Applied Mathematics PPAM’2007, Gdansk, Poland, September 9–12, 2007, Springer Lecture Notes in Computer Science, vol. 4967, pp. 618–621 (2008) 11. Kaluza, Th.: Sitzungsberichte der K. Prussischen Akademie der Wiseenschaften zu Berlin (1921), p. 966 (in German); Engl. translation “On the unification problem in physics” in [13], pp. 1–9 12. Klein, O.: Zeitschrift für Physik (1926), vol. 37, p. 895 (in German); Engl. translation “Quantum theory and five-dimensional relativity” in [13], pp. 10–23 13. Lee, H.C. (ed.): An Introduction to Kaluza-Klein Theories. World Scientific, Singapore (1984) 14. Neumaier, A.: Taylor forms. Reliab. Comput. 9, 43–79 (2002) 15. Nielsen, M., Chuang, I.: Quantum Computation and Quantum Information. Cambridge University Press, Cambridge (2000) 16. Polchinski, J.: String Theory, vols. 1, 2. Cambridge University Press (1998) 17. Revol, N., Makino, K., Berz, M.: Taylor models and floating-point arithmetic: proof that arithmetic operations are validated in COSY. J. Log. Algebr. Program. 64(1), 135–154 (2005) 18. Shor, P.: Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. In: Proceedings of the 35th Annual Symposium on Foundations of Computer Science, Santa Fe, NM, Nov. 20–22 (1994) 19. Shor, P.: Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J. Sci. Statist. Comput. 26, 1484-ff (1997) 20. Starks, S.A., Kosheleva, O., Kreinovich, V.: Kaluza-Klein 5D ideas made fully geometric. Int. J. Theor. Phys. 45(3), 589–601 (2006) 21. Tietze, H.: Famous Problems of Mathematics: Solved and Unsolved Mathematical Problems, from Antiquity to Modern Times. Graylock Press, New York (1965) 22. Zaniolo, C., Ceri, S., Faloutsos, C., Snodgrass, R.T., Subrahmanian, V.S., Zicari, R.: Advanced Database Systems. Morgan Kaufmann (1997)
A “Fuzzy” Like Button Can Decrease Echo Chamber Effect Olga Kosheleva and Vladik Kreinovich
Abstract One of the big problems of the US political life is the echo chamber effect—in spite of the abundance of materials on the web, many people only read materials confirming their own opinions. The resulting polarization often deadlocks the political situation and prevents politicians from reaching compromises needed to make needed changes. In this paper, we show, on a simplified model, that the echo chamber effect can be decreased if we simply replace the currently prevalent binary (yes-no) Like button on webpages with a more gradual (“fuzzy”) one—a button that will capture the relative degree of likeness.
1 Formulation of the Problem What is echo chamber effect. The Internet is full of different political opinions from all sides of the spectrum. One would expect that this would promote people’s understanding of different ideas and make them better understand each other. In reality, the effect is exactly the opposite: many people completely ignore different opinions and only communicate with those whose opinions are identical—and thus, persevere in their opinions. This phenomenon is known as echo chamber effect—as in the echo chamber, when it sounds like several people are yelling but in reality, it is the same voice repeated several times.
O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_9
57
58
O. Kosheleva and V. Kreinovich
Why? One of the reasons why people do not read opposite opinions is that the Internet is so vast that some filtering is needed. Usually, filtering is done by the binary Like button: a person marks materials that he/she likes with Like and the materials that he/she dislikes with Dislike. The system learns fast and only delivers to this person materials similar to what this person liked in the past. Naturally, most people dislike opposite political opinions and like their own. As a result, opposite political opinions are completely filtered out and never reach this person. This is exactly what leads to the echo chamber. What we do in this paper. It is desirable to decrease echo chamber effect. In this paper, we show that to achieve this goal, there is no need to make drastic changes, that a natural need to make the Like button more adequately describing people’s opinions—i.e., turning it into a “fuzzy” button—will actually lead to the desired decrease. Comment. The idea of making the Like button fuzzy—and the suggestion that this may lead to a decrease in the echo chamber effect—was first proposed by our colleague Ramon Ravelo [4]. In this paper, we provide a model detailing and justifying this suggestion.
2 Proposed Solution Need to make information about likes more adequate. The current systems treat our likes as a binary quality: either we like something or we don’t. In reality, there are different degrees of liking and different degrees of disliking. To give the system more accurate information about our preferences, it is thus desirable to replace the binary Like button with a “fuzzy” button that will enable us to evaluate degrees of likeness, e.g., on a scale from −5 to 5 where: • • • • •
5 would mean “like very much”, −5 would mean “dislike very much”, positive values smaller than 5 would mean like somewhat, negative values larger than −5 would mean dislike somewhat, and 0 would mean indifference.
The system would then supply, to the user: • • • •
all the materials with predicted degree of likeness 5, some materials with predicted degree of likeness 4, a little fewer materials with degree of likeness 3, etc., and a little bit of materials with degree of likeness 1.
Comment. We are not requiring that a person gets anything that he/shes dislikes—or to what he/she is indifferent.
A “Fuzzy” Like Button Can Decrease Echo Chamber Effect
59
How this can help decrease the echo chamber effect. Let us suppose that, similar to how it is now, most people occupy two opposite levels of the political spectrum. For simplicity, let us mark these levels by −1 and 1. There are also some people in between, but they are the absolute minority. People do hear about the opinions of such in-between people—as long as these folks are on average on the same side of the debate: • If a Republican Senator criticizes a Republican President, Democrat-oriented press usually covers this, although on many issue, this particular senator still has many opinions which are opposite to theirs. • Similarly, when a Democratic Senator criticizes the Democratic leadership, the Republican-oriented media likes to mention it. But—and this is a big but—they briefly mention these opinions, but they do not provide any detailed and coherent explanation of their opinions. In most cases, the mention consists of something not very informative like this: “Even some Republicans, such as Senator X, disagrees with the President’s policy on Y.” Such mentions, by themselves, do not decrease the echo chamber effect, we describe them just to emphasize that such opinions are bound to get mild likes. Let us also suppose that when we seriously read someone’s opinion, we tend to move somewhat in that person’s direction—it is inevitable, serious political opinions do have reason behind them, otherwise they will not be supported by big groups of people. Serious politicians understand that there are two sides to each story, it is very rarely—if ever—angels with wings against devils with horns and hoofs. At the risk of being crucified by extreme supporters of one of the viewpoints, let us give three examples: • Raising the minimal wage can help poor people live better—but it will also force some small businesses (especially those that are currently struggling to survive) not to hire people and maybe even to fire who they hired in the past, and thus, to increase unemployment. In our opinion, the correct solution is to take both effects into account and find a compromise which works best for the society. • If the basic medical help will be free for a larger group of US folks than now, this will help improve people’s health—but this would require either raising taxes or cutting some current federal expenses. Again, the correct solution is to take into account both effects when making the corresponding decision. • Requiring that phone and internet companies provide backdoor access to encoded data at the government’s request will help fight crime, but will also decrease our privacy, etc. In the usual situation, for a person at level 1 his/her likes are opinions of people at the same level. So this person listens to similar opinions all the time and thus, remains at the same level. Similarly, persons at level −1 remain at this level after all the “liked” materials that they read. What happens when we allow mild likes? Let us consider the simplest model in which:
60
O. Kosheleva and V. Kreinovich
• almost a half of the people—the proportion 0.5 · (1 − ε) for some small ε > 0— are at level 1, • the same proportion of people are at level −1, • the proportion ε/2 is at some level δ > 0, for some small number δ, and • the same proportion is at level −δ. Let us also assume that on average, all the people post the same number of posting, and that, due to a minor degree of liking, • the proportion of postings of δ-folks reaching people on level 1 is p > 0, and similarly, • the proportion of postings of (−δ)-folks reaching people on level −1 is also p > 0. Finally, let us assume that: • a message from the level differing from the current level by the difference d will move the reader’s level by a value proportional to α · d in the message’s direction, for some α > 0, and • the actual resulting level shift can be obtained by averaging level shifts corresponding to all the messages. Comment. This is a simplified version of a model presented in [3]. Readers interested in more detailed models of how people make political decisions can look at [1, 2] and references therein. In the initial moment of time, most people were at levels x0 = 1 or x0 = −1. Let us analyze what will happen at the next moment of time. Since the situation is completely symmetric, let us consider only what happens at people who were initially at level 1. At the next moment of time, each person on level x0 will get: • 1/2 − ε/2 messages from the same level x0 , and • p · (ε/2) messages from level δ. Messages from the same level x0 do not change people’s opinions, but messages from level δ will shift their opinion by the value α · (x0 − d). So: • 1/2 − ε/2 messages do not change the level, while • p · ε/2 messages shift their opinion by the value α · (x0 − d). Thus, the average shift is equal to (1/2 − ε/2) · 0 + p · (ε/2) · (x0 − d) , (1/2 − ε/2) + p · (ε/2) i.e., to k · (x0 − d), where we denoted def
k =
p · (ε/2) . (1/2 − ε/2) + p · (ε/2)
A “Fuzzy” Like Button Can Decrease Echo Chamber Effect
61
Thus, at the next moment of time, these folks will be at level x1 = x0 − k · (x0 − d) for which, therefore, x1 − d = (1 − k) · (x0 − d). Due to symmetry, the folks who were at level −x0 will now move to the level −x1 . Similarly, at the next moment of time, we will have x2 − d = (1 − k)2 · (x0 − d), and, in general, after t moments of time, at the level xt − d = (1 − k)t · (x0 − d). When t → ∞, we will have xt − d → 0, i.e., xt → d. So, in this simplified model, eventually, everyone will take centrist opinions—while retaining their main emphasis of being on one side or another. Of course, in reality, things are more complicated, but this makes us hope that such a simple idea as having a “fuzzy” like button will indeed help decrease the echo chamber effect. Acknowledgements This work was supported in part by the US National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence). The authors are greatly thankful to Ramon Ravelo, who proposed the main idea and provided us with a lot of advice. Truly, he is practically a co-author of this paper.
References 1. Lau, R.R.: Models of decision making. In: Sears, D.O., Huddy, L., Jervis, R. (eds.) Oxford Handbook of Political Psychology, pp. 19–59. Oxford University Press, New York (2003) 2. Lau, R.R., Kleinberg, M.S., Ditonto, T.M.: Measuring voter decision strategies in political behavior and public opinion research. Public Opinion Q. 82(Special 2018 Issue), 911–936 (2018) 3. McCullen, N.J., Ivanchenko, M.V., Shalfeev, V.D., Gale, W.F.: A dynamical model of decisionmaking behavior in a network of consumers with application to energy choices. Int. J. Bifurcat. Chaos 21, 2467–2480 (2011) 4. Ravelo, R.: Private Communication (2019)
Intuitive Idea of Implication Versus Formal Definition: How to Define the Corresponding Degree Olga Kosheleva and Vladik Kreinovich
Abstract Formal implication does not capture the intuitive idea of “if A then B”, since in formal implication, every two true statements—even completely unrelated ones—imply each other. A more adequate description of intuitive implication happens if we consider how much the use of A can shorten a derivation of B. At first glance, it may seem that the number of bits by which we shorten this derivation is a reasonable degree of implication, but we show that this number is not in good accordance with our intuition, and that a natural formalization of this intuition leads to the need to use, as the desired degree, the ratio between the shortened derivation length and the original length.
1 Formulation of the Problem Intuitive idea of implication is different from a formal definition. A traditional formal definition of implication is that A implies B (denoted A → B) if either B is true or A is false. This formal definition is very different from the intuitive idea of what we mean when we say that A implies B: e.g., it allows formally true but intuitively meaningless implications such as “if the Moon is made out of cheese, then ghosts like to scare people.” Also, in the sense of the formal definitions, all true statements imply each other, e.g., Pythagoras theorem implied the four-color theorem, which also intuitively makes no sense.
O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_10
63
64
O. Kosheleva and V. Kreinovich
How can we formalize the intuitive idea of implication. A better formalization of the intuitive idea of implication is that if we assume A, then we can get a shorter derivation of B—e.g., a derivation that takes fewer bits when stored in the computer memory. To describe this idea in precise terms, let us fix a formal theory T , and let us denote: • by L(B), the shortest length of a derivation of B in this theory, and • by L(B | A), the shortest length of a derivation of B from a theory T ∪ {A} (to which A is added as an additional axiom). In these terms, we can say that A implies B if L(B | A) < L(B). Comments. • For every true statement B and for every statement A, we can algorithmically compute the values L(B) and L(B & A)—by simply trying all possible derivations of a given length. Thus, we can algorithmically check whether A intuitively implies B. Of course, this computation may take a long time—since trying all possible combinations of n steps requires time which is exponential in n and thus, not feasible for large n. • In contrast to the formal implication, this intuitive implication has many unusual properties. For example, it is not transitive. Indeed, if true statements p and q are completely independent, then the use of p does not help us derive q: L(q | p) = L(q) hence p does not imply q in this sense. On the other hand: – if we use p, then deriving p & q is easier then without p: we only need to derive q: L( p & q | p) = L(q) and L( p & q) = L( p) + L(q) > L(q), thus we can intuitively derive p & q from p; – the derivation of q from p & q is even easier: it takes just one step, so L(q | p & q) = 1 L(q); thus, we can intuitively derive q from p & q. So, in this example, p intuitively implies p & q, the statement p & q intuitively implies q, but p does not intuitively imply q. Need to consider degrees of implication. Intuitively, there is a big different between a situation in which the use of A decreases the length of the derivation of B by a small amount (e.g., by 1 step), and a situation in which the use of A drastically decreases the length of such derivation. In the second situation, A strongly implies B, while in the first situation, the degree to which A implies B is very small. It is therefore desirable to come up with a way to measure the degree to which A implies B. This is the problem that we consider in this paper. Comment. This conclusion is line with the general idea of Lotfi Zadeh and fuzzy logic: that everything is a matter of degree; see, e.g., [1, 2, 4–8].
Intuitive Idea of Implication Versus Formal Definition …
65
2 How to Define the Degree of Implication: A Seemingly Reasonable Idea and Its Limitations A seemingly reasonable idea. Intuitively, the more the use of A decreases the length of the derivation of B, the larger the degree to which A implies B. It therefore seems reasonable to define this degree as the difference L(B) − L(B | A) between the corresponding lengths [3]: • when this difference is large, A strongly implies B, while • when this difference is small, the degree to which A implies B is very small. Limitations of this idea. The problem with the above seemingly natural idea is that it does not fully capture the intuitive notion of implication. Indeed, let us assume that we have two independent derivations with the same degree of implication: • with this degree, A1 implies B1 , and • with the same degree, A2 implies B2 . Then, when we simply combine the corresponding statement and the corresponding proofs, we expect to conclude that the conjunction A1 & A2 implies the conjunction B1 & B2 with the same degree of implication. What will happen if we compute this degree by using the above idea? • Since the derivations are independent, the only way to prove B1 & B2 is to prove both B1 and B2 , so L(B1 & B2 ) = L(B1 ) + L(B2 ). • Similarly, the only way to prove B1 & B2 from A1 & A2 is to prove B1 from A1 and B2 from A2 , so L(b1 & B2 | A1 & A2 ) = L(B1 | A1 ) + L(B2 | A2 ). Thus, for the difference, we have L(B1 & B2 ) − L(B1 & B2 | A1 & A2 ) = L(B1 ) + L(B2 ) − L(B1 | A1 ) − L(B2 | A2 ) = (L(B1 ) − L(B1 | A1 )) + (L(B2 ) − L(B2 | A2 )). Thus, if both derivations had the same degree of implication d, i.e., if we had L(B1 ) − L(B1 | A1 ) = L(B2 ) − L(B2 | A2 ) = d, then for the degree L(B1 & B2 ) − L(B1 & B2 | A1 & A2 ), we get not the same degree d, but the much larger value 2d. So, if we combine the two independent derivations with low degrees of implication, then magically the degree of implication increases—this does not make sense. Challenge and what we do in this paper. Since the seemingly natural definition contradicts our intuition, it is desirable to come up with another definition, a definition that will be consistent with our intuition. This is what we will do in this paper.
66
O. Kosheleva and V. Kreinovich
3 Towards a New Definition of Degree of Implication Towards a formal definition. Let us fix a certain degree of implication. This degree occurs for some examples, when the length of the original derivation L(B) is b and the length of the conditional derivation L(B | A) is a. The smaller a, the larger the degree of implication. Thus, not for every length b, we can have the given degree of implication: e.g., • for b = 2, we only have two options a = 0 and a = 1, i.e., only two possible degrees of implication; while • for b = 10, we have ten possible values a = 0, . . . , 9 corresponding to ten possible degrees of implication. For each possible degree d, let f d (b) describe the length of the conditional derivation corresponding to the length b of the original derivation and this particular degree. As we have mentioned, this function is not necessarily defined for all b. Let us assume that the function f d (b) is defined for some value b. This means that there exists an example with this particular degree-of-implication when the original derivation has length b and the conditional derivation has length a = f d (b). Then, as we have mentioned in the previous section, if we have k different independent pairs of statements (Ai , Bi ) of this type, for which L(B1 ) = . . . = L(Bk ) = b and L(B1 | A1 ) = . . . = L(Bk | Ak ) = a, then, for derivation of B1 & . . . & Bk from A1 & . . . & Ak of the same degree of implication, we have L(B1 & . . . & Bk ) = L(B1 ) + . . . + L(Bk ) = k · b and L(B1 & . . . & Bk | A1 & . . . & Ak ) = L(B1 | A1 ) + . . . + L(Bk | Ak ) = k · a. Thus, if f d (b) = a, then for each integer k > 1, we have f d (k · b) = k · a = k · f d (b). So, we arrive at the following definition. Definition 1 By a function corresponding to a given degree of implication, we mean a partial function f d (b) from natural numbers to natural numbers that satisfies the following property: • if this function is defined for some d, • then for each natural number k > 2, it is also defined for k · d, and f d (k · b) = k · f d (b). Proposition 1 Let f d be a function corresponding to a given degree of implication. Then, for every two values b and b for which f d is defined, we have f d (b ) f d (b) = . b b
Intuitive Idea of Implication Versus Formal Definition …
67
Proof Since the function f d is defined for b, then, by taking k = b , we conclude that this function is also defined for b · b , and f d (b · b ) = b · f d (b). Similarly, from the fact that this function is defined for b , by taking k = b, we get f d (b · b ) = b · f d (b ). Thus, we conclude that b · f d (b) = b · f d (b ). Dividing both sides by b · b , we get the desired equality. Conclusion. Different degrees of implication correspond to different values of the f d (b) L(B | A) ratio . Thus, we can use the corresponding ratio as the desired degree b L(B) to what A implies B. Comment. It may be better to use a slightly different measure 1−
L(B | A) − L(B) L(B | A) = , L(B) L(B)
since this measure is 0 if the use of A does not help at all and 1 if it decreases the length of the proof maximally. Acknowledgements This work was supported in part by the National Science Foundation via grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
References 1. Belohlavek, R., Dauben, J.W., Klir, G.J.: Fuzzy Logic and Mathematics: A Historical Perspective. Oxford University Press, New York (2017) 2. Klir, G., Yuan, B.: Fuzzy Sets and Fuzzy Logic. Prentice Hall, Upper Saddle River, New Jersey (1995) 3. Kreinovich, V., Longpré, L., Nguyen, H.T.: Towards formalization of feasibility, randomness, and commonsense implication: Kolmogorov complexity, and the necessity of considering (fuzzy) degrees. In: Phuong, N.H., Ohsato, A. (eds.), Proceedings of the Vietnam-Japan Bilateral Symposium on Fuzzy Systems and Applications VJFUZZY’98, HaLong Bay, Vietnam, 30th September–2nd October (1998), pp. 294–302 4. Mendel, J.M.: Uncertain Rule-Based Fuzzy Systems: Introduction and New Directions. Springer, Cham, Switzerland (2017) 5. Nguyen, H.T., Kreinovich, V.: Nested intervals and sets: concepts, relations to fuzzy sets, and applications. In: Kearfott, R.B., Kreinovich, V. (eds.) Applications of Interval Computations, pp. 245–290. Kluwer, Dordrecht (1996) 6. Nguyen, H.T., Walker, C., Walker, E.A.: A First Course in Fuzzy Logic. Chapman and Hall/CRC, Boca Raton, Florida (2019) 7. Novák, V., Perfilieva, I., Moˇckoˇr, J.: Mathematical Principles of Fuzzy Logic. Kluwer, Boston, Dordrecht (1999) 8. Zadeh, L.A.: Fuzzy sets. Inf. Control 8, 338–353 (1965)
Dimension Compactification—A Possible Explanation for Superclusters and for Empirical Evidence Usually Interpreted as Dark Matter Vladik Kreinovich
Abstract According to modern quantum physics, at the microlevel, the dimension of space-time is ≥11; we only observe 4 dimensions because the others are compactified: the size along each of the other dimensions is much smaller than the macroscale. There is no universally accepted explanation of why exactly 4 dimensions remain at the microscopic level. A natural suggestion is: maybe there is no fundamental reason why exactly 4 dimensions should remain, maybe when we go to even larger scales, some of the 4 dimensions will be compactified as well? In this paper, we explore the consequences of the compactification suggestion, and we show that— on the qualitative level—these consequences have actually been already observed: as superclusters and as evidence that is usually interpreted as justifying dark matter. Thus, we get a new possible explanation of both superclusters and dark matter evidence—via dimension compactification.
1 Main Idea According to modern quantum physics, at the microlevel, space-time has at least 11 dimensions; we only observe 4 dimensions in our macroobservations because the rest are compactified: the size along each of the remaining directions is so small that on macrolevel, these dimensions can be safely ignored; see, e.g., [3, 8]. There is no universally accepted explanation of why exactly 4 dimensions remain at the microscopic level. A natural suggestion is: maybe there is no fundamental reason why exactly 4 dimensions should remain. Maybe when we go to even larger scales, some of the 4 dimensions will be compactified as well? Could one rigorously argue for “compactification” when these effects must occur over the accessible, kiloparsec and larger, scales? In modern physics, indeed, compactification is related to quantum-size distances, but there is nothing inherently V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_11
69
70
V. Kreinovich
quantum in the compactification idea. Indeed, the very first paper that proposed compactification—the 1938 paper by Einstein and Bergmann [1]—described it as simply the extra dimension being a circle, so that the entire space-time looks like a thin cylinder whose width is negligible if we operate at large enough scales. In this paper, we explore the consequences of the compactification suggestion, and we show that—on the qualitative level—these consequences have actually been already observed: as superclusters and as evidence that is usually interpreted as justifying dark matter. Thus, we get a new possible explanation of both superclusters and dark matter evidence—via dimension compactification.
2 Geometric Consequences of the Main Idea If our idea is correct, then, as we increase the scale, we will observe how a 3D picture is replaced by a 2D and then by a 1D one. This is exactly what we observe: while at a macrolevel, we see a uniform distribution of galaxies in a 3D space, at a larger-scale level, galaxies start forming superclusters—long and thin strands of clusters and galaxies; see, e.g., [2]. Superclusters are either close to a 2D shape (as the “Great Wall” discovered in the 1980s) or close to 1D.
3 Towards Physical Consequences of the Main Idea How does the change in dimension affect physics? In non-relativistic approximation, the gravitation potential ϕ is related to the mass density ρ by the Laplace equation ϕ = ρ. In the 3D space, this leads to Newton’s potential ∼1/r , with the force F(r ) =
G0 · m1 · m2 ; r2
in a 2D space, we get potential ∼ log(r ), with the force F(r ) =
G1 · m1 · m2 r
for some constant G 1 . For intermediate scales, it is reasonable to consider a combination of these two forces: F(r ) =
G1 · m1 · m2 G0 · m1 · m2 . + r2 r
(1)
In the following text, we explain, on the qualitative level, why such combination naturally follows from the above compactification idea.
Dimension Compactification—A Possible Explanation for Superclusters …
71
Let us consider the simplest possible compactification (along the lines of the original Einstein-Bergmann paper), where the space-time is wrapped as a cylinder of circumference R along one of the coordinates. What will the gravitational potential look like in this simple model? The easiest way to solve the corresponding Newton’s equation in this cylinder is to “unwrap” the cylinder into a full space. After this “unwrapping”, each particle in a cylindrical space (in particular, each source of gravitation) is represented as infinitely many different bodies at distances R, 2R, etc., from each other. For further simplicity, let us consider the potential force between the two bodies on the wrapping line at a distance r from each other. The effect of the second body on the first one in cylindrical space is equivalent to the joint effect of multiple copies of the second body in the unwrapped space: r
R
R
R
The resulting gravitational potential of a unit mass can be described as a sum of potentials corresponding to all these copies, i.e., 1 1 1 1 + + + ··· + + ··· r r + R r + 2R r +k· R
ϕ(r ) =
(2)
From the purely mathematical viewpoint, this sum is infinite. From the physical viewpoint, however, the actual potential is not infinite: due to relativistic effects, at the current moment of time t0 , the influence of a source at a distance d = r + k · R is determined by this source’s location at a time t0 − d/c (where c is the speed of light). Thus, we only need to add the terms for which d/c is smaller than the age of the Universe. As a result, we can ignore the slowly increasing infiniteness of the sum when k → ∞. How can we estimate this potential? The formula (2) has the form f (0) + f (1) + · · · + f (k) + · · · , def
where we denoted f (x) = 1/(r + x · R). It is difficult to get an analytical expression for the exact sum, but we can use the fact that this sum is an integral sum for an integral ∞
f (x) d x; this integral has an analytical expression—it is const − (1/R) · ln(r ).
0
In this approximation
∞
f (x) d x ≈ f (0) + f (1) + · · · + f (k) + · · · ,
0
we used the fact that ∞ f (x) d x = 0
1 0
2
f (x) d x + 1
f (x) d x + · · · + k
k+1
f (x) d x + · · · ,
(3)
72
V. Kreinovich
k+1
and we approximated each term
f (x) d x by f (k). This approximation is equiv-
k
alent to approximating the function f (x) on the interval [k, k + 1] by its value f (k) at the left endpoint of this interval—i.e., by the first term in the Taylor expansion of the function f (x) around the point k. A natural next approximation is when instead of only taking the first term, we consider the first two terms in this Taylor expansion, i.e., when on each interval [k, k + 1], we approximate the function f (x) by a formula f (k) + f (k) · (x − k). Under this approximation,
k+1
f (x) d x ≈ f (k) +
k
1 · f (k), 2
and therefore,
∞ 0
1
f (x) d x =
f (x) d x +
0
2
1
( f (0) + f (1) + · · · + f (k) + · · · ) +
k+1
f (x) d x + · · · +
f (x) d x + · · · =
k
1 · ( f (0) + f (1) + · · · + f (k) + · · · ). 2 (4)
The second term in the right-hand side can be (similarly to the formula (3)) approximated by f (0) + f (1) + · · · + f (k) + · · · ≈ const +
∞
f (x) d x −
0
1 · f (0). 2
Thus, from the formula (4), we can conclude that f (0) + f (1) + · · · + f (k) + · · · ≈
∞
f (x) d x + f (0).
0
In particular, for our function f (x), we get ϕ(r ) = f (0) + f (1) + · · · + f (k) + · · · ≈ const −
1 1 1 · ln(r ) + · . R 2 r
Differentiating relative to r , we get the desired formula (1) for the gravitational force: F(r ) = −
1 1 dϕ(r ) = + 2, dr R ·r 2r
with R = 2G 0 /G 1 . According to estimates from [4], we expect R to be between ≈10 and ≈30 kpc.
Dimension Compactification—A Possible Explanation for Superclusters …
73
4 Physical Consequences of the Main Idea The force described by formula (1) is exactly the force that, according to Kuhn, Milgrom et al. [4–6, 9], is empirically needed to describe the observations if we want to avoid dark matter. Indeed, in Newtonian mechanics, for any large-scale rotating gravitational system, if we know the rotation speed v at a distance r from the center, we can find the mass Mg (r ) inside the sphere of radius r by equating the acceleration v 2 /r with the acceleration G 0 · Mg (r )/r 2 provided by the Newton’s law. As a result, we get Mg (r ) = r · v 2 /G 0 . Alternatively, we can also count masses of different observed bodies and get M L (r )—the total mass of luminescent bodies. It turns out that M L (r ) Mg (r )—hence the traditional explanation that in addition to luminescent bodies, there is also “dark” (non-luminescent) matter. An alternative explanation is not to introduce any new unknown type of matter— i.e., assume that Mg (r ) ≈ M L (r )—but rather change the expression for the force, or, equivalently, assume that the gravitational constant G 0 is not a constant but it may depend on r : G 0 = G(r ). Equating the acceleration v 2 /r with the acceleration G(r ) · M L (r )/r 2 provided by the new gravity law, we can determine G(r ) as G(r ) = v 2 · r/M L (r ). Observation data show that G(r ) = G 0 + G 1 · r for some constant G 1 —i.e., that the dependence of the gravity force on distance is described by the formula G0 · m1 · m2 G1 · m1 · m2 G(r ) · m 1 · m 2 , = + F(r ) = r2 r2 r which is exactly what we deduced from our dimension compactification idea. This idea has been proposed 20 years ago, and one of the reasons why it has not been universally accepted is that it was difficult to get a natural field theory explanation of this empirical law. We have just shown that such an empirical explanation comes naturally if we consider the possibility of dimension compactification. This explanation is in line with Milgrom’s own explanation; see, e.g., [7]. As we go further up in scale, one more dimension starts compactifying, so we start getting a 1D space in which Laplace equation leads to potential ϕ(r ) ∼ r . Thus, at a large-scale level, we should have a term proportional to r added to the normal gravity potential formulas. This additional term is exactly what is added when we take a cosmological constant Λ into consideration.
5 Observable Predictions of Our New Idea A possible observable consequence of the additional term ϕ(r ) ∼ r is that it leads to an additional constant term in the gravitational force and therefore, to a formula G(r ) = G 0 + G 1 · r + G 2 · r 2 . Thus, if the empirical dependence of G(r ) on r turns out to be not exactly linear but rather slightly quadratic, it will be a strong argument in favor of our compactification idea.
74
V. Kreinovich
6 Natural Open Questions In this paper, we simply formulate the idea and explain why we believe this idea to be prospective. Many related questions are still open: • how to concoct a geometry that accomplishes “compactification”—what would a metric look like that transitions from large to small scales? • could we solve for the force law in such a geometry with some rigor? • will such a profound geometrical perturbation as the one we propose have other consequences beyond a force law change? perhaps not, but it is worth investigating. Acknowledgements This work was supported in part by NASA under cooperative agreement NCC5-209, by the Future Aerospace Science and Technology Program (FAST) Center for Structural Integrity of Aerospace Systems, effort sponsored by the Air Force Office of Scientific Research, Air Force Materiel Command, USAF, under grant F49620-00-1-0365, and by NSF grants EAR-0112968 and EAR-0225670. The author is very thankful to Jeffrey R. Kuhn for his ideas and for his valuable advise.
References 1. Einstein, A., Bergmann, P.: On the generalization of Kaluza’s theory of electricity. Ann. Phys. 39, 683–701 (1938) 2. Fairall, A.: Large-Scale Structures in the Universe. Wiley, New York (1998) 3. Green, M.B., Schwarz, J.H., Witten, E.: Superstring Theory, vol. 1, 2. Cambridge University Press, Cambridge (1988) 4. Kuhn, J.R., Krugliak, L.: Non-Newtonian forces and the invisible mass problem. Astrophys. J. 313, 1–12 (1987) 5. Milgrom, M.: A modification of the Newton dynamics as a possible alternative to the hidden mass hypothesis. Astrophys. J. 270, 365–370 (1983) 6. Milgrom, M.: Does dark matter really exist? Sci. Am. 42–52 (2002) 7. Milgrom, M.: MOND - theoretical aspects. New Astron. Rev. 46, 741–753 (2002) 8. Polchinski, J.: String Theory, vol. 1, 2. Cambridge University Press, Cambridge (1998) 9. Sanders, R.H., McGaugh, S.S.: Modified Newtonian dynamics as an alternative to dark matter. Annu. Rev. Astron. Astrophys. 40, 263–317 (2002)
Fundamental Properties of Pair-Wise Interactions Naturally Lead to Quarks and Quark Confinement: A Theorem Motivated by Neural Universal Approximation Results Vladik Kreinovich Abstract In traditional mechanics, most interactions are pair-wise; if we omit one of the particles from our description, then the original pair-wise interaction can sometimes only be represented as interaction between triples, etc. It turns out that, vice versa, every possible interaction between N particles can be represented as pair-wise interaction if we represent each of the original N particles as a triple of new ones (and two new ones are not enough for this representation). The resulting three “particles” actually represent a single directly observable particle and in this sense, cannot be separated. So, this representation might ultimately explain the threequark model of basic baryons, and explain quark confinement. The representation is based on a deep mathematical result (namely, a minor modification of Kolmogorov’s solution to Hilbert’s 13th problem) which has already been used in foundations of neural networks and in foundations of physics—to explain why fundamental physical equations are of second order, and why all these fundamental equations naturally lead to non-smooth solutions like singularity.
1 Formulation of the Problem In traditional (Newtonian) particle mechanics, most interactions are pair-wise (see, e.g., [11]): the total force F(a) acting on a particle a is equal to the sum of all the forces F(ab) with which all other particles b acts on this a: F(a) =
F(ab) (x (a) , x (b) );
(1)
b=a
V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_12
75
76
V. Kreinovich
here, x (a) denotes all the parameters which characterize the state of a particle a (its coordinates, its charge, etc.). For example, the Newtonian gravitational interaction has the form G · m a · m b · (r (a) − r (b) ) ; F(a) = r (a) − r (b) 3 b=a here, x (a) = (r (a) , m a ), and F(ab) =
G · m a · m b · (r (a) − r (b) ) . r (a) − r (b) 3
Relativistic non-quantum particle interactions can be described by a similar formula, with only two changes: first, we must use a relativistic expression for the force, and second, in determining the force which acts on a particle a at a given moment of time t, we must use the values x (b) at the appropriately retarded moments of time t < t (see, e.g., [2]). From the general mathematical viewpoint, we can expect more general interactions which are not necessarily pair-wise; namely, we can expect a general dependence of the type F(a) = F(a) (x (a) , x (b) , x (c) , . . .).
(2)
which is not necessarily representable as a sum of type (1). General formulas of type (2) are not only mathematically possible, they are physically quite possible even when interactions are actually pair-wise: namely, if we omit one of the particles c from our description, then particles b and b act on a both directly (in the form (1)) and indirectly (via the unaccounted particle c), and in this indirect interaction, terms corresponding to x (b) and x (b ) may not be separable. A similar idea can be described in an even clearer way on the example of chemical interactions: Suppose that we have pair-wise interactions in which two substances a and b form a compound ab, and this compound interacts with a third compound c to form a new substance abc: a + b → ab; ab + c → abc.
(3)
If the intermediate compound ab is short-lived, then it is possible to omit it, and treat the reaction as a single (non-pair-wise) reaction between three chemical substances a, b, and c: a + b + c → abc.
(4)
In this case, a non-pair-wise interactions (4) can be reformulated in a pair-wise form (3) if we add an additional substance ab. In a more general case, we may have to add several additional substances.
Fundamental Properties of Pair-Wise Interactions Naturally …
77
Similarly, some non-pair-wise particle interactions (2) can be reformulated in a pair-wise form (1) if we add one or more additional particles. A natural question is: Can an arbitrary particle interaction be reformulated as pair-wise interaction by adding additional particles? In this paper, we will show that the answer to this question is “yes”; this answer will also, hopefully, bring us closer to the fundamental explanation of why a nucleon consists of exactly three (not two and not five) hard-to-separate particles (quarks).
2 Definitions and the Main Result Definition. Let m and N be positive integers, and let B be a positive real number. The integer n will be called the number of particles, and the number B will be called a bound. • By a state space, we mean a set S = [−B, B]m of all m -tuples s = (s1 , . . . , sm ) for which |si | ≤ B for all i. • By a particle interaction between N particles, we mean a sequence of N continuous functions f (a) : S N → R m (1 ≤ a ≤ N ), i.e., functions which transforms every N tuple of states (s (1) , . . . , s (N ) ) into a vector v ∈ R m : v (a) = f (a) (s (1) , . . . , s (N ) ). • We say that a particle interaction ( f (1) , . . . , f (N ) ) is pair-wise if every function f (a) can be represented as a sum: f (a) (s (1) , . . . , s (N ) ) =
N
f (ab) (s (a) , s (b) ), 1 ≤ a ≤ N ,
b=1
for some functions f (ab) : S 2 → R m . • We say that a particle interaction ( f (1) , . . . , f (N ) ) can be represented in pairwise form by adding E extra particles, if there exists a particle interaction (g (1) , . . . , g (N ) , . . . , g (N +E) ) in which: – for 1 ≤ a ≤ N , the functions g (a) may depend on the states of all N + E particles: g (a) (s (1) , . . . , s (N ) , . . . , s (N +E) );
(5)
– for N + 1 ≤ a ≤ N + E, the functions g (a) depend only on the states of the first N particles: g (a) (s (1) , . . . , s (N ) , . . . , s (N +E) ) = g (a) (s (1) , . . . , s (N ) ),
and describe the states s (a) = v (a) of the new particles;
(6)
78
V. Kreinovich
– for every sequence of N states (s (1) , . . . , s (N ) ) from S N and for every a from 1 to N , if we substitute the values s (b) = g (b) (s (1) , . . . , s (N ) ), N + 1 ≤ b ≤ N + E, into the expression (5), we get the value f (a) (s (1) , . . . , s (N ) ). Theorem. An arbitrary particle interaction can be represented in pair-wise form by adding a finite number of extra particles.
3 Proof This proof uses a result proven by Kolmogorov [8] as a solution to the conjecture of Hilbert, formulated as the thirteenth problem [5]: one of 23 problems that Hilbert has proposed in 1900 as a challenge to the XX century mathematics. We will present this result in a form given by Sprecher in [13] (see also [12]): For an arbitrary integer n > 0, and for an arbitrary real number B, there exist 2n + 1 continuous increasing functions ϕq : [−B, B] → R, 1 ≤ q ≤ 2n + 1, and n real numbers λ1 , . . . , λn such that if we denote yq =
n
λ p · ϕq (x p ),
(7)
p=1
then we can represent each continuous function f : [−B, B]n → R in the form f (x1 , . . . , xn ) =
2n+1
g(yq )
(8)
q=1
for some continuous function g : [−B, B] → R. In particular, if we have n different functions f 1 (x1 , . . . , xn ), . . ., f n (x1 , . . . , xn ), then we can represent each of them in this form: f p (x1 , . . . , xn ) =
2n+1
g p (yq ), 1 ≤ p ≤ n
(9)
q=1
for some continuous functions g1 , . . . , gn : [−B, B] → R. This result was previously used in different application areas: • in computational physics (see, e.g., [3]) where it helped to speed up computations; • in neural networks to prove that standard neural networks are universal approximators, in the sense that for every continuous function F(x1 , . . . , xn ), and for every ε > 0, there exists a neural network for which the corresponding input-output function is ε-close to F(x1 , . . . , xn ) [4, 6, 7, 9, 10]; and
Fundamental Properties of Pair-Wise Interactions Naturally …
79
• in foundations of physics (see. e.g., [14]), where it was used to explain why fundamental physical equations are of second order, and why all these fundamental equations naturally lead to non-smooth solutions like singularity. In our physical problem, we have N vector-valued functions of N vector variables. Each of these vector variables has m components, so, we can view each of m components of each of N functions as a function of n = N · m scalar variables: f i(a) (s (1) , . . . , s (N ) ) = f i(a) (s1(1) , . . . , sm(1) , . . . , s1(N ) , . . . , sm(N ) ). Thus, we have n functions f i(a) of n scalar variables si(a) ∈ [−B, B]; so, according to Sprecher’s theorem, we can find 2n + 1 functions ϕq (x) and g p (x) for which the formulas (7) and (9) hold. Hence, we have 2n + 1 extra variables y1 , . . . , y2n+1 , for which each original function f p is represented as a sum of functions of one variable depending on these new variables (formula (9)), and each of the new variables yq is represented as a sum of functions of one variable depending only on old variables x p (formula (7)). This is almost what we want in our definition of “represented in pair-wise form by adding E extra particles”; to get exactly what we want, we add m − 1 fictitious new variables yq which do not affect anything, and divide the resulting (2n + 1) + (m − 1) = 2n + m = 2N · m + m = m(2N + 1) new variables into E = 2N + 1 groups of m variables in each. Then, if we interpret variables from each group as representing a state of one of the new particles, we get the desired representation (5), (6). The theorem is proven.
4 Physical Interpretation of the Result Our result explains why in classical Newtonian mechanics, we only consider pairwise particle interactions: indeed, as the this theorem shows, by adding extra (“fictitious”) particles, we can describe an arbitrary particle interaction as a pair-wise one. This result does more than simply explain this general possibility; it tells us how many of these additional particles we need to add to describe an arbitrary particle interaction. Indeed, if we start with N particles, then we need 2N + 1 additional (“fictitious”) particles, and we can therefore describe the original physical system by pair-wise interactions between the resulting 3N + 1 particles. If we add one “real” particle to the original physical system (i.e., if we go from N from N + 1), then, to preserve the pair-wise description of the particle interaction, we need to go from 3N + 1 to 3(N + 1) + 1 = 3N + 4 = (3N + 1) + 3 particles in the pair-wise description, i.e., we need to add 3 particles to that description.
80
V. Kreinovich
In other words, to represent particle interaction as pair-wise, we must represent each original particle as a triple of new ones. The resulting three “particles” actually represent a single directly observable particle and in this sense, cannot be separated. So, this representation might ultimately explain the three-quark model of basic baryons. This explanation can also potentially explain quark confinement. Indeed, in this model, the new particles are artificial mathematical constructions added to the list of original particles to describe their interaction; these new particles do not have any physical meaning of their own and therefore, there is no physical way to extract a single new particle. Qualitatively, this impossible is exactly what physicists mean by quark confinement. Comments. • To avoid misunderstanding, we must emphasize again that this paper is purely foundational. Its goal is to derive the desired facts (such as the fact that a particle consists of only 3 quarks) in the most general context, by using the weakest possible assumptions. For example, when we consider arbitrary many-body interactions, we make this consideration not because we have any deep physical reasons to believe that, e.g., nucleonic forces are many-body, but because this is the most general type of interaction (more physically useful pair-wise interactions are a particular case of this more general class). • We have explained why 3 new particles is enough; the only remaining mathematical question is: are 3 new particles really necessary? Can we do the same with two new particles instead of each original one? A (partial) negative answer to this question comes from a theorem proven by Doss [1]: that (at least for some n), in Sprecher’s result, we cannot use fewer than 2n + 1 new variables yq . Thus, two new particles are not enough, and we get an explanation of why exactly 3 quarks are needed. • Our main objective is to describe a simple idea and its explanational potential. To develop this idea further, it is necessary to look more attentively into the corresponding physics. For example, we simply show that we need at most 3 particles per one “old” one. This explanation does not explain why baryons (such as proton or neutron) consists of 3 quarks, while, say, pions consist of only 2 quarks, and photons and electrons look like elementary particles (which cannot be further decomposed into quark-like pieces). • At present, our idea is a purely mathematical idea. In its present form, it shows that an arbitrary interaction can be explained by using a quark-type structure. Since this is a universal representation of arbitrary forces, it cannot therefore lead, by itself, to any experimental predictions. It may be possible, however, that in combination with other physically meaningful assumptions, we may get experimentally verifiable results. This would lead to an experimental test of this interpretation of quarks.
Fundamental Properties of Pair-Wise Interactions Naturally …
81
Acknowledgements This work was supported in part by NASA under cooperative agreement NCC5-209, by NSF grant No. DUE-9750858, by the United Space Alliance, grant No. NAS 920000 (PWO C0C67713A6), by the Future Aerospace Science and Technology Program (FAST) Center for Structural Integrity of Aerospace Systems, effort sponsored by the Air Force Office of Scientific Research, Air Force Materiel Command, USAF, under grant number F49620-95-1-0518, and by the National Security Agency under Grant No. MDA904-98-1-0561. The author is thankful to Ronald R. Yager for valuable discussions.
References 1. Doss, R.: On the representation of continuous functions of two variables by means of addition and continuous functions of one variable. Colloq. Math. 10, 249–259 (1963) 2. Feynman, R.P., Leighton, R.B., Sands, M.L.: The Feynman Lectures on Physics. AddisonWesley, Redwood City (1989) 3. Frisch, H.L., Borzi, C., Ord, G., Percus, J.K., Williams, G.O.: Approximate representation of functions of several variables in terms of functions of one variable. Phys. Rev. Lett. 63(9), 927–929 (1989) 4. Hecht-Nielsen, R.: Kolmogorov’s mapping neural network existence theorem. In: Proceedings of First IEEE International Conference on Neural Networks, San Diego, CA, pp. 11–14 (1987) 5. Hilbert, D.: Mathematical problems, lecture delivered before the international congress of mathematics in Paris in 1900, translated in Bull. Am. Math. Soc. 8, 437–479 (1902) 6. Hornik, K.: Approximation capabilities of multilayer feedforward neural networks. Neural Netw. 4, 251–257 (1991) 7. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward neural networks are universal approximators. Neural Netw. 2, 359–366 (1989) 8. Kolmogorov, A.N.: On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR 114, 369–373 (1957) 9. K˚urková, V.: Kolmogorov’s theorem is relevant. Neural Comput. 3, 617–622 (1991) 10. K˚urková, V.: Kolmogorov’s theorem and multilayer neural networks. Neural Netw. 5, 501–506 (1992) 11. Landau, L.D., Lifshitz, E.M.: The Classical Theory of Fields. Butterworth Heinemann, Bristol (1997) 12. Lorentz, G.G.: The 13-th problem of Hilbert. In: Browder, F.E. (ed.) Mathematical Developments Arising from Hilbert’s Problems, pp. 419–430. American Mathematical Society, Providence (1976) (Part 2) 13. Sprecher, D.A.: On the structure of continuous functions of several variables. Trans. Am. Math. Soc. 115(3), 340–355 (1965) 14. Yamakawa, T., Kreinovich, V.: Why fundamental physical equations are of second order? Int. J. Theor. Phys. 38(6), 1763–1770 (1999)
Linear Neural Networks Revisited: From PageRank to Family Happiness Vladik Kreinovich
Abstract The study of Artificial Neural Networks started with the analysis of linear neurons. It was then discovered that networks consisting only of linear neurons cannot describe non-linear phenomena. As a result, most currently used neural networks consist of non-linear neurons. In this paper, we show that in many cases, linear neurons can still be successfully applied. This idea is illustrated by two examples: the PageRank algorithm underlying the successful Google search engine and the analysis of family happiness.
1 Linear Neural Networks: A Brief Reminder Neural networks. A general neural network consists of several neurons exchanging signals. At each moment of time, for each neuron, we need finitely many numerical parameters to describe the current state of this neuron and the signals generated by this neuron. The state of the neuron at the next moment of time and the signals generated by the neuron at the next moment of time are determined by its current state and by the signals obtained from other neurons. Non-linear and linear neural networks. In general, the dependence of the neuron’s next state and/or next signal on its previous state and previous signals is non-linear; see, e.g., [5]. However, original neural networks started with linear neurons, and, as we will argue, there are still cases when linear neurons work well. What we do in this paper. In this paper, we show that there are indeed many useful applications of linear neural networks.
V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_13
83
84
V. Kreinovich
2 Linear Neural Networks: A Precise Description Description. The state of each neuron i at each moment of time t can be described by the values wi,1 (t), . . . , wi,ni (t) of the corresponding parameters. Since we consider linear neural networks, the transformation to the next moment of time is linear in each of these parameters: wi, j (t + 1) = ai, j +
ai, j,k,l · wk,l (t).
k,l
Simplification: step one. In neural networks, to simplify the description, it is usually convenient to add a fictitious parameter w0 whose value is 1; then, the above formula takes the simplified form wi, j (t + 1) =
ai, j,k,l · wk,l (t),
k,l
where now the sum includes the case when (k, l) = 0; for this case, we take ai, j,0 = ai, j . Simplification: step two. This formula can be further simplified if we ignore the relation between parameters and neurons and simply consider all parameters of all the neurons. In precise terms, we take A to be an index that goes over all pairs (i, j). In this case, we have a A,B · w B (t). w A (t + 1) = B
Asymptotic behavior is what we are interested in. In a general neural network, the dynamical process is important, we what is even more important is what happens after all these changes are done. In other words, the learning process is important but the main objective is the result of this process. From this viewpoint, we are interested in the asymptotic behavior of the above dynamical system. How to describe asymptotic behavior: enter eigenvectors and eigenvalues. The behavior of a general linear dynamical system is easier to describe if we use the basis consisting of the eigenvectors of the matrix a AB . In this basis, the matrix becomes diagonal, with the eigenvalues λ A on the diagonal, so each component gets transformed as w A (t + 1) = λ A · w A (t). Thus, we get w A (t) = λtA · w A (0). In the general case, all initial values w A (0) are non-zeros. In this case, for large t, coefficients w A (t) corresponding to the eigenvalue with the largest |λ A | become thus much larger than all other coefficients. Hence, for large t, the state vector becomes proportional to the eigenvector corresponding to this largest eigenvalue.
Linear Neural Networks Revisited: From PageRank to Family Happiness
85
So, asymptotically, a linear neural network transforms the original state into the eigenvector corresponding to the largest eigenvalue (largest by the absolute value). Comment. The above ideas can be viewed as a particular case of the general ideas of neural-network-related dynamics logic; see, e.g., [16–18].
3 Linear Neural Networks: From the General Case to the Simplest Case Simplest case: we only know connections, not weights. In general, to describe a linear neural network, we need to describe real values a AB corresponding to different pairs of nodes. In geometric terms, a AB = 0 means that there is a connection from node B to node A. Thus, to describe a linear neural network, we need to describe: • all the connections between the nodes, and • for connected pairs of nodes, the strength a AB of this connection. The simplest case is when we do not know the strengths, we only know which neurons are connected to each other. Let us see how in this case, we can naturally define the strengths. Limitations on the values w A of the state variables. In general, as we have mentioned, the values w A (t) of the state variables grow exponentially with time t. To restrict this growth, we can put a restriction on these values. The simplest limitations are linear ones, i.e., limitations of the type
c A · w A = const.
A
In the generic case, the constant is not 0. By dividing all the coefficients c A by this constant, we get a simplified relation c A · w A = 1. A
A priori, we have no reasons to prefer one state variable over another. Thus, it makes sense to require that the weights c A corresponding to all the state variables are the same, i.e., that c A = c for all A and for some constant c. Thus, we conclude that c · w A = 1. A
We can simplify this limitation even further if we change the unit in which we measure the values of the variables w A to a new measuring unit which is c times smaller, so that 1 old unit is equivalent to c new units. With this new choice of a unit, the value w A in the old units becomes c · w A in the new units. Thus, if we describe the values w A in the new units, the above limitation takes a simplified form A
w A = 1.
86
V. Kreinovich
Resulting limitations on dynamics. The above limitation on the states means that we have to restrict dynamics as well: to make sure that if the above restriction holds at some moment of time, then it will also hold in the next moment of time. In precise terms, if w A = 1 and w A = a AB · w B , then we should have w A = 1. A
B
A
Let us describe this relation in terms of the coefficients a AB . By substituting the expression for w A into the desired formula w A = 1, we can reformulate the above A condition as follows: if w A = 1, then a AB · w B = 1, i.e., in other words, A
A
B
c A · w A = 1,
A def
where c A =
aB A.
B
In geometric terms, both the original equation
w A = 1 and the new equation
A
c A · w A = 1 describe planes, and the above if-then condition means the first plane
A
is a subset of the second plane. This can only happen when the two planes coincide, and this, in turn, means that the corresponding coefficients must coincide, i.e., that a B A = 1 for all A. Resulting formulas for the weights. For each A which is connected to some variables B, we need to assign weights a B A to all such connections. Since there is no reason to distinguish different B from which there is a connection from A, it makes sense to make all these weights equal to each other. Due to the condition a B A = 1, B
1 this means that a B A = , where n A denotes the total number of variables B for nA which there is a connection from A to B. For some variables A, there is no connection to any B’s. In this case, similarly, there is no reason to prefer one of the B’s, so it makes sense to assign equal weights 1 to all the corresponding variables a B A . Thus, for such variables A, we get a B A = , n where n is the total number of the variables. A problem. Our main interest is in eigenvectors. The problem is that for the above matrix a AB , there are many different eigenvectors. For example, if we have a group of variables which are only connected to each other, then there is an eigenvector in which w A = 0 only for these variables A. To get this eigenvector, we can start, e.g., with equal weights assigned to all these variables. Since these variables are only connected to each other, we will never get w A = 0 for any variable A outside the group. In general, the abundance of eigenvectors is not a big problem since we distinguish eigenvectors by their eigenvalues—and select only the one with the largest eigenvalue. However, in our case, due to the restriction w A (t) = 1, all eigenvalues are A
equal to 1.
Linear Neural Networks Revisited: From PageRank to Family Happiness
87
So, to restrict the number of eigenvectors, it is desirable to modify the original matrix a AB . How to modify the original matrix. The above problem occurs when some variables are not connected with others. So, to resolve this problem, a natural idea is to add small connections a AB = 0 between every two nodes. Again, there is no reason to prefer one pair of variables to any other pair, so it makes sense to make all these values equal to each other: a AB = δ = 0 for all A and B. matrix If we simply add these values a AB = δ to all the entries of the original a B A = 1, since for the matrix a AB = a AB + a AB , we will violate the condition B
a AB , we will have
a B A =
B
aB A +
B
a B A = 1 + ε,
B
def
where we denoted ε = n · δ. To preserve this condition, we should therefore multiply all the entries on the original matrix a AB by 1 − ε. This multiplication does not change the eigenvectors and the order of eigenvalues of the original matrix, and for the new matrix a AB = (1 − ε) · a AB + AB , we indeed have B
a B A = (1 − ε) ·
B
aB A +
a B A = (1 − ε) + ε = 1.
B
One can check that for nodes A that have no outgoing connections, the new formula leads to exactly the same equal values a B A as the formula for a B A —which makes perfect sense, since for this node A, we did not introduce any differentiation between different nodes B. Thus, we arrive at the following natural matrix. Final formula for the matrix a B A . We start with a small number ε and a directed graph in which vertices correspond to variables A and there is an edge from A to B if and only if the variable A can influence the variable B. When a node A cannot influence any variable B, i.e., when there is no edge coming 1 out of A in our graph, then we take a B A = , where n is the total number of nodes. n For every other node A, we take: ε • a B A = for all B for which there is no connection from A, and n 1 ε • a B A = (1 − ε) · + for every node B for which there is an edge from A to nA n B, where n A is the number of such nodes B.
88
V. Kreinovich
4 First Application: PageRank Algorithm as an Example of a Linear Neural Network PageRank algorithm: brief reminder. Web search engines return all the webpages that contain the desired phrase or keywords. Often, there are thousands and millions of such webpages. So, to make search results meaningful, search engines sort the webpages and only return the top ranking ones. The most successful and the most widely used search engine—Google—uses a special PageRank algorithm for sorting. This algorithm is based on the following idea. Let us assume that a user starts with a randomly selected page. If this page has no links on it, the user picks another random page—with equal probability of selecting each of the webpages. If there are links, the user either goes to one of these links (with equal probability), or, with a certain probability ε > 0, starts the process anew. As a result: • reputable pages, i.e., pages to which many pages link will be visited more frequently, while • obscure pages, i.e., pages to which no pages link will be visited much less frequently. The probability of visiting a page in the above process is then used as a rank of this page; see, e.g., [6, 13] and references therein. PageRank and linear neural networks. One can represent the world wide web as a directed graph, in which webpages are nodes and an edge from page A to page B means that there is a link to page B on page A. From the above description of the PageRank algorithm, one can see that the probabilities a B A of going from A to B are described exactly by our linear neural network formulas, so the final probabilities— ranks of different pages—form an eigenvector of the corresponding matrix. Thus, PageRank can be viewed as a natural application of linear neural networks. PageRank and linear neural networks beyond web search. Similar ideas and formulas have been used beyond web search, e.g., to describe the degree of trust of different participants in a peer-to-peer network; see, e.g., [11]. This idea can also be applied to other situations if instead of the full graph of all the connections, we apply this algorithm to the appropriate subgraph. In some cases, this restriction leads to heterogenous networks (see, e.g., [25]); examples will be given below. Let us list several possible applications of this type. We will see that the result depends on how much information we keep in the subgraph. First example: ranking papers based on citations. A paper which is cited more is, in general, considered to be more valuable. This is why often, papers are ranked by the number of citations. However, this simplified ranking is—well, simplified: a citation in a more important (more cited) paper should be counted for more than a citation in an obscure paper—a paper that no one cites.
Linear Neural Networks Revisited: From PageRank to Family Happiness
89
A natural way to assign rank to papers based on citations is to form a directed graph in which nodes are papers and there is a link from A to B if the paper A cites the paper B. Then, we can use the PageRank algorithm—i.e., in effect, linear neural networks—to provide a good ranking of papers by their importance; this idea has been proposed already by the authors of the PageRank algorithm. Second example: ranking of authors. To nodes describing papers, we can add nodes describing authors, with a link from each author to each of his or her papers, and with links from each paper to each of its authors. Computational comment. This is an example of a heterogenous graph: we clearly have nodes of two different types—papers and authors. We can also consider a simplified (and homogeneous) version of this graph, in which there are only author nodes, and each author links to co-authors; it will provide the ranking by intensity of collaborations. Practical comment. Because of the inter-disciplinary character of modern research, it makes sense, when ranking authors from a certain area (e.g., from geosciences), to include papers from other areas in this graph as well—this way, if a computer science paper is citing a geoscience paper, we give this citation more credit if this computer science paper is itself more important (i.e., in this description, more cited). Third example: ranking journals and conferences. To do that, we add, to the graph describing authors and papers, additional nodes describing journals and conferences. Then, we have a link from each paper to a journal or conference where it appeared, and we have links from each journal or conference page to all the papers that have been published there.
5 Second Application: Family Dynamics From the simplest case to the general case. In the previous two sections, we only considered the simplest case when we only know which nodes can influence other nodes, but we do not have specific information about the degree to which they can influence each other. In many practical cases, however, we know these degrees as well. Let us describe a natural example of such a case. Description of the example. A natural example of dependence is between people. Each person’s utility (i.e., degree of happiness, degree of satisfaction) is, in general, determined not only by the objective factors—like what this person gets and what others get—but also by the degree of happiness of other people. Normally, this dependence is positive, i.e., we feel happier if other people are happy. However, negative emotions such as jealousy are also common, when someone else’s happiness makes a person not happy. The idea that a utility of a person depends on utilities of others was first described in [20, 21]. It was further developed by another future Nobelist Gary Becker; see, e.g., [1]; see also [4, 7, 10, 14, 26].
90
V. Kreinovich
Interdependence of utilities: general description. In general, the utility u i of ith person under interdependence can be described as u i = f i (u i(0) , u j ), where u i(0) is the utility that does not take interdependence into account, and u j are utilities of other people. The effects of interdependence can be illustrated on the example of linear approximation, when we approximate the dependence by the first (linear) terms in its expansion into Taylor series, i.e., when the utility u i of ith person is equal to u i = u i(0) +
ai j · u j ,
j=i
where the interdependence is described by the coefficients ai j —i.e., in effect, as a limit state of a linear neural network; the ideas of using neural networks to describe social interactions can be also found in [15]. Paradoxes of love. This simple and seemingly natural model leads to interesting and somewhat paradoxical conclusions; see, e.g., [2, 3, 12, 14]. For example, mutual affection between persons P1 and P2 means that a12 > 0 and a21 > 0. In particular, selfless love, when someone else’s happiness means more than one’s own, corresponds to a12 > 1. In general, for two persons, we thus have (0) u 1 = u (0) 1 + a12 · u 2 ; u 2 = u 2 + a21 · u 1 . (0) Once we know the original utility values u (0) 1 and u 2 , we can solve this system of linear equations and find the resulting values of utility:
u1 =
(0) u (0) u (0) + a21 · u (0) 1 + a12 · u 2 1 ; u2 = 2 . 1 − a12 · a21 1 − a12 · a21
As a result, when two people are deeply in love with each other (a12 > 1 and a21 > 1), then positive original pleasures u i(0) > 0 lead to u i < 0—i.e., to unhappiness. This phenomenon may be one of the reasons why people in love often experience deep negative emotions. From this viewpoint, a situation when one person loves deeply and another rather allows him- or herself to be loved may lead to more happiness than mutual passionate love. The fact that the coefficients ai j , in general, change with time, explains a frequent family dynamics, when a passionate happy marriage surprisingly drifts into negative emotions. A similar negative consequence of love can also happen in situations like selfless Mother’s love when a12 > 0 may be not so large but a21 is so large that a12 · a21 > 1. There are also interesting consequences when we try to generalize these results to more than 2 persons. For example, we can define an ideal love, when each person treats other’s emotions almost the same way as one’s own, i.e., a12 = a = 1 − ε for a small ε > 0. For two people, from u i(0) > 0, we get u i > 0—i.e., we can still
Linear Neural Networks Revisited: From PageRank to Family Happiness
91
have happiness. However, if we have three or more people in the state of mutual u j , then in case when everything is fine—e.g., affection, i.e., if u i = u i(0) + a · u i(0)
j=i
=u
(0)
> 0—we have u i · (1 − a · (n − 1)) = u i · (2 − ε − (1 − ε) · n) = u (0) ,
hence ui =
u (0) < 0, 2 − ε − (1 − ε) · n
i.e., we have unhappiness. This may be the reason why 2-person families are the main form—or, in other words, if two people care about the same person (e.g., his mother and his wife), all three of them are happier if there is some negative feeling (e.g., jealousy) between them. Comment. It is important to distinguish between emotional interdependence in which one’s utility is determined by the utility of other people, and “objective” altruism, in which one’s utility depends on the material gain of other people—but not on their subjective utility values, i.e., in which (in the linearized case) u i = u i(0) +
ai j · u (0) j .
j
In this approach, when we care about others’ well-being and not their emotions, no paradoxes arise, and any degree of altruism only improves the situation; see, e.g, [8, 9, 19]. This objective approach to interdependence was proposed and actively used by yet another Nobel Prize winner: Amartya K. Sen; see, e.g., [22–24]. Acknowledgements This work was supported in part by the National Science Foundation grants HRD-0734825 and DUE-0926721, by Grant 1 T36 GM078000-01 from the National Institutes of Health, by Grant MSM 6198898701 from MŠMT of Czech Republic, and by Grant 5015 “Application of fuzzy logic with operators in the knowledge based systems” from the Science and Technology Centre in Ukraine (STCU), funded by European Union. The author is thankful to Jiawei Han, Leonid Perlovsky, and Paulo Pinheiro da Silva for valuable discussions.
References 1. Becker, G.S.: A Treatise on the Family. Harvard University Press, Cambridge (1991) 2. Bergstrom, T.: Love and spaghetti, the opportunity cost of virtue. J. Econ. Perspect. 3, 165–173 (1989) 3. Bergstron, T.: Systems of benevolent utility interdependence. Technical report, University of Michigan (1991)
92
V. Kreinovich
4. Bernheim, B.D., Stark, O.: Altruism within the family reconsidered: do nice guys finish last? Am. Econ. Rev. 78(5), 1034–1045 (1988) 5. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2007) 6. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998) 7. Friedman, D.D.: Price Theory. South-Western Publication, Cincinnati (1986) 8. Harsanyi, J.S.: Rational Behavior and Bargaining Equilibrium in Games and Social Situations. Cambridge University Press, New York (1977) 9. Harsanyi, J.S.: Morality and the theory of rational behavior. In: Sen, A., Williams, B. (eds.) Utilitarianism and Beyond, pp. 39–62. Cambridge University Press, Cambridge (1982) 10. Hori, H., Kanaya, S.: Utility functionals with nonpaternalistic intergenerational altruism. J. Econ. Theory 49, 241–265 (1989) 11. Kamvar, S.D., Schlosser, M.T., Garcia-Molina, H.: The EigenTrust algorithm for reputation management in P2P networks. In: Proceedings of the 12th International World Wide Web Conference WWW’2003, Budapest, Hungary, 20–24 May 2003 12. Kreinovich, V.: Paradoxes of love: game-theoretic explanation. Technical report UTEP-CS-9016, University of Texas at El Paso, Department of Computer Science (1990) 13. Langville, A.N., Meyer, C.D.: Google’s PageRank and Beyond: the Science of Search Engine Rankings. Princeton University Press, Princeton (2006) 14. Nguyen, H.T., Kosheleva, O., Kreinovich, V.: Decision making beyond arrow’s ‘impossibility theorem’ with the analysis of effects of collusion and mutual attraction. Int. J. Intell. Syst. 24(1), 27–47 (2009) 15. Perlovsky, L.: Evolution of languages, consciousness, and culture. IEEE Comput. Intell. Mag. 2(3), 25–39 (2009) 16. Perlovsky, L.I.: Neural network with fuzzy dynamic logic. In: Proceedings of the International IEEE and INNS Joint Conference on Neural Networks IJCNN’05, Montreal, Quebec, Canada (2005) 17. Perlovsky, L.I.: Fuzzy dynamic logic. New Math. Nat. Comput. 2(1), 43–55 (2006) 18. Perlovsky, L.I.: Neural networks, fuzzy models and dynamic logic. In: Köhler, R., Mehler, A. (eds.) Aspects of Automatic Text Analysis: Festschrift in Honor of Burghard Rieger, pp. 363–386. Springer, Berlin (2007) 19. Putterman, L.: Peasants, Collectives, and Choice: Economic Theory and Tanzania’s Villages. JAI Press, Greenwich (1986) 20. Rapoport, A.: Some game theoretic aspects of parasitism and symbiosis. Bull. Math. Biophys. 18 (1956) 21. Rapoport, A.: Strategy and Conscience. New York (1964) 22. Sen, A.K.: Labor allocation in a cooperative enterprise. Rev. Econ. Stud. 33(4), 361–371 (1966) 23. Sen, A.K.: Collective Choice and Social Welfare. Holden-Day, San Francisco (1970) 24. Sen, A.K.: Resources, Values, and Development. Harvard University Press, Harvard (1984) 25. Sun, Y., Han, J., Zhao, P., Yin, Z., Cheng, H., Wu, T.: RankClus: integrating clustering with ranking for heterogeneous information network analysis. In: Proceedings of the European Conference on Data Base Theory EDBT’2009, Saint Petersburg, Russia, 24–26 March 2009 26. Tipler, F.J.: The Physics of Immortality. Doubleday, New York (1994)
Why 3 Basic Colors? Why 4 Basic Tastes? Vladik Kreinovich
Abstract In this paper, we give a general explanation of why there are 3 basic colors and 4 basic tastes. One of the advantages of having an explanation on a system level (without involving physiological details) is that a general explanation works not only for humans, but for potential extra-terrestrial intelligent beings as well.
1 Introduction Why 3 basic colors? The mainstream color perception theory, dating back to Thomas Young (1801) and Hermann von Helmholtz (1850s) states that every color that we see can be expressed as a composition of three basic colors: red, green, and blue (see, e.g., [2]). It is not only a theory: TV sets give us perfect colors by combining the three basic ones, and the nice colors on the computer that I am typing this text on are also composed of the same three basic colors. Why three? Colors from a physical viewpoint. From a physical viewpoint, different colors correspond to different wavelengths (or, equivalently, different frequencies) of light. However, the fact that three colors are sufficient for a human eye has nothing to do with this physical observation: in real physical world, no combination of lights of wavelengths λ1 , . . . , λk can make a light with a new wavelength λ = λi . So, the physical viewpoint does not help us to answer the question: why 3 basic colors? There are 4 basic tastes. According to the mainstream theory (see, e.g., [1, 7]), any taste can be represented as a combination of a basic four: salt, sour, sweet, and bitter. Why four?
V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_14
93
94
V. Kreinovich
Existing explanations. One of the possible explanations (see, e.g., a popular survey [3] and references therein) is that 4 basic tastes have the following meaning: sweet means high-energy food, salt means salt, bitter means poison, and sour means unripe. What is wrong with this explanation. It is not very convincing. If the nature decided to distinguish between high-energy food and salts, why not go one step further and distinguish between different types of sugars? Or between sugars and proteins? Another reason why this explanation may not sound convincing is that although some other species (with a similar biochemistry of life) also have 4 basic tastes, but their tastes correspond to somewhat different sets of chemical substances (see, e.g., [1]). What we are going to do in this paper is to provide a new explanation of the existence of exactly 3 basic colors and exactly 4 basic tastes. This explanation will not be based on any biochemical details, and can be thus potentially applied to any lifeform.
2 Why 3 Basic Colors? The following idea was first proposed in [5]. We want the smallest possible sufficient number of colors. The more colors, the harder it is to implement them. So, let us find the smallest number of colors that would be sufficient for a human being. What are colors for? Informal description of a problem. Among the main purposes of vision in a living creature are the necessities to notice potential food and potential menace (predator). How many colors are sufficient for that purpose? Suppose that we are approaching some surface (e.g., the border of a forest, or a hill). A predator does not want to be noticed beforehand, so he tries to hide on that surface. Similarly, the creatures that are potential food for us want to hide. When is hiding efficient? Formalizing this problem. If no colors are used (i.e., if everything is in only one color, black and white), then every point x = (x, y) on this surface is characterized by its brightness I (x), i.e., in physical terms, by the intensity of the light emitted by this point. If we use c basic colors, then to characterize a point, we need the intensity of each of these colors. In other words, to characterize a point x, we use c numbers I1 (x), . . . , Ic (x). Suppose that the color of a creature that wants to hide (i.e., a predator or a prey) is described by parameters p1 , . . . , pc . Then, a creature will be able to hide successfully if he can find a point (x, y) in which his colors are exactly the same as the background, i.e., if I1 (x, y) = p1 , …, Ic (x, y) = pc . So, we have c equations for two unknowns (the unknowns are the coordinates x and y). Whether a system of equations has a solution, depends on the relationship between the number of equations (in our case, c) and the number of unknowns (in our case,
Why 3 Basic Colors? Why 4 Basic Tastes?
95
2). If there are more unknowns than equations, then in general, there is a solution. If there is exactly the same number of equations as there are unknowns, then there is in general a unique solution. If the number of equations exceeds the number of unknowns, then the system is over-determined, and in general, it has no solutions. In our case, if c ≤ 2, then in the general case, we do have a solution, so 1 or 2 colors are not sufficient. If c = 3, 4, . . ., then we have a system of c > 2 equations with 2 unknowns. In general, such a system is inconsistent. Therefore, the smallest number of basic colors that enable us to detect a hiding creature is 3. This explains why there are 3 basic colors. Comment. The relationship between the numbers of parameters and the existence of the solution is an informal idea. However, under the appropriate formalization of the term “in general”, this statement can be made precise (for an example of how similar argument in thermodynamics can be formalized, see, e.g., [4], Chap. 10). Another physical applications of the similar line of argument can be found in [6].
3 Why 4 Basic Tastes? What are tastes for? Informal description. When we have a mouthful of useful food, it is necessary to detect whether there is any unwelcome part in it. One of the purposes of taste is to detect this part and thus not allow it into the organism. When is it possible? Formal description. In this case, we have a 3-dimensional area filled with pieces of food that have different tastes. If we use t basic tastes, then to describe the taste in each point x = (x, y, z), we need t values: components of these tastes I1 (x), . . . , It (x). Suppose that the taste of the unwelcome part of food is described by parameters q1 , . . . , qt . Then, this part can remain undetected only if there is a place in this chunk of food with exactly the same values of taste parameters, i.e., when there is a point x = (x, y, z) such that I1 (x, y, z) = q1 , …, It (x, y, z) = qt . So, we have t equations for three unknowns x, y, and z. Similarly to the colors case, we can conclude that in general, this system is solvable if and only if the number of equations does not exceed the number of unknowns (i.e., when t ≤ 3). Therefore, the smallest number t of basic tastes for which this system is not in general solvable (and for which, therefore, tastes make it impossible to hide an alien object inside the food) is t = 4. This explains why there are 4 basic tastes.
References 1. Beidler, L.M. (ed.): Taste. Springer, Berlin (1971) 2. De Grandis, L.: Theory and Use of Color. Abrams, New York (1986) 3. Freedman, D.H.: In the realm of the chemical. Discover 7, 69–76 (1993)
96
V. Kreinovich
4. Gilmore, R.: Catastrophe Theory for Scientists and Engineers. Wiley, New York (1981) 5. Kreinovich, V.: Connection between dimensions c of colour space and d of real space. Not. Am. Math. Soc. 26(7), A-615 (1979) 6. Kreinovich, V.: Only particles with spin ≤ 2 are mediators for fundamental forces: why? Phys. Essays 5(4), 458–462 (1992) 7. Laing, D.C., et al. (eds.): Perception of Complex Smells and Tastes. Academic, Sydney (1989)
What Segments Are the Best in Representing Contours? Vladik Kreinovich and Chris Quintana
Abstract A pixel-by-pixel (iconic) computer representation of an image takes too much memory, demands lots of time to process, and is difficult to translate into terms that are understandable for human users. Therefore, several image compression methods have been proposed and efficiently used. The natural idea is to represent only the contours that separate different regions of the scene. At first these contours were mainly approximated by a sequence of straight-line segments. In this case, for smooth contours a few segments are sufficient to represent the contours; the more curved it is, the more segments we need and the less compression we get as a result. To handle curved edges, circle arcs are now used in addition to straight lines. Since the choice of segments essentially influences the quality of the compression, it is reasonable to analyze what types of segments are the best. In the present paper we formulate the problem of choosing the segment type as a mathematical optimization problem and solve it under different optimality criteria. As a result, we get a list of segment types that are optimal under these criteria. This list includes both the segments that were empirically proven to be efficient for some problems, and some new types of segments that may be worth trying.
1 Introduction to the Problem Why is contour segmentation necessary? A pixel-by-pixel (iconic) computer representation of a map takes too much memory, demands lots of time to process, and is difficult to translate into terms that are understandable for human users [2, 4, 5]. Therefore, several image compression methods have been proposed and efficiently used. The natural idea is to represent only the contours that separate different regions of the scene. V. Kreinovich (B) · C. Quintana Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_15
97
98
V. Kreinovich and C. Quintana
Storing the contour pixel-by-pixel would also take too much memory, and therefore too much time to process. Therefore, in order to represent a contour it is reasonable to choose a class of standard segments (e.g, straight-line segments), approximate a contour by a sequence of segments, and represent this contour by the parameters that characterize these segments. Such representations are successfully used in many areas: • for automated navigation [5]; • in robotic vision; for example, for the analysis of an image that contains industrial parts that have to be assembled (or processed in some way) [2, 4, 10, 11]; • for automated analysis of line drawings in documents, especially the analysis of flow charts [2, 18]. What segments are used now? The simplest approximation for curves are chains of straight line segments, and such approximations have been intensively used and studied in image processing [1, 5, 7, 17, 19]. In this case, a few segments are sufficient to represent the smooth contours; the more curved it is, the more segments we need and the less compression we get as a result. To handle curved edges, it was proposed in [2, 11] to use circle arcs in addition to straight lines. This addition essentially improved the quality of the compression. Formulation of the problem. Since the choice of segments essentially influences the quality of the compression, we can gain much by choosing new types of segments. Therefore, it is necessary to analyze what types of segments are the best to choose (in some reasonable sense). What we are planning to do. In the present paper, we formulate the problem of choosing a segment type as a mathematical optimization problem, and solve it under different optimality criteria. As a result, we get a list of segment types that are optimal under these criteria. This list includes both the segments that were empirically proven to be efficient for some problems (that is, line segments, circle arcs, etc), and some new types of segments (like hyperbolic curves) that may be worth trying.
2 Motivations of the Proposed Mathematical Definitions How to represent segments? How can we represent a contour in mathematical terms? From a mathematical viewpoint, a contour is a curve, i.e., when we follow a contour, both the x and y coordinates change. In principle, we can represent a contour by describing how, for example, the y coordinate depends on x, i.e., by a function y = f (x). But in many applications it is important to be able to rotate the image: e.g., in robotic vision, when the robot moves, the image that it sees rotates; in automated navigation it is desirable to have a map in which the current course of the ship is vertical; in this case, when a ship turns, the map has to be rotated. Rotating a
What Segments Are the Best in Representing Contours?
99
map in mathematical terms means that we change our coordinate frame from x, y to the new coordinates x and y : x = x cos(φ) + y sin(φ) y = −x sin(φ) + y cos(φ) It is difficult to transform the dependency y(x) into the dependency of y on x . Therefore, instead of describing the explicit dependency f = y(x), it is more reasonable to represent a curve by an implicit function, i.e., by an equation F(x, y) = 0. This equation describes a curve just as the equation x 2 + y 2 = 1 describes a circle in analytical geometry. For such implicit representations it is very easy to change the representation if we make a rotation: it is sufficient to substitute into the expression for F(x, y) the expressions for x and y in terms of x and y : x = x cos(φ) − y sin(φ) y = x sin(φ) + y cos(φ) The segment F(x, y) = 0 must fit different curves, therefore we must have an expression for F(x, y) with one or several parameters, so that by adjusting the values of these parameters we would be able to get the best fit for each segment. The more parameters we allow, the better the approximation, but, on the other hand, the more parameters we need. So we must somehow fix the number m of parameters that we will use. The most commonly used m-parametric families are obtained as follows: we fix m smooth functions f i (x, y) (their set is called a basis because theyare the basis for m Ci f i (x, y), our approximations), and then use arbitrary functions f (x, y) = i=1 where Ci are the adjustable parameters. The functions that are obtained for different values of Ci form a family that is m-parametric (or m-dimensional) in the sense that in order to choose a function from that family, it is necessary to give the values of m parameters. We’ll consider only functions f i (x, y) that are maximally smooth, namely, analytical (i.e., can be expanded into Taylor series). Let’s give two examples. Example 1 In particular, if we take m = 3, f 1 (x, y) = 1, f 2 (x, y) = x and f 3 (x, y) = y, we get all the curves of the type C1 + C2 x + C3 y = 0, i.e., all possible straight lines. Example 2 If we take m = 4 and in addition to the above functions f i (x, y) for i = 1, 2, 3 take f 4 (x, y) = x 2 + y 2 , we obtain all possible curves of the type C1 + C2 x + C3 y + C4 (x 2 + y 2 ) = 0. The geometric meaning of this expression is easy to get: If C4 = 0, we get straight lines. If C4 = 0, we can divide both sides of the equation F(x, y) = 0 by C4 , thus reducing it to the equation x 2 + y 2 + C˜ 2 x + C˜ 3 y + C˜ 1 , where C˜ i = Ci /C4 . If we apply to this equation the transformation that helps to solve ordinary quadratic equations (namely, x 2 + ax + b =
100
V. Kreinovich and C. Quintana
(x + (a/2))2 + (b − (a/2)2 )), we come to the conclusion that the original equation F(x, y) = 0 is equivalent to the following equation: (x − x0 )2 + (y − y0 )2 = Z , where x0 = −C˜ 2 /2, y0 = −C3 /2 and Z = C22 + C32 − C1 . If Z < 0, then this equation has no solutions at all, because its left-hand side is always non-negative. If Z = 0, the solution is possible only when x = x0 and y = y0 , i.e., the curve consist √ of only one point (x0 , y0 ). If Z > 0, then by introducing a new variable R = Z we can reduce the above equation to the equation (x − x0 )2 + (y − y0 )2 = Z , that describes a circle with a center in the point (x0 , y0 ) and radius R. Summarizing these two possible cases (C4 = 0 and C4 = 0), we come to the fol4 lowing conclusion. If we use segments that are described by the curves Ci f i (x, y) i=1
with the above-described functions fi (x, y), then we approximate a curve by straightline segments and circle arcs. Main problem. When the functions f i (x, y) are chosen, the problem of choosing Ci is easy to solve: there exist many software programs that implement least square methods and find the values Ci that are the best fit for a curve that is normally described by a sequence of points (x j , y j ) that densely follow each other. However, as we have already mentioned, the quality of this approximation and the quality of the resulting compression essentially depend on the choice of the basis: for some bases the approximation is much better, for some it is much worse. So the problem is: what basis to choose for approximations? Why is this problem difficult? We want to find a basis that is in some reasonable sense the best. For example, we may look for a basis for which an average error is the smallest possible, or for which the running time of the corresponding approximation algorithm is the smallest possible, etc. The trouble is that even for the simplest bases we do not know how to compute any of these possible characteristics. How can we find a basis for which some characteristics is optimal if we cannot compute this characteristic even for a single basis? There does not seem to be a likely answer. However, we will show that this problem is solvable (and give the solution). The basic idea of our solution is that we consider all possible optimization criteria on the set of all bases, impose some reasonable invariance demands, and from them deduce the precise formulas for the optimal basis. This approach has been applied to various problems in [6, 13–16]. What family is the best? Among all m-dimensional families of functions, we want to choose the best one. In formalizing what “the best” means we follow the general idea outlined in [13] and applied to various areas of computer science (expert systems in [14], neural networks in [15], and genetic algorithms in [16]); there exists also an application to physical chemistry, see [6]. The criteria to choose may be computational simplicity, minimal average approximation error, or something else. In mathematical optimization problems, numeric criteria are most frequently used, where to every family we assign some value expressing its performance, and choose a family for which this value is maximal. However,
What Segments Are the Best in Representing Contours?
101
it is not necessary to restrict ourselves to such numeric criteria only. For example, if we have several different families that have the same average approximation error E, we can choose among them the one for which the average running time T of an approximation algorithm is the smallest. In this case, the actual criterion that we use to compare two families is not numeric, but more complicated: a family Φ1 is better than the family Φ2 if and only if either E(Φ1 ) < E(Φ2 ) or E(Φ1 ) = E(Φ2 ) and T (Φ1 ) < T (Φ2 ). A criterion can be even more complicated. What a criterion must do is to allow us for every pair of families to tell whether the first family is better with respect to this criterion (we’ll denote it by Φ1 > Φ2 ), or the second is better (Φ1 < Φ2 ) or these families have the same quality in the sense of this criterion (we’ll denote it by Φ1 ∼ Φ2 ). The criterion for choosing the best family must be consistent. Of course, it is necessary to demand that these choices be consistent, e.g., if Φ1 > Φ2 and Φ2 > Φ3 then Φ1 > Φ3 . The criterion must be final. Another natural demand is that this criterion must be final in the sense that it must choose a unique optimal family (i.e., a family that is better with respect to this criterion than any other family). The reason for this demand is very simple. If a criterion does not choose any family at all, then it is of no use. If several different families are “the best” according to this criterion, then we still have a problem choosing the absolute “best”. Therefore, we need some additional criterion for that choice. For example, if several families turn out to have the same average approximation error, we can choose among them a family with minimal computational complexity. So what we actually do in this case is abandon that criterion for which there were several “best” families, and consider a new “composite” criterion instead: Φ1 is better than Φ2 according to this new criterion if either it was better according to the old criterion or according to the old criterion they had the same quality, and Φ1 is better than Φ2 according to the additional criterion. In other words, if a criterion does not allow us to choose a unique best family, it means that this criterion is not ultimate; we have to modify it until we come to a final criterion that will have that property. The criterion must be reasonably invariant. As we have already mentioned, in many applications it is desirable to be able to rotate the image, i.e., change the coordinates from r = (x, y) to the rotated ones r = (x , y ) = U (r), where by U we denoted the coordinate transformation induced by rotation (its explicit expression was given before). Suppose now that we first fixed some coordinates, compared two different bases it turned outthat the basis f i (r) is better (or,to be more precise, f i (r) and f˜i (r), and that the family Φ = Ci f i (r) is better than the family Φ˜ = Ci f˜i (r) ). This i
i
means the following: suppose that we have a family of curves (x cj , y cj ) (where c is an index that characterizes the curve). Then for these curves in some reasonable average sense the quality of approximation by segments F(x, y) = 0 with F ∈ Φ is better ˜ than the quality of approximation by the segments F(x, y) = 0 with F ∈ Φ.
102
V. Kreinovich and C. Quintana
It sounds reasonable to expect that the relative quality of two bases should not change if we rotate the axes. After such a rotation a curve that was initially described by its points r jc = (x cj , y cj ) will now be characterized by the values U (r jc ). So we expect that when we apply the same approximation algorithms to the rotated data, the results of approximating by segments from Φ will still be better than the results ˜ of applying Φ. Let us now take into consideration that rotation is a symmetry transformation in the sense that the same approximation problem can be viewed from two different viewpoints: Namely, if we describe this approximation problem in the new coordinates (x , y ), then the description is that we approximate the rotated curve U (r jc ) by segments that have the prescribed type in these new coordinates, i.e., they are described by the formula F(r ) = 0 with F from an appropriate family. But we could as well consider the same problem in the old coordinates (x, y) that can be determined from the new ones by the inverse transformation U −1 : r = U −1 (r ). In this case the curve is described by its old coordinates r j ; as for the approximating segments, their expression in the old coordinates can be obtained by substituting the expression U (r) instead of r into the expression, that describes this segment in terms of the new coordinates. So this expression is F(U (r)) = 0. Let’s now substitute into the equation F(U for F(x, y) (r)) = 0 the expression in terms of the basis functions (F(x, y) = Ci f i (x, y), or F(r) = Ci f i (r)). i
i
We can then conclude that from the viewpoint of the old coordinates we are approximating the same curve with segments that are described by the equa Ci f i (U (r)) = 0. These are arbitrary linear combinations of the functions tions i
f i (U (r)). In other words, from the viewpoint of the old coordinate system we are approximating the same curve, but we are using a different basis, consisting of functions f iU (r) = f i (U (r)). Similar to that approximation of a rotated image by functions from Φ˜ is equivalent to approximating the original (non-rotated image) by the functions from a basis { f˜i (U (r))}. So from the demand that the relative quality of the two bases should not change after the rotation, we can conclude that if a basis { f i (r)} is better than { f˜i (r)}, then for every rotation U the basis { f iU (r)} must be better than { f˜iU (r)}, where f iU (r) = f i (U (r)) and f˜iU (r) = f˜i (U (r)). Another reasonable demand is related to the possibility to change the origin of the coordinate system. This happens in navigation problems or in robotic vision, if we take our position as an origin. Then, whenever we move, the origin changes and therefore all the coordinates change: a point that initially had coordinates r = (x, y) will now have new coordinates r = (x − x0 , y − y0 ), where x0 and y0 are the old coordinates of our current position. This transformation is called translation and is usually denoted by T . It is again reasonable to expect that the relative quality of two bases should not change if we change the origin (i.e., apply an arbitrary translation). If we consider this as a demand that the reasonable preference relation must satisfy, then arguments like the ones we used for rotation allow us then to make the following conclusion: if
What Segments Are the Best in Representing Contours?
103
a basis { f i (r)} is better than { f˜i (r)}, then for every translation T the basis { f iT (r)} must be better than { f˜iT (r)}, where f iT (r) = f i (T (r)) and f˜iT (r) = f˜i (T (r)). The last invariance demand is related to the possibility of using different units to measure coordinates. For example, in navigation we can measure them in miles or in kilometers, and thus get different numerical values for the same point on the map. If (x, y) are the numerical values of the coordinates in miles, then the values of the same coordinates in kilometers will be (cx, cy), where c = 1.6 . . . is the ratio of these two units (number of kilometers per mile). Suppose now that we first used one unit, compared two different bases f i (r) and that the family f˜i (r),and it turned out that f i (X ) is better (or, tobe more precise, Ci f i (r) is better than the family Φ˜ = Ci f˜i (r) ). Φ= i
i
It sounds reasonable to expect that the relative quality of the two bases should not depend on what units we used. So we expect that when we apply the same methods, but to the data in which coordinates are expressed in the new units (in which we have cr j instead of r j ), the results of applying f i (r) will still be better than the results of applying f˜i (r). But again we can view this same approximation process from two different viewpoints: if we consider the numeric values expressed in (kilometers), then we approximate the values r j = cr j by the segments new units Ci f i (r ) = 0. But we could as well consider the same approximation problem in old units (in this case, in miles). Then we have a problem of approximating the points r j by the segments, whose equations are Ci f i (cr) = 0. This is equivalent to using new basis functions f ic (r), defined as f ic (r) = f i (cr), to the numerical values of coordinates in the old units. So, just like in the two previous cases, we conclude that if a basis { f i (r)} is better than { f˜i (r)}, then the basis { f ic (r)} must be better than { f˜ic (r)}, where f ic (r) = f i (cr) and f˜ic (r) = f˜i (cr). This must be true for every c, because we can use not only miles and kilometers, but other units as well. Now we are ready for the formal definitions.
3 Definitions and the Main Result Definitions. Assume that an integer m is fixed. By a basis we mean a set of m analytical linearly independent functions f i (x, y), i = 1, 2, . . . , m (from 2-dimensional space R 2 to the set R of real numbers). By an m-dimensional family of functions m Ci f i (x, y) for some basis we mean the set of all functions of the type f (x, y) = i=1
{ f i (x, y)}, where Ci are arbitrary constants. The set of all m-dimensional families will be denoted by Sm . Comment. “Linearly independent” means that all these linear combinations Ci f i (x, y) are different. If the functions f i (x, y) are not linearly i
independent, then one of them can be expressed as a linear combination of the others,
104
V. Kreinovich and C. Quintana
and so the set of all their linear combinations can be obtained by using a subset of 0 the following two conditions are true: (i ) if Φ is better than Φ˜ in the sense of this criterion (i.e., Φ˜ < Φ), then cΦ˜ < cΦ. ˜ then (ii ) if Φ is equivalent to Φ˜ in the sense of this criterion (i.e., Φ ∼ Φ), ˜ cΦ ∼ cΦ. Comment. As we have already remarked, the demands that the optimality criterion is final, rotation-, translation- and unit-invariant are quite reasonable. The only problem with them is that at first glance they may seem rather weak. However, they are not, as the following theorem shows: Theorem 1 If an m-dimensional family Φ is optimal in the sense of some optimality criterion that is final, rotation-, translation- and unit-invariant, then all its elements are polynomials. (The proofs are given in Sect. 4). Comment. Curves that are described by the equation F(x, y) = 0 are called algebraic curves in mathematics, so we can reformulate Theorem 1 by saying that every optimal approximation family consists only of algebraic curves. For small m we can explicitly enumerate the optimal curves. In order to do it let’s introduce some definitions.
106
V. Kreinovich and C. Quintana
Definitions. If a function F(x, y) belongs to a family Φ, we say that a curve F(x, y) = 0 belongs to a family Φ; we’ll also say that a family Φ consists of all the curves F(x, y) = 0 for all functions F(x, y) from Φ. By a hyperbola we mean a curve y = C/x and all the results of its rotation, translation or unit change. Theorem 2 Suppose that an m-dimensional family Φ is optimal in the sense of some optimality criterion that is final, rotation-, translation- and unit-invariant. Then: • for m = 1, Φ consists of all constant functions (so the equation F(x, y) = 0 does not define any approximation curves); • for m = 2, there is no such criterion; • for m = 3, the optimal family consists of all straight lines; • for m = 4, the optimal family consists of all straight lines and circles; • for m = 5, the optimal family consists of all straight lines and hyperbolas. Comments. (1) This result explains why straight-line approximation is the most often used, and why approximation by straight line segments and circle arcs turned out to be sufficiently good [2, 11]. (2) This result also shows that the next family of approximation curves that is worth trying is by segments of hyperbolas. Preliminary experiments with hyperbolas were performed in [9]: it approximated the isolines (lines of equal depth) of the Pacific Ocean and showed that hyperbolas were a reasonably good approximation: namely, when using them one needs 2–3 times less parameters than by using straight line segments and circle arcs. So we get a 2–3 times compression. (3) Theorems 1 and 2 subsume the (weaker) results that appeared in [8, 12].
4 Proofs 1. Let us first prove that the optimal family Φopt exists and is rotation-invariant in the sense that Φopt = U (Φopt ) for an arbitrary rotation U . Indeed, we assumed that the optimality criterion is final, therefore there exists a unique optimal family Φopt . Let’s now prove that this optimal family is rotationinvariant. The fact that Φopt is optimal means that for every other Φ, either Φ < Φopt or Φopt ∼ Φ. If Φopt ∼ Φ for some Φ = Φopt , then from the definition of the optimality criterion we can easily deduce that Φ is also optimal, which contradicts the fact that there is only one optimal family. So for every Φ either Φ < Φopt or Φopt = Φ. Take an arbitrary rotation U and let Φ = U (Φopt ). If Φ = U (Φopt ) < Φopt , then from the invariance of the optimality criterion (condition (ii)) we conclude that U −1 (U (Φopt )) = Φopt < U −1 (Φopt ),
What Segments Are the Best in Representing Contours?
107
and that conclusion contradicts the choice of Φopt as the optimal family. So the inequality Φ = U (Φopt ) < Φopt is impossible, and therefore Φopt = Φ = U (Φopt ), i.e., the optimal family is really rotation-invariant. 2. Similar arguments show that the optimal family is translation-invariant and unitinvariant, i.e., Φopt = T (Φopt ) for any translation T and Φopt = cΦopt for any c > 0. 3. Let’s use these invariances to prove that all the functions F(x, y) from the optimal family are polynomials. We supposed that all the functions F(x, y) from Φ are analytical, so they can be expanded into a Taylor series: F(x, y) = a + bx + cy + d x 2 + ex y + f y 2 + · · · Every term in this expansion is of the type cx k y l , where c is a real number and k and l are non-negative integers. The sum k + l is called an order of this term (so constant terms are of order 0, linear terms are of order 1, quadratic terms are of order 2, etc). For every function F(x, y) by F (k) (x, y) we’ll denote the sum of all terms of order k in its expansion. Then F(x, y) = F (0) (x, y) + F (1) (x, y) + F (2) (x, y) + · · · , where F (0) (x, y) = a, F (1) (x, y) = bx + cy, F (2) (x, y) = d x 2 + ex y + f y 2 , etc. Some terms in this equation can be equal to 0. For example, for a linear function F (0) (x, y) = 0, for quadratic functions F (0) (x, y) = F (1) (x, y) = 0 and the first non-zero term is F (2) (x, y), etc. The first term F (k) (x, y) in the above expansion that is different from 0 is called the main term of this function. Comment. The reason why such a definition is often used is that when x, y → 0, then these terms are really the main terms in the sense that all the others are asymptotically smaller. 4. Let us now use the unit-invariance of the optimal family Φopt to prove that if a function F(x, y) belongs to it, then so does its main term F (k) (x, y). Indeed, due to unit invariance for every c > 0 the function F(cr) = F(cx, cy) also belongs to Φopt . Since any family is a set of all linear combinations of some functions f i (x, y), it is a linear space, i.e., a linear combination of any two functions from this family also belongs to this family. In particular, since F(cx, cy) belong to Φopt , we conclude that the function c−k F(cx, cy) belong to Φopt . Since the main term in F(x, y) has order k, the expansion for F(x, y) starts with kth term: F(x, y) = F (k) (x, y) + F (k+1) (x, y) + F (k+2) (x, y) + · · ·
108
V. Kreinovich and C. Quintana
Substituting cx and cy instead of x and y, we conclude that F(cx, cy) = F (k) (cx, cy) + F (k+1) (cx, cy) + F (k+2) (cx, cy) + · · · Every term of order p is a sum of several monomials of order p, i.e., monomials of the type x q y p−q . If we substitute cx and cy instead of x and y into each of these monomials, we get cq x q c p−q y p−q . If we combine together the powers of c, we conclude that this monomial turns into c p x q y p−q . So the result of substituting cx and cy into each monomial of F ( p) (x, y) is equivalent to multiplying this monomial by c p . Therefore, the same is true for the whole term, i.e., F ( p) (cx, cy) = c p F ( p) (x, y). Substituting this formula into the above expression for F(cx, cy), we conclude that F(cx, cy) = ck F (k) (x, y) + ck+1 F (k+1) (x, y) + ck+2 F (k+2) (x, y) + · · · After multiplying both sides of this equation by c−k we conclude that c−k F(cx, cy) = F (k) (x, y) + cF (k+1) (x, y) + c2 F (k+2) (x, y) + · · · We have already proved that the left-hand side of this equality belongs to Φopt . The linear space Φopt is finite-dimensional, and therefore it is closed (i.e., it contains a limit of every sequence of elements from it). If we are tending c to 0, then all the terms in the right-hand side converge to 0, except the first one F (k) (x, y). So the function F (k) (x, y) equals to a limit of the functions c−k F(cx, cy) from Φopt , and therefore this function also belongs to the optimal family Φopt . Thus we proved that the main term of any function from an optimal family belongs to this same optimal family. 5. Let us now prove that if a function F(x, y) belongs to an optimal family, then all its terms F ( p) (x, y) (and not only its main term) belong to this same family. We will prove this consequently (i.e., by induction) for p = k, k + 1, k + 2, . . . . For k it is already true. Suppose now that we have already proved that F ( p) (x, y) ∈ Φopt for p = k, k + 1, k + 2, . . . , q and let’s prove this statement for p = q + 1. If F (q+1) (x, y) is identically 0, then this is trivially true, because Φopt is a linear space, and therefore it has to contain 0. So it is sufficient to consider the case when F (q+1) (x, y) = 0. In this case, by definition of the terms F ( p) (x, y) we have the following expansion: F(x, y) = F (k) (x, y) + F (k+1) (x, y) + · · · + F (q) (x, y) + F (q+1) (x, y) + · · · We assumed that F(x, y) belongs to Φopt , and we proved that F ( p) (x, y) belongs to Φopt for p = k, k + 1, . . . , q. If we move all the terms, which we have already proven belong to Φopt , to the left-hand side, we get the following equation: F(x, y) − F (k) (x, y) − F (k+1) (x, y) − · · · − F (q) (x, y) = F (q+1) (x, y) + · · ·
What Segments Are the Best in Representing Contours?
109
Since we consider the case when F (q+1) (x, y) = 0, this means that F (q+1) (x, y) is the main term of the function in the left-hand side. But Φopt is a linear space, therefore it contains a linear combination of any functions from it. In particular, it contains a function F(x, y) − F (k) (x, y) − F (k+1) (x, y) − · · · − F (q) (x, y). So according to statement 4. of this proof the main term of this function also belongs to Φopt . But we have already mentioned that this main term is F (q+1) (x, y). So F (q+1) (x, y) also belongs to Φopt . The inductive step is proven, and so is this statement 5. 6. Now we are ready to prove Theorem 1, i.e., to prove that all functions from the optimal family Φopt are polynomials. Indeed, suppose that some function F(x, y) from this family is not polynomial. This means that its Taylor expansion has infinitely many terms, and, therefore infinitely many terms F (k) (x, y) in its expansion are different from 0. According to 5. all these terms belong to Φopt . But since they are all polynomials of different orders, they are all linearly independent. So Φopt contains infinitely many linearly independent functions, which contradicts to our assumption that Φopt is finite-dimensional, and therefore it can contain at most m linearly independent functions. This contradiction proves that a Taylor expansion of F(x, y) cannot contain infinitely many terms, so it contains only finitely many terms and is therefore a polynomial. Theorem 1 is proved. 7. In order to prove Theorem 2, let us first prove that if F(x, y) belongs to Φopt , then so do its partial derivatives ∂ F(x, y)/∂ x and ∂ F(x, y)/∂ y. Indeed, since Φopt is translation-invariant, it contains the function F(x + a, y) that is obtained from F(x, y) by a translation by r0 = (a, 0). Since Φopt is a linear space and it contains both F(x, y) and F(x + a, y), it must also contain its difference F(x + a, y) − F(x, y) and, moreover, the linear combination (F(x + a, y) − F(x, y))/a. We have already mentioned that since Φopt is finitedimensional, it is closed, i.e., contains the limit of any convergent sequence of its elements. The sequence (F(x + a, y) − F(x, y))/a evidently converges: to the partial derivative ∂ F(x, y)/∂ x, so we conclude that the partial derivative also belongs to Φopt . For the partial derivative with respect to y, the proof is similar. 8. Let’s now prove that if Φopt contains at least one term of order k > 0, then it must contain non-zero terms of orders k − 1, k − 2, . . . , 0. It is sufficient to show that it contains a non-zero term of order k − 1, all the other cases can be then proved by mathematical induction. Indeed, suppose that F(x, y) is of order k. Then either ∂ F(x, y)/∂ x or ∂ F(x, y)/∂ y are different from 0 (because else F is identically constant and cannot therefore be of order k > 0). In both cases this partial derivative is of order k − 1 and according to 7. it belongs to Φopt . The statement is proven. 9. From 8. we conclude that an optimal m-dimensional family can contain at most terms of order m − 1. Indeed, with any term of order k it contains k + 1 linearly independent terms of different orders k, k − 1, k − 2, . . . , 2, 1, 0, and in an
110
V. Kreinovich and C. Quintana
m-dimensional space there can be at most m linearly independent elements. So k + 1 ≤ m, hence k ≤ m − 1. 10. For m = 1 we can already conclude that Φopt contains only a constant: in this case this inequality turns into k ≤ 0. So all the functions F(x, y) can contain only terms of order 0, i.e., they are all constants. 11. Let us now consider the case 2 ≤ m ≤ 3. According to 9. it must contain only terms of order ≤2. If it contains only terms of order 0, then all the functions F(x, y) from this family are constants, and therefore the dimension m of this family is m = 1. So since m ≥ 2, the family Φopt must contain at least one function of order >0. If it contains a function of order 2, then according to 8. it must also contain a linear function. So in all the cases Φopt contains a linear function F(x, y) = bx + cy. Applying the same statement 8. once again, we conclude that Φopt must also contain a 0th order function (i.e., a constant). Now we can use rotation-invariance of the optimal family, i.e., the fact that the optimal family contains a function F(U (x, y)) together with any F(x, y) for an arbitrary rotation U . In particular, if we apply a rotation U on 90◦ (for which U (x, y) = (y, −x)) to a function F(x, y) = bx + cy, we conclude that F(U (x, y)) = by − cx also belongs to Φopt . Let’s show that these two functions bx + cy and by − cx are not linearly independent. Indeed, if c = 0 or b = 0, this is evident. Let us consider the case when b = 0 and c = 0. Then if they are linearly independent, this means that bx + cy = λ(by − cx) for some λ and for all x, y. If two linear functions coincide, then its coefficients coincide, i.e., b = −λc and c = λb, and, multiplying these equalities, that bc = −λ2 bc. Since b = 0 and c = 0, we conclude that λ2 = −1, but for real numbers λ it is impossible. So these two functions are linearly independent elements of the 2-dimensional space of all linear function {C1 x + C2 y}. So their linear combinations form a 2dimensional subspace of a 2-dimensional space, and therefore coincide with the whole space. So the optimal family Φopt contains all linear functions. We have already shown that it contains all constants, so its dimension is at least 3. Therefore m = 2 is impossible. If m = 3, then we have already 3 linearly independent elements in our space: 1, x and y. Every three linearly independent elements of a 3-dimensional space form its basis, so all the elements F(x, y) of Φopt are arbitrary linear functions a + bx + cy. Therefore all the curves of the type F(x, y) = 0 are straight lines. 12. Now let us consider the case m = 4. In this case, Φopt cannot contain only linear functions, because in this case its dimension will be equal to 3. So it must contain at least one function of order ≥2. If it contains a function of order 3 or more, then it must contain also a function of order 2, of order 1, etc. From the fact that it contains a function of order 1 we conclude (as in 11.) that it contains all linear functions. So in this case we have at least five linearly independent functions: 1, x, y, a function of order 2, and a function of order 3. But the whole dimension of the space Φopt equals to 4, so there cannot be 5 linearly independent functions in it.
What Segments Are the Best in Representing Contours?
111
This contradiction proves that Φopt cannot (for m = 4) contain any function of order greater than 2. So it must contain a function of order 2. Any term of order 2 can be represented as F(x, y) = d x 2 + ex y + f y 2 for some real numbers d, e, f , i.e., as a quadratic form. It is known in linear algebra that by an appropriate rotation U (namely, a rotation to a basis consisting of eigen vectors) any quadratic form can be represented in a diagonal form F(x, y) = C1 (x )2 + C2 (y )2 , where (x , y ) = U (x, y). Since the optimal family is rotation˜ invariant, we can conclude that a function F(x, y) = F(U −1 (x, y)) = C1 x 2 + C2 y 2 also belongs to Φopt . Let us show that C1 = C2 . Indeed, if C1 = C2 and C1 = −C2 , then by applying to this function rotation-invariance with the same rotation U˜ (x, y) = (y, −x) ˜ U˜ (x, y)) = as in statement 11, we conclude that Φopt must contain a function F( ˜ C2 x 2 + C1 y 2 , that is linearly independent with F(x, y). So we get at least 5 linearly independent functions in Φopt : 1, x, y and two of these. It is impossible because the total dimension m of Φopt is 4. If C2 = −C1 , then since Φopt is a linear space, it must contain together with ˜ ˜ F(x, y) a function F(x, y)/C1 = x 2 − y 2 . Applying a 45◦ rotation U , we conclude that Φopt must also contain a function x y. So again we have two linearly independent quadratic functions: x 2 − y 2 and x y, and the total dimension is at least 5, which contradicts to m = 4. So in all the cases except C1 = C2 we get a contradiction. Therefore C1 = C2 , and we conclude that Φopt contains a function C1 (x 2 + y 2 ). Since Φopt is a linear space, it must also contain a function that is obtained from that one by dividing by C1 , i.e., x 2 + y 2 . So a 4-dimensional space Φopt contains 4 linearly independent functions 1, x, y, x 2 + y 2 , hence it coincides with the set of their linear combinations. We have already shown that the correspondent curves are straight lines and circles. For m = 4, the statement is proved. 13. Let us finally consider the case m = 5. Let us first show that in this case Φopt also cannot contain terms of order 3 or more. Indeed, in this case out of 5 possible linearly independent functions at least one must be of order >2, so the set of all polynomials of order ≤2 is of dimension ≤4. We have already proven (in 12.) that in this case, all second order terms are proportional to x 2 + y 2 . If there is a term of order 3 or more, there must be a non-trivial term of order 3 (according to 8.). Its partial derivatives must also belong to Φopt (due to 7.) and, since they are of second order, they must be proportional to x 2 + y 2 . So ∂ F(x, y) = a(x 2 + y 2 ) ∂x
112
and
V. Kreinovich and C. Quintana
∂ F(x, y) = b(x 2 + y 2 ) ∂y
for some constants a and b that are not both equal to 0 (because else F(x, y) would be a constant). If we differentiate the left-hand side of the first equality with respect to y and the second one with respect to x, we get the same results. So the same must be true if we differentiate the right-hand sides. Therefore, we conclude that 2ay = 2bx for all x, y. This is possible only if a = b = 0, and, as we have already remarked, at least one of the values a, b must be non-zero. This contradiction shows that third order terms in Φopt are (for m = 5) impossible, and therefore Φopt must contain only second-order terms. If all these terms are proportional to x 2 + y 2 , then the total dimension is 4. So there must be at least one quadratic term F(x, y) in Φopt , that is not proportional to x 2 + y 2 . Like in 12. we conclude that the family Φopt must contain a function C1 x 2 + C2 y 2 with C1 = C2 . Again like in 12. we can apply the 90◦ -rotation and conclude that the family also contains a function C2 x 2 + C1 y 2 . Since Φopt is a linear space, it must contain a difference of these functions (C1 − C2 )(x 2 − y 2 ) and, since C1 = C2 (and thence C1 − C2 = 0) their linear combination x 2 − y 2 . Again like in 12. we can now apply a 45◦ -rotation and thus prove that a function x y belongs to Φopt . So we already have 5 linearly independent functions 1, x, y, x 2 − y 2 , x y in a 5-dimensional linear space Φopt . Therefore this space Φopt coincides with the set of all possible linear combinations of these functions. It can be then easily checked that the corresponding curves are either straight lines or hyperbolas.
References 1. Ballard, D.H.: Strip trees, a hierarchical representation for curves. Commun. ACM 24(5), 310–321 (1981) 2. Bartneck, N.: A general data structure for image analysis based on a description of connected components. Computing 42, 17–34 (1989) 3. Bellman, R.: Introduction to Matrix Analysis. McGraw-Hill, New York (1970) 4. Bunke, H.: Modellgestuerte Bildanalyse. B. G. Teubner, Stuttgart (1985) 5. Davis, E.: Representing and Acquiring Geographic Knowledge. Pitman, London; Morgan Kaufmann, Los Altos (1986) 6. Davis, M., Ybarra, A., Kreinovich, V., Quintana, C.: Patterns of molecular aggregation in mixtures: what mathematical models fit them best? Technical report, University of Texas at El Paso, Computer Science Department (1992) 7. Freeman, H.: Computer processing of line-drawn images. Comput. Surv. 6(1), 57–97 (1974)
What Segments Are the Best in Representing Contours?
113
8. Freidzon, R.I., Kreinovich, V., et al.: A knowledge-based navigation system. Technical report, Soviet Ministry of Marine Industry, Soviet National Bureau of Cartography and Geodesy and Soviet Ministry of Defence, Leningrad (1989) (in Russian) 9. Gerasimov, A.I.: Applications of projective (piecewise-linear) transformations. Master’s thesis, Leningrad Polytechnic Institute, Leningrad (1989) (in Russian) 10. Grebner, K.: Model based analysis of industrial scenes. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, Miami, pp. 28–33 (1986) 11. Kirchner, U.: Problem adapted modeling for industrial scene analysis. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, San Diego, CA (1989) 12. Kreinovich, V.: Sea cartography: optimal data representation. Technical report, Center for New Informational Technology Informatika, Leningrad (1989) 13. Kreinovich, V.: Group-theoretic approach to intractable problems. Lecture Notes in Computer Science, vol. 417, pp. 112–121. Springer, Berlin (1990) 14. Kreinovich, V., Kumar, S.: Optimal choice of &- and ∨-operations for expert values. In: Proceedings of the 3rd University of New Brunswick Artificial Intelligence Workshop, Fredericton, New Brunswick, Canada, pp. 169–178 (1990) 15. Kreinovich, V., Quintana, C.: Neural networks: what non-linearity to choose. In: Proceedings of the 4th University of New Brunswick Artificial Intelligence Workshop, Fredericton, New Brunswick, pp. 627–637 (1991) 16. Kreinovich, V., Quintana, C., Fuentes, O.: Genetic algorithms: what fitness scaling is optimal? Technical report, University of Texas at El Paso, Computer Science Department (1992) 17. Montanari, U.: A note on minimal length polygon approximation. Commun. ACM 13, 41–47 (1970) 18. Niemann, H.: Pattern Analysis. Springer, Berlin (1981) 19. Pavlidis, T.: Curve fitting as a pattern recognition problem. In: Proceedings of the 6th International Joint Conference on Pattern Recognition, Munich, vol. 2, p. 853 (1982)
Strength of Lime Stabilized Pavement Materials: Possible Theoretical Explanation of Empirical Dependencies Edgar Daniel Rodriguez Velasquez and Vladik Kreinovich
Abstract When building a road, it is often necessary to strengthen the underlying soil layer. This strengthening is usually done by adding lime. There are empirical formulas that describe how the resulting strength depends on the amount of added lime. In this paper, we provide a theoretical explanation for these empirical formulas.
1 Formulation of the Problem Need for lime stabilized pavement layers. To have a stable road, it is often necessary to enhance the mechanical properties of the underlying soil layer. The most costefficient way of this enhancement is to mix soil with lime (sometimes coal fly ash is also added). Water is then added to this mix, and after a few days, the upper level of the soil becomes strengthened. The needed amount of lime depends on the soil. How to determine the optimal amount of lime. To determine the proper amount of lime, soil specimens are brought into the lab, mixed with different amounts of lime, and tested. All chemical processes become faster when the temperature increases. So, to speed up the testing process, instead of simply waiting for several weeks as in the field, practitioners heat the sample to a higher temperature, thus speeding up the strengthening process; this higher-temperature speed-up is known as curing.
E. D. Rodriguez Velasquez Department of Civil Engineering, Universidad de Piura in Peru (UDEP), Av. Ramón Mugica 131, Piura, Peru e-mail: [email protected]; [email protected] Department of Civil Engineering, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_16
115
116
E. D. Rodriguez Velasquez and V. Kreinovich
Based on the testing results, we need to predict the strength of the soil in the field for different possible lime amounts—and thus, select the lime amount that guarantees the desired strength. Need for formulas describing the dependence of strength on curing temperature and other parameters. To be able to make this prediction, we need to know how the strength depends on the lime content L (which is usually measured in percentage of lime in the dry weight of the mix). To be more precise, we need a formula with one or more parameters depending on the soil. We can then: • determine the parameters based on the testing results, and then • use the corresponding formula to predict the soil strength. It turns out that the resulting empirical formulas differ depending on the porosity η of the mix, i.e., the percentage of voids in the overall volume of the soil: for different values of η, we have, in general, different dependencies on lime content L. Known empirical formulas. The mix is isotropic, so its mechanical strength can be characterized by two parameters: • unconfined compressive strength qu that describes the smallest value of pressure (force over area) applied at the top of a cylindrical sample at which this sample fails; • the tensile strength qt is when the force is applied orthogonally to the cylinder’s axis. For both types of strength q, the empirical formulas describing the dependence of strength on η and L are (1) q = c1 · ηeη · L eL , for some parameters c1 , eη , and e L ; see, e.g., [2–6]. The corresponding constant c1 depends on the dry density ρ. The dependence on ρ takes the form (2) c1 = c2 · ρ eρ for some constants c1 and eρ ; see, e.g., [4]. Substituting the formula (2) into the formula (1), we get (3) q = c2 · ρ eρ · ηeη · L eL . The above three formulas are supported by a large amount of evidence. Comment. The strength also depends on the curing temperature T and on the duration d of curing. However, the dependence of the strength on these two quantities is less well-studied; see, e.g., [2, 6]. While there are some useful empirical formulas, these formulas are not yet in accordance with the physical meaning. For example: • When the curing time d is set to 0—i.e., when there is no curing at all—we should expect that these formula produce the same non-cured result, irrespective of the curing temperature.
Strength of Lime Stabilized Pavement Materials …
117
• However, for d = 0, different existing formula lead to values differing by the order of magnitude. Need for a theoretical justification. In real-life applications, we can have many different combinations of the values of L, η, and ρ, but it is not realistically possible to test them for all possible combinations of these quantities. It is therefore desirable to come up with a theoretical justification for these empirical formulas. This will make the user more confident that the formulas can be applied to different combinations of the values of these three quantities.
2 Our Explanation Main idea: scale-invariance. Our objective is to explain the dependence of the strength on the porosity, lime content, and dry density—the dependencies which are confirmed by a large amount of experiments. The main idea behind our explanation is that while we are interested in physical quantities, when we process data, what we process are numerical values of these quantities, and these numerical values depend on the choice of the measuring unit. If we replace the original measuring unit with the one which is λ times smaller, then the physics remains the same, but each numerical value x get multiplied by λ: x → λ · x. For example, if we replace meters with centimeters, 2 m becomes 100 · 2 = 200 cm. Since such “scaling” does not change the physics, it is reasonable to require that all dependencies remain the same when we change the corresponding units. Of course, if we have a dependence y = f (x)—e.g., that the area y of a square domain is equal to the square of its side x (y = x 2 )—then, when we change the unit for measuring x, we should appropriately change the unit for measuring y. In other words, for each λ > 0, there should exist a value μ(λ) depending on λ such that: • if we have y = f (x), • then we will have y = f (x ) for x = λ · x and y = μ(λ) · y. How to describe scale-invariant dependencies. Substituting the expressions x = λ · x and y = μ(λ) · y into the formula y = f (x ), we conclude that f (λ · x) = μ(λ) · y. Here, y = f (x), so we conclude that f (λ · x) = μ(λ) · f (x).
(4)
From the physical viewpoint, the dependence y = f (x) is usually continuous. It is known (see, e.g., [1]) that every continuous solution of the functional equation (4) has the form
118
E. D. Rodriguez Velasquez and V. Kreinovich
f (x) = c · x e
(5)
for some parameters c and e. How can we apply these ideas to our formulas. At first glance, the above ideas are not applicable to the formulas (1)–(3), since these formulas deal: • not with values measured in some physical units—were scaling would make sense, • but rather with percentages, i.e., with values for which re-scaling does not make physical sense. However, it is possible to apply the scaling ideas if we take into account that each of these quantities is actually a ratio of physical quantities for which re-scaling makes sense: • the lime content L is the ratio of the lime volume VL to the overall volume Vd of the dry mix: VL L= , and Vd • the porosity η is the ratio of the total volume of voids Vv to the total volume V : η=
Vv . V
From this viewpoint, for each sample: • the dependence of the strength q on the lime content takes the form q = const · VLeL ; • the dependence of the strength q on the porosity takes the form e
q = const · Vv η ; and • the dependence of the strength q on the dry density ρ takes the form q = const · ρ eρ . In all three cases, we now have a dependence between physical quantities for which re-scaling makes perfect sense, so it is reasonable to expect that these each of these dependencies is described by the power law. Acknowledgements This work was supported in part by the US National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
Strength of Lime Stabilized Pavement Materials …
119
References 1. Aczel, J., Dhombres, J.: Functional Equations in Several Variables. Cambridge University Press, Cambridge (1989) 2. Consoli, N.C., da Rocha, C.G., Silvani, C.: Effect of curing temperature on the strength of sand, coal fly ash, and lime blends. J. Mater. Civ. Eng. 26(8) (2014). https://doi.org/10.1061/ (ASCE)MT.1943-5533.0001011 3. Consoli, N.C., Saldanha, R.B., Novaes, J.F., Sheuermann Filho, H.C.: On the durability and strength of compacted coal fly ash-carbide lime blends. Soils Rocks 40(2), 155–161 (2017) 4. Kumar, S., Prasad, A.: Parameters controlling strength of red mud-lime mix. Eur. J. Environ. Civ. Eng. 23(6), 743–757 (2019) 5. Saldanha, R.B., Consoli, N.C.: Accelerated mix design of lime stabilized materials. J. Mater. Civ. Eng. 28(3) (2016). https://doi.org/10.1061/(ASCE)MT.1943-5533.0001437 6. Silvani, C., Braun, E., Masuero, G.B., Consoli, N.C.: Behavior of soil-fly blends under different curing temperature. In: Proceedings of the Third International Conference on Transportation Geotechnics ICTG’2016, Procedia Engineering, vol. 143, pp. 220–228 (2016)
Towards a Theoretical Explanation of How Pavement Condition Index Deteriorates over Time Egdar Daniel Rodriguez Velasquez and Vladik Kreinovich
Abstract To predict how the Pavement Condition Index will change over time, practitioners use a complex empirical formula derived in the 1980s. In this paper, we provide a possible theoretical explanation for this formula, an explanation based on general ideas of invariance. In general, the existence of a theoretical explanation makes a formula more reliable; thus, we hope that our explanation will make predictions of road quality more reliable.
1 Formulation of the Problem The quality of a road pavement is described by a Pavement Condition Index (PCI) that takes into account all possible pavement imperfections [2]. The perfect condition of the road corresponds to PCI = 100, and the worst possible condition corresponds to PCI = 0. As the pavement ages, its quality deteriorates. To predict this deterioration, practitioners use an empirical formula developed in [5]: PCI = 100 −
R , (ln(α) − ln(t))1/β
(1)
where t is the pavement’s age, and R, α are corresponding parameters. E. D. Rodriguez Velasquez Department of Civil Engineering, Universidad de Piura in Peru (UDEP), Av. Ramón Mugica 131, Piura, Peru e-mail: [email protected]; [email protected] Department of Civil Engineering, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_17
121
122
E. D. Rodriguez Velasquez and V. Kreinovich
In this paper, we propose a possible theoretical explanation for this empirical formula, explanation based on the general notions of invariance. In general, when a formula has a theoretical explanation, it increases the users’ confidence in using this formula; this was our main motivation for providing such an explanation.
2 General Invariances Main natural transformations. In order to describe our explanation, let us recall the basic ideas of invariance. Many of these ideas come from the fact that in data processing, we use numerical values of physical properties. In general, for the same physical quantity, we get different numerical values depending on what measuring unit we use and what starting point we use as a 0 value. For example, if we replace meters with centimeters, all numerical values get multiplied by 100. In general, if we replace the original measuring unit with a new unit which is λ times smaller, then each original numerical value x is replaced by the new numerical value λ · x. This transformation is known as scaling. Similarly, if for measuring temperature, we select a new starting point which is 32◦ below the original zero, then this number 32 is added to all the numerical values. In general, if we choose a new starting point which precedes the original one by x0 units, then each original numerical value x is replaced by the new numerical values x + x0 . This transformation is known as shift. Notion of invariance. The physics does not change if we simply change the measuring unit and/or change the starting point. Thus, it makes sense to require that the physical properties should not change if we apply the corresponding natural transformations: scaling or shift; see, e.g., [3, 6]. In particular, it makes sense that the relations y = f (x) between physical quantities be thus invariant. Of course, if we change the measuring unit for x, then we may need to change the measuring unit for y. For example, the formula y = x 2 for the area y of the square of linear size x does not depend on the units, but if we change meters to centimeters, we also need to correspondingly change square meters to square centimeters. So, the proper definition of invariance is that for each natural transformation x → X there exists an appropriate transformation y → Y such that if, in the original scale, we had y = f (x), then in the new scale, we will have Y = f (X ). Invariant dependencies: towards a general description. There are two possible types of natural transformations for x: scaling and shift. Similarly, there are two possible types of natural transformation of y. Depending on which class of natural transformation we choose for x and for y, we will thus get 2 · 2 = 4 possible cases. Let us describe these four cases explicitly, and let us describe, for each of the cases, what are the corresponding invariant functions. Case of x-scaling and y-scaling. In this case, for each λ > 0, there exists a value μ > 0 such that if we have y = f (x), then for X = λ · x and Y = μ · y, we will have
Towards a Theoretical Explanation of How Pavement Condition …
123
Y = f (X ). Substituting the expressions for Y and X into this formula, and explicitly taking into account that μ depends on λ, we get the following equation: μ(λ) · f (x) = f (λ · x).
(2)
All continuous (and even discontinuous but measurable) solutions to this equations are known (see, e.g., [1]): they all have the form f (x) = A · x b
(3)
for some A and b. Comment. The proof for general measurable functions is somewhat complex, but in the natural case of differentiable dependence f (x), the proof is easy. Namely, f (λ · x) . Since the right-hand side of this from the Eq. (2), we conclude that μ(λ) = f (x) equality is differentiable, the left-hand side μ(λ) is differentiable too. Thus, we can d f (x) differentiate both sides by λ and take λ = 1. As a result, we get b · f (x) = x · , dx def where b = μ (1). We can separate the variables x and f by moving all the terms containing x to one side and all the terms containing f to the other side; then, we dx df get b · = . Integrating both sides, we get ln( f ) = b · ln(x) + C, thus indeed x f f = exp(ln( f )) = A · x b , where A = exp(C). Case of x-scaling and y-shift. In this case, for each λ > 0, there exists a value y0 (λ) such that if y = f (x), then Y = f (X ) for Y = y + y0 and X = λ · x. Thus, we get f (x) + y0 (λ) = f (λ · x).
(4)
If the function f (x) is differentiable, then the difference y0 (λ) = f (λ · x) − f (x) is also differentiable. Differentiating both sides of the Eq. (4) with respect to λ and d f (x) def , where b = y0 (1). Separating the variables, we taking λ = 1, we get b = x · dx dx , and after integration, we get get d f = b · x f (x) = A + b · ln(x)
(5)
for some constant A. Comment. Similarly to the previous case, the same formula (5) holds if we only assume that f (x) is measurable [1].
124
E. D. Rodriguez Velasquez and V. Kreinovich
Case of x-shift and y-scaling. In this case, for each x0 , there exists a value λ(x0 ) such that if y = f (x), then Y = f (X ) for Y = λ · y and X = x + x0 . Thus, we get λ(x0 ) · f (x) = f (x + x0 ).
(6)
f (x + x0 ) is also f (x) differentiable. Differentiating both sides of the Eq. (6) with respect to x0 and taking d f (x) def , where b = λ (0). Separating the variables, we get x0 = 0, we get b · f (x) = dx df = b · d x, and after integration, we get ln( f ) = b · x + C hence f If the function f (x) is differentiable, then the ratio λ(x0 ) =
f (x) = A · exp(b · x),
(7)
where A = exp(C). Comment. Similarly to the previous two cases, the same formula (7) holds if we only assume that f (x) is measurable [1]. Case of x-shift and y-shift. In this case, for each x0 , there exists a value y0 (x0 ) such that if y = f (x), then Y = f (X ) for Y = y + y0 and X = x + x0 . Thus, we get f (x) + y0 (x0 ) = f (x + x0 ).
(8)
If the function f (x) is differentiable, then the difference y0 (x0 ) = f (x + x0 ) − f (x) is also differentiable. Differentiating both sides of the Eq. (8) with respect to d f (x) def , where b = y0 (0). Thus, x0 and taking x0 = 0, we get b = dx f (x) = A + b · x
(9)
for some constant A. Comment. In this case too, the same formula (9) holds if we only assume that f (x) is measurable [1].
3 First Attempt: Let Us Directly Apply Invariance Ideas to Our Problem PCI and age: scale-invariance or shift-invariance? We are analyzing how PCI depends on the pavement’s age t. To apply invariances to our dependence, we need first to analyze which invariances are reasonable for the corresponding variables— PCI and age. For age, the answer is straightforward: there is a clear starting point for measuring age, namely, the moment when the road was built. On the other hand, there is no fixed
Towards a Theoretical Explanation of How Pavement Condition …
125
measuring unit: we can measure age in years or in months or—for good roads—in decades. Thus, for age: • shift-invariance—corresponding to the possibility of changing the starting point— makes no physical sense, while • scale-invariance—corresponding to the possibility of changing the measuring unit—makes perfect sense. For PCI, the situation is similar. Namely, there is a very clear starting point—the point corresponding to the newly built practically perfect road, when PCI = 100. From this viewpoint, for PCI, shifts do not make much physical sense. If we select 100 as the starting point (i.e., as 0), then instead of the original numerical values PCI, we get shifted values PCI − 100. A minor problem with these shifted values is that they are all negative, while it is more convenient to use positive numbers. Thus, we change the sign and consider the difference 100 − PCI. On the other hand, the selection of point 0 is rather subjective. What is marked as 0 in a developed country that can afford to invest money into road repairs may be a passable road in a poor country, where most of the roads are, from the viewpoint of US standards, very bad; see, e.g., [4]. So, for PCI (or, to be more precise, for 100 − PCI), it probably makes sense to use scaling. Let us directly apply the invariance ideas. In view of the above analysis, we should be looking for a dependence of y = 100 − PCI on x = t which is invariant with respect to x-scaling and y-scaling. As we have discussed in the previous section, this requirement leads to y = A · x b , i.e., to 100 − PCI = A · t b and PCI = 100 − A · t b . This formula may be reasonable from the purely mathematical viewpoint, but in practice, it is a very crude description of what we actually observe. Thus, the direct application of invariance ideas does not lead to good results.
4 Let Us Now Apply Invariance Ideas Indirectly Idea. Since we cannot apply the invariance requirements directly—to describe the dependence of y = 100 − PCI on x = t, a natural idea is to apply these requirements indirectly. Namely, we assume that there is some auxiliary intermediate variable z such that: • y depends on z, • z depends on x, and • both these y-on-z and z-on-x dependencies are, in some reasonable sense, invariant. Options. We know that for x and for y, only scaling makes sense. However, for the auxiliary variable z, in principle, both shifts and scalings may be physically
126
E. D. Rodriguez Velasquez and V. Kreinovich
reasonable. Depending on which of the two types of transformations we use for z when describing y-on-z and z-on-x dependencies, we get four possible options: • for both y-on-z and z-on-x dependencies, we use z-shift; • for both y-on-z and z-on-x dependencies, we use z-scaling; • for y-on-z dependence, we use z-shift, while for z-on-x dependence, we use zscaling; • for y-on-z dependence, we use z-scaling, while for z-on-x dependence, we use z-shift. Let us consider these four cases one by one. Case when for both y-on-z and z-on-x dependencies, we use z-shift. In this case, in accordance to the results presented in Sect. 2, we have z = A + b · ln(x) and y = A1 · exp(b1 · z). Substituting the expression for z into the formula for y, we get y = A1 · exp(A + b · ln(x)) = (A1 · exp(A)) · (exp(ln(x))b = A2 · x b , def
where A2 = A1 · exp(A). This is exactly the formula coming from the direct application of invariance requirements, and we already know that this formula is not very adequate for describing the experimental data. Case when for both y-on-z and z-on-x dependencies, we use z-scaling. In this case, we have z = A · x b and y = A1 · z b1 . Thus, here, b1 z = A1 · A · x b = A2 · x b2 , def
def
where A2 = A1 · Aα1 and b2 = b · b1 . Thus, in this case, we also get the same formula as for the direct application of invariance. Case when for y-on-z dependence, we use z-shift, while for z-on-x dependence, we use z-scaling. Here, z = A · x b and y = A1 · exp(b1 · y), thus y = def A1 · exp b1 · A · x b , i.e., y = A1 · exp b2 · x b , where b2 = b1 · A. So, for PCI = 100 − y and x = t, we get the dependence PCI = 100 − A1 · exp b2 · t b .
(10)
Interestingly, this is one of the formula that was tested in [5] and which turned out to work not so well as the formula that was selected. Case when for y-on-z dependence, we use z-scaling, while for z-on-x dependence, we use z-shift. In this case, z = A + b · ln(x) and y = A1 · z b , thus y = A1 · (A + b · ln(x))b1 . So, for PCI = 100 − y and x = t, we get PCI = 100 − A1 · (A + b · ln(x))b1 . Let us show that this is indeed the desired formula (1).
(11)
Towards a Theoretical Explanation of How Pavement Condition …
127
Indeed, here, A + b · ln(x) = (−b) ·
A − ln(x) . − b
(12)
A A For α = exp − , we have ln(α) = − , so the formula (12) takes the form b b A + b · ln(x) = (−b) · (ln(α) − ln(t)). Thus, the formula (11) takes the form def
PCI = 100 − A1 · (−b)b1 · (ln(α) − ln(t))b1 , i.e., the desired form (1) with R = A1 · (−b)b1 and β = −
1 . b1
Conclusion. We indeed derived the empirical formula (1) for the decrease of PCI over time from the general invariance requirements. To be more precise, from the invariance requirements, we can derive two possible formulas: • the desired formula (1)—which is in good accordance with the empirical data, and • the alternative formula (10)—which is not a good fit for empirical data. Since in general, the existence of a theoretical explanation makes a formula more reliable, we hope that our explanation will make predictions of road quality more reliable. Acknowledgements This work was partially supported by the Universidad de Piura in Peru (UDEP) and by the US National Science Foundation via grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
References 1. Aczel, J., Dhombres, J.: Functional Equations in Several Variables. Cambridge University Press, Cambridge (1989) 2. American Society for Testing and Materials (ASTM), Standard test method for measuring the longitudinal profile of traveled surfaces with an accelerometer established inertial profiling reference, ASTM Standard E950/E950M-09 (2018) 3. Feynman, R., Leighton, R., Sands, M.: The Feynman Lectures on Physics. Addison Wesley, Boston (2005) 4. Greenbaum, E.: Emerald Labyrinth: a Scientist’s Adventures in the Jungles of the Congo. ForeEdge, Lebanon, New Hampshire (2018) 5. Metropolitan Transportation Commission (MTC), Technical appendices describing the development and operation of the Bay Area pavement management system (PMS), prepared by Roger E. Smith (1987) 6. Thorne, K.S., Blandford, R.D.: Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics. Princeton University Press, Princeton (2017)
A Recent Result About Random Metric Spaces Explains Why All of Us Have Similar Learning Potential Christian Servin, Olga Kosheleva, and Vladik Kreinovich
Abstract In the same class, after the same lesson, the amount of learned material often differs drastically, by a factor of ten. Does this mean that people have that different learning abilities? Not really: experiments show that among different students, learning abilities differ by no more than a factor of two. This fact have been successfully used in designing innovative teaching techniques, techniques that help students realize their full learning potential. In this paper, we deal with a different question: how to explain the above experimental result. It turns out that this result about learning abilities—which are, due to genetics, randomly distributed among the human population—can be naturally explained by a recent mathematical result about random metrics.
1 Formulation of the Problem Learning results differ a lot. In the same class—whether it is an elementary school, a high school, or a university—some students learn a lot right way, and some stay behind and require a lot of time to learn the same concepts.
C. Servin Computer Science and Information Technology Systems Department, El Paso Community College (EPCC), 919 Hunter Dr., El Paso, TX 79915-1908, USA e-mail: [email protected] O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_18
129
130
C. Servin et al.
How can we explain this disparity? There are several possible explanations of the above phenomenon. Crudely speaking, these explanations can be divided into two major categories: • elitist explanations claim that people are different, some are born geniuses, some have low IQ; nasty elitists claim that the IQ is correlated with gender, race, etc., others believe that geniuses are evenly distributed among people of different gender and different races, but they all believe that some people are much much more capable than others, and this difference explains the observed different in learning results; • liberal explanations assume that everyone is equal, that every person has approximately the same learning potential, so even the slowest-to-learn students can learn very fast, we just need to find an appropriate teaching approach; see, e.g., [2, 3]. Which explanation is correct: an experiment. To answer this question, Russian specialists in pedagogy made the following experiment—cited in [1, 4, 6, 7]. After a regular class, with the usual widely varying results, the researchers asked the students to write down everything that the students remembered during this class period— what material was taught, what dress the teacher had on, etc., literally everything. No surprisingly, straight-A students remembered practically all the material that was being taught, but students whose average grade is on the edge of failing remembered, in the worst case, about 10% of what was taught in the class. This was expected. What was completely unexpected is that when the researchers counted the overall amount of information that a student remembered, the overall number of remembered bits differed by no more than a factor of two. The only difference was that: • the bits remembered by straight-A students included all the material taught in the class, while • the bits remembered by not so successful students included how the teacher was dressed, which birds flew outside the window during the class, etc. This experiment—and other similar experiments (see, e.g., [2] and references therein)—clearly supports what we called a liberal viewpoint. This experimental result is used in teaching. This observation underlies successful teaching methods—e.g., a method tested and promoted by the Russian researcher Anatoly Zimichev and his group [1, 4, 6, 7]—that we can make everyone learn well if we block all the other sources of outside information (no teacher, online learning, empty room, no other students etc.). Other related teaching methods are described in [2, 3]. But why? This experimental results is helpful, but a natural question is: why? How can we explain this results? Why a factor of two and not, e.g., a factor of three or of 1.5? As we all know, genes are randomly combined, so why don’t we have a bigger variety of leaning abilities? What we do in this paper. In this paper, we provide a possible theoretical explanation for this empirically observed almost-equality.
A Recent Result About Random Metric Spaces Explains …
131
2 Our Explanation Let us formulate this problem in precise terms. We are interested in differences between students. A natural way to gauge this difference is to have a numerical value d(a, b) > 0 that describes how different students a and b are. Intuitively, the difference between students a and b is exactly the same as the difference between students b and a, so we should have d(a, b) = d(b, a). It is also reasonable to require that the difference d(a, c) between students a and c should not exceed the sum of the differences d(a, b) between a and some other student b and d(b, c) between this auxiliary student b and the original student c: d(a, c) ≤ d(a, b) + d(b, c). In mathematics, a symmetric (d(a, b) = d(b, a)) function satisfying this inequality is known as a metric. We know that genes act randomly, so we expect the metric to also be random—in some reasonable sense. So, we are interested in the properties of a random metric. Analysis of the problem. We have a large population of people on Earth, the overall number n of people is in billions. So, for all practical purposes, we can assume that this n is close to infinity—in the sense that if there is an asymptotic property of a random metric, then this property is—with high confidence—satisfied for this n. This assumption makes perfect sense. For example, we know that 1/n tends to 0 as n tends to infinity. And indeed, if we divide one slice of pizza into several billion pieces—the favorite elementary school example of division—we will get practically nothing left for each person. So, we are interested in asymptotic properties of random metrics. Given students a and b, how can we determine the corresponding distance d(a, b)? The only way to do that is by measurement, whether it is counting bits in stories (as in the above example) or in any other way. But measurements are never absolutely accurate: whatever we measure, if we repeat the same procedure one more time, we will get a slightly different result. • Anyone who had physics labs knows that this is true when we measure any physical quantity, be it weight or current. • Anyone who had their blood pressure measured at the doctor’s office knows that two consequent measurements lead to slightly different results. • And every psychology student knows that repeating the same test—be it IQ test or any other test—leads to slightly different results. In other words, as a result of the measurement, we only know the measured quantity with some accuracy. Let ε > 0 denote the accuracy with which we can measure the value d(a, b). This means that, instead of the exact value d(a, b), we, in effect, come up with one of the values 0, ε, 2ε, etc., i.e., the values of the type k · ε for an integer k. There are
132
C. Servin et al.
finitely many people on Earth, so there are finitely many such differences, so there is the largest of the corresponding values k; let us denote this largest number by r . The value ε is also not precise defined. If we slightly increase or decrease ε, the value r —which is equal to the ratio between the largest possible distance D and the accuracy ε—correspondingly, slightly decreases or slightly increases. If the original value r was odd, let us slightly decrease ε and get an even number r + 1 instead. This does not change anything of substance, but helps in the analysis of the problem, since more is known about random metrics for even values r . Now, we are ready to cite the corresponding mathematical result—that leads to the desired explanation. Our explanation. A recent result [5] shows that in almost all randomly selected metric spaces, all the distances are between the largest value D = r · ε and its half D/2 = (r/2) · ε. Here, “almost all” means that as n increases, the probability of this property tends to 1. In view of our comment about asymptotic properties, this means that for humanity as a whole, this property should be true. Thus, different distances differ by no more than a factor of two—exactly as we observe in the above experiment. Comment. Strictly speaking, the result from [5] refers to integer-valued metrics, but it can be easily extended to metrics whose values are k · ε > 0 for some number ε > 0. Acknowledgements This work was supported in part by the US National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
References 1. Aló, R., Kosheleva, O.: Optimization techniques under uncertain criteria, and their possible use in computerized education. In: Proceedings of the 25th International Conference of the North American Fuzzy Information Processing Society NAFIPS’2006, Montreal, Quebec, Canada, 3–6 June 2006, pp. 593–598 2. Boaler, J.: Limitless Mind: Learn, Lead, and Live Without Barriers. Harper One, New York (2019) 3. Iuculano, T., et al.: Cognitive tutoring induces widespread neuroplasticity and remediates brain function in children with mathematical learning disabilities. Nat. Commun. 6, Paper 8453 (2015) 4. Longpré, L., Kosheleva, O., Kreinovich, V.: How to estimate amount of useful information, in particular under imprecise probability. In: Proceedings of the 7th International Workshop on Reliable Engineering Computing REC’2016, Bochum, Germany, 15–17 June 2016, pp. 257–268 5. Mubayi, D., Terry, C.: Discrete metric spaces: structure, enumeration, and 0–1 laws. J. Symb. Log. 84(4), 1293–1325 (2019) 6. Zimichev, A.M., Kreinovich, V., et al.: Automatic control system for psychological experiments, Final report on Grant No. 0471-2213-20 from the Soviet Science Foundation (1982) (in Russian) 7. Zimichev, A.M., Kreinovich, V., et al.: Automatic system for fast education of computer operators, Final report on Grant No. 0471-0212-60 from the Soviet Science Foundation (1980–1982) (in Russian)
Finitely Generated Sets of Fuzzy Values: If “And” Is Exact, Then “Or” Is Almost Always Approximate, and Vice Versa—A Theorem Julio C. Urenda, Olga Kosheleva, and Vladik Kreinovich
Abstract In the traditional fuzzy logic, experts’ degrees of confidence are described by numbers from the interval [0, 1]. Clearly, not all the numbers from this interval are needed: in the whole history of the Universe, there will be only countably many statements and thus, only countably many possible degree, while the interval [0, 1] is uncountable. It is therefore interesting to analyze what is the set S of actually used values. The answer depends on the choice of “and”-operations (t-norms) and “or”-operations (t-conorms). For the simplest pair of min and max, any finite set will do—as long as it is closed under negation 1 − a. For the next simplest pair—of algebraic product and algebraic sum—we prove that for a finitely generated set, if the “and”-operation is exact, then the “or”-operation is almost always approximate, and vice versa. For other “and”- and “or”-operations, the situation can be more complex.
1 Formulation of the Problem Need for fuzzy degrees: a brief reminder. Computers are an important part of our lives. They help us understand the world, they help us make good decisions. It is desirable to make sure that these computers possess as much of our own knowledge as possible. J. C. Urenda Department of Mathematical Sciences and Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_19
133
134
J. C. Urenda et al.
Some of this knowledge is precise; such a knowledge is relatively easy to describe in computer-understandable terms. However, a significant part of our knowledge is described by using imprecise (“fuzzy”) words from natural language. For example, to design better self-driving cars, it sounds reasonable to ask experienced drivers what they do in different situations, and implement the corresponding rules in the car’s computer. The problem with this idea is that expert drivers usually describe their rules by saying something like “If the car in front of you is close, and it slows down a little bit, then ...”. Here, “close”, “a little bit”, etc. are imprecise words. To translate such knowledge into computer-understandable terms, Lotfi Zadeh invented fuzzy logic, in which each imprecise terms like “close” is described by assigning, to each possible value x of the corresponding quantity (in this case, distance), a degree to which x satisfies the property under consideration (in this case, to what extent, x is close); see, e.g., [3, 6, 11, 14, 15, 19]. In the original formulation of fuzzy logic, the degrees are described by numbers from the interval [0, 1], so that: • 1 means that we are absolutely sure about the corresponding statement, • 0 means that we are sure that this statement is false, and • intermediate degrees corresponding to intermediate degrees of certainty. Need for operations on fuzzy degrees, i.e., for fuzzy logic: a brief reminder. Our rules often use logical connectives like “and” and “or”. In the above example of the car-related statement, the person used “and” and “if-then” (implication). Other statements use negation or “or”. In the ideal world, we should ask each expert to describe his or her degree of confidence in each such statement—e.g., in the statement that “the car in front of you is close, and it slows down a little bit”. However, here is a problem: to describe each property like “close” for distance or “a little bit” for a change in speed, it is enough to list possible values of one variable. For a statement about two variables— as above—we already need to consider all possible pairs of values. So, if we consider N possible values of each variable, we need to ask the expert N 2 questions. If our statement involves three properties—which often happens—we need to consider N 3 possible combinations, etc. With a reasonably large N , this quickly becomes impossible to ask the expert all these thousands (and even millions) of questions. So, instead of explicitly asking all these questions about composite statements like A & B, we need to be able to estimate the expert’s degree of confidence in such a statement based on his or her known degrees of confidence a and b in the original statements A and B. A procedure that transform these degrees a and b into the desired estimate for the degree of confidence in A & B is known as an “and”-operation (or, for historical reasons, a t-norm). We will denote this procedure by f & (a, b). Similarly, a procedure corresponding to “or” is called an “or”-operation or a t-conorm; it will be denoted by f ∨ (a, b). We can also have negation operations f ¬ (a), implication operation f → (a, b), etc. Simple examples of operations on fuzzy degrees. In his very first paper on fuzzy logic, Zadeh considered the two simplest possible “and”-operations min(a, b) and
Finitely Generated Sets of Fuzzy Values: If “And” Is Exact …
135
a · b, the simplest negation operation f ¬ (a) = 1 − a, and the simplest possible “or”operations max(a, b) and a + b − a · b. It is easy to see that the corresponding “and”- and “or”-operations form two dual pairs, i.e., pairs for which f ∨ (a, b) = f ¬ ( f & ( f ¬ (a), f ¬ (b)))—this reflects the fact that in our reasoning, a ∨ b is indeed usually equivalent to ¬(¬a & ¬b). Indeed, for example, what does it mean that a dish contains either pork or alcohol (or both)? It simply means that it is not true that this is an alcohol-free and pork-free dish. Both pairs of operations can be derived from the requirement of the smallest sensitivity to changes in a and b (see, e.g., [13, 14, 18])—which makes sense, since experts can only mark their degree of confidence with some accuracy, and we do not want the result of, e.g., “and”-operation drastically change if we replace the original degree 0.5 with a practically indistinguishable degree 0.51. If we require that the worst-case change in the result of the operation be as small as possible, we get min(a, b) and max(a, b). If we require that the mean squares value of the change be as small as possible, we get a · b and a + b − a · b. Implication A → B means that, if we know A and we know that implication is true, then we can conclude B. In other words, implication is the weakest of all statements C for which A & C implies B. So, if we know the degrees of confidence a and b in the statements A and B, then a reasonable definition of the implication f ∨ (a, b) is the smallest degree c for which f & (a, c) ≥ b. In this sense, implication is, in some reasonable sense, an inverse to the “and”-operation. In particular, when the “and”-operation is multiplication f & (a, b) = a · b, the implication operation is simply division: f → (a, b) = b/a (if b ≤ a). Not all values from the interval [0, 1] make sense. While it is reasonable to use numbers from the interval [0, 1] to describe the corresponding degrees, the inverse is not true—not every number from the interval [0, 1] makes sense as a degree. Indeed, whatever degree we use corresponds to some person’s informal description of his or her degree of confidence. Whatever language we use, there are only countably many words, while, as is well known, the set of all real numbers from an interval is uncountable. Usually, we have a finite set of basic degrees, and everything else is obtained by applying some logical operations. A natural question is: what can we say about the resulting—countable—sets of actually used values? This is a general question to which, in this paper, we provide a partial answer. Simplest case: min and max. The simplest case is when we have f & (a, b) = min(a, b), f ¬ (a) = 1 − a, and f ∨ (a, b) = max(a, b). In this case, if we start with a finite set of degrees a1 , . . . , an , then we add their negations 1 − a1 , . . . , 1 − an , and that, in effect, is it: min(a, b) and max(a, b) do not generate any new values, they just select one of the two given ones (a or b). What about a · b and a + b − a · b. What about the next simplest pair of operations? Since the product is the simplest of the two, let us start with the product. Again, we start with a finite set of degrees a1 , . . . , an . We can also consider their negations def def def an+1 = 1 − a1 , …, an+i = 1 − ai , …, a2n = 1 − an .
136
J. C. Urenda et al.
If we apply “and”-operation to these values, we get products, i.e., values the type k2n a1k1 · · · · · a2n
(1)
for integers ki ≥ 0. If we also allow implication—i.e., in this case, division—then we get values of the same type (1), but with integers ki being possibly negative. The set of all such values is generated based on the original finite set of values. Thus, we can say that this set is finitely generated. Every real number can be approximated, with any given accuracy, by a rational number. Thus, without losing generality, we can assume that all the values ai are rational numbers—i.e., ratios of two integers. Since for dual operations, the result of applying the “or”-operation is the negation of the result of applying the “and”-operation—to negations of a and b—a natural question is: if we take values of type (1), how many of their negations are also of the same type? This is a question that we study in this paper.
2 Definitions and the Main Result Definition 1 By a finitely generated set of fuzzy degrees, we mean a set S of values of the type (1) from the interval [0, 1], where a1 , . . . , an are given rational numbers, an+i = 1 − ai , and k1 , . . . , k2n are arbitrary integers. Examples. If we take n = 1 and a1 = 1/2, then a2 = 1 − a1 = 1/2, so all the values of type (1) are 1/2, 1/4, 1/8, etc. Here, only for one number a1 = 1/2, the negation 1 − a1 belongs to the same set. If we take a1 = 1/3 and a2 = 2/3, then we have more than one number s from the set S for which its negation 1 − s is also in S: • we have 1/4 = a12 · a2−2 ∈ S for which 1 − 1/4 = 3/4 = a1 · a2−2 ∈ S; and • we have 1/9 = a12 ∈ S and 1 − 1/9 = 8/9 = a1−1 · a23 ∈ S. Proposition 1 For each finitely generated set S of fuzzy degrees, there are only finitely many element s ∈ S for which 1 − s ∈ S. Proof is, in effect, contained in [2, 4, 5, 7, 8, 10, 12, 16, 17], where the values s ∈ S are called S-units and the desired formula s + s = 1 for s, s ∈ S is known as the S-unit equation. Historical comment. The history of this mathematical result is unusual (see, e.g., [9]): the corresponding problem was first analyzed by Axel Thue in 1909, it was implicitly proven by Carl Lugwig Siegel in 1929, then another implicit proof was made by Kurt Mahler in 1933—but only reasonably recently this result was explicitly formulated and explicitly proven.
Finitely Generated Sets of Fuzzy Values: If “And” Is Exact …
137
Discussion. Proposition 1 says that for all but finitely many (“almost all”) values s ∈ S, the negation 1 − s is outside the finitely generated set S. Since, as we have mentioned, to get an “or”-operation out of “and” requires negation, this means that while for this set, “and”-operation is exact, the corresponding “or”-operation almost always leads us to a value outside S. So, if we restrict ourselves to the finitely generated set S, we can only represent the results of “or”-operation approximately. In other words, if “and” is exact, then “or” is almost always approximate. Due to duality between “and”- and “or”, we can also conclude that if “or” is exact, then “and” is almost always approximate. Computational aspects. The formulation of our main result sounds like (too) abstract mathematics: there exists finitely many such values s; but how can we find them? Interesting, there exists a reasonably efficient algorithms for finding such values; see, e.g., [1]. Relation to probabilities. Our current interest is in fuzzy logic, but it should be mentioned that a similar results holds for the case of probabilistic uncertainty, when, instead of degrees of confidence, we consider possible probability values ai . In this case: • if an event has probability a, then its negation has probability 1 − a; • if two independent events have probabilities a and b, then the probably that both events will happen is a · b; and • if an event B is a particular case of an event A, then the conditional probability P(B | A) is equal to b/a. Thus, in the case of probabilistic uncertainty, it also makes sense to consider multiplication and division operations—and thus, to consider sets which are closed under these operations.
3 How General Is This Result? Formulation of the problem. In the previous section, we considered the case when f & (a, b) = a · b and f ¬ (a) = 1 − a. What if we consider another pair of operations, will the result still be true? For example, is it true for strict Archimedean “and”-operations? Analysis of the problem. It is known that every strict Archimedean “and”-operation is equivalent to f & (a, b) = a · b—namely, we can reduce it to the product by applying an appropriate strictly increasing re-scaling r : [0, 1] → [0, 1]; see, e.g., [6, 14]. Thus, without losing generality, we can assume that the “and”-operation is exactly the product f & (a, b) = a · b, but the negation operation may be different—as long as f ¬ ( f ¬ (a)) = a for all a.
138
J. C. Urenda et al.
Result of this section. It turns out that there are some negation operations for which the above result does not hold. Proposition 2 For each finitely generated set S—with the only exception of the set generated by a single value 1/2—there exists a negation operation f ¬ (a) for which, for infinitely many s ∈ S, we have f ¬ (s) ∈ S. Proof. When at least one of the original values ai is different from 1/2, this means that the fractions ai and 1 − ai have different combinations of prime numbers in their numerators and denominators. In this case, for every ε > 0, there exists a number s ∈ S for which 1 − ε < s < 1. We know that one of the original values ai is different from 1/2. Without losing generality, let us assume that this value is a1 . If a1 > 1/2, then 1 − a1 < 1/2. So, again without losing any generality, we can assume that a1 < 1/2. Let us now define two monotonic sequences pn and qn . For the first sequence, we take the values p0 = 1/2 > p1 = a1 > p2 = a12 > p3 = a13 > · · · The second sequence is defined iteratively: • As q0 , we take q0 = 1/2. • As q1 , let us select some number (smaller than 1) from the set S which is greater than or equal to 1 − p1 . • Once the values q1 , . . . , qk have been selected, we select, as qk+1 , a number (smaller than 1) from the set S which is larger than qk and larger than 1 − pk+1 , etc. For values s ≤ 0.5, we can then define the negation operation as follows: • for each k, we have f ¬ ( pk ) = qk and • it is linear for pk+1 < s < pk , i.e. f ¬ (s) = qk+1 + (s − pk+1 ) ·
qk+1 − qk . pk+1 − p − k
The resulting function maps the interval [0, 0.5] to the interval [0.5, 1]. For values s ≥ 0.5, we can define f ¬ (a) as the inverse function to this.
4 What if We Allow Unlimited Number of “And”-Operations and Negations: Case Study Formulation of the problem. In the previous sections, we allowed an unlimited application of “and”-operation and implication. What if instead, we allow an unlimited application of “and”-operation and negation?
Finitely Generated Sets of Fuzzy Values: If “And” Is Exact …
139
Here is our related result. Proposition 3 The set S of degrees that can be obtained from 0, 1/2, and 1 by using “and”-operation f & (a, b) = a · b and negation f ¬ (a) = 1 − a is the set of all binary-rational numbers, i.e., all numbers of the type p/2k for natural numbers p and k for which p ≤ 2k . Proof. Clearly, the product of two binary-rational numbers is binary-rational, and 1 minus a binary-rational number is also a binary-rational number. So, all elements of the set S are binary-rational. To complete the proof, we need to show that every binary-rational number p/2k belongs to the set S, i.e., can be obtained from 1/2 by using multiplication and 1 − a. We will prove this result by induction over k. For k = 1, this means that 0, 1/2, and 1 belong to the set S—and this is clearly true, since S consists of all numbers that can be obtained from these three, these three numbers included. Let us assume that this property is proved for k. Then, for p ≤ 2k , each element p/2k+1 is equal to the product (1/2) · ( p/2k ) of two numbers from the set S and thus, also belongs to S. For p > 2k , we have p/2k+1 = 1 − (2k+1 − p)/2k+1 . Since p > 2k , we have 2k+1 − p < 2k and thus, as we have just proved, (2k+1 − p)/2k+1 ∈ S. So, the ratio p/2k+1 is obtained by applying the negation operation to a number from the set S and is, therefore, itself an element of the set S. The induction step is proven, and so is the proposition. Comment. If we also allow implication f & (a, b) = b/a, then we will get all possible rational numbers p/q from the interval [0, 1]. Indeed, if we pick k for which q < 2k , then for a = q/2k and b = p/2k , we get b/a = p/q. Acknowledgements This work was supported in part by the US National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
References 1. Alvarado, A., Koutsianas, A., Malmskog, B., Rasmussen, C., Roe, D., Vincent, C., West, M.: Solve S-unit equation x + y = 1—Sage reference manual v8.7: algebraic numbers and number fields (2019). http://doc.sagemath.org/html/en/reference/number_fields/sage/rings/ number_field/S_unit_solver.html 2. Baker, A., Wüstholz, G.: Logarithmic Forms and Diophantine Geometry. Cambridge University Press, Cambridge (2007) 3. Belohlavek, R., Dauben, J.W., Klir, G.J.: Fuzzy Logic and Mathematics: a Historical Perspective. Oxford University Press, New York (2017)
140
J. C. Urenda et al.
4. Bombieri, E., Gubler, W.: Heights in Diophantine Geometry. Cambridge University Press, Cambridge (2006) 5. Everest, G., van der Poorten, A., Shparlinski, I., Ward, Th.: Recurrence Sequences. American Mathematical Society, Providence (2003) 6. Klir, G., Yuan, B.: Fuzzy Sets and Fuzzy Logic. Prentice Hall, Upper Saddle River (1995) 7. Lang, S.: Elliptic Curves: Diophantine Analysis. Springer, Berin (1978) 8. Lang, S.: Algebraic Number Theory. Springer, Berin (1986) 9. Mackenzie, D.: Needles in an infinite haystack. In: Mackenzie, D. (ed.) What’s Happening in the Mathematical Sciences, vol. 11, pp. 123–136. American Mathematical Society, Providence (2019) 10. Malmskog, B., Rasmussen, C.: Picard curves over Q with good reduction away from 3. Lond. Math. Soc. (LMS) J. Comput. Math. 19(2), 382–408 (2016) 11. Mendel, J.M.: Uncertain Rule-Based Fuzzy Systems: Introduction and New Directions. Springer, Cham (2017) 12. Neukirch, J.: Class Field Theory. Springer, Berlin (1986) 13. Nguyen, H.T., Kreinovich, V., Tolbert, D.: On robustness of fuzzy logics. In: Proceedings of the 1993 IEEE International Conference on Fuzzy Systems FUZZ-IEEE’93, San Francisco, California, vol. 1, pp. 543–547 (1993) 14. Nguyen, H.T., Walker, C.L., Walker, E.A.: A First Course in Fuzzy Logic. Chapman and Hall/CRC, Boca Raton (2019) 15. Novák, V., Perfilieva, I., Moˇckoˇr, J.: Mathematical Principles of Fuzzy Logic. Kluwer, Boston (1999) 16. Smart, N.P.: The solution of triangularly connected decomposable form equations. Math. Comput. 64(210), 819–840 (1995) 17. Smart, N.P.: The Algorithmic Resolution of Diophantine Equations. London Mathematical Society, London (1998) 18. Tolbert, D.: Finding “and” and “or” operations that are least sensitive to change in intelligent control. Master’s thesis, University of Texas at El Paso, Department of Computer Science (1994) 19. Zadeh, L.A.: Fuzzy sets. Inf. Control 8, 338–353 (1965)
Fuzzy Logic Explains the Usual Choice of Logical Operations in 2-Valued Logic Julio C. Urenda, Olga Kosheleva, and Vladik Kreinovich
Abstract In the usual 2-valued logic, from the purely mathematical viewpoint, there are many possible binary operations. However, in commonsense reasoning, we only use a few of them: why? In this paper, we show that fuzzy logic can explain the usual choice of logical operations in 2-valued logic.
1 Formulation of the Problem In 2-valued logic, there are many possible logical operations: reminder. In the usual 2-valued logic, in which each variable can have two possible truth values— 0 (false) or 1 (true), for each n, there are many possible n logical operations, i.e., functions f : {0, 1}n → {0, 1}. To describe each such function, we need to describe, for each of 2n boolean vectors (a1 , . . . , an ), whether the resulting value f (a1 , . . . , an ) is 0 or 1. Case of unary operations. For unary operations, i.e., operations corresponding to n = 1, we need to describe two values: f (0) and f (1). For each of these two values, there are 2 possible options, so overall, we have 2 · 2 = 22 = 4 possible unary operations:
J. C. Urenda Department of Mathematical Sciences and Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva Department of Teacher Education, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] V. Kreinovich (B) Department of Computer Science, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. Ceberio and V. Kreinovich (eds.), How Uncertainty-Related Ideas Can Provide Theoretical Explanation For Empirical Dependencies, Studies in Systems, Decision and Control 306, https://doi.org/10.1007/978-3-030-65324-8_20
141
142
J. C. Urenda et al.
• the case when f (0) = f (1) = 0 corresponds to a constant f (a) ≡ 0; • the case when f (0) = 0 and f (1) = 1 corresponds to the identity function f (a) = a; • the case when f (0) = 1 and f (1) = 0 corresponds to negation f (a) = ¬a; and • the case when f (0) = f (1) = 1 corresponds to a constant function f (a) = 1. The only non-trivial case is negation, and it is indeed actively used in our logical reasoning. Case of binary operations. For binary operations, i.e., operations corresponding to n = 2, we need to describe four values f (0, 0), f (0, 1), f (1, 0), and f (1, 1). For each of these four values, there are 2 possible options, so overall, we have 24 = 16 possible binary operations: • the case when f (0, 0) = f (0, 1) = f (1, 0) = f (1, 1) = 0 corresponds to a constant function f (a, b) ≡ 0; • the case when f (0, 0) = f (0, 1) = f (1, 0) = 0 and f (1, 1) = 1 corresponds to “and” f (a, b) = a & b; • the case when f (0, 0) = f (0, 1) = 0, f (1, 0) = 1, and f (1, 1) = 0, corresponds to f (a, b) = a & ¬b; • the case when f (0, 0) = f (0, 1) = 0 and f (1, 0) = f (1, 1) = 1, corresponds to f (a, b) = a; • the case when f (0, 0) = 0, f (0, 1) = 1, and f (1, 0) = f (1, 1) = 0, corresponds to f (a, b) = ¬a & b;
Fuzzy Logic Explains the Usual Choice of Logical Operations …
143
• the case when f (0, 0) = 0, f (0, 1) = 1, f (1, 0) = 0, and f (1, 1) = 1, corresponds to f (a, b) = b; • the case when f (0, 0) = 0, f (0, 1) = 1, f (1, 0) = 1, and f (1, 1) = 0, corresponds to exclusive “or” (= addition modulo 2) f (a, b) = a ⊕ b; • the case when f (0, 0) = 0 and f (0, 1) = f (1, 0) = f (1, 1) = 1, corresponds to “or” f (a, b) = a ∨ b; • the case when f (0, 0) = 1 and f (0, 1) = f (1, 0) = f (1, 1) = 0, corresponds to f (a, b) = ¬a & ¬b; • the case when f (0, 0) = 1, f (0, 1) = f (1, 0) = 0, and f (1, 1) = 1, corresponds to equivalence (equality) f (a, b) = a ≡ b; • the case when f (0, 0) = 1, f (0, 1) = 0, f (1, 0) = 1, and f (1, 1) = 0, corresponds to f (a, b) = ¬b; • the case when f (0, 0) = 1, f (0, 1) = 0, and f (1, 0) = f (1, 1) = 1, corresponds to f (a, b) = a ∨ ¬b or, equivalently, to the implication f (a, b) = b → a; • the case when f (0, 0) = f (0, 1) = 1, and f (1, 0) = f (1, 1) = 0, corresponds to f (a, b) = ¬a; • the case when f (0, 0) = f (0, 1) = 1, f (1, 0) = 0, and f (1, 1) = 1, corresponds to f (a, b) = ¬a ∨ b, or, equivalently, to the implication f (a, b) = a → b; • the case when f (0, 0) = f (0, 1) = f (1, 0) = 1, and f (1, 1) = 0, corresponds to f (a, b) = ¬a ∨ ¬b;
144
J. C. Urenda et al.
• the case when f (0, 0) = f (0, 1) = f (1, 0) = f (1, 1) = 1, corresponds to a constant function f (a, b) ≡ 1. In commonsense reasoning, we use only some of the binary operations. Out of the above 16 operations, • two are constants: f (a, b) = 0 and f (a, b) = 1, and • four are actually unary: f (a, b) = a, f (a, b)¬a, f (a, b) = b, and f (a, b) = ¬b. In addition to these 2 + 4 = 6 operations, there are also 10 non-constant and nonunary binary logical operations: • six named operations “and”, “or”, exclusive “or”, equivalence, and two implications (a → b and b → a), and • four usually un-named logical operations ¬a & b, a & ¬b, ¬a & ¬b, and ¬a ∨ ¬b. In commonsense reasoning, however, we only use the named operations. Why? Maybe it is the question of efficiency? Maybe it is the question of efficiency? To check on this, we can use the experience of computer design, where the constituent binary gates are selected so as to make computations more efficient. Unfortunately, this leads to a completely different set of binary operations: e.g., computers typically use “nand” gates, that implement the function f (a, b) = ¬(a & b) = ¬a ∨ ¬b, but they never use gates corresponding to implication. So, the usual selection of binary logical operations remains a mystery. What we do in this paper. In this paper, we show that the usual choice of logical operations in the 2-valued logic can be explained by ... fuzzy logic.
2 Our Explanation Why fuzzy logic. We are interested in operations in 2-valued logics, so why should we take fuzzy logic into account? The reason is straightforward: in commonsense reasoning, we deal not only with precisely defined statements, but also with imprecise (“fuzzy”) ones. For example: • We can say that the age of a person is 18 or above and and this person is a US citizen, so he or she is eligible to vote. • We can also say that a student is good academically and enthusiastic about research, so this student can be recommended for the graduate school.
Fuzzy Logic Explains the Usual Choice of Logical Operations …
145
We researchers may immediately see the difference between these two uses of “and”: precisely defined (“crisp”) in the first case, fuzzy in the second case. However, to many people, these two examples are very similar. So, to understand why some binary operations are used in commonsense reasoning and some are not, it is desirable to consider the use of each operation not only in the 2-valued logic, but also in the more general fuzzy case; see, e.g., [1, 3, 4, 6, 7, 9]. Which fuzzy generalizations of binary operations should we consider? In fuzzy logic, in addition to value 1 (true) and 0 (false), we also consider intermediate values corresponding to uncertainty. To describe such intermediate degrees, it is reasonable to consider real numbers intermediate between 0 or 1—this was exactly the original Zadeh’s idea which is still actively used in applications of fuzzy logic. So, to compare different binary operations, we need to extend these operations from the original set {0, 1} to the whole interval [0, 1]. There are many possible extension of this type; which one should we select? A natural idea is to select the most robust operations. For each fuzzy statement, its degree of confidence has to be elicited from the person making this statement. These statements are fuzzy, so naturally, it is not reasonable to expect that the same expert will always produce the exact same number: for the same statement, the expert can one day produce one number, another day a slightly different number, etc. When we plot these numbers, we will get something like a bell-shaped histogram—similar to what we get when we repeatedly measure the same quantity by the same measuring instrument. It is therefore reasonable to say that, in effect, the value a marked by an expert can differ from the corresponding mean value a by some small random value Δa, with zero mean and small standard deviation σ . Since this small difference does not affect the user’s perception, it should not affect the result of commonsense reasoning—in particular, the result of applying a binary operation to the corresponding imprecise numbers should not change much if we use slightly different estimates of the same expert. For example, 0.9 and 0.91 probably represent the same degree of expert’s confidence. So, it is not reasonable to expect that we should get drastically different values of a & b if we use a = 0.9 or a = 0.91. To be more precise, if we fix the value of one of the variables in a binary operation, then the effect of changing the second value on the result should be as small as possible. Due to the probabilistic character, we can only talk about being small “on average”, i.e., about the smallest possible mean square difference. This idea was, in effect, presented in [5, 6, 8] for the case of “and”- and “or”-operations; let us show how it can be extended to all possible binary logical operations. For a function F(a) of one variable, if we replace a with a + Δa, then the value F(a) changes to F(a + Δa). Since the difference Δa is small, we can expand the above expression in Taylor series and ignore terms which are quadratic (or higher order) in terms of Δa. Thus, we keep only linear terms in this expansion: F(a + Δa) = F(a) + F (a) · Δa, where F (a), as usual, indicates the derivative. def The resulting difference in the value of F(a) is equal to ΔF = F(a + Δa) − F(a) = F (a) · Δa. Here, the mean squared value (variance) of Δa is equal to σ 2 ;
146
J. C. Urenda et al.
thus, the mean squared value of F (a) · Δa is equal to (F (a))2 · σ 2 . We are interested in the mean value of this difference. Here, a can take any value 1 from the interval [0, 1], so the resulting mean value takes the form 0 (F (a))2 · σ 2 da. Since σ is a constant, minimizing this difference is equivalent to minimizing the 1 integral 0 (F (a))2 da. Our goal is to extend a binary operation from the 2-valued set {0, 1}. So, we usually know the values of the function F(a) for a = 0 and a = 1. In general, a function f (x) that minimizes a functional L( f, f ) d x is described by the Euler–Lagrange equation d ∂L ∂L − = 0; ∂f dx ∂ f see, e.g., [2]. In our case, L = (F (a))2 , so the Euler–Lagrange equation has the form d (2F (a)) = 2F (a) = 0. da Thus, F (a) = 0, meaning that the function F(a) is linear. Thus, we conclude that in our extensions of binary (and other) operations, the corresponding function should be linear in each of the variables, i.e., that this function should be bilinear (or, in general, multi-linear). Such an extension is unique: a proof. Let us show that this requirement of bilinearity uniquely determines the corresponding extension. Indeed, suppose that two bilinear expressions f (a, b) and g(a, b) have the same values when a ∈ {0, 1} and b ∈ {0, 1}. In this case, the difference d(a, b) = f (a, b) − g(a, b) is also a bilinear expression whose value for all four pairs (a, b) for which a ∈ {0, 1} and b ∈ {0, 1} is equal to 0. Since d(0, 0) = d(1, 0) = 0, by linearity, we conclude that d(a, 0) = 0 for all a. Similarly, since d(0, 1) = d(1, 1) = 0, by linearity, we conclude that d(a, 1) = 0 for all a. Now, since d(a, 0) = d(a, 1) = 0, by linearity, we conclude that d(a, b) = 0 for all a and b—thus, indeed, f (a, b) = g(a, b) and the extension is indeed defined uniquely. How can we find the corresponding bilinear extension. For negation, linear interpolation leads to the usual formula f (a) = 1 − a. Let us see what happens for binary operations. Once we know the values f (0, 0) and f (1, 0), we can use linear interpolation to find the values f (a, 0) for all a: f (a, 0) = f (0, 0) + a · ( f (1, 0) − f (0, 0)). Similarly, once we know the values f (0, 1) and f (1, 1), we can use linear interpolation to find the values f (a, 1) for all a:
Fuzzy Logic Explains the Usual Choice of Logical Operations …
147
f (a, 1) = f (0, 1) + a · ( f (1, 1) − f (0, 1)). Now, since we know, for each a, the values f (a, 0) and f (a, 1), we can use linear interpolation to find the value f (a, b) for each b as f (a, b) = f (a, 0) + b · ( f (a, 1) − f (a, 0)). Let us use this idea to find the bilinear extensions of all ten non-trivial binary operations. Let us list bilinear extensions of all non-trivial binary operations. • • • • • • • • • •
for a & b, we have f (a, b) = a · b; for a & ¬b, we have f (a, b) = a · (1 − b) = a − a · b; for ¬a & b, we have f (a, b) = (1 − a) · b = b − a · b; for a ⊕ b, we have f (a, b) = a + b − 2a · b; for a ∨ b, we have f (a, b) = a + b − a · b; for ¬a & ¬b, we have f (a, b) = (1 − a) · (1 − b); for a ≡ b, we have f (a, b) = 1 − a − b + 2a · b; for ¬a ∨ b = a → b, we have f (a, b) = 1 − a + a · b; for ¬a ∨ b = b → a, we have f (a, b) = 1 − b + a · b; finally, for ¬a ∨ ¬b = ¬(a & b), we have f (a, b) = 1 − a · b.
Which operations should we select as basic. As the basic operations, we should select the ones which are the easiest to compute. Of course, we should have negation f (a) = 1 − a, since it is the easiest to compute: it requires only one subtraction. Out of all the above ten binary operations, the simplest to compute is f (a, b) = a · b (corresponding to “and”) which requires only one arithmetic operation— multiplication. All other operations need at least one more arithmetic operation. This explains why “and” is one of the basic operations in commonsense reasoning. Several other operations can be described in terms of the selected “and”- and “not”-operations f & (a, b) = a · b and f ¬ (a) = 1 − a: • the operation f (a, b) = a − a · b corresponding to a & ¬b can be represented as f & (a, f ¬ (b)); • the operation f (a, b) = b − a · b corresponding to ¬a & b can be represented as f & ( f ¬ (a), b); • the operation f (a, b) = a + b − a · b corresponding to a ∨ b can be represented as f ¬ ( f & ( f ¬ (a), f ¬ (b))); • the operation f (a, b) = (1 − a) · (1 − b) corresponding to ¬a & ¬b can be represented as
148
J. C. Urenda et al.
f & ( f ¬ (a), f ¬ (b)); • the operation f (a, b) = 1 − a + a · b corresponding to ¬a ∨ b = a → b can be represented as f ¬ ( f & (a, f ¬ (b))); • the operation f (a, b) = 1 − b + a · b corresponding to a ∨ ¬b = b → a can be represented as f ¬ ( f & ( f ¬ (a), b)); and • the operation f (a, b) = 1 − a · b corresponding to ¬a ∨ ¬b = ¬(a & b) can be represented as f ¬ ( f & (a, b)). There are two binary operations which cannot be represented as compositions of the selected “and” and “or”-operations: namely: • the operation f (a, b) = a + b − 2a · b corresponding to exclusive “or” a ⊕ b, and • the operation f (a, b) = 1 − a − b + 2a · b corresponding to equivalence a ≡ b. Thus, we need to add at least one of these operations to our list of basic operations. Out of these two operations, the simplest to compute—i.e., requiring the smallest number of arithmetic operations—is the function corresponding to exclusive “or”. This explains why we use exclusive “or” in commonsense reasoning. What about operations that can be described in terms of “and” and “not”? For some of these operations, e.g., for the function f (a, b) = a − a · b = a · (1 − b) corresponding to a & ¬b, direct computation requires exactly as many arithmetic operations as computing the corresponding representation in terms of “and” and “or”. However, there is one major exception: for the function f (a, b) = a + b − a · b corresponding to “or”: • its straightforward computation requires two additions/subtractions and one multiplication, while • its computation as f ¬ ( f & ( f ¬ (a), f ¬ (b)) requires one multiplication (when applying f & ) but three subtractions (corresponding to three negation operations). Thus, for the purpose of efficiency, it makes sense to consider “or” as a separate operation. This explains why we use “or” in commonsense reasoning. So far, we have explained all basic operations except for implication. Explaining implication requires a slightly more subtle analysis. Why implications. In the above analysis, we considered addition and subtraction to be equally complex. This is indeed the case for computer-based computations,
Fuzzy Logic Explains the Usual Choice of Logical Operations …
149
but for us humans, subtraction is slightly more complex than addition. This does not change our conclusion about operations like f (a, b) = a − a · b: whether we compute them directly or as f (a, b) = f & (a, f ¬ (b)), in both cases, we use the same number of multiplications, the same number of additions, and the same number of subtractions. There is, however, a difference for implication operations such as f (a, b) = 1 − a + a · b = f ¬ ( f & (a, f ¬ (b))) : • its direct computation requires one multiplication, one addition, and one subtraction, while • its computation in terms of “and”- and “not”-operations requires one multiplication and two subtractions. In this sense, the direct computation of implication is more efficient—which explains why we also use implication in commonsense reasoning. Conclusion. By using fuzzy logic, we have explained why negation, “and”, “or”, implication, and exclusive “or” are used in commonsense reasoning while other binary 2-valued logical operations are not.
3 Auxiliary Result: Why the Usual Quantifiers? Formulation of the problem. In the previous sections, we consider binary logical operations. In our reasoning, we also use quantifiers such as “for all” and “there exists” which are, in effect, n-ary logical operations, where n is the number of possible objects. Why these quantifiers? Why not use additional quantifiers like “there exists at least two”? Let us analyze this question from the same fuzzy-based viewpoint from which we analyzed binary operations. It turns out that this way, we get a (partial) explanation for the usual choice of quantifiers. Why universal quantifier. Let us consider the case of n objects 1, …, n. We have some property p(i) which, for each object i, can be true or false. We would like to combine these n truth values into a single one, i.e., we need an n-ary operation f (a1 , . . . , an ) that would transform n truth values p(1), …, p(n) into a single combined value f ( p(1), . . . , p(n)). Similarly to the previous chapter, let us consider a fuzzy version. Just like all non-degenerate binary operations—i.e., operations that are not constants or unary operations—must contain a product of two numbers, similarly, all non-degenerate n-ary operations must contain the product of n numbers. Thus, the simplest possible case—with the fastest computations—is when the operation is simply the product of the given n numbers, i.e., the operation f (a1 , . . . , an ) = a1 · . . . · an . Indeed, every other operation requires also addition
150
J. C. Urenda et al.
of subtraction. This operation transforms the values p(1), …, p(n) into their product p(1) · . . . · p(n), that corresponds exactly to the formula p(1) & · · · & p(n), i.e., to the formula ∀i p(i). This explains the ubiquity of the universal quantifiers. Why existential quantifier. The universal quantifier has the property that it does n n p(i) = p(π(i)) for every permutation not change if we permute the objects: i=1
i=1
π : {1, . . . , n} → {1, . . . , n}. This condition of permutation invariance holds for all quantifiers and it is natural to be required. We did not explicitly impose this condition in our derivation of universal quantifier for only one reason—that we were able to derive this quantifier only from the requirement of computational simplicity, without a need to also explicitly require permutation invariance. However, now that we go from the justification of the simplest possible quantifier to a justification of other quantifiers, we need to explicitly require permutation invariance—otherwise, the next simplest operations are operations like ¬a1 & a2 & a2 · · · & an . It turns out that among permutation-invariant n-ary logical operations, the simplest are: n pi requires • the operation ¬∀i p(i) for which the corresponding formula 1 − n − 1 multiplications and one subtraction; • the operation ∀i ¬ p(i) for which the corresponding formula
i=1 n
(1 − pi ) requires
i=1
n − 1 multiplications and n subtractions; and • the operation ∃i p(i), i.e., equivalently, ¬∀i ¬ p(i), for which the corresponding n formula 1 − (1 − pi ) requires n − 1 multiplications and n + 1 subtractions. i=1
This explains the ubiquity of existential quantifiers. Acknowledgements This work was supported in part by the US National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
References 1. Belohlavek, R., Dauben, J.W., Klir, G.J.: Fuzzy Logic and Mathematics: a Historical Perspective. Oxford University Press, New York (2017) 2. Gelfand, I.M., Fomin, S.V.: Calculus of Variations. Dover Publication, New York (2000) 3. Klir, G., Yuan, B.: Fuzzy Sets and Fuzzy Logic. Prentice Hall, Upper Saddle River (1995) 4. Mendel, J.M.: Uncertain Rule-Based Fuzzy Systems: Introduction and New Directions. Springer, Cham (2017)
Fuzzy Logic Explains the Usual Choice of Logical Operations …
151
5. Nguyen, H.T., Kreinovich, V., Tolbert, D.: On robustness of fuzzy logics. In: Proceedings of the 1993 IEEE International Conference on Fuzzy Systems FUZZ-IEEE’93, San Francisco, California, vol. 1, pp. 543–547 (1993) 6. Nguyen, H.T., Walker, C.L., Walker, E.A.: A First Course in Fuzzy Logic. Chapman and Hall/CRC, Boca Raton (2019) 7. Novák, V., Perfilieva, I., Moˇckoˇr, J.: Mathematical Principles of Fuzzy Logic. Kluwer, Boston (1999) 8. Tolbert, D.: Finding “and” and “or” operations that are least sensitive to change in intelligent control. Master’s thesis, University of Texas at El Paso, Department of Computer Science (1994) 9. Zadeh, L.A.: Fuzzy sets. Inf. Control 8, 338–353 (1965)